Researchers from MIT and other places wanted to see if machine learning could help them make better hash functions.

Hashing is essential to most online databases, like library catalogues or online stores. A hash function makes codes that tell you exactly where the data will be stored. So, it is easier to find and get the data when you use these codes. But because traditional hash functions make codes at random, sometimes the same value can be hashed from two different pieces of data. It can lead to collisions, which happen when a user searches for one thing and finds many pieces of data. Finding the right one takes a lot longer, so searches are slower, and performance goes down.

Perfect hash functions are a subset of hash functions explicitly made to store data without causing collisions. However, they take longer to compute than conventional hash functions and are time-consuming to build for each dataset. Nevertheless, fast and effective hash functions are essential because hashing is used in many applications, including database indexing, data compression, and cryptography. Thus, researchers from MIT and other institutions set out to investigate whether they might create better hash functions using machine learning.

They discovered that, in some circumstances, using learnt models rather than conventional hash functions could lead to half as many collisions. These trained models are produced by applying a machine-learning algorithm to a dataset intended to identify particular traits. Also, the team's tests revealed that learning models frequently outperformed imperfect hash functions in terms of computational efficiency.

Procedure

A traditional hash function takes an input of data, called a key, and comes up with a random number, or code, that tells where that key will be stored. For example, if ten keys need to go into ten slots, the function would give each input a random number between 1 and 10. Two keys will likely end up in the same slot, which will cause problems. Perfect hash functions are a way to avoid collisions. Researchers provide the function with other information, such as how many slots the data should occupy. Then it can do more math to figure out where to put each key, so there aren't any collisions. But because of these extra calculations, the function is harder to make and less valuable.

The researchers took a small sample from a data set and used machine learning to get a rough idea of how the data are spread out. The model then uses the approximation to guess where a key is in the dataset. They found that learned models were easier to make and ran faster than perfect hash functions. They also had fewer collisions than traditional hash functions if the data were spread out predictably. But using learned models could lead to more collisions if the data are spread out only partially because the gaps between data points vary too much.

Results

When data were spread out predictably, learned models could cut the number of keys clashed in a set of data from 30% to 15%, compared to traditional hash functions. They could also get more work done than perfect hash functions. Finally, in the best cases, learned models cut the time to run by almost 30%.

As the researchers looked into how learned models could be used for hashing, they also found that the number of sub-models affected how fast the process went. Each model is made up of smaller linear models that try to figure out how the data is spread out for different parts of the data. With more sub-models, the learned model makes an approximation closer to the truth but takes longer. Based on this analysis, the researchers want to use the models they have learned to make hash functions for other kinds of data. They also want to look into how learned hashing works with databases where data can be added or removed. When data are changed this way, the model must also be changed. However, changing a model while keeping its accuracy is a complex problem.

Conclusion

Hashing is one of the most important parts of database management. It is used in many of the most important data structures and algorithms. Unfortunately, traditional hash functions try to act like a function that maps a key to a random value. As a result, it can lead to collisions, where more than one key is mapped to the same value. There are many well-known ways to deal with collisions, such as chaining, probing, and cuckoo hashing.

In this project, the researchers want to find out if using learned models instead of traditional hash functions can cut down on collisions and if that cuts down on performance, especially for indexing and joins. First, they show that learned models can sometimes reduce collisions, depending on how the data is spread out. Then, they use bucket chaining, linear probing, and cuckoo hash tables to test how well the models they have learned work as hash functions.

The researchers find that learned models can do two things:

(1) reduce probe latency by 1.4 times and

(2) cut non-partitioned hash join runtime by 28% compared to the following best baseline for some datasets.

Furthermore, the researchers find that learned models can outperform hash functions, but only for certain data distributions.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE