Hash Table

From GPWiki
Jump to: navigation, search

A common problem in programming is that you need to find an element in a collection, preferably really fast. If you just dump all your elements into an array or linked list you'll have to start at the front, and check every element until you find the one you were looking for. If the element is not present, you'll need to check every element to find that out.

  • One partial solution is to sort the contents of your container, and use a binary search to find the element in logarithmic time.
    • Arrays can efficiently be accessed at any point, but they have the problem that inserting elements in the middle is inefficient.
    • Lists allow efficient insertion and deletion in the middle, but you can not access middle elements in constant time.
  • Another solution to this problem is to use a less linear data structure.
    • Trees are an option, or
    • hash tables.

This article discusses hash tables.

Standard Hash Tables

A hash table is in effect an array, but elements are not added from front to back. Instead, we use a special function called a hashing function to determine where in the array the element should go. The hashing function uses some attribute of the data to calculate this position. For example a very simple hashing function for a string would be to add all the ascii values up, divide by the size of the array and take the remainder. By calculating the hash code, we can quickly discover where our data is stored to retrieve it without searching the entire array. A common usage of hash tables is to implement maps.

Hash tables are typically faster than trees or sorted arrays/lists because they support fewer operations. In particular, hash tables do not support iterating over the elements in order. By not maintaining ordering, insertions and lookups can be faster. However, in worst-case situations hash tables have O(n) operation, while a Red-Black or AVL tree will always be O( lg(n) ).

Many high-level languages such as Python and Perl have hash tables as a first-class data type.

A C Example

This is an example of code that uses a simple hash algorithm and a small hash table of ints. Basic knowledge of C pointer arithmetic and C's representation of strings is assumed. Also, there is no checking for collisions.

#define HASHTABLESIZE 100
 
/* the hash table */
int hashtable[HASHTABLESIZE];
 
/**
 * this function accepts a string as it's argument, and
 * returns the hash code of the string. this is a simple
 * algorithm i found somewhere online
 **/
unsigned long
hashstring (char *str)
{
  unsigned long hash;
 
  /* loop until end of string */
  while (*str)
    hash = (hash << 5) + hash + *str++;
 
  return hash;
}
 
/**
 * this function sets the value in the hash table at the
 * given key to the given value.
 **/
void
putvalue (char *key, int value)
{
  hashtable[hashstring (key) % HASHTABLESIZE] = value;
}
 
/**
 * this function returns the value in the hash table
 * assigned to the given key.
 **/
int
getvalue (char *key)
{
  return hashtable[hashstring (key) % HASHTABLESIZE];
}

Using the hash table is very simple. the following code should print '42'

putvalue ("forty-two", 42);
printf ("%d\n", getvalue ("forty-two"));

the following code should print josh:

char *kids[] = { "alex", "robby", "josh" };
 
putvalue ("winner", (int)(&kids[2]));
puts ((char*)(getvalue ("winner")));

Multi-Dimensional Hash Tables

While not used very often (or at all for that matter) hash tables can be multi-dimensional. If the array has more than one dimension, say 2, then whenever a data item and key are to be put into the table, the first dimension's hash value can be determined by a first algorithm, and the second dimension's hash value, another algorithm. This method uses two or more hash algorithms, therefore minimizing the chance of a collision.

Dealing With Collisions

One problem that arises when using hash tables is what to do when a collision arises - that is more than one element hashes to the same position. A perfect hash function is a hash function that has a unique hash for every item of data, however these are very rare. When designing your hash function, try to design to minimize collisions as dealing with collisions can be expensive time wise.


Chaining

To solve collisions by chaining, have each element in the array point to a linked node containing the data instead of the actual data. This way you can store all elements with the same hash code in a linked list type structure. It is also possible to use other types of collections to store same hash elements in (this is known as bucket hashing), however the overhead of creating these collections means it can become quite slow.

Open Addressing

Open addressing solves the collision problem by choosing another slot to put the element into if its hash is already taken. With linear probing, the element is inserted at the next free space. However this means when retrieving data, we have to search all elements after the hash until we find an empty space to be sure that the data is not there. Also, when removing data we have to replace it with a dummy value that tells the search function that there used to be a value there, else it could miss elements with the same hash added after that value.

Quadratic probing is a similar technique, except we use a quadratic formula to calculate where the value should go next if its spot is taken. The formula used is newhash = hash + (-1)^{i-1} * ((i + 1)/2)^2, where i is any number greater than 1 and less than the size of the array. The advantage of this technique over linear probing is it reduces clumps of data in one part of the array that slow down the search function.

The third type of open addressing is known as double hashing. To double hash, you simply write another hash function and add it to the original hash if its spot is taken. If this is also taken, you add the secondary hash function again, until you find an empty spot. Note you should always divide by the size of the array and take the remainder to make sure you stay inside the boundary of the array.