Tangible Computing
16. Hash Tables




16.1 What is a hash table?

A hash table is a data structure that has roughly constant time access to the entry that matches a given key. One thinks of the hash table as an unordered set containing many instances of the same kind of item. One field of the item is designated as the key. Then, given an item with a specific key, one can insert the item into the table, or lookup up the item and retrieve the item form the table.

Hash tables are used to implement a variety of data structures, the most common being a map or associative array which maps keys of some type to values of some type. They are most space-efficient when the space of possible keys is much larger than the set of actual keys you are using.

16.2 Implementing a Hash Table with Collision Lists

The hash table has a function, called the hash function that takes a key from an item and then computes an integer index called a hash value associated with the key. You use this index to decide where to place the item in the hash table, or where to look it up.

In theory, if you had a hash function that generated hash values in the range 0 to n-1, you could simply have an array of size n that stores the items. The hash function would compute where in the array to store a given item.

But there is a wrinkle. Hash functions have the property that many keys are mapped to the same hash value. So it is possible that two or more items that you want to place in the array have the same index. This is called a collision. The main difference between the various hash table implementations is the way they handle collisions.

The intuition behind collision lists is as follows. The table consists of a number of buckets that can store multiple items. The hash function takes a key from an item and the resulting hash value determines what bucket the data element belongs to. If you are doing an insert, the data is put into that bucket. If you are doing a lookup, then you rummage in the bucket looking for an entry with a matching key, and if you find it then return it.

A hash table with collision lists consists of a table of buckets. Each bucket is just a pointer to the head of a linked list of nodes that contains the items in the bucket. Each item contains a key field, and other data fields.

It is best to look at an example. Suppose that we want to store information about students and look them up by student id. We have the information in this data structure:

code/HashTables/student/student.h

    #ifndef _student_h
    #define _student_h
     
    typedef struct {
        int id;
        char name[64];
        char grade[3];
    } student_info;
    #endif


We want to look up students by id number. Since student ids can be quite large (32-bit unsigned) we cannot simply have a big array with an entry for each possible id. Thus a hash table is a good candidate data structure.

Lets first look at a picture of waht is going on.

Here is a sequence of additions to the hash table data structure, in order of Student 0, Student 1, and so on. The hash table has size 5, and there are 10 students, so there will be collisions.

Step 0:


Step 1:



Step 2:


Step 3:


Step 4:


Step 5:


Step 6:


Step 7:


Step 8:


Step 9:


Step 10:


Now the code

code/HashTables/student/hashtable.h

    #ifndef _hashtable_h
    #define _hashtable_h
     
     
    #include "student.h"
    /*
        Hash table of student records.
     
        The hash table is implemented using collision lists.  Each collision
        list contains all the student info records whose id hashes to the same 
        index value.
     
        NOTE: we manipulate and store actual student records in the hash table, 
        not pointers to the records.
    */
     
    typedef struct ht_node {
        student_info item;
        struct ht_node *next;
    } ht_node;
     
    typedef struct {
        /* items in the table are placed into buckets, each bin holding items with
        the same hash index */
        ht_node **buckets;  
        int num_buckets;
    } ht_table_instance;
     
    typedef ht_table_instance *hashtable;
     
    /* 
        Constructor: create a hash table that will contain roughly size_hint
        items and return a pointer to the instance.
    */
    hashtable ht_create(int size_hint);
     
    /*
        Destructor: release all the storage used by the hash table.
    */
    void ht_destroy(hashtable h);
     
    /*
        Insert the given item into the hash table using id as the key.
        If an item already exists in the table with the same id, replace 
        it with this new one.
    */
    void ht_insert(hashtable h, student_info item);
     
    /*
        Lookup the student info item that matches the given id.
     
        If present, return it in the struct pointed to by itemp, and 
        return 1.
     
        If missing, don't modify the item, and return 0.
    */
    int ht_lookup(hashtable h, int id, student_info *itemp);
     
     
    #endif


code/HashTables/student/hashtable.c

    #include "hashtable.h"
    #include <stdlib.h>
    #include <assert.h>
     
    /*
        A very simple hash function
    */
    int ht_hash(hashtable ht, int id) {
        return id % ht->num_buckets;
        }
     
    hashtable ht_create(int size_hint)
    {
        int num_buckets = size_hint;
        hashtable ht = (hashtable) malloc(sizeof(ht_table_instance));
        assert(ht);
     
        ht->num_buckets = num_buckets;
        ht->buckets = (ht_node **) malloc(sizeof(ht_node *) * ht->num_buckets);
        assert(ht->buckets);
     
        /* Initialize linked lists to NULL */
        for(int i=0; i < ht->num_buckets; i++) {
            ht->buckets[i] = NULL;
            }
     
        return ht;
    }
     
    void ht_insert(hashtable ht, student_info item) {
        // what index in the table does it hash to?
        int index = ht_hash(ht, item.id);
     
        // empty position
        if ( ! ht->buckets[index] ) {
            // empty list, just place it there */
            ht_node *n = (ht_node *) malloc(sizeof(ht_node));
            assert(n);
            n->item = item;
            n->next = 0;
            ht->buckets[index] = n;
            return;
            } 
     
        // collision, insert into the list if not already there
        ht_node *cur = ht->buckets[index];
        while (cur) {
            if ( (cur->item).id == item.id ) {
                // already in table, replace it
                cur->item = item;
                cur = 0;
            } else if ( cur->next ) {
                // no match, but more to look at, move on
                cur = cur->next;
            } else {
                // no match, at end of list, insert at end
                ht_node *n = (ht_node *) malloc(sizeof(ht_node));
                assert(n);
                cur->next = n; 
                n->item = item;
                n->next = 0;
                cur = 0;
                }
            }
        }
     
    int ht_lookup(hashtable ht, int id, student_info *itemp) {
        int index = ht_hash(ht, id);
     
        if ( ! ht->buckets[index] ) {
            // empty list, not present 
            return 0;
            }
     
        // walk along collision list looking for matching id
        ht_node *cur = ht->buckets[index];
        while (cur) {
            if ( (cur->item).id == id ) {
                // found it, return the item and success
                *itemp = cur->item;
                return 1;
                }
     
            // no match, more to look at, move on
            cur = cur->next;
            }
     
        // could not find it
        return 0;
        }
     
    void ht_destroy(hashtable ht) {
        /* unimplemented stub */
        }


And another style of test program.

code/HashTables/student/testhash.c

    #include "hashtable.h"
    #include <stdio.h>
     
    void lookup_test(hashtable ht, int test_id) {
        student_info s;
        if (ht_lookup(ht, test_id, &s) ) {
            printf("Id %d has id %d, name %s, grade %s\n", 
                test_id, s.id, s.name, s.grade);
            }
        else {
            printf("Id %d not found in hash table\n", test_id);
            }
        }
     
      
    /* our test data */
    student_info students[] = { 
        { 451241, "Marjorie G. Smith", "B+" },
        { 519244, "Michael P. Anderson", "C", },
        { 129426, "Ben E. Harrison", "A-" }, 
        { 743295, "William A. Leslie", "B-" },
        { 623599, "Brenda M. Crumble", "A+" },
        { 195151, "Mary M. Toney", "A-" },
    };
    int num_students = 6;
     
    int test_ids[] = {
        195151, 519244, 666222,
        };
    int num_test_ids = 3;
      
    int main(int argc, char *argv[]) {
        // need to know implementation to realise that size 3 will create collisions
        hashtable ht = ht_create(3);
     
        // how test collisions?
        for (int i=0; i < num_students; i++) {
            ht_insert(ht, students[i]);
            }
     
        for (int i=0; i < num_test_ids; i++) {
            int test_id = test_ids[i];
            lookup_test(ht, test_id);
            }
     
        // how would you test this?
        ht_destroy(ht);
        }

16. Hash Tables
Tangible Computing / Version 3.20 2013-03-25