Phase 2: Evaluating Hash Functions

A very important part of using a hash table is choosing a hash function!

In this assignment, users of your HashSet<T> will need to provide a myhash function for their type.

In this part, you'll be exploring possible hash functions for strings, and develop a myhash function for std::strings that users can use if they don't want to come up with one themselves.

Let's Go

Find Hash Functions

In this part (and only this part) of the assignment, your hash functions do not need to be your own original work. In particular, in an exception to the usual course rules, you may (and are encouraged to!) ask ChatGPT or similar assistants to help in the creation of the hash functions, as described below.

Remember that attribution is still required.

Find (at least) three distinct hash functions for strings and add their implementations to stringhash.cpp (replacing the three placeholder functions currently there). Be sure to update the contents of the hashInfo list at the bottom of stringhash.cpp as described in the comment above hashInfo.

You may copy code from any source for the hash functions except for

The work of current or prior CS 70 students.
Any implementation of the C++ standard library's std::hash function.

Our lessons have described one straightforward hash function for strings and you can find many more examples in textbooks, Wikipedia, and other online sources. You must adhere to the following rules:

Ensure that your version of the hash function takes only a std::string (by constant reference) and returns a size_t value.
- Some example code you find may use C-style strings. Adapting that code to use C++ strings will require some changes to the code; see the hints below.
- Similarly, some examples include the table size as an additional parameter. In most cases, you can easily remove that part of the code as it will just have % tableSize at the end of the function.
- If you find code that is in C (or very old-style C++, or in some other language), you will need to adapt it to modern C++. You may use a language model like ChatGPT to help you with this process, as described below.
Avoid hash functions that require additional libraries or significant support infrastructure. The hash function should be a self-contained function that can easily be copied into your code.
Try to find functions that are reasonably different from one another. For example, if you find several variants of the same basic idea (e.g., all based on polynomial accumulation with different constants), they may not be particularly interesting for testing.
Provide attribution
- Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'sstd::string).
  - Failing to provide attribution is plagiarism and is an Honor Code violation.
- If the code has an open-source license, you should note which one.
  - Note that it is our understanding that submitting a homework assignment as part of an educational experience does not count as “distribution” under software-licensing rules.
- If you use ChatGPT or another coding assistant, check any citations it provides to make sure that they aren't hallucinations.
Only obtain code for hash functions, nothing else!
- You may not copy code for other parts of the assignment, including hash tables themselves.

Using ChatGPT or Other AI Assistants

Unlike any other parts of assignments in the class, you may use ChatGPT (or similar AI assistants) to help you find hash functions (but not for any other parts of the assignment). If you do, you must provide attribution to ChatGPT in your code. You may not, however, enable GitHub Copilot in Visual Studio Code. Use the chat interface directly in a web browser.

Thus, you may use a conversation with ChatGPT to

Suggest sources to find C++ hash functions,
Adapt code written for other languages to work with a std::string parameter and return a size_t value, or,
Create innovative new hash functions based on its knowledge of the subject area.

A sample conversation can be found here, which includes several example hash functions that you can use, but you can also just use the chat as inspiration to guide your own conversation with ChatGPT.

Here's some suggested prompt context you can include in your requests to ChatGPT:

When providing hash function(s), please note the following:

* The hash function(s) should be named in camelCase, based on their origin or characteristics.
* The hash function(s) should take a single parameter of type `std::string` (by constant reference) and return a `size_t` value.
* The hash function(s) should not require any additional libraries or significant support infrastructure beyond what is in the C++20 standard library.
* There is no need for a table-size parameter; the hash function(s) should just return a `size_t` hash value which will be reduced modulo the table size later.
* Please include comments indicating the source of the hash function(s) and any applicable open-source license information.  Be sure to provide attribution for your own contributions to the code as well.
* This code will be used as part of hash-function testing in a homework assignment, so we will be hoping to have some variety in the hash functions provided.  None of the hash functions should be obviously terrible, but equally, if they're all variants on the same basic idea, that won't be very interesting for testing.
* When giving multiple hash functions, avoid functions that are essentially the same modulo different constants.  In particular, multiplicative hashes implemented using shifts and adds as an “optimization” are no different, on a modern compiler, from those using multiplication directly.  Be sure to point out this issue if it arises as it isn't always obvious, and when doing code conversions, just convert to the multiplication form.

Determine Hash-Function Properties

If you run make stringhash-test, a program named stringhash-test will be generated. (The source code for this program is not part of the starter code for the assignment, but it is included on the CS 70 server). Use this program to determine the properties of your selected hash functions.

The stringhash-test program takes two command-line arguments:

The number of hash-table buckets to simulate; and,
The file name of a dictionary of words.

In the /cs70/data directory, we have provided several dictionary files for you to use:

smalldict.words (about 350 words)
ispell.words (about 35,000 words)
largedict.words (about 280,000 words)
web2 (about 250,000 words)
all.words (about 410,000 words)

Although you can start off by running the tests with a small number of buckets and a small dictionary, if you have a decent hash function you can also just go ahead and try the web2 word list with a large number of buckets; for example,

./stringhash-test 150000 /cs70/data/web2

As an example of good performance, here are the results of running against the web2 file with the C++ standard library's provided hash function:

== Testing std::hash<std::string>

Inserted 234937 items in 47 ms

For 150000 buckets:
  - Empty bucket percentage:            20.9027%
  - Min items per bucket:               0
  - Max items per bucket:               10
  - Average items per bucket:           1.56625
  - Standard deviation:                 1.00081
  - Average items per non-empty bucket: 1.98015

For 'infinite' buckets (actually 2^32):
  - Expected: 6.42533 hash collisions, stddev 2.53482
  - Actual:   234930 unique hashes, 7 hash collisions:
  - Specifically:
        4243118819: subrhomboid soulcake
        687986930: Pleurotomaria neogenesis
        1985490478: predynamite blasphemy
        3339483196: doxologize bemoaningly
        418121517: phlebitic bebannered
        1406104780: supertranscendent Koreishite
        1751510580: Tat Occamist

Average avalanche:
        32.0234/64 bits for one bit change in 1 character key
        31.9893/64 bits for one bit change in 2 character key
        31.9989/64 bits for one bit change in 4 character key
        31.9997/64 bits for one bit change in 8 character key
        32.0005/64 bits for one bit change in 16 character key
        31.9943/64 bits for one bit change in 32 character key
        31.9986/64 bits for one bit change in 64 character key
        32.0072/64 bits for one bit change in 128 character key
        32.0046/64 bits for one bit change in 256 character key

Interpreting the Results

The tests show some useful information about the performance of the chosen hash table in three blocks:

Real-World Performance

The first part of the test shows the kind of behavior you might see in a hash table in practice. For example, in the results above, theory tells us to expect 20.8826% of the buckets to be empty, and the results almost exactly match that value.

Performance with an Extremely Large Hash Table

The second part imagines a very large hash table with 2³² buckets. The number of collisions should be very small, and there should be no obvious similarities between the few words that collide.

Avalanche Behavior

The final test provides “avalanche” information. One of the properties we want hash functions to have is that very similar keys don't produce similar hash values. We want a nice uniform distribution, not hash values clustering together.

The avalanche test tells you the result of the question, “If we hash a string of a specific length and then change just one bit at random, how different is the new hash value from the original hash value (in terms of the average number of bits in the 64-bit number that change)?” Ideally, half the bits should change, and you can see that std::hash does a good job of meeting that goal.

One more bit of information:

Speed: The test also measures speed. Both hashing speed and the spread of hash values impact the performance of a hash table, but speed results will probably only be meaningful for a large dictionary like web2 because hashing is so quick overall.

It's also only meaningful to compare the speeds of good hash functions—because a bad hash function maps to fewer unique buckets, it may get an “unfair” speed advantage in this particular test.

Choose and Explain

Imagine that you were going to use your hash table to implement a spell checker. It would store English-language words and then be used to look up strings to see if they appear in the dictionary. Using the results of your tests, decide which of your three hash functions you would choose for this scenario.

Briefly document the rationale for your choice of hash function as a block comment in stringhash.cpp. There is a TODO placeholder in the spot where your explanation should go.

Note that there is no “best” hash function! Hash functions all exist in a space of trade-offs. So there isn't a “correct” answer to this question, but your rationale should discuss the trade-offs that you are making with your selection. We are looking for sensible reasoning and thoughtful comparisons, not for some objectively correct response.

Helpful Hints

If you find a hash function written in C, it will probably look something like

unsigned int simpleAddHash(const char* str) {
    size_t hash = 0;

    // C strings end with a zero byte (`\0`), so we loop until we see it.
    for (char* p = str; *p != '\0'; ++p) {
        hash = hash + *p;
    }

    return hash;
}

It might even look like this code, which is equivalent to the above, but “cleverer”:

unsigned int simpleAddHash(const char* str) {
    size_t hash = 0;
    char c;

    while ((c = *str++)) {
        hash = hash + c;
    }

    return hash;
}

That second version seems hard to understand.
I get it, the = is assignment, putting *str into c, and the loop exits when c is zero. Does it run faster that way?
No. Quite a lot of C programmers think this kind of code will somehow run faster. Maybe that was true in 1972, but it isn't true with a modern compiler. Some C programmers also think that this kind of cryptic code looks impressive.
Maybe they think it's is cool because when you finally figure out what it's doing, you have a “aha!” moment?
Meh. I'm not impressed by gibberish code. Even if it is shorter.

You would need to convert functions like these to use a C++ std::string; for example,

size_t simpleAddHash(const string& str) {
    size_t hash = 0;

    for (unsigned char c : str) {
        hash = hash + c;
    }

    return hash;
}

Similarly, some C programmers might write
```
x = (x << 2) + x;
```
instead of
```
x = x * 5;
```
thinking that a binary shift and an addition is faster than a multiplication.

But, once again, they're mistaken. The compiler already understands that 5*x and 4*x + x are the exact same thing. Writing code this way just makes the code needlessly cryptic.
Since this class isn't about converting weird, cryptic, falsely optimized C code to C++, you can absolutely get ChatGPT's help to do this conversion.
Or just choose a better written hash function to work with.

(When logged in, completion status appears here.)

Previous Page Next Page

CS 70

Warning: Missing Prerequisites