Phase 2: Evaluating Hash Functions
A very important part of using a hash table is choosing a hash function!
In this assignment, users of your HashSet<T> will need to provide a myhash function for their type.
In this part, you'll be exploring possible hash functions for strings, and develop a myhash function for std::strings that users can use if they don't want to come up with one themselves.
Let's Go
Find Hash Functions
In this part (and only this part) of the assignment, your hash functions do not need to be your own original work. In particular, in an exception to the usual course rules, you may (and are encouraged to!) ask ChatGPT or similar assistants to help in the creation of the hash functions, as described below.
Remember that attribution is still required.
Find (at least) three distinct hash functions for strings and add their implementations to stringhash.cpp (replacing the three placeholder functions currently there). Be sure to update the contents of the hashInfo list at the bottom of stringhash.cpp as described in the comment above hashInfo.
You may copy code from any source for the hash functions except for
- The work of current or prior CS 70 students.
- Any implementation of the C++ standard library's
std::hashfunction.
Our lessons have described one straightforward hash function for strings and you can find many more examples in textbooks, Wikipedia, and other online sources. You must adhere to the following rules:
- Ensure that your version of the hash function takes only a
std::string(by constant reference) and returns asize_tvalue.- Some example code you find may use C-style strings. Adapting that code to use C++ strings will require some changes to the code; see the hints below.
- Similarly, some examples include the table size as an additional parameter. In most cases, you can easily remove that part of the code as it will just have
% tableSizeat the end of the function. - If you find code that is in C (or very old-style C++, or in some other language), you will need to adapt it to modern C++. You may use a language model like ChatGPT to help you with this process, as described below.
- Avoid hash functions that require additional libraries or significant support infrastructure. The hash function should be a self-contained function that can easily be copied into your code.
- Try to find functions that are reasonably different from one another. For example, if you find several variants of the same basic idea (e.g., all based on polynomial accumulation with different constants), they may not be particularly interesting for testing.
- Provide attribution
- Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'s
std::string).- Failing to provide attribution is plagiarism and is an Honor Code violation.
- If the code has an open-source license, you should note which one.
- Note that it is our understanding that submitting a homework assignment as part of an educational experience does not count as “distribution” under software-licensing rules.
- If you use ChatGPT or another coding assistant, check any citations it provides to make sure that they aren't hallucinations.
- Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'s
- Only obtain code for hash functions, nothing else!
- You may not copy code for other parts of the assignment, including hash tables themselves.
Using ChatGPT or Other AI Assistants
Unlike any other parts of assignments in the class, you may use ChatGPT (or similar AI assistants) to help you find hash functions (but not for any other parts of the assignment). If you do, you must provide attribution to ChatGPT in your code. You may not, however, enable GitHub Copilot in Visual Studio Code. Use the chat interface directly in a web browser.
Thus, you may use a conversation with ChatGPT to
- Suggest sources to find C++ hash functions,
- Adapt code written for other languages to work with a
std::stringparameter and return asize_tvalue, or, - Create innovative new hash functions based on its knowledge of the subject area.
A sample conversation can be found here, which includes several example hash functions that you can use, but you can also just use the chat as inspiration to guide your own conversation with ChatGPT.
Here's some suggested prompt context you can include in your requests to ChatGPT:
When providing hash function(s), please note the following:
* The hash function(s) should be named in camelCase, based on their origin or characteristics.
* The hash function(s) should take a single parameter of type `std::string` (by constant reference) and return a `size_t` value.
* The hash function(s) should not require any additional libraries or significant support infrastructure beyond what is in the C++20 standard library.
* There is no need for a table-size parameter; the hash function(s) should just return a `size_t` hash value which will be reduced modulo the table size later.
* Please include comments indicating the source of the hash function(s) and any applicable open-source license information. Be sure to provide attribution for your own contributions to the code as well.
* This code will be used as part of hash-function testing in a homework assignment, so we will be hoping to have some variety in the hash functions provided. None of the hash functions should be obviously terrible, but equally, if they're all variants on the same basic idea, that won't be very interesting for testing.
Determine Hash-Function Properties
If you run make stringhash-test, a program named stringhash-test will be generated. (The source code for this program is not part of the starter code for the assignment, but it is included on the CS 70 server). Use this program to determine the properties of your selected hash functions.
The stringhash-test program takes two command-line arguments:
- The number of hash-table buckets to simulate; and,
- The file name of a dictionary of words.
In the /cs70/data directory, we have provided several dictionary files for you to use:
smalldict.words(about 350 words)ispell.words(about 35,000 words)largedict.words(about 280,000 words)web2(about 250,000 words)all.words(about 410,000 words)
Although you can start off by running the tests with a small number of buckets and a small dictionary, if you have a decent hash function you can also just go ahead and try the web2 word list with a large number of buckets; for example,
./stringhash-test 150000 /cs70/data/web2
As an example of good performance, here are the results of running against the web2 file with the C++ standard library's provided hash function:
== Testing std::hash<std::string>
Inserted 234937 items in 47 ms
For 150000 buckets:
- Empty bucket percentage: 20.9027%
- Min items per bucket: 0
- Max items per bucket: 10
- Average items per bucket: 1.56625
- Standard deviation: 1.00081
- Average items per non-empty bucket: 1.98015
For 'infinite' buckets (actually 2^32):
- Expected: 6.42533 hash collisions, stddev 2.53482
- Actual: 234930 unique hashes, 7 hash collisions:
- Specifically:
4243118819: subrhomboid soulcake
687986930: Pleurotomaria neogenesis
1985490478: predynamite blasphemy
3339483196: doxologize bemoaningly
418121517: phlebitic bebannered
1406104780: supertranscendent Koreishite
1751510580: Tat Occamist
Average avalanche:
32.0234/64 bits for one bit change in 1 character key
31.9893/64 bits for one bit change in 2 character key
31.9989/64 bits for one bit change in 4 character key
31.9997/64 bits for one bit change in 8 character key
32.0005/64 bits for one bit change in 16 character key
31.9943/64 bits for one bit change in 32 character key
31.9986/64 bits for one bit change in 64 character key
32.0072/64 bits for one bit change in 128 character key
32.0046/64 bits for one bit change in 256 character key
Interpreting the Results
The tests show some useful information about the performance of the chosen hash table in three blocks:
- Real-World Performance
- The first part of the test shows the kind of behavior you might see in a hash table in practice. For example, in the results above, theory tells us to expect 20.8826% of the buckets to be empty, and the results almost exactly match that value.
- Performance with an Extremely Large Hash Table
- The second part imagines a very large hash table with 232 buckets. The number of collisions should be very small, and there should be no obvious similarities between the few words that collide.
- Avalanche Behavior
-
The final test provides “avalanche” information. One of the properties we want hash functions to have is that very similar keys don't produce similar hash values. We want a nice uniform distribution, not hash values clustering together.
The avalanche test tells you the result of the question, “If we hash a string of a specific length and then change just one bit at random, how different is the new hash value from the original hash value (in terms of the average number of bits in the 64-bit number that change)?” Ideally, half the bits should change, and you can see that
std::hashdoes a good job of meeting that goal.
One more bit of information:
- Speed
- The test also measures speed. Both hashing speed and the spread of hash values impact the performance of a hash table, but speed results will probably only be meaningful for a large dictionary like
web2because hashing is so quick overall.
It's also only meaningful to compare the speeds of good hash functions—because a bad hash function maps to fewer unique buckets, it may get an “unfair” speed advantage in this particular test.
Choose and Explain
Imagine that you were going to use your hash table to implement a spell checker. It would store English-language words and then be used to look up strings to see if they appear in the dictionary. Using the results of your tests, decide which of your three hash functions you would choose for this scenario.
Briefly document the rationale for your choice of hash function as a block comment in stringhash.cpp. There is a TODO placeholder in the spot where your explanation should go.
Note that there is no “best” hash function! Hash functions all exist in a space of trade-offs. So there isn't a “correct” answer to this question, but your rationale should discuss the trade-offs that you are making with your selection. We are looking for sensible reasoning and thoughtful comparisons, not for some objectively correct response.
Helpful Hints
If you find a hash function written in C, it will probably look something like
unsigned int simpleAddHash(const char* str) {
size_t hash = 0;
// C strings end with a zero byte (`\0`), so we loop until we see it.
for (char* p = str; *p != '\0'; ++p) {
hash = hash + *p;
}
return hash;
}
It might even look like this code, which is equivalent to the above, but “cleverer”:
unsigned int simpleAddHash(const char* str) {
size_t hash = 0;
char c;
while ((c = *str++)) {
hash = hash + c;
}
return hash;
}
That second version seems hard to understand.
I get it, the
=is assignment, putting*strintoc, and the loop exits whencis zero. Does it run faster that way?
No. Quite a lot of C programmers think this kind of code will somehow run faster. Maybe that was true in 1972, but it isn't true with a modern compiler. Some C programmers also think that this kind of cryptic code looks impressive.
Maybe they think it's is cool because when you finally figure out what it's doing, you have a “aha!” moment?
Meh. I'm not impressed by gibberish code. Even if it is shorter.
You would need to convert functions like these to use a C++ std::string; for example,
size_t simpleAddHash(const string& str) {
size_t hash = 0;
for (unsigned char c : str) {
hash = hash + c;
}
return hash;
}
Similarly, some C programmers might write
x = (x << 2) + x;instead of
x = x * 5;thinking that a binary shift and an addition is faster than a multiplication.
But, once again, they're mistaken. The compiler already understands that
5*xand4*x + xare the exact same thing. Writing code this way just makes the code needlessly cryptic.
Since this class isn't about converting weird, cryptic, falsely optimized C code to C++, you can absolutely get ChatGPT's help to do this conversion.
Or just choose a better written hash function to work with.
(When logged in, completion status appears here.)