CS70, Spring 2004

Homework Assignment #10

This assignment is due at 9 P.M. on Wednesday, April 21st, 2004. As usual, the README file is due at midnight the same day (i.e., the moment that Thursday starts). Refer to the homework policies page for submission instructions and general homework guidelines.

The primary purpose of this assignment is to gain experience with hash tables.

Overview

In this assignment, you will create a simple spell checker. The program will read a dictionary from a file that is given as the first argument, insert the words into a hash table, and report collision statistics.

After reading the dictionary, the spelling checker will read a list of words from the standard input. Each word will be looked up in the dictionary. If it is incorrect, it will be written to the standard output together with a list of suggested corrections. (This is similar to ispell's -a mode.) The algorithm for generating corrections is given below.

The Hash Table

The dictionary will consist of a list of words, separated by whitespace. For convenience, the words will be given in lower case, so you do not need to worry about capitalization issues. You will need to insert them into a hash table that grows dynamically as necessary to hold the dictionary while keeping the load factor low enough. It is up to you to decide how to handle collisions: separate chaining, linear probing, quadratic probing, or rehashing. Your hash table must be implemented as a general-purpose class, although it does not need to be templated.

Hash Functions

Designing a good hash function is something of a black art. We have provided a separate Web page that briefly discusses some hash functions and how they work. However, you don't necessarily have to write a hash function as part of this assignment.

Because we don't have time to cover hash functions in lecture, and because of the limited amount of time you have to work on the program, we have provided a hash function for you. Actually, we have provided several for you to choose from, together with a header file that you can #include so that they are easy to use.

If you are short on time, we suggest that you use hashStringCRC (with a prime table size) or hashStringBUZ as your hash function. However, if you have more time, we suggest that you experiment with several different hash functions to find out which works best (in terms of the collision statistics).

All of the hash functions have descriptive comments in the source file. Before you choose a function, be sure to read the comments (for example, some functions work very badly with certain table sizes).

You are not required to use one of our hash functions. If you want to experiment with writing your own, please feel free. However, you should test your function thoroughly so that you can be sure that it gives you good collision statistics in a wide variety of conditions.

Hash Table Statistics

To help you understand how your hashing code works, you should track and report the following statistics:

The number of times you had to expand the table.
The load factor in the table.
The number of insertions that encountered a collision.
The length of the longest known collision chain (depending on your collision-handling method, this might be less then the length of the longest chain in the table: why?).

Note that all but the first of these statistics will need to be reinitialized whenever you expand the table. After you read the dictionary, you should report the above statistics to cerr. If you wind up with a collision chain longer than about 15, there is something seriously wrong with your hash function or your collision method, and points will be deducted. (This means that linear probing is probably inappropriate.)

Spell Checking

Once the dictionary has been created, your program will read a list of words from standard input. If a word is found in the dictionary, your program should produce no output. Otherwise, you should generate suggested corrections and write them, together with the original word (converted to lowercase), as a single output line. For example, suppose the input word was "Wird". The output might be:

wird: bird gird ward word wild wind wire wiry

Unlike the dictionary, the words input to your program may be in any case. You can convert a string to lower case by including the cctype header file and using the isupper and tolower functions:

#include <cctype>
...
    string mystring("ABcdEFg!@KLm");
    for (string::iterator nextChar = mystring.begin();
      nextChar != myString.end();
      nextChar++) {
        if (isupper(*nextChar))
            *nextChar = tolower(*nextChar);
    }

Generating Corrections

The easiest way to generate corrections in a spell checker is a trial-and-error method. If we assume that the misspelled word contains only a single error, we can try all possible corrections and look each up in the dictionary.

Traditionally, spelling checkers have looked for four possible errors: a wrong letter ("wird"), an inserted letter ("woprd"), a deleted letter ("wrd"), or a pair of adjacent transposed letters ("wrod"). To simplify this assignment, you will only need to deal with the first possibility, a wrong letter. When a word isn't found in the dictionary, you will need to look up all variants that can be generated by changing one letter. For example, given "wird," you should look up "aird", "bird", "cird", etc. through "zird", then "ward", "wbrd", "wcrd" through "wzrd", and so forth. Whenever you find a match in the dictionary, you should add it to your output line.

Input Format

Both the dictionary and the file to be spell-checked consist of arbitrary-length words separated by whitespace. The easy way to represent them is as C++ strings. You can then easily read them in and manipulate them using something like:

    string word;
    // ...
    while (cin >> word) {
	if (islower (word[0]))
	    word[0] = toupper(word[0]);
    }

When used with a string, the >> operator will skip over any whitespace and then grab the next string of non-whitespace characters -- exactly what you need.

For convenience, neither the dictionary nor the input file will contain punctuation. If you would like to test your spelling checker on a "real" input file (such as your README), you can remove the punctuation with the tr program. The method for using tr varies depending on your system. On any sane operating system (e.g., Linux), you could do:

    tr -c 'A-Za-z \010-\015' ' ' < README | ./assign_10 my-dictionary

On Turing, however, you have to use a broken notation:

    tr -c '[A-Za-z \010-\015]' '[ *0]' < README | ./assign_10 my-dictionary

(You may find it instructive to study the tr manual page to learn how the above command works.)

If you have a file that has already been cleaned up (so it only contains alphabetics and whitespace), you could do:

    ./assign_10 my-dictionary < error-filled-file.txt

You can also just type directly to stdin:

    ./assign_10 my-dictionary

In that case, you'll need to type control-D at the end of your input to generate an EOF.

Sample Dictionaries

You may wish to create a very small sample dictionary of your own for initial testing. A slightly larger dictionary of 341 words should help you to get most of your bugs out. When you're fairly confident, you can try your luck with over 34,000 words in an all-lowercase version of the ispell dictionary. The latter file can be found on Turing in "~cs70grad/ispell.words".

Output Format

The spell-check program should produce one line of statistical output on cerr, and zero or more lines of correction output on cout.

The statistical output should be in the format:
n expansions, load factor f, n collisions, longest chain n
where n is an integer and f is a floating-point (double) number.

Each line in the correction output should consist of the incorrect word, followed by a colon and zero or more corrections, separated by spaces. There should be exactly one space after the colon (unless there are no corrections), and there should be no space at the end of the line. The following are examples of valid output lines:

xyzzy:
foo: for
wird: bird gird ward word wild wind wire wiry

If every word in the input is found in the dictionary, the spell checker should produce no output on cout.

Sample Files

When you check out your copy of the assignment, you will get a copy of "simple.dict", the small dictionary. Because the ispell dictionary is moderately large, it is not included in the checkout. Instead, you can use it directly from the CS70 grader account. For example:

    ./assign_10 ~cs70grad/ispell.words < error-filled-file.txt

No Restrictions on C++ Libraries

With one exception, there are no restrictions on your use of the standard C++ libraries for this assignment. The single exception is that you are not allowed to use a hash-table library! In particular, though, you are allowed to use the string type. The string type will greatly simplify your life on input. You can read a word from a stream (either for the dictionary or to spell-check it) with code like this:

    string nextWord;
    stream >> nextWord;
    if (!stream)
        // EOF was hit.

Although you are allowed to use the library's list class, we would prefer that you use your own templated list class from previous assignments. After all, that's why you put so much work into it, right?

Also, a small word of warning: you may be tempted to use the vector class from the library to manage your hash buckets. Although it is possible to do so effectively, it is trickier than might first appear. In particular, the resize function is not appropriate for resizing hash tables. I recommend that rather than using vector, you simply manage the array of hash buckets yourself.

Compilation

The code you submit will be compiled with the g++ options -Wall and -pedantic. Your program should produce no errors or warning messages when compiled with these options on Turing. If you absolutely cannot get rid of a warning, even with the help of the professor or the graders, document it in the README file along with the names of anyone who helped you try to understand the problem.

Submitting

As usual, you must check out your assignment before beginning by using "cs70checkout hw10". This is true even though you will be writing 100% of the program yourself. The checkout will provide you with two C++ source files, hashfuncs.hh and hashfuncs.cc, which implement a variety of hash functions. However, you are not required to use these hash functions.

Your submission should consist of a number of files:

Makefile: A "make file" containing instructions on how to compile your program with the make utility.
The makefile you provide must produce an executable named assign_10.
assign_10.cc: The C++ code for your main program for the assignment.
*.hh, *.cc, *.icc: Header and source files containing the classes you implement. Some of these can be lifted directly from previous assignments, or can be extended versions of classes in previous assignments. It is up to you to choose the names for these files.
README: A documentation file, as specified in the homework policies page. Note that this file is not due until 3 hours after the other files in the assignment.

If you wish, you can create other files to help you develop this assignment, but it is not necessary.

When you have a working solution, you must submit your files with cs70submit. If you create any new files, you need to tell the submission system about them by mentioning them once on a cs70submit command line. For convenience, we have provided dummy versions of README, Makefile, and assign_10.cc so that they will be sure to get submitted.

Tricky Stuff

As usual, there are parts of this assignment that contain traps. Here are a few:

When you are developing a hash table, it is wise to start your debugging with a dummy hash function that always returns zero. Once you are sure your collision handling works correctly, you can write a real hash function.
If you are using one of the provided hash functions, try several to see which one works best for you.
If you get excessive collisions, be sure your hash function is returning values that are well spread out. Test it with "nearby" values such as "aaa", "aab", "baa", etc.
Remember that the hash functions can return a number larger than the table size (except for hashStringBase256). You must reduce the hash value yourself to make sure you don't go beyond your array bounds.
If you use separate chaining, you can use your list class (with a few extensions) to manage the chains, or you can use the STL list class. Alternatively, you could re-implement the list functions inside your hash-table code, but that approach isn't as "C++-ish".
Remember that when you expand your hash table, you must re-hash everything in the current table, since the wrapping due to the modulo function will change. If you are using separate chaining, don't forget to follow your collision chains appropriately.

There is more information on using C++ on Turing available in the departmental quick-reference guide and the C++ quick reference guide. You can find information about debugging in the gdb quick reference guide.

This page is maintained by Geoff Kuenning.