CS 70, Spring 2001
Assignment 7: DNA Recombination

The program for this assignment, and everything else except the README file, is due at 9 PM on Wednesday, March 28th, 2001. As usual, the README file is due at 12 midnight on the same day (i.e., the moment Thursday starts). Refer to the homework policies page for general homework guidelines.

The primary purpose of this assignment is to get you used to writing C++ iterators. You will also be developing a preliminary list class. Both the list class and the iterator for it will be useful to you in later assignments.

NOTE: the list class you develop in this assignment will be central to assignments 8 and 9. MAKE SURE you develop it well and debug it thoroughly. If you blow this assignment off, you will do poorly on the following two assignments as well. A correct solution to this assignment will NOT be distributed to the class. It is your responsibility to be sure you have a properly working list class and iterator.

Overview

One of the more creative approaches to artificial intelligence is the genetic algorithm, invented by Prof. John Holland of the University of Michigan.

In brief, a genetic algorithm simulates the process of evolution by applying the usual rules of genetics to simulate natural selection. In real life, natural selection's primary goal is the continuation of the species, and organisms that achieve that goal tend to be propagated. In a genetic algorithm, on the other hand, the primary goal is to satisfy a "fitness function" chosen by the programmer. For example, a simple fitness function might interpret the genes of an organism as the value of x in a complicated equation. The natural-selection process could then be tuned to prefer organisms that generate an output near zero, so that the survivors would eventually produce a solution to the equation.

Genetic algorithms were the first step in the current research area called "artificial life", and they have been used to successfully solve many problems that were otherwise intractable.

In this assignment, we will create a program that uses a genetic algorithm to find approximate square roots of integers. Although it is simplified compared to a production implementation, the program demonstrates the basic outline and capabilities of a genetic algorithm.

There are three basic processes in evolution: mutation, crossover, and selection. Mutation involves selecting a gene site and modifying it in some fashion, usually by replacing it with another gene. Mutation is very rare both in real life and in genetic algorithms. Crossover is the most important process in generating new organisms. It involves taking two gene strings (usually from two parent organisms), cutting them both at the same point, and re-splicing them so that the head of the result comes from one parent and the tail from the other. Real genetic algorithms usually generate two children in this process, and may splice at more than one point, but we'll simplify things in our implementation.

The final step, selection, involves evaluating the organisms according to some criterion (the "fitness function") and choosing the ones that are most successful. In real life, selection is the harsh process of "survival of the fittest." In a genetic algorithm, the same method is used: the least fit organisms are discarded (i.e., killed) without being allowed to reproduce. As in real life, there is some randomness, so that a somewhat unfit organism has a chance of surviving even when a more fit one is discarded. This randomness turns out to be important to the success of the method, since any two slightly unfit parents might (through crossover) generate an extremely fit child.

Because we will not have time to implement an entire genetic algorithm, much of the code has been provided for you. You must supply the underlying data structure (a linked list), and must also write the two small functions that perform mutation and crossover.

Data Structures

IntList

An organism will be represented entirely by its gene sequence, which in turn will be represented using a singlylinked list. Each element in the list will contain only a single integer from 0 to 9 (represented by the C++ type int), plus a link to the next element. The list must have a separate header that is not a plain element, which means that you must implement two classes (the header and the element). The cleanest approach is to make the element a nested private class of the header, so that only the header (IntList) is visible from outside.

You are not allowed to use a doubly linked list in this assignment.

Your linked list must be named IntList (so that it can be used by the main driver program) and must support the following operations. Note that, since the main driver program is supplied, the function names cannot be changed.

In addition, you must implement an output operator (operator<<) for IntList. I suggest that you use the technique suggested in Weiss: provide a public print function, and have operator<< call print. The output operator should write all the integers in the list concatenated together, with no blanks or newlines. (This design is a very poor approach in general, and will be changed next term. The right way to do it would be to separate the integers with blanks or commas.)

Finally, you may find it helpful to implement a few other standard list functions: pushHead, popHead, isEmpty, and possibly popTail. Several of these functions will be useful in future assignments, and you will find it much easier to do those assignments if you implement the functions now, while your list class is simple, rather than waiting until later when you have converted it into a templated class. However, only the list above is absolutely required.

IntListIterator

You must also implement an iterator for IntList, which must be named IntListIterator. The iterator must support the following functions at a minimum:

In addition, you may wish to support a copy constructor, assignment operator, and postincrement operator. It would not be appropriate to implement operator->, since int is not a class.

What You Need to Build

You are provided with a single file, assign_07.cc, which is the main driver program. You are not allowed to modify assign_07.cc except by adding code at the "ADD STUFF" locations.

You must create or modify the following files:

assign_07.cc
This must be the file that you downloaded from this Web page, with additions at the "ADD STUFF" parts. You may not change any other part of the file.
Makefile
For this assignment, the Makefile will not be provided. You must write your own, and it must be correct. If you do not provide a Makefile, your program will not compile and you will receive a zero for functionality. Be sure your dependencies are correct; you may wish to use g++ -M to help.
intlist.hh
This file will contain the interface definition for the IntList and IntListIterator classes. Note that both classes must be defined by this file, either by placing both definitions in the file, or by having it #include whatever file(s) contain the remaining definitions.
*.hh
Any other header files that you feel are necessary to implement your code. (There is no requirement that there be any other header files, but you might find it useful.)
*.cc
Any other source files that you feel are necessary to implement your code.

In assign_07.cc, you must provide the functions that perform genetic mutation and crossover. The mutation function modifies its list argument in place, changing a single gene at a specified position (0-indexed). You must use your iterator for access to the list. The crossover function creates (and returns) an entirely new list, choosing each gene from one of the two parents depending on the position argument. Again, you must use your iterator to access the parent lists.

The places where you need to provide code are marked by "ADD STUFF" comments.

Since assign_07.cc is provided to you, you must maintain stylistic consistency in that file. However, you are not required to use any specific coding style in the other files that you create. Since you are creating them from scratch, any good style is acceptable. In particular, you do not have to match the style of assign_07.cc in those files.

As usual, you can also download the provided file as a bundle, either as a gzipped tar file or as a ZIP archive.

Submission Mechanics

For assignment 7, you must submit the following files:

Testing

Testing is your responsibility. We will not provide exact test cases for you. You should test your program a number of times, under different conditions.

In its default condition, the program is nondeterministic (i.e., two successive runs may produce different results). To make testing easier, the program accepts a switch that makes it deterministic. If you use "-S n", where n is an integer, the random seed will be set to that value. Specifying the random seed will allow you to control the program's behavior so that you can reproduce bugs.

You will also find it instructive to run the program with the -d switch, and to run it for many different values of the -g, -m, -p, -r, and -s switches. Judicious reading of the comments, together with experimentation, will reveal the purpose of these switches and how they interact.

We will not limit ourselves to running only simple test cases. You can expect that we will run stress tests in an attempt to break your program. I strongly suggest that you attempt to break it yourself, so that we won't be able to do so. In particular, make sure you ask it to find the roots of a lot of numbers, all on one command line.

Sample Runs

To make it clearer how the program is used, here are some sample runs. First, we can approximate the square root of 2000000 (which is just 1000 times the square root of 2). The "%" represents the command prompt.

% ./assign_07 -S 12345 2000000
0001414 * 1414 = 1999396
If we start with a different random seed, we get a different result:
./assign_07 -S 54321 2000000
% 0001415 * 1415 = 2002225
A third attempt gives a pretty bad answer:
% ./assign_07 -S 1 2000000
0000989 * 989 = 978121
Finally, we can change the number of generations (-g), the mutation rate (-m), the population size (-p) the selection pool size (-s, which should be smaller than the population size), and the number of randomly-chosen survivors (-r, which should usually be pretty small), and run with debugging (-d):
% ./assign_07 -S 1 -g 100 -m 0.1 -p 100 -s 50 -r 3 -d 2000000
Generation 0: 0003616
Generation 1: 0001993
Generation 5: 0001912
Generation 7: 0001501
Generation 11: 0001413
Generation 22: 0001414
0001414 * 1414 = 1999396

Note 1: the running time of the program is O(population size * number of generations). Don't use huge numbers or you'll wait all day!

Note 2: If you don't specify the -S switch, you will get different results every time you run the program. That's a feature, not a bug.

Note 3: The defaults are:

Tricky Stuff

As usual there are some tricky parts to this assignment. Some of them are:


© 2001, Geoff Kuenning

This page is maintained by Geoff Kuenning.