Homework 3b Problem 1: DP Design and Implementation

Recall that RNA is a sequence of nucleic acids of types A, U, C, and G. (Similar to DNA, except that RNA uses uracil "U" rather than thymine "T"). The primary structure of RNA is just its linear string of A, U, C, and G symbols. RNA folds into a 2-dimensional secondary structure in which A binds with U while C binds with G.

A reasonable first-order approximation is that RNA will fold in such a way as to maximize the total number of A-U and C-G matches. Moreover, in general, RNA avoids pseudoknots meaning that if an A in position i in the linear string matches with U at position j > i then the symbols at indices between i and j (if there are any) can match among themselves but cannot match with symbols at positions less than i or greater than j.

Your first task

Your first task is to design and implement a recursive algorithm for this problem. It need not be particularly fast! The algorithm should be implemented in a function called fold and it should take a RNA string as input. The function returns the maximum number of matches possible for that string. For example, here is our solution (implemented in Python) in action:

>>> fold("AAUUGC")
3

>>> fold("AAGUCU")
2
Note that in the first case, the solution matches the A in position 0 (using Python's convention of indices beginning with 0) with the U in position 3, the A in position 1 with the U in position 2, and the G and C at the end match, for a total of 3 matches. In the second case, the A in position matches the U in position 5 and the G and the C match, for a total of 2 matches. If we had matched the A in position 0 with the U in position 3 then we could not have matched the G and the C because that would have "crossed" the matching of the A and the U.

While you're welcome to implement your program in any imperative programming language (defined to be C, C++, Java, or Python), Python will be the shortest. Please comment your program and keep it as simple and easy-to-read as possible. About 25% of the points for this problem will be allocated for quality of code. As long as your code is relatively simple, easy-to-read, and well-documented, you should expect to get these points.

Here is another test input (primate pps-mir1 miRNA):

>>> fold("ACCUACUCAGAGUACAUACUUCUUUAUGUACCCAUAUGAACAUACAAUGCUAUGGAAUGUAAAGAAGUAUGUAUUUUUGGUAGGC")
This string is so long that it will take your recursive function almost forever to solve. On to your next task...

DP to the Rescue!

Your next task is to design and implement a DP version of your recursive algorithm. The function should be called foldDP. This one will handle the above RNA string in 1 to 2 seconds on a typical personal computer:
>>> foldDP("ACCUACUCAGAGUACAUACUUCUUUAUGUACCCAUAUGAACAUACAAUGCUAUGGAAUGUAAAGAAGUAUGUAUUUUUGGUAGGC")
38
You should use a real DP table rather than memoizing. As we noted in class, memoization uses a hash function and hashing is not guaranteed to be constant time.

What to submit

You should submit the following:
  1. A printout of your code. Please keep it as simple and elegant as you can for maximum readability and include comments to make it easier for the grutors to understand what you're doing. Remember that about 25% of the points will be assigned for readability and simplicity of your code.
  2. A screenshot (or cut-and-paste text) of your RNA program run (that is, with the solution displayed) with these two input strings that we downloaded from a bioinformatics database (the first one is for Daphnia pulex bantam stem-loop and the second is 5S rRNA for Acanthamoeba castella):
    "UCGGCUUUUAAACCUCGAGACUAGUUUUCUCAGUGAUCUGCCAGAUAUUAUUUGAAUUUCUGAGAUCAUUGUGAAAGCUGAUUUUGUGGAUAGCCUCGACU"
    
    "GGAUACGGCCAUACUGCGCAGAAAGCACCGCUUCCCAUCCGAACAGCGAAGUUAAGCUGCGCCAGGCGGUGUUAGUACUGGGGUGGGCGACCACCCGGGAAUCCACCGUGCCGUAUCCU"