Recall that RNA is a sequence of nucleic acids of types A, U, C, and G. (Similar to DNA, except that RNA uses uracil "U" rather than thymine "T"). The primary structure of RNA is just its linear string of A, U, C, and G symbols. RNA folds into a 2-dimensional secondary structure in which A binds with U while C binds with G.
A reasonable first-order approximation is that RNA will fold in such a way as to maximize the total number of A-U and C-G matches. Moreover, in general, RNA avoids pseudoknots meaning that if an A in position i in the linear string matches with U at position j > i then the symbols at indices between i and j (if there are any) can match among themselves but cannot match with symbols at positions less than i or greater than j.
Your first task is to design and implement a recursive algorithm for this problem. It need not be particularly fast! The algorithm should be implemented in a function called fold and it should take a RNA string as input. The function returns the maximum number of matches possible for that string. For example, here is our solution (implemented in Python) in action:
>>> fold("AAUUGC")
3
>>> fold("AAGUCU")
2
Note that in the first case, the solution matches the A in position 0
(using Python's convention of indices beginning with 0) with the U in
position 3, the A in position 1 with the U in position 2, and the G
and C at the end match, for a total of 3 matches. In the second case,
the A in position matches the U in position 5 and the G and the C
match, for a total of 2 matches. If we had matched the A in position
0 with the U in position 3 then we could not have matched the G and
the C because that would have "crossed" the matching of the A and the
U.
While you're welcome to implement your program in any imperative programming language (defined to be C, C++, Java, or Python), Python will be the shortest. Please comment your program and keep it as simple and easy-to-read as possible. About 25% of the points for this problem will be allocated for quality of code. As long as your code is relatively simple, easy-to-read, and well-documented, you should expect to get these points.
Here is another test input (primate pps-mir1 miRNA):
>>> fold("ACCUACUCAGAGUACAUACUUCUUUAUGUACCCAUAUGAACAUACAAUGCUAUGGAAUGUAAAGAAGUAUGUAUUUUUGGUAGGC")
This string is so long that it will take your recursive function
almost forever to solve. On to your next task...
>>> foldDP("ACCUACUCAGAGUACAUACUUCUUUAUGUACCCAUAUGAACAUACAAUGCUAUGGAAUGUAAAGAAGUAUGUAUUUUUGGUAGGC")
38
You should use a real DP table rather than memoizing. As we
noted in class, memoization uses a hash function and hashing is not
guaranteed to be constant time.
"UCGGCUUUUAAACCUCGAGACUAGUUUUCUCAGUGAUCUGCCAGAUAUUAUUUGAAUUUCUGAGAUCAUUGUGAAAGCUGAUUUUGUGGAUAGCCUCGACU" "GGAUACGGCCAUACUGCGCAGAAAGCACCGCUUCCCAUCCGAACAGCGAAGUUAAGCUGCGCCAGGCGGUGUUAGUACUGGGGUGGGCGACCACCCGGGAAUCCACCGUGCCGUAUCCU"