Chapter 5: Imperative Programming

The imperative is to define what is right and do it.

—Barbara Jordan

5.1 A Computer that Knows You (Better than You Know Yourself?)

These days it seems that just about every website out there is trying to recommend something. Websites recommend books, movies, music, and activities. Some even recommend friends!

How does Netflix know what movies we’re likely to enjoy watching or Amazon know what we’ll like to read?

../_images/Alien6.PNG

Netflix recommended the movie “Alien” to me.

The answer to this question lies in a broad and important subfield of computer science: data mining. Data mining focuses on extracting useful information from very large quantities of unstructured data.

In this chapter we will examine a simplified version of a fundamental data mining algorithm called collaborative filtering (CF). Collaborative filtering is widely used in many successful recommendation systems, as well as many other application areas where there’s a lot of data to be studied (e.g., financial markets, geological data, biological data, web pages, etc.). Our goal in this chapter is to build a music recommender system that uses a basic version of the CF approach.

Of course, our recommender system and our CF approach will necessarily be simplified. To build an industrial-strength recommender system is quite complicated. The system that won the Netflix prize (see http://www.netflixprize.com for more details) consists of several different sophisticated algorithms. See the paper “The BellKor Solution to the Netflix Grand Prize” by Yehuda Koren at http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

To motivate the idea behind collaborative filtering, let’s go back to the dark ages before the World Wide Web became pervasive. In those ancient times – perhaps before you were born – people discovered movies, books, restaurants, and music by asking for recommendations from people whom they knew to have tastes similar to their own.

Collaborative filtering systems work on this same fundamental principle–they find people who generally like the same things you do, and then recommend the things that they like that you might not have discovered yet. Actually, this is only one type of collaborative filtering called user-based CF. If you’re dying to learn more about different types of CF, good! Keep taking CS courses. Or for a more immediate gratification, read the Wikipedia page: http://en.wikipedia.org/wiki/Collaborative_filtering

Let’s consider a simple example in which a system is trying to recommend music you might like. First, the system needs to know something about your tastes, so you have to give it some information about what you like. Let’s say you tell the system you like Lady Gaga, Katy Perry, Justin Bieber and Maroon 5.

../_images/Alien6.PNG

I talked to your roommate and learned that those are your favorite groups, so don’t try to deny it!

The system knows about the following additional five “stored” users (users for which we’ve already stored some data):

  • April likes Maroon 5, The Rolling Stones and The Beatles
  • Ben likes Lady Gaga, Adele, Kelly Clarkson, The Dixie Chicks and Lady Antebellum
  • Cory likes Kelly Clarkson, Lady Gaga, Katy Perry, Justin Bieber and Lady Antebellum
  • Dave likes The Beatles, Maroon 5, Eli Young Band and Scotty McCreery
  • Edgar likes Adele, Maroon 5, Katy Perry, and Bruno Mars.

Determining a recommendation for you involves two steps:

  • Determine which user (or users) have tastes most similar to yours.
  • Choose and rank artists that the similar user likes. These are your recommendations.

Of course, there are many algorithms for performing each of the above steps, and many interesting problems that can arise in performing them, but again we will keep things relatively simple here. Our system will use the following simple CF algorithm:

  • For each user, count the number of matching artists between your profile and the profile of that user.
  • Choose the user with the highest number of matches.
  • Recommend the artists who appear on their list and not on yours.

So in the above example, the system determines that Cory has the greatest overlap with you: He likes three of the same artists you like (Lady Gaga, Katy Perry and Justin Bieber) whereas none of the others have more than two groups in common with you. So, the recommendation system chooses Cory as the person whose tastes are most similar to yours. It also sees that Cory likes Lady Antebellum and Kelly Clarkson and therefore recommends that you check out these additional artists.

This is an admittedly simplified example and you can probably come up with a number of good questions about this approach. For example:

  • What if I already know that I don’t like Kelly Clarkson? How could the system use this information?
  • What if there are multiple people who tie for the best match with my tastes? Conversely, what if no one matches my tastes?
  • In a real system, there are millions of users. How can the system efficiently find the best match?

Most of these answers are beyond the scope of this chapter, but we encourage you to think about how you might answer these questions as you read through the rest of this chapter. If you’re intrigued, great! This is exactly the kind of stuff you’ll learn as you continue your CS studies. But for now, back to the foundations for our music recommendation system.

5.1.1 Our Goal: A Music Recommender System

In the rest of this chapter we will build a music recommender system that implements the basic collaborative filtering algorithm described above. The transcript below shows an example of an interaction with the system.

Welcome to the music recommender system!
What is your name? Christine

Welcome Christine.  Since you are a new user, I will need to gather some information about your music preferences.

Please enter an artist or band that you like: Maroon 5

Please enter another artist or band that you like,
or just press enter to see your recommendations: Lady Gaga

Please enter another artist or band that you like,
or just press enter to see your recommendations: Katy Perry

Please enter another artist or band that you like,
or just press enter to see your recommendations: Justin Bieber

Please enter another artist or band that you like,
or just press enter to see your recommendations:

Christine, based on the users I currently know about,
I believe you might like:
Lady Antebellum
Kelly Clarkson

I hope you enjoy them!  I will save your preferred artists and will have new recommendations for you in the future.

The system keeps track of users and their preferences. As more users rate more songs, the system can generate additional recommendations, both for new and existing users. So, the next time Christine uses the system, she’s likely to get some new recommendations.

Before we introduce the techniques that we’ll use to implement the recommendation system, take a few moments to think about the steps involved in the program demonstrated above. What data do you need to gather and store? How do you need to process the data? What do you need to produce as output?

Although there are many ways to actually write the program, the underlying components are the following:

  • The ability to gather input from the user;
  • The ability to repeat a task (e.g., asking for the user’s preferred artists, comparing against stored users’ preferences) many times;
  • The ability to store and manipulate data in different ways (e.g., storing the user’s responses, manipulating the stored users’ preferences); and
  • The ability to save data between program runs, and to load saved data (e.g., loading the stored users’ preferences from a file, saving the current user’s preferences to a file).

Throughout the rest of this chapter we will learn ways to perform each of the above tasks, ultimately resulting in a fully-functional music recommender system.

5.2 Getting Input from the User

The first task that we have listed above is to get input from the user. Fortunately, this is a simple task thanks to Python’s built-in raw_input function. (This function is called input in Python 3.) Here’s an example of some code that uses the raw_input function:

def greeting():
    name = raw_input('What is your name? ')
    print ('Nice to meet you,', name)

And here’s how that code looks when it is run:

>>> greeting()
What is your name? Christine
Nice to meet you, Christine

The function raw_input takes one argument: a string that will be printed when the raw_input function is run (e.g., the string 'What is your name?' in the example above). When Python sees the raw_input command, it prints that string and then waits for the user to type in her own string. The user types in a string and presses Return (or Enter). At that point, Python assigns the user’s input string to the variable appearing on the left-hand side of the = sign. In the example above, the string that the user typed in will be assigned to the variable name.

Python’s raw_input function will always return a string. If you want that string to be converted into something else, like an integer or a floating point number (a number with a decimal point – also known as a “float”), that can be done using Python’s data conversion functions. In the example below, we take the user’s input string and convert it into a floating point number.

In the shell, this should appear as:

>>> plus42()
Enter a number: 15
15 + 42 = 57.0
../_images/Alien6.PNG

Proving that watermelons don’t float!

Note that converting data from one type to another (e.g., from strings to floats) can be “risky.” In the example above, we converted the user’s input string “15” into the floating point number \(15.0\). However, had the user entered the string “watermelon”, Python’s attempt to convert that to a floating point number would have failed and resulted in an error message.

5.3 Repeated Tasks–Loops

The next feature that our program needs is to perform some set of actions multiple times. For example, the recommender system needs to ask the user to enter a number of artists that they like. For the moment, let’s change the system slightly so that instead of allowing the user to enter as many artists as she likes, the system asks the user to enter exactly four artists. The algorithm that performs this data-collection looks something like:

  • Repeat the following four times:
    • Display a prompt to the user;
    • Obtain the user’s input; and
    • Record the user’s answer.
../_images/Alien6.PNG

Actually, 42 times is just the right number of times for me!

Imagine that we have a few lines of Python code that gather and store input from the user. Since we want to do this four times, we could just make four copies of that code, one for each time that we want to ask the user for an artist. For only four preferences, this wouldn’t be so bad. But imagine that we wanted to ask the user for 10 preferences? Or 42? The code would quickly become long and cumbersome. Furthermore, if we decided that we wanted to make a small change to the way that we ask the user for input, we’d have to make that change in 4, 10, or 42 places in our program. This would be very annoying for the programmer! Moreover, ultimately we might want to let the user herself specify how many artists she wants to enter, instead of fixing that value before the program runs. In that case, we simply wouldn’t know in advance how many copies of the code to place in our program.

We have already seen one way to perform repeated tasks in Chapter 3. In fact, we’ve also seen a second: list comprehensions. This chapter presents another way to perform these same repeated tasks: iteration or loops.

Just as recursion allows a programmer to express self-similarity in a natural way, loops naturally express sequential repetition. It turns out that recursion and loops are often equally good choices for doing something repeatedly, but in many cases it’s much easier and more natural to use one rather than the other. More on that shortly!

5.3.1 Recursion vs. Iteration at the Low Level

../_images/Alien6.PNG

If you skipped Chapter 4, no worries– you might jump to Section 5.3.2

To illustrate the difference between iteration and recursion, let’s travel backwards in time and revisit the recursive factorial function:

def factorial(n):
    if n == 0:
        return 1
    else:
        answer = factorial(n - 1)
        return n * answer

In Chapter 4, we implemented this function in HMMM and saw that, at the machine level, the recursion uses a stack to keep track of the function calls and the values of the variables in each function call. In that chapter, we also looked at another very different implementation of the factorial function that was considerably simpler and did not use a stack. Here it is:

#
# Calculate N factorial
# Input: N
# Output: N!
#
# Register usage:
#
#       r1      N
#       r2      Running product
#

00 read r1          # read n from the user
01 loadn r13 1      # load 1 into the return register
                    # (base case return value)
02 jeqz r1 06       # if we are at the base case, jump to the
                    # end
03 mul r13 r13 r1   # else multiply the answer by n
04 addn r1 -1       # and decrement n
05 jump 02          # go back to line 2 and test for base case
                    # again
06 write r13        # we're done so print the answer
07 halt             # and halt

In contrast to the recursive version, this implementation uses a variable (r13 in this example) to gradually “accumulate” the answer. Initially, r13 is set to 1. In lines 3–5, we multiply r13 by the current value of n (stored in register r1), decrement n by 1, and jumps back to line 2 to test if we should do this again. We repeat this loop until the value of n reaches 0. At that point, r13 contains the product \(n \times (n-1) \times \dots \times 1\) and we’re done.

In some sense, this iterative approach is simpler than recursion because it just loops “round and round”, updating the value of a variable until it’s done. There’s no need to keep track (on the stack) of what “things were like” before the recursive call and reinstate those “things” (from the stack) when the recursion returns. Let’s now see how this works in Python.

5.3.2 Definite Iteration: for loops

In many cases, we want to repeat a certain computation a specific number of times. Python’s for loop expresses this idea. Let’s take a look at how we can use a for loop to implement the iteration in the recommendation program to collect a fixed number of artists that the user likes.

To begin, assume the user will enter three artists.

artists = []

for i in [0, 1, 2]:
    next_artist = raw_input('Enter an artist that you like:')
    artists.append(next_artist)

print ('Thank you!  We'll work on your recommendations now.')

In Python 3, remember to use input instead of raw_input.

This for loop will repeat its “body”, lines 4–5, exactly 3 times. Let’s look closely at how this works. Line 3 above is the header of the loop, with five required parts:

  1. The keyword for;
  2. The name of a variable that will control the loop. In our case that variable is named i. It’s safest to use a new variable name, created just for this for loop, or one whose old value is no longer needed;
  3. The keyword in;
  4. A sequence such as a list or a string. In our case it is the list [0, 1, 2]; and
  5. A colon. This is the end of the header and the start of the for loop body.

As we noted, lines 4–5 are called the body of the loop. The instructions in the loop body must be indented consistently within the loop header, just as statements within functions are. Note that line 7 above is not part of the loop body because it is not indented.

So what exactly is going on in this for loop? Our variable i will initially take the first value in the list, which is 0. It will then perform the lines of code that are indented below the for loop. In our case, those are lines 4 and 5. Line 4 uses Python’s raw_input function (or input in Python 3) which prints the string

Enter an artist that you like:

and then pauses to let the user type in a response. Once the user has typed a response and pressed the ENTER ( or RETURN ) key, that response is placed in the variable named next_artist. So, next_artist will be a string that is the name of an artist. Line 5 uses Python’s append function to add that string to the list artists.

Python recognizes that this is the end of the body of the for loop because line 7 is not indented – that is, it’s at the same level of indentation as line 3. So, Python now goes back to line 3 and this time sets i to be 1. Now, lines 4 and 5 are performed again, resulting in the user being asked for another string and that string being appended to the end of the artists list.

Again, Python goes back up to line 3 and this time i takes the value 2. Once again, Python executes lines 4 and 5 and we get another artist and add it to the artists list.

../_images/Alien6.PNG

Four fors!

Finally, i has no more values to take since we told it to take values from the list [0, 1, 2]. So now the loop is done and Python continues computing at line 7.

How could we ask the user to respond to four examples instead of three? Easy! We could modify the loop header to be:

for i in [0, 1, 2, 3]:

What about 25 iterations? Well, we could replace the list [0, 1, 2, 3] with the list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] but who wants to type all that? (We certainly didn’t find it very much fun!) Instead, we can use the built-in range function (that we saw in Chapter 3) to automatically generate that list:

../_images/Alien6.PNG

Remember that range(25) does indeed generate that 25-element list from 0 to 24.

for i in range(25):

Does the control variable have to be named i? No – any valid variable name can be used here. For clarity and to avoid confusion, you should generally avoid using a name that you have already used earlier in your function, although we often reuse variable names in loops.

Finally, a word on style: In general, variable names like i are not very descriptive. However, a variable used in a loop is a very temporary variable – it serves the purpose of allowing us to loop and then once the loop is over we’re done with it. Therefore, it’s quite common to use short variable names in for loops, and the names i, j, and k are particularly popular.

5.3.3 How Is the Control Variable Used?

In the example above, the control variable took on a new value at every iteration through the loop, but we never explicitly used the value of variable inside the body of the loop. Therefore, it didn’t matter what the list of elements in our for loop header was—these elements were ignored. We could have accomplished the same thing with any three-element list header:

for i in ['I', 'love', 'Spam']:

or

for i in [42, 42, 42]:

In both cases, we would have gone through the loop three times.

Sometimes the value of the control variable is important to the computation inside the for loop. For example, consider this iterative version of the factorial function. Here we’ve named the control variable factor to suggest its role:

def factorial(n):
    answerSoFar = 1
    for factor in range(1, n+1):
        answerSoFar = answerSoFar * factor
    return answerSoFar

When we call factorial(4), the loop becomes:

for factor in range(1, 5):
    answerSoFar = answerSoFar * factor

What happens here?

  • After the first time through the loop answerSoFar holds the result of its previous value (1) times the value of factor (1). When the loop completes its first iteration answerSoFar will be 1 (i.e., 1*1).
  • In the second iteration through the loop answerSoFar will again be assigned to hold the product of the previous value of answerSoFar times factor. Since factor‘s value is now 2, answerSoFar will equal 1*2, or 2, when this second loop iteration ends.
  • After the third time through the loop, answerSoFar will be 2*3 or 6, and the fourth time through answerSoFar will become :math: 4*6, or 24.

The loop repeats exactly 4 times because range(1, 5) has four elements: [1, 2, 3, 4].

Let’s now “unroll” this four-iteration loop to see what it does at each iteration. Calling factorial(4) becomes

# factorial function begins
answerSoFar = 1

# loop iteration 1
factor = 1
answerSoFar = answerSoFar * factor  # answerSoFar becomes 1
# iteration 2
factor = 2
answerSoFar = answerSoFar * factor  # answerSoFar becomes 2
# iteration 3
factor = 3
answerSoFar = answerSoFar * factor  # answerSoFar becomes 6
# iteration 4
factor = 4
answerSoFar = answerSoFar * factor  # answerSoFar becomes 24

# loop ends, 24 is returned

5.3.4 Accumulating Answers

Consider what happened to answerSoFar throughout the iterations above. It started with a value and it changed that value each time through the loop, so that when the loop completed the returned answerSoFar was the final answer. This technique of accumulation is very common, and the variable that “accumulates” the desired result is called an accumulator.

Let’s look at another example. The listDoubler function below returns a new list in which each element is double the value of its corresponding element in the input list, aList. Which variable is the accumulator here?

(ch05_ld)

In this case, the accumulator is the list doubledList rather than a number. It’s growing in length one element at a time instead of one factor at a time as in factorial.

Let’s consider one more example where the loop control variable’s value is important to the functionality of a loop. Returning to our recommender program, let’s write a loop that calculates the number of matches between two lists. This function will be useful when we’re comparing the current user’s preferences to the stored preferences of other users to determine which stored user most closely matches the current user’s preferences. For this problem, we have two lists of artist/band names, for example:

>>> userPrefs
['Maroon 5', 'Lady Gaga', 'Justin Bieber']
>>> storedUserPrefs
['Maroon 5', 'Kelly Clarkson', 'Lady Gaga', 'Bruno Mars']

Our first version of this function will implement the following algorithm:

  • Initialize a counter to 0
  • For each artist in the user’s preferences:
  • If that artist is in the stored user’s preferences too, add one to count

This simple algorithm is implemented with the following Python code:

def numMatches(userPrefs, storedUserPrefs):
    ''' return the number of elements that match between
        the lists userPrefs and storedUserPrefs '''
    count = 0
    for item in userPrefs:
        if item in storedUsersPrefs:
            count += 1
    return count

In this function we’ve used a shorthand notation to increment the value of count.

../_images/Alien6.PNG

Similar shorthands include -=, *=, and /=

The line

count += 1

is just shorthand for

count = count + 1

Notice that the code above is actually not specific to our music recommender function, but in fact can be used to compare any two lists and return the number of matching elements between them. For this reason, it’s stylistically better to use generic list names for the parameters to this function – that is, we could replace the variable name userPrefs with a more general name such as listA and storedUserPrefs with listB.

def numMatches(listA, listB):
    ''' return the number of elements that match between
        listA and listB '''
    count = 0
    for item in listA:
        if item in listB:
            count += 1
    return count

Finally, let’s complete the process of choosing the stored user with the highest number of matches to the current user. Let’s assume that each stored user is represented by a list of that user’s preferences and then we put all of those lists into one master list. Using the representation, the stored users’ preferences might look like this:

[['Maroon 5', 'The Rolling Stones', 'The Beatles'],

['Lady Gaga', 'Adele', 'Kelly Clarkson', 'The Dixie Chicks', 'Lady Antebellum'],

['Kelly Clarkson', 'Lady Gaga', 'Katy Perry', 'Justin Bieber', 'Lady Antebellum'],

['The Beatles', 'Maroon 5', 'Eli Young Band', 'Scotty McCreery'],

['Adele', 'Maroon 5', 'Katy Perry', 'Bruno Mars']]

Note that we have not included the names of the stored users, since we only care about their preferences for now. However, we can think about the stored users as having indices. In the list above, index 0 corresponds to the user who likes ['Maroon 5', 'The Rolling Stones', 'The Beatles']. There are four other users with indices 1, 2, 3, and 4. Let’s will assume that this list does not contain the preferences of the current user.

So, imagine that we want to calculate the index of the stored user with the best match to our current user. For example, if our current user likes ['Lady Gaga', 'Katy Perry', 'Justin Bieber', 'Maroon 5'] then the stored user at index 2 in the stored user list above has the most matches in common – three matches, whereas all other stored users have fewer matches.

The algorithm for finding the user with the best match to the current user is the following:

  • Initialize the maximum number of matches seen so far to \(0\)
  • For each stored user:
  • Count the number of matches between that stored user’s preferences and the current user’s preferences
  • If that number of matches is greater than the maximum number of matches so far:
  • Update the maximum number of matches seen so far
  • Keep track of index of that user
  • Return the index of the user with the maximum number of matches

How could we express this in Python? A first attempt might begin like this:

def findBestUser(userPrefs, allUsersPrefs):
    ''' Given a list of user artist preferences and a
        list of lists representing all stored users'
        preferences, return the index of the stored
        user with the most matches to the current user. '''
    max_matches = 0 # no matches found yet!
    best_index = 0
    for pref_list in allUsersPrefs:
        curr_matches = numMatches(userPrefs, pref_list)
        if curr_matches > max_matches:
           # somehow get the index of pref_list??

Note that we maintain a variable called max_matches that stores the maximum number of matches found so far. Initially it is set to 0, because we haven’t done any comparisons yet and have no matches. We also maintain a variable called best_index that stores the index of the list in the allUsersPrefs that has the highest number of matches. We’ve initialized this to 0 as well. Is this a good idea? It suggests that the list with index 0 has the best number of matches so far. In fact, we haven’t even looked at the list with index 0 at this point, so this may seem a bit weird. On the other hand, it doesn’t hurt to initialize this variable to 0 since we’re about to begin counting matches in the loop and the best solution will be at least 0.

Notice that we reach a dead end in the last line of our function above. We have the element from the allPrefs list, but we have no way of getting the index that corresponds to it, and it’s that index that we’re committed to finding. Can you think of a way to solve this problem—before peeking at the solution below?

Here’s our next attempt—this time, more successful:

def findBestUser(userPrefs, allUsersPrefs):
    ''' Given a list of user artist preferences and a
        list of lists represented all stored users'
        preferences, return the index of the stored
        user with the most matches to the current user. '''
    max_matches = 0
    best_index = 0
    for i in range(len(allUsersPrefs)):
        curr_matches = countMatches(userPrefs,
                                    allPrefs[i])
        if curr_matches > max_matches:
            best_index = i
            max_matches = curr_matches
    return best_index

What’s the difference? Notice that now our for loop is iterating not over the lists in allUsersPrefs but rather over the indices of those lists. We do this using the range function to generate the list of all indices in allUsersPrefs. With this index variable, we can access the element in the list and we can store that index in our best_index variable.

5.3.5 Indefinite Iteration: while Loops

In all of the examples above, we knew exactly how many times we wanted to loop. However, in many cases, we can’t know how many iterations we need — the number of iterations of the loop depends on some external factor out of our control.

Let’s continue with our recommendation program. In the last section we collected a fixed number of preferred artists from the user. But in our original program we allowed the user to enter as many artists as they wished, whether that number was one or one thousand.

Recall that for loops always run a definite number of times, so this situation calls for a different kind of loop: the while. A while loop runs as long as its boolean condition is true.

../_images/Alien6.PNG

In other words, a while loop runs for a while.

Let’s take a close look at the structure of a while loop that implements the new desired behavior for our recommender program

newPref = raw_input("Please enter the name of an \
artist or band that you like: " )

while newPref != '':
    prefs.append(newPref)
    newPref = raw_input("Please enter an artist or band \
that you like, or just press enter to see recommendations: ")

print('Thanks for your input!')

In Python 3, remember to use input rather than raw_input.

The while loop is similar to the for loop in that it consists of a loop header (line 4) and a loop body (lines 5-7). The loop header consists of the following three elements, in order:

  • The keyword while
  • A boolean expression. In our example this expression is newPref != ''
  • A colon

As with for loops, the loop body must be indented under the loop header. Thus, line 9 in the above example is not inside the loop body. Note that both of the raw_input statements are split over more than one line of text, but Python considers each really just one line because of the textbackslash symbols at the end of the lines.

A while loop will execute as long as the Boolean expression in the header evaluates to True. In this case, the Boolean expression is newPref != ''. Assuming that the user entered a non-empty string in line 1, the expression newPref != '' will evaluate to True the first time it is evaluated. Thus, we enter the body of the for loop and execute lines 5–7. If the user entered another non-empty string, then this Boolean will again evaluate to True. This will repeat until, eventually, the user enters no string – she just presses RETURN or ENTER. At this point, newPref is an empty string. Thus, when we return to evaluate the Boolean expression newPref != '' in the while statement, it is now False and the loop terminates, causing execution to continue at line 9.

5.3.6 for Loops vs. while Loops

../_images/Alien6.PNG

But aliens don’t have thumbs!

Often it’s clear whether a computational problem calls for definite (for) or indefinite (while) iteration. One simple rule-of-thumb is that for loops are ideally suited to cases when you know exactly how many iterations of the loop will be conducted whereas while loops are ideal for cases when it’s less clear in advance how many times the loop must repeat.

It is always possible to use a while loop to emulate the behavior of a for loop. For example, we could express the factorial function with a while loop as follows:

def factorial(n):
    answer = 1
    while n > 0:
        answer = answer * n
        n = n-1
    return answer
../_images/Alien6.PNG

...that saves typing!

Here we’ve used answer instead of answerSoFar with the understanding that answer will not actually hold the value of the desired answer until the end of the loop–this is a common style for naming accumulator variables.

Be careful! Sometimes when people try to use a while loop to perform a specific number of iterations their code ends up looking like this:

def factorial(n):
    answer = 1
    while n > 0:
        answer = answer * n
    return answer

What happens when you run factorial(5) using the function above? The loop will run, but it will never stop!

../_images/Alien6.PNG

This while loop depends on the fact that n will eventually reach 0, but in the body of the loop we never change the value of n and we never get n! Python will happily continue multiplying answer * n forever–or at least until you get tired of waiting and hit Ctrl-C to stop the program.

Go through the steps and see if you can spot the error in the following version:

(ch05_fac)

This code will also run forever. But why? After all, we definitely do decrease n. This bug is more subtle. Remember that the while loop runs only the code in the body of the loop before repeating. Because the statement that subtracts 1 from n is not indented, it is not part of the while loop’s body. Again, the loop variable does not change within the loop, and it runs forever.

../_images/Alien6.PNG

The Apple Corporation’s address is One Infinite Loop, Cupertino, CA

A loop that never ends is called an infinite loop and it is a common programming bug. When using a while loop, remember to update your loop-control variable inside the loop. It’s an advantage of for loops that this updating is done automatically! In fact, the tendency to accidentally create infinite loops is so common that it leads us to an important takeaway message about the choice between for and while loops.

Takeaway message: If you know in advance how many times you want to run a loop, use a for loop; if you don’t, use a while loop.

5.3.7 Creating Infinite Loops On Purpose

Sometimes infinite loops can come in handy. They’re not actually infinite, but the idea is that we will stop them when we are “done,” and we don’t have to decide what “done” means until later in the loop itself.

../_images/Alien6.PNG

I’m pro-crastination!

For example, consider a different version of the recommender loop above that gathers data from the user.

numCorrect = 0

while True:      # run forever -- or at least as long as needed...
    newPref = raw_input("Please enter an artist or band that you like, \
                 or just press enter to see recommendations: ")
    if newPref:
        prefs.append(newPref)
    else:
        break

print('Thanks for your input!')

The body of the else statement contains one instruction: the break instruction. break will immediately halt the execution of the loop that it appears in, causing the code to jump to the next line immediately after the loop body. break can be used in any kind of loop, and its effect is always the same–if the code reaches a break statement, Python immediately exits the containing loop and proceeds with the next line after the loop. If you have one loop inside another loop (a perfectly OK thing to do, as you’ll see below), the break statement exits only the innermost loop.

../_images/Alien6.PNG

Yes! I need a break!

You might ask, “Do we really need a
break ?”

After all, the loop above can be written with a more informative condition, as we saw above. Which approach is better? It’s a matter of style. Some prefer the “delayed decision” approach, writing loops that appear to run too long, only to break out of them from the inside; others prefer to put all of the conditions directly in the loop header. The advantage of the latter approach is that the condition helps clarify the context for the loop. The advantage of the former (in some cases anyway) is that it avoids the awkward double-input statement—there’s no need to ask the user to enter their initial response in a separate input statement before the loop begins.

5.3.8 Iteration Is Efficient

The heart of imperative programming is the ability to change the value of variables—one or more accumulators—until a desired result is reached. These in-place changes can be more efficient, because they save the overhead of recursive function calls. For example, on our aging computer, the Python code

counter = 0
while counter < 10000:
    counter = counter + 1

ran in 2.6 milliseconds. The "equivalent" recursive program

def increment(value, times):
    if times <= 0:
        return value
    return increment(value + 1, times - 1)

counter = increment(0, 10000)

ran more than an order of magnitude slower, in 38.3 milliseconds.

Why the difference? Both versions evaluate 10,000 boolean tests; both execute the same 10,000 additions. The difference comes from the overhead of building and removing the stack frames used to implement the function calls made recursively.

Memory differences are even more dramatic: storing partial results on the stack can quickly exhaust even today’s huge memory stores.

5.4 References and Mutable vs. Immutable Data

At the beginning of this chapter we introduced the two key components of imperative programming: iteration and data mutation. We have already discussed iteration in some depth. Now we begin to explore the concept of data mutation.

5.4.1 Assignment by Reference

The assignment and re-assignment of values to one variable—the accumulator—characterizes imperative programming with loops. All of these assignments are efficient, but only if the size of the copied data is small! Floating-point values and typical integers are stored in a small space, often the size of one register (32 or 64 bits). They can be copied from place to place rapidly. For example, the assignment operation

# suppose x refers to the value 42 right now
y = x

runs extremely fast, probably best measured in nanoseconds, as suggested by the timings in the previous section.

Lists, on the other hand, can grow very large. Consider this code:

# suppose that list1 holds the value of
# range(1000042) right now
list2 = list1

This assignment makes list2 refer to a 1,000,042-element list—a potentially expensive proposition, if it involved 1,000,042 individual integer assignments like the y = x example above. What’s more, there’s no guarantee the elements of list1 aren’t lists themselvesldots .

How does Python make both integer and list assignments efficient? It does so through a simple, single rule for all of its data types: Assignments copy a single reference.

Reference? It turns out that when you assign a piece of data (e.g., an integer) to a variable, what you are actually doing is storing the memory location of that piece of data in the variable. We say that the variable holds a reference to the piece of data. When we think of a variable, we typically think only of the data to which the variable refers but you can also obtain the reference (i.e., the memory location of that data) with Python’s built-in id() function:

>>> x = 42
>>> x           # Python will reply with x's value
42
>>> id(x)       # asks for x's reference (the memory location of its contents)
505494448        # this will be different on your machine!

When we talk about a variable’s value, we mean the data to which the variable refers; when we talk about a variable’s reference, we mean the memory location of that data. For example, in the code above, the value of x is 42, and its reference is 505494448, which is the location of the value 42 in memory. By the way, the idea of the memory location of a piece of data should be very familiar to you from Chapter 4. If you like, you can think of the variables that store the references as registers on the CPU. (This is not quite exactly correct, but a reasonable conceptual model.) Figure 5.1 illustrates this concept graphically.

../_images/fivepointone.png

Figure 5.1: A depiction of how Python stores data. The boxes on the left are the variables, which you can think of as registers on the CPU, which store references to locations in memory. The actual data is stored in memory.

Python makes assignment efficient by only copying the reference, and not the data:

>>> x = 42      # this puts the value 42 in the next memory slot, 505494448 and then it gives x a copy
                # of that memory reference
>>> y = x       # copies the reference in x into y, so that x and y both refer to the same integer 42 in memory
>>> id(x)       # asks for x's reference (the memory location of its contents)
505494448
>>> id(y)       # asks for y's reference (the memory location of its contents)
505494448

As you would expect, changes to x do not affect y:

>>> x = 43      # this puts 43 in the next memory slot,505494464
>>> id(x)
505494464        # x's reference has changed
>>> id(y)
505494448        # but y's has not

The result of executing the above code is shown on the right in Figure 5.1.

../_images/fivepointtwo.png

Figure 5.2: A box-and-arrow diagram abstraction of how Python stores list data. All of the elements in the list (and the list itself) are in memory, but we have abstracted away from the exact addresses of these elements.

Assignment happens the same way, regardless of the data types involved:

../_images/Alien6.PNG

It is possible to reprogram how assignment works for user-defined data types, but this is the default.

>>> list1 = [42,43]  # this will create the list [42,43] and give its location to list1
>>> list2 = list1    # give list2 that reference, as well
>>> list1            # the values of each are as expected
[42,43]
>>> list2
[42,43]
>>> id(list1)        # asks for list1's reference (its memory location)
538664
>>> id(list2)        # asks for list2's reference (its memory location)
538664

As the data we store gets more complicated, it becomes cumbersome to try to actually represent what is happening in memory faithfully, so computer scientists often use a type of diagram called a box-and-arrow diagram to illustrate how references refer to memory. A box-and-arrow diagram for the above code is shown on the left in Figure 5.2. In this figure, the list [42, 43] exists in memory at the location referred to by both list1 and list2, but instead of writing out the actual memory location, we indicate that list1 and list2 refer to the same data in memory using arrows. In fact, notice something interesting about Figure5.2: the elements in the list are also references to data in other locations in memory! This is a fundamental difference between integers and lists: a list refers to the memory location of a collection of potentially many elements, each of which has its own reference to the underlying data!

As with integers, if an assignment is made involving a list, one reference is copied, as shown in the code below, and on the right in Figure 5.2:

>>> list1 = [44]     # will create the list [44] and make list1 refer to it
>>> id(list1)        # list1's reference has changed
541600
>>> id(list2)        # but list2's has not
538664

The assignment list2 = list1 caused both list1 and list2 to refer to the same list.

This one-reference rule can have surprising repercussions. Consider this example, in which x, list1[0], and list2[0] start out by holding the same reference to the value 42:

>>> x = 42           # to get started
>>> list1 = [x]      # similar to before
>>> list2 = list1    # give list2 that reference, as well
>>> id(x)        # all refer to the same data
505494448
>>> id(list1[0])
505494448
>>> id(list2[0])
505494448

What happens when we change list1[0]?

>>> list1[0] = 43    # this will change the reference held by the "zeroth" element of list1
>>> list1[0]
43               # not surprising at all
>>> list2[0]
43               # aha! list1 and list2 refer to the same list
>>> x
42               # x is a distinct reference

Let’s see

>>> id(list1[0])     # indeed, the reference of that zeroth element has changed
505494464
>>> id(list2[0])     # list2's zeroth element has changed too!
505494464
>>> id(x)            # x is still, happily, the same as before
505494448

This example is illustrated in Figure 5.3 and codelens examples ‘ch05_ref1’ and ‘ch05_ref2’.

(ch05_ref1)

(ch05_ref2)

../_images/fig5point3.png

Figure 5.3: A graphical illustration of what happens when you modify the elements in a list that is referenced by two variables.

5.4.2 Mutable Data Types Can Be Changed Using Other Names!

Note that list2[0] has been changed in the above example, even though no assignment statements involving list2[0] were run! Because list2 is really just another name for the exact same data as list1, we say that list2 is an alias of list1. Note that aliases should be used with caution as they can be both powerful and dangerous, as we explore more below.

Lists are an example of a mutable data type. They are mutable because their component parts (in this case, the elements of the list) can be modified. If you have two different variables that both refer to the same mutable piece of data – as we do with list1 and list2 above – changes that are made to the data’s components using one of those variables will also be seen when you use the other variable, since both variables refer to the same thing.

../_images/Alien6.PNG

Two images here—mutation can be good ... or bad.

Data types that do not allow changes to their components are termed immutable. Integers, floating-point values, and booleans are examples of immutable types. This isn’t surprising, because they do not have any accessible component parts that could mutate anyway.

../_images/Alien6.PNG

I love to talk. I guess that makes me immutable!

../_images/Alien6.PNG

Python lets you compute what bits make up these data types, but you can’t change—or even read—the individual bits themselves.

Types with component parts can also be immutable: for example, strings are an immutable type. If you try to change a piece of a string, Python will complain:

>>> s = 'immutable'
>>> s[0] = ' '
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

Keep in mind that, mutable or immutable, you can always use an assignment to make a variable refer to a different piece of data.

>>> s = 'immutable'
>>> s
'immutable'
>>> s = 'mutable'
>>> s
'mutable'

Takeaway Message

There are a few key ideas to keep in mind:

  • All Python variables have a value (the piece of data to which they refer) and a reference (the memory location of that piece of data).
  • Python assignment copies only the reference.
  • Mutable data types like lists allow assignment to component parts. Immutable data types like strings do not allow assignment to component parts.
  • If you have two different variables that both refer to the same mutable piece of data, changes that are made to the data’s components using one of the variables will also be seen when you use the other variable.
  • It is always possible to use an assignment statement to make a variable refer to a different piece of data.

5.5 Mutable Data + Iteration: Sorting out Artists

Now that we have introduced the two fundamental concepts in imperative programming, we return to our recommendation example in order to motivate and illustrate the power of data mutation and iteration together. The example we will use is sorting a list of elements.

5.5.1 Why Sort? Running Time Matters

Before we dive into the details of sorting, let’s establish (at least one reason) why it is useful to have data that is in sorted order. In short, sorted data makes data processing much faster.

But how do we measure what is “fast” vs. what is “slow?” The key to analyzing how long a program takes to run is to count the number of operations that it will perform for a given size input. Computer Scientists rarely care how fast or slow programs are for very small input: they are almost always fast if the input is small. However, when processing large inputs (for example, millions of users who have each rated hundreds or even thousands of artists), speed becomes critical.

../_images/Alien6.PNG

In fact, if we were trying to build a system to handle millions of users, we would need to make many more optimizations and use fundamentally different algorithms, but that is beyond the scope of this chapter. Here we show you how to speed things up to handle slightly larger inputs.

So, let’s look at how long it takes to calculate the number of matches between two lists using the function from section 5.3.4, reproduced here:

def numMatches(list1, list2):
    ''' return the number of elements that match between
        list1 and list2 '''
    count = 0
    for item in list1:
        if item in list2:
            count += 1
    return count

Let’s say for the moment that each list has four elements in it. Take a moment to think about how many comparisons you think the program will make... then read on below.

First, you take the first element of the first list, and ask if it is in the second list. The in command is a bit deceptive because it hides a significant number of comparisons. How does Python tell if an item is in a list? It has to compare that item to every item in the list! So in this case, the first item in the first list is compared to every item (i.e., four of them) in the second list to determine whether it is in that list.

“Wait!” you say. If the item is actually in the list, it doesn’t actually have to check all four, and can stop checking when it finds the item in question. This is exactly correct, but in fact it doesn’t matter in our analysis. For the purpose of an analysis like this, computer scientists are quite pessimistic. They rarely care what happens when things work out well–what they care about is what might possibly happen in the worst case. In this case, the worst case is when the item is not in the list and Python has to compare it to every item in the list to determine it’s not there. Since what we care about is this worst case behavior, we will perform our analysis as if we were dealing with the worst case.

So, back to the analysis: For the first item in the list, Python made four comparisons to the items in the second list – commands that were hidden away in the in command. Now our program moves on to the second item in the first list, where it again makes four comparisons with the second list. Similarly, it makes four comparisons for each of the third and the fourth elements in the first list. For a total of \(4 + 4 + 4 + 4 = 4 * 4 = 16\) .Again, this probably doesn’t sound like such a bad number. After all, your computer can make 16 comparisons in way less than a second. But what if our lists were longer? What if the user had rated 100 artists and the comparison user had rated 1000 (high, but not crazy)? Then the system would have to do 1000 comparisons (to the items in the second list) for each of the 100 items in the first list for a total of \(100 * 1000 = 10^5\) comparisons. Still not huge, but hopefully you can see where this is going. In general, the matching algorithm we’ve written above takes \(N*M\) comparisons, where \(N\) is the size of the first list and \(M\) is the size of the second list. For simplicity, we might just assume that the two lists will always be the same length, \(N\), in which case it makes \(N^2\) comparisons.

The good news is that we can do significantly better, but we have to make an assumption about the lists themselves. What if our lists were sorted alphabetically? How could that make our matching algorithm faster? The answer is that we can keep the lists “synchronized,” so to speak, and march through both lists at the same time, rather than pulling a single element out of the first list and comparing it to all of the elements in the second. For example, if you are looking at the first element of the first list and it is “Black Eyed Peas” and the first element of the second list is “Counting Crows” you know that the Black Eyed Peas do not appear in the second list, because C is already past B. So you can simply discard the Black Eyed Peas and move on to the next artist in the first list.

../_images/Alien6.PNG

Discarding the “Black Eyed Peas” would save my hearing!

Here’s what the new algorithm looks like. Remember that it assumes the lists will be in sorted order (we’ll talk about how to sort lists later).

  • Initialize a counter to 0
  • Set the current item in each list to be the first item in each list
  • Repeat the following until you reach the end of one of the two lists:
  • Compare the current items in each list.
  • If they are equal, increment the counter and advance the current item of both lists to the next items in the lists.
  • Else, if the current item in the first list is alphabetically before the current item in the second list, advance the current item in the first list.
  • Else, advance the current item in the second list.
  • The counter holds the number of matches.

Before you look at the code below, ask yourself: “What kind of loop should I use here? A for loop or while loop?” When you think you’ve got an answer you’re happy with, read on.

Here’s the corresponding Python code:

(ch05_matches)

Using ==, >, <, etc. on strings is perfectly valid and these operators will compare strings alphabetically. Recall from section 4.2.3 that text is represented numerically. In this encoding all of the capital letters are mapped to consecutive numbers, with ‘A’ getting the lowest number and ‘Z’ getting the highest. The lower-case letters are also mapped to consecutive numbers, which are all higher than the number used for ‘Z’. It is these encodings that are used when strings are compared. This means that strings that start with capital letters will always come “alphabetically” before those that start with lower case letters. This issue is important to consider in our program (we haven’t yet), and we’ll discuss it below in Section 5.5.2.

../_images/Alien6.PNG

Python 2 and earlier versions use ASCII encoding, while Python 3 uses a slightly different encoding called Unicode, but everything here applies to both encodings.

Now the question remains, is this approach really faster than the previous approach to comparing the elements in two lists? The answer is: definitely yes. Let’s again look at the number of comparisons that would need to be made to compare two lists with 4 elements each. Imagine that none of the elements match and that the lists are exactly interleaved alphabetically. That is, first the element in list 1 is smaller, then the element in list 2 is smaller, and so on, as in the lists ["Amy Winehouse", "Coldplay",  "Madonna", "Red Hot Chili Peppers"] and ["Black Eyed Peas", "Dave Matthews Band", "Maroon 5", "Stevie Nicks"]

With these two lists, the code above will never trigger on the first if condition–it will always increment either i or j, but not both. Furthermore, it will run out of elements in one list just before it runs out of elements in the second. Essentially it will look at all the elements in both lists.

At first it might seem like we haven’t made any improvements. After all, aren’t we still looking at all the elements of both lists? Aha, but now there’s an important difference! Whereas before we were looking at all the elements in the second list for each element in the first list, here we are only looking at all the elements in the second list once. In other words, each time through the loop, either i or j is incremented, and they are never decremented. So either i or j will hit the end of the list after all of the elements of that list have been looked at and one fewer than the elements of the other list have been looked at. So in this example, this means we will make exactly 7 comparisons. In general, if the lists are both length \(N\) , the number of comparisons this algorithm will make is \(N+N-1\) or \(2N - 1\). So even for the case where one list has 100 elements and the second has 1000, this is only about 1100 comparisons, a significant improvement over the \(10^5\) of the previous approach!

In technical terms, computer scientists call this second algorithm a linear time algorithm because the equation which describes the number of comparisons is linear. The first algorithm is called a quadratic time algorithm because its equation is quadratic.

5.5.2 A Simple Sorting Algorithm: Selection Sort

Now that we’ve motivated at least one reason to sort data (there are many others—you can probably think of several of them), let’s look at an algorithm for actually doing the sorting.

Let’s consider the list of artists that the user likes that we collect by prompting the user at the start of the program. Recall that the code that gathers this list is the following:

newPref = raw_input("Plese enter the name of an artist \
or band that you like: " )

while newPref != '':
    prefs.append(newPref)
    newPref = raw_input("Please enter an artist or band \
that you like, or just press enter to see recommendations: ")

print ('Thanks for your input!')

Remember that in Python 3 you’ll need to use input rather than raw_input.

The artists the user has entered are stored in the list prefs, but that list is entirely unsorted. We would like to sort this list alphabetically, and clean up the formatting of the text while we’re at it.

../_images/Alien6.PNG

I don’t think this will work for LMFAO!

First, to make our lives easier, and to facilitate the matching process between user profiles, we will make a pass through the list to standardize the format of the artist names. In particular, we want to be sure that there is no leading (or trailing) whitespace, and that the artist or band names are in Title Case (i.e., the first letter of each word is capitalized, and the rest are lower case). Even though this format may fail for some bands , we’ll work with it because it gives us a standard representation that allows the strings to be sorted and compared without us having to worry about case issues getting in the way.

../_images/Alien6.PNG

Recall that all uppercase strings are considered “less than” lower case strings, and it would be confusing to the user to alphabetically sort ZZ Top before adele.

Because strings are immutable objects, we can’t actually change them, but rather we have to generate new strings with the formatting we desire. Here is the code that accomplishes this goal (note that we also build a new list, which is not strictly necessary):

standardPrefs = []
for artist in prefs:
    standardPrefs.append(artist.strip().title())

The strip function returns a new string identical to artist, but without any leading or trailing whitespace. Then the title function returns that same string in Title Case.

Now that our data is standardized, we can sort it from smallest (alphabetically first) to largest. Sorting is an important topic of study in computer science in its own right.

../_images/Alien6.PNG

I sort of knew that it was important!

Here we will discuss one sorting algorithm, but there’s much more to be said on the subject.

To develop our algorithm, we start with small cases: what are the computational pieces that enable the rearrangement of a list?

A two-element list is the smallest (non-trivial) case to consider: in the worst case, the two elements would need to swap places with each other.

In fact, this idea of a swap is all we need! Imagine a large list that we’d like sorted in ascending order.

  • First, we could find the smallest element. Then, we could swap that smallest element with the first element of the list.
  • Next, we would search for the second-smallest element, which is the smallest element of the rest of the list. Then, we could swap it into place as the second element of the overall list.
  • We continue this swapping so that the third-smallest element takes the third place in the list, do the same for the fourth, ... and so on ... until we run out of elements.

This algorithm is called selection sort because it proceeds by repeatedly selecting the minimum remaining element and moving it to the next position in the list.

What functions will we need to write selection sort? It seems we need only two:

  • index_of_min(aList,starting_index), which will return the index of the smallest element in aList, starting from index starting_index.
  • swap(aList,i,j), which will swap the values of aList[i] and aList[j]

Here is the Python code for this algorithm:

(ch05_selSort)

Notice that the code above does not return anything. But when we run it, we see that it works. After the call to selectionSort, our list is sorted!

>>> standardPrefs
['Maroon 5', 'Adele', 'Lady Gaga']
>>> selectionSort(standardPrefs)
>>> standardPrefs
['Adele', 'Lady Gaga', 'Maroon 5']

5.5.3 Why selectionSort Works

Why does this code work? What is more, why does it work even though swap does not have a return statement? There are two key factors.

First, lists are mutable. Thus, two (or more) variables can refer to the same list, and changes that are made to list elements using one variable will also be seen when you use the other variables. But where are the two variables referring to the same list? It seems that standardPrefs is the only variable referring to the original three-element list in the example above.

This is the second factor: when inputs are passed into functions, the function parameters are assigned to each input just as if an assignment statement had been explicitly written, e.g.,

listToSort = standardPrefs

occurs at the beginning of the call to selectionSort(standardPrefs).

As a result, as long as listToSort and standardPrefs refer to the same list, any changes that are made to listToSort‘s elements will affect standardPrefs‘s elements—because listToSort‘s elements and standardPrefs‘s elements are one and the same thing.

Takeaway message: Passing inputs into a function is equivalent to assigning those inputs to the parameters of that function. Thus, the subtleties of mutable and immutable data types apply, just as they do in ordinary assignment.

The left side of Figure 5.4 illustrates what is happening at the beginning of the first call to swap. swap‘s variables are displayed in red, while selectionSort‘s variables are displayed in black. Notice that swap‘s i and j refer to the index locations in the list aList (which is just another name for the list listToSort).

../_images/SwapBadFigure.png

Figure 5.4: A graphical depiction of the variables in selectionSort and swap at the beginning and the end of the first call to swap

The right side of Figure 5.4 shows how the variables look at the very end of the first call to swap. When swap finishes, its variables disappear. But notice that even when swap‘s aList, i, j, and temp are gone, the value of the list has been changed in memory. Because swap‘s aList and selectionSort‘s listToSort and the original standardPrefs are all references to the same collection of mutable elements, the effect of assignment statements on those elements will be shared among all of these names. After all, they all name the same list!

5.5.4 A Swap of a Different Sort

Consider a very minor modification to the swap and selectionSort functions that has a very major impact on the results:

def selectionSort(listToSort):
    '''sorts listToSort iteratively and in-place'''
    for starting_index in range(len(listToSort)):
        min_elem_index = index_of_min(listToSort, starting_index)
        # now swap the elements!
        swap(listToSort[starting_index], listToSort[min_elem_index])

def swap(a, b):
    '''swaps the values of a and b'''
    temp = a
    a = b
    b = temp

The code looks almost the same, but now it does not work!

>>> standardPrefs     # Assume this list is already populated with standardized user preferences
['Maroon 5', 'Adele', 'Lady Gaga']
>>> selectionSort(standardPrefs)
>>> standardPrefs
['Maroon 5', 'Adele', 'Lady Gaga']
../_images/SwapFigureBad.png

Figure 5.5: A graphical depiction of the variables in selectionSort and the new (and faulty) swap at the beginning and the end of the first call to swap

The variables a and b in swap indeed do get swapped, as the following before-and-after figure illustrates. But nothing happens to the elements of aList or, to say the same thing, nothing happens to the elements of standardPrefs.

Figure 5.5 illustrates what is happening before and after the call to swap.

What happened? This time, swap has only two parameters, a and b, whose references are copies of the references of listToSort[start_index] and listToSort[min_elem_index], respectively. Keep in mind Python’s mechanism for assignment: Assignments make a copy of a reference.

Thus, when swap runs this time, its assignment statements are working on copies of references. All of the swapping happens just as specified, so that the values referred to by a and b are, indeed, reversed. But the references held within selectionSort‘s list listToSort have not changed. Thus, the value of listToSort does not change. Since the value of standardPrefs is the value of listToSort, nothing happens.

As we mentioned at the start of this section, Selection Sort is just one of many sorting algorithms, some of which are faster than others (Selection Sort is not particularly fast, though it’s certainly not the slowest approach either). In practice, especially for now, you’ll probably just call Python’s built-in sort function for lists, which is efficient, and more importantly, correctly implemented.

>>> standardPrefs     # Assume this list is already populated with standardized user preferences
['Maroon 5', 'Adele', 'Lady Gaga']
>>> standardPrefs.sort()
>>> standardPrefs
['Adele', 'Lady Gaga', 'Maroon 5']
../_images/2Darray.png

Figure 5.6: A graphical depiction of a 2D array. Following our model from above, the green box is stored in the CPU, and the rest of the data (including the references to other lists) is stored in memory.

5.5.5 2D Arrays and Nested Loops

It turns out that you can store more than just numbers in lists: arbitrary data can be stored. We’ve already seen many examples in Chapter 2 where lists were used to store strings, numbers, and combinations of those data types. Lists of lists are not only possible, but powerful and common.

In an imperative context, lists are often called arrays, and in this section we examine another common array structure: arrays that store other arrays, often called 2D arrays.

../_images/Alien6.PNG

Our attorneys have advised us to note that technically, there are very subtle differences between “lists” and “arrays” – but the distinctions are too fine to worry about here.

The concept behind a 2D array is simple: it is just a list whose elements themselves are lists. For example, the code

>>> a2DList = [[1, 2, 3, 4],[5, 6, 7, 8], [9, 10, 11, 12]]

creates a list named a2DList where each of a2DList‘s elements is itself a list. Figure 5.6 illustrates this 2D array (list) graphically. In the figure the array name has been shortened to just A for conciseness.

You can access the individual elements in the 2D array by using “nested” indices as shown below:

>>> a2DList[0]    # Get the first element of a2DList
[1, 2, 3, 4]
>>> a2DList[0][0] # Get the first element of a2DList[0]
1
>>> a2DList[1][1] # Get the second element of a2DList[1]
6

We can also ask questions about a2DList‘s height (the number of rows) and its width (the number of columns):

>>> len(a2DList)    # The number of elements (rows) in a2DList
3
>>> len(a2DList[0]) # The number of elements in a list in a2DList (i.e., the number of columns)
4

Typically, arrays have equal-length rows, though “jagged” arrays are certainly possible.

This 2D structure turns out to be a very powerful tool for representing data. In fact, we’ve already seen an example of this in our recommender program: the stored users’ preferences were represented by a list-of-lists, i.e., a 2D array. In this case, the array certainly was “jagged” as different stored users had rated different numbers of artists.

Above we wrote some code to standardize the representation of one particular user’s preferences. But we might be unsure about the database—are those lists also standardized (and sorted)? If not, no problem. It’s easy to write code that quickly standardizes all the elements in all of the lists:

(ch05_stand)

In the above code, the outer loop controls the iteration over the individual users. That is, for each iteration of the outer loop, the variable storedUser is assigned a list containing the preferences of one of the users. The inner loop then iterates over the elements in that user’s list of preferences. In other words, artist is simply a string.

We can also write the above code as follows:

def standardizeAll(storedPrefs):
    ''' Returns a new list of lists of stored user preferences,
        with each artist string in Title Case,
        with leading and trailing whitespace removed.
    '''
    standardStoredPrefs = []
    for i in range(len(storedPrefs)):
        standardStoredUser = []
        for j in range(len(storedPrefs[i])):
            standardStoredUser.append(storedPrefs[i][j].strip().title())
        standardStoredPrefs.append(standardStoredUser)
    return standardStoredPrefs

This code does the same thing, but we use the indexed version of the for loop instead of letting the for loop iterate directly over the elements in the list. Either construct is fine, but the latter makes the next example more clear.

Because lists are mutable, we don’t actually have to create a whole new 2D array just to change the formatting of the strings in the array. Remember that strings themselves are not mutable, so we can’t directly change the strings stored in the list. However, we can change which strings are stored in the original list, as follows:

def standardizeAll(storedPrefs):
    ''' Mutates storedPrefs so that each string is in
        Title Case, with leading and trailing whitespace removed.
        Returns nothing.
    '''
    for i in range(len(storedPrefs):
        for j in range(len(storedPrefs[i])):
            standardArtist = storedPrefs[i][j].strip().title()
            storedPrefs[i][j] = standardArtist

Notice that this code is slightly simpler than the code above, and avoids the overhead of creating a whole new list of lists.

5.5.6 Dictionaries

So far we’ve looked at one type of mutable data: arrays. Of course, there are many others. In fact, in the next chapter you’ll learn how to create your own mutable data types. But more on that later. For now we will examine a built-in datatype called a dictionary that allows us to create mappings between pieces of data.

../_images/Alien6.PNG

Have you ever looked up dictionary in a dictionary?

To motivate the need for a dictionary, let’s once again return to the recommender program. Until now, in our examples we have not been associating stored user’s names with their preferences. While we don’t need to store the user’s name with their examples in order to calculate the best matching user (assuming we know that the list of stored users does not contain the current user!), there are many advantages of doing so. For one, we can make sure that the user doesn’t get matched against themselves! Additionally, it might be nice in an extended version of the system to suggest possible “music friends” to the user, whose tastes they may want to track.

So how will we associate user names with their preferences? One way would be to simply use lists. For example, we could have a convention where the first element in each user’s stored preferences is actually the name of the user herself. This approach works, but it is not very elegant and is likely to lead to mistakes. For example, what if another programmer working on this system forgets that this is the representation and starts treating user names as artist names? Perhaps not a tragedy, but incorrect behavior nonetheless. In short, elegant design is important!

What we need is a single place to store the mapping from user names to user preferences that we can pass around to all of the functions that need to know about it. This is where the dictionary comes in handy! A dictionary allows you to create mappings between pieces of (immutable) data (the keys) and other pieces of (any kind of) data (the values). Here’s an example of how they work:

>>> myDict = {}     # creates a new empty dictionary

# create an association between 'April' her
# list of preferred artists.
# 'April' is the key and the list is the value
>>> myDict['April'] = ['Maroon 5', 'The Rolling Stones',
                       'The Beatles']

# Ditto for Ben and his list
>>> myDict['Ben'] =  ['Lady Gaga', 'Adele',
                      'Kelly Clarkson', 'The Dixie Chicks']
>>> myDict                  # display the mappings currently in the dictionary
{'April': ['Maroon 5', 'The Rolling Stones', 'The Beatles'],
'Ben': ['Lady Gaga', 'Adele', 'Kelly Clarkson', 'The Dixie Chicks']}
>>> myDict['April']         # get the item mapped to 'April'
['Maroon 5', 'The Rolling Stones', 'The Beatles']
>>> myDict['f']             # get the item mapped to 'f'. Will cause an error because
                            # mappings are one-way--'f' is not
                            # a valid key.
Traceback (most recent call last):
File "<pyshell\#14>", line 1, in <module>
myDict['f']
KeyError: 'f'
>>> myDict.has_key('f')     # Check whether a key is in the
                            # dictionary
False
>>> myDict.keys()           # Get the keys in the dictionary
[April', 'Ben']             # Notice that this is a special kind of
                            # object that you can iterate
                            # over, but that's not a list
>>> myDict[1] = 'one'       # keys can be any immutable type
>>> myDict[1.5] = [3, 5, 7] # values can also be mutable, and
                            # we can mix types in the same
                            # dictionary
>>> myDict[[1, 2]] = 'one'  # Keys cannot be mutable

Traceback (most recent call last):
 File "<pyshell\#36>", line 1, in <module>
 myDict[[1, 2]] = 'one'
TypeError: list objects are unhashable

# a shorthand way to create a dictionary
>>> userPrefs = {'April': ['Maroon 5', 'The Rolling Stones', 'The Beatles'],
'Ben': ['Lady Gaga', 'Adele', 'Kelly Clarkson', 'The Dixie Chicks']}
>>> userPrefs
'April': ['Maroon 5', 'The Rolling Stones', 'The Beatles'],
'Ben': ['Lady Gaga', 'Adele', 'Kelly Clarkson', 'The Dixie Chicks']

The order in which the key-value pairs appear in Python’s output of userPrefs represents its internal representation of the data. They are not always in the same order.

Now let’s look at how we can modify our recommender code to use a dictionary instead of a list of lists:

def getBestUser(currUser, prefs, userMap):
    ''' Gets recommendations for currUser based on the users in userMap (a dictionary)
    and the current user's preferences in prefs (a list) '''
    users = userMap.keys()
    bestuser = None
    bestscore = -1
    for user in users:
        score = numMatches(prefs, userMap[user])
        if score > bestscore and currUser != user:
            bestscore = score
            bestuser = user
    return bestuser

Notice that dictionaries allow us to make sure we’re not matching a user to their own stored preferences! Pretty simple... and flexible too!

5.6 Reading and Writing Files

We’ve almost got all the pieces we need to build our recommender program. The last major piece we are missing is the ability to read and write files, which we will need to load and store the preferences for the users the system already knows about.

Fortunately, working with files is very simple in Python, and we illustrate file input and output (i.e., file I/O) through example. Imagine that our stored user preferences exist in a file named musicrec-store.txt. There is one line in the file for each user and the format of each line is the following:

username:artist1,artist2,...,artistN

We can write the following code to read in and process all the lines from this file:

def loadUsers(fileName):
    ''' Reads in a file of stored users' preferences stored
        in the file 'fileName'.
        Returns a dictionary containing a mapping of user
        names to a list of preferred artists
    '''
    file = open(fileName, 'r')
    userDict = {}
    for line in file:
        # Read and parse a single line
        [userName, bands] = line.strip().split(":")
        bandList = bands.split(",")
        bandList.sort()
        userDict[userName] = bandList
    file.close()
    return userDict

When we call this function we would pass it the filename musicrec-store.txt.

There are a few key pieces to this function. First, we have to open the file for reading. This task is accomplished with the line file = open(fileName, 'r'), which gives us a link to the contents of the file through file, but we can only read from this file (we cannot write to it). If we wanted to write to the file, we would specify 'w' as the second argument to the open function; if we wanted to both read and write, we would pass 'w+' as the second argument.

Once we have our handle to the file’s contents (file) we can read lines from it using the for loop construct above. As the for loop iterates over file, what it is actually doing is pulling out one line at a time until there are no more lines, at which point the for loop ends.

Finally, when we have read all the data out of the file and stored it in the dictionary in our program, we can close the file using file.close().

The function below shows the part of the code that writes the stored user preferences (including the current user) back to the file:

def saveUserPreferences(userName, prefs, userMap, fileName):
    ''' Writes all of the user preferences to the file.
        Returns nothing. '''
    userMap[userName] = prefs
    file = open(fileName, "w")
    for user in userMap:
        toSave = str(user) + ":" + ",".join(userMap[user]) + \
                    "\n"
        file.write(toSave)
    file.close()

5.7 Putting It All Together: Program Design

Now that we’ve got all the tools, it’s time to construct the whole recommender program. However, as a disclaimer, large scale program design can at times feel more like an art than a science. Maybe more science than art! There are certainly guiding principles and theories, but you rarely “just get it right” on your first try, no matter how careful you are. So expect that in implementing any program of reasonable size, you’ll need to make a couple of revisions. Think of it like writing a paper—if you like writing papers.

../_images/Alien6.PNG

CS is at least as creative and precise as careful writing; don’t be afraid of creating—and redoing—several drafts!

The first step to program design is trying to figure out what data your program is responsible for and how that data comes into the program (input), gets manipulated (computation) and is output from the program (output). In fact, this task is so important that we had you do this at the beginning of the chapter.

After identifying the data your program must keep track of, the next step is to decide what data structures you will use to keep track of this data. Do you need ints, lists, or dictionaries, or some other structure entirely? Take a moment to choose variable names and data types for each of the piece of data you identified at the beginning of the chapter (or that you identify now).

Finally, now that you have identified our data and the computation that our program needs to perform, you are ready to start writing functions. We can start with any function we like–there’s no right order as long as we are able to test each function as we write it.

Our complete program is given below in listing 5.1. Let’s return to the idea of collaborative filtering and examine how this program implements the basic CF algorithm we described in section 5.1. As we described there, the basic idea behind collaborative filtering is to try to make predictions about a user by looking at data from a number of other users. There are two basic steps to a CF algorithm:

  • Find the user (or users) most similar to the current user,
  • Make predictions based on what they like.
../_images/2Darray.png

In the code in listing 5.1, these two steps are performed in helper functions called in getRecommendations. First, findBestUser finds and returns the user whose tastes are closest to the current user. Then drop returns a list of the artists who the best user likes who are not already in the current user’s list.

Notice that even though the code is relatively short, we have chosen to separate the two stages of the algorithm into two separate functions for clarity. Generally it is a good idea for each semantic piece of your algorithm to have its own function. Notice also that findBestUser also relies on a helper function (numMatches), which again helps make the specific functionality of each piece of the code more clear.

One more item that’s new in the code below is the last line:

if __name__ == "__main__": main()

You will notice that the function main implements the main control of the recommendation program. We could run our recommendation program by loading this code into the Python interpreter and then typing:

>>> main()

However, to keep the user from having to type this extra command, we include the above line, which tells the program to automatically run the function main as soon as the code is loaded into the interpreter.

Finally, you’ll notice the variable PREF_FILE at the top of the program. This variable is known as a global variable because it is accessible to all of the functions in the program. By convention, global variable names are often ALL CAPS to distinguish them from local variables and parameters. Generally global variables should be avoided, but there are a few special cases such as this one where we want to avoid sprinkling a hard-coded value (in this case, the name of the file) throughout the program. Using a global variable for the filename makes it easy to change in the future.

# A very simple music recommender system.

PREF_FILE = "musicrec-store.txt"

def loadUsers(fileName):
    ''' Reads in a file of stored users' preferences
        stored in the file 'fileName'.
        Returns a dictionary containing a mapping
        of user names to a list preferred artists
    '''
    file = open(fileName, 'r')
    userDict = {}
    for line in file:
        # Read and parse a single line
        [userName, bands] = line.strip().split(":")
        bandList = bands.split(",")
        bandList.sort()
        userDict[userName] = bandList
    file.close()
    return userDict
         
def getPreferences(userName, userMap):
    ''' Returns a list of the uesr's preferred artists.

        If the system already knows about the user,
        it gets the preferences out of the userMap
        dictionary and then asks the user if she has
        additional preferences.  If the user is new,
        it simply asks the user for her preferences. '''
    newPref = ""
    if userName in userMap:
        prefs = userMap[userName]
        print("I see that you have used the system before.")
        print("Your music preferences include:")
        for artist in prefs:
            print(artist)
        print("Please enter another artist or band that you")
        print("like, or just press enter")
        newPref = input("to see your recommendations: ")
    else:
        prefs = []
        print("I see that you are a new user.")
        print("Please enter the name of an artist or band")
        newPref = input("that you like: " )
        
    while newPref != "":
        prefs.append(newPref.strip().title())
        print("Please enter another artist or band that you")
        print("like, or just press enter")
        newPref = input("to see your recommendations: ")
        
    # Always keep the lists in sorted order for ease of
    # comparison
    prefs.sort()
    return prefs


def getRecommendations(currUser, prefs, userMap):
    ''' Gets recommendations for a user (currUser) based
        on the users in userMap (a dictionary)
        and the user's preferences in pref (a list).
        Returns a list of recommended artists.  '''
    bestUser = findBestUser(currUser, prefs, userMap)
    recommendations = drop(prefs, userMap[bestUser])
    return recommendations

def findBestUser(currUser, prefs, userMap):
    ''' Find the user whose tastes are closest to the current
        user.  Return the best user's name (a string) '''
    users = userMap.keys()
    bestUser = None
    bestScore = -1
    for user in users:
        score = numMatches(prefs, userMap[user])
        if score > bestScore and currUser != user:
            bestScore = score
            bestUser = user
    return bestUser

def drop(list1, list2):
    ''' Return a new list that contains only the elements in
        list2 that were NOT in list1. '''
    list3 = []
    i = 0
    j = 0
    while i < len(list1) and j < len(list2):
        if list1[i] == list2[j]:
            i += 1
            j += 1
        elif list1[i] < list2[j]:
            i += 1
        else:
            list3.append(list2[j])
            j += 1
    
    return list3

def numMatches( list1, list2 ):
    ''' return the number of elements that match between
        two sorted lists '''
    matches = 0
    i = 0
    j = 0
    while i < len(list1) and j < len(list2):
        if list1[i] == list2[j]:
            matches += 1
            i += 1
            j += 1
        elif list1[i] < list2[j]:
            i += 1
        else:
            j += 1
    return matches

def saveUserPreferences(userName, prefs, userMap, fileName):
    ''' Writes all of the user preferences to the file.
        Returns nothing. '''
    userMap[userName] = prefs
    file = open(fileName, "w")
    for user in userMap:
        toSave = str(user) + ":" + ",".join(userMap[user]) + \
                    "\n"
        file.write(toSave)
    file.close()    

def main():
    ''' The main recommendation function '''
    userMap = loadUsers(PREF_FILE)
    print("Welcome to the music recommender system!")

    userName = input("Please enter your name: ")
    print ("Welcome,", userName)

    prefs = getPreferences(userName, userMap)
    recs = getRecommendations(userName, prefs, userMap)

    # Print the user's recommendations
    if len(recs) == 0:
        print("I'm sorry but I have no recommendations")
        print("for you right now.")
    else:
        print(userName, "based on the users I currently")
        print("know about, I believe you might like:")
        for artist in recs:
            print(artist)

        print("I hope you enjoy them! I will save your")
        print("preferred artists and have new")
        print(" recommendations for you in the future")

    saveUserPreferences(userName, prefs, userMap, PREF_FILE)
    

if __name__ == "__main__": main()

5.8 Conclusion

We’ve covered a lot of ground in this chapter, ultimately resulting in a recommender program similar in spirit to ones that you undoubtedly use yourself.

../_images/Alien6.PNG

Are there any recommender programs that recommend recommender programs?

We’ve also seen that there are often multiple different ways of doing the same thing. For example, recursion, for loops, and while loops all allow us to repeat a computational task. How do you determine which one to use? Some problems are inherently recursive. For example, the edit distance problem from Chapter 2 was very naturally suited for recursion because it can be solved using solutions to smaller versions of the same problem – the so-called “recursive substructure” problem. Other problems, like computing the factorial of a number, can be solved naturally – albeit differently – using recursion or loops. And then there are cases where loops seem like the best way to do business. The fact that there are choices like these to be made is part of what makes designing programs fun and challenging.