ART-House: A Collaborative Filtering System for Movies using the ART2 Algorithm

Stephen Jones
December 12, 2006

Problem

The Netflix prize is a $1 million award for an algorithm that significantly improves Netflix's movies recommender system. When I read about it in the news, it sounded like a good idea for my neural nets project. Though it would have been nice, I did not set out with the intention of winning the prize; my goal in is this project is to implement a collaborative filtering system for movies ratings.

The ART Algorithm

After learning about several clustering algorithms, I found that the ART algorithm seemed the most powerful and adaptable. ART uses competitive learning to cluster and can create new clusters if an input vector is too different from current clusters (what "too different" means is decided by a vigilance parameter.) If clusters correspond to groups of users with similar taste in movies, then the ability to have a variable number of clusters is very desirable. In fact, there has been research showing the ART algorithm has performed well in collaborative filtering compared to other clustering networks [1]. I am working with the ART2 algorithm, which clusters continuous inputs, whereas the ART1 algorithm only works with discrete inputs.

Data Processing

A substantial sub-problem here is actually organizing the data for training and testing. Netflix offers to contest entrants movie rating data with 17,700 movies rated by over 480,000 users and more than 100 million ratings. For the scope of this project, I only want to deal with a subset of this data so that my machine can run tests in a reasonable amount of time. This goal raises several issues. The data set can be limited in terms of ratings per movies, a cap on the number of users, or a cap on the number of movies, and limiting one variable will necessarily limit another, since each user has not rated every movie. Another issue is that taking a subset of the data can significantly affect how well the algorithm works depending on both the training subset and the testing subset. It would be most desirable for the algorithm's performance on the testing subset to reflect the algorithm's performance on the entire data set.

Implementation

The basic idea for the implementation is to cluster user profile vectors, where each element of a profile vector is that user's rating of a movie. If a user has not seen a movie, it is assigned a default rating of 2.5. To predict a user's rating of a certain movie, the algorithm should consult the weights of the user's cluster's "characteristic vector" to see what the average user in that cluster rated the movie.

ART2

ART1 is a simple algorithm compared to ART2, which deals with the problem that continuous inputs can be arbitrarily close to one another [2]. The input layer contains several neurons that pre-process input vectors using a normalizer function and a threshold function that sets all values below a certain point to zero; this structure is shown in Fig. 1. I do not fully understand exactly how this input layer solves the problem, but the way it works was clear enough for me to implement. The "Feature Recognition" paper [2] laid out the variables and the update equations in a very straightforward way, and the basic implementation was painless.


Fig. 1. Structure of ART2 network [2].

The only snag in the implementation of the algorithm was that because of the processing in the input layer, the weights for each cluster are not on the 5 star rating scale. After much research I could not find a way to convert the weights of a cluster into a characteristic vector for that cluster that gives the average ratings. Instead, I created those vectors myself in the network. Whenever a user's profile vector is clustered, the original, unprocessed vector is averaged into the cluster's characteristic vector.

Data

The Netflix data comes in 17,700 text files, one for each movie in the data set. In each file, user identification numbers are listed next to the rating that user gave the movie.

Creating a Database

Since the algorithm needs user profile vectors, the data needs to be indexed by user, rather than by movie. I decided to learn a little bit about databases in Java and implement a simple database to organize my data. I wrote a simple Java program to read in movie files and create a JavaDB with that information. The main table has a column for user number, a column for movie number, and a column for the rating. Creating this table is quite time consuming, but it only had to be done a few times, once for each set of movies I worked with.

Cutting Down the Size

It was easiest to start out by limiting the number of movies in the database, since decreasing that number speeds up the database creation process significantly and also decides the length of the user profile vectors. After creating several movie sets I decided to use 250 movies for most of my tests. I randomly chose the movies from the 17,700 available.

Limiting the number of users was not so simple. I thought about choosing different numbers of users and randomly creating several data sets from a reduced set of movies. For training it is useful to remove users that have sparse profile vectors because they do not have much information to offer [1]. I decided to set a minimum number of movies that a user must have rated in order to be included in the data set. This does not allow me to directly set the number of users in the data set, but it allowed me to cut down both the size of the data set and the number of useless user profiles.

Rather than deleting the rows of those users from the main table, which can be very time consuming, I created a separate table with one entry per user specifying the users that have enough movies in their profiles to qualify for inclusion. Using this second table also allows me to use the same main table of movie/user/rating data to generate several data sets with different minimum movie requirement values. During this step of the data processing, I randomly chose 2/3 of the data set to be in a training set and marked the other 1/3 of the data set to be for testing; this information is also stored in the second table of the database.

Results

I worked mainly with three sets of data, drawn from the 250 movie subset. The 250 movie subset had over 325,000 users. In a set with 5% minimum movie threshold there were over 25,000 users; in a 15% threshold set there were 391 users; and in a 25% threshold set there were only 63 users. I first ran some tests using the 5% threshold set because that was most representative of the larger data set. I varied maximum cluster value, vigilance, and learning rate. Wherever I varied learning rate, the results didn't change in any regular way, though 0.6 seemed to be a consistently good value. In varying maximum clusters, it seems like increasing the allowable clusters improves the MSE without increasing runtime much. The algorithm seems to choose a good number of clusters, and if speed is not an issue then there seems to be no good reason to limit the clusters in this case. Decreasing the vigilance from the high value of 0.9 did not lower the MSE in 5% threshold set.

As suggested in [1] the sets with higher threshold values and thus more dense profile vectors generally had lower MSE than the 5% threshold set. The 25% threshold set, though, did not perform as well as the 15% threshold set. I imagine that in the case of the 25% threshold set, the 41 users in that training set are not representative enough of the entire set to predict well for the 22 users in that testing set. Indeed, I took the weights from 25% threshold training and ran them on the 5% threshold testing set. The MSE was 1.757, which is higher than any other test run I performed.


Fig. 2. Runs on the 15% threshold set show overfitting.

The 15% threshold set did the best of any of the sets. It performed particularly well when I lowered the vigilance value. It makes sense that if a very small set is representative of a larger set then two vectors that seem to be similar in that set should probably be grouped in different clusters. The only problem with the vigilance being set so low is that with the number of epochs I was using, the clusters seemed to be overfitting to the training set, as depicted in Figure 2. Recognizing the overfitting, I took the weights from the epoch with the lowest MSE and ran those on the 5% threshold testing set. The resulting MSE was 1.339, which was better than any other run on the 5% threshold testing set, even those that trained on the 5% threshold training set. This shows that the algorithm does benefit from working with mostly dense user profile vectors, and that there is a reasonable chance that the algorithm will scale well when training on dense data and then moving on to recommend for a larger user base with more sparse profile vectors.

Further Work

One idea for future work on this algorithm is to have a vigilance that can increase gradually. Perhaps that might combat some of the overfitting problems seen when working with a lower vigilance. The more obvious extension to the algorithm would be to make it work on all 17,700 movies in the data set. Unfortunately, accommodating that much information would require significantly modifying the way user profile vectors are stored in memory.

Deliverables

Bibliography

  1. Graef, Guntram and Christian Schaefer. "Application of ART2 Networks and Self-Organizing Maps to Collaborative Filtering." Lecture Notes in Computer Science, Volume 2266. pp. 296-309, 2002.
  2. Lankalapalli, Kishore, Subrata Chatterjee and T. C. Chang. "Feature Recognition using ART2: A Self-Organizing Neural Network." Journal of Intelligent Manufacturing, Volume 8. pp. 203-214, 1997.