The goal of this project is to successfully use neural networks to categorize newsgroup messages. Text classification is important when searching through information, and in this project, we will try to distinguish various categories of newsgroup messages. We plan to compare a variety of approaches, such as a Self-Organizing Map (SOM), a Support Vector Machine (SVM), and an ensemble of Multi-Layer Perceptrons (MLPs).
In 2009, the team of Zaghloul, Lee, and Trimi compared the use of neural networks to support vector machines. They used 400 documents which they classified into 9 separate categories. For input to their networks, they used the frequency in the document of 100 pre-selected words, which were the most common words in all documents, with some use of filtering. This team only showed results for 5 categories, for they claimed that there were not enough positive training examples for the other 4. The neural network showed classification rates between 48% and 80% for each category, and the SVM accuracies were between 54% and 66%. It was concluded that there was no significant difference between the performance of the neural network and the SVM, and thus neural networks can be used for text classification.
The team of Rauber, Schweighofer, and Merkl used a self-organizing map to classify legal text documents in a project from 2000. Their paper showed results for a 5x5 mapping of 43 documents. Each of the 25 nodes was labeled according to keywords that were representative of documents that were classified to that node. This map gave "satisfying" results, for the documents grouped to one node were similar. For example, a node grouping documents relating to the energy market was labeled with the following keywords: energy, transit, grid, pressure, and electricity.
Stanford graduate students, Sullivan and Kulkarni, worked on a project similar to ours which compared Naive Bayes variants to a SVM. Their project used the exact dataset, the Twenty Newsgroups dataset from the machine learning collection at UCI, which we used in our project. Their Naive Bayes classifiers showed accuracies from 82.3% to 94.5%. For input to their SVM, they used the frequencies of the top ten words chosen from each category based upon the Chi-square method.
A last motivation for this project was an assignment of Audrey's from another CS course, where a basic Naive Bayes classifier was used to classify the same dataset. An accuracy of 83.2% was found for the Twenty Newsgroups dataset.The dataset we used is from UCI which has messages from 20 newsgroups. In total there are 20,000 messages in this dataset. The messages end at 9/9/99 and go back a few years.
These are the 20 categories:
For our tests, we used 80% of the data as training data, and 20% as test data. These were chosen randomly from the overall data set. Overall, 3,808 were part of the test set and 16,192 for the training.
We decided to use word frequency to distinguish the newsgroups. Using all words would be impractical so we started with the 400 most frequent words. We filtered the words to remove what we determiened to be useless words. We removed prepositions, conjunctions, auxiliary verbs, numbers, both spelled out and as numerals, and symbols. We ended up with the follwoing 243 words to be considered.
all, edu, go, find, writes, enough, send, only, going, under, get, very, de, every, probably, list, email, try, bad, team, where, wrote, set, says, up, second, computer, best, subject, even, what, said, please, state, version, above, between, new, net, public, available, never, however, here, let, key, others, news, come, both, great, last, many, against, etc, com, point, cc, ca, article, cs, use, mark, few, much, call, tell, more, life, else, must, case, look, car, bill, following, example, program, give, chip, high, heard, someone, something, want, keep, information, different, end, rather, things, make, same, how, used, max, after, wrong, law, data, man, off, maybe, well, thought, person, without, software, order, help, just, less, being, when, over, years, course, through, thanks, world, yes, still, yet, before, group, seems, actually, better, other, might, image, real, good, around, government, read, possible, game, know, using, bit, now, day, name, like, always, university, either, each, found, mean, right, old, people, hard, some, back, year, out, space, god, since, looking, re, got, gov, run, power, free, quite, reason, put, org, post, card, about, anything, david, thing, place, think, first, own, into, number, one, down, done, least, another, next, little, her, support, question, system, long, anyone, way, john, files, lot, jesus, part, line, believe, true, made, see, uk, problem, called, ac, general, say, file, need, seen, work, any, again, no, able, book, take, which, really, sure, though, who, problems, most, nothing, why, windows, drive, nasa, time, far, dos, having, fact, once
For this part of the project we used the package SOM-PAK. We tried out a variety of mappings, and overall found very poor results. However, these maps after training did group similar categories closer and made reasonable sense. Examples of these maps are shown below, along with a table showing the accuracies of various maps.
An Example 4x5 Map: (where the numbers of the map correspond to the label of the given node. Nodes that have similar shades were found to be more similar to one another.)
The groupings on this map provide insight to our data set. For example:
Another interesting way to see this mapping is through the graph below. This graph more clearly shows how far nodes corresponding to 19 and 3 are from others.
However, the maps we used for our analysis had larger dimensions that 4x5. Below is an example of our 15x15 mapping.
Again, we can observe a number of interesting insights of our mapping, such as the following:
Our mappings seem to work reasonably well. However, when we look at how accurately the SOM matched our test samples with nodes that had labels corresponding to the labels of the samples, our results are not great. We strictly look at the label of the test sample and the label of the node the sample is grouped with to find accuracy. Our results are shown in the table below:
| Map Dimension | Accuracy |
|---|---|
| 8x5 | 12.9% |
| 8x10 | 15.4% |
| 15x15 | 19.7% |
| 20x18 | 20.6% |
Since our accuracy improves as we add more nodes, it seems that for maps of lower dimension, we are still grouping samples classified incorrectly with nodes that correspond to a similar category to that of the sample. As these nodes break up, more categories are represented in that area, and test samples are better grouped with a nodes that are labeled identically to the samples.
We used the package LIBSVM for our support vector machine and the same training and test vectors as with the self-organizing map. Our best accuracy for the SVM was 50.03% while using the Radial Basis Function option with the parameter C as 25.
We used an ensemble approach with MLPs. We would train multiple MLPs separately and use them together to make decisions. The MLPs would have 243 inputs, for each of the words, and 20 outputs, for each of the categories. A correct output would be having a 1 for the correct newgroup and a 0 for all the others There would be a single hidden layer and the whole system would be fully connected between layers. Each system was trained for 100 epochs, using gradient of descent, with a learning rate of .05. Each node had a logistic output which varied from 0 to 1. The training data was randomized so that the newsgroups would not be clumped together. First, we present the results of the individual MLPs
| Number of Hidden Nodes | Accuracy |
|---|---|
| 50 | 32.2% |
| 100 | 41.8% |
| 200 | 43.9% |
| 400 | 29.9% |
Our ensemble approach would use 2 different methods to choose what newgroup a message is in. The first method, which we call the electoral college method, has each of the MLPs choose what newsgroup they think the message is in. The newsgroup with the most votes wins. In the event that the vote ties, the raw values for the output are added up and the largest resulting output is then chosen.
| MLPs used | Accuracy |
|---|---|
| 50/100/200/400 | 40.2% |
| 50/100/200 | 41.0% |
After receiving poor results from all of our classifiers in comparison to the relatively unintelligent Naive Bayes Classifier, we decided to investigate how our Naive Bayes would do without information on all of the words of each document, but instead only the 243 words we used for the other classifiers. As expected, with this limited information, the Naive Bayes performed much worse with an accuracy of 6.25%. It also only classified documents in two categories successfully. Almost all of the messages labeled as misc.forsale were classified correctly, and about 1/5 of the comp.sys.mac.hardware group was classified correctly.
One thing that we decided to look into after finishing up our original goals for the project was to see if we could achieve better results by lowering the number of classification categories that we were trying to place messages. For this we chose randomly which categories to keep and removed samples of the other categories from the training and test datasets. While using the same parameters that were used above with 20 categories, this extension did show improved results, as expected.
| Number of Hidden Nodes | Accuracy |
|---|---|
| 50 | 67.5% |
| 100 | 69.6% |
| 200 | 72.5% |
| 400 | 69.2% |
| Number of Hidden Nodes | Accuracy |
|---|---|
| 50 | 44.1% |
| 100 | 34.4% |
| 200 | 30.1% |
| 400 | 43.1% |
| MLPs used | Accuracy |
|---|---|
| 50/100/200/400 | 42.4% |
| 50/100/200 | 36.1% |
Overall, it seems that SVMs were the most effective at classifying these newsgroup messages with the data we provided. They consistently performed better than any of the other options. After that, MLPs seemed to perform the next best. The ensemble approach did not seem to significantly improve the performance of the MLPs, but this is likely due to too few MLPs and not enough training time given the success of previous papers. The SOM is a nice way to visualize information but does not actually categorize it very effectively. Finally, the Naive Bayes classifier seems effective but must use more information that any of the other techniques mentioned to get these results. Instead, when Naive Bayes is restricted to the same amount of information, it actually performs significantly worse. Also, as the number of categories decreased, the accuracies for each method tended to increase.