Text Classification of Newsgroup Messages with Neural Networks

Audrey Lawrence and Singer Ma

CS152 Fall 2010

Motivation

The goal of this project is to successfully use neural networks to categorize newsgroup messages. Text classification is important when searching through information, and in this project, we will try to distinguish various categories of newsgroup messages. We plan to compare a variety of approaches, such as a Self-Organizing Map (SOM), a Support Vector Machine (SVM), and an ensemble of Multi-Layer Perceptrons (MLPs).

Past Work

In 2009, the team of Zaghloul, Lee, and Trimi compared the use of neural networks to support vector machines. They used 400 documents which they classified into 9 separate categories. For input to their networks, they used the frequency in the document of 100 pre-selected words, which were the most common words in all documents, with some use of filtering. This team only showed results for 5 categories, for they claimed that there were not enough positive training examples for the other 4. The neural network showed classification rates between 48% and 80% for each category, and the SVM accuracies were between 54% and 66%. It was concluded that there was no significant difference between the performance of the neural network and the SVM, and thus neural networks can be used for text classification.

The team of Rauber, Schweighofer, and Merkl used a self-organizing map to classify legal text documents in a project from 2000. Their paper showed results for a 5x5 mapping of 43 documents. Each of the 25 nodes was labeled according to keywords that were representative of documents that were classified to that node. This map gave "satisfying" results, for the documents grouped to one node were similar. For example, a node grouping documents relating to the energy market was labeled with the following keywords: energy, transit, grid, pressure, and electricity.

Stanford graduate students, Sullivan and Kulkarni, worked on a project similar to ours which compared Naive Bayes variants to a SVM. Their project used the exact dataset, the Twenty Newsgroups dataset from the machine learning collection at UCI, which we used in our project. Their Naive Bayes classifiers showed accuracies from 82.3% to 94.5%. For input to their SVM, they used the frequencies of the top ten words chosen from each category based upon the Chi-square method.

A last motivation for this project was an assignment of Audrey's from another CS course, where a basic Naive Bayes classifier was used to classify the same dataset. An accuracy of 83.2% was found for the Twenty Newsgroups dataset.

Dataset

The dataset we used is from UCI which has messages from 20 newsgroups. In total there are 20,000 messages in this dataset. The messages end at 9/9/99 and go back a few years.

These are the 20 categories:

rec.motorcycles

comp.sys.mac.hardware

talk.politics.misc

soc.religion.christian

comp.graphics

sci.med

talk.religion.misc

comp.windows.x

comp.sys.ibm.pc.hardware

talk.politics.guns

alt.atheism

comp.os.ms-windows.misc

sci.crypt

sci.space

misc.forsale

rec.sport.hockey

rec.sport.baseball

sci.electronics

rec.autos

talk.politics.mideast

For our tests, we used 80% of the data as training data, and 20% as test data. These were chosen randomly from the overall data set. Overall, 3,808 were part of the test set and 16,192 for the training.

We decided to use word frequency to distinguish the newsgroups. Using all words would be impractical so we started with the 400 most frequent words. We filtered the words to remove what we determiened to be useless words. We removed prepositions, conjunctions, auxiliary verbs, numbers, both spelled out and as numerals, and symbols. We ended up with the follwoing 243 words to be considered.

all, edu, go, find, writes, enough, send, only, going, under, get, very, de, every, probably, list, email, try, bad, team, where, wrote, set, says, up, second, computer, best, subject, even, what, said, please, state, version, above, between, new, net, public, available, never, however, here, let, key, others, news, come, both, great, last, many, against, etc, com, point, cc, ca, article, cs, use, mark, few, much, call, tell, more, life, else, must, case, look, car, bill, following, example, program, give, chip, high, heard, someone, something, want, keep, information, different, end, rather, things, make, same, how, used, max, after, wrong, law, data, man, off, maybe, well, thought, person, without, software, order, help, just, less, being, when, over, years, course, through, thanks, world, yes, still, yet, before, group, seems, actually, better, other, might, image, real, good, around, government, read, possible, game, know, using, bit, now, day, name, like, always, university, either, each, found, mean, right, old, people, hard, some, back, year, out, space, god, since, looking, re, got, gov, run, power, free, quite, reason, put, org, post, card, about, anything, david, thing, place, think, first, own, into, number, one, down, done, least, another, next, little, her, support, question, system, long, anyone, way, john, files, lot, jesus, part, line, believe, true, made, see, uk, problem, called, ac, general, say, file, need, seen, work, any, again, no, able, book, take, which, really, sure, though, who, problems, most, nothing, why, windows, drive, nasa, time, far, dos, having, fact, once

Results

Self-Organizing Map

For this part of the project we used the package SOM-PAK. We tried out a variety of mappings, and overall found very poor results. However, these maps after training did group similar categories closer and made reasonable sense. Examples of these maps are shown below, along with a table showing the accuracies of various maps.

An Example 4x5 Map: (where the numbers of the map correspond to the label of the given node. Nodes that have similar shades were found to be more similar to one another.)

The groupings on this map provide insight to our data set. For example:

0, rec.motocycles, and 18, rec.autos, are similar
15, rec.sport.hockey, and 16, rec.sport.baseball, are similar
7, comp.windows.x, and 8, comp.sys.ibm.pc.hardware, are similar
19, talk.politics.mideast does not relate well to any other group, although its nearest neighbor is 3, soc.religion.christian

Another interesting way to see this mapping is through the graph below. This graph more clearly shows how far nodes corresponding to 19 and 3 are from others.

However, the maps we used for our analysis had larger dimensions that 4x5. Below is an example of our 15x15 mapping.

Again, we can observe a number of interesting insights of our mapping, such as the following:

7, comp.windows.x, and 11, comp.os.ms-windows.misc, are similar
7, comp.windows.x, and 4, comp.graphics, are similar
Again, 15, rec.sport.hockey, and 16, rec.sport.baseball, are similar
1, comp.sys.mac.hardware, 11, comp.os.ms-windows.misc, and 17, sci.electronics, are similar

Our mappings seem to work reasonably well. However, when we look at how accurately the SOM matched our test samples with nodes that had labels corresponding to the labels of the samples, our results are not great. We strictly look at the label of the test sample and the label of the node the sample is grouped with to find accuracy. Our results are shown in the table below:

Map Dimension	Accuracy
8x5	12.9%
8x10	15.4%
15x15	19.7%
20x18	20.6%

Since our accuracy improves as we add more nodes, it seems that for maps of lower dimension, we are still grouping samples classified incorrectly with nodes that correspond to a similar category to that of the sample. As these nodes break up, more categories are represented in that area, and test samples are better grouped with a nodes that are labeled identically to the samples.

Support Vector Machine

We used the package LIBSVM for our support vector machine and the same training and test vectors as with the self-organizing map. Our best accuracy for the SVM was 50.03% while using the Radial Basis Function option with the parameter C as 25.

Ensemble of Multi-Layer Perceptrons

We used an ensemble approach with MLPs. We would train multiple MLPs separately and use them together to make decisions. The MLPs would have 243 inputs, for each of the words, and 20 outputs, for each of the categories. A correct output would be having a 1 for the correct newgroup and a 0 for all the others There would be a single hidden layer and the whole system would be fully connected between layers. Each system was trained for 100 epochs, using gradient of descent, with a learning rate of .05. Each node had a logistic output which varied from 0 to 1. The training data was randomized so that the newsgroups would not be clumped together. First, we present the results of the individual MLPs

Number of Hidden Nodes	Accuracy
50	32.2%
100	41.8%
200	43.9%
400	29.9%

Our ensemble approach would use 2 different methods to choose what newgroup a message is in. The first method, which we call the electoral college method, has each of the MLPs choose what newsgroup they think the message is in. The newsgroup with the most votes wins. In the event that the vote ties, the raw values for the output are added up and the largest resulting output is then chosen.

MLPs used	Accuracy
50/100/200/400	40.2%
50/100/200	41.0%

Naive Bayes Classifier - Revisited

After receiving poor results from all of our classifiers in comparison to the relatively unintelligent Naive Bayes Classifier, we decided to investigate how our Naive Bayes would do without information on all of the words of each document, but instead only the 243 words we used for the other classifiers. As expected, with this limited information, the Naive Bayes performed much worse with an accuracy of 6.25%. It also only classified documents in two categories successfully. Almost all of the messages labeled as misc.forsale were classified correctly, and about 1/5 of the comp.sys.mac.hardware group was classified correctly.

Extensions of our Original Project

One thing that we decided to look into after finishing up our original goals for the project was to see if we could achieve better results by lowering the number of classification categories that we were trying to place messages. For this we chose randomly which categories to keep and removed samples of the other categories from the training and test datasets. While using the same parameters that were used above with 20 categories, this extension did show improved results, as expected.

Results using 5 categories:

SOM: 26.42% (on 15x15 mapping)
SVM: 80.6%
MLPs

Number of Hidden Nodes Accuracy

50 67.5%

100 69.6%

200 72.5%

400 69.2%

Number of Hidden Nodes	Accuracy
50	67.5%
100	69.6%
200	72.5%
400	69.2%

Ensemble MLP: Unfortunately, the original files were written over. So we were unable to try the ensemble approach. Fortunately, our regular MLP results showed much improved performance.

Results using 10 categories:

SOM:10.34% (on 15x15 mapping)
SVM: 57.28%
MLPs

Number of Hidden Nodes Accuracy

50 44.1%

100 34.4%

200 30.1%

400 43.1%
Ensemble MLP:

MLPs used Accuracy

50/100/200/400 42.4%

50/100/200 36.1%

Number of Hidden Nodes	Accuracy
50	44.1%
100	34.4%
200	30.1%
400	43.1%

MLPs used	Accuracy
50/100/200/400	42.4%
50/100/200	36.1%

Conclusions

Overall, it seems that SVMs were the most effective at classifying these newsgroup messages with the data we provided. They consistently performed better than any of the other options. After that, MLPs seemed to perform the next best. The ensemble approach did not seem to significantly improve the performance of the MLPs, but this is likely due to too few MLPs and not enough training time given the success of previous papers. The SOM is a nice way to visualize information but does not actually categorize it very effectively. Finally, the Naive Bayes classifier seems effective but must use more information that any of the other techniques mentioned to get these results. Instead, when Naive Bayes is restricted to the same amount of information, it actually performs significantly worse. Also, as the number of categories decreased, the accuracies for each method tended to increase.

Future Work

The following are a number of strategies that we think could improve our classification accuracy, but were outside the scope of this project:

Use the frequencies of more than 243 words in the input vectors
Choose the pre-selected words in a different manner, instead of based upon their frequency in all documents. (Could use the Chi-square method to choose important words from each category, as Zaghloul, et al. did.)
Train more MLPs with different architectures for our ensemble
Use the frequencies of different sets of pre-selected words as inputs to different MLPs

References

Zaghloul, W., Lee, S. M., Trimi, S., Text classification: neural networks vs support vector machines, Industrial Management & Data Systems, 109 (2009), pp. 708-717.
Hanson, L. K., Salamon, P., Neural Network Ensembles, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, (October 1990), pp. 993-1001.
Rauber, A., Schweighofer, E., Merkl, D., Text classification and Labelling of Document Clusters with Self-Organising Maps, 2000.
Sullivan, T., Kulkarni, A., CS 276 Programming Assignment 2: Newsgroup Classification, 2008.
Mitchell, T., Twenty Newsgroups Data Set, UCI Machine Learning Respository, 1999.
Chang, C., Lin, C., LIBSVM - A Library for Support Vector Machines, 2010.
Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J. SOM_PAK: The Self-Organizing Map Program Package. Technical Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland, 1996.