Starter Code

Get the starter code on GitHub Classroom

Docker

New data has been added to the course docker image for this lab. If you use docker, you will want to pull the updated image.

Introduction

In this lab you will investigate the Lexical Sample Task data taken from Senseval 3 and you will build a supervised decision list classifier, trained on the provided training data and tested on the provided test data.

Answers to written questions should be added to a analysis.md, which should be converted into a pdf (analysis.pdf) for your final submission.

Examining the data

The data provided by the Senseval 3 competition is in /cs/cs159/data/senseval3/. Training and testing data are in separate subdirectories. Before you begin work on the lab, you need to spend a few minutes looking at the source data. In particular, examine the files /cs/cs159/data/senseval3/train/EnglishLS.train and /cs/cs159/data/senseval3/test/EnglishLS.test.

When you look through the training data file, you will see the text <lexelt item="activate.v">. This indicates that until the next lexelt tag, all of the training examples are for the ambiguous verb “activate”. There are 228 training examples provided for the lexelt ‘activate.v’. Each training example, called an instance, is given an instance id. The first instance has the id ‘activate.v.bnc.00024693’. The next line in the file indicates the sense label for the word “activate” in the text that follows. The first instance is labeled with sense ‘38201’.

After the instance id and answer are provided, the there is a short piece of text labeled with the tag <context>. This text contains the word “activate” and provides a few sentences worth of context to help you determine which sense is the correct sense. In addition, the context indicates the word that you are trying to disambiguate. In the first context, you will find <head>activate</head>. It might seem strange to label the “head” word like this, but there are a few good reasons to do so. First, sometimes the word in the context isn’t the same as the lexelt. For example, in the second instance, the head word is “activated”, not “activate”. Second, it is possible that the word you are trying to disambiguate appears multiple times in the same piece of text, so marking the head word lets you know which one of the words you should be determining the sense of. Finally, sometimes there are multiple head words marked in the same context, indicating that those words all share the same sense.

Spend some time looking through the training data file. You should find that some instances are labeled with two senses, indicating that annotators either disagreed on what the correct sense was, or they believed that both senses were represented. In addition, some instances are labeled with the sense “U”, meaning “Unclear”. This means that the annotators weren’t clear which sense to assign to the word. There are even some cases where the word was given two sense labels but one of them as a “U”, meaning that some of the annotators agreed on one sense but other annotators were unclear on how to assign a sense to it.

Parsing the data

The source data for the Senseval task is tricky to parse correctly. The starter code includes a file, Lexelt.py, that will do this parsing for you. Before you add to the file, make sure that you understand it, and that you can use it directly like this:

$ python3
>>> from Lexelt import get_data, get_key
>>> train_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train', 'r')
>>> train_data = get_data(train_fp)
>>> train_data.keys()
dict_keys(['activate.v', 'add.v', 'appear.v', ..., 'write.v'])
>>> train_data['add.v'].keys()
dict_keys(['add.v.bnc.00000134', 'add.v.bnc.00000242', 'add.v.bnc.00000837', ..., 'add.v.bnc.00083170'])
>>> this_instance = train_data['add.v'].get('add.v.bnc.00000242')
>>> this_instance
<Lexelt.LexEltInstance object at 0x7f47a2624fd0>
>>> heads = this_instance.heads
>>> heads
[36]
>>> this_instance.words[heads[0]]
'added'
>>> trainkey_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train.key', 'r')
>>> get_key(trainkey_fp, train_data)
>>> train_data['add.v'].get('add.v.bnc.00000242').answers
['42603']

Exploring the training data

This section asks you to write some helper functions that will make it easier to answer the questions in the following section. All of these functions should be new member functions of the LexElt class.

pos(self): Returns the part of speech of a LexElt object.
num_headwords(self): Returns a list with the number of headwords in each instance of a LexElt object.
num_answers(self): Returns a list with the number of answers in each instance of a LexElt object.
get_all_senses(self): Returns a single list containing all of the senses of all of the instances of a LexElt object. Excludes U senses.
count_unique_senses(self): Returns the number of unique senses used by at least one of the instances of a LexElt object. Excludes U.
most_frequent_sense(self): Returns the sense that is associated with the largest number of instances of a LexElt object. Ties can be broken any way you like.

Confirm that your functions work before moving on:

>>> lexelt = train_data['activate.v']
>>> lexelt.pos()
'v'
>>> lexelt.num_headwords()
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> lexelt.num_answers()
[1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1]
>>> lexelt.get_all_senses()
['38201', '38201', '38202', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38202', '38202', '38201', '38203', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38203', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38202', '38201', '38201', '38201', '38202', '38201', '38201', '38202', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38202', '38203', '38203', '38202', '38204', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38202', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38202', '38204', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38202', '38201', '38201', '38201', '38202', '38201', '38202', '38202']
>>> lexelt.count_unique_senses()
4
>>> lexelt.most_frequent_sense()
'38201'

Questions

You should write the solutions to the following questions in the main function of Lexelt.py, and include your answers in your analysis.md file. The first question has been answered for you in the starter code.

How many different “lexelts” (lexical elements) are there? The lexelts represent the polysemous word types that you are being asked to disambiguate.
What is the breakdown of part of speech (verbs, nouns and adjectives) of these lexelts? You can determine the part of speech by looking at the suffix attached to the lexelt: activate.v is a verb, shelter.n is a noun, and hot.a is an adjective.
How many instances (training examples) are there for the lexelt organization.n?
The introduction to the lab mentioned that an instance can have multiple head words and how an instance can have multiple answers. Make a small table showing the breakdown on the number of head words per instance. Make another table showing the breakdown on the number of answers per instance. Don’t break this down per lexelt - just one small table that summarizes all of the data.
How many senses of activate.v are represented in the training data? (Do not count “U” as a sense for this question.)
One common baseline for this task is to assume that you guess randomly. However, rather than actually making random guesses, you can just assume that you will get of the instances for a particular lexelt correct, where is the number of senses for that lexelt. For example, if there 5 senses for a lexelt , the random baseline will get of the instances of the lexelt correct. It doesn’t matter if the number of instances for is not a multiple of : if there are 37 instances for a lexelt with 5 senses, you will say that a random guess will give you of them correct. The question for you is as follows: what percentage of the instances in the training data would you be able to label correctly if you just guessed randomly? Two important notes:
- You should ignore “U” when counting how many senses a lexelt has for this question.
- You aren’t actually checking the answer key in this question. It doesn’t matter what the actual answer is: you are just trying to get an estimate for how many you could get right if you just guessed randomly.
Another common baseline is the “most frequent sense”, or MFS, baseline. In the MFS baseline, you find the most frequent sense for each lexelt and the label every instance with that most frequent sense. What percentage of the instances in the training data would you label correctly using the most frequent sense baseline? For this question, you can assume that only the first sense listed is the “right” one to predict, even though you should use all of the listed senses for all of the instances to determine the most frequent sense. Is the MFS baseline higher or lower than you expected? Discuss.

Exploring the test data

How many instances of activate.v are there in the test data?
There are two similar questions that are not repeats of one another:
1. Are there any senses of organization.n that are represented in the training data but not in the test data? Assuming the answer is yes, is this a problem? Discuss briefly.
2. Are there any senses of organization.n that are represented in the test data but not in the training data? Assuming the answer is yes, is this a problem? Discuss briefly.
For each lexelt in the training data, find the most frequent sense. Repeat this for the test data. How many (and what percentage of) lexelts have the same most frequent sense in the training data and the test data? You can choose how to handle ties but state the method that you used. Discuss the implications of any disparities you found between the MFS in the training and the MFS in the test.
Use the MFS sense from the training data to label the test data. What is your accuracy labeling the test data using the MFS sense baseline? Compare to how the random baseline and the MFS baseline performed when applied to the training data. How does this match your intuition? Discuss. Note that if an instance is labeled with “U”, then “U” is the correct labeling for the sense. Your system is supposed to be able to predict when “U” is appropriate for the instance, so if you predict an actual sense label (e.g. 'organization%1:04:02::') but the true label is “U”, you are incorrect.

Feature extraction

In this week’s lab, you will be writing a decision list classifier. Later in the semester, you will use scikit-learn, which has many prebuilt classifiers for you to use. One of your goals this week is build your decision list classifier with an interface similar to that of the classifiers in scikit-learn so that you can compare the performance of your classifier against other classifiers you could have used, and so that you start to become familiar with the scikit-learn API.

One very important thing to keep in mind as you work through this lab is that your are not training a classifier for all of the lexelts at once. Rather, you will train a classifier on each lexelt (e.g. activate.v) and then use that classifier to label the test instances for that lexelt. You can then throw that classifier away and train a new classifier on the next lexelt (e.g. add.v) and use that classifier to label the test instances for that lexelt, throw it away, etc.

Feature vectors and labels

In order to train a supervised classifier, you will need training data. For this task, you are provided with multiple instances for each lexelt. You will represent each instance as a feature vector. A feature vector is a numerical representation for a particular training instance. Each feature vector in the training data will have an accompanying target, a numerical representation of the sense that the instance was labeled with.

In order to conform with the scikit-learn interface, you will put all of your feature vectors (for a single lexelt, e.g. 'activate.v') into a numpy array, where each row of the array is an individual feature vector. In the scikit-learn documentation, this array of feature vectors is denoted by the variable $X$ . If there are $n$ training examples and $m$ features, $X$ will be an $n{\times}m$ array.

Similarly, you will put all of your targets (for a single lexelt) into a 1-dimensional numpy array. The scikit-learn documentation refers to this array of targets as $y$ . Each element $y[i]$ is the target (often called the class label) for $i^{th}$ training example in $X$ .

Feature extraction

For the word sense disambiguation problem, you will extract two types of features:

bag of words features
collocational features

Bag of words features are simply the count for each word within some window size of the head word. For this lab, we will set the window size large enough to include all of the tokens in the context. Don’t throw away things like punctuation, numbers, etc. If your classifier finds those tokens useful, it can use them. If they are just noise, your classifier should be able to ignore them.

For example, for instance id ‘activate.v.bnc.00044852’, the bag of words features would show that the token ‘receptors’ appeared 4 times in the context, the token ‘(‘ appeared 2 times, and the token ‘dimensions’ appeared 1 time.

Collocational features are the n-grams that include the head word and their counts. For this lab, you will extract the bigrams and trigrams that include the head word. If there are multiple head words in a single context, you will extract bigrams and trigrams for each one. The instance above, activate.v.bnc.00044852, contains the following text (where activated is the head word):

… upon which parts of the sensory system are activated : stimulation of the retinal receptors …

The collocational features in this instance are the bigrams “are activated” and “activated :”; the trigrams “system are activated”, “are activated :”, and “activated : stimulation”. You can represent these n-grams as a single “word” by joining the tokens in the n-gram together using underscores, such as “system_are_activated”. Just like with bag-of-words features, you’ll want to keep track of the count of each of these n-grams. In this case, the collocation “system_are_activated” appears 1 time.

Add following member functions to the LexEltInstance class:

bow_features(self) should return a Counter object over all of the words in the instance.
colloc_features(self) should return a Counter object over all of the collocational features (bigrams and trigrams) for the instance.
make_features(self) should combine the output of bow_features and colloc_features into a single Counter, and should store that Counter as a features attribute of the LexEltInstance object.
get_feature_labels(self) should return all of the keys of the object’s features attribute.

You are welcome to write additional helper member functions; perphaps having separate functions to find bigrams and trigrams would be helpful here?

Now you should be able to interact with a LexEltInstance like this:

>>> this_instance = lexelt.get('activate.v.bnc.00044852')
>>> this_instance.make_features()
>>> this_instance.features
Counter({'the': 20, 'of': 14, ',': 10, 'to': 7, 'in': 6, 'are': 6, 'and': 5, 'experience': 4, 'system': 4, '.': 4, 'receptors': 4,...})
>>> this_instance.get_feature_labels()
dict_keys(['For', 'neurophysiologists', 'and', 'neuropsychologists', ',', 'the', 'way', 'forward', 'in', 'understanding', 'perception', 'has', 'been', 'to', 'correlate', 'these', 'dimensions', 'of', 'experience', 'with', 'firstly', 'material', 'properties', 'experienced', 'object', 'or', 'event', '(', 'usually', 'regarded', 'as', 'stimulus', ')', 'secondly', 'patterns', 'discharges', 'sensory', 'system', '.', 'Qualitative', 'Aspects', 'Experience', 'The', 'quality', 'modality', 'depends', 'less', 'upon', 'energy', 'reaching', 'nervous', 'than', 'which', 'parts', 'are', 'activated', ':', 'stimulation', 'retinal', 'receptors', 'causes', 'an', 'light', ';', 'inner', 'ear', 'gives', 'rise', 'sound', 'so', 'on', 'Muller', "'s", 'nineteenth', '-', 'century', 'doctrine', 'specific', 'energies', 'formalized', 'ordinary', 'observation', 'that', 'different', 'sense', 'organs', 'sensitive', 'physical', 'world', 'when', 'they', 'stimulated', 'sensations', 'those', 'It', 'was', 'proposed', 'there', 'endings', 'within', 'attuned', 'types', 'example', 'eye', 'respond', 'cochlear', 'vibrations', 'air', 'are_activated', 'activated_:', 'system_are_activated', 'are_activated_:', 'activated_:_stimulation'])

Storing features in a `numpy` array

Now that you know what features to extract, you need to store them in a numpy array. To do that, you first need to know the complete set of features that you have (for one lexelt!). The first step, therefore, is to go through each instance of a lexelt and collect the features of them.

Write a member function of the LexElt class called get_features(). This function should do the following:

Call make_features for all of the LexElt object’s instances
Build a (sorted) list of all of the features that show up at least once in at least one instance for the LexElt object.
Create a numpy array the right size to hold a count for each of the features for each of the instances.
Pass the list of feature names to a new member function of the LexEltInstance class, to_vector, that will return a 1-dimensional numpy array.
Store the array from to_vector in the right row of the feature array.

get_features should then return both the list of feature labels and the numpy array with all of the features for all of the instances.

The to_vector(self, feature_list) function in the LexEltInstance class should return a 1-dimensional array with the count of each feature for that instance. The features must be returned in the same order they appear in feature_list so that we will know the values in one column correspond to the same feature for all instances!

If everything is working correctly, you should now be able to do the following:

>>> this_instance.to_vector(['and', 'types', 'system_are_activated'])
array([5, 1, 1])
>>> labels, X = lexelt.get_features()
>>> labels
['!', '%', "'", "'d", "'ll", "'m", "'re", "'s", "'ve", '(', '(_directly_activated', '(_injunctions_activated', '(_tonically_activating', ')', ')_is_activated', ..., 'zero', 'zipper', 'zoom']
>>> X
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])
>>> X.shape
(228, 7074)

Storing answers in a `numpy` array

The data that we are working with occasionally has instances labeled with multiple correct answers. That’s a challenge to work with, so we’re going to simplify the problem and keep only the first listed answer for each instance. Like the $X$ array you created above, you’re going to need the complete set of possible answers first. Once you have that, you’re going to map each answer to an integer and then store those integers in a one-dimensional numpy array $y$ , such that the first entry in the array $y$ is the target for the first vector in $X$ .

Write a get_targets function that returns a (sorted) list of answer labels along with a $y$ array with the correct label for each instance:

>>> answer_labels, Y = lexelt.get_targets()
>>> answer_labels
['38201', '38202', '38203', 'U']
>>> Y
array([0, 0, 2, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
       0, 2, 0, 3, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 2, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 2, 0, 0,
       0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2,
       0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 1])

How does test data fit into this?

The two functions you just wrote above (get_features and get_targets) were designed to take new training data and make the $X$ and $y$ arrays needed for training.

Once your classifier is trained, you’ll need to pass in test data into the classifier. And we’ll need to convert the test data into feature vectors, too. Fortunately, you’ve written almost all of the code needed to handle the test data. The one important thing is that when you read in the test data, you have to use the feature index that you already made. For any test feature that you haven’t seen in the training data, just throw it away. Your classifier doesn’t know how to classify instances based on that new feature anyway, so just ignore it.

You will need to modify your get_features and get_answers functions to each take an optional labels argument so that they handles unseen data properly, but it should be a quick fix!

When you’re done, you should be able to do the following:

>>> test_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test', 'r')
>>> test_data = get_data(test_fp)
>>> testkey_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test.key', 'r')
>>> get_key(testkey_fp, test_data)
>>> test_lexelt = test_data['activate.v']
>>> _, Xtest = test_lexelt.get_features(labels)
>>> Xtest
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 3., ..., 0., 0., 0.]])
>>> _, Ytest = test_lexelt.get_targets(answer_labels)
>>> Ytest
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,  1,  0,  0,  0,  0,  1,
        0,  0,  1,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,
        1,  1,  1,  0,  1,  0,  1,  1,  0,  1,  0,  0])

(Note that if a sense wasn’t seen at all in the training data, it’s indicated with -1 in the test y array! )

How supervised decision list classifiers work

Supervised decision list classifiers are trained to recognize the features that are high-quality indicators of a particular sense. Once trained, the classifier can be used to label unknown senses in test data.

Making decisions

At this point, you should have already extracted the features from training data and stored them in $X$ , and extracted the targets and stored them in $y$ as described above.

First, let’s look at a simplified version of the problem where you are trying to decide between only two senses. When a classifier only has to decide between two choices, it is called a binary classifier. In a binary classifier, you either need to decide if an unknown head word should be labeled as $\text{sense}_1$ or $\text{sense}_2$ .

For each feature $f$ , which could be a word in the bag of words or a collocation, you compute the log-odds ratio of the probabilities of each sense given the feature to see how good that feature is in distinguishing between the two sense:

\begin{align} \text{score}(f) &= \log_2 \frac{P(\text{sense}_1|f)}{P(\text{sense}_2|f)}
\end{align}

A high magnitude score (e.g. +10 or -10 vs +.02 or -.02) for feature $f$ indicates that the feature is good at distinguishing the senses. A low magnitude score indicates that the feature is not good at distinguishing the senses. A positive score means that $\text{sense}_1$ is more likely; a negative score means that $\text{sense}_2$ is more likely.

Probabilities can be estimated using maximum likelihood estimation as:

\begin{align} P(\text{sense}_1|f) &= \log_2 \frac{\text{count(}f\text{ is present in sense}_1)}{\text{count(}f\text{ is present in either sense)}}
\end{align}

Since you’re going dividing the two probabilities by one another and their denominators will be the same, the denominators cancel each other out, leaving us with:

\begin{align} \text{score}(f) = \log_2 \frac{P(\text{sense}_1|f)}{P(\text{sense}_2|f)} &= \log_2 \frac{\text{count(}f\text{ is present in sense}_1)}{\text{count(}f\text{ is present in sense}_2)}
\end{align}

Smoothing

In some cases, the numerator or denominator might be zero, so you will use Laplace smoothing (it’s easy!) but instead of $+1$ , you will use $+\alpha$ smoothing, adding $\alpha = 0.1$ to both the numerator and denominator in order to avoid division by zero errors, and to avoid taking the log of 0.

\begin{align} \text{score}(f) &= \log_2 \frac{\text{count(}f\text{ is present in sense}_1) + \alpha}{\text{count(}f\text{ is present in sense}_2) + \alpha}
\end{align}

Classification with more than two senses

In the lexical sample data you have, you can’t directly use the formulation above because you don’t have a binary classification: you have to decide which sense label to assign when there might be many more than 2 choices. Therefore, you don’t use the formulation shown above. Rather, you use the following modification for each sense:

\begin{align} \text{score}(i,f) &= \log_2\frac{P(\text{sense}_i|f)}{P(\text{NOT sense}_i|f)}
\end{align}

Like before, high positive values will be strong indicators of $\text{sense}_1$ . Low positive values will be weak indicators of $\text{sense}_1$ . Negative values will be indicators that it is not $\text{sense}_1$ , but since we don’t classify words as not being a particular sense, your decision list classifier won’t make use of negative scores.

Similar to what you showed above, you can rewrite the ratio of probabilities as follows:

\begin{equation} \text{score}(i,f) = \log_2 \frac{P(\text{sense}_i|f)}{P(\text{NOT sense}_i|f)} = \log_2 \frac{\text{count(}f\text{ is present in sense}_i)}{\text{count(}f\text{ is present all other senses except sense}_i)} \end{equation}

Finally, you will use Laplace smoothing:

\begin{equation} \text{score}(i,f) = \log_2 \frac{\text{count(}f\text{ is present in sense}_i) + \alpha}{\text{count(}f\text{ is present all other senses except sense}_i) + \alpha} \end{equation}

Writing the decision list classifier

You will fill in the implementaiton of a DecisionList class in DecisionList.py that conforms to the scikit-learn API.

To match scikit-learn, the class has two methods: fit(X,y), which will be used to train the classifier, and predict(X) which will be used to predict senses in the test data. The constructor (the __init__ method) for the DecisionList will take three optional parameters:

A default_target that gives the fall-back class the classifier will predict if it doesn’t have any information matching an instance
An alpha, that determines the value to be used for Laplace smoothing. Set alpha to $0.1$ if it is not specified.
A min_score that will be used later to say something about how confident the classifier should be in one of its rules before it trusts it. Set min_score to $0$ if it is not specified.

Training the decision list classifier

Your fit(X,y) method will take two parameters: $X$ , the numpy array of feature vectors for the training data, and $y$ , the numpy array of targets from the training data.

To train a decision list from the features (of a single lexelt!), you will first calculate a score for each (feature $f$ , sense $i$ ) pair. For some senses, a particular feature might be a good indicator; but for another sense of the same word, it will not be a good indicator. Sort these (feature $f$ , sense $i$ ) pairs so that the (feature $f$ , sense $i$ ) pair with the highest positive score first. Throw away any scores that are less than min_score. Save the resulting list of (feature $f$ , sense $i$ ) pairs as a rules attribute of the DecisionList object.

To make this code easier to read (and debug), write a helper member function get_score(self, feature, label, X, y) that takes a feature, a label, an $X$ feature array and a $y$ label array, and returns a score for that feature for that label. You should use the laplace smoothed score from above in this function.

The fit method doesn’t return anything. All of the state of the decision list is stored in the object.

Check that your fit function is working:

>>> from DecisionList import DecisionList
>>> d = DecisionList(alpha=0.1, min_score=5, default_target=5)
>>> d.get_score(0, 0, X, Y)
2.471305718925589
>>> d.get_score(0, 1, X, Y)
-6.149747119504682
>>> d.get_score(42, 0, X, Y)
3.4594316186372978
>>> d.fit(X, Y)
>>> d.rules[:10]
[(7067, 0), (4077, 0), (4047, 0), (1218, 0), (878, 1), (4482, 0), (3326, 0), (5218, 0), (1800, 0), (4980, 0)]

Using your decision list on test data

Now that your classifier is trained, you’re ready to use it to predict labels for new instances.

Your predict(X) method will take an array of feature vectors representing the test data.

For each instance in the test data (a single row in the matrix $X$ ), you walk down the .rules decision list until you find a matching (feature $f$ , sense $i$ ) pair. Since the decision list is sorted from most predictive to least predictive, you will end up classifying your instance based on the presence of the best matching feature. As you walk down the (feature $f$ , sense $i$ ) decision list, if an instance contains feature $f$ , then you will label that instance as sense $i$ . If the instance does not contain $f$ , proceed to the next (sense, feature) pair. Continue this process until you have exhausted all features in your decision list.

If none of the features in the decision list match the classifier needs to label the instance with some default classification. We will pass that value in to the DecisionList constructor through the default_target parameter. Since we have already seen that most frequent sense is a good baseline classifier, we will use the MFS for each word as the default sense when we build the DecisionList classifier.

Your predict(X) method will return a one-dimensional numpy array containing the classification. The resulting numpy array should have the same format as the array $y$ that you passed into the fit method.

To break the work of predict down into smaller (more testable!) pieces, write a helper method predict_one that takes a feature vector for a single instance as input, and returns a predicted label for that instance. Then predict can call predict_one on each row of the X array:

>>> d.predict_one(Xtest[0])
0
>>> d.predict_one(Xtest[101])
1
>>> d.predict(Xtest)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

Important: In case it was not obvious from all the reminders, you need to create a separate decision list for each lexelt (e.g. one decision list for bank.n, another for activate.v, etc.)

DecisionList Questions

Continue to put your answers to these questions in analysis.md.

Implement the decision list described above, apply it to the test data, and report your accuracy across all instances of all lexelts. (The DecisionList.py main function already has code to help you do this!) Does your decision list outperform the MFS baseline?
Examine the rules in the decision list for activate.v. What problems do you notice in the rules you have generated? Describe and comment on several examples.
Choose two (or more) of the following and report your results. Provide some discussion: were you surprised by the result? Why do you think you got that result?
- Use Tfidf to downweight common features. Assuming your training matrix is called X_train and your test matrix is called X_test:
```
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer.fit(X_train)
>>> X_train = transformer.transform(X_train).toarray()
>>> X_test = transformer.transform(X_test).toarray()
```
Comment on the impact of the transformation on your results. Does it help? Why do you think that is?
- How does your choice of min_score affect your results? Try a range of possible values, and report trends that you find.
- What if you used another classifier instead of your decision list classifier? If you wrote things to the scikit-learn API, this question is a piece of cake! This example uses the GaussianNB classifier but you can try any classifier in scikit-learn instead. To do so, simply import the classifier you want (e.g. from sklearn.naive_bayes import GaussianNB). Now find the line where you created your DecisionList object. It probably looks something like this in your code:
```
my_classifier = DecisionList(alpha=alpha, default_target=mfs)
```
Simply replace that line with one of the classifiers from scikit-learn:
```
my_classifier = GaussianNB()
```
You shouldn’t have to change anything else! Your code will now run using the classifier you chose from scikit-learn. Play around and let us know what you discover!
- Is there a correlation between the score a rule has and the accuracy of the prediction? Do high scoring rules work well? Do low scoring rules work equally well? A nice visualization could show a bar chart of the accuracy of the rules (y-axis) plotted against the score of the rule (x-axis). You’re welcome to come up with whatever you’d like here – we’re interested to see what you find!

Lab 05

Due Wednesday February 26, 2020

Starter Code

Docker

Introduction

Examining the data

Parsing the data

Exploring the training data

Questions

Exploring the test data

Feature extraction

Feature vectors and labels

Feature extraction

Storing features in a `numpy` array

Storing answers in a `numpy` array

How does test data fit into this?

How supervised decision list classifiers work

Making decisions

Smoothing

Classification with more than two senses

Writing the decision list classifier

Training the decision list classifier

Using your decision list on test data

DecisionList Questions

Lab 05

Due Wednesday February 26, 2020

Starter Code

Docker

Introduction

Examining the data

Parsing the data

Exploring the training data

Questions

Exploring the test data

Feature extraction

Feature vectors and labels

Feature extraction

Storing features in a numpy array

Storing answers in a numpy array

How does test data fit into this?

How supervised decision list classifiers work

Making decisions

Smoothing

Classification with more than two senses

Writing the decision list classifier

Training the decision list classifier

Using your decision list on test data

DecisionList Questions

Storing features in a `numpy` array

Storing answers in a `numpy` array