Starter Code
Get the starter code on GitHub Classroom
Docker
New data has been added to the course docker image for this lab. If you use docker, you will want to pull the updated image.
Introduction
This week, you’ll spend some time working with an HMM Part of Speech tagger. Rather than implementing it from scratch, you’ll make a series of modifications to the tagger. Along the way, you’ll analyze the impact of different configuration settings on the tagger’s performance. You’ll also get a chance to explore how the (automatically-predicted) gender of authors correlates with the performance of your POS tagger.
Understanding the Starter Code
The starter code includes three files:
HmmTagger.pydefines a classHMMTaggerthat implements an HMM Part of Speech tagger.evaluate.pyruns a part of speech tagger (like anHMMTagger) on a directory of data and calculates the tagger’s accuracy.read_tags.pycontains helper functions for loading directories full of texts labeled for part of speech.
The HMMTagger Class
The two main externally-facing functions of an HMMTagger object are train and predict. You can use an HMMTagger like this:
nlp = spacy.load("en_core_web_sm")
train_dir = "/cs/cs159/data/pos/wsj/train"
tagger = HMMTagger(nlp, alpha=0.1)
tagger.train(train_dir)
test_sentence = nlp("This is test input to the Part of Speech Tagger.")
tagger.predict(test_sentence)
print([token.tag_ for token in test_sentence])
To be consistent with the spacy interface for tagger objects, you will also be able to access predict by calling an HMMTagger with a sequence of tokens:
nlp = spacy.load("en_core_web_sm")
train_dir = "/cs/cs159/data/pos/wsj/train"
tagger = HMMTagger(nlp, alpha=0.1, vocab_size=20000)
tagger.train(train_dir)
test_sentence = nlp("This is test input to the Part of Speech Tagger.")
tagger(test_sentence)
print([token.tag_ for token in test_sentence])
The HmmTagger.py file can also be run as a script from the command line. When you do that, it has the following interface:
usage: HmmTagger.py [-h] --dir DIR --output FILE [--alpha ALPHA]
Train (and save) hmm models for POS tagging
optional arguments:
-h, --help show this help message and exit
--dir DIR, -d DIR Read training data from DIR
--output FILE, -o FILE
Save output to FILE
--alpha ALPHA, -a ALPHA
Alpha value for add-alpha smoothing
Running HmmTagger.py will train an HMMTagger object on all of the files in dir, then save the (binary) model to output:
python3 HmmTagger.py --dir /cs/cs159/data/pos/wsj/train --output model.pkl
Document HMMTagger.py!
I’m not asking you to write an HMM tagger this week, but I do want you to understand how it works. Take the time to carefully document the class. Every member function should have a doc string. The following functions should have block- or line-level comments that show you understand the comptations that are being made:
do_train_sent()train()normalize_probabilities()(and, by extension,normalize())predict()backtrace()
When in doubt, err on the side of over-commenting, since we want to make sure that you understand all of the little details that go int the tagger.
Exploring Tag Set Collapsing
This paper from Garimella et al. reports some interesting results about author gender and POS tagger performance. Later in the lab, you’ll get a chance to try to replicate some of their results. But before we can do that, we need to make some modifications to our tagger.
Garimella et al. report results on the Universal Tag Set, which is a much smaller tag set than PTB. In this section of the lab, you will analyze the impact of tag set granularity on tagging performance.
-
First, train an
HMMTaggeron the brown data set (/cs/cs159/data/pos/brown) and test it on the wsj training data set (/cs/cs159/data/pos/wsj/train) using the Penn Treebank tags that are recorded in the data. Record your accuracy in a table inanalysis.md. -
Next, repeat the above, but convert all of the tags from Penn Treebank to the Universal Tag Set before training/testing. To do that:
- Add a new flag
universalto the interface for theHmmTagger.pyfile. You don’t want to break the existing functionality, so haveuniversaldefault toFalseby making itsaction="store_true". - Add a new named parameter
do_universalto theHMMTaggerclass’s initializer.do_universalshould default toFalse. Inside theHMMTaggerobject, save the new parameter as a data member calleddo_universal. - Update the
main()function inHmmTagger.pyto pass the new command-line argumentargs.universalto theHMMTaggerinitializer. - Modify the
train()function inHmmTagger.pyto pass the value ofself.do_universalto theparse_dirdirectory. - Add a
universalflag toevaluate.pyjust like you did forHmmTagger.py. - Modify the
mainfunction ofevaluate.pyto pass the value ofargs.universaltoread_dir, just like you did for the training method ofHMMTagger. * Finally, try one more configuration. This time, you’ll train on the full PTB tag set, but at evaluation time, you’ll map all of the tags to the universal tag set before evaluating them. To do that: - Modify the
mainfunction ofevaluate.pyso that after it callstagger, but before it compares the tags, it changes each token’stag_attribute to the right Universal Tag Set tag. There’s a dictionary defined inread_tags.pythat should help, so this should only require a couple of lines of code.
- Add a new flag
In analysis.md, describe how the performance of your tagger changed for each of the three configurations. What do the results say about tag granularity? Are the results surprising to you? Why or why not?
We might wonder if the accuracy is affected by a mis-match in the style of the data: maybe the Brown corpus is just very different from the WSJ corpus! Repeat the above, but using /cs/cs159/data/pos/wsj/train as your training data and /cs/cs159/data/pos/wsj/test as your testing data.
In analysis.md, comment on whether the results are consistent after the change in dataset. What, if anything, do you conclude from your results?
For the remainder of the lab, you should train using the full PTB Tag Set and test using the Universal Tag Set.
Exploring Vocabulary Size Effects
The starter HMM model adds all of the words in the training set to its vocabulary. For the brown data set, that means it has a vocabulary of words, plus the <<OOV>> token.
In this section of the lab, you will explore some of the time, space, and performance trade-offs that come from varying the size of the vocabulary.
First, modify the HmmTagger.py interface so that it can take a vocabular size as a command-line argument. To do that, you should:
- Add a
--vocabsize, -vargument to theHmmTagger.pyinterface. The default value should beNone, which will correspond to keeping all of the words in the vocabulary (in other words, the default behavior will be the same as the starter code behavior). - Modify
update_vocabso that it only keeps thevocabsizemost frequent words in the vocabulary.
Next, build six models. All of them should be trained on the brown corpus using the PTB tag set. They should vary in thir vocabulary size: 1000, 2000, 5000, 10000, 20000, or 50000. Note that since number of different word types in the brown corpus, it will keep all of the tokens in the vocabulary.
In your analysis.md, add a table that reports the following for each of the five models:
- How long it takes to train the model (in seconds)
- How long it takes to test the model on the
/cs/cs159/data/pos/wsjdata (in seconds) - How big the model is (in Kilobytes or Megabytes, as appropriate)
- The model’s accuracy
Hint: The command-line time command can be used to time how long another program takes to run. For example, to see how long it takes to build model.pkl:
$ (prompt) time python3 HmmTagger.py -d /cs/cs159/data/pos/brown -o model.pkl
real 1m13.679s
user 1m12.865s
sys 0m0.487s
This output says that it took 1:13 in real (clock) time, 1:12 in processor time, and 0.487 seconds of system time to run HmmTagger.py on my laptop. You should report the user time for this lab.
Hint: To repeat the same command on several values, you can use a bash for loop: for <var> in <sequence>; do <command>; done. For example, if I have files named file1.txt, file2.txt, file3.txt and file4.txt in my directory, I could run:
$ (prompt) For i in 1 2 3 4; do mv file${i}.txt file${i}.md; done
…to rename the files to file1.md, file2.md, file3.md, and file4.md.
After you build your table, comment on the patterns you observe in analysis.md.
Exploring Document Size Effects
In this section, you will explore whether the length of test documents affects the POS tagger’s performance.
For this part, you should train on the brown data with a vocabulary size of . You should test on the wsj data. Train with the PTB tag set and evaluate with the Universal set.
Update evaluate.py so that it generates a scatter plot of document size (in tokens, on the x-axis) and tagger accuracy (as a percent, on the y-axis). To do that you, you should:
- Update
evaluate.py’s argparse interface to take a new argument--output, -othat will take the name of a file to write an image to. This argument should default toNoneso that if it’s not provided, the behavior of the script will stay the same as it was in previous steps. - Update
main()so that ifargs.outputis notNone, a scatter plot is generated showing document length compared to accuracy. If there, generate a scatter plot in a .png file where x axis is document size and y axis is accuracy.
Note: To get a point for every file, you won’t be able to use parse_dir directly. Instead, look at parse_dir for a model of the some of the code you’ll want to add to your main function.
Include your plot in analysis.md. Comment on any patterns you notice, and discuss why you think you might see those patterns.
Exploring Gender Effects on POS Tag Results
In this last part of the lab, you will see if you see the same patterns in our HMMTagger that Garimella et al. reported in their paper.
Continue to use a vocabulary size of 20000, training on the PTB tag set and testing on the Universal tag set. You will try a variety of train data sets. You will test on wsj, since that’s the data set that we have (automatically-predicted from names) author gender information for.
The /cs/cs159/data/pos/wsj/train and /cs/cs159/data/pos/wsj/test directories are both split into two subdirectories, female/ and male/, corresponding to the article labels from Garimella et al.’s data release.
Set the action of the argparse dir argument to "append". That way, you’ll be able to use the -d (or --dir) argument more than once to list all of the directories you want to compare.
Update main to loop through all of the directories passed through --dir, processing them each in turn. Your script should report the accuracy for each category in addition to reporting the overall accuracy.
Experiments
Report accuracies on the following data set configurations:
- Train on wsj/train/male, wsj/train/female, wsj/train, or brown/
- Test on wsj/test/male, wsj/test/female, or wsj/test
Report accuracy for each train/test pair.
In analysis.md, comment on patterns you see. Are they consistent with the findings of Garimella et al.? How do you think the imbalance in data set size in our data affects the results you’re seeing?