Starter Code

Get the starter code on GitHub Classroom

Docker

New data has been added to the course docker image for this lab. If you use docker, you will want to pull the updated image.

Introduction

This week, you’ll spend some time working with an HMM Part of Speech tagger. Rather than implementing it from scratch, you’ll make a series of modifications to the tagger. Along the way, you’ll analyze the impact of different configuration settings on the tagger’s performance. You’ll also get a chance to explore how the (automatically-predicted) gender of authors correlates with the performance of your POS tagger.

Understanding the Starter Code

The starter code includes three files:

HmmTagger.py defines a class HMMTagger that implements an HMM Part of Speech tagger.
evaluate.py runs a part of speech tagger (like an HMMTagger) on a directory of data and calculates the tagger’s accuracy.
read_tags.py contains helper functions for loading directories full of texts labeled for part of speech.

The `HMMTagger` Class

The two main externally-facing functions of an HMMTagger object are train and predict. You can use an HMMTagger like this:

nlp = spacy.load("en_core_web_sm")

train_dir = "/cs/cs159/data/pos/wsj/train"
tagger = HMMTagger(nlp, alpha=0.1)
tagger.train(train_dir)

test_sentence = nlp("This is test input to the Part of Speech Tagger.")
tagger.predict(test_sentence)
print([token.tag_ for token in test_sentence])

To be consistent with the spacy interface for tagger objects, you will also be able to access predict by calling an HMMTagger with a sequence of tokens:

nlp = spacy.load("en_core_web_sm")

train_dir = "/cs/cs159/data/pos/wsj/train"
tagger = HMMTagger(nlp, alpha=0.1, vocab_size=20000)
tagger.train(train_dir)

test_sentence = nlp("This is test input to the Part of Speech Tagger.")
tagger(test_sentence)
print([token.tag_ for token in test_sentence])

The HmmTagger.py file can also be run as a script from the command line. When you do that, it has the following interface:

usage: HmmTagger.py [-h] --dir DIR --output FILE [--alpha ALPHA]

Train (and save) hmm models for POS tagging

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR, -d DIR     Read training data from DIR
  --output FILE, -o FILE
                        Save output to FILE
  --alpha ALPHA, -a ALPHA
                        Alpha value for add-alpha smoothing

Running HmmTagger.py will train an HMMTagger object on all of the files in dir, then save the (binary) model to output:

python3 HmmTagger.py --dir /cs/cs159/data/pos/wsj/train --output model.pkl

Document HMMTagger.py!

I’m not asking you to write an HMM tagger this week, but I do want you to understand how it works. Take the time to carefully document the class. Every member function should have a doc string. The following functions should have block- or line-level comments that show you understand the comptations that are being made:

do_train_sent()
train()
normalize_probabilities() (and, by extension, normalize())
predict()
backtrace()

When in doubt, err on the side of over-commenting, since we want to make sure that you understand all of the little details that go int the tagger.

Exploring Tag Set Collapsing

This paper from Garimella et al. reports some interesting results about author gender and POS tagger performance. Later in the lab, you’ll get a chance to try to replicate some of their results. But before we can do that, we need to make some modifications to our tagger.

Garimella et al. report results on the Universal Tag Set, which is a much smaller tag set than PTB. In this section of the lab, you will analyze the impact of tag set granularity on tagging performance.

First, train an HMMTagger on the brown data set (/cs/cs159/data/pos/brown) and test it on the wsj training data set (/cs/cs159/data/pos/wsj/train) using the Penn Treebank tags that are recorded in the data. Record your accuracy in a table in analysis.md.
Next, repeat the above, but convert all of the tags from Penn Treebank to the Universal Tag Set before training/testing. To do that:
- Add a new flag universal to the interface for the HmmTagger.py file. You don’t want to break the existing functionality, so have universal default to False by making its action="store_true".
- Add a new named parameter do_universal to the HMMTagger class’s initializer. do_universal should default to False. Inside the HMMTagger object, save the new parameter as a data member called do_universal.
- Update the main() function in HmmTagger.py to pass the new command-line argument args.universal to the HMMTagger initializer.
- Modify the train() function in HmmTagger.py to pass the value of self.do_universal to the parse_dir directory.
- Add a universal flag to evaluate.py just like you did for HmmTagger.py.
- Modify the main function of evaluate.py to pass the value of args.universal to read_dir, just like you did for the training method of HMMTagger. * Finally, try one more configuration. This time, you’ll train on the full PTB tag set, but at evaluation time, you’ll map all of the tags to the universal tag set before evaluating them. To do that:
- Modify the main function of evaluate.py so that after it calls tagger, but before it compares the tags, it changes each token’s tag_ attribute to the right Universal Tag Set tag. There’s a dictionary defined in read_tags.py that should help, so this should only require a couple of lines of code.

In analysis.md, describe how the performance of your tagger changed for each of the three configurations. What do the results say about tag granularity? Are the results surprising to you? Why or why not?

We might wonder if the accuracy is affected by a mis-match in the style of the data: maybe the Brown corpus is just very different from the WSJ corpus! Repeat the above, but using /cs/cs159/data/pos/wsj/train as your training data and /cs/cs159/data/pos/wsj/test as your testing data.

In analysis.md, comment on whether the results are consistent after the change in dataset. What, if anything, do you conclude from your results?

For the remainder of the lab, you should train using the full PTB Tag Set and test using the Universal Tag Set.

Exploring Vocabulary Size Effects

The starter HMM model adds all of the words in the training set to its vocabulary. For the brown data set, that means it has a vocabulary of $47,703$ words, plus the <<OOV>> token.

In this section of the lab, you will explore some of the time, space, and performance trade-offs that come from varying the size of the vocabulary.

First, modify the HmmTagger.py interface so that it can take a vocabular size as a command-line argument. To do that, you should:

Add a --vocabsize, -v argument to the HmmTagger.py interface. The default value should be None, which will correspond to keeping all of the words in the vocabulary (in other words, the default behavior will be the same as the starter code behavior).
Modify update_vocab so that it only keeps the vocabsize most frequent words in the vocabulary.

Next, build six models. All of them should be trained on the brown corpus using the PTB tag set. They should vary in thir vocabulary size: 1000, 2000, 5000, 10000, 20000, or 50000. Note that since $50000 >$ number of different word types in the brown corpus, it will keep all of the tokens in the vocabulary.

In your analysis.md, add a table that reports the following for each of the five models:

How long it takes to train the model (in seconds)
How long it takes to test the model on the /cs/cs159/data/pos/wsj data (in seconds)
How big the model is (in Kilobytes or Megabytes, as appropriate)
The model’s accuracy

Hint: The command-line time command can be used to time how long another program takes to run. For example, to see how long it takes to build model.pkl:

$ (prompt) time python3 HmmTagger.py -d /cs/cs159/data/pos/brown -o model.pkl

real    1m13.679s
user    1m12.865s
sys	    0m0.487s

This output says that it took 1:13 in real (clock) time, 1:12 in processor time, and 0.487 seconds of system time to run HmmTagger.py on my laptop. You should report the user time for this lab.

Hint: To repeat the same command on several values, you can use a bash for loop: for <var> in <sequence>; do <command>; done. For example, if I have files named file1.txt, file2.txt, file3.txt and file4.txt in my directory, I could run:

$ (prompt) For i in 1 2 3 4; do mv file${i}.txt file${i}.md; done

…to rename the files to file1.md, file2.md, file3.md, and file4.md.

After you build your table, comment on the patterns you observe in analysis.md.

Exploring Document Size Effects

In this section, you will explore whether the length of test documents affects the POS tagger’s performance.

For this part, you should train on the brown data with a vocabulary size of $20,000$ . You should test on the wsj data. Train with the PTB tag set and evaluate with the Universal set.

Update evaluate.py so that it generates a scatter plot of document size (in tokens, on the x-axis) and tagger accuracy (as a percent, on the y-axis). To do that you, you should:

Update evaluate.py’s argparse interface to take a new argument --output, -o that will take the name of a file to write an image to. This argument should default to None so that if it’s not provided, the behavior of the script will stay the same as it was in previous steps.
Update main() so that if args.output is not None, a scatter plot is generated showing document length compared to accuracy. If there, generate a scatter plot in a .png file where x axis is document size and y axis is accuracy.

Note: To get a point for every file, you won’t be able to use parse_dir directly. Instead, look at parse_dir for a model of the some of the code you’ll want to add to your main function.

Include your plot in analysis.md. Comment on any patterns you notice, and discuss why you think you might see those patterns.

Exploring Gender Effects on POS Tag Results

In this last part of the lab, you will see if you see the same patterns in our HMMTagger that Garimella et al. reported in their paper.

Continue to use a vocabulary size of 20000, training on the PTB tag set and testing on the Universal tag set. You will try a variety of train data sets. You will test on wsj, since that’s the data set that we have (automatically-predicted from names) author gender information for.

The /cs/cs159/data/pos/wsj/train and /cs/cs159/data/pos/wsj/test directories are both split into two subdirectories, female/ and male/, corresponding to the article labels from Garimella et al.’s data release.

Set the action of the argparse dir argument to "append". That way, you’ll be able to use the -d (or --dir) argument more than once to list all of the directories you want to compare.

Update main to loop through all of the directories passed through --dir, processing them each in turn. Your script should report the accuracy for each category in addition to reporting the overall accuracy.

Experiments

Report accuracies on the following data set configurations:

Train on wsj/train/male, wsj/train/female, wsj/train, or brown/
Test on wsj/test/male, wsj/test/female, or wsj/test

Report accuracy for each train/test pair.

In analysis.md, comment on patterns you see. Are they consistent with the findings of Garimella et al.? How do you think the imbalance in data set size in our data affects the results you’re seeing?

Lab 06

Due Wednesday March 4, 2020

Starter Code

Docker

Introduction

Understanding the Starter Code

The `HMMTagger` Class

Document HMMTagger.py!

Exploring Tag Set Collapsing

Exploring Vocabulary Size Effects

Exploring Document Size Effects

Exploring Gender Effects on POS Tag Results

Experiments

Lab 06

Due Wednesday March 4, 2020

Starter Code

Docker

Introduction

Understanding the Starter Code

The HMMTagger Class

Document HMMTagger.py!

Exploring Tag Set Collapsing

Exploring Vocabulary Size Effects

Exploring Document Size Effects

Exploring Gender Effects on POS Tag Results

Experiments

The `HMMTagger` Class