LAB 6: Text Generation using RNNs

Parts

Part 1: Shakespeare using word tokenization

Steps

Create a language model data loader (see Language Module Using DataBlock section of Chapter 10).

Show the batch and ensure it looks proper (same section).

Create a non-pretrained LanguageModelLearner (see Fine-Tunting the Language Model section of Chapter 10).

Train the model for one or two epochs.

Generate some fake Shakespeare (at least two examples of at least 50 words, starting with the text prompt "What is the temptation" (see Text Generation section of Chapter 10).

Hints

You may find it faster to pip install my fastai fork. This fastai repo is a fork of fastai as of 3/4/21 that ups the declared version of pytorch supported to 1.8. There may or may not be errors using fastai with PyTorch 1.8, however it sure installs a lot quicker! Use:

 !pip install git+git://github.com/nrhodes/fastai.git

rather than:

 !pip install -Uqq fastai

The tokenizers cache tokenized values, so I recommend that you use separate Colab notebooks for each part of this lab (otherwise, if you switch tokenizers for a dataset, all hell will break loose). Either submit a master notebook that contains links to the three notebooks for each part, or have the end of notebook 1 have a link to notebook 2 (and some for 2 to 3).

Part 2: Shakespeare using character tokenization

Tokenization

Numericalization with fastai

Create a custom tokenizer.

Decide what rules you'll want, if any.

Pass in a vocabulary to the TextBlock.

Part 3: Linux source code using character tokenization

Train on linux source code rather than shakespeare.

Write your own greedy_predict routine that'll:

Take a prompt and a length. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
You'll repeatedly call the language model. At each step, you'll get a probablity distribution for characters. You'll greedily choose the next character (highest proability).

Write your own random_predict routine that'll:

Take a prompt, and a length. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
You'll repeatedly call the language model. At each step, you'll get a probablity distribution for characters. You'll sample the next character from the probability distribution.

Write your own beam_predict routine that'll:

Take a prompt, a length, and k. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
Do a beam search (an improvement on greedy prediction) to compute the most likely output.

Challenges

Challenge 1 We've looked at word tokenization, sub-word tokenization, and character tokenization. Let's take it to the extreme and do bit tokenization.

Train a language model on the linux kernel corpus using bit tokenization and generate output using beam search.

Challenge 2 Rewrite random_predict and beam_predict so that they can take multiple prompts. Put the multiple prompts into a batch and calculate the predictions in parallel. For the beam prediction, do all the prediction in parallel using the batching mechanism.

This completes the lab. Submit instructions

Make sure that the output of all cells is up-to-date.

Rename your notebook:

Click on notebook name at the top of the window.
Rename to "CS152Sp21Lab6 FirstName1/FirstName2" (using the correct lab number, along with your two first names). I need this naming so I can easily navigate through the large number of shared docs I will have by the end of the semester.

Choose File/Save

Share your notebook with me:

Click on the Share button at the top-right of your notebook.
Enter rhodes@g.hmc.edu as the email address.
Click the pencil icon and select Can comment.
Click on Done.

Enter the URL of your colab notebook in this submittal form. Do not copy the URL from the address bar (which may contain an authuser parameter and which I will not be able to open). Instead, click Share and Copy link to obtain the correct link. Enter your names in alphabetical order.

At this point, you and I will go back and forth until the lab is approved.

I will provide inline comments as I evaluate the submission (Google should notify you of these comments via email).
You will then need to address those comments. Please do not resolve or delete the comments. I will use them as a record of our conversation. You can respond to them ("Fixed" perhaps).
Once you have addressed all the comments in this round, fill out the submittal form again.
Once I am completely satisifed with your lab, I will add a LGTM (Looks Good to Me) comment
At that point, setup an office hour appointment with me. Ill meet with you and your partner and we'll have a short discussiona about the lab. Both of you should be able to answer questions about any part of the lab.

Lab 6: Text Generation using RNNs