Lab 4: Yelp review regression

Introduction

This lab predicts star ratings for Yelp reviews. Although this could be architected as a regression problem, we make it easier by structuring it as a classification problem: treat reviews with 4 or 5 stars as positive, and 1 or 2 stars as negative.

Data

The data is available with 5-star reviews from fastai at URLs.YELP_REVIEWS. Alternatively, a slightly modified version of the data is available (with positive-negative labels already created from the star ratings) from fastai at URLs.YELP_REVIEWS_POLARITY.

The training and validation datasets are huge. I suggest you use at most 10% of the data for training/validation because otherwise training will take forever.

The three ways to solve this problem

Try all of the following ways of solving this problem:

For parts 2 and 3, you'll need a dictionary of unique words (which can be obtained from the dataloader). From then on, you can use the index into this dictionary rather than the string itself.

  1. Use the pretrained language model from fastai. Although you'll want to fine tune the classifer, there's no need to fine tune the language model itself.
  2. Bag of words: create a vector that represents the set of words that are present in the review. Use that to predict the sentiment (positive or negative).

    A good way to approach this is to use a standard linear neural network with 1 or 2 hidden layers. How many outputs from the last layer?

    You'll need a new layer at the beginning of the neural network that converts from the format the dataloader provides, a tensor of word numbers, one for each word in the review:

         [3, 5, 1, 9, 57, ..., 12]
         
    to a tensor of length equal to the number of words in the dictionary (dls.train.vocab[0]), with a 1 at each location that a word is present:
         [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, ...]
         
    (note that in this example entries at index 3, 5, 1, 9, ..., and 12 are set to 1; all others are 0).
  3. Embedding: create an embedding matrix (of size 10, perhaps?) that is applied to each word. Limit the review to some reasonable size (100 words, perhaps, chosen from beginning and end of the review). Use those words, run through the embedding and then to the remainder of a neural network.

    Make sure you are using the same embedding matrix for each word.

Compare and contrast the results of all three approaches.

Suggestions

Challenge 1 Use a pretrained embedding like Glove or Word2vec for step 3.

Challenge 2 Use embeddings in conjunction with an LSTM for step 3

Challenge 3 Fine-tune the underlying language model for step 1 before fine-tuning the classifer based on it.

Challenge 4 Train a regression model rather than a classification model for the three steps.

This completes the lab. Submit instructions

  1. Make sure that the output of all cells is up-to-date.
  2. Rename your notebook:
    1. Click on notebook name at the top of the window.
    2. Rename to "CS152Sp21Lab4 FirstName1/FirstName2" (using the correct lab number, along with your two first names). I need this naming so I can easily navigate through the large number of shared docs I will have by the end of the semester.
  3. Choose File/Save
  4. Share your notebook with me:
    1. Click on the Share button at the top-right of your notebook.
    2. Enter rhodes@g.hmc.edu as the email address.
    3. Click the pencil icon and select Can comment.
    4. Click on Done.
  5. Enter the URL of your colab notebook in this submittal form. Do not copy the URL from the address bar (which may contain an authuser parameter and which I will not be able to open). Instead, click Share and Copy link to obtain the correct link. Enter your names in alphabetical order.
  6. At this point, you and I will go back and forth until the lab is approved.
    1. I will provide inline comments as I evaluate the submission (Google should notify you of these comments via email).
    2. You will then need to address those comments. Please do not resolve or delete the comments. I will use them as a record of our conversation. You can respond to them ("Fixed" perhaps).
    3. Once you have addressed all the comments in this round, fill out the submittal form again.
    4. Once I am completely satisifed with your lab, I will add a LGTM (Looks Good to Me) comment
    5. At that point, setup an office hour appointment with me. Ill meet with you and your partner and we'll have a short discussiona about the lab. Both of you should be able to answer questions about any part of the lab.

'