Introduction
This course will introduce you to a broad range of topics in the area of natural language processing including language modeling, part of speech tagging, spelling correction, morphology, syntactic parsing, semantics and machine translation. If time permits, we may also cover speech recognition, natural language generation or discourse systems.
Course Goals
By the end of the course you will:
- learn the algorithms and data structures central to Natural Language Processing,
- learn how these algorithms and data structures are used to access large text corpora,
- learn how to use text corpora as the basis for training probabilistic machine learning algorithms
- build components of large NLP systems such as language models, part of speech taggers, morphological analyzers, parsers and text classifiers
- develop experiments, analyze the results, and report on the results
- learn to use LaTeX to write papers that can be submitted to a conference
- read and analyze primary literature in NLP, implementing some of the algorithms described in order to better understand the work and/or make suggestions for improvement upon the work
Class Information
Professor: Julie Medero
Office: Olin 1269
Office Hours: Mondays 1:15-3:00 and Fridays 1-2 in Olin 1269
Grutors: Celela Chen and Celine Park
Grutoring hours: Wednesdays 3-5 (LAC computer lab)
Class time: Tuesday and Thursday 9:35-10:50am
Class room: SHAN 1480
GitHub organization: hmc-cs159-spring2020
Prerequisites:
- CS70 or CS52 + CS62
- CS81
- A course in probability
Textbooks
You should not purchase any textbooks this semester.
- Jurafsky and Martin, Speech and Language Processing, 3rd edition, 2019 – draft edition online
Tentative schedule
Readings
There will be reading assigned each week, and you are expected to complete that reading before class each Tuesday. In class time will involve individual and group problem sessions, discussions, and other activities that depend on you having done the reading. You will be asked to fill out a worksheet each Tuesday; if you’d like to print your own copy and make notes as you read, you can bring it to class with you.
Assignments
Programming assignments will typically be available on Thursdays before class and due on the following Wednesday evening at 10:00pm. In the second half of the semester, your programming work will focus on a large project. There will be several intermediate deadlines.
| WEEK | DAY | ANNOUNCEMENTS | TOPIC & READING | LABS & PROJECTS |
|---|---|---|---|---|
| 1 | Jan 21 | Introduction, Regular Expressions, Encodings Handouts | ||
Jan 23 | ||||
| 2 | Jan 28 | Handouts: Tokenization, Normalization, Segmentation Everyone:
Options:
| ||
Jan 30 | ||||
| 3 | Feb 04 | Handouts: N-Grams and Smoothing Everyone:
Options:
| ||
Feb 06 | ||||
| 4 | Feb 11 | Handouts: Vector Semantics Everyone:
Options:
| ||
Feb 13 | ||||
| 5 | Feb 18 | Handouts Word Sense Disambiguation Everyone: Options: Pick a paper to present from last year's Wordnet Conference
| ||
Feb 20 | ||||
| 6 | Feb 25 | Handouts Part of Speech Tagging Everyone: Options:
| ||
Feb 27 | ||||
| 7 | Mar 03 | Text Classification Everyone: Options:
| ||
Mar 05 | ||||
| 8 | Mar 10 | Handouts Midterm Review (No new reading) | ||
Mar 12 | First exam List of Topics | |||
Mar 17 | Spring Break | |||
Mar 19 | ||||
Mar 24 | ||||
Mar 26 | ||||
| 9 | Mar 31 | Plan for the rest of the semester (reading group, optional project) | ||
Apr 02 | ||||
| 10 | Apr 07 | Information Extraction
| ||
Apr 09 | ||||
| 11 | Apr 14 | Automatic Speech Recognition | ||
Apr 16 | ||||
| 12 | Apr 21 | Summarization | ||
Apr 23 | ||||
| 13 | Apr 28 | Machine Translation
| ||
Apr 30 | ||||
Grading
Your final grade will be calculated as follows:
- 55% Labs and projects
- 20% First in-class exam
- 20% Second in-class exam
- 5% Class participation and attendance
Final grades are calculated according to the below grading scale:
| A | 93 |
| A- | 90 |
| B+ | 87 |
| B | 83 |
| B- | 80 |
| C+ | 77 |
| C | 73 |
| C- | 70 |
| D+ | 67 |
| D | 65 |
Final grades are truncated, not rounded. For example, an 82.8 will receive a B-.
Programming language
Assignments will presuppose knowledge of python3. You will almost certainly end up learning some Perl and bash scripting, but you are not expected to know this.
Please make sure that each program you turn in has:
- A comment at the top of the program that includes
- Program authors
- A brief description of what the program does
- Concise comments that summarize major sections of your code
- Meaningful variable and function names
- Well organized code
- White space to improve legibility
- Lines whose width is less than 80 characters wide (whenever possible)
I expect that you will be using python3 for all of your lab assignments, but if you would like to use something different, you are welcome to come talk to me about your plan.
Accessing the CS labs after hours
All students enrolled in CS courses are eligible for 24-hour access to Olin and the CS labs (Beckman 102 and 104). The code for the lab doors will be shared on the course Piazza site. If you are an off-campus student who does not have access to the building, you should visit the HMC Facilities & Maintenance office in the basement of the Platt Campus Center to get access added to your card.
Policies
Assignment Extension Policy
You have three late days that can be used on weekly labs or a final project component at any point during the semester without penalty. You can use all three late days on one assignment, or split them across multiple assignments. To use a late day, you must email me after you have completed the assignment and pushed to your repository.
I encourage you to work together on your assignments for this class. Weekly labs and the final project can be done in groups of up to three. If you work in a group, only one of you needs to use your late day(s).
Illness
If you get sick during the term, please notify me immediately, even if you think that being sick will not affect your ability to complete your assignments. You should also notify me any time that you’re sick enough to miss any classes or find that your performance is below par for any reason.
Classroom Environment
As your instructor, I am committed to creating a classroom environment that welcomes all students, regardless of race, gender, social class, religious beliefs, etc. We all have implicit biases, and I will try to continually examine my judgments, words and actions to keep my biases in check and treat everyone fairly. I expect that you will do the same, and that you will let me know if there is anything I can do to make sure everyone is encouraged to succeed in this class.
Reminder: The Honor Code
All students—even those from other colleges—are expected to understand and comply with Harvey Mudd College’s Honor Code. If you haven’t already done so, you must read, sign, and abide by the computer-science department’s interpretation of the Honor Code to participate in this course. Specifically:
- You must not exchange literal copies of material, whether that material consists of code, program output, or English-language text (e.g., documentation). You also may not copy material from published or online sources, with or without cosmetic changes (such as altering variable names), without explicit permission. If you do have permission to use externally written material, you must attribute it properly and clearly indicate which material is yours and which material is not yours. Publishing your own homework or exams from CS70 on the web (e.g., in a public GitHub repository) violates this policy.
- You should not do anything that a reasonable student peer would describe as “subverting the clear intent of the assignment,” unless you have asked for and received permission to do so. Finding open-sourced code that you can use to solve an assigned problem, for example, would typically be subverting the intent of the assignment because your shortcut means that you do not learn what the assignment aims to teach.
- If you use any sources to assist you, you must document them. For example, all assignments have a hints-and-tips page on the course website. If a “tip” from that page ends up incorporated into your graded work, you must credit the source. (A clarification about assignment requirements or a debugging tip, however, need not be credited.)
- If you aren’t sure whether something you’ve done or plan to do is allowed, you should explicitly document what you did and—if at all possible—consult with the course staff, ideally before you take the questionable action. Similarly, document any extensive or particularly important help you obtain, even if that help seems legitimate. If you’ve been helped so much that we can’t consider the work truly your own, you might not be able to get full credit for it but proper attribution will avoid an Honor Code violation.
- Academic integrity also involves being careful enough to avoid unintentionally breaking the rules. Thus, you must read instructions in assignments and exams carefully so that you are aware of any limitations they place on you, such as time restrictions or restrictions on information sources you may consult. Similarly, if you see something that plausibly seems like it ought to be off-limits to you, such as a GitHub directory belonging to another student or files from a previous semester, you should immediately contact us to let us know that something doesn’t seem right, rather than looking further at something that perhaps should have been off-limits.
These principles apply to all methods and media of discussion or exchange (voice, writing, email, etc.).
Academic Accommodations
If you anticipate or experience academic barriers based on a disability (including mental health, chronic or temporary medical conditions), please let me know immediately so that we can privately discuss options. Any student with a documented disability who requires reasonable accommodations should contact their home college’s disability officer:
- CMC: Kari Rood - Kari.Rood@cmc.edu or Disabilityservices@cmc.edu
- CGU: Quamina Carter - Quamina.Carter@cgu.edu
- HMC: Brandon Ice - bice@hmc.edu
- KGI: Andrea Mozqueda - Andrea_Mozqueda@kgi.edu
- Pitzer: Gabriella Tempestoso - Gabriella_Tempestoso@pitzer.edu
- Pomona: Jan Collins Eaglin - jan.collins-eaglin@pomona.edu
- Scripps: Bianca Vinci - bvinci@scrippscollege.edu