Tokenization, Normalization, and Segmentation
utterance disfluency filled pause
lemma wordform word type
word token dialects code switching
tokenization word segmentation case folding
lemmatization stemming sentence segmentation
Evaluation
macroaveraging multinomial
microaveraging extrinsic evaluation
F1 measure intrinsic evaluation
precision training set
recall development set
F-measure test set
gold labels perplexity
contingency table null hypothesis
multi-label bootstrap test
N-Grams and Smoothing
language model sparsity
n-gram zeros
bigram closed/open vocabulary
trigram OOV word
chain rule Laplace smoothing
Markov assumption backoff
maximum likelihood estimation discounting
normalize interpolation
relative frequency
Vector Semantics
vector semantics tf-idf algorithm
embeddings term frequency
term-document matrix document frequency
vector space model idf
row vector co-occurrence
word-word matrix debiasing
cosine similarity
Word Sense Disambiguation
word sense lexical sample task
zeugma all-words task
WordNet semantic concordance
gloss most frequent sense
synset one sense per discourse
supersense word sense disambiguation
Part of Speech Tagging
part of speech degree wh-pronoun
closed class manner auxiliary verb
open class temporal adverb copula
function word preposition modal verb
noun particle interjection
proper noun phrasal verb POS tagging
common noun determiner ambiguous / disambiguation
count noun article accuracy
mass noun conjunction sequence model
verb complementizer Markov chain
adjective pronoun Markov assumption
adverb personal pronoun Hidden Markov Model
locative possessive pronoun decoding
Viterbi algorithm beam search unknown words
Text Classification
text categorization naive Bayes assumption
sentiment analysis linear classifier
language id unknown words
authorship attribution stop words
generative classifier binary naive Bayes
discriminative classifier sentiment lexicon
multinomial naive Bayes hyperpartisan news
bag-of-words clickbait
prior probability fake news
likelihood