hw3: n-gram models for spelling correction!

about this assignment

This homework is adapted from Dan Jurafsky and Chris Manning's online NLP course. They based their code on Peter Norvig's spelling corrector. So there's a long line of happy borrowing here. I think this is a lovely instructive example of using a language model.

In this assignment you will be training a language model to build a spell checker. Specifically, you will be implementing part of a noisy-channel model for spelling correction. We will give the likelihood term, or edit model, and your job is to make a language model, the prior distribution in the noisy channel model. At test time you will be given a sentence with exactly one typing error. We then select the correction which gets highest likelihood under the noisy-channel model, using your language model as the prior. Your language models will be evaluated for accuracy, the number of valid corrections, divided by the number of test sentences.

Get the starter code and data here.

data

We will be using the writings of secondary-school children, collected by David Holbrook. The training data is located in the data/ directory. A summary of the contents:

holbrook-tagged-train.dat: the corpus to train your language models
holbrook-tagged-dev.dat: a corpus of spelling errors for development
count_1edit.txt : a table listing counts of edits x | w

Note that the data files do not contain <s> and </s> markers, but the code which reads in the data adds them.

instructions

Implement the following three language models:

Laplace Unigram Language Model: a unigram model with add-one smoothing. Treat out-of-vocabulary items as a word which was seen zero times in training.
Laplace Bigram Language Model: a bigram model with add-one smoothing.
Stupid Backoff Language Model: use an unsmoothed bigram model combined with backoff to an add-one smoothed unigram model

We have provided you with a uniform language model so you can see the basic layout, located in UniformLanguageModel.{py,java}.

To implement a language model you need to implement two functions:

train(HolbrookCorpus corpus): takes a corpus and trains your language model. Compute any counts or other corpus statistics in this function. See the example UniformLanguageModel for how to access sentences in the corpus and the words in those sentences.
score(List words): takes a list of strings as argument and returns the numerical score, which should be the log-probability of the sentence using your language model. Use whatever data you computed in train() here.

Evaluation

Your language models will be evaluated on the development data set and a held-out test set. The performance of each language model will be evaluated with respect to the performance of our solution implementation. To help with your implementation, we give you the expected performance of each language model on the development set:

Laplace Unigram Language Model: 0.11
Laplace Bigram Language Model: 0.13
Stupid Backoff Language Model: 0.18

Note that the performance we expect from your language model is not that great! We have provided you with a very simple edit model, not a lot of training data, and the task is rather difficult. You will receive full credit for implementations which meet the stated thresholds.

Given Code

The rest of the scaffolding has been provided (reading corpora, computed edit probabilities, computing the argmax correction). A short summary of the remaining files:

SpellCorrect.py: Computes the most likely correction given a language model and edit model. The main() function here will load all of your language model and print performance on the development data, useful for debugging. It may be useful to comment out some of the tests in main() when developing.
EditModel.py: Reads the count_1edit.txt file and computes the probability of corrections. The candidate corrections are all strings within Damerau-Levenshtein edit distance 1. The probability of no correction is set at .9 ( P(x|x) = 0.9 ). Note that the EditModel isn't great, but your language models will be evaluated using this model, so it won't effect your grade.
HolbrookCorpus.py: Reads in the corpus and generates test cases from misspellings.
Sentence.py: Holds the data for a given sentence, which is a list of Datums. Contains helper functions for generating the correct sentence and the sentence with the spelling error.
Datum.py: Contains two strings, word and error. The word is the corrected word, and error contains the spelling error. For tokens which are spelled correctly in the corpus, error = ''.

Running the code

Using Python 3, run:

$ cd src
$ python3 SpellCorrect.py

HARD MODE

There's a lot of code here, but the stuff you have to implement is pretty straightforward. If that's not tuff enuff for you, here are some ideas.

Custom Language Model: implement a more sophisticated language model of your choice (note that we haven't discussed these in class!) Ideas include interpolated Kneser-Ney, Good-Turing, linear interpolated, trigram, or any other language model you can come up with. Your goal is for your custom language model to perform better than any of the other three language models we ask you to implement.
Use more training data: This is the HARD MODE option that I (alexr) would recommend. There really isn't very much training data for the language model here. Maybe you can get better accuracy if you use more text? Questions to answer here include: where to get more text? What format does it need to be in? Will any English text work, or will some work better than others? How much more accuracy did you get for how big of a bigger training corpus?
Use a different programming language: Not enough lambdas? Really wish this was in MIPS assembly or Standard O'Prolog of Moscow? Go for it. But you have to rebuild all the framework code too.

turning it in

Turn in your three language model files (LaplaceUnigramLanguageModel.py, LaplaceBigramLanguageModel.py, and StupidBackoffLanguageModel.py), along with a text file hw3.txt, describing your accuracies, any difficulties you had, and whatever HARD MODE extensions (if any -- totally optional!) you ended up doing. Use Oncourse! Due on Thursday October 11 at 1:30pm.