hw3: n-gram models for spelling correction!

about this assignment

This homework is adapted from Dan Jurafsky and Chris Manning's online NLP course. They based their code on Peter Norvig's spelling corrector. So there's a long line of happy borrowing here. I think this is a lovely instructive example of using a language model.

In this assignment you will be training a language model to build a spell checker. Specifically, you will be implementing part of a noisy-channel model for spelling correction. We will give the likelihood term, or edit model, and your job is to make a language model, the prior distribution in the noisy channel model. At test time you will be given a sentence with exactly one typing error. We then select the correction which gets highest likelihood under the noisy-channel model, using your language model as the prior. Your language models will be evaluated for accuracy, the number of valid corrections, divided by the number of test sentences.

Get the starter code and data here.

data

We will be using the writings of secondary-school children, collected by David Holbrook. The training data is located in the data/ directory. A summary of the contents:

Note that the data files do not contain <s> and </s> markers, but the code which reads in the data adds them.

instructions

Implement the following three language models: We have provided you with a uniform language model so you can see the basic layout, located in UniformLanguageModel.{py,java}.

To implement a language model you need to implement two functions:

Evaluation

Your language models will be evaluated on the development data set and a held-out test set. The performance of each language model will be evaluated with respect to the performance of our solution implementation. To help with your implementation, we give you the expected performance of each language model on the development set:

Note that the performance we expect from your language model is not that great! We have provided you with a very simple edit model, not a lot of training data, and the task is rather difficult. You will receive full credit for implementations which meet the stated thresholds.

Given Code

The rest of the scaffolding has been provided (reading corpora, computed edit probabilities, computing the argmax correction). A short summary of the remaining files:

Running the code

Using Python 3, run:

$ cd src
$ python3 SpellCorrect.py

HARD MODE

There's a lot of code here, but the stuff you have to implement is pretty straightforward. If that's not tuff enuff for you, here are some ideas.

turning it in

Turn in your three language model files (LaplaceUnigramLanguageModel.py, LaplaceBigramLanguageModel.py, and StupidBackoffLanguageModel.py), along with a text file hw3.txt, describing your accuracies, any difficulties you had, and whatever HARD MODE extensions (if any -- totally optional!) you ended up doing. Use Oncourse! Due on Thursday October 11 at 1:30pm.