In previous assignments, we gave you a lot of code to work with. This one is all you. Show us what you've got. You don't have to use Python; you can use whatever language you want (although if it's obscure or not multi-platform, let us know beforehand).
This one is kind of involved. But you can do it. Please come see us during office hours to talk about it sometime during the next two weeks, or schedule another appointment if you honestly can't make it during any office hours. (email us, we're flexible; this is part of the assignment, coming to talk to either Alex or Can)
This part is due for the homework ping on Thursday 8 November!
You get a small treebank (adapted from SMULTRON), and your job is to extract a PCFG from it. What are all the production rules? What are their weights? You can set the weights for the rules using the formulas on page 467 of Jurafsky and Martin; just count up the number of times a given production is used, out of all the times that left-hand side of the production occurs. Have some way to print out the rules with their weights. Make sure all the different ways to make a NP (for example) have weights that sum to 1.0!
Write a program that takes the grammar you extracted from the treebank and converts it to Chomsky Normal Form. Use the algorithm on page 437 of Jurafsky and Martin. NB: you're going to have to figure out what to do with the weights. You can't just drop the weights! You don't have to read in a grammar from a text file or some other serialization -- you can keep it in memory from part 1. Be able to print out the new CNF grammar too.
Adapt your CKY parser from hw4 to use probabilities! Have it return the single most probable parse for an input sentence. The pseudocode on page 465 of Jurafsky and Martin (Figure 14.3) might be helpful.
What precision, recall, and F-measure do you get when parsing the first ten sentences of the treebank? Write a program to calculate this. You're going to have to enumerate all the spans both in the "gold standard" parses from the treebank, and in your highest-ranked parses from your parser. Don't count the POS tags as "constituents" in your tree when calculating precision and recall. For example, in the tree:
(S (NP (NN cats)) (VP (V eat) (NP (NN cat) (NN food))))
... there are four constituents: S from 0 to 4, NP from 0 to 1, VP from 1 to 4, and NP from 2 to 4. This is not good practice, evaluating on sentences from the training data, by the way. But this is a very small treebank.
In hw5.txt, answer these questions. Also put whatever other questions, commentary, and observations you'd like to pass along.
Let us know if you need some HARD MODE things to try. This seems hard enough.
Turn in your code and your hw5.txt, with your answers and a quick description of what you did, and any HARD MODE extensions. Use Oncourse! Due on Thursday November 15 at 1:30pm.