hw2: Naïve Bayes classifiers and sentence segmentation

instructions

This homework will have a little bit more programming than before, but also some discussion. Write up any written answers in a text file called hw2.txt. You'll complete the given program (just follow the instructions below), and zip up any files you changed or created into an archived named YOURUSERNAME-hw2.zip. Turn it in on OnCourse, under "hw2".

As mentioned before, feel free to discuss with your friends and classmates! Just make sure all the code and text you turn in was typed by you. If you get substantial ideas from people or online sources, make sure to cite your sources! (this is good not just for honesty's sake, but it helps me know about good sources in the future; maybe we can share them)

For this homework, you're given three small data sets to start to make sure we can implement a simple classifier. They're all from the UCI Machine Learning repository, but I've modified them slightly to make sure they work with our code. These are in the datasets directory, and they have already been split into training and test portions, so you can evaluate how well the classifier works. Importantly, your code won't be able to get all the answers right, and that's to be expected! My version gets 1 flower, 3 congresspeople, and 16 scale problems wrong. Yours will probably do about the same, if your code works right!

Run the classifier like this:

$ python3 naivebayes.py datasets/house-votes-84.train datasets/house-votes-84.test

Get the starter code and data here.

part one: training your classifier

In naivebayes.py, finish up the places marked TODO in the train function. The goal of the train function is that it learns the probability distribution over the different classes (how often did each class happen?), and the conditional probability distribution for each value for each feature, given a particular class. For example: if we're trying to classify congresspeople into parties, given that we already know a congressperson is a Democrat, what's the probability that they voted "yes" on a certain issue?

Once this is working, you should be able to inspect the conditional probabilities: just print them out at the appropriate place in the code. I recommend testing this on the house-votes-84 dataset, since it's just a two-way classification. Maybe you'll find that some of the votes were very polarizing?

part two: making classification decisions

Finish up the classify function in naivebayes.py. It should return the most probable class for a given instance, making use of the class prior probabilities and conditional probabilities computed earlier in train.

part three: evaluate your classifier

In hw2.txt tell us how well your classifier does on the three data sets. Importantly: if it does better on some of the problems than others, why do you think that is? Can you imagine ways to improve accuracy on the problem that your classifier has trouble with? Is the problem inherently hard, or is the classifier just missing the point? How could you help it?