hw1: basic text processing and regular expressions


Write up your answers for part one in a text file called part1.txt. For part two, you'll complete the program date_extract.py, and for part three, you'll complete the program eliza.py. Zip up those three files, and any other code that you changed or created (libraries you modified or added, for example), into an archive named YOURUSERNAME-hw1.zip, and turn it in on OnCourse, under "hw1".

As mentioned before, feel free to discuss with your friends and classmates! Just make sure all the code and text you turn in was typed by you. If you get substantial ideas from people or online sources, make sure to cite your sources! (this is good not just for honesty's sake, but it helps me know about good sources in the future; maybe we can share them)

Get the starter code and data here.

part one: text processing and a little bit of corpus exploration

Using the Unix utilities tr, sort, uniq and perhaps cat and head, explore the included wikinews articles. What are the 25 most common words in the articles? What sequence of commands did you use to produce this answer? (remember: connect a bunch of them together with pipes! don't forget sort -rn to sort things by number, in reverse order.)

If you're feeling energetic: explore the Enron email dataset and pick the most common words from the sent mail of an Enron employee or two, or find some other large cache of interesting text!

part two: regular expressions in Python

Complete the included date_extract.py: this is really straightforward. It just finds dates in text files and then prints out each date matched, followed by the month from that date. You can get fancier if you want, but I just want you to be able to match the dates in the wikinews articles included with the homework.

part three: build your theraputic chatbot!