CS 210 - PA02

CS 210 Intermediate Computing with Data Structures

Programming Assignment 2 (pa02) Spring 2006

Due Monday Feb. 27, 2006 at noon

Purpose

This assignment aims to help you:

Work with Collection classes, in particular Maps and Sets
Use main() of classes to do simple unit tests.
Use the powerful new Scanner class to tokenize input.

Reading

Before working on this assignment, you should read chapter 6, esp. sections 6.7 and 6.8.

Description

In this assignment, we will analyze English text. We can use the Scanner class to tokenize the text for us, turning it into a sequence of words, stripping off quotes, etc. We also want to detect ends of sentences.

We will find the word that most commonly follows (in the same sentence) each specific word in the document, not counting "stop words", the short words that have little specific meaning in themselves. A file of stop words will be available, as well as input for the English text.

Armed with this analysis, we will try an experiment making up sentences.

Implementation

The aim here is simplicity of application code, using the JDK to the fullest. Don’t implement anything you can find in the JDK!

1. First tokenize the input document with Scanner (with default delimiters) and simply count how many times each word appears in the document. Turn the uppercase letters to lowercase, so that capitalization doesn't make one word appear as two. For this, set up a Map from String to Integer, where the String is a word and the Integer the count of times seen. Use a TreeMap to keep the words in alphabetical order. Print out the first 10 words and their frequencies. Call this program Words1.java.

2. Now handle detailed word and sentence structure as follows. Set up a WordTokenizer class wrapping a Scanner, with a String getToken() method that returns either:

a word without surrounding puctuation or whitepace, that is a sequence of letters, '-', or an interior quote as in "can't".
an end-sentence mark: "." or "?" or "!". You may assume these do not occur at the start or in the middle of words. Return "." for any of these.
a null if out of input.

In the main of WordTokenizer, read an argued text file and output the tokens.

Example text:

See Spot run.
See Sally run.
Run, Spot, run.
Run fast!

Tokens returned by getToken, separated by spaces: See Spot run . See Sally run . Run spot run . Run fast .

3. Stop-word filtering. Write a StopWord class that can load stopwords from the file stopwords.txt, put them in a Set of String, and then handle calls boolean isStopWord(String). Write a main() for StopWord class that reads a file of words using the Scanner with default delimiters, drops the stop words in it, and outputs the rest.

4. Now set up a class WordStats that contains the word, its frequency, and a TreeMap of String to Integer to hold statistics on the words seen just after this word. Set up a TreeMap of String to WordStats to hold all the information we need, in class TextAnalyzer. TextAnalyzer should support addWordAndFollower(String, String) and String getMostFrequentFollower(String), among other methods. In main(), devise a test for this class that doesn't need other classes of this assignment. In class Words2.java, use all the classes you have written so far to read a text file, tokenize it, make the tokens lowercase, drop the stopwords, and put (word, follower) pairs in the TextAnalyzer via addWordAndFollower, or (word, null) for words at ends of sentences. It should output the first 10 words, their frequency, and their most frequent follower (any one of ties.)

Word (frequency) following words with frequency. Note alphabetical order.

fast(1):
run(5): fast (tied with spot)
sally(1): run
see (2): sally (tied with spot)
spot (2): run

5. Now for the experiment. Use provided files tom.txt (Tom Sawyer) and alice.txt (Alice in Wonderland.) Generate a random number x from 0 to N-1, where N is the number of words in the Map. Output a "sentence" of up to 5 words by starting with word x, then its most frequent follower, if any. If it does have a follower, look up that word and output its most frequent follower, etc., until a word fails to have a follower or 5 words have been output. Generate 10 "sentences" this way. Do any of them make any sense? Of course you have to supply stop words to make them into real sentences. Program Sentences.java.

memo.txt

In the file memo.txt, answer the following questions, in one to three pages (60-180 lines) of text. Use complete sentences, as you would for an English class.

1. What problems did you encounter and how did you solve them?

· Did you use the textbook examples to help you? Which ones?

· Did you access any information on the Web during this work? Where?

. Other resources?

2. Did the unit tests help you find bugs before assembling the whole program?
3. Did you tokenize by sentences, then by words, or by words, and then adjust for end-of-sentences found? Explain your algorithm for tokenization.

4. Did you find any ends of sentences that did not obey our simple rule of . or ? or ! as an ending? Any other irregularities?

5. Did any of your experimental sentences make any sense?

Turn In

Use the turnin system to submit your files. Before the due date, submit the following files to the pa02 folder of your turnin system account:

memo.txt (plain textfile)
Words1.java, Words2.java: Both have main() and take a textfile as a single argument, and output the information on the first 10 words (in alphabetical order) or fewer if necessary. One line of output for each word.
WordTokenizer.java: method getToken(), etc. Also main() with unit test as specified.
StopWord.java: method isStopWord(String), etc. Also main() with unit test as specified.
WordStats.java: methods addFollower(String), String getMostFrequentFollower(), etc. Also main() with your unit test.
TextAnalyzer.java: methods addWordAndFollower(String, String), getMostFrequentFollower(String), etc. Also main() with your unit test.
Sentences.java: has main() and takes a textfile as a single argument and outputs 10 sentences, one line of output for each sentence.

We will run a program that compiles and tests the Java code, and collect all files for the grader. It is your responsibility to make sure those files are present, that their names are spelled correctly (e.g. we will not find Memo.txt or memo.TXT) and that the files are turned in for the proper assignment folder.