CS 210 Intermediate Computing with Data Structures 

Programming Assignment 2 (pa02) Spring 2006

Due Monday Feb. 27, 2006 at noon

Purpose

This assignment aims to help you:

Reading

Before working on this assignment, you should read chapter 6, esp. sections 6.7 and 6.8.

Description

In this assignment, we will analyze English text.  We can use the Scanner class to tokenize the text for us, turning it into a sequence of words, stripping off quotes, etc.  We also want to detect ends of sentences.

We will find the word that most commonly follows (in the same sentence) each specific word in the document, not counting "stop words", the short words that have little specific meaning in themselves.  A file of stop words will be available, as well as input for the English text.

Armed with this analysis, we will try an experiment making up sentences.

Implementation

The aim here is simplicity of application code, using the JDK to the fullest.  Don’t implement anything you can find in the JDK! 

1.  First  tokenize the input document  with Scanner (with default delimiters) and simply count how many times each word appears in the document. Turn the uppercase letters to lowercase, so that capitalization doesn't make one word appear as two.  For this, set up a Map from String to Integer, where the String is a word and the Integer the count of times seen.   Use a TreeMap to keep the words in alphabetical order. Print out the first 10 words and their frequencies. Call this program  Words1.java.

2.  Now handle detailed word and sentence structure as follows. Set up a WordTokenizer class wrapping a Scanner, with a String getToken() method that returns either:

In the main of WordTokenizer, read an argued text file and output the tokens.  

Example text:

See Spot run.
See Sally run.
Run, Spot, run.
Run fast!

Tokens returned by getToken, separated by spaces:  See Spot run . See Sally run . Run spot run . Run fast .

3.  Stop-word filtering.  Write a StopWord class that can load stopwords from the file stopwords.txt, put them in a Set of String, and then handle calls boolean isStopWord(String).  Write a main() for StopWord class that reads a file of words using the Scanner with default delimiters, drops the stop words in it, and outputs the rest.

4.  Now set up a class WordStats that contains the word, its frequency, and a TreeMap of String to Integer to hold statistics on the words seen just after this word.  Set up a TreeMap of String to WordStats to hold all the information we need, in class TextAnalyzer. TextAnalyzer should support addWordAndFollower(String, String) and String getMostFrequentFollower(String), among other methods. In main(), devise a test for this class that doesn't need other classes of this assignment.  In class Words2.java, use all the classes you have written so far to read a text file, tokenize it, make the tokens lowercase, drop the stopwords, and put (word, follower) pairs in the TextAnalyzer via addWordAndFollower, or (word, null) for words at ends of sentences.  It should output the first 10 words, their frequency, and their most frequent follower (any one of ties.)

Word (frequency) following words with frequency.  Note alphabetical order.

fast(1):
run(5): fast             (tied with spot)
sally(1):  run
see (2):  sally             (tied with spot)
spot (2):  run

5. Now for the experiment.  Use provided files tom.txt (Tom Sawyer) and alice.txt (Alice in Wonderland.) Generate a random number x  from 0 to N-1, where N is the number of words in the Map.  Output a "sentence" of  up to 5 words by starting with word x, then its most frequent follower, if any.  If it does have a follower, look up that word and output its most frequent follower, etc., until a word fails to have a follower or 5 words have been output.  Generate 10 "sentences" this way.  Do any of them make any sense?  Of course you have to supply stop words to make them into real sentences.  Program Sentences.java.

memo.txt

In the file memo.txt, answer the following questions, in one to three pages (60-180 lines) of text.  Use complete sentences, as you would for an English class.

1.  What problems did you encounter and how did you solve them? 

·       Did you use the textbook examples to help you?  Which ones?

·        Did you access any information on the Web during this work?  Where?

.    Other resources?

            2.  Did the unit tests help you find bugs before assembling the whole program?
            3.  Did you tokenize by sentences, then by words, or by words, and then adjust for end-of-sentences found?  Explain your algorithm for tokenization.  

4.   Did you find any ends of sentences that did not obey our simple rule of . or ? or !  as an ending?  Any other irregularities?

5.   Did any of your experimental sentences make any sense?

Turn In

Use the turnin system to submit your files.  Before the due date, submit the following files to the pa02 folder of your turnin system account:

We will run a program that compiles and tests the Java code, and collect all files for the grader. It is your responsibility to make sure those files are present, that their names are spelled correctly (e.g. we will not find Memo.txt or memo.TXT) and that the files are turned in for the proper assignment folder.