CS210 Class 9

Class 9 Tues., Feb. 21

We were looking at an inventory example using a Map of String name to ItemInfo.

"pencil" -> (“pencil, 120, 42) 120 pencils in bin 42

"tape" -> (“tape”, 44, 11) 44 rolls of tape in bin 11

…

public class ItemInfo {

private String name;

private int quantity;

private int binNumber;

//Constructor, getters, setters…

};

Map<String,ItemInfo> inventory = new TreeMap<String, ItemInfo>;

ItemInfo item = new ItemInfo(“pencil”, 120, 42);

inventory.put(“pencil”,item);

…

later:

ItemInfo item = inventory.get(“pencil”);

System.out.println(“we have “+ item.getQuantity() + item.getName() + “s”);

Now we want to find all the information in the Map.

Look at Map interface, pg 237, for way to scan through Map.

see Set<KeyType> keySet();

Set<String> keys = inventory.keySet();

for (String k: keys)

System.out.println("key = "+ k + "val = "+ inventory.get(k));

Look at Map interface in general, pg 237

Drop “extends Serializable” and anywhere else in Chap 6. you see it.

containsKey, get, and remove actually take Object in the JDK, like Collection, when equals is involved.

Means the compiler can’t catch our mistake if we try to get an Integer key from a Map of String to whatever. Doesn’t stop us from doing anything that’s useful.

We have HashMap and TreeMap, both easy to use with String, Integer, etc. keys. Advantage of TreeMap is order by key. HashMap is a little faster.

Intro to PA2

Analyzing English text, including alice.txt, Alice in Wonderland, by Lewis Carroll, and tom.txt, Tom Sawyer, by Mark Twain. First example is

test1.txt:

See Spot run.
See Sally run.
Run, Spot, run.
Run fast!

We want to tokenize this into tokens like this:

See Spot run . See Sally run . Run Spot run . Run fast .

Then make them lowercase, like this, so run and Run count the same.

see spot run . see sally run . run spot run . run fast .

Then for each word, keep track of its followers in the same sentence:

see->spot

spot->run

see->sally

sally->run

run->spot

spot->run

run->fast

For this, we set up a Map of String to String for each word:

Map for see: spot-->1, sally->1
Map for spot: run-->2
Map for sally: run-->1
Map for run: spot-->1, fast-->1
Map for fast:

Then for each word, we can find the most common follower:

see: spot

spot: run

sally: run

run: spot

fast:

Random sentence: take random first word, use followers, at most 5 words

spot run spot run spot

see spot run spot run.

Have Maps of String to Integer, String to WordStats object.
Also Set of String for StopWords.

Let's tackle the tokenization using Scanner.

As in hw, can use delimiters.

If we use the default Scanner as in the hw, we would get

See Spot run. See Sally run. Run, Spot, run. Run fast!

All the non-letters would be fastened on the words.

See "Spot" (the dog) run.

See "Spot" (the dog) run.

Second example in hw: useDelimiter( ",\\s") Here we would get three "tokens"
See Spot run. See Sally run. Run

Spot

run. Run fast!

Gives us an idea for sentences. Sentences end in periods, question marks or exclamation points, followed by whitespace.
delimiter.

First basics with ordinary letters. These are regular expressions, also studied in CS240 for use with grep, UNIX search

abc means match "abc" "cba" means match "cba". Order counts

a|b|c means match a or b or c, same as c|b|a

[abc] also means a|b|c, or [cba]

[a-c] another way to say it

[a-zA-Z] matches any letter

[^abc] is any char other than a or b or c

[^A-Za-z] is any char other than a letter

a* is any number of a's, including none at all

a+ is one of more a's

a? is an optional single a

Whitepace char: \s, need to write \\s in a String constant, because \ is a special mark for String constants.
There are other special forms, like \\d for decimal digit.

Default delimiter for Scanner is any number of whitespace characters (spaces, tabs, newlines).
Delimiter of ",\\s" is comma followed by a whitespace char
Delimiter of ",\\s*" is comma followed by any number of whitespace chars.

So delimiter for sentences set by useDelimiter("[.?!]\\s*"); Use the star to gobble up extra whitepace we don't want to see.

Then we can tokenize a sentence into words. A word is terminated by a non-letter:

useDelimiter("[^A-Za-z]*");

But this makes can't into two words. Also see-saw. Can we fix this? Add ' and - to the letter-chars:, but - has a special meaning inside [], so we need to escape it with \:

useDelimiter("[^A-Za-z'\-]*");

OK, now can't is an OK word. This does allow 'xxx' as a word. If you want to fix this, it's possible via a fancier pattern.

Note that you will need to keep the current sentence around for several calls to getToken().

You could go the other way, tokenizing words and allowing '.?! through, then cleaning up and saving sentence-ends for the next call.