CS 310 Homework 4

CS 310 Homework 4
Ethen Bolker, Woojin Paik
Spring 2002
Due date: see course calendar

In this assignment you will do some pencil and paper exercises with B-trees, heaps and Red-Black trees (no programming), finish the instrumentation of your hash table Map implementation, and carry out some performance studies in which you compare timings for various Map implementations.

As usual, do all your programming work in a hw4 subdirectory of your cs310 directory. But we won't collect electronically this time, since this homework calls for pictures and graphs. Hard copy is due in class on the assignment due date. Legible hand written work is acceptable, but the more you can get the computer to do for you the better. You can type the words and leave space for hand drawn data structures. You can draw the graphs by hand too if you must, but if you can use Excel you should. The graders may look into your hw4 directory if necessary.

Part I - Pencil and paper

Weiss, Problem 4.36.
Weiss, Problem 6.2.
Weiss, Problem 6.3. (use 6.2 part a)
Weiss, Problem 6.6. It is not a good idea to simply count all the nodes. You need to find out a better method. Give the final result and describe how you got it.
Weiss, Problem 12.6 (if Red Black trees were discussed in class).

Part II - Hash table statistics

This is a question left over from hw3. It was optional there. Complete it now if you didn't then.

Hashtable performance depends on how spread out the data are in the table. You need to add to your hahstable instrumentation that will collect information on the distribution of data in the buckets (if you haven't done it in hw3).

So if you implement open hashing (separate chaining in Weiss ) you should report the number of buckets of each size, this way:

	items in bucket		number of buckets
		0			100
		1			 50
		2			 22
		3			  4
		4			  1
		5			  3
	       >5 			  4

If you implement closed hashing (open addressing in Weiss) you should report the number of clusters (consecutive full locations) of each size:

	size of cluster		number of clusters
		1			 50
		2			 22
		3			  4
		4			  1
		5			  3
	       >5 			  4

Part III - performance measurements

In this part of the homework you will analyze the performance of several different map implementations: ours using a linked list, yours using a hash table, and ours using a hash table. Link the hw4/makefile to your hw4 directory. Link or copy your HashMap.c and instrumented ListMap.c to that directory. If you don't trust your implementations you may use solutions (solution for hw3 will be available soon).

When you type make three programs will be created for you: timemapMyHash, timemapClassHash and timemapList which will test your hashtable and our hashtable and and list implementations. If you want to use a different hash function (default is hash1) type

make HASH=hash[123]

(remember to make clean first). The source of the testing program can be found in $cs310/src/timemap.c . Feel free to copy the file to your directory and modify it if you need to. If you do, you will also need to copy and modify the makefile.

Each time* executable reads words from an input file to build a Map for which those words are keys. Then it reads keys from stdin and calls get, delete and put in that Map. It surrounds those calls by calls to start and stop clocks, using the ARM (Application Response Measurement) api specified in /home/eb/arm/arm.h to time the Map api function response times. Then it writes timing statistics and mapStatus output to file timeSomething.dat To answer the questions below you must be able to read and understand those statistics.

Here's an example showing how to use the timing tools:

   % make
   % head -1000 wuther.words | timemapMyHash 15000 wuther.words

The first line creates an executable to time the hash table Map implementation using the default (bad) hash function hash1. The


timehash

program builds a Map from the first 15000 lines of file wuther.words and then uses the first 1000 lines of that file (read from stdin) to generate Map commands. File wuther.words contains the full text of Emily Bronte's "Wuthering Heights" with each of its 118492 words on a separate line, converted to lower case with punctuation removed.

After you have run timemapList you will find the performance statistics in timemapList.dat . We have annotated that file, explaining the meaning of some of the numbers.

Note: you can answer many of the questions below even if you did not manage to write a working hashtable implementation in the last homework. Use the one we have provided instead if you must.

Warmup questions

Each of these can be answered with one or two well chosen experiments - but you may need to do more than one or two in order to find some from which you can draw conclusions. Remember that small samples (small maps, few queries) are likely to generate unreproducible results. Choose input values large enough to yield reliable measurements but small enough so that they don't run for an unreasonable length of time. Whenever you can, try to find two independent ways to come to the same conclusion rather than relying on just one set of numbers. And please don't report too many unrealistic decimal places of accuracy.

If you run the same program a few times you may notice that the first run takes longer than the following ones, and that the timings for subsequent runs are not identical. The first effect is the result of caching of disk data in RAM, the second is the result of measurement inaccuracies. To make your measurements more accurate you may want to reapeat each experiment a few times (e.g. 5-6), ignore the first run, and take the average of the timing results of subsequent runs.

Explain why you would expect failed calls to get to be a lot slower than successful ones for a list implementation of a Map but perhaps slightly faster for a hash table implementation. Perform experiments to verify your expectation.
If you run timemapMyHash or timemapList as we did above, using the same file to build and query the map, then all the calls to get in the query loop will succeed. In order to test your prediction you will need to change timemapMyHash/List or invoke it differently. Do one or the other, in order to collect the data you need.
Answer the previous question for the time differences between successful and failed calls to delete and between calls to put that overwrite and those that don't.
Recompile your HashMap.c without the debug option -g and with various levels of compiler optimization enabled (see man gcc). Discuss the effects of these changes on performance.

HashMap performance

Measure and discuss the performance implications of various choices of hash function. Use tha mapStatus output that tells you the number of buckets of various sizes (open hashing) or the sizes of the clusters (closed hashing).
Since you wrote the hashtable code, you know what value of the load factor triggers rehashing in your implementation. Can you perform experiments from which you can deduce the value of the load factor that causes a rehash? You can try your experiments on our implementation.
Can you detect a performance difference between your hash table implementation and the one we have provided?

Looking for trends

Each of the questions considered so far can be answered with a single small set of experiments. Now we want to look at how Map performance depends on the size of the Map (the number of actual entries, not the number specified on the timing study command line). That is, we want to examine the O(N) behavior of get and nextKey. To that you must perform a sequence of experiments for each implementation. After you have collected all your data, draw graphs and discuss them.

Do you see the constant and linear times for get predicted by theory for the implementations using hashing and lists? Can you detect a performance difference between your hash table implementation and the one we have provided?

Does iteration take time O(N) as expected for all implementations?

Your results are probably best presented graphically. To make that easier for you the ARM package specifies

	void armCsvReport(void);

a call to that function puts a comma separated report into the timing output file. That format is easy to import into a spreadsheet.

Warning: The natural (and faulty) behavior of most spreadsheets is to display the list of data points like

    ( 1388, y1)
    ( 2395, y2)
    (13254, y3)
    (32987, y4)

above four equally spaced points along the x axis. That's clearly a disaster when it comes to seeing how the y values are growing. The points along the x axis must represent the actual x values - in this case the number of entries in the map. If you can't make your spreadsheet do this, then you should draw your graphs by hand.

Some things to look at:

In your experiments you should keep the number of queries constant (large enough - at least 5000), and only change the size of the map.
You need to have a reasonable number of measurement points (at least 10) to see the trends.
The measurement points should be placed more or less evenly throughout the tested range of map sizes.
Do not leave your data collection for the last minute. Everyone will want to use the system then and it will run so slowly that you will not be able to finish your work!
Do not use the "smoothing" function of the spreadsheet while producing the graphs, it tends to create false minima/maxima in the charts.

Deliverables:

Hardcopy due in class