In this assignment you will do some pencil and paper exercises with B-trees, heaps and Red-Black trees (no programming), finish the instrumentation of your hash table Map implementation, and carry out some performance studies in which you compare timings for various Map implementations.
As usual, do all your programming work in a hw4
subdirectory of
your cs310
directory. But we won't collect
electronically this time, since this homework calls for pictures and
graphs. Hard copy is due in class on the assignment due date.
Legible hand written work is acceptable, but the more you can get the
computer to do for you the better. You can type the words and leave
space for hand drawn data structures. You can draw the graphs by hand
too if you must, but if you can use Excel you should.
The graders may look into your
hw4
directory if necessary.
This is a question left over from hw3. It was optional there. Complete it now if you didn't then.
Hashtable performance depends on how spread out the data are in the table. You need to add to your hahstable instrumentation that will collect information on the distribution of data in the buckets (if you haven't done it in hw3).
So if you implement open hashing (separate chaining in Weiss ) you should report the number of buckets of each size, this way:
items in bucket number of buckets 0 100 1 50 2 22 3 4 4 1 5 3 >5 4If you implement closed hashing (open addressing in Weiss) you should report the number of clusters (consecutive full locations) of each size:
size of cluster number of clusters 1 50 2 22 3 4 4 1 5 3 >5 4
hw4/makefile
to your
hw4
directory. Link or copy your HashMap.c and
instrumented ListMap.c to that directory. If you don't trust your
implementations you may use solutions (solution for hw3 will be
available soon).
When you type make
three programs will be created for
you: timemapMyHash
,
timemapClassHash
and
timemapList
which will test
your hashtable and our hashtable and and list implementations. If you
want to use a different
hash function (default is hash1
) type
make HASH=hash[123](remember to
make clean
first).
The source of the testing program can be found in $cs310/src/timemap.c
. Feel free
to copy the file to your directory and modify it if you need to. If
you do, you will also need to copy and modify the makefile.
Each time*
executable
reads words from an input file to
build a Map for which those words are keys.
Then it reads keys
from stdin
and calls get
,
delete
and put
in that Map. It surrounds those
calls by calls to start and stop clocks, using the ARM (Application
Response Measurement) api specified in
/home/eb/arm/arm.h
to time the Map api function
response times. Then it
writes timing statistics and
mapStatus
output to file timeSomething.dat
To answer the questions below you must be able to read and understand
those statistics.
Here's an example showing how to use the timing tools:
% make % head -1000 wuther.words | timemapMyHash 15000 wuther.wordsThe first line creates an executable to time the hash table Map implementation using the default (bad) hash function hash1. The
timehash
program builds a Map from the first 15000 lines of
file wuther.words
and then uses the first 1000 lines of
that file (read from stdin) to generate Map commands. File wuther.words
contains the full
text of Emily Bronte's "Wuthering Heights" with each of its 118492
words on a separate line, converted to lower case with punctuation
removed.
After you have run timemapList
you will find the
performance statistics in
timemapList.dat
. We have annotated that file, explaining the
meaning of some of the numbers.
Note: you can answer many of the questions below even if you did not manage to write a working hashtable implementation in the last homework. Use the one we have provided instead if you must.
Each of these can be answered with one or two well chosen experiments - but you may need to do more than one or two in order to find some from which you can draw conclusions. Remember that small samples (small maps, few queries) are likely to generate unreproducible results. Choose input values large enough to yield reliable measurements but small enough so that they don't run for an unreasonable length of time. Whenever you can, try to find two independent ways to come to the same conclusion rather than relying on just one set of numbers. And please don't report too many unrealistic decimal places of accuracy.
If you run the same program a few times you may notice that the first run takes longer than the following ones, and that the timings for subsequent runs are not identical. The first effect is the result of caching of disk data in RAM, the second is the result of measurement inaccuracies. To make your measurements more accurate you may want to reapeat each experiment a few times (e.g. 5-6), ignore the first run, and take the average of the timing results of subsequent runs.
get
to
be a lot slower than successful ones for a list implementation of a
Map but perhaps slightly faster for a hash table implementation.
Perform experiments to verify your expectation.
If you run timemapMyHash
or timemapList
as
we did above, using the same file to build and query the map, then all
the calls to get
in the query loop will succeed. In
order to test your prediction you will need to change
timemapMyHash/List
or invoke it differently. Do one or the other,
in order to collect the data you need.
put
that overwrite and those that don't.
HashMap.c
without the debug option -g
and with
various levels
of compiler optimization enabled (see man gcc
). Discuss
the effects of these changes on performance.
Each of the questions considered so far can be answered with a single
small set of experiments. Now we want to look at how Map performance
depends on the size of the Map (the number of actual entries,
not the number specified on the timing study command line). That is,
we want to
examine the O(N)
behavior of get
and
nextKey
. To that you must perform
a sequence of experiments for each implementation.
After you have collected all your data, draw graphs and discuss them.
Do you see the constant and linear times for get
predicted by
theory for the implementations using hashing and lists?
Can you detect a performance difference between your hash table
implementation and the one we have provided?
Does iteration take time O(N) as expected for all implementations?
Your results are probably best presented graphically. To make that easier for you the ARM package specifies
void armCsvReport(void);a call to that function puts a comma separated report into the timing output file. That format is easy to import into a spreadsheet.
Warning: The natural (and faulty) behavior of most spreadsheets is to display the list of data points like
( 1388, y1) ( 2395, y2) (13254, y3) (32987, y4)above four equally spaced points along the x axis. That's clearly a disaster when it comes to seeing how the y values are growing. The points along the x axis must represent the actual x values - in this case the number of entries in the map. If you can't make your spreadsheet do this, then you should draw your graphs by hand.
Some things to look at:
Hardcopy due in class