This
homework assignment has two parts.
The first part helps you to learn the basic of how c4.5 operates. The second part is designed for you to
gather empirical evidence on the generation of decision trees. You will be utilizing c4.5 and the
adult data set which is provided in the course website.
Before start working on the project assignment:
cs470/670 account (use the apply command, ask the
operator for help in case of problems). After you apply for the account a cs470/670 subdirectory will be
created in your home directory (it may take a few hours) IF YOU HAVE NOT
DONE SO ALREADY.o DO NOT CHANGE ACCESS RIGHTS FOR THIS DIRECTORY OR ANYTHING YOU CREATE IN THERE
Read the following steps carefully, as they apply to
the group project assignment as well:
cs310 directory create a proj1 subdirectory and do
all your work there. ·
Please protect your
code from access by others (if you work on the project from your home directory
before moving the final version to your cs470/cs670 account).
Installing software and make training/testing data available:
·
You
will download c4.5 program from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
to your proj1 sub-directory in the course account directory.
·
You
will compile c4.5 (you need to change Makefile to change ‘cc’ to ‘gcc’ and also
remove the references to getopt to compile correctly.)
·
You
will make a link to the following files from the proj1 subdiectory:
/home/wjpaik/adult/adult.names
/home/wjpaik/adult/adult.data
/home/wjpaik/adult/adult.test
PART 1 -Running
C4.5:
C4.5 comes with a set of
example datasets. One of these datasets is the “vote” database in the Data
directory. Notices the “vote” database consists of three files vote.names, vote.data,
and vote.test. Open the vote.names file in a text editor. First read the introduction
to the dataset. Following the opening comments are the classifiers. In this case
the classifiers are democrat and republican. After the classifiers are the attributes,
in this case the each attribute is a bill that was voted on. The attribute
values are all ‘n’, ‘y’, and ‘u’ that correspond to no, yes, unknown
respectfully. Open the vote.data file. This file contains examples of the
attribute’s values in order of the names file, followed by the example’s target
classification. Each line is a separate example. The vote.test file is formatted
exactly like the vote.data file, however it is used to validate the decision
tree generated from the vote.data file.
Step 1: Now that you have an
understanding of the underlying dataset, we can construct a decision tree using
c4.5. Type this command in the Unix prompt:
c4.5 -f vote
This executes c4.5 with the
vote database as it’s input. The –f switch signifies “vote” is filename stem.
This produces a Decision Tree, a Simplified Decision Tree, and an Evaluation on
training data.
Step 2: Note that the leaf
nodes return the appropriate classification depending on the attribute’s value.
The number in the prentices next to a given classification is the number of
times that path was followed in the example set. According to the decision
tree, how many times did this expression
physician
fee freeze = n Ù adoption of the budget
resolution = y
get executed from the
training set? What is the target classifier the decision tree output?
Step 3: This number may be
followed by a second number (e.g., 4.0/2.0), in which case the second value
(2.0) equals the number of classification errors encountered out of the total
number of classifications made from the training data in that particular path
of the decision tree (4.0). According to the decision tree, how many classification
errors did this path have in the training data?
physician
fee freeze = y Ù synfuels corporation
cutback = n
Step 4: In the “Evaluation
on training data” section, the unpruned decision tree and the pruned decision
tree are evaluated against the training data instances to test the fitness of
each. The first table illustrates the fitness of the unpruned tree. It
has two columns:
The second table illustrates
the fitness of the pruned tree. It has three columns:
Step 5: C4.5 can also
evaluate the generated decision tree(s) against test data. Enter the following
command into the Unix prompt:
c4.5 -u -f vote
This command generates the
same output, but appends evaluation of the tree(s) using the test data, that
follows the same format as the evaluation on the training data. The evaluation
of the test set also gives a classification matrix that shows the number of
correct versus incorrect classifications.
PART 2 –Analysis
of Decision Tree:
The adult database consists
of data collected from the US Census Bureau. The decision tree will try to
classify where an individual ma ke above 50K or below 50K, depending on the
given attributes. The focus of this assignment is to determine how much data is
needed to train decision tree that has the highest probability of correctness.
You should write a script to do the following assignment. And store the output
for further analysis.
Step 1: The adult.data file
has 32562 examples. You need to split the adult.data file into 20 files, each
containing successively larger datase ts. That is to say you should have
1(32562/20) examples in your first dataset. Your second dataset will have 2(32562/20).
And so on until you have 20 datasets. (Note that you will also have to copy the
adult.test and adult.names files so that the file stem matches the file stem of
the dataset’s file stem).
Step 2: Run c4.5 on each of
the 20 copies of the database. Use the –u switch to test against unseen data in
the filestem.test dataset. Also use the –g switch to make c4.5 use the
information gain criteria when selecting nodes.
Step 3: Graph a
“training-set size” v. “training data performance of unpruned tree” chart.
Step 4: Graph a
“training-set size” v. “training data performance of pruned” chart.
Step 5: Graph a
“training-set size” v. “test data performance of unpruned tree” chart.
Step
6: Graph a “training-set size” v. “test data performance of pruned” chart.
You will be graded on the
accuracy and the presentation of your hypothesizes and collected data. That is
to say, your data should not only be correct, but be presented in scientific
manner. By “Scientific”, I mean the scientific method: