Individual Project: Decision Tree Learning

Due: Part 1 at 5:30pm on October 23, 2002 & Part 2 at 5:30pm on October 30, 2002

 

This homework assignment has two parts.  The first part helps you to learn the basic of how c4.5 operates.  The second part is designed for you to gather empirical evidence on the generation of decision trees.  You will be utilizing c4.5 and the adult data set which is provided in the course website.

 

What to do:

 

Before start working on the project assignment:

 

o        DO NOT CHANGE ACCESS RIGHTS FOR THIS DIRECTORY OR ANYTHING YOU CREATE IN THERE

 

Read the following steps carefully, as they apply to the group project assignment as well:

 

·         Please protect your code from access by others (if you work on the project from your home directory before moving the final version to your cs470/cs670 account).

 

Installing software and make training/testing data available:

 

·         You will download c4.5 program from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz to your proj1 sub-directory in the course account directory.

·         You will compile c4.5 (you need to change Makefile to change ‘cc’ to ‘gcc’ and also remove the references to getopt to compile correctly.)

·         You will make a link to the following files from the proj1 subdiectory:

/home/wjpaik/adult/adult.names

/home/wjpaik/adult/adult.data

/home/wjpaik/adult/adult.test

 

PART 1 -Running C4.5:

 

C4.5 comes with a set of example datasets. One of these datasets is the “vote” database in the Data directory. Notices the “vote” database consists of three files vote.names, vote.data, and vote.test. Open the vote.names file in a text editor. First read the introduction to the dataset. Following the opening comments are the classifiers. In this case the classifiers are democrat and republican. After the classifiers are the attributes, in this case the each attribute is a bill that was voted on. The attribute values are all ‘n’, ‘y’, and ‘u’ that correspond to no, yes, unknown respectfully. Open the vote.data file. This file contains examples of the attribute’s values in order of the names file, followed by the example’s target classification. Each line is a separate example. The vote.test file is formatted exactly like the vote.data file, however it is used to validate the decision tree generated from the vote.data file.

 

Step 1: Now that you have an understanding of the underlying dataset, we can construct a decision tree using c4.5. Type this command in the Unix prompt:

 

c4.5 -f vote

 

This executes c4.5 with the vote database as it’s input. The –f switch signifies “vote” is filename stem. This produces a Decision Tree, a Simplified Decision Tree, and an Evaluation on training data.

 

Step 2: Note that the leaf nodes return the appropriate classification depending on the attribute’s value. The number in the prentices next to a given classification is the number of times that path was followed in the example set. According to the decision tree, how many times did this expression

 

physician fee freeze = n Ù adoption of the budget resolution = y

 

get executed from the training set? What is the target classifier the decision tree output?

 

Step 3: This number may be followed by a second number (e.g., 4.0/2.0), in which case the second value (2.0) equals the number of classification errors encountered out of the total number of classifications made from the training data in that particular path of the decision tree (4.0). According to the decision tree, how many classification errors did this path have in the training data?

 

physician fee freeze = y Ù synfuels corporation cutback = n

 

Step 4: In the “Evaluation on training data” section, the unpruned decision tree and the pruned decision tree are evaluated against the training data instances to test the fitness of each. The first table illustrates the fitness of the unpruned tree. It has two columns:

 

  1. Size: the size of the unpruned tree. That is, the number of nodes of which it is composed.
  2. Errors: the number of classification errors and their corresponding error percentage from the total number of cases.

 

The second table illustrates the fitness of the pruned tree. It has three columns:

 

  1. Size: the size of the pruned tree. It is either less than or equal to that of the unpruned tree depending upon the extent of the pruning performed by C4.5.
  2. Errors: the number of classification errors and their corresponding actual error percentage after pruning.
  3. Estimate: the estimated error percentage of the tree after pruning, useful when comparing with the actual percentage.

 

Step 5: C4.5 can also evaluate the generated decision tree(s) against test data. Enter the following command into the Unix prompt:

 

c4.5 -u -f vote

 

This command generates the same output, but appends evaluation of the tree(s) using the test data, that follows the same format as the evaluation on the training data. The evaluation of the test set also gives a classification matrix that shows the number of correct versus incorrect classifications.

 

PART 1 – Questions:

 

  1. Describe what you have done to complete the steps one to five – report any assumptions or decisions made while following each step.  In addition, report any problem that you encountered.
  1. Answer questions in steps 2 and 3.
  2. The “Simplified Decision Tree”, is the pruned decision tree. How many attributes does the simplified decision tree have compared to the unpruned decision tree in this example?
  3. When evaluating the unpruned versus the pruned decision tree with the training data, the error on the pruned decision tree is higher. However, when evaluating the decision trees against the test data, the error on the unpruned decision tree is higher. Explain.

 

 

 

 

PART 1 - What to turn in:

 

PART 2 –Analysis of Decision Tree:

 

The adult database consists of data collected from the US Census Bureau. The decision tree will try to classify where an individual ma ke above 50K or below 50K, depending on the given attributes. The focus of this assignment is to determine how much data is needed to train decision tree that has the highest probability of correctness. You should write a script to do the following assignment. And store the output for further analysis.

 

Step 1: The adult.data file has 32562 examples. You need to split the adult.data file into 20 files, each containing successively larger datase ts. That is to say you should have 1(32562/20) examples in your first dataset. Your second dataset will have 2(32562/20). And so on until you have 20 datasets. (Note that you will also have to copy the adult.test and adult.names files so that the file stem matches the file stem of the dataset’s file stem).

Step 2: Run c4.5 on each of the 20 copies of the database. Use the –u switch to test against unseen data in the filestem.test dataset. Also use the –g switch to make c4.5 use the information gain criteria when selecting nodes.

Step 3: Graph a “training-set size” v. “training data performance of unpruned tree” chart.

Step 4: Graph a “training-set size” v. “training data performance of pruned” chart.

Step 5: Graph a “training-set size” v. “test data performance of unpruned tree” chart.

Step 6: Graph a “training-set size” v. “test data performance of pruned” chart.

 

You will be graded on the accuracy and the presentation of your hypothesizes and collected data. That is to say, your data should not only be correct, but be presented in scientific manner. By “Scientific”, I mean the scientific method:

  1. Observe some aspect of the world.
  2. Invent a tentative description, called a hypothesis, which is consistent with what you have observed.
  3. Use the hypothesis to make predictions.
  4. Test those predictions by experiments or further observations and modify the hypothesis in the light of your results.
  5. Repeat steps 2 and 3 until there are no discrepancies between theory and experiment and/or observation.

 

PART 2 – Questions:

 

  1. Describe what you have done to complete the steps one to six – report any assumptions or decisions made while following each step.  In addition, report any problem that you encountered.
  2. Describe the behavior of each graph. That is, does the graphs increase, decrease, or stay the same when adding more examples to the training data? Explain why the graphs behave in this fashion. (Hint: How m any attributes are there for the adult database? Of these attributes, how many attributes do they have? How does this Hypothesis Space compare to the number of examples used for training? See section 18.6 of Russell and Norvig)
  3. In this experiment:
    1. Did you randomly select the examples from the dataset, or did you chop first (32562/20) then for the second example use the first 2(32562/20), and so on?
    2. Is one selection process better than the other? Explain.

 

PART 2 - What to turn in: