What is Decision Tree Learning?
An important technique in machine learning, decision trees are used extensively in data mining. They are able to produce human-readable descriptions of trends in the underlying relationships of a dataset and can be used for classfication and prediction tasks. The technique has been used successfully in many different areas, such as medical diagnosis, plant classification, and customer marketing strategies.
How to Use These Pages
The following pages in this Tutorial will present the basic ideas and techniques involved in Decision Tree Learning with references to research in the area. They should be worked through in sequential order, or use the navigation to skip to a particular page. Every now and then, there will be interactive pages. These are intended to be used as aids in understanding the various algorithms and concepts. Once you feel that you have a good understanding of the material, please use the resources section or contribute to the site with comments or additions. Thanks for looking - have fun!
Tutorial (1): A simple decision tree
An example dataset
Imagine we have the following data about a fictitious marketing strategy. Say some company sent out some promotion to various houses and recorded a few facts about each house and also whether the people responded or not:
| District | House Type | Income | Previous Customer |
Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Responded | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Responded | Nothing |
| Rural | Semi-detached | Low | Responded | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Responded | Responded |
| Rural | Terrace | High | Responded | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Responded | Nothing |
Imagine that we had thousands and thousands of instances (records) of this stuff. Here we have only 14, but if we had a lot, then it would be reasonable to assume that there would be some patterns in it. What sort of patterns? What could we find out? Well, we might discover some underlying relationships between some of the attributes, in particular it would be good to know which factors influence whether someone responds or not. That is, which factors most strongly affect a household's response to the promotion. In the data above for example, we can see that all rural households responded. This would be useful to know, as next time we might only have so many promotional brochures and so we would like to be selective as to where we send them in order to get the most responses. The example above is pretty trivial and we could probable analyse it manually just by looking, but the general idea if we had more instances would be to build some sort of classifier which could be used to examine the underlying relationships and make future predictions about the target concept (in this case the outcome of a mailed promotion). This is where automated building of decision trees comes in - a technique that can be used to generate some rules about data and then perform generalisation and prediction tasks.
We'll stick with the example data above in this tutorial and use it to show how we could build a decision tree to analyse it. Of course we're not going to 'build' any trees ourselves-we'd get some software to do it, or some seeds and a pot- but it's important to examine the techniques involved.
The Decision Tree
How can a tree help here? Well, in order to generate a set of rules we can construct a decision tree. This is done top-down from a root node and involves partitioning the data into subsets that contain instances that have similar values. Doing this for the dataset above can result in such a tree:
| District | |||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||
Explanation
Ok, so the nodes in brown in the tree correspond to attributes. At each node the dataset is split into subsets based on the value of the attribute at the node. For instance, at the root node, we split the entire dataset into three subsets. One that contains only instances (rows, tuples, whatever) that have the value 'Suburban' for the 'District' attribute, one that that contains only instances where the District attribute is 'Urban', and one where the all the instances are 'Rural' for that attribute. The numbers on the branches are important here: They correspond to the number of instances in each subset that have one and only one value for the target attribute ('Outcome'). This basically says how well the given value of the attribute we split on relates to the target attribute. What? Look at the tree - the middle branch of the first node. '4/4' below 'Rural' indicates that all four of the instances with District=Rural have the same value for 'Outcome' (in this case 'Responded'). This is good, because we have split the data using this attribute=value pairing to perfectly classify all instances that have this pairing. In other cases the value of the District attribute does not lead to a perfect, or pure subset. These ideas are related to entropy which we shall examine later. Anyway, continuing with the above tree - look at the first branch of the first node. This tells us that when District=Suburban, only 3 of 5 instances have the same value of the target attribute. In this case, it is necessary to continue splitting up this subset using other attribute tests until we have only pure subsets. The 5 instances which have District=Suburban on the left-most branch are then tested with 'House-Type' and are split into further subsets. The tree construction continues until purity, or until all the subsets are pure (with respect to the target attribute). When this occurs the branch terminates in a green leaf node that specifies what value the target attribute takes for all instances that have been filtered down this branch.
Rules From The Tree
Ok, so we can represent the data with a tree? So what? Well we can extract rules from the tree quite easily. Just read off the paths of all the leaf nodes. This gives us (from left to right in the tree):
A disjunction of conjunctions. This is useful for summarising the data and extracting the underlying relationships.
How can this be used for classification and prediction?
Well, say for example that we wanted to predict the outcome of mailing to a certain house. We could just of course do a look-up on our dataset to see if the characteristics of this new house matched any we had mailed to before with the assumption that the new house will respond in the same way. This won't always be possible though, as our dataset here doesn't represent all the possible combinations. Instead we use the decision tree to generalise.
E.g. If we know the District we're going to mail to is Urban and the person was a previous customer, then the tree predicts that the person will not respond (Follow the attributes and values down the tree).
Practically?
Ok, so this illustrates the basic idea how we can use Decision Tree Learning. You might be thinking that this is a very small, contrived example and that really it's all fairly random what happened. Well, yes, yes and yes, but the same basic idea is used in practical situations. Imagine if we had thousands of records of data for a concept like the one above and maybe lots more attributes, perhaps some of them with numeric attributes. We wouldn't be able to analyse such data by just looking at it and so constructing a decision tree would help. Furthermore, the more data we have, then the more chance that we can get a real insight into the any underlying function or relationship between the attributes. This is because the tree generalises when used for predictions and we would be more confident about it's accuracy if it had been constructed from many examples instead of just a few. There are many other complications and details to worry about, but all that can be looked at later. Right now, we are going to look at the tree building process a bit more.
One Dataset: Many Trees
For any given dataset there are a lot of possible trees that we could construct. Instead of having the root node as 'District', it could have been 'Income' for example. Likewise, the second child node could have been 'House-Type' instead of 'Previous-Customer'. Have a go building a tree from this dataset on the next page. You will see that there are many possible trees that can be built to model this data. They are all perfectly legitimate decision trees. Try to find the shortest (least number of nodes).
Tutorial (2): Exercise 1
Constructing a Simple Tree
The dataset:
| District | House Type | Income | Previous Customer | Outcome |
| Suburban | Detached | High | No | Nothing |
| Suburban | Detached | High | Yes | Nothing |
| Rural | Detached | High | No | Responded |
| Urban | Semi-detached | High | No | Responded |
| Urban | Semi-detached | Low | No | Responded |
| Urban | Semi-detached | Low | Yes | Nothing |
| Rural | Semi-detached | Low | Yes | Responded |
| Suburban | Terrace | High | No | Nothing |
| Suburban | Semi-detached | Low | No | Responded |
| Urban | Terrace | Low | No | Responded |
| Suburban | Terrace | Low | Yes | Responded |
| Rural | Terrace | High | Yes | Responded |
| Rural | Detached | Low | No | Responded |
| Urban | Terrace | High | Yes | Nothing |
The Decision Tree: Interactively build it
- Click on the root node below and
start building the tree. - Non leaf nodes can be "pruned" once they have been chosen (by clicking on the node and selecting "prune node completely")
- The ratios on the branches indicate how well the chosen attribute at a node splits the remaining data based on the target attribute ('outcome').
- Click on any nodes to hilight the rows in the data table that the rule down to that node covers.
| Click here to begin |
Tutorial (3): Occam's Razor
The Smallest Tree
If you did the practical exercise on the previous page, you will notice that the smallest complete tree that it is possible to build from the is this:
| District | ||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||
There were other trees that you could build, some of them much larger and more complex than this one. The interesting thing is though, is that each of the trees describes the data perfectly - no instances are left unaccounted for. Each tree has no exceptions and is a perfectly valid hypothesis over the data. In that case, the shorter hypothesis is preferred. Why?
Occam's Razor
This principle thought up a long time ago by William Occam while shaving(!) states that the shortest hypothesis, or solution to a problem, should be the one we should prefer (over longer more complicated ones). This is one of the fundamental tenet's of the way western science works and has received much debate and controversy. Let's just say that most of the stuff concerning decision trees follows the idea of Occam's razor and looks to always contstruct small, simple, easily-understandable tree structures.
Why not just construct the shortest tree each time each time then for every dataset? As simple as it was for our dataset, in the general case this isn't so simple and can take a very long time on some datasets. Formally, this has shown to be an NP-complete problem which means that no computer algorithm can solve this in a sensible amount of time (in the general case). Therefore algorithms have been developed which find a good tree, usually very close to the best. It's now necessary to start examining some of these decision tree algorithms in more detail.
|