25 Vehicles classiification with Decision Trees

  • Datasets: Vehicle (mlbench)
  • Algorithms:
    • Decision Trees
  • Instructions: book “Applied Predictive Modeling Techniques”, Lewis, N.D.

25.1 Load packages


25.2 Prepare data

N = nrow(Vehicle)
train <- sample(1:N, 500, FALSE)
# training and test sets
trainset <- Vehicle[train,]
testset  <- Vehicle[-train,]

25.3 Estimate the decision tree

fit <- tree(Class ~., data = trainset, split = "deviance")
# fit <- tree(Class ~., data = Vehicle[train,], split ="deviance")
# fit
# fit

We use deviance as the splitting criteria, a common alternative is to use split=“gini”.

At each branch of the tree (after root) we see in order: 1. The branch number (e.g. in this case 1,2,14 and 15); 2. the split (e.g. Elong < 41.5); 3. the number of samples going along that split (e.g. 229); 4. the deviance associated with that split (e.g. 489.1); 5. the predicted class (e.g. opel); 6. the associated probabilities (e.g. ( 0.222707 0.410480 0.366812 0.000000 ); 7. and for a terminal node (or leaf), the symbol "*".

#> Classification tree:
#> tree(formula = Class ~ ., data = trainset, split = "deviance")
#> Variables actually used in tree construction:
#>  [1] "Elong"        "Max.L.Ra"     "Comp"         "Pr.Axis.Ra"   "Sc.Var.maxis"
#>  [6] "Max.L.Rect"   "Scat.Ra"      "Circ"         "D.Circ"       "Skew.maxis"  
#> Number of terminal nodes:  16 
#> Residual mean deviance:  0.943 = 456 / 484 
#> Misclassification error rate: 0.252 = 126 / 500

Notice that summary(fit) shows: 1. The type of tree, in this case a Classification tree; 2. the formula used to fit the tree; 3. the variables used to fit the tree; 4. the number of terminal nodes in this case 15; 5. the residual mean deviance - 0.9381; 6. the misclassification error rate 0.232 or 23.2%.

plot(fit); text(fit)

25.4 Assess model

Unfortunately, classification trees have a tendency to overfit the data. One approach to reduce this risk is to use cross-validation. For each hold out sample we fit the model and note at what level the tree gives the best results (using deviance or the misclassification rate). Then we hold out a different sample and repeat. This can be carried out using the cv.tree() function. We use a leave-one-out cross-validation using the misclassification rate and deviance (FUN=prune.misclass, followed by FUN=prune.tree).

fitM.cv <- cv.tree(fit, K=346, FUN = prune.misclass)
fitP.cv <- cv.tree(fit, K=346, FUN = prune.tree)

The results are plotted out side by side in Figure 1.2. The jagged lines shows where the minimum deviance / misclassification occurred with the cross-validated tree. Since the cross validated misclassification and deviance both reach their minimum close to the number of branches in the original fitted tree there is little to be gained from pruning this tree

par(mfrow = c(1, 2))

25.5 Make predictions

We use the validation data set and the fitted decision tree to predict vehicle classes; then we display the confusion matrix and calculate the error rate of the fitted tree. Overall, the model has an error rate of 32%.

testLabels <- Vehicle$Class[-train]
# Confusion Matrix
pred <- predict(fit, newdata = testset)
# find column whih has the maximum of all rows 
pred.class <- colnames(pred)[max.col(pred, ties.method = c("random"))]
cm <- table(testLabels, pred.class, 
      dnn = c("Observed Class", "Predicted Class"))
#>               Predicted Class
#> Observed Class bus opel saab van
#>           bus   85    1    1   5
#>           opel   3   70   10   2
#>           saab   7   67   14   7
#>           van    1    4    5  64
# Sensitivity
sum(diag(cm)) / sum(cm)
#> [1] 0.673
# pred <- predict(fit, newdata = Vehicle[-train,])
# pred.class <- colnames(pred)[max.col(pred, ties.method = c("random"))]
# table(Vehicle$Class[-train], pred.class, 
#       dnn = c("Observed Class", "Predicted Class"))
error_rate = (1 - sum(pred.class == testset) / nrow(testset))
round(error_rate, 3)
#> [1] 0.327
# error_rate = (1 - sum(pred.class == Vehicle$Class[-train])/346)
# round(error_rate,3)
# round(error_rate,3)