Introduction

Sometime we build a classification tree, not in order to make a prediction for an individual, but rather to identify “groups” of individuals that have an elevated probability of possessing a particular characteristic.

For example, suppose that in the case of Justin Verlander’s pitches we would like to say which sorts of pitches are especially likely to be curve balls (value “CU”).

First let’s set up the data frame:

ver2 <- verlander
ver2$season <- NULL
ver2$gamedate <- NULL

Now we divide the data into training and test sets:

dfs <- divideTrainTest(seed = 3030, prop.train = 0.5, data = ver2)
verTrain <- dfs$train
verTest <- dfs$test

Then we build a tree on the training set:

trMod <- tree(pitch_type ~ ., data = verTrain)
trMod
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 7653 22050.00 FF ( 0.1643800 0.1821508 0.4398275 0.1305370 0.0831047 )  
##    2) speed < 91.85 3331  7421.00 CU ( 0.3767637 0.4184929 0.0024017 0.0114080 0.1909337 )  
##      4) pfx_x < -2.465 1328   719.40 CH ( 0.9435241 0.0000000 0.0060241 0.0286145 0.0218373 ) *
##      5) pfx_x > -2.465 2003  2488.00 CU ( 0.0009985 0.6959561 0.0000000 0.0000000 0.3030454 )  
##       10) pfx_z < -2.995 1363   257.40 CU ( 0.0000000 0.9809244 0.0000000 0.0000000 0.0190756 ) *
##       11) pfx_z > -2.995 640   411.20 SL ( 0.0031250 0.0890625 0.0000000 0.0000000 0.9078125 )  
##         22) pfx_x < 4.715 553    52.95 SL ( 0.0036166 0.0036166 0.0000000 0.0000000 0.9927667 ) *
##         23) pfx_x > 4.715 87   114.50 CU ( 0.0000000 0.6321839 0.0000000 0.0000000 0.3678161 ) *
##    3) speed > 91.85 4322  4628.00 FF ( 0.0006941 0.0000000 0.7769551 0.2223508 0.0000000 )  
##      6) pfx_x < -8.365 968   370.10 FT ( 0.0000000 0.0000000 0.0475207 0.9524793 0.0000000 ) *
##      7) pfx_x > -8.365 3354   473.00 FF ( 0.0008945 0.0000000 0.9874776 0.0116279 0.0000000 ) *

We can tell that pitches at terminal nodes 10 and 23 are especially likely to be curve balls.

Distribution at the Nodes: Training Set

The distAtNodes() function gives us this information directly, in the form of a table:

trainTab <- distAtNodes(trMod, df = verTrain, resp_varname = "pitch_type")
trainTab
##     pitch_type
## node   CH   CU   FF   FT   SL
##   4  1253    0    8   38   29
##   6     0    0   46  922    0
##   7     3    0 3312   39    0
##   10    0 1337    0    0   26
##   22    2    2    0    0  549
##   23    0   55    0    0   32
rowPerc(trainTab)
##     pitch_type
## node     CH     CU     FF     FT     SL  Total
##   4   94.35   0.00   0.60   2.86   2.18 100.00
##   6    0.00   0.00   4.75  95.25   0.00 100.00
##   7    0.09   0.00  98.75   1.16   0.00 100.00
##   10   0.00  98.09   0.00   0.00   1.91 100.00
##   22   0.36   0.36   0.00   0.00  99.28 100.00
##   23   0.00  63.22   0.00   0.00  36.78 100.00

Distribution at the Nodes: Training Set

The model points to two nodes as having “high-likelihood nodes” for curve-balls. But a more reliable estimate of the probability of a curve-ball at each of these nodes is provided by applying the model to new data, not the data that is used to build the model itself. Hence we also use the distAtNodes() function on the model and the test set:

testTab <- distAtNodes(trMod, df = verTest, resp_varname = "pitch_type")
testTab
##     pitch_type
## node   CH   CU   FF   FT   SL
##   4  1287    0   13   28   22
##   6     0    0   47  943    0
##   7     0    0 3330   51    4
##   10    0 1248    0    0   30
##   22    5   13    0    0  545
##   23    0   61    0    0   27
rowPerc(testTab)
##     pitch_type
## node     CH     CU     FF     FT     SL  Total
##   4   95.33   0.00   0.96   2.07   1.63 100.00
##   6    0.00   0.00   4.75  95.25   0.00 100.00
##   7    0.00   0.00  98.38   1.51   0.12 100.00
##   10   0.00  97.65   0.00   0.00   2.35 100.00
##   22   0.89   2.31   0.00   0.00  96.80 100.00
##   23   0.00  69.32   0.00   0.00  30.68 100.00

Again nodes 10 and 23 stand out—and with about the same probability of curve-ball at each node.

Cautions

Variability

When a node does not contain a large number of individuals, estimates of distribution of the response variable at that node are subject to a great deal of chance variation. Watch for this especially when you are building trees with many nodes, or when your training or test sets are not large to being with.

Missing Values

We are interested here in the distribution of individuals at terminal nodes only. However, when an observation is missing the value of a variable and the tree requires that value at some node, then the tree will stop at that node and make a prediction based upon all observations that pass through that node. In order to prevent non-terminal nodes from showing up in your analysis, you should remove all obervations with missing values.

One way to accomplsh this is with the complete.cases() function. Supppose, for example, that we wish to work with m111survey from the tigerstats package. It so happens that three of the rows in the data frame contains at least one missing value. Create a new copy that removes them:

s2 <- subset(m111survey, complete.cases(m111survey) == TRUE)
nrow(m111survey)
## [1] 71
nrow(s2)
## [1] 68

Sure enough, the three offending obervations were removed.