Sometime we build a classification tree, not in order to make a prediction for an individual, but rather to identify “groups” of individuals that have an elevated probability of possessing a particular characteristic.
For example, suppose that in the case of Justin Verlander’s pitches we would like to say which sorts of pitches are especially likely to be curve balls (value “CU”).
First let’s set up the data frame:
ver2 <- verlander
ver2$season <- NULL
ver2$gamedate <- NULL
Now we divide the data into training and test sets:
dfs <- divideTrainTest(seed = 3030, prop.train = 0.5, data = ver2)
verTrain <- dfs$train
verTest <- dfs$test
Then we build a tree on the training set:
trMod <- tree(pitch_type ~ ., data = verTrain)
trMod
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 7653 22050.00 FF ( 0.1643800 0.1821508 0.4398275 0.1305370 0.0831047 )
## 2) speed < 91.85 3331 7421.00 CU ( 0.3767637 0.4184929 0.0024017 0.0114080 0.1909337 )
## 4) pfx_x < -2.465 1328 719.40 CH ( 0.9435241 0.0000000 0.0060241 0.0286145 0.0218373 ) *
## 5) pfx_x > -2.465 2003 2488.00 CU ( 0.0009985 0.6959561 0.0000000 0.0000000 0.3030454 )
## 10) pfx_z < -2.995 1363 257.40 CU ( 0.0000000 0.9809244 0.0000000 0.0000000 0.0190756 ) *
## 11) pfx_z > -2.995 640 411.20 SL ( 0.0031250 0.0890625 0.0000000 0.0000000 0.9078125 )
## 22) pfx_x < 4.715 553 52.95 SL ( 0.0036166 0.0036166 0.0000000 0.0000000 0.9927667 ) *
## 23) pfx_x > 4.715 87 114.50 CU ( 0.0000000 0.6321839 0.0000000 0.0000000 0.3678161 ) *
## 3) speed > 91.85 4322 4628.00 FF ( 0.0006941 0.0000000 0.7769551 0.2223508 0.0000000 )
## 6) pfx_x < -8.365 968 370.10 FT ( 0.0000000 0.0000000 0.0475207 0.9524793 0.0000000 ) *
## 7) pfx_x > -8.365 3354 473.00 FF ( 0.0008945 0.0000000 0.9874776 0.0116279 0.0000000 ) *
We can tell that pitches at terminal nodes 10 and 23 are especially likely to be curve balls.
The distAtNodes()
function gives us this information directly, in the form of a table:
trainTab <- distAtNodes(trMod, df = verTrain, resp_varname = "pitch_type")
trainTab
## pitch_type
## node CH CU FF FT SL
## 4 1253 0 8 38 29
## 6 0 0 46 922 0
## 7 3 0 3312 39 0
## 10 0 1337 0 0 26
## 22 2 2 0 0 549
## 23 0 55 0 0 32
rowPerc(trainTab)
## pitch_type
## node CH CU FF FT SL Total
## 4 94.35 0.00 0.60 2.86 2.18 100.00
## 6 0.00 0.00 4.75 95.25 0.00 100.00
## 7 0.09 0.00 98.75 1.16 0.00 100.00
## 10 0.00 98.09 0.00 0.00 1.91 100.00
## 22 0.36 0.36 0.00 0.00 99.28 100.00
## 23 0.00 63.22 0.00 0.00 36.78 100.00
The model points to two nodes as having “high-likelihood nodes” for curve-balls. But a more reliable estimate of the probability of a curve-ball at each of these nodes is provided by applying the model to new data, not the data that is used to build the model itself. Hence we also use the distAtNodes()
function on the model and the test set:
testTab <- distAtNodes(trMod, df = verTest, resp_varname = "pitch_type")
testTab
## pitch_type
## node CH CU FF FT SL
## 4 1287 0 13 28 22
## 6 0 0 47 943 0
## 7 0 0 3330 51 4
## 10 0 1248 0 0 30
## 22 5 13 0 0 545
## 23 0 61 0 0 27
rowPerc(testTab)
## pitch_type
## node CH CU FF FT SL Total
## 4 95.33 0.00 0.96 2.07 1.63 100.00
## 6 0.00 0.00 4.75 95.25 0.00 100.00
## 7 0.00 0.00 98.38 1.51 0.12 100.00
## 10 0.00 97.65 0.00 0.00 2.35 100.00
## 22 0.89 2.31 0.00 0.00 96.80 100.00
## 23 0.00 69.32 0.00 0.00 30.68 100.00
Again nodes 10 and 23 stand out—and with about the same probability of curve-ball at each node.
When a node does not contain a large number of individuals, estimates of distribution of the response variable at that node are subject to a great deal of chance variation. Watch for this especially when you are building trees with many nodes, or when your training or test sets are not large to being with.
We are interested here in the distribution of individuals at terminal nodes only. However, when an observation is missing the value of a variable and the tree requires that value at some node, then the tree will stop at that node and make a prediction based upon all observations that pass through that node. In order to prevent non-terminal nodes from showing up in your analysis, you should remove all obervations with missing values.
One way to accomplsh this is with the complete.cases()
function. Supppose, for example, that we wish to work with m111survey
from the tigerstats
package. It so happens that three of the rows in the data frame contains at least one missing value. Create a new copy that removes them:
s2 <- subset(m111survey, complete.cases(m111survey) == TRUE)
nrow(m111survey)
## [1] 71
nrow(s2)
## [1] 68
Sure enough, the three offending obervations were removed.