favstats()
comes from the mosaic
package and we will use some data from the tigerstats
package, so make sure that both are loaded:
require(mosaic)
require(tigerstats)
In this tutorial we will work with the m111survey
data frame from tigerstats
package. If you are not yet familiar with this data, then run:
data(m111survey)
View(m111survey)
help(m111survey)
Remember that you can also learn about the types of each variable in the data frame with the str()
function:
str(m111survey)
## 'data.frame': 71 obs. of 12 variables:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
## $ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
## $ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
## $ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
## $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
## $ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
## $ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
## $ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
## $ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
## $ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
“favstats” is short for “favorite statistics”: it will give you the some of the most popular summary statistics for numerical variables.
Suppose, for example, that you want to know how fast people in the m111survey
sample tend to drive, when they drive their fastest. The you want to study the numerical variable fastest: the fastest speed each person claims to have ever driven, measured in miles per hour. Just try favstats()
with the usual formula-data input:
favstats(~fastest, data = m111survey)
## min Q1 median Q3 max mean sd n missing
## 60 90.5 102 119.5 190 105.9014 20.8773 71 0
Remember what each of the statistics tells you:
We also see that
Studying the relationship between a numerical variable and factor variable involves what is popularly known as “breaking the data into groups” based on the values of the factor variable. More formally, we obtain the conditional distributions of the numerical variable given the various possible values of the factor variable, and look for difference between these distributions. If we see large differences. then we know that the factor variable “makes a difference” in the likely values of the numerical variable, i.e., the two variable are related.
For example we might want to know if the fastest speed one drives might be related to one’s sex. The relevant variables in m111survey
are then the numerical fastest and the factor variable sex.
In formula-data input for favstats()
the formula always follows the format:
\[numerical \sim factor.\]
So we run the following command:
favstats(fastest~sex, data = m111survey)
## sex min Q1 median Q3 max mean sd n missing
## 1 female 60 90 95 110.0 145 100.0500 17.60966 40 0
## 2 male 85 99 110 122.5 190 113.4516 22.56818 31 0
The first row of the output gives a summary of the conditional distribution of fastest, given that sex is female.
The second row summarizes the conditional distribution of fastest, given that sex is male.
The two conditional distribution are not the same. For example, we see that on average females drove about 100 mph, whereas the guys drove about 113.4 mph. The guys appear to drive faster than the gals: for this sample of students, fastest speed drive does indeed appear to be related to sex.
Sometimes you want just a few of the numbers from favstats()
. If you would like to display only those numbers you can do so using brackets “[" and “]”, along with a list of the names of the columns you want to see. For example, to display only the means and the standard deviations for fastest, ask for:
favstats(~fastest, data=m111survey)[c("mean","sd")]
## mean sd
## 105.9014 20.8773
The brackets are R’s way of locating particular parts of an object. If you want to display more than one column, make sure to combine their names (in quotes) in a list with the c()
function, as shown above.
When you are breaking a numerical variable into groups, you will probably also want to see the group names: this requires the addition of the column named for the factor variable in question. Therefore, to see just the mean and the standard deviation for fastest broken down by sex, ask for:
favstats(fastest~sex, data = m111survey)[c("sex","mean","sd")]
## sex mean sd
## 1 female 100.0500 17.60966
## 2 male 113.4516 22.56818
favstats()
specializes in numerical variables: it does not like being used to study a factor variable by itself. Look what happens if you try to use it to study the factor variable sex:
favstats(~sex, data = m111survey)
## Warning in FUN(eval(formula[[2]], data, .envir), ...): Auto-converting
## factor to numeric.
## min Q1 median Q3 max mean sd n missing
## 1 1 1 2 2 1.43662 0.4994967 71 0
favstats()
converted the factor variable to a numeric variable and tried to give you some useful information based on that conversion, but elements of the output (especially the mean and the standard deviation) are meaningless in the context of factor variables. In order to study a factor variable you really should look instead at functions like xtabs()
.
You can incorporate additional factor variables into your analysis. For example, suppose you want to break down the students’ fastest speed not only by sex but also by where they prefer to sit in a classroom. This is accomplished as follows:
favstats(fastest ~ sex + seat, data = m111survey)
## sex.seat min Q1 median Q3 max mean sd n
## 1 female.1_front 60 87.5 100.0 116.00 130 99.63158 20.276452 19
## 2 male.1_front 85 97.5 112.5 119.00 130 108.50000 15.556349 8
## 3 female.2_middle 80 90.0 95.0 101.00 110 94.93750 8.290306 16
## 4 male.2_middle 85 100.0 109.0 121.25 143 111.00000 16.737184 16
## 5 female.3_back 90 110.0 120.0 125.00 145 118.00000 20.186629 5
## 6 male.3_back 95 96.5 110.0 142.50 190 124.71429 36.976183 7
## missing
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
The two “breakdown” variables after the “~” are separated by a “+”. Once again, you may want to limit your output:
favstats(fastest ~ sex + seat, data = m111survey)[c("sex.seat", "mean","sd","n")]
## sex.seat mean sd n
## 1 female.1_front 99.63158 20.276452 19
## 2 male.1_front 108.50000 15.556349 8
## 3 female.2_middle 94.93750 8.290306 16
## 4 male.2_middle 111.00000 16.737184 16
## 5 female.3_back 118.00000 20.186629 5
## 6 male.3_back 124.71429 36.976183 7
You can break down by more than two factor variables, but the resulting tables can be rather messy.