Learning Objectives Use R to numerically and visually summarize:
As with last week, you will need to both create an R Markdown lab report and submit your answers to the five marked quiz questions to the corresponding Canvas Quiz. You may download the template for the lab report from Canvas.
The Centers for Disease Control (CDC) conducts the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual telephone survey of 350,000 people in the United States that is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The CDC Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
We will read in the dataset from the website of Dr. Yibi Huang at the University of Chicago, who developed the original version of this lab assignment. read.table()
is one of many ways to read in a dataset. While you are at it, don’t forget to load our usual packages. These lines of code must be included in the top of your RMarkdown document in order for your document to knit.
library(tidyverse)
cdc <- read.table("http://www.stat.uchicago.edu/~yibi/s220/labs/data/cdc.dat",
header=TRUE)
To make sure that you have loaded the data correctly, try opening it in the data viewer with View(cdc)
. Remember that you may also view the variable names with names(cdc)
and you can take a peak at the different variables with glimpse(cdc)
.
Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth
, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (y) or did not (n). hlthplan
indicates whether the respondent had some form of health coverage (y) or did not (n). smoke100
indicates whether the respondent had smoked at least 100 cigarettes in their lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire
, age
(in years), and gender
.
Consider the variable genhlth
in the CDC dataset, which takes on values excellent
, very good
, good
, fair
, and poor
. Since this is a categorical variable, we cannot summarize it with the tools we practiced last week. We cannot compute the mean or make a histogram. This week, we will explore tools in R for summarizing a categorical variable through tables and bar charts.
The simplest type of table in R is a frequency table for a single categorical variable. All we need to do is select()
our variable of interest and pipe (%>%
) this variable into the function table()
. Try it out with the command below:
Note that R has ordered the categories of genhlth
alphabetically. genhlth
is an ordinal variable, and our table will be more readable if we put the categories in the natural order of excellent > very good > good > fair > poor. We need to tell R that the categories of genhlth
should be treated as ordered. We can accomplish this with the following command:
cdc <- cdc %>% mutate(genhlth = ordered(genhlth,
levels=c("poor", "fair", "good",
"very good", "excellent")))
We have over-ridden the original genhlth
column of the dataset with a new, ordered variable. When we remake the table, the order is more natural:
Suppose that we would rather see relative frequencies (percentages). We can pipe (%>%
) the table above into a new function, prop.table()
, that divides all entries in the table by the total number of observations in the table. Try running the following command:
A nice visual display of a frequency table for one variable is a bar plot. Bar plots are available in the ggplot
framework using the command geom_bar()
.
genhlth
variable using ggplot
. If you need a hint, start with code that you used to create histograms on lab 1, then incorperate the following components:
data = cdc
aes(x=genhlth)
geom_bar()
Body Mass Index (BMI) is a weight to height ratio and can be calculated as: \[ BMI = \frac{weight \ in \ pounds}{(height \ in \ inches)^2} \times 703 \]
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds). BMI does not appear directly in the cdc
dataset, but last week we learned the tools to add new variables to our dataset.
BMI
to the cdc
dataset. What is the mean BMI in the dataset? Round your answer to 1 decimal place. The following hints might help:
mutate()
to add the avg_speed
variable to the dataset using the formula distance/(air_time/60)
. Use the same method here to add BMI
to the dataset using the formula given above.
BMI
to the dataset, you can use summarize()
to find the mean BMI. For hints on how to do this, see Exercise 4 on lab 1.
Suppose we want to know whether the distribution of the quantitative variable BMI is different among individuals with different levels of genhlth
. In other words, we want to know if there is a relationship between genhlth
, a categorical variable, and BMI
, a quantitative variable. One great way to do this is to create side-by-side boxplots showing the distribution of BMI for each separate value of genhlth
. Try out the code below:
Now let’s consider the relationship between genhlth
, which we explored above, and smoke100
, which is a yes/no
variable measuring whether an individual has smoked more than 100 times in their lifetime. We have already explored the univariate distribution of genhlth
, but we should take a quick look at the univariate distribution of smoke100
.
To explore the relationship between two categorical variables, genhlth
and smoke100
, we should start with a contingency table. A contingency table is simply a table of two categorical variables. We can make one as follows by selecting two variables from our dataset.
You may notice that the contingency table above does not show row totals or column totals. We can take care of this by piping this table into the addmargins()
function.
This table is showing raw counts. It is often more useful to look at percentages, but now that we have two variables we need to be careful to specify which type of percentages we want. For example, consider the following table, which shows the joint percentages.
## genhlth
## smoke100 poor fair good very good excellent
## n 0.01145 0.04555 0.13910 0.18790 0.14395
## y 0.02240 0.05540 0.14465 0.16070 0.08890
The prop.table()
function took every entry of our original table and divided by 20,000 (the number of cases in the entire dataset).
A mosaic plot is a cool graphical summary of a two-way table. Run the command below to make a mosaic plot:
The areas of the boxes in a mosaic plot correspond to the proportion of total observations with that combination of variable values. For example, the box corresponding to n
and poor
in the upper left corner has an area that is 1.145% of the overall area in the plot. The color=TRUE
command simply adds shading to the plot for aesthetic purposes (try this command without color=TRUE
and see what happens).
excellent
and y
should have an area that is what proportion of the area of the entire plot? Round your answer to two decimal places.Sometimes we are interested in the row percentages or column percentages in our table. Let’s examine the conditional distribution of genhlth
given smoke100
. If instead of using prop.table()
we use prop.table(1)
, R will compute row percentages in our table. The 1
stands for “row”.
## genhlth
## smoke100 poor fair good very good excellent
## n 0.02168766 0.08627711 0.26347192 0.35590492 0.27265840
## y 0.04745260 0.11736045 0.30642940 0.34043004 0.18832751
Among individuals who do not smoke, 27% report being in excellent health. Among individuals who smoke, only 19% report being in excellent health. If we instead use prop.table(2)
, we get the column percentages from our original table, which correspond to the conditional distribution of smoke100
given genhlth
. The 2
stands for “column”.
Consider the table that you made using prop.table(2)
, which shows the conditional distribution of smoke100
given genhlth
. In one sentence, interpret the number 0.6617.
Quiz Question: Among individuals who consider themselves to be in excellent
health, what proportion have smoked 100 times? Round your answer to two decimal places.
A standardized bar plot is an easy way to visualize row or column percentages. Suppose that we want to visualize the distribution of smoke100
given ` each value of genhlth
. We can create a barplot where the x-axis is genhlth
and the bars are filled
(colored) using smoke100
. Try out the code below:
Do smoking status and general health status appear to be associated? Justify your answer.
Remake your bar chart of genhlth
and smoke100
but remove the position="fill"
argument. Is this plot as useful as the previous plot?
This lab is a modified version of Yibi Huang’s Lab 3, developed for Stat 220 at the University of Chicago. Yibi Huang’s lab, in turn, is based off of the OpenIntro lab framework
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.