Lab 1: Exploring Numerical Data

Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore a random sample of domestic flights that departed from the three major New York City airports in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times and speed. We will also learn the indispensable skills of data processing and subsetting.

A Note to the Student The text included in lab 0, the tidyverse tutorial, and this lab is all meant to help you. Please read the lab text; do not just skip to the exercises.

The setup

What do you need to turn in?

We will be using R Markdown to create reproducible lab reports. For this lab, we have created a template for you to use. You will need to record all of your code and your answers to all exercises in this document, as you will be uploading your final lab report to Canvas. This report will be graded: we want to see that you wrote your own code and put some effort into the discussion questions. To reinforce the importance of reproducibility, the grader may execute some of your code to make sure that it runs.

You will be turning in the knit version of your RMarkdown report. In order to knit your document correctly, you need to make sure that all code you use to produce your answers is included in the R Markdown file. This includes the code to load the libraries and the dataset. To avoid hard-to-catch errors with knitting, we recommend knitting your lab report each time you complete an exercise. Do not wait until you are all done with the assignment to try knitting for the first time. Knitting your document produces an html file, and you will upload this to Canvas. Note that when you knit your document, a .html file is produced in the same folder as your original .Rmd file. If you cannot find either file, try checking your downloads folder (since you downloaded the template from Canvas).

Separate from your lab report, you need to submit answers to the questions marked Quiz Question in the Canvas quiz associated with the lab. The Canvas quiz answers are graded for correctness.

Loading our packages

The first step in most of our labs will be to load the tidyverse. Since we already installed the packages during the tutorial, we do not need to re-install them. We just need to load them using the library() command.

library(tidyverse)

Importing the data

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

Begin by loading the nycflights data frame. Type the following command to load the data. Note that we highly recommend typing these commands, rather than copy and pasting them. This will ensure that the commands themselves are sinking in, and you will start to see how different commands relate to one another.

source("http://www.openintro.org/stat/data/nycflights.R")

The data set nycflights shows up in your workspace. R has stored this data set as a data frame. Each row represents an observation and each column represents a variable. To view the names of the variables, type the command

names(nycflights)

This returns the names of the variables in this data frame. Remember that you can use glimpse to take a quick peek at your data to understand its contents better.

glimpse(nycflights)

Quiz Question: What is the observational unit in this dataset?
- A Year
- An Airline
- A Flight
- A City

The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:

On average, how delayed were flights?
On average, how delayed were certain subsets of flights?
How do delays vary by airline or by month?
How does average speed of a flight vary with distance traveled?

The first two questions involve just one variable at a time (delay time), while the second two questions involve relationships between variables. Below, we will first explore univariate summaries of the data, and then we will dive in to relationships between variables.

Univariate summaries

A first look at departure delays

Let’s start by examing the distribution of departure delays of all flights with a histogram.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

This function says to plot the dep_delay variable from the nycflights data frame on the x-axis, and to use a histogram as the type of plot.

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

As you can see, some features are revealed with the very fine bin width that are obscured when the bin width is too wide.

Delays within groups

Suppose you want to visualize delays only of flights headed to Los Angeles. You need to first filter the data for flights with that destination (dest == "LAX") and then make a histogram of the departure delays of only those flights.

lax_flights <- nycflights %>%
  filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram()

Let’s decipher these two commands. The first says to take the nycflights data frame, filter for flights with LAX as a destination, and save the result as a new data frame called lax_flights. The second command makes a histogram of departure delays in this new, smaller data frame.

Now that we have this smaller data frame, we can obtain numerical summaries for only the LA flights.

lax_flights %>%
  summarize(mean_dd = mean(dep_delay), median_dd = median(dep_delay), sample_size = n())

Note that in the summarize function you created a list of three different numerical summaries that you were interested in. The names of these elements are user defined, like mean_dd, median_dd, sample_size, and you can customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also requires that you know the correct function names. Note that n() reports the sample size.

You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February:

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

Note that you can separate the conditions using commas if you want flights that are both headed to SFO and in February. If you are interested in either flights headed to SFO or in February, you can use the | instead of the comma.

Quiz Question: How many flights are there in the dataset that were headed to SFO in February?
Consider the set of flights headed to SFO in February. Make a histogram of the distribution of the arrival delays (arr_delay) of these flights and describe the distribution.
Calculate the mean and median arrival delay time of the SFO February Flights and compare the two numbers. Which is larger? Why does this make sense?

Exploring relationships between variables

Delays by month and by carrier

Another useful technique is quickly calculating summary statistics for various groups within your dataset. Above, we found the average arrival delay time of the SFO February flights. Suppose we want to know if the average arrival delay differs in different months of the year.

We can take our nycflights and first group_by month and then summarize the data within these groups. Finally, we can arrange the results in descending order to easily see which month has the most delayed arrivals.

nycflights %>%
  group_by(month) %>%
  summarize(median_ad = median(arr_delay), n_flights = n()) %>%
  arrange(desc(median_ad))

Quiz Question: Based only on the median, which month seems to have the longest average arrival delays? A positive value means that the arrival was delayed; a negative value means that the plane arrived early. For the quiz, you may enter the number of the month (i.e. February=2, March=3).

The carrier (airline) variable is coded according to the following system.

carrier: Two letter carrier abbreviation.
- 9E: Endeavor Air Inc.
- AA: American Airlines Inc.
- AS: Alaska Airlines Inc.
- B6: JetBlue Airways
- DL: Delta Air Lines Inc.
- EV: ExpressJet Airlines Inc.
- F9: Frontier Airlines Inc.
- FL: AirTran Airways Corporation
- HA: Hawaiian Airlines Inc.
- MQ: Envoy Air
- OO: SkyWest Airlines Inc.
- UA: United Air Lines Inc.
- US: US Airways Inc.
- VX: Virgin America
- WN: Southwest Airlines Co.
- YV: Mesa Airlines Inc.

Quiz Question: Modify the code that you wrote above to determine which carrier tends to have the most delayed arrivals based on the mean. To answer this question, you should
- group_by carrier, then
- summarize mean arrival delays.
Enter your answer as the 2-letter airline code.

Average speed by distance

Recall from the lab tutorial that we can mutate a data frame to add new variables. Here, let’s mutate the nycflights data frame so that it contains the speed traveled by the plane for each flight (in mph). Speed is simply distance divided by air_time, although note that air_time is recorded in minutes so we first convert air_time to hours.

nycflights <- nycflights %>% mutate(speed = distance/(air_time/60))

Quiz Question: Using group_by and summarize and arrange, which carrier seems to operate flights with the highest average speeds (use the mean)? Enter the two letter code.
Create a scatterplot showing the relationship between the distance of a flight and its speed. Describe the relationship between distance and speed in your own words. Here are some hints:
- Use ggplot() combined with the appropriate dataset and the command aes(x=distance, y=speed)
- use geom_point() because this is a scatterplot.
Consider the airline that operates flights with the highest average speeds. Why do you think that this airline operates flights with the highest speeds? Think about the relationship that you described in exercise 8, and support your explanation with concrete numbers. Hint: which airline operates flights with the furthest average distances?

Conclusion

This concludes lab 1. Be sure to submit your quiz question answers to Canvas.

To turn your R Markdown file into a lab report, click the knit button at the top of the panel. As long as you have included all of your code correctly in R chunks, a nicely formatted html version of your report should pop up in a new window- this is what you will submit to Canvas. Remember that your R Markdown report should include your code and your answers to the discussion questions, all presented in an organized fashion (following the template is a good way to accomplish this).

The most common errors in knitting R Markdown files occur when your Markdown file uses a variable or dataset that is stored in your environment, but that is not actually created within the code chunks of your Markdown file. When knitting, your Markdown file cannot access things in your environment; it creates its own environment. So any variable used in the Markdown file must be created in the Markdown file. If you are having issues with knitting, please see a TA.

Acknowledgements

This lab is a modified version Lab 2 (Introduction to Data) from the OpenInto Tidyverse Labs. The OpenIntro labs can be found at this link: https://www.openintro.org/stat/labs.php.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.