ModernDive

D Learning Check Solutions

D.1 Chapter 2 Solutions

library(dplyr)
library(ggplot2)
library(nycflights13)

(LC2.1) What does any ONE row in this flights dataset refer to?

  • A. Data on an airline
  • B. Data on a flight
  • C. Data on an airport
  • D. Data on multiple flights

Solution: This is data on a flight. Not a flight path! Example:

  • a flight path would be United 1545 to Houston
  • a flight would be United 1545 to Houston at a specific date/time. For example: 2013/1/1 at 5:15am.

(LC2.2) What are some examples in this dataset of categorical variables? What makes them different than quantitative variables?

Solution: Hint: Type ?flights in the console to see what all the variables mean!

  • Categorical:
    • carrier the company
    • dest the destination
    • flight the flight number. Even though this is a number, its simply a label. Example United 1545 is not less than United 1714
  • Quantitative:
    • distance the distance in miles
    • time_hour time

(LC2.3) What does int, dbl, and chr mean in the output above?

Solution:

  • int: integer. Used to count things i.e. a discrete value. Ex: the # of cars parked in a lot
  • dbl: double. Used to measure things. i.e. a continuous value. Ex: your height in inches
  • chr: character. i.e. text

D.2 Chapter 3 Solutions

library(nycflights13)
library(ggplot2)
library(dplyr)

(LC3.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights) in the console. In what respect do these data frames differ?

Solution: flights contains all flight data, while alaska_flights contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while alaska_flights has only 714

(LC3.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship?

Solution: The later a plane departs, typically the later it will arrive.

(LC3.3) What variables (not necessarily in the flights data frame) would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here.

Solution: An example in the weather dataset is visibility, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease.

(LC3.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?

Solution: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time.

(LC3.5) What are some other features of the plot that stand out to you?

Solution: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late.

(LC3.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example above.

Solution: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart.

ggplot(data = alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) +
  geom_point()

(LC3.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?

Solution: Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? It thins out the points so we address overplotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur. We will see more about densities and distributions in Chapter 6 when we switch gears to statistical topics.

(LC3.8) After viewing the Figure 3.4 above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 3.2?

Solution: After viewing the Figure 3.4 above, give a range of arrival delays and departure delays that occur most frequently? How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 3.2? The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.

(LC3.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ?

Solution: Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? The rows of early_january_weather are a subset of weather.

(LC3.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement whereas the hour variable does not?

Solution: View() the flights data frame again. Why does the time_hour variable correctly identify the hour of the measurement whereas the hour variable does not? Because to uniquely identify an hour, we need the year/month/day/hour sequence, whereas there are only 24 possible hour’s.

(LC3.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?

Solution: Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? Because lines suggest connectedness and ordering.

(LC3.12) Why are linegraphs frequently used when time is the explanatory variable?

Solution: Why are linegraphs frequently used when time is the explanatory variable? Because time is sequential: subsequent observations are closely related to each other.

(LC3.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013.

Solution: Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. Humidity is a good one to look at, since this very closely related to the cycles of a day.

ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = humid)) +
  geom_line()

(LC3.14) What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?

Solution: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variabile by View(weather), we see that the precision of each temperature recording is 2 decimal places.

(LC3.15) Would you classify the distribution of temperatures as symmetric or skewed?

Solution: It is rather symmetric, i.e. there are no long tails on only one side of the distribution

(LC3.16) What would you guess is the “center” value in this distribution? Why did you make that choice?

Solution: The center is around 55.26°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.

(LC3.17) Is this data spread out greatly from the center or is it close? Why?

Solution: This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures:

While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:

(LC3.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?

Solution:

  • Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
  • Because we see temp recordings split by month, we are considering the relationship between these two variables. For example, for example for summer months, temperatures tend to be higher.

(LC3.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?

Solution:

  • While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically an ordinal categorical variable since there is a ordering to the categories
  • 25, 50, 75, 100 are temperatures

(LC3.20) For which types of data-sets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.

Solution:

  • We’d have 365 facets to look at. Way to many.
  • We don’t really care about day-to-day fluctuation in weather so much, but maybe more week-to-week variation. We’d like to focus on seasonal trends.

(LC3.21) Does the temp variable in the weather data-set have a lot of variability? Why do you say that?

Solution: Again, like in LC (LC3.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!

(LC3.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

Solution: It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp<25

weather %>% 
  filter(month==5 & temp < 25)
# A tibble: 1 x 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 JFK     2013     5     8    22  13.1  12.0  95.3       80       8.06        NA
# … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)?

(LC3.23) Which months have the highest variability in temperature? What reasons do you think this is?

Solution: We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):

  • The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
  • You can also think of this as the spread of the middle 50% of the data

Just from eyeballing it, it seems

  • November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
  • August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise

Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 5 of the text):

  1. group the observations by month then
  2. for each group, i.e. month, summarize it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then
  3. arrange the table in descending order of IQR
weather %>%
  group_by(month) %>%
  summarize(IQR = IQR(temp, na.rm=TRUE)) %>%
  arrange(desc(IQR))
month IQR
11 16.02
12 14.04
1 13.77
9 12.06
4 12.06
5 11.88
6 10.98
10 10.98
2 10.08
7 9.18
3 9.00
8 7.02

(LC3.24) We looked at the distribution of a numerical variable over a categorical variable here with this boxplot. Why can’t we look at the distribution of one numerical variable over the distribution of another numerical variable? Say, temperature across pressure, for example?

Solution: Because we need a way to group many numerical observations together, say by grouping by month. For pressure, we have near unique values for pressure, i.e. no groups, so we can’t make boxplots.

(LC3.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

Solution: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.

(LC3.26) Why are histograms inappropriate for visualizing categorical variables?

Solution: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.

(LC3.27) What is the difference between histograms and barplots?

Solution: See above.

(LC3.28) How many Envoy Air flights departed NYC in 2013?

Solution: Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013.

(LC3.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly?

Solution: What a pain! We’ll see in Chapter 5 on Data Wrangling that applying arrange(desc(n)) will sort this table in descending order of n!

(LC3.30) Why should pie charts be avoided and replaced by barplots?

Solution: In our opinion, comparisons using horizontal lines are easier than comparing angles and areas of circles.

(LC3.31) What is your opinion as to why pie charts continue to be used?

Solution: Legacy?

(LC3.32) What kinds of questions are not easily answered by looking at the above figure?

Solution: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.

(LC3.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?

Solution: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.}

(LC3.34) Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?

Solution: We can easily compare the different aiports for a given carrier using a single comparison line i.e. things are lined up

(LC3.35) What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?

Solution: It is hard to get totals for each airline.

(LC3.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?

Solution: Not that different than using side-by-side; depends on how you want to organize your presentation.

(LC3.37) What information about the different carriers at different airports is more easily seen in the faceted barplot?

Solution: Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.


D.3 Chapter 4 Solutions

library(dplyr)
library(ggplot2)
library(nycflights13)
library(tidyr)
library(readr)

(LC4.1) Consider the following data frame of average number of servings of beer, spirits, and wine consumption in three countries as reported in the FiveThirtyEight article Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?

# A tibble: 3 x 4
  country     beer_servings spirit_servings wine_servings
  <chr>               <int>           <int>         <int>
1 Canada                240             122           100
2 South Korea           140              16             9
3 USA                   249             158            84

This data frame is not in tidy format. What would it look like if it were?

Solution: There are three variables of information included: country, alcohol type, and number of servings. In tidy format, each of these variables of information are included in their own column.

# A tibble: 9 x 3
  country     `alcohol type` servings
  <chr>       <chr>             <int>
1 Canada      beer                240
2 Canada      spirit              122
3 Canada      wine                100
4 South Korea beer                140
5 South Korea spirit               16
6 South Korea wine                  9
7 USA         beer                249
8 USA         spirit              158
9 USA         wine                 84

Note that how the rows are sorted is inconsequential in whether or not the data frame is in tidy format. In other words, the following data frame sorted by alcohol type instead of country is equally in tidy format.

# A tibble: 9 x 3
  country     `alcohol type` servings
  <chr>       <chr>             <int>
1 Canada      beer                240
2 South Korea beer                140
3 USA         beer                249
4 Canada      spirit              122
5 South Korea spirit               16
6 USA         spirit              158
7 Canada      wine                100
8 South Korea wine                  9
9 USA         wine                 84

(LC4.2) What properties of the observational unit do each of lat, lon, alt, tz, dst, and tzone describe for the airports data frame? Note that you may want to use ?airports to get more information.

Solution: lat long represent the airport geographic coordinates, alt is the altitude above sea level of the airport (Run airports %>% filter(faa == "DEN") to see the altitude of Denver International Airport), tz is the time zone difference with respect to GMT in London UK, dst is the daylight savings time zone, and tzone is the time zone label.

(LC4.3) Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy dataset that matches these conditions.

Solution:

  • In the weather example in LC3.8, the combination of origin, year, month, day, hour are identification variables as they identify the observation in question.
  • Anything else pertains to observations: temp, humid, wind_speed, etc.

(LC4.4) Convert the dem_score data frame into a tidy data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame.

Solution: Running the following in the console:

dem_score_tidy <- gather(data = dem_score, key = year, value = democracy_score, - country)

Let’s now compare the dem_score and dem_score_tidy. dem_score has democracy score information for each year in columns, whereas in dem_score_tidy there are explicit variables year and democracy_score. While both representations of the data contain the same information, we can only use ggplot() to create plots using the dem_score_tidy data frame.

dem_score
# A tibble: 96 x 10
   country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
   <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Albania        -9     -9     -9     -9     -9     -9     -9     -9      5
 2 Argentina      -9     -1     -1     -9     -9     -9     -8      8      7
 3 Armenia        -9     -7     -7     -7     -7     -7     -7     -7      7
 4 Australia      10     10     10     10     10     10     10     10     10
 5 Austria        10     10     10     10     10     10     10     10     10
 6 Azerbaijan     -9     -7     -7     -7     -7     -7     -7     -7      1
 7 Belarus        -9     -7     -7     -7     -7     -7     -7     -7      7
 8 Belgium        10     10     10     10     10     10     10     10     10
 9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
# … with 86 more rows
dem_score_tidy
# A tibble: 864 x 3
   country    year  democracy_score
   <chr>      <chr>           <dbl>
 1 Albania    1952               -9
 2 Argentina  1952               -9
 3 Armenia    1952               -9
 4 Australia  1952               10
 5 Austria    1952               10
 6 Azerbaijan 1952               -9
 7 Belarus    1952               -9
 8 Belgium    1952               10
 9 Bhutan     1952              -10
10 Bolivia    1952               -4
# … with 854 more rows

(LC4.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a tidy data frame.

Solution: The code is similar

life_expectancy <- read_csv('https://moderndive.com/data/le_mess.csv')
life_expectancy_tidy <- gather(data = life_expectancy, key = year, value = life_expectancy, -country)

We observe the same construct structure with respect to year in life_expectancy vs life_expectancy_tidy as we did in dem_score vs dem_score_tidy:

life_expectancy
# A tibble: 202 x 67
   country `1951` `1952` `1953` `1954` `1955` `1956` `1957` `1958` `1959` `1960`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan…   27.1   27.7   28.2   28.7   29.3   29.8   30.3   30.9   31.4   31.9
 2 Albania   54.7   55.2   55.8   56.6   57.4   58.4   59.5   60.6   61.8   62.9
 3 Algeria   43.0   43.5   44.0   44.4   44.9   45.4   45.9   46.4   47.0   47.5
 4 Angola    31.0   31.6   32.1   32.7   33.2   33.8   34.3   34.9   35.4   36.0
 5 Antigu…   58.3   58.8   59.3   59.9   60.4   60.9   61.4   62.0   62.5   63.0
 6 Argent…   61.9   62.5   63.1   63.6   64.0   64.4   64.7   65     65.2   65.4
 7 Armenia   62.7   63.1   63.6   64.1   64.5   65     65.4   65.9   66.4   66.9
 8 Aruba     59.0   60.0   61.0   61.9   62.7   63.4   64.1   64.7   65.2   65.7
 9 Austra…   68.7   69.1   69.7   69.8   70.2   70.0   70.3   70.9   70.4   70.9
10 Austria   65.2   66.8   67.3   67.3   67.6   67.7   67.5   68.5   68.4   68.8
# … with 192 more rows, and 56 more variables: `1961` <dbl>, `1962` <dbl>,
#   `1963` <dbl>, `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>,
#   `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
#   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
#   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
#   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>
life_expectancy_tidy
# A tibble: 13,332 x 3
   country             year  life_expectancy
   <chr>               <chr>           <dbl>
 1 Afghanistan         1951             27.1
 2 Albania             1951             54.7
 3 Algeria             1951             43.0
 4 Angola              1951             31.0
 5 Antigua and Barbuda 1951             58.3
 6 Argentina           1951             61.9
 7 Armenia             1951             62.7
 8 Aruba               1951             59.0
 9 Australia           1951             68.7
10 Austria             1951             65.2
# … with 13,322 more rows

(LC4.6) What are common characteristics of “tidy” datasets?

Solution: Rows correspond to observations, while columns correspond to variables.

(LC4.7) What makes “tidy” datasets useful for organizing data?

Solution: Tidy datasets are an organized way of viewing data. We’ll see later that this format is required for the ggplot2 and dplyr packages for data visualization and wrangling.

(LC4.8) What are some advantages of data in normal forms? What are some disadvantages?

Solution: When datasets are in normal form, we can easily _join them with other datasets! For example, can we join the flights data with the planes data? We’ll see this more in Chapter 5!


D.4 Chapter 5 Solutions

library(dplyr)
library(ggplot2)
library(nycflights13)

(LC5.1) What’s another way using the “not” operator ! we could filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the code above.

Solution:

# Original in book
not_BTV_SEA <- flights %>% 
  filter(!(dest == "BTV" | dest == "SEA"))

# Alternative way
not_BTV_SEA <- flights %>% 
  filter(!dest == "BTV" & !dest == "SEA")

# Yet another way
not_BTV_SEA <- flights %>% 
  filter(dest != "BTV" & dest != "SEA")

(LC5.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?

Solution: The missing patients may have died of lung cancer! So to ignore them might seriously bias your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:

  • There is a systematic reasons why certain values are missing? If so, you might be biasing your results!
  • If there isn’t, then it might be ok to “sweep missing values under the rug.”

(LC5.3) Modify the above summarize function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to?

Solution: It corresponds to a count of the number of observations/rows:

weather %>% 
  summarize(count = n())
# A tibble: 1 x 1
  count
  <int>
1 26115

(LC5.4) Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE)) first.

summary_temp <- weather %>%   
  summarize(mean = mean(temp, na.rm = TRUE)) %>% 
  summarize(std_dev = sd(temp, na.rm = TRUE))

Solution: Consider the output of only running the first two lines:

weather %>%   
  summarize(mean = mean(temp, na.rm = TRUE))
# A tibble: 1 x 1
   mean
  <dbl>
1  55.3

Because after the first summarize(), the variable temp disappears as it has been collapsed to the value mean. So when we try to run the second summarize(), it can’t find the variable temp to compute the standard deviation of.

(LC5.5) Recall from Chapter 3 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year?

Solution:

month mean std_dev
1 35.6 10.22
2 34.3 6.98
3 39.9 6.25
4 51.7 8.79
5 61.8 9.68
6 72.2 7.55
7 80.1 7.12
8 74.5 5.19
9 67.4 8.47
10 60.1 8.85
11 45.0 10.44
12 38.4 9.98

The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.

(LC5.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?

Solution:

summary_temp_by_day <- weather %>% 
  group_by(year, month, day) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )
summary_temp_by_day
# A tibble: 364 x 5
# Groups:   year, month [12]
    year month   day  mean std_dev
   <dbl> <dbl> <int> <dbl>   <dbl>
 1  2013     1     1  37.0    4.00
 2  2013     1     2  28.7    3.45
 3  2013     1     3  30.0    2.58
 4  2013     1     4  34.9    2.45
 5  2013     1     5  37.2    4.01
 6  2013     1     6  40.1    4.40
 7  2013     1     7  40.6    3.68
 8  2013     1     8  40.1    5.77
 9  2013     1     9  43.2    5.40
10  2013     1    10  43.8    2.95
# … with 354 more rows

Note: group_by(day) is not enough, because day is a value between 1-31. We need to group_by(year, month, day)

library(dplyr)
library(nycflights13)

summary_temp_by_month <- weather %>% 
  group_by(month) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

(LC5.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset?

Solution:

by_monthly_origin <- flights %>% 
  group_by(month, origin) %>% 
  summarize(count = n())
by_monthly_origin
month origin count
1 EWR 9893
1 JFK 9161
1 LGA 7950
2 EWR 9107
2 JFK 8421
2 LGA 7423
3 EWR 10420
3 JFK 9697
3 LGA 8717
4 EWR 10531
4 JFK 9218
4 LGA 8581
5 EWR 10592
5 JFK 9397
5 LGA 8807
6 EWR 10175
6 JFK 9472
6 LGA 8596
7 EWR 10475
7 JFK 10023
7 LGA 8927
8 EWR 10359
8 JFK 9983
8 LGA 8985
9 EWR 9550
9 JFK 8908
9 LGA 9116
10 EWR 10104
10 JFK 9143
10 LGA 9642
11 EWR 9707
11 JFK 8710
11 LGA 8851
12 EWR 9922
12 JFK 9146
12 LGA 9067

In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you’ll see that the values are actually the same, just presented in a different order.

(LC5.8) How could we identify how many flights left each of the three airports for each carrier?

Solution: We could summarize the count from each airport using the n() function, which counts rows.

count_flights_by_airport <- flights %>% 
  group_by(origin, carrier) %>% 
  summarize(count=n())
count_flights_by_airport
origin carrier count
EWR 9E 1268
EWR AA 3487
EWR AS 714
EWR B6 6557
EWR DL 4342
EWR EV 43939
EWR MQ 2276
EWR OO 6
EWR UA 46087
EWR US 4405
EWR VX 1566
EWR WN 6188
JFK 9E 14651
JFK AA 13783
JFK B6 42076
JFK DL 20701
JFK EV 1408
JFK HA 342
JFK MQ 7193
JFK UA 4534
JFK US 2995
JFK VX 3596
LGA 9E 2541
LGA AA 15459
LGA B6 6002
LGA DL 23067
LGA EV 8826
LGA F9 685
LGA FL 3260
LGA MQ 16928
LGA OO 26
LGA UA 8044
LGA US 13136
LGA WN 6087
LGA YV 601

All remarkably similar! Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain numerical variable VARIABLE_NAME.

(LC5.9) How does the filter operation differ from a group_by followed by a summarize?

Solution:

  • filter picks out rows from the original dataset without modifying them, whereas
  • group_by %>% summarize computes summaries of numerical variables, and hence reports new values.

(LC5.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value?

Solution:

  • Say a flight departed 20 minutes late, i.e. dep_delay = 20
  • Then arrived 10 minutes late, i.e. arr_delay = 10.
  • Then gain = dep_delay - arr_delay = 20 - 10 = 10 is positive, so it “made up/gained time in the air.”
  • 0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes.
  • I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?

(LC5.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights.

Solution: No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1203-1159 = 44

(LC5.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values.

Solution: Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!

(LC5.13) Looking at Figure 5.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour?

Solution: Because hour is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport.

(LC5.14) What surprises you about the top 10 destinations from NYC in 2013?

Solution: This question is subjective! What surprises me is the high number of flights to Boston. Wouldn’t it be easier and quicker to take the train?

(LC5.15) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways.

Solution:

# The regular way:
flights %>% 
  select(dest, air_time, distance)
# A tibble: 336,776 x 3
   dest  air_time distance
   <chr>    <dbl>    <dbl>
 1 IAH        227     1400
 2 IAH        227     1416
 3 MIA        160     1089
 4 BQN        183     1576
 5 ATL        116      762
 6 ORD        150      719
 7 FLL        158     1065
 8 IAD         53      229
 9 MCO        140      944
10 ORD        138      733
# … with 336,766 more rows
# Since they are sequential columns in the dataset
flights %>% 
  select(dest:distance)
# A tibble: 336,776 x 3
   dest  air_time distance
   <chr>    <dbl>    <dbl>
 1 IAH        227     1400
 2 IAH        227     1416
 3 MIA        160     1089
 4 BQN        183     1576
 5 ATL        116      762
 6 ORD        150      719
 7 FLL        158     1065
 8 IAD         53      229
 9 MCO        140      944
10 ORD        138      733
# … with 336,766 more rows
# Not as effective, by removing everything else
flights %>% 
  select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
         -sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin, 
         -hour, -minute, -time_hour)
# A tibble: 336,776 x 6
   dest  air_time distance  gain hours gain_per_hour
   <chr>    <dbl>    <dbl> <dbl> <dbl>         <dbl>
 1 IAH        227     1400    -9 3.78          -2.38
 2 IAH        227     1416   -16 3.78          -4.23
 3 MIA        160     1089   -31 2.67         -11.6 
 4 BQN        183     1576    17 3.05           5.57
 5 ATL        116      762    19 1.93           9.83
 6 ORD        150      719   -16 2.5           -6.4 
 7 FLL        158     1065   -24 2.63          -9.11
 8 IAD         53      229    11 0.883         12.5 
 9 MCO        140      944     5 2.33           2.14
10 ORD        138      733   -10 2.3           -4.35
# … with 336,766 more rows

(LC5.16) How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains.

Solution:

# Anything that starts with "d"
flights %>% 
  select(starts_with("d"))
# A tibble: 336,776 x 5
     day dep_time dep_delay dest  distance
   <int>    <int>     <dbl> <chr>    <dbl>
 1     1      517         2 IAH       1400
 2     1      533         4 IAH       1416
 3     1      542         2 MIA       1089
 4     1      544        -1 BQN       1576
 5     1      554        -6 ATL        762
 6     1      554        -4 ORD        719
 7     1      555        -5 FLL       1065
 8     1      557        -3 IAD        229
 9     1      557        -3 MCO        944
10     1      558        -2 ORD        733
# … with 336,766 more rows
# Anything related to delays:
flights %>% 
  select(ends_with("delay"))
# A tibble: 336,776 x 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
# Anything related to departures:
flights %>% 
  select(contains("dep"))
# A tibble: 336,776 x 3
   dep_time sched_dep_time dep_delay
      <int>          <int>     <dbl>
 1      517            515         2
 2      533            529         4
 3      542            540         2
 4      544            545        -1
 5      554            600        -6
 6      554            558        -4
 7      555            600        -5
 8      557            600        -3
 9      557            600        -3
10      558            600        -2
# … with 336,766 more rows

(LC5.17) Why might we want to use the select() function on a data frame?

Solution: To narrow down the data frame, to make it easier to look at. Using View() for example.

(LC5.18) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.

Solution:

top_five <- flights %>% 
  group_by(dest) %>% 
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay)) %>% 
  top_n(n = 5)
top_five
# A tibble: 5 x 2
  dest  avg_delay
  <chr>     <dbl>
1 CAE        41.8
2 TUL        33.7
3 OKC        30.6
4 JAC        28.1
5 TYS        24.1

(LC5.19) Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints:

  1. Crucial: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code).
  2. Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles.
  3. Figure 5.7 above showing how the various datasets can be joined will also be useful.
  4. Consider the data wrangling verbs in Table 5.1 as your toolbox!

Solution: Here are some examples of student-written pseudocode. Based on our own pseudocode, let’s first display the entire solution.

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  mutate(ASM = seats * distance) %>% 
  group_by(carrier) %>% 
  summarize(ASM = sum(ASM, na.rm = TRUE)) %>% 
  arrange(desc(ASM))
# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 UA      15516377526
 2 DL      10532885801
 3 B6       9618222135
 4 AA       3677292231
 5 US       2533505829
 6 VX       2296680778
 7 EV       1817236275
 8 WN       1718116857
 9 9E        776970310
10 HA        642478122
11 AS        314104736
12 FL        219628520
13 F9        184832280
14 YV         20163632
15 MQ          7162420
16 OO          1299835

Let’s now break this down step-by-step. To compute the available seat miles for a given flight, we need the distance variable from the flights data frame and the seats variable from the planes data frame, necessitating a join by the key variable tailnum as illustrated in Figure 5.7. To keep the resulting data frame easy to view, we’ll select() only these two variables and carrier:

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance)
# A tibble: 284,170 x 3
   carrier seats distance
   <chr>   <int>    <dbl>
 1 UA        149     1400
 2 UA        149     1416
 3 AA        178     1089
 4 B6        200     1576
 5 DL        178      762
 6 UA        191      719
 7 B6        200     1065
 8 EV         55      229
 9 B6        200      944
10 B6        200     1028
# … with 284,160 more rows

Now for each flight we can compute the available seat miles ASM by multiplying the number of seats by the distance via a mutate():

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  # Added:
  mutate(ASM = seats * distance)
# A tibble: 284,170 x 4
   carrier seats distance    ASM
   <chr>   <int>    <dbl>  <dbl>
 1 UA        149     1400 208600
 2 UA        149     1416 210984
 3 AA        178     1089 193842
 4 B6        200     1576 315200
 5 DL        178      762 135636
 6 UA        191      719 137329
 7 B6        200     1065 213000
 8 EV         55      229  12595
 9 B6        200      944 188800
10 B6        200     1028 205600
# … with 284,160 more rows

Next we want to sum the ASM for each carrier. We achieve this by first grouping by carrier and then summarizing using the sum() function:

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  mutate(ASM = seats * distance) %>% 
  # Added:
  group_by(carrier) %>% 
  summarize(ASM = sum(ASM))
# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 9E        776970310
 2 AA       3677292231
 3 AS        314104736
 4 B6       9618222135
 5 DL      10532885801
 6 EV       1817236275
 7 F9        184832280
 8 FL        219628520
 9 HA        642478122
10 MQ          7162420
11 OO          1299835
12 UA      15516377526
13 US       2533505829
14 VX       2296680778
15 WN       1718116857
16 YV         20163632

However, because for certain carriers certain flights have missing NA values, the resulting table also returns NA’s. We can eliminate these by adding a na.rm = TRUE argument to sum(), telling R that we want to remove the NA’s in the sum. We saw this in Section (summarize):

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  mutate(ASM = seats * distance) %>% 
  group_by(carrier) %>% 
  # Modified:
  summarize(ASM = sum(ASM, na.rm = TRUE))
# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 9E        776970310
 2 AA       3677292231
 3 AS        314104736
 4 B6       9618222135
 5 DL      10532885801
 6 EV       1817236275
 7 F9        184832280
 8 FL        219628520
 9 HA        642478122
10 MQ          7162420
11 OO          1299835
12 UA      15516377526
13 US       2533505829
14 VX       2296680778
15 WN       1718116857
16 YV         20163632

Finally, we arrange() the data in desc()ending order of ASM.

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  mutate(ASM = seats * distance) %>% 
  group_by(carrier) %>% 
  summarize(ASM = sum(ASM, na.rm = TRUE)) %>% 
  # Added:
  arrange(desc(ASM))
# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 UA      15516377526
 2 DL      10532885801
 3 B6       9618222135
 4 AA       3677292231
 5 US       2533505829
 6 VX       2296680778
 7 EV       1817236275
 8 WN       1718116857
 9 9E        776970310
10 HA        642478122
11 AS        314104736
12 FL        219628520
13 F9        184832280
14 YV         20163632
15 MQ          7162420
16 OO          1299835

While the above data frame is correct, the IATA carrier code is not always useful. For example, what carrier is WN? We can address this by joining with the airlines dataset using carrier is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data!

flights %>% 
  inner_join(planes, by = "tailnum") %>% 
  select(carrier, seats, distance) %>% 
  mutate(ASM = seats * distance) %>% 
  group_by(carrier) %>% 
  summarize(ASM = sum(ASM, na.rm = TRUE)) %>% 
  arrange(desc(ASM)) %>% 
  # Added:
  inner_join(airlines, by = "carrier")
# A tibble: 16 x 3
   carrier         ASM name                       
   <chr>         <dbl> <chr>                      
 1 UA      15516377526 United Air Lines Inc.      
 2 DL      10532885801 Delta Air Lines Inc.       
 3 B6       9618222135 JetBlue Airways            
 4 AA       3677292231 American Airlines Inc.     
 5 US       2533505829 US Airways Inc.            
 6 VX       2296680778 Virgin America             
 7 EV       1817236275 ExpressJet Airlines Inc.   
 8 WN       1718116857 Southwest Airlines Co.     
 9 9E        776970310 Endeavor Air Inc.          
10 HA        642478122 Hawaiian Airlines Inc.     
11 AS        314104736 Alaska Airlines Inc.       
12 FL        219628520 AirTran Airways Corporation
13 F9        184832280 Frontier Airlines Inc.     
14 YV         20163632 Mesa Airlines Inc.         
15 MQ          7162420 Envoy Air                  
16 OO          1299835 SkyWest Airlines Inc.      

D.5 Chapter 6 Solutions

library(ggplot2)
library(dplyr)
library(moderndive)
library(gapminder)
library(skimr)