Bike Rentals

Excess Rentals in TFL Bike Sharing

How did the bike rentals volume fluctuate in the last years? The data comes from public Government data on Tlf bike sharing and can be downloaded from this link. We will use TfL data on how many bikes were hired every single day. We can get the latest data by running the following:

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20200919%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20200919T224752Z&X-Amz-Expires=300&X-Amz-Signature=a18923e87aab7725bbc1c87abf21b03daf8c067853e7c51fde1af0d5d5e12782&X-Amz-SignedHeaders=host]
##   Date: 2020-09-19 22:47
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 165 kB
## <ON DISK>  C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpAdINEQ\file5a446fb0505f.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day), 
          dayy=day(day))

We can see the distribution of bikes for 2019 and 2020 in the following graph:

bike_first<-bike %>% 
  filter(year>=2019 ) 

bike_first
## # A tibble: 609 x 6
##    day                 bikes_hired  year month  week  dayy
##    <dttm>                    <dbl> <dbl> <ord> <dbl> <int>
##  1 2019-01-01 00:00:00       14148  2019 Jan       1     1
##  2 2019-01-02 00:00:00       19746  2019 Jan       1     2
##  3 2019-01-03 00:00:00       21552  2019 Jan       1     3
##  4 2019-01-04 00:00:00       20863  2019 Jan       1     4
##  5 2019-01-05 00:00:00       13907  2019 Jan       1     5
##  6 2019-01-06 00:00:00       14262  2019 Jan       1     6
##  7 2019-01-07 00:00:00       25668  2019 Jan       2     7
##  8 2019-01-08 00:00:00       27757  2019 Jan       2     8
##  9 2019-01-09 00:00:00       26568  2019 Jan       2     9
## 10 2019-01-10 00:00:00       26022  2019 Jan       2    10
## # ... with 599 more rows
ggplot(bike_first, aes(x=dayy, y=bikes_hired, fill='red'))+
  geom_line(aes(x=dayy, y=bikes_hired))+
  #facet_grid(year~month, scales='free_x')+
  facet_wrap(~ year + month, ncol = 12, scales = "free_x")+
  #scale_y_continuous(breaks=seq(0, 5000, by = 2500))
  theme_minimal()+
  theme(aspect.ratio=2/1.5)+
  labs(title='Tlf bike rentals fluctuations', 
       x=NULL, 
       y='Bike rentals', 
       caption = 'Source:Tlf, London Data Score')

  #theme(strip.text.x = element_text(size=2, angle=75),
         # strip.text.y = element_text(size=2, face="bold"),
         # strip.background = element_rect(colour="red", fill="#CCCCFF"))
From the graph, we could see high discrepancies in the number of bike rentals during February-April, a change that may be due to Coronavirus restrictions. To further investigate that, we are going to compare volumes of bikes rentals across years.

Next, we will reproduce 2 graphs presented in LBS Statistics with R course.

##Chnage from montly expected bike rentals

### Data Wrangling for calculationg the expected and excess rentals

I will first wrangle the data to adapt it to our needs. In the following chunk, we calculate 2 dataframes: `bike_expected` and `bike_avg`. The former has the expected trends (blue line on the plot), the latter will have the actual trend (thin black line on the plots). The I will join them using `left_join()`.

### Mean or median?

We are going to use median for these purposes. That is because the median is much less affected by outliers and extreme values, such as those occurring during 2020, as it is simply the value in the middle between the higher and lower half of a given range. It is therefore much better at reflecting the true expected or 'middle' trend.


```r
#calculate expected number of bikes to be hired per month
bike_expected<-bike %>% 
  filter(year>= 2015 & year<2020) %>% #we do not take 2020, since we do not have observations from the entire year 
  group_by(month) %>% 
  summarize(expected_per_month = median(bikes_hired)) %>% 
  select(month, expected_per_month)
## `summarise()` ungrouping output (override with `.groups` argument)
# Show results
kable(head(bike_expected,5))
month expected_per_month
Jan 21816.0
Feb 23169.0
Mar 24559.0
Apr 28818.5
May 33192.0
#calculate average bike hires
bike_avg<-bike %>% 
  filter(year>= 2015 & year<=2020) %>% 
  group_by(month, year) %>% 
  summarize(average_per_month = median(bikes_hired)) %>% 
  select(month, average_per_month, year)
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
# Show results
kable(head(bike_avg,5 ))
month average_per_month year
Jan 21405 2015
Jan 20948 2016
Jan 22122 2017
Jan 23374 2018
Jan 24357 2019
#unite the two calculations 
#we join the two tables on the 'month' variable, in order to have both expected and actual number of bikes rentals
big_table<-left_join(bike_avg, bike_expected, by='month') %>%  
  mutate(date = ymd(paste(as.character(year), as.character(month), "1")), #get dates
         bottom_line=ifelse(average_per_month>expected_per_month,expected_per_month,average_per_month)) #minimum of both curves everywhere
# Show Results
kable(head(big_table, 5))
month average_per_month year expected_per_month date bottom_line
Jan 21405 2015 21816 2015-01-01 21405
Jan 20948 2016 21816 2016-01-01 20948
Jan 22122 2017 21816 2017-01-01 21816
Jan 23374 2018 21816 2018-01-01 21816
Jan 24357 2019 21816 2019-01-01 21816

Plotting

Now we can plot this:

big_table %>% 
  
  ggplot(aes(x=month))+
  facet_wrap(~year)+
  
  #green and red bands
  geom_ribbon(aes(ymin=bottom_line,ymax=expected_per_month, group=year), fill="#EAB5B7")+ #red band: between bottom and expected_per_month
  geom_ribbon(aes(ymin=bottom_line,ymax=average_per_month, group=year), fill="#CBEBCE")+ #green band: between bottom and avg_per_month
  
  #plot line curves
  geom_line(aes(y = average_per_month, group=year), size = 0.3,color = "black") +
  geom_line(aes(y = expected_per_month, group=year), size = 0.6, color = "blue")+
  
  #aesthetics
  theme_minimal()+
  theme(aspect.ratio = 1/3, 
        axis.text=element_text(size=6))+ #reduce text size in axes to avoid overlap
  labs(title='Montly changes in Tlf bike rentals', 
       subtitle = 'Change from monthly average shown in blue and calculated between 2015-2020', 
       x=NULL, 
       y='Bike rentals', 
       caption = 'Source:Tlf, London Data Score')


The blue line represents the average number of bikes computed monthly for the 2015- 2019 period. The thin black line represents the number of bikes rented in that specific year (computed monthly).When the number of bikes rented is smaller than the average of bikes per month (calculated from 2015-2019), the difference is depicted in red. Conversely, when the number is greater, the difference is depicted in green.

What happens on the graph?

On the graph, we can appreciate the a growing trend of bike hirings throughout the years. In the initial years of 2015 and 2016, the service was used less than average. This might be because the infrastructure of London was less suited to bikers than in later years. Through 2017 and 2018, the bike use is seen to rise. In 2020, a large decrease is once again noted in the early months of the year. This is due to the corona-virus restrictions which confined people to their houses. Upon the lifting of the restrictions, a large resurgence of bike use can be noticed, as people hired more bikes to enjoy the outdoors and their regained freedom of movement.

Percentages change from weekly average

The next graph looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to the second (weeks 14-26) and fourth (weeks 40-52) quarters.

Data Wrangling for next Graph

We will wrangle this data again. We will create the bike_avg_week to estimate the average per week around years 2015-2020, and the bike_expected_weekto appreciate the expected amount of hirings per week in 2015-2020.

# Avearage per week in 2015-2020
bike_avg_week<-bike %>%
  # Filter 2015-2020
  filter(year>= 2015 & year<=2020) %>% 
  # Group by week
  group_by(week, year) %>% 
  # summarise statistics (median)
  summarize(average_per_week = median(bikes_hired)) %>% 
  # select vars of interest
  select(week, average_per_week, year)
## `summarise()` regrouping output by 'week' (override with `.groups` argument)
# Show Results
kable(head(bike_avg_week, 10))
week average_per_week year
1 9491.0 2015
1 20644.0 2016
1 18973.0 2017
1 15349.5 2018
1 15214.0 2019
1 15695.0 2020
2 20566.0 2015
2 22266.0 2016
2 20963.0 2017
2 24683.0 2018
# Average per week in 2015-2020
bike_expected_week<-bike %>% 
  # filter 2015-2020
  filter(year>= 2015 & year<2020) %>% 
  # group by week
  group_by(week) %>% 
  # calculate median
  summarize(expected_per_week = median(bikes_hired)) %>% 
  # select vars of interest
  select(week, expected_per_week)
## `summarise()` ungrouping output (override with `.groups` argument)
# Show results
kable(head(bike_expected, 10))
month expected_per_month
Jan 21816.0
Feb 23169.0
Mar 24559.0
Apr 28818.5
May 33192.0
Jun 36708.0
Jul 38912.0
Aug 35108.0
Sep 33402.0
Oct 31069.0

And now we will join them

# Join data frames
big_table_week<-left_join(bike_avg_week, bike_expected_week, by='week') %>%
  mutate(change_week=(average_per_week-expected_per_week)/expected_per_week) %>% 
  mutate(topline = ifelse(change_week>0, change_week, 0))
# Show results
kable(head(big_table_week, 10))
week average_per_week year expected_per_week change_week topline
1 9491.0 2015 15777.5 -0.3984472 0.0000000
1 20644.0 2016 15777.5 0.3084456 0.3084456
1 18973.0 2017 15777.5 0.2025353 0.2025353
1 15349.5 2018 15777.5 -0.0271272 0.0000000
1 15214.0 2019 15777.5 -0.0357154 0.0000000
1 15695.0 2020 15777.5 -0.0052290 0.0000000
2 20566.0 2015 22285.0 -0.0771371 0.0000000
2 22266.0 2016 22285.0 -0.0008526 0.0000000
2 20963.0 2017 22285.0 -0.0593224 0.0000000
2 24683.0 2018 22285.0 0.1076060 0.1076060

Now, we can construct the desired plot

ggplot(data = big_table_week, aes(x=week, y=change_week, group=year))+
  geom_line()+
  
  #facet by year
  facet_wrap(~year)+
  
  #add grey vertical bands in background
  annotate("rect", xmin=13, xmax=26, ymin=-Inf, ymax=Inf, alpha=0.4, fill="grey") +
  annotate("rect", xmin=39, xmax=53, ymin=-Inf, ymax=Inf, alpha=0.4, fill="grey") +
  
  #add red and green bands between curve and y = 0
  geom_ribbon(aes(ymin=0,ymax=change_week), fill="#CBEBCE", color="black", size=0.05)+ #green
  geom_ribbon(aes(ymin=change_week, ymax=topline), fill="#EAB5B7", color="black", size=0.05) + #red\
  
  #tick marks on bottom axis
  geom_rug(data = . %>%  filter(change_week <= 0), sides="b", color='#EAB5B7', size = 0.8) + #filter to color only the ticks where change_week < 0 red
  geom_rug(data = . %>%  filter(change_week > 0), sides="b", color='#CBEBCE', size = 0.8) + #green
  
  #horizontal line at y = 0
  geom_hline(yintercept = 0, size=0) + 
  
  #aesthetics
  scale_y_continuous(expand = expansion(mult = .05), 
                     labels=scales::percent)+ #y-axis lables
  theme_minimal()+
  theme(aspect.ratio = 1/2) + #wider
  labs(title = "Weekly changes in TfL bike rentals", # add labels to the df
       subtitle = "% changes from weekly averages calculated between 2015-2019",
       caption = "Source: fivethirtyeight.com - 2014", # Source
       y = NULL)


The above graph tells largely the same story as the graph shown earlier in the discussion, only this time displayed as changes with respect to a weekly average. We can once again appreciate a growing populatiry of bike use from 2015 through to 2019, a major decrease during the pandemic of early 2020, and a resurgence upon the lifting of the restrictions.