Bike Rentals
Excess Rentals in TFL Bike Sharing
How did the bike rentals volume fluctuate in the last years? The data comes from public Government data on Tlf bike sharing and can be downloaded from this link. We will use TfL data on how many bikes were hired every single day. We can get the latest data by running the following:
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20200919%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20200919T224752Z&X-Amz-Expires=300&X-Amz-Signature=a18923e87aab7725bbc1c87abf21b03daf8c067853e7c51fde1af0d5d5e12782&X-Amz-SignedHeaders=host]
## Date: 2020-09-19 22:47
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 165 kB
## <ON DISK> C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpAdINEQ\file5a446fb0505f.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day),
dayy=day(day))
We can see the distribution of bikes for 2019 and 2020 in the following graph:
bike_first<-bike %>%
filter(year>=2019 )
bike_first
## # A tibble: 609 x 6
## day bikes_hired year month week dayy
## <dttm> <dbl> <dbl> <ord> <dbl> <int>
## 1 2019-01-01 00:00:00 14148 2019 Jan 1 1
## 2 2019-01-02 00:00:00 19746 2019 Jan 1 2
## 3 2019-01-03 00:00:00 21552 2019 Jan 1 3
## 4 2019-01-04 00:00:00 20863 2019 Jan 1 4
## 5 2019-01-05 00:00:00 13907 2019 Jan 1 5
## 6 2019-01-06 00:00:00 14262 2019 Jan 1 6
## 7 2019-01-07 00:00:00 25668 2019 Jan 2 7
## 8 2019-01-08 00:00:00 27757 2019 Jan 2 8
## 9 2019-01-09 00:00:00 26568 2019 Jan 2 9
## 10 2019-01-10 00:00:00 26022 2019 Jan 2 10
## # ... with 599 more rows
ggplot(bike_first, aes(x=dayy, y=bikes_hired, fill='red'))+
geom_line(aes(x=dayy, y=bikes_hired))+
#facet_grid(year~month, scales='free_x')+
facet_wrap(~ year + month, ncol = 12, scales = "free_x")+
#scale_y_continuous(breaks=seq(0, 5000, by = 2500))
theme_minimal()+
theme(aspect.ratio=2/1.5)+
labs(title='Tlf bike rentals fluctuations',
x=NULL,
y='Bike rentals',
caption = 'Source:Tlf, London Data Score')
#theme(strip.text.x = element_text(size=2, angle=75),
# strip.text.y = element_text(size=2, face="bold"),
# strip.background = element_rect(colour="red", fill="#CCCCFF"))
From the graph, we could see high discrepancies in the number of bike rentals during February-April, a change that may be due to Coronavirus restrictions. To further investigate that, we are going to compare volumes of bikes rentals across years.
Next, we will reproduce 2 graphs presented in LBS Statistics with R course.
##Chnage from montly expected bike rentals
### Data Wrangling for calculationg the expected and excess rentals
I will first wrangle the data to adapt it to our needs. In the following chunk, we calculate 2 dataframes: `bike_expected` and `bike_avg`. The former has the expected trends (blue line on the plot), the latter will have the actual trend (thin black line on the plots). The I will join them using `left_join()`.
### Mean or median?
We are going to use median for these purposes. That is because the median is much less affected by outliers and extreme values, such as those occurring during 2020, as it is simply the value in the middle between the higher and lower half of a given range. It is therefore much better at reflecting the true expected or 'middle' trend.
```r
#calculate expected number of bikes to be hired per month
bike_expected<-bike %>%
filter(year>= 2015 & year<2020) %>% #we do not take 2020, since we do not have observations from the entire year
group_by(month) %>%
summarize(expected_per_month = median(bikes_hired)) %>%
select(month, expected_per_month)
## `summarise()` ungrouping output (override with `.groups` argument)
# Show results
kable(head(bike_expected,5))
month | expected_per_month |
---|---|
Jan | 21816.0 |
Feb | 23169.0 |
Mar | 24559.0 |
Apr | 28818.5 |
May | 33192.0 |
#calculate average bike hires
bike_avg<-bike %>%
filter(year>= 2015 & year<=2020) %>%
group_by(month, year) %>%
summarize(average_per_month = median(bikes_hired)) %>%
select(month, average_per_month, year)
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
# Show results
kable(head(bike_avg,5 ))
month | average_per_month | year |
---|---|---|
Jan | 21405 | 2015 |
Jan | 20948 | 2016 |
Jan | 22122 | 2017 |
Jan | 23374 | 2018 |
Jan | 24357 | 2019 |
#unite the two calculations
#we join the two tables on the 'month' variable, in order to have both expected and actual number of bikes rentals
big_table<-left_join(bike_avg, bike_expected, by='month') %>%
mutate(date = ymd(paste(as.character(year), as.character(month), "1")), #get dates
bottom_line=ifelse(average_per_month>expected_per_month,expected_per_month,average_per_month)) #minimum of both curves everywhere
# Show Results
kable(head(big_table, 5))
month | average_per_month | year | expected_per_month | date | bottom_line |
---|---|---|---|---|---|
Jan | 21405 | 2015 | 21816 | 2015-01-01 | 21405 |
Jan | 20948 | 2016 | 21816 | 2016-01-01 | 20948 |
Jan | 22122 | 2017 | 21816 | 2017-01-01 | 21816 |
Jan | 23374 | 2018 | 21816 | 2018-01-01 | 21816 |
Jan | 24357 | 2019 | 21816 | 2019-01-01 | 21816 |
Plotting
Now we can plot this:
big_table %>%
ggplot(aes(x=month))+
facet_wrap(~year)+
#green and red bands
geom_ribbon(aes(ymin=bottom_line,ymax=expected_per_month, group=year), fill="#EAB5B7")+ #red band: between bottom and expected_per_month
geom_ribbon(aes(ymin=bottom_line,ymax=average_per_month, group=year), fill="#CBEBCE")+ #green band: between bottom and avg_per_month
#plot line curves
geom_line(aes(y = average_per_month, group=year), size = 0.3,color = "black") +
geom_line(aes(y = expected_per_month, group=year), size = 0.6, color = "blue")+
#aesthetics
theme_minimal()+
theme(aspect.ratio = 1/3,
axis.text=element_text(size=6))+ #reduce text size in axes to avoid overlap
labs(title='Montly changes in Tlf bike rentals',
subtitle = 'Change from monthly average shown in blue and calculated between 2015-2020',
x=NULL,
y='Bike rentals',
caption = 'Source:Tlf, London Data Score')
The blue line represents the average number of bikes computed monthly for the 2015- 2019 period. The thin black line represents the number of bikes rented in that specific year (computed monthly).When the number of bikes rented is smaller than the average of bikes per month (calculated from 2015-2019), the difference is depicted in red. Conversely, when the number is greater, the difference is depicted in green.
What happens on the graph?
On the graph, we can appreciate the a growing trend of bike hirings throughout the years. In the initial years of 2015 and 2016, the service was used less than average. This might be because the infrastructure of London was less suited to bikers than in later years. Through 2017 and 2018, the bike use is seen to rise. In 2020, a large decrease is once again noted in the early months of the year. This is due to the corona-virus restrictions which confined people to their houses. Upon the lifting of the restrictions, a large resurgence of bike use can be noticed, as people hired more bikes to enjoy the outdoors and their regained freedom of movement.
Percentages change from weekly average
The next graph looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to the second (weeks 14-26) and fourth (weeks 40-52) quarters.
Data Wrangling for next Graph
We will wrangle this data again. We will create the bike_avg_week
to estimate the average per week around years 2015-2020, and the bike_expected_week
to appreciate the expected amount of hirings per week in 2015-2020.
# Avearage per week in 2015-2020
bike_avg_week<-bike %>%
# Filter 2015-2020
filter(year>= 2015 & year<=2020) %>%
# Group by week
group_by(week, year) %>%
# summarise statistics (median)
summarize(average_per_week = median(bikes_hired)) %>%
# select vars of interest
select(week, average_per_week, year)
## `summarise()` regrouping output by 'week' (override with `.groups` argument)
# Show Results
kable(head(bike_avg_week, 10))
week | average_per_week | year |
---|---|---|
1 | 9491.0 | 2015 |
1 | 20644.0 | 2016 |
1 | 18973.0 | 2017 |
1 | 15349.5 | 2018 |
1 | 15214.0 | 2019 |
1 | 15695.0 | 2020 |
2 | 20566.0 | 2015 |
2 | 22266.0 | 2016 |
2 | 20963.0 | 2017 |
2 | 24683.0 | 2018 |
# Average per week in 2015-2020
bike_expected_week<-bike %>%
# filter 2015-2020
filter(year>= 2015 & year<2020) %>%
# group by week
group_by(week) %>%
# calculate median
summarize(expected_per_week = median(bikes_hired)) %>%
# select vars of interest
select(week, expected_per_week)
## `summarise()` ungrouping output (override with `.groups` argument)
# Show results
kable(head(bike_expected, 10))
month | expected_per_month |
---|---|
Jan | 21816.0 |
Feb | 23169.0 |
Mar | 24559.0 |
Apr | 28818.5 |
May | 33192.0 |
Jun | 36708.0 |
Jul | 38912.0 |
Aug | 35108.0 |
Sep | 33402.0 |
Oct | 31069.0 |
And now we will join them
# Join data frames
big_table_week<-left_join(bike_avg_week, bike_expected_week, by='week') %>%
mutate(change_week=(average_per_week-expected_per_week)/expected_per_week) %>%
mutate(topline = ifelse(change_week>0, change_week, 0))
# Show results
kable(head(big_table_week, 10))
week | average_per_week | year | expected_per_week | change_week | topline |
---|---|---|---|---|---|
1 | 9491.0 | 2015 | 15777.5 | -0.3984472 | 0.0000000 |
1 | 20644.0 | 2016 | 15777.5 | 0.3084456 | 0.3084456 |
1 | 18973.0 | 2017 | 15777.5 | 0.2025353 | 0.2025353 |
1 | 15349.5 | 2018 | 15777.5 | -0.0271272 | 0.0000000 |
1 | 15214.0 | 2019 | 15777.5 | -0.0357154 | 0.0000000 |
1 | 15695.0 | 2020 | 15777.5 | -0.0052290 | 0.0000000 |
2 | 20566.0 | 2015 | 22285.0 | -0.0771371 | 0.0000000 |
2 | 22266.0 | 2016 | 22285.0 | -0.0008526 | 0.0000000 |
2 | 20963.0 | 2017 | 22285.0 | -0.0593224 | 0.0000000 |
2 | 24683.0 | 2018 | 22285.0 | 0.1076060 | 0.1076060 |
Now, we can construct the desired plot
ggplot(data = big_table_week, aes(x=week, y=change_week, group=year))+
geom_line()+
#facet by year
facet_wrap(~year)+
#add grey vertical bands in background
annotate("rect", xmin=13, xmax=26, ymin=-Inf, ymax=Inf, alpha=0.4, fill="grey") +
annotate("rect", xmin=39, xmax=53, ymin=-Inf, ymax=Inf, alpha=0.4, fill="grey") +
#add red and green bands between curve and y = 0
geom_ribbon(aes(ymin=0,ymax=change_week), fill="#CBEBCE", color="black", size=0.05)+ #green
geom_ribbon(aes(ymin=change_week, ymax=topline), fill="#EAB5B7", color="black", size=0.05) + #red\
#tick marks on bottom axis
geom_rug(data = . %>% filter(change_week <= 0), sides="b", color='#EAB5B7', size = 0.8) + #filter to color only the ticks where change_week < 0 red
geom_rug(data = . %>% filter(change_week > 0), sides="b", color='#CBEBCE', size = 0.8) + #green
#horizontal line at y = 0
geom_hline(yintercept = 0, size=0) +
#aesthetics
scale_y_continuous(expand = expansion(mult = .05),
labels=scales::percent)+ #y-axis lables
theme_minimal()+
theme(aspect.ratio = 1/2) + #wider
labs(title = "Weekly changes in TfL bike rentals", # add labels to the df
subtitle = "% changes from weekly averages calculated between 2015-2019",
caption = "Source: fivethirtyeight.com - 2014", # Source
y = NULL)
The above graph tells largely the same story as the graph shown earlier in the discussion, only this time displayed as changes with respect to a weekly average. We can once again appreciate a growing populatiry of bike use from 2015 through to 2019, a major decrease during the pandemic of early 2020, and a resurgence upon the lifting of the restrictions.