This project developed a method of predicting bikeshare trips in New York City with emphasis on the effect of bicycle infrastructure, such as bike lanes. We used our modeling predictions to explore bike planning scenarios in New York City. This project is one of the first to use data capturing the unique routes taken by each bikeshare trip, so the methods used herein could be applied to other similar datasets. This document will outline the motivation for our project, exploratory analysis, feature engineering and model building, and the scenario planning tool displayed in our application.
This project was completed for the final capstone in the Master of Urban Spatial Analytics program at the University of Pennsylvania. We are very thankful to our instructors, Ken Steif, Michael Fichman, and Matt Harris, for their insight and patient guidance as we undertook this large task. We are also incredibly grateful and indebted to our clients, Kyle Yakal-Kremski, Apaar Bansal, and Paula Rubira from the New York City Department of Transportation, for their generosity in sharing such a unique dataset with a group of students and working with us through this process.
Riding a bike in a city can be a harrowing task of dodging turning vehicles, weaving around parked cars and delivery trucks, and watching out for a car door that may open any moment. Bike lanes are an important way to carve out space for cyclists to improve the comfort and safety of riding. However, it can be a very challenging task to build broad support for bike lanes and decide which streets should receive bike lanes first. Our project provides a new way of predicting the impact of a bike lane based on unique ridership data from the Citibike system in New York City, in which people rent bikes from docked stations.
A transportation planner’s task of deciding where to install a new bike lane is a challenging one. Planners must create a network throughout the city that allows cyclists to flow between destinations and protects them from the dangers of the road. Planners may prioritize streets for bike lane investments based on the number of collisions between cars and cyclists or the value of a certain road to the network. There are also strategic factors, such as a street repaving, that can present a convenient time for lane reconfiguration. It is also crucial to develop relationships in communities where new bike lanes will be installed to build support for the changes and ensure community members know how to use the new cycling infrastructure.
This project provides a unique way of prioritizing streets for bike lane investments based on observed bikeshare ridership. Predicting the impact of a new bikelane on bike ridership for would give our client, the NYC DOT, an additional tool to use in deciding which streets to prioritize for bike lanes. Furthermore, if planners can explain to a community board the most important streets for adding a bike lane or quantify the impact of adding a more protected facility, it could help advocate for the most needed changes or develop relationships with groups who are more skeptical of bike lanes.
To develop this model, we relied on two main data sources. New York City’s street centerlines dataset shows the type of bike lane on any given street and where the bike lane network has expanded over time. We combined this with trip data from the bikeshare system, Citibike, in which people can rent bikes at docked stations. To predict bikeshare ridership, we relied on additional data sources including American Community Survey 2015-2019 Estimates, Open Street Map amenity data, and additional amenity datasets from New York City Open Data.
Dataset | Source |
---|---|
Citibike Trips | NYC Department of Transportation |
LION Street Centerlines | Bytes of the Big Apple Archive |
Citibike Stations | NYC Department of Transportation |
Demographics | American Community Survey 2015-2019 Estimates |
Job Locations | Longitudinal Employer HOusehold Dynamics |
Ofice and Retail Amenities | OpenStreetMap |
Additional Amenities | NYC Open Data |
This project is unique because it relies on a very uncommon type of trip data that is not usually shared with the public. Most publicly available data on docked bikeshare systems is origin-destination data, meaning you know the stations where a trip starts and ends but not the route the cyclist took in between. Since 2018, Citibike has allowed riders to opt-in to having their trip tracked via their phone’s GPS. The resulting data, rather than being a pair of stations, is a line that shows the exact route a rider took. This allows much greater insight into the choices cyclists make when they ride and allows us to calculate the ridership of bikeshare trips on any given block.
New York City uses three categories for recording bike lanes, which we have termed Protected Lanes, Unprotected Lanes, and Sharrows. The highest level of protection recorded in the data is a protected lane, which is separated from vehicle lanes by parked cars or another kind of buffer to create a dedicated space for cyclists. An unprotected lane is one in which the bike lane is painted onto the street immediately adjacent to the vehicle lane. Even though this is a bike lane, it offers less protection because vehicles can easily wander into the bike lane. The final category, sharrow, is used when streets are not wide enough for separate vehicle and bike lanes. The term sharrow refers to the symbol of an arrow with a cycle and is meant to encourage drivers to share the lane with cyclists. Since the sharrow is very similar to a street without a bike lane, we decided to exclude it from our category of bike lane.
Protected Lane | Unprotected Lane | Sharrow |
---|---|---|
Since the the Citibike network does not extend throughout all of New York, we defined a study area based on the time frame used by our predictive model: August 2020. We chose this month because it captures the warm weather of the summer and includes stations in the Bronx, which was only added to the Citibike network starting in July 2020. Our study area therefore comprises a half mile buffer around the Citibike stations that were available in August of 2020. This study area extends for most of Manhattan, and parts of the Bronx, Brooklyn, and Queens.
Converting individual bikeshare trips into useful ridership data is a very challenging task. Since this is such a novel analysis, recording our data cleaning process will be helpful to anyone pursuing a similar project in the future. The steps we pursued are as follows:
Buffer Width
For ease of calculation, we chose a single buffer width of 15 feet. When buffers for different street segments overlap, assigning each point to only one of those buffers is more time-consuming in R. Using a buffer of 15 feet allowed us to capture most points and minimize overlap between buffers. However, many streets are more than 30 feet wide, so this method could miss some ridership. Using a variable buffer width based on street width could improve this calculation in the future.
Lines to Points
The built-in tool in ArcMap to Generate Points Along Lines is meant to place a point along each line at a specified distance. However, we found that with a dataset of 100,000 trips, the resulting data was very spatially inconsistent, with large peaks of 300-400 riders while no other adjoining street had counts higher than 30 trips. This suggested to us that the initial ArcMap tool was not generating points in the way it had described. We performed a second points generation using a custom tool from a Stack Exchange post, which resulted in much more spatially consistent ridership counts. However, there are still some segments where the count on one segment is several hundred trips higher than any adjoining segment. Future analyses should investigate this step and find the optimal way to generate points along lines, either in ArcMap, ArcGIS Pro, or R.
Figure 1. Original (left) vs. improved (right) ridership counts. Original counts created using ArcMap Generate Points Along Lines. Improved counts created using custom toolbox, Create Points from Lines, created by another user.
Our exploratory analysis was motivated by two questions:
Though it may seem intuitive that bike infrastructure such as bike lanes and cycle tracks make riding a bike more appealing, the unique kind of data used in this project allows us to test that assumption. To explore whether bike ridership is related to changes in bike infrastructure, we performed a difference in difference analysis on several streets before and after bike lane additions.
To explore the changes in ridership on a particular street when a bike lane was added, we used a difference in difference approach. This is a method of exploratory analysis that compares a treatment and control group before and after an intervention. Both groups change over time, but the additional difference in the treatment group is roughly attributed to the effect of the treatment.
For our purposes, difference-in-difference asks, is there a difference in trips on streets with a new bike lane compared to similar streets without a bike lane? We chose two areas of Manhattan in which to test this difference-in-difference exploratory approach: East Harlem and Lower Manhattan. Both areas had new bike lanes that were added in 2018 and were served by the Citibike network.
Lower Manhattan saw the longest addition of protected bike lane in 2018, on 7th Avenue. We chose 3rd Avenue as the control because it was one of few parallel streets in the area without a bike lane.
Treatment: 7th Avenue | Control: 3rd Avenue |
---|---|
Type | Pre | Post | Difference |
---|---|---|---|
Treatment | 3786 | 8397 | 4611 |
Control | 1713 | 2903 | 1190 |
Difference | 2073 | 5494 | 3421 |
Our difference in difference results for lower Manhattan show that the treatment group saw a larger increase in bikeshare trips than the control. This suggests that the new protected bike lane on 7th avenue may have had a positive impact on ridership.
To test this approach in another neighborhood, we chose East Harlem because it is less dense than lower Manhattan, but had several bike lanes added in 2018. Unlike 7th avenue, all of these additions were unprotected bike lanes. We chose 4 other side streets as controls.
Treatment: E 110th St | Control: E 109th St |
---|---|
Type | Pre | Post | Difference |
---|---|---|---|
Treatment | 178 | 332 | 154 |
Control | 152 | 407 | 255 |
Difference | 26 | -75 | -101 |
The results of the East Harlem difference in difference analysis show that the control group actually had a larger increase in trips that the treatment group and overtook the treatment group. Perhaps because the ridership on these streets is only in the hundreds rather than the thousands, a smaller change in actual riders can have a larger effect on the outcome. These results could also suggest that unprotected bike lanes do not encourage bikeshare ridership as much as protected bike lanes. Further difference in difference analyses at different time points and regions of the city would be needed to elaborate the interactions between new bike lanes and Citibike ridership.
Our exploratory difference in difference results are encouraging and may suggest that adding bike lanes, or at least protected bike lanes, to the network over time has helped create a more welcoming environment for cyclists. However, to build a scenario planning tool for our client, we needed to develop a predictive model that would allow us to predict the change in ridership once a new bike lane was added.
We explored several features to investigate which were were the best predictors of bikeshare ridership. We tested features of the street centerline data itself, such as road width and speed limit, as well as demographics from the American Community Survey and amenities from OpenStreetMap and OpenData NYC. The below charts show the correlation between each predictor and the dependent variable, Count.
We also analyzed these independent variables for multicollinearity, as shown in the plot below. The two most correlated features were median rent and the percentage of White residents. However, the correlation value between these two features was less than 0.8, so both were included.
Our final model included the following features:
We modeled Citibike ridership on each street segment in our study area in the first two weeks of August 2020. We chose August 2020 because it was both a warm month good for biking and the first month in which Citibike stations were open in the Bronx, which we felt was important to include. Though bikeshare ridership trends have shifted during the pandemic, we modeled the total ridership over a two week period, so the changes in peak commuting patterns during the COVID-19 pandemic were not a concern for our model.
In this part, we used the features we selected in the feature engineering part to build machine learning models and predict the bike counts on each road segment in our study area. We will compare between different models to see how well they account for the variance in bike ridership and we will also compare how well they predict unseen data. Then we will choose the model with the best performance to predict the bike counts in our study. The three models we used are:
Linear Model: OLS linear regression model. As mentioned in the exploratory analysis part, the bike counts data are not normally distributed. Therefore when training the linear model, we log-transformed our dependent variable to make the data more conforms to the linear regression assumption.
Poisson Model: A log-linear model. Poisson Model is especially suitable for predicting count data. Because our data are counts data with large number of 0 and the prediction residuals are not normally distributed, Poisson might be a good choice.
Random Forest: An ensemble learning method that operates by constructing a multitude of decision trees.
We first split our data into 80%/20% training set and test set. Then we performs 10 fold cross validation to tune the parameters as well as calculating the training accuracy. The training accuracy will tell us how well our models fit on the training data. We also calculated test accuracy because testing accuracy will tell us how well our models predict unseen data. By comparing the performances of three models, we will decide which one we will choose to predict bike counts in the study area.
After tuning the parameters and training the model, we first plot the MAE (Mean Absolute Error) and RMSE (Root Mean Square Error) of the training set using the optimized parameter for each model.
As you can see, the random forest has lowest MAE and RMSE among the three models, however, the differences of these two statistics are not that much between the three models. We then plot the “Actual vs. Predicted” scatter plots and see the model performance when predicting training set.
Three models all have issues of underpredicting. However, among the three, linear regression looks least predictive. Poisson and random forest seem to predict the trend using the features selected. Random forest generally predicts bike counts well because it has relatively small variance.
Let’s see the performance of the models when predicting test data. It looks like that when predicting test set, the three models tend to have similar performance as to the training set prediction.
We also plot the distribution of absolute errors in test set. The first plot shows the distribution of absolute errors less than 90 (relatively small errors). As is shown below, linear regression and random forest produce more small errors which is less than 5. However, all models have long tails, indicating some large errors.
The plot below shows large errors which is larger than 90. From this plot we can see that random forest actually performs better and produce less large errors.
Therefore, we finally decide to use random forest to predict the bike ride counts in our study area.
After predicting the bike counts using random forest, we do some analysis on the prediction errors.
We used our model to explore several scenarios where new bike lanes could be added in the future. Each scenario is comprised of several similar streets that could serve as alternatives to each other. For example, the Central Park Transverses Scenario is made up of 4 streets that cross Central Park, none of which have bike lanes. Comparing the predicted increase ridership across these scenarios could be a helpful tool for planners to decide which alternative would be most impactful.
Scenario | Street | Length.Miles | Median.Ridership | Max.Ridership |
---|---|---|---|---|
Brooklyn | ATLANTIC AVENUE | 6.6 | 4.0 | 68 |
Brooklyn | BUSHWICK AVENUE | 3.1 | 9.0 | 104 |
Brooklyn | FLATBUSH AVENUE | 6.5 | 7.5 | 47 |
Brooklyn | MEEKER AVENUE | 2.7 | 5.0 | 22 |
Brooklyn | MYRTLE AVENUE | 5.1 | 14.0 | 105 |
Central Park Transverses | 65 ST TRANSVERSE | 1.7 | 24.0 | 143 |
Central Park Transverses | 79 ST TRANSVERSE | 1.9 | 38.0 | 71 |
Central Park Transverses | 86 ST TRANSVERSE | 0.9 | 42.5 | 78 |
Central Park Transverses | 97 ST TRANSVERSE | 1.2 | 26.0 | 69 |
Upper Manhattan | ADAM CLAYTON POWELL JR BOULEVARD | 5.1 | 3.0 | 112 |
Upper Manhattan | AMSTERDAM AVENUE | 2.5 | 8.0 | 18 |
Upper Manhattan | LENOX AVENUE | 3.8 | 17.5 | 58 |
For each scenario, we predicted the change in ridership that would be expected if a protected or unprotected bikelane were added to each street. In the case of Upper Manhattan, several of the avenues already had an unprotected bike lane, so we only predicted the change in ridership that would be expected from an increase to a protected bike lane.
All Central Park transverses were predicted to have a higher growth in ridership under a protected than an unprotected bike lane. Unprotected Lane predictions resulted in an increase of around 100% for all 4 streets, but for the protected bike lane the prediction for 97th St was an increase of over 300%. For both types of lane, the 97th St. Traverse also appears to have a higher percent increase in ridership. The street segments with the highest increase in ridership tend to fall toward the edges of the park. This reflects the spatial lag variable because ridership is low on the transverses themselves but high on other roads bordering the park or the loop road in the park.
In Upper Manhattan, we only predicted the impact of a protected bike lane since several streets already had an unprotected bike lane. Amsterdam Ave has by far the greatest increase in ridership, at over 300%, while the other two avenues are predicted to have ridership increases of around 100% or 150%. Interestingly, the map shows that a large number of segments on Lenox Ave and Adam Clayton Powell Jr. Boulevard are predicted to have a decrease in ridership, although overall ridership was still predicted to increase.
In Brooklyn, our model predicted that Meeker St. would show the largest increase in ridership, at almost 400% with a protected bike lane and 200% with an unprotected bike lane. Myrtle Avenue was projected to have the least growth in ridership, with Atlantic Avenue, Bushwick Avenue, and Flatbush Avenue falling in the middle. Due to the orientation of Meeker St., the entire street is located in a more densely traveled part of the Citibike network compared to other streets in this scenario which exend out to the edge of the study area. This might help explain the higher predictions for Meeker St., but comparing the % growth in ridership was meant to normalize for these effects.
Though the model performance leaves room for improvement, this model can still offer interesting insights into predicted ridership for several bike planning scenarios in New York City. With further revision, this type of analysis could be predicted with greater accuracy and expanded to other streets in the city.
Calculating Ridership
# Calculating Ridership
# pre-processing steps:
## generate points from line trip data (see section 2.2)
## buffer street centerlines to desired distance (15ft in this analysis)
library(sf)
library(tidyverse)
library(mapview)
#read in data
points_new<-st_read('Aug1_14_points.shp')
extent <- st_read("station_extent_Aug2020.shp") %>%
st_transform(st_crs(points_new))
cd<-st_read('https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson') %>%
st_transform(st_crs(points_new))
cd_extent<-st_intersection(cd, extent)
#clip points to study area extent
points_new_extent<-st_intersection(points_new, extent)
#bring in roads
bike20d_buffer <- st_read("lion2020d_buffer15.shp") %>%
st_transform(st_crs(points_new))
# filter out non-bikeable LION segments (e.g. highways and ferry routes)
bike20d_buffer<- subset(bike20d_buffer, RW_TYPE != 12 &
RW_TYPE != 7 &
RW_TYPE != 14 &
RW_TYPE != 2 &
RW_TYPE != 4 &
FeatureTyp != 1 &
FeatureTyp != 2 &
FeatureTyp != 3 &
FeatureTyp != 5 &
FeatureTyp != 7 &
FeatureTyp != 8 &
FeatureTyp != 9 &
SegmentTyp !='E'&
SegmentTyp !='G'&
SegmentTyp !='F'&
SegmentTyp !='T')
#segment ID is not unique, so select distinct segment IDs
bike20d_buffer <- distinct(bike20d_buffer,SegmentID,.keep_all = T)
#clip street segments to study area
bike20d_buffer<- st_intersection(bike20d_buffer, extent)
save.image(file = "pointsClean_raw.RData")
#function that cleans points for one community district
cleaner<- function(cb){
cd_chosen<- cd %>%
filter(boro_cd==cb)
streets <- st_intersection(bike20d_buffer, cd_chosen)
points_extent <- st_intersection(points_new_extent, cd_chosen)
points_clean <- points_extent[streets,]%>%
mutate(pnt_id = seq.int(nrow(.))) %>%
st_join(., streets, join=st_intersects, left=T, largest=TRUE)
return(points_clean)
}
#run the function for each cd
#Manhattan
points_clean_101<- cleaner('101')
points_clean_102<- cleaner('102')
points_clean_103<- cleaner('103')
points_clean_104<- cleaner('104')
points_clean_105<- cleaner('105')
points_clean_106<- cleaner('106')
points_clean_108<- cleaner('108')
points_clean_164<- cleaner('164')
points_clean_107<- cleaner('107')
points_clean_111<- cleaner('111')
points_clean_110<- cleaner('110')
points_clean_109<- cleaner('109')
points_clean_112<- cleaner('112')
save.image(file = "pointsClean_Manhattan.RData")
points_clean_201<- cleaner('201')
points_clean_202<- cleaner('202')
points_clean_203<- cleaner('203')
points_clean_204<- cleaner('204')
save.image(file = "pointsClean_MHBX.RData")
points_clean_301<- cleaner('301')
points_clean_302<- cleaner('302')
points_clean_303<- cleaner('303')
points_clean_304<- cleaner('304')
points_clean_306<- cleaner('306')
points_clean_308<- cleaner('308')
points_clean_307<- cleaner('307')
points_clean_355<- cleaner('355')
points_clean_309<- cleaner('309')
points_clean_316<- cleaner('316')
save.image(file = "pointsClean_MHBXBK.RData")
points_clean_405<- cleaner('405')
points_clean_402<- cleaner('402')
points_clean_401<- cleaner('401')
save.image(file = "pointsClean_MHBXBKQN.RData")
#function for counting trips
countTrip <- function(pnt){
tripCount <- pnt %>%
as.data.frame() %>%
select(-geometry) %>%
group_by(FID_1,SegmentID) %>%
summarise() %>%
group_by(SegmentID) %>%
summarise(tripCount = n())
return(tripCount)
}
#count trips in the previous dataframe
count_101 <- countTrip(points_clean_101)
count_102 <- countTrip(points_clean_102)
count_103 <- countTrip(points_clean_103)
count_104 <- countTrip(points_clean_104)
count_105 <- countTrip(points_clean_105)
count_106 <- countTrip(points_clean_106)
count_107 <- countTrip(points_clean_107)
count_108 <- countTrip(points_clean_108)
count_109 <- countTrip(points_clean_109)
count_110 <- countTrip(points_clean_110)
count_111 <- countTrip(points_clean_111)
count_112 <- countTrip(points_clean_112)
count_164 <- countTrip(points_clean_164)
count_301 <- countTrip(points_clean_301)
count_302 <- countTrip(points_clean_302)
count_303 <- countTrip(points_clean_303)
count_304 <- countTrip(points_clean_304)
count_306 <- countTrip(points_clean_306)
count_307 <- countTrip(points_clean_307)
count_308 <- countTrip(points_clean_308)
count_309 <- countTrip(points_clean_309)
count_316 <- countTrip(points_clean_316)
count_355 <- countTrip(points_clean_355)
count_201 <- countTrip(points_clean_201)
count_202 <- countTrip(points_clean_202)
count_203 <- countTrip(points_clean_203)
count_204 <- countTrip(points_clean_204)
count_401 <- countTrip(points_clean_401)
count_402 <- countTrip(points_clean_402)
count_405 <- countTrip(points_clean_405)
save.image(file = "pointsCounted.RData")
#join the counted trips to the road data
#Manhattan
new_info.Aug <- lion %>%
merge(.,new_tripCount,by = "SegmentID",all.x = T) %>%
rename(Count=tripCount)
info.Aug_101<- bike20d_buffer %>%
merge(., count_101, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_102<- bike20d_buffer %>%
merge(., count_102, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_103<- bike20d_buffer %>%
merge(., count_103, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_104<- bike20d_buffer %>%
merge(., count_104, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_105<- bike20d_buffer %>%
merge(., count_105, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_106<- bike20d_buffer %>%
merge(., count_106, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_107<- bike20d_buffer %>%
merge(., count_107, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_108<- bike20d_buffer %>%
merge(., count_108, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_109<- bike20d_buffer %>%
merge(., count_109, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_110<- bike20d_buffer %>%
merge(., count_110, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_111<- bike20d_buffer %>%
merge(., count_111, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_112<- bike20d_buffer %>%
merge(., count_112, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_164<- bike20d_buffer %>%
merge(., count_164, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
#bronx
info.Aug_201<- bike20d_buffer %>%
merge(., count_201, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_202<- bike20d_buffer %>%
merge(., count_202, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_203<- bike20d_buffer %>%
merge(., count_203, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_204<- bike20d_buffer %>%
merge(., count_204, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
#brooklyn
info.Aug_301<- bike20d_buffer %>%
merge(., count_301, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_302<- bike20d_buffer %>%
merge(., count_302, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_303<- bike20d_buffer %>%
merge(., count_303, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_304<- bike20d_buffer %>%
merge(., count_304, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_306<- bike20d_buffer %>%
merge(., count_306, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_307<- bike20d_buffer %>%
merge(., count_307, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_308<- bike20d_buffer %>%
merge(., count_308, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_309<- bike20d_buffer %>%
merge(., count_309, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_316<- bike20d_buffer %>%
merge(., count_316, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_355<- bike20d_buffer %>%
merge(., count_355, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
#queens
info.Aug_401<- bike20d_buffer %>%
merge(., count_401, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_402<- bike20d_buffer %>%
merge(., count_402, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
info.Aug_405<- bike20d_buffer %>%
merge(., count_405, by='SegmentID', all.y=T)%>%
rename(Count=tripCount)
#rbind everything back together
all_test<- rbind(info.Aug_101, info.Aug_102, info.Aug_103, info.Aug_104, info.Aug_105,
info.Aug_106, info.Aug_107, info.Aug_108, info.Aug_109, info.Aug_110,
info.Aug_111, info.Aug_112, info.Aug_164, info.Aug_201, info.Aug_202,
info.Aug_203, info.Aug_204, info.Aug_301, info.Aug_302, info.Aug_303,
info.Aug_304, info.Aug_306, info.Aug_307, info.Aug_308, info.Aug_309,
info.Aug_316, info.Aug_355, info.Aug_401, info.Aug_402, info.Aug_405)
glimpse(all_test)
#group by street and segment id and sum ridership (some streets are split by community districts)
all_grouped<-all_test%>%
group_by(Street, SegmentID)%>%
summarize(Count=sum(Count))
all_grouped<-st_set_geometry(all_grouped, NULL)
#join it back to the original bike20d_buffer or somehow bring back the segments with count0
mergeCols <- c("Street", "SegmentID")
ridership<- bike20d_buffer%>%
merge(., all_grouped, by=mergeCols, all.x=T)
ridership$Count[is.na(ridership$Count)] <- 0
#export as RData
save.image(file = "ridership.RData")
Difference in Difference Analysis
#Difference in Difference
#pre-processing: select the street segments that will be each treatment and control group
Feature Engineering
#Feature Engineering
#pre-steps: download LEHD work area characteristics data
library(tidyverse)
library(dplyr)
library(sf)
library(lubridate)
library(mapview)
library(viridis)
library(ggplot2)
library(kableExtra)
library(stargazer)
library(FNN)
library(ggcorrplot)
library(spdep)
library(plotly)
library(car)
library(ranger)
library(tidycensus)
library(corpcor)
library(caret)
library(rjson)
library(tidyr)
library(sp)
library(osmdata)
load("ridership.RData")
#NYC boroughs
borough<- st_read('https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson')%>%
st_transform(st_crs(ridership))
boros_4<-subset(borough, boro_code<5)
#citibike stations
station<- st_read('Infrastructure/Citibike_Stations/Stations_Aug_2020.shp')%>%
st_transform(st_crs(ridership))
#Station Extent
extent <- st_read('OtherData/Station_Extent_2020/station_extent_Aug2020.shp')%>%
st_transform(st_crs(ridership))
#boroughs within study area
boroughs_clip<-st_intersection(borough, extent)
#neighborhoods
neighborhood <- st_read('https://data.cityofnewyork.us/resource/q2z5-ai38.geojson')%>%
st_transform(st_crs(ridership))
#only neighborhood in citibike extent
neighborhood <- neighborhood[extent,]
#set ridership to info.Aug
info.Aug<-ridership
#add bike lane level
info.Aug <- info.Aug %>%
mutate(
bikeLaneLv = case_when(
BikeLane == 1 | BikeLane == 5 | BikeLane == 8 | BikeLane == 9 | BikeLane == 10 ~ "Protected",
BikeLane == 2 | BikeLane == 3 | BikeLane == 4 | BikeLane == 6 | BikeLane == 11 ~ "Unprotected",
TRUE ~ "noBikeLane"
))
#select useful columns
info.Aug <- info.Aug %>%
select(Street,SegmentID, Count, FeatureTyp, SegmentTyp, NonPed,
TrafDir, SegCount, LBoro, XFrom, YFrom, RW_TYPE, Snow_Prior,
Number_Tra, BIKE_TRAFD,
StreetWidt, TRUCK_ROUT, POSTED_SPE,
bikeLaneLv, geometry)
# collect bike lane information
bikelanes <- info.Aug %>% filter(bikeLaneLv!="noBikeLane")
####=====Feature Engineering========####
#change projection
#info.Aug <- info.Aug %>% st_transform('ESRI:102711')
#bike20d_buffer <- bike20d_buffer %>% st_transform('ESRI:102711')
#bikelanes <- bikelanes %>% st_transform('ESRI:102711')
#station <- station %>% st_transform('ESRI:102711')
#Number of trips in each neighborhood
info.Aug_nh <- st_join(info.Aug, neighborhood, join=st_intersects, left=TRUE,largest = TRUE)
info.Aug <- info.Aug_nh %>% select(-(ntacode:county_fips),-shape_leng, -boro_code)
## bring this line to above to the neighborhood joining part
info.Aug <- info.Aug %>%
mutate(
ntaname = case_when(
is.na(ntaname) ~ "Connecting Neighborhoods",
TRUE ~ ntaname
))
###Distance to nearest bikelanes
info.Aug <- info.Aug %>%
mutate(dist.lane = nn_function(st_coordinates(st_centroid(info.Aug$geometry)), st_coordinates(st_centroid(bikelanes$geometry)), 1))
#borough
info.Aug$isMH<-0
info.Aug$isMH[info.Aug$LBoro==1]<-1
#travel lanes
info.Aug$Number_Tra <- as.numeric(info.Aug$Number_Tra)
#citibike stations
#drop all columns but geometry
station<-station%>%
select()
View(info.Aug)
info.Aug$citibike.Buffer_small <-
st_buffer(info.Aug, 250) %>%
aggregate(mutate(station, counter = 1),., sum) %>%
pull(counter)
info.Aug$citibike.Buffer_small[is.na(info.Aug$citibike.Buffer_small)] <- 0
#fixNAs
#find averages
median_lanes <- median(info.Aug$Number_Tra, na.rm=TRUE)
median_width<- median(info.Aug$StreetWidt, na.rm = TRUE)
info.Aug$POSTED_SPE<- as.numeric(info.Aug$POSTED_SPE)
median_speed<-median(info.Aug$POSTED_SPE, na.rm=TRUE)
info.Aug$Number_Tra[is.na(info.Aug$Number_Tra)] <- median_lanes
info.Aug$StreetWidt[is.na(info.Aug$StreetWidt)] <- median_width
info.Aug$TRUCK_ROUT[is.na(info.Aug$TRUCK_ROUT)] <- 0
info.Aug$POSTED_SPE[is.na(info.Aug$POSTED_SPE)] <- median_speed
#### Census Features ####
#variables available at: https://api.census.gov/data/2019/acs/acs5/variables.html
v19 <- load_variables(2019, "acs5", cache = TRUE)
census_df <- data.frame(vars = c("B01003_001E",
'B02001_002E',
'B01001A_002E',
'B01001A_017E',
'B01002_001E',
'B19013_001E',
'B25031_001E',
'B25044_001E',
'B08006_003E',
'B08006_010E',
'B08131_001E',
'B08141_002E',
'B08141_001E',
'B08303_001E',
'B03001_003E',
'B26001_001E', #GROUP QUARTERS POPULATION
'B25007_012E',
#'B08201_001E',
'B25002_001E',
'B25002_003E',
'B08006_014E'),
colNames2 = c('TotPop',
'WhitePop',
'TotMale',
'TotFemale',
'MedianAge',
'MedHHInc',
'MedRent',
'Vehicles_Avail',
'Commute_DriveAlone',
'Commute_Subway',
'Travel_Time',
'No_Vehicle',
'Means_of_Transport_pop',
'Travel_Time_pop',
'LatinoPop',
'GroupQuarters',
'NumRenters',
#'AvgHHSize',
'TotUnits',
'VacUnits',
'Commute_Bike'),
stringsAsFactors = FALSE)
census_vars <- census_df$vars
census_colNames <- census_df$colNames
# Function for renaming columns after collecting census data
rename_census_cols <- function(x){
output <- x %>%
rename_at(vars(census_vars),
~ census_colNames)
output
}
#set census api key
census_key <- 'your census key here'
census_api_key(census_key, overwrite = TRUE, install = TRUE)
# Collect census data and geometries
Census_raw <- get_acs(geography = "tract",
variables = census_vars,
year = 2019,
state = "NY",
geometry = TRUE,
county = c("New York", "Kings", "Queens", "Bronx"),
output = "wide") %>%
rename_census_cols %>%
dplyr::select(GEOID,
geometry,
census_colNames) %>%
st_transform(2263)
Census_raw <- Census_raw%>%
st_transform(2263)%>%
mutate(AREA=st_area(geometry))
Census_raw <- Census_raw%>%
st_transform(2263)%>%
mutate(sq_mi=AREA/27878400)
#fix NAs
colSums(is.na(Census_raw))
median_rent<-median(Census_raw$MedRent, na.rm=TRUE)
Census_raw$MedRent[is.na(Census_raw$MedRent)] <- median_rent
# calculate some vars
Census <- Census_raw %>%
st_transform(2263) %>%
mutate(pWhite = WhitePop / TotPop,
PopDens = TotPop/sq_mi,
Mean_Commute_Time = Travel_Time / Travel_Time_pop,
pSubway = Commute_Subway / Means_of_Transport_pop,
pDrive = Commute_DriveAlone/Means_of_Transport_pop,
pBike = Commute_Bike/Means_of_Transport_pop,
pNoVeh = No_Vehicle / TotPop)
#fix weird values in calculated vars
#set infinity values to NA
Census_geo<-Census%>%
select(GEOID)
Census_st<- st_set_geometry(Census, NULL)
Census_fix <- do.call(data.frame, lapply(Census_st, function(x) replace(x, is.infinite(x), NA)))
#fix NAs in calculated vars
colSums(is.na(Census_fix))
median_pWhite<-median(Census_fix$pWhite, na.rm=TRUE)
median_subway<-median(Census_fix$pSubway, na.rm=TRUE)
median_noveh<-median(Census_fix$pNoVeh, na.rm=TRUE)
median_rent<-median(Census_fix$MedRent, na.rm=TRUE)
Census_fix$pWhite[is.na(Census_fix$pWhite)] <- median_pWhite
Census_fix$pSubway[is.na(Census_fix$pSubway)] <- median_subway
Census_fix$pNoVeh[is.na(Census_fix$pNoVeh)] <- median_noveh
Census_fix$pNoVeh[is.na(Census_fix$MedRent)] <- median_rent
#if % greater than 1, set value equal to median
summary(Census_fix)
Census_fix$pSubway[Census_fix$pSubway>1.01] <- median_subway
#get geometry back
Census<- inner_join(Census_geo, Census_fix)
#select the variables of interest
#Census <- Census %>%
# select(GEOID, sq_mi,pWhite, pSubway, pNoVeh,MedRent)
#join with info.Aug
Census<-st_transform(Census, crs=st_crs(info.Aug))
info.Aug_census<-st_join(st_centroid(info.Aug), left=TRUE, Census)
summary(is.na(info.Aug_census))
summary(is.na(Census))
#You will find there are 16 records not joined to a census tract
#They are joined becuase they are falling outside of (near) the boundary of the tract polygon
#We will join them to the nearest census tract
colNms <- names(info.Aug_census)
ThoseNotJoined <- info.Aug_census[is.na(info.Aug_census$GEOID),] %>% select(all_of(colNms))
joinThemAgain <- st_join(ThoseNotJoined,Census,left=T,join = st_nearest_feature)
ThoseJoined <- info.Aug_census[!is.na(info.Aug_census$GEOID),]
info.Aug_census <- rbind(ThoseJoined,joinThemAgain)
rm(ThoseJoined,ThoseNotJoined,joinThemAgain)
info.Aug<-info.Aug_census
#### Environment Features ####
#make new dataset to play with
info.Aug_env<- info.Aug
#sidewalk cafes
sidewalk_cafe<- read.csv('https://data.cityofnewyork.us/resource/qcdj-rwhu.csv?$limit=2000')
sidewalk_cafe<-subset(sidewalk_cafe, lic_status =='Active')
sidewalk_cafe <- st_as_sf(sidewalk_cafe, coords=c('longitude', 'latitude'), crs=4326)
#nn version
sidewalk_cafe<-st_transform(sidewalk_cafe, crs=st_crs(info.Aug))
info.Aug_env<- info.Aug_env %>%
mutate(
sidewalk_caf_nn3 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(sidewalk_cafe), 3))
#big parks
parks<- st_read('OtherData/parks/geo_export_8469ffba-1951-4e52-916c-c9c4dfa54c18.shp')
big_parks<-subset(parks, acres>100)
big_parks<-st_transform(big_parks, crs=st_crs(info.Aug_env))
info.Aug_env<- info.Aug_env %>%
mutate(bigPark = lengths(st_within(geometry, big_parks)))
info.Aug_env$bigPark[info.Aug_env$bigPark==2]<-1
#greenway
info.Aug_env$isGreenway<-ifelse(grepl('greenway', info.Aug_env$Street, ignore.case=T), 1, 0)
#distance to subway stations
subway<-st_read('stops_nyc_subway_may2019.shp')
subway<-st_read('OtherData/NYC_Subway/stops_nyc_subway_may2019.shp')
subway<-st_transform(subway, st_crs(info.Aug_env))
info.Aug_env<- info.Aug_env %>%
mutate(subway_nn5 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(subway), 5))
#reset whole dataset to include these features
info.Aug<-info.Aug_env
# jobs
jobs_mh<-st_read('LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('LODES_WAC/queens/points_2018.shp')
jobs_mh<-st_read('OtherData/LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('OtherData/LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('OtherData/LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('OtherData/LODES_WAC/queens/points_2018.shp')
jobs<-rbind(jobs_mh, jobs_bx, jobs_bk, jobs_qn)
WAC <- jobs %>%
dplyr::select(id, c000) %>%
mutate(GEOID = as.character(substr(id, 1, 11))) %>%
group_by(GEOID) %>%
summarize(jobs_in_tract = sum(c000, na.rm = TRUE)) %>%
filter(GEOID %in% info.Aug$GEOID)
WAC<-st_set_geometry(WAC, NULL)
info.Aug_jobs<- left_join(info.Aug, WAC,by = "GEOID")
#get rid of NAs
info.Aug_jobs$jobs_in_tract[is.na(info.Aug_jobs$jobs_in_tract)] <- 0
info.Aug<-info.Aug_jobs
#### OSM Features ####
info.Aug_osm<-info.Aug
#retail
extent<- st_transform(extent, st_crs(info.Aug))
retail <- opq ("New York City USA") %>%
add_osm_feature(key = 'shop') %>%
osmdata_sf(.)
retail <- st_geometry(retail$osm_points) %>%
st_transform(st_crs(info.Aug)) %>%
st_sf() %>%
st_intersection(extent) %>%
mutate(Legend = 'Retail') %>%
dplyr::select(Legend, geometry)
ggplot()+geom_sf(data=retail)
#office
office <- opq ("New York City USA") %>%
add_osm_feature(key = 'office') %>%
osmdata_sf(.)
office <- st_geometry(office$osm_points) %>%
st_transform(st_crs(info.Aug)) %>%
st_sf() %>%
st_intersection(extent) %>%
mutate(Legend = 'Office') %>%
dplyr::select(Legend, geometry)
#add density features
# retail
retail_ct <- st_join(retail, Census,join = st_intersects, left = T) %>%
group_by(GEOID,sq_mi) %>%
summarise(count_retail= n())
retail_ct$density_retail <- retail_ct$count_retail/retail_ct$sq_mi
# office
office_ct <- st_join(office,Census,left = T) %>%
group_by(GEOID,sq_mi) %>%
summarise(count_office = n())
office_ct$density_office <- office_ct$count_office/office_ct$sq_mi
#join back to info.Aug
retail_ct<- retail_ct%>%
select(GEOID, count_retail, density_retail)%>%
st_set_geometry(NULL)
office_ct<- office_ct%>%
select(GEOID, count_office, density_office)%>%
st_set_geometry(NULL)
info.Aug_osm <- info.Aug_osm %>% left_join(retail_ct,by = "GEOID") %>%
left_join(office_ct, by = "GEOID")
info.Aug<-info.Aug_osm
#fix NAs
colSums(is.na(info.Aug))
## I fill the na in POI with 0 instead of median here
info.Aug$count_retail[is.na(info.Aug$count_retail)] <- 0
info.Aug$count_office[is.na(info.Aug$count_office)] <- 0
info.Aug$density_retail[is.na(info.Aug$density_retail)] <- 0
info.Aug$density_office[is.na(info.Aug$density_office)] <- 0
## Spatial lag
#function to get cloest idx
nn_idx<- nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.index[,k]
return(nn)
}
#this index is the idex exactly for itselt-->so we start from 2
#make a copy
info.Aug1 <- info.Aug
#C1
info.Aug <- info.Aug %>%
mutate(cidx1 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 2))
#C2
info.Aug <- info.Aug %>%
mutate(cidx2 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 3))
#C3
info.Aug <- info.Aug %>%
mutate(cidx3 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 4))
#C4
info.Aug <- info.Aug %>%
mutate(cidx4 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 5))
#C5
info.Aug <- info.Aug %>%
mutate(cidx5 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 6))
#C6
info.Aug <- info.Aug %>%
mutate(cidx6 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 7))
#C7
info.Aug <- info.Aug %>%
mutate(cidx7 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 8))
#C8
info.Aug <- info.Aug %>%
mutate(cidx8 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 9))
#C9
info.Aug <- info.Aug %>%
mutate(cidx9 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 10))
#C10
info.Aug <- info.Aug %>%
mutate(cidx10 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
st_coordinates(st_centroid(info.Aug1$geometry)), 11))
#extract count based on index
for (i in 1:nrow(info.Aug)) {
info.Aug$c1Count[i] =info.Aug1[info.Aug$cidx1[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c2Count[i] =info.Aug1[info.Aug$cidx2[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c3Count[i] =info.Aug1[info.Aug$cidx3[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c5Count[i] =info.Aug1[info.Aug$cidx5[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c6Count[i] =info.Aug1[info.Aug$cidx6[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c7Count[i] =info.Aug1[info.Aug$cidx7[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c8Count[i] =info.Aug1[info.Aug$cidx8[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c9Count[i] =info.Aug1[info.Aug$cidx9[i],]$Count
}
for (i in 1:nrow(info.Aug)) {
info.Aug$c10Count[i] =info.Aug1[info.Aug$cidx10[i],]$Count
}
#get the avg 10
for (i in 1:nrow(info.Aug)) {
max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i] + info.Aug$c8Count[i] + info.Aug$c9Count[i] + info.Aug$c10Count[i])
left = sum - max_val - min_val
info.Aug$avgCount10[i] = left/ 8
}
#get the avg7
for (i in 1:nrow(info.Aug)) {
max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i])
left = sum - max_val - min_val
info.Aug$avgCount7[i] = left/5
}
#get the avg 5
for (i in 1:nrow(info.Aug)) {
max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i])
left = sum - max_val - min_val
info.Aug$avgCount5[i] = left/3
}
Running the Model
#Running the Model
####====Regression====####
options(scipen = 999)
rm(list = ls())
#Package installs -------------------------------------------------------------
load.fun <- function(x) {
x <- as.character(x)
if(isTRUE(x %in% .packages(all.available=TRUE))) {
eval(parse(text=paste("require(", x, ")", sep="")))
print(paste(c(x, " : already installed; requiring"), collapse=''))
} else {
#update.packages()
print(paste(c(x, " : not installed; installing"), collapse=''))
eval(parse(text=paste("install.packages('", x, "')", sep="")))
print(paste(c(x, " : installed and requiring"), collapse=''))
eval(parse(text=paste("require(", x, ")", sep="")))
}
}
########### Required Packages ###########
packages = c("bayesplot", "lme4", "rstan", "shinystan", "RcppEigen",
"tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
"ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
"sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
"sf", "nngeo", "mapview","poissonreg",'gridExtra')
for(i in seq_along(packages)){
packge <- as.character(packages[i])
load.fun(packge)
}
#read in data from feature engineering
load("C:/Users/xinti/Box/MUSA_800_Practicum/Data/RHistoryData/model_clean_xl03.RData")
rm(list=setdiff(ls(), "info.Aug"))
gc()
info.Aug <- info.Aug %>% filter(Count<400)
selectedP <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' ,
'sidewalk_caf_nn3' , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
'jobs_in_tract','density_retail','density_office','ntaname',"avgCount5")
selectedF <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' ,
'sidewalk_caf_nn3' , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
'jobs_in_tract','density_retail','density_office','ntaname','Count',"avgCount5",'SegmentID')
#Check NAs
info.Aug.model <- info.Aug %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.Aug.geo <- info.Aug %>% dplyr::select(SegmentID,geometry)
#Create dataset without NAs
data <- info.Aug.model %>%
drop_na()
#Set seed to make your practive producible
set.seed(717)
theme_set(theme_bw())
"%!in%" <- Negate("%in%")
g <- glimpse
data$originalCount = data$Count
data$Count = data$Count+1
data$isMH = as.character(data$isMH)
data$bigPark = as.character(data$bigPark)
data$isGreenway = as.character(data$isGreenway)
#Slipt the data into training and testing sets
data_split <- rsample::initial_split(data, strata = "Count", prop = 0.75)
train.set_wtID <- rsample::training(data_split)
test.set_wtID <- rsample::testing(data_split)
train.set <- train.set_wtID %>% dplyr::select(selectedP,Count,originalCount)
test.set <- test.set_wtID %>% dplyr::select(selectedP,Count,originalCount)
cv_splits <- rsample::vfold_cv(train.set, strata = "Count", k = 10)
print(cv_splits)
#Create recipe
model_rec_lm <- recipe(Count ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
pWhite+MedRent+pNoVeh+pSubway+
sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>%
update_role(ntaname, new_role = "ntaname") %>%
step_log(Count) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale
glimpse(model_rec_lm %>% prep() %>% juice())
model_rec <- recipe(originalCount ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
pWhite+MedRent+pNoVeh+pSubway+
sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>%
update_role(ntaname, new_role = "ntaname") %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale
glimpse(model_rec %>% prep() %>% juice())
# library(poissonreg)
lm_plan <-
linear_reg() %>%
set_engine("lm")
ps_plan <-
poissonreg::poisson_reg() %>%
set_engine("glm")
rf_plan <- parsnip::rand_forest() %>%
parsnip::set_args(mtry = tune()) %>%
parsnip::set_args(min_n = tune()) %>%
parsnip::set_args(trees = 2000) %>%
parsnip::set_engine("ranger", importance = "impurity") %>%
parsnip::set_mode("regression")
# GridSearch
rf_grid <- expand.grid(mtry = c(3,6,9),
min_n = c(100,500,800))
#Create workflow
lm_wf <-
workflows::workflow() %>%
add_recipe(model_rec_lm) %>%
add_model(lm_plan)
ps_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(ps_plan)
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
lm_tuned <- lm_wf %>%
fit_resamples(.,
resamples = cv_splits,#cv_splits_geo
control = control,
metrics = metric_set(rmse, rsq))
ps_tuned <- ps_wf %>%
fit_resamples(.,
resamples = cv_splits,#cv_splits_geo
control = control,
metrics = metric_set(rmse, rsq))
rf_tuned <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits,#cv_splits_geo
grid = rf_grid,
control = control,
metrics = metric_set(rmse, rsq))
show_best(lm_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(ps_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "rmse", n = 15, maximize = FALSE)
lm_best_params <- select_best(lm_tuned, metric = "rmse", maximize = FALSE)
ps_best_params <- select_best(ps_tuned, metric = "rmse", maximize = FALSE)
rf_best_params <- select_best(rf_tuned, metric = "rmse", maximize = FALSE)
# Pull best hyperparam preds from 10-fold cross validated predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)
ps_best_OOF_preds <- collect_predictions(ps_tuned)
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
filter(mtry == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])
lm_best_OOF_preds <- lm_best_OOF_preds %>% dplyr::select(-id,-.config) %>%
mutate(.pred = exp(.pred)-1,
Count = exp(Count)-1)
ps_best_OOF_preds <- ps_best_OOF_preds %>% dplyr::select(-id,-.config) %>%
rename(Count = originalCount)
rf_best_OOF_preds <- rf_best_OOF_preds %>% dplyr::select(-id,-.config) %>%
rename(Count = originalCount)
OOF_preds <- rbind(data.frame(lm_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Linear Regression"),
data.frame(ps_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Poisson"),
data.frame(rf_best_OOF_preds %>% dplyr::select(.pred,Count),model = "Random Forest")) %>%
group_by(model) %>%
mutate(
RMSE = yardstick::rmse_vec(Count, .pred),
MAE = yardstick::mae_vec(Count, .pred),
MAPE = yardstick::mape_vec((Count+1), (.pred+1))) %>%
ungroup()
# average error for each model
# RMSE
ggplot(data = OOF_preds %>%
dplyr::select(model, RMSE) %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "red") +
geom_label(aes(label = round(RMSE,1))) +
theme_bw()
# MAE
ggplot(data = OOF_preds %>%
dplyr::select(model, MAE) %>%
distinct() ,
aes(x = model, y = MAE, group = 1)) +
geom_path(color = "red") +
geom_label(aes(label = round(MAE,1))) +
theme_bw()
# Scatter plots: Predicted vs Observed
ggplot(OOF_preds, aes(y=.pred , x = Count,group = model))+
geom_point(alpha = 0.3) +
coord_equal() +
geom_abline(linetype = "dashed",color = "red") +
geom_smooth(method="lm", color = "blue") +
facet_wrap(~model,ncol = 2)+
theme_bw()+
ylim(0,500)+
xlim(0,500)
##===============Predict the test set=============####
#Final workflow
lm_best_wf <- finalize_workflow(lm_wf, lm_best_params)
ps_best_wf <- finalize_workflow(ps_wf, ps_best_params)
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
lm_val_fit_geo <- lm_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
ps_val_fit_geo <- ps_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
rf_val_fit_geo <- rf_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(rmse, rsq))
# collect test set predictions from last_fit model
lm_val_pred_geo <- collect_predictions(lm_val_fit_geo)
ps_val_pred_geo <- collect_predictions(ps_val_fit_geo)
rf_val_pred_geo <- collect_predictions(rf_val_fit_geo)
lm_val_pred_geo <- lm_val_pred_geo %>% mutate(.pred = exp(.pred)-1,
Count = exp(Count)-1)
ps_val_pred_geo <- ps_val_pred_geo %>% dplyr::select(-id,-.config) %>%
rename(Count = originalCount)
rf_val_pred_geo <- rf_val_pred_geo %>% dplyr::select(-id,-.config) %>%
rename(Count = originalCount)
# Aggregate test set predictions (they do not overlap with training prediction set, which is OOF_preds)
val_preds <- rbind(data.frame(dplyr::select(lm_val_pred_geo, .pred, Count), model = "lm"),
data.frame(dplyr::select(ps_val_pred_geo, .pred, Count), model = "PS"),
data.frame(dplyr::select(rf_val_pred_geo, .pred, Count), model = "RF")
) %>%
group_by(model) %>%
mutate(RMSE = yardstick::rmse_vec(Count, .pred),
MAE = yardstick::mae_vec(Count, .pred),
MAPE = yardstick::mape_vec((Count+1), (.pred+1)),
absE=abs(Count-.pred)) %>%
ungroup()
# MAE chart
ggplot(data = val_preds %>%
dplyr::select(model, MAE) %>%
distinct() ,
aes(x = model, y = MAE,group = 1)) +
geom_path(color = "red") +
geom_label(aes(label = round(MAE,1))) +
theme_bw()
# Observed vs. Predicted on the test set
ggplot(val_preds, aes(y=.pred , x = Count,group = model))+
geom_point(alpha = 0.3) +
coord_equal() +
geom_abline(linetype = "dashed",color = "red") +
geom_smooth(method="lm", color = "blue") +
facet_wrap(~model,ncol = 2)+
theme_bw()+
ylim(0,500)+
xlim(0,500)
##===============Find where the model performs well=============####
# First extract work flow
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
# fit model on the whole dataset
rf_wflow_final_fit <- fit(rf_best_wf, data = data)
# extract recipe
dat_recipe <- pull_workflow_prepped_recipe(rf_wflow_final_fit)
# extract fit
rf_final_fit <- pull_workflow_fit(rf_wflow_final_fit)
# predict bike counts for the whole dataset
data$.pred <- predict(rf_final_fit,
new_data = bake(dat_recipe,data))$.pred
data <- data %>%
mutate(absE = abs(originalCount - .pred),# calculate absolute error
Count = Count - 1)
data$originalCount <- NULL
yardstick::mae_vec(data$Count, data$.pred)
ggplot()+geom_histogram(data = data,aes(absE),binwidth = 1,fill="#219ebc")
# observed vs preicted by different characteristics
ggplot(data, aes(y=.pred , x = Count,group = TRUCK_ROUT))+
geom_point(alpha = 0.3) +
coord_equal() +
geom_abline(linetype = "dashed",color = "red") +
geom_smooth(method="lm", color = "blue") +
facet_wrap(~TRUCK_ROUT,ncol = 2)+
theme_bw()+
ylim(0,500)+
xlim(0,500)+
labs(x = "Observed",
y = "Predicted",
title = "Observed vs. Predicted on the testing set")
ggplot(data %>% filter(Number_Tra<6), aes(y=.pred , x = Count,group = Number_Tra))+
geom_point(alpha = 0.3) +
coord_equal() +
geom_abline(linetype = "dashed",color = "red") +
geom_smooth(method="lm", color = "blue") +
facet_wrap(~Number_Tra,ncol = 3)+
theme_bw()+
ylim(0,500)+
xlim(0,500)+
labs(x = "Observed",
y = "Predicted",
title = "Observed vs. Predicted on the testing set")
# join the geometry
data.geo <- data %>% left_join(info.Aug,by = "SegmentID") %>% st_as_sf()
ggplot() +
geom_sf(data = boroughs_clip, fill = "#D4D4D4") +
geom_sf(data = data.geo, aes(colour = absE)) +
scale_colour_viridis(direction = -1, discrete = FALSE, option="viridis")+
labs(title="Absolute predicting Error on each road segment") +
mapTheme()
# calculate error by neighborhoods
data.nta <- data.geo %>% st_drop_geometry() %>%
group_by(ntaname) %>%
summarise(MAE = mean(absE))
error.nta <- neighborhood %>% left_join(data.nta,by="ntaname")
ggplot()+
geom_sf(data = error.nta, aes(fill=MAE)) +
scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
labs(title="MAE by neighborhoods") + mapTheme()
# calculate error by boroughs
data.geo <- data.geo %>% st_transform(crs = 4326)
boroughs_clip <- boroughs_clip %>% st_transform(crs = 4326)
data.boro <- st_join(st_centroid(data.geo),boroughs_clip,left=T)
data.boro <- data.boro %>% st_drop_geometry() %>%
group_by(boro_name) %>%
summarise(MAE = mean(absE))
error.boro <- boroughs_clip %>% left_join(data.boro,by="boro_name")
ggplot()+
geom_sf(data = error.boro, aes(fill=MAE)) +
scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
labs(title="MAE by neighborhoods") + mapTheme()
Scenario Predictions
#Scenario Predictions
##Read Data
scenarios <- st_read("E:/NYCDOT/Scenarios/scenariosShort.shp")
neighborhoods <- st_read("E:/NYCDOT/NYCneighborhood/nynta.shp")
boro <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")
boro <- boro %>%
filter(boro_name == "Brooklyn")
boro <- boro %>% st_transform(st_crs(scenarios))
boro2 <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")
boro2 <- boro2 %>%
filter(boro_name == "Manhattan" | boro_name == "Bronx")
boro2 <- boro2 %>% st_transform(st_crs(scenarios))
##Find and Clean Range
scenario1_range <- neighborhoods %>% filter(NTACode == "MN99")
scenario1 <- scenarios[scenario1_range,] %>%
filter(SegmentID != "0137647" & SegmentID != "0134012" &
SegmentID != "0150953" & SegmentID != "0150951" &
SegmentID != "0071072" & SegmentID != "0071082")
scenario1 <- scenario1 %>%
filter(SegmentID != "0111014" & SegmentID != "0071078" & SegmentID != "0108162")
scenario1 <- scenario1 %>%
filter(SegmentID != "9000012")
scenario1 <- scenario1 %>%
filter(SegmentID != "0071084")
scenario1 <- scenario1 %>%
filter(SegmentID != "9000011")
scenario_23 <- scenarios %>%
filter(! SegmentID %in% scenario1$SegmentID)
scenario2 <- scenario_23[boro2,]
scenario3 <- scenario_23 %>%
filter(! SegmentID %in% scenario2$SegmentID)
#check
mapview(list(scenario1, scenario2, scenario3), color = list("red", "blue", "purple"))
#Make Samples
sample1 <- info.Aug %>%
filter(SegmentID %in% scenario1$SegmentID)
sample2 <- info.Aug %>%
filter(SegmentID %in% scenario2$SegmentID)
sample3 <- info.Aug %>% filter(SegmentID %in% scenario3$SegmentID)
#Edit
sample1 <- sample1 %>%
filter(bikeLaneLv == "noBikeLane")
scenario1 <- scenario1 %>%
filter(SegmentID %in% sample1$SegmentID)
#check
mapview(sample2, zcol="bikeLaneLv")
mapview(sample3, zcol="bikeLaneLv")
sample3 <- sample3 %>%
filter(bikeLaneLv != "Protected")
scenario3 <- scenario3 %>%
filter(SegmentID %in% sample3$SegmentID)
#read buffered
scenario1_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario1_Buffer.shp")
scenario2_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario2_Buffer.shp")
scenario3_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario3_Buffer.shp")
mapview(list(scenario1_buffer, scenario2_buffer, scenario3_buffer))
#### Predictions
#reference:https://hansjoerg.me/2020/02/09/tidymodels-for-machine-learning/
Brooklyn <- boro %>% st_transform(st_crs(info.Aug))
Manhattan <- boro2 %>%
filter(boro_name == "Manhattan") %>% st_transform(st_crs(info.Aug))
info.Brooklyn <- info.Aug[Brooklyn,]
info.Manhattan <- info.Aug[Manhattan,]
info.Bk_model <- info.Brooklyn %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.M_model <- info.Manhattan %>% st_drop_geometry() %>% dplyr::select(selectedF)
sample1 <- sample1 %>% dplyr::select(selectedF)
sample2 <- sample2 %>% dplyr::select(selectedF)
sample3 <- sample3 %>% dplyr::select(selectedF)
#Predict scenario1
rf_Manhattan_fit <- fit(rf_best_wf, data = info.M_model)
rf_final_Mfit <- pull_workflow_fit(rf_Manhattan_fit)
dia_rec3_M <- pull_workflow_prepped_recipe(rf_Manhattan_fit)
#make different versions of sample 1 data
sample1_protect <- sample1 %>%
mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()
sample1_unprotect <- sample1 %>%
mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()
#make predictions
#Scenario1
sample1_protect$.pred <- predict(rf_final_Mfit,
new_data = bake(dia_rec3_M , sample1_protect))$.pred
sample1_unprotect$.pred <- predict(rf_final_Mfit,
new_data = bake(dia_rec3_M , sample1_unprotect))$.pred
#join back and calculate diff
sample1$pre_protect <- sample1_protect$.pred
sample1 <- sample1 %>%
mutate(diff_protect = ((pre_protect - Count) / Count) * 100)
sample1$pre_unprotect <- sample1_unprotect$.pred
sample1 <- sample1 %>%
mutate(diff_unprotect = ((pre_unprotect - Count) / Count) * 100)
sample1.info <- sample1 %>%
dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)
sample1.info <- sample1.info %>% st_transform(st_crs(scenario1))
#join back to road segment
scenario1_infonew <- st_join(scenario1_buffer %>% st_transform(st_crs(sample1.info)),
sample1.info, left=TRUE, join = st_intersects)
scenario1_infonew <- distinct(scenario1_infonew,SegmentID.x,.keep_all = T)
scenario1_infonew <- scenario1_infonew %>%
dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect, pre_unprotect, diff_unprotect)
scenario1_infonew <- scenario1_infonew %>% st_transform(st_crs("EPSG:4326"))
names(scenario1_infonew)[2] <- "SegmentID"
st_write(scenario1_infonew, "scenario1.geojson")
##==Predict Scenario2
#make different versions of sample 1 data
sample2_protect <- sample2 %>%
mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()
sample2_protect$.pred <- predict(rf_final_Mfit,
new_data = bake(dia_rec3_M ,sample2_protect))$.pred
sample2_protect <- sample2_protect %>%
mutate(Count, ifelse(Count == 0, 0.01, Count)) #avoid INF
names(sample2_protect)[24] <- "Count_c"
#join back and calculate diff
sample2$pre_protect <- sample2_protect$.pred
sample2$Count_c <- sample2_protect$Count_c
sample2 <- sample2 %>%
mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)
sample2.info <- sample2 %>%
dplyr::select(SegmentID, Count, pre_protect, diff_protect)
#change back to segment
scenario2_info <- st_join(scenario2_buffer %>% st_transform(st_crs(sample2.info)),
sample2.info, left=TRUE, join = st_intersects)
scenario2_info <- distinct(scenario2_info, SegmentID.x,.keep_all = T)
scenario2_info <- scenario2_info %>%
dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect)
scenario2_info <- scenario2_info %>% st_transform(st_crs("EPSG:4326"))
names(scenario2_info)[2] <- "SegmentID"
mapview(scenario2_info)
st_write(scenario2_info, "scenario2.geojson")
##==Predict Scenario3
rf_Brooklyn_fit <- fit(rf_best_wf, data = info.Bk_model)
rf_final_Bfit <- pull_workflow_fit(rf_Brooklyn_fit)
dia_rec3_B <- pull_workflow_prepped_recipe(rf_Brooklyn_fit)
#make different versions of sample 3 data
mapview(sample3, zcol="bikeLaneLv")
sample3_protect <- sample3 %>%
mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()
sample3_unprotect <- sample3 %>%
filter(bikeLaneLv != "Unprotected") %>%
mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()
#make predictions
#Scenario3
sample3_protect$.pred <- predict(rf_final_Bfit,
new_data = bake(dia_rec3_B, sample3_protect))$.pred
sample3_unprotect$.pred <- predict(rf_final_Bfit,
new_data = bake(dia_rec3_B, sample3_unprotect))$.pred
sample3_protect <- sample3_protect %>%
mutate(Count, ifelse(Count == 0, 0.01, Count))
names(sample3_protect)[24] <- "Count_c"
sample3_unprotect <- sample3_unprotect %>%
mutate(Count, ifelse(Count == 0, 0.01, Count))
names(sample3_unprotect)[24] <- "Count_c"
#join back and calculate diff
sample3$pre_protect <- sample3_protect$.pred
sample3$Count_c <-sample3_protect$Count_c
sample3 <- sample3 %>%
mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)
sample3 <- merge(sample3, sample3_unprotect %>% dplyr::select(SegmentID, .pred),
by="SegmentID", all.x=TRUE)
sample3 <- sample3 %>%
mutate(diff_unprotect = ((.pred - Count) / Count_c) * 100)
names(sample3)[26] <- "pre_unprotect"
sample3$pre_unprotect[is.na(sample3$pre_unprotect)] <- 0
sample3$diff_unprotect[is.na(sample3$diff_unprotect)] <- 0
sample3.info <- sample3 %>%
dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)
#change back to segment
scenario3_info <- st_join(scenario3_buffer %>% st_transform(st_crs(sample3.info)),
sample3.info, left=TRUE, join = st_intersects)
scenario3_info <- distinct(scenario3_info, SegmentID.x,.keep_all = T)
scenario3_info <- scenario3_info %>%
dplyr::select(Street, SegmentID.x,Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)
scenario3_info <- scenario3_info %>% st_transform(st_crs("EPSG:4326"))
names(scenario3_info)[2] <- "SegmentID"
mapview(scenario3_info)
st_write(scenario3_info, "scenario3.geojson")