bike

1. Introduction

1.1 About this Project

This project developed a method of predicting bikeshare trips in New York City with emphasis on the effect of bicycle infrastructure, such as bike lanes. We used our modeling predictions to explore bike planning scenarios in New York City. This project is one of the first to use data capturing the unique routes taken by each bikeshare trip, so the methods used herein could be applied to other similar datasets. This document will outline the motivation for our project, exploratory analysis, feature engineering and model building, and the scenario planning tool displayed in our application.

This project was completed for the final capstone in the Master of Urban Spatial Analytics program at the University of Pennsylvania. We are very thankful to our instructors, Ken Steif, Michael Fichman, and Matt Harris, for their insight and patient guidance as we undertook this large task. We are also incredibly grateful and indebted to our clients, Kyle Yakal-Kremski, Apaar Bansal, and Paula Rubira from the New York City Department of Transportation, for their generosity in sharing such a unique dataset with a group of students and working with us through this process.

1.2 Motivation

bike


Riding a bike in a city can be a harrowing task of dodging turning vehicles, weaving around parked cars and delivery trucks, and watching out for a car door that may open any moment. Bike lanes are an important way to carve out space for cyclists to improve the comfort and safety of riding. However, it can be a very challenging task to build broad support for bike lanes and decide which streets should receive bike lanes first. Our project provides a new way of predicting the impact of a bike lane based on unique ridership data from the Citibike system in New York City, in which people rent bikes from docked stations.

1.3 Use Case

A transportation planner’s task of deciding where to install a new bike lane is a challenging one. Planners must create a network throughout the city that allows cyclists to flow between destinations and protects them from the dangers of the road. Planners may prioritize streets for bike lane investments based on the number of collisions between cars and cyclists or the value of a certain road to the network. There are also strategic factors, such as a street repaving, that can present a convenient time for lane reconfiguration. It is also crucial to develop relationships in communities where new bike lanes will be installed to build support for the changes and ensure community members know how to use the new cycling infrastructure.


bike


This project provides a unique way of prioritizing streets for bike lane investments based on observed bikeshare ridership. Predicting the impact of a new bikelane on bike ridership for would give our client, the NYC DOT, an additional tool to use in deciding which streets to prioritize for bike lanes. Furthermore, if planners can explain to a community board the most important streets for adding a bike lane or quantify the impact of adding a more protected facility, it could help advocate for the most needed changes or develop relationships with groups who are more skeptical of bike lanes.

2. Data

2.1 Sources

To develop this model, we relied on two main data sources. New York City’s street centerlines dataset shows the type of bike lane on any given street and where the bike lane network has expanded over time. We combined this with trip data from the bikeshare system, Citibike, in which people can rent bikes at docked stations. To predict bikeshare ridership, we relied on additional data sources including American Community Survey 2015-2019 Estimates, Open Street Map amenity data, and additional amenity datasets from New York City Open Data.

Dataset Source
Citibike Trips NYC Department of Transportation
LION Street Centerlines Bytes of the Big Apple Archive
Citibike Stations NYC Department of Transportation
Demographics American Community Survey 2015-2019 Estimates
Job Locations Longitudinal Employer HOusehold Dynamics
Ofice and Retail Amenities OpenStreetMap
Additional Amenities NYC Open Data

Trip Data

This project is unique because it relies on a very uncommon type of trip data that is not usually shared with the public. Most publicly available data on docked bikeshare systems is origin-destination data, meaning you know the stations where a trip starts and ends but not the route the cyclist took in between. Since 2018, Citibike has allowed riders to opt-in to having their trip tracked via their phone’s GPS. The resulting data, rather than being a pair of stations, is a line that shows the exact route a rider took. This allows much greater insight into the choices cyclists make when they ride and allows us to calculate the ridership of bikeshare trips on any given block.


bike


Types of Bike Lanes

New York City uses three categories for recording bike lanes, which we have termed Protected Lanes, Unprotected Lanes, and Sharrows. The highest level of protection recorded in the data is a protected lane, which is separated from vehicle lanes by parked cars or another kind of buffer to create a dedicated space for cyclists. An unprotected lane is one in which the bike lane is painted onto the street immediately adjacent to the vehicle lane. Even though this is a bike lane, it offers less protection because vehicles can easily wander into the bike lane. The final category, sharrow, is used when streets are not wide enough for separate vehicle and bike lanes. The term sharrow refers to the symbol of an arrow with a cycle and is meant to encourage drivers to share the lane with cyclists. Since the sharrow is very similar to a street without a bike lane, we decided to exclude it from our category of bike lane.

Protected Lane Unprotected Lane Sharrow
protected unprotected sharrow

Study Area

Since the the Citibike network does not extend throughout all of New York, we defined a study area based on the time frame used by our predictive model: August 2020. We chose this month because it captures the warm weather of the summer and includes stations in the Bronx, which was only added to the Citibike network starting in July 2020. Our study area therefore comprises a half mile buffer around the Citibike stations that were available in August of 2020. This study area extends for most of Manhattan, and parts of the Bronx, Brooklyn, and Queens.

2.2 Data Cleaning

Converting individual bikeshare trips into useful ridership data is a very challenging task. Since this is such a novel analysis, recording our data cleaning process will be helpful to anyone pursuing a similar project in the future. The steps we pursued are as follows:

  1. From NYC lion street centerlines, filter out non-bikeable streets (e.g. highways, tunnels, ferry routes)
  2. Create a 15 foot buffer around street centerlines
  3. Convert each line to points
  4. Spatial join each point to one street segment buffer
  5. Group by street segment and Trip ID to sum the number of trips per segment

bike


Possible Improvements

Buffer Width
For ease of calculation, we chose a single buffer width of 15 feet. When buffers for different street segments overlap, assigning each point to only one of those buffers is more time-consuming in R. Using a buffer of 15 feet allowed us to capture most points and minimize overlap between buffers. However, many streets are more than 30 feet wide, so this method could miss some ridership. Using a variable buffer width based on street width could improve this calculation in the future.

Lines to Points
The built-in tool in ArcMap to Generate Points Along Lines is meant to place a point along each line at a specified distance. However, we found that with a dataset of 100,000 trips, the resulting data was very spatially inconsistent, with large peaks of 300-400 riders while no other adjoining street had counts higher than 30 trips. This suggested to us that the initial ArcMap tool was not generating points in the way it had described. We performed a second points generation using a custom tool from a Stack Exchange post, which resulted in much more spatially consistent ridership counts. However, there are still some segments where the count on one segment is several hundred trips higher than any adjoining segment. Future analyses should investigate this step and find the optimal way to generate points along lines, either in ArcMap, ArcGIS Pro, or R.

img1img2

Figure 1. Original (left) vs. improved (right) ridership counts. Original counts created using ArcMap Generate Points Along Lines. Improved counts created using custom toolbox, Create Points from Lines, created by another user.

3. Explorary Analysis

Our exploratory analysis was motivated by two questions:

  • What are the spatial trends in bikeshare ridership?
  • Do bike lanes influence bikeshare ridership?

3.2 Impact of Bike Lanes on Ridership

Though it may seem intuitive that bike infrastructure such as bike lanes and cycle tracks make riding a bike more appealing, the unique kind of data used in this project allows us to test that assumption. To explore whether bike ridership is related to changes in bike infrastructure, we performed a difference in difference analysis on several streets before and after bike lane additions.

To explore the changes in ridership on a particular street when a bike lane was added, we used a difference in difference approach. This is a method of exploratory analysis that compares a treatment and control group before and after an intervention. Both groups change over time, but the additional difference in the treatment group is roughly attributed to the effect of the treatment.


diff chart


For our purposes, difference-in-difference asks, is there a difference in trips on streets with a new bike lane compared to similar streets without a bike lane? We chose two areas of Manhattan in which to test this difference-in-difference exploratory approach: East Harlem and Lower Manhattan. Both areas had new bike lanes that were added in 2018 and were served by the Citibike network.


Lower Manhattan

lower mh map


Lower Manhattan saw the longest addition of protected bike lane in 2018, on 7th Avenue. We chose 3rd Avenue as the control because it was one of few parallel streets in the area without a bike lane.


Treatment: 7th Avenue Control: 3rd Avenue
treatment1 img control1 img


LM chart
Lower Manhattan Difference in Difference Results
Type Pre Post Difference
Treatment 3786 8397 4611
Control 1713 2903 1190
Difference 2073 5494 3421

Our difference in difference results for lower Manhattan show that the treatment group saw a larger increase in bikeshare trips than the control. This suggests that the new protected bike lane on 7th avenue may have had a positive impact on ridership.


East Harlem

e harlem map


To test this approach in another neighborhood, we chose East Harlem because it is less dense than lower Manhattan, but had several bike lanes added in 2018. Unlike 7th avenue, all of these additions were unprotected bike lanes. We chose 4 other side streets as controls.


Treatment: E 110th St Control: E 109th St
treatment1 img control1 img


EH chart
East Harlem Difference in Difference Results
Type Pre Post Difference
Treatment 178 332 154
Control 152 407 255
Difference 26 -75 -101

The results of the East Harlem difference in difference analysis show that the control group actually had a larger increase in trips that the treatment group and overtook the treatment group. Perhaps because the ridership on these streets is only in the hundreds rather than the thousands, a smaller change in actual riders can have a larger effect on the outcome. These results could also suggest that unprotected bike lanes do not encourage bikeshare ridership as much as protected bike lanes. Further difference in difference analyses at different time points and regions of the city would be needed to elaborate the interactions between new bike lanes and Citibike ridership.

4. Feature Engineering

Our exploratory difference in difference results are encouraging and may suggest that adding bike lanes, or at least protected bike lanes, to the network over time has helped create a more welcoming environment for cyclists. However, to build a scenario planning tool for our client, we needed to develop a predictive model that would allow us to predict the change in ridership once a new bike lane was added.

Correlations

We explored several features to investigate which were were the best predictors of bikeshare ridership. We tested features of the street centerline data itself, such as road width and speed limit, as well as demographics from the American Community Survey and amenities from OpenStreetMap and OpenData NYC. The below charts show the correlation between each predictor and the dependent variable, Count.

Street Network

Demographics

Amenities

Multicollinearity

We also analyzed these independent variables for multicollinearity, as shown in the plot below. The two most correlated features were median rent and the percentage of White residents. However, the correlation value between these two features was less than 0.8, so both were included.

Final Features

Our final model included the following features:

  • Street Network
    • Bike Lane: Protected vs Unprotected vs None
    • Number of Vehicle Travel Lanes
    • Street Width
    • Truck Route
    • Greenway Yes/No
  • Bike Lane & Citibike Networks
    • Distance to the Nearest Bike Lane
    • Number of Citibike Stations: within 250 ft
  • Demographics
    • Median Rent
    • Percent White Residents
    • Percent Without a Vehicle
    • Percent who Commute by Subway
  • Amenities
    • Jobs in Census Tract
    • Office Density
    • Retail Density
    • Large Park: whether in a large park
    • Distance to Subway Station
    • Distance to Sidewalk Cafe
  • Geography
    • Neighborhood Tabulation Area
    • Manhattan: Yes/No
    • Spatial Lag: average trip count of 5 nearest street segments

5. Modeling

5.1 Model Basics

We modeled Citibike ridership on each street segment in our study area in the first two weeks of August 2020. We chose August 2020 because it was both a warm month good for biking and the first month in which Citibike stations were open in the Bronx, which we felt was important to include. Though bikeshare ridership trends have shifted during the pandemic, we modeled the total ridership over a two week period, so the changes in peak commuting patterns during the COVID-19 pandemic were not a concern for our model.

5.2 Model Building

In this part, we used the features we selected in the feature engineering part to build machine learning models and predict the bike counts on each road segment in our study area. We will compare between different models to see how well they account for the variance in bike ridership and we will also compare how well they predict unseen data. Then we will choose the model with the best performance to predict the bike counts in our study. The three models we used are:

  • Linear Model: OLS linear regression model. As mentioned in the exploratory analysis part, the bike counts data are not normally distributed. Therefore when training the linear model, we log-transformed our dependent variable to make the data more conforms to the linear regression assumption.

  • Poisson Model: A log-linear model. Poisson Model is especially suitable for predicting count data. Because our data are counts data with large number of 0 and the prediction residuals are not normally distributed, Poisson might be a good choice.

  • Random Forest: An ensemble learning method that operates by constructing a multitude of decision trees.

We first split our data into 80%/20% training set and test set. Then we performs 10 fold cross validation to tune the parameters as well as calculating the training accuracy. The training accuracy will tell us how well our models fit on the training data. We also calculated test accuracy because testing accuracy will tell us how well our models predict unseen data. By comparing the performances of three models, we will decide which one we will choose to predict bike counts in the study area.

5.3 Model Evaluation

After tuning the parameters and training the model, we first plot the MAE (Mean Absolute Error) and RMSE (Root Mean Square Error) of the training set using the optimized parameter for each model.

MAE_RMSE training

As you can see, the random forest has lowest MAE and RMSE among the three models, however, the differences of these two statistics are not that much between the three models. We then plot the “Actual vs. Predicted” scatter plots and see the model performance when predicting training set.

scatter plot training

Three models all have issues of underpredicting. However, among the three, linear regression looks least predictive. Poisson and random forest seem to predict the trend using the features selected. Random forest generally predicts bike counts well because it has relatively small variance.

Let’s see the performance of the models when predicting test data. It looks like that when predicting test set, the three models tend to have similar performance as to the training set prediction.

error analysis test set

We also plot the distribution of absolute errors in test set. The first plot shows the distribution of absolute errors less than 90 (relatively small errors). As is shown below, linear regression and random forest produce more small errors which is less than 5. However, all models have long tails, indicating some large errors.

distirbution of error1

The plot below shows large errors which is larger than 90. From this plot we can see that random forest actually performs better and produce less large errors.

distirbution of error2

Therefore, we finally decide to use random forest to predict the bike ride counts in our study area.

5.4 Prediction & Error Analysis

After predicting the bike counts using random forest, we do some analysis on the prediction errors.

error map and pred map
We can see that most of the large errors are in Manhattan, which has the highest ridership. It is interesting to know that our model are underprediting bike counts on roads with large ridership. Further study will be needed to figure out the reasons behind.
error map each boro
From the MAE by neigbhorhood plot we can also see that the model is not good at predicting Manhattan, especially the lower Manhattan.
error map each nta


6. Scenarios

We used our model to explore several scenarios where new bike lanes could be added in the future. Each scenario is comprised of several similar streets that could serve as alternatives to each other. For example, the Central Park Transverses Scenario is made up of 4 streets that cross Central Park, none of which have bike lanes. Comparing the predicted increase ridership across these scenarios could be a helpful tool for planners to decide which alternative would be most impactful.


6.1 Scenarios Chosen

Streets Chosen for Each Scenario
Scenario Street Length.Miles Median.Ridership Max.Ridership
Brooklyn ATLANTIC AVENUE 6.6 4.0 68
Brooklyn BUSHWICK AVENUE 3.1 9.0 104
Brooklyn FLATBUSH AVENUE 6.5 7.5 47
Brooklyn MEEKER AVENUE 2.7 5.0 22
Brooklyn MYRTLE AVENUE 5.1 14.0 105
Central Park Transverses 65 ST TRANSVERSE 1.7 24.0 143
Central Park Transverses 79 ST TRANSVERSE 1.9 38.0 71
Central Park Transverses 86 ST TRANSVERSE 0.9 42.5 78
Central Park Transverses 97 ST TRANSVERSE 1.2 26.0 69
Upper Manhattan ADAM CLAYTON POWELL JR BOULEVARD 5.1 3.0 112
Upper Manhattan AMSTERDAM AVENUE 2.5 8.0 18
Upper Manhattan LENOX AVENUE 3.8 17.5 58


For each scenario, we predicted the change in ridership that would be expected if a protected or unprotected bikelane were added to each street. In the case of Upper Manhattan, several of the avenues already had an unprotected bike lane, so we only predicted the change in ridership that would be expected from an increase to a protected bike lane.


6.2 Predictions


Scenario 1: Central Park Transverses

All Central Park transverses were predicted to have a higher growth in ridership under a protected than an unprotected bike lane. Unprotected Lane predictions resulted in an increase of around 100% for all 4 streets, but for the protected bike lane the prediction for 97th St was an increase of over 300%. For both types of lane, the 97th St. Traverse also appears to have a higher percent increase in ridership. The street segments with the highest increase in ridership tend to fall toward the edges of the park. This reflects the spatial lag variable because ridership is low on the transverses themselves but high on other roads bordering the park or the loop road in the park.


Protected
Unprotected



Scenario 2: Upper Manhattan

In Upper Manhattan, we only predicted the impact of a protected bike lane since several streets already had an unprotected bike lane. Amsterdam Ave has by far the greatest increase in ridership, at over 300%, while the other two avenues are predicted to have ridership increases of around 100% or 150%. Interestingly, the map shows that a large number of segments on Lenox Ave and Adam Clayton Powell Jr. Boulevard are predicted to have a decrease in ridership, although overall ridership was still predicted to increase.



Scenario 3: Brooklyn

In Brooklyn, our model predicted that Meeker St. would show the largest increase in ridership, at almost 400% with a protected bike lane and 200% with an unprotected bike lane. Myrtle Avenue was projected to have the least growth in ridership, with Atlantic Avenue, Bushwick Avenue, and Flatbush Avenue falling in the middle. Due to the orientation of Meeker St., the entire street is located in a more densely traveled part of the Citibike network compared to other streets in this scenario which exend out to the edge of the study area. This might help explain the higher predictions for Meeker St., but comparing the % growth in ridership was meant to normalize for these effects.


Protected


sce 3 protect plot
scenario3 plot
Unprotected


sce 3 unprotect plot
scenario2 plot


Next Steps

Though the model performance leaves room for improvement, this model can still offer interesting insights into predicted ridership for several bike planning scenarios in New York City. With further revision, this type of analysis could be predicted with greater accuracy and expanded to other streets in the city.


7. Code Appendix

Calculating Ridership

# Calculating Ridership
# pre-processing steps: 
## generate points from line trip data (see section 2.2)
## buffer street centerlines to desired distance (15ft in this analysis)

library(sf)
library(tidyverse)
library(mapview)

#read in data
points_new<-st_read('Aug1_14_points.shp')

extent <- st_read("station_extent_Aug2020.shp") %>%
  st_transform(st_crs(points_new))

cd<-st_read('https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson') %>%
  st_transform(st_crs(points_new))

cd_extent<-st_intersection(cd, extent)


#clip points to study area extent
points_new_extent<-st_intersection(points_new, extent)

#bring in roads
bike20d_buffer <- st_read("lion2020d_buffer15.shp") %>%
  st_transform(st_crs(points_new))

# filter out non-bikeable LION segments (e.g. highways and ferry routes)
bike20d_buffer<- subset(bike20d_buffer, RW_TYPE != 12 &
                RW_TYPE != 7 &
                RW_TYPE != 14 &
                RW_TYPE != 2 &
                RW_TYPE != 4 &
                FeatureTyp != 1 &
                FeatureTyp != 2 &
                FeatureTyp != 3 &
                FeatureTyp != 5 &
                FeatureTyp != 7 &
                FeatureTyp != 8 &
                FeatureTyp != 9 &
                SegmentTyp !='E'&
                SegmentTyp !='G'&
                SegmentTyp !='F'&
                SegmentTyp !='T')

#segment ID is not unique, so select distinct segment IDs
bike20d_buffer <- distinct(bike20d_buffer,SegmentID,.keep_all = T)

#clip street segments to study area
bike20d_buffer<- st_intersection(bike20d_buffer, extent)

save.image(file = "pointsClean_raw.RData")


#function that cleans points for one community district
cleaner<- function(cb){
  cd_chosen<- cd %>%
    filter(boro_cd==cb)
  streets <- st_intersection(bike20d_buffer, cd_chosen)
  points_extent <- st_intersection(points_new_extent, cd_chosen)
  points_clean <- points_extent[streets,]%>%
    mutate(pnt_id = seq.int(nrow(.))) %>%
    st_join(., streets, join=st_intersects, left=T, largest=TRUE)
  return(points_clean)
}

#run the function for each cd
#Manhattan
points_clean_101<- cleaner('101')
points_clean_102<- cleaner('102')
points_clean_103<- cleaner('103')
points_clean_104<- cleaner('104')
points_clean_105<- cleaner('105')
points_clean_106<- cleaner('106')
points_clean_108<- cleaner('108')
points_clean_164<- cleaner('164')
points_clean_107<- cleaner('107')
points_clean_111<- cleaner('111')
points_clean_110<- cleaner('110')
points_clean_109<- cleaner('109')
points_clean_112<- cleaner('112')

save.image(file = "pointsClean_Manhattan.RData")

points_clean_201<- cleaner('201')
points_clean_202<- cleaner('202')
points_clean_203<- cleaner('203')
points_clean_204<- cleaner('204')

save.image(file = "pointsClean_MHBX.RData")


points_clean_301<- cleaner('301')
points_clean_302<- cleaner('302')
points_clean_303<- cleaner('303')
points_clean_304<- cleaner('304')
points_clean_306<- cleaner('306')
points_clean_308<- cleaner('308')
points_clean_307<- cleaner('307')
points_clean_355<- cleaner('355')
points_clean_309<- cleaner('309')
points_clean_316<- cleaner('316')

save.image(file = "pointsClean_MHBXBK.RData")

points_clean_405<- cleaner('405')
points_clean_402<- cleaner('402')
points_clean_401<- cleaner('401')

save.image(file = "pointsClean_MHBXBKQN.RData")


#function for counting trips
countTrip <- function(pnt){
  tripCount <- pnt %>%
    as.data.frame() %>%
    select(-geometry) %>%
    group_by(FID_1,SegmentID) %>%
    summarise() %>%
    group_by(SegmentID) %>%
    summarise(tripCount = n())
  return(tripCount)
}

#count trips in the previous dataframe
count_101 <- countTrip(points_clean_101)
count_102 <- countTrip(points_clean_102)
count_103 <- countTrip(points_clean_103)
count_104 <- countTrip(points_clean_104)
count_105 <- countTrip(points_clean_105)
count_106 <- countTrip(points_clean_106)


count_107 <- countTrip(points_clean_107)
count_108 <- countTrip(points_clean_108)
count_109 <- countTrip(points_clean_109)
count_110 <- countTrip(points_clean_110)
count_111 <- countTrip(points_clean_111)
count_112 <- countTrip(points_clean_112)
count_164 <- countTrip(points_clean_164)

count_301 <- countTrip(points_clean_301)
count_302 <- countTrip(points_clean_302)
count_303 <- countTrip(points_clean_303)
count_304 <- countTrip(points_clean_304)
count_306 <- countTrip(points_clean_306)
count_307 <- countTrip(points_clean_307)
count_308 <- countTrip(points_clean_308)
count_309 <- countTrip(points_clean_309)
count_316 <- countTrip(points_clean_316)
count_355 <- countTrip(points_clean_355)


count_201 <- countTrip(points_clean_201)
count_202 <- countTrip(points_clean_202)
count_203 <- countTrip(points_clean_203)
count_204 <- countTrip(points_clean_204)


count_401 <- countTrip(points_clean_401)
count_402 <- countTrip(points_clean_402)
count_405 <- countTrip(points_clean_405)

save.image(file = "pointsCounted.RData")


#join the counted trips to the road data
#Manhattan
new_info.Aug <- lion %>% 
  merge(.,new_tripCount,by = "SegmentID",all.x = T) %>% 
  rename(Count=tripCount)

info.Aug_101<- bike20d_buffer %>% 
  merge(., count_101, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

info.Aug_102<- bike20d_buffer %>% 
  merge(., count_102, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

info.Aug_103<- bike20d_buffer %>% 
  merge(., count_103, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_104<- bike20d_buffer %>% 
  merge(., count_104, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_105<- bike20d_buffer %>% 
  merge(., count_105, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_106<- bike20d_buffer %>% 
  merge(., count_106, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_107<- bike20d_buffer %>% 
  merge(., count_107, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_108<- bike20d_buffer %>% 
  merge(., count_108, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_109<- bike20d_buffer %>% 
  merge(., count_109, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_110<- bike20d_buffer %>% 
  merge(., count_110, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_111<- bike20d_buffer %>% 
  merge(., count_111, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_112<- bike20d_buffer %>% 
  merge(., count_112, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_164<- bike20d_buffer %>% 
  merge(., count_164, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#bronx
info.Aug_201<- bike20d_buffer %>% 
  merge(., count_201, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_202<- bike20d_buffer %>% 
  merge(., count_202, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_203<- bike20d_buffer %>% 
  merge(., count_203, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_204<- bike20d_buffer %>% 
  merge(., count_204, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#brooklyn
info.Aug_301<- bike20d_buffer %>% 
  merge(., count_301, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_302<- bike20d_buffer %>% 
  merge(., count_302, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_303<- bike20d_buffer %>% 
  merge(., count_303, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_304<- bike20d_buffer %>% 
  merge(., count_304, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_306<- bike20d_buffer %>% 
  merge(., count_306, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_307<- bike20d_buffer %>% 
  merge(., count_307, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_308<- bike20d_buffer %>% 
  merge(., count_308, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_309<- bike20d_buffer %>% 
  merge(., count_309, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_316<- bike20d_buffer %>% 
  merge(., count_316, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_355<- bike20d_buffer %>% 
  merge(., count_355, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#queens
info.Aug_401<- bike20d_buffer %>% 
  merge(., count_401, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_402<- bike20d_buffer %>% 
  merge(., count_402, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_405<- bike20d_buffer %>% 
  merge(., count_405, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#rbind everything back together

all_test<- rbind(info.Aug_101, info.Aug_102, info.Aug_103, info.Aug_104, info.Aug_105,
                 info.Aug_106, info.Aug_107, info.Aug_108, info.Aug_109, info.Aug_110,
                 info.Aug_111, info.Aug_112, info.Aug_164, info.Aug_201, info.Aug_202, 
                 info.Aug_203, info.Aug_204, info.Aug_301, info.Aug_302, info.Aug_303,
                 info.Aug_304, info.Aug_306, info.Aug_307, info.Aug_308, info.Aug_309,
                 info.Aug_316, info.Aug_355, info.Aug_401, info.Aug_402, info.Aug_405)

glimpse(all_test)

#group by street and segment id and sum ridership (some streets are split by community districts)
all_grouped<-all_test%>%
  group_by(Street, SegmentID)%>%
  summarize(Count=sum(Count))
all_grouped<-st_set_geometry(all_grouped, NULL)


#join it back to the original bike20d_buffer or somehow bring back the segments with count0
mergeCols <- c("Street", "SegmentID")
ridership<- bike20d_buffer%>%
  merge(., all_grouped, by=mergeCols, all.x=T)
ridership$Count[is.na(ridership$Count)] <- 0

#export as RData
save.image(file = "ridership.RData")

Difference in Difference Analysis

#Difference in Difference
#pre-processing: select the street segments that will be each treatment and control group

Feature Engineering

#Feature Engineering
#pre-steps: download LEHD work area characteristics data

library(tidyverse)
library(dplyr)
library(sf)
library(lubridate)
library(mapview)
library(viridis)
library(ggplot2)
library(kableExtra)
library(stargazer)
library(FNN)
library(ggcorrplot)
library(spdep)
library(plotly)
library(car)
library(ranger)
library(tidycensus)
library(corpcor)
library(caret)
library(rjson)
library(tidyr)
library(sp)
library(osmdata)

load("ridership.RData")

#NYC boroughs
borough<- st_read('https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson')%>%
  st_transform(st_crs(ridership))

boros_4<-subset(borough, boro_code<5)


#citibike stations
station<- st_read('Infrastructure/Citibike_Stations/Stations_Aug_2020.shp')%>%
  st_transform(st_crs(ridership))

#Station Extent
extent <- st_read('OtherData/Station_Extent_2020/station_extent_Aug2020.shp')%>%
  st_transform(st_crs(ridership))

#boroughs within study area
boroughs_clip<-st_intersection(borough, extent)


#neighborhoods
neighborhood <- st_read('https://data.cityofnewyork.us/resource/q2z5-ai38.geojson')%>%
  st_transform(st_crs(ridership))


#only neighborhood in citibike extent
neighborhood <- neighborhood[extent,]


#set ridership to info.Aug
info.Aug<-ridership

#add bike lane level
info.Aug <- info.Aug %>% 
  mutate(
    bikeLaneLv = case_when(
      BikeLane == 1 | BikeLane == 5 | BikeLane == 8 | BikeLane == 9 | BikeLane == 10 ~ "Protected",
      BikeLane == 2 | BikeLane == 3 | BikeLane == 4 | BikeLane == 6 | BikeLane == 11 ~ "Unprotected",
      TRUE ~ "noBikeLane"
    ))


#select useful columns
info.Aug <- info.Aug %>%
  select(Street,SegmentID, Count, FeatureTyp, SegmentTyp, NonPed,
         TrafDir, SegCount, LBoro, XFrom, YFrom, RW_TYPE, Snow_Prior,
         Number_Tra, BIKE_TRAFD, 
         StreetWidt, TRUCK_ROUT, POSTED_SPE,  
         bikeLaneLv, geometry)

# collect bike lane information
bikelanes <- info.Aug %>% filter(bikeLaneLv!="noBikeLane")


####=====Feature Engineering========####
#change projection
#info.Aug <- info.Aug %>% st_transform('ESRI:102711')
#bike20d_buffer <- bike20d_buffer %>% st_transform('ESRI:102711')
#bikelanes <- bikelanes %>% st_transform('ESRI:102711')
#station <- station %>% st_transform('ESRI:102711')

#Number of trips in each neighborhood
info.Aug_nh <- st_join(info.Aug, neighborhood, join=st_intersects, left=TRUE,largest = TRUE)
info.Aug <- info.Aug_nh %>% select(-(ntacode:county_fips),-shape_leng, -boro_code)


## bring this line to above to the neighborhood joining part
info.Aug <- info.Aug %>% 
  mutate(
    ntaname = case_when(
      is.na(ntaname) ~ "Connecting Neighborhoods",
      TRUE ~ ntaname
    ))

###Distance to nearest bikelanes
info.Aug <- info.Aug %>%
  mutate(dist.lane = nn_function(st_coordinates(st_centroid(info.Aug$geometry)), st_coordinates(st_centroid(bikelanes$geometry)), 1))


#borough
info.Aug$isMH<-0
info.Aug$isMH[info.Aug$LBoro==1]<-1

#travel lanes
info.Aug$Number_Tra <- as.numeric(info.Aug$Number_Tra)


#citibike stations
#drop all columns but geometry
station<-station%>%
  select()

View(info.Aug)

info.Aug$citibike.Buffer_small <-
  st_buffer(info.Aug, 250) %>% 
  aggregate(mutate(station, counter = 1),., sum) %>%
  pull(counter)
info.Aug$citibike.Buffer_small[is.na(info.Aug$citibike.Buffer_small)] <- 0

#fixNAs
#find averages
median_lanes <- median(info.Aug$Number_Tra, na.rm=TRUE)
median_width<- median(info.Aug$StreetWidt, na.rm = TRUE)
info.Aug$POSTED_SPE<- as.numeric(info.Aug$POSTED_SPE)
median_speed<-median(info.Aug$POSTED_SPE, na.rm=TRUE)


info.Aug$Number_Tra[is.na(info.Aug$Number_Tra)] <- median_lanes
info.Aug$StreetWidt[is.na(info.Aug$StreetWidt)] <- median_width
info.Aug$TRUCK_ROUT[is.na(info.Aug$TRUCK_ROUT)] <- 0
info.Aug$POSTED_SPE[is.na(info.Aug$POSTED_SPE)] <- median_speed


#### Census Features ####
#variables available at: https://api.census.gov/data/2019/acs/acs5/variables.html
v19 <- load_variables(2019, "acs5", cache = TRUE)

census_df <- data.frame(vars = c("B01003_001E",
                                 'B02001_002E',
                                 'B01001A_002E',
                                 'B01001A_017E',
                                 'B01002_001E',
                                 'B19013_001E',
                                 'B25031_001E',
                                 'B25044_001E',
                                 'B08006_003E',
                                 'B08006_010E',
                                 'B08131_001E',
                                 'B08141_002E',
                                 'B08141_001E',
                                 'B08303_001E',
                                 'B03001_003E',
                                 'B26001_001E', #GROUP QUARTERS POPULATION
                                 'B25007_012E',
                                 #'B08201_001E',
                                 'B25002_001E',
                                 'B25002_003E',
                                 'B08006_014E'),
                        colNames2 = c('TotPop',
                                      'WhitePop',
                                      'TotMale',
                                      'TotFemale',
                                      'MedianAge',
                                      'MedHHInc',
                                      'MedRent',
                                      'Vehicles_Avail',
                                      'Commute_DriveAlone',
                                      'Commute_Subway',
                                      'Travel_Time',
                                      'No_Vehicle',
                                      'Means_of_Transport_pop',
                                      'Travel_Time_pop',
                                      'LatinoPop',
                                      'GroupQuarters',
                                      'NumRenters',
                                      #'AvgHHSize',
                                      'TotUnits',
                                      'VacUnits',
                                      'Commute_Bike'),
                        stringsAsFactors = FALSE)

census_vars <- census_df$vars
census_colNames <- census_df$colNames

# Function for renaming columns after collecting census data
rename_census_cols <- function(x){
  
  output <- x %>% 
    rename_at(vars(census_vars), 
              ~ census_colNames)
  
  output
}

#set census api key
census_key <- 'your census key here'
census_api_key(census_key, overwrite = TRUE, install = TRUE)

# Collect census data and geometries
Census_raw <- get_acs(geography = "tract", 
                      variables = census_vars, 
                      year = 2019, 
                      state = "NY", 
                      geometry = TRUE, 
                      county = c("New York", "Kings", "Queens", "Bronx"),
                      output = "wide") %>%
  rename_census_cols %>%
  dplyr::select(GEOID, 
                geometry,
                census_colNames) %>% 
  st_transform(2263)

Census_raw <- Census_raw%>%
  st_transform(2263)%>%
  mutate(AREA=st_area(geometry))
Census_raw <- Census_raw%>%
  st_transform(2263)%>%
  mutate(sq_mi=AREA/27878400)

#fix NAs 
colSums(is.na(Census_raw))

median_rent<-median(Census_raw$MedRent, na.rm=TRUE)
Census_raw$MedRent[is.na(Census_raw$MedRent)] <- median_rent

# calculate some vars
Census <- Census_raw %>% 
  st_transform(2263) %>% 
  mutate(pWhite = WhitePop / TotPop,
         PopDens = TotPop/sq_mi,
         Mean_Commute_Time = Travel_Time / Travel_Time_pop,
         pSubway = Commute_Subway / Means_of_Transport_pop,
         pDrive = Commute_DriveAlone/Means_of_Transport_pop,
         pBike = Commute_Bike/Means_of_Transport_pop,
         pNoVeh = No_Vehicle / TotPop)

#fix weird values in calculated vars
#set infinity values to NA
Census_geo<-Census%>%
  select(GEOID)

Census_st<- st_set_geometry(Census, NULL)
Census_fix <- do.call(data.frame, lapply(Census_st, function(x) replace(x, is.infinite(x), NA)))


#fix NAs in calculated vars
colSums(is.na(Census_fix))

median_pWhite<-median(Census_fix$pWhite, na.rm=TRUE)
median_subway<-median(Census_fix$pSubway, na.rm=TRUE)
median_noveh<-median(Census_fix$pNoVeh, na.rm=TRUE)
median_rent<-median(Census_fix$MedRent, na.rm=TRUE)



Census_fix$pWhite[is.na(Census_fix$pWhite)] <- median_pWhite
Census_fix$pSubway[is.na(Census_fix$pSubway)] <- median_subway
Census_fix$pNoVeh[is.na(Census_fix$pNoVeh)] <- median_noveh
Census_fix$pNoVeh[is.na(Census_fix$MedRent)] <- median_rent


#if % greater than 1, set value equal to median
summary(Census_fix)
Census_fix$pSubway[Census_fix$pSubway>1.01] <- median_subway

#get geometry back
Census<- inner_join(Census_geo, Census_fix)

#select the variables of interest
#Census <- Census %>%
#  select(GEOID, sq_mi,pWhite, pSubway, pNoVeh,MedRent)

#join with info.Aug
Census<-st_transform(Census, crs=st_crs(info.Aug))
info.Aug_census<-st_join(st_centroid(info.Aug), left=TRUE, Census)

summary(is.na(info.Aug_census))
summary(is.na(Census))
#You will find there are 16 records not joined to a census tract
#They are joined becuase they are falling outside of (near) the boundary of the tract polygon
#We will join them to the nearest census tract
colNms <- names(info.Aug_census)
ThoseNotJoined <- info.Aug_census[is.na(info.Aug_census$GEOID),] %>% select(all_of(colNms))
joinThemAgain <- st_join(ThoseNotJoined,Census,left=T,join = st_nearest_feature)
ThoseJoined <- info.Aug_census[!is.na(info.Aug_census$GEOID),]
info.Aug_census <- rbind(ThoseJoined,joinThemAgain)

rm(ThoseJoined,ThoseNotJoined,joinThemAgain)
info.Aug<-info.Aug_census

#### Environment Features ####

#make new dataset to play with
info.Aug_env<- info.Aug

#sidewalk cafes
sidewalk_cafe<- read.csv('https://data.cityofnewyork.us/resource/qcdj-rwhu.csv?$limit=2000')
sidewalk_cafe<-subset(sidewalk_cafe, lic_status =='Active')
sidewalk_cafe <- st_as_sf(sidewalk_cafe, coords=c('longitude', 'latitude'), crs=4326)

#nn version
sidewalk_cafe<-st_transform(sidewalk_cafe, crs=st_crs(info.Aug))

info.Aug_env<- info.Aug_env %>%
  mutate(
    sidewalk_caf_nn3 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(sidewalk_cafe), 3))



#big parks
parks<- st_read('OtherData/parks/geo_export_8469ffba-1951-4e52-916c-c9c4dfa54c18.shp')

big_parks<-subset(parks, acres>100)

big_parks<-st_transform(big_parks, crs=st_crs(info.Aug_env))

info.Aug_env<- info.Aug_env %>% 
  mutate(bigPark = lengths(st_within(geometry, big_parks)))
info.Aug_env$bigPark[info.Aug_env$bigPark==2]<-1


#greenway
info.Aug_env$isGreenway<-ifelse(grepl('greenway', info.Aug_env$Street, ignore.case=T), 1, 0)

#distance to subway stations
subway<-st_read('stops_nyc_subway_may2019.shp')
subway<-st_read('OtherData/NYC_Subway/stops_nyc_subway_may2019.shp')

subway<-st_transform(subway, st_crs(info.Aug_env))


info.Aug_env<- info.Aug_env %>%
  mutate(subway_nn5 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(subway), 5))


#reset whole dataset to include these features
info.Aug<-info.Aug_env

# jobs
jobs_mh<-st_read('LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('LODES_WAC/queens/points_2018.shp')

jobs_mh<-st_read('OtherData/LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('OtherData/LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('OtherData/LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('OtherData/LODES_WAC/queens/points_2018.shp')


jobs<-rbind(jobs_mh, jobs_bx, jobs_bk, jobs_qn)

WAC <- jobs %>% 
  dplyr::select(id, c000) %>% 
  mutate(GEOID = as.character(substr(id, 1, 11))) %>% 
  group_by(GEOID) %>% 
  summarize(jobs_in_tract = sum(c000, na.rm = TRUE)) %>% 
  filter(GEOID %in% info.Aug$GEOID)

WAC<-st_set_geometry(WAC, NULL)

info.Aug_jobs<- left_join(info.Aug, WAC,by = "GEOID")

#get rid of NAs
info.Aug_jobs$jobs_in_tract[is.na(info.Aug_jobs$jobs_in_tract)] <- 0

info.Aug<-info.Aug_jobs

#### OSM Features ####
info.Aug_osm<-info.Aug

#retail
extent<- st_transform(extent, st_crs(info.Aug))

retail <- opq ("New York City USA") %>%
  add_osm_feature(key = 'shop') %>%
  osmdata_sf(.)

retail <- st_geometry(retail$osm_points) %>%
  st_transform(st_crs(info.Aug)) %>%
  st_sf() %>%
  st_intersection(extent) %>%
  mutate(Legend = 'Retail') %>%
  dplyr::select(Legend, geometry)

ggplot()+geom_sf(data=retail)

#office
office <- opq ("New York City USA") %>%
  add_osm_feature(key = 'office') %>%
  osmdata_sf(.)

office <- st_geometry(office$osm_points) %>%
  st_transform(st_crs(info.Aug)) %>%
  st_sf() %>%
  st_intersection(extent) %>%
  mutate(Legend = 'Office') %>%
  dplyr::select(Legend, geometry)

#add density features
# retail
retail_ct <- st_join(retail, Census,join = st_intersects, left = T) %>%
  group_by(GEOID,sq_mi) %>%
  summarise(count_retail= n())

retail_ct$density_retail <- retail_ct$count_retail/retail_ct$sq_mi

# office
office_ct <- st_join(office,Census,left = T) %>% 
  group_by(GEOID,sq_mi) %>% 
  summarise(count_office = n())

office_ct$density_office <- office_ct$count_office/office_ct$sq_mi

#join back to info.Aug
retail_ct<- retail_ct%>%
  select(GEOID, count_retail, density_retail)%>%
  st_set_geometry(NULL)

office_ct<- office_ct%>%
  select(GEOID, count_office, density_office)%>%
  st_set_geometry(NULL)

info.Aug_osm <- info.Aug_osm %>% left_join(retail_ct,by = "GEOID") %>% 
  left_join(office_ct, by = "GEOID")

info.Aug<-info.Aug_osm

#fix NAs
colSums(is.na(info.Aug))

## I fill the na in POI with 0 instead of median here
info.Aug$count_retail[is.na(info.Aug$count_retail)] <- 0
info.Aug$count_office[is.na(info.Aug$count_office)] <- 0
info.Aug$density_retail[is.na(info.Aug$density_retail)] <- 0
info.Aug$density_office[is.na(info.Aug$density_office)] <- 0
                                         
                                         
## Spatial lag
                                         
#function to get cloest idx
nn_idx<- nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.index[,k]
  return(nn)  
}
 
#this index is the idex exactly for itselt-->so we start from 2
#make a copy                                         
info.Aug1 <- info.Aug
                                         
#C1
info.Aug <- info.Aug %>%
  mutate(cidx1 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 2))


#C2
info.Aug <- info.Aug %>%
  mutate(cidx2 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 3))


#C3
info.Aug <- info.Aug %>%
  mutate(cidx3 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 4))

#C4
info.Aug <- info.Aug %>%
  mutate(cidx4 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 5))

#C5
info.Aug <- info.Aug %>%
  mutate(cidx5 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 6))


#C6
info.Aug <- info.Aug %>%
  mutate(cidx6 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 7))

#C7
info.Aug <- info.Aug %>%
  mutate(cidx7 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 8))

#C8
info.Aug <- info.Aug %>%
  mutate(cidx8 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 9))


#C9
info.Aug <- info.Aug %>%
  mutate(cidx9 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 10))

#C10
info.Aug <- info.Aug %>%
  mutate(cidx10 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 11))                                        
                                         

                                         
#extract count based on index
for (i in 1:nrow(info.Aug)) {
  info.Aug$c1Count[i] =info.Aug1[info.Aug$cidx1[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c2Count[i] =info.Aug1[info.Aug$cidx2[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c3Count[i] =info.Aug1[info.Aug$cidx3[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c5Count[i] =info.Aug1[info.Aug$cidx5[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c6Count[i] =info.Aug1[info.Aug$cidx6[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c7Count[i] =info.Aug1[info.Aug$cidx7[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c8Count[i] =info.Aug1[info.Aug$cidx8[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c9Count[i] =info.Aug1[info.Aug$cidx9[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c10Count[i] =info.Aug1[info.Aug$cidx10[i],]$Count
}                                         
                                         
 #get the avg 10

for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i] + info.Aug$c8Count[i] + info.Aug$c9Count[i] + info.Aug$c10Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount10[i] = left/ 8
}


#get the avg7 
for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount7[i] = left/5
}

#get the avg 5

for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount5[i] = left/3
}                                        

Running the Model

#Running the Model
####====Regression====####

options(scipen = 999)
rm(list = ls())

#Package installs -------------------------------------------------------------
load.fun <- function(x) { 
  x <- as.character(x) 
  if(isTRUE(x %in% .packages(all.available=TRUE))) { 
    eval(parse(text=paste("require(", x, ")", sep=""))) 
    print(paste(c(x, " : already installed; requiring"), collapse=''))
  } else { 
    #update.packages()
    print(paste(c(x, " : not installed; installing"), collapse=''))
    eval(parse(text=paste("install.packages('", x, "')", sep=""))) 
    print(paste(c(x, " : installed and requiring"), collapse=''))
    eval(parse(text=paste("require(", x, ")", sep=""))) 
  } 
} 

########### Required Packages ###########
packages = c("bayesplot", "lme4", "rstan", "shinystan", "RcppEigen",
             "tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
             "ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
             "sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
             "sf", "nngeo", "mapview","poissonreg",'gridExtra')

for(i in seq_along(packages)){
  packge <- as.character(packages[i])
  load.fun(packge)
}


#read in data from feature engineering
load("C:/Users/xinti/Box/MUSA_800_Practicum/Data/RHistoryData/model_clean_xl03.RData")


rm(list=setdiff(ls(), "info.Aug"))
gc()
info.Aug <- info.Aug %>% filter(Count<400)
selectedP <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
               'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' , 
               'sidewalk_caf_nn3'  , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
               'jobs_in_tract','density_retail','density_office','ntaname',"avgCount5")
selectedF <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
               'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' , 
               'sidewalk_caf_nn3'  , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
               'jobs_in_tract','density_retail','density_office','ntaname','Count',"avgCount5",'SegmentID')
#Check NAs
info.Aug.model <- info.Aug %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.Aug.geo <- info.Aug %>% dplyr::select(SegmentID,geometry)


#Create dataset without NAs
data <- info.Aug.model %>%
  drop_na()
#Set seed to make your practive producible
set.seed(717)

theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

data$originalCount = data$Count
data$Count = data$Count+1
data$isMH = as.character(data$isMH)
data$bigPark = as.character(data$bigPark)
data$isGreenway = as.character(data$isGreenway)

#Slipt the data into training and testing sets
data_split <- rsample::initial_split(data, strata = "Count", prop = 0.75)
train.set_wtID <- rsample::training(data_split)
test.set_wtID  <- rsample::testing(data_split)

train.set <- train.set_wtID %>% dplyr::select(selectedP,Count,originalCount) 
test.set <- test.set_wtID %>% dplyr::select(selectedP,Count,originalCount) 


cv_splits <- rsample::vfold_cv(train.set, strata = "Count", k = 10)
print(cv_splits)


#Create recipe
model_rec_lm <- recipe(Count ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
                       pWhite+MedRent+pNoVeh+pSubway+
                       sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
                       jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>% 
  update_role(ntaname, new_role = "ntaname") %>% 
  step_log(Count) %>% 
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
              subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
  step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
             subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale

glimpse(model_rec_lm %>% prep() %>% juice())


model_rec <- recipe(originalCount ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
                      pWhite+MedRent+pNoVeh+pSubway+
                      sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
                      jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>% 
  update_role(ntaname, new_role = "ntaname") %>% 
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
              subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
  step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
             subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale

glimpse(model_rec %>% prep() %>% juice())


# library(poissonreg)

lm_plan <- 
  linear_reg() %>% 
  set_engine("lm")


ps_plan <- 
  poissonreg::poisson_reg() %>% 
  set_engine("glm")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 2000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")



# GridSearch
rf_grid <- expand.grid(mtry = c(3,6,9), 
                       min_n = c(100,500,800))


#Create workflow
lm_wf <-
  workflows::workflow() %>% 
  add_recipe(model_rec_lm) %>% 
  add_model(lm_plan)

ps_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(ps_plan)

rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)

# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)



lm_tuned <- lm_wf %>%
  fit_resamples(.,
                resamples = cv_splits,#cv_splits_geo
                control   = control,
                metrics   = metric_set(rmse, rsq))

ps_tuned <- ps_wf %>%
  fit_resamples(.,
            resamples = cv_splits,#cv_splits_geo
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits,#cv_splits_geo
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))


show_best(lm_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(ps_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "rmse", n = 15, maximize = FALSE)


lm_best_params     <- select_best(lm_tuned, metric = "rmse", maximize = FALSE)
ps_best_params     <- select_best(ps_tuned, metric = "rmse", maximize = FALSE)
rf_best_params     <- select_best(rf_tuned, metric = "rmse", maximize = FALSE)


# Pull best hyperparam preds from 10-fold cross validated predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned) 
ps_best_OOF_preds <- collect_predictions(ps_tuned) 
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>% 
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])


lm_best_OOF_preds <- lm_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  mutate(.pred = exp(.pred)-1,
         Count = exp(Count)-1)

ps_best_OOF_preds <- ps_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)

rf_best_OOF_preds <- rf_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)


OOF_preds <- rbind(data.frame(lm_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Linear Regression"),
                   data.frame(ps_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Poisson"),
                   data.frame(rf_best_OOF_preds %>% dplyr::select(.pred,Count),model = "Random Forest")) %>% 
  group_by(model) %>% 
  mutate(
    RMSE = yardstick::rmse_vec(Count, .pred),
    MAE  = yardstick::mae_vec(Count, .pred),
    MAPE = yardstick::mape_vec((Count+1), (.pred+1))) %>% 
  ungroup()


# average error for each model
# RMSE
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(RMSE,1))) +
  theme_bw()

# MAE
ggplot(data = OOF_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(MAE,1))) + 
  theme_bw() 


# Scatter plots: Predicted vs Observed
ggplot(OOF_preds, aes(y=.pred , x = Count,group = model))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~model,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)



##===============Predict the test set=============####

#Final workflow
lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
ps_best_wf     <- finalize_workflow(ps_wf, ps_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)


lm_val_fit_geo <- lm_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))

ps_val_fit_geo <- ps_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))


# collect test set predictions from last_fit model
lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
ps_val_pred_geo     <- collect_predictions(ps_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)

lm_val_pred_geo <- lm_val_pred_geo %>% mutate(.pred = exp(.pred)-1,
                                              Count = exp(Count)-1)


ps_val_pred_geo <- ps_val_pred_geo %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)

rf_val_pred_geo <- rf_val_pred_geo %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)


# Aggregate test set predictions (they do not overlap with training prediction set, which is OOF_preds)
val_preds <- rbind(data.frame(dplyr::select(lm_val_pred_geo, .pred, Count), model = "lm"),
                   data.frame(dplyr::select(ps_val_pred_geo, .pred, Count), model = "PS"),
                   data.frame(dplyr::select(rf_val_pred_geo, .pred, Count), model = "RF")
                   ) %>% 
  group_by(model) %>% 
  mutate(RMSE = yardstick::rmse_vec(Count, .pred),
         MAE  = yardstick::mae_vec(Count, .pred),
         MAPE = yardstick::mape_vec((Count+1), (.pred+1)),
         absE=abs(Count-.pred)) %>% 
  ungroup()


# MAE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE,group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(MAE,1))) + 
  theme_bw() 

# Observed vs. Predicted on the test set
ggplot(val_preds, aes(y=.pred , x = Count,group = model))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~model,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)



##===============Find where the model performs well=============####

# First extract work flow 
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
# fit model on the whole dataset
rf_wflow_final_fit <- fit(rf_best_wf, data = data)
# extract recipe
dat_recipe     <- pull_workflow_prepped_recipe(rf_wflow_final_fit)
# extract fit
rf_final_fit <- pull_workflow_fit(rf_wflow_final_fit)

# predict bike counts for the whole dataset
data$.pred <- predict(rf_final_fit, 
                      new_data = bake(dat_recipe,data))$.pred

data <- data %>% 
  mutate(absE = abs(originalCount - .pred),# calculate absolute error
         Count = Count - 1) 

data$originalCount <- NULL 

yardstick::mae_vec(data$Count, data$.pred)

ggplot()+geom_histogram(data = data,aes(absE),binwidth = 1,fill="#219ebc")

# observed vs preicted by different characteristics
ggplot(data, aes(y=.pred , x = Count,group = TRUCK_ROUT))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~TRUCK_ROUT,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)+
  labs(x = "Observed",
       y = "Predicted",
       title = "Observed vs. Predicted on the testing set")

ggplot(data %>% filter(Number_Tra<6), aes(y=.pred , x = Count,group = Number_Tra))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~Number_Tra,ncol = 3)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)+
  labs(x = "Observed",
       y = "Predicted",
       title = "Observed vs. Predicted on the testing set")


# join the geometry
data.geo <- data %>% left_join(info.Aug,by = "SegmentID") %>% st_as_sf()

ggplot() +
  geom_sf(data = boroughs_clip, fill = "#D4D4D4") +
  geom_sf(data = data.geo, aes(colour = absE)) +
  scale_colour_viridis(direction = -1, discrete = FALSE, option="viridis")+
  labs(title="Absolute predicting Error on each road segment") +
  mapTheme()

# calculate error by neighborhoods
data.nta <- data.geo %>% st_drop_geometry() %>% 
  group_by(ntaname) %>% 
  summarise(MAE = mean(absE))

error.nta <- neighborhood %>% left_join(data.nta,by="ntaname")

ggplot()+
  geom_sf(data = error.nta, aes(fill=MAE)) +
  scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
  labs(title="MAE by neighborhoods") + mapTheme()


# calculate error by boroughs
data.geo <- data.geo %>% st_transform(crs = 4326)
boroughs_clip <- boroughs_clip %>% st_transform(crs = 4326)
data.boro <- st_join(st_centroid(data.geo),boroughs_clip,left=T)
data.boro <- data.boro %>% st_drop_geometry() %>% 
  group_by(boro_name) %>% 
  summarise(MAE = mean(absE))

error.boro <- boroughs_clip %>% left_join(data.boro,by="boro_name")

ggplot()+
  geom_sf(data = error.boro, aes(fill=MAE)) +
  scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
  labs(title="MAE by neighborhoods") + mapTheme()

Scenario Predictions

#Scenario Predictions
##Read Data
scenarios <- st_read("E:/NYCDOT/Scenarios/scenariosShort.shp")
neighborhoods <- st_read("E:/NYCDOT/NYCneighborhood/nynta.shp")
boro <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")

boro <- boro %>%
  filter(boro_name == "Brooklyn")
boro <- boro %>% st_transform(st_crs(scenarios))

boro2 <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")
boro2 <- boro2 %>%
  filter(boro_name == "Manhattan" | boro_name == "Bronx")
boro2 <- boro2 %>% st_transform(st_crs(scenarios))

##Find and Clean Range
scenario1_range <- neighborhoods %>% filter(NTACode == "MN99")
scenario1 <- scenarios[scenario1_range,] %>% 
  filter(SegmentID != "0137647" & SegmentID != "0134012" &
           SegmentID != "0150953" & SegmentID != "0150951" &
           SegmentID != "0071072" & SegmentID != "0071082")
scenario1 <- scenario1 %>%
  filter(SegmentID != "0111014" & SegmentID != "0071078" & SegmentID != "0108162")
scenario1 <- scenario1 %>%
  filter(SegmentID != "9000012")
scenario1 <- scenario1 %>%
  filter(SegmentID != "0071084")
scenario1 <- scenario1 %>%
  filter(SegmentID != "9000011")


scenario_23 <- scenarios %>%
  filter(! SegmentID %in% scenario1$SegmentID)
scenario2 <- scenario_23[boro2,]


scenario3 <- scenario_23 %>%
  filter(! SegmentID %in% scenario2$SegmentID)

#check
mapview(list(scenario1, scenario2, scenario3), color = list("red", "blue", "purple"))

#Make Samples
sample1 <- info.Aug %>%
  filter(SegmentID %in% scenario1$SegmentID)

sample2 <- info.Aug %>%
  filter(SegmentID %in% scenario2$SegmentID)

sample3 <- info.Aug %>% filter(SegmentID %in% scenario3$SegmentID)

#Edit
sample1 <- sample1 %>%
  filter(bikeLaneLv == "noBikeLane")
scenario1 <- scenario1 %>%
  filter(SegmentID %in% sample1$SegmentID)

#check
mapview(sample2, zcol="bikeLaneLv")
mapview(sample3, zcol="bikeLaneLv")

sample3 <- sample3 %>%
  filter(bikeLaneLv != "Protected")

scenario3 <- scenario3 %>%
  filter(SegmentID %in% sample3$SegmentID)

#read buffered
scenario1_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario1_Buffer.shp")
scenario2_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario2_Buffer.shp")
scenario3_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario3_Buffer.shp")

mapview(list(scenario1_buffer, scenario2_buffer, scenario3_buffer))


#### Predictions
#reference:https://hansjoerg.me/2020/02/09/tidymodels-for-machine-learning/

Brooklyn <- boro %>% st_transform(st_crs(info.Aug))
Manhattan <- boro2 %>%
  filter(boro_name == "Manhattan") %>% st_transform(st_crs(info.Aug))


info.Brooklyn <- info.Aug[Brooklyn,] 
info.Manhattan <- info.Aug[Manhattan,]

info.Bk_model <- info.Brooklyn %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.M_model <- info.Manhattan %>% st_drop_geometry() %>% dplyr::select(selectedF)

sample1 <- sample1 %>% dplyr::select(selectedF)
sample2 <- sample2 %>% dplyr::select(selectedF)
sample3 <- sample3 %>% dplyr::select(selectedF)


#Predict scenario1
rf_Manhattan_fit <- fit(rf_best_wf, data = info.M_model)

rf_final_Mfit <- pull_workflow_fit(rf_Manhattan_fit)
dia_rec3_M    <- pull_workflow_prepped_recipe(rf_Manhattan_fit)

#make different versions of sample 1 data
sample1_protect <- sample1 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()
sample1_unprotect <- sample1 %>%
  mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()


#make predictions

#Scenario1
sample1_protect$.pred <- predict(rf_final_Mfit, 
                          new_data = bake(dia_rec3_M , sample1_protect))$.pred


sample1_unprotect$.pred <- predict(rf_final_Mfit, 
                                 new_data = bake(dia_rec3_M , sample1_unprotect))$.pred

#join back and calculate diff
sample1$pre_protect <- sample1_protect$.pred
sample1 <- sample1 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count) * 100)

sample1$pre_unprotect <- sample1_unprotect$.pred
sample1 <- sample1 %>%
  mutate(diff_unprotect = ((pre_unprotect - Count) / Count) * 100)

sample1.info <- sample1 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)

sample1.info <- sample1.info %>% st_transform(st_crs(scenario1))

#join back to road segment
scenario1_infonew <- st_join(scenario1_buffer %>% st_transform(st_crs(sample1.info)),
                             sample1.info, left=TRUE, join = st_intersects)

scenario1_infonew <- distinct(scenario1_infonew,SegmentID.x,.keep_all = T)

scenario1_infonew <- scenario1_infonew %>%
  dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect, pre_unprotect, diff_unprotect)

scenario1_infonew <- scenario1_infonew %>% st_transform(st_crs("EPSG:4326"))

names(scenario1_infonew)[2] <- "SegmentID"

st_write(scenario1_infonew, "scenario1.geojson")



##==Predict Scenario2

#make different versions of sample 1 data
sample2_protect <- sample2 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()

sample2_protect$.pred <- predict(rf_final_Mfit, 
                                 new_data = bake(dia_rec3_M ,sample2_protect))$.pred

sample2_protect <- sample2_protect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count)) #avoid INF

names(sample2_protect)[24] <- "Count_c"

#join back and calculate diff
sample2$pre_protect <- sample2_protect$.pred
sample2$Count_c <- sample2_protect$Count_c

sample2 <- sample2 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)

sample2.info <- sample2 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect)

#change back to segment

scenario2_info <- st_join(scenario2_buffer %>% st_transform(st_crs(sample2.info)),
                             sample2.info, left=TRUE, join = st_intersects)

scenario2_info <- distinct(scenario2_info, SegmentID.x,.keep_all = T)

scenario2_info <- scenario2_info %>%
  dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect)

scenario2_info <- scenario2_info %>% st_transform(st_crs("EPSG:4326"))

names(scenario2_info)[2] <- "SegmentID"

mapview(scenario2_info)

st_write(scenario2_info, "scenario2.geojson")


##==Predict Scenario3
rf_Brooklyn_fit <- fit(rf_best_wf, data = info.Bk_model)

rf_final_Bfit <- pull_workflow_fit(rf_Brooklyn_fit)
dia_rec3_B    <- pull_workflow_prepped_recipe(rf_Brooklyn_fit)

#make different versions of sample 3 data
mapview(sample3, zcol="bikeLaneLv")

sample3_protect <- sample3 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()

sample3_unprotect <- sample3 %>%
  filter(bikeLaneLv != "Unprotected") %>%
  mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()


#make predictions

#Scenario3
sample3_protect$.pred <- predict(rf_final_Bfit, 
                                 new_data = bake(dia_rec3_B, sample3_protect))$.pred


sample3_unprotect$.pred <- predict(rf_final_Bfit, 
                                   new_data = bake(dia_rec3_B, sample3_unprotect))$.pred


sample3_protect <- sample3_protect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count))

names(sample3_protect)[24] <- "Count_c"


sample3_unprotect <- sample3_unprotect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count))

names(sample3_unprotect)[24] <- "Count_c"

#join back and calculate diff
sample3$pre_protect <- sample3_protect$.pred
sample3$Count_c <-sample3_protect$Count_c

sample3 <- sample3 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)

sample3 <- merge(sample3, sample3_unprotect %>% dplyr::select(SegmentID, .pred),
                 by="SegmentID", all.x=TRUE)

sample3 <- sample3 %>%
  mutate(diff_unprotect = ((.pred - Count) / Count_c) * 100)

names(sample3)[26] <- "pre_unprotect"


sample3$pre_unprotect[is.na(sample3$pre_unprotect)] <- 0
sample3$diff_unprotect[is.na(sample3$diff_unprotect)] <- 0

sample3.info <- sample3 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)



#change back to segment

scenario3_info <- st_join(scenario3_buffer %>% st_transform(st_crs(sample3.info)),
                          sample3.info, left=TRUE, join = st_intersects)

scenario3_info <- distinct(scenario3_info, SegmentID.x,.keep_all = T)

scenario3_info <- scenario3_info %>%
  dplyr::select(Street, SegmentID.x,Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)

scenario3_info <- scenario3_info %>% st_transform(st_crs("EPSG:4326"))

names(scenario3_info)[2] <- "SegmentID"

mapview(scenario3_info)

st_write(scenario3_info, "scenario3.geojson")