1. Introduction

1.1 About this Project

This project developed a method of predicting bikeshare trips in New York City with emphasis on the effect of bicycle infrastructure, such as bike lanes. We used our modeling predictions to explore bike planning scenarios in New York City. This project is one of the first to use data capturing the unique routes taken by each bikeshare trip, so the methods used herein could be applied to other similar datasets. This document will outline the motivation for our project, exploratory analysis, feature engineering and model building, and the scenario planning tool displayed in our application.

This project was completed for the final capstone in the Master of Urban Spatial Analytics program at the University of Pennsylvania. We are very thankful to our instructors, Ken Steif, Michael Fichman, and Matt Harris, for their insight and patient guidance as we undertook this large task. We are also incredibly grateful and indebted to our clients, Kyle Yakal-Kremski, Apaar Bansal, and Paula Rubira from the New York City Department of Transportation, for their generosity in sharing such a unique dataset with a group of students and working with us through this process.

1.2 Motivation

Riding a bike in a city can be a harrowing task of dodging turning vehicles, weaving around parked cars and delivery trucks, and watching out for a car door that may open any moment. Bike lanes are an important way to carve out space for cyclists to improve the comfort and safety of riding. However, it can be a very challenging task to build broad support for bike lanes and decide which streets should receive bike lanes first. Our project provides a new way of predicting the impact of a bike lane based on unique ridership data from the Citibike system in New York City, in which people rent bikes from docked stations.

1.3 Use Case

A transportation planner’s task of deciding where to install a new bike lane is a challenging one. Planners must create a network throughout the city that allows cyclists to flow between destinations and protects them from the dangers of the road. Planners may prioritize streets for bike lane investments based on the number of collisions between cars and cyclists or the value of a certain road to the network. There are also strategic factors, such as a street repaving, that can present a convenient time for lane reconfiguration. It is also crucial to develop relationships in communities where new bike lanes will be installed to build support for the changes and ensure community members know how to use the new cycling infrastructure.

This project provides a unique way of prioritizing streets for bike lane investments based on observed bikeshare ridership. Predicting the impact of a new bikelane on bike ridership for would give our client, the NYC DOT, an additional tool to use in deciding which streets to prioritize for bike lanes. Furthermore, if planners can explain to a community board the most important streets for adding a bike lane or quantify the impact of adding a more protected facility, it could help advocate for the most needed changes or develop relationships with groups who are more skeptical of bike lanes.

2. Data

2.1 Sources

To develop this model, we relied on two main data sources. New York City’s street centerlines dataset shows the type of bike lane on any given street and where the bike lane network has expanded over time. We combined this with trip data from the bikeshare system, Citibike, in which people can rent bikes at docked stations. To predict bikeshare ridership, we relied on additional data sources including American Community Survey 2015-2019 Estimates, Open Street Map amenity data, and additional amenity datasets from New York City Open Data.

Dataset	Source
Citibike Trips	NYC Department of Transportation
LION Street Centerlines	Bytes of the Big Apple Archive
Citibike Stations	NYC Department of Transportation
Demographics	American Community Survey 2015-2019 Estimates
Job Locations	Longitudinal Employer HOusehold Dynamics
Ofice and Retail Amenities	OpenStreetMap
Additional Amenities	NYC Open Data

Trip Data

This project is unique because it relies on a very uncommon type of trip data that is not usually shared with the public. Most publicly available data on docked bikeshare systems is origin-destination data, meaning you know the stations where a trip starts and ends but not the route the cyclist took in between. Since 2018, Citibike has allowed riders to opt-in to having their trip tracked via their phone’s GPS. The resulting data, rather than being a pair of stations, is a line that shows the exact route a rider took. This allows much greater insight into the choices cyclists make when they ride and allows us to calculate the ridership of bikeshare trips on any given block.

Types of Bike Lanes

New York City uses three categories for recording bike lanes, which we have termed Protected Lanes, Unprotected Lanes, and Sharrows. The highest level of protection recorded in the data is a protected lane, which is separated from vehicle lanes by parked cars or another kind of buffer to create a dedicated space for cyclists. An unprotected lane is one in which the bike lane is painted onto the street immediately adjacent to the vehicle lane. Even though this is a bike lane, it offers less protection because vehicles can easily wander into the bike lane. The final category, sharrow, is used when streets are not wide enough for separate vehicle and bike lanes. The term sharrow refers to the symbol of an arrow with a cycle and is meant to encourage drivers to share the lane with cyclists. Since the sharrow is very similar to a street without a bike lane, we decided to exclude it from our category of bike lane.

Protected Lane	Unprotected Lane	Sharrow
$protected$

Study Area

Since the the Citibike network does not extend throughout all of New York, we defined a study area based on the time frame used by our predictive model: August 2020. We chose this month because it captures the warm weather of the summer and includes stations in the Bronx, which was only added to the Citibike network starting in July 2020. Our study area therefore comprises a half mile buffer around the Citibike stations that were available in August of 2020. This study area extends for most of Manhattan, and parts of the Bronx, Brooklyn, and Queens.

2.2 Data Cleaning

Converting individual bikeshare trips into useful ridership data is a very challenging task. Since this is such a novel analysis, recording our data cleaning process will be helpful to anyone pursuing a similar project in the future. The steps we pursued are as follows:

From NYC lion street centerlines, filter out non-bikeable streets (e.g. highways, tunnels, ferry routes)
Create a 15 foot buffer around street centerlines
Convert each line to points
Spatial join each point to one street segment buffer
Group by street segment and Trip ID to sum the number of trips per segment

Possible Improvements

Buffer Width
For ease of calculation, we chose a single buffer width of 15 feet. When buffers for different street segments overlap, assigning each point to only one of those buffers is more time-consuming in R. Using a buffer of 15 feet allowed us to capture most points and minimize overlap between buffers. However, many streets are more than 30 feet wide, so this method could miss some ridership. Using a variable buffer width based on street width could improve this calculation in the future.

Lines to Points
The built-in tool in ArcMap to Generate Points Along Lines is meant to place a point along each line at a specified distance. However, we found that with a dataset of 100,000 trips, the resulting data was very spatially inconsistent, with large peaks of 300-400 riders while no other adjoining street had counts higher than 30 trips. This suggested to us that the initial ArcMap tool was not generating points in the way it had described. We performed a second points generation using a custom tool from a Stack Exchange post, which resulted in much more spatially consistent ridership counts. However, there are still some segments where the count on one segment is several hundred trips higher than any adjoining segment. Future analyses should investigate this step and find the optimal way to generate points along lines, either in ArcMap, ArcGIS Pro, or R.

Figure 1. Original (left) vs. improved (right) ridership counts. Original counts created using ArcMap Generate Points Along Lines. Improved counts created using custom toolbox, Create Points from Lines, created by another user.

3. Explorary Analysis

Our exploratory analysis was motivated by two questions:

What are the spatial trends in bikeshare ridership?
Do bike lanes influence bikeshare ridership?

3.1 Spatial Trends in Bikeshare Ridership

To explore the first question, we visualized ridership in several ways, the first being a distribution plot.

The distribution plot shows that the data are very left-skewed, with a high number of street segments having less than 10 trips and a long tail extending out to 1061 trips on a single segment.

Ridership by Street Segment

The most basic way to map ridership is at the street segment level. Because the data has a long tail, restricting the range for plotting to 800, 300, and 50 trips can help visualize the full variation of the data.

Up to 800 Trips

Up to 400 Trips

Up to 100 Trips

High Median Segment Ridership

Since we summarized trips at the segment level, a good way to find streets with high ridership is to look at the median segment ridership for each street. A high median segment ridership suggests that the street has high ridership on many segments, rather than only a few prominent blocks.

Manhattan

Map

In looking at Manhattan ridership, it is clear that the loop in Central Park, shown as West Drive and East Drive in the table, has very high median segment ridership. Other streets with high ridership include the Hudson River Greenway on the West side and the Williamsburg Bridge bike path. Most of the streets with high median segment ridership are concentrated south of Central Park.

Table

Top 100 Streets in Manhattan by Median Segment Ridership
Street	MedianCount
EAST DRIVE	299.5
WEST DRIVE	259.0
HUDSON RIVER GREENWAY	252.0
CENTER DRIVE	230.0
TERRACE DRIVE	224.5
CHRYSTIE STREET	185.0
CENTRAL PARK SOUTH	176.0
CHERRY WALK GREENWAY	172.0
8 AVENUE	165.0
ST MARKS PLACE	153.0
WILLIAMSBURG BR BIKE AND PED PTH	149.0
102 STREET CROSSING	148.0
4 AVENUE	147.5
EAST 72 STREET	146.0
CENTRAL PARK WEST	145.0
WASHINGTON SQUARE NORTH	141.0
AVENUE A	132.5
2 AVENUE	130.0
LA GUARDIA PLACE	126.5
CNTRL PARK 7 AVENUE APPROACH	123.0
WEST 8 STREET	120.5
BLOOMFIELD STREET	115.0
PRINCE STREET	115.0
WANAMAKER PLACE	115.0
BLEECKER STREET	113.0
WEST 32 STREET	111.0
CARMINE STREET	110.0
UNIVERSITY PLACE	109.0
WEST 20 STREET	105.0
CLIFF STREET	104.0
WEST 21 STREET	103.0
ALLEN STREET	102.0
AVENUE OF THE AMERICAS	100.5
WEST 42 STREET	99.5
MARGINAL STREET	99.0
RUTGERS SLIP	99.0
GREAT JONES STREET	98.0
WEST 47 STREET	96.0
WEST 29 STREET	93.0
EAST 55 STREET	92.5
WASHINGTON SQUARE SOUTH	92.0
WEST 22 STREET	90.0
EAST 31 STREET	89.0
WEST 3 STREET	88.0
WEST 31 STREET	88.0
9 AVENUE	87.0
UNION SQUARE WEST	85.0
7 AVENUE	83.0
CHARLTON STREET	83.0
CHRISTOPHER STREET	83.0
CNTRL PARK EAST 60 ST APPROACH	83.0
WASHINGTON SQUARE WEST	82.5
UNION SQUARE EAST	82.0
WASHINGTON SQUARE EAST	82.0
GREENWICH AVENUE	81.0
WEST 52 STREET	81.0
EAST 2 STREET	78.5
CLEVELAND PLACE	78.0
EAST 26 STREET	78.0
WEST 23 STREET	77.5
ASTOR PLACE	77.0
WEST 44 STREET	77.0
WEST 59 STREET	76.0
EAST 58 STREET	75.5
CORNELIA STREET	75.0
THE MALL	74.0
EAST 32 STREET	73.0
CPW 85 APPROACH	71.0
EAST 29 STREET	71.0
WEST 43 STREET	70.5
EAST 19 STREET	69.5
WEST 72 STREET	69.0
EAST 12 STREET	68.0
EAST 33 STREET	68.0
1 AVENUE	67.0
EAST 43 STREET	66.5
EAST 18 STREET	66.0
EAST 59 STREET	66.0
GRAND ARMY PLAZA	66.0
EAST 10 STREET	65.0
EAST 39 STREET	65.0
PARK AVENUE SOUTH	64.5
5 AVENUE	64.0
CENTRAL PARK NORTH	64.0
WEST 27 STREET	64.0
CHARLES LANE	63.0
EAST 17 STREET	63.0
EAST 30 STREET	63.0
GRAMERCY PARK SOUTH	63.0
LAFAYETTE STREET	63.0
WEST 28 STREET	63.0
WEST 46 STREET	63.0
WEST 58 STREET	63.0
WEST 45 STREET	62.0
WEST 54 STREET	62.0
AVENUE B	61.5
WEST 30 STREET	60.0
ST JAMES PLACE	59.5
WEST 15 STREET	59.5
COLUMBUS AVENUE	59.0

Brooklyn

Map

The Brooklyn map also shows the prominence of the Williamsburg Bridge bike path, along with the Franklin St-Kent Avenue corridor, the loop in Prospect Park, and the Pulaski, Brooklyn, and Manhattan bridges.

Table

Top 100 Streets in Brooklyn by Median Segment Ridership
Street	MedianCount
WILLIAMSBURG BRIDGE BIKE PATH	273.0
SOUTH 5 PLACE	247.0
FRANKLIN STREET	145.0
PULASKI BRIDGE PEDESTRIAN PATH	131.5
SOUTH LAKE DRIVE	121.0
KENT AVENUE	115.0
WEST DRIVE	114.5
BROOKLYN BRIDGE	106.0
ANCHORAGE PLACE	90.0
WASHINGTON PARK	89.0
EAST DRIVE	80.5
WILLOUGHBY AVENUE	67.0
MANHATTAN BRIDGE BIKE PATH	64.5
JAY STREET	63.0
EAGLE STREET	60.5
MANHATTAN AVENUE	60.0
ASHLAND PLACE	59.5
MILTON STREET	59.0
VANDERBILT AVENUE	58.0
BERKELEY PLACE	55.0
SOUTH 4 STREET	54.5
LEONARD STREET	54.0
DEKALB AVENUE	53.0
BROOKLYN BRDG PED AND BIKE PATH	51.5
IMLAY STREET	50.0
VERONA PLACE	48.0
BANKER STREET	47.5
FRANKLIN AVENUE	47.0
SOUTH 3 STREET	47.0
UNION AVENUE	47.0
NORTH 12 STREET	46.5
NORTH 6 STREET	46.0
UNDERHILL AVENUE	44.0
FRONT STREET	43.0
SCHERMERHORN STREET	43.0
HIGH STREET	42.0
NORTH 11 STREET	42.0
UNION ST BRIDGE	42.0
FREEMAN STREET	41.0
WEST STREET	41.0
WHIPPLE STREET	41.0
TOMPKINS AVENUE	40.5
NAVY STREET	39.5
BROOKLYN WATERFRONT GREENWAY	39.0
CLERMONT AVENUE	39.0
NORTH OXFORD STREET	39.0
NORTH 9 STREET	38.0
PLAZA ST WEST	38.0
SHARON STREET	38.0
CAMBRIDGE PLACE	37.5
BOERUM PLACE	37.0
MESEROLE STREET	37.0
QUAY STREET	36.5
BROOKLYN BRIDGE PARK GREENWAY	36.0
HANSON PLACE	36.0
NORTH 13 STREET	36.0
BERRY STREET	35.0
VAN BRUNT STREET	35.0
IRVING AVENUE	34.5
CLASSON AVENUE	33.5
THROOP AVENUE	33.0
HARRISON AVENUE	32.5
NORTH 5 STREET	32.5
LAFAYETTE AVENUE	32.0
KNICKERBOCKER AVENUE	31.5
BEDFORD AVENUE	31.0
MANHATTAN BRIDGE APPROACH	31.0
NASSAU AVENUE	31.0
BORINQUEN PLACE	30.0
SMITH STREET	30.0
ROCKWELL PLACE	29.5
9 STREET	29.0
CHAPEL STREET	29.0
GRAND STREET	29.0
SOUTH 11 STREET	29.0
SPENCER COURT	29.0
ST FELIX STREET	29.0
FT GREENE PLACE	28.0
BROOKLYN BRIDGE APPROACH	27.5
LORIMER STREET	27.0
SUMNER PLACE	27.0
NORTH 4 STREET	26.5
9 STREET BRIDGE	26.0
COURT STREET	26.0
DRIGGS AVENUE	26.0
LYNCH STREET	26.0
GUERNSEY STREET	25.5
NORTH 8 STREET	25.5
LIVINGSTON STREET	25.0
MONTROSE AVENUE	25.0
NAVY STREET BIKE PATH	25.0
NEW DOCK STREET	25.0
PULASKI BRIDGE	25.0
RIVER STREET	25.0
STATE STREET	25.0
AINSLIE STREET	24.5
GREENPOINT AVENUE	24.5
MANHATTAN BRIDGE	24.0
NORTH 3 STREET	24.0
GRAND ARMY PLAZA GREENWAY	23.5

Queens

Map

In Queens, bridges and parks, such as the Queensboro Bridge Bike Path and the Queensbridge Park Esplanade, are once again popular places for Citibike ridership, along with important streets such as Vernon Boulevard and Queens Plaza.

Table

Top 50 Streets in Queens by Median Segment Ridership
Street	MedianCount
KOCH BR PEDESTRIAN AND BIKE PATH	82.0
RAINEY PARK ESPLANADE	76.0
QUEENSBRIDGE PARK GREENWAY	59.5
VERNON BOULEVARD	57.0
QUEENS PLAZA NORTH	50.5
RALPH DEMARCO PARK GREENWAY	39.5
ASTORIA PARK SOUTH	30.0
MAIN AVENUE	30.0
NEWTOWN AVENUE	26.0
24 AVENUE	24.0
23 AVENUE	22.0
34 AVENUE	22.0
44 DRIVE	22.0
11 STREET	20.5
PULASKI BRIDGE	20.5
PULASKI BRIDGE PEDESTRIAN PATH	20.5
ASTORIA PARK GREENWAY	20.0
ROOSEVELT IS BR PED & BIKE PATH	19.5
ED KOCH BR UPPER ROADWAY EN RAMP	18.0
41 ROAD	17.5
STEINWAY STREET	16.5
39 STREET	16.0
HOYT AVENUE NORTH	16.0
RFK BR QUEENS PED AND BIKE PATH	16.0
HALLETS COVE ESPLANADE	15.0
CENTER BOULEVARD	14.5
29 STREET	13.0
31 DRIVE	13.0
36 AVENUE	13.0
46 ROAD	13.0
CRESCENT STREET	13.0
46 AVENUE	12.5
42 ROAD	12.0
45 ROAD	12.0
49 AVENUE	12.0
5 STREET	12.0
SHORE BOULEVARD	12.0
LIC GREENWAY	11.5
30 DRIVE	11.0
40 ROAD	11.0
47 ROAD	11.0
HUNTER STREET	11.0
ROOSEVELT IS BRIDGE	11.0
21 STREET	10.0
30 AVENUE	10.0
44 ROAD	10.0
ED KOCH QUEENSBORO BRIDGE	10.0
HOYT AVENUE SOUTH	10.0
WOODWARD AVENUE	10.0
45 AVENUE	9.5

The Bronx

Map

Citibike had just opened stations in the Bronx the month prior to the time period of our analysis, and ridership in this borough is much lower. The popular streets in the Bronx included the Willis Avenue, the Willis Avenue Bridge Bike Path, E 148th St, and the Randalls Island Connector.

Table

Top 15 Streets in the Bronx by Median Segment Ridership
Street	MedianCount
WILLIS AVENUE BRIDGE BIKE PATH	28.0
WILLIS AVENUE BRIDGE	17.0
EAST 148 STREET	12.0
RANDALLS ISLAND CONNECTOR	11.0
WILLIS AVENUE	11.0
BERGEN AVENUE	7.5
CANAL STREET WEST	7.0
MADISON AVENUE BR APPROACH	6.5
ANTHONY J GRIFFIN PLACE	6.0
MELROSE AVENUE	5.0
EAST 149 STREET	4.0
MADISON AVE BRDG PED & BIKE PATH	3.5
WILLOW AVENUE	3.5

3.2 Impact of Bike Lanes on Ridership

Though it may seem intuitive that bike infrastructure such as bike lanes and cycle tracks make riding a bike more appealing, the unique kind of data used in this project allows us to test that assumption. To explore whether bike ridership is related to changes in bike infrastructure, we performed a difference in difference analysis on several streets before and after bike lane additions.

To explore the changes in ridership on a particular street when a bike lane was added, we used a difference in difference approach. This is a method of exploratory analysis that compares a treatment and control group before and after an intervention. Both groups change over time, but the additional difference in the treatment group is roughly attributed to the effect of the treatment.

For our purposes, difference-in-difference asks, is there a difference in trips on streets with a new bike lane compared to similar streets without a bike lane? We chose two areas of Manhattan in which to test this difference-in-difference exploratory approach: East Harlem and Lower Manhattan. Both areas had new bike lanes that were added in 2018 and were served by the Citibike network.

Lower Manhattan

Lower Manhattan saw the longest addition of protected bike lane in 2018, on 7th Avenue. We chose 3rd Avenue as the control because it was one of few parallel streets in the area without a bike lane.

Treatment: 7th Avenue	Control: 3rd Avenue

Lower Manhattan Difference in Difference Results
Type	Pre	Post	Difference
Treatment	3786	8397	4611
Control	1713	2903	1190
Difference	2073	5494	3421

Our difference in difference results for lower Manhattan show that the treatment group saw a larger increase in bikeshare trips than the control. This suggests that the new protected bike lane on 7th avenue may have had a positive impact on ridership.

East Harlem

To test this approach in another neighborhood, we chose East Harlem because it is less dense than lower Manhattan, but had several bike lanes added in 2018. Unlike 7th avenue, all of these additions were unprotected bike lanes. We chose 4 other side streets as controls.

Treatment: E 110th St	Control: E 109th St

East Harlem Difference in Difference Results
Type	Pre	Post	Difference
Treatment	178	332	154
Control	152	407	255
Difference	26	-75	-101

The results of the East Harlem difference in difference analysis show that the control group actually had a larger increase in trips that the treatment group and overtook the treatment group. Perhaps because the ridership on these streets is only in the hundreds rather than the thousands, a smaller change in actual riders can have a larger effect on the outcome. These results could also suggest that unprotected bike lanes do not encourage bikeshare ridership as much as protected bike lanes. Further difference in difference analyses at different time points and regions of the city would be needed to elaborate the interactions between new bike lanes and Citibike ridership.

4. Feature Engineering

Our exploratory difference in difference results are encouraging and may suggest that adding bike lanes, or at least protected bike lanes, to the network over time has helped create a more welcoming environment for cyclists. However, to build a scenario planning tool for our client, we needed to develop a predictive model that would allow us to predict the change in ridership once a new bike lane was added.

Correlations

We explored several features to investigate which were were the best predictors of bikeshare ridership. We tested features of the street centerline data itself, such as road width and speed limit, as well as demographics from the American Community Survey and amenities from OpenStreetMap and OpenData NYC. The below charts show the correlation between each predictor and the dependent variable, Count.

Street Network

Demographics

Amenities

Multicollinearity

We also analyzed these independent variables for multicollinearity, as shown in the plot below. The two most correlated features were median rent and the percentage of White residents. However, the correlation value between these two features was less than 0.8, so both were included.

Final Features

Our final model included the following features:

Street Network
- Bike Lane: Protected vs Unprotected vs None
- Number of Vehicle Travel Lanes
- Street Width
- Truck Route
- Greenway Yes/No
Bike Lane & Citibike Networks
- Distance to the Nearest Bike Lane
- Number of Citibike Stations: within 250 ft
Demographics
- Median Rent
- Percent White Residents
- Percent Without a Vehicle
- Percent who Commute by Subway
Amenities
- Jobs in Census Tract
- Office Density
- Retail Density
- Large Park: whether in a large park
- Distance to Subway Station
- Distance to Sidewalk Cafe
Geography
- Neighborhood Tabulation Area
- Manhattan: Yes/No
- Spatial Lag: average trip count of 5 nearest street segments

5. Modeling

5.1 Model Basics

We modeled Citibike ridership on each street segment in our study area in the first two weeks of August 2020. We chose August 2020 because it was both a warm month good for biking and the first month in which Citibike stations were open in the Bronx, which we felt was important to include. Though bikeshare ridership trends have shifted during the pandemic, we modeled the total ridership over a two week period, so the changes in peak commuting patterns during the COVID-19 pandemic were not a concern for our model.

5.2 Model Building

In this part, we used the features we selected in the feature engineering part to build machine learning models and predict the bike counts on each road segment in our study area. We will compare between different models to see how well they account for the variance in bike ridership and we will also compare how well they predict unseen data. Then we will choose the model with the best performance to predict the bike counts in our study. The three models we used are:

Linear Model: OLS linear regression model. As mentioned in the exploratory analysis part, the bike counts data are not normally distributed. Therefore when training the linear model, we log-transformed our dependent variable to make the data more conforms to the linear regression assumption.
Poisson Model: A log-linear model. Poisson Model is especially suitable for predicting count data. Because our data are counts data with large number of 0 and the prediction residuals are not normally distributed, Poisson might be a good choice.
Random Forest: An ensemble learning method that operates by constructing a multitude of decision trees.

We first split our data into 80%/20% training set and test set. Then we performs 10 fold cross validation to tune the parameters as well as calculating the training accuracy. The training accuracy will tell us how well our models fit on the training data. We also calculated test accuracy because testing accuracy will tell us how well our models predict unseen data. By comparing the performances of three models, we will decide which one we will choose to predict bike counts in the study area.

5.3 Model Evaluation

After tuning the parameters and training the model, we first plot the MAE (Mean Absolute Error) and RMSE (Root Mean Square Error) of the training set using the optimized parameter for each model.

As you can see, the random forest has lowest MAE and RMSE among the three models, however, the differences of these two statistics are not that much between the three models. We then plot the “Actual vs. Predicted” scatter plots and see the model performance when predicting training set.

Three models all have issues of underpredicting. However, among the three, linear regression looks least predictive. Poisson and random forest seem to predict the trend using the features selected. Random forest generally predicts bike counts well because it has relatively small variance.

Let’s see the performance of the models when predicting test data. It looks like that when predicting test set, the three models tend to have similar performance as to the training set prediction.

We also plot the distribution of absolute errors in test set. The first plot shows the distribution of absolute errors less than 90 (relatively small errors). As is shown below, linear regression and random forest produce more small errors which is less than 5. However, all models have long tails, indicating some large errors.

The plot below shows large errors which is larger than 90. From this plot we can see that random forest actually performs better and produce less large errors.

Therefore, we finally decide to use random forest to predict the bike ride counts in our study area.

5.4 Prediction & Error Analysis

After predicting the bike counts using random forest, we do some analysis on the prediction errors.

We can see that most of the large errors are in Manhattan, which has the highest ridership. It is interesting to know that our model are underprediting bike counts on roads with large ridership. Further study will be needed to figure out the reasons behind. error map each boro

From the MAE by neigbhorhood plot we can also see that the model is not good at predicting Manhattan, especially the lower Manhattan. error map each nta

6. Scenarios

We used our model to explore several scenarios where new bike lanes could be added in the future. Each scenario is comprised of several similar streets that could serve as alternatives to each other. For example, the Central Park Transverses Scenario is made up of 4 streets that cross Central Park, none of which have bike lanes. Comparing the predicted increase ridership across these scenarios could be a helpful tool for planners to decide which alternative would be most impactful.

6.1 Scenarios Chosen

Streets Chosen for Each Scenario
Scenario	Street	Length.Miles	Median.Ridership	Max.Ridership
Brooklyn	ATLANTIC AVENUE	6.6	4.0	68
Brooklyn	BUSHWICK AVENUE	3.1	9.0	104
Brooklyn	FLATBUSH AVENUE	6.5	7.5	47
Brooklyn	MEEKER AVENUE	2.7	5.0	22
Brooklyn	MYRTLE AVENUE	5.1	14.0	105
Central Park Transverses	65 ST TRANSVERSE	1.7	24.0	143
Central Park Transverses	79 ST TRANSVERSE	1.9	38.0	71
Central Park Transverses	86 ST TRANSVERSE	0.9	42.5	78
Central Park Transverses	97 ST TRANSVERSE	1.2	26.0	69
Upper Manhattan	ADAM CLAYTON POWELL JR BOULEVARD	5.1	3.0	112
Upper Manhattan	AMSTERDAM AVENUE	2.5	8.0	18
Upper Manhattan	LENOX AVENUE	3.8	17.5	58

For each scenario, we predicted the change in ridership that would be expected if a protected or unprotected bikelane were added to each street. In the case of Upper Manhattan, several of the avenues already had an unprotected bike lane, so we only predicted the change in ridership that would be expected from an increase to a protected bike lane.

6.2 Predictions

Scenario 1: Central Park Transverses

All Central Park transverses were predicted to have a higher growth in ridership under a protected than an unprotected bike lane. Unprotected Lane predictions resulted in an increase of around 100% for all 4 streets, but for the protected bike lane the prediction for 97th St was an increase of over 300%. For both types of lane, the 97th St. Traverse also appears to have a higher percent increase in ridership. The street segments with the highest increase in ridership tend to fall toward the edges of the park. This reflects the spatial lag variable because ridership is low on the transverses themselves but high on other roads bordering the park or the loop road in the park.

Protected

Unprotected

Scenario 2: Upper Manhattan

In Upper Manhattan, we only predicted the impact of a protected bike lane since several streets already had an unprotected bike lane. Amsterdam Ave has by far the greatest increase in ridership, at over 300%, while the other two avenues are predicted to have ridership increases of around 100% or 150%. Interestingly, the map shows that a large number of segments on Lenox Ave and Adam Clayton Powell Jr. Boulevard are predicted to have a decrease in ridership, although overall ridership was still predicted to increase.

Scenario 3: Brooklyn

In Brooklyn, our model predicted that Meeker St. would show the largest increase in ridership, at almost 400% with a protected bike lane and 200% with an unprotected bike lane. Myrtle Avenue was projected to have the least growth in ridership, with Atlantic Avenue, Bushwick Avenue, and Flatbush Avenue falling in the middle. Due to the orientation of Meeker St., the entire street is located in a more densely traveled part of the Citibike network compared to other streets in this scenario which exend out to the edge of the study area. This might help explain the higher predictions for Meeker St., but comparing the % growth in ridership was meant to normalize for these effects.

Protected

Unprotected

Next Steps

Though the model performance leaves room for improvement, this model can still offer interesting insights into predicted ridership for several bike planning scenarios in New York City. With further revision, this type of analysis could be predicted with greater accuracy and expanded to other streets in the city.

7. Code Appendix

Calculating Ridership

# Calculating Ridership
# pre-processing steps: 
## generate points from line trip data (see section 2.2)
## buffer street centerlines to desired distance (15ft in this analysis)

library(sf)
library(tidyverse)
library(mapview)

#read in data
points_new<-st_read('Aug1_14_points.shp')

extent <- st_read("station_extent_Aug2020.shp") %>%
  st_transform(st_crs(points_new))

cd<-st_read('https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson') %>%
  st_transform(st_crs(points_new))

cd_extent<-st_intersection(cd, extent)


#clip points to study area extent
points_new_extent<-st_intersection(points_new, extent)

#bring in roads
bike20d_buffer <- st_read("lion2020d_buffer15.shp") %>%
  st_transform(st_crs(points_new))

# filter out non-bikeable LION segments (e.g. highways and ferry routes)
bike20d_buffer<- subset(bike20d_buffer, RW_TYPE != 12 &
                RW_TYPE != 7 &
                RW_TYPE != 14 &
                RW_TYPE != 2 &
                RW_TYPE != 4 &
                FeatureTyp != 1 &
                FeatureTyp != 2 &
                FeatureTyp != 3 &
                FeatureTyp != 5 &
                FeatureTyp != 7 &
                FeatureTyp != 8 &
                FeatureTyp != 9 &
                SegmentTyp !='E'&
                SegmentTyp !='G'&
                SegmentTyp !='F'&
                SegmentTyp !='T')

#segment ID is not unique, so select distinct segment IDs
bike20d_buffer <- distinct(bike20d_buffer,SegmentID,.keep_all = T)

#clip street segments to study area
bike20d_buffer<- st_intersection(bike20d_buffer, extent)

save.image(file = "pointsClean_raw.RData")


#function that cleans points for one community district
cleaner<- function(cb){
  cd_chosen<- cd %>%
    filter(boro_cd==cb)
  streets <- st_intersection(bike20d_buffer, cd_chosen)
  points_extent <- st_intersection(points_new_extent, cd_chosen)
  points_clean <- points_extent[streets,]%>%
    mutate(pnt_id = seq.int(nrow(.))) %>%
    st_join(., streets, join=st_intersects, left=T, largest=TRUE)
  return(points_clean)
}

#run the function for each cd
#Manhattan
points_clean_101<- cleaner('101')
points_clean_102<- cleaner('102')
points_clean_103<- cleaner('103')
points_clean_104<- cleaner('104')
points_clean_105<- cleaner('105')
points_clean_106<- cleaner('106')
points_clean_108<- cleaner('108')
points_clean_164<- cleaner('164')
points_clean_107<- cleaner('107')
points_clean_111<- cleaner('111')
points_clean_110<- cleaner('110')
points_clean_109<- cleaner('109')
points_clean_112<- cleaner('112')

save.image(file = "pointsClean_Manhattan.RData")

points_clean_201<- cleaner('201')
points_clean_202<- cleaner('202')
points_clean_203<- cleaner('203')
points_clean_204<- cleaner('204')

save.image(file = "pointsClean_MHBX.RData")


points_clean_301<- cleaner('301')
points_clean_302<- cleaner('302')
points_clean_303<- cleaner('303')
points_clean_304<- cleaner('304')
points_clean_306<- cleaner('306')
points_clean_308<- cleaner('308')
points_clean_307<- cleaner('307')
points_clean_355<- cleaner('355')
points_clean_309<- cleaner('309')
points_clean_316<- cleaner('316')

save.image(file = "pointsClean_MHBXBK.RData")

points_clean_405<- cleaner('405')
points_clean_402<- cleaner('402')
points_clean_401<- cleaner('401')

save.image(file = "pointsClean_MHBXBKQN.RData")


#function for counting trips
countTrip <- function(pnt){
  tripCount <- pnt %>%
    as.data.frame() %>%
    select(-geometry) %>%
    group_by(FID_1,SegmentID) %>%
    summarise() %>%
    group_by(SegmentID) %>%
    summarise(tripCount = n())
  return(tripCount)
}

#count trips in the previous dataframe
count_101 <- countTrip(points_clean_101)
count_102 <- countTrip(points_clean_102)
count_103 <- countTrip(points_clean_103)
count_104 <- countTrip(points_clean_104)
count_105 <- countTrip(points_clean_105)
count_106 <- countTrip(points_clean_106)


count_107 <- countTrip(points_clean_107)
count_108 <- countTrip(points_clean_108)
count_109 <- countTrip(points_clean_109)
count_110 <- countTrip(points_clean_110)
count_111 <- countTrip(points_clean_111)
count_112 <- countTrip(points_clean_112)
count_164 <- countTrip(points_clean_164)

count_301 <- countTrip(points_clean_301)
count_302 <- countTrip(points_clean_302)
count_303 <- countTrip(points_clean_303)
count_304 <- countTrip(points_clean_304)
count_306 <- countTrip(points_clean_306)
count_307 <- countTrip(points_clean_307)
count_308 <- countTrip(points_clean_308)
count_309 <- countTrip(points_clean_309)
count_316 <- countTrip(points_clean_316)
count_355 <- countTrip(points_clean_355)


count_201 <- countTrip(points_clean_201)
count_202 <- countTrip(points_clean_202)
count_203 <- countTrip(points_clean_203)
count_204 <- countTrip(points_clean_204)


count_401 <- countTrip(points_clean_401)
count_402 <- countTrip(points_clean_402)
count_405 <- countTrip(points_clean_405)

save.image(file = "pointsCounted.RData")


#join the counted trips to the road data
#Manhattan
new_info.Aug <- lion %>% 
  merge(.,new_tripCount,by = "SegmentID",all.x = T) %>% 
  rename(Count=tripCount)

info.Aug_101<- bike20d_buffer %>% 
  merge(., count_101, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

info.Aug_102<- bike20d_buffer %>% 
  merge(., count_102, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

info.Aug_103<- bike20d_buffer %>% 
  merge(., count_103, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_104<- bike20d_buffer %>% 
  merge(., count_104, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_105<- bike20d_buffer %>% 
  merge(., count_105, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_106<- bike20d_buffer %>% 
  merge(., count_106, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_107<- bike20d_buffer %>% 
  merge(., count_107, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_108<- bike20d_buffer %>% 
  merge(., count_108, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_109<- bike20d_buffer %>% 
  merge(., count_109, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_110<- bike20d_buffer %>% 
  merge(., count_110, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_111<- bike20d_buffer %>% 
  merge(., count_111, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_112<- bike20d_buffer %>% 
  merge(., count_112, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_164<- bike20d_buffer %>% 
  merge(., count_164, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#bronx
info.Aug_201<- bike20d_buffer %>% 
  merge(., count_201, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_202<- bike20d_buffer %>% 
  merge(., count_202, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_203<- bike20d_buffer %>% 
  merge(., count_203, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_204<- bike20d_buffer %>% 
  merge(., count_204, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#brooklyn
info.Aug_301<- bike20d_buffer %>% 
  merge(., count_301, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_302<- bike20d_buffer %>% 
  merge(., count_302, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_303<- bike20d_buffer %>% 
  merge(., count_303, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_304<- bike20d_buffer %>% 
  merge(., count_304, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_306<- bike20d_buffer %>% 
  merge(., count_306, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_307<- bike20d_buffer %>% 
  merge(., count_307, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_308<- bike20d_buffer %>% 
  merge(., count_308, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_309<- bike20d_buffer %>% 
  merge(., count_309, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_316<- bike20d_buffer %>% 
  merge(., count_316, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_355<- bike20d_buffer %>% 
  merge(., count_355, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#queens
info.Aug_401<- bike20d_buffer %>% 
  merge(., count_401, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_402<- bike20d_buffer %>% 
  merge(., count_402, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)
info.Aug_405<- bike20d_buffer %>% 
  merge(., count_405, by='SegmentID', all.y=T)%>%
  rename(Count=tripCount)

#rbind everything back together

all_test<- rbind(info.Aug_101, info.Aug_102, info.Aug_103, info.Aug_104, info.Aug_105,
                 info.Aug_106, info.Aug_107, info.Aug_108, info.Aug_109, info.Aug_110,
                 info.Aug_111, info.Aug_112, info.Aug_164, info.Aug_201, info.Aug_202, 
                 info.Aug_203, info.Aug_204, info.Aug_301, info.Aug_302, info.Aug_303,
                 info.Aug_304, info.Aug_306, info.Aug_307, info.Aug_308, info.Aug_309,
                 info.Aug_316, info.Aug_355, info.Aug_401, info.Aug_402, info.Aug_405)

glimpse(all_test)

#group by street and segment id and sum ridership (some streets are split by community districts)
all_grouped<-all_test%>%
  group_by(Street, SegmentID)%>%
  summarize(Count=sum(Count))
all_grouped<-st_set_geometry(all_grouped, NULL)


#join it back to the original bike20d_buffer or somehow bring back the segments with count0
mergeCols <- c("Street", "SegmentID")
ridership<- bike20d_buffer%>%
  merge(., all_grouped, by=mergeCols, all.x=T)
ridership$Count[is.na(ridership$Count)] <- 0

#export as RData
save.image(file = "ridership.RData")

Difference in Difference Analysis

#Difference in Difference
#pre-processing: select the street segments that will be each treatment and control group

Feature Engineering

#Feature Engineering
#pre-steps: download LEHD work area characteristics data

library(tidyverse)
library(dplyr)
library(sf)
library(lubridate)
library(mapview)
library(viridis)
library(ggplot2)
library(kableExtra)
library(stargazer)
library(FNN)
library(ggcorrplot)
library(spdep)
library(plotly)
library(car)
library(ranger)
library(tidycensus)
library(corpcor)
library(caret)
library(rjson)
library(tidyr)
library(sp)
library(osmdata)

load("ridership.RData")

#NYC boroughs
borough<- st_read('https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson')%>%
  st_transform(st_crs(ridership))

boros_4<-subset(borough, boro_code<5)


#citibike stations
station<- st_read('Infrastructure/Citibike_Stations/Stations_Aug_2020.shp')%>%
  st_transform(st_crs(ridership))

#Station Extent
extent <- st_read('OtherData/Station_Extent_2020/station_extent_Aug2020.shp')%>%
  st_transform(st_crs(ridership))

#boroughs within study area
boroughs_clip<-st_intersection(borough, extent)


#neighborhoods
neighborhood <- st_read('https://data.cityofnewyork.us/resource/q2z5-ai38.geojson')%>%
  st_transform(st_crs(ridership))


#only neighborhood in citibike extent
neighborhood <- neighborhood[extent,]


#set ridership to info.Aug
info.Aug<-ridership

#add bike lane level
info.Aug <- info.Aug %>% 
  mutate(
    bikeLaneLv = case_when(
      BikeLane == 1 | BikeLane == 5 | BikeLane == 8 | BikeLane == 9 | BikeLane == 10 ~ "Protected",
      BikeLane == 2 | BikeLane == 3 | BikeLane == 4 | BikeLane == 6 | BikeLane == 11 ~ "Unprotected",
      TRUE ~ "noBikeLane"
    ))


#select useful columns
info.Aug <- info.Aug %>%
  select(Street,SegmentID, Count, FeatureTyp, SegmentTyp, NonPed,
         TrafDir, SegCount, LBoro, XFrom, YFrom, RW_TYPE, Snow_Prior,
         Number_Tra, BIKE_TRAFD, 
         StreetWidt, TRUCK_ROUT, POSTED_SPE,  
         bikeLaneLv, geometry)

# collect bike lane information
bikelanes <- info.Aug %>% filter(bikeLaneLv!="noBikeLane")


####=====Feature Engineering========####
#change projection
#info.Aug <- info.Aug %>% st_transform('ESRI:102711')
#bike20d_buffer <- bike20d_buffer %>% st_transform('ESRI:102711')
#bikelanes <- bikelanes %>% st_transform('ESRI:102711')
#station <- station %>% st_transform('ESRI:102711')

#Number of trips in each neighborhood
info.Aug_nh <- st_join(info.Aug, neighborhood, join=st_intersects, left=TRUE,largest = TRUE)
info.Aug <- info.Aug_nh %>% select(-(ntacode:county_fips),-shape_leng, -boro_code)


## bring this line to above to the neighborhood joining part
info.Aug <- info.Aug %>% 
  mutate(
    ntaname = case_when(
      is.na(ntaname) ~ "Connecting Neighborhoods",
      TRUE ~ ntaname
    ))

###Distance to nearest bikelanes
info.Aug <- info.Aug %>%
  mutate(dist.lane = nn_function(st_coordinates(st_centroid(info.Aug$geometry)), st_coordinates(st_centroid(bikelanes$geometry)), 1))


#borough
info.Aug$isMH<-0
info.Aug$isMH[info.Aug$LBoro==1]<-1

#travel lanes
info.Aug$Number_Tra <- as.numeric(info.Aug$Number_Tra)


#citibike stations
#drop all columns but geometry
station<-station%>%
  select()

View(info.Aug)

info.Aug$citibike.Buffer_small <-
  st_buffer(info.Aug, 250) %>% 
  aggregate(mutate(station, counter = 1),., sum) %>%
  pull(counter)
info.Aug$citibike.Buffer_small[is.na(info.Aug$citibike.Buffer_small)] <- 0

#fixNAs
#find averages
median_lanes <- median(info.Aug$Number_Tra, na.rm=TRUE)
median_width<- median(info.Aug$StreetWidt, na.rm = TRUE)
info.Aug$POSTED_SPE<- as.numeric(info.Aug$POSTED_SPE)
median_speed<-median(info.Aug$POSTED_SPE, na.rm=TRUE)


info.Aug$Number_Tra[is.na(info.Aug$Number_Tra)] <- median_lanes
info.Aug$StreetWidt[is.na(info.Aug$StreetWidt)] <- median_width
info.Aug$TRUCK_ROUT[is.na(info.Aug$TRUCK_ROUT)] <- 0
info.Aug$POSTED_SPE[is.na(info.Aug$POSTED_SPE)] <- median_speed


#### Census Features ####
#variables available at: https://api.census.gov/data/2019/acs/acs5/variables.html
v19 <- load_variables(2019, "acs5", cache = TRUE)

census_df <- data.frame(vars = c("B01003_001E",
                                 'B02001_002E',
                                 'B01001A_002E',
                                 'B01001A_017E',
                                 'B01002_001E',
                                 'B19013_001E',
                                 'B25031_001E',
                                 'B25044_001E',
                                 'B08006_003E',
                                 'B08006_010E',
                                 'B08131_001E',
                                 'B08141_002E',
                                 'B08141_001E',
                                 'B08303_001E',
                                 'B03001_003E',
                                 'B26001_001E', #GROUP QUARTERS POPULATION
                                 'B25007_012E',
                                 #'B08201_001E',
                                 'B25002_001E',
                                 'B25002_003E',
                                 'B08006_014E'),
                        colNames2 = c('TotPop',
                                      'WhitePop',
                                      'TotMale',
                                      'TotFemale',
                                      'MedianAge',
                                      'MedHHInc',
                                      'MedRent',
                                      'Vehicles_Avail',
                                      'Commute_DriveAlone',
                                      'Commute_Subway',
                                      'Travel_Time',
                                      'No_Vehicle',
                                      'Means_of_Transport_pop',
                                      'Travel_Time_pop',
                                      'LatinoPop',
                                      'GroupQuarters',
                                      'NumRenters',
                                      #'AvgHHSize',
                                      'TotUnits',
                                      'VacUnits',
                                      'Commute_Bike'),
                        stringsAsFactors = FALSE)

census_vars <- census_df$vars
census_colNames <- census_df$colNames

# Function for renaming columns after collecting census data
rename_census_cols <- function(x){
  
  output <- x %>% 
    rename_at(vars(census_vars), 
              ~ census_colNames)
  
  output
}

#set census api key
census_key <- 'your census key here'
census_api_key(census_key, overwrite = TRUE, install = TRUE)

# Collect census data and geometries
Census_raw <- get_acs(geography = "tract", 
                      variables = census_vars, 
                      year = 2019, 
                      state = "NY", 
                      geometry = TRUE, 
                      county = c("New York", "Kings", "Queens", "Bronx"),
                      output = "wide") %>%
  rename_census_cols %>%
  dplyr::select(GEOID, 
                geometry,
                census_colNames) %>% 
  st_transform(2263)

Census_raw <- Census_raw%>%
  st_transform(2263)%>%
  mutate(AREA=st_area(geometry))
Census_raw <- Census_raw%>%
  st_transform(2263)%>%
  mutate(sq_mi=AREA/27878400)

#fix NAs 
colSums(is.na(Census_raw))

median_rent<-median(Census_raw$MedRent, na.rm=TRUE)
Census_raw$MedRent[is.na(Census_raw$MedRent)] <- median_rent

# calculate some vars
Census <- Census_raw %>% 
  st_transform(2263) %>% 
  mutate(pWhite = WhitePop / TotPop,
         PopDens = TotPop/sq_mi,
         Mean_Commute_Time = Travel_Time / Travel_Time_pop,
         pSubway = Commute_Subway / Means_of_Transport_pop,
         pDrive = Commute_DriveAlone/Means_of_Transport_pop,
         pBike = Commute_Bike/Means_of_Transport_pop,
         pNoVeh = No_Vehicle / TotPop)

#fix weird values in calculated vars
#set infinity values to NA
Census_geo<-Census%>%
  select(GEOID)

Census_st<- st_set_geometry(Census, NULL)
Census_fix <- do.call(data.frame, lapply(Census_st, function(x) replace(x, is.infinite(x), NA)))


#fix NAs in calculated vars
colSums(is.na(Census_fix))

median_pWhite<-median(Census_fix$pWhite, na.rm=TRUE)
median_subway<-median(Census_fix$pSubway, na.rm=TRUE)
median_noveh<-median(Census_fix$pNoVeh, na.rm=TRUE)
median_rent<-median(Census_fix$MedRent, na.rm=TRUE)



Census_fix$pWhite[is.na(Census_fix$pWhite)] <- median_pWhite
Census_fix$pSubway[is.na(Census_fix$pSubway)] <- median_subway
Census_fix$pNoVeh[is.na(Census_fix$pNoVeh)] <- median_noveh
Census_fix$pNoVeh[is.na(Census_fix$MedRent)] <- median_rent


#if % greater than 1, set value equal to median
summary(Census_fix)
Census_fix$pSubway[Census_fix$pSubway>1.01] <- median_subway

#get geometry back
Census<- inner_join(Census_geo, Census_fix)

#select the variables of interest
#Census <- Census %>%
#  select(GEOID, sq_mi,pWhite, pSubway, pNoVeh,MedRent)

#join with info.Aug
Census<-st_transform(Census, crs=st_crs(info.Aug))
info.Aug_census<-st_join(st_centroid(info.Aug), left=TRUE, Census)

summary(is.na(info.Aug_census))
summary(is.na(Census))
#You will find there are 16 records not joined to a census tract
#They are joined becuase they are falling outside of (near) the boundary of the tract polygon
#We will join them to the nearest census tract
colNms <- names(info.Aug_census)
ThoseNotJoined <- info.Aug_census[is.na(info.Aug_census$GEOID),] %>% select(all_of(colNms))
joinThemAgain <- st_join(ThoseNotJoined,Census,left=T,join = st_nearest_feature)
ThoseJoined <- info.Aug_census[!is.na(info.Aug_census$GEOID),]
info.Aug_census <- rbind(ThoseJoined,joinThemAgain)

rm(ThoseJoined,ThoseNotJoined,joinThemAgain)
info.Aug<-info.Aug_census

#### Environment Features ####

#make new dataset to play with
info.Aug_env<- info.Aug

#sidewalk cafes
sidewalk_cafe<- read.csv('https://data.cityofnewyork.us/resource/qcdj-rwhu.csv?$limit=2000')
sidewalk_cafe<-subset(sidewalk_cafe, lic_status =='Active')
sidewalk_cafe <- st_as_sf(sidewalk_cafe, coords=c('longitude', 'latitude'), crs=4326)

#nn version
sidewalk_cafe<-st_transform(sidewalk_cafe, crs=st_crs(info.Aug))

info.Aug_env<- info.Aug_env %>%
  mutate(
    sidewalk_caf_nn3 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(sidewalk_cafe), 3))



#big parks
parks<- st_read('OtherData/parks/geo_export_8469ffba-1951-4e52-916c-c9c4dfa54c18.shp')

big_parks<-subset(parks, acres>100)

big_parks<-st_transform(big_parks, crs=st_crs(info.Aug_env))

info.Aug_env<- info.Aug_env %>% 
  mutate(bigPark = lengths(st_within(geometry, big_parks)))
info.Aug_env$bigPark[info.Aug_env$bigPark==2]<-1


#greenway
info.Aug_env$isGreenway<-ifelse(grepl('greenway', info.Aug_env$Street, ignore.case=T), 1, 0)

#distance to subway stations
subway<-st_read('stops_nyc_subway_may2019.shp')
subway<-st_read('OtherData/NYC_Subway/stops_nyc_subway_may2019.shp')

subway<-st_transform(subway, st_crs(info.Aug_env))


info.Aug_env<- info.Aug_env %>%
  mutate(subway_nn5 = nn_function(st_coordinates(st_centroid(info.Aug_env)), st_coordinates(subway), 5))


#reset whole dataset to include these features
info.Aug<-info.Aug_env

# jobs
jobs_mh<-st_read('LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('LODES_WAC/queens/points_2018.shp')

jobs_mh<-st_read('OtherData/LODES_WAC/manhattan/points_2018.shp')
jobs_bx<-st_read('OtherData/LODES_WAC/bronx/points_2018.shp')
jobs_bk<-st_read('OtherData/LODES_WAC/brooklyn/points_2018.shp')
jobs_qn<-st_read('OtherData/LODES_WAC/queens/points_2018.shp')


jobs<-rbind(jobs_mh, jobs_bx, jobs_bk, jobs_qn)

WAC <- jobs %>% 
  dplyr::select(id, c000) %>% 
  mutate(GEOID = as.character(substr(id, 1, 11))) %>% 
  group_by(GEOID) %>% 
  summarize(jobs_in_tract = sum(c000, na.rm = TRUE)) %>% 
  filter(GEOID %in% info.Aug$GEOID)

WAC<-st_set_geometry(WAC, NULL)

info.Aug_jobs<- left_join(info.Aug, WAC,by = "GEOID")

#get rid of NAs
info.Aug_jobs$jobs_in_tract[is.na(info.Aug_jobs$jobs_in_tract)] <- 0

info.Aug<-info.Aug_jobs

#### OSM Features ####
info.Aug_osm<-info.Aug

#retail
extent<- st_transform(extent, st_crs(info.Aug))

retail <- opq ("New York City USA") %>%
  add_osm_feature(key = 'shop') %>%
  osmdata_sf(.)

retail <- st_geometry(retail$osm_points) %>%
  st_transform(st_crs(info.Aug)) %>%
  st_sf() %>%
  st_intersection(extent) %>%
  mutate(Legend = 'Retail') %>%
  dplyr::select(Legend, geometry)

ggplot()+geom_sf(data=retail)

#office
office <- opq ("New York City USA") %>%
  add_osm_feature(key = 'office') %>%
  osmdata_sf(.)

office <- st_geometry(office$osm_points) %>%
  st_transform(st_crs(info.Aug)) %>%
  st_sf() %>%
  st_intersection(extent) %>%
  mutate(Legend = 'Office') %>%
  dplyr::select(Legend, geometry)

#add density features
# retail
retail_ct <- st_join(retail, Census,join = st_intersects, left = T) %>%
  group_by(GEOID,sq_mi) %>%
  summarise(count_retail= n())

retail_ct$density_retail <- retail_ct$count_retail/retail_ct$sq_mi

# office
office_ct <- st_join(office,Census,left = T) %>% 
  group_by(GEOID,sq_mi) %>% 
  summarise(count_office = n())

office_ct$density_office <- office_ct$count_office/office_ct$sq_mi

#join back to info.Aug
retail_ct<- retail_ct%>%
  select(GEOID, count_retail, density_retail)%>%
  st_set_geometry(NULL)

office_ct<- office_ct%>%
  select(GEOID, count_office, density_office)%>%
  st_set_geometry(NULL)

info.Aug_osm <- info.Aug_osm %>% left_join(retail_ct,by = "GEOID") %>% 
  left_join(office_ct, by = "GEOID")

info.Aug<-info.Aug_osm

#fix NAs
colSums(is.na(info.Aug))

## I fill the na in POI with 0 instead of median here
info.Aug$count_retail[is.na(info.Aug$count_retail)] <- 0
info.Aug$count_office[is.na(info.Aug$count_office)] <- 0
info.Aug$density_retail[is.na(info.Aug$density_retail)] <- 0
info.Aug$density_office[is.na(info.Aug$density_office)] <- 0
                                         
                                         
## Spatial lag
                                         
#function to get cloest idx
nn_idx<- nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.index[,k]
  return(nn)  
}
 
#this index is the idex exactly for itselt-->so we start from 2
#make a copy                                         
info.Aug1 <- info.Aug
                                         
#C1
info.Aug <- info.Aug %>%
  mutate(cidx1 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 2))


#C2
info.Aug <- info.Aug %>%
  mutate(cidx2 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 3))


#C3
info.Aug <- info.Aug %>%
  mutate(cidx3 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 4))

#C4
info.Aug <- info.Aug %>%
  mutate(cidx4 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 5))

#C5
info.Aug <- info.Aug %>%
  mutate(cidx5 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 6))


#C6
info.Aug <- info.Aug %>%
  mutate(cidx6 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 7))

#C7
info.Aug <- info.Aug %>%
  mutate(cidx7 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 8))

#C8
info.Aug <- info.Aug %>%
  mutate(cidx8 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 9))


#C9
info.Aug <- info.Aug %>%
  mutate(cidx9 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 10))

#C10
info.Aug <- info.Aug %>%
  mutate(cidx10 = nn_idx(st_coordinates(st_centroid(info.Aug$geometry)),
                        st_coordinates(st_centroid(info.Aug1$geometry)), 11))                                        
                                         

                                         
#extract count based on index
for (i in 1:nrow(info.Aug)) {
  info.Aug$c1Count[i] =info.Aug1[info.Aug$cidx1[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c2Count[i] =info.Aug1[info.Aug$cidx2[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c3Count[i] =info.Aug1[info.Aug$cidx3[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c4Count[i] =info.Aug1[info.Aug$cidx4[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c5Count[i] =info.Aug1[info.Aug$cidx5[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c6Count[i] =info.Aug1[info.Aug$cidx6[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c7Count[i] =info.Aug1[info.Aug$cidx7[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c8Count[i] =info.Aug1[info.Aug$cidx8[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c9Count[i] =info.Aug1[info.Aug$cidx9[i],]$Count
}

for (i in 1:nrow(info.Aug)) {
  info.Aug$c10Count[i] =info.Aug1[info.Aug$cidx10[i],]$Count
}                                         
                                         
 #get the avg 10

for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i], info.Aug$c8Count[i], info.Aug$c9Count[i], info.Aug$c10Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i] + info.Aug$c8Count[i] + info.Aug$c9Count[i] + info.Aug$c10Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount10[i] = left/ 8
}


#get the avg7 
for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i], info.Aug$c6Count[i], info.Aug$c7Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i] + info.Aug$c6Count[i] + info.Aug$c7Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount7[i] = left/5
}

#get the avg 5

for (i in 1:nrow(info.Aug)) {
  max_val = max(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
  min_val = min(c(info.Aug$c1Count[i], info.Aug$c2Count[i], info.Aug$c3Count[i], info.Aug$c4Count[i], info.Aug$c5Count[i]))
  sum = (info.Aug$c1Count[i] + info.Aug$c2Count[i] + info.Aug$c3Count[i] + info.Aug$c4Count[i] + info.Aug$c5Count[i])
  left = sum - max_val - min_val
  info.Aug$avgCount5[i] = left/3
}

Running the Model

#Running the Model
####====Regression====####

options(scipen = 999)
rm(list = ls())

#Package installs -------------------------------------------------------------
load.fun <- function(x) { 
  x <- as.character(x) 
  if(isTRUE(x %in% .packages(all.available=TRUE))) { 
    eval(parse(text=paste("require(", x, ")", sep=""))) 
    print(paste(c(x, " : already installed; requiring"), collapse=''))
  } else { 
    #update.packages()
    print(paste(c(x, " : not installed; installing"), collapse=''))
    eval(parse(text=paste("install.packages('", x, "')", sep=""))) 
    print(paste(c(x, " : installed and requiring"), collapse=''))
    eval(parse(text=paste("require(", x, ")", sep=""))) 
  } 
} 

########### Required Packages ###########
packages = c("bayesplot", "lme4", "rstan", "shinystan", "RcppEigen",
             "tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
             "ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
             "sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
             "sf", "nngeo", "mapview","poissonreg",'gridExtra')

for(i in seq_along(packages)){
  packge <- as.character(packages[i])
  load.fun(packge)
}


#read in data from feature engineering
load("C:/Users/xinti/Box/MUSA_800_Practicum/Data/RHistoryData/model_clean_xl03.RData")


rm(list=setdiff(ls(), "info.Aug"))
gc()
info.Aug <- info.Aug %>% filter(Count<400)
selectedP <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
               'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' , 
               'sidewalk_caf_nn3'  , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
               'jobs_in_tract','density_retail','density_office','ntaname',"avgCount5")
selectedF <- c("bikeLaneLv" , 'dist.lane' , 'Number_Tra' , 'StreetWidt' , 'TRUCK_ROUT' , 'citibike.Buffer_small' , 'isMH' ,
               'pWhite' , 'MedRent' , 'pNoVeh' , 'pSubway' , 
               'sidewalk_caf_nn3'  , 'bigPark' , 'isGreenway' , 'subway_nn5' ,
               'jobs_in_tract','density_retail','density_office','ntaname','Count',"avgCount5",'SegmentID')
#Check NAs
info.Aug.model <- info.Aug %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.Aug.geo <- info.Aug %>% dplyr::select(SegmentID,geometry)


#Create dataset without NAs
data <- info.Aug.model %>%
  drop_na()
#Set seed to make your practive producible
set.seed(717)

theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

data$originalCount = data$Count
data$Count = data$Count+1
data$isMH = as.character(data$isMH)
data$bigPark = as.character(data$bigPark)
data$isGreenway = as.character(data$isGreenway)

#Slipt the data into training and testing sets
data_split <- rsample::initial_split(data, strata = "Count", prop = 0.75)
train.set_wtID <- rsample::training(data_split)
test.set_wtID  <- rsample::testing(data_split)

train.set <- train.set_wtID %>% dplyr::select(selectedP,Count,originalCount) 
test.set <- test.set_wtID %>% dplyr::select(selectedP,Count,originalCount) 


cv_splits <- rsample::vfold_cv(train.set, strata = "Count", k = 10)
print(cv_splits)


#Create recipe
model_rec_lm <- recipe(Count ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
                       pWhite+MedRent+pNoVeh+pSubway+
                       sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
                       jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>% 
  update_role(ntaname, new_role = "ntaname") %>% 
  step_log(Count) %>% 
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
              subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
  step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
             subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale

glimpse(model_rec_lm %>% prep() %>% juice())


model_rec <- recipe(originalCount ~ bikeLaneLv + dist.lane + Number_Tra+StreetWidt+TRUCK_ROUT+citibike.Buffer_small+isMH+
                      pWhite+MedRent+pNoVeh+pSubway+
                      sidewalk_caf_nn3+bigPark+isGreenway+subway_nn5+
                      jobs_in_tract+density_retail+density_office+ntaname+avgCount5, data = train.set) %>% 
  update_role(ntaname, new_role = "ntaname") %>% 
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_center(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
              subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) %>%
  step_scale(dist.lane, StreetWidt,MedRent,sidewalk_caf_nn3,
             subway_nn5, jobs_in_tract,density_retail,density_office,avgCount5) #%>% #put on standard deviation scale

glimpse(model_rec %>% prep() %>% juice())


# library(poissonreg)

lm_plan <- 
  linear_reg() %>% 
  set_engine("lm")


ps_plan <- 
  poissonreg::poisson_reg() %>% 
  set_engine("glm")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 2000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")



# GridSearch
rf_grid <- expand.grid(mtry = c(3,6,9), 
                       min_n = c(100,500,800))


#Create workflow
lm_wf <-
  workflows::workflow() %>% 
  add_recipe(model_rec_lm) %>% 
  add_model(lm_plan)

ps_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(ps_plan)

rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)

# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)



lm_tuned <- lm_wf %>%
  fit_resamples(.,
                resamples = cv_splits,#cv_splits_geo
                control   = control,
                metrics   = metric_set(rmse, rsq))

ps_tuned <- ps_wf %>%
  fit_resamples(.,
            resamples = cv_splits,#cv_splits_geo
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits,#cv_splits_geo
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))


show_best(lm_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(ps_tuned, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "rmse", n = 15, maximize = FALSE)


lm_best_params     <- select_best(lm_tuned, metric = "rmse", maximize = FALSE)
ps_best_params     <- select_best(ps_tuned, metric = "rmse", maximize = FALSE)
rf_best_params     <- select_best(rf_tuned, metric = "rmse", maximize = FALSE)


# Pull best hyperparam preds from 10-fold cross validated predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned) 
ps_best_OOF_preds <- collect_predictions(ps_tuned) 
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>% 
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])


lm_best_OOF_preds <- lm_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  mutate(.pred = exp(.pred)-1,
         Count = exp(Count)-1)

ps_best_OOF_preds <- ps_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)

rf_best_OOF_preds <- rf_best_OOF_preds %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)


OOF_preds <- rbind(data.frame(lm_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Linear Regression"),
                   data.frame(ps_best_OOF_preds %>% dplyr::select(.pred,Count), model = "Poisson"),
                   data.frame(rf_best_OOF_preds %>% dplyr::select(.pred,Count),model = "Random Forest")) %>% 
  group_by(model) %>% 
  mutate(
    RMSE = yardstick::rmse_vec(Count, .pred),
    MAE  = yardstick::mae_vec(Count, .pred),
    MAPE = yardstick::mape_vec((Count+1), (.pred+1))) %>% 
  ungroup()


# average error for each model
# RMSE
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(RMSE,1))) +
  theme_bw()

# MAE
ggplot(data = OOF_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(MAE,1))) + 
  theme_bw() 


# Scatter plots: Predicted vs Observed
ggplot(OOF_preds, aes(y=.pred , x = Count,group = model))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~model,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)



##===============Predict the test set=============####

#Final workflow
lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
ps_best_wf     <- finalize_workflow(ps_wf, ps_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)


lm_val_fit_geo <- lm_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))

ps_val_fit_geo <- ps_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(rmse, rsq))


# collect test set predictions from last_fit model
lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
ps_val_pred_geo     <- collect_predictions(ps_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)

lm_val_pred_geo <- lm_val_pred_geo %>% mutate(.pred = exp(.pred)-1,
                                              Count = exp(Count)-1)


ps_val_pred_geo <- ps_val_pred_geo %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)

rf_val_pred_geo <- rf_val_pred_geo %>% dplyr::select(-id,-.config) %>% 
  rename(Count = originalCount)


# Aggregate test set predictions (they do not overlap with training prediction set, which is OOF_preds)
val_preds <- rbind(data.frame(dplyr::select(lm_val_pred_geo, .pred, Count), model = "lm"),
                   data.frame(dplyr::select(ps_val_pred_geo, .pred, Count), model = "PS"),
                   data.frame(dplyr::select(rf_val_pred_geo, .pred, Count), model = "RF")
                   ) %>% 
  group_by(model) %>% 
  mutate(RMSE = yardstick::rmse_vec(Count, .pred),
         MAE  = yardstick::mae_vec(Count, .pred),
         MAPE = yardstick::mape_vec((Count+1), (.pred+1)),
         absE=abs(Count-.pred)) %>% 
  ungroup()


# MAE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE,group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = round(MAE,1))) + 
  theme_bw() 

# Observed vs. Predicted on the test set
ggplot(val_preds, aes(y=.pred , x = Count,group = model))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~model,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)



##===============Find where the model performs well=============####

# First extract work flow 
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
# fit model on the whole dataset
rf_wflow_final_fit <- fit(rf_best_wf, data = data)
# extract recipe
dat_recipe     <- pull_workflow_prepped_recipe(rf_wflow_final_fit)
# extract fit
rf_final_fit <- pull_workflow_fit(rf_wflow_final_fit)

# predict bike counts for the whole dataset
data$.pred <- predict(rf_final_fit, 
                      new_data = bake(dat_recipe,data))$.pred

data <- data %>% 
  mutate(absE = abs(originalCount - .pred),# calculate absolute error
         Count = Count - 1) 

data$originalCount <- NULL 

yardstick::mae_vec(data$Count, data$.pred)

ggplot()+geom_histogram(data = data,aes(absE),binwidth = 1,fill="#219ebc")

# observed vs preicted by different characteristics
ggplot(data, aes(y=.pred , x = Count,group = TRUCK_ROUT))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~TRUCK_ROUT,ncol = 2)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)+
  labs(x = "Observed",
       y = "Predicted",
       title = "Observed vs. Predicted on the testing set")

ggplot(data %>% filter(Number_Tra<6), aes(y=.pred , x = Count,group = Number_Tra))+ 
  geom_point(alpha = 0.3) +
  coord_equal() +
  geom_abline(linetype = "dashed",color = "red") +
  geom_smooth(method="lm", color = "blue") +
  facet_wrap(~Number_Tra,ncol = 3)+
  theme_bw()+
  ylim(0,500)+
  xlim(0,500)+
  labs(x = "Observed",
       y = "Predicted",
       title = "Observed vs. Predicted on the testing set")


# join the geometry
data.geo <- data %>% left_join(info.Aug,by = "SegmentID") %>% st_as_sf()

ggplot() +
  geom_sf(data = boroughs_clip, fill = "#D4D4D4") +
  geom_sf(data = data.geo, aes(colour = absE)) +
  scale_colour_viridis(direction = -1, discrete = FALSE, option="viridis")+
  labs(title="Absolute predicting Error on each road segment") +
  mapTheme()

# calculate error by neighborhoods
data.nta <- data.geo %>% st_drop_geometry() %>% 
  group_by(ntaname) %>% 
  summarise(MAE = mean(absE))

error.nta <- neighborhood %>% left_join(data.nta,by="ntaname")

ggplot()+
  geom_sf(data = error.nta, aes(fill=MAE)) +
  scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
  labs(title="MAE by neighborhoods") + mapTheme()


# calculate error by boroughs
data.geo <- data.geo %>% st_transform(crs = 4326)
boroughs_clip <- boroughs_clip %>% st_transform(crs = 4326)
data.boro <- st_join(st_centroid(data.geo),boroughs_clip,left=T)
data.boro <- data.boro %>% st_drop_geometry() %>% 
  group_by(boro_name) %>% 
  summarise(MAE = mean(absE))

error.boro <- boroughs_clip %>% left_join(data.boro,by="boro_name")

ggplot()+
  geom_sf(data = error.boro, aes(fill=MAE)) +
  scale_fill_viridis(direction = 1, discrete = FALSE, option="viridis")+
  labs(title="MAE by neighborhoods") + mapTheme()

Scenario Predictions

#Scenario Predictions
##Read Data
scenarios <- st_read("E:/NYCDOT/Scenarios/scenariosShort.shp")
neighborhoods <- st_read("E:/NYCDOT/NYCneighborhood/nynta.shp")
boro <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")

boro <- boro %>%
  filter(boro_name == "Brooklyn")
boro <- boro %>% st_transform(st_crs(scenarios))

boro2 <- st_read("E:/NYCDOT/Borough Boundaries/geo_export_4c045526-dcb9-4e1a-b602-59b848e20e6a.shp")
boro2 <- boro2 %>%
  filter(boro_name == "Manhattan" | boro_name == "Bronx")
boro2 <- boro2 %>% st_transform(st_crs(scenarios))

##Find and Clean Range
scenario1_range <- neighborhoods %>% filter(NTACode == "MN99")
scenario1 <- scenarios[scenario1_range,] %>% 
  filter(SegmentID != "0137647" & SegmentID != "0134012" &
           SegmentID != "0150953" & SegmentID != "0150951" &
           SegmentID != "0071072" & SegmentID != "0071082")
scenario1 <- scenario1 %>%
  filter(SegmentID != "0111014" & SegmentID != "0071078" & SegmentID != "0108162")
scenario1 <- scenario1 %>%
  filter(SegmentID != "9000012")
scenario1 <- scenario1 %>%
  filter(SegmentID != "0071084")
scenario1 <- scenario1 %>%
  filter(SegmentID != "9000011")


scenario_23 <- scenarios %>%
  filter(! SegmentID %in% scenario1$SegmentID)
scenario2 <- scenario_23[boro2,]


scenario3 <- scenario_23 %>%
  filter(! SegmentID %in% scenario2$SegmentID)

#check
mapview(list(scenario1, scenario2, scenario3), color = list("red", "blue", "purple"))

#Make Samples
sample1 <- info.Aug %>%
  filter(SegmentID %in% scenario1$SegmentID)

sample2 <- info.Aug %>%
  filter(SegmentID %in% scenario2$SegmentID)

sample3 <- info.Aug %>% filter(SegmentID %in% scenario3$SegmentID)

#Edit
sample1 <- sample1 %>%
  filter(bikeLaneLv == "noBikeLane")
scenario1 <- scenario1 %>%
  filter(SegmentID %in% sample1$SegmentID)

#check
mapview(sample2, zcol="bikeLaneLv")
mapview(sample3, zcol="bikeLaneLv")

sample3 <- sample3 %>%
  filter(bikeLaneLv != "Protected")

scenario3 <- scenario3 %>%
  filter(SegmentID %in% sample3$SegmentID)

#read buffered
scenario1_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario1_Buffer.shp")
scenario2_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario2_Buffer.shp")
scenario3_buffer <- st_read("E:/NYCDOT/Scenarios/buffered/scenario3_Buffer.shp")

mapview(list(scenario1_buffer, scenario2_buffer, scenario3_buffer))


#### Predictions
#reference:https://hansjoerg.me/2020/02/09/tidymodels-for-machine-learning/

Brooklyn <- boro %>% st_transform(st_crs(info.Aug))
Manhattan <- boro2 %>%
  filter(boro_name == "Manhattan") %>% st_transform(st_crs(info.Aug))


info.Brooklyn <- info.Aug[Brooklyn,] 
info.Manhattan <- info.Aug[Manhattan,]

info.Bk_model <- info.Brooklyn %>% st_drop_geometry() %>% dplyr::select(selectedF)
info.M_model <- info.Manhattan %>% st_drop_geometry() %>% dplyr::select(selectedF)

sample1 <- sample1 %>% dplyr::select(selectedF)
sample2 <- sample2 %>% dplyr::select(selectedF)
sample3 <- sample3 %>% dplyr::select(selectedF)


#Predict scenario1
rf_Manhattan_fit <- fit(rf_best_wf, data = info.M_model)

rf_final_Mfit <- pull_workflow_fit(rf_Manhattan_fit)
dia_rec3_M    <- pull_workflow_prepped_recipe(rf_Manhattan_fit)

#make different versions of sample 1 data
sample1_protect <- sample1 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()
sample1_unprotect <- sample1 %>%
  mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()


#make predictions

#Scenario1
sample1_protect$.pred <- predict(rf_final_Mfit, 
                          new_data = bake(dia_rec3_M , sample1_protect))$.pred


sample1_unprotect$.pred <- predict(rf_final_Mfit, 
                                 new_data = bake(dia_rec3_M , sample1_unprotect))$.pred

#join back and calculate diff
sample1$pre_protect <- sample1_protect$.pred
sample1 <- sample1 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count) * 100)

sample1$pre_unprotect <- sample1_unprotect$.pred
sample1 <- sample1 %>%
  mutate(diff_unprotect = ((pre_unprotect - Count) / Count) * 100)

sample1.info <- sample1 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)

sample1.info <- sample1.info %>% st_transform(st_crs(scenario1))

#join back to road segment
scenario1_infonew <- st_join(scenario1_buffer %>% st_transform(st_crs(sample1.info)),
                             sample1.info, left=TRUE, join = st_intersects)

scenario1_infonew <- distinct(scenario1_infonew,SegmentID.x,.keep_all = T)

scenario1_infonew <- scenario1_infonew %>%
  dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect, pre_unprotect, diff_unprotect)

scenario1_infonew <- scenario1_infonew %>% st_transform(st_crs("EPSG:4326"))

names(scenario1_infonew)[2] <- "SegmentID"

st_write(scenario1_infonew, "scenario1.geojson")



##==Predict Scenario2

#make different versions of sample 1 data
sample2_protect <- sample2 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()

sample2_protect$.pred <- predict(rf_final_Mfit, 
                                 new_data = bake(dia_rec3_M ,sample2_protect))$.pred

sample2_protect <- sample2_protect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count)) #avoid INF

names(sample2_protect)[24] <- "Count_c"

#join back and calculate diff
sample2$pre_protect <- sample2_protect$.pred
sample2$Count_c <- sample2_protect$Count_c

sample2 <- sample2 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)

sample2.info <- sample2 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect)

#change back to segment

scenario2_info <- st_join(scenario2_buffer %>% st_transform(st_crs(sample2.info)),
                             sample2.info, left=TRUE, join = st_intersects)

scenario2_info <- distinct(scenario2_info, SegmentID.x,.keep_all = T)

scenario2_info <- scenario2_info %>%
  dplyr::select(Street, SegmentID.x,Count,pre_protect, diff_protect)

scenario2_info <- scenario2_info %>% st_transform(st_crs("EPSG:4326"))

names(scenario2_info)[2] <- "SegmentID"

mapview(scenario2_info)

st_write(scenario2_info, "scenario2.geojson")


##==Predict Scenario3
rf_Brooklyn_fit <- fit(rf_best_wf, data = info.Bk_model)

rf_final_Bfit <- pull_workflow_fit(rf_Brooklyn_fit)
dia_rec3_B    <- pull_workflow_prepped_recipe(rf_Brooklyn_fit)

#make different versions of sample 3 data
mapview(sample3, zcol="bikeLaneLv")

sample3_protect <- sample3 %>%
  mutate(bikeLaneLv = "Protected") %>% st_drop_geometry()

sample3_unprotect <- sample3 %>%
  filter(bikeLaneLv != "Unprotected") %>%
  mutate(bikeLaneLv = "Unprotected") %>% st_drop_geometry()


#make predictions

#Scenario3
sample3_protect$.pred <- predict(rf_final_Bfit, 
                                 new_data = bake(dia_rec3_B, sample3_protect))$.pred


sample3_unprotect$.pred <- predict(rf_final_Bfit, 
                                   new_data = bake(dia_rec3_B, sample3_unprotect))$.pred


sample3_protect <- sample3_protect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count))

names(sample3_protect)[24] <- "Count_c"


sample3_unprotect <- sample3_unprotect %>%
  mutate(Count, ifelse(Count == 0, 0.01, Count))

names(sample3_unprotect)[24] <- "Count_c"

#join back and calculate diff
sample3$pre_protect <- sample3_protect$.pred
sample3$Count_c <-sample3_protect$Count_c

sample3 <- sample3 %>%
  mutate(diff_protect = ((pre_protect - Count) / Count_c) * 100)

sample3 <- merge(sample3, sample3_unprotect %>% dplyr::select(SegmentID, .pred),
                 by="SegmentID", all.x=TRUE)

sample3 <- sample3 %>%
  mutate(diff_unprotect = ((.pred - Count) / Count_c) * 100)

names(sample3)[26] <- "pre_unprotect"


sample3$pre_unprotect[is.na(sample3$pre_unprotect)] <- 0
sample3$diff_unprotect[is.na(sample3$diff_unprotect)] <- 0

sample3.info <- sample3 %>%
  dplyr::select(SegmentID, Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)



#change back to segment

scenario3_info <- st_join(scenario3_buffer %>% st_transform(st_crs(sample3.info)),
                          sample3.info, left=TRUE, join = st_intersects)

scenario3_info <- distinct(scenario3_info, SegmentID.x,.keep_all = T)

scenario3_info <- scenario3_info %>%
  dplyr::select(Street, SegmentID.x,Count, pre_protect, diff_protect, pre_unprotect, diff_unprotect)

scenario3_info <- scenario3_info %>% st_transform(st_crs("EPSG:4326"))

names(scenario3_info)[2] <- "SegmentID"

mapview(scenario3_info)

st_write(scenario3_info, "scenario3.geojson")

Predicting Bikeshare Trips in New York City

Jiaxuan Lyu, Xintian Stella Li & Kate Sutton

May 2021

1. Introduction

1.1 About this Project

1.2 Motivation

1.3 Use Case

2. Data

2.1 Sources

Trip Data

Types of Bike Lanes

Study Area

2.2 Data Cleaning

Possible Improvements

3. Explorary Analysis

3.1 Spatial Trends in Bikeshare Ridership

Ridership by Street Segment

Up to 800 Trips

Up to 400 Trips

Up to 100 Trips

High Median Segment Ridership

Manhattan

Map

Table

Brooklyn

Map

Table

Queens

Map

Table

The Bronx

Map

Table

3.2 Impact of Bike Lanes on Ridership

Lower Manhattan

East Harlem

4. Feature Engineering

Correlations

Street Network

Demographics

Amenities

Multicollinearity

Final Features

5. Modeling

5.1 Model Basics

5.2 Model Building

5.3 Model Evaluation

5.4 Prediction & Error Analysis

6. Scenarios

6.1 Scenarios Chosen

6.2 Predictions

Scenario 1: Central Park Transverses

Protected

Unprotected

Scenario 2: Upper Manhattan

Scenario 3: Brooklyn

Protected

Unprotected

Next Steps

7. Code Appendix