1.3 Data Preprocessing
1.3.1 TLC Yellow Taxi 2019 Data
The following data preprocessing pipeline was applied for each of the monthly dataset:
- Only 9 relevant features identified in the section above were selected, namely:
trip_distance
,DOLocationID
,PULocationID
,tpep_pickup_datetime
,tpep_dropoff_datetime
,payment_type
,fare_amount
,tip_amount
,total_amount
- Trips with non-positive trip_distance were removed.
tpep_pickup_datetime
,tpep_dropoff_datetime
were converted to datetime objects, and trips that have either pick-up or drop-off time that fall outside the current month were removed.- A
trip_duration
attribute was created by subtractingtpep_pickup_datetime
fromtpep_dropoff_datetime
and rounded to the nearest minute. - Trips with non-positive
trip_duration
were removed. - Trips with a
trip_duration
value of more than 720 (equivalent to more than 12 hours) were also removed, as there are two possible reasons for a trip lasting more than 12 hours: either the driver forgot to turn off his meter at the end of a shift which is an input error; or it is a highly unusual interstate trip, in which case theDOLocationID
is likely to no longer be in NYC and should be excluded. - Trips with either
DOLocationID
orPULocationID
being 264 or 265 were removed, as they are unknown locations according to the taxi zone lookup table. - Duplicated instances were removed.
- A subset on
payment_type == 1
was also created with attributesDOLocationID
,PULocationID
,trip_duration
,total_amount
,tpep_pickup_datetime
,trip_distance.
Since only credit card tips (payment type 1) are consistently recorded, any analysis relating to profitability will use this subset. Note that the total_amount variable also encapsulates fare_amount, tip_amount and other taxes, and thus is more appropriate to represent revenue for profitability formula.
After preprocessing all 12 monthly datasets, each attribute was merged to an annual Panda series and serialised into PyArrow feather format. For analysis and visualization, only the interested attributes are loaded onto memory (under 700MB per attribute), which allows for faster computation compared to using Spark. Geovisualisation was performed with the Geopandas package.
Table 1: Dataset size (instances) comparison before and after preprocessing
For profitability analysis, a dataframe consisting of credit-card only trip_distance
, PULocationID
, trip_duration
, total_amount
was created. A rate_per_trip
attribute was created using the formula in Equation 2. The dataframe was then grouped by PULocationID
and averaged the other values to calculate the zone_profitability
according to Equation 1.
1.3.2 MTA Static Transport Accessibility Data
The schedule and stop location datasets require no cleaning as the data is static. The preprocessing steps for this data replicates the TAT calculation process by Correa, Xie and Ozbay (2019) 14, with the differences in a shorter cell diameter and aggregation based on taxi zones instead of Neighbourhood Tabulation Areas.
- Firstly, a hex grid with a cell diameter of 500m was fitted over the NYC map, which means that the time taken to traverse each cell on foot is about 5 minutes. Each cell is characterized by the location of its centroid and was assigned to its nearest subway station
stop_id
through a Nearest-Neighbor BallTree algorithm. The haversine shortest distance between a cell centroid and its assigned station was also returned (See Figure 10a in the Analysis section below). departure_time
was converted to 24-hour format and only the hour value was extracted. Values greater than or equal 24, such as “25:15”, indicate that the train has departed from its first stop on the previous day, and is now departing from its current stop at 1:15 AM on the next day. These values were also converted to standard 24-hour format and had their hour components extracted.- To get the average number of trains per hour component of the TAT formula (Equation 3), the total count of trains departing from each
stop_id
–departure_time
combination was divided by 3, since for each of these combinations, there are 3 types of schedules in the original dataset: Weekday, Saturday, and Sunday. - The dataframe (11896 x 3) was converted from long to wide format, with each
departure_time
becoming its own column, resulting in a (496 x 24) dataframe withstop_id
as the unique index for each row. - The 8 extra values resulting from the pivot were due to stop 901 and 902 not having any trips between 12 AM and 4 AM, and as such were imputed with 0.
- The TAT per hour for each cell was calculated using the two components
avg_freq
anddistance
above. - Once the TAT for each cell is determined, it is not difficult to calculate TAT per taxi zone by averaging the values across the cells included within the zone (See Figure 10b in the Analysis section below). Note that this resulted in each zone being assigned to a nearest stop, which would also be used in the preprocessing for the Transport Usage Dataset.
Table 2: Final TAT dataframe (3750 rows of centroids, 496 stop IDs and 260 Zone IDs)
1.3.3 qri Transport Usage Data
The turnstile data has been preprocessed by qri while maintaining the original column structure by MTA, which allowed for little data cleaning and seamless SQL joins with the two previous datasets. Note that since the number of turnstile entries is just an estimate of the actual public transport usage and is collected periodically rather than in real-time, attempts to compare it with hourly taxi usage do not hold strong statistical power. Thus, averaging methods were used in the following preprocessing to reduce variance in the comparisons albeit losing the hourly resolution from the preprocessed TLC dataframe.
complex_id
(the primary key for the Turnstile dataset) was converted to gtfs_stop_id (the primary key for the MTA schedule dataset which was also used for the TAT data) via a SQL join using the reference table from MTA website. Since the Turnstile dataset excludes data for the three commuter rails, a left join ensured that only stops with non-null usage data are included in the result dataframe.- Each
gtfs_stop_id
was assigned to its containing taxi zone from the TAT data. - The
entries
counts were then grouped byzone
and taken the mean value to find the daily average number of people using public transport in each taxi zone. - Daily average number of pick-ups (
num_of_trips
) grouped byzone
from the preprocessed TLC dataframe was merged and used in tandem withentries
column to calculate modepreference
ratio per zone (as detailed in Equation 4).
Correa, D., Xie, K., & Ozbay, K. (2017). Exploring the taxi and Uber demand in New York City: An empirical analysis and spatial modeling. In 96th Annual Meeting of the Transportation Research Board, Washington, DC.↩︎