1.3 Data Preprocessing

1.3.1 TLC Yellow Taxi 2019 Data

The following data preprocessing pipeline was applied for each of the monthly dataset:

Only 9 relevant features identified in the section above were selected, namely: trip_distance, DOLocationID, PULocationID, tpep_pickup_datetime, tpep_dropoff_datetime, payment_type, fare_amount, tip_amount, total_amount
Trips with non-positive trip_distance were removed.
tpep_pickup_datetime, tpep_dropoff_datetime were converted to datetime objects, and trips that have either pick-up or drop-off time that fall outside the current month were removed.
A trip_duration attribute was created by subtracting tpep_pickup_datetime from tpep_dropoff_datetime and rounded to the nearest minute.
Trips with non-positive trip_duration were removed.
Trips with a trip_duration value of more than 720 (equivalent to more than 12 hours) were also removed, as there are two possible reasons for a trip lasting more than 12 hours: either the driver forgot to turn off his meter at the end of a shift which is an input error; or it is a highly unusual interstate trip, in which case the DOLocationID is likely to no longer be in NYC and should be excluded.
Trips with either DOLocationID or PULocationID being 264 or 265 were removed, as they are unknown locations according to the taxi zone lookup table.
Duplicated instances were removed.
A subset on payment_type == 1 was also created with attributes DOLocationID, PULocationID, trip_duration, total_amount, tpep_pickup_datetime, trip_distance. Since only credit card tips (payment type 1) are consistently recorded, any analysis relating to profitability will use this subset. Note that the total_amount variable also encapsulates fare_amount, tip_amount and other taxes, and thus is more appropriate to represent revenue for profitability formula.

After preprocessing all 12 monthly datasets, each attribute was merged to an annual Panda series and serialised into PyArrow feather format. For analysis and visualization, only the interested attributes are loaded onto memory (under 700MB per attribute), which allows for faster computation compared to using Spark. Geovisualisation was performed with the Geopandas package.

Table 1: Dataset size (instances) comparison before and after preprocessing

For profitability analysis, a dataframe consisting of credit-card only trip_distance, PULocationID, trip_duration, total_amount was created. A rate_per_trip attribute was created using the formula in Equation 2. The dataframe was then grouped by PULocationID and averaged the other values to calculate the zone_profitability according to Equation 1.

1.3.2 MTA Static Transport Accessibility Data

The schedule and stop location datasets require no cleaning as the data is static. The preprocessing steps for this data replicates the TAT calculation process by Correa, Xie and Ozbay (2019) ¹⁴, with the differences in a shorter cell diameter and aggregation based on taxi zones instead of Neighbourhood Tabulation Areas.

Firstly, a hex grid with a cell diameter of 500m was fitted over the NYC map, which means that the time taken to traverse each cell on foot is about 5 minutes. Each cell is characterized by the location of its centroid and was assigned to its nearest subway station stop_id through a Nearest-Neighbor BallTree algorithm. The haversine shortest distance between a cell centroid and its assigned station was also returned (See Figure 10a in the Analysis section below).
departure_time was converted to 24-hour format and only the hour value was extracted. Values greater than or equal 24, such as “25:15”, indicate that the train has departed from its first stop on the previous day, and is now departing from its current stop at 1:15 AM on the next day. These values were also converted to standard 24-hour format and had their hour components extracted.
To get the average number of trains per hour component of the TAT formula (Equation 3), the total count of trains departing from each stop_id – departure_time combination was divided by 3, since for each of these combinations, there are 3 types of schedules in the original dataset: Weekday, Saturday, and Sunday.
The dataframe (11896 x 3) was converted from long to wide format, with each departure_time becoming its own column, resulting in a (496 x 24) dataframe with stop_id as the unique index for each row.
The 8 extra values resulting from the pivot were due to stop 901 and 902 not having any trips between 12 AM and 4 AM, and as such were imputed with 0.
The TAT per hour for each cell was calculated using the two components avg_freq and distance above.
Once the TAT for each cell is determined, it is not difficult to calculate TAT per taxi zone by averaging the values across the cells included within the zone (See Figure 10b in the Analysis section below). Note that this resulted in each zone being assigned to a nearest stop, which would also be used in the preprocessing for the Transport Usage Dataset.

Table 2: Final TAT dataframe (3750 rows of centroids, 496 stop IDs and 260 Zone IDs)

1.3.3 qri Transport Usage Data

The turnstile data has been preprocessed by qri while maintaining the original column structure by MTA, which allowed for little data cleaning and seamless SQL joins with the two previous datasets. Note that since the number of turnstile entries is just an estimate of the actual public transport usage and is collected periodically rather than in real-time, attempts to compare it with hourly taxi usage do not hold strong statistical power. Thus, averaging methods were used in the following preprocessing to reduce variance in the comparisons albeit losing the hourly resolution from the preprocessed TLC dataframe.

complex_id (the primary key for the Turnstile dataset) was converted to gtfs_stop_id (the primary key for the MTA schedule dataset which was also used for the TAT data) via a SQL join using the reference table from MTA website. Since the Turnstile dataset excludes data for the three commuter rails, a left join ensured that only stops with non-null usage data are included in the result dataframe.
Each gtfs_stop_id was assigned to its containing taxi zone from the TAT data.
The entries counts were then grouped by zone and taken the mean value to find the daily average number of people using public transport in each taxi zone.
Daily average number of pick-ups (num_of_trips) grouped by zone from the preprocessed TLC dataframe was merged and used in tandem with entries column to calculate mode preference ratio per zone (as detailed in Equation 4).

Correa, D., Xie, K., & Ozbay, K. (2017). Exploring the taxi and Uber demand in New York City: An empirical analysis and spatial modeling. In 96th Annual Meeting of the Transportation Research Board, Washington, DC.↩︎