1.3 Data Preprocessing

1.3.1 TLC Yellow Taxi 2019 Data

The following data preprocessing pipeline was applied for each of the monthly dataset:

  • Only 9 relevant features identified in the section above were selected, namely: trip_distance, DOLocationID, PULocationID, tpep_pickup_datetime, tpep_dropoff_datetime, payment_type, fare_amount, tip_amount, total_amount
  • Trips with non-positive trip_distance were removed.
  • tpep_pickup_datetime, tpep_dropoff_datetime were converted to datetime objects, and trips that have either pick-up or drop-off time that fall outside the current month were removed.
  • A trip_duration attribute was created by subtracting tpep_pickup_datetime from tpep_dropoff_datetime and rounded to the nearest minute.
  • Trips with non-positive trip_duration were removed.
  • Trips with a trip_duration value of more than 720 (equivalent to more than 12 hours) were also removed, as there are two possible reasons for a trip lasting more than 12 hours: either the driver forgot to turn off his meter at the end of a shift which is an input error; or it is a highly unusual interstate trip, in which case the DOLocationID is likely to no longer be in NYC and should be excluded.
  • Trips with either DOLocationID or PULocationID being 264 or 265 were removed, as they are unknown locations according to the taxi zone lookup table.
  • Duplicated instances were removed.
  • A subset on payment_type == 1 was also created with attributes DOLocationID, PULocationID, trip_duration, total_amount, tpep_pickup_datetime, trip_distance. Since only credit card tips (payment type 1) are consistently recorded, any analysis relating to profitability will use this subset. Note that the total_amount variable also encapsulates fare_amount, tip_amount and other taxes, and thus is more appropriate to represent revenue for profitability formula.

After preprocessing all 12 monthly datasets, each attribute was merged to an annual Panda series and serialised into PyArrow feather format. For analysis and visualization, only the interested attributes are loaded onto memory (under 700MB per attribute), which allows for faster computation compared to using Spark. Geovisualisation was performed with the Geopandas package.

Table 1: Dataset size (instances) comparison before and after preprocessing

Month
Original
Preprocessed
Reduction %
Credit Only
Credit Only %
January
7667792
7406029
3.4
5321953
71.9
February
7019375
6855350
2.3
5088365
74.2
March
7832545
7676761
2
5630306
73.3
April
7433139
7264644
2.3
5254251
72.3
May
7565261
7391333
2.3
5353756
72.4
June
6941024
6770765
2.5
4883001
72.1
July
6310419
6147767
2.6
4327608
70.4
August
6073357
5908200
2.7
4117299
69.7
September
6567788
6400096
2.6
4624211
72.3
October
7213891
7042317
2.4
5111277
72.6
November
6878111
6708031
2.5
4875722
72.7
December
6896317
6725257
2.5
4772482
71
Total
84399019
82296550
2.5
59360231
72.1

For profitability analysis, a dataframe consisting of credit-card only trip_distance, PULocationID, trip_duration, total_amount was created. A rate_per_trip attribute was created using the formula in Equation 2. The dataframe was then grouped by PULocationID and averaged the other values to calculate the zone_profitability according to Equation 1.

1.3.2 MTA Static Transport Accessibility Data

The schedule and stop location datasets require no cleaning as the data is static. The preprocessing steps for this data replicates the TAT calculation process by Correa, Xie and Ozbay (2019) 14, with the differences in a shorter cell diameter and aggregation based on taxi zones instead of Neighbourhood Tabulation Areas.

  • Firstly, a hex grid with a cell diameter of 500m was fitted over the NYC map, which means that the time taken to traverse each cell on foot is about 5 minutes. Each cell is characterized by the location of its centroid and was assigned to its nearest subway station stop_id through a Nearest-Neighbor BallTree algorithm. The haversine shortest distance between a cell centroid and its assigned station was also returned (See Figure 10a in the Analysis section below).
  • departure_time was converted to 24-hour format and only the hour value was extracted. Values greater than or equal 24, such as “25:15”, indicate that the train has departed from its first stop on the previous day, and is now departing from its current stop at 1:15 AM on the next day. These values were also converted to standard 24-hour format and had their hour components extracted.
  • To get the average number of trains per hour component of the TAT formula (Equation 3), the total count of trains departing from each stop_iddeparture_time combination was divided by 3, since for each of these combinations, there are 3 types of schedules in the original dataset: Weekday, Saturday, and Sunday.
  • The dataframe (11896 x 3) was converted from long to wide format, with each departure_time becoming its own column, resulting in a (496 x 24) dataframe with stop_id as the unique index for each row.
  • The 8 extra values resulting from the pivot were due to stop 901 and 902 not having any trips between 12 AM and 4 AM, and as such were imputed with 0.
  • The TAT per hour for each cell was calculated using the two components avg_freq and distance above.
  • Once the TAT for each cell is determined, it is not difficult to calculate TAT per taxi zone by averaging the values across the cells included within the zone (See Figure 10b in the Analysis section below). Note that this resulted in each zone being assigned to a nearest stop, which would also be used in the preprocessing for the Transport Usage Dataset.

Table 2: Final TAT dataframe (3750 rows of centroids, 496 stop IDs and 260 Zone IDs)

Column
Description
Sample Data
zone_id
Taxi zone ID
1
zone
Taxi zone name
Newark Airport
geometry
Taxi zone Geometry
POLYGON ((-74.18445 40.69500, -74.18449 40.695...
centroid
Hex cell Geometry
POINT (-73.91154 40.91112)
stop_id
Closest station Stop ID
106
closest_stop_geom
Closest station Geometry
POINT (-73.90983 40.87456)
distance
Distance from centroid to closest station
1142.440469
TAT_00_00
Transport Access Time at 12AM
60.711021
TAT_01_00
Transport Access Time at 1AM
61.78245
TAT_02_00
Transport Access Time at 2AM
64.28245
...
...
TAT_23_00
Transport Access Time at 23PM
58.78245

1.3.3 qri Transport Usage Data

The turnstile data has been preprocessed by qri while maintaining the original column structure by MTA, which allowed for little data cleaning and seamless SQL joins with the two previous datasets. Note that since the number of turnstile entries is just an estimate of the actual public transport usage and is collected periodically rather than in real-time, attempts to compare it with hourly taxi usage do not hold strong statistical power. Thus, averaging methods were used in the following preprocessing to reduce variance in the comparisons albeit losing the hourly resolution from the preprocessed TLC dataframe.

  • complex_id (the primary key for the Turnstile dataset) was converted to gtfs_stop_id (the primary key for the MTA schedule dataset which was also used for the TAT data) via a SQL join using the reference table from MTA website. Since the Turnstile dataset excludes data for the three commuter rails, a left join ensured that only stops with non-null usage data are included in the result dataframe.
  • Each gtfs_stop_id was assigned to its containing taxi zone from the TAT data.
  • The entries counts were then grouped by zone and taken the mean value to find the daily average number of people using public transport in each taxi zone.
  • Daily average number of pick-ups (num_of_trips) grouped by zone from the preprocessed TLC dataframe was merged and used in tandem with entries column to calculate mode preference ratio per zone (as detailed in Equation 4).

  1. Correa, D., Xie, K., & Ozbay, K. (2017). Exploring the taxi and Uber demand in New York City: An empirical analysis and spatial modeling. In 96th Annual Meeting of the Transportation Research Board, Washington, DC.↩︎