2.1 Dataset & Sampling

The second phase of the project uses the 2019 subset of the Yellow Taxi trip data that has been preprocessed during phase 1 with 8 relevant features, including: trip_distance, DOLocationID, PULocationID, trip_duration, total_amount, pickup_hour, tip_amount, fare_amount. As Total Amount encapsulates both Fare Amount and Tip Amount and that only credit card tips are consistently recorded, the following analysis will only consider the subset of 59360231 credit-card only instances.

2.1.1 Feature Engineering

Area

During phase 1, the grouping of zone locations into 6 areas (Downtown, Midtown, Uptown, LGA, JFK, Others) has allowed for better comparisons between zones, specifically for those that are in Manhattan. As such, the current analysis will also group DOLocationID and PULocationID into DOArea and PUArea, effectively introducing 2 different spatial resolution to compare the models.

Transport Access Time (TAT)

Despite being found not significant, the effect of public transport demand was only quantified as a daily average value in the visualization report due to the lack of tools to visualize hourly demand as measured by TAT. Meanwhile, TAT is originally used to measure the average time taken to the nearest subway station in a taxi zone at a specific time of the day:

\[\text{TAT}_h \text{(min)}=60\times \frac{\text{Distance to nearest subway}}{v_\text{walking}}+\frac{60}{\text{Average number of trains at } h \text { hour}}\]

Explanatory modelling allows for the inclusion of time-specific TAT as a factor, which can reveal more information about how public transport demand affect Total Amount. Consequently, TAT is merged to the working dataset based on the pickup hour and pickup zone for each trip in the dataset.

2.1.2 Sampling

Compared to predictive modelling where the goal is to build an accurate model, explanatory modelling does not require as many training data points as predictive modelling, provided that the results have a scientifically meaningful and statistically significant inference. Thus, the second phase of the project does not use all the 59 million data points to train the models since it will increase the training time and limit the number and complexity of the models to investigate.

One million instances are randomly sampled without replacement from over 59 million data points in the chosen dataset. A Kolmogorov–Smirnov test 17 is performed for each feature of the sample to ensure that its sample distribution is not statistically different from the original distribution. The obtained sample is further randomly split into 80000 training instances and 20000 testing instances.

Some of the tasks in the following analysis will not use all the 80000 training instances due to computational cost and quality of visualizations, but rather a sub-sample of this set. Specifically, the pairwise plots use a sub-sample of size 100, whereas the initial phase of building models use a sub-sample of size 10000. The final model will be trained on the full training set and evaluated with the full testing set.


  1. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association, 62(318), 399-402.↩︎