Phase 2 Explanatory Modelling

The visualisation phase of the project “Exploring Yellow Taxi Profitability in New York City: A Spatio-Temporal Analysis” has found that the profitability in 2019 is mostly related to the pickup hour and location, which are the two major factors affecting taxi demand. Specifically, even though other suburbs reported longer trip durations on average and consequently higher total fare amount, the three hotspot areas – Manhattan, LaGuardia Airport (LGA) and JFK Airport (JFK), had the better advantage of a surgically high demand for taxi all day long. Furthermore, the effect of competition from subway as an alternative mode of transport is also investigated through an attempt to measure the average transport preference for each suburb. This factor, however, does not seem to significantly impact taxi demand, and it is concluded that there are sufficiently high demands for both modes of transportation to cater for passengers.

The second phase of the project will focus on quantifying the relationship of the abovementioned factors through explanatory modelling on the 2019 TLC dataset. In phase 1, our main outcome variable was the expected zone profitability metric, which represents the expected profit from picking up a customer in that zone, adjusted for the demand and competition level (via the number of trips) and the logistic cost (via the trip duration). A major drawback of this metric is that it has been derived and averaged out at the zone level, and thus rendering it unsuitable for modelling at the individual trip level. For measuring profitability at the individual trip level, another metric called the rate per trip was introduced as:

\[\text{Rate per trip (dollar/min)} = \frac{\text{Total Fare Amount}-\text{ACPM}\times\text{Distance (miles)}}{\text{Duration (min)}}\]

where \(\text{ACPM}=0.58\), the estimated cost per miles by TLC. As Total Fare Amount is highly linearly correlated with the other two factors, with pairwise Pearson’s correlations of 0.81 and 0.95 correspondingly, the proposed metric can be estimated with Total Fare Amount. As such, from a modelling perspective, Total Amount is the main random variable of interest which can both represent profitability at the trip level and quantify the relationship of the trip factors. Following phase 1 results, two time-independent hypotheses are proposed

  • Total Amount is higher for trips starting and ending in hotspot areas.
  • Total Amount is higher for trips starting in areas with lower public transport accessibility.

Data preprocessing is performed in Python. Feature analysis and modelling is performed in R, the latter of which utilizes the lm and lmer functions in the LME4 package (see Explanatory Modelling section for more information about the models used).