2.2 Feature Transformation

2.2.1 Log Transformation of Numerical Features

During phase 1 analysis, it has been found that most of the trip-related numerical features (except for pickup hour) have extremely high kurtosis and skewedness, which is mainly due to the disproportionality in demand between the hotspot and non-hotspot areas. One remedy is to take the log base 10 transformation for all the numerical features, including trip_duration, trip_distance, total_amount, tip_amount, fare_amount. Instances with a negative value in any of these five numerical features were discarded, and a small constant of 0.001 was added to all instances prior to log transformation to avoid taking the log of zero. As explained in the visualization report, the few negative values in the amount features may indicate refunds or transactional disputes, which does not add any statistical importance to the two proposed hypotheses, and thus will not be considered in this analysis.

This transformation pipeline results in a training set of 799999 instances and a testing set of 199999 instances. From this point onwards, we will only refer to the log-transformed value of the attribute using its name (e.g., “Duration” refers to log10 of trip_duration). Additionally, no transformation is applied to TAT to preserve its theoretical meaning.

2.2.2 Factorization of Categorical Features

Categorical features, namely DOLocationID (250 levels), PULocationID (250 levels), DOArea (6 levels), and PUArea (6 levels), are transformed into R factor objects. pickup_hour is dynamically treated as either a numerical feature (not log-transformed) or a categorical (factor) feature depending on the specific use case.