2.4 Model Selection

The goal of explanatory modelling is to provide insights into each explanatory variable rather than to arrive at a complex black-box model with high predictive accuracy. Therefore, the proposed methodology is to compare the performance of different variations of the same family model and attempt to pinpoint which variables are responsible for the differences in the predicted values for Total Amount. The Linear Model (LM) family is chosen as it is easy to interpret, efficient to train with very low chance of non-convergence, and robust to mixed-type inputs (ANOVA regression if there are categorical inputs, as well as ability to include interaction terms). An extension of LM – the Linear Mixed Models (LME), are also considered.

2.4.1 Linear Additive Models

In the additive LMs, the relationship between response variable to be predicted (Total Amount) and the set of predictors X (assumed to be independently affecting the response variable) can be expressed as:

\[\log{(\text{Total amount})}=X\beta + \varepsilon\] For each categorical predictor of \(n\) factor levels, the coefficient vector \(\beta\) introduces \(n – 1\) additional terms, each of which corresponds to the contrast with the first factor level. The error vector \(\varepsilon\) is assumed to be normally distributed.

2.4.2 Linear Interaction Models

From the results of the Feature Analysis section, it is clear that the predictors exhibit some levels of collinearity and interactions, violating the assumption of the additive models. Questions such as “will there be a difference in Total Amount between a trip beginning in Downtown and ending in JFK compared to a trip beginning in Uptown and ending in JFK?” can be answered with the interaction models, which introduce a \(\beta_*\) interaction coefficient vector and an interaction design matrix \(X_*\):

\[\log{(\text{Total amount})}=X\beta + X_* \beta_* + \varepsilon\] The interaction design matrix \(X_*\) can handle categorical – categorical, numerical – categorical, and numerical – numerical interactions. Note that for categorical interactions, the number of additional coefficients increases in proportion to the number of factor levels. Hence, the use of Area-level spatial resolution is significantly more computationally efficient than Zone-level resolution as the interaction between drop-off zones and pick-up zones can introduce up to 62500 terms in the model!

2.4.3 Linear Mixed Models ²³

LMEs are also used to handle interactions between variables, by additionally modelling the effect of the numerical predictors as random variables (called “random effects”) for each factor level in some categorical predictors.

\[\log{(\text{Total amount})}=X\beta + Zu + \varepsilon\] For example, from the Feature Analysis section, the Total Amount for trips finishing at JFK and LGA is generally greater than other trips. This between-group relationship is captured by the fixed-effect coefficients in \(\beta\), which is similar to the classic LMs. However, among the trips finishing at JFK, the ones that begin from areas closer to JFK will have lower Total Amount than the ones begin from further areas. This within-group relationship is additionally modelled by the random-effect coefficients in \(u\), and \(Z\) is the design matrix for the within-group random effects.

LMEs are generally similar to the one-factor LMs. A random intercept LME begins with a different mean of Total Amount for each group, whereas a one-factor model implies that all groups have the same mean Total Amount (i.e., the intercept) and models the effect of being in a specific group on the intercept. Note that LMEs do not have a standard degrees-of-freedom measurement as it is effectively a nested LM within another LM.

2.4.4 Experimental Design

Table 3: Experimental design

A benchmark model M0 is first established by running an AIC-based stepwise feature selection on the largest-possible interaction model of all 4 predictors (Duration + TAT + PUArea + DOArea). As we are not interested in maximizing predictive power but rather the effect of each term in the model, the experimental design (Table 3) mostly considers models of a smaller scope than M0 (except for M9). The description column highlights the feature of interest in each model, as well as the type of model used (additive, interactive, or mixed model). The initial phase of model building is to investigate how closely the performance of the alternative models M1 – M9 come to the benchmark level, as well as comparing the performance level of related additive, interaction and mixed models to identify which one is most suitable to describe the relationship between each predictor and the target variable. The second phase of model building is to train the best model on the full training set, evaluate with the testing set and perform error analysis.

2.4.5 Model Selection Metrics

Akaike Information Criteria (AIC)

To compare performance between LMs and LMEs, the traditional goodness-of-fit measure R2 is not suitable due to the structure of LMEs. As such, the preferred metric for model comparison is Akaike Information Criteria (AIC), which is defined as

\[AIC=2*\text{Number of parameters} - 2\log{\hat{L}}\]

An advantage of AIC is that it deals with the trade-off between the goodness of fit of the model and the simplicity of the model by penalizing complex models. The lower the AIC of the model, the better it is in comparison to other models trained on the same dataset.

We also utilize Model Relative Likelihood – a generalized comparison ratio version of AIC, to compare each alternative model with the benchmark model M0.

\[MRL_i = \exp{\left(\frac{AIC(M_0)-AIC(M_i)}{2}\right)}\]

A MRL closer to 1 indicates that the model performs generally the same as M0, whereas a MRL=0 means that the model is comparatively different from M0.

Root Mean Squared Error (RMSE)

Each model will also be evaluated with predictive performance based on its prediction of the testing set. The traditional Root Mean Squared Error (RMSE) is calculated for each model. For the current regression task, the lower the value of RMSE, the better the model performs.

\[RMSE = \sqrt{\frac{1}{N_\text{test}}\sum_1^{N_\text{test}}({\text{Predicted}-\text{Actual}})^2}\]

Note that the reported RMSEs have been exponentially transformed so that they can be interpreted as the actual differences from the actual Total Amount in dollar values.

Gałecki, A., & Burzykowski, T. (2013). Linear mixed-effects model. In Linear Mixed-Effects Models Using R (pp. 245-273). Springer, New York, NY.↩︎