America’s Next Top Regression Model

A huge part of any research comes down to modeling. You explore a bunch of variables and you try to see which ones are important predictors to the outcome you care about. In transportation, modeling is extremely important for planners and leaders who are the decision makers for transportation infrastructure. If you want to build a new light rail in your city, you have to be able to show that you can get ridership out of it.

I have spent the last week looking at how to build a model to predict ridership levels. Why is ridership so important? Well, the majority of positive outcomes from public transit are contingent on high ridership and use. You cannot get economic development and congestion relief from a new rail service if no one rides it. And while no transit organization makes a profit from fares, high fare revenue helps keep a transit system functioning without having to rely completely on government subsidies. Ridership, while not the most detailed and fair metric, is often used as a quick and easy indicator of a successful system.

What variables drive ridership? Arguably, everything. While constructing a model, it is worth while to consider as many variables as possible to begin with, and then throw out ones that do not appear to have statistical significance.

Off the bat, we could say some obvious predictors of ridership are the route miles (DRM), revenue vehicle miles (RVM) and service area. These are characteristics of the physical system. But demographics of the host city also impact ridership. Such variables are population, unemployment, travel time to work and the travel time index (TTI). The last two are measures of the congestion in the city. The travel time index is calculated by taking the time it takes to make a trip during peak hours dividing by the time it takes to make the same trip at free-flow speeds (i.e. no traffic). For example, an index of 1.20 means that a 10 min trip takes 12 min during the rush hour traffic. While that may not seem bad at first, consider that the national average travel time to work is 25.7 min (American Community Survey 2014 Estimate). So even a TTI of 1.20 can be significant.

To begin developing a working model, I tested four variables: service population, unemployment, travel time to work and the travel time index. I predicted that service population would be the dominating variable.

After running a multivariate regression in Excel, I found that unemployment, travel time to work and TTI all had p-values greater than 0.05. In standard practice, this suggests that these variables are not significant to the model. Only service population had a coefficient statistically different from zero, with a p-value of 0.013308. Keeping all the variables in the model, I used this regression to make predictions for 2014 ridership. I compared the model’s predictions with the actual ridership in a scatter plot, with the predicted values on the y-axis and the actual on the x-axis. If the model is accurate, the resulting plot should be a straight line with a 45° angle.


Above is the plot from the first regression model. Not a terrible fit, but this could be a lot better.

After some thought, I decided to throw out all the variables except for service population. I added DRM, RVM and service area to see what happened.

The Excel regression showed that three of the variables were statistically significant to the model. DRM was the only one with a p-value greater than 0.05. However, keeping the DRM coefficient, I again used the model to graph the predicted ridership against the actual ridership. The results are shown in the plot below.


Here we see a much tighter fit of the data. The R² value is high at 0.93 and the slope is 0.945 (almost 1 which would give a 45° slope!). The question becomes if it is a good idea to still keep DRM in the model or reject it because of its p-value. I decided to keep it. When building a regression model, one cannot lose sight of what variables mean and their nature. RVM, as I discussed in my last post, is arguably not a good independent variable because it can easily be adjusted as a response to ridership. In other words, the direction of causality could be reverse. Since DRM is a measure of the physical tracks built, it is safer to assume that ridership responds to changes in DRM. Because of this, I have kept DRM in the model.

Going forward, I will want to work on developing this model and seeing how it works against individual cities over a time span and see what can be learned from what the model predicts versus what actually happened with light rail ridership.