When a model predicts not just the next step, but the next 10 steps into the future, what kind of evaluation is needed to check its performance over this longer period?
To check the performance of a model predicting 10 steps into the future, a multi-step-ahead forecasting evaluation is needed, specifically employing a rolling forecast origin methodology. This approach simulates how the model would perform in a real-world scenario by repeatedly making predictions and then moving the point from which forecasts originate forward in time. For example, the model predicts steps 1 to 10 starting from Monday's data; then, with Tuesday's data as the new forecast origin, it predicts steps 1 to 10 again, based on the updated information. This process is repeated across a significant portion of the historical data to generate many sets of 10-step forecasts, allowing for a comprehensive assessment over varying conditions.
Evaluating these predictions involves assessing both point forecasts and probabilistic forecasts.
For point forecasts, which are single value predictions for each future step, performance is typically assessed using accuracy metrics calculated independently for each forecast horizon, meaning for step 1, step 2, up to step 10. Common metrics include Mean Absolute Error (MAE), which measures the average magnitude of the absolute difference between predicted and actual values, and Root Mean Squared Error (RMSE), which calculates the square root of the average of the squared errors, giving a higher weight to larger errors. By analyzing these metrics at each individual step within the 10-step horizon, one can observe how prediction error typically accumulates and degrades as the forecast horizon lengthens. It is crucial to examine the error trend across the 10 steps, not just an overall average, because errors often compound over longer periods.
For probabilistic forecasts, which provide an estimate of uncertainty alongside the point prediction, such as prediction intervals or quantile forecasts, the evaluation focuses on two key aspects: coverage and sharpness. Coverage refers to the percentage of actual observed values that fall within the predicted intervals, indicating the reliability of the uncertainty estimate. Sharpness refers to the width of these intervals; narrower intervals are more informative, provided they maintain adequate coverage. Metrics like the Continuous Ranked Probability Score (CRPS), which assesses the overall quality of probabilistic forecasts by measuring the difference between the forecast distribution and the true outcome, or Pinball Loss, used for evaluating quantile forecasts, can be employed to quantify this performance. This comprehensive evaluation across multiple forecast origins and specific horizon steps provides a robust understanding of the model's predictive capabilities over the longer 10-step period.