Evaluation#

This gallery focuses on forecast-evaluation plotting in GeoPrior.

The pages in this section are built as guided lessons for users who already have model predictions, forecast tables, interval forecasts, or ensemble outputs and now want to answer a practical question:

How good is this forecast, where does it fail, and which evaluation view should I trust for the next decision?

That is the central purpose of this gallery.

Unlike the forecasting gallery, which focuses on visualizing forecast paths, forecast tables, temporal trajectories, and spatial forecast maps, the pages in this section are organized around a different job:

turning forecasts into interpretable evidence about forecast quality.

These lessons teach users how to read evaluation views such as:

  • metric change across forecast horizon,

  • direct forecast comparisons,

  • radar summaries across segments or score families,

  • time-weighted summaries that emphasize chosen horizons,

  • prediction-stability views,

  • naïve-benchmark skill via Theil’s U,

  • interval coverage,

  • interval width and sharpness,

  • weighted interval score,

  • continuous ranked probability score,

  • quantile calibration,

  • compact QCE summaries,

  • and final multi-metric score radars.

In other words, this gallery is about judging forecast quality from multiple angles. It helps users decide whether a forecast is accurate, stable, calibrated, sharp enough to be useful, and strong enough to report, compare, or improve.

Module guide#

Module

Main output

Purpose

plot_metric_over_horizon_overview.py

Horizon-wise metric lesson

Learn how point or interval-oriented metrics change across forecast steps, how to compare grouped subsets, and how to spot long-horizon degradation before relying on one global score.

plot_forecast_comparison_overview.py

Forecast-comparison lesson

Compare forecast paths directly in temporal or spatial form so users can connect score summaries back to what the model is actually predicting.

plot_metric_radar_overview.py

Segment-radar lesson

Compare one chosen metric across segments, groups, or categories to see where performance concentrates or breaks down.

plot_time_weighted_metric_overview.py

Time-weighted metric lesson

Emphasize specific horizons through inverse-time or custom weights and learn when a weighted score tells a more useful story than an unweighted average.

plot_prediction_stability_overview.py

Stability lesson

Read prediction stability as a complementary quality signal and learn when forecast paths are too erratic even if average error looks acceptable.

plot_theils_u_score_overview.py

Relative-skill lesson

Benchmark forecasts against a naïve reference using Theil’s U and decide whether the model is truly beating a simple persistence strategy.

plot_coverage_overview.py

Coverage lesson

Inspect whether interval forecasts cover observations often enough and learn why coverage alone is useful but insufficient.

plot_mean_interval_width_overview.py

Width lesson

Read interval width as a sharpness view and judge whether the forecast is informative or only safe because intervals are broad.

plot_weighted_interval_score_overview.py

WIS lesson

Combine interval width and miss penalties into one score and learn how to read the trade-off between sharpness and reliability.

plot_crps_overview.py

CRPS lesson

Evaluate full predictive distributions or ensembles with a score that rewards both concentration and correctness.

plot_quantile_calibration_overview.py

Quantile-calibration lesson

Compare nominal and empirical quantile behavior to determine whether quantile forecasts are calibrated across the distribution.

plot_qce_donut_overview.py

QCE-summary lesson

Read compact quantile calibration error as an at-a-glance summary of where calibration gaps remain across quantile levels.

plot_radar_scores_overview.py

Multi-score radar lesson

Build compact score profiles across categories or metrics and use them as a final synthesis view for comparison-oriented reporting.

Suggested reading paths#

There is no single correct order, but four reading paths are especially useful.

Point-forecast quality path#

Choose this path when you first want to know whether the point forecast itself is useful.

Recommended order:

  1. plot_metric_over_horizon_overview.py

  2. plot_forecast_comparison_overview.py

  3. plot_time_weighted_metric_overview.py

  4. plot_theils_u_score_overview.py

This path helps answer questions such as:

  • Do later horizons become much harder?

  • Does the forecast path look reasonable when plotted directly?

  • Should near-term or long-term horizons matter more in my summary?

  • Is the model really beating a simple baseline?

Uncertainty and interval-quality path#

Choose this path when your forecast includes uncertainty bounds and you want to judge whether those bounds are believable and useful.

Recommended order:

  1. plot_coverage_overview.py

  2. plot_mean_interval_width_overview.py

  3. plot_weighted_interval_score_overview.py

  4. plot_crps_overview.py

This path helps answer questions such as:

  • Are the intervals covering observations often enough?

  • Are they too wide to be decision-useful?

  • Does a better score come from better calibration or just wider bands?

  • Does the full predictive distribution still look strong when judged as a distribution, not only as a median path?

Calibration-reading path#

Choose this path when your main concern is whether quantiles and probabilities are aligned with observed frequencies.

Recommended order:

  1. plot_quantile_calibration_overview.py

  2. plot_qce_donut_overview.py

This path helps answer questions such as:

  • Which quantiles are over- or under-covering?

  • Is the calibration error concentrated in the tails or spread across all quantiles?

  • Does the forecast need recalibration before reporting?

Comparison and synthesis path#

Choose this path when you need a compact comparison across segments, outputs, or score families.

Recommended order:

  1. plot_metric_radar_overview.py

  2. plot_prediction_stability_overview.py

  3. plot_radar_scores_overview.py

This path helps answer questions such as:

  • Which segment or subgroup is hardest?

  • Is one model smoother but less accurate?

  • Which score profile best matches the decision rule that matters for my application?

How to use these lessons with real forecast outputs#

Most evaluation pages begin with small synthetic or hand-built forecast examples. That is useful for documentation because it keeps the lesson stable and easy to read.

However, the real workflow value comes from replacing those demo arrays or demo tables with your own outputs.

A practical pattern is:

  1. export or collect the forecast outputs you want to evaluate,

  2. identify whether you have point forecasts, interval forecasts, quantile tables, or ensembles,

  3. open the matching lesson in this gallery,

  4. replace the demo arrays or DataFrame with your real outputs,

  5. check that column names and shapes follow the helper’s expected contract,

  6. and then read the plot using the same interpretation logic taught on the page.

This is one of the main goals of the evaluation gallery: the examples should remain useful after the user already has real forecast results.

How to read the plots in this section#

Most plots in this gallery are intentionally practical. They are not only decorative figures. They are decision plots.

That means they are meant to answer questions like:

  • Where does error rise across the forecast horizon?

  • Is the model better than a naïve baseline?

  • Are intervals calibrated or merely wide?

  • Does one output behave differently from the others?

  • Which quantiles are failing calibration?

  • Which metric profile best fits the reporting goal?

When reading these pages, users should usually ask:

What forecast decision does this plot support?

That question keeps evaluation practical and prevents the user from reducing forecast quality to a single number too early.

Why multi-view evaluation matters#

In a forecasting workflow like GeoPrior, weak decisions often come from looking at one score in isolation:

  • low MAE may hide poor long-horizon behavior,

  • acceptable coverage may hide intervals that are too wide,

  • good calibration at one quantile may hide tail failures,

  • a strong score average may hide one unstable output,

  • and a smooth forecast may still underperform a naïve baseline.

Multi-view evaluation helps catch those cases earlier and more clearly. That is why this gallery is intentionally positioned between:

  • forecast generation,

  • calibration and evaluation reading,

  • and model comparison or reporting.

Notes#

  • These pages are intentionally lesson-oriented and interpretation-first.

  • Most examples use compact synthetic or template-like forecast arrays so the documentation remains stable and quick to build.

  • The goal is not only to show what a plotting helper returns, but to teach how a user should read the result.

  • A good practical sequence is:

    • start with horizon-wise error and direct forecast comparison,

    • then inspect interval quality and probabilistic scores,

    • then inspect quantile calibration,

    • and finally use radar-style summaries for compact comparison.

Learn how to read interval reliability with plot_coverage

Learn how to read interval reliability with plot_coverage

Read ensemble forecast quality with plot_crps

Read ensemble forecast quality with plot_crps

Learn to compare forecasts visually with plot_forecast_comparison

Learn to compare forecasts visually with plot_forecast_comparison

Learn how to read forecast sharpness with plot_mean_interval_width

Learn how to read forecast sharpness with plot_mean_interval_width

Read forecast quality horizon by horizon with plot_metric_over_horizon

Read forecast quality horizon by horizon with plot_metric_over_horizon

Compare forecast quality across groups with plot_metric_radar

Compare forecast quality across groups with plot_metric_radar

Learn how forecast smoothness behaves with plot_prediction_stability

Learn how forecast smoothness behaves with plot_prediction_stability

Read quantile miscalibration with plot_qce_donut

Read quantile miscalibration with plot_qce_donut

Read quantile reliability with plot_quantile_calibration

Read quantile reliability with plot_quantile_calibration

Compare compact score profiles with plot_radar_scores

Compare compact score profiles with plot_radar_scores

Learn how to benchmark a forecast against a naive baseline with plot_theils_u_score

Learn how to benchmark a forecast against a naive baseline with plot_theils_u_score

Learn how horizon emphasis changes the score with plot_time_weighted_metric

Learn how horizon emphasis changes the score with plot_time_weighted_metric

Learn how to judge interval forecasts with plot_weighted_interval_score

Learn how to judge interval forecasts with plot_weighted_interval_score