.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/forecasting/plot_evaluate_forecast.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_forecasting_plot_evaluate_forecast.py: Evaluate forecast tables with ``evaluate_forecast`` ==================================================== This lesson teaches how to use :func:`geoprior.utils.forecast_utils.evaluate_forecast` to turn an evaluation forecast table into a set of readable diagnostics. Why this page matters --------------------- A forecast table is useful, but it is still only an intermediate object. After formatting predictions into ``df_eval``, the next question is: **How good is the forecast, and where does it become less reliable?** That is the role of ``evaluate_forecast``. The function consumes the ``df_eval`` output produced by :func:`format_and_forecast` (or any compatible evaluation DataFrame), then computes: - deterministic accuracy metrics, - interval coverage and sharpness in quantile mode, - optional per-horizon diagnostics, - optional user-defined extra metrics. When a time column such as ``coord_t`` is present, the function groups by time and returns one metric block per time value, plus an optional ``"__overall__"`` block for the complete evaluation set. It can also save JSON or a flattened CSV representation. What this lesson teaches ------------------------ We will: 1. create a synthetic spatial evaluation dataset, 2. convert raw model-like outputs into ``df_eval``, 3. call the real ``evaluate_forecast`` helper, 4. inspect the nested result structure, 5. add a custom metric, 6. build a small explanatory figure from the returned metrics. The figure is important here because it shows *why* this helper is useful: it makes horizon-wise and time-wise forecast quality visible. .. GENERATED FROM PYTHON SOURCE LINES 55-59 Imports ------- We use the real forecast formatting and evaluation utilities from GeoPrior. .. GENERATED FROM PYTHON SOURCE LINES 59-69 .. code-block:: Python import numpy as np import pandas as pd import matplotlib.pyplot as plt from geoprior.utils.forecast_utils import ( evaluate_forecast, format_and_forecast, ) .. GENERATED FROM PYTHON SOURCE LINES 70-84 Step 1 - Build a compact synthetic spatial forecast --------------------------------------------------- We mimic a small city-like grid. Each spatial location becomes one sample, and the model produces a forecast for three horizon steps. We create raw arrays in the same style used by the forecast formatter: - ``y_pred["subs_pred"]`` with shape ``(B, H, Q, O)`` - ``y_true["subsidence"]`` with shape ``(B, H, O)`` - ``coords`` with shape ``(B, H, 3)`` This keeps the lesson aligned with the actual workflow. .. GENERATED FROM PYTHON SOURCE LINES 84-166 .. code-block:: Python rng = np.random.default_rng(123) nx = 8 ny = 6 xv = np.linspace(0.0, 12_000.0, nx) yv = np.linspace(0.0, 8_000.0, ny) X, Y = np.meshgrid(xv, yv) x_flat = X.ravel() y_flat = Y.ravel() B = x_flat.size H = 3 Q = 3 O = 1 quantiles = [0.1, 0.5, 0.9] future_years = np.array([2023, 2024, 2025], dtype=int) train_end_year = 2022 xn = (x_flat - x_flat.min()) / (x_flat.max() - x_flat.min()) yn = (y_flat - y_flat.min()) / (y_flat.max() - y_flat.min()) hotspot = np.exp( -( ((xn - 0.72) ** 2) / 0.022 + ((yn - 0.36) ** 2) / 0.032 ) ) ridge = 0.50 * np.exp( -( ((xn - 0.30) ** 2) / 0.030 + ((yn - 0.74) ** 2) / 0.050 ) ) gradient = 0.45 * xn + 0.24 * (1.0 - yn) coords = np.zeros((B, H, 3), dtype=float) for i in range(B): for h in range(H): coords[i, h, 0] = h + 1 coords[i, h, 1] = x_flat[i] coords[i, h, 2] = y_flat[i] y_true_subs = np.zeros((B, H, O), dtype=float) y_pred_subs = np.zeros((B, H, Q, O), dtype=float) for h in range(H): lead_scale = [1.00, 1.18, 1.42][h] median = ( 2.0 + 1.3 * gradient + 2.2 * hotspot + 0.9 * ridge ) * lead_scale width = ( 0.30 + 0.08 * xn + 0.05 * hotspot + 0.06 * (h + 1) ) q10 = median - width q50 = median q90 = median + width actual = median + rng.normal(0.0, 0.16, size=B) y_true_subs[:, h, 0] = actual y_pred_subs[:, h, 0, 0] = q10 y_pred_subs[:, h, 1, 0] = q50 y_pred_subs[:, h, 2, 0] = q90 y_pred = {"subs_pred": y_pred_subs} y_true = {"subsidence": y_true_subs} print("Prediction shape:", y_pred["subs_pred"].shape) print("Truth shape:", y_true["subsidence"].shape) print("Coordinate shape:", coords.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none Prediction shape: (48, 3, 3, 1) Truth shape: (48, 3, 1) Coordinate shape: (48, 3, 3) .. GENERATED FROM PYTHON SOURCE LINES 167-177 Step 2 - Format the raw outputs into ``df_eval`` ------------------------------------------------ ``evaluate_forecast`` works on an evaluation DataFrame, not directly on raw tensors. So we first create that table using the real formatting helper. We use ``eval_export="all"`` so the evaluation table contains all horizons rather than only the last one. This makes the later per-horizon diagnostics much more informative. .. GENERATED FROM PYTHON SOURCE LINES 177-198 .. code-block:: Python df_eval, df_future = format_and_forecast( y_pred=y_pred, y_true=y_true, coords=coords, quantiles=quantiles, target_name="subsidence", train_end_time=train_end_year, future_time_grid=future_years, eval_export="all", value_mode="rate", city_name="SyntheticCity", model_name="GeoPrior-demo", dataset_name="synthetic-eval-demo", verbose=0, ) print("df_eval shape:", df_eval.shape) print("") print(df_eval.head(8).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none df_eval shape: (144, 9) sample_idx forecast_step coord_t subsidence_actual subsidence_q10 subsidence_q50 subsidence_q90 coord_x coord_y 0 1 2020.0000 2.1537 1.9520 2.3120 2.6720 0.0000 0.0000 0 2 2021.0000 3.0945 2.3082 2.7282 3.1482 0.0000 0.0000 0 3 2022.0000 3.2345 2.8030 3.2830 3.7630 0.0000 0.0000 1 1 2020.0000 2.3367 2.0241 2.3956 2.7670 1714.2857 0.0000 1 2 2021.0000 2.7119 2.3953 2.8268 3.2582 1714.2857 0.0000 1 3 2022.0000 3.2139 2.9103 3.4017 3.8931 1714.2857 0.0000 2 1 2020.0000 2.6852 2.0963 2.4792 2.8620 3428.5714 0.0000 2 2 2021.0000 2.9306 2.4825 2.9254 3.3683 3428.5714 0.0000 .. GENERATED FROM PYTHON SOURCE LINES 199-216 Step 3 - Run the real forecast evaluator ---------------------------------------- We now call the actual helper. Important settings: - ``per_horizon=True``: compute MAE, MSE, RMSE, and R² by forecast step; - ``quantile_interval=(0.1, 0.9)``: evaluate interval coverage and sharpness using the q10-q90 band. In quantile mode, the helper uses the median forecast as the deterministic prediction and adds interval metrics when lower and upper quantiles are available. It also groups by ``coord_t`` when that column is present. .. GENERATED FROM PYTHON SOURCE LINES 216-230 .. code-block:: Python metrics = evaluate_forecast( df_eval, target_name="subsidence", per_horizon=True, quantile_interval=(0.1, 0.9), verbose=0, ) print("Returned object type:", type(metrics).__name__) print("") print("Top-level keys:") print(list(metrics.keys())) .. rst-class:: sphx-glr-script-out .. code-block:: none Returned object type: dict Top-level keys: ['2020.0', '2021.0', '2022.0', '__overall__'] .. GENERATED FROM PYTHON SOURCE LINES 231-243 Step 4 - Understand the result structure ---------------------------------------- Because our ``df_eval`` contains multiple time values in ``coord_t``, the helper returns a nested dictionary. Typical structure: - one key per time value, - plus ``"__overall__"`` for the whole evaluation set. Each block contains overall metrics, and the overall block also contains the merged per-horizon summaries. .. GENERATED FROM PYTHON SOURCE LINES 243-251 .. code-block:: Python overall = metrics["__overall__"] print("") print("Overall metrics block") for key, value in overall.items(): print(f"{key}: {value}") .. rst-class:: sphx-glr-script-out .. code-block:: none Overall metrics block overall_mae: 0.11841463887587696 overall_mse: 0.02232009503193533 overall_rmse: 0.14939911322339008 overall_r2: 0.9577081394557485 coverage80: 1.0 sharpness80: 0.9260785235305219 per_horizon_mae: {1: 0.12583531038681153, 2: 0.10452059872115223, 3: 0.12488800751966711} per_horizon_mse: {1: 0.02347620407777981, 2: 0.018069552126142144, 3: 0.02541452889188403} per_horizon_rmse: {1: 0.1532194637693913, 2: 0.1344230342096999, 3: 0.15941934917657904} per_horizon_r2: {1: 0.9026842779970629, 2: 0.9358584695856399, 3: 0.9426421737377303} .. GENERATED FROM PYTHON SOURCE LINES 252-255 We can also inspect one time-specific block. This is useful when we want to know whether performance differs from one evaluation year to another. .. GENERATED FROM PYTHON SOURCE LINES 255-264 .. code-block:: Python year_keys = [k for k in metrics.keys() if k != "__overall__"] first_year = sorted(year_keys, key=lambda x: float(x))[0] print("") print(f"Metrics for year {first_year}") for key, value in metrics[first_year].items(): print(f"{key}: {value}") .. rst-class:: sphx-glr-script-out .. code-block:: none Metrics for year 2020.0 overall_mae: 0.12583531038681153 overall_mse: 0.02347620407777981 overall_rmse: 0.1532194637693913 overall_r2: 0.9026842779970629 coverage80: 1.0 sharpness80: 0.8060785235305218 per_horizon_mae: {1: 0.12583531038681153} per_horizon_mse: {1: 0.02347620407777981} per_horizon_rmse: {1: 0.1532194637693913} per_horizon_r2: {1: 0.9026842779970629} .. GENERATED FROM PYTHON SOURCE LINES 265-287 What these keys mean -------------------- In quantile mode, the helper returns: - ``overall_mae`` - ``overall_mse`` - ``overall_rmse`` - ``overall_r2`` - ``coverage80`` - ``sharpness80`` and, with ``per_horizon=True``: - ``per_horizon_mae`` - ``per_horizon_mse`` - ``per_horizon_rmse`` - ``per_horizon_r2`` The ``__overall__`` block is especially useful because it merges all time groups into one complete summary, which is often the easiest way to compare horizon behavior across the full evaluation split. .. GENERATED FROM PYTHON SOURCE LINES 289-293 Step 5 - Turn the nested dictionary into small tables ----------------------------------------------------- For interpretation, it helps to convert the nested result into compact DataFrames. .. GENERATED FROM PYTHON SOURCE LINES 293-339 .. code-block:: Python def _as_int_label(x): try: return int(float(x)) except Exception: return x year_rows = [] for key in sorted(year_keys, key=lambda x: float(x)): m = metrics[key] year_rows.append( { "year": _as_int_label(key), "overall_mae": m["overall_mae"], "overall_rmse": m["overall_rmse"], "overall_r2": m["overall_r2"], "coverage80": m.get("coverage80", np.nan), "sharpness80": m.get("sharpness80", np.nan), } ) df_year_metrics = pd.DataFrame(year_rows) horizon_rows = [] for h, mae in overall["per_horizon_mae"].items(): horizon_rows.append( { "forecast_step": int(float(h)), "mae": float(mae), "rmse": float(overall["per_horizon_rmse"][h]), "r2": float(overall["per_horizon_r2"][h]), } ) df_horizon_metrics = pd.DataFrame(horizon_rows).sort_values( "forecast_step" ) print("") print("Per-year summary") print(df_year_metrics.to_string(index=False)) print("") print("Per-horizon summary") print(df_horizon_metrics.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Per-year summary year overall_mae overall_rmse overall_r2 coverage80 sharpness80 2020 0.1258 0.1532 0.9027 1.0000 0.8061 2021 0.1045 0.1344 0.9359 1.0000 0.9261 2022 0.1249 0.1594 0.9426 1.0000 1.0461 Per-horizon summary forecast_step mae rmse r2 1 0.1258 0.1532 0.9027 2 0.1045 0.1344 0.9359 3 0.1249 0.1594 0.9426 .. GENERATED FROM PYTHON SOURCE LINES 340-349 Step 6 - Add a custom extra metric ---------------------------------- The helper can also compute user-defined metrics. If we pass a mapping ``{name: func}``, the evaluator will call the function with ``(y_true, y_pred, **kwargs)`` when possible, and it also supports simpler prediction-only signatures. This makes the helper easy to extend without modifying the core registry. .. GENERATED FROM PYTHON SOURCE LINES 349-365 .. code-block:: Python def mean_signed_error(y_true, y_pred): return float(np.mean(y_pred - y_true)) metrics_extra = evaluate_forecast( df_eval, target_name="subsidence", per_horizon=True, extra_metrics={"mse_signed": mean_signed_error}, verbose=0, ) print("") print("Custom metric from '__overall__'") print(metrics_extra["__overall__"]["mse_signed"]) .. rst-class:: sphx-glr-script-out .. code-block:: none Custom metric from '__overall__' -0.017260129542610102 .. GENERATED FROM PYTHON SOURCE LINES 366-382 Step 7 - Build an explanatory figure from the metrics ----------------------------------------------------- This is the most important visual step in the lesson. ``evaluate_forecast`` itself returns metrics rather than a plot, but the returned structure is exactly what we need to build compact explanation figures. Here we show two useful views: - left: accuracy degradation across forecast steps, - right: coverage-versus-sharpness evolution by year. This is why the helper is valuable: it turns a forecast table into diagnostics that can be summarized visually and compared across horizons and times. .. GENERATED FROM PYTHON SOURCE LINES 382-435 .. code-block:: Python fig, axes = plt.subplots(1, 2, figsize=(11, 4.2)) # Left panel: horizon-wise deterministic error axes[0].plot( df_horizon_metrics["forecast_step"], df_horizon_metrics["mae"], marker="o", label="MAE", ) axes[0].plot( df_horizon_metrics["forecast_step"], df_horizon_metrics["rmse"], marker="s", label="RMSE", ) axes[0].set_xlabel("Forecast step") axes[0].set_ylabel("Error") axes[0].set_title("Error grows across the horizon") axes[0].grid(True, linestyle=":", alpha=0.6) axes[0].legend() # Right panel: year-wise coverage and sharpness ax2 = axes[1] ax2.plot( df_year_metrics["year"], df_year_metrics["coverage80"], marker="o", label="Coverage80", ) ax2.set_xlabel("Evaluation year") ax2.set_ylabel("Coverage80") ax2.set_title("Interval quality by year") ax2.grid(True, linestyle=":", alpha=0.6) ax2b = ax2.twinx() ax2b.plot( df_year_metrics["year"], df_year_metrics["sharpness80"], marker="s", linestyle="--", label="Sharpness80", ) ax2b.set_ylabel("Sharpness80") # merge legends from both y-axes lines_a, labels_a = ax2.get_legend_handles_labels() lines_b, labels_b = ax2b.get_legend_handles_labels() ax2.legend(lines_a + lines_b, labels_a + labels_b, loc="best") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/forecasting/images/sphx_glr_plot_evaluate_forecast_001.png :alt: Error grows across the horizon, Interval quality by year :srcset: /auto_examples/forecasting/images/sphx_glr_plot_evaluate_forecast_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 436-466 How to read the figure ---------------------- Left panel ~~~~~~~~~~ This panel answers: - does error increase with forecast step? In most forecasting systems, later horizons are harder. A gentle rise in MAE and RMSE is usually expected. A sudden jump at one step can indicate instability or a mismatch between how the model handles short and long horizons. Right panel ~~~~~~~~~~~ This panel answers: - are the predictive intervals behaving consistently over time? ``coverage80`` tells us how often the truth falls inside the q10-q90 band. ``sharpness80`` tells us how wide that band is. A useful interpretation is: - high coverage with enormous width: safe but not very informative, - low width with poor coverage: sharp but overconfident, - reasonable coverage with moderate width: a better balanced forecast. .. GENERATED FROM PYTHON SOURCE LINES 468-479 Step 8 - Optional export behavior --------------------------------- ``evaluate_forecast`` can also save its outputs. - JSON preserves the nested dictionary structure. - CSV flattens the result into rows with columns like ``coord_t``, ``metric``, ``horizon``, and ``value``. The helper supports both because dictionary form is convenient in code, while CSV form is convenient for reports, spreadsheets, or later plotting. .. GENERATED FROM PYTHON SOURCE LINES 479-500 .. code-block:: Python # Example (kept commented to avoid writing files during the lesson): # # metrics_json = evaluate_forecast( # df_eval, # target_name="subsidence", # per_horizon=True, # savefile="generated/eval_metrics.json", # save_format="json", # verbose=0, # ) # # metrics_csv = evaluate_forecast( # df_eval, # target_name="subsidence", # per_horizon=True, # savefile="generated/eval_metrics.csv", # save_format="csv", # verbose=0, # ) .. GENERATED FROM PYTHON SOURCE LINES 501-514 Final takeaway -------------- ``evaluate_forecast`` is the core helper that turns an evaluation forecast table into readable diagnostics. Once you have ``df_eval``, this function tells you: - how accurate the median forecast is, - how interval quality behaves, - how performance changes across the horizon, - and how those diagnostics vary by evaluation time. That makes it one of the most important utility pages in the forecasting section. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.312 seconds) .. _sphx_glr_download_auto_examples_forecasting_plot_evaluate_forecast.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_evaluate_forecast.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_evaluate_forecast.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_evaluate_forecast.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_