Reliability diagrams for probabilistic forecasts#

This lesson teaches how to use geoprior.utils.forecast_utils.plot_reliability_diagram() to inspect the calibration of probabilistic forecast intervals.

Why this page matters#

A coverage number such as 0.74 for an intended 80% interval is useful, but it is still only one number.

A reliability diagram answers a richer question:

Across a range of nominal interval probabilities, how often did the forecast intervals actually contain the truth?

That is the role of plot_reliability_diagram.

The real helper draws:

  • a diagonal baseline representing perfect calibration,

  • one curve per model,

  • the empirical coverage obtained from the model’s quantile forecast columns.

It can read either:

  • forecast DataFrames directly, or

  • precomputed reliability points via nominal_probs and observed_probs.

What the function expects#

The helper accepts models_data as a mapping from model names to either:

  • a pandas DataFrame, or

  • a nested dict containing at least forecasts and optionally styling such as color, marker, and style.

When forecast tables are provided, the helper uses the target prefix to look for quantile columns such as subsidence_q10 or subsidence_q90, compares them with y_true, and plots nominal-versus-observed coverage.

What this lesson teaches#

We will:

  1. create a synthetic spatial evaluation dataset,

  2. build several forecast variants with different interval behavior,

  3. compute compact reliability tables numerically,

  4. plot the real reliability diagram,

  5. demonstrate the helper’s alternative precomputed-point mode.

This page is synthetic so it remains fully executable during the documentation build.

Imports#

We use the real reliability helper, and one calibration helper to produce a calibrated comparison curve.

import numpy as np
import pandas as pd

from geoprior.utils.calibrate import calibrate_quantile_forecasts
from geoprior.utils.forecast_utils import plot_reliability_diagram

Step 1 - Build a compact spatial evaluation dataset#

We mimic a small city-like spatial grid observed over three forecast steps. Each location-step pair becomes one row in the evaluation table.

We generate one latent continuous truth field and then build several quantile-forecast variants around it.

rng = np.random.default_rng(52)

nx = 9
ny = 6
steps = [1, 2, 3]
years = {1: 2024, 2: 2025, 3: 2026}

xv = np.linspace(0.0, 12_000.0, nx)
yv = np.linspace(0.0, 8_000.0, ny)
X, Y = np.meshgrid(xv, yv)

x_flat = X.ravel()
y_flat = Y.ravel()
n_sites = x_flat.size

xn = (x_flat - x_flat.min()) / (x_flat.max() - x_flat.min())
yn = (y_flat - y_flat.min()) / (y_flat.max() - y_flat.min())

hotspot = np.exp(
    -(
        ((xn - 0.72) ** 2) / 0.020
        + ((yn - 0.34) ** 2) / 0.032
    )
)
ridge = 0.55 * np.exp(
    -(
        ((xn - 0.28) ** 2) / 0.028
        + ((yn - 0.74) ** 2) / 0.050
    )
)
gradient = 0.46 * xn + 0.20 * (1.0 - yn)

# Quantiles we want to expose in the forecast tables.
# The reliability helper works with quantile suffixes like q10/q90.
quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50,
             0.60, 0.70, 0.80, 0.90, 0.95]

# Approximate z-scores for a symmetric Gaussian-style quantile family.
z_map = {
    0.05: -1.6449,
    0.10: -1.2816,
    0.20: -0.8416,
    0.30: -0.5244,
    0.40: -0.2533,
    0.50: 0.0,
    0.60: 0.2533,
    0.70: 0.5244,
    0.80: 0.8416,
    0.90: 1.2816,
    0.95: 1.6449,
}

truth_rows: list[dict[str, float | int]] = []

for step in steps:
    scale = {1: 1.00, 2: 1.18, 3: 1.42}[step]

    mu = (
        2.1
        + 1.45 * gradient
        + 2.1 * hotspot
        + 0.95 * ridge
    ) * scale

    sigma_true = (
        0.28
        + 0.08 * xn
        + 0.04 * hotspot
        + 0.04 * step
    )

    actual = mu + rng.normal(0.0, sigma_true, size=n_sites)

    for i in range(n_sites):
        truth_rows.append(
            {
                "sample_idx": i,
                "forecast_step": step,
                "coord_t": years[step],
                "coord_x": float(x_flat[i]),
                "coord_y": float(y_flat[i]),
                "subsidence_actual": float(actual[i]),
                "mu_latent": float(mu[i]),
                "sigma_true": float(sigma_true[i]),
            }
        )

truth_df = pd.DataFrame(truth_rows)

print("Truth table shape:", truth_df.shape)
print("")
print(truth_df.head(8).to_string(index=False))
Truth table shape: (162, 8)

 sample_idx  forecast_step  coord_t    coord_x  coord_y  subsidence_actual  mu_latent  sigma_true
          0              1     2024     0.0000   0.0000             2.1257     2.3900      0.3200
          1              1     2024  1500.0000   0.0000             2.5534     2.4734      0.3300
          2              1     2024  3000.0000   0.0000             2.5351     2.5568      0.3400
          3              1     2024  4500.0000   0.0000             2.6073     2.6403      0.3500
          4              1     2024  6000.0000   0.0000             2.8450     2.7285      0.3601
          5              1     2024  7500.0000   0.0000             3.0898     2.8430      0.3707
          6              1     2024  9000.0000   0.0000             3.0166     2.9444      0.3810
          7              1     2024 10500.0000   0.0000             3.0603     2.9907      0.3903

Step 2 - Build several quantile forecast variants#

We create three different uncertainty behaviors:

  • Overconfident: intervals too narrow;

  • Balanced: intervals close to the latent truth spread;

  • Conservative: intervals too wide.

Then we calibrate the overconfident one so the reliability diagram can show how calibration changes the curve.

def make_quantile_forecast_df(
    truth_data: pd.DataFrame,
    *,
    width_scale: float,
    median_bias: float = 0.0,
) -> pd.DataFrame:
    out = truth_data[
        [
            "sample_idx",
            "forecast_step",
            "coord_t",
            "coord_x",
            "coord_y",
            "subsidence_actual",
            "mu_latent",
            "sigma_true",
        ]
    ].copy()

    mu = out["mu_latent"].to_numpy(float) + float(median_bias)
    sigma = out["sigma_true"].to_numpy(float) * float(width_scale)

    for q in quantiles:
        col = f"subsidence_q{int(q * 100)}"
        out[col] = mu + z_map[q] * sigma

    # keep the canonical median column users will expect
    # from downstream tooling
    return out.drop(columns=["mu_latent", "sigma_true"])


df_over = make_quantile_forecast_df(truth_df, width_scale=0.55)
df_bal = make_quantile_forecast_df(truth_df, width_scale=1.00)
df_cons = make_quantile_forecast_df(truth_df, width_scale=1.65)

print("")
print("Overconfident forecast head")
print(df_over.head(6).to_string(index=False))
Overconfident forecast head
 sample_idx  forecast_step  coord_t   coord_x  coord_y  subsidence_actual  subsidence_q5  subsidence_q10  subsidence_q20  subsidence_q30  subsidence_q40  subsidence_q50  subsidence_q60  subsidence_q70  subsidence_q80  subsidence_q90  subsidence_q95
          0              1     2024    0.0000   0.0000             2.1257         2.1005          2.1644          2.2419          2.2977          2.3454          2.3900          2.4346          2.4823          2.5381          2.6156          2.6795
          1              1     2024 1500.0000   0.0000             2.5534         2.1748          2.2408          2.3206          2.3782          2.4274          2.4734          2.5194          2.5686          2.6261          2.7060          2.7719
          2              1     2024 3000.0000   0.0000             2.5351         2.2492          2.3171          2.3994          2.4587          2.5094          2.5568          2.6041          2.6548          2.7141          2.7964          2.8644
          3              1     2024 4500.0000   0.0000             2.6073         2.3236          2.3936          2.4783          2.5393          2.5915          2.6403          2.6890          2.7412          2.8023          2.8870          2.9569
          4              1     2024 6000.0000   0.0000             2.8450         2.4028          2.4747          2.5619          2.6247          2.6784          2.7285          2.7787          2.8324          2.8952          2.9824          3.0543
          5              1     2024 7500.0000   0.0000             3.0898         2.5076          2.5817          2.6714          2.7360          2.7913          2.8430          2.8946          2.9499          3.0145          3.1043          3.1783

Step 3 - Calibrate the overconfident forecast#

We use the real interval-calibration workflow from the uncertainty utilities. This gives us a fourth comparison curve:

  • Calibrated overconfident forecast

The calibration helper fits horizon-wise factors from the evaluation table and rescales quantiles around the median. That makes it a natural companion for a reliability page.

df_over_cal, _, cal_stats = calibrate_quantile_forecasts(
    df_eval=df_over,
    df_future=None,
    target_name="subsidence",
    step_col="forecast_step",
    interval=(0.1, 0.9),
    target_coverage=0.8,
    median_q=0.5,
    use="auto",
    keep_original=True,
    enforce_monotonic="cummax",
    verbose=0,
)

print("")
print("Calibration stats")
print(cal_stats)
Calibration stats
{'target': 0.8, 'interval': [0.1, 0.9], 'f_max': 5.0, 'tol': 0.02, 'overall_key': '__overall__', 'factors_source': 'fit', 'factors': {'1': 1.8629894256591797, '2': 1.3924503326416016, '3': 1.911332130432129}, 'eval_before': {'coverage': 0.5864197530864198, 'sharpness': 0.5672223460842876, 'per_horizon': {'1': {'coverage': 0.5925925925925926, 'sharpness': 0.5108319460842875}, '2': {'coverage': 0.6296296296296297, 'sharpness': 0.5672223460842875}, '3': {'coverage': 0.5370370370370371, 'sharpness': 0.6236127460842877}}}, 'eval_after': {'coverage': 0.8148148148148148, 'sharpness': 0.9778115122895517, 'per_horizon': {'1': {'coverage': 0.8148148148148148, 'sharpness': 0.9516745138439275}, '2': {'coverage': 0.8148148148148148, 'sharpness': 0.7898289444868157}, '3': {'coverage': 0.8148148148148148, 'sharpness': 1.1919310785379116}}}}

Step 4 - Build a compact numerical reliability table#

Before plotting, it helps to compute empirical coverage manually.

We use a handful of symmetric intervals:

  • 90% -> q05 to q95

  • 80% -> q10 to q90

  • 60% -> q20 to q80

  • 40% -> q30 to q70

  • 20% -> q40 to q60

These are exactly the kinds of nominal-versus-observed comparisons that a reliability diagram visualizes.

interval_pairs = {
    0.90: (5, 95),
    0.80: (10, 90),
    0.60: (20, 80),
    0.40: (30, 70),
    0.20: (40, 60),
}


def reliability_table(
    df: pd.DataFrame,
    *,
    actual_col: str = "subsidence_actual",
    prefix: str = "subsidence",
) -> pd.DataFrame:
    rows = []
    y_true = df[actual_col].to_numpy(float)

    for nominal, (lo_q, hi_q) in interval_pairs.items():
        lo_col = f"{prefix}_q{lo_q}"
        hi_col = f"{prefix}_q{hi_q}"

        lo = df[lo_col].to_numpy(float)
        hi = df[hi_col].to_numpy(float)

        covered = (y_true >= lo) & (y_true <= hi)

        rows.append(
            {
                "nominal_prob": float(nominal),
                "observed_freq": float(np.mean(covered)),
                "mean_width": float(np.mean(hi - lo)),
                "calibration_gap": float(np.mean(covered) - nominal),
            }
        )

    return pd.DataFrame(rows).sort_values("nominal_prob")


rel_over = reliability_table(df_over)
rel_bal = reliability_table(df_bal)
rel_cons = reliability_table(df_cons)
rel_cal = reliability_table(df_over_cal)

print("")
print("Balanced model reliability table")
print(rel_bal.to_string(index=False))
Balanced model reliability table
 nominal_prob  observed_freq  mean_width  calibration_gap
       0.2000         0.2407      0.2038           0.0407
       0.4000         0.4568      0.4220           0.0568
       0.6000         0.6543      0.6772           0.0543
       0.8000         0.8210      1.0313           0.0210
       0.9000         0.8889      1.3237          -0.0111

Step 5 - Summarize calibration error numerically#

A compact scalar summary can help readers compare the curves before they inspect the figure.

We compute the mean absolute deviation from the perfect-calibration diagonal across the available nominal probabilities.

def mean_abs_reliability_gap(df_rel: pd.DataFrame) -> float:
    return float(
        np.mean(
            np.abs(
                df_rel["observed_freq"].to_numpy(float)
                - df_rel["nominal_prob"].to_numpy(float)
            )
        )
    )


df_summary = pd.DataFrame(
    [
        {
            "model": "Overconfident",
            "mean_abs_gap": mean_abs_reliability_gap(rel_over),
        },
        {
            "model": "Balanced",
            "mean_abs_gap": mean_abs_reliability_gap(rel_bal),
        },
        {
            "model": "Conservative",
            "mean_abs_gap": mean_abs_reliability_gap(rel_cons),
        },
        {
            "model": "Calibrated",
            "mean_abs_gap": mean_abs_reliability_gap(rel_cal),
        },
    ]
)

print("")
print("Mean absolute reliability gap")
print(df_summary.to_string(index=False))
Mean absolute reliability gap
        model  mean_abs_gap
Overconfident        0.1640
     Balanced        0.0368
 Conservative        0.1891
   Calibrated        0.0232

How to read these numbers#

A smaller mean absolute gap means the curve lies closer to the perfect-calibration line on average.

This is not the only calibration summary one could use, but it is a simple way to connect the later figure to a numerical comparison.

Step 6 - Plot the real reliability diagram#

We now call the actual GeoPrior helper.

The helper accepts a mapping from model names to either DataFrames or nested dicts. Here we use nested dicts so the lesson can also show the styling hooks documented by the function.

y_true = df_over["subsidence_actual"].reset_index(drop=True)

models_data = {
    "Overconfident": {
        "forecasts": df_over.reset_index(drop=True),
        "marker": "o",
        "style": "-",
    },
    "Balanced": {
        "forecasts": df_bal.reset_index(drop=True),
        "marker": "s",
        "style": "-",
    },
    "Conservative": {
        "forecasts": df_cons.reset_index(drop=True),
        "marker": "^",
        "style": "-",
    },
    "Calibrated": {
        "forecasts": df_over_cal.reset_index(drop=True),
        "marker": "D",
        "style": "-",
    },
}

plot_reliability_diagram(
    models_data=models_data,
    y_true=y_true,
    prefix="subsidence",
    figsize=(7.2, 7.2),
    title="Reliability diagram across forecast variants",
    plot_style="default",
    verbose=0,
)
Reliability diagram across forecast variants

How to read the figure#

The diagonal represents perfect calibration:

  • a nominal 80% interval should contain the truth 80% of the time,

  • a nominal 40% interval should contain the truth 40% of the time,

  • and so on.

Curves below the diagonal are typically overconfident: the forecast intervals are too narrow for the actual uncertainty.

Curves above the diagonal are conservative: the intervals are wide enough to cover more often than they claim.

In this synthetic example, the overconfident model should sit below the diagonal more often, the conservative model should tend above it, and the calibrated model should move closer to the diagonal.

Step 7 - Compare reliability by forecast horizon#

Reliability can change with lead time. So we also build a compact per-horizon table for the most common interval, q10-q90.

def coverage80_by_step(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for step, g in df.groupby("forecast_step", sort=True):
        y_true = g["subsidence_actual"].to_numpy(float)
        lo = g["subsidence_q10"].to_numpy(float)
        hi = g["subsidence_q90"].to_numpy(float)

        covered = (y_true >= lo) & (y_true <= hi)

        rows.append(
            {
                "forecast_step": int(step),
                "coverage80": float(np.mean(covered)),
                "mean_width80": float(np.mean(hi - lo)),
            }
        )

    return pd.DataFrame(rows)


print("")
print("Coverage80 by step: calibrated model")
print(coverage80_by_step(df_over_cal).to_string(index=False))
Coverage80 by step: calibrated model
 forecast_step  coverage80  mean_width80
             1      0.8148        0.9517
             2      0.8148        0.7898
             3      0.8148        1.1919

Why this matters#

A model can look well calibrated overall and still degrade at later horizons. Reliability diagrams give the broad probability-level view, while simple per-step tables help show where the calibration problem is strongest.

Step 8 - Demonstrate the precomputed-point mode#

The helper can also work from precomputed reliability points instead of forecast tables.

This is useful when:

  • the reliability points were computed elsewhere,

  • or when you want to compare a stored summary rather than reprocess the forecast table.

Here we pass the calibrated model that way.

precomputed_models = {
    "Calibrated (precomputed)": {
        "nominal_probs": rel_cal["nominal_prob"].tolist(),
        "observed_probs": rel_cal["observed_freq"].tolist(),
        "marker": "D",
        "style": "-",
    }
}

plot_reliability_diagram(
    models_data=precomputed_models,
    y_true=None,
    prefix="subsidence",
    figsize=(6.6, 6.6),
    title="Reliability diagram from precomputed points",
    plot_style="default",
    verbose=0,
)
Reliability diagram from precomputed points

Final takeaway#

Reliability diagrams answer a different question from Brier score or interval width alone.

They show whether probabilistic intervals are honest across a range of nominal probabilities.

That is why this page belongs in the uncertainty gallery:

  • interval calibration teaches how to adjust intervals,

  • coverage versus sharpness teaches the main trade-off,

  • reliability diagrams show whether the probabilistic statements themselves line up with reality.

Total running time of the script: (0 minutes 0.304 seconds)

Gallery generated by Sphinx-Gallery