.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/uncertainty/plot_reliability_diagram.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_uncertainty_plot_reliability_diagram.py: Reliability diagrams for probabilistic forecasts ================================================ This lesson teaches how to use :func:`geoprior.utils.forecast_utils.plot_reliability_diagram` to inspect the calibration of probabilistic forecast intervals. Why this page matters --------------------- A coverage number such as 0.74 for an intended 80% interval is useful, but it is still only one number. A reliability diagram answers a richer question: **Across a range of nominal interval probabilities, how often did the forecast intervals actually contain the truth?** That is the role of ``plot_reliability_diagram``. The real helper draws: - a diagonal baseline representing perfect calibration, - one curve per model, - the empirical coverage obtained from the model's quantile forecast columns. It can read either: - forecast DataFrames directly, or - precomputed reliability points via ``nominal_probs`` and ``observed_probs``. What the function expects ------------------------- The helper accepts ``models_data`` as a mapping from model names to either: - a pandas DataFrame, or - a nested dict containing at least ``forecasts`` and optionally styling such as ``color``, ``marker``, and ``style``. When forecast tables are provided, the helper uses the target ``prefix`` to look for quantile columns such as ``subsidence_q10`` or ``subsidence_q90``, compares them with ``y_true``, and plots nominal-versus-observed coverage. What this lesson teaches ------------------------ We will: 1. create a synthetic spatial evaluation dataset, 2. build several forecast variants with different interval behavior, 3. compute compact reliability tables numerically, 4. plot the real reliability diagram, 5. demonstrate the helper's alternative precomputed-point mode. This page is synthetic so it remains fully executable during the documentation build. .. GENERATED FROM PYTHON SOURCE LINES 63-67 Imports ------- We use the real reliability helper, and one calibration helper to produce a calibrated comparison curve. .. GENERATED FROM PYTHON SOURCE LINES 67-74 .. code-block:: Python import numpy as np import pandas as pd from geoprior.utils.calibrate import calibrate_quantile_forecasts from geoprior.utils.forecast_utils import plot_reliability_diagram .. GENERATED FROM PYTHON SOURCE LINES 75-83 Step 1 - Build a compact spatial evaluation dataset --------------------------------------------------- We mimic a small city-like spatial grid observed over three forecast steps. Each location-step pair becomes one row in the evaluation table. We generate one latent continuous truth field and then build several quantile-forecast variants around it. .. GENERATED FROM PYTHON SOURCE LINES 83-177 .. code-block:: Python rng = np.random.default_rng(52) nx = 9 ny = 6 steps = [1, 2, 3] years = {1: 2024, 2: 2025, 3: 2026} xv = np.linspace(0.0, 12_000.0, nx) yv = np.linspace(0.0, 8_000.0, ny) X, Y = np.meshgrid(xv, yv) x_flat = X.ravel() y_flat = Y.ravel() n_sites = x_flat.size xn = (x_flat - x_flat.min()) / (x_flat.max() - x_flat.min()) yn = (y_flat - y_flat.min()) / (y_flat.max() - y_flat.min()) hotspot = np.exp( -( ((xn - 0.72) ** 2) / 0.020 + ((yn - 0.34) ** 2) / 0.032 ) ) ridge = 0.55 * np.exp( -( ((xn - 0.28) ** 2) / 0.028 + ((yn - 0.74) ** 2) / 0.050 ) ) gradient = 0.46 * xn + 0.20 * (1.0 - yn) # Quantiles we want to expose in the forecast tables. # The reliability helper works with quantile suffixes like q10/q90. quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95] # Approximate z-scores for a symmetric Gaussian-style quantile family. z_map = { 0.05: -1.6449, 0.10: -1.2816, 0.20: -0.8416, 0.30: -0.5244, 0.40: -0.2533, 0.50: 0.0, 0.60: 0.2533, 0.70: 0.5244, 0.80: 0.8416, 0.90: 1.2816, 0.95: 1.6449, } truth_rows: list[dict[str, float | int]] = [] for step in steps: scale = {1: 1.00, 2: 1.18, 3: 1.42}[step] mu = ( 2.1 + 1.45 * gradient + 2.1 * hotspot + 0.95 * ridge ) * scale sigma_true = ( 0.28 + 0.08 * xn + 0.04 * hotspot + 0.04 * step ) actual = mu + rng.normal(0.0, sigma_true, size=n_sites) for i in range(n_sites): truth_rows.append( { "sample_idx": i, "forecast_step": step, "coord_t": years[step], "coord_x": float(x_flat[i]), "coord_y": float(y_flat[i]), "subsidence_actual": float(actual[i]), "mu_latent": float(mu[i]), "sigma_true": float(sigma_true[i]), } ) truth_df = pd.DataFrame(truth_rows) print("Truth table shape:", truth_df.shape) print("") print(truth_df.head(8).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Truth table shape: (162, 8) sample_idx forecast_step coord_t coord_x coord_y subsidence_actual mu_latent sigma_true 0 1 2024 0.0000 0.0000 2.1257 2.3900 0.3200 1 1 2024 1500.0000 0.0000 2.5534 2.4734 0.3300 2 1 2024 3000.0000 0.0000 2.5351 2.5568 0.3400 3 1 2024 4500.0000 0.0000 2.6073 2.6403 0.3500 4 1 2024 6000.0000 0.0000 2.8450 2.7285 0.3601 5 1 2024 7500.0000 0.0000 3.0898 2.8430 0.3707 6 1 2024 9000.0000 0.0000 3.0166 2.9444 0.3810 7 1 2024 10500.0000 0.0000 3.0603 2.9907 0.3903 .. GENERATED FROM PYTHON SOURCE LINES 178-191 Step 2 - Build several quantile forecast variants ------------------------------------------------- We create three different uncertainty behaviors: - Overconfident: intervals too narrow; - Balanced: intervals close to the latent truth spread; - Conservative: intervals too wide. Then we calibrate the overconfident one so the reliability diagram can show how calibration changes the curve. .. GENERATED FROM PYTHON SOURCE LINES 191-231 .. code-block:: Python def make_quantile_forecast_df( truth_data: pd.DataFrame, *, width_scale: float, median_bias: float = 0.0, ) -> pd.DataFrame: out = truth_data[ [ "sample_idx", "forecast_step", "coord_t", "coord_x", "coord_y", "subsidence_actual", "mu_latent", "sigma_true", ] ].copy() mu = out["mu_latent"].to_numpy(float) + float(median_bias) sigma = out["sigma_true"].to_numpy(float) * float(width_scale) for q in quantiles: col = f"subsidence_q{int(q * 100)}" out[col] = mu + z_map[q] * sigma # keep the canonical median column users will expect # from downstream tooling return out.drop(columns=["mu_latent", "sigma_true"]) df_over = make_quantile_forecast_df(truth_df, width_scale=0.55) df_bal = make_quantile_forecast_df(truth_df, width_scale=1.00) df_cons = make_quantile_forecast_df(truth_df, width_scale=1.65) print("") print("Overconfident forecast head") print(df_over.head(6).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Overconfident forecast head sample_idx forecast_step coord_t coord_x coord_y subsidence_actual subsidence_q5 subsidence_q10 subsidence_q20 subsidence_q30 subsidence_q40 subsidence_q50 subsidence_q60 subsidence_q70 subsidence_q80 subsidence_q90 subsidence_q95 0 1 2024 0.0000 0.0000 2.1257 2.1005 2.1644 2.2419 2.2977 2.3454 2.3900 2.4346 2.4823 2.5381 2.6156 2.6795 1 1 2024 1500.0000 0.0000 2.5534 2.1748 2.2408 2.3206 2.3782 2.4274 2.4734 2.5194 2.5686 2.6261 2.7060 2.7719 2 1 2024 3000.0000 0.0000 2.5351 2.2492 2.3171 2.3994 2.4587 2.5094 2.5568 2.6041 2.6548 2.7141 2.7964 2.8644 3 1 2024 4500.0000 0.0000 2.6073 2.3236 2.3936 2.4783 2.5393 2.5915 2.6403 2.6890 2.7412 2.8023 2.8870 2.9569 4 1 2024 6000.0000 0.0000 2.8450 2.4028 2.4747 2.5619 2.6247 2.6784 2.7285 2.7787 2.8324 2.8952 2.9824 3.0543 5 1 2024 7500.0000 0.0000 3.0898 2.5076 2.5817 2.6714 2.7360 2.7913 2.8430 2.8946 2.9499 3.0145 3.1043 3.1783 .. GENERATED FROM PYTHON SOURCE LINES 232-242 Step 3 - Calibrate the overconfident forecast --------------------------------------------- We use the real interval-calibration workflow from the uncertainty utilities. This gives us a fourth comparison curve: - Calibrated overconfident forecast The calibration helper fits horizon-wise factors from the evaluation table and rescales quantiles around the median. That makes it a natural companion for a reliability page. .. GENERATED FROM PYTHON SOURCE LINES 242-261 .. code-block:: Python df_over_cal, _, cal_stats = calibrate_quantile_forecasts( df_eval=df_over, df_future=None, target_name="subsidence", step_col="forecast_step", interval=(0.1, 0.9), target_coverage=0.8, median_q=0.5, use="auto", keep_original=True, enforce_monotonic="cummax", verbose=0, ) print("") print("Calibration stats") print(cal_stats) .. rst-class:: sphx-glr-script-out .. code-block:: none Calibration stats {'target': 0.8, 'interval': [0.1, 0.9], 'f_max': 5.0, 'tol': 0.02, 'overall_key': '__overall__', 'factors_source': 'fit', 'factors': {'1': 1.8629894256591797, '2': 1.3924503326416016, '3': 1.911332130432129}, 'eval_before': {'coverage': 0.5864197530864198, 'sharpness': 0.5672223460842876, 'per_horizon': {'1': {'coverage': 0.5925925925925926, 'sharpness': 0.5108319460842875}, '2': {'coverage': 0.6296296296296297, 'sharpness': 0.5672223460842875}, '3': {'coverage': 0.5370370370370371, 'sharpness': 0.6236127460842877}}}, 'eval_after': {'coverage': 0.8148148148148148, 'sharpness': 0.9778115122895517, 'per_horizon': {'1': {'coverage': 0.8148148148148148, 'sharpness': 0.9516745138439275}, '2': {'coverage': 0.8148148148148148, 'sharpness': 0.7898289444868157}, '3': {'coverage': 0.8148148148148148, 'sharpness': 1.1919310785379116}}}} .. GENERATED FROM PYTHON SOURCE LINES 262-276 Step 4 - Build a compact numerical reliability table ---------------------------------------------------- Before plotting, it helps to compute empirical coverage manually. We use a handful of symmetric intervals: - 90% -> q05 to q95 - 80% -> q10 to q90 - 60% -> q20 to q80 - 40% -> q30 to q70 - 20% -> q40 to q60 These are exactly the kinds of nominal-versus-observed comparisons that a reliability diagram visualizes. .. GENERATED FROM PYTHON SOURCE LINES 276-325 .. code-block:: Python interval_pairs = { 0.90: (5, 95), 0.80: (10, 90), 0.60: (20, 80), 0.40: (30, 70), 0.20: (40, 60), } def reliability_table( df: pd.DataFrame, *, actual_col: str = "subsidence_actual", prefix: str = "subsidence", ) -> pd.DataFrame: rows = [] y_true = df[actual_col].to_numpy(float) for nominal, (lo_q, hi_q) in interval_pairs.items(): lo_col = f"{prefix}_q{lo_q}" hi_col = f"{prefix}_q{hi_q}" lo = df[lo_col].to_numpy(float) hi = df[hi_col].to_numpy(float) covered = (y_true >= lo) & (y_true <= hi) rows.append( { "nominal_prob": float(nominal), "observed_freq": float(np.mean(covered)), "mean_width": float(np.mean(hi - lo)), "calibration_gap": float(np.mean(covered) - nominal), } ) return pd.DataFrame(rows).sort_values("nominal_prob") rel_over = reliability_table(df_over) rel_bal = reliability_table(df_bal) rel_cons = reliability_table(df_cons) rel_cal = reliability_table(df_over_cal) print("") print("Balanced model reliability table") print(rel_bal.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Balanced model reliability table nominal_prob observed_freq mean_width calibration_gap 0.2000 0.2407 0.2038 0.0407 0.4000 0.4568 0.4220 0.0568 0.6000 0.6543 0.6772 0.0543 0.8000 0.8210 1.0313 0.0210 0.9000 0.8889 1.3237 -0.0111 .. GENERATED FROM PYTHON SOURCE LINES 326-333 Step 5 - Summarize calibration error numerically ------------------------------------------------ A compact scalar summary can help readers compare the curves before they inspect the figure. We compute the mean absolute deviation from the perfect-calibration diagonal across the available nominal probabilities. .. GENERATED FROM PYTHON SOURCE LINES 333-370 .. code-block:: Python def mean_abs_reliability_gap(df_rel: pd.DataFrame) -> float: return float( np.mean( np.abs( df_rel["observed_freq"].to_numpy(float) - df_rel["nominal_prob"].to_numpy(float) ) ) ) df_summary = pd.DataFrame( [ { "model": "Overconfident", "mean_abs_gap": mean_abs_reliability_gap(rel_over), }, { "model": "Balanced", "mean_abs_gap": mean_abs_reliability_gap(rel_bal), }, { "model": "Conservative", "mean_abs_gap": mean_abs_reliability_gap(rel_cons), }, { "model": "Calibrated", "mean_abs_gap": mean_abs_reliability_gap(rel_cal), }, ] ) print("") print("Mean absolute reliability gap") print(df_summary.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Mean absolute reliability gap model mean_abs_gap Overconfident 0.1640 Balanced 0.0368 Conservative 0.1891 Calibrated 0.0232 .. GENERATED FROM PYTHON SOURCE LINES 371-378 How to read these numbers ------------------------- A smaller mean absolute gap means the curve lies closer to the perfect-calibration line on average. This is not the only calibration summary one could use, but it is a simple way to connect the later figure to a numerical comparison. .. GENERATED FROM PYTHON SOURCE LINES 380-387 Step 6 - Plot the real reliability diagram ------------------------------------------ We now call the actual GeoPrior helper. The helper accepts a mapping from model names to either DataFrames or nested dicts. Here we use nested dicts so the lesson can also show the styling hooks documented by the function. .. GENERATED FROM PYTHON SOURCE LINES 387-423 .. code-block:: Python y_true = df_over["subsidence_actual"].reset_index(drop=True) models_data = { "Overconfident": { "forecasts": df_over.reset_index(drop=True), "marker": "o", "style": "-", }, "Balanced": { "forecasts": df_bal.reset_index(drop=True), "marker": "s", "style": "-", }, "Conservative": { "forecasts": df_cons.reset_index(drop=True), "marker": "^", "style": "-", }, "Calibrated": { "forecasts": df_over_cal.reset_index(drop=True), "marker": "D", "style": "-", }, } plot_reliability_diagram( models_data=models_data, y_true=y_true, prefix="subsidence", figsize=(7.2, 7.2), title="Reliability diagram across forecast variants", plot_style="default", verbose=0, ) .. image-sg:: /auto_examples/uncertainty/images/sphx_glr_plot_reliability_diagram_001.png :alt: Reliability diagram across forecast variants :srcset: /auto_examples/uncertainty/images/sphx_glr_plot_reliability_diagram_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 424-442 How to read the figure ---------------------- The diagonal represents perfect calibration: - a nominal 80% interval should contain the truth 80% of the time, - a nominal 40% interval should contain the truth 40% of the time, - and so on. Curves below the diagonal are typically overconfident: the forecast intervals are too narrow for the actual uncertainty. Curves above the diagonal are conservative: the intervals are wide enough to cover more often than they claim. In this synthetic example, the overconfident model should sit below the diagonal more often, the conservative model should tend above it, and the calibrated model should move closer to the diagonal. .. GENERATED FROM PYTHON SOURCE LINES 444-448 Step 7 - Compare reliability by forecast horizon ------------------------------------------------ Reliability can change with lead time. So we also build a compact per-horizon table for the most common interval, q10-q90. .. GENERATED FROM PYTHON SOURCE LINES 448-473 .. code-block:: Python def coverage80_by_step(df: pd.DataFrame) -> pd.DataFrame: rows = [] for step, g in df.groupby("forecast_step", sort=True): y_true = g["subsidence_actual"].to_numpy(float) lo = g["subsidence_q10"].to_numpy(float) hi = g["subsidence_q90"].to_numpy(float) covered = (y_true >= lo) & (y_true <= hi) rows.append( { "forecast_step": int(step), "coverage80": float(np.mean(covered)), "mean_width80": float(np.mean(hi - lo)), } ) return pd.DataFrame(rows) print("") print("Coverage80 by step: calibrated model") print(coverage80_by_step(df_over_cal).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Coverage80 by step: calibrated model forecast_step coverage80 mean_width80 1 0.8148 0.9517 2 0.8148 0.7898 3 0.8148 1.1919 .. GENERATED FROM PYTHON SOURCE LINES 474-480 Why this matters ---------------- A model can look well calibrated overall and still degrade at later horizons. Reliability diagrams give the broad probability-level view, while simple per-step tables help show where the calibration problem is strongest. .. GENERATED FROM PYTHON SOURCE LINES 482-494 Step 8 - Demonstrate the precomputed-point mode ----------------------------------------------- The helper can also work from precomputed reliability points instead of forecast tables. This is useful when: - the reliability points were computed elsewhere, - or when you want to compare a stored summary rather than reprocess the forecast table. Here we pass the calibrated model that way. .. GENERATED FROM PYTHON SOURCE LINES 494-514 .. code-block:: Python precomputed_models = { "Calibrated (precomputed)": { "nominal_probs": rel_cal["nominal_prob"].tolist(), "observed_probs": rel_cal["observed_freq"].tolist(), "marker": "D", "style": "-", } } plot_reliability_diagram( models_data=precomputed_models, y_true=None, prefix="subsidence", figsize=(6.6, 6.6), title="Reliability diagram from precomputed points", plot_style="default", verbose=0, ) .. image-sg:: /auto_examples/uncertainty/images/sphx_glr_plot_reliability_diagram_002.png :alt: Reliability diagram from precomputed points :srcset: /auto_examples/uncertainty/images/sphx_glr_plot_reliability_diagram_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 515-528 Final takeaway -------------- Reliability diagrams answer a different question from Brier score or interval width alone. They show whether probabilistic intervals are *honest* across a range of nominal probabilities. That is why this page belongs in the uncertainty gallery: - interval calibration teaches how to adjust intervals, - coverage versus sharpness teaches the main trade-off, - reliability diagrams show whether the probabilistic statements themselves line up with reality. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.304 seconds) .. _sphx_glr_download_auto_examples_uncertainty_plot_reliability_diagram.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_reliability_diagram.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_reliability_diagram.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_reliability_diagram.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_