.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/inspection/plot_calibration_stats_overview.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_inspection_plot_calibration_stats_overview.py: Inspect calibration statistics before trusting interval forecasts ================================================================= This lesson explains how to inspect the compact ``calibration_stats.json`` artifact produced by the forecast calibration workflow. Why this file matters --------------------- Calibration is easy to misunderstand. A calibration step can improve coverage while also making intervals too wide to be useful. In other words, "better calibrated" does not automatically mean "better for forecasting decisions." The calibration-stats artifact helps answer practical questions such as: - Was the target interval clearly defined? - Did the overall coverage move closer to the target? - Which forecast horizons required widening the intervals? - Did the long-horizon correction become too aggressive? - Was the gain in coverage paid for by a large increase in sharpness? The goal of this page is therefore not only to call plotting helpers. It is to teach how to read calibration results step by step and decide whether the calibrated intervals are believable and still useful. .. GENERATED FROM PYTHON SOURCE LINES 33-74 .. code-block:: Python from __future__ import annotations import json import tempfile from pathlib import Path from pprint import pprint import matplotlib.pyplot as plt import pandas as pd from geoprior.utils.inspect import ( calibration_stats_factors_frame, calibration_stats_overall_frame, calibration_stats_per_horizon_frame, generate_calibration_stats, inspect_calibration_stats, load_calibration_stats, plot_calibration_boolean_summary, plot_calibration_factors, plot_calibration_overall_metrics, plot_calibration_per_horizon_coverage, plot_calibration_per_horizon_sharpness, summarize_calibration_stats, ) pd.set_option("display.max_columns", 20) pd.set_option("display.width", 100) pd.set_option("display.float_format", lambda v: f"{v:0.6f}") CALIBRATION_PALETTE = { "factors": "#4361EE", "overall": ["#457B9D", "#1D3557", "#F4A261", "#E76F51"], "coverage_before": "#6C757D", "coverage_after": "#2A9D8F", "sharp_before": "#B56576", "sharp_after": "#E56B6F", "checks": "#5E548E", } .. GENERATED FROM PYTHON SOURCE LINES 75-92 What this artifact is really checking ------------------------------------- The compact calibration-stats artifact is intentionally narrow. It does not try to replace the full interpretable evaluation JSON. Instead, it focuses on the small set of quantities that matter most when asking a calibration question: 1. the target interval and tolerance, 2. the per-horizon widening factors, 3. the overall coverage/sharpness before calibration, 4. the same quantities after calibration, 5. the per-horizon behaviour behind the global summary. This is useful because interval quality is usually a trade-off. We want coverage to approach the target, but we do not want the intervals to become unnecessarily wide. .. GENERATED FROM PYTHON SOURCE LINES 95-105 Create a realistic demo calibration artifact -------------------------------------------- For a documentation lesson, we want a stable example that behaves like a real saved calibration file, without rerunning the full Stage-2 workflow. The generator gives us exactly that. In this example, we intentionally let the third horizon receive the largest widening factor, because long-range forecasts are often where interval calibration struggles the most. .. GENERATED FROM PYTHON SOURCE LINES 105-162 .. code-block:: Python workdir = Path(tempfile.mkdtemp(prefix="gp_cal_stats_")) out_dir = workdir cal_path = out_dir / "nansha_calibration_stats.json" generate_calibration_stats( cal_path, overrides={ "target": 0.80, "tol": 0.02, "factors": { "1": 1.0, "2": 1.0, "3": 1.085, }, "eval_before": { "coverage": 0.865, "sharpness": 33.08, "per_horizon": { "1": { "coverage": 0.9790543662405667, "sharpness": 23.244874687581664, }, "2": { "coverage": 0.8223728117459829, "sharpness": 27.592224706985775, }, "3": { "coverage": 0.7935725653267621, "sharpness": 48.40746668258111, }, }, }, "eval_after": { "coverage": 0.806, "sharpness": 33.92, "per_horizon": { "1": { "coverage": 0.9790543662405667, "sharpness": 23.244874687581664, }, "2": { "coverage": 0.8223728117459829, "sharpness": 27.592224706985775, }, "3": { "coverage": 0.8010410698701166, "sharpness": 50.89692692253824, }, }, }, }, ) print("Written calibration file") print(f" - {cal_path}") .. rst-class:: sphx-glr-script-out .. code-block:: none Written calibration file - /tmp/gp_cal_stats_lvl72pgz/nansha_calibration_stats.json .. GENERATED FROM PYTHON SOURCE LINES 163-170 Load the artifact with the real reader -------------------------------------- The loader is designed to accept both a direct calibration-stats JSON and the richer interpretable evaluation JSON that embeds the same block. For the first pass, we use the direct file because it is the cleanest way to understand the structure. .. GENERATED FROM PYTHON SOURCE LINES 170-184 .. code-block:: Python cal_record = load_calibration_stats(cal_path) print("\nArtifact header") pprint( { "kind": cal_record.kind, "city": cal_record.city, "model": cal_record.model, "path": str(cal_record.path), "meta": cal_record.meta, } ) .. rst-class:: sphx-glr-script-out .. code-block:: none Artifact header {'city': None, 'kind': 'calibration_stats', 'meta': {'has_eval_after': True, 'has_eval_before': True, 'has_factors': True, 'n_top_keys': 9, 'source_name': 'nansha_calibration_stats.json', 'top_keys': ['target', 'interval', 'f_max', 'tol', 'overall_key', 'factors_source', 'factors', 'eval_before', 'eval_after']}, 'model': None, 'path': '/tmp/gp_cal_stats_lvl72pgz/nansha_calibration_stats.json'} .. GENERATED FROM PYTHON SOURCE LINES 185-194 Start with the compact summary ------------------------------ A good inspection habit is to read the semantic summary before you look at individual horizons. This answers the first decision question: *Did calibration move the overall interval behaviour in the right direction?* .. GENERATED FROM PYTHON SOURCE LINES 194-200 .. code-block:: Python summary = summarize_calibration_stats(cal_record) print("\nCompact summary") print(json.dumps(summary, indent=2)) .. rst-class:: sphx-glr-script-out .. code-block:: none Compact summary { "target": 0.8, "interval_low": 0.1, "interval_high": 0.9, "tol": 0.02, "n_horizons": 3, "factors_source": "fit", "coverage_before": 0.865, "coverage_after": 0.806, "sharpness_before": 33.08, "sharpness_after": 33.92, "coverage_error_before": 0.06499999999999995, "coverage_error_after": 0.006000000000000005, "coverage_error_improved": true, "target_reached_after": true, "max_factor": 1.085, "min_factor": 1.0, "has_eval_before": true, "has_eval_after": true, "has_factors": true, "skipped": false } .. GENERATED FROM PYTHON SOURCE LINES 201-212 Read the main tidy tables ------------------------- The helper frames turn the nested JSON into compact views that are easier to reason about than the raw payload. We use three levels of reading: 1. the factor table, 2. the overall before/after table, 3. the per-horizon tables. .. GENERATED FROM PYTHON SOURCE LINES 212-236 .. code-block:: Python factors = calibration_stats_factors_frame(cal_record) overall = calibration_stats_overall_frame(cal_record) per_before = calibration_stats_per_horizon_frame( cal_record, which="eval_before", ) per_after = calibration_stats_per_horizon_frame( cal_record, which="eval_after", ) print("\nPer-horizon calibration factors") print(factors) print("\nOverall before/after metrics") print(overall) print("\nPer-horizon coverage and sharpness before calibration") print(per_before) print("\nPer-horizon coverage and sharpness after calibration") print(per_after) .. rst-class:: sphx-glr-script-out .. code-block:: none Per-horizon calibration factors horizon factor 0 1 1.000000 1 2 1.000000 2 3 1.085000 Overall before/after metrics which coverage sharpness coverage_error 0 eval_before 0.865000 33.080000 0.065000 1 eval_after 0.806000 33.920000 0.006000 Per-horizon coverage and sharpness before calibration which horizon coverage sharpness 0 eval_before 1 0.979054 23.244875 1 eval_before 2 0.822373 27.592225 2 eval_before 3 0.793573 48.407467 Per-horizon coverage and sharpness after calibration which horizon coverage sharpness 0 eval_after 1 0.979054 23.244875 1 eval_after 2 0.822373 27.592225 2 eval_after 3 0.801041 50.896927 .. GENERATED FROM PYTHON SOURCE LINES 237-254 Interpret the overall table --------------------------- The most useful first comparison is the distance to the target coverage. The calibration summary already computes ``coverage_error_before`` and ``coverage_error_after``, but it is worth making that comparison explicit in the lesson. Here the target is 0.80. So: - a coverage of 0.865 means the uncalibrated intervals were too conservative overall, - a coverage of 0.806 is much closer to the target, - but sharpness also became larger, which means the intervals got wider. That is the core calibration trade-off. .. GENERATED FROM PYTHON SOURCE LINES 254-264 .. code-block:: Python overall_view = overall.copy() overall_view["target"] = summary["target"] overall_view["distance_to_target"] = ( overall_view["coverage"] - summary["target"] ).abs() print("\nOverall interpretation table") print(overall_view) .. rst-class:: sphx-glr-script-out .. code-block:: none Overall interpretation table which coverage sharpness coverage_error target distance_to_target 0 eval_before 0.865000 33.080000 0.065000 0.800000 0.065000 1 eval_after 0.806000 33.920000 0.006000 0.800000 0.006000 .. GENERATED FROM PYTHON SOURCE LINES 265-272 Compare horizons directly ------------------------- Global coverage can hide where the correction really happened. Per-horizon inspection often reveals that the calibration step was almost inactive at short horizons and mainly widened the uncertain tail. .. GENERATED FROM PYTHON SOURCE LINES 272-311 .. code-block:: Python horizon_compare = per_before.merge( per_after, on="horizon", suffixes=("_before", "_after"), ) horizon_compare["coverage_delta"] = ( horizon_compare["coverage_after"] - horizon_compare["coverage_before"] ) horizon_compare["sharpness_delta"] = ( horizon_compare["sharpness_after"] - horizon_compare["sharpness_before"] ) horizon_compare["coverage_error_before"] = ( horizon_compare["coverage_before"] - summary["target"] ).abs() horizon_compare["coverage_error_after"] = ( horizon_compare["coverage_after"] - summary["target"] ).abs() print("\nPer-horizon comparison") print( horizon_compare.loc[ :, [ "horizon", "coverage_before", "coverage_after", "coverage_delta", "coverage_error_before", "coverage_error_after", "sharpness_before", "sharpness_after", "sharpness_delta", ], ] ) .. rst-class:: sphx-glr-script-out .. code-block:: none Per-horizon comparison horizon coverage_before coverage_after coverage_delta coverage_error_before \ 0 1 0.979054 0.979054 0.000000 0.179054 1 2 0.822373 0.822373 0.000000 0.022373 2 3 0.793573 0.801041 0.007469 0.006427 coverage_error_after sharpness_before sharpness_after sharpness_delta 0 0.179054 23.244875 23.244875 0.000000 1 0.022373 27.592225 27.592225 0.000000 2 0.001041 48.407467 50.896927 2.489460 .. GENERATED FROM PYTHON SOURCE LINES 312-330 How to read the factor table ---------------------------- Calibration factors are one of the easiest places to make a wrong decision. A factor near 1.0 means the interval was left essentially unchanged. A larger factor means the interval had to be widened. This does **not** mean a larger factor is always bad. It simply means that horizon needed more correction. The real question is whether the correction improved coverage enough to justify the extra width. In this demo: - H1 and H2 stay at 1.0, so they were already acceptable, - H3 is widened, which is consistent with it being the hardest horizon to calibrate. .. GENERATED FROM PYTHON SOURCE LINES 330-339 .. code-block:: Python print("\nFactor reading note") if not factors.empty: worst = factors.loc[factors["factor"].idxmax()] print( f"Largest widening factor: H{worst['horizon']} -> " f"{worst['factor']:.3f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none Factor reading note Largest widening factor: H3 -> 1.085 .. GENERATED FROM PYTHON SOURCE LINES 340-354 Plot the main calibration views ------------------------------- A compact first inspection usually benefits from four plots: 1. the widening factors, 2. the overall before/after coverage and sharpness, 3. per-horizon coverage before calibration, 4. per-horizon coverage after calibration. This figure helps answer: *Where did calibration act, and did it move the empirical coverage toward the target?* .. GENERATED FROM PYTHON SOURCE LINES 354-401 .. code-block:: Python fig, axes = plt.subplots( 2, 2, figsize=(12.4, 8.8), constrained_layout=True, ) plot_calibration_factors( axes[0, 0], cal_record, title="Per-horizon widening factors", color=CALIBRATION_PALETTE["factors"], edgecolor="#1F2937", linewidth=0.9, alpha=0.92, ) plot_calibration_overall_metrics( axes[0, 1], cal_record, title="Overall coverage and sharpness", color=CALIBRATION_PALETTE["overall"], edgecolor="#1F2937", linewidth=0.9, alpha=0.92, ) plot_calibration_per_horizon_coverage( axes[1, 0], cal_record, which="eval_before", title="Coverage before calibration", color=CALIBRATION_PALETTE["coverage_before"], marker="s", linewidth=2.2, markersize=7, ) plot_calibration_per_horizon_coverage( axes[1, 1], cal_record, which="eval_after", title="Coverage after calibration", color=CALIBRATION_PALETTE["coverage_after"], marker="o", linewidth=2.2, markersize=7, ) .. image-sg:: /auto_examples/inspection/images/sphx_glr_plot_calibration_stats_overview_001.png :alt: Per-horizon widening factors, Overall coverage and sharpness, Coverage before calibration, Coverage after calibration :srcset: /auto_examples/inspection/images/sphx_glr_plot_calibration_stats_overview_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 402-413 How to interpret the first figure --------------------------------- When reading the first figure, look for the following pattern: - are factors concentrated on the harder horizons? - does the post-calibration coverage move closer to the target? - does the correction stay moderate rather than explosive? In this demo, the main action is at H3. That is exactly the kind of pattern we usually expect when uncertainty grows with horizon. .. GENERATED FROM PYTHON SOURCE LINES 415-421 Plot the width trade-off directly --------------------------------- Coverage alone is never enough. The second figure focuses on the cost of calibration in terms of interval width and on the final pass/ fail-style checks. .. GENERATED FROM PYTHON SOURCE LINES 421-459 .. code-block:: Python fig, axes = plt.subplots( 1, 3, figsize=(14.2, 4.4), constrained_layout=True, ) plot_calibration_per_horizon_sharpness( axes[0], cal_record, which="eval_before", title="Sharpness before calibration", color=CALIBRATION_PALETTE["sharp_before"], marker="^", linewidth=2.2, markersize=7, ) plot_calibration_per_horizon_sharpness( axes[1], cal_record, which="eval_after", title="Sharpness after calibration", color=CALIBRATION_PALETTE["sharp_after"], marker="D", linewidth=2.2, markersize=6.5, ) plot_calibration_boolean_summary( axes[2], cal_record, title="Calibration decision checks", color=CALIBRATION_PALETTE["checks"], edgecolor="#1F2937", linewidth=0.8, alpha=0.9, ) .. image-sg:: /auto_examples/inspection/images/sphx_glr_plot_calibration_stats_overview_002.png :alt: Sharpness before calibration, Sharpness after calibration, Calibration decision checks :srcset: /auto_examples/inspection/images/sphx_glr_plot_calibration_stats_overview_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 460-477 Read the sharpness plots with care ---------------------------------- A larger sharpness value means a wider interval. In many workflows, that is the hidden price of achieving better calibration. The right question is therefore not: *Did sharpness increase?* but: *Did sharpness increase by an amount that is acceptable given the gain in coverage?* In this demo the widening is concentrated at H3. That is plausible, because the later horizon needed the strongest factor adjustment. .. GENERATED FROM PYTHON SOURCE LINES 479-484 Use the all-in-one inspection bundle ------------------------------------ ``inspect_calibration_stats(...)`` is handy when you want the raw payload, the semantic summary, and the main tidy frames in one place. .. GENERATED FROM PYTHON SOURCE LINES 484-490 .. code-block:: Python bundle = inspect_calibration_stats(cal_record) print("\nInspector bundle keys") print(sorted(bundle)) .. rst-class:: sphx-glr-script-out .. code-block:: none Inspector bundle keys ['factors', 'overall', 'payload', 'per_horizon_after', 'per_horizon_before', 'summary'] .. GENERATED FROM PYTHON SOURCE LINES 491-498 The module also understands the nested eval-style structure ----------------------------------------------------------- A useful design detail of this inspector is that it can also read the calibration block when it is embedded inside the richer interpretable evaluation JSON. This means a user can inspect calibration even when they only have the larger eval artifact at hand. .. GENERATED FROM PYTHON SOURCE LINES 498-522 .. code-block:: Python eval_like_path = out_dir / "nansha_eval_interpretable.json" eval_like_payload = { "city": "nansha", "model": "GeoPriorSubsNet", "interval_calibration": { "factors_per_horizon_from_cal_stats": bundle["payload"], "factors_per_horizon": { row["horizon"]: row["factor"] for _, row in bundle["factors"].iterrows() }, }, } eval_like_path.write_text( json.dumps(eval_like_payload, indent=2), encoding="utf-8", ) nested_record = load_calibration_stats(eval_like_path) nested_summary = summarize_calibration_stats(nested_record) print("\nNested eval-style summary") print(json.dumps(nested_summary, indent=2)) .. rst-class:: sphx-glr-script-out .. code-block:: none Nested eval-style summary { "target": 0.8, "interval_low": 0.1, "interval_high": 0.9, "tol": 0.02, "n_horizons": 3, "factors_source": "fit", "coverage_before": 0.865, "coverage_after": 0.806, "sharpness_before": 33.08, "sharpness_after": 33.92, "coverage_error_before": 0.06499999999999995, "coverage_error_after": 0.006000000000000005, "coverage_error_improved": true, "target_reached_after": true, "max_factor": 1.085, "min_factor": 1.0, "has_eval_before": true, "has_eval_after": true, "has_factors": true, "skipped": false } .. GENERATED FROM PYTHON SOURCE LINES 523-537 A practical decision rule ------------------------- For this artifact family, a reasonable quick decision rule is: - the file contains both before and after evaluation blocks, - the calibration factors are present, - the post-calibration coverage error improved, - the post-calibration coverage is within tolerance of the target, - and the largest widening factor is still interpretable. That last part is domain-dependent. There is no universal numeric threshold, but extremely large factors are a warning sign that the forecast distribution may have been badly mis-scaled upstream. .. GENERATED FROM PYTHON SOURCE LINES 537-570 .. code-block:: Python max_factor = summary.get("max_factor") structurally_ready = all( bool(summary.get(name, False)) for name in [ "has_eval_before", "has_eval_after", "has_factors", "coverage_error_improved", "target_reached_after", ] ) moderate_factors = ( max_factor is not None and float(max_factor) <= 2.0 ) print("\nDecision note") if structurally_ready and moderate_factors: print( "This demo calibration artifact looks healthy enough to trust " "the calibrated intervals in later evaluation and reporting." ) elif structurally_ready: print( "Calibration improved the target fit, but the widening factors " "deserve a closer look before you rely on the result." ) else: print( "This calibration result needs attention before you use the " "calibrated intervals downstream." ) .. rst-class:: sphx-glr-script-out .. code-block:: none Decision note This demo calibration artifact looks healthy enough to trust the calibrated intervals in later evaluation and reporting. .. GENERATED FROM PYTHON SOURCE LINES 571-586 Final checklist for real workflow review ---------------------------------------- When you inspect your own saved calibration artifact, try to answer these questions in order: 1. Is the target interval the one I intended to calibrate? 2. Did overall coverage move toward the target? 3. Which horizons needed the correction? 4. Did the width increase stay concentrated on the difficult cases? 5. Are the factors still small enough to remain believable? If the answer is mostly yes, the calibration step is doing something useful. If not, the right fix may be upstream in the forecast model, not in the calibration layer itself. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.642 seconds) .. _sphx_glr_download_auto_examples_inspection_plot_calibration_stats_overview.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_calibration_stats_overview.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_calibration_stats_overview.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_calibration_stats_overview.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_