.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/inspection/plot_ablation_record_overview.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_inspection_plot_ablation_record_overview.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_inspection_plot_ablation_record_overview.py:


Inspect ablation records before choosing a configuration
========================================================

This lesson explains how to inspect the JSONL artifact
``ablation_record.jsonl``.

Why this file matters
---------------------
Ablation work is not only about finding the *best* score. It is about
understanding *why* one configuration behaves differently from another.
In practice, an ablation record helps answer questions such as:

- Which variant really reduced RMSE or MAE?
- Was the apparent gain only at short horizons?
- Did a physics-weight change improve fit while harming stability?
- Are two variants actually comparable in their lambda weights?
- Is a strong score still believable when epsilon diagnostics degrade?

This page is therefore written as a **reading lesson**, not only as an
API demo. We will use the ablation inspector to move from raw JSONL
records to interpretable tables, comparison plots, and a final decision
rule.

.. GENERATED FROM PYTHON SOURCE LINES 30-73

.. code-block:: Python


    from __future__ import annotations

    import json
    import tempfile
    from pathlib import Path
    from pprint import pprint

    import matplotlib.pyplot as plt
    import pandas as pd

    from geoprior.utils.inspect import (
        ablation_config_frame,
        ablation_metrics_frame,
        ablation_per_horizon_frame,
        ablation_record_flags_frame,
        ablation_record_runs_frame,
        generate_ablation_record,
        inspect_ablation_record,
        load_ablation_record,
        plot_ablation_boolean_summary,
        plot_ablation_lambda_weights,
        plot_ablation_metric_by_variant,
        plot_ablation_per_horizon_metric,
        plot_ablation_run_counts,
        plot_ablation_top_variants,
        summarize_ablation_record,
    )

    pd.set_option("display.max_columns", 32)
    pd.set_option("display.width", 110)
    pd.set_option("display.float_format", lambda v: f"{v:0.6f}")

    ABLATION_PALETTE = {
        "counts": ["#4C78A8", "#72B7B2", "#54A24B", "#F58518"],
        "rmse": ["#2E86AB", "#A23B72", "#F18F01", "#C73E1D"],
        "lambdas": ["#355070", "#6D597A", "#B56576", "#E56B6F", "#EAAC8B", "#84A59D", "#52796F"],
        "horizon_lines": ["#3A86FF", "#8338EC", "#FF006E", "#FB5607"],
        "top": ["#118AB2", "#06D6A0", "#FFD166", "#EF476F"],
        "checks": "#5C677D",
    }


.. GENERATED FROM PYTHON SOURCE LINES 74-89

JSONL is a little different from the other artifacts
----------------------------------------------------

Most inspection files in this gallery are single JSON objects.
``ablation_record.jsonl`` is different: it is a newline-delimited log,
so each line is one ablation record.

That difference matters conceptually:

1. we are usually comparing *variants*, not reading one run in
   isolation,
2. there may be multiple seeds or repeated evaluations,
3. one metric alone is rarely enough to choose a variant.

In other words, the goal is to compare *patterns* across records.

.. GENERATED FROM PYTHON SOURCE LINES 92-109

Create a realistic demo ablation file
-------------------------------------

For a gallery lesson, we want a stable and readable set of ablation
rows without rerunning the full experiment pipeline.

The helper already creates a realistic family of records, but here we
push the variants a little further apart so the lesson becomes easier
to interpret:

- ``baseline`` stays as the reference,
- ``gw_heavier`` gets a larger groundwater weight,
- ``smoother`` makes the smoothness term more visible,
- ``bounds_stronger`` increases the bounds penalty.

This is a good teaching setup because it gives us meaningful
differences in both scalar and horizon-wise behavior.

.. GENERATED FROM PYTHON SOURCE LINES 109-207

.. code-block:: Python


    workdir = Path(tempfile.mkdtemp(prefix="gp_ablation_"))
    out_dir = workdir
    ablation_path = out_dir / "nansha_ablation_record.jsonl"

    generate_ablation_record(
        ablation_path,
        overrides=[
            {
                "ablation": "baseline",
                "seed": 11,
                "lambda_gw": 0.10,
                "lambda_smooth": 0.010,
                "lambda_bounds": 0.05,
                "mae": 0.01190,
                "rmse": 0.01776,
                "r2": 0.87970,
                "epsilon_prior": 8.78,
                "per_horizon_mae": {
                    "H1": 0.00536,
                    "H2": 0.01229,
                    "H3": 0.01807,
                },
                "per_horizon_r2": {
                    "H1": 0.8924,
                    "H2": 0.8812,
                    "H3": 0.8720,
                },
            },
            {
                "ablation": "gw_heavier",
                "seed": 12,
                "lambda_gw": 0.18,
                "lambda_smooth": 0.010,
                "lambda_bounds": 0.05,
                "mae": 0.01135,
                "rmse": 0.01710,
                "r2": 0.88450,
                "epsilon_prior": 9.35,
                "per_horizon_mae": {
                    "H1": 0.00510,
                    "H2": 0.01166,
                    "H3": 0.01729,
                },
                "per_horizon_r2": {
                    "H1": 0.8960,
                    "H2": 0.8865,
                    "H3": 0.8758,
                },
            },
            {
                "ablation": "smoother",
                "seed": 13,
                "lambda_gw": 0.10,
                "lambda_smooth": 0.050,
                "lambda_bounds": 0.05,
                "mae": 0.01225,
                "rmse": 0.01802,
                "r2": 0.87720,
                "epsilon_prior": 7.95,
                "per_horizon_mae": {
                    "H1": 0.00562,
                    "H2": 0.01256,
                    "H3": 0.01857,
                },
                "per_horizon_r2": {
                    "H1": 0.8910,
                    "H2": 0.8790,
                    "H3": 0.8680,
                },
            },
            {
                "ablation": "bounds_stronger",
                "seed": 14,
                "lambda_gw": 0.10,
                "lambda_smooth": 0.010,
                "lambda_bounds": 0.12,
                "mae": 0.01170,
                "rmse": 0.01742,
                "r2": 0.88210,
                "epsilon_prior": 8.12,
                "per_horizon_mae": {
                    "H1": 0.00518,
                    "H2": 0.01202,
                    "H3": 0.01796,
                },
                "per_horizon_r2": {
                    "H1": 0.8948,
                    "H2": 0.8832,
                    "H3": 0.8731,
                },
            },
        ],
    )

    print("Written ablation file")
    print(f" - {ablation_path}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Written ablation file
     - /tmp/gp_ablation_nxwhtrru/nansha_ablation_record.jsonl


.. GENERATED FROM PYTHON SOURCE LINES 208-215

Look at the raw JSONL form first
--------------------------------

Before using the inspection helpers, it is useful to remember what
this artifact really looks like on disk. Each line is its own JSON
record. This is one reason why the ablation inspector uses tables so
heavily: raw JSONL becomes hard to compare once the file grows.

.. GENERATED FROM PYTHON SOURCE LINES 215-223

.. code-block:: Python


    print("\nFirst two raw lines")
    with ablation_path.open("r", encoding="utf-8") as stream:
        for idx, line in enumerate(stream, start=1):
            print(line.strip())
            if idx >= 2:
                break


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    First two raw lines
    {"timestamp": "20260228-191355", "city": "nansha", "model": "GeoPriorSubsNet", "pde_mode": "both", "use_effective_h": true, "kappa_mode": "kb", "hd_factor": 0.6, "lambda_cons": 0.0, "lambda_gw": 0.1, "lambda_prior": 0.0, "lambda_smooth": 0.01, "lambda_mv": 0.0, "lambda_bounds": 0.05, "lambda_q": 0.0005, "r2": 0.8797, "mse": 0.000315, "mae": 0.0119, "rmse": 0.01776, "coverage80": 0.8554, "sharpness80": 0.0454, "metrics": {"r2": 0.8797, "mse": 0.000315, "mae": 0.0119, "rmse": 0.0178, "coverage80": 0.8554, "sharpness80": 0.0454, "units": {"subs_unit_to_si": 0.001, "subs_factor_si_to_real": 1000.0, "subs_metrics_unit": "m", "time_units": "year", "seconds_per_time_unit": 31556952.0}}, "units": {"subs_unit_to_si": 0.001, "subs_factor_si_to_real": 1000.0, "subs_metrics_unit": "m", "time_units": "year", "seconds_per_time_unit": 31556952.0}, "epsilon_prior": 8.78, "epsilon_cons": 0.00552, "epsilon_gw": 4.38e-07, "per_horizon_mae": {"H1": 0.00536, "H2": 0.01229, "H3": 0.01807}, "per_horizon_r2": {"H1": 0.8924, "H2": 0.8812, "H3": 0.872}, "ablation": "baseline", "seed": 11}
    {"timestamp": "20260228-191355", "city": "nansha", "model": "GeoPriorSubsNet", "pde_mode": "both", "use_effective_h": true, "kappa_mode": "kb", "hd_factor": 0.6, "lambda_cons": 0.0, "lambda_gw": 0.18, "lambda_prior": 0.0, "lambda_smooth": 0.01, "lambda_mv": 0.0, "lambda_bounds": 0.05, "lambda_q": 0.0005, "r2": 0.8845, "mse": 0.000315, "mae": 0.01135, "rmse": 0.0171, "coverage80": 0.8554, "sharpness80": 0.0454, "metrics": {"r2": 0.8805000000000001, "mse": 0.000315, "mae": 0.011500000000000002, "rmse": 0.0174, "coverage80": 0.8554, "sharpness80": 0.0454, "units": {"subs_unit_to_si": 0.001, "subs_factor_si_to_real": 1000.0, "subs_metrics_unit": "m", "time_units": "year", "seconds_per_time_unit": 31556952.0}}, "units": {"subs_unit_to_si": 0.001, "subs_factor_si_to_real": 1000.0, "subs_metrics_unit": "m", "time_units": "year", "seconds_per_time_unit": 31556952.0}, "epsilon_prior": 9.35, "epsilon_cons": 0.00552, "epsilon_gw": 4.38e-07, "per_horizon_mae": {"H1": 0.0051, "H2": 0.01166, "H3": 0.01729}, "per_horizon_r2": {"H1": 0.896, "H2": 0.8865, "H3": 0.8758}, "ablation": "gw_heavier", "seed": 12}


.. GENERATED FROM PYTHON SOURCE LINES 224-230

Load the artifact through the real reader
-----------------------------------------

The reader returns a plain list of normalized records. That keeps
the JSONL structure faithful while giving us a stable starting point
for tables and plots.

.. GENERATED FROM PYTHON SOURCE LINES 230-239

.. code-block:: Python


    records = load_ablation_record(ablation_path)

    print("\nHow many records were loaded?")
    print(len(records))

    print("\nFirst normalized record")
    pprint(records[0])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    How many records were loaded?
    4

    First normalized record
    {'ablation': 'baseline',
     'city': 'nansha',
     'coverage80': 0.8554,
     'epsilon_cons': 0.00552,
     'epsilon_gw': 4.38e-07,
     'epsilon_prior': 8.78,
     'hd_factor': 0.6,
     'kappa_mode': 'kb',
     'lambda_bounds': 0.05,
     'lambda_cons': 0.0,
     'lambda_gw': 0.1,
     'lambda_mv': 0.0,
     'lambda_prior': 0.0,
     'lambda_q': 0.0005,
     'lambda_smooth': 0.01,
     'mae': 0.0119,
     'metrics': {'coverage80': 0.8554,
                 'mae': 0.0119,
                 'mse': 0.000315,
                 'r2': 0.8797,
                 'rmse': 0.0178,
                 'sharpness80': 0.0454,
                 'units': {'seconds_per_time_unit': 31556952.0,
                           'subs_factor_si_to_real': 1000.0,
                           'subs_metrics_unit': 'm',
                           'subs_unit_to_si': 0.001,
                           'time_units': 'year'}},
     'model': 'GeoPriorSubsNet',
     'mse': 0.000315,
     'pde_mode': 'both',
     'per_horizon_mae': {'H1': 0.00536, 'H2': 0.01229, 'H3': 0.01807},
     'per_horizon_r2': {'H1': 0.8924, 'H2': 0.8812, 'H3': 0.872},
     'r2': 0.8797,
     'rmse': 0.01776,
     'seed': 11,
     'sharpness80': 0.0454,
     'timestamp': '20260228-191355',
     'units': {'seconds_per_time_unit': 31556952.0,
               'subs_factor_si_to_real': 1000.0,
               'subs_metrics_unit': 'm',
               'subs_unit_to_si': 0.001,
               'time_units': 'year'},
     'use_effective_h': True}


.. GENERATED FROM PYTHON SOURCE LINES 240-251

Start with the semantic summary
-------------------------------

A useful first question is not "Which plot should I draw?" but
rather:

*Does this file look complete enough to support a fair comparison?*

The semantic summary answers exactly that. It tells us whether the
file contains records, core metrics, horizon-wise metrics, units,
and key configuration knobs.

.. GENERATED FROM PYTHON SOURCE LINES 251-257

.. code-block:: Python


    summary = summarize_ablation_record(records)

    print("\nCompact summary")
    print(json.dumps(summary, indent=2))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Compact summary
    {
      "record_count": 4,
      "variant_count": 4,
      "seed_count": 4,
      "variants": [
        "baseline",
        "bounds_stronger",
        "gw_heavier",
        "smoother"
      ],
      "has_metrics": true,
      "has_per_horizon": true,
      "has_units": true,
      "has_lambda_weights": true,
      "best_by_rmse": {
        "variant": "gw_heavier",
        "value": 0.0171
      },
      "best_by_r2": {
        "variant": "gw_heavier",
        "value": 0.8845
      },
      "checks": {
        "has_records": true,
        "has_timestamp": true,
        "has_core_metrics": true,
        "has_per_horizon_metrics": true,
        "has_units_block": true,
        "has_config_knobs": true
      }
    }


.. GENERATED FROM PYTHON SOURCE LINES 258-271

Build the tidy comparison tables
--------------------------------

JSONL is best inspected after converting it into tidy tables.
We will use four different views:

1. a run-level table,
2. a long-form scalar-metrics table,
3. a per-horizon table,
4. a compact config/weights table.

Each table answers a different question, so it is worth keeping
them conceptually separate.

.. GENERATED FROM PYTHON SOURCE LINES 271-293

.. code-block:: Python


    runs = ablation_record_runs_frame(records)
    metrics = ablation_metrics_frame(records)
    per_h = ablation_per_horizon_frame(records)
    config = ablation_config_frame(records)
    flags = ablation_record_flags_frame(records)

    print("\nRun-level view")
    print(runs)

    print("\nScalar metric rows")
    print(metrics.head(18))

    print("\nPer-horizon rows")
    print(per_h)

    print("\nConfiguration / weight view")
    print(config)

    print("\nBoolean flags")
    print(flags)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Run-level view
       record_id          variant  seed        timestamp    city            model pde_mode  use_effective_h  \
    0          1         baseline    11  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    1          2       gw_heavier    12  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    2          3         smoother    13  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    3          4  bounds_stronger    14  20260228-191355  nansha  GeoPriorSubsNet     both             True   

      kappa_mode  hd_factor  lambda_cons  lambda_gw  lambda_prior  lambda_smooth  lambda_mv  lambda_bounds  \
    0         kb   0.600000     0.000000   0.100000      0.000000       0.010000   0.000000       0.050000   
    1         kb   0.600000     0.000000   0.180000      0.000000       0.010000   0.000000       0.050000   
    2         kb   0.600000     0.000000   0.100000      0.000000       0.050000   0.000000       0.050000   
    3         kb   0.600000     0.000000   0.100000      0.000000       0.010000   0.000000       0.120000   

       lambda_q       r2      mse      mae     rmse  coverage80  sharpness80  epsilon_prior  epsilon_cons  \
    0  0.000500 0.879700 0.000315 0.011900 0.017760    0.855400     0.045400       8.780000      0.005520   
    1  0.000500 0.884500 0.000315 0.011350 0.017100    0.855400     0.045400       9.350000      0.005520   
    2  0.000500 0.877200 0.000315 0.012250 0.018020    0.855400     0.045400       7.950000      0.005520   
    3  0.000500 0.882100 0.000315 0.011700 0.017420    0.855400     0.045400       8.120000      0.005520   

       epsilon_gw  subs_unit_to_si  subs_factor_si_to_real subs_metrics_unit time_units  seconds_per_time_unit  
    0    0.000000         0.001000             1000.000000                 m       year        31556952.000000  
    1    0.000000         0.001000             1000.000000                 m       year        31556952.000000  
    2    0.000000         0.001000             1000.000000                 m       year        31556952.000000  
    3    0.000000         0.001000             1000.000000                 m       year        31556952.000000  

    Scalar metric rows
        record_id     variant  seed         metric    value
    0           1    baseline    11             r2 0.879700
    1           1    baseline    11            mse 0.000315
    2           1    baseline    11            mae 0.011900
    3           1    baseline    11           rmse 0.017760
    4           1    baseline    11     coverage80 0.855400
    5           1    baseline    11    sharpness80 0.045400
    6           1    baseline    11  epsilon_prior 8.780000
    7           1    baseline    11   epsilon_cons 0.005520
    8           1    baseline    11     epsilon_gw 0.000000
    9           2  gw_heavier    12             r2 0.884500
    10          2  gw_heavier    12            mse 0.000315
    11          2  gw_heavier    12            mae 0.011350
    12          2  gw_heavier    12           rmse 0.017100
    13          2  gw_heavier    12     coverage80 0.855400
    14          2  gw_heavier    12    sharpness80 0.045400
    15          2  gw_heavier    12  epsilon_prior 9.350000
    16          2  gw_heavier    12   epsilon_cons 0.005520
    17          2  gw_heavier    12     epsilon_gw 0.000000

    Per-horizon rows
        record_id          variant  seed metric horizon    value
    0           1         baseline    11    mae      H1 0.005360
    1           1         baseline    11    mae      H2 0.012290
    2           1         baseline    11    mae      H3 0.018070
    3           1         baseline    11     r2      H1 0.892400
    4           1         baseline    11     r2      H2 0.881200
    5           1         baseline    11     r2      H3 0.872000
    6           2       gw_heavier    12    mae      H1 0.005100
    7           2       gw_heavier    12    mae      H2 0.011660
    8           2       gw_heavier    12    mae      H3 0.017290
    9           2       gw_heavier    12     r2      H1 0.896000
    10          2       gw_heavier    12     r2      H2 0.886500
    11          2       gw_heavier    12     r2      H3 0.875800
    12          3         smoother    13    mae      H1 0.005620
    13          3         smoother    13    mae      H2 0.012560
    14          3         smoother    13    mae      H3 0.018570
    15          3         smoother    13     r2      H1 0.891000
    16          3         smoother    13     r2      H2 0.879000
    17          3         smoother    13     r2      H3 0.868000
    18          4  bounds_stronger    14    mae      H1 0.005180
    19          4  bounds_stronger    14    mae      H2 0.012020
    20          4  bounds_stronger    14    mae      H3 0.017960
    21          4  bounds_stronger    14     r2      H1 0.894800
    22          4  bounds_stronger    14     r2      H2 0.883200
    23          4  bounds_stronger    14     r2      H3 0.873100

    Configuration / weight view
       record_id          variant  seed        timestamp    city            model pde_mode  use_effective_h  \
    0          1         baseline    11  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    1          2       gw_heavier    12  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    2          3         smoother    13  20260228-191355  nansha  GeoPriorSubsNet     both             True   
    3          4  bounds_stronger    14  20260228-191355  nansha  GeoPriorSubsNet     both             True   

      kappa_mode  hd_factor  lambda_cons  lambda_gw  lambda_prior  lambda_smooth  lambda_mv  lambda_bounds  \
    0         kb   0.600000     0.000000   0.100000      0.000000       0.010000   0.000000       0.050000   
    1         kb   0.600000     0.000000   0.180000      0.000000       0.010000   0.000000       0.050000   
    2         kb   0.600000     0.000000   0.100000      0.000000       0.050000   0.000000       0.050000   
    3         kb   0.600000     0.000000   0.100000      0.000000       0.010000   0.000000       0.120000   

       lambda_q  subs_unit_to_si  subs_factor_si_to_real subs_metrics_unit time_units  seconds_per_time_unit  
    0  0.000500         0.001000             1000.000000                 m       year        31556952.000000  
    1  0.000500         0.001000             1000.000000                 m       year        31556952.000000  
    2  0.000500         0.001000             1000.000000                 m       year        31556952.000000  
    3  0.000500         0.001000             1000.000000                 m       year        31556952.000000  

    Boolean flags
       record_id          variant  seed             flag  value
    0          1         baseline    11  use_effective_h   True
    1          2       gw_heavier    12  use_effective_h   True
    2          3         smoother    13  use_effective_h   True
    3          4  bounds_stronger    14  use_effective_h   True


.. GENERATED FROM PYTHON SOURCE LINES 294-311

Read the run-level table as a comparison checklist
--------------------------------------------------

The run-level table is the fastest way to check whether records are
*comparable* at all.

Things worth checking here:

- Are all variants from the same city and model?
- Did the PDE mode stay fixed, or are we accidentally mixing two
  kinds of experiments?
- Are units consistent across rows?
- Did only one configuration knob change, or did several knobs move
  together?

A common ablation mistake is to compare variants that changed more
than one meaningful thing. The table makes that easier to detect.

.. GENERATED FROM PYTHON SOURCE LINES 311-329

.. code-block:: Python


    compare_cols = [
        col
        for col in [
            "variant",
            "city",
            "model",
            "pde_mode",
            "use_effective_h",
            "kappa_mode",
            "hd_factor",
            "time_units",
        ]
        if col in runs.columns
    ]
    print("\nComparison checklist view")
    print(runs.loc[:, compare_cols])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Comparison checklist view
               variant    city            model pde_mode  use_effective_h kappa_mode  hd_factor time_units
    0         baseline  nansha  GeoPriorSubsNet     both             True         kb   0.600000       year
    1       gw_heavier  nansha  GeoPriorSubsNet     both             True         kb   0.600000       year
    2         smoother  nansha  GeoPriorSubsNet     both             True         kb   0.600000       year
    3  bounds_stronger  nansha  GeoPriorSubsNet     both             True         kb   0.600000       year


.. GENERATED FROM PYTHON SOURCE LINES 330-339

Aggregate scalar metrics by variant
-----------------------------------

The long-form metric table becomes much easier to interpret once we
aggregate by variant. For a deep inspection lesson, this is where we
start asking ranking questions.

Here we compute a compact mean metric view. In a real experiment,
this step becomes even more useful when you have repeated seeds.

.. GENERATED FROM PYTHON SOURCE LINES 339-354

.. code-block:: Python


    metric_pivot = (
        metrics.pivot_table(
            index="variant",
            columns="metric",
            values="value",
            aggfunc="mean",
        )
        .reset_index()
        .sort_values("rmse")
    )

    print("\nMean scalar metrics by variant")
    print(metric_pivot)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Mean scalar metrics by variant
    metric          variant  coverage80  epsilon_cons  epsilon_gw  epsilon_prior      mae      mse       r2  \
    2            gw_heavier    0.855400      0.005520    0.000000       9.350000 0.011350 0.000315 0.884500   
    1       bounds_stronger    0.855400      0.005520    0.000000       8.120000 0.011700 0.000315 0.882100   
    0              baseline    0.855400      0.005520    0.000000       8.780000 0.011900 0.000315 0.879700   
    3              smoother    0.855400      0.005520    0.000000       7.950000 0.012250 0.000315 0.877200   

    metric     rmse  sharpness80  
    2      0.017100     0.045400  
    1      0.017420     0.045400  
    0      0.017760     0.045400  
    3      0.018020     0.045400  


.. GENERATED FROM PYTHON SOURCE LINES 355-371

How to interpret scalar ranking
-------------------------------

A careful ablation reading usually follows this order:

1. lower ``rmse`` or ``mae`` is good,
2. higher ``r2`` is good,
3. coverage and sharpness should still look reasonable together,
4. epsilon diagnostics should not quietly become much worse.

That last point matters a lot. A variant can improve fit while also
pushing the physics consistency in a less trustworthy direction.

In this demo, ``gw_heavier`` looks strongest on fit metrics, but we
should still compare its epsilon level against the others before we
celebrate it.

.. GENERATED FROM PYTHON SOURCE LINES 371-392

.. code-block:: Python


    epsilon_view = metric_pivot.loc[
        :, [
            col
            for col in [
                "variant",
                "rmse",
                "r2",
                "coverage80",
                "sharpness80",
                "epsilon_prior",
                "epsilon_cons",
                "epsilon_gw",
            ]
            if col in metric_pivot.columns
        ]
    ]

    print("\nFit metrics together with epsilon diagnostics")
    print(epsilon_view)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Fit metrics together with epsilon diagnostics
    metric          variant     rmse       r2  coverage80  sharpness80  epsilon_prior  epsilon_cons  epsilon_gw
    2            gw_heavier 0.017100 0.884500    0.855400     0.045400       9.350000      0.005520    0.000000
    1       bounds_stronger 0.017420 0.882100    0.855400     0.045400       8.120000      0.005520    0.000000
    0              baseline 0.017760 0.879700    0.855400     0.045400       8.780000      0.005520    0.000000
    3              smoother 0.018020 0.877200    0.855400     0.045400       7.950000      0.005520    0.000000


.. GENERATED FROM PYTHON SOURCE LINES 393-409

Inspect the lambda weights directly
-----------------------------------

One of the easiest ways to misread an ablation file is to focus only
on the outcome metrics and forget the actual weights that produced
them.

The lambda view is therefore essential. It tells us whether the
variants are changing:

- the groundwater residual weight,
- the smoothness weight,
- the bounds penalty,
- or a more complex combination.

Good ablation reading connects *weight changes* to *metric changes*.

.. GENERATED FROM PYTHON SOURCE LINES 409-427

.. code-block:: Python


    lambda_cols = ["variant"] + [
        col
        for col in [
            "lambda_cons",
            "lambda_gw",
            "lambda_prior",
            "lambda_smooth",
            "lambda_mv",
            "lambda_bounds",
            "lambda_q",
        ]
        if col in config.columns
    ]

    print("\nLambda-weight comparison")
    print(config.loc[:, lambda_cols])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Lambda-weight comparison
               variant  lambda_cons  lambda_gw  lambda_prior  lambda_smooth  lambda_mv  lambda_bounds  lambda_q
    0         baseline     0.000000   0.100000      0.000000       0.010000   0.000000       0.050000  0.000500
    1       gw_heavier     0.000000   0.180000      0.000000       0.010000   0.000000       0.050000  0.000500
    2         smoother     0.000000   0.100000      0.000000       0.050000   0.000000       0.050000  0.000500
    3  bounds_stronger     0.000000   0.100000      0.000000       0.010000   0.000000       0.120000  0.000500


.. GENERATED FROM PYTHON SOURCE LINES 428-442

Inspect the horizon-wise behavior
---------------------------------

Scalar averages can hide a very important phenomenon:
a variant may help early horizons but degrade later ones.

This is why the per-horizon table matters. We can read it as a
*degradation curve*.

In many forecasting tasks, a good variant should:

- keep short-horizon error low,
- avoid a sharp long-horizon blow-up,
- and preserve a sensible ranking across horizons.

.. GENERATED FROM PYTHON SOURCE LINES 442-456

.. code-block:: Python


    per_h_pivot = (
        per_h.pivot_table(
            index=["variant", "horizon"],
            columns="metric",
            values="value",
            aggfunc="mean",
        )
        .reset_index()
    )

    print("\nPer-horizon comparison table")
    print(per_h_pivot)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Per-horizon comparison table
    metric          variant horizon      mae       r2
    0              baseline      H1 0.005360 0.892400
    1              baseline      H2 0.012290 0.881200
    2              baseline      H3 0.018070 0.872000
    3       bounds_stronger      H1 0.005180 0.894800
    4       bounds_stronger      H2 0.012020 0.883200
    5       bounds_stronger      H3 0.017960 0.873100
    6            gw_heavier      H1 0.005100 0.896000
    7            gw_heavier      H2 0.011660 0.886500
    8            gw_heavier      H3 0.017290 0.875800
    9              smoother      H1 0.005620 0.891000
    10             smoother      H2 0.012560 0.879000
    11             smoother      H3 0.018570 0.868000


.. GENERATED FROM PYTHON SOURCE LINES 457-470

Plot the main comparison views
------------------------------

A compact ablation review is usually easier when we look at four
complementary views:

1. how many runs belong to each variant,
2. which variants rank best on RMSE,
3. how the lambda weights differ,
4. how MAE behaves across horizons.

Notice that each plot answers a different decision question. That is
much better than drawing many redundant score charts.

.. GENERATED FROM PYTHON SOURCE LINES 470-521

.. code-block:: Python


    fig, axes = plt.subplots(
        2,
        2,
        figsize=(13.0, 9.0),
        constrained_layout=True,
    )

    axes[1, 0].set_prop_cycle(color=ABLATION_PALETTE["lambdas"])
    axes[1, 1].set_prop_cycle(color=ABLATION_PALETTE["horizon_lines"])

    plot_ablation_run_counts(
        records,
        ax=axes[0, 0],
        title="Runs available per variant",
        color=ABLATION_PALETTE["counts"],
        edgecolor="#1F2937",
        linewidth=0.9,
        alpha=0.92,
    )
    plot_ablation_metric_by_variant(
        records,
        metric="rmse",
        ax=axes[0, 1],
        title="Mean RMSE by variant",
        color=ABLATION_PALETTE["rmse"],
        edgecolor="#1F2937",
        linewidth=0.9,
        alpha=0.92,
    )
    plot_ablation_lambda_weights(
        records,
        ax=axes[1, 0],
        title="Lambda weights behind each variant",
        edgecolor="white",
        linewidth=0.8,
        alpha=0.92,
        legend_kws={"ncol": 2, "frameon": False, "fontsize": 9},
    )
    plot_ablation_per_horizon_metric(
        records,
        metric="mae",
        ax=axes[1, 1],
        title="Per-horizon MAE by variant",
        linewidth=2.2,
        markersize=7,
        marker="o",
        alpha=0.95,
        legend_kws={"frameon": False, "fontsize": 9},
    )


.. image-sg:: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_001.png
   :alt: Runs available per variant, Mean RMSE by variant, Lambda weights behind each variant, Per-horizon MAE by variant
   :srcset: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'Per-horizon MAE by variant'}, xlabel='horizon', ylabel='mae'>


.. GENERATED FROM PYTHON SOURCE LINES 522-542

How to read these plots
-----------------------

Here is a practical interpretation order:

**Run count plot**
  Check whether comparison is balanced. If one variant has many more
  runs or seeds, its mean score may be more stable than the others.

**RMSE plot**
  This is the quick performance ranking. It is often the first chart
  a reader notices, but it should never be read in isolation.

**Lambda plot**
  This explains *what actually changed*. Without this chart, a good
  score might look mysterious or misleading.

**Per-horizon MAE plot**
  This shows where the gain happens. A variant that only improves
  H1 but worsens H3 may not be the best operational choice.

.. GENERATED FROM PYTHON SOURCE LINES 545-552

Plot the best-ranked variants explicitly
----------------------------------------

The top-variants plot is useful when the file becomes longer and you
want a quick shortlist. In a real workflow, this is often the chart
that helps decide which configurations deserve a rerun, deeper
diagnosis, or reporting.

.. GENERATED FROM PYTHON SOURCE LINES 552-569

.. code-block:: Python


    fig, ax = plt.subplots(
        figsize=(8.6, 4.6),
        constrained_layout=True,
    )
    plot_ablation_top_variants(
        records,
        metric="rmse",
        top_n=4,
        ax=ax,
        title="Best variants by mean RMSE",
        color=ABLATION_PALETTE["top"],
        edgecolor="#1F2937",
        linewidth=0.9,
        alpha=0.94,
    )


.. image-sg:: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_002.png
   :alt: Best variants by mean RMSE
   :srcset: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'Best variants by mean RMSE'}, xlabel='rmse', ylabel='variant'>


.. GENERATED FROM PYTHON SOURCE LINES 570-577

Plot the structural checks separately
-------------------------------------

The boolean summary is a small but useful guardrail. It tells us
whether the file carries the minimum information needed for a real
comparison: records, metrics, horizon metrics, units, and config
knobs.

.. GENERATED FROM PYTHON SOURCE LINES 577-592

.. code-block:: Python


    fig, ax = plt.subplots(
        figsize=(8.0, 4.2),
        constrained_layout=True,
    )
    plot_ablation_boolean_summary(
        records,
        ax=ax,
        title="Ablation artifact structural checks",
        color=ABLATION_PALETTE["checks"],
        edgecolor="#1F2937",
        linewidth=0.8,
        alpha=0.9,
    )


.. image-sg:: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_003.png
   :alt: Ablation artifact structural checks
   :srcset: /auto_examples/inspection/images/sphx_glr_plot_ablation_record_overview_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'Ablation artifact structural checks'}>


.. GENERATED FROM PYTHON SOURCE LINES 593-600

Save the full inspection bundle
-------------------------------

The all-in-one inspector is useful when you want to keep the
semantic summary, tidy frames, and saved figures together. This is a
convenient pattern for reports, gallery generation, or later CLI
helpers that may inspect a whole experiment folder.

.. GENERATED FROM PYTHON SOURCE LINES 600-613

.. code-block:: Python


    bundle_dir = out_dir / "inspection_bundle"
    bundle = inspect_ablation_record(
        records,
        output_dir=bundle_dir,
        stem="lesson_ablation",
        save_figures=True,
    )

    print("\nSaved inspection figures")
    for name, path in bundle["figure_paths"].items():
        print(f" - {name}: {path}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Saved inspection figures
     - lesson_ablation_run_counts.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_run_counts.png
     - lesson_ablation_rmse.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_rmse.png
     - lesson_ablation_r2.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_r2.png
     - lesson_ablation_lambda_weights.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_lambda_weights.png
     - lesson_ablation_per_h_mae.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_per_h_mae.png
     - lesson_ablation_checks.png: /tmp/gp_ablation_nxwhtrru/inspection_bundle/lesson_ablation_checks.png


.. GENERATED FROM PYTHON SOURCE LINES 614-627

A practical decision rule
-------------------------

A simple reading rule for ablation records can be:

- choose variants with strong scalar fit metrics,
- reject variants whose horizon-wise behavior degrades badly,
- reject variants whose epsilon diagnostics become suspicious,
- and always confirm the configuration knobs that changed.

In this demo, the most plausible candidate is the one that combines
a good RMSE ranking with stable horizon behavior and no obvious
structural warning.

.. GENERATED FROM PYTHON SOURCE LINES 627-644

.. code-block:: Python


    best = summary.get("best_by_rmse") or {}
    best_variant = best.get("variant")

    print("\nDecision note")
    if best_variant:
        print(
            "A good next candidate for deeper review is: "
            f"{best_variant!r}. But the final choice should still "
            "consider horizon-wise behavior and epsilon diagnostics, "
            "not only the scalar RMSE ranking."
        )
    else:
        print(
            "No clear best variant could be identified from the "
            "available ablation records."
        )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Decision note
    A good next candidate for deeper review is: 'gw_heavier'. But the final choice should still consider horizon-wise behavior and epsilon diagnostics, not only the scalar RMSE ranking.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.599 seconds)


.. _sphx_glr_download_auto_examples_inspection_plot_ablation_record_overview.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_ablation_record_overview.ipynb <plot_ablation_record_overview.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_ablation_record_overview.py <plot_ablation_record_overview.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_ablation_record_overview.zip <plot_ablation_record_overview.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_