Diagnostics
===========

Diagnostics are a first-class part of GeoPrior-v3.

In a physics-guided workflow, a run is not trustworthy merely
because it finished. A useful diagnostic workflow must help
you answer questions such as:

- Did the stage use the intended inputs and configuration?
- Did the tensors, feature lists, and manifests agree?
- Did training remain numerically stable?
- Did forecast quality improve after calibration?
- Did the physics-facing diagnostics remain plausible?
- Are transfer or inference results being interpreted under
  the correct target-domain contract?

This page explains how to read diagnostics across the staged
GeoPrior-v3 workflow and how to connect them back to the
artifacts each stage produces.

Why diagnostics matter
----------------------

GeoPrior-v3 is explicitly organized as a staged scientific
workflow in which diagnostics, artifact inspection, and
audits are part of the normal execution model rather than an
optional extra. The workflow guide already emphasizes
progressive validation and warns that a numerically
successful run is not automatically a scientifically
trustworthy one.

That is why diagnostics appear throughout the workflow:

- Stage-1 validates the preprocessing and sequence-export
  contract.
- Stage-2 validates training-time handoff, optimization, and
  forecast export.
- Stage-3 validates tuning outcomes and tuned forecast
  behavior.
- Stage-4 validates saved-model reuse, calibration policy,
  and inference outputs.
- Later workflow steps continue this pattern through export,
  reporting, and downstream interpretation.

A useful way to think about the diagnostics system is:
**contract diagnostics + optimization diagnostics +
forecast diagnostics + physics diagnostics**.

Diagnostic layers in GeoPrior-v3
--------------------------------

GeoPrior-v3 diagnostics naturally fall into four layers.

**1. Contract diagnostics**

These diagnose whether a stage received the right inputs and
whether the workflow contract remains internally consistent.

Typical examples include:

- manifest presence and correctness,
- feature-list agreement,
- tensor shape agreement,
- split and holdout summaries,
- city or run mismatch detection,
- scaling and coordinate metadata consistency.

**2. Optimization diagnostics**

These diagnose training or tuning behavior.

Typical examples include:

- training history curves,
- early stopping behavior,
- NaN termination,
- best-epoch selection,
- tuning-summary comparison,
- trial instability during hyperparameter search.

**3. Forecast diagnostics**

These diagnose the quality and usability of outputs.

Typical examples include:

- point metrics such as MAE or RMSE,
- interval metrics such as coverage and sharpness,
- calibrated vs uncalibrated forecast comparison,
- per-horizon metrics,
- evaluation and future CSV inspection.

**4. Physics diagnostics**

These diagnose whether the model remains physically
interpretable.

Typical examples include:

- epsilon-style residual summaries,
- physics payload exports,
- physical-parameter summaries,
- identifiability-oriented summaries,
- scaled vs raw residual interpretation.

A compact mental model
----------------------

A practical mental model is:

.. code-block:: text

   contract checks
        ↓
   optimization stability
        ↓
   forecast quality
        ↓
   physics consistency
        ↓
   transfer / inference interpretation

If any early layer is wrong, later layers become harder to
trust.

Stage-1 diagnostics
-------------------

Stage-1 diagnostics are mostly about the **data contract**.

Stage-1 writes the manifest, train/validation/test NPZ
bundles, and related CSV snapshots. The manifest records
artifact paths, tensor shapes, sequence dimensions, feature
lists, column roles, configuration snapshot, split details,
and software metadata. It is also the place where you can
inspect the canonical naming block for subsidence,
groundwater, thickness, surface-elevation, and coordinate
roles.

After Stage-1, the first things to inspect are:

- the run directory,
- saved raw / cleaned / scaled CSV snapshots,
- exported NPZ bundles,
- the manifest,
- recorded feature lists,
- recorded shapes,
- holdout summaries,
- and scaling or censoring metadata.

A Stage-1 run is not trustworthy if the manifest and the NPZ
artifacts disagree, if feature ordering drifted, or if the
city / model identity in the artifacts does not match the
intended run.

.. admonition:: Best practice

   Inspect the Stage-1 manifest before moving to Stage-2.

   It is the main handshake artifact for the rest of the
   workflow.

Stage-2 diagnostics
-------------------

Stage-2 is where diagnostics become much richer.

The training stage already performs a Stage-1 → Stage-2
handoff audit, checking items such as manifest identity, city
match, required NPZ existence, tensor last-dimension
agreement with recorded feature lists, horizon / time-step
consistency, scaling metadata availability, and coordinate
compatibility.

During and after training, typical Stage-2 artifacts include:

- training log CSV,
- training summary JSON,
- Stage-2 run manifest,
- model checkpoint artifacts,
- ``scaling_kwargs.json``,
- evaluation forecast CSVs,
- calibrated forecast CSVs,
- evaluation diagnostics JSON,
- physics payload NPZ,
- extracted physical parameter tables.

This is why Stage-2 should not be interpreted only as
``model.fit()``. It is also a diagnostics and export stage.

The first things to inspect after Stage-2 are:

- the training summary,
- the run manifest,
- the CSV history log,
- the best-model bundle,
- the saved scaling snapshot,
- evaluation diagnostics,
- calibrated forecast outputs,
- and the physics payload.

Stage-3 diagnostics
-------------------

Stage-3 diagnostics focus on **selection quality** and
**tuned-model trustworthiness**.

The tuning stage records:

- tuner logs,
- best hyperparameters JSON,
- tuned model manifest,
- ``tuning_summary.json``,
- tuned evaluation forecast CSVs,
- tuned future forecast CSVs,
- ``eval_diagnostics_tuned.json``,
- tuned physics payloads,
- and optional Stage-3 audit outputs.

These diagnostics matter because Stage-3 is not only search.
It is also a tuned evaluation and export stage.

A tuned winner should therefore be reviewed along at least
three axes:

- did the search remain stable?
- did the chosen hyperparameters actually improve the run?
- did the physics-facing diagnostics remain interpretable?

A lower validation loss alone is not enough in a
physics-guided setting.

Stage-4 diagnostics
-------------------

Stage-4 diagnostics are about **saved-model reuse** and
**target-contract-aware inference**.

The inference stage writes its outputs under a target-domain
inference directory and can produce:

- evaluation CSVs,
- future forecast CSVs,
- ``inference_summary.json``,
- and, when optional evaluation flags are enabled,
  ``transfer_eval.json``.

The summary JSON records items such as:

- dataset name,
- city,
- model,
- horizon,
- mode,
- whether the run was calibrated,
- output CSV paths,
- and, when targets and scaler information are available,
  physical-unit point and interval metrics.

The optional ``transfer_eval.json`` extends this with
scaled-space loss information, approximate physics
diagnostics, and debug details about the rebuild path.

This stage is especially important in same-domain reuse,
zero-shot transfer, source-calibrator reuse, and
target-validation recalibration settings, where you must
distinguish carefully between source-model artifacts and the
target-domain contract.

What to read first in a diagnostic review
-----------------------------------------

A good diagnostic review sequence is:

**After Stage-1**

1. manifest
2. feature lists
3. shapes
4. split summary

**After Stage-2**

1. training summary JSON
2. history log
3. evaluation diagnostics JSON
4. calibrated forecast CSVs
5. physics payload

**After Stage-3**

1. tuning summary
2. best hyperparameters JSON
3. tuned diagnostics JSON
4. tuned forecast CSVs
5. tuned physics payload

**After Stage-4**

1. inference summary JSON
2. transfer-eval JSON, if present
3. evaluation CSV
4. future CSV
5. calibration policy

This order makes it easier to separate contract problems from
optimization problems and optimization problems from physics
problems.

Interpreting forecast metrics
-----------------------------

GeoPrior-v3 diagnostics span both point forecasts and
interval forecasts.

Common point-facing metrics include:

- MAE,
- MSE,
- RMSE,
- and, in some summaries, :math:`R^2`.

Common interval-facing metrics include:

- coverage,
- sharpness,
- and per-horizon variants of the same ideas.

A good mental model is:

- **point metrics** tell you how close the central forecast
  is;
- **interval metrics** tell you whether the uncertainty band
  is both useful and honest.

When calibrated and uncalibrated forecast outputs both
exist, compare them explicitly rather than reading only one
set of CSVs.

Interpreting calibration
------------------------

Calibration is an explicit part of later GeoPrior stages.

Stage-2 fits interval calibration after training.
Stage-3 can calibrate tuned quantile outputs before export.
Stage-4 can reuse a source calibrator, load a calibrator
explicitly, or fit a new one on target validation data.

That means calibration should be treated as a visible part of
interpretation, not as a hidden post-processing detail.

Ask these questions:

- Was the output calibrated or not?
- Which calibrator policy was used?
- Was the calibrator source-side or target-side?
- Are you comparing calibrated and uncalibrated outputs
  fairly?

If those questions are ignored, interval comparisons become
harder to trust.

Interpreting physics diagnostics
--------------------------------

The physics side has its own diagnostic language.

The math page already distinguishes two important families of
epsilon diagnostics:

- **raw epsilons**, which still carry the original residual
  scale;
- **scaled epsilons**, which are unitless and correspond more
  directly to the optimization view.

Typical keys include:

- ``epsilon_cons_raw`` and ``epsilon_cons``
- ``epsilon_gw_raw`` and ``epsilon_gw``
- ``epsilon_prior``

A useful interpretation rule is:

- if raw epsilons are very large but scaled epsilons are
  moderate, scaling may be doing its job;
- if both are unstable or exploding, you likely need to
  inspect units, coordinate spans, or scaling assumptions.

The same math page also emphasizes that GeoPrior uses
SI-consistent physics internally, with head-like quantities
treated in meters, time converted to seconds for physics, and
positive SI fields such as :math:`K`, :math:`S_s`,
:math:`H`, and :math:`\\tau`. This is why diagnostics should
always be interpreted together with the data-and-units and
scaling pages.

A practical physics-diagnostics checklist
-----------------------------------------

When reviewing a physics-aware run, ask:

- Are epsilon curves finite?
- Are raw and scaled epsilons telling a consistent story?
- Does ``epsilon_prior`` suggest a severe timescale mismatch?
- Were physics payloads exported successfully?
- Do extracted physical-parameter tables look plausible?
- Are unit assumptions and coordinate conventions still the
  same as those recorded by Stage-1 and Stage-2?

If the answer to any of these is “not sure,” pause before
making a scientific interpretation.

Programmatic diagnostics helpers
--------------------------------

The public model-facing namespace also exposes several helper
utilities that are useful when diagnostics are inspected from
Python rather than only from files.

Examples include:

- ``debug_tensor_interval``
- ``debug_val_interval``
- ``debug_quantile_crossing_np``
- ``plot_history_in``
- ``fit_interval_calibrator_on_val``
- ``apply_calibrator_to_subs``
- ``extract_physical_parameters``
- ``load_physics_payload``
- ``plot_physics_values_in``
- ``identifiability_diagnostics_from_payload``
- ``summarise_effective_params``
- ``ident_audit_dict``

These tools are worth documenting here because they give the
user a path from stage-level artifact inspection to
programmatic post hoc diagnosis.

Common diagnostic patterns
--------------------------

**Pattern 1: Good training loss, poor forecast CSVs**

This often means the optimization loop was stable, but the
forecast interpretation step still needs inspection. Check:

- calibration policy,
- quantile layout handling,
- output formatting,
- and the specific split being evaluated.

**Pattern 2: Good point metrics, poor interval behavior**

This often means the central prediction is fine, but the
uncertainty bands are too narrow, too wide, or poorly
calibrated. Compare calibrated and uncalibrated outputs.

**Pattern 3: Good supervised metrics, unstable physics
diagnostics**

This often points to unit, scaling, coordinate, or physics
weight problems rather than a pure forecast-quality issue.

**Pattern 4: Good tuned result, poor transfer or inference
reuse**

This often means the saved model is fine, but the target
contract, calibration policy, or bundle reconstruction path
needs inspection.

**Pattern 5: Strange behavior very early in the workflow**

This often goes back to Stage-1. Recheck the manifest,
feature lists, split summaries, and exported tensors before
debugging later stages.

Common mistakes
---------------

**Reading only one summary file**

A training summary without the forecast diagnostics, or an
inference summary without the target contract, gives only a
partial picture.

**Ignoring manifests**

The manifest is not bookkeeping clutter. It is how the
workflow records meaning.

**Treating calibration as invisible**

Later-stage diagnostics must be interpreted in light of the
calibration policy.

**Reading physics metrics without checking units**

Residual diagnostics can look alarming or reassuring for the
wrong reason if unit assumptions are wrong.

**Comparing runs with different contracts**

Two runs are not automatically comparable if their Stage-1
feature order, target semantics, or scaling conventions
differ.

Best practices
--------------

.. admonition:: Best practice

   Review diagnostics in layers.

   Start with contract checks, then optimization, then
   forecast quality, then physics interpretation.

.. admonition:: Best practice

   Keep summary JSONs, manifests, and forecast CSVs together.

   A single artifact rarely explains the full story.

.. admonition:: Best practice

   Compare calibrated and uncalibrated outputs explicitly.

   This is especially important when interval behavior is part
   of the scientific claim.

.. admonition:: Best practice

   Treat physics diagnostics as scientific evidence, not only
   as debugging metadata.

   They are part of the model’s interpretability contract.

.. admonition:: Best practice

   When in doubt, go back one stage earlier.

   Many late-stage diagnostic surprises originate in earlier
   artifact or semantics drift.

A compact diagnostics map
-------------------------

The GeoPrior-v3 diagnostics story can be summarized like
this:

.. code-block:: text

   Stage-1
     manifest + shapes + feature lists + split summaries
        ↓
   Stage-2
     history logs + training summary + eval diagnostics
     + calibrated forecasts + physics payload
        ↓
   Stage-3
     tuning summary + best_hps + tuned diagnostics
     + tuned forecasts + tuned physics payload
        ↓
   Stage-4
     inference_summary + transfer_eval + eval/future CSVs
     + calibration-policy interpretation
        ↓
   scientific review
     point metrics + interval metrics + physics diagnostics
     + transfer / reuse interpretation

Read next
---------

The best next pages after this one are:

.. grid:: 1 1 2 2
   :gutter: 3

   .. grid-item-card:: Inference and export
      :link: inference_and_export
      :link-type: doc
      :class-card: sd-shadow-sm

      See how diagnostic artifacts relate to the reusable
      inference and export workflow.

   .. grid-item-card:: Data and units
      :link: ../scientific_foundations/data_and_units
      :link-type: doc
      :class-card: sd-shadow-sm

      Revisit the unit and coordinate assumptions that affect
      physics-facing diagnostics.

   .. grid-item-card:: Scaling
      :link: ../scientific_foundations/scaling
      :link-type: doc
      :class-card: sd-shadow-sm

      Review the scaling rules that shape both optimization
      and physics residual interpretation.

   .. grid-item-card:: Maths
      :link: ../scientific_foundations/maths
      :link-type: doc
      :class-card: sd-shadow-sm card--physics

      Connect the workflow diagnostics back to the formal math
      behind residuals, losses, and epsilon summaries.

.. seealso::

   - :doc:`workflow_overview`
   - :doc:`stage1`
   - :doc:`stage2`
   - :doc:`stage3`
   - :doc:`stage4`
   - :doc:`inference_and_export`
   - :doc:`../scientific_foundations/data_and_units`
   - :doc:`../scientific_foundations/scaling`
   - :doc:`../scientific_foundations/maths`