Stage-5 ======= Stage-5 is the **cross-city transfer evaluation and benchmarking stage** of the GeoPrior-v3 workflow. It is the stage where GeoPrior-v3 moves from single-run training, tuning, or inference into a structured **source-to-target transfer analysis**. Rather than focusing on one city and one saved model in isolation, Stage-5 evaluates how a source-domain model behaves when applied to another city under different transfer strategies, calibration policies, and rescaling assumptions. This makes Stage-5 one of the most scientifically important stages in the workflow. It is where GeoPrior-v3 asks not only: - *Can the model fit one city well?* but also: - *How well does the learned structure transfer?* - *How sensitive is performance to scaling assumptions?* - *What changes when calibration is reused or refit?* - *Does warm-start adaptation improve target performance?* In the current implementation, Stage-5 is explicitly exposed as a **cross-city transfer evaluation** workflow supporting **baseline**, **transfer**, and **warm-start** strategies. What Stage-5 does ----------------- Stage-5 orchestrates a grid of transfer experiments between two cities. At a high level, it: - loads Stage-1 bundles for two cities; - builds a set of transfer jobs across directions and strategies; - resolves the source model artifact to reuse; - aligns the target inputs to the source schema; - optionally reprojects target dynamic inputs into source scaling; - optionally applies calibration from the source or target side; - evaluates each job on validation or test splits; - writes per-job forecast CSVs and physics payloads; - aggregates all jobs into transfer summary files. This means Stage-5 is best understood as a **benchmarking matrix stage** rather than as a single model execution step. Why Stage-5 matters ------------------- Stage-5 matters because real geohazard models are often used beyond the exact city or domain on which they were trained. A model that performs well in one city may fail in another for reasons such as: - different scaling conventions, - different feature distributions, - different static covariate availability, - different future-driver ordering, - or different calibration behavior. Stage-5 is designed to make those differences explicit. Instead of hiding transfer behavior inside ad hoc notebooks, GeoPrior-v3 gives it a dedicated stage with: - explicit strategies, - explicit job construction, - explicit schema audits, - explicit calibration modes, - and explicit benchmark outputs. That makes the transfer study reproducible, comparable, and scientifically inspectable. Stage-5 entry point ------------------- Stage-5 is exposed through the modern GeoPrior wrapper while preserving the existing transfer implementation underneath. The wrapper supports a configuration-driven launch style with options such as: - ``--config`` - ``--config-root`` - ``--city-a`` - ``--city-b`` - ``--model`` - ``--results-dir`` - repeated ``--set KEY=VALUE`` overrides The wrapper then forwards the original transfer arguments to the legacy Stage-5 CLI unchanged. A good first move is: .. code-block:: bash geoprior-run --help Then inspect the Stage-5 transfer subcommand in your installed environment. .. note:: The wrapper is intentionally thin. It seeds defaults from configuration but preserves the original transfer CLI behavior. The installed help output is therefore the best source for the exact invocation syntax. The Stage-5 problem setup ------------------------- Stage-5 is built around two cities: - **city A** - **city B** It then constructs directional experiments such as: - ``A_to_A`` - ``B_to_B`` - ``A_to_B`` - ``B_to_A`` These directions are combined with one or more strategies, splits, calibration modes, and rescale modes to form the full job matrix. This is a very important design choice: Stage-5 does not evaluate only one transfer path. It builds a **systematic experiment grid**. Strategies evaluated by Stage-5 ------------------------------- Stage-5 supports three main strategies. **1. baseline** This is the in-domain reference case: - ``A_to_A`` - ``B_to_B`` It answers the question: *How well does each city perform when evaluated against its own trained or tuned artifact family?* Baseline runs are important because they define the natural comparison point for transfer performance. **2. xfer** This is the zero-shot cross-city transfer case: - ``A_to_B`` - ``B_to_A`` It answers the question: *How well does a source-domain model generalize directly to a target city without warm-start adaptation?* **3. warm** This is the warm-start transfer case: - ``A_to_B`` with target adaptation - ``B_to_A`` with target adaptation It answers the question: *Can limited target-domain adaptation improve transfer performance compared with pure zero-shot reuse?* Taken together, these three strategies let Stage-5 compare in-domain performance, zero-shot transfer, and limited target adaptation in one reproducible framework. Splits evaluated by Stage-5 --------------------------- Stage-5 supports evaluation on: - ``val`` - ``test`` These splits are read from the Stage-1 bundles for each city. This is useful because the same transfer strategy can be studied under validation-like and held-out evaluation conditions. Calibration modes ----------------- Stage-5 supports several calibration modes: - ``none`` - ``source`` - ``target`` Each one answers a different question. **none** No interval calibration is applied. This is the cleanest way to inspect raw transferred quantile behavior. **source** A calibrator near the source model is reused. This asks whether a calibration learned in the source domain remains useful after transfer. **target** A new calibrator is fitted on the target validation set. This asks whether transfer improves when quantile intervals are adapted to the target domain. .. admonition:: Best practice Be explicit about which calibration mode you are using when interpreting Stage-5 results. These modes answer different scientific questions and should not be merged casually in reporting. Rescale modes ------------- Stage-5 also supports two rescale modes: - ``as_is`` - ``strict`` These are especially important in cross-city work. **as_is** The target city remains in its own scaling convention. This reflects a more direct target-domain inference path. **strict** The target dynamic inputs are reprojected into the source scaling space before transfer evaluation. This reflects a stricter source-space transfer setup. In practice, this helps disentangle whether performance differences are coming from: - learned model structure, - domain shift, - or mismatched scaling conventions. .. note:: Baseline runs do not need strict reprojection. In the implementation, baseline is effectively kept in ``as_is`` mode. What Stage-5 consumes --------------------- Stage-5 consumes **Stage-1 bundles for both cities**. These bundles provide: - the manifest for each city, - train/validation/test NPZ arrays, - scaler metadata, - feature lists, - target semantics, - and configuration snapshots. The transfer stage uses these Stage-1 contracts to build both the source-side and target-side evaluation context. This is one of the strongest parts of the design: Stage-5 does not guess the schema of either city. It reads both from their exported preprocessing contracts. How Stage-5 resolves the source model ------------------------------------- For each transfer job, Stage-5 resolves a source-side model artifact. It supports source model selection modes such as: - ``auto`` - ``tuned`` - ``trained`` and load policies such as: - ``auto`` - ``full`` - ``weights`` This allows Stage-5 to reuse: - tuned winners from Stage-3, - standard trained models from Stage-2, - full saved model bundles, - or weights-plus-rebuild workflows. This is important because transfer benchmarking should not be tied to only one artifact format. Schema alignment and transfer safety ------------------------------------ One of the most important responsibilities of Stage-5 is **schema validation between source and target**. The stage explicitly checks: - static feature compatibility, - dynamic feature compatibility, - future feature compatibility, - dimension agreement, - and feature-order agreement. Static inputs are aligned by name so that shared source-side static semantics can still be used even when the target has missing or extra static features. Dynamic and future features are more strict. If their names or order differ, Stage-5 can: - fail explicitly, or - allow controlled reordering in a soft mode when the same names exist but the order differs. This is critical for trustworthy transfer evaluation. .. admonition:: Best practice Prefer harmonized Stage-1 feature schemas across cities. Stage-5 can help detect and sometimes soften transfer mismatches, but the most trustworthy setup is still to use the same feature lists and ordering across the cities being compared. Warm-start behavior ------------------- The ``warm`` strategy is a special case in Stage-5. Instead of evaluating the source model directly on the target city, it first adapts the source model using a subset of target data, then evaluates the warmed model on the selected split. Warm-start settings include items such as: - warm split, - warm sample count, - warm fraction, - warm epochs, - warm learning rate, - warm sampling seed. This creates a useful middle ground between: - zero-shot transfer, - and full retraining on the target domain. Warm-start is especially important when you want to study whether a small amount of target adaptation produces a large gain in performance. Internal structure of Stage-5 ----------------------------- The user does not need to know every helper function to use Stage-5, but it is helpful to understand the overall flow. **1. Load Stage-1 bundles for both cities** Stage-5 first loads the Stage-1 bundle for city A and city B. **2. Build directional experiment jobs** It constructs all jobs implied by the chosen strategies, splits, calibration modes, and rescale modes. **3. Prepare target inputs for one job** For a given job, it loads the target split arrays and normalizes them into inference-ready form. **4. Align static schema and validate transfer compatibility** The target static inputs are aligned to the source schema by name, and the dynamic/future schema is audited carefully. **5. Optionally reproject target dynamics** If strict rescaling is enabled, target dynamics are moved into source scaling space before evaluation. **6. Load or reuse the source model** The source model is resolved from the saved artifact family. **7. Optionally calibrate** Depending on calibration mode, Stage-5 may load a source calibrator, fit a target calibrator, or run without one. **8. Predict and format outputs** The transferred or warmed model is used to create evaluation and future forecast CSVs. **9. Compute metrics and diagnostics** Stage-5 computes summary metrics, per-horizon metrics, and optional physics-facing exports. **10. Aggregate the experiment matrix** All completed jobs are written into a shared JSON and CSV benchmark summary. Per-job outputs --------------- Each successful Stage-5 job can write job-level artifacts such as: - evaluation CSV, - future forecast CSV, - optional physics payload NPZ, - schema audit details, - warm-start metadata, - and per-job metrics returned to the final summary table. The per-job forecast base name follows the logical pattern: .. code-block:: text _to__ ___ This keeps the output naming explicit and easy to audit. Metrics and summaries --------------------- Stage-5 records several types of metrics. Typical summary metrics include: - coverage80, - sharpness80, - overall MAE, - overall MSE, - overall RMSE, - overall R². It also records per-horizon metrics such as: - per-horizon MAE, - per-horizon MSE, - per-horizon RMSE, - per-horizon R². This is important because transfer may behave differently at different forecast horizons, and Stage-5 is designed to make that visible. Unit handling ------------- Stage-5 includes explicit unit handling for subsidence outputs and metrics. It supports settings such as: - exported CSV subsidence unit, - source physical output unit, - metrics unit. This is especially useful in transfer settings, where the workflow may need to keep CSV presentation units and internal metric units conceptually distinct. The stage can also recompute metrics from the exported evaluation CSV when needed. This is helpful when unit, scaling, or calibration conventions have changed and you want the final benchmark summary to reflect the exported products rather than stale intermediate values. The main Stage-5 benchmark outputs ---------------------------------- After all jobs finish, Stage-5 writes two main benchmark artifacts: **1. xfer_results.json** This contains the full list of completed job dictionaries, including metrics, schema audit fields, calibration choices, rescale mode, model source details, and warm-start metadata. **2. xfer_results.csv** This flattens the benchmark results into a tabular form that is easier to inspect, compare, plot, or export into papers and reports. The CSV includes columns such as: - strategy, - rescale mode, - direction, - source city, - target city, - split, - calibration, - overall metrics, - coverage and sharpness, - warm-start details, - schema audit details, - per-horizon metrics. This makes Stage-5 the natural source for transfer summary figures and cross-city comparison tables. Where Stage-5 writes outputs ---------------------------- Stage-5 writes its benchmark outputs under a timestamped transfer results directory with the general structure: .. code-block:: text /xfer/__// This is a good design because it keeps one complete transfer experiment grouped together. Inside that directory, you will typically find: - job-level eval/future CSVs, - physics payload NPZ files where available, - ``xfer_results.json``, - ``xfer_results.csv``. What to inspect after Stage-5 completes --------------------------------------- After a Stage-5 run, inspect at least: - the experiment output directory, - ``xfer_results.json``, - ``xfer_results.csv``, - a few representative job-level eval CSVs, - a few representative future CSVs, - any physics payload files, - the schema audit fields in the summary, - and the warm-start metadata when warm runs were included. You should be able to answer questions such as: - Which strategies were actually evaluated? - Which calibration modes were included? - Which rescale modes were used? - Did the benchmark include both directions? - Were any schema reorders or mismatches recorded? - Did warm-start improve over zero-shot transfer? - Did strict rescaling help or hurt? - Were baseline runs clearly separated from transfer runs? Common Stage-5 mistakes ----------------------- **Treating baseline and transfer as interchangeable** Baseline is the in-domain reference. It should not be mixed casually with cross-city transfer outcomes. **Ignoring schema audit fields** Transfer results are much harder to trust if dynamic or future features were reordered or mismatched. **Comparing calibrated and uncalibrated runs without care** Calibration mode changes the interpretation of interval metrics and should be tracked explicitly. **Running too many combinations without a clear question** Stage-5 can generate a large job matrix. It is better to start from a focused scientific question than to evaluate every possible combination blindly. **Forgetting that warm-start is a different regime** Warm-start is not just “better transfer.” It is a distinct adaptation setting and should be reported as such. Best practices -------------- .. admonition:: Best practice Always report Stage-5 results with strategy, calibration mode, and rescale mode shown explicitly. Otherwise, transfer comparisons become hard to interpret. .. admonition:: Best practice Review the schema audit fields before trusting a transfer comparison. A model may appear to transfer poorly when the real issue is a schema mismatch rather than a modeling limitation. .. admonition:: Best practice Keep baseline, zero-shot transfer, and warm-start results separated clearly in figures and tables. These are different experimental questions. .. admonition:: Best practice Use ``xfer_results.csv`` as the main summary table for downstream analysis. It is the cleanest compact artifact for comparing all jobs in one place. A compact Stage-5 map --------------------- The Stage-5 workflow can be summarized like this: .. code-block:: text Stage-1 bundle (city A) + Stage-1 bundle (city B) ↓ build directions and job matrix ↓ choose strategy: baseline / xfer / warm ↓ choose split: val / test ↓ choose calibration: none / source / target ↓ choose rescale mode: as_is / strict ↓ align target schema to source contract ↓ load or rebuild source model artifact ↓ optionally warm-start on target subset ↓ predict + format eval/future CSVs ↓ compute overall + per-horizon metrics ↓ export physics payload where available ↓ aggregate into xfer_results.json + xfer_results.csv Read next --------- The next best pages after Stage-5 are: .. grid:: 1 1 2 2 :gutter: 3 .. grid-item-card:: Diagnostics :link: diagnostics :link-type: doc :class-card: sd-shadow-sm Learn how to interpret benchmark metrics, schema audit fields, and physics-facing outputs. .. grid-item-card:: Inference and export :link: inference_and_export :link-type: doc :class-card: sd-shadow-sm See how forecast CSVs, summaries, and related artifacts fit into the broader GeoPrior workflow. .. grid-item-card:: Configuration :link: configuration :link-type: doc :class-card: sd-shadow-sm card--configuration Review how transfer defaults such as city pair and results directory can be driven from config. .. grid-item-card:: CLI guide :link: cli :link-type: doc :class-card: sd-shadow-sm card--cli Move from the stage narrative to the full command interface. .. seealso:: - :doc:`workflow_overview` - :doc:`stage3` - :doc:`stage4` - :doc:`diagnostics` - :doc:`inference_and_export` - :doc:`configuration` - :doc:`cli` - :doc:`../scientific_foundations/scaling` - :doc:`../scientific_foundations/data_and_units`