.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/diagnostics/plot_spatial_block_holdout.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_diagnostics_plot_spatial_block_holdout.py: Spatial-block holdout as a Stage-1 diagnostic ============================================= This lesson teaches how to use the GeoPrior holdout splitter for train/validation/test group design, with special attention to the ``"spatial_block"`` strategy. We focus on: - :class:`geoprior.utils.holdout_utils.HoldoutSplit` - :func:`geoprior.utils.holdout_utils.split_groups_holdout` Why this page matters --------------------- A random split can look statistically fine and still be a weak generalization test in spatial forecasting. If nearby points are placed into different subsets, train and test may become unrealistically similar. That is why GeoPrior supports a spatial-block strategy. What the real utility does -------------------------- ``split_groups_holdout`` splits unique groups into train, validation, and test subsets using either: - ``"random"`` - ``"spatial_block"`` In spatial-block mode, the helper: - requires ``x_col`` and ``y_col``, - requires ``block_size > 0``, - bins groups into coarse `(bx, by)` blocks, - shuffles blocks rather than individual points, - then returns a :class:`HoldoutSplit`. The resulting :class:`HoldoutSplit` object stores the three group tables and checks that they are disjoint. What this lesson teaches ------------------------ We will: 1. build a compact spatial group table, 2. compare random and spatial-block splits, 3. check disjointness, 4. visualize the two splitting strategies, 5. explain when spatial-block holdout is preferable. This page is synthetic so it remains fully executable during the documentation build. .. GENERATED FROM PYTHON SOURCE LINES 57-59 Imports ------- .. GENERATED FROM PYTHON SOURCE LINES 59-68 .. code-block:: Python from __future__ import annotations import numpy as np import pandas as pd import matplotlib.pyplot as plt from geoprior.utils.holdout_utils import split_groups_holdout .. GENERATED FROM PYTHON SOURCE LINES 69-73 Step 1 - Build a compact group table ------------------------------------ ``split_groups_holdout`` operates on a unique group table rather than the full long temporal DataFrame. .. GENERATED FROM PYTHON SOURCE LINES 73-93 .. code-block:: Python nx = 10 ny = 7 xv = np.linspace(0.0, 12_000.0, nx) yv = np.linspace(0.0, 8_000.0, ny) X, Y = np.meshgrid(xv, yv) groups = pd.DataFrame( { "group_id": np.arange(X.size), "coord_x": X.ravel().astype(float), "coord_y": Y.ravel().astype(float), } ) print("Number of unique groups:", len(groups)) print("") print(groups.head(10).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Number of unique groups: 70 group_id coord_x coord_y 0 0.000000 0.0 1 1333.333333 0.0 2 2666.666667 0.0 3 4000.000000 0.0 4 5333.333333 0.0 5 6666.666667 0.0 6 8000.000000 0.0 7 9333.333333 0.0 8 10666.666667 0.0 9 12000.000000 0.0 .. GENERATED FROM PYTHON SOURCE LINES 94-97 Step 2 - Run the random split ----------------------------- This is the simpler baseline. .. GENERATED FROM PYTHON SOURCE LINES 97-106 .. code-block:: Python split_random = split_groups_holdout( groups, seed=42, val_frac=0.2, test_frac=0.2, strategy="random", ) .. GENERATED FROM PYTHON SOURCE LINES 107-114 Step 3 - Run the spatial-block split ------------------------------------ In spatial-block mode we must provide: - ``x_col`` - ``y_col`` - ``block_size`` .. GENERATED FROM PYTHON SOURCE LINES 114-126 .. code-block:: Python split_block = split_groups_holdout( groups, seed=42, val_frac=0.2, test_frac=0.2, strategy="spatial_block", x_col="coord_x", y_col="coord_y", block_size=4000.0, ) .. GENERATED FROM PYTHON SOURCE LINES 127-130 Step 4 - Inspect sizes and disjointness --------------------------------------- ``HoldoutSplit`` includes an explicit disjointness check. .. GENERATED FROM PYTHON SOURCE LINES 130-172 .. code-block:: Python split_random.check_disjoint() split_block.check_disjoint() summary = pd.DataFrame( [ { "strategy": "random", "subset": "train", "n_groups": len(split_random.train_groups), }, { "strategy": "random", "subset": "val", "n_groups": len(split_random.val_groups), }, { "strategy": "random", "subset": "test", "n_groups": len(split_random.test_groups), }, { "strategy": "spatial_block", "subset": "train", "n_groups": len(split_block.train_groups), }, { "strategy": "spatial_block", "subset": "val", "n_groups": len(split_block.val_groups), }, { "strategy": "spatial_block", "subset": "test", "n_groups": len(split_block.test_groups), }, ] ) print("") print(summary.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none strategy subset n_groups random train 42 random val 14 random test 14 spatial_block train 46 spatial_block val 12 spatial_block test 12 .. GENERATED FROM PYTHON SOURCE LINES 173-177 Step 5 - Plot the random split ------------------------------ This shows how groups are scattered across subsets when sampling is independent. .. GENERATED FROM PYTHON SOURCE LINES 177-202 .. code-block:: Python fig, ax = plt.subplots(figsize=(7.2, 5.2)) for label, gdf, marker in [ ("train", split_random.train_groups, "o"), ("val", split_random.val_groups, "s"), ("test", split_random.test_groups, "^"), ]: ax.scatter( gdf["coord_x"], gdf["coord_y"], marker=marker, s=80, label=label, ) ax.set_xlabel("coord_x") ax.set_ylabel("coord_y") ax.set_title("Random holdout split") ax.grid(True, linestyle=":", alpha=0.5) ax.legend() plt.tight_layout() plt.show() .. image-sg:: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_001.png :alt: Random holdout split :srcset: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 203-206 Step 6 - Plot the spatial-block split ------------------------------------- This view makes the block logic visible. .. GENERATED FROM PYTHON SOURCE LINES 206-231 .. code-block:: Python fig, ax = plt.subplots(figsize=(7.2, 5.2)) for label, gdf, marker in [ ("train", split_block.train_groups, "o"), ("val", split_block.val_groups, "s"), ("test", split_block.test_groups, "^"), ]: ax.scatter( gdf["coord_x"], gdf["coord_y"], marker=marker, s=80, label=label, ) ax.set_xlabel("coord_x") ax.set_ylabel("coord_y") ax.set_title("Spatial-block holdout split") ax.grid(True, linestyle=":", alpha=0.5) ax.legend() plt.tight_layout() plt.show() .. image-sg:: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_002.png :alt: Spatial-block holdout split :srcset: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 232-235 Step 7 - Visualize the coarse spatial blocks directly ----------------------------------------------------- This helps explain what ``block_size`` is doing under the hood. .. GENERATED FROM PYTHON SOURCE LINES 235-269 .. code-block:: Python block_size = 4000.0 groups_with_blocks = groups.copy() groups_with_blocks["bx"] = np.floor( groups_with_blocks["coord_x"] / block_size ).astype(int) groups_with_blocks["by"] = np.floor( groups_with_blocks["coord_y"] / block_size ).astype(int) groups_with_blocks["block_label"] = ( groups_with_blocks["bx"].astype(str) + "," + groups_with_blocks["by"].astype(str) ) block_codes = pd.factorize(groups_with_blocks["block_label"])[0] fig, ax = plt.subplots(figsize=(7.4, 5.4)) sc = ax.scatter( groups_with_blocks["coord_x"], groups_with_blocks["coord_y"], c=block_codes, s=85, ) ax.set_xlabel("coord_x") ax.set_ylabel("coord_y") ax.set_title("Spatial blocks induced by block_size") ax.grid(True, linestyle=":", alpha=0.5) plt.tight_layout() plt.show() .. image-sg:: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_003.png :alt: Spatial blocks induced by block_size :srcset: /auto_examples/diagnostics/images/sphx_glr_plot_spatial_block_holdout_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 270-288 How to read the split plots --------------------------- Random split ~~~~~~~~~~~~ Random splitting is simple and often statistically efficient, but it does not respect spatial neighborhood structure. Spatial-block split ~~~~~~~~~~~~~~~~~~~ Spatial-block splitting is harsher but often more realistic for geospatial generalization. Nearby points are grouped into the same holdout block, which reduces spatial leakage. Block map ~~~~~~~~~ The block visualization shows the coarse bins used before the actual train/val/test assignment. Larger ``block_size`` makes larger spatial neighborhoods move together. .. GENERATED FROM PYTHON SOURCE LINES 290-306 Step 8 - Practical guidance --------------------------- Use random split when: - spatial leakage is not a major concern, - or when you need a quick baseline. Use spatial-block split when: - the task is genuinely spatial, - nearby samples are strongly correlated, - and you want a more realistic out-of-area validation test. This is why spatial-block holdout belongs in the diagnostics section: it defines *how demanding* the downstream evaluation really is. .. GENERATED FROM PYTHON SOURCE LINES 308-316 Final takeaway -------------- ``HoldoutSplit`` and ``split_groups_holdout`` are not just utilities. They define the credibility of the Stage-1 split design. In GeoPrior, the spatial-block strategy is the diagnostic tool to use when spatial leakage would make a random split too optimistic. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.475 seconds) .. _sphx_glr_download_auto_examples_diagnostics_plot_spatial_block_holdout.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_spatial_block_holdout.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_spatial_block_holdout.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_spatial_block_holdout.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_