.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/tables_and_summaries/build_batch_spatial_sampling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_tables_and_summaries_build_batch_spatial_sampling.py: Build non-overlapping spatial sample batches ============================================ This example teaches you how to use GeoPrior's ``batch-spatial-sampling`` build utility. The previous ``spatial-sampling`` lesson produced **one** compact sampled table. This lesson goes one step further: it produces **several non-overlapping sampled tables** from the same input. Why this matters ---------------- Batch sampling is useful when one compact sample is not enough. Sometimes you want several small-but-representative subsets so you can: - run repeated demos or smoke tests, - compare stability across sampled subsets, - distribute work across several smaller jobs, - or prepare several compact teaching datasets without reusing the same rows every time. That is exactly what ``batch-spatial-sampling`` is for. .. GENERATED FROM PYTHON SOURCE LINES 27-33 Imports ------- We call the real production entrypoint from the project code. For the synthetic spatial support, we reuse the shared helpers from ``geoprior.scripts.utils`` so the lesson remains consistent with the rest of the documentation. .. GENERATED FROM PYTHON SOURCE LINES 33-53 .. code-block:: Python from __future__ import annotations import tempfile from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd from geoprior.cli.build_batch_spatial_sampling import ( build_batch_spatial_sampling_main, ) from geoprior.scripts.utils import ( SpatialSupportSpec, make_spatial_field, make_spatial_scale, make_spatial_support, ) .. GENERATED FROM PYTHON SOURCE LINES 54-77 Build two synthetic city supports --------------------------------- We again start from ``SpatialSupportSpec`` so the user can see how a reusable spatial support is defined. Key parameters -------------- city: Only the city label attached to the generated support. center_x, center_y: Approximate center of the synthetic projected or geographic space. span_x, span_y: Half-width and half-height of the city's extent. nx, ny: Mesh density before masking. jitter_x, jitter_y: Small perturbations so the support is not an exact grid. footprint: Synthetic city shape. Here we use the city-like footprints. keep_frac: Fraction of masked support points to keep. seed: Keeps the support reproducible across doc builds. .. GENERATED FROM PYTHON SOURCE LINES 77-116 .. code-block:: Python ns_support = make_spatial_support( SpatialSupportSpec( city="Nansha", center_x=113.52, center_y=22.74, span_x=0.17, span_y=0.11, nx=64, ny=50, jitter_x=0.0014, jitter_y=0.0011, footprint="nansha_like", keep_frac=0.84, seed=11, ) ) zh_support = make_spatial_support( SpatialSupportSpec( city="Zhongshan", center_x=113.38, center_y=22.53, span_x=0.19, span_y=0.13, nx=66, ny=52, jitter_x=0.0015, jitter_y=0.0012, footprint="zhongshan_like", keep_frac=0.82, seed=23, ) ) print("Synthetic support sizes") print(f" - Nansha : {ns_support.sample_idx.size:,} points") print(f" - Zhongshan: {zh_support.sample_idx.size:,} points") .. rst-class:: sphx-glr-script-out .. code-block:: none Synthetic support sizes - Nansha : 1,336 points - Zhongshan: 1,330 points .. GENERATED FROM PYTHON SOURCE LINES 117-130 Convert the supports into a richer multi-year table --------------------------------------------------- ``batch-spatial-sampling`` operates on a regular DataFrame, so we now build one realistic synthetic input table. Each row will carry: - spatial coordinates, - a time stamp, - several categorical stratification fields, - one continuous response-like field, - and a unique ``row_uid`` so we can verify that batches do not share the same sampled row. .. GENERATED FROM PYTHON SOURCE LINES 130-216 .. code-block:: Python rng = np.random.default_rng(84) years = [2020, 2021, 2022, 2023, 2024] frames: list[pd.DataFrame] = [] for support, amp0, drift_x, drift_y in [ (ns_support, 6.8, 0.10, 0.08), (zh_support, 7.4, 0.07, 0.10), ]: base_mean = make_spatial_field( support, amplitude=amp0, drift_x=drift_x, drift_y=drift_y, phase=0.25, hotspot_weight=0.92, secondary_weight=0.56, ridge_weight=0.18, wave_weight=0.14, local_weight=0.06, ) for year in years: step = year - years[0] scale = make_spatial_scale( support, base=0.30, x_weight=0.08, hotspot_weight=0.06, step_weight=0.025, step=step, ) mean = base_mean + 0.62 * step noise = rng.normal(0.0, scale * 0.55) frame = support.to_frame().rename( columns={ "coord_x": "longitude", "coord_y": "latitude", } ) frame["year"] = int(year) # Keep a compact but meaningful set of categorical variables so # the lesson can show why stratification is useful. frame["lithology_class"] = np.where( frame["y_norm"] > 0.62, "Clay", np.where(frame["x_norm"] > 0.57, "Fill", "Sand"), ) frame["development_zone"] = np.where( frame["x_norm"] + frame["y_norm"] > 1.10, "Urban core", "Expansion belt", ) frame["hydro_zone"] = np.where( frame["y_norm"] < 0.34, "Low plain", np.where(frame["y_norm"] < 0.67, "Middle belt", "High belt"), ) frame["rainfall_mm"] = ( 1260 + 72 * step + 48 * frame["y_norm"] + rng.normal(0.0, 12.0, len(frame)) ) frame["subsidence_mm"] = mean + noise # Unique row key used only for demonstration checks after batching. frame["row_uid"] = [ f"{support.city}_{idx}_{year}" for idx in frame["sample_idx"].to_numpy() ] frames.append(frame) full_df = pd.concat(frames, ignore_index=True) print("") print("Synthetic input table") print(full_df.head(8).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Synthetic input table sample_idx longitude latitude x_norm y_norm city year lithology_class development_zone hydro_zone rainfall_mm subsidence_mm row_uid 0 113.5074 22.6590 0.4624 0.1423 Nansha 2020 Sand Expansion belt Low plain 1260.3657 -0.4042 Nansha_0_2020 1 113.4808 22.6617 0.3856 0.1542 Nansha 2020 Sand Expansion belt Low plain 1267.5303 0.1706 Nansha_1_2020 2 113.4906 22.6629 0.4140 0.1595 Nansha 2020 Sand Expansion belt Low plain 1247.5171 -0.3128 Nansha_2_2020 3 113.5044 22.6627 0.4537 0.1586 Nansha 2020 Sand Expansion belt Low plain 1271.5833 0.1742 Nansha_3_2020 4 113.5120 22.6610 0.4760 0.1510 Nansha 2020 Sand Expansion belt Low plain 1246.3361 -0.3348 Nansha_4_2020 5 113.5210 22.6606 0.5018 0.1496 Nansha 2020 Sand Expansion belt Low plain 1287.0786 -0.1820 Nansha_5_2020 6 113.5295 22.6605 0.5266 0.1491 Nansha 2020 Sand Expansion belt Low plain 1264.6117 -0.0300 Nansha_6_2020 7 113.5329 22.6611 0.5365 0.1517 Nansha 2020 Sand Expansion belt Low plain 1261.6202 -0.0895 Nansha_7_2020 .. GENERATED FROM PYTHON SOURCE LINES 217-221 Write one input file per city ----------------------------- The build command uses the shared table-reader utilities, so we teach the multi-file workflow directly. .. GENERATED FROM PYTHON SOURCE LINES 221-243 .. code-block:: Python tmp_dir = Path( tempfile.mkdtemp(prefix="gp_sg_batch_spatial_sampling_") ) ns_csv = tmp_dir / "nansha_spatial_panel.csv" zh_csv = tmp_dir / "zhongshan_spatial_panel.csv" full_df.loc[full_df["city"] == "Nansha"].to_csv( ns_csv, index=False, ) full_df.loc[full_df["city"] == "Zhongshan"].to_csv( zh_csv, index=False, ) print("") print("Input files") print(" -", ns_csv.name) print(" -", zh_csv.name) .. rst-class:: sphx-glr-script-out .. code-block:: none Input files - nansha_spatial_panel.csv - zhongshan_spatial_panel.csv .. GENERATED FROM PYTHON SOURCE LINES 244-256 Run the real batch-spatial-sampling builder ------------------------------------------- We ask the command to: - read both city tables, - produce four non-overlapping sampled batches, - preserve balance across city, year, and lithology class, - write one stacked output table, - and also write one file per batch into a split directory. Here ``sample-size`` is the **total** sampling size across all batches, not the size of each batch individually. .. GENERATED FROM PYTHON SOURCE LINES 256-297 .. code-block:: Python stacked_csv = tmp_dir / "batch_spatial_sampling_gallery.csv" split_dir = tmp_dir / "batch_files" build_batch_spatial_sampling_main( [ str(ns_csv), str(zh_csv), "--sample-size", "0.24", "--n-batches", "4", "--stratify-by", "city", "year", "lithology_class", "--spatial-cols", "longitude", "latitude", "--spatial-bins", "8", "7", "--method", "relative", "--min-relative-ratio", "0.03", "--random-state", "42", "--batch-col", "batch_id", "--output", str(stacked_csv), "--split-dir", str(split_dir), "--split-prefix", "spatial_batch_", "--verbose", "1", ] ) .. rst-class:: sphx-glr-script-out .. code-block:: none Generating stratification keys for 13,330 records... This may take some time. Please be patient... Stratification keys generated successfully for 13,330 records. Creating 4 stratified batches with a total of 3,199 samples... Batch Sampling Progress: 0%| | 0/4 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 500-515 How to read this output ----------------------- The point of batch sampling is not just to create many files. The important thing is that the batches remain useful. In this preview, the desirable pattern is: - each batch still covers both cities, - each batch still contains multiple years, - the stacked table remains representative of the original footprint, - and the off-diagonal overlap cells stay at zero. That combination makes ``batch-spatial-sampling`` a practical builder for repeated experiments, debugging, teaching, and lightweight batch workflows. .. GENERATED FROM PYTHON SOURCE LINES 517-561 Command-line usage ------------------ The same workflow can be run directly from the command line. Family entry point:: geoprior-build batch-spatial-sampling \ nansha_spatial_panel.csv \ zhongshan_spatial_panel.csv \ --sample-size 0.24 \ --n-batches 4 \ --stratify-by city year lithology_class \ --spatial-cols longitude latitude \ --spatial-bins 8 7 \ --method relative \ --min-relative-ratio 0.03 \ --random-state 42 \ --batch-col batch_id \ --output batch_spatial_sampling_gallery.csv \ --split-dir batch_files \ --split-prefix spatial_batch_ Root entry point:: geoprior build batch-spatial-sampling \ nansha_spatial_panel.csv \ zhongshan_spatial_panel.csv \ --sample-size 0.24 \ --n-batches 4 \ --stratify-by city year lithology_class \ --spatial-cols longitude latitude \ --spatial-bins 8 7 \ --method relative \ --min-relative-ratio 0.03 \ --random-state 42 \ --batch-col batch_id \ --output batch_spatial_sampling_gallery.csv \ --split-dir batch_files \ --split-prefix spatial_batch_ The shared data-reader arguments still apply here, so the command can also read one or many input tables in CSV / TSV / Parquet / Excel / JSON / Feather / Pickle form, and Excel paths may use selectors such as ``input.xlsx::Sheet1``. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.291 seconds) .. _sphx_glr_download_auto_examples_tables_and_summaries_build_batch_spatial_sampling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: build_batch_spatial_sampling.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: build_batch_spatial_sampling.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: build_batch_spatial_sampling.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_