.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/tables_and_summaries/build_spatial_clusters.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_tables_and_summaries_build_spatial_clusters.py: Build spatial cluster tables with ``spatial-clusters`` ====================================================== This lesson teaches how to use GeoPrior's ``spatial-clusters`` build command. Unlike row sampling, this builder does not reduce the table size. Its goal is to **add region labels** inferred from spatial geometry. That is useful when you want to: - split one city into compact spatial zones, - create region identifiers for diagnostics or ablations, - build cluster-aware summaries, - or prepare later workflows that need a region column. Why this matters ---------------- Large geospatial tables often contain clear spatial structure even before any physics or forecasting model is trained. A clustering step can expose that structure directly from the coordinates. In GeoPrior, the ``spatial-clusters`` command reads one or many tabular files, merges them into one DataFrame, clusters the rows from two coordinate columns, and writes the enriched table back to disk. What this lesson teaches ------------------------ We will: 1. build a realistic synthetic spatial table from the shared helper utilities, 2. save it as two separate input files, 3. run the real ``spatial-clusters`` CLI entrypoint, 4. inspect the generated region labels, 5. build one compact visual preview, 6. end with direct command-line examples. .. GENERATED FROM PYTHON SOURCE LINES 41-46 Imports ------- We use the real production CLI entrypoint and the shared synthetic spatial-support helpers that are already reused elsewhere in the documentation. .. GENERATED FROM PYTHON SOURCE LINES 46-61 .. code-block:: Python from __future__ import annotations import tempfile from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd from geoprior.cli.build_spatial_clusters import ( build_spatial_clusters_main, ) from geoprior.scripts import utils as script_utils .. GENERATED FROM PYTHON SOURCE LINES 62-72 Step 1 - Build a synthetic city with three spatial districts ------------------------------------------------------------ Instead of generating arbitrary random longitude/latitude pairs, we reuse the shared spatial-support helpers. That keeps the lesson close to the rest of the gallery and makes the synthetic geometry easier to reason about. We create three compact supports with different centers. Each support acts like one urban district. Later, the clustering command should be able to recover these zones from the coordinates alone. .. GENERATED FROM PYTHON SOURCE LINES 72-171 .. code-block:: Python specs = [ # Western district script_utils.SpatialSupportSpec( city="ClusterDemo", center_x=1_500.0, center_y=2_600.0, span_x=850.0, span_y=650.0, nx=19, ny=15, jitter_x=18.0, jitter_y=18.0, footprint="ellipse", keep_frac=0.92, seed=11, ), # Central district script_utils.SpatialSupportSpec( city="ClusterDemo", center_x=4_700.0, center_y=3_300.0, span_x=1_050.0, span_y=820.0, nx=22, ny=16, jitter_x=20.0, jitter_y=20.0, footprint="nansha_like", keep_frac=0.88, seed=21, ), # Eastern district script_utils.SpatialSupportSpec( city="ClusterDemo", center_x=8_000.0, center_y=2_000.0, span_x=920.0, span_y=700.0, nx=20, ny=15, jitter_x=22.0, jitter_y=22.0, footprint="zhongshan_like", keep_frac=0.90, seed=31, ), ] frames: list[pd.DataFrame] = [] rng = np.random.default_rng(123) for idx, spec in enumerate(specs, start=1): support = script_utils.make_spatial_support(spec) frame = support.to_frame() # Add a hidden reference label so the lesson can compare the input # geometry with the recovered cluster labels. This column is only a # teaching aid; the builder itself will not use it. frame["district_true"] = f"district_{idx}" # Add a few realistic-looking continuous fields. These are not used # directly by the clustering command, but they make the synthetic # table feel more like a true geospatial artifact. field = script_utils.make_spatial_field( support, amplitude=1.4 + 0.20 * idx, drift_x=0.50 * idx, drift_y=0.25 * idx, phase=0.35 * idx, local_weight=0.12, ) spread = script_utils.make_spatial_scale( support, base=0.20 + 0.02 * idx, x_weight=0.06, hotspot_weight=0.05, ) frame["lithology_score"] = field frame["hydro_variability"] = spread frame["subsidence_proxy"] = ( 4.0 + 1.3 * field + 2.4 * spread + rng.normal(0.0, 0.18, size=len(frame)) ) frames.append(frame) city_df = pd.concat(frames, ignore_index=True) city_df["sample_idx"] = np.arange(len(city_df), dtype=int) # The support helpers expose ``coord_x`` and ``coord_y``. We keep these # names on purpose so the lesson can teach ``--spatial-cols``. print("Synthetic table shape:", city_df.shape) print("") print(city_df.head(10).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Synthetic table shape: (384, 10) sample_idx coord_x coord_y x_norm y_norm city district_true lithology_score hydro_variability subsidence_proxy 0 1374.3433 2170.5028 0.4275 0.1841 ClusterDemo district_1 0.4632 0.2465 5.0158 1 1502.2758 2152.4765 0.5002 0.1708 ClusterDemo district_1 0.4597 0.2527 5.1378 2 1603.9449 2151.7415 0.5579 0.1702 ClusterDemo district_1 0.5074 0.2587 5.5124 3 1114.3191 2246.8747 0.2798 0.2406 ClusterDemo district_1 0.5367 0.2368 5.3010 4 1216.3383 2238.7680 0.3378 0.2346 ClusterDemo district_1 0.5279 0.2404 5.4289 5 1317.2824 2224.7146 0.3951 0.2242 ClusterDemo district_1 0.5108 0.2444 5.3544 6 1389.7828 2247.3991 0.4363 0.2410 ClusterDemo district_1 0.5481 0.2482 5.1938 7 1510.7747 2236.4552 0.5050 0.2329 ClusterDemo district_1 0.6161 0.2565 5.5140 8 1592.5551 2235.2596 0.5515 0.2320 ClusterDemo district_1 0.7233 0.2637 5.5163 9 1697.7536 2184.0257 0.6112 0.1941 ClusterDemo district_1 0.6661 0.2671 5.4489 .. GENERATED FROM PYTHON SOURCE LINES 172-177 Step 2 - Save the synthetic table as two input files ---------------------------------------------------- The CLI accepts one or many tabular inputs. To make that visible in the lesson, we split the synthetic city into two files and then let the real command merge them back together. .. GENERATED FROM PYTHON SOURCE LINES 177-201 .. code-block:: Python tmp_dir = Path( tempfile.mkdtemp(prefix="gp_sg_spatial_clusters_") ) left_csv = tmp_dir / "cluster_demo_west.csv" right_csv = tmp_dir / "cluster_demo_east.csv" output_csv = tmp_dir / "cluster_demo_with_regions.csv" x_mid = float(city_df["coord_x"].median()) city_df.loc[city_df["coord_x"] <= x_mid].to_csv( left_csv, index=False, ) city_df.loc[city_df["coord_x"] > x_mid].to_csv( right_csv, index=False, ) print("") print("Input files") print(" -", left_csv.name) print(" -", right_csv.name) .. rst-class:: sphx-glr-script-out .. code-block:: none Input files - cluster_demo_west.csv - cluster_demo_east.csv .. GENERATED FROM PYTHON SOURCE LINES 202-222 Step 3 - Run the real ``spatial-clusters`` command -------------------------------------------------- We call the production CLI entrypoint exactly as a user would, but from inside the lesson script. Important options shown here ---------------------------- --spatial-cols: We explicitly tell the command to cluster from ``coord_x`` and ``coord_y`` rather than the default longitude/latitude names. --cluster-col: Name of the new label column. --algorithm: Backend. The CLI exposes kmeans, dbscan, and agglo. --n-clusters: We set it to 3 here so the lesson stays deterministic. With KMeans, omitting this argument lets the helper try to auto-detect a suitable k. --output: Any supported tabular path can be used here. .. GENERATED FROM PYTHON SOURCE LINES 222-243 .. code-block:: Python build_spatial_clusters_main( [ str(left_csv), str(right_csv), "--spatial-cols", "coord_x", "coord_y", "--cluster-col", "region_id", "--algorithm", "kmeans", "--n-clusters", "3", "--output", str(output_csv), "--verbose", "1", ] ) .. rst-class:: sphx-glr-script-out .. code-block:: none Scaling coordinates... Clustering with KMEANS... [OK] loaded 384 row(s), created 3 cluster label(s), and wrote 384 row(s) to /tmp/gp_sg_spatial_clusters_i0etrnl1/cluster_demo_with_regions.csv .. GENERATED FROM PYTHON SOURCE LINES 244-248 Step 4 - Read the enriched table back in ---------------------------------------- The output is the original table plus one additional cluster label column. .. GENERATED FROM PYTHON SOURCE LINES 248-255 .. code-block:: Python clustered = pd.read_csv(output_csv) print("") print("Clustered table") print(clustered.head(10).to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Clustered table sample_idx coord_x coord_y x_norm y_norm city district_true lithology_score hydro_variability subsidence_proxy region_id 0 1374.3433 2170.5028 0.4275 0.1841 ClusterDemo district_1 0.4632 0.2465 5.0158 2 1 1502.2758 2152.4765 0.5002 0.1708 ClusterDemo district_1 0.4597 0.2527 5.1378 2 2 1603.9449 2151.7415 0.5579 0.1702 ClusterDemo district_1 0.5074 0.2587 5.5124 2 3 1114.3191 2246.8747 0.2798 0.2406 ClusterDemo district_1 0.5367 0.2368 5.3010 2 4 1216.3383 2238.7680 0.3378 0.2346 ClusterDemo district_1 0.5279 0.2404 5.4289 2 5 1317.2824 2224.7146 0.3951 0.2242 ClusterDemo district_1 0.5108 0.2444 5.3544 2 6 1389.7828 2247.3991 0.4363 0.2410 ClusterDemo district_1 0.5481 0.2482 5.1938 2 7 1510.7747 2236.4552 0.5050 0.2329 ClusterDemo district_1 0.6161 0.2565 5.5140 2 8 1592.5551 2235.2596 0.5515 0.2320 ClusterDemo district_1 0.7233 0.2637 5.5163 2 9 1697.7536 2184.0257 0.6112 0.1941 ClusterDemo district_1 0.6661 0.2671 5.4489 2 .. GENERATED FROM PYTHON SOURCE LINES 256-260 Step 5 - Summarize the recovered regions ---------------------------------------- A quick summary helps the user understand what the new labels mean in practice. .. GENERATED FROM PYTHON SOURCE LINES 260-277 .. code-block:: Python summary = ( clustered.groupby("region_id") .agg( n_points=("sample_idx", "size"), x_center=("coord_x", "mean"), y_center=("coord_y", "mean"), mean_proxy=("subsidence_proxy", "mean"), ) .sort_values("n_points", ascending=False) .reset_index() ) print("") print("Recovered cluster summary") print(summary.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Recovered cluster summary region_id n_points x_center y_center mean_proxy 0 146 4772.9506 3263.0429 6.4099 1 122 8041.8442 1970.4160 7.0305 2 116 1495.5305 2578.7135 5.8214 .. GENERATED FROM PYTHON SOURCE LINES 278-287 Step 6 - Build one compact preview figure ----------------------------------------- The command can display its own diagnostic plot with ``--view``. For the gallery page, however, we create one compact and fully controlled figure that compares: - the hidden synthetic districts used to build the input, - the region labels recovered by the command, - cluster sizes and cluster centroids. .. GENERATED FROM PYTHON SOURCE LINES 287-351 .. code-block:: Python fig, axes = plt.subplots( 1, 3, figsize=(13.5, 4.6), constrained_layout=True, ) # Left: hidden reference districts used to build the synthetic city. for label in sorted(city_df["district_true"].unique()): sub = city_df.loc[city_df["district_true"] == label] axes[0].scatter( sub["coord_x"], sub["coord_y"], s=14, alpha=0.85, label=label, ) axes[0].set_title("Synthetic districts") axes[0].set_xlabel("coord_x") axes[0].set_ylabel("coord_y") axes[0].legend(frameon=False, fontsize=8) axes[0].grid(True, linestyle=":", alpha=0.35) axes[0].set_aspect("equal", adjustable="box") # Middle: output labels created by the real builder. for region in sorted(clustered["region_id"].unique()): sub = clustered.loc[clustered["region_id"] == region] axes[1].scatter( sub["coord_x"], sub["coord_y"], s=14, alpha=0.85, label=f"region {region}", ) axes[1].set_title("Recovered region labels") axes[1].set_xlabel("coord_x") axes[1].set_ylabel("coord_y") axes[1].legend(frameon=False, fontsize=8) axes[1].grid(True, linestyle=":", alpha=0.35) axes[1].set_aspect("equal", adjustable="box") # Right: size summary plus centroid annotations. axes[2].bar(summary["region_id"].astype(str), summary["n_points"]) axes[2].set_title("Cluster sizes") axes[2].set_xlabel("region_id") axes[2].set_ylabel("n_points") axes[2].grid(True, axis="y", linestyle=":", alpha=0.35) for _, row in summary.iterrows(): axes[2].text( x=str(int(row["region_id"])), y=float(row["n_points"]), s=( f"({row['x_center']:.0f},\n" f" {row['y_center']:.0f})" ), ha="center", va="bottom", fontsize=7, ) plt.show() .. image-sg:: /auto_examples/tables_and_summaries/images/sphx_glr_build_spatial_clusters_001.png :alt: Synthetic districts, Recovered region labels, Cluster sizes :srcset: /auto_examples/tables_and_summaries/images/sphx_glr_build_spatial_clusters_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 352-367 What to learn from this output ------------------------------ The exact integer labels are arbitrary. What matters is the spatial partition they create. In this lesson, a good result means: - points that belong to the same compact district mostly share one region label, - different districts receive different labels, - the final output is still the full table, only enriched with a new clustering column. Once a region column exists, it can be reused later for summaries, filtering, diagnostics, or downstream experiments. .. GENERATED FROM PYTHON SOURCE LINES 369-428 Command-line usage ------------------ The lesson above used the real CLI entrypoint from Python. At the terminal, the same workflow looks like this: Basic KMeans clustering with an explicit number of clusters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash geoprior-build spatial-clusters \ cluster_demo_west.csv cluster_demo_east.csv \ --spatial-cols coord_x coord_y \ --cluster-col region_id \ --algorithm kmeans \ --n-clusters 3 \ --output cluster_demo_with_regions.csv The same command through the root dispatcher ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash geoprior build spatial-clusters \ cluster_demo_west.csv cluster_demo_east.csv \ --spatial-cols coord_x coord_y \ --cluster-col region_id \ --algorithm kmeans \ --n-clusters 3 \ --output cluster_demo_with_regions.csv Optional diagnostic plot ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash geoprior-build spatial-clusters \ cluster_demo_west.csv cluster_demo_east.csv \ --spatial-cols coord_x coord_y \ --algorithm kmeans \ --cluster-col region_id \ --view \ --output cluster_demo_with_regions.csv Other supported backends ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash geoprior-build spatial-clusters demo.csv \ --spatial-cols coord_x coord_y \ --algorithm dbscan \ --output cluster_demo_dbscan.csv geoprior-build spatial-clusters demo.csv \ --spatial-cols coord_x coord_y \ --algorithm agglo \ --n-clusters 3 \ --output cluster_demo_agglo.csv .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.491 seconds) .. _sphx_glr_download_auto_examples_tables_and_summaries_build_spatial_clusters.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: build_spatial_clusters.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: build_spatial_clusters.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: build_spatial_clusters.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_