geoprior.utils.spatial_utils#

geospatial_utils - A collection of utilities for geospatial and positional data analysis, filtering, and transformations.

Functions

batch_spatial_sampling(data[, sample_size, ...])

Create stratified spatial sample batches from a DataFrame.

create_spatial_clusters(df[, spatial_cols, ...])

Cluster 2D spatial data in df using <algorithm> and optionally plot the results.

deg_to_m_from_lat(lat_deg)

Approx WGS84 meters per degree at reference latitude.

dual_merge(df1, df2[, feature_cols, ...])

Merge two DataFrames based on specified feature columns.

extract_coordinates(df[, as_frame, drop_xy, ...])

Extract coordinate columns or their midpoint from a DataFrame.

extract_spatial_roi(df, x_range, y_range[, ...])

Extracts a spatial Region of Interest (ROI) from a DataFrame.

extract_zones_from(z[, threshold, ...])

Extract zones by filtering values against a threshold rule.

filter_position(df, pos[, pos_cols, ...])

filter_position is a utility that filters a pandas.DataFrame based on user-specified spatial positions.

gen_buffered_negative_samples(df, target_col)

Generate buffer-based negative samples around existing points or gauge stations.

gen_negative_samples(df, target_col[, ...])

Generate synthetic negative samples for spatial binary classification tasks.

gen_negative_samples_plus(df, target_col[, ...])

Generates negative samples for modeling in spatial scenarios, offering multiple strategies.

get_xy_coordinates(df[, as_frame, drop_xy, ...])

Check whether the coordinate values x, y exist in the data.

make_forecast_ready_sample(data[, ...])

Build a compact, forecast-ready panel sample.

spatial_sampling(data[, sample_size, ...])

Sample spatial data intelligently to represent the distribution of the whole area and include different years.

geoprior.utils.spatial_utils.spatial_sampling(data, sample_size=0.01, stratify_by=None, spatial_bins=10, spatial_cols=None, method='abs', min_relative_ratio=0.01, random_state=42, savefile=None, verbose=1)[source]#

Sample spatial data intelligently to represent the distribution of the whole area and include different years.

This function performs stratified sampling on spatial data, ensuring that the sample reflects both spatial distribution and temporal aspects of the entire dataset. It combines spatial stratification based on coordinates and additional stratification columns specified by the user.

Parameters:
  • data (pandas.DataFrame) – The input DataFrame to sample from. Must contain spatial coordinate columns (e.g., ‘longitude’, ‘latitude’) and any columns specified in stratify_by.

  • sample_size (float or int, optional) – The proportion or absolute number of samples to select. If float, should be between 0.0 and 1.0 and represents the fraction of the dataset to include in the sample. If int, represents the absolute number of samples to select. Default is 0.01 (1% of the data).

  • stratify_by (list of str, optional) – List of column names to stratify by.

  • spatial_bins (int or tuple/list of int, optional) – Number of bins to divide the spatial coordinates into. If an integer, the same number of bins is used for all spatial dimensions. If a tuple or list, its length must match the number of spatial columns, specifying the number of bins for each spatial dimension. Default is 10.

  • spatial_cols (list or tuple of str, optional) – List of spatial coordinate column names. Can accept one or two columns. If None, the function checks for columns named ‘longitude’ and/or ‘latitude’ in data. If only one spatial column is provided or found, a warning is issued, suggesting that providing both spatial columns is recommended for more accurate sampling. If more than two columns are provided, an error is raised.

  • method (str, {'abs', 'relative'}, default 'abs') – Defines how the sample size is determined. 'abs' or 'absolute' uses a fixed sampling proportion based on sample_size. 'relative' scales sampling by dataset stratification so small groups still receive a proportional sample controlled by min_relative_ratio.

  • min_relative_ratio (float, default 0.01) – Controls the minimum allowable fraction of records that must be sampled when method='relative'. It must be between 0 and 1. For example, min_relative_ratio=0.05 requests at least 5 percent of the total dataset size from each stratification group when possible; if a group is smaller than that minimum, the entire subset is sampled instead.

  • random_state (int, optional) – Random seed for reproducibility. Default is 42.

  • verbose (int, default 1) – Controls progress-bar and status output during execution. Larger values produce more detailed messages.

Returns:

sampled_data – A sampled DataFrame representing the distribution of the whole area and including different years.

Return type:

pandas.DataFrame

Notes

The function performs stratified sampling based on spatial bins and other specified stratification columns. Spatial coordinates are binned using quantile-based discretization (pandas.qcut()), ensuring each bin has approximately the same number of observations.

Let \(N\) be the total number of samples in data, and \(n\) be the desired sample size. The function calculates the number of samples to draw from each stratification group based on the proportion of the group size to the total dataset size:

(1)#\[n_i = \left\lceil \frac{N_i}{N} \times n \right\rceil\]

where \(N_i\) is the size of group \(i\), and \(n_i\) is the number of samples to draw from group \(i\).

The function ensures that all specified spatial and stratification columns exist in data, that the number of spatial bins matches the number of spatial columns, and that the sample size is valid. A warning is issued when only one spatial column is used because two spatial columns usually give more reliable spatial sampling.

Examples

>>> from geoprior.utils.spatial_utils import spatial_sampling
>>> import pandas as pd
>>> # Assume 'df' is a pandas DataFrame with columns
>>> # 'longitude', 'latitude', 'year', and other data.
>>> sampled_df = spatial_sampling(
...     data=df,
...     sample_size=0.05,
...     stratify_by=['year', 'geological_category'],
...     spatial_bins=(10, 15),
...     spatial_cols=['longitude', 'latitude'],
...     random_state=42
... )
>>> print(sampled_df.shape)

See also

pandas.qcut

Quantile-based discretization function used for binning.

sklearn.model_selection.StratifiedShuffleSplit

For stratified sampling.

batch_spatial_sampling

Resample spatial data with batching.

geoprior.utils.spatial_utils.extract_coordinates(df, as_frame=False, drop_xy=False, error='raise', verbose=0)[source]#

Extract coordinate columns or their midpoint from a DataFrame.

Parameters:
  • df (pandas.DataFrame) – DataFrame expected to contain longitude/latitude or easting/northing columns.

  • as_frame (bool, optional) – If True, return the coordinate columns as a DataFrame. Otherwise return their midpoint.

  • drop_xy (bool, optional) – If True, remove detected coordinate columns from the returned DataFrame.

  • error (bool or {'raise', 'warn', 'ignore'}, optional) – Error-handling policy for invalid inputs.

  • verbose (int, optional) – Verbosity level for detection messages.

Returns:

Tuple containing the extracted coordinates or midpoint, the DataFrame with optional coordinate removal, and the detected coordinate-column names.

Return type:

tuple

Notes

Longitude/latitude are preferred over easting/northing when both are present.

geoprior.utils.spatial_utils.batch_spatial_sampling(data, sample_size=0.1, n_batches=10, stratify_by=None, spatial_bins=10, spatial_cols=None, method='abs', min_relative_ratio=0.01, random_state=42, verbose=1)[source]#

Create stratified spatial sample batches from a DataFrame.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame used for sampling.

  • sample_size (float or int, optional) – Total sample size as a fraction or absolute count.

  • n_batches (int, optional) – Number of batches to generate.

  • stratify_by (str or list of str or None, optional) – Additional columns used for stratification.

  • spatial_bins (int or sequence of int, optional) – Number of spatial bins used when discretizing coordinates.

  • spatial_cols (list or tuple of str or None, optional) – Spatial coordinate columns.

  • method ({'abs', 'absolute', 'relative'}, optional) – Strategy used to translate sample_size into per-batch sample counts.

  • min_relative_ratio (float, optional) – Minimum relative group size used by method='relative'.

  • random_state (int, optional) – Random seed for reproducibility.

  • verbose (int, optional) – Verbosity level.

Returns:

Stratified batches sampled without overlap.

Return type:

list of pandas.DataFrame

Notes

Spatial coordinates are discretized with pandas.qcut and combined with stratify_by columns so batches preserve the overall data distribution as closely as possible.

geoprior.utils.spatial_utils.extract_zones_from(z, threshold='auto', condition='auto', use_negative_criteria=True, percentile=10, x=None, y=None, data=None, view=False, plot_type='scatter', figsize=(8, 6), savefile=None, axis_off=False, show_grid=True, **kwargs)[source]#

Extract zones by filtering values against a threshold rule.

Parameters:
  • z (array-like, pandas.Series, or str) – Input data to filter. If z is a string, it is interpreted as a column name in data.

  • threshold ({'auto'} or float or int or tuple, optional) – Filtering criterion. Use 'auto' for percentile-based thresholding, a scalar for a single cutoff, or a length-2 tuple for interval filtering.

  • condition ({'auto', 'above', 'below', 'between'}, optional) – Relation between the data and the threshold.

  • use_negative_criteria (bool, optional) – Controls the automatic condition when condition='auto'.

  • percentile (int or float, optional) – Percentile used when threshold='auto'.

  • x (array-like, pandas.Series, str, or None, optional) – Optional coordinates or column names used for plotting.

  • y (array-like, pandas.Series, str, or None, optional) – Optional coordinates or column names used for plotting.

  • data (pandas.DataFrame or None, optional) – Data source used when x, y, or z are column names.

  • view (bool, optional) – Whether to visualize the filtered result.

  • plot_type (str, optional) – Plot type used when view=True. Common values include 'scatter', 'line', and 'hist'.

  • figsize (tuple, optional) – Figure size for plotting.

  • savefile (str or None, optional) – Optional path used when saving the figure.

  • axis_off (bool, optional) – Whether to hide axes in the plot.

  • show_grid (bool, optional) – Whether to display the plot grid.

  • **kwargs (dict) – Additional plotting keyword arguments.

Returns:

Filtered values and any optional plotting outputs defined by the implementation.

Return type:

object

Notes

When x, y, or z are passed as strings, the function relies on extract_array_from to retrieve the corresponding arrays from data.

geoprior.utils.spatial_utils.filter_position(df, pos, pos_cols=None, find_closest=True, threshold=0.01, error='raise', verbose=0)[source]#

filter_position is a utility that filters a pandas.DataFrame based on user-specified spatial positions. It can match positions exactly or compute distances to find the closest points within a threshold.

For a single dimension, the distance is computed as:

(2)#\[d = |x - p|\]

For multi-dimensional data with n coordinates, the Euclidean distance is computed as:

(3)#\[d = \sqrt{\sum_{i=1}^n (x_i - p_i)^2}\]
Parameters:
  • df (pandas.DataFrame) – The DataFrame that will be filtered. This parameter is essential and must contain columns referenced by pos_cols if pos_cols is not None.

  • pos (float or tuple of floats) – The reference position(s) to match or approximate. When pos_cols is None, pos is interpreted as an index value. Otherwise, each value in pos aligns with a specific column in pos_cols.

  • pos_cols (str or tuple of str, optional) – Name(s) of the column(s) in df to match against pos. If pos_cols=None, then pos is treated as an index filter. If multiple columns are given (e.g., latitude and longitude), each coordinate in pos should correspond to one column in pos_cols.

  • find_closest (bool, optional) – If True, nearest-neighbor filtering is performed within the distance threshold. If False, exact matches are used.

  • threshold (float, optional) – The maximum distance within which points are considered a match if find_closest is True. The unit corresponds to the column data (e.g., degrees for geographic lat/lon).

  • error ({'raise', 'warn', 'ignore'}, optional) – Specifies how to handle dimension mismatches or missing values. If 'raise', a ValueError will be raised. If 'warn', a warning is printed and extra values are ignored. If 'ignore', mismatches are silently ignored.

  • verbose (int, optional) – Controls the level of output messages: - 0: No output - 1: Basic info - 2: Additional details - >=3: Comprehensive summary

Returns:

A new DataFrame that contains only rows matching or approximating the specified position(s) within the given threshold if find_closest is True.

Return type:

pandas.DataFrame

Notes

When pos_cols is None, the function attempts to filter by DataFrame index using the first element of pos. This approach may fail for multi-level indexes unless error='warn' or error='ignore' is used to bypass the dimension mismatch.

Examples

>>> from geoprior.utils.spatial_utils import filter_position
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'lat': [113.309998, 113.310001],
...     'lon': [22.831362, 22.831364]
... })
>>> # Exact match
>>> result_exact = filter_position(df, pos=(113.309998,
...                                         22.831362),
...                                pos_cols=('lat', 'lon'),
...                                find_closest=False)
>>> # Nearest match with threshold
>>> result_close = filter_position(df,
...                                pos=(113.31,
...                                     22.83),
...                                pos_cols=('lat',
...                                          'lon'),
...                                find_closest=True,
...                                threshold=0.01)

See also

geoprior.utils.data_utils.truncate_data

Truncate multiple DataFrames based on spatial coordinates or index alignment with a base DataFrame.

geoprior.utils.spatial_utils.create_spatial_clusters(df, spatial_cols=None, cluster_col='region', n_clusters=None, algorithm='kmeans', view=True, figsize=(14, 10), s=60, plot_style='seaborn', cmap='tab20', show_grid=True, grid_props=None, auto_scale=True, savefile=None, verbose=1, **kwargs)[source]#

Cluster 2D spatial data in df using <algorithm> and optionally plot the results.

This function, <create_spatial_clusters>, extracts two coordinate columns from <df> to form clusters via methods such as ‘kmeans’, ‘dbscan’, or ‘agglo’ (agglomerative). It uses the function filter_valid_kwargs (when relevant) to strip out invalid parameters for certain estimators, and writes cluster labels into <cluster_col>.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame holding spatial coordinates and optional other fields.

  • spatial_cols (list of str, optional) – Two-column list for x and y coordinates. Defaults to ['longitude','latitude'] if None.

  • cluster_col (str, default 'region') – Name of the column to store the assigned cluster labels.

  • n_clusters (int, optional) – Number of clusters to form. If not provided for KMeans, it is auto-detected. For DBSCAN or Agglomerative, a warning is issued if not set.

  • algorithm (str, default 'kmeans') – Choice of clustering algorithm among ['kmeans','dbscan','agglo'].

  • view (bool, default True) – If True, displays a scatterplot of the final clusters.

  • figsize (tuple, default (14, 10)) – Size of the displayed figure for the cluster plot.

  • s (int, default 60) – Marker size in the scatterplot.

  • plot_style (str, default 'seaborn') – Matplotlib style used for the plot.

  • cmap (str, default 'tab20') – Colormap name used to differentiate clusters.

  • show_grid (bool, default True) – Toggles grid lines on or off.

  • grid_props (dict, optional) – Additional keyword arguments controlling the grid style.

  • auto_scale (bool, default True) – If True, standardize coordinates before clustering.

  • savefile (str, optional) – File path to save the data with an additional <cluster_col> storing the assigned cluster labels if desired.

  • verbose (int, default 1) – Controls console logs. Higher values yield more details about scaling and cluster detection.

  • **kwargs – Additional keyword arguments passed to the chosen algorithm (filtered by filter_valid_kwargs for KMeans, DBSCAN, AgglomerativeClustering ).

Returns:

A copy of <df> with an additional <cluster_col> storing the assigned cluster labels.

Return type:

pandas.DataFrame

Notes

If <auto_scale> is True, it uses a standard scaler to normalize the coordinate columns. The scatterplot is generated using the library seaborn for enhanced styling.

By default, for <algorithm> = “kmeans”, the model attempts to minimize:

(4)#\[J = \sum_{i=1}^{N} \min_{\mu_j} \lVert x_i - \mu_j \rVert^2\]

where \(x_i\) are the scaled or raw 2D coordinates in <df>. The function can optionally auto-detect n_clusters using a silhouette and elbow analysis if not provided.

Examples

>>> from geoprior.utils.spatial_utils import create_spatial_clusters
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "longitude": [0.1, 0.2, 2.2, 2.3],
...     "latitude": [1.0, 1.1, 2.1, 2.2]
... })
>>> # KMeans with auto scale and auto-detect k
>>> result = create_spatial_clusters(
...     df=df,
...     algorithm="kmeans",
...     view=True
... )
>>> # DBSCAN with custom arguments
>>> result_db = create_spatial_clusters(
...     df=df,
...     algorithm="dbscan",
...     eps=0.5,
...     min_samples=2
... )

See also

filter_valid_kwargs

Helps discard unsupported keyword arguments for chosen estimators.

geoprior.utils.spatial_utils.gen_negative_samples(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, use_gpd='auto', view=False, savefile=None, verbose=1)[source]#

Generate synthetic negative samples for spatial binary classification tasks.

This function creates additional samples labeled as non-events within a specified spatial buffer around the positive (event) observations. The main idea is to generate negative examples that reflect realistic conditions but have not triggered an event, thereby assisting models in distinguishing occurrences from non-occurrences.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the positive samples (events). Must include both the target column and the specified spatial columns.

  • target_col (str) – Column name for the binary target (e.g. landslide). Rows where this column is 1 (or True) are considered positive samples.

  • spatial_cols (tuple of str, default (``’longitude’, ``'latitude')) – Tuple specifying the longitude> and latitude column names in df.

  • feature_cols (list of str or None, default None) – List of feature columns to use or to simulate for generated negatives. If None, the function automatically detects numeric and categorical columns excluding spatial_cols and target_col.

  • buffer_km (float, default 10) – Spatial buffer in kilometers used to define the radius around each positive sample within which negative samples are created.

  • neg_feature_range (tuple of int, default (0, 5)) – Value range (minimum, maximum) used for simulating numeric feature values in negative samples if the corresponding feature column does not exist in <df>.

  • num_neg_per_pos (int, default 1) – Number of negative samples to generate per positive sample. For instance, if num_neg_per_pos=2, each positive sample spawns two negatives.

  • use_gpd (str, default 'auto') – If set to ‘auto’, the function tries to import GeoPandas for visualization. If ‘none’, no GeoPandas usage will occur.

  • view (bool, default False) – Whether to visualize the generated samples on a map. Attempts to use geopandas if installed; falls back to matplotlib if ‘auto’ is chosen and GeoPandas is not available.

  • savefile (str or None, default None) – Path to which the resulting DataFrame is saved if provided. Handled by the decorator that wraps this function.

  • verbose (int, default 1) – Verbosity level. 0 for silent, 1 for progress indication, 2 for more messages, 3 for debugging output.

Returns:

Combined DataFrame with both original positive samples and newly generated negative samples. The <target_col> is 1 for positives and 0 for negatives.

Return type:

pandas.DataFrame

`columns_manager`

This internal function is used to handle the processing of columns for features and spatial parameters.

Notes

  • If a feature column exists in df, the negative samples will copy or randomly select categories for categorical columns, and sample integers within neg_feature_range for numeric columns.

  • If feature_cols is empty or does not exist in df, the function simulates all values for negative samples.

  • When view=True, circles depicting the buffer zone around each positive sample are drawn for visualization.

  • The approximation of 1 degree to roughly 111 km varies slightly depending on latitude.

Mathematically, we define the spatial buffer in degrees as:

(5)#\[\begin{split}\\Delta = \\frac{\\text{buffer_km}}{111.0},\end{split}\]

where \(111.0\) km approximates the distance of one degree of latitude or longitude. For each positive sample at location \((lat, lon)\), we generate \(n\) new points with offsets \(\\delta_{lat}\) and \(\\delta_{lon}\), each drawn from a uniform distribution \(U(-\\Delta, \\Delta)\):

(6)#\[\begin{split}\\begin{aligned} &lat_{new} = lat + \\delta_{lat},\\\\ &lon_{new} = lon + \\delta_{lon}. \\end{aligned}\end{split}\]

Combined with randomly sampled or inferred feature values, these new samples serve as negative examples for modeling tasks such as landslide prediction.

Examples

>>> from geoprior.utils.spatial_utils import gen_negative_samples
>>> import pandas as pd
>>> import numpy as np
>>> df_pos = pd.DataFrame({
...        'latitude': np.random.uniform(24.0, 25.0, 5),
...        'longitude': np.random.uniform(113.0, 114.0, 5),
...        'rainfall_day_1': np.random.randint(10, 30, 5),
...        'rainfall_day_2': np.random.randint(10, 30, 5),
...        'rainfall_day_3': np.random.randint(10, 30, 5),
...        'rainfall_day_4': np.random.randint(10, 30, 5),
...        'rainfall_day_5': np.random.randint(10, 30, 5),
...        'landslide': 1
...    })
>>> combined = gen_negative_samples(
...     df=df_pos,
...     target_col='landslide',
...     buffer_km=10,
...     num_neg_per_pos=2,
...     view=False,
...     verbose=2
... )
>>> print(combined.head())

See also

check_spatial_columns

Ensures the existence of required spatial columns.

exist_features

Verifies the presence of specified features in <df>.

columns_manager

Handles both feature and spatial columns for processing.

geoprior.utils.spatial_utils.gen_buffered_negative_samples(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, strategy='landslide', gauge_data=None, use_gpd='auto', id_col='auto', view=False, savefile=None, seed=None, verbose=1)[source]#

Generate buffer-based negative samples around existing points or gauge stations.

This function creates additional negative samples for binary spatial events. It either takes an existing landslide dataset (when strategy is ‘landslide’) or a separate gauge dataset (if strategy is ‘gauge’) to serve as the base points for generating negatives within a circular buffer. The function validates input columns and parameters via _validate_negatives_sampling before constructing synthetic samples.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing positive event samples (e.g., landslides). Must include <target_col> and <spatial_cols>.

  • target_col (str) – Name of the binary target column (1 for event, 0 for no event).

  • spatial_cols (tuple of str, default (``’longitude’, ``'latitude')) – Indicates which columns hold <longitude> and latitude in df.

  • feature_cols (list of str, optional) – Additional feature columns to simulate or copy for negatives. If None, all columns except <spatial_cols> and target_col are used.

  • buffer_km (float, default 10) – The radial distance in kilometers for sampling negative points around each base point.

  • neg_feature_range (tuple of float, default (0, 5)) – A numeric range from which feature values are drawn for negative samples if the column is numeric.

  • num_neg_per_pos (int, default 1) – Number of negatives to generate per positive (landslide) or gauge point.

  • strategy (str, default 'landslide') – Defines the base from which negative samples are generated. Use 'landslide' or 'event' to sample around rows in df. Use 'gauge' to sample around rows in gauge_data.

  • gauge_data (pandas.DataFrame, optional) – Required if strategy is 'gauge'. Must contain spatial_cols.

  • use_gpd (bool or 'auto', default 'auto') – If 'auto', attempts to use GeoPandas for visualization if installed. Otherwise, falls back to Matplotlib. This parameter is forwarded to the underlying visualization function.

  • id_col (str or list of str, default 'auto') – Column(s) representing IDs in df. If 'auto', the function tries to detect possible ID columns. Used by _validate_negatives_sampling.

  • view (bool, default False) – Whether to visualize the sampled negatives around the base points.

  • savefile (str, optional) – If provided, saves the final combined dataset (positives and negatives) to a CSV file at the specified path.

  • seed (int, optional) – Seed for NumPy’s random generator, ensuring reproducible offsets in negative sampling.

  • verbose (int, default 1) – Controls console messages: 1 for minimal, 2 for more detailed logs.

Returns:

The combined dataset containing both the original (positive) rows, labeled with target_col = 1, and the newly generated negative rows, labeled target_col = 0.

Return type:

pandas.DataFrame

``_validate_negatives_sampling``

Validates required columns and parameters, including num_neg_per_pos and neg_feature_range.

``visualize_negative_sampling``

Generates a plot showing the negative samples around the base points if view is True.

Notes

  • If strategy is 'gauge', gauge_data must be provided and contain columns longitude and latitude.

  • When view is True, circles are drawn to illustrate the buffer radius.

  • The ratio of 1 degree to roughly 111 km is an approximation and can vary slightly by latitude.

Formally, a buffer in degrees \(\Delta\) is computed by:

(7)#\[\Delta = \frac{\text{buffer\_km}}{111},\]

where \(111\) is an approximate km-per-degree conversion factor. Each base point \((lat, lon)\) spawns \(n\) negatives, each offset by \(\delta_{lat}\), \(\delta_{lon}\) drawn from \(U(-\Delta, \Delta)\).

Examples

Below is an illustration of how to generate negative samples around both existing event locations (strategy=`landslide`) and separate gauge stations (strategy=`gauge`) using gen_buffered_negative_samples.

First, we simulate a small DataFrame of positive landslide samples with rainfall attributes, as well as a separate DataFrame for gauge stations:

>>> import numpy as np
>>> import pandas as pd
>>> np.random.seed(42)
>>> positive_samples = pd.DataFrame({
...     'id': [1, 2, 3, 4, 5],
...     'latitude': np.random.uniform(24.0, 25.0, 5),
...     'longitude': np.random.uniform(113.0, 114.0, 5),
...     'rainfall_day_1': np.random.randint(10, 30, 5),
...     'rainfall_day_2': np.random.randint(10, 30, 5),
...     'rainfall_day_3': np.random.randint(10, 30, 5),
...     'rainfall_day_4': np.random.randint(10, 30, 5),
...     'rainfall_day_5': np.random.randint(10, 30, 5),
...     'landslide': [1]*5
... })
>>> gauge_data = pd.DataFrame({
...     'gauge_id': ['G1', 'G2', 'G3'],
...     'latitude': np.random.uniform(24.0, 25.0, 3),
...     'longitude': np.random.uniform(113.0, 114.0, 3)
... })

We then call gen_buffered_negative_samples to produce negatives around these data using two different strategies:

>>> from geoprior.utils.spatial_utils import gen_buffered_negative_samples
>>> # Generate negatives around landslide points
>>> results_landslide = generate_negative_samples_with(
...     df=positive_samples,
...     target_col='landslide',
...     spatial_cols=('longitude', 'latitude'),
...     feature_cols=[f'rainfall_day_{i+1}' for i in range(5)],
...     buffer_km=10,
...     num_neg_per_pos=1,
...     strategy='landslide',
...     verbose=1
... )
>>> # Generate negatives around the gauge stations
>>> results_gauge = gen_buffered_negative_samples(
...     df=positive_samples,
...     target_col='landslide',
...     spatial_cols=('longitude', 'latitude'),
...     feature_cols=[f'rainfall_day_{i+1}' for i in range(5)],
...     buffer_km=10,
...     num_neg_per_pos=1,
...     strategy='gauge',
...     gauge_data=gauge_data,
...     verbose=1
... )

See also

generate_negative_samples

Generate synthetic negative samples for spatial binary classification tasks.

_validate_negatives_sampling

Ensures inputs and parameters are correct.

visualize_negative_sampling

Plots the positive and negative points for inspection.

geoprior.utils.spatial_utils.gen_negative_samples_plus(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, strategy='landslide', gauge_data=None, elevation_data=None, similarity_features=None, time_col=None, cluster_method='kmeans', use_gpd='auto', id_col='auto', view=False, savefile=None, verbose=1, seed=None)[source]#

Generates negative samples for modeling in spatial scenarios, offering multiple strategies. The function calls gen_buffered_negative_samples when the strategy argument is 'landslide', 'event', or 'gauge'. It also calls generate_negative_samples partially in the 'hybrid' strategy. Internally, each sample is augmented to produce negative instances according to a chosen method.

(8)#\[\text{buffer_deg} = \frac{\text{buffer_km}}{111.0}\]

The above formula approximates degrees from kilometers near the equator. The output is a combined dataset containing original positives and generated negatives. The ratio of negatives per positive is controlled by num_neg_per_pos.

Parameters:
  • df (pandas.DataFrame) – Input data containing spatial coordinates and features.

  • target_col (str) – Name of the classification target column. Positive samples in df are labeled, negatives will be generated.

  • spatial_cols (tuple of str, optional) – Columns representing longitude and latitude in df.

  • feature_cols (list of str, optional) – Additional feature columns used to drive or constrain negative sampling processes.

  • buffer_km (float, optional) – Radius in kilometers for local negative sampling. Used to compute buffer degrees.

  • neg_feature_range (tuple of float, optional) – Lower and upper range for continuous features in random negative generation.

  • num_neg_per_pos (int, optional) – Number of negatives to generate per positive instance.

  • strategy (str, optional) –

    Defines the sampling approach. Options include 'landslide', 'gauge', 'random_global', 'temporal_shift', 'clustered_negatives', 'environmental_similarity', 'elevation_based', and 'hybrid'.

    See more in User Guide.

  • gauge_data (pandas.DataFrame, optional) – Reference data for gauge-based or hybrid strategies.

  • elevation_data (pandas.DataFrame, optional) – Elevation records, used if strategy='elevation_based'.

  • similarity_features (list of str, optional) – Columns for nearest neighbor computation in 'environmental_similarity'.

  • time_col (str, optional) – Name of the time column for 'temporal_shift'. Required if using that strategy.

  • cluster_method (str, optional) – Clustering algorithm for 'clustered_negatives'. Default is 'kmeans'.

  • use_gpd (bool or str, optional) – Indicator for whether geopandas is used in buffer-based processes.

  • id_col (str, optional) – Column name used as an identifier. If ‘auto’, a default is used.

  • view (bool, optional) – Flag for visualizing or previewing the results.

  • savefile (str, optional) – Path to save the output dataset. If None, no file is saved.

  • verbose (int, optional) – Verbosity level. Higher values yield more logs.

  • seed (int, optional) – Random seed for reproducibility. If None, randomness is not fixed.

Returns:

A combined DataFrame containing the original positive samples labeled as 1 and newly generated negative samples labeled 0.

Return type:

pandas.DataFrame

Notes

When strategy='hybrid', partial sets of negatives come from two distinct calls to generate_negative_samples for 'landslide' and 'gauge' sub-strategies, then merged.

Examples

>>> from geoprior.utils.spatial_utils import gen_negative_samples_plus
>>> import pandas as pd
>>> df_example = pd.DataFrame({{
...     "longitude": [10.1, 10.2, 10.3],
...     "latitude":  [45.1, 45.2, 45.3],
...     "feature":   [3.4, 2.1, 6.7],
...     "target":    [1, 1, 1]
... }})
>>> gauge_data = pd.DataFrame({{
...     'gauge_id': ['G1', 'G2', 'G3'],
...     'latitude': np.random.uniform(24.0, 25.0, 3),
...     'longitude': np.random.uniform(113.0, 114.0, 3)
... }})
>>> # Generate random global negatives
>>> result = gen_negative_samples_plus(
...     df_example,
...     target_col="target",
...     strategy="random_global"
... )
>>> print(result.head())

See also

gen_buffered_negative_samples

Generates negative samples within a buffer region around reference events or gauges.

generate_negative_samples

A simpler negative sampling utility for certain strategies.

geoprior.utils.spatial_utils.extract_spatial_roi(df, x_range, y_range, x_col='longitude', y_col='latitude', snap_to_closest=True, savefile=None, **kwargs)[source]#

Extracts a spatial Region of Interest (ROI) from a DataFrame.

This function filters a DataFrame to include only the data points that fall within a specified rectangular bounding box defined by x and y coordinate ranges.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing the spatial data.

  • x_range (tuple of (float, float)) – A tuple containing the minimum and maximum desired values for the x-coordinate (e.g., longitude). The order does not matter.

  • y_range (tuple of (float, float)) – A tuple containing the minimum and maximum desired values for the y-coordinate (e.g., latitude). The order does not matter.

  • x_col (str, default 'longitude') – The name of the column in df that contains the x-coordinates.

  • y_col (str, default 'latitude') – The name of the column in df that contains the y-coordinates.

  • snap_to_closest (bool, default True) – If True, and a value in x_range or y_range does not exist in the data, the function will “snap” to the nearest available coordinate in the dataset. If False, it will use the exact boundaries provided.

  • savefile (str, optional) – The path to save the resulting DataFrame as a CSV file.

Returns:

A new DataFrame containing only the rows that fall within the specified spatial bounding box.

Return type:

pd.DataFrame

Raises:

ValueError – If x_col or y_col are not found in the DataFrame, or if the range tuples are not provided correctly.

geoprior.utils.spatial_utils.make_forecast_ready_sample(data, sample_size=0.05, time_col='year', spatial_cols=None, group_cols=None, stratify_by=None, spatial_bins=10, time_steps=3, forecast_horizon=1, require_consecutive=True, keep_years=None, year_mode='latest', min_groups=5, max_groups=None, columns_to_keep=None, method='abs', min_relative_ratio=0.01, random_state=42, export_path=None, export_format=None, savefile=None, sort_output=True, verbose=1)[source]#

Build a compact, forecast-ready panel sample.

The function samples spatial groups rather than individual rows, then reconstructs the full panel for the selected groups. This is much safer for demo/testing than row-wise sampling because each sampled location keeps enough temporal history for sequence construction.

Parameters:
  • data (pandas.DataFrame) – Input panel DataFrame.

  • sample_size (float or int, default 0.05) – Group-level sample size passed to the internal spatial sampler. - float: fraction of eligible groups - int: absolute number of eligible groups

  • time_col (str, default 'year') – Time column.

  • spatial_cols (tuple/list of str, optional) – Spatial coordinate columns. If None, the function searches for ‘longitude’ and ‘latitude’.

  • group_cols (tuple/list of str, optional) – Group identifier columns. If None, uses spatial_cols.

  • stratify_by (list/tuple of str, optional) – Extra group-level columns used for stratification. Typical examples: [‘lithology_class’] or [‘city’, ‘lithology_class’].

  • spatial_bins (int or tuple/list, default 10) – Spatial bins passed to spatial_sampling().

  • time_steps (int, default 3) – Lookback window length.

  • forecast_horizon (int, default 1) – Forecast horizon length.

  • require_consecutive (bool, default True) – If True, each kept group must contain at least one consecutive run of length time_steps + forecast_horizon.

  • keep_years (int, optional) – If provided, keep only this many years per group after sampling. Must be >= time_steps + forecast_horizon.

  • year_mode ({'all', 'latest', 'earliest', 'random'}) – How to trim years when keep_years is given.

  • min_groups (int, default 5) – Minimum number of eligible groups required.

  • max_groups (int, optional) – Hard cap on sampled groups after spatial sampling.

  • columns_to_keep (list/tuple of str, optional) – Restrict the returned columns. Group and time columns are always preserved.

  • method ({'abs', 'absolute', 'relative'}, default 'abs') – Same meaning as in spatial_sampling().

  • min_relative_ratio (float, default 0.01) – Minimum relative sampling ratio when method=’relative’.

  • random_state (int, optional) – Random seed.

  • export_path (str, optional) – Explicit path for non-CSV export.

  • export_format (str, optional) – Export format for export_path. If None, inferred from suffix.

  • savefile (str, optional) – CSV save path handled by @SaveFile.

  • sort_output (bool, default True) – Sort final output by group and time.

  • verbose (int, default 1) – Verbosity level.

Returns:

A compact panel sample that preserves group-wise temporal structure for forecast demos/tests.

Return type:

pandas.DataFrame

Examples

>>> from geoprior.utils.spatial_utils import make_forecast_ready_sample
>>> # Small demo sample, latest 4 years per location
>>> demo = make_forecast_ready_sample(
        data=df,
        sample_size=0.05,
        stratify_by=["lithology_class"],
        spatial_cols=("longitude", "latitude"),
        time_col="year",
        time_steps=3,
        forecast_horizon=1,
        keep_years=4,
        year_mode="latest",
        require_consecutive=True,
        savefile="demo_panel.csv",
    )
>>> #Slightly richer test panel, keep all years, export parquet
    demo = make_forecast_ready_sample(
    data=df,
    sample_size=150,
    stratify_by=["city", "lithology_class"],
    spatial_cols=("longitude", "latitude"),
    time_col="year",
    time_steps=3,
    forecast_horizon=1,
    keep_years=None,
    export_path="demo_panel.parquet",
    export_format="parquet",
)