geoprior.utils.spatial_utils#
geospatial_utils - A collection of utilities for geospatial and positional data analysis, filtering, and transformations.
Functions
|
Create stratified spatial sample batches from a DataFrame. |
|
Cluster 2D spatial data in |
|
Approx WGS84 meters per degree at reference latitude. |
|
Merge two DataFrames based on specified feature columns. |
|
Extract coordinate columns or their midpoint from a DataFrame. |
|
Extracts a spatial Region of Interest (ROI) from a DataFrame. |
|
Extract zones by filtering values against a threshold rule. |
|
filter_position is a utility that filters a pandas.DataFrame based on user-specified spatial positions. |
|
Generate buffer-based negative samples around existing points or gauge stations. |
|
Generate synthetic negative samples for spatial binary classification tasks. |
|
Generates negative samples for modeling in spatial scenarios, offering multiple strategies. |
|
Check whether the coordinate values x, y exist in the data. |
|
Build a compact, forecast-ready panel sample. |
|
Sample spatial data intelligently to represent the distribution of the whole area and include different years. |
- geoprior.utils.spatial_utils.spatial_sampling(data, sample_size=0.01, stratify_by=None, spatial_bins=10, spatial_cols=None, method='abs', min_relative_ratio=0.01, random_state=42, savefile=None, verbose=1)[source]#
Sample spatial data intelligently to represent the distribution of the whole area and include different years.
This function performs stratified sampling on spatial data, ensuring that the sample reflects both spatial distribution and temporal aspects of the entire dataset. It combines spatial stratification based on coordinates and additional stratification columns specified by the user.
- Parameters:
data (
pandas.DataFrame) – The input DataFrame to sample from. Must contain spatial coordinate columns (e.g., ‘longitude’, ‘latitude’) and any columns specified instratify_by.sample_size (
floatorint, optional) – The proportion or absolute number of samples to select. If float, should be between 0.0 and 1.0 and represents the fraction of the dataset to include in the sample. If int, represents the absolute number of samples to select. Default is0.01(1% of the data).stratify_by (
listofstr, optional) – List of column names to stratify by.spatial_bins (
intortuple/listofint, optional) – Number of bins to divide the spatial coordinates into. If an integer, the same number of bins is used for all spatial dimensions. If a tuple or list, its length must match the number of spatial columns, specifying the number of bins for each spatial dimension. Default is10.spatial_cols (
listortupleofstr, optional) – List of spatial coordinate column names. Can accept one or two columns. IfNone, the function checks for columns named ‘longitude’ and/or ‘latitude’ indata. If only one spatial column is provided or found, a warning is issued, suggesting that providing both spatial columns is recommended for more accurate sampling. If more than two columns are provided, an error is raised.method (
str,{'abs', 'relative'}, default'abs') – Defines how the sample size is determined.'abs'or'absolute'uses a fixed sampling proportion based onsample_size.'relative'scales sampling by dataset stratification so small groups still receive a proportional sample controlled bymin_relative_ratio.min_relative_ratio (
float, default0.01) – Controls the minimum allowable fraction of records that must be sampled whenmethod='relative'. It must be between0and1. For example,min_relative_ratio=0.05requests at least 5 percent of the total dataset size from each stratification group when possible; if a group is smaller than that minimum, the entire subset is sampled instead.random_state (
int, optional) – Random seed for reproducibility. Default is42.verbose (
int, default1) – Controls progress-bar and status output during execution. Larger values produce more detailed messages.
- Returns:
sampled_data – A sampled DataFrame representing the distribution of the whole area and including different years.
- Return type:
Notes
The function performs stratified sampling based on spatial bins and other specified stratification columns. Spatial coordinates are binned using quantile-based discretization (
pandas.qcut()), ensuring each bin has approximately the same number of observations.Let \(N\) be the total number of samples in
data, and \(n\) be the desired sample size. The function calculates the number of samples to draw from each stratification group based on the proportion of the group size to the total dataset size:(1)#\[n_i = \left\lceil \frac{N_i}{N} \times n \right\rceil\]where \(N_i\) is the size of group \(i\), and \(n_i\) is the number of samples to draw from group \(i\).
The function ensures that all specified spatial and stratification columns exist in
data, that the number of spatial bins matches the number of spatial columns, and that the sample size is valid. A warning is issued when only one spatial column is used because two spatial columns usually give more reliable spatial sampling.Examples
>>> from geoprior.utils.spatial_utils import spatial_sampling >>> import pandas as pd >>> # Assume 'df' is a pandas DataFrame with columns >>> # 'longitude', 'latitude', 'year', and other data. >>> sampled_df = spatial_sampling( ... data=df, ... sample_size=0.05, ... stratify_by=['year', 'geological_category'], ... spatial_bins=(10, 15), ... spatial_cols=['longitude', 'latitude'], ... random_state=42 ... ) >>> print(sampled_df.shape)
See also
pandas.qcutQuantile-based discretization function used for binning.
sklearn.model_selection.StratifiedShuffleSplitFor stratified sampling.
batch_spatial_samplingResample spatial data with batching.
- geoprior.utils.spatial_utils.extract_coordinates(df, as_frame=False, drop_xy=False, error='raise', verbose=0)[source]#
Extract coordinate columns or their midpoint from a DataFrame.
- Parameters:
df (
pandas.DataFrame) – DataFrame expected to contain longitude/latitude or easting/northing columns.as_frame (
bool, optional) – IfTrue, return the coordinate columns as a DataFrame. Otherwise return their midpoint.drop_xy (
bool, optional) – IfTrue, remove detected coordinate columns from the returned DataFrame.error (
boolor{'raise', 'warn', 'ignore'}, optional) – Error-handling policy for invalid inputs.verbose (
int, optional) – Verbosity level for detection messages.
- Returns:
Tuple containing the extracted coordinates or midpoint, the DataFrame with optional coordinate removal, and the detected coordinate-column names.
- Return type:
Notes
Longitude/latitude are preferred over easting/northing when both are present.
- geoprior.utils.spatial_utils.batch_spatial_sampling(data, sample_size=0.1, n_batches=10, stratify_by=None, spatial_bins=10, spatial_cols=None, method='abs', min_relative_ratio=0.01, random_state=42, verbose=1)[source]#
Create stratified spatial sample batches from a DataFrame.
- Parameters:
data (
pandas.DataFrame) – Input DataFrame used for sampling.sample_size (
floatorint, optional) – Total sample size as a fraction or absolute count.n_batches (
int, optional) – Number of batches to generate.stratify_by (
strorlistofstrorNone, optional) – Additional columns used for stratification.spatial_bins (
intorsequenceofint, optional) – Number of spatial bins used when discretizing coordinates.spatial_cols (
listortupleofstrorNone, optional) – Spatial coordinate columns.method (
{'abs', 'absolute', 'relative'}, optional) – Strategy used to translatesample_sizeinto per-batch sample counts.min_relative_ratio (
float, optional) – Minimum relative group size used bymethod='relative'.random_state (
int, optional) – Random seed for reproducibility.verbose (
int, optional) – Verbosity level.
- Returns:
Stratified batches sampled without overlap.
- Return type:
Notes
Spatial coordinates are discretized with
pandas.qcutand combined withstratify_bycolumns so batches preserve the overall data distribution as closely as possible.
- geoprior.utils.spatial_utils.extract_zones_from(z, threshold='auto', condition='auto', use_negative_criteria=True, percentile=10, x=None, y=None, data=None, view=False, plot_type='scatter', figsize=(8, 6), savefile=None, axis_off=False, show_grid=True, **kwargs)[source]#
Extract zones by filtering values against a threshold rule.
- Parameters:
z (
array-like,pandas.Series, orstr) – Input data to filter. Ifzis a string, it is interpreted as a column name indata.threshold (
{'auto'}orfloatorintortuple, optional) – Filtering criterion. Use'auto'for percentile-based thresholding, a scalar for a single cutoff, or a length-2 tuple for interval filtering.condition (
{'auto', 'above', 'below', 'between'}, optional) – Relation between the data and the threshold.use_negative_criteria (
bool, optional) – Controls the automatic condition whencondition='auto'.percentile (
intorfloat, optional) – Percentile used whenthreshold='auto'.x (
array-like,pandas.Series,str, orNone, optional) – Optional coordinates or column names used for plotting.y (
array-like,pandas.Series,str, orNone, optional) – Optional coordinates or column names used for plotting.data (
pandas.DataFrameorNone, optional) – Data source used whenx,y, orzare column names.view (
bool, optional) – Whether to visualize the filtered result.plot_type (
str, optional) – Plot type used whenview=True. Common values include'scatter','line', and'hist'.figsize (
tuple, optional) – Figure size for plotting.savefile (
strorNone, optional) – Optional path used when saving the figure.axis_off (
bool, optional) – Whether to hide axes in the plot.show_grid (
bool, optional) – Whether to display the plot grid.**kwargs (
dict) – Additional plotting keyword arguments.
- Returns:
Filtered values and any optional plotting outputs defined by the implementation.
- Return type:
Notes
When
x,y, orzare passed as strings, the function relies onextract_array_fromto retrieve the corresponding arrays fromdata.
- geoprior.utils.spatial_utils.filter_position(df, pos, pos_cols=None, find_closest=True, threshold=0.01, error='raise', verbose=0)[source]#
filter_position is a utility that filters a pandas.DataFrame based on user-specified spatial positions. It can match positions exactly or compute distances to find the closest points within a threshold.
For a single dimension, the distance is computed as:
(2)#\[d = |x - p|\]For multi-dimensional data with n coordinates, the Euclidean distance is computed as:
(3)#\[d = \sqrt{\sum_{i=1}^n (x_i - p_i)^2}\]- Parameters:
df (
pandas.DataFrame) – The DataFrame that will be filtered. This parameter is essential and must contain columns referenced by pos_cols ifpos_colsis not None.pos (
floatortupleoffloats) – The reference position(s) to match or approximate. When pos_cols is None, pos is interpreted as an index value. Otherwise, each value in pos aligns with a specific column in pos_cols.pos_cols (
strortupleofstr, optional) – Name(s) of the column(s) in df to match against pos. Ifpos_cols=None, then pos is treated as an index filter. If multiple columns are given (e.g., latitude and longitude), each coordinate in pos should correspond to one column in pos_cols.find_closest (
bool, optional) – If True, nearest-neighbor filtering is performed within the distance threshold. If False, exact matches are used.threshold (
float, optional) – The maximum distance within which points are considered a match if find_closest is True. The unit corresponds to the column data (e.g., degrees for geographic lat/lon).error (
{'raise', 'warn', 'ignore'}, optional) – Specifies how to handle dimension mismatches or missing values. If'raise', a ValueError will be raised. If'warn', a warning is printed and extra values are ignored. If'ignore', mismatches are silently ignored.verbose (
int, optional) – Controls the level of output messages: - 0: No output - 1: Basic info - 2: Additional details - >=3: Comprehensive summary
- Returns:
A new DataFrame that contains only rows matching or approximating the specified position(s) within the given threshold if find_closest is True.
- Return type:
Notes
When pos_cols is None, the function attempts to filter by DataFrame index using the first element of pos. This approach may fail for multi-level indexes unless
error='warn'orerror='ignore'is used to bypass the dimension mismatch.Examples
>>> from geoprior.utils.spatial_utils import filter_position >>> import pandas as pd >>> df = pd.DataFrame({ ... 'lat': [113.309998, 113.310001], ... 'lon': [22.831362, 22.831364] ... }) >>> # Exact match >>> result_exact = filter_position(df, pos=(113.309998, ... 22.831362), ... pos_cols=('lat', 'lon'), ... find_closest=False) >>> # Nearest match with threshold >>> result_close = filter_position(df, ... pos=(113.31, ... 22.83), ... pos_cols=('lat', ... 'lon'), ... find_closest=True, ... threshold=0.01)
See also
geoprior.utils.data_utils.truncate_dataTruncate multiple DataFrames based on spatial coordinates or index alignment with a base DataFrame.
- geoprior.utils.spatial_utils.create_spatial_clusters(df, spatial_cols=None, cluster_col='region', n_clusters=None, algorithm='kmeans', view=True, figsize=(14, 10), s=60, plot_style='seaborn', cmap='tab20', show_grid=True, grid_props=None, auto_scale=True, savefile=None, verbose=1, **kwargs)[source]#
Cluster 2D spatial data in
dfusing <algorithm> and optionally plot the results.This function, <create_spatial_clusters>, extracts two coordinate columns from <df> to form clusters via methods such as ‘kmeans’, ‘dbscan’, or ‘agglo’ (agglomerative). It uses the function filter_valid_kwargs (when relevant) to strip out invalid parameters for certain estimators, and writes cluster labels into <cluster_col>.
- Parameters:
df (
pandas.DataFrame) – Input DataFrame holding spatial coordinates and optional other fields.spatial_cols (
listofstr, optional) – Two-column list for x and y coordinates. Defaults to['longitude','latitude']if None.cluster_col (
str, default'region') – Name of the column to store the assigned cluster labels.n_clusters (
int, optional) – Number of clusters to form. If not provided for KMeans, it is auto-detected. For DBSCAN or Agglomerative, a warning is issued if not set.algorithm (
str, default'kmeans') – Choice of clustering algorithm among['kmeans','dbscan','agglo'].view (
bool, defaultTrue) – If True, displays a scatterplot of the final clusters.figsize (
tuple, default(14,10)) – Size of the displayed figure for the cluster plot.s (
int, default60) – Marker size in the scatterplot.plot_style (
str, default'seaborn') – Matplotlib style used for the plot.cmap (
str, default'tab20') – Colormap name used to differentiate clusters.show_grid (
bool, defaultTrue) – Toggles grid lines on or off.grid_props (
dict, optional) – Additional keyword arguments controlling the grid style.auto_scale (
bool, defaultTrue) – If True, standardize coordinates before clustering.savefile (
str, optional) – File path to save the data with an additional <cluster_col> storing the assigned cluster labels if desired.verbose (
int, default1) – Controls console logs. Higher values yield more details about scaling and cluster detection.**kwargs – Additional keyword arguments passed to the chosen algorithm (filtered by filter_valid_kwargs for KMeans, DBSCAN, AgglomerativeClustering ).
- Returns:
A copy of <df> with an additional <cluster_col> storing the assigned cluster labels.
- Return type:
Notes
If <auto_scale> is True, it uses a standard scaler to normalize the coordinate columns. The scatterplot is generated using the library seaborn for enhanced styling.
By default, for <algorithm> = “kmeans”, the model attempts to minimize:
(4)#\[J = \sum_{i=1}^{N} \min_{\mu_j} \lVert x_i - \mu_j \rVert^2\]where \(x_i\) are the scaled or raw 2D coordinates in <df>. The function can optionally auto-detect
n_clustersusing a silhouette and elbow analysis if not provided.Examples
>>> from geoprior.utils.spatial_utils import create_spatial_clusters >>> import pandas as pd >>> df = pd.DataFrame({ ... "longitude": [0.1, 0.2, 2.2, 2.3], ... "latitude": [1.0, 1.1, 2.1, 2.2] ... }) >>> # KMeans with auto scale and auto-detect k >>> result = create_spatial_clusters( ... df=df, ... algorithm="kmeans", ... view=True ... ) >>> # DBSCAN with custom arguments >>> result_db = create_spatial_clusters( ... df=df, ... algorithm="dbscan", ... eps=0.5, ... min_samples=2 ... )
See also
filter_valid_kwargsHelps discard unsupported keyword arguments for chosen estimators.
- geoprior.utils.spatial_utils.gen_negative_samples(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, use_gpd='auto', view=False, savefile=None, verbose=1)[source]#
Generate synthetic negative samples for spatial binary classification tasks.
This function creates additional samples labeled as non-events within a specified spatial buffer around the positive (event) observations. The main idea is to generate negative examples that reflect realistic conditions but have not triggered an event, thereby assisting models in distinguishing occurrences from non-occurrences.
- Parameters:
df (
pandas.DataFrame) – Input DataFrame containing the positive samples (events). Must include both the target column and the specified spatial columns.target_col (
str) – Column name for the binary target (e.g. landslide). Rows where this column is 1 (or True) are considered positive samples.spatial_cols (
tupleofstr, default(``’longitude’, ``'latitude')) – Tuple specifying the longitude> and latitude column names in df.feature_cols (
listofstrorNone, defaultNone) – List of feature columns to use or to simulate for generated negatives. IfNone, the function automatically detects numeric and categorical columns excluding spatial_cols and target_col.buffer_km (
float, default10) – Spatial buffer in kilometers used to define the radius around each positive sample within which negative samples are created.neg_feature_range (
tupleofint, default(0,5)) – Value range (minimum, maximum) used for simulating numeric feature values in negative samples if the corresponding feature column does not exist in <df>.num_neg_per_pos (
int, default1) – Number of negative samples to generate per positive sample. For instance, ifnum_neg_per_pos=2, each positive sample spawns two negatives.use_gpd (
str, default'auto') – If set to ‘auto’, the function tries to import GeoPandas for visualization. If ‘none’, no GeoPandas usage will occur.view (
bool, defaultFalse) – Whether to visualize the generated samples on a map. Attempts to use geopandas if installed; falls back to matplotlib if ‘auto’ is chosen and GeoPandas is not available.savefile (
strorNone, defaultNone) – Path to which the resulting DataFrame is saved if provided. Handled by the decorator that wraps this function.verbose (
int, default1) – Verbosity level. 0 for silent, 1 for progress indication, 2 for more messages, 3 for debugging output.
- Returns:
Combined DataFrame with both original positive samples and newly generated negative samples. The <target_col> is 1 for positives and 0 for negatives.
- Return type:
- `columns_manager`
This internal function is used to handle the processing of columns for features and spatial parameters.
Notes
If a feature column exists in df, the negative samples will copy or randomly select categories for categorical columns, and sample integers within
neg_feature_rangefor numeric columns.If feature_cols is empty or does not exist in df, the function simulates all values for negative samples.
When view=True, circles depicting the buffer zone around each positive sample are drawn for visualization.
The approximation of 1 degree to roughly 111 km varies slightly depending on latitude.
Mathematically, we define the spatial buffer in degrees as:
(5)#\[\begin{split}\\Delta = \\frac{\\text{buffer_km}}{111.0},\end{split}\]where \(111.0\) km approximates the distance of one degree of latitude or longitude. For each positive sample at location \((lat, lon)\), we generate \(n\) new points with offsets \(\\delta_{lat}\) and \(\\delta_{lon}\), each drawn from a uniform distribution \(U(-\\Delta, \\Delta)\):
(6)#\[\begin{split}\\begin{aligned} &lat_{new} = lat + \\delta_{lat},\\\\ &lon_{new} = lon + \\delta_{lon}. \\end{aligned}\end{split}\]Combined with randomly sampled or inferred feature values, these new samples serve as negative examples for modeling tasks such as landslide prediction.
Examples
>>> from geoprior.utils.spatial_utils import gen_negative_samples >>> import pandas as pd >>> import numpy as np >>> df_pos = pd.DataFrame({ ... 'latitude': np.random.uniform(24.0, 25.0, 5), ... 'longitude': np.random.uniform(113.0, 114.0, 5), ... 'rainfall_day_1': np.random.randint(10, 30, 5), ... 'rainfall_day_2': np.random.randint(10, 30, 5), ... 'rainfall_day_3': np.random.randint(10, 30, 5), ... 'rainfall_day_4': np.random.randint(10, 30, 5), ... 'rainfall_day_5': np.random.randint(10, 30, 5), ... 'landslide': 1 ... }) >>> combined = gen_negative_samples( ... df=df_pos, ... target_col='landslide', ... buffer_km=10, ... num_neg_per_pos=2, ... view=False, ... verbose=2 ... ) >>> print(combined.head())
See also
check_spatial_columnsEnsures the existence of required spatial columns.
exist_featuresVerifies the presence of specified features in <df>.
columns_managerHandles both feature and spatial columns for processing.
- geoprior.utils.spatial_utils.gen_buffered_negative_samples(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, strategy='landslide', gauge_data=None, use_gpd='auto', id_col='auto', view=False, savefile=None, seed=None, verbose=1)[source]#
Generate buffer-based negative samples around existing points or gauge stations.
This function creates additional negative samples for binary spatial events. It either takes an existing landslide dataset (when strategy is ‘landslide’) or a separate gauge dataset (if strategy is ‘gauge’) to serve as the base points for generating negatives within a circular buffer. The function validates input columns and parameters via _validate_negatives_sampling before constructing synthetic samples.
- Parameters:
df (
pandas.DataFrame) – The DataFrame containing positive event samples (e.g., landslides). Must include <target_col> and <spatial_cols>.target_col (
str) – Name of the binary target column (1 for event, 0 for no event).spatial_cols (
tupleofstr, default(``’longitude’, ``'latitude')) – Indicates which columns hold <longitude> and latitude in df.feature_cols (
listofstr, optional) – Additional feature columns to simulate or copy for negatives. IfNone, all columns except <spatial_cols> and target_col are used.buffer_km (
float, default10) – The radial distance in kilometers for sampling negative points around each base point.neg_feature_range (
tupleoffloat, default(0,5)) – A numeric range from which feature values are drawn for negative samples if the column is numeric.num_neg_per_pos (
int, default1) – Number of negatives to generate per positive (landslide) or gauge point.strategy (
str, default'landslide') – Defines the base from which negative samples are generated. Use'landslide'or'event'to sample around rows indf. Use'gauge'to sample around rows ingauge_data.gauge_data (
pandas.DataFrame, optional) – Required ifstrategyis'gauge'. Must containspatial_cols.use_gpd (
boolor'auto', default'auto') – If'auto', attempts to use GeoPandas for visualization if installed. Otherwise, falls back to Matplotlib. This parameter is forwarded to the underlying visualization function.id_col (
strorlistofstr, default'auto') – Column(s) representing IDs indf. If'auto', the function tries to detect possible ID columns. Used by_validate_negatives_sampling.view (
bool, defaultFalse) – Whether to visualize the sampled negatives around the base points.savefile (
str, optional) – If provided, saves the final combined dataset (positives and negatives) to a CSV file at the specified path.seed (
int, optional) – Seed for NumPy’s random generator, ensuring reproducible offsets in negative sampling.verbose (
int, default1) – Controls console messages:1for minimal,2for more detailed logs.
- Returns:
The combined dataset containing both the original (positive) rows, labeled with
target_col= 1, and the newly generated negative rows, labeledtarget_col= 0.- Return type:
- ``_validate_negatives_sampling``
Validates required columns and parameters, including
num_neg_per_posandneg_feature_range.
- ``visualize_negative_sampling``
Generates a plot showing the negative samples around the base points if
viewis True.
Notes
If
strategyis'gauge',gauge_datamust be provided and contain columnslongitudeandlatitude.When
viewis True, circles are drawn to illustrate the buffer radius.The ratio of 1 degree to roughly 111 km is an approximation and can vary slightly by latitude.
Formally, a buffer in degrees \(\Delta\) is computed by:
(7)#\[\Delta = \frac{\text{buffer\_km}}{111},\]where \(111\) is an approximate km-per-degree conversion factor. Each base point \((lat, lon)\) spawns \(n\) negatives, each offset by \(\delta_{lat}\), \(\delta_{lon}\) drawn from \(U(-\Delta, \Delta)\).
Examples
Below is an illustration of how to generate negative samples around both existing event locations (strategy=`landslide`) and separate gauge stations (strategy=`gauge`) using
gen_buffered_negative_samples.First, we simulate a small DataFrame of positive landslide samples with rainfall attributes, as well as a separate DataFrame for gauge stations:
>>> import numpy as np >>> import pandas as pd >>> np.random.seed(42)
>>> positive_samples = pd.DataFrame({ ... 'id': [1, 2, 3, 4, 5], ... 'latitude': np.random.uniform(24.0, 25.0, 5), ... 'longitude': np.random.uniform(113.0, 114.0, 5), ... 'rainfall_day_1': np.random.randint(10, 30, 5), ... 'rainfall_day_2': np.random.randint(10, 30, 5), ... 'rainfall_day_3': np.random.randint(10, 30, 5), ... 'rainfall_day_4': np.random.randint(10, 30, 5), ... 'rainfall_day_5': np.random.randint(10, 30, 5), ... 'landslide': [1]*5 ... })
>>> gauge_data = pd.DataFrame({ ... 'gauge_id': ['G1', 'G2', 'G3'], ... 'latitude': np.random.uniform(24.0, 25.0, 3), ... 'longitude': np.random.uniform(113.0, 114.0, 3) ... })
We then call
gen_buffered_negative_samplesto produce negatives around these data using two different strategies:>>> from geoprior.utils.spatial_utils import gen_buffered_negative_samples
>>> # Generate negatives around landslide points >>> results_landslide = generate_negative_samples_with( ... df=positive_samples, ... target_col='landslide', ... spatial_cols=('longitude', 'latitude'), ... feature_cols=[f'rainfall_day_{i+1}' for i in range(5)], ... buffer_km=10, ... num_neg_per_pos=1, ... strategy='landslide', ... verbose=1 ... )
>>> # Generate negatives around the gauge stations >>> results_gauge = gen_buffered_negative_samples( ... df=positive_samples, ... target_col='landslide', ... spatial_cols=('longitude', 'latitude'), ... feature_cols=[f'rainfall_day_{i+1}' for i in range(5)], ... buffer_km=10, ... num_neg_per_pos=1, ... strategy='gauge', ... gauge_data=gauge_data, ... verbose=1 ... )
See also
generate_negative_samplesGenerate synthetic negative samples for spatial binary classification tasks.
_validate_negatives_samplingEnsures inputs and parameters are correct.
visualize_negative_samplingPlots the positive and negative points for inspection.
- geoprior.utils.spatial_utils.gen_negative_samples_plus(df, target_col, spatial_cols=('longitude', 'latitude'), feature_cols=None, buffer_km=10, neg_feature_range=(0, 5), num_neg_per_pos=1, strategy='landslide', gauge_data=None, elevation_data=None, similarity_features=None, time_col=None, cluster_method='kmeans', use_gpd='auto', id_col='auto', view=False, savefile=None, verbose=1, seed=None)[source]#
Generates negative samples for modeling in spatial scenarios, offering multiple strategies. The function calls gen_buffered_negative_samples when the
strategyargument is'landslide','event', or'gauge'. It also calls generate_negative_samples partially in the'hybrid'strategy. Internally, each sample is augmented to produce negative instances according to a chosen method.(8)#\[\text{buffer_deg} = \frac{\text{buffer_km}}{111.0}\]The above formula approximates degrees from kilometers near the equator. The output is a combined dataset containing original positives and generated negatives. The ratio of negatives per positive is controlled by
num_neg_per_pos.- Parameters:
df (
pandas.DataFrame) – Input data containing spatial coordinates and features.target_col (
str) – Name of the classification target column. Positive samples in df are labeled, negatives will be generated.spatial_cols (
tupleofstr, optional) – Columns representing longitude and latitude in df.feature_cols (
listofstr, optional) – Additional feature columns used to drive or constrain negative sampling processes.buffer_km (
float, optional) – Radius in kilometers for local negative sampling. Used to compute buffer degrees.neg_feature_range (
tupleoffloat, optional) – Lower and upper range for continuous features in random negative generation.num_neg_per_pos (
int, optional) – Number of negatives to generate per positive instance.strategy (
str, optional) –Defines the sampling approach. Options include
'landslide','gauge','random_global','temporal_shift','clustered_negatives','environmental_similarity','elevation_based', and'hybrid'.See more in User Guide.
gauge_data (
pandas.DataFrame, optional) – Reference data for gauge-based or hybrid strategies.elevation_data (
pandas.DataFrame, optional) – Elevation records, used ifstrategy='elevation_based'.similarity_features (
listofstr, optional) – Columns for nearest neighbor computation in'environmental_similarity'.time_col (
str, optional) – Name of the time column for'temporal_shift'. Required if using that strategy.cluster_method (
str, optional) – Clustering algorithm for'clustered_negatives'. Default is'kmeans'.use_gpd (
boolorstr, optional) – Indicator for whether geopandas is used in buffer-based processes.id_col (
str, optional) – Column name used as an identifier. If ‘auto’, a default is used.view (
bool, optional) – Flag for visualizing or previewing the results.savefile (
str, optional) – Path to save the output dataset. If None, no file is saved.verbose (
int, optional) – Verbosity level. Higher values yield more logs.seed (
int, optional) – Random seed for reproducibility. If None, randomness is not fixed.
- Returns:
A combined DataFrame containing the original positive samples labeled as 1 and newly generated negative samples labeled 0.
- Return type:
Notes
When
strategy='hybrid', partial sets of negatives come from two distinct calls to generate_negative_samples for'landslide'and'gauge'sub-strategies, then merged.Examples
>>> from geoprior.utils.spatial_utils import gen_negative_samples_plus >>> import pandas as pd >>> df_example = pd.DataFrame({{ ... "longitude": [10.1, 10.2, 10.3], ... "latitude": [45.1, 45.2, 45.3], ... "feature": [3.4, 2.1, 6.7], ... "target": [1, 1, 1] ... }}) >>> gauge_data = pd.DataFrame({{ ... 'gauge_id': ['G1', 'G2', 'G3'], ... 'latitude': np.random.uniform(24.0, 25.0, 3), ... 'longitude': np.random.uniform(113.0, 114.0, 3) ... }}) >>> # Generate random global negatives >>> result = gen_negative_samples_plus( ... df_example, ... target_col="target", ... strategy="random_global" ... ) >>> print(result.head())
See also
gen_buffered_negative_samplesGenerates negative samples within a buffer region around reference events or gauges.
generate_negative_samplesA simpler negative sampling utility for certain strategies.
- geoprior.utils.spatial_utils.extract_spatial_roi(df, x_range, y_range, x_col='longitude', y_col='latitude', snap_to_closest=True, savefile=None, **kwargs)[source]#
Extracts a spatial Region of Interest (ROI) from a DataFrame.
This function filters a DataFrame to include only the data points that fall within a specified rectangular bounding box defined by x and y coordinate ranges.
- Parameters:
df (
pd.DataFrame) – The input DataFrame containing the spatial data.x_range (
tupleof(float,float)) – A tuple containing the minimum and maximum desired values for the x-coordinate (e.g., longitude). The order does not matter.y_range (
tupleof(float,float)) – A tuple containing the minimum and maximum desired values for the y-coordinate (e.g., latitude). The order does not matter.x_col (
str, default'longitude') – The name of the column in df that contains the x-coordinates.y_col (
str, default'latitude') – The name of the column in df that contains the y-coordinates.snap_to_closest (
bool, defaultTrue) – If True, and a value in x_range or y_range does not exist in the data, the function will “snap” to the nearest available coordinate in the dataset. If False, it will use the exact boundaries provided.savefile (
str, optional) – The path to save the resulting DataFrame as a CSV file.
- Returns:
A new DataFrame containing only the rows that fall within the specified spatial bounding box.
- Return type:
pd.DataFrame- Raises:
ValueError – If x_col or y_col are not found in the DataFrame, or if the range tuples are not provided correctly.
- geoprior.utils.spatial_utils.make_forecast_ready_sample(data, sample_size=0.05, time_col='year', spatial_cols=None, group_cols=None, stratify_by=None, spatial_bins=10, time_steps=3, forecast_horizon=1, require_consecutive=True, keep_years=None, year_mode='latest', min_groups=5, max_groups=None, columns_to_keep=None, method='abs', min_relative_ratio=0.01, random_state=42, export_path=None, export_format=None, savefile=None, sort_output=True, verbose=1)[source]#
Build a compact, forecast-ready panel sample.
The function samples spatial groups rather than individual rows, then reconstructs the full panel for the selected groups. This is much safer for demo/testing than row-wise sampling because each sampled location keeps enough temporal history for sequence construction.
- Parameters:
data (
pandas.DataFrame) – Input panel DataFrame.sample_size (
floatorint, default0.05) – Group-level sample size passed to the internal spatial sampler. - float: fraction of eligible groups - int: absolute number of eligible groupstime_col (
str, default'year') – Time column.spatial_cols (
tuple/listofstr, optional) – Spatial coordinate columns. If None, the function searches for ‘longitude’ and ‘latitude’.group_cols (
tuple/listofstr, optional) – Group identifier columns. If None, uses spatial_cols.stratify_by (
list/tupleofstr, optional) – Extra group-level columns used for stratification. Typical examples: [‘lithology_class’] or [‘city’, ‘lithology_class’].spatial_bins (
intortuple/list, default10) – Spatial bins passed to spatial_sampling().time_steps (
int, default3) – Lookback window length.forecast_horizon (
int, default1) – Forecast horizon length.require_consecutive (
bool, defaultTrue) – If True, each kept group must contain at least one consecutive run of length time_steps + forecast_horizon.keep_years (
int, optional) – If provided, keep only this many years per group after sampling. Must be >= time_steps + forecast_horizon.year_mode (
{'all', 'latest', 'earliest', 'random'}) – How to trim years when keep_years is given.min_groups (
int, default5) – Minimum number of eligible groups required.max_groups (
int, optional) – Hard cap on sampled groups after spatial sampling.columns_to_keep (
list/tupleofstr, optional) – Restrict the returned columns. Group and time columns are always preserved.method (
{'abs', 'absolute', 'relative'}, default'abs') – Same meaning as in spatial_sampling().min_relative_ratio (
float, default0.01) – Minimum relative sampling ratio when method=’relative’.random_state (
int, optional) – Random seed.export_path (
str, optional) – Explicit path for non-CSV export.export_format (
str, optional) – Export format for export_path. If None, inferred from suffix.savefile (
str, optional) – CSV save path handled by @SaveFile.sort_output (
bool, defaultTrue) – Sort final output by group and time.verbose (
int, default1) – Verbosity level.
- Returns:
A compact panel sample that preserves group-wise temporal structure for forecast demos/tests.
- Return type:
Examples
>>> from geoprior.utils.spatial_utils import make_forecast_ready_sample >>> # Small demo sample, latest 4 years per location >>> demo = make_forecast_ready_sample( data=df, sample_size=0.05, stratify_by=["lithology_class"], spatial_cols=("longitude", "latitude"), time_col="year", time_steps=3, forecast_horizon=1, keep_years=4, year_mode="latest", require_consecutive=True, savefile="demo_panel.csv", ) >>> #Slightly richer test panel, keep all years, export parquet demo = make_forecast_ready_sample( data=df, sample_size=150, stratify_by=["city", "lithology_class"], spatial_cols=("longitude", "latitude"), time_col="year", time_steps=3, forecast_horizon=1, keep_years=None, export_path="demo_panel.parquet", export_format="parquet", )