geoprior.utils.sequence_utils#
Sequence-building helpers for temporal model inputs.
Functions
|
Build history–future sequences and save them as compressed NPZ files. |
|
Quick pre-flight feasibility check for sliding-window sequence generation |
|
Generate input/target arrays for PINN models using various sampling methods (rolling, strided, random, expanding, bootstrap). |
|
Generate time-series windows for encoder/decoder and covariates. |
|
Return the total number of feasible sliding-window sequences and a mapping group → count using the requested execution engine. |
Exceptions
|
Raised when no sequence can be generated with the given settings. |
- geoprior.utils.sequence_utils.check_sequence_feasibility(df, *, time_col, group_id_cols=None, time_steps=12, forecast_horizon=3, engine='vectorized', mode=None, logger=<built-in function print>, verbose=0, error='warn')[source]#
Quick pre-flight feasibility check for sliding-window sequence generation
Checks whether the input table is long enough—per group—to yield at least one (look-back + horizon) sliding window, without allocating large NumPy tensors. It is typically called immediately before
prepare_pinn_data_sequences()or similar generators to “fail fast’’ on data shortages.- Parameters:
df (
pandas.DataFrame) – Tidy time-series table in long format. Every row represents one observation timestamp (and optionally one entity when group_id_cols is given). The function never mutates df.time_col (
str) – Column that defines temporal order inside each trajectory. Must be sortable; no other assumptions (numeric, datetime, …) are made.group_id_cols (
listofstrorNone, defaultNone) – Column names that jointly identify independent trajectories (e.g.["well_id"]or["site", "layer_id"]). When None the whole DataFrame is treated as a single group.time_steps (
int, default12) – Look-back window \(T_ ext{past}\) consumed by the encoder.forecast_horizon (
int, default3) – Prediction horizon \(H\) produced by the decoder.engine (
{'vectorized', 'loop', 'pyarrow'}, default'vectorized') –‘vectorized’ – fastest; single
DataFrame.groupby.size()call (C-level) plus NumPy math.’native’ – reproduces the original Python loop for debuggability.
’pyarrow’ – forces pandas’ Arrow backend, then runs the same vectorised logic; ~20 % faster on very wide frames when pyarrow ≥ 14 is installed.
mode (
{'pihal_like', 'tft_like'}orNone, optional) – Present only for API symmetry. Ignored – feasibility depends solely ontime_steps + forecast_horizon.logger (
callable, defaultprint()) – Sink for human-readable log messages. Must accept a single str.verbose (
int, default0) – Verbosity level: 0 → silent, 1 → summary lines, 2 → per-group detail.error (
{'raise', 'warn', 'ignore'}, default'warn') –Action when no group is long enough.
'raise'– raiseSequenceGeneratorError.'warn'– emitUserWarning, returnFalse.'ignore'– stay silent, returnFalse.
- Returns:
- Raises:
SequenceGeneratorError – Raised only when
error='raise'and all groups fail the length check.- Return type:
Notes
A group passes the check iff
(1)#\[\text{len(group)} \;\ge\; T_\text{past} + H\]No validation of time-gaps, duplicates, or NaNs is performed; those are deferred to the full preparation routine.
The Arrow backend (
engine='pyarrow') can accelerate very wide frames because each column is represented as a contiguous Arrow array with cheap zero-copy slicing.Examples
Minimal usage
>>> from geoprior.utils.sequence_utils import check_sequence_feasibility >>> ok, counts = check_sequence_feasibility( ... df, ... time_col="date", ... group_id_cols=["site"], ... time_steps=6, ... forecast_horizon=3, ... ) >>> ok True >>> counts {'A': 9, 'B': 9}
Fail-fast behaviour
>>> check_sequence_feasibility( ... df_small, ... time_col="t", ... time_steps=10, ... forecast_horizon=5, ... error="raise", ... ) Traceback (most recent call last): ... SequenceGeneratorError: No group is long enough ...
Switching engines
>>> _ , _ = check_sequence_feasibility( ... df, ... time_col="ts", ... group_id_cols=None, ... engine="pyarrow", # requires pandas 2.1+, pyarrow installed ... verbose=1, ... ) ✅ Feasible: 1 234 567 sequences possible.
References
McKinney, W. pandas 2.0 User Guide, sec. “GroupBy: split-apply-combine’’.
Arrow Project. (2025). Arrow Columnar Memory Format v2.
- geoprior.utils.sequence_utils.get_sequence_counts(df, *, group_id_cols, min_len, engine='vectorized', verbose=0, logger=<built-in function print>)[source]#
Return the total number of feasible sliding-window sequences and a mapping group → count using the requested execution engine.
- Parameters:
engine (
{'vectorized', 'native', 'pyarrow'}, default'vectorized') –Execution backend.
’vectorized’ – fast C-level
DataFrame.groupby.size()(recommended).’native’ – original Python loop (easier to debug, slower).
’pyarrow’ – forces pandas’ Arrow backend if available, then runs the vectorised path. Falls back silently to
'vectorized'when pyarrow is not installed.
df (DataFrame)
min_len (int)
verbose (int)
- Return type:
- geoprior.utils.sequence_utils.generate_pinn_sequences(df, time_col, subsidence_col, gwl_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, group_id_cols=None, time_steps=12, forecast_horizon=3, output_subsidence_dim=1, output_gwl_dim=1, mode='pihal_like', normalize_coords=True, cols_to_scale=None, method='rolling', stride=1, random_samples=None, expand_step=1, n_bootstrap=0, progress_hook=None, stop_check=None, verbose=1, _logger=None, **kwargs)[source]#
Generate input/target arrays for PINN models using various sampling methods (rolling, strided, random, expanding, bootstrap).
- Parameters:
df (
pd.DataFrame) – Full time-series data.time_col (
str) – Name of the time coordinate column.subsidence_col (
str) – Name of the subsidence target column.gwl_col (
str) – Name of the groundwater level target column.dynamic_cols (
list[str]) – Names of past-covariate columns.static_cols (
list[str], optional) – Names of static feature columns.future_cols (
list[str], optional) – Names of known-future feature columns.spatial_cols (
(str,str), optional) – Tuple of (lon_col, lat_col) for spatial coords.group_id_cols (
list[str], optional) – Column(s) identifying independent time-series groups.time_steps (
int, default12) – Look-back window length T.forecast_horizon (
int, default3) – Prediction horizon H.output_subsidence_dim (
int, default1) – Last-dim of subsidence target.output_gwl_dim (
int, default1) – Last-dim of GWL target.mode (
{'pihal_like','tft_like'}, default'pihal_like') – Shapes the “future” window length for TFT vs. PIHALNet.normalize_coords (
bool, defaultTrue) – Apply MinMax scaling to (t,x,y) across all sequences.cols_to_scale (
list[str]or'auto'orNone) – Additional columns to scale via MinMax.method (
{'rolling','strided','random','expanding','bootstrap'}) – Sequence-generation strategy.stride (
int, default1) – Step size for ‘strided’ sampling.random_samples (
int, optional) – Number of random start indices for ‘random’ sampling.expand_step (
int, default1) – Increment size for ‘expanding’ sampling.n_bootstrap (
int, default0) – Number of blocks for ‘bootstrap’ sampling.progress_hook (
callable, optional) – Called with float in [0,1] to report overall progress.stop_check (
callable, optional) – If returns True, aborts sequence generation early.verbose (
int, default1) – Verbosity level (higher = more logs)._logger (
logging.Loggerorcallable, optional) – Logger or print‐style function for vlog().**kwargs – Passed to helper.
- Returns:
inputs (
dict[str,np.ndarray]) – Contains ‘coords’, ‘dynamic_features’, optionally ‘static_features’ and ‘future_features’.targets (
dict[str,np.ndarray]) – Contains ‘subsidence’ and ‘gwl’ arrays.coord_scaler (
MinMaxScalerorNone) – Fitted scaler for coords, if normalization was applied.
- Return type:
tuple[dict[str, ndarray], dict[str, ndarray], MinMaxScaler | None]
- geoprior.utils.sequence_utils.generate_ts_sequences(df, time_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, group_id_cols=None, time_steps=12, forecast_horizon=1, normalize_coords=True, cols_to_scale=None, method='rolling', stride=1, random_samples=None, expand_step=1, n_bootstrap=0, progress_hook=None, stop_check=None, verbose=1, _logger=None, **kwargs)[source]#
Generate time-series windows for encoder/decoder and covariates. Supports rolling, strided, random, expanding, and bootstrap.
- Parameters:
df (
pd.DataFrame) – Input frame with time and feature columns.time_col (
str) – Name of the time coordinate column.dynamic_cols (
list[str]) – Past-covariate columns for encoder inputs.static_cols (
list[str]orNone) – Static covariate columns, repeated per window.future_cols (
list[str]orNone) – Known-future covariates for decoder inputs.spatial_cols (
tuple(str,str)orNone) – (lon, lat) column names for spatial coords.group_id_cols (
list[str]orNone) – Columns to group by for independent series.time_steps (
int) – Number of past steps (T) per window.forecast_horizon (
int) – Number of future steps (H) per window.normalize_coords (
bool) – If True, MinMax-scale spatial coords.cols_to_scale (
list[str]or'auto'orNone) – Other columns to MinMax-scale.method (
str) – ‘rolling’,’strided’,’random’,’expanding’,’bootstrap’.stride (
int) – Step size for ‘strided’ windows.random_samples (
intorNone) – Number of random windows if method=’random’.expand_step (
int) – Increment for ‘expanding’ windows.n_bootstrap (
int) – Number of bootstrap samples if method=’bootstrap’.progress_hook (
callableorNone) – Receives float [0,1] as work progresses.stop_check (
callableorNone) – If returns True, aborts generation.verbose (
int) – Verbosity level. >0 logs progress._logger (
callableorNone) – Logger to use for messages.
- Returns:
- Raises:
SequenceGeneratorError – If no valid windows could be generated.
- Return type:
tuple[dict[str, ndarray], dict[str, ndarray], MinMaxScaler | None]
- geoprior.utils.sequence_utils.build_future_sequences_npz(df_scaled, *, time_col, time_col_num, lon_col, lat_col, time_steps, train_end_time=None, forecast_start_time=None, forecast_horizon=None, subs_col=None, gwl_col=None, h_field_col=None, static_features=None, dynamic_features=None, future_features=None, group_id_cols=None, mode=None, model_name=None, artifacts_dir=None, prefix='future', future_mode='auto', normalize_coords=False, coord_scaler=None, verbose=1, logger=None, stop_check=None, progress_hook=None, **kws)[source]#
Build history–future sequences and save them as compressed NPZ files.
This helper constructs, for each spatial group, a sliding window of time_steps “history” points followed by a multi–step forecast horizon and exports the resulting NumPy arrays to disk. It is time-agnostic: the time_col can be numeric (e.g. year, index), year-like floats, datetimes, or strings, as long as equality on that column is meaningful.
If train_end_time, forecast_start_time, or forecast_horizon are not provided, they are inferred from the sorted unique values in
df_scaled[time_col]:train_end_time: by default the second-to-last unique time, leaving at least one future step.forecast_start_time: by default the first time strictly aftertrain_end_time.forecast_horizon: by default one time step ahead, clipped to the number of available future points.
For each valid group, the function builds history dynamic features of shape
(time_steps, n_dynamic), future features of shape(time_steps + H, n_future)whenmodestarts with"tft"or(H, n_future)otherwise, one static feature vector of shape(n_static,), coordinates over the horizon of shape(H, 3)with columns[time_num, lon, lat], anH_fieldarray of shape(H, 1), and optional subsidence and groundwater targets of shape(H, 1)each.All per-group arrays are stacked along a new batch dimension and written as two NPZ files:
<prefix>_inputs.npz: coordinates, dynamic, static, future features and H field.<prefix>_targets.npz: subsidence and groundwater targets.
- Parameters:
df_scaled (
pandas.DataFrame) – Pre-processed (typically scaled) dataframe containing all required columns: time, spatial coordinates, static/dynamic/ future features and optional targets.time_col (
str) – Name of the column encoding the temporal index (e.g."year","date","t_index"). May be numeric, datetime, or string.time_col_num (
strorNone) – Optional numeric time column used as a tie-breaker when multiple rows share the sametime_colvalue. If provided and present in a group, the last row sorted by this column is selected for that time.lon_col (
str) – Name of the longitude (or x-coordinate) column.lat_col (
str) – Name of the latitude (or y-coordinate) column.time_steps (
int) – Length of the history window (number of time steps in the past). Must be strictly positive.train_end_time (
object, optional) – Effective end of the training period. IfNone, it is inferred as the second-to-last unique value indf_scaled[time_col](after sorting).forecast_start_time (
object, optional) – First time step of the forecast horizon. IfNone, it is inferred as the first unique time strictly greater thantrain_end_time.forecast_horizon (
int, optional) – Number of future time steps to include. IfNone, a default horizon of1is used and clipped to the maximum number of available future time points.subs_col (
str, optional) – Name of the subsidence target column. IfNoneor missing from a group, subsidence targets are filled withNaN.gwl_col (
str, optional) – Name of the groundwater-level target column. IfNoneor missing from a group, groundwater targets are filled withNaN.h_field_col (
str, optional) – Name of the hydraulic-head field column used as an additional horizon-level input (H_field). IfNoneor missing, a zero field is used.static_features (
listofstr, optional) – Names of static (time-invariant) feature columns. Any names not present in the dataframe are silently ignored.dynamic_features (
listofstr, optional) – Names of dynamic (history) feature columns used to build the(time_steps, n_dynamic)sequence. Missing columns are ignored.future_features (
listofstr, optional) – Names of future covariate columns used to build the history+future or future-only sequence, depending onmode. Missing columns are ignored.group_id_cols (
listofstr, optional) – Columns used to define spatial (or logical) groups, typically something like["lon", "lat"]or a station identifier. IfNoneor empty, the entire dataframe is treated as a single global group.mode (
str, optional) – Controls how future features are constructed. If the lower-cased value starts with"tft"(e.g."tft_like"), future features are built on top of both history and future rows. Otherwise, only the forecast horizon rows are used.model_name (
str, optional) – Optional model identifier used only in logging messages.artifacts_dir (
str, optional) – Directory where NPZ files are written. IfNoneor empty, the current working directory is used.prefix (
str, default"future") – Prefix for the output NPZ filenames:"<prefix>_inputs.npz"and"<prefix>_targets.npz".future_mode (
{'auto', 'pure-inference', 'pure-data-driven'}, default'auto') –Strategy used to construct the future (forecast) portion of the sequences.
'pure-data-driven': Use only time points that actually exist indf_scaledstrictly after the history window. All future time indices must be present in the data; otherwise aValueErroris raised. This corresponds to the original, strictly data-driven behaviour.'pure-inference': Always synthesize future time points from the last history time, using the median positive time step (or1.0as a fallback). Future inputs are built by re-using the last available history row (forfuture_features,H_field, etc.), and future targets (e.g. subsidence, GWL) are filled withNaNsince the true future is unknown. This mode does not require any rows beyondtrain_end_time.'auto': Try data-driven mode first. If there are enough actual future time points aftertrain_end_timeto cover the requestedforecast_horizon, behave like'pure-data-driven'. If not, automatically fall back to the synthetic'pure-inference'behaviour described above and emit an informational log message viavlog.
verbose (
int, default1) – Verbosity level forwarded togeoprior.utils.vlog(). A value>= 3provides detailed progress logs (temporal inference, per-group status, dropped groups, etc.).logger (
logging.Loggerorcallable, optional) – Optional logger or logging function used bygeoprior.utils.vlog(). IfNone, messages are printed to standard output.**kws – Reserved for future extensions. Currently ignored.
normalize_coords (bool)
coord_scaler (Any | None)
- Returns:
A small dictionary with the absolute paths to the written NPZ files:
{"future_inputs_npz": <path>, "future_targets_npz": <path>}.- Return type:
- Raises:
ValueError – If there are not enough history points before
train_end_timeto satisfytime_steps, if no future points are available afterforecast_start_time, or if all groups are dropped due to incomplete history/horizon windows.
Notes
Groups that do not contain all required history and future times are silently dropped, but the number of dropped groups is reported via
geoprior.utils.vlog()whenverbose > 0.Examples
>>> from geoprior.nn.pinn.sequences import ( ... build_future_sequences_npz, ... ) >>> result = build_future_sequences_npz( ... df_scaled=df_scaled, ... time_col="year", ... time_col_num="t_index", ... lon_col="lon", ... lat_col="lat", ... time_steps=5, ... # Let the function infer times/horizon: ... train_end_time=None, ... forecast_start_time=None, ... forecast_horizon=None, ... subs_col="subsidence", ... gwl_col="gwl", ... h_field_col="H_field", ... static_features=["lithology_class"], ... dynamic_features=["rainfall_mm", "GWL_depth_bgs_z"], ... future_features=["normalized_urban_load_proxy"], ... group_id_cols=["lon", "lat"], ... mode="tft_like", ... model_name="GeoPriorSubsNet", ... artifacts_dir="results/zhongshan/future_npz", ... prefix="zhongshan_future", ... verbose=2, ... ) >>> result["future_inputs_npz"] 'results/zhongshan/future_npz/zhongshan_future_inputs.npz' >>> result["future_targets_npz"] 'results/zhongshan/future_npz/zhongshan_future_targets.npz'