geoprior.utils.sequence_utils#

Sequence-building helpers for temporal model inputs.

Functions

build_future_sequences_npz(df_scaled, *, ...)

Build history–future sequences and save them as compressed NPZ files.

check_sequence_feasibility(df, *, time_col)

Quick pre-flight feasibility check for sliding-window sequence generation

generate_pinn_sequences(df, time_col, ...[, ...])

Generate input/target arrays for PINN models using various sampling methods (rolling, strided, random, expanding, bootstrap).

generate_ts_sequences(df, time_col, dynamic_cols)

Generate time-series windows for encoder/decoder and covariates.

get_sequence_counts(df, *, group_id_cols, ...)

Return the total number of feasible sliding-window sequences and a mapping group → count using the requested execution engine.

Exceptions

SequenceGeneratorError

Raised when no sequence can be generated with the given settings.

geoprior.utils.sequence_utils.check_sequence_feasibility(df, *, time_col, group_id_cols=None, time_steps=12, forecast_horizon=3, engine='vectorized', mode=None, logger=<built-in function print>, verbose=0, error='warn')[source]#

Quick pre-flight feasibility check for sliding-window sequence generation

Checks whether the input table is long enough—per group—to yield at least one (look-back + horizon) sliding window, without allocating large NumPy tensors. It is typically called immediately before prepare_pinn_data_sequences() or similar generators to “fail fast’’ on data shortages.

Parameters:
  • df (pandas.DataFrame) – Tidy time-series table in long format. Every row represents one observation timestamp (and optionally one entity when group_id_cols is given). The function never mutates df.

  • time_col (str) – Column that defines temporal order inside each trajectory. Must be sortable; no other assumptions (numeric, datetime, …) are made.

  • group_id_cols (list of str or None, default None) – Column names that jointly identify independent trajectories (e.g. ["well_id"] or ["site", "layer_id"]). When None the whole DataFrame is treated as a single group.

  • time_steps (int, default 12) – Look-back window \(T_ ext{past}\) consumed by the encoder.

  • forecast_horizon (int, default 3) – Prediction horizon \(H\) produced by the decoder.

  • engine ({'vectorized', 'loop', 'pyarrow'}, default 'vectorized') –

    • ‘vectorized’ – fastest; single DataFrame.groupby.size() call (C-level) plus NumPy math.

    • ’native’ – reproduces the original Python loop for debuggability.

    • ’pyarrow’ – forces pandas’ Arrow backend, then runs the same vectorised logic; ~20 % faster on very wide frames when pyarrow ≥ 14 is installed.

  • mode ({'pihal_like', 'tft_like'} or None, optional) – Present only for API symmetry. Ignored – feasibility depends solely on time_steps + forecast_horizon.

  • logger (callable, default print()) – Sink for human-readable log messages. Must accept a single str.

  • verbose (int, default 0) – Verbosity level: 0 → silent, 1 → summary lines, 2 → per-group detail.

  • error ({'raise', 'warn', 'ignore'}, default 'warn') –

    Action when no group is long enough.

    • 'raise' – raise SequenceGeneratorError.

    • 'warn' – emit UserWarning, return False.

    • 'ignore' – stay silent, return False.

Returns:

  • feasible (bool) – True iff at least one sequence can be produced, otherwise False.

  • counts (dict) – Mapping group key → # sequences. The key is a tuple of the group values—or None when group_id_cols is None.

Raises:

SequenceGeneratorError – Raised only when error='raise' and all groups fail the length check.

Return type:

tuple[bool, dict[str | tuple, int]]

Notes

A group passes the check iff

(1)#\[\text{len(group)} \;\ge\; T_\text{past} + H\]

No validation of time-gaps, duplicates, or NaNs is performed; those are deferred to the full preparation routine.

The Arrow backend (engine='pyarrow') can accelerate very wide frames because each column is represented as a contiguous Arrow array with cheap zero-copy slicing.

Examples

  • Minimal usage

>>> from geoprior.utils.sequence_utils import check_sequence_feasibility
>>> ok, counts = check_sequence_feasibility(
...     df,
...     time_col="date",
...     group_id_cols=["site"],
...     time_steps=6,
...     forecast_horizon=3,
... )
>>> ok
True
>>> counts
{'A': 9, 'B': 9}
  • Fail-fast behaviour

>>> check_sequence_feasibility(
...     df_small,
...     time_col="t",
...     time_steps=10,
...     forecast_horizon=5,
...     error="raise",
... )
Traceback (most recent call last):
...
SequenceGeneratorError: No group is long enough ...
  • Switching engines

>>> _ , _ = check_sequence_feasibility(
...     df,
...     time_col="ts",
...     group_id_cols=None,
...     engine="pyarrow",   # requires pandas 2.1+, pyarrow installed
...     verbose=1,
... )
✅ Feasible: 1 234 567 sequences possible.

References

  • McKinney, W. pandas 2.0 User Guide, sec. “GroupBy: split-apply-combine’’.

  • Arrow Project. (2025). Arrow Columnar Memory Format v2.

geoprior.utils.sequence_utils.get_sequence_counts(df, *, group_id_cols, min_len, engine='vectorized', verbose=0, logger=<built-in function print>)[source]#

Return the total number of feasible sliding-window sequences and a mapping group → count using the requested execution engine.

Parameters:
  • engine ({'vectorized', 'native', 'pyarrow'}, default 'vectorized') –

    Execution backend.

    • ’vectorized’ – fast C-level DataFrame.groupby.size() (recommended).

    • ’native’ – original Python loop (easier to debug, slower).

    • ’pyarrow’ – forces pandas’ Arrow backend if available, then runs the vectorised path. Falls back silently to 'vectorized' when pyarrow is not installed.

  • df (DataFrame)

  • group_id_cols (list[str] | None)

  • min_len (int)

  • verbose (int)

Return type:

tuple[int, dict[str | tuple, int], Series]

geoprior.utils.sequence_utils.generate_pinn_sequences(df, time_col, subsidence_col, gwl_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, group_id_cols=None, time_steps=12, forecast_horizon=3, output_subsidence_dim=1, output_gwl_dim=1, mode='pihal_like', normalize_coords=True, cols_to_scale=None, method='rolling', stride=1, random_samples=None, expand_step=1, n_bootstrap=0, progress_hook=None, stop_check=None, verbose=1, _logger=None, **kwargs)[source]#

Generate input/target arrays for PINN models using various sampling methods (rolling, strided, random, expanding, bootstrap).

Parameters:
  • df (pd.DataFrame) – Full time-series data.

  • time_col (str) – Name of the time coordinate column.

  • subsidence_col (str) – Name of the subsidence target column.

  • gwl_col (str) – Name of the groundwater level target column.

  • dynamic_cols (list[str]) – Names of past-covariate columns.

  • static_cols (list[str], optional) – Names of static feature columns.

  • future_cols (list[str], optional) – Names of known-future feature columns.

  • spatial_cols ((str, str), optional) – Tuple of (lon_col, lat_col) for spatial coords.

  • group_id_cols (list[str], optional) – Column(s) identifying independent time-series groups.

  • time_steps (int, default 12) – Look-back window length T.

  • forecast_horizon (int, default 3) – Prediction horizon H.

  • output_subsidence_dim (int, default 1) – Last-dim of subsidence target.

  • output_gwl_dim (int, default 1) – Last-dim of GWL target.

  • mode ({'pihal_like','tft_like'}, default 'pihal_like') – Shapes the “future” window length for TFT vs. PIHALNet.

  • normalize_coords (bool, default True) – Apply MinMax scaling to (t,x,y) across all sequences.

  • cols_to_scale (list[str] or 'auto' or None) – Additional columns to scale via MinMax.

  • method ({'rolling','strided','random','expanding','bootstrap'}) – Sequence-generation strategy.

  • stride (int, default 1) – Step size for ‘strided’ sampling.

  • random_samples (int, optional) – Number of random start indices for ‘random’ sampling.

  • expand_step (int, default 1) – Increment size for ‘expanding’ sampling.

  • n_bootstrap (int, default 0) – Number of blocks for ‘bootstrap’ sampling.

  • progress_hook (callable, optional) – Called with float in [0,1] to report overall progress.

  • stop_check (callable, optional) – If returns True, aborts sequence generation early.

  • verbose (int, default 1) – Verbosity level (higher = more logs).

  • _logger (logging.Logger or callable, optional) – Logger or print‐style function for vlog().

  • **kwargs – Passed to helper.

Returns:

  • inputs (dict[str, np.ndarray]) – Contains ‘coords’, ‘dynamic_features’, optionally ‘static_features’ and ‘future_features’.

  • targets (dict[str, np.ndarray]) – Contains ‘subsidence’ and ‘gwl’ arrays.

  • coord_scaler (MinMaxScaler or None) – Fitted scaler for coords, if normalization was applied.

Return type:

tuple[dict[str, ndarray], dict[str, ndarray], MinMaxScaler | None]

geoprior.utils.sequence_utils.generate_ts_sequences(df, time_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, group_id_cols=None, time_steps=12, forecast_horizon=1, normalize_coords=True, cols_to_scale=None, method='rolling', stride=1, random_samples=None, expand_step=1, n_bootstrap=0, progress_hook=None, stop_check=None, verbose=1, _logger=None, **kwargs)[source]#

Generate time-series windows for encoder/decoder and covariates. Supports rolling, strided, random, expanding, and bootstrap.

Parameters:
  • df (pd.DataFrame) – Input frame with time and feature columns.

  • time_col (str) – Name of the time coordinate column.

  • dynamic_cols (list[str]) – Past-covariate columns for encoder inputs.

  • static_cols (list[str] or None) – Static covariate columns, repeated per window.

  • future_cols (list[str] or None) – Known-future covariates for decoder inputs.

  • spatial_cols (tuple(str,str) or None) – (lon, lat) column names for spatial coords.

  • group_id_cols (list[str] or None) – Columns to group by for independent series.

  • time_steps (int) – Number of past steps (T) per window.

  • forecast_horizon (int) – Number of future steps (H) per window.

  • normalize_coords (bool) – If True, MinMax-scale spatial coords.

  • cols_to_scale (list[str] or 'auto' or None) – Other columns to MinMax-scale.

  • method (str) – ‘rolling’,’strided’,’random’,’expanding’,’bootstrap’.

  • stride (int) – Step size for ‘strided’ windows.

  • random_samples (int or None) – Number of random windows if method=’random’.

  • expand_step (int) – Increment for ‘expanding’ windows.

  • n_bootstrap (int) – Number of bootstrap samples if method=’bootstrap’.

  • progress_hook (callable or None) – Receives float [0,1] as work progresses.

  • stop_check (callable or None) – If returns True, aborts generation.

  • verbose (int) – Verbosity level. >0 logs progress.

  • _logger (callable or None) – Logger to use for messages.

Returns:

  • inputs (dict of np.ndarray) – ‘encoder_inputs’,’static’,’future’,’coords’.

  • targets (dict of np.ndarray) – ‘decoder_targets’.

  • coord_scaler (MinMaxScaler or None) – Fitted scaler for coords, if normalized.

Raises:

SequenceGeneratorError – If no valid windows could be generated.

Return type:

tuple[dict[str, ndarray], dict[str, ndarray], MinMaxScaler | None]

geoprior.utils.sequence_utils.build_future_sequences_npz(df_scaled, *, time_col, time_col_num, lon_col, lat_col, time_steps, train_end_time=None, forecast_start_time=None, forecast_horizon=None, subs_col=None, gwl_col=None, h_field_col=None, static_features=None, dynamic_features=None, future_features=None, group_id_cols=None, mode=None, model_name=None, artifacts_dir=None, prefix='future', future_mode='auto', normalize_coords=False, coord_scaler=None, verbose=1, logger=None, stop_check=None, progress_hook=None, **kws)[source]#

Build history–future sequences and save them as compressed NPZ files.

This helper constructs, for each spatial group, a sliding window of time_steps “history” points followed by a multi–step forecast horizon and exports the resulting NumPy arrays to disk. It is time-agnostic: the time_col can be numeric (e.g. year, index), year-like floats, datetimes, or strings, as long as equality on that column is meaningful.

If train_end_time, forecast_start_time, or forecast_horizon are not provided, they are inferred from the sorted unique values in df_scaled[time_col]:

  • train_end_time: by default the second-to-last unique time, leaving at least one future step.

  • forecast_start_time: by default the first time strictly after train_end_time.

  • forecast_horizon: by default one time step ahead, clipped to the number of available future points.

For each valid group, the function builds history dynamic features of shape (time_steps, n_dynamic), future features of shape (time_steps + H, n_future) when mode starts with "tft" or (H, n_future) otherwise, one static feature vector of shape (n_static,), coordinates over the horizon of shape (H, 3) with columns [time_num, lon, lat], an H_field array of shape (H, 1), and optional subsidence and groundwater targets of shape (H, 1) each.

All per-group arrays are stacked along a new batch dimension and written as two NPZ files:

  • <prefix>_inputs.npz: coordinates, dynamic, static, future features and H field.

  • <prefix>_targets.npz: subsidence and groundwater targets.

Parameters:
  • df_scaled (pandas.DataFrame) – Pre-processed (typically scaled) dataframe containing all required columns: time, spatial coordinates, static/dynamic/ future features and optional targets.

  • time_col (str) – Name of the column encoding the temporal index (e.g. "year", "date", "t_index"). May be numeric, datetime, or string.

  • time_col_num (str or None) – Optional numeric time column used as a tie-breaker when multiple rows share the same time_col value. If provided and present in a group, the last row sorted by this column is selected for that time.

  • lon_col (str) – Name of the longitude (or x-coordinate) column.

  • lat_col (str) – Name of the latitude (or y-coordinate) column.

  • time_steps (int) – Length of the history window (number of time steps in the past). Must be strictly positive.

  • train_end_time (object, optional) – Effective end of the training period. If None, it is inferred as the second-to-last unique value in df_scaled[time_col] (after sorting).

  • forecast_start_time (object, optional) – First time step of the forecast horizon. If None, it is inferred as the first unique time strictly greater than train_end_time.

  • forecast_horizon (int, optional) – Number of future time steps to include. If None, a default horizon of 1 is used and clipped to the maximum number of available future time points.

  • subs_col (str, optional) – Name of the subsidence target column. If None or missing from a group, subsidence targets are filled with NaN.

  • gwl_col (str, optional) – Name of the groundwater-level target column. If None or missing from a group, groundwater targets are filled with NaN.

  • h_field_col (str, optional) – Name of the hydraulic-head field column used as an additional horizon-level input (H_field). If None or missing, a zero field is used.

  • static_features (list of str, optional) – Names of static (time-invariant) feature columns. Any names not present in the dataframe are silently ignored.

  • dynamic_features (list of str, optional) – Names of dynamic (history) feature columns used to build the (time_steps, n_dynamic) sequence. Missing columns are ignored.

  • future_features (list of str, optional) – Names of future covariate columns used to build the history+future or future-only sequence, depending on mode. Missing columns are ignored.

  • group_id_cols (list of str, optional) – Columns used to define spatial (or logical) groups, typically something like ["lon", "lat"] or a station identifier. If None or empty, the entire dataframe is treated as a single global group.

  • mode (str, optional) – Controls how future features are constructed. If the lower-cased value starts with "tft" (e.g. "tft_like"), future features are built on top of both history and future rows. Otherwise, only the forecast horizon rows are used.

  • model_name (str, optional) – Optional model identifier used only in logging messages.

  • artifacts_dir (str, optional) – Directory where NPZ files are written. If None or empty, the current working directory is used.

  • prefix (str, default "future") – Prefix for the output NPZ filenames: "<prefix>_inputs.npz" and "<prefix>_targets.npz".

  • future_mode ({'auto', 'pure-inference', 'pure-data-driven'}, default 'auto') –

    Strategy used to construct the future (forecast) portion of the sequences.

    • 'pure-data-driven': Use only time points that actually exist in df_scaled strictly after the history window. All future time indices must be present in the data; otherwise a ValueError is raised. This corresponds to the original, strictly data-driven behaviour.

    • 'pure-inference': Always synthesize future time points from the last history time, using the median positive time step (or 1.0 as a fallback). Future inputs are built by re-using the last available history row (for future_features, H_field, etc.), and future targets (e.g. subsidence, GWL) are filled with NaN since the true future is unknown. This mode does not require any rows beyond train_end_time.

    • 'auto': Try data-driven mode first. If there are enough actual future time points after train_end_time to cover the requested forecast_horizon, behave like 'pure-data-driven'. If not, automatically fall back to the synthetic 'pure-inference' behaviour described above and emit an informational log message via vlog.

  • verbose (int, default 1) – Verbosity level forwarded to geoprior.utils.vlog(). A value >= 3 provides detailed progress logs (temporal inference, per-group status, dropped groups, etc.).

  • logger (logging.Logger or callable, optional) – Optional logger or logging function used by geoprior.utils.vlog(). If None, messages are printed to standard output.

  • **kws – Reserved for future extensions. Currently ignored.

  • normalize_coords (bool)

  • coord_scaler (Any | None)

  • stop_check (Callable[[], bool])

  • progress_hook (Callable[[float], None] | None)

Returns:

A small dictionary with the absolute paths to the written NPZ files:

{"future_inputs_npz": <path>, "future_targets_npz": <path>}.

Return type:

dict

Raises:

ValueError – If there are not enough history points before train_end_time to satisfy time_steps, if no future points are available after forecast_start_time, or if all groups are dropped due to incomplete history/horizon windows.

Notes

Groups that do not contain all required history and future times are silently dropped, but the number of dropped groups is reported via geoprior.utils.vlog() when verbose > 0.

Examples

>>> from geoprior.nn.pinn.sequences import (
...     build_future_sequences_npz,
... )
>>> result = build_future_sequences_npz(
...     df_scaled=df_scaled,
...     time_col="year",
...     time_col_num="t_index",
...     lon_col="lon",
...     lat_col="lat",
...     time_steps=5,
...     # Let the function infer times/horizon:
...     train_end_time=None,
...     forecast_start_time=None,
...     forecast_horizon=None,
...     subs_col="subsidence",
...     gwl_col="gwl",
...     h_field_col="H_field",
...     static_features=["lithology_class"],
...     dynamic_features=["rainfall_mm", "GWL_depth_bgs_z"],
...     future_features=["normalized_urban_load_proxy"],
...     group_id_cols=["lon", "lat"],
...     mode="tft_like",
...     model_name="GeoPriorSubsNet",
...     artifacts_dir="results/zhongshan/future_npz",
...     prefix="zhongshan_future",
...     verbose=2,
... )
>>> result["future_inputs_npz"]
'results/zhongshan/future_npz/zhongshan_future_inputs.npz'
>>> result["future_targets_npz"]
'results/zhongshan/future_npz/zhongshan_future_targets.npz'