geoprior.utils.data_utils#

Data utilities.

Functions

`mask_by_reference`(data, ref_col[, values, ...])	Masks (replaces) values in columns other than the reference column for rows in which the reference column matches (or is closest to) the specified value(s).
`nan_ops`(data[, auxi_data, data_kind, ops, ...])	Perform operations on NaN values within data structures, handling both primary data and optional witness data based on specified parameters.
`pop_labels_in`(df, columns, labels[, ...])	Remove specific categories (labels) from columns in a dataframe.
`widen_temporal_columns`(data, dt_col[, ...])	Convert a long PIHALNet prediction table into a wide format where each temporal slice becomes a dedicated column.

geoprior.utils.data_utils.mask_by_reference(data, ref_col, values=None, find_closest=False, fill_value=0, mask_columns=None, error='raise', verbose=0, inplace=False, savefile=None)[source]#

Masks (replaces) values in columns other than the reference column for rows in which the reference column matches (or is closest to) the specified value(s).

If a row’s reference-column value is matched, that row’s values in the other columns are overwritten by fill_value. The reference column itself is not modified.

This function supports both exact and approximate matching:

Exact matching is used if find_closest=False.
Approximate (closest) matching is used if find_closest=True and the reference column is numeric.

By default, if the reference column does not exist or if the given values cannot be found (or approximated) in the reference column, an exception is raised. This behavior can be adjusted with the error parameter.

Parameters:

data (pd.DataFrame) – The input DataFrame containing the data to be masked.
ref_col (str) – The column in data serving as the reference for matching or finding the closest values.
values (Any or sequence of Any, optional) –
The reference values to look for in ref_col. This can be:
- A single value (e.g., 0 or "apple").
- A list/tuple of values (e.g., [0, 10, 25]).
- If values is None, all rows are masked (i.e. all rows match), effectively overwriting the entire DataFrame (except the reference column) with fill_value.
Note that if find_closest=False, these values must appear in the reference column; otherwise, an error or warning is triggered (depending on the error setting).
find_closest (bool, default False) – If True, performs an approximate match for numeric reference columns. For each entry in values, the function locates the row(s) in ref_col whose value is numerically closest. Non-numeric reference columns will revert to exact matching regardless.
fill_value (Any, default 0) –
The value used to fill/mask the non-reference columns wherever the condition (exact or approximate match) is met. This can be any valid type, e.g., integer, float, string, np.nan, etc. If fill_value='auto' and multiple values are given, each row matched by a particular reference value is filled with that same reference value.
Examples:
- If values=9 and fill_value='auto', the fill value is 9 for matched rows.
- If values=['a', 10] and fill_value='auto', then rows matching ‘a’ are filled with ‘a’, and rows matching 10 are filled with 10.
mask_columns (str or list of str, optional) – If specified, only these columns are masked. If None, all columns except ref_col are masked. If any column in mask_columns does not exist in the DataFrame and error='raise', a KeyError is raised; otherwise, a warning may be issued or ignored.
error ({'raise', 'warn', 'ignore'}, default 'raise') –
Controls how to handle errors:
- ’raise’: raise an error if the reference column does not exist or if any of the given values cannot be matched (or approximated).
- ’warn’: only issue a warning instead of raising an error.
- ’ignore’: silently ignore any issues.
verbose (int, default 0) –
Verbosity level:
- 0: silent (no messages).
- 1: minimal feedback.
- 2 or 3: more detailed messages for debugging.
inplace (bool, default False) – If True, performs the operation in place and returns the original DataFrame with modifications. If False, returns a modified copy, leaving the original unaltered.
savefile (str or None, optional) – File path where the DataFrame is saved if the decorator-based saving is active. If None, no saving occurs.

Returns:

A DataFrame where rows matching the specified condition (exact or approximate) have had their non-reference columns replaced by fill_value.

Return type:

pd.DataFrame

Raises:

KeyError – If error='raise' and ref_col is not in data.columns.
ValueError – If error='raise' and no exact/approx match can be found for one or more entries in values.

Notes

If values is None, all rows are masked in the non-ref columns, effectively overwriting them with fill_value.
When find_closest=True, approximate matching is performed only if the reference column is numeric. For non-numeric data, it falls back to exact matching.
When multiple reference values are provided, each is processed in turn. If fill_value=’auto’, each matched row is filled with that specific reference value.

Examples

>>> import pandas as pd
>>> from geoprior.utils.data_utils import mask_by_reference
>>>
>>> df = pd.DataFrame({
...     "A": [10, 0, 8, 0],
...     "B": [2, 0.5, 18, 85],
...     "C": [34, 0.8, 12, 4.5],
...     "D": [0, 78, 25, 3.2]
... })
>>>
>>> # Example 1: Exact matching, replace all columns except 'A' with 0
>>> masked_df = mask_by_reference(
...     data=df,
...     ref_col="A",
...     values=0,
...     fill_value=0,
...     find_closest=False,
...     error="raise"
... )
>>> print(masked_df)
>>> # 'B', 'C', 'D' for rows where A=0 are replaced with 0.
>>>
>>> # Example 2: Approximate matching for numeric
>>> # If 'A' has values [0, 10, 8] and we search for 9, then 'A=8' or 'A=10'
>>> # are the closest, so those rows get masked in non-ref columns.
>>> masked_df2 = mask_by_reference(
...     data=df,
...     ref_col="A",
...     values=9,
...     find_closest=True,
...     fill_value=-999
... )
>>> print(masked_df2)

>>>
>>> # Example 2: Approx. match for numeric ref_col
>>> # 9 is between 8 and 10, so rows with A=8 and A=10 are masked
>>> res2 = mask_by_reference(df, "A", 9, find_closest=True, fill_value=-999)
>>> print(res2)
... # Rows 0 (A=10) and 2 (A=8) are replaced with -999 in columns B,C,D
>>>
>>> # Example 3: fill_value='auto' with multiple values
>>> # Rows matching A=0 => fill with 0; rows matching A=8 => fill with 8
>>> res3 = mask_by_reference(df, "A", [0, 8], fill_value='auto')
>>> print(res3)
... # => rows with A=0 => B,C,D replaced by 0
... # => rows with A=8 => B,C,D replaced by 8
>>>
>>> # 2) mask_columns=['C','D'] => only columns C and D are masked
>>> res2 = mask_by_reference(df, "A", values=0, fill_value=999,
...                         mask_columns=["C","D"])
>>> print(res2)
... # Rows where A=0 => columns C,D replaced by 999, while B remains unchanged
>>>

geoprior.utils.data_utils.nan_ops(data, auxi_data=None, data_kind=None, ops='check_only', action=None, error='raise', process=None, condition=None, savefile=None, verbose=0)[source]#

Perform operations on NaN values within data structures, handling both primary data and optional witness data based on specified parameters.

This function provides a comprehensive toolkit for managing missing values (NaN) in various data structures such as NumPy arrays, pandas DataFrames, and pandas Series. Depending on the ops parameter, it can check for the presence of NaN`s, validate data integrity, or sanitize the data by filling or dropping `NaN values. The function also supports handling witness data, which can be crucial in scenarios where the relationship between primary and witness data must be maintained.

(1)#\[\begin{split}\text{Processed\_data} = \begin{cases} \text{filled\_data} & \text{if action is 'fill'} \\ \text{dropped\_data} & \text{if action is 'drop'} \\ \text{original\_data} & \text{otherwise} \end{cases}\end{split}\]

Parameters:

data (array-like, pandas.DataFrame, or pandas.Series) – The primary data structure containing NaN values to be processed.
auxi_data (array-like, pandas.DataFrame, or pandas.Series, optional) – Auxiliary data that accompanies the primary data. Its role depends on the data_kind parameter. If data_kind is ‘target’, auxi_data is treated as feature data, and vice versa. This is useful for operations that need to maintain the alignment between primary and witness data.
data_kind ({'target', 'feature', None}, optional) – Specifies the role of the primary data. If set to ‘target’, data is considered target data, and auxi_data (if provided) is treated as feature data. If set to ‘feature’, data is treated as feature data, and auxi_data is considered target data. If None, no special handling is applied, and witness data is ignored unless explicitly required by other parameters.
ops ({'check_only', 'validate', 'sanitize'}, default :py:class:``’check_only’:py:class:``) –
Defines the operation to perform on the NaN values in the data:
- 'check_only': Checks whether the data contains any NaN values and returns a boolean indicator.
- 'validate': Validates that the data does not contain NaN values. If NaN`s are found, it raises an error or warns based on the ``error` parameter.
- 'sanitize': Cleans the data by either filling or dropping NaN values based on the action, process, and condition parameters.
action ({'fill', 'drop'}, optional) –
Specifies the action to take when ops is set to ‘sanitize’:
- 'fill': Fills NaN values using the fill_NaN function with the method set to ‘both’.
- 'drop': Drops NaN values based on the conditions and process specified. If data_kind is ‘target’, it handles `NaN`s in a way that preserves data integrity for machine learning models.
- If None, defaults to ‘drop’ when sanitizing.
Note: If ops is not ‘sanitize’ and action is set, an error is raised indicating conflicting parameters.
error ({'raise', 'warn', None}, default :py:class:``’raise’:py:class:``) –
Determines the error handling policy:
- 'raise': Raises exceptions when encountering issues.
- 'warn': Emits warnings instead of raising exceptions.
- None: Defaults to the base policy, which is typically ‘warn’.
This parameter is utilized by the error_policy function to enforce consistent error handling throughout the operation.
process ({'do', 'do_anyway'}, optional) –
Works in conjunction with the action parameter when action is ‘drop’:
- 'do': Drops NaN values only if certain conditions are met.
- 'do_anyway': Forces the dropping of NaN values regardless of conditions.
This provides flexibility in handling `NaN`s based on the specific requirements of the dataset and the analysis being performed.
condition (callable or None, optional) – A callable that defines a condition for dropping NaN values when action is ‘drop’. For example, it can specify that the number of NaN`s should not exceed a certain fraction of the dataset. If the condition is not met, the behavior is controlled by the ``process` parameter.
verbose (int, default 0) –
Controls the verbosity level of the function’s output for debugging purposes:
- 0: No output.
- 1: Basic informational messages.
- 2: Detailed processing messages.
- 3: Debug-level messages with complete trace of operations.
Higher verbosity levels provide more insights into the function’s internal operations, aiding in debugging and monitoring.

Returns:

The sanitized data structure with NaN values handled according to the specified parameters. If auxi_data is provided and processed, a tuple containing the sanitized data and auxi_data is returned. Otherwise, only the sanitized data is returned.

Return type:

array-like, pandas.DataFrame, or pandas.Series

Raises:

ValueError –
- If an invalid value is provided for ops or data_kind.
- If auxi_data does not align with data in shape.
- If sanitization conditions are not met and the error policy is set to ‘raise’.
Warning –
- Emits warnings when NaN values are present and the error policy is
set to ‘warn’.

Examples

>>> from geoprior.utils.data_utils import nan_ops
>>> import pandas as pd
>>> import numpy as np
>>> # Example with target data and witness feature data
>>> target = pd.Series([1, 2, np.nan, 4])
>>> features = pd.DataFrame({
...     'A': [5, np.nan, 7, 8],
...     'B': ['x', 'y', 'z', np.nan]
... })
>>> # Check for NaNs
>>> nan_ops(target, auxi_data=features, data_kind='target', ops='check_only')
(True, True)
>>> # Validate data (will raise ValueError if NaNs are present)
>>> nan_ops(target, auxi_data=features, data_kind='target', ops='validate')
Traceback (most recent call last):
    ...
ValueError: Target contains NaN values.
>>> # Sanitize data by dropping NaNs
>>> cleaned_target, cleaned_features = nan_ops(
...     target,
...     auxi_data=features,
...     data_kind='target',
...     ops='sanitize',
...     action='drop',
...     verbose=2
... )
Dropping NaN values.
Dropped NaNs successfully.
>>> cleaned_target
0    1.0
1    2.0
3    4.0
dtype: float64
>>> cleaned_features
     A    B
0  5.0    x
3  8.0  NaN

Notes

The nan_ops function is designed to provide a robust framework for handling missing values in datasets, especially in machine learning workflows where the integrity of target and feature data is paramount. By allowing conditional operations and providing flexibility in error handling, it ensures that data preprocessing can be tailored to the specific needs of the analysis.

The function leverages helper utilities such as fill_NaN, drop_nan_in, and error_policy to maintain consistency and reliability across different data structures and scenarios. The verbosity levels aid developers in tracing the function’s execution flow, making it easier to debug and verify data transformations.

See also

geoprior.utils.base_utils.fill_NaN: Fills NaN values in numeric data structures using specified methods.
geoprior.core.array_manager.drop_nan_in: Drops NaN values from data structures, optionally alongside witness data.
geoprior.core.utils.error_policy: Determines how errors are handled based on user-specified policies.
geoprior.core.array_manager.array_preserver: Preserves and restores the original structure of array-like data.

geoprior.utils.data_utils.widen_temporal_columns(data, dt_col, spatial_cols=None, target_name=None, round_dt=True, ignore_cols=None, nan_op=None, nan_thresh=None, savefile=None, verbose=0)[source]#

Convert a long PIHALNet prediction table into a wide format where each temporal slice becomes a dedicated column.

The routine pivots columns whose names follow the pattern

<base>           deterministic forecast
<base>_qXX       quantile forecast (e.g., ``subsidence_q10``)
<base>_actual    ground‑truth column

and produces columns of the form

<base>_<year>            point forecast
<base>_<year>_qXX        quantile forecast
<base>_<year>_actual     ground‑truth value

If duplicate (spatial, year) pairs are found, values are aggregated with :pyfunc:`pandas.Series.groupby(mean) <pandas.core.series.Series.groupby>` prior to pivoting to avoid “Index contains duplicate entries” errors.

Parameters:

data (PathLike object or pandas.DataFrame) – Long‑format DataFrame returned by :pyfunc:`geoprior.utils.format_pihalnet_predictions`.
dt_col (str) – Column holding the temporal coordinate (e.g., 'coord_t'). Must be numeric or datetime‑coercible. When round_dt is True, values are rounded to integers.
spatial_cols ((str, str) or None, default None) – Names of x and y spatial coordinates. These are retained as leading columns in the output. If None, the function falls back to 'sample_idx' or an auto‑generated 'row_id'.
target_name (str or None, default None) – Restrict pivoting to a specific base (e.g., 'subsidence'). When None every base present in df is widened.
round_dt (bool, default True) – Round dt_col to the nearest integer (helpful for fractional years such as 2020.0001).
ignore_cols (list[str] or None, default None) – Additional columns to carry through unchanged. Values are propagated per spatial location using the first non‑null entry.
nan_op ({'drop', 'fill', 'both', None}, default None) –
Strategy for NaN handling after pivot:
- 'fill' – forward‑fill then back‑fill missing values.
- 'drop' – drop rows containing NaNs (see nan_thresh).
- 'both' – fill then drop according to nan_thresh.
- None – leave NaNs untouched.
nan_thresh (float or None, default None) –
When nan_op contains 'drop', rows are dropped if the proportion of missing values exceeds nan_thresh. Set nan_thresh = 0 to require no NaNs, 0.5 to allow ≤ 50 % missing, etc.

(2)\[\text{row kept} \;\Longleftrightarrow\; \frac{\text{NaNs in row}}{\text{row width}} \le \text{nan\_thresh}\]
savefile (str, optional) – If a file path is provided, the final wide-format DataFrame will be saved as a CSV file.
verbose (int, default 0) – Diagnostic verbosity from 0 (silent) to 5 (trace every step).

Returns:

Wide‑format frame with spatial identifiers first, followed by year‑wise forecast, quantile, and actual columns.

Return type:

pandas.DataFrame

Raises:

KeyError – dt_col missing from df or spatial_cols absent.
ValueError – No columns match target_name or nan_thresh is outside \([0, 1]\).

Notes

Duplicate indices are aggregated with the arithmetic mean before pivoting. Modify the aggregation lambda inside the function for alternative choices.
If ignore_cols is provided, their first non‑null value per spatial location is appended to the output.

Examples

Minimal usage on a tiny synthetic set

>>> import pandas as pd
>>> from geoprior.utils.data_utils import widen_temporal_columns
>>>
>>> df_long = pd.DataFrame(
...     {
...         "coord_x": [113.15, 113.15, 113.15, 113.15],
...         "coord_y": [22.63, 22.63, 22.63, 22.63],
...         "coord_t": [2019, 2020, 2019, 2020],
...         "subsidence_q50": [0.09, 0.10, 0.12, 0.13],
...         "subsidence_actual": [0.08, 0.11, 0.10, 0.14],
...     }
... )
>>>
>>> wide = widen_temporal_columns(
...     df_long,
...     dt_col="coord_t",
...     spatial_cols=("coord_x", "coord_y"),
...     verbose=2,
... )
[INFO] Initial rows: 4, columns: 2
[INFO] Widening base 'subsidence' (2 columns)
[DONE] Final wide shape: (1, 4)
>>> wide
   coord_x  coord_y  subsidence_2019_actual  subsidence_2020_actual  \
0   113.15    22.63                   0.08                   0.11

subsidence_2019_q50 subsidence_2020_q50

0 0.12 0.13

End‑to‑end example with NaN handling, ignored columns, and two targets

>>> import numpy as np
>>> rng = pd.date_range("2018", periods=3, freq="Y").year
>>> n = 5  # five spatial locations
>>>
>>> # build synthetic long DataFrame
>>> df_long = pd.DataFrame(
...     {
...         "sample_idx": np.repeat(np.arange(n), len(rng)),
...         "coord_x": np.repeat(np.linspace(113.4, 113.5, n), len(rng)),
...         "coord_y": np.repeat(np.linspace(22.1, 22.2, n), len(rng)),
...         "coord_t": np.tile(rng, n),
...         "region": np.repeat(["A", "B", "A", "B", "A"], len(rng)),
...         "subsidence_q10": np.random.rand(n * len(rng)),
...         "subsidence_q50": np.random.rand(n * len(rng)),
...         "subsidence_q90": np.random.rand(n * len(rng)),
...         "subsidence_actual": np.random.rand(n * len(rng)),
...         "GWL_q50": np.random.rand(n * len(rng)),
...     }
... )
>>>
>>> # introduce NaNs for demonstration
>>> df_long.loc[df_long.sample(frac=0.2).index, "subsidence_q50"] = np.nan
>>>
>>> wide = widen_temporal_columns(
...     df_long,
...     dt_col="coord_t",
...     spatial_cols=("coord_x", "coord_y"),
...     ignore_cols=["region"],
...     target_name=None,      # widen both 'subsidence' and 'GWL'
...     nan_op="both",         # fill then drop rows with many NaNs
...     nan_thresh=0.4,        # allow at most 40 % missing
...     verbose=3,
... )
[INFO] Initial rows: 15, columns: 7
[INFO] Widening base 'GWL' (1 columns)
  └─ 0 duplicate rows in 'GWL_q50' → aggregated
[INFO] Widening base 'subsidence' (4 columns)
  └─ 0 duplicate rows in 'subsidence_q10' → aggregated
  └─ 0 duplicate rows in 'subsidence_q50' → aggregated
  └─ 0 duplicate rows in 'subsidence_q90' → aggregated
  └─ 0 duplicate rows in 'subsidence_actual' → aggregated
[INFO] Missing values filled (ffill+bfill).
[INFO] Rows with >40% NaN dropped.
[DONE] Final wide shape: (5, 19)
>>> wide.iloc[:2, :8]  # show first 8 columns
   coord_x  coord_y  GWL_2018_q50  GWL_2019_q50  GWL_2020_q50  \
0  113.400       ...         ...          ...          ...
1  113.425       ...         ...          ...          ...

subsidence_2018_actual subsidence_2019_actual subsidence_2020_actual

0 … … … 1 … … …