geoprior.utils.generic_utils#

Provides common helper functions and for validation, comparison, and other generic operations

Functions

apply_affix(value[, label, mode, ...])

Apply either a prefix or suffix (with optional version, date, and custom label) to the string form of value.

are_all_values_in_bounds(values[, bounds, ...])

Check if all evaluable input values are within specified numeric bounds.

as_tuple(obj[, names, ctx, strict, ...])

Convert model I/O structures (dict/list/tuple/tensor-like) into a tuple.

cast_hp_to_bool(params, param_name[, ...])

Casts a hyperparameter value in the params dictionary to a boolean.

cast_multiple_bool_params(params, ...)

Casts a list of boolean hyperparameters to ensure they are Python booleans.

check_group_column_validity(df, group_col[, ...])

Validate a grouping column for categorical-style use and optionally bin it.

default_results_dir([start, env_var, ...])

Resolve the canonical 'results' directory with robust fallbacks.

detect_dt_format(series)

Detect the datetime format of a pandas Series containing datetime values.

ensure_cols_exist(df, *cols[, strict, ...])

Validate that every requested column is present in a DataFrame.

ensure_directory_exists(path)

Ensure that a directory exists at the given path, creating it if needed.

exclude_duplicate_kwargs(func, ...)

Prevents the user from overriding existing parameters

find_id_column(df[, strategy, ...])

Identify potential ID column(s) in a pandas DataFrame using multiple heuristic strategies.

get_actual_column_name(df[, tname, ...])

Determines the actual target column name in the given DataFrame.

getenv_stripped(name[, default, allow_empty])

Read an environment variable and strip whitespace robustly.

handle_emptiness(obj[, ops, empty_as_none])

Smart helper to check or normalize empty/None values.

insert_affix_in(filepath, affix, *[, separator])

Insert an affix between the base name and extension of a filename.

map_scales_choice(scales_choice_str)

Maps a string choice for scales to an actual list of scale values or None if no scales are provided.

normalize_model_inputs(*data)

normalize_time_column(df, time_col[, ...])

Normalize a time column into a datetime column and an integer year.

print_box(msg[, width, align, border_char, ...])

Print a boxed message with customizable styling.

print_config_table(sections[, title, ...])

Pretty-print configuration or hyperparameters as a key/value table.

rename_dict_keys(data[, param_to_rename, order])

Renames keys in the data dictionary based on the provided param_to_rename dictionary.

reorder_columns(df, columns[, pos])

Reorder columns in a DataFrame by moving specified columns to a chosen position.

save_all_figures([output_dir, prefix, fmts, ...])

Save all currently open Matplotlib figures to disk in specified formats.

save_figure(figure[, savefile, save_fmts, ...])

Save the given matplotlib figure to disk in one or more formats.

select_mode([mode, default, canonical])

Resolve a user-supplied mode alias to a canonical value.

split_train_test_by_time(df, time_col, cutoff, *)

Split a DataFrame into train/test based on a time cutoff, robust to different time formats.

transform_contributions(contributions[, ...])

Converts the feature contributions either to a direct percentage,

verify_identical_items(list1, list2[, mode, ...])

Check if two lists contain identical elements according to the specified mode.

vlog(message[, verbose, level, depth, mode, ...])

Log or naive messages with optional indentation and bracketed tags.

Classes

ExistenceChecker()

A utility class for checking and ensuring the existence of files and directories on the filesystem.

class geoprior.utils.generic_utils.ExistenceChecker[source]#

Bases: object

A utility class for checking and ensuring the existence of files and directories on the filesystem.

This class provides static methods to verify whether a given path exists and to create directories or files if necessary. It raises informative exceptions when paths are invalid or cannot be created.

ensure_directory(path)[source]#

Ensure a directory exists at the specified path.

ensure_file(path, create_parent_dirs=False)[source]#

Ensure a file exists at the specified path, optionally creating parent directories.

Examples

>>> from geoprior.utils.generic_utils import ExistenceChecker
>>> # Ensure a directory exists
>>> dir_path = ExistenceChecker.ensure_directory("data/output")
>>> isinstance(dir_path, Path)
True
>>> # Ensure a file exists, creating parent directories
>>> file_path = ExistenceChecker.ensure_file(
...     "data/output/results.txt", create_parent_dirs=True
... )
>>> file_path.exists()
True

Notes

  • Uses pathlib.Path.mkdir(…, parents=True, exist_ok=True) under the hood to create directories.

  • Creating a file will produce an empty file if it does not exist.

  • Raises TypeError if the given path is not a str or pathlib.Path, and appropriate OSError/FileExistsError for filesystem errors.

See also

pathlib.Path.mkdir

Method to create directories.

pathlib.Path.touch

Method to create an empty file.

os.makedirs

Legacy function for creating directories recursively.

os.path.exists

Check if a path exists.

static ensure_directory(path)[source]#

Ensure that a directory exists at the given path, creating it if needed.

Parameters:

path (str or pathlib.Path) – The filesystem path for which to ensure directory existence. Can be either a string or a pathlib.Path object.

Returns:

A Path object pointing to the existing (or newly created) directory.

Return type:

pathlib.Path

Raises:
  • TypeError – If path is not a string or pathlib.Path.

  • FileExistsError – If a file (not a directory) already exists at path.

  • OSError – If the directory cannot be created for any other reason (e.g., insufficient permissions).

static ensure_file(path, create_parent_dirs=False)[source]#

Ensure that a file exists at the given path, creating it if needed.

If create_parent_dirs is True, any missing parent directories will be created automatically.

Parameters:
  • path (str or pathlib.Path) – The filesystem path for the file that must exist.

  • create_parent_dirs (bool, optional) – If True, create any missing parent directories. Default is False.

Returns:

A Path object pointing to the existing (or newly created) file.

Return type:

pathlib.Path

Raises:
  • TypeError – If path is not a string or pathlib.Path.

  • FileExistsError – If a directory (not a file) already exists at path.

  • OSError – If the file or parent directories cannot be created due to filesystem errors.

geoprior.utils.generic_utils.ensure_directory_exists(path)[source]#

Ensure that a directory exists at the given path, creating it if needed.

This function checks whether the provided path exists and is a directory. If the path does not exist, it attempts to create the directory (including any necessary parent directories). If a file with the same name already exists, or if creation fails, an exception is raised.

Parameters:

path (str or pathlib.Path) – The filesystem path for which to ensure directory existence. Can be either a string or a pathlib.Path object.

Returns:

A Path object pointing to the existing (or newly created) directory.

Return type:

pathlib.Path

Raises:
  • TypeError – If path is not a string or pathlib.Path.

  • FileExistsError – If a file (not a directory) already exists at path.

  • OSError – If the directory cannot be created for any other reason (e.g., insufficient permissions).

Examples

>>> from pathlib import Path
>>> from geoprior.utils.generic_utils import ensure_directory_exists
>>> output_dir = ensure_directory_exists("data/output")
>>> isinstance(output_dir, Path)
True
>>> # The directory "data/output" now exists on disk.

Notes

  • Uses pathlib.Path.mkdir(…, parents=True, exist_ok=True) under the hood for cross-platform compatibility.

  • If path already exists as a directory, this function returns immediately without modifying it.

See also

pathlib.Path.mkdir

Method to create a directory.

os.makedirs

Legacy function for creating directories recursively.

geoprior.utils.generic_utils.verify_identical_items(list1, list2, mode='unique', ops='check_only', error='raise', objname=None)[source]#

Check if two lists contain identical elements according to the specified mode.

In “unique” mode, the function compares the unique elements in each list. In “ascending” mode, it compares elements pairwise in order.

Parameters:
  • list1 (list) – The first list of items.

  • list2` (list) – The second list of items.

  • mode ({'unique', 'ascending'}, default "unique") –

    The mode of comparison:
    • ”unique”: Compare unique elements (order-insensitive).

    • ”ascending”: Compare each element pairwise in order.

  • ops ({'check_only', 'validate'}, default "check_only") – If “check_only”, returns True/False indicating a match. If “validate”, returns the validated list.

  • error ({'raise', 'warn', 'ignore'}, default "raise") – Specifies how to handle mismatches.

  • objname (str, optional) – A name to include in error messages.

Returns:

Depending on ops, returns True/False or the validated list.

Return type:

bool or list

Examples

>>> from geoprior.utils.generic_utils import verify_identical_items
>>> list1 = [0.1, 0.5, 0.9]
>>> list2 = [0.1, 0.5, 0.9]
>>> verify_identical_items(list1, list2, mode="unique", ops="validate")
[0.1, 0.5, 0.9]
>>> verify_identical_items(list1, list2, mode="ascending", ops="check_only")
True

Notes

In “ascending” mode, both lists must have the same length, and the function compares each corresponding pair of elements. In “unique” mode, the function uses the set of unique values for comparison. If the lists contain mixed types, the function attempts to compare their string representations.

geoprior.utils.generic_utils.vlog(message, verbose=None, level=3, depth='auto', mode=None, vp=True, logger=None, **kws)[source]#

Log or naive messages with optional indentation and bracketed tags.

This function, vlog, allows conditional logging or printing of messages based on a global or passed in <parameter inline> verbose level. By default, it behaves differently depending on whether mode is 'log' or 'naive'. When \(mode = 'log'\), the message is printed only if \(\text{verbose} \geq \text{level}\). Otherwise, for \(mode\) in [None, 'naive'], the verbosity threshold leads to various bracketed prefixes (e.g. [INFO], [DEBUG], [TRACE]) unless the message already contains such a prefix.

(1)#\[\text{indentation} = 2 \times \text{depth}\]

where \(\text{depth}\) is either manually specified or auto-derived based on <parameter inline> level (1 = ERROR, 2 = WARNING, 3 = INFO, 4/5 = DEBUG, 6/7 = TRACE).

Parameters:
  • message (str) – The text to be printed or logged.

  • verbose (int, optional) – Overall verbosity threshold. If None, it looks for a global variable named verbose. Default is None.

  • level (int, default 3) –

    Severity or importance level of the message. Commonly:

    • 1 = ERROR

    • 2 = WARNING

    • 3 = INFO

    • 4,5 = DEBUG

    • 6,7 = TRACE

  • depth (int or str, default "auto") – Indentation level used for the printed message. If "auto", the depth is computed from <parameter inline> level.

  • mode (str, optional) – Determines logging mode. If set to 'log', prints messages only if \(\text{verbose} \geq \text{level}\). Otherwise (if None or 'naive'), it follows a custom logic driven by <parameter inline> verbose.

  • vp (bool, default True) – If True, the function automatically prepends bracketed tags (e.g. [INFO]) unless the message already contains one of [INFO], [DEBUG], [ERROR], [WARNING], or [TRACE].

  • logger (logging.Logger or Callable[[str], None], optional) –

    Custom sink that receives the already-formatted message string.

    • If you pass a standard :pyclass:`logging.Logger` instance, the message is routed through logger.info.

    • If you supply any callable that accepts a single str (e.g. a GUI text-append function), that callable is invoked directly.

    • Defaults to :pyfunc:`print`, which writes to stdout.

  • kws (Logging instance, optional) – For future extensions.

Returns:

This function does not return anything. It either prints the message to stdout or omits it, depending on <parameter inline> verbose, <parameter inline> level, and mode.

Return type:

None

Notes

This function is helpful for selectively displaying or logging messages in applications that adapt to the user’s required verbosity. By default, each level has a specific bracketed tag and an auto indentation depth.

Examples

>>> from geoprior.utils.generic_utils import vlog
>>> # Example with mode='log'
>>> # This prints only if global or passed-in
>>> # verbose >= 4.
>>> vlog("Check debugging details.", verbose=3,
...      level=4, mode='log')
>>> # Example with mode='naive'
>>> # If verbose=2, it displays as [INFO] prefixed.
>>> vlog("Loading data...", verbose=2, mode='naive')

See also

globals

Used to retrieve the fallback verbose value if not explicitly passed.

geoprior.utils.generic_utils.detect_dt_format(series)[source]#

Detect the datetime format of a pandas Series containing datetime values.

This function inspects a non-null sample from the datetime Series and infers the format string based on its components (year, month, day, hour, minute, and second). It returns a format string that can be used with strftime. For example, if the sample indicates only a year is relevant, it returns "%Y"; if full date information is present, it returns "%Y-%m-%d"; and if time details are also present, it extends the format accordingly.

Parameters:

series (pandas.Series) – A Series containing datetime values (dtype datetime64).

Returns:

A datetime format string (e.g., "%Y", "%Y-%m-%d", or "%Y-%m-%d %H:%M:%S") that represents the resolution of the data.

Return type:

str

Examples

>>> from geoprior.utils.generic_utils import detect_dt_format
>>> import pandas as pd
>>> dates = pd.to_datetime(['2023-01-01', '2024-01-01', '2025-01-01'])
>>> fmt = detect_dt_format(pd.Series(dates))
>>> print(fmt)
%Y

Notes

The detection logic checks if month, day, hour, minute, and second are all default values (e.g., month == 1, day == 1, hour == 0, etc.) and infers the most compact format that still represents the data accurately.

geoprior.utils.generic_utils.get_actual_column_name(df, tname=None, actual_name=None, error='raise', default_to=None)[source]#

Determines the actual target column name in the given DataFrame.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the target column.

  • tname (str, optional) – The base target name (e.g., “subsidence”). If not found in the DataFrame, it will attempt to find a matching column using “<tname>_actual” format.

  • actual_name (str, optional) – If provided, this name will be returned as the actual target column name.

  • error ({'raise', 'warn', 'ignore'}, default 'raise') – Specifies how to handle the case when no valid column is found: - ‘raise’: Raises a ValueError. - ‘warn’: Issues a warning and returns None. - ‘ignore’: Silently returns None.

Returns:

The determined actual column name, or None if no match is found and error=’warn’ or error=’ignore’.

Return type:

str or None

Raises:

ValueError – If no valid target column is found and error=’raise’.

Examples

>>> from geoprior.utils.generic_utils import get_actual_column_name
>>> df = pd.DataFrame({'subsidence_actual': [1, 2, 3]})
>>> get_actual_column_name(df, tname="subsidence")
'subsidence_actual'
>>> df = pd.DataFrame({'subsidence': [1, 2, 3]})
>>> get_actual_column_name(df, tname="subsidence")
'subsidence'
>>> df = pd.DataFrame({'actual': [1, 2, 3]})
>>> get_actual_column_name(df)
'actual'
>>> df = pd.DataFrame({'measurement': [1, 2, 3]})
>>> get_actual_column_name(df, tname="subsidence", error="warn")
Warning: Could not determine the actual target column in the DataFrame.
None
geoprior.utils.generic_utils.transform_contributions(contributions, to_percent=True, normalize=False, norm_range=(0, 1), scale_type=None, zero_division='warn', epsilon=1e-06, log_transform=False)[source]#

Converts the feature contributions either to a direct percentage, normalizes them to a custom range, or applies a scaling strategy based on the chosen parameters.

Parameters:
contributionsdict

A dictionary where keys are feature names and values are the feature contributions. Each value is expected to be a numerical value representing the contribution of the respective feature.

to_percentbool, optional, default=True

Whether to convert the contributions to percentages. If True, each value in contributions will be multiplied by 100. This is useful when contributions are given in decimal form but are expected as percentages.

normalizebool, optional, default=False

Whether to normalize the contributions using min-max scaling. If True, the values will be scaled to the range defined in norm_range.

norm_rangetuple, optional, default=(0, 1)

A tuple specifying the range (min, max) for normalization. This range is applied when normalize is set to True. The contributions will be rescaled so that the minimum value maps to norm_range[0] and the maximum value maps to norm_range[1].

scale_typestr, optional, default=None

The scaling strategy. Options include: - 'zscore': Performs Z-score normalization. - 'log': Applies a logarithmic transformation to the data. If None, no scaling is applied.

zero_divisionstr, optional, default=’warn’

Defines how to handle zero or missing values in the contributions. Options include: - 'skip': Skips zero values (no modification). - 'warn': Issues a warning if zero values are found. - 'replace': Replaces zeros with a small value defined by

epsilon to avoid division by zero or undefined results.

epsilonfloat, optional, default=1e-6

A small value used to replace zeros when zero_division is set to 'replace'. This prevents division by zero errors during transformations like Z-score or log transformation.

log_transformbool, optional, default=False

Whether to apply a logarithmic transformation to the contributions. If True, it applies the natural logarithm to each value in the contributions dictionary. Only positive values are valid for log transformation, and zero values are either skipped or replaced based on the zero_division parameter.

Returns:
dict

A dictionary with feature names as keys and the transformed feature contributions as values. The transformation is applied according to the chosen parameters.

See also

numpy.mean

Compute the arithmetic mean of an array.

numpy.std

Compute the standard deviation of an array.

rac{X - mu}{sigma}

where \(X\) is the contribution, \(\mu\) is the mean of the contributions, and \(\sigma\) is the standard deviation of the contributions.

  • If log_transform=True, the function applies the natural logarithm:

    (2)#\[ext{log}(X) ext{ for } X > 0\]
  • The zero_division parameter handles zero values by either skipping, warning, or replacing them with a small value (epsilon).

Examples

>>> from geoprior.utils.generic_utils import transform_contributions
>>> contributions = {
>>>     'GWL': 2.226836617133828,
>>>     'rainfall_mm': 12.398293851061492,
>>>     'normalized_seismic_risk_score': 0.9402759347406523,
>>>     'normalized_density': 4.806074194258057,
>>>     'density_concentration': 5.666943330566496e-06,
>>>     'geology': 1.2798872011280326e-05,
>>>     'density_tier': 1.044039559604414e-05,
>>>     'rainfall_category': 0.0
>>> }
>>> transform_contributions(contributions, to_percent=True, normalize=True)
>>> transform_contributions(contributions, to_percent=False, scale_type='zscore')
geoprior.utils.generic_utils.exclude_duplicate_kwargs(func, existing_kwargs, user_kwargs)[source]#

Prevents the user from overriding existing parameters in a target function. The method exclude_duplicate_kwargs checks both developer-specified and function-level parameter names to exclude them from user_kwargs.

(3)#\[ext{final\_kwargs} = \{\,(k, v) \in ext{user\_kwargs} \,\mid\, k\]

otin ext{protected_params},}

Parameters:
funccallable()

The target function whose valid parameters are checked. It uses Python’s introspection to gather the acceptable parameter names.

existing_kwargsdict or list

Developer-defined parameters to protect. Can be: * A dictionary of parameter-value pairs (e.g.,

{'ax': ax_obj, 'data': df}) whose keys are excluded from user overrides.

  • A list of parameter names (e.g., ['ax', 'data']) to protect from user overrides.

user_kwargsdict

The user-supplied keyword arguments that are candidates for merging with existing_kwargs. This dictionary is filtered to remove collisions with protected parameters.

Returns:
dict

A filtered dictionary of user-defined arguments that do not overlap with protected parameters.

Parameters:
Return type:

dict[str, Any]

See also

inspect.signature

Used to introspect function parameters.

filter_valid_kwargs

Another inline function that discards user params not valid for a given function.

Notes

By default, if existing_kwargs is a dictionary, its keys are treated as protected parameter names. If it’s a list, those items are protected. The function signature of func is also used to verify that only recognized parameters are protected. Keyword-filtering patterns like this are covered in Beazley and Jones [26].

Examples

>>> from geoprior.utils.generic_utils import exclude_duplicate_kwargs
>>> import seaborn as sns
>>> # Developer has some base kwargs
... base_kwargs = {
...     'x': 'species',
...     'y': 'sepal_length',
...     'palette': 'viridis'
... }
>>> # User tries to override 'x' with new param
... user_args = {
...     'x': 'petal_width',
...     'color': 'red'
... }
>>> # Filter out duplicates
... safe_args = exclude_duplicate_kwargs(
...     sns.scatterplot,
...     base_kwargs,
...     user_args
... )
>>> safe_args
{'color': 'red'}
geoprior.utils.generic_utils.reorder_columns(df, columns, pos='end')[source]#

Reorder columns in a DataFrame by moving specified columns to a chosen position.

This function locates <columns> in the original DataFrame <df> and rearranges them based on the parameter pos. If pos is “end”, columns are appended to the end. If “begin” or “start”, they are placed at the front. If “center”, they are inserted at the midpoint:

Parameters:
  • df (pandas.DataFrame) – The input DataFrame to be modified.

  • columns (str or iterable of str) – A single column name or multiple column names to reposition. If a single string is given, it is converted to a list with one element.

  • pos (str, int, or float, default :py:class:``”end”:py:class:``) –

    Determines the target placement:
    • "end": Append after all other columns.

    • "begin" or "start": Prepend at the start.

    • "center": Insert at the midpoint of remaining columns.

    • integer or float: Insert at zero-based index among the remaining columns. If out of bounds, the original DataFrame is returned unchanged.

Returns:

A new DataFrame with <columns> moved as specified by pos.

Return type:

pandas.DataFrame

`reorder_columns_in`

This method rearranges columns without altering values or data order beyond column placement.

Notes

  • The function checks if <columns> exist in <df>, ignoring columns not present.

  • A warning is issued if the position is beyond the range of valid indices.

  • Negative indices for integer pos are converted to positive by adding the total number of remaining columns.

(4)#\[i_{\text{center}} = \left\lfloor \frac{|R|}{2} \right\rfloor,\]

where \(|R|\) is the number of remaining columns after removing the target columns. For integer or float pos, the target columns are inserted at index \(\lfloor pos \rfloor\) among the remaining columns. Column-order management follows common DataFrame practices discussed in McKinney [27].

Examples

>>> from geoprior.utils.generic_utils import reorder_columns
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'id': [1, 2, 3],
...     'latitude': [10.1, 10.2, 10.3],
...     'landslide': [0, 1, 0],
...     'longitude': [20.1, 20.2, 20.3]
... })
>>> # Move 'landslide' to the end (default)
>>> reorder_columns(data, 'landslide', pos="end")
   id  latitude  longitude  landslide
0   1      10.1       20.1          0
1   2      10.2       20.2          1
2   3      10.3       20.3          0

See also

pandas.DataFrame.reindex

Pandas method for reindexing or reordering columns more generally.

geoprior.utils.generic_utils.find_id_column(df, strategy='naive', regex_pattern=None, uniqueness_threshold=0.95, errors='raise', empty_as_none=True, as_list=False, case_sensitive=False, as_frame=False)[source]#

Identify potential ID column(s) in a pandas DataFrame using multiple heuristic strategies.

The function examines column names and/or data properties to detect columns likely to serve as unique identifiers. This is particularly useful for large datasets where the ID field is not explicitly labeled, and for quick scanning of possible key columns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame in which to search for potential ID columns.

  • strategy ({'naive', 'exact', 'dtype', 'regex','prefix_suffix'}, default 'naive') –

    Defines the logic for detecting ID columns: - exact: Checks for a column name that exactly

    matches id (case sensitivity controlled by case_sensitive).

    • naive: Searches for columns where id is part of the name (e.g., location_id) subject to case sensitivity.

    • prefix_suffix: Considers columns prefixed or suffixed with id or _id.

    • dtype: Examines columns having data types commonly used for IDs (integer, string, or object) and checks if they show high uniqueness via \(\text{uniqueness\_ratio} \geq \text{uniqueness\_threshold}\).

    • <regex>: Uses a custom regular expression <regex_pattern> to find matches in column names.

  • regex_pattern (str, optional) – Required if strategy is ‘regex’. The pattern is compiled via re.compile, with case sensitivity determined by <case_sensitive>.

  • uniqueness_threshold (float, default 0.95) –

    For <dtype> strategy, columns are flagged as ID candidates if the ratio:

    (5)\[r = \frac{ \text{unique\_values} }{ \text{non\_NA\_rows} }\]

    satisfies \(r \geq \text{uniqueness\_threshold}\), or if the number of unique values equals the number of non-null rows.

  • errors ({'raise', 'warn', 'ignore'}, default 'raise') –

    How to handle no-match cases:
    • raise: Raises a ValueError.

    • warn: Issues a UserWarning and returns based on <as_frame> or <empty_as_none>.

    • ignore: Returns an empty result based on the same parameters without warning.

  • empty_as_none (bool, default True) – Applies only if `as_frame` is False. Defines whether to return None (if True) or an empty list (if False) when no ID column is found and <errors> is ‘warn’ or ‘ignore’.

  • as_list (bool, default False) – If True, return all matched columns. If False, return only the first match. Affects both name returns and DataFrame returns.

  • case_sensitive (bool, default False) – If False, comparisons (including regex) are performed in a case-insensitive manner.

  • as_frame (bool, default False) – If True, return the matched columns as a pandas DataFrame. If as_list is True, it may include multiple columns. If no column is found, returns an empty DataFrame (if <errors> is ‘warn’ or ‘ignore’).

Returns:

Depends on as_frame, as_list, and the number of matching columns: - `<as_frame>`=False, `as_list`=False:

returns the first match as a string, or None/[].

  • `as_frame`=False, `as_list`=True: returns all matching column names as a list of strings.

  • `as_frame`=True, `as_list`=False: returns a DataFrame with the first matched column. If no match is found, an empty DataFrame may be returned.

  • `as_frame`=True, `as_list`=True: returns a DataFrame with all matched columns included.

Return type:

str or List[str] or pandas.DataFrame or None

Notes

  • For <dtype> strategy, integer, string, and object columns are inspected. The function calculates a uniqueness ratio and compares it against <uniqueness_threshold>.

  • Negative or zero thresholds are invalid, as are values above 1.

  • If the DataFrame has no columns or is empty, the behavior is determined by <errors>.

  • The relational-model motivation for schema-oriented column handling goes back to Codd [28].

Examples

>>> from geoprior.utils.generic_utils import find_id_column
>>> import pandas as pd
>>> data = pd.DataFrame({
...     'ID_code': [101, 102, 103],
...     'Name': ['Alice', 'Bob', 'Charlie'],
...     'value': [10, 20, 30]
... })
>>> # Example using the 'naive' strategy
>>> col = find_id_column(data, strategy='naive')
>>> print(col)  # Might return 'ID_code'
>>> # Example with as_list=True
>>> cols = find_id_column(data, strategy='naive',
...                       as_list=True)
>>> print(cols)  # ['ID_code']

See also

re.compile

The regex compilation method used when `strategy`=’regex’.

pandas.api.types.is_integer_dtype

Checks integer type.

pandas.api.types.is_string_dtype

Checksstring type.

pandas.api.types.is_object_dtype

Checksobject type.

geoprior.utils.generic_utils.check_group_column_validity(df, group_col, ops='check_only', max_unique=10, auto_bin=False, bins=4, error='warn', bin_labels=None, verbose=True)[source]#

Validate a grouping column for categorical-style use and optionally bin it.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame holding the grouping column.

  • group_col (str) – Name of the candidate grouping column in df.

  • ops ({'check_only', 'binning', 'validate'}, optional) – Operation mode. Use "check_only" to return a boolean, "binning" to bin the column when needed and return a modified DataFrame, or "validate" to check validity while honoring error.

  • max_unique (int, optional) – Maximum number of unique numeric values allowed before the column is treated as too continuous for categorical use.

  • auto_bin (bool, optional) – Whether to auto-bin a numeric column when ops='binning'.

  • bins (int, optional) – Number of bins to create when binning is applied.

  • error ({'warn', 'raise', 'ignore'}, optional) – Policy used when validation fails.

  • bin_labels (list of str or None, optional) – Custom labels for generated bins.

  • verbose (bool, optional) – Whether to emit informational messages.

Returns:

Returns a boolean for ops='check_only'. Otherwise returns a DataFrame, possibly with a transformed group_col.

Return type:

bool or pandas.DataFrame

Notes

When quantile binning is used, interval boundaries are derived from the numeric distribution of group_col.

geoprior.utils.generic_utils.save_all_figures(output_dir='figures', prefix='figure', fmts=('png',), close=True, dpi=150, transparent=False, timestamp=True, verbose=True)[source]#

Save all currently open Matplotlib figures to disk in specified formats.

Parameters:
  • output_dir (str) – Directory where figures will be saved. Created if not exists.

  • prefix (str) – Filename prefix for each figure.

  • formats (list or tuple of str) – File formats/extensions to use (e.g., (‘png’,’pdf’)).

  • close (bool) – Whether to close each figure after saving. Default is True.

  • dpi (int or None) – Resolution in dots per inch. None uses Matplotlib default.

  • transparent (bool) – Whether to save figures with transparent background.

  • timestamp (bool) – Append current timestamp (YYYYmmddTHHMMSS) to filenames.

  • verbose (bool) – Print progress messages.

  • fmts (list[str] | tuple)

Returns:

List of saved file paths.

Return type:

List[str]

Examples

>>> import matplotlib.pyplot as plt
>>> plt.figure(); plt.plot([1, 2, 3])
>>> from geoprior.utils.generic_utils import save_all_figures
>>> paths = save_all_figures(output_dir="plots", formats=("png",))
>>> print(paths)
['plots/figure_1_20250521T153045.png']
geoprior.utils.generic_utils.rename_dict_keys(data, param_to_rename=None, order='forward')[source]#

Renames keys in the data dictionary based on the provided param_to_rename dictionary.

This function will check if the key exists in the data dictionary. If the key is present, it will be renamed according to the mapping provided in the param_to_rename dictionary. If the key is not found in data and a mapping exists in param_to_rename, the function will apply the rename. If no rename is required, the function will return the original dictionary.

Parameters:
  • data (dict) – The dictionary whose keys may be renamed. The function will iterate over the keys of this dictionary and rename them according to the mapping provided in param_to_rename.

  • param_to_rename (dict, optional) – A dictionary mapping old keys to new keys. Each key in this dictionary represents an old key that may be found in data, and the corresponding value is the new key. If None, no renaming is performed. If a key in data matches an old key in param_to_rename, that key will be renamed.

  • order (str, {'forward', 'reverse'}:) –

    Order for renaming keys in a flat dict:

    forward (default):
        param_to_rename = {old_key: new_key}
    
    reverse:
      param_to_rename = {
        canonical_key: alias or (alias1, alias2, ...)
      }
      The first alias found in `data` is moved under the
      canonical key. If the canonical key already exists,
      nothing is changed for that mapping.
    

Returns:

The updated dictionary with keys renamed as per the param_to_rename mapping. If no keys need renaming, the original dictionary is returned.

Return type:

dict

Raises:

ValueError – If param_to_rename is not a dictionary, a ValueError will be raised.

Examples

>>> from geoprior.utils.generic_utils import
Example 1: Renaming a key in the dictionary:
>>> data = {"subsidence": 100}
>>> param_to_rename = {"subsidence": "subs_pred"}
>>> rename_dict_keys(data, param_to_rename)
{'subs_pred': 100}

Example 2: When the key is already valid (no change needed):

>>> data = {"subs_pred": 100}
>>> param_to_rename = {"subsidence": "subs_pred"}
>>> rename_dict_keys(data, param_to_rename)
{'subs_pred': 100}

Example 3: When param_to_rename is None, no renaming is performed:

>>> data = {"subsidence": 100}
>>> rename_dict_keys(data)
{'subsidence': 100}

Notes

  • If param_to_rename is None, no renaming occurs, and the data dictionary is returned as is.

  • This function raises an error if param_to_rename is not a dictionary. Ensure that the parameter is a valid dictionary of old-to-new key mappings.

geoprior.utils.generic_utils.normalize_time_column(df, time_col, datetime_col='datetime_temp', year_col='year_int', drop_orig=False)[source]#

Normalize a time column into a datetime column and an integer year.

The input column may contain integer years, strings, or existing pandas Datetime values. The function creates datetime_col with parsed timestamps and year_col with the extracted integer year. When drop_orig=True, the original time_col is removed and datetime_col is renamed back to time_col.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing a time column named time_col.

  • time_col (str) – Name of the column to normalize.

  • datetime_col (str, default 'datetime_temp') – Name of the parsed datetime column.

  • year_col (str, default 'year_int') – Name of the extracted integer year column.

  • drop_orig (bool, default False) – If True, drop the original time_col after parsing and rename datetime_col back to time_col.

Returns:

A copy of df with the parsed datetime column and integer year column.

Return type:

pandas.DataFrame

Raises:
  • ValueError – If time_col is missing or parsing fails for any entry.

  • TypeError – If df is not a pandas DataFrame.

geoprior.utils.generic_utils.select_mode(mode=None, default='pihal_like', canonical=None)[source]#

Resolve a user-supplied mode alias to a canonical value.

Parameters:
  • mode (str or None, optional) – Case-insensitive mode alias. Accepted values include 'pihal', 'pihal_like', 'tft', 'tft_like', or None to fall back to default.

  • default ({'pihal', 'tft'}, optional) – Canonical value returned when mode is None.

  • canonical (dict or list or None, optional) – Custom alias mapping. A dictionary maps input strings to canonical values. A list is treated as an identity mapping for its items.

Returns:

Canonical string corresponding to the resolved mode.

Return type:

str

Raises:

ValueError – If mode does not match any accepted alias.

geoprior.utils.generic_utils.normalize_model_inputs(*data)[source]#
Parameters:

data (DataFrame | Mapping[str, DataFrame] | list | tuple)

Return type:

dict[str, DataFrame]

geoprior.utils.generic_utils.print_config_table(sections, title=None, table_width=None, sort_keys=True, key_col_fraction=0.35, max_value_length=200, log_fn=None)[source]#

Pretty-print configuration or hyperparameters as a key/value table.

This helper is intended for CLI scripts (Stage-1, training, tuning) so that the user can quickly inspect which parameters are actually in effect.

Parameters:
  • sections (dict or sequence of (str, dict)) –

    If a single dict is passed, all key/value pairs are printed in one block.

    If a sequence is passed, it must contain (name, params) tuples, where name is a section label (e.g. "Physics") and params is a dict mapping parameter names to values.

  • title (str, optional) – Optional title displayed above the table (centered).

  • table_width (int, optional) – Total width of the printed table. If None, the function tries to use geoprior.api.util.get_table_size(). If that fails, it falls back to the terminal width (via shutil.get_terminal_size) or 80 characters.

  • sort_keys (bool, default True) – Whether to sort parameter names alphabetically within each section.

  • key_col_fraction (float, default 0.35) – Fraction of the table width allocated to the parameter-name column. The remainder is used for the value column.

  • max_value_length (int, default 200) – Maximum number of characters kept from the stringified value. Longer values are truncated with an ellipsis ("...") before being wrapped onto multiple lines.

  • log_fn (callable, optional) – Function used to emit lines (defaults to print()). This allows capturing the table in logs if needed.

Returns:

The full rendered table as a single string. It is always printed via print_fn as a side effect.

Return type:

str

Notes

  • Nested containers (lists, tuples, dicts) are rendered in a compact one-line form and then wrapped to fill the value column.

  • This function is intentionally lightweight and does not depend on external tabulation libraries, so it can be safely used in lightweight Stage-1 / Stage-2 scripts.