geoprior.utils.base_utils#

Essential utilities for data processing and analysis in FusionLab, offering functions for normalization, interpolation, feature selection, outlier removal, and various data manipulation tasks.

Adapted for FusionLab from the original geoprior.utils.base_utils.

Functions

`check_file_exists`(package, resource)	Check if a file exists in a package's directory with importlib.resources.
`detect_categorical_columns`(data[, ...])	Detect categorical columns in a dataset by examining column types and user-defined criteria.
`download_file`(url, filename[, dstpath])	download a remote file.
`extract_target`(data, target_names[, drop, ...])	Extracts specified target column(s) from a multidimensional numpy array or pandas DataFrame.
`fancier_downloader`(url, filename[, dstpath, ...])	Download a remote file with a progress bar and optional size verification.
`fillNaN`(arr[, method])	Fill NaN values in a numpy array, pandas Series, or pandas DataFrame using specified methods for forward filling, backward filling, or both.
`fill_NaN`(arr[, method])	Fill NaN values in an array-like structure using specified methods.
`move_file`(file_path, directory)	Move file to a directory.
`select_features`(data[, features, ...])	Selects features from a dataset based on various criteria and returns a new DataFrame.
`validate_target_in`(df, target[, error, verbose])	Validate and process the target variable, ensuring it is consistent with the features in the DataFrame.

geoprior.utils.base_utils.detect_categorical_columns(data, integer_as_cat=True, float0_as_cat=True, min_unique_values=None, max_unique_values=None, handle_nan=None, return_frame=False, consider_dt_as=None, verbose=0)[source]#

Detect categorical columns in a dataset by examining column types and user-defined criteria. Columns with integer type or float values ending with .0 can be categorized as categorical, depending on settings. Also handles user-defined thresholds for minimum and maximum unique values.

(1)#\[\forall x \in X,\; x = \lfloor x \rfloor\]

Above equation indicates that for float columns to be treated as categorical, each value \(x\) must be an integer when cast from float. This function leverages the inline methods build_data_if, drop_nan_in, fill_NaN, parameter_validator, and smart_format (excluding those prefixed with _).

Parameters:

data (DataFrame or array-like) – The input data to analyze. If not a DataFrame, it will be converted internally.
integer_as_cat (bool, optional) – If True, integer-type columns are considered categorical. Default is True.
float0_as_cat (bool, optional) – If True, float columns whose values can be cast to integer without remainder are considered categorical. Default is True.
min_unique_values (int or None, optional) – Minimum number of unique values in a column to qualify as categorical. If None, no minimum check is applied.
max_unique_values (int or :py:class:``’auto’:py:class:`` or None, optional) – Maximum number of unique values allowed for a column to be considered categorical. If 'auto', set the limit to the column’s own unique count. If None, no maximum check is applied.
handle_nan (str or None, optional) – Handling method for missing data. Can be 'drop' to remove rows with NaNs, 'fill' to impute them via forward/backward fill, or None for no change.
return_frame (bool, optional) – If True, returns a DataFrame of detected categorical columns; otherwise returns a list of column names. Default is False.
consider_dt_as (str, optional) – Indicates how to handle or convert datetime columns when ops='validate'. Use None to keep datetime columns as-is. Use 'numeric' for timestamp-style conversion, 'float', 'float32' or 'float64' for float conversion, 'int', 'int32' or 'int64' for integer conversion, and 'object' or 'category' to convert them to Python objects such as strings. If conversion fails, behavior follows the configured error policy.
verbose (int, optional) – Verbosity level. If greater than 0, a summary of detected columns is printed.

Returns:

Either a list of column names or a DataFrame containing the categorical columns, depending on the value of return_frame.

Return type:

list or DataFrame

Examples

>>> from geoprior.utils.base_utils import detect_categorical_columns
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'A': [1, 2, 3],
...     'B': [1.0, 2.0, 3.0],
...     'C': ['cat', 'dog', 'mouse']
... })
>>> detect_categorical_columns(df)
['A', 'B', 'C']

Notes

This function focuses on flexible treatment of integer and float columns. Combined with verbose settings, it can provide detailed feedback. Using 'drop' or 'fill' for handle_nan helps reduce disruptions caused by missing data. The array-programming background is discussed in Harris et al. [36].

The function uses flexible criteria for determining whether a column should be treated as categorical, allowing for detection of columns with integer values or float values ending in .0 as categorical columns. The method is useful when preparing data for machine learning algorithms that expect categorical inputs, such as decision trees or classification models.

This method uses the helper function build_data_if from geoprior.utils.validator to ensure that the input data is a DataFrame. If the input is not a DataFrame, it creates one, giving column names that start with input_name.

See also

build_data_if: Validates and converts input into a DataFrame if needed.
drop_nan_in: Drops NaN values from a DataFrame along axis=0.
fill_NaN: Fills missing data in a DataFrame using forward and backward fill.

geoprior.utils.base_utils.extract_target(data, target_names, drop=True, columns=None, return_y_X=False)[source]#

Extracts specified target column(s) from a multidimensional numpy array or pandas DataFrame.

with options to rename columns in a DataFrame and control over whether the extracted columns are dropped from the original data.

Parameters:

data (Union[np.ndarray, pd.DataFrame]) – The input data from which target columns are to be extracted. Can be a NumPy array or a pandas DataFrame.
target_names (Union[str, int, List[Union[str, int]]]) – The name(s) or integer index/indices of the column(s) to extract. If data is a DataFrame, this can be a mix of column names and indices. If data is a NumPy array, only integer indices are allowed.
drop (bool, default True) – If True, the extracted columns are removed from the original data. If False, the original data remains unchanged.
columns (Optional[List[str]], default None) – If provided and data is a DataFrame, specifies new names for the columns in data. The length of columns must match the number of columns in data. This parameter is ignored if data is a NumPy array.
return_y_X (bool, default False) – If True, returns a tuple (y, X) where X is the data with the target columns removed and y is the target columns. If False, returns only y.

Returns:

If return_X_y is True, returns a tuple (X, y) where X is the data with the target columns removed and y is the target columns. If return_X_y is False, returns only y.

Return type:

Union[ArrayLike, pd.Series, pd.DataFrame, Tuple[ pd.DataFrame, ArrayLike]]

Raises:

ValueError – If columns is provided and its length does not match the number of columns in data. If any of the specified target_names do not exist in data. If target_names includes a mix of strings and integers for a NumPy array input.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'A': [1, 2, 3],
...     'B': [4, 5, 6],
...     'C': [7, 8, 9]
... })
>>> target = extract_target(df, 'B', drop=True, return_y_X=False)
>>> print(target)
0    4
1    5
2    6
Name: B, dtype: int64
>>> target, remaining = extract_target(df, 'B', drop=True, return_y_X=True)
>>> print(target)
0    4
1    5
2    6
Name: B, dtype: int64
>>> print(remaining)
   A  C
0  1  7
1  2  8
2  3  9
>>> arr = np.random.rand(5, 3)
>>> target, modified_arr = extract_target(arr, 2, return_X_y=True)
>>> print(target)
>>> print(modified_arr)

geoprior.utils.base_utils.fancier_downloader(url, filename, dstpath=None, check_size=False, error='raise', verbose=True)[source]#

Download a remote file with a progress bar and optional size verification.

This function downloads a file from the specified url and saves it locally with the given filename. It provides a visual progress bar during the download process and offers an option to verify the downloaded file’s size against the expected size to ensure data integrity. Additionally, the function allows for moving the downloaded file to a specified destination directory.

(2)#\[|S_{downloaded} - S_{expected}| < \epsilon\]

where \(S_{downloaded}\) is the size of the downloaded file, \(S_{expected}\) is the size specified by the server, and \(\epsilon\) is a small tolerance value.

Parameters:

url (str) – The URL from which to download the remote file.
filename (str) – The desired name for the local file. This is the name under which the file will be saved after downloading.
dstpath (Optional[str], default None) – The destination directory path where the downloaded file should be saved. If None, the file is saved in the current working directory.
check_size (bool, default False) –
Whether to verify the size of the downloaded file against the expected size obtained from the server. This is useful for ensuring the integrity of the downloaded file. When True, the function checks:

(3)\[|S_{downloaded} - S_{expected}| < \epsilon\]

If the size check fails:
- If error='raise', an exception is raised.
- If error='warn', a warning is emitted.
- If error='ignore', the discrepancy is ignored, and the function continues.
error (str, default 'raise') –
Specifies how to handle errors during the size verification process.
- 'raise': Raises an exception if the file size does not match.
- 'warn': Emits a warning and continues execution.
- 'ignore': Silently ignores the size discrepancy and proceeds.
verbose (bool, default True) – Controls the verbosity of the function. If True, the function will print informative messages about the download status, including progress updates and success or failure notifications.

Returns:

Returns None if dstpath is provided and the file is moved to the destination. Otherwise, returns the local filename as a string.

Return type:

Optional[str]

Raises:

RuntimeError – If the download fails and error is set to 'raise'.
ValueError – If an invalid value is provided for the error parameter.

Examples

>>> from geoprior.utils.base_utils import fancier_downloader
>>> url = 'https://example.com/data/file.h5'
>>> local_filename = 'file.h5'
>>> # Download to current directory without size check
>>> fancier_downloader(url, local_filename)
>>>
>>> # Download to a specific directory with size verification
>>> fancier_downloader(
...     url,
...     local_filename,
...     dstpath='/path/to/save/',
...     check_size=True,
...     error='warn',
...     verbose=True
... )
>>>
>>> # Handle size mismatch by raising an exception
>>> fancier_downloader(
...     url,
...     local_filename,
...     check_size=True,
...     error='raise'
... )

Notes

Progress Bar: The function uses the tqdm library to display a progress bar during the download. If tqdm is not installed, it falls back to a basic downloader without a progress bar.
Directory Creation: If the specified dstpath does not exist, the function will attempt to create it to ensure the file is saved correctly.
File Integrity: Enabling check_size helps in verifying that the downloaded file is complete and uncorrupted. However, it does not perform a checksum verification.
Progress-reporting patterns and surrounding tooling are described in [37, 38].

See also

requests.get: Function to perform HTTP GET requests.
tqdm: A library for creating progress bars.
os.makedirs: Function to create directories.
geoprior.utils.base_utils.check_file_exists: Utility to check file existence.

geoprior.utils.base_utils.fillNaN(arr, method='ff')[source]#

Fill NaN values in a numpy array, pandas Series, or pandas DataFrame using specified methods for forward filling, backward filling, or both.

Parameters:

arr (Union[np.ndarray, pd.Series, pd.DataFrame]) – The input data containing NaN values to be filled. This can be a numpy array, pandas Series, or DataFrame expected to contain numeric data.
method (str, optional) – The method used for filling NaN values. Valid options are: - ‘ff’: forward fill (default) - ‘bf’: backward fill - ‘both’: applies both forward and backward fill sequentially

Returns:

The array with NaN values filled according to the specified method. The return type matches the input type (numpy array, Series, or DataFrame).

Return type:

Union[np.ndarray, pd.Series, pd.DataFrame]

geoprior.utils.base_utils.select_features(data, features=None, dtypes_inc=None, dtypes_exc=None, coerce=False, columns=None, verify_integrity=False, parse_features=False, include_missing=None, exclude_missing=None, transform=None, regex=None, callable_selector=None, inplace=False, **astype_kwargs)[source]#

Selects features from a dataset based on various criteria and returns a new DataFrame.

Conceptually, the selected columns are the subset of the input column set that satisfies the requested feature names, data-type filters, regex patterns, callable selectors, and missing-data conditions.

Parameters:

data (Union[pd.DataFrame, dict, np.ndarray, list]) – The dataset from which to select features. Can be a pandas DataFrame, a dictionary, a NumPy array, or a list of dictionaries/lists.
features (Optional[Union[List[str], Pattern, Callable[[str], bool]]], default None) – Specific feature names to select. Can also be a regex pattern or a callable that takes a column name and returns True if the column should be selected.
dtypes_inc (Optional[Union[str, List[str]]], default None) – The data type(s) to include in the selection. Possible values are the same as for the pandas include parameter in select_dtypes.
dtypes_exc (Optional[Union[str, List[str]]], default None) – The data type(s) to exclude from the selection. Possible values are the same as for the pandas exclude parameter in select_dtypes.
coerce (bool, default False) – If True, numeric columns are coerced to the appropriate types without selection, ignoring features, dtypes_inc, and dtypes_exc parameters.
columns (Optional[List[str]], default None) – Column names to use if data is a NumPy array or a list without column names.
verify_integrity (bool, default False) – Verifies the data type integrity and converts data to the correct types if necessary.
parse_features (bool, default False) – Parses string features and converts them to an iterable object (e.g., lists).
include_missing (Optional[bool], default None) – If True, includes only columns with missing values. If False, excludes columns with missing values.
exclude_missing (Optional[bool], default None) – If True, excludes columns with any missing values.
transform (Optional, default None) – Function or dictionary of functions to apply to the selected columns. If a dictionary is provided, keys should correspond to column names.
regex (Optional[Union[str, Pattern]], default None) – Regular expression pattern to select columns.
callable_selector (Optional[Callable[[str], bool]], default None) – A callable that takes a column name and returns True if the column should be selected.
inplace (bool, default False) – If True, modifies the data in place. Otherwise, returns a new DataFrame.
**astype_kwargs (Any) – Additional keyword arguments for pandas.DataFrame.astype.

Returns:

A new DataFrame with the selected features.

Return type:

pd.DataFrame

Raises:

ValueError – If no columns match the selection criteria and coerce is False.
TypeError – If regex is not a string or compiled regex pattern. If callable_selector is not a callable. If transform is not a callable or a dictionary of callables. If provided parameters are of incorrect types.

Examples

>>> from geoprior.utils.base_utils import select_features
>>> import pandas as pd
>>> import re
>>> import numpy as np
>>> data = {
...     "Color": ['Blue', 'Red', 'Green'],
...     "Name": ['Mary', "Daniel", "Augustine"],
...     "Price ($)": ['200', "300", "100"],
...     "Discount": [20, 30, np.nan]
... }
>>> select_features(data, dtypes_inc='number', verify_integrity=True)
   Price ($)  Discount
0      200.0      20.0
1      300.0      30.0
2      100.0       NaN

>>> select_features(data, features=['Color', 'Price ($)'])
   Color Price ($)
0   Blue       200
1    Red       300
2  Green       100

>>> select_features(
...     data,
...     regex='^Price|Discount$',
...     transform={'Price ($)': lambda x: x / 100}
... )
   Price ($)  Discount
0        2.0        20
1        3.0        30
2        1.0         NaN

>>> select_features(
...     data,
...     callable_selector=lambda col: col.startswith('C')
... )
   Color
0   Blue
1    Red
2  Green

Notes

This function is particularly useful in data preprocessing pipelines where the presence of certain features is critical for later analysis or modeling steps. When using regex patterns, ensure that the pattern accurately reflects the intended column names to avoid unintended matches. The callable provided to callable_selector should accept a single column-name string and return a boolean. Transformation functions should be designed to handle the data types of the selected columns to avoid runtime errors. Related selection and coercion behavior is documented in [39, 40, 41, 42].

geoprior.utils.base_utils.fill_NaN(arr, method='ff')[source]#

Fill NaN values in an array-like structure using specified methods. Handles numeric and non-numeric data separately to preserve data integrity.

Parameters:

arr (array-like, pandas.DataFrame, or pandas.Series) – The input data structure containing NaN values to be filled.
method (str, default :py:class:``’ff’:py:class:``) –
The method to use for filling NaN values. Accepted values:
- Forward fill: 'forward', 'ff', 'fwd'
- Backward fill: 'backward', 'bf', 'bwd'
- Both: 'both', 'ffbf', 'fbwf', 'bff', 'full'

Returns:

The input data structure with NaN values filled according to the specified method.

Return type:

array-like, pandas.DataFrame, or pandas.Series

Raises:

ValueError – If the provided fill method is not recognized.

Notes

Mathematically, the function performs:

(4)#\[ext{Filled\_array} = egin{cases} ext{fillNaN(arr, method)} & ext{if arr is numeric} \ ext{concat(fillNaN(numeric\_parts, method), non\_numeric\_parts)} & ext{otherwise} \end{cases}\]

This ensures that non-numeric data remains unaltered while NaN values in numeric columns are appropriately filled.

The function preserves the original structure of the input array by utilizing array_preserver. Numeric columns are filled using the specified method, while non-numeric columns remain unchanged.

Examples

>>> from geoprior.utils.base_utils import fill_NaN
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'A': [1, 2, np.nan, 4],
...     'B': ['x', np.nan, 'y', 'z']
... })
>>> fill_NaN(df, method='ff')
     A    B
0  1.0    x
1  2.0    x
2  2.0    y
3  4.0    z

See also

geoprior.core.array_manager.array_preserver: Preserves and restores array structures.
geoprior.core.array_manager.to_array: Converts input to a pandas-compatible array-like structure.
geoprior.core.checks.is_numeric_dtype: Checks if the array has numeric data types.
geoprior.utils.base_utils.fillNaN: Core function to fill NaN values in numeric data.

geoprior.utils.base_utils.validate_target_in(df, target, error='raise', verbose=0)[source]#

Validate and process the target variable, ensuring it is consistent with the features in the DataFrame.

Parameters:

df (pandas.DataFrame) – The DataFrame containing the features and possibly the target column.
target (str or pandas.Series or pandas.DataFrame) – The target variable to validate and process.
error ({'raise', 'warn', 'ignore'}, optional) – Behavior to use when target validation fails. Use 'raise' to raise an exception, 'warn' to continue with a warning, or 'ignore' to skip reporting.
verbose (int, optional) – Verbosity level for logging. Use 0 for no output, 1 for basic information, and 2 for detailed information.

Returns:

target (pandas.Series) – The processed target variable.
df (pandas.DataFrame) – The DataFrame containing the features and target.