geoprior.utils.base_utils#
Essential utilities for data processing and analysis in FusionLab, offering functions for normalization, interpolation, feature selection, outlier removal, and various data manipulation tasks.
Adapted for FusionLab from the original geoprior.utils.base_utils.
Functions
|
Check if a file exists in a package's directory with importlib.resources. |
|
Detect categorical columns in a dataset by examining column types and user-defined criteria. |
|
download a remote file. |
|
Extracts specified target column(s) from a multidimensional numpy array or pandas DataFrame. |
|
Download a remote file with a progress bar and optional size verification. |
|
Fill NaN values in a numpy array, pandas Series, or pandas DataFrame using specified methods for forward filling, backward filling, or both. |
|
Fill NaN values in an array-like structure using specified methods. |
|
Move file to a directory. |
|
Selects features from a dataset based on various criteria and returns a new DataFrame. |
|
Validate and process the target variable, ensuring it is consistent with the features in the DataFrame. |
- geoprior.utils.base_utils.detect_categorical_columns(data, integer_as_cat=True, float0_as_cat=True, min_unique_values=None, max_unique_values=None, handle_nan=None, return_frame=False, consider_dt_as=None, verbose=0)[source]#
Detect categorical columns in a dataset by examining column types and user-defined criteria. Columns with integer type or float values ending with .0 can be categorized as categorical, depending on settings. Also handles user-defined thresholds for minimum and maximum unique values.
(1)#\[\forall x \in X,\; x = \lfloor x \rfloor\]Above equation indicates that for float columns to be treated as categorical, each value \(x\) must be an integer when cast from float. This function leverages the inline methods
build_data_if,drop_nan_in,fill_NaN,parameter_validator, andsmart_format(excluding those prefixed with_).- Parameters:
data (
DataFrameorarray-like) – The input data to analyze. If not a DataFrame, it will be converted internally.integer_as_cat (
bool, optional) – IfTrue, integer-type columns are considered categorical. Default isTrue.float0_as_cat (
bool, optional) – IfTrue, float columns whose values can be cast to integer without remainder are considered categorical. Default isTrue.min_unique_values (
intorNone, optional) – Minimum number of unique values in a column to qualify as categorical. IfNone, no minimum check is applied.max_unique_values (
intor :py:class:``’auto’:py:class:``orNone, optional) – Maximum number of unique values allowed for a column to be considered categorical. If'auto', set the limit to the column’s own unique count. IfNone, no maximum check is applied.handle_nan (
strorNone, optional) – Handling method for missing data. Can be'drop'to remove rows with NaNs,'fill'to impute them via forward/backward fill, orNonefor no change.return_frame (
bool, optional) – IfTrue, returns a DataFrame of detected categorical columns; otherwise returns a list of column names. Default isFalse.consider_dt_as (
str, optional) – Indicates how to handle or convert datetime columns whenops='validate'. UseNoneto keep datetime columns as-is. Use'numeric'for timestamp-style conversion,'float','float32'or'float64'for float conversion,'int','int32'or'int64'for integer conversion, and'object'or'category'to convert them to Python objects such as strings. If conversion fails, behavior follows the configured error policy.verbose (
int, optional) – Verbosity level. If greater than 0, a summary of detected columns is printed.
- Returns:
Either a list of column names or a DataFrame containing the categorical columns, depending on the value of
return_frame.- Return type:
listorDataFrame
Examples
>>> from geoprior.utils.base_utils import detect_categorical_columns >>> import pandas as pd >>> df = pd.DataFrame({ ... 'A': [1, 2, 3], ... 'B': [1.0, 2.0, 3.0], ... 'C': ['cat', 'dog', 'mouse'] ... }) >>> detect_categorical_columns(df) ['A', 'B', 'C']
Notes
This function focuses on flexible treatment of integer and float columns. Combined with
verbosesettings, it can provide detailed feedback. Using'drop'or'fill'forhandle_nanhelps reduce disruptions caused by missing data. The array-programming background is discussed in Harris et al. [36].The function uses flexible criteria for determining whether a column should be treated as categorical, allowing for detection of columns with integer values or float values ending in .0 as categorical columns. The method is useful when preparing data for machine learning algorithms that expect categorical inputs, such as decision trees or classification models.
This method uses the helper function build_data_if from geoprior.utils.validator to ensure that the input data is a DataFrame. If the input is not a DataFrame, it creates one, giving column names that start with input_name.
See also
build_data_ifValidates and converts input into a DataFrame if needed.
drop_nan_inDrops NaN values from a DataFrame along axis=0.
fill_NaNFills missing data in a DataFrame using forward and backward fill.
- geoprior.utils.base_utils.extract_target(data, target_names, drop=True, columns=None, return_y_X=False)[source]#
Extracts specified target column(s) from a multidimensional numpy array or pandas DataFrame.
with options to rename columns in a DataFrame and control over whether the extracted columns are dropped from the original data.
- Parameters:
data (
Union[np.ndarray,pd.DataFrame]) – The input data from which target columns are to be extracted. Can be a NumPy array or a pandas DataFrame.target_names (
Union[str,int,List[Union[str,int]]]) – The name(s) or integer index/indices of the column(s) to extract. If data is a DataFrame, this can be a mix of column names and indices. If data is a NumPy array, only integer indices are allowed.drop (
bool, defaultTrue) – If True, the extracted columns are removed from the original data. If False, the original data remains unchanged.columns (
Optional[List[str]], defaultNone) – If provided and data is a DataFrame, specifies new names for the columns in data. The length of columns must match the number of columns in data. This parameter is ignored if data is a NumPy array.return_y_X (
bool, defaultFalse) – If True, returns a tuple (y, X) where X is the data with the target columns removed and y is the target columns. If False, returns only y.
- Returns:
If return_X_y is True, returns a tuple (X, y) where X is the data with the target columns removed and y is the target columns. If return_X_y is False, returns only y.
- Return type:
Union[ArrayLike,pd.Series,pd.DataFrame,Tuple[ pd.DataFrame,ArrayLike]]- Raises:
ValueError – If columns is provided and its length does not match the number of columns in data. If any of the specified target_names do not exist in data. If target_names includes a mix of strings and integers for a NumPy array input.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'A': [1, 2, 3], ... 'B': [4, 5, 6], ... 'C': [7, 8, 9] ... }) >>> target = extract_target(df, 'B', drop=True, return_y_X=False) >>> print(target) 0 4 1 5 2 6 Name: B, dtype: int64 >>> target, remaining = extract_target(df, 'B', drop=True, return_y_X=True) >>> print(target) 0 4 1 5 2 6 Name: B, dtype: int64 >>> print(remaining) A C 0 1 7 1 2 8 2 3 9 >>> arr = np.random.rand(5, 3) >>> target, modified_arr = extract_target(arr, 2, return_X_y=True) >>> print(target) >>> print(modified_arr)
- geoprior.utils.base_utils.fancier_downloader(url, filename, dstpath=None, check_size=False, error='raise', verbose=True)[source]#
Download a remote file with a progress bar and optional size verification.
This function downloads a file from the specified
urland saves it locally with the givenfilename. It provides a visual progress bar during the download process and offers an option to verify the downloaded file’s size against the expected size to ensure data integrity. Additionally, the function allows for moving the downloaded file to a specified destination directory.(2)#\[|S_{downloaded} - S_{expected}| < \epsilon\]where \(S_{downloaded}\) is the size of the downloaded file, \(S_{expected}\) is the size specified by the server, and \(\epsilon\) is a small tolerance value.
- Parameters:
url (
str) – The URL from which to download the remote file.filename (
str) – The desired name for the local file. This is the name under which the file will be saved after downloading.dstpath (
Optional[str], defaultNone) – The destination directory path where the downloaded file should be saved. IfNone, the file is saved in the current working directory.check_size (
bool, defaultFalse) –Whether to verify the size of the downloaded file against the expected size obtained from the server. This is useful for ensuring the integrity of the downloaded file. When
True, the function checks:(3)\[|S_{downloaded} - S_{expected}| < \epsilon\]If the size check fails:
If
error='raise', an exception is raised.If
error='warn', a warning is emitted.If
error='ignore', the discrepancy is ignored, and the function continues.
error (
str, default'raise') –Specifies how to handle errors during the size verification process.
'raise': Raises an exception if the file size does not match.'warn': Emits a warning and continues execution.'ignore': Silently ignores the size discrepancy and proceeds.
verbose (
bool, defaultTrue) – Controls the verbosity of the function. IfTrue, the function will print informative messages about the download status, including progress updates and success or failure notifications.
- Returns:
Returns
Noneifdstpathis provided and the file is moved to the destination. Otherwise, returns the local filename as a string.- Return type:
Optional[str]- Raises:
RuntimeError – If the download fails and
erroris set to'raise'.ValueError – If an invalid value is provided for the
errorparameter.
Examples
>>> from geoprior.utils.base_utils import fancier_downloader >>> url = 'https://example.com/data/file.h5' >>> local_filename = 'file.h5' >>> # Download to current directory without size check >>> fancier_downloader(url, local_filename) >>> >>> # Download to a specific directory with size verification >>> fancier_downloader( ... url, ... local_filename, ... dstpath='/path/to/save/', ... check_size=True, ... error='warn', ... verbose=True ... ) >>> >>> # Handle size mismatch by raising an exception >>> fancier_downloader( ... url, ... local_filename, ... check_size=True, ... error='raise' ... )
Notes
Progress Bar: The function uses the tqdm library to display a progress bar during the download. If tqdm is not installed, it falls back to a basic downloader without a progress bar.
Directory Creation: If the specified
dstpathdoes not exist, the function will attempt to create it to ensure the file is saved correctly.File Integrity: Enabling
check_sizehelps in verifying that the downloaded file is complete and uncorrupted. However, it does not perform a checksum verification.Progress-reporting patterns and surrounding tooling are described in [37, 38].
See also
requests.getFunction to perform HTTP GET requests.
tqdmA library for creating progress bars.
os.makedirsFunction to create directories.
geoprior.utils.base_utils.check_file_existsUtility to check file existence.
- geoprior.utils.base_utils.fillNaN(arr, method='ff')[source]#
Fill NaN values in a numpy array, pandas Series, or pandas DataFrame using specified methods for forward filling, backward filling, or both.
- Parameters:
arr (
Union[np.ndarray,pd.Series,pd.DataFrame]) – The input data containing NaN values to be filled. This can be a numpy array, pandas Series, or DataFrame expected to contain numeric data.method (
str, optional) – The method used for filling NaN values. Valid options are: - ‘ff’: forward fill (default) - ‘bf’: backward fill - ‘both’: applies both forward and backward fill sequentially
- Returns:
The array with NaN values filled according to the specified method. The return type matches the input type (numpy array, Series, or DataFrame).
- Return type:
Union[np.ndarray,pd.Series,pd.DataFrame]
- geoprior.utils.base_utils.select_features(data, features=None, dtypes_inc=None, dtypes_exc=None, coerce=False, columns=None, verify_integrity=False, parse_features=False, include_missing=None, exclude_missing=None, transform=None, regex=None, callable_selector=None, inplace=False, **astype_kwargs)[source]#
Selects features from a dataset based on various criteria and returns a new DataFrame.
Conceptually, the selected columns are the subset of the input column set that satisfies the requested feature names, data-type filters, regex patterns, callable selectors, and missing-data conditions.
- Parameters:
data (
Union[pd.DataFrame,dict,np.ndarray,list]) – The dataset from which to select features. Can be a pandas DataFrame, a dictionary, a NumPy array, or a list of dictionaries/lists.features (
Optional[Union[List[str],Pattern,Callable[[str],bool]]], defaultNone) – Specific feature names to select. Can also be a regex pattern or a callable that takes a column name and returnsTrueif the column should be selected.dtypes_inc (
Optional[Union[str,List[str]]], defaultNone) – The data type(s) to include in the selection. Possible values are the same as for the pandasincludeparameter inselect_dtypes.dtypes_exc (
Optional[Union[str,List[str]]], defaultNone) – The data type(s) to exclude from the selection. Possible values are the same as for the pandasexcludeparameter inselect_dtypes.coerce (
bool, defaultFalse) – IfTrue, numeric columns are coerced to the appropriate types without selection, ignoringfeatures,dtypes_inc, anddtypes_excparameters.columns (
Optional[List[str]], defaultNone) – Column names to use ifdatais a NumPy array or a list without column names.verify_integrity (
bool, defaultFalse) – Verifies the data type integrity and converts data to the correct types if necessary.parse_features (
bool, defaultFalse) – Parses string features and converts them to an iterable object (e.g., lists).include_missing (
Optional[bool], defaultNone) – IfTrue, includes only columns with missing values. IfFalse, excludes columns with missing values.exclude_missing (
Optional[bool], defaultNone) – IfTrue, excludes columns with any missing values.transform (
Optional, defaultNone) – Function or dictionary of functions to apply to the selected columns. If a dictionary is provided, keys should correspond to column names.regex (
Optional[Union[str,Pattern]], defaultNone) – Regular expression pattern to select columns.callable_selector (
Optional[Callable[[str],bool]], defaultNone) – A callable that takes a column name and returnsTrueif the column should be selected.inplace (
bool, defaultFalse) – IfTrue, modifies the data in place. Otherwise, returns a new DataFrame.**astype_kwargs (
Any) – Additional keyword arguments forpandas.DataFrame.astype.
- Returns:
A new DataFrame with the selected features.
- Return type:
pd.DataFrame- Raises:
ValueError – If no columns match the selection criteria and
coerceisFalse.TypeError – If
regexis not a string or compiled regex pattern. Ifcallable_selectoris not a callable. Iftransformis not a callable or a dictionary of callables. If provided parameters are of incorrect types.
Examples
>>> from geoprior.utils.base_utils import select_features >>> import pandas as pd >>> import re >>> import numpy as np >>> data = { ... "Color": ['Blue', 'Red', 'Green'], ... "Name": ['Mary', "Daniel", "Augustine"], ... "Price ($)": ['200', "300", "100"], ... "Discount": [20, 30, np.nan] ... } >>> select_features(data, dtypes_inc='number', verify_integrity=True) Price ($) Discount 0 200.0 20.0 1 300.0 30.0 2 100.0 NaN
>>> select_features(data, features=['Color', 'Price ($)']) Color Price ($) 0 Blue 200 1 Red 300 2 Green 100
>>> select_features( ... data, ... regex='^Price|Discount$', ... transform={'Price ($)': lambda x: x / 100} ... ) Price ($) Discount 0 2.0 20 1 3.0 30 2 1.0 NaN
>>> select_features( ... data, ... callable_selector=lambda col: col.startswith('C') ... ) Color 0 Blue 1 Red 2 Green
Notes
This function is particularly useful in data preprocessing pipelines where the presence of certain features is critical for later analysis or modeling steps. When using regex patterns, ensure that the pattern accurately reflects the intended column names to avoid unintended matches. The callable provided to
callable_selectorshould accept a single column-name string and return a boolean. Transformation functions should be designed to handle the data types of the selected columns to avoid runtime errors. Related selection and coercion behavior is documented in [39, 40, 41, 42].See also
validate_feature,pandas.DataFrame.select_dtypes,pandas.DataFrame.astype
- geoprior.utils.base_utils.fill_NaN(arr, method='ff')[source]#
Fill NaN values in an array-like structure using specified methods. Handles numeric and non-numeric data separately to preserve data integrity.
- Parameters:
arr (
array-like,pandas.DataFrame, orpandas.Series) – The input data structure containing NaN values to be filled.method (
str, default :py:class:``’ff’:py:class:``) –The method to use for filling NaN values. Accepted values:
Forward fill:
'forward','ff','fwd'Backward fill:
'backward','bf','bwd'Both:
'both','ffbf','fbwf','bff','full'
- Returns:
The input data structure with NaN values filled according to the specified method.
- Return type:
array-like,pandas.DataFrame, orpandas.Series- Raises:
ValueError – If the provided fill method is not recognized.
Notes
Mathematically, the function performs:
(4)#\[ext{Filled\_array} = egin{cases} ext{fillNaN(arr, method)} & ext{if arr is numeric} \ ext{concat(fillNaN(numeric\_parts, method), non\_numeric\_parts)} & ext{otherwise} \end{cases}\]This ensures that non-numeric data remains unaltered while NaN values in numeric columns are appropriately filled.
The function preserves the original structure of the input array by utilizing
array_preserver. Numeric columns are filled using the specified method, while non-numeric columns remain unchanged.Examples
>>> from geoprior.utils.base_utils import fill_NaN >>> import pandas as pd >>> df = pd.DataFrame({ ... 'A': [1, 2, np.nan, 4], ... 'B': ['x', np.nan, 'y', 'z'] ... }) >>> fill_NaN(df, method='ff') A B 0 1.0 x 1 2.0 x 2 2.0 y 3 4.0 z
See also
geoprior.core.array_manager.array_preserverPreserves and restores array structures.
geoprior.core.array_manager.to_arrayConverts input to a pandas-compatible array-like structure.
geoprior.core.checks.is_numeric_dtypeChecks if the array has numeric data types.
geoprior.utils.base_utils.fillNaNCore function to fill NaN values in numeric data.
- geoprior.utils.base_utils.validate_target_in(df, target, error='raise', verbose=0)[source]#
Validate and process the target variable, ensuring it is consistent with the features in the DataFrame.
- Parameters:
df (
pandas.DataFrame) – The DataFrame containing the features and possibly the target column.target (
strorpandas.Seriesorpandas.DataFrame) – The target variable to validate and process.error (
{'raise', 'warn', 'ignore'}, optional) – Behavior to use when target validation fails. Use'raise'to raise an exception,'warn'to continue with a warning, or'ignore'to skip reporting.verbose (
int, optional) – Verbosity level for logging. Use0for no output,1for basic information, and2for detailed information.
- Returns:
target (
pandas.Series) – The processed target variable.df (
pandas.DataFrame) – The DataFrame containing the features and target.