geoprior.utils.validator#

Provides a comprehensive set of functions and warnings for validating and ensuring the integrity of data. This includes utilities for checking data consistency, validating machine learning targets, ensuring proper data types, and handling various validation scenarios.

Functions

array_to_frame(X, *[, to_frame, columns, ...])

Validates and optionally converts an array-like object to a pandas DataFrame, applying specified column names if provided or generating them if the force parameter is set.

array_to_frame2(X, *[, to_frame, columns, ...])

Added part of is_frame dedicated to X and y frame reconversion validation.

assert_all_finite(X, *[, allow_nan, ...])

Throw a ValueError if X contains NaN or infinity.

assert_xy_in(x, y, *[, data, asarray, ...])

Assert the name of x and y in the given data.

build_data_if(data[, columns, to_frame, ...])

Validates and converts data into a pandas DataFrame if requested, optionally enforcing consistent column naming.

build_series_if(*arr[, series_names, ...])

Constructs one or more pandas Series from the provided input arrays.

check_X_y(X, y[, accept_sparse, ...])

Input validation for standard estimators.

check_array(array, *[, accept_large_sparse, ...])

Input validation on an array, list, or similar.

check_classification_targets(*y[, ...])

Validate that the target arrays are suitable for classification tasks.

check_consistency_size(*arrays)

Check consistency of array and raises error otherwise.

check_consistent_length(*arrays)

Check that all arrays have consistent first dimensions.

check_donut_inputs([values, data, labels, ...])

Validate and/or build inputs for donut chart plotting.

check_epsilon(eps[, y_true, y_pred, ...])

Dynamically determine or validate an epsilon value for numerical computations.

check_has_run_method(estimator[, msg, ...])

Check if the given estimator has a callable run method or any other specified method.

check_is_fitted(estimator[, attributes, ...])

Perform is_fitted validation for estimator.

check_is_fitted2(estimator, attributes, *[, msg])

Perform is_fitted validation for estimator.

check_is_runned(estimator[, attributes, ...])

Validate if an estimator instance has been "runned" (executed) prior to invoking dependent methods.

check_memory(memory)

Check that memory is joblib.Memory-like.

check_mixed_data_types(data)

Checks if the given data (DataFrame or numpy array) contains both numerical and categorical columns.

check_random_state(seed)

Turn seed into a np.random.RandomState instance.

check_scalar(x, name, target_type, *[, ...])

Validate scalar parameters type and value.

check_symmetric(array, *[, tol, ...])

Make sure that array is 2D, square and symmetric.

check_y(y[, multi_output, y_numeric, ...])

Validates the target array y, ensuring it is suitable for classification or regression tasks based on its content and the specified strategy.

contains_nested_objects(lst[, strict, ...])

Determines whether a list contains nested objects.

convert_array_to_pandas(X, *[, to_frame, ...])

Converts an array-like object to a pandas DataFrame or Series, applying provided column names or series name.

convert_to_numeric(value[, ...])

Helper function to convert values to float.

ensure_2d(X[, output_format])

Ensure that the input X is converted to a 2-dimensional structure.

ensure_non_negative(*arrays[, err_msg])

Ensure that provided arrays contain only non-negative values.

filter_valid_kwargs(callable_obj, kwargs)

Filter and return only the valid keyword arguments for a given callable object.

get_estimator_name(estimator)

Get the estimator name whatever it is an instanciated object or not

handle_zero_division(y_true[, ...])

Preprocess input arrays to handle cases where zero could cause division errors in subsequent metric computations.

has_fit_parameter(estimator, parameter)

Check whether the estimator's fit method supports the given parameter.

has_methods(models, methods[, strict, ...])

Validate that one or more model objects implement required methods.

has_required_attributes(model, attributes)

Check if the model has all required Keras-specific attributes.

is_array_like(obj[, numpy_check])

Check if an object is array-like (

is_binary_class(y[, accept_multioutput])

Check whether the target array represents binary classification.

is_categorical(data, column[, strict, error])

Checks if a specified column in a DataFrame or Series is of a categorical type.

is_frame(arr[, df_only, raise_exception, ...])

Check if arr is a pandas DataFrame or Series.

is_installed(module)

Checks if TensorFlow is installed.

is_normalized(arr[, method])

Checks if the provided array is normalized according to the specified method.

is_square_matrix(data[, data_type])

Determine whether the input, either a DataFrame or an array-like structure, forms a square matrix.

is_time_series(data, time_col[, ...])

Check if the provided DataFrame is time series data.

is_valid_policies(nan_policy[, allowed_policies])

Validates the nan_policy or any policy argument to ensure it is one of the acceptable options (allowed_policies).

normalize_array(arr[, normalize, method])

Checks if an array is normalized according to the specified method and normalizes it if required based on the 'normalize' parameter.

parameter_validator(param_name, target_strs)

Creates a validator function for ensuring a parameter's value matches one of the allowed target strings, optionally applying normalization.

process_y_pairs(*ys[, error, solo_return, ops])

Process and validate paired arrays of ground truth (y_true) and predicted values (y_pred) for machine learning evaluation.

recheck_data_types(data[, coerce_numeric, ...])

Rechecks and coerces column data types in a DataFrame to the most appropriate numeric or datetime types if initially identified as objects.

set_array_back(X, *[, to_frame, columns, ...])

Set array back to frame, reconvert the Numpy array to pandas series or dataframe.

to_dtype_str(arr[, return_values])

Convert numeric or object dtype to string dtype.

validate_and_adjust_ranges(**kwargs)

Validates and adjusts the provided range tuples to ensure each is composed of two numerical values and is sorted in ascending order.

validate_batch_size(batch_size, n_samples[, ...])

Validate the batch size against the number of samples.

validate_comparison_data(df[, alignment])

Validates a DataFrame to ensure it is a square matrix and that the index and column names match.

validate_data_types(data[, expected_type, ...])

Checks for mixed data types in a pandas Series or DataFrame and handles according to the specified policies.

validate_dates(start_date, end_date[, ...])

Validates and parses start and end years/dates, with options for output formatting.

validate_distribution(distribution[, ...])

Validates or generates distributions for given elements, ensuring the sum equals 1 if check_normalization is True.

validate_dtype_selector(dtype_selector)

Validates and categorizes the dtype_selector using regex, including handling cases where 'only' is specifically included.

validate_estimator_methods(estimator, methods)

Validate that the specified methods exist and are callable on the given estimator.

validate_fit_weights(y[, sample_weight, ...])

Validate and compute sample weights for fitting.

validate_length_range(length_range[, ...])

Validates the review length range ensuring it's a tuple with two integers where the first value is less than the second.

validate_multiclass_target(y[, ...])

Validates that the target data is suitable for multiclass classification.

validate_multioutput(value[, extra])

Validate the multioutput parameter value and handle special cases.

validate_nan_policy(nan_policy, *arrays[, ...])

Validates and applies a specified nan_policy to input arrays and optionally to sample weights.

validate_numeric(value[, convert_to, ...])

Validates if a given value is numeric.

validate_performance_data([...])

Validates and preprocesses model performance data to ensure it conforms to the necessary structure and constraints for statistical and machine learning analysis.

validate_positive_integer(value, variable_name)

Validates whether the given value is a positive integer or zero based on the parameter and rounds float values according to the specified method.

validate_sample_weights(weights, y[, normalize])

Validates that the sample weights are suitable for use in calculations.

validate_scores(scores[, true_labels, mode, ...])

Validates that the scores represent valid probability distributions and checks consistency between scores and true labels in multi-output scenarios.

validate_sequences(sequences[, n_features, ...])

Validate and reshape sequences input for a neural network.

validate_sets(data[, mode, allow_empty, ...])

Validates whether the input data is a set in 'base' mode or a dictionary of sets in 'deep' mode.

validate_square_matrix(data[, align, ...])

Validate that the input data forms a square matrix and optionally aligns its indices and columns if specified.

validate_strategy([strategy, error, ops, ...])

Validate and construct a strategy dictionary for imputing missing data.

validate_weights(weights[, min_value, ...])

Validates and optionally normalizes the given weights array to ensure all elements meet specified criteria and the structure is suitable for computations.

validate_yy(y_true, y_pred[, expected_type, ...])

Validates the shapes and types of actual and predicted target arrays, ensuring they are compatible for further analysis or metrics calculation.

Exceptions

DataConversionWarning

Warning used to notify implicit data conversions happening in the code.

PositiveSpectrumWarning

Warning raised when the eigenvalues of a PSD matrix have issues

exception geoprior.utils.validator.DataConversionWarning[source]#

Bases: UserWarning

Warning used to notify implicit data conversions happening in the code.

This warning occurs when some input data needs to be converted or interpreted in a way that may not match the user’s expectations. For example, this warning may occur when the user:

  • passes an integer array to a function that expects float input and will convert the input;

  • requests a non-copying operation, but a copy is required to meet the implementation’s data-type expectations;

  • passes an input whose shape can be interpreted ambiguously.

Changed in version 0.18: Moved from sklearn.utils.validation.

exception geoprior.utils.validator.PositiveSpectrumWarning[source]#

Bases: UserWarning

Warning raised when the eigenvalues of a PSD matrix have issues

This warning is typically raised by _check_psd_eigenvalues when the eigenvalues of a positive semidefinite (PSD) matrix such as a gram matrix (kernel) present significant negative eigenvalues, or bad conditioning i.e. very small non-zero eigenvalues compared to the largest eigenvalue.

Added in version 0.22.

geoprior.utils.validator.array_to_frame(X, *, to_frame=False, columns=None, raise_exception=False, raise_warning=True, input_name='', force=False)[source]#

Validates and optionally converts an array-like object to a pandas DataFrame, applying specified column names if provided or generating them if the force parameter is set.

Parameters:
  • X (array-like) – The array to potentially convert to a DataFrame.

  • columns (str or list of str, optional) – The names for the resulting DataFrame columns or the Series name.

  • to_frame (bool, default False) – If True, converts X to a DataFrame if it isn’t already one.

  • input_name (str, default '') – The name of the input variable, used for error and warning messages.

  • raise_warning (bool, default True) – If True and to_frame is True but columns are not provided, a warning is issued unless force is True.

  • raise_exception (bool, default False) – If True, raises an exception when to_frame is True but columns are not provided and force is False.

  • force (bool, default False) – Forces the conversion of X to a DataFrame by generating column names based on input_name if columns are not provided.

Returns:

The potentially converted DataFrame or Series, or X unchanged.

Return type:

pd.DataFrame or pd.Series

Examples

>>> from geoprior.utils.validator import array_to_frame
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> X = data.data
>>> array_to_frame(X, to_frame=True, columns=['sepal_length', 'sepal_width',
                                              'petal_length', 'petal_width'])
geoprior.utils.validator.array_to_frame2(X, *, to_frame=False, columns=None, raise_exception=False, raise_warning=True, input_name='', force=False)[source]#

Added part of is_frame dedicated to X and y frame reconversion validation.

Parameters:
  • X (Array-like) – Array to convert to frame.

  • columns (str or list of str) – Series name or columns names for pandas.Series and DataFrame.

  • to_frame (str, default False) – If True , reconvert the array to frame using the columns orthewise no-action is performed and return the same array.

  • input_name (str, default "") – The data name used to construct the error message.

  • raise_warning (bool, default True) – If True then raise a warning if conversion is required. If ignore, warnings silence mode is triggered.

  • raise_exception (bool, default False) – If True then raise an exception if array is not symmetric.

  • force (bool, default False) – Force conversion array to a frame is columns is not supplied. Use the combinaison, input_name and X.shape[1] range.

Returns:

X

Return type:

converted array

Example

>>> from geoprior.datasets import fetch_data
>>> from geoprior.utils.validator import array_to_frame
>>> data = fetch_data ('hlogs').frame
>>> array_to_frame (data.k.values ,
                    to_frame= True, columns =None, input_name= 'y',
                    raise_warning="silence"
                            )
... array([nan, nan, nan, ..., nan, nan, nan]) # mute
geoprior.utils.validator.assert_all_finite(X, *, allow_nan=False, estimator_name=None, input_name='')[source]#

Throw a ValueError if X contains NaN or infinity.

Parameters:
  • X ({ndarray, sparse matrix}) – The input data.

  • allow_nan (bool, default False) – If True, do not throw error when X contains NaN.

  • estimator_name (str, default None) – The estimator name, used to construct the error message.

  • input_name (str, default "") – The data name used to construct the error message. In particular if input_name is “X” and the data has NaN values and allow_nan is False, the error message will link to the imputer documentation.

geoprior.utils.validator.assert_xy_in(x, y, *, data=None, asarray=True, to_frame=False, columns=None, xy_numeric=False, ignore=None, **kws)[source]#

Assert the name of x and y in the given data.

Check whether string arguments passed to x and y are valid in the data, then retrieve the x and y array values.

Parameters:
  • x (Arraylike 1d or str, str) – One dimensional arrays. In principle if data is supplied, they must constitute series. If x and y are given as string values, the data must be supplied. x and y names must be included in the dataframe otherwise an error raises.

  • y (Arraylike 1d or str, str) – One dimensional arrays. In principle if data is supplied, they must constitute series. If x and y are given as string values, the data must be supplied. x and y names must be included in the dataframe otherwise an error raises.

  • data (pd.DataFrame,) – Data containing x and y names. Need to be supplied when x and y are given as string names.

  • asarray (bool, default =True) – Returns x and y as array rather than series.

  • to_frame (bool, default False,) – Convert data to a dataframe using either the columns names or the input_names when the keyword parameter force=True.

  • columns (list of str, Optional) – Name of columns to transform the array ( data) to a dataframe.

  • xy_numeric (bool, default False) – Convert x and y to numeric values.

  • ignore (str, optional) – It should be ‘x’ or ‘y’. If set the array is ignored and not asserted.

  • kws (dict,) – Keyword arguments passed to array_to_frame().

Returns:

x, y – One dimensional array or pd.Series

Return type:

Arraylike

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from geoprior.utils.validator import assert_xy_in
>>> x, y = np.random.rand(7 ), np.arange (7 )
>>> data = pd.DataFrame ({'x': x, 'y':y} )
>>> assert_xy_in (x='x', y='y', data = data )
(array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
        0.15599452, 0.05808361]),
 array([0, 1, 2, 3, 4, 5, 6]))
>>> assert_xy_in (x=x, y=y)
(array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
        0.15599452, 0.05808361]),
 array([0, 1, 2, 3, 4, 5, 6]))
>>> assert_xy_in (x=x, y=data.y) # y is a series
(array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
        0.15599452, 0.05808361]),
 array([0, 1, 2, 3, 4, 5, 6]))
>>> assert_xy_in (x=x, y=data.y, asarray =False ) # return y like it was
(array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
        0.15599452, 0.05808361]),
0    0
1    1
2    2
3    3
4    4
5    5
6    6
Name: y, dtype: int32)
geoprior.utils.validator.build_data_if(data, columns=None, to_frame=True, input_name='data', col_prefix='col_', force=False, error='warn', coerce_datetime=False, coerce_numeric=True, start_incr_at=0, **kw)[source]#

Validates and converts data into a pandas DataFrame if requested, optionally enforcing consistent column naming. Intended to standardize data structures for downstream analysis.

See more in geoprior.utils.data_utils.build_df() for documentation details.

geoprior.utils.validator.check_X_y(X, y, accept_sparse=False, *, accept_large_sparse=True, dtype='numeric', order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, multi_output=False, ensure_min_samples=1, ensure_min_features=1, y_numeric=False, estimator=None, to_frame=False)[source]#

Input validation for standard estimators.

Checks X and y for consistent length, enforces X to be 2D and y 1D. By default, X is checked to be non-empty and containing only finite values. Standard input checks are also applied to y, such as checking that y does not have np.nan or np.inf targets. For multi-label y, set multi_output=True to allow 2D and sparse y. If the dtype of X is object, attempt converting to float, raising on failure.

Parameters:
  • X ({ndarray, list, sparse matrix}) – Input data.

  • y ({ndarray, list, sparse matrix}) – Labels.

  • accept_sparse (str, bool or list of str, default False) – String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

  • accept_large_sparse (bool, default True) – If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by accept_sparse, accept_large_sparse will cause it to be accepted only if its indices are stored with a 32-bit dtype.

  • dtype ('numeric', type, list of type or None, default 'numeric') – Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • order ({'F', 'C'}, default None) – Whether an array will be forced to be fortran or c-style.

  • copy (bool, default False) – Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

  • force_all_finite (bool or 'allow-nan', default True) – Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter does not influence whether y can have np.inf, np.nan, pd.NA values. Use True to require all values of X to be finite, False to allow np.inf, np.nan, and pd.NA, or "allow-nan" to allow only np.nan and pd.NA while still rejecting infinite values. pd.NA is accepted and converted into np.nan.

  • ensure_2d (bool, default True) – Whether to raise a value error if X is not 2D.

  • allow_nd (bool, default False) – Whether to allow X.ndim > 2.

  • multi_output (bool, default False) – Whether to allow 2D y (array or sparse matrix). If false, y will be validated as a vector. y cannot have np.nan or np.inf values if multi_output=True.

  • ensure_min_samples (int, default 1) – Make sure that X has a minimum number of samples in its first axis (rows for a 2D array).

  • ensure_min_features (int, default 1) – Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when X has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

  • y_numeric (bool, default False) – Whether to ensure that y has a numeric type. If dtype of y is object, it is converted to float64. Should only be used for regression algorithms.

  • estimator (str or estimator instance, default None) – If passed, include the name of the estimator in warning messages.

Returns:

  • X_converted (object) – The converted and validated X.

  • y_converted (object) – The converted and validated y.

geoprior.utils.validator.check_array(array, *, accept_large_sparse=True, dtype='numeric', accept_sparse=False, order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, estimator=None, input_name='', to_frame=True)[source]#

Input validation on an array, list, or similar.

By default, the input is checked to be a non-empty 2D array containing only finite values. If the dtype of the array is object, attempt converting to float, raising on failure.

Parameters:
  • array (object) – Input object to check / convert.

  • accept_sparse (str, bool or list/tuple of str, default False) – String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

  • accept_large_sparse (bool, default True) – If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by accept_sparse, accept_large_sparse=False will cause it to be accepted only if its indices are stored with a 32-bit dtype.

  • dtype ('numeric', type, list of type or None, default 'numeric') – Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • order ({'F', 'C'} or None, default None) – Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.

  • copy (bool, default False) – Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

  • force_all_finite (bool or 'allow-nan', default True) – Whether to raise an error on np.inf, np.nan, or pd.NA in array. Use True to require all values to be finite, False to allow np.inf, np.nan, and pd.NA, or "allow-nan" to allow only np.nan and pd.NA while still rejecting infinite values. pd.NA is converted into np.nan.

  • ensure_2d (bool, default True) – Whether to raise a value error if array is not 2D.

  • ensure_min_samples (int, default 1) – Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.

  • ensure_min_features (int, default 1) – Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

  • estimator (str or estimator instance, default None) – If passed, include the name of the estimator in warning messages.

  • input_name (str, default "") – The data name used to construct the error message. In particular if input_name is “X” and the data has NaN values and allow_nan is False, the error message will link to the imputer documentation.

  • to_frame (bool, default False) – Reconvert array back to pd.Series or pd.DataFrame if the original array is pd.Series or pd.DataFrame.

Returns:

array_converted – The converted and validated array.

Return type:

object

geoprior.utils.validator.check_classification_targets(*y, target_type='numeric', strategy='auto', verbose=False)[source]#

Validate that the target arrays are suitable for classification tasks.

This function is designed to ensure that target arrays (y) contain only finite, categorical values, and it raises a ValueError if the targets do not meet the criteria necessary for classification tasks, such as the presence of continuous values, NaNs, or infinite values.

This validation is crucial for preprocessing steps in machine learning pipelines to ensure that the data is appropriate for classification algorithms.

Parameters:
  • *y (array-like) – One or more target arrays to be validated. The input can be in the form of lists, numpy arrays, or pandas series. Each array is checked individually to ensure it meets the criteria for classification targets.

  • target_type (str, optional) – The expected data type of the target arrays. Supported values are ‘numeric’ and ‘object’. If ‘numeric’, the function attempts to convert the target arrays to integers, raising an error if conversion is not possible due to non-numeric values. If ‘object’, the target arrays are left as numpy arrays of dtype object, suitable for categorical classification without conversion. Default is ‘numeric’.

  • strategy (str, optional) –

    Defines the approach for evaluating if the target arrays are suitable for classification based on their unique values and data types. The ‘auto’ strategy uses heuristic or automatic detection to decide whether target data should be treated as categorical, which is useful for most cases. Custom strategies can be defined to enforce specific validation rules or preprocessing steps based on the nature of the target data (e.g., ‘continuous’, ‘multilabel-indicator’, ‘unknown’). These custom strategies should align with the outcomes of a predefined type_of_target function, allowing for nuanced handling of different target data scenarios. The default value is 'auto', which applies general rules for categorization and numeric conversion where applicable.

    If a strategy other than 'auto' is specified, it directly influences how the data is validated and potentially converted, based on the expected or detected type of target data:

    • If ‘continuous’, the function checks if the data can be used for regression tasks and raises an error for classification use without explicit binning.

    • If ‘multilabel-indicator’, it validates the data for multilabel classification tasks and ensures appropriate format.

    • If ‘unknown’, it attempts to validate the data with generic checks, raising errors for any unclear or unsupported data formats.

  • verbose (bool, optional) – If set to True, the function prints a message for each target array checked, confirming that it is suitable for classification. This is helpful for debugging and when validating multiple target arrays simultaneously.

Raises:

ValueError – If any of the target arrays contain values unsuitable for classification. This includes arrays with continuous values, NaNs, infinite values, or arrays that do not represent categorical data properly.

Examples

Using the function with a single array of integer labels:

>>> from geoprior.utils.validator import check_classification_targets
>>> y = [1, 2, 3, 2, 1]
>>> check_classification_targets(y)
[array([1, 2, 3, 2, 1], dtype=object)]

Using the function with multiple arrays, including a mix of integer and string labels:

>>> y1 = [0, 1, 0, 1]
>>> y2 = ["spam", "ham", "spam", "ham"]
>>> check_classification_targets(y1, y2, verbose=True)
Targets are suitable for classification.
Targets are suitable for classification.
[array([0, 1, 0, 1], dtype=object), array(['spam', 'ham', 'spam', 'ham'], dtype=object)]

Attempting to use the function with an array containing NaN values:

>>> y_with_nan = [1, np.nan, 2, 1]
>>> check_classification_targets(y_with_nan)
ValueError: Target values contain NaN or infinite numbers, which are not
suitable for classification.

Attempting to use the function with a continuous target array:

>>> y_continuous = np.linspace(0, 1, 10)
>>> check_classification_targets(y_continuous)
ValueError: The number of unique values is too high for a classification task.
Validating and converting a mixed-type target array to numeric:
>>> y_mixed = [1, '2', 3.0, '4', 5]
>>> check_classification_targets(y_mixed, target_type='numeric')
ValueError: Target array at index 0 contains non-numeric values, which
cannot be converted to integers: ['2', '4']...

Validating object target arrays without attempting conversion:

>>> y_str = ["apple", "banana", "cherry"]
>>> check_classification_targets(y_str, target_type='object')
[array(['apple', 'banana', 'cherry'], dtype=object)]
geoprior.utils.validator.check_consistency_size(*arrays)[source]#

Check consistency of array and raises error otherwise.

geoprior.utils.validator.check_consistent_length(*arrays)[source]#

Check that all arrays have consistent first dimensions.

Checks whether all objects in arrays have the same shape or length.

Parameters:

*arrays (list or tuple of input objects.) – Objects that will be checked for consistent length.

geoprior.utils.validator.check_donut_inputs(values=None, data=None, labels=None, ops='check', labels_as_index=True, index=None, origin_index='drop', value_name='auto')[source]#

Validate and/or build inputs for donut chart plotting.

This function accepts inputs in various forms and returns a pair of numeric values and labels or builds a new \(n \\times 1\) DataFrame from them. The function supports two modes:

  • In ops="check", it returns a tuple (values, labels) after validating that the numeric values are appropriate for plotting.

  • In ops="build", it returns a pandas DataFrame constructed from the inputs. If labels_as_index is True, the labels become the DataFrame index; otherwise, they form a separate column. If an index is provided, it is used to reset the DataFrame index and the original index is either dropped or kept based on origin_index.

The function also accepts inputs through a DataFrame or Series (data). In such cases, if values is a \(\\text{str}\), it is interpreted as a column name of the DataFrame. Similarly, if labels is a \(\\text{str}\), it is used to fetch the label column.

(1)#\[\begin{split}S = \\{ x_i \\}_{i=1}^{n} \\quad \\text{and} \\quad L = \\{ l_i \\}_{i=1}^{n}\end{split}\]

where \(S\) denotes the numeric values and \(L\) denotes the corresponding labels.

Parameters:
  • values (array-like or str, optional) – Numeric values for the donut slices. If data is a DataFrame and values is a double backtick string`` ("colname"), then the column "colname" is used. If data is a Series and values is not provided, the series values are used.

  • data (pandas.Series or pandas.DataFrame, optional) – Data source from which to fetch values and labels. If provided, the function extracts the corresponding numeric data. For a DataFrame, if values (or labels) is a double backtick string`` ("colname"), the function fetches the column named "colname".

  • labels (array-like or str, optional) – Labels for the donut slices. If data is provided and labels is a double backtick string`` ("colname"), then the function uses the specified column as labels. If omitted, the function uses the index of the DataFrame or Series.

  • ops (:py:class:``”check”:py:class:`` or :py:class:``”build”:py:class:``, optional) – Operation mode of the function. In "check" mode, the function returns a tuple (values, labels) after validation. In "build" mode, it returns a new DataFrame built from the inputs. The default is "check".

  • labels_as_index (bool, optional) – If ops="build", this flag determines whether the labels are used as the DataFrame index. If True, the labels become the index; if False, they form a separate column. The default is True.

  • index (array-like or str, optional) – New index to assign in "build" mode. If a double backtick string`` is provided, it must correspond to a column in the DataFrame and that column is used as the new index. If a list is provided, it directly replaces the DataFrame index. In case the original index is to be retained, see origin_index.

  • origin_index (:py:class:``”drop”:py:class:`` or :py:class:``”keep”:py:class:``, optional) – Specifies whether to drop or retain the original index when resetting the DataFrame index. If set to "keep", the original index is saved in a new column named origin_index. The default is "drop".

  • value_name (:py:class:``”auto”:py:class:`` or str, optional) – Name to use for the numeric values in the built DataFrame (when ops="build"). If set to "auto" (or None), the default name "Value" is used unless overridden by the source data. Otherwise, the provided double backtick string`` (e.g., "Total") is used as the column name.

Returns:

  • If ops="check", returns a tuple (values, labels) where values is a NumPy array of numeric values and labels is a list of labels.

  • If ops="build", returns a pandas DataFrame constructed from the inputs. If labels_as_index is True, the DataFrame index is set to the provided labels (or the new index if index is specified). Otherwise, the DataFrame contains separate columns for the labels and numeric values.

Return type:

tuple of (ndarray, list) or pandas.DataFrame

Examples

Build inputs from a DataFrame with explicit column names:

>>> from geoprior.utils.validator import check_donut_inputs
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "Sales": [100, 200, 150],
...     "Country": ["USA", "Canada", "Mexico"]
... })
>>> # Build a DataFrame using "Sales" as values and "Country" as index
>>> new_df = check_donut_inputs(
...     values="Sales",
...     data=df,
...     labels="Country",
...     ops="build",
...     labels_as_index=True,
...     index="Country",
...     origin_index="drop"
... )
>>> new_df
        Sales
USA      100
Canada   200
Mexico   150

Check inputs when only numeric values are provided:

>>> values, labs = check_donut_inputs(
...     values=[10, 20, 30],
...     labels=["A", "B", "C"],
...     ops="check"
... )
>>> values
array([10., 20., 30.])
>>> labs
['A', 'B', 'C']

Notes

The function internally calls the inline helper check_numeric_dtype to ensure that the provided numeric data satisfies the necessary type constraints. The function supports grouping or multiple donut charts by using the input DataFrame directly. See also check_numeric_dtype() for numeric type validation.

geoprior.utils.validator.check_epsilon(eps, y_true=None, y_pred=None, base_epsilon=1e-10, scale_factor=1e-05)[source]#

Dynamically determine or validate an epsilon value for numerical computations.

This function either validates a provided epsilon if it is a numeric value, or calculates an appropriate epsilon dynamically based on the input data. The dynamic calculation aims to adjust epsilon based on the scale of the input data, providing flexibility and adaptability in algorithms where numerical stability is critical.

Parameters:
  • eps ({'auto', float}) – The epsilon value to use. If ‘auto’, the function dynamically determines an appropriate epsilon based on y_true and y_pred. If a float, it validates this as the epsilon value.

  • y_true (array-like, optional) – True values array. Used in conjunction with y_pred to dynamically determine epsilon if eps is ‘auto’. If None, this input is ignored.

  • y_pred (array-like, optional) – Predicted values array. Used alongside y_true for epsilon determination. If None, this input is ignored.

  • base_epsilon (float, optional) – Base epsilon value used as a starting point in dynamic determination. This value is adjusted based on the scale_factor and the input data to compute the final epsilon.

  • scale_factor (float, optional) – Scaling factor applied to adjust the base epsilon in relation to the scale of the input data. Helps tailor the epsilon to the problem’s numerical scale.

Returns:

The determined or validated epsilon value. Ensures numerical operations are conducted with an appropriate epsilon to avoid division by zero or other numerical instabilities.

Return type:

float

Examples

>>> y_true = [1, 2, 3]
>>> y_pred = [1.1, 1.9, 3.05]
>>> check_epsilon('auto', y_true, y_pred)
0.00001  # Example output, actual value depends on `determine_epsilon` implementation.
>>> check_epsilon(1e-8)
1e-8

Notes

Using ‘auto’ for eps allows algorithms to adapt to different scales of data, enhancing numerical stability without manually tuning the epsilon value.

geoprior.utils.validator.check_has_run_method(estimator, msg=None, method_name='run')[source]#

Check if the given estimator has a callable run method or any other specified method. This utility helps validate that an object can execute the expected method before further actions are taken.

Parameters:
  • estimator (object) – The object (instance or class) to check for the presence of the run method or another specified method.

  • msg (str, optional) – Custom error message to display if the method is missing. If None, a default message is generated based on the method_name.

  • method_name (str, default "run") – The method name to check for. This defaults to run, but you can specify any method name. The method must be callable.

Raises:

AttributeError – Raised if the run method (or any specified method) does not exist on the object or is not callable.

Examples

>>> from geoprior.utils.validator import check_has_run_method
>>> class MyClass:
...     def run(self):
...         pass
>>> check_has_run_method(MyClass())  # No error
>>> class MyClassWithoutRun:
...     pass
>>> check_has_run_method(MyClassWithoutRun())  # Raises AttributeError

Notes

This function performs several checks:

  1. Existence check: It checks whether the run method (or any other specified method) exists in the estimator object.

  2. Callable check: It ensures that the method is callable, which rules out attributes that might exist but aren’t methods.

  3. Static/class method check: The function accepts static or class methods as valid callable methods.

  4. Bound method check: It verifies that instance methods are bound to an object when required, which ensures they can be called properly in the given context.

This function can be expressed as a validation function:

(2)#\[ext{check\_has\_method}(estimator, method\_name) = egin{cases} ext{valid}, & ext{if method exists and callable} \ ext{invalid}, & ext{if method is missing or not callable} \end{cases}\]

It determines whether the method is callable or raises an error otherwise. Callable-method validation here follows the Python documentation and the staticmethod overview in [43, 44].

See also

validate_estimator_methods

A helper function to validate multiple methods on an estimator.

geoprior.utils.validator.check_is_fitted(estimator, attributes=None, *, msg=None, all_or_any=<built-in function all>)[source]#

Perform is_fitted validation for estimator.

Checks if the estimator is fitted by verifying the presence of fitted attributes (ending with a trailing underscore) and otherwise raises a NotFittedError with the given message.

If an estimator does not set any attributes with a trailing underscore, it can define a __sklearn_is_fitted__ or __fusionlab_is_fitted__ method returning a boolean to specify if the estimator is fitted or not.

Parameters:
  • estimator (estimator instance) – Estimator instance for which the check is performed.

  • attributes (str, list or tuple of str, default None) –

    Attribute name(s) given as string or a list/tuple of strings Eg.: ["coef_", "estimator_", ...], "coef_"

    If None, estimator is considered fitted if there exist an attribute that ends with a underscore and does not start with double underscore.

  • msg (str, default None) –

    The default error message is, “This %(name)s instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.”

    For custom messages if “%(name)s” is present in the message string, it is substituted for the estimator name.

    Eg. : “Estimator, %(name)s, must be fitted before sparsifying”.

  • all_or_any (callable, {all, any}, default all) – Specify whether all or any of the given attributes must exist.

Raises:
  • TypeError – If the estimator is a class or not an estimator instance

  • NotFittedError – If the attributes are not found.

geoprior.utils.validator.check_is_fitted2(estimator, attributes, *, msg=None)[source]#

Perform is_fitted validation for estimator.

Checks if the estimator is fitted by looking for attributes set during fitting. Typically, these attributes end with an underscore (‘_’).

Parameters:
  • estimator (BaseEstimator) – An instance of a scikit-learn estimator.

  • attributes (str or list of str) – The attributes to check for. These are typically set in the ‘fit’ method.

  • msg (str, optional) – The message to raise in the NotFittedError. If not provided, a default message is used.

Raises:

NotFittedError – If the given attributes are not found in the estimator.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier()
>>> check_is_fitted(clf, ['feature_importances_'])
NotFittedError: This RandomForestClassifier instance is not fitted yet.
geoprior.utils.validator.check_is_runned(estimator, attributes=None, *, msg=None, all_or_any=<built-in function all>)[source]#

Validate if an estimator instance has been “runned” (executed) prior to invoking dependent methods. This check ensures that the estimator is in the appropriate operational state, allowing users to identify and address runtime issues effectively.

If an estimator does not set “runned” attributes (such as _is_runned), it may define a __gofast_is_runned__ method. This method should return a boolean indicating whether the estimator is “runned” or not.

Parameters:
  • estimator (object) –

    The instance of the estimator or class being validated. This parameter represents the object in which dependent methods are validated to confirm that the “runned” state has been achieved.

    To determine the “runned” status, the function checks for specific attributes or, if defined, the __gofast_is_runned__ method.

  • attributes (str, list, or tuple of str, optional, default None) –

    Specifies the name(s) of attributes that indicate the “runned” status, such as ['_is_runned'] or ['_is_fitted']. If these attributes are present and set to True, the estimator is considered to have been runned.

    If attributes is set to None, the function will default to checking for _is_runned. This default provides flexibility for estimators that employ standard runned flags.

  • msg (str, optional, default None) –

    Custom error message to be displayed if the validation fails. By default, this error message uses the class name of the estimator in the format:

    ”This %(name)s instance has not been ‘runned’ yet. Call ‘run’ with appropriate arguments before using this method.”

    To customize the message, include %(name)s as a placeholder for the estimator’s class name.

  • all_or_any (callable, {all, any}, optional, default all) – Determines whether all or any of the specified attributes must be present and set to True. By default, the function expects all attributes to be set to True. Set to any for greater flexibility with multiple attributes.

``__gofast_is_runned__`` : optional, callable

If defined within the estimator, this method should return a boolean indicating the “runned” status of the estimator. This provides an alternative to using attributes.

Raises:

RuntimeError – If none of the specified attributes are set to True or if the __gofast_is_runned__ method (if present) returns False.

Notes

The check_is_runned function ensures that methods dependent on the “runned” status are only executed after the estimator has completed all required preliminary processes, like fit or run. This helper mirrors the fitted-state checks described in [45, 46].

Examples

>>> from geoprior.utils.validator import check_is_runned
>>> class ExampleClass:
...     def __init__(self):
...         self._is_runned = False
...
...     def run(self):
...         self._is_runned = True
...         print("Run completed.")
...
...     def process_data(self):
...         check_is_runned(self)
...         print("Processing data...")
>>> model = ExampleClass()
>>> model.process_data()  # Raises RuntimeError
>>> model.run()
>>> model.process_data()  # Now it works

See also

check_is_fitted

Validates that an estimator has been “fitted” before further use.

validate_estimator_methods

Validates essential estimator methods.

geoprior.utils.validator.check_memory(memory)[source]#

Check that memory is joblib.Memory-like.

joblib.Memory-like means that memory can be converted into a joblib.Memory instance (typically a str denoting the location) or has the same interface (has a cache method).

Parameters:

memory (None, str or object with the joblib.Memory interface) –

  • If string, the location where to create the joblib.Memory interface.

  • If None, no caching is done and the Memory object is completely transparent.

Returns:

memory – A correct joblib.Memory object.

Return type:

object with the joblib.Memory interface

Raises:

ValueError – If memory is not joblib.Memory-like.

geoprior.utils.validator.check_mixed_data_types(data)[source]#

Checks if the given data (DataFrame or numpy array) contains both numerical and categorical columns.

Parameters:

data (pd.DataFrame or np.ndarray) – The data to check. Can be a pandas DataFrame or a numpy array. If data is a numpy array, it is temporarily converted to a DataFrame for type checking.

Returns:

True if the data contains both numerical and categorical columns, False otherwise.

Return type:

bool

Examples

Using with a pandas DataFrame:

>>> import numpy as np
>>> import pandas as pd
>>> from geoprior.utils.validator import check_mixed_data_types
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
>>> print(check_mixed_data_types(df))
True

Using with a numpy array:

>>> array = np.array([[1, 'a'], [2, 'b'], [3, 'c']])
>>> print(check_mixed_data_types(array))
True

With data containing only numerical values:

>>> df_numeric_only = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> print(check_mixed_data_types(df_numeric_only))
False

With data containing only categorical values:

>>> df_categorical_only = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['d', 'e', 'f']})
>>> print(check_mixed_data_types(df_categorical_only))
False
geoprior.utils.validator.check_random_state(seed)[source]#

Turn seed into a np.random.RandomState instance.

Parameters:

seed (None, int or instance of RandomState) – If seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.

Returns:

The random state object based on seed parameter.

Return type:

numpy.random.RandomState

geoprior.utils.validator.check_scalar(x, name, target_type, *, min_val=None, max_val=None, include_boundaries='both')[source]#

Validate scalar parameters type and value.

Parameters:
  • x (object) – The scalar parameter to validate.

  • name (str) – The name of the parameter to be printed in error messages.

  • target_type (type or tuple) – Acceptable data types for the parameter.

  • min_val (float or int, default None) – The minimum valid value the parameter can take. If None (default) it is implied that the parameter does not have a lower bound.

  • max_val (float or int, default None) – The maximum valid value the parameter can take. If None (default) it is implied that the parameter does not have an upper bound.

  • include_boundaries ({"left", "right", "both", "neither"}, default "both") – Whether the interval defined by min_val and max_val should include the boundaries. Use "left" for [min_val, max_val), "right" for (min_val, max_val], "both" for [min_val, max_val], or "neither" for (min_val, max_val).

Returns:

x – The validated number.

Return type:

numbers.Number

Raises:
  • TypeError – If the parameter’s type does not match the desired type.

  • ValueError – If the parameter’s value violates the given bounds. If min_val, max_val and include_boundaries are inconsistent.

geoprior.utils.validator.check_symmetric(array, *, tol=1e-10, raise_warning=True, raise_exception=False)[source]#

Make sure that array is 2D, square and symmetric.

If the array is not symmetric, then a symmetrized version is returned. Optionally, a warning or exception is raised if the matrix is not symmetric.

Parameters:
  • array ({ndarray, sparse matrix}) – Input object to check / convert. Must be two-dimensional and square, otherwise a ValueError will be raised.

  • tol (float, default 1e-10) – Absolute tolerance for equivalence of arrays. Default = 1E-10.

  • raise_warning (bool, default True) – If True then raise a warning if conversion is required.

  • raise_exception (bool, default False) – If True then raise an exception if array is not symmetric.

Returns:

array_sym – Symmetrized version of the input array, i.e. the average of array and array.transpose(). If sparse, then duplicate entries are first summed and zeros are eliminated.

Return type:

{ndarray, sparse matrix}

geoprior.utils.validator.check_y(y, multi_output=False, y_numeric=False, input_name='y', estimator=None, to_frame=False, allow_nan=False)[source]#

Validates the target array y, ensuring it is suitable for classification or regression tasks based on its content and the specified strategy.

Parameters:
  • y (array-like) – Target values to validate.

  • multi_output (bool, default False) – Whether to allow two-dimensional y values. If False, y is validated as a vector. When multi_output=True, y still cannot contain np.nan or np.inf values unless allow_nan permits NaNs.

  • y_numeric (bool, default False) – Whether to ensure that y has a numeric type. If dtype of y is object, it is converted to float64. Should only be used for regression algorithms.

  • input_name (str, default "y") – Data name used to construct the error message.

  • estimator (str or estimator instance, default None) – If passed, include the name of the estimator in warning messages.

  • allow_nan (bool, default False) – If True, do not raise an error when y contains NaN values.

  • to_frame (bool, default False) – Reconvert the validated array to its initial pandas type when the input was provided as a pandas Series or DataFrame.

Returns:

y_converted – The converted and validated y.

Return type:

object

geoprior.utils.validator.contains_nested_objects(lst, strict=False, allowed_types=None)[source]#

Determines whether a list contains nested objects.

Parameters:
  • lst (list) – The list to be checked for nested objects.

  • strict (bool, optional) – If True, all items in the list must be nested objects. If False, the function returns True if any item is a nested object. Default is False.

  • allowed_types (tuple of types, optional) – A tuple of types to consider as nested objects. If None, common nested types like list, set, dict, and tuple are checked. Default is None.

Returns:

True if the list contains nested objects according to the given parameters, otherwise False.

Return type:

bool

Notes

A nested object is defined as any item within the list that is not a primitive data type (e.g., int, float, str) or is a complex structure like lists, sets, dictionaries, etc. The function can be customized to check for specific types using the allowed_types parameter.

Examples

>>> from geoprior.utils.validator import contains_nested_objects
>>> example_list1 = [{1, 2}, [3, 4], {'key': 'value'}]
>>> example_list2 = [1, 2, 3, [4]]
>>> example_list3 = [1, 2, 3, 4]
>>> contains_nested_objects(example_list1)
True  # non-strict, contains nested objects
>>> contains_nested_objects(example_list1, strict=True)
True  # strict, all are nested objects
>>> contains_nested_objects(example_list2)
True  # non-strict, contains at least one nested object
>>> contains_nested_objects(example_list2, strict=True)
False  # strict, not all are nested objects
>>> contains_nested_objects(example_list3)
False  # non-strict, no nested objects
>>> contains_nested_objects(example_list3, strict=True)
False  # strict, no nested objects
geoprior.utils.validator.convert_array_to_pandas(X, *, to_frame=False, columns=None, input_name='X')[source]#

Converts an array-like object to a pandas DataFrame or Series, applying provided column names or series name.

Parameters:
  • X (array-like) – The array to convert to a DataFrame or Series.

  • to_frame (bool, default False) – If True, converts the array to a DataFrame. Otherwise, returns the array unchanged.

  • columns (str or list of str, optional) – Name(s) for the columns of the resulting DataFrame or the name of the Series.

  • input_name (str, default 'X') – The name of the input variable; used in constructing error messages.

Returns:

  • pd.DataFrame or pd.Series – The converted DataFrame or Series. If to_frame is False, returns X unchanged.

  • columns (str or list of str) – The column names of the DataFrame or the name of the Series, if applicable.

Raises:
  • TypeError – If X is not array-like or if columns is neither a string nor a list of strings.

  • ValueError – If the conversion to DataFrame is requested but columns is not provided, or if the length of columns does not match the number of columns in X.

geoprior.utils.validator.ensure_2d(X, output_format='auto')[source]#

Ensure that the input X is converted to a 2-dimensional structure.

Parameters:
  • X (array-like or pandas.DataFrame) – The input data to convert. Can be a list, numpy array, or DataFrame.

  • output_format (str, optional) – The format of the returned object. Options are “auto”, “array”, or “frame”. “auto” returns a DataFrame if X is a DataFrame, otherwise a numpy array. “array” always returns a numpy array. “frame” always returns a pandas DataFrame.

Returns:

The converted 2-dimensional structure, either as a numpy array or DataFrame.

Return type:

ndarray or DataFrame

Raises:

ValueError – If the output_format is not one of the allowed values.

Examples

>>> import numpy as np
>>> from geoprior.utils.validator import ensure_2d
>>> X = np.array([1, 2, 3])
>>> ensure_2d(X, output_format="array")
array([[1],
       [2],
       [3]])
>>> df = pd.DataFrame([1, 2, 3])
>>> ensure_2d(df, output_format="frame")
   0
0  1
1  2
2  3
geoprior.utils.validator.ensure_non_negative(*arrays, err_msg=None)[source]#

Ensure that provided arrays contain only non-negative values.

This function checks each provided array for non-negativity. If any negative values are found in any array, it raises a ValueError. This check is crucial for computations or algorithms where negative values are not permissible, such as logarithmic transformations.

Parameters:
  • *arrays (array-like) – One or more array-like structures (e.g., lists, numpy arrays). Each array is checked for non-negativity.

  • err_msg (str, optional) – Specify a custom error message if negative values are found.

Raises:

ValueError – If any array contains negative values, a ValueError is raised with a message indicating that only non-negative values are expected.

Examples

>>> y_true = [0, 1, 2, 3]
>>> y_pred = [0.5, 2.1, 3.5, -0.1]
>>> ensure_non_negative(y_true, y_pred)
ValueError: Negative value found. Expect only non-negative values.

Note

The function uses a variable number of arguments, allowing flexibility in the number of arrays checked in a single call.

geoprior.utils.validator.filter_valid_kwargs(callable_obj, kwargs)[source]#

Filter and return only the valid keyword arguments for a given callable object.

This function checks if the arguments in kwargs are valid for the provided callable object (function, lambda function, method, or class). If any argument is not valid, it is removed from kwargs. The function returns only the valid kwargs.

Parameters:
  • callable_obj (callable) – The callable object (function, lambda function, method, or class) for which the keyword arguments need to be validated.

  • kwargs (dict) – Dictionary of keyword arguments to be validated against the callable object.

Returns:

valid_kwargs – Dictionary containing only the valid keyword arguments for the callable object.

Return type:

dict

Examples

>>> def example_func(a, b, c=3):
...     pass
>>> kwargs = {'a': 1, 'b': 2, 'd': 4}
>>> filter_valid_kwargs(example_func, kwargs)
{'a': 1, 'b': 2}
>>> class ExampleClass:
...     def __init__(self, x, y, z=10):
...         pass
>>> kwargs = {'x': 1, 'y': 2, 'a': 3}
>>> filter_valid_kwargs(ExampleClass, kwargs)
{'x': 1, 'y': 2}
>>> filter_valid_kwargs(ExampleClass(), kwargs)
{'x': 1, 'y': 2}

Notes

This function uses the inspect module to retrieve the signature of the given callable object and validate the keyword arguments.

geoprior.utils.validator.get_estimator_name(estimator)[source]#

Get the estimator name whatever it is an instanciated object or not

Parameters:

estimator – callable or instanciated object, callable or instance object that has a fit method.

Returns:

str, name of the estimator.

geoprior.utils.validator.handle_zero_division(y_true, zero_division='warn', metric_name='metric computation', epsilon=1e-15, replace_with=None)[source]#

Preprocess input arrays to handle cases where zero could cause division errors in subsequent metric computations.

Parameters:
  • y_true (array-like) – The input data array where zeros might cause division errors.

  • zero_division ({'warn', 'raise', 'ignore'}, default 'warn') – Determines the action to perform when a zero is encountered. Use "warn" to issue a warning and replace zeros with replace_with or epsilon, "raise" to raise an error, or "ignore" to leave zeros unchanged when the metric can handle them natively.

  • metric_name (str, optional) – Name of the metric for which this preprocessing is being done, to be included in warnings or error messages for better context.

  • epsilon (float, optional) – Small value to use as default replacement if replace_with is None, default is 1e-15.

  • replace_with (float or None, optional) – A specific value to replace zeros with, if None, epsilon is used.

Returns:

The processed array with modifications based on the zero_division strategy.

Return type:

numpy.ndarray

Raises:

ValueError – If zero_division is ‘raise’ and zero is found in y_true.

Notes

Using replace_with allows for custom behavior when handling zeros, which can be tailored to the specific requirements of different metric computations.

Examples

>>> from geoprior.utils.validator import handle_zero_division
>>> y_true = [0, 1, 2, 3, 0]
>>> processed_y_true = handle_zero_division(
...     y_true, replace_with=0.001, zero_division='warn'
... )
>>> print(processed_y_true)
[1.e-03 1.e+00 2.e+00 3.e+00 1.e-03]
geoprior.utils.validator.has_methods(models, methods, strict=True, check_status='check_only', msg=None)[source]#

Validate that one or more model objects implement required methods.

Parameters:
  • models (object or list of objects) – Model instance or collection of model instances to validate.

  • methods (list of str) – Public method names that each model must implement.

  • strict (bool, optional) – If True, raise an AttributeError when a required method is missing.

  • check_status ({'validate', 'check_only'}, optional) – Return mode. Use 'validate' to return validated models and 'check_only' to return a boolean flag.

  • msg (str or None, optional) – Optional custom error message using {model} and {methods} placeholders.

Returns:

Validated models when check_status='validate' or a boolean flag when check_status='check_only'.

Return type:

list of objects or bool

Raises:
geoprior.utils.validator.has_fit_parameter(estimator, parameter)[source]#

Check whether the estimator’s fit method supports the given parameter.

Parameters:
  • estimator (object) – An estimator to inspect.

  • parameter (str) – The searched parameter.

Returns:

is_parameter – Whether the parameter was found to be a named parameter of the estimator’s fit method.

Return type:

bool

Examples

>>> from sklearn.svm import SVC
>>> from sklearn.utils.validation import has_fit_parameter
>>> has_fit_parameter(SVC(), "sample_weight")
True
geoprior.utils.validator.has_required_attributes(model, attributes)[source]#

Check if the model has all required Keras-specific attributes.

This function is part of the deep validation process to ensure that the model not only inherits from Keras model classes but also implements essential methods.

Parameters:
  • model (Any) – The model object to inspect.

  • attributes (list of str) – A list of strings representing the names of the attributes to check for in the model.

Returns:

True if the model contains all specified attributes, False otherwise.

Return type:

bool

geoprior.utils.validator.is_binary_class(y, accept_multioutput=False)[source]#

Check whether the target array represents binary classification. Optionally, handle multi-output arrays if each output is binary.

Parameters:
  • y (array-like) – The target array to be checked. This can be a 1D array for single output or a 2D array for multiple outputs if accept_multioutput is True.

  • accept_multioutput (bool, default False) – If True, the function checks if each column in a multi-dimensional array is binary. If False, the function checks if the entire array is binary.

Returns:

Returns True if y is binary (or each output is binary if multi-output is accepted), False otherwise.

Return type:

bool

Examples

>>> from geoprior.utils.validator import is_binary_class
>>> is_binary_class([0, 1, 1, 0])
True
>>> is_binary_class([[0, 1], [1, 0], [0, 1], [1, 0]], accept_multioutput=True)
True
>>> is_binary_class([0, 1, 2, 3])
False
geoprior.utils.validator.is_categorical(data, column, strict=False, error='raise')[source]#

Checks if a specified column in a DataFrame or Series is of a categorical type.

Parameters:
  • data (DataFrame or Series) – The DataFrame or Series to check.

  • column (str) – The name of the column to check.

  • strict (bool, optional) – If True, only considers pandas CategoricalDtype as categorical. If False, also considers object dtype that often represents categorical data. Default is False.

  • error (str, optional) – Specifies how to handle situations when the column does not exist. Options are ‘raise’, ‘warn’, or ‘ignore’. Default is ‘raise’.

Returns:

True if the column is categorical, otherwise False.

Return type:

bool

Raises:

ValueError – If the column does not exist and error is set to ‘raise’.

Examples

>>> import pandas as pd
>>> from geoprior.utils.validator import is_categorical
>>> df = pd.DataFrame({
...     'fruit': ['Apple', 'Banana', 'Cherry'],
...     'count': [10, 20, 15]
... })
>>> df['fruit'] = df['fruit'].astype('category')
>>> print(is_categorical(df, 'fruit'))
True
>>> print(is_categorical(df, 'count'))
False
>>> print(is_categorical(df, 'non_existent', error='warn'))
Warning: Column 'non_existent' not found in the dataframe.
False
geoprior.utils.validator.is_frame(arr, df_only=False, raise_exception=False, objname=None, error='raise')[source]#

Check if arr is a pandas DataFrame or Series.

If df_only=True, the function checks strictly for a pandas DataFrame. Otherwise, it accepts either a pandas DataFrame or Series. This utility is often used to validate input data before processing, ensuring that the input conforms to expected types.

Parameters:
  • arr (object) – The object to examine. Typically a pandas DataFrame or Series, but can be any Python object.

  • df_only (bool, optional) – If True, only verifies that arr is a DataFrame. If False, checks for either a DataFrame or a Series. Default is False.

  • raise_exception (bool, optional) – If True, this will override error=”raise”. This parameter is deprecated and will be removed soon. Default is False.

  • error (str, optional) – Determines the action when arr is not a valid frame. Can be: - "raise": Raises a TypeError. - "warn": Issues a warning. - "ignore": Does nothing. Default is "raise".

  • objname (str or None, optional) – A custom name used in the error message if error is set to "raise". If None, a generic name is used.

Returns:

True if arr is a DataFrame or Series (or strictly a DataFrame if df_only=True), otherwise False.

Return type:

bool

Raises:

TypeError – If error=”raise” and arr is not a valid frame. The error message guides the user to provide the correct type (DataFrame or DataFrame or Series).

Notes

This function does not convert or modify arr. It merely checks its compatibility with common DataFrame/Series interfaces by examining attributes such as ‘columns’ or ‘name’. For a DataFrame, arr.columns should exist, and for a Series, a ‘name’ attribute is often present. Both DataFrame and Series implement __array__, making them NumPy array-like.

Examples

>>> import pandas as pd
>>> from geoprior.utils.validator import is_frame
>>> df = pd.DataFrame({'A': [1,2,3]})
>>> is_frame(df)
True
>>> s = pd.Series([4,5,6], name='S')
>>> is_frame(s)
True
>>> is_frame(s, df_only=True)
False

If error=”raise”:

>>> is_frame(s, df_only=True, error="raise", objname='Input')
Traceback (most recent call last):
    ...
TypeError: 'Input' parameter expects a DataFrame. Got 'Series'
geoprior.utils.validator.is_installed(module)[source]#

Checks if TensorFlow is installed.

This function attempts to find the TensorFlow package specification without importing the package. It’s a lightweight method to verify the presence of TensorFlow in the environment.

Returns:

True if TensorFlow is installed, False otherwise.

Return type:

bool

Parameters:

module (str)

Examples

>>> from geoprior.utils.validator import is_installed
>>> print(is_installed("tensorflow"))
True  # Output will be True if TensorFlow is installed, False otherwise.
geoprior.utils.validator.is_normalized(arr, method='sum')[source]#

Checks if the provided array is normalized according to the specified method.

Parameters:
  • arr (array-like) – The array to check for normalization.

  • method (str, optional) – The normalization method to check against. Use "01" to confirm values are within [0, 1] with minimum 0 and maximum 1, "zscore" to confirm mean 0 and standard deviation 1, or "sum" to confirm the array sums to 1. Default is "sum".

Returns:

Returns True if the array is normalized according to the specified method, False otherwise.

Return type:

bool

Examples

>>> arr = np.array([0.25, 0.25, 0.25, 0.25])
>>> is_normalized(arr, method='sum')
True
>>> arr = np.array([0, 0.5, 1])
>>> is_normalized(arr, method='01')
True
>>> arr = np.array([1, -1, 1, -1])
>>> is_normalized(arr, method='zscore')
True
geoprior.utils.validator.is_square_matrix(data, data_type=None)[source]#

Determine whether the input, either a DataFrame or an array-like structure, forms a square matrix.

Automatically detects the data type unless specified. Supports data inputs that can be converted to a NumPy array.

Parameters:
  • data (DataFrame, array-like, or any object convertible to a numpy array) – The input data to check.

  • data_type (str, optional) – The expected type of the input data. Valid options are ‘array’ or ‘dataframe’. If not specified, the data type is inferred. Default interpretation is as an ‘array’.

Returns:

Returns True if the data is a square matrix, otherwise False.

Return type:

bool

Raises:
  • ValueError – If data_type is neither ‘array’ nor ‘dataframe’.

  • TypeError – If the input data does not match the expected format or cannot be processed.

Examples

>>> is_square_matrix(np.array([[1, 2], [3, 4]]))
True
>>> is_square_matrix(pd.DataFrame([[1, 2, 3], [4, 5, 6]]))
False
>>> is_square_matrix([[1, 2], [3, 4]], data_type='array')
True

Notes

A square matrix has an equal number of rows and columns. This function checks the dimensionality and shape of the data to confirm if it meets this criterion.

geoprior.utils.validator.is_time_series(data, time_col, check_time_interval=False)[source]#

Check if the provided DataFrame is time series data.

Parameters:
  • data (pandas.DataFrame) – The DataFrame to be checked.

  • time_col (str) – The name of the column in df expected to represent time.

Returns:

True if df is a time series, False otherwise.

Return type:

bool

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
    'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
    'Value': [1, 2, 3, 4, 5]
})
>>> # Should return True if Date column
>>> # can be converted to datetime
>>> print(is_time_series(df, 'Date'))
geoprior.utils.validator.is_valid_policies(nan_policy, allowed_policies=None)[source]#

Validates the nan_policy or any policy argument to ensure it is one of the acceptable options (allowed_policies).

Function is used to enforce conformity to predefined NaN handling strategies in data processing tasks.

Parameters:
  • nan_policy (str) – The NaN handling policy to validate. Acceptable values are: ‘propagate’ - NaN values are propagated, i.e., no action is taken. ‘omit’ - NaN values are omitted before proceeding with the operation. ‘raise’ - Raises an error if NaN values are present.

  • allowed_policies (list of str, optional) – A list of allowable policy options. If None, defaults to [‘propagate’, ‘omit’, ‘raise’].

Raises:

ValueError – If nan_policy is not one of the valid options in allowed_policies.

Returns:

The verified nan_policy value, confirming it is within allowed parameters.

Return type:

str

Examples

>>> from geoprior.utils.validator import is_valid_policies
>>> is_valid_policies('omit')  # This should pass without an error.
>>> is_valid_policies('ignore')  # This should raise a ValueError.
geoprior.utils.validator.normalize_array(arr, normalize='auto', method='01')[source]#

Checks if an array is normalized according to the specified method and normalizes it if required based on the ‘normalize’ parameter.

Parameters:
  • arr (array-like) – The input array to check and potentially normalize.

  • normalize (str, optional) – Controls whether normalization is applied. Use "auto" to normalize only when the array is not already normalized for the selected method. Use True to always normalize and False to return the array unchanged. Default is "auto".

  • method (str, optional) – Normalization method to apply. Use "01" for min-max scaling, "zscore" for standardization, or "sum" to scale values so they sum to 1. Default is "01".

Returns:

The normalized array, or the original array if no normalization was applied.

Return type:

np.ndarray

Raises:

ValueError – If an unknown normalization method is specified or if normalization cannot be performed due to data characteristics (e.g., zero variance).

Examples

>>> import numpy as np
>>> from geoprior.utils.validator import normalize_array
>>> data = np.array([1, 2, 3, 4, 5])
>>> normalized_data = normalize_array(data, normalize=True, method='01')
>>> print("Normalized between 0 and 1:", normalized_data)
Normalized between 0 and 1: [0.   0.25 0.5  0.75 1.  ]
>>> zscore_data = normalize_array(data, normalize=True, method='zscore')
>>> print("Standardized (Z-score):", zscore_data)
Standardized (Z-score): [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]
>>> sum_data = normalize_array(data, normalize=True, method='sum')
>>> print("Normalized by sum:", sum_data)
Normalized by sum: [0.06666667 0.13333333 0.2        0.26666667 0.33333333]
geoprior.utils.validator.parameter_validator(param_name, target_strs, match_method='contains', raise_exception=True, **kws)[source]#

Creates a validator function for ensuring a parameter’s value matches one of the allowed target strings, optionally applying normalization.

This higher-order function returns a validator that can be used to check if a given parameter value matches allowed criteria, optionally raising an exception or normalizing the input.

Parameters:
  • param_name (str) – Name of the parameter to be validated. Used in error messages to indicate which parameter failed validation.

  • target_strs (list of str) – A list of acceptable string values for the parameter.

  • match_method (str, optional) – The method used to match the input string against the target strings. The default method is ‘contains’, which checks if the input string contains any of the target strings.

  • raise_exception (bool, optional) – Specifies whether an exception should be raised if validation fails. Defaults to True, raising an exception on failure.

  • **kws (dict,) – Keyword arguments passed to geoprior.core.utils.normalize_string().

Returns:

A closure that takes a single string argument (the parameter value) and returns a normalized version of it if the parameter matches the target criteria. If the parameter does not match and raise_exception is True, it raises an exception; otherwise, it returns the original value.

Return type:

function

Examples

>>> from geoprior.utils.validator import parameter_validator
>>> validate_outlier_method = parameter_validator(
...  'outlier_method', ['z_score', 'iqr'])
>>> outlier_method = "z_score"
>>> print(validate_outlier_method(outlier_method))
'z_score'
>>> validate_fill_missing = parameter_validator(
...  'fill_missing', ['median', 'mean', 'mode'], raise_exception=False)
>>> fill_missing = "average"  # This does not match but won't raise an exception.
>>> print(validate_fill_missing(fill_missing))
'average'

Notes

  • The function leverages a custom utility function normalize_string from a module named geoprior.core.utils. This utility is assumed to handle string normalization and matching based on the provided match_method.

  • If raise_exception is set to False and the input does not match any target string, the input string is returned unchanged. This behavior allows for optional enforcement of the validation rules.

  • The primary use case for this function is to validate and optionally normalize parameters for configuration settings or function arguments where only specific values are allowed.

geoprior.utils.validator.process_y_pairs(*ys, error='warn', solo_return=False, ops='check_only')[source]#

Process and validate paired arrays of ground truth (y_true) and predicted values (y_pred) for machine learning evaluation.

Parameters:
  • *ys (ArrayLike) – Variable-length sequence of array-likes containing alternating (y_true, y_pred) pairs. Must contain even number of inputs.

  • error ({'raise', 'warn', 'ignore'}, default 'warn') – Handling strategy for validation errors: - 'raise': Immediately raise ValueError - 'warn': Issue UserWarning but continue processing - 'ignore': Silently skip invalid pairs

  • solo_return (bool, default False) – When processing single pair, return as individual arrays instead of length-1 lists.

  • ops ({'check_only', 'validate'}, default 'check_only') – Processing mode: - 'check_only': Verify pair lengths without modification - 'validate': Clean data (remove NaNs) and validate dtypes

Returns:

Processed pairs as (y_trues, y_preds) tuple. Return type depends on solo_return and number of valid pairs.

Return type:

Tuple[List[ArrayLike], List[ArrayLike]] or Tuple[ArrayLike, ArrayLike]

Raises:
  • ValueError

    • If input count is odd and error='raise'

    • Length mismatch in pairs when error='raise'

    • Invalid error or ops values

  • UserWarning

    • When odd input count and error='warn'

    • Length mismatches when error='warn'

Examples

Basic usage with valid pairs:

>>> from geoprior.utils.validator  import process_y_pairs
>>> y_true1 = [1.2, 2.3, 3.4]
>>> y_pred1 = [1.1, 2.4, 3.3]
>>> y_true2 = [4.5, 5.6]
>>> y_pred2 = [4.4, 5.7]
>>> process_y_pairs(y_true1, y_pred1, y_true2, y_pred2)
([[1.2, 2.3, 3.4], [4.5, 5.6]], [[1.1, 2.4, 3.3], [4.4, 5.7]])

Handling mismatched pair with warnings:

>>> y_bad = [1, 2, 3]
>>> p_bad = [1, 2]
>>> process_y_pairs(y_bad, p_bad, error='warn')
UserWarning: Length mismatch in pair 0: 3 vs 2
([], [])

Full validation pipeline:

>>> import numpy as np
>>> y_clean, p_clean = process_y_pairs(
...     [1, np.nan, 3], [np.nan, 2.1, 3.2],
...     ops='validate', solo_return=True
... )
>>> y_clean
array([3.])
>>> p_clean
array([3.2])

Notes

Ensures input pairs meet requirements for downstream analysis through:

(3)#\[ \begin{align}\begin{aligned}\forall i \in \{0,2,4,...\},\ (y_{true}^i, y_{pred}^i) \rightarrow (\tilde{y}_{true}^i, \tilde{y}_{true}^i)\ \text{where}\\\text{len}(\tilde{y}_{true}^i) = \text{len}(\tilde{y}_{pred}^i)\\\text{and}\ \tilde{y}_{true}^i \in \mathbb{R}^{n},\ \tilde{y}_{pred}^i \in \mathbb{R}^{n}\end{aligned}\end{align} \]
  1. Uses drop_nan_in for NaN removal and index resetting during validation

  2. Applies validate_yy for dtype consistency checks and array flattening

  3. Forward references for ArrayLike allow flexibility - accepts any array-like structure (list, numpy array, pandas Series, etc.)

  4. The type and array-handling conventions rely on the Python language reference and NumPy’s array-programming model [36, 47].

See also

drop_nan_in

Core NaN removal and index resetting function

validate_yy

Array validation and dtype consistency checker

sklearn.utils.check_consistent_length

Scikit-learn’s length validation

geoprior.utils.validator.to_dtype_str(arr, return_values=False)[source]#

Convert numeric or object dtype to string dtype.

This will avoid a particular TypeError when an array is filled by np.nan and at the same time contains string values. Converting the array to dtype str rather than keeping to ‘object’ will pass this error.

Parameters:
  • arr – array-like array with all numpy datatype or pandas dtypes

  • return_values – bool, default=False returns array values in string dtype. This might be usefull when a series with dtype equals to object or numeric is passed.

Returns:

array-like array-like with dtype str Note that if the dataframe or serie is passed, the object datatype will change only if return_values is set to True, otherwise returns the same object.

geoprior.utils.validator.validate_and_adjust_ranges(**kwargs)[source]#

Validates and adjusts the provided range tuples to ensure each is composed of two numerical values and is sorted in ascending order.

This function takes multiple range specifications as keyword arguments, each expected to be a tuple of two numerical values (min, max). It validates the format and contents of each range, adjusting them if necessary to ensure that each tuple is ordered as (min, max).

Parameters:

**kwargs (dict) – Keyword arguments where each key is the name of a range (e.g., ‘lat_range’) and its corresponding value is a tuple of two numerical values representing the minimum and maximum of that range.

Returns:

A dictionary with the same keys as the input, but with each tuple value adjusted to ensure it is in the format (min, max).

Return type:

dict

Raises:

ValueError – If any provided range tuple does not contain exactly two values, contains non-numerical values, or if the min value is not less than the max value.

Examples

>>> from geoprior.utils.validator import validate_and_adjust_ranges
>>> validate_and_adjust_ranges(lat_range=(34.00, 36.00), lon_range=(-118.50, -117.00))
{'lat_range': (34.00, 36.00), 'lon_range': (-118.50, -117.00)}
>>> validate_and_adjust_ranges(time_range=(10.0, 0.01))
{'time_range': (0.01, 10.0)}
>>> validate_and_adjust_ranges(invalid_range=(1, 'a'))
ValueError: invalid_range must contain numerical values.

Notes

This function is particularly useful for preprocessing input ranges for various analyses, ensuring consistency and correctness of range specifications. It automates the adjustment of provided ranges, simplifying the setup process for further data processing or modeling tasks.

geoprior.utils.validator.validate_batch_size(batch_size, n_samples, min_batch_size=1, max_batch_size=None)[source]#

Validate the batch size against the number of samples.

This function checks whether the provided batch_size is appropriate given the total number of samples n_samples. It ensures that the batch size meets specified minimum and maximum limits, raising appropriate errors if any constraints are violated.

Parameters:
  • batch_size (int) – The size of each batch. This must be a positive integer, as batches must contain at least one sample. A ValueError will be raised if this value is less than the minimum allowed batch size or exceeds the total number of samples.

  • n_samples (int) – The total number of samples in the dataset. This value must be positive and greater than or equal to the batch_size. If batch_size is greater than n_samples, a ValueError is raised.

  • min_batch_size (int, optional) – The minimum allowed batch size (default is 1). This parameter defines the smallest permissible batch size. A ValueError will be raised if the batch_size is less than this value.

  • max_batch_size (int, optional) – The maximum allowed batch size (default is None, meaning no upper limit). This parameter can be used to restrict the size of the batch to a specified maximum value. If max_batch_size is provided, a ValueError will be raised if the batch_size exceeds this limit.

Returns:

batch_size

Return type:

Validated number of batch size

Raises:

ValueError – If the batch_size is less than the min_batch_size, greater than the n_samples, or exceeds the max_batch_size if specified. Additionally, if batch_size is not a positive integer, a ValueError is raised.

Notes

Let B represent the batch_size and N represent the n_samples. The validation can be expressed mathematically as:

(4)#\[ext{If } B < ext{min\_batch\_size} ext{ or } B > N ext{ or } B > ext{max\_batch\_size}: \quad ext{raise ValueError}\]

This function is essential for managing data batching in machine learning workflows, where improper batch sizes can lead to inefficient training or runtime errors. The practical mini-batch constraint follows standard deep-learning training guidance [48].

Examples

>>> from geoprior.utils.validator import validate_batch_size
>>> validate_batch_size(32, 100)  # Valid case
>>> validate_batch_size(0, 100)  # Raises ValueError
>>> validate_batch_size(150, 100)  # Raises ValueError
>>> validate_batch_size(32, 100, max_batch_size=32)  # Valid case
>>> validate_batch_size(40, 100, max_batch_size=32)  # Raises ValueError
geoprior.utils.validator.validate_comparison_data(df, alignment='auto')[source]#

Validates a DataFrame to ensure it is a square matrix and that the index and column names match. Optionally aligns the index names to the column names or vice versa based on the alignment parameter.

Parameters:
  • df (pandas.DataFrame) – The DataFrame to validate.

  • alignment (str, default 'auto') – Controls how the DataFrame’s index and columns are aligned if they d o not match. Options are ‘auto’, ‘index_to_columns’, and ‘columns_to_index’.

Returns:

The validated and potentially modified DataFrame.

Return type:

pandas.DataFrame

Raises:

ValueError – If the DataFrame is not square or if index and column names do not match and no suitable alignment option is specified.

Examples

>>> from geoprior.utils.validator import validate_comparison_data
>>> data = pd.DataFrame({
...     'A': [1, 0.9, 0.8],
...     'B': [0.9, 1, 0.85],
...     'C': [0.8, 0.85, 1]
... }, index=['A', 'B', 'X'])
>>> print(validate_comparison_data(data, alignment='index_to_columns'))
>>> data = pd.DataFrame({
...     1: [1, 0.9, 0.8],
...     2: [0.9, 1, 0.85],
...     3: [0.8, 0.85, 1]
... }, index=[1, 2, 'X'])
>>> print(validate_comparison_data(data, alignment='auto'))
geoprior.utils.validator.validate_data_types(data, expected_type='numeric', nan_policy='omit', return_data=False, error='raise')[source]#

Checks for mixed data types in a pandas Series or DataFrame and handles according to the specified policies. This function is designed to ensure data consistency by verifying that data matches expected type criteria, offering options to manage and report any discrepancies.

Parameters:
  • data (pd.Series or pd.DataFrame) – The data to be checked. This can be a pandas Series or DataFrame.

  • expected_type ({'numeric', 'categoric', 'both'}, default 'numeric') –

    Specifies the type of data expected:

    • ’numeric’: All data should be of numeric types (int, float).

    • ’categoric’: All data should be categorical, typically strings or pandas Categorical datatype.

    • ’both’: Any mix of numeric and categorical data is considered valid.

  • nan_policy ({'raise', 'omit', 'propagate'}, default 'omit') –

    Determines how NaN values are handled:

    • ’raise’: Raises an error if NaN values are found.

    • ’warn’: Issues a warning if NaN values are found but proceeds.

    • ’propagate’: Continues execution without addressing NaNs.

  • return_data (bool, default False) – If True, returns a DataFrame or Series (depending on the input) that only includes data rows that conform to the expected_type. If False, returns None.

  • error ({'raise', 'warn'}, default 'raise') –

    Configures the error handling behavior when data types do not conform to the expected_type:

    • ’raise’: Raises a TypeError if mixed types are detected.

    • ’warn’: Emits a warning but attempts to continue by filtering non-conforming data if return_data is True.

Returns:

Depending on return_data, this function may return a filtered version of data that conforms to the expected_type or None if return_data is False.

Return type:

pd.Series or pd.DataFrame or None

Raises:
  • ValueError – If NaN values are present and nan_policy is set to ‘error’.

  • TypeError – If data types do not conform to expected_type and error is set to ‘raise’.

Examples

>>> import pandas as pd
>>> from geoprior.utils.validator import validate_data_types
>>> df = pd.DataFrame({'A': [1, 2, 'a', 3.5, np.nan], 'B': ['x', 'y', 'z', None, 't']})
>>> validate_data_types(df, expected_type='numeric', nan_policy='warn',
...                  return_data=True, error='warn')
UserWarning: NaN values found in the data, but processing will continue.
UserWarning: Expected numeric types but found mixed types.
Non-numeric data will be ignored.
   A
0  1.0
1  2.0
3  3.5

Notes

The check_data_types function is useful in data preprocessing steps, particularly when you need to ensure that data fed into a machine learning algorithm meets certain type requirements. Handling mixed data types early on can prevent issues in model training and evaluation.

geoprior.utils.validator.validate_dates(start_date, end_date, return_as_date_str=False, date_format='%Y-%m-%d')[source]#

Validates and parses start and end years/dates, with options for output formatting.

This function ensures the validity of provided start and end years or dates, checks if they fall within a reasonable range, and allows the option to return the validated years or dates in a specified string format.

Parameters:
  • start_date (int, float, or str) – The starting year or date. Can be an integer, float (converted to integer), or string in “YYYY” or “YYYY-MM-DD” format.

  • end_date (int, float, or str) – The ending year or date, with the same format options as start_date.

  • return_as_date_str (bool, optional) – If True, returns the start and end dates as strings in the specified format. Default is False, returning years as integers.

  • date_format (str, optional) – The format string for output dates if return_as_date_str is True. Default format is “%Y-%m-%d”.

Returns:

A tuple of two elements, either integers (years) or strings (formatted dates), representing the validated start and end years or dates.

Return type:

tuple

Raises:

ValueError – If the input years or dates are invalid, out of the acceptable range, or if the start year/date does not precede the end year/date.

Examples

>>> from geoprior.utils.validator import validate_dates
>>> validate_dates(1999, 2001)
(1999, 2001)
>>> validate_dates("1999/01/01", "2001/12/31", return_as_date_str=True)
('1999-01-01', '2001-12-31')
>>> validate_dates("1999", "1998")
ValueError: The start date/time must precede the end date/time.
>>> validate_years("1899", "2001")
ValueError: Years must be within the valid range: 1900 to [current year].

Notes

The function supports flexible input formats for years and dates, including handling both slash “/” and dash “-” separators in date strings. It enforces logical and chronological order between start and end inputs and allows customization of the output format for date strings.

geoprior.utils.validator.validate_distribution(distribution, elements=None, kind=None, check_normalization=True)[source]#

Validates or generates distributions for given elements, ensuring the sum equals 1 if check_normalization is True.

Parameters:
  • distribution (str, tuple, list) – The distribution to be validated or generated. If ‘auto’, generates a random distribution for the specified number of elements. Can also be a tuple or list representing an explicit distribution.

  • elements (int, list of str, optional) – Defines how many elements the distribution should be generated for when ‘auto’ is used. If a list of strings is provided, its length is used to determine the number of elements.

  • kind (str, optional) – Specifies the kind of distribution. It can be {"probs"} for probability distributions, where the sum should equal 1 and values must be non-negative.

  • check_normalization (bool, optional) – If True, ensures that the sum of the distribution equals 1. Default is True.

Returns:

A tuple representing the validated or generated distribution.

Return type:

tuple

Raises:

ValueError – If the provided distribution does not meet the specified conditions.

Examples

>>> from geoprior.utils.validator import validate_distribution
>>> validate_distribution("auto", elements=['positive', 'neutral', 'negative'])
(0.1450318690603951, 0.5660028611331361, 0.2889652698064687)
geoprior.utils.validator.validate_dtype_selector(dtype_selector)[source]#

Validates and categorizes the dtype_selector using regex, including handling cases where ‘only’ is specifically included.

Parameters:

dtype_selector (str) – Input dtype selector string.

Returns:

Categorized dtype_selector based on predefined patterns. If "only" is included, the returned category reflects this so it can drive specific data-type handling.

Return type:

str

Raises:

ValueError – If the input dtype_selector does not match any predefined category.

geoprior.utils.validator.validate_estimator_methods(estimator, methods, msg=None)[source]#

Validate that the specified methods exist and are callable on the given estimator.

This utility function is designed to check whether an estimator (or any object) contains the required methods, such as fit or predict, and ensures that those methods are callable. It helps prevent runtime errors by verifying the presence of expected methods.

Parameters:
  • estimator (object) – The object (instance or class) to check for the presence of the specified methods. The estimator can be an instance of a class or the class itself, and it should implement the required methods.

  • methods (list of str) – List of method names (as strings) to validate. Each method name must exist on the estimator and be callable. Examples of methods might include fit, run, or predict.

  • msg (str, optional) – Custom error message to display if any method is missing or not callable. If None, a default message is generated for each missing or invalid method based on the method name.

Raises:

AttributeError – If any method in methods is not present or not callable on the estimator.

Examples

>>> from geoprior.utils.validator import validate_estimator_methods
>>> class MyClass:
...     def fit(self):
...         pass
...     def run(self):
...         pass
>>> validate_estimator_methods(MyClass(), ['fit', 'run'])  # No error
>>> class IncompleteClass:
...     def fit(self):
...         pass
>>> validate_estimator_methods(IncompleteClass(), ['fit', 'run'])
# Raises AttributeError for missing `run` method

Notes

This helper is useful when you want to ensure that an object, such as an estimator or a model, exposes several callable methods before proceeding. If any method is missing or not callable, the function raises an AttributeError. Method-callability checks follow the Python documentation and the callable-object discussion in [43, 49].

See also

check_has_run_method

Validate the presence of a single method, defaulting to run.

geoprior.utils.validator.validate_fit_weights(y, sample_weight=None, weighted_y=False)[source]#

Validate and compute sample weights for fitting.

Parameters:
  • y (array-like of shape (n_samples,)) – Target values.

  • sample_weight (array-like of shape (n_samples,), default None) – Sample weights. If None, then samples are equally weighted.

  • weighted_y (bool, default False) – If True, compute the weighted target values.

Returns:

  • sample_weight (array-like of shape (n_samples,)) – Validated sample weights.

  • weighted_y_values (array-like of shape (n_samples,), optional) – Weighted target values if weighted_y is True.

Raises:

ValueError – If sample_weight is not None and its length does not match the length of y. If any value in sample_weight is negative.

Notes

This function checks the input sample weights, ensuring they are consistent with the target values y. If sample_weight is None, it returns an array of ones indicating equal weighting. Otherwise, it validates and returns the given sample weights. If weighted_y is True, it also computes and returns the weighted target values.

Examples

>>> import numpy as np
>>> y = np.array([0, 1, 1, 0, 1])
>>> validate_fit_weights(y)
array([1., 1., 1., 1., 1.])
>>> sample_weight = np.array([1, 0.5, 1, 1.5, 1])
>>> validate_fit_weights(y, sample_weight)
array([1. , 0.5, 1. , 1.5, 1. ])
>>> validate_fit_weights(y, sample_weight, weighted_y=True)
(array([1. , 0.5, 1. , 1.5, 1. ]), array([0. , 0.5, 1. , 0. , 1. ]))
>>> validate_fit_weights(y, weighted_y=True)
(array([1., 1., 1., 1., 1.]), array([0., 1., 1., 0., 1.]))
geoprior.utils.validator.validate_length_range(length_range, sorted_values=True, param_name=None)[source]#

Validates the review length range ensuring it’s a tuple with two integers where the first value is less than the second.

Parameters:
  • length_range (tuple) – A tuple containing two values that represent the minimum and maximum lengths of reviews.

  • sorted_values (bool, default True) – If True, the function expects the input length range to be sorted in ascending order and will automatically sort it if not. If False, the input length range is not expected to be sorted, and it will remain as provided.

  • param_name (str, optional) – The name of the parameter being validated. If None, the default name ‘length_range’ will be used in error messages.

Returns:

The validated length range.

Return type:

tuple

Raises:

ValueError – If the length range does not meet the requirements.

Examples

>>> from geoprior.utils.validator import validate_length_range
>>> validate_length_range ( (202, 25) )
(25, 202)
>>> validate_length_range ( (202,) )
ValueError: length_range must be a tuple with two elements.
geoprior.utils.validator.validate_multiclass_target(y, accept_multioutput=False, return_classes=False)[source]#

Validates that the target data is suitable for multiclass classification. Optionally accepts multi-output targets and can return the unique classes.

Parameters:
  • y (array-like) – The target data to be validated, expected to contain class labels for multiclass classification. Can be a multi-output array if accept_multioutput is set to True.

  • accept_multioutput (bool, optional) – Allows the target array to be multi-dimensional (default is False).

  • return_classes (bool, optional) – If True, returns the unique classes instead of a validation boolean.

Returns:

If return_classes is False, returns True if the target data is valid for multiclass classification, otherwise raises a ValueError. If return_classes is True, returns the unique classes in the target data.

Return type:

bool or array

Raises:

ValueError – If any of the following conditions are not met: - If accept_multioutput is False, the target data must be one-dimensional. - All elements in the target array must be non-negative integers. - The target array must contain at least two distinct classes.

Examples

>>> from geoprior.utils.validator import validate_multiclass_target
>>> validate_multiclass_target([0, 1, 2, 1, 0])
array([0, 1, 2, 1, 0])
>>> validate_multiclass_target([0, 0, 0])
ValueError: Target array must contain at least two distinct classes.
>>> validate_multiclass_target([0.5, 1.2, 2.3])
ValueError: All elements in the target array must be non-negative integers.
>>> validate_multiclass_target([[1, 2], [2, 3]], accept_multioutput=True,
...                              return_classes=True)
(array([1, 2, 2, 3]), 3)
True
geoprior.utils.validator.validate_multioutput(value, extra='')[source]#

Validate the multioutput parameter value and handle special cases.

This function checks if the provided multioutput value is one of the accepted strings (‘raw_values’, ‘uniform_average’, ‘raise’, ‘warn’). It warns or raises an error based on the value if it’s applicable.

Parameters:
  • value (str) – The value of the multioutput parameter to be validated. Accepted values are ‘raw_values’, ‘uniform_average’, ‘raise’, ‘warn’.

  • extra (str, optional) – Additional text to include in the warning or error message if multioutput is not applicable.

Returns:

The validated multioutput value in lowercase if it’s one of the accepted values. If the value is ‘warn’ or ‘raise’, the function handles the case accordingly without returning a value.

Return type:

str

Raises:

ValueError – If value is not one of the accepted strings and is not ‘raise’.

Examples

>>> from geoprior.utils.validator import validate_multioutput
>>> validate_multioutput('raw_values')
'raw_values'
>>> validate_multioutput('warn', extra=' for Dice Similarity Coefficient')
# This will warn that multioutput parameter is not applicable for Dice
# Similarity Coefficient.
>>> validate_multioutput('raise', extra=' for Gini Coefficient')
# This will raise a ValueError indicating that multioutput parameter
# is not applicable for Gini Coefficient.
>>> validate_multioutput('average')
# This will raise a ValueError indicating 'average' is an invalid value
# for multioutput parameter.

Note

The function is designed to ensure API consistency across various metrics functions by providing a standard way to handle multioutput parameter values, especially in contexts where multiple outputs are not applicable.

geoprior.utils.validator.validate_nan_policy(nan_policy, *arrays, sample_weights=None)[source]#

Validates and applies a specified nan_policy to input arrays and optionally to sample weights. This utility is essential for pre-processing data prior to statistical analyses or model training, where appropriate handling of NaN values is critical to ensure accurate and reliable outcomes.

Parameters:
  • nan_policy ({'propagate', 'raise', 'omit'}) – Defines how to handle NaNs in the input arrays. ‘propagate’ returns the input data without changes. ‘raise’ throws an error if NaNs are detected. ‘omit’ removes rows with NaNs across all input arrays and sample weights.

  • *arrays (array-like) – Variable number of input arrays to be validated and adjusted based on the specified nan_policy.

  • sample_weights (array-like, optional) – Sample weights array to be validated and adjusted in tandem with the input arrays according to nan_policy. Defaults to None.

Returns:

  • arrays (tuple of np.ndarray) – Adjusted input arrays, with modifications applied based on nan_policy. The order of arrays in the tuple corresponds to the order of input.

  • sample_weights (np.ndarray or None) – Adjusted sample weights, modified according to nan_policy if provided. Returns None if no sample_weights were provided.

Raises:

ValueError – If nan_policy is not among the valid options (‘propagate’, ‘raise’, ‘omit’) or if NaNs are detected when nan_policy is set to ‘raise’.

Notes

Handling NaN values is a critical step in data preprocessing, especially in datasets with missing values. The choice of nan_policy can significantly impact subsequent statistical analysis or predictive modeling by either including, excluding, or signaling errors for observations with missing values. This function ensures consistent application of the chosen policy across multiple datasets, facilitating robust and error-free analyses.

Examples

>>> import numpy as np
>>> from geoprior.utils.validator import validate_nan_policy
>>> y_true = np.array([1, np.nan, 3])
>>> y_pred = np.array([1, 2, 3])
>>> sample_weights = np.array([0.5, 0.5, 1.0])
>>> arrays, sw = validate_nan_policy('omit', y_true, y_pred,
...                                  sample_weights=sample_weights)
>>> arrays
(array([1., 3.]), array([1., 3.]))
>>> sw
array([0.5, 1. ])
geoprior.utils.validator.validate_numeric(value, convert_to='float', allow_negative=True, min_value=None, max_value=None, check_mode='soft')[source]#

Validates if a given value is numeric. It can accept numeric strings and numpy arrays of single values. Optionally converts the value to either float or integer.

Parameters:
  • value (Any) – The value to be validated as numeric. This can be of any type but is expected to be convertible to a numeric type. Accepted types include numeric strings (e.g., "42"), single-element numpy arrays (e.g., np.array([3.14])), integers, and floats.

  • convert_to (str, optional) – Type to convert the validated numeric value to. Use "float" for floating-point output or "int" for integer output. Defaults to "float".

  • allow_negative (bool, optional) – Whether to allow negative values. If False, negative values raise a ValueError. Defaults to True.

  • min_value (float or int, optional) – The minimum value allowed. If None, no minimum value check is applied. Defaults to None.

  • max_value (float or int, optional) – The maximum value allowed. If None, no maximum value check is applied. Defaults to None.

  • check_mode (str, optional) – Validation mode. Use "soft" to accept single-element iterables and validate their single value, or "strict" to accept only non-iterable numeric inputs. Defaults to "soft".

Returns:

The validated and optionally converted numeric value. The type of the return value is determined by the convert_to parameter.

Return type:

float or int

Raises:

ValueError – If the value is not numeric or does not meet the specified criteria.

Notes

The function can coerce single-element NumPy arrays, numeric strings, and, in soft mode, single-element iterables before validating the result. The validated value is then converted to float or int and checked against the sign and range constraints. Array coercion details are documented in NumPy developers [50].

Examples

>>> from geoprior.utils.validator import validate_numeric
>>> validate_numeric("42", convert_to='int')
42
>>> validate_numeric(np.array([3.14]), convert_to='float')
3.14
>>> validate_numeric([123], check_mode='soft')
123.0
>>> validate_numeric([123], check_mode='strict')
Traceback (most recent call last):
    ...
ValueError: Value '[123]' is not a numeric type.
>>> validate_numeric("-123.45", allow_negative=False)
Traceback (most recent call last):
    ...
ValueError: Negative values are not allowed: -123.45

See also

numpy.array

Numpy arrays, which can be validated by this function.

geoprior.utils.validator.validate_performance_data(model_performance_data=None, nan_policy='raise', convert_integers=True, check_performance_range=True, verbose=False)[source]#

Validates and preprocesses model performance data to ensure it conforms to the necessary structure and constraints for statistical and machine learning analysis. The function accepts either a dictionary or a DataFrame as input and performs the following tasks:

  1. Converts data to a DataFrame if it is provided as a dictionary.

  2. Converts integer values to floats, ensuring compatibility with statistical processing.

  3. Manages NaN values according to the specified nan_policy.

  4. Validates that performance data falls within a valid range, ensuring values lie within [0, 1].

The function is adaptable, capable of being used directly or as a decorator, with or without configuration parameters.

Parameters:
  • model_performance_data (Union[Dict[str, List[float]], pd.DataFrame], optional) – The input model performance data to validate. Can be provided as either a dictionary (with model names as keys and performance metrics as lists) or a DataFrame where each column represents a model.

  • nan_policy (str, default 'raise') – The policy to handle NaN values: * ‘raise’: Raises a ValueError if NaNs are detected. * ‘omit’: Drops rows with NaNs. * ‘propagate’: Ignores NaNs during performance range checks.

  • convert_integers (bool, default True) – Converts integer values within the data to floats if set to True, which is useful for consistency when computing metrics.

  • check_performance_range (bool, default True) – Ensures that performance values lie within the range [0, 1]. If any value falls outside this range, an error is raised unless nan_policy is set to ‘propagate’.

  • verbose (bool, default False) – If True, displays steps of the data validation process for tracking operations and debugging.

geoprior.utils.validator.actual_validate_performance_data(data)#

Validates and processes the data according to specified policies and constraints.

geoprior.utils.validator.Usage()#
-----
This function can be utilized in three primary ways:
1. **As a function**: Provide data directly to perform validation.
>>> from geoprior.utils.validator import validate_performance_data
>>> data = {'model1': [0.85, 0.90, 0.92], 'model2': [0.80, 0.87, 0.88]}
>>> validate_performance_data(data)
2. **As a decorator**: Use as a decorator to validate the first

argument of a function. If used without parentheses, default values will be applied.

>>> @validate_performance_data
>>> def process_data(validated_data):
>>>     print(validated_data)
3. **As a decorator with parameters**: Customize validation by

specifying parameters.

>>> @validate_performance_data(nan_policy='omit', verbose=True)
>>> def process_data(validated_data):
>>>     print(validated_data)

Notes

The validation process includes statistical pre-checks, using custom modules to convert data and handle NaNs. For integer-to-float conversion, the convert_to_numeric function is utilized, while NaN policies are verified using is_valid_policies. The comparison framing for multiple models follows Demšar [51].

See also

DataFrameFormatter

Formatter for handling DataFrame structures.

MultiFrameFormatter

Formatter for handling multiple DataFrames.

geoprior.utils.validator.validate_positive_integer(value, variable_name, include_zero=False, round_float=None, msg=None)[source]#

Validates whether the given value is a positive integer or zero based on the parameter and rounds float values according to the specified method.

Parameters:
  • value (int or float) – The value to validate.

  • variable_name (str) – The name of the variable for error message purposes.

  • include_zero (bool, optional) – If True, zero is considered a valid value. Default is False.

  • round_float (str, optional) – If “ceil”, rounds up float values; if “floor”, rounds down float values; if None, truncates float values to the nearest whole number towards zero.

  • msg (str, optional) – Error message when checking for proper type failed.

Returns:

The validated value converted to an integer.

Return type:

int

Raises:

ValueError – If the value is not a positive integer or zero (based on include_zero), or if the round_float parameter is improperly specified.

geoprior.utils.validator.validate_sample_weights(weights, y, normalize=False)[source]#

Validates that the sample weights are suitable for use in calculations.

This function checks that the sample weights are non-negative and match the length of the target array y. It raises an error if any conditions are not met. If a single number is provided as weights, it will be converted into an array with repeated values matching the length of y.

Parameters:
  • weights (array-like or number) – The sample weights to be validated. Each weight must be non-negative. A single number will be converted to an array with repeated values.

  • y (array-like) – The target array that the weights should correspond to. The length of weights must match the length of y.

  • normalize (bool, optional) – If True, weights will be normalized to sum to 1. Default is False.

Returns:

The validated sample weights as a numpy array.

Return type:

numpy.ndarray

Raises:

ValueError – If weights are not one-dimensional, if any weight is negative, or if the length of weights does not match the length of y.

Examples

>>> frpm geoprior.utils.validator import validate_sample_weights
>>> y = [0, 1, 2, 3]
>>> weights = [0.1, 0.2, 0.3, 0.4]
>>> validate_sample_weights(weights, y)
array([0.1, 0.2, 0.3, 0.4])
>>> weights = [-0.1, 0.2, 0.3, 0.4]
>>> validate_sample_weights(weights, y)
ValueError: Sample weights must be non-negative.
>>> weights = [0.1, 0.2, 0.3]
>>> validate_sample_weights(weights, y)
ValueError: Length of sample weights must match length of y.
geoprior.utils.validator.validate_sets(data, mode='base', allow_empty=True, element_type=None, key_type=<class 'str'>)[source]#

Validates whether the input data is a set in ‘base’ mode or a dictionary of sets in ‘deep’ mode. Provides additional parameters for flexibility and versatility. Returns the data if it passes validation.

Parameters:
  • data (Union[set, Dict[str, set]]) –

    The input data to validate. It can be either a single set or a dictionary where keys are set names and values are sets.

    • base mode : A single set.

    • deep mode : A dictionary of sets.

  • mode (str, optional) – The mode in which to validate the data. Options are ‘base’ for a single set and ‘deep’ for a dictionary of sets. Default is ‘base’.

  • allow_empty (bool, optional) – Whether to allow empty sets or dictionaries. Default is True.

  • element_type (type, optional) – The expected type of elements in the set(s). If provided, the function checks whether all elements are of this type. Default is None (no type check).

  • key_type (type, optional) – The expected type of keys in the dictionary when in ‘deep’ mode. Default is str.

Returns:

The original data if it matches the specified mode and additional criteria. Raises ValueError if validation fails.

Return type:

Union[set, Dict[str, set]]

Examples

>>> from geoprior.utils.validator import validate_sets
>>> validate_sets({1, 2, 3}, mode='base')
{1, 2, 3}
>>> validate_sets({"Set1": {1, 2, 3}, "Set2": {3, 4, 5}}, mode='deep')
{"Set1": {1, 2, 3}, "Set2": {3, 4, 5}}
>>> validate_sets({"Set1": {1, 2, 3}, "Set2": [3, 4, 5]}, mode='deep')
Traceback (most recent call last):
    ...
ValueError: Data validation failed: expected all values to be sets
>>> validate_sets(set(), mode='base', allow_empty=False)
Traceback (most recent call last):
    ...
ValueError: Data validation failed: empty set is not allowed
>>> validate_sets({"Set1": set()}, mode='deep', allow_empty=False)
Traceback (most recent call last):
    ...
ValueError: Data validation failed: empty dictionary is not allowed
>>> validate_sets({"Set1": {1, 2, 3}}, mode='deep', element_type=int)
{"Set1": {1, 2, 3}}

Notes

This function checks the type of the input data based on the specified mode. In ‘base’ mode, it ensures the data is a set. In ‘deep’ mode, it ensures the data is a dictionary where all values are sets. Additional parameters allow for checking if sets are empty, if elements are of a specific type, and if dictionary keys are of a specific type. The core type test used here is documented in Python Software Foundation [52].

See also

isinstance

Python built-in function to check an object’s type.

geoprior.utils.validator.validate_strategy(strategy=None, error='raise', ops='validate', rename_key=False, **kwargs)[source]#

Validate and construct a strategy dictionary for imputing missing data.

This function processes the input strategy to ensure it conforms to the expected format for imputing missing values in numerical and categorical features. It provides flexibility in handling different strategies and error management, making it suitable for integration with scikit-learn’s imputation tools.

Parameters:
  • strategy (Optional[Union[str, Dict[str, str]]], default None) – Defines the imputation strategy for numerical and categorical features. A string is parsed into a dictionary with keys "numeric" and "categorical", a dictionary is used directly, and None selects the default strategy.

  • error (str, default 'raise') – Error handling behavior for invalid strategy tokens. Use "raise" to raise a ValueError, "warn" to emit a warning, or "ignore" to skip invalid tokens silently.

  • ops (str, default 'validate') – Operation mode of the validator. Use "passthrough" to return the input strategy unchanged when it is already a dictionary, "check_only" to validate without modifying it, or "validate" to validate and construct the strategy dictionary from the input.

  • rename_key (bool, default False) – If True, rename aliases such as "num", "numeric", or "numerical" to "numeric", and aliases such as "cat", "categorical", or "categoric" to "categorical". Other keys remain unchanged.

  • **kwargs – Additional keyword arguments for future extensions.

Returns:

Returns the input strategy dictionary for ops='passthrough', True or False for ops='check_only', and the validated or modified strategy dictionary for ops='validate'.

Return type:

Union[Dict[str, str], bool]

Raises:

ValueError – If an invalid error or ops parameter is provided, or if the strategy tokens are invalid and error is set to ‘raise’.

Notes

The function limits numerical strategies to "median" and "mean" while categorical strategies default to "constant". It also handles key aliasing to keep the returned dictionary consistent.

Examples

>>> from geoprior.utils.validator import validate_strategy
>>> validate_strategy('mean constant')
{'numeric': 'mean', 'categorical': 'constant'}
>>> validate_strategy({'num': 'mean', 'cat': 'constant'}, rename_key=True)
{'numeric': 'mean', 'categorical': 'constant'}
>>> validate_strategy('numeric categorical', ops='check_only')
False
>>> validate_strategy('invalid_strategy', error='warn')
{'numeric': 'median', 'categorical': 'constant'}

See also

sklearn.impute.SimpleImputer

Imputation transformer for completing missing values.

geoprior.utils.validator.validate_scores(scores, true_labels=None, mode='strict', accept_multi_output=False)[source]#

Validates that the scores represent valid probability distributions and checks consistency between scores and true labels in multi-output scenarios.

Parameters:
  • scores (list or np.ndarray) – A list of np.ndarrays for multi-output probabilities, or a single np.ndarray for single-output probabilities. Each ndarray should contain probability distributions where each row sums to approximately 1 and has non-negative values.

  • true_labels (list or np.ndarray, optional) – The true labels corresponding to the scores. This parameter must be provided in multi-output scenarios to check the alignment of labels and scores. Each element or row in true_labels should correspond to the equivalent in scores.

  • mode (str, optional (default "strict")) – Validation mode for checking probability distributions. Use "strict" to require each row to sum to 1 within numerical tolerance, "soft" to require non-negative scores with totals no greater than 1, or "passthrough" to only check that each score lies in the interval [0, 1].

  • accept_multi_output (bool, default False) – Flag indicating whether scores with multiple outputs are accepted. If False and scores are provided as a list, a ValueError will be raised.

Returns:

The validated scores as a NumPy array.

Return type:

np.ndarray

Raises:

ValueError – If multi-output scores are provided and not accepted. If there is a mismatch in the number of outputs between scores and true_labels. If scores or any subset of scores do not form valid probability distributions. If there is a mismatch in format expectations between scores and true_labels in terms of multi-output handling.

Notes

The function is designed to handle both single and multi-output probability distributions. For multi-output scenarios, both scores and true_labels should be lists of np.ndarrays. This function is particularly useful in scenarios involving machine learning models where output probabilities need to be validated before further processing or metrics calculations.

Examples

>>> import numpy as np
>>> from geoprior.utils.validator import validate_scores
>>> scores_single = np.array([[0.1, 0.9], [0.8, 0.2]])
>>> print(validate_scores(scores_single))
[[0.1, 0.9]
 [0.8, 0.2]]
>>> scores_multi = [np.array([[0.1, 0.9]]), np.array([[0.8, 0.2]])]
>>> true_labels_multi = [np.array([1]), np.array([0])]
>>> print(validate_scores(scores_multi, true_labels_multi, accept_multi_output=True))
[array([[0.1, 0.9]]), array([[0.8, 0.2]])]
geoprior.utils.validator.validate_square_matrix(data, align=False, align_mode='auto', message='')[source]#

Validate that the input data forms a square matrix and optionally aligns its indices and columns if specified.

Parameters:
  • data (DataFrame or array-like) – The input data to validate as a square matrix.

  • align (bool, default False) – Whether to align the DataFrame’s index with its columns.

  • align_mode (str, default 'auto') – Alignment mode if indices and columns do not match. Options are ‘auto’, ‘index_to_columns’, and ‘columns_to_index’.

  • message (str, default '') – Additional message to append to the error if validation fails.

Returns:

The validated or aligned square matrix.

Return type:

data

Raises:

ValueError – If the input is not a square matrix.

Examples

>>> from geoprior.utils.validator import validate_square_matrix
>>> validate_square(np.array([[1, 2], [3, 4]]))
array([[1, 2],
       [3, 4]])
>>> validate_square(pd.DataFrame([[1, 2], [3, 4, 5]]))
ValueError: Input must be a square matrix.

Notes

A square matrix is defined as having equal number of rows and columns. This function checks the dimensionality of the data and optionally aligns the index and columns if align is set to True.

geoprior.utils.validator.validate_weights(weights, min_value=None, max_value=None, normalize=False, allowed_dims=1)[source]#

Validates and optionally normalizes the given weights array to ensure all elements meet specified criteria and the structure is suitable for computations.

Parameters:
  • weights (array-like) – Weights to be validated. Can be a list, tuple, or numpy array.

  • min_value (float, optional) – Minimum allowable value for weights (inclusive). If None, weights are expected to be non-negative. Explicitly set to a negative value if negative weights are allowed.

  • max_value (float or None, optional) – Maximum allowable value for weights (inclusive). If None, no upper limit is enforced.

  • normalize (bool, optional) – If True, weights will be normalized to sum to 1. Default is False.

  • allowed_dims (int or tuple, optional) – Specifies the allowed dimensions of the weights array. Default is 1 (one-dimensional). If a tuple is provided, weights must match one of the dimensions specified in the tuple.

Returns:

A numpy array of the validated and optionally normalized weights.

Return type:

np.ndarray

Raises:

ValueError – If weights contain values outside the specified range, or if the format or dimensions are not suitable.

Examples

>>> from geoprior.utils.validator import validate_weights
>>> validate_weights([0.25, 0.75, 0.5], normalize=True)
array([0.2, 0.6, 0.4])
>>> validate_weights([-0.1, 0.9], min_value=0)
ValueError: Weights must be non-negative.
>>> validate_weights([0.1, 0.2, 0.7], max_value=0.5)
ValueError: Weights must not exceed 0.5.
>>> validate_weights([1, 2, 3], allowed_dims=(1, 2))
ValueError: Weights dimensions not allowed.
geoprior.utils.validator.validate_yy(y_true, y_pred, expected_type=None, *, validation_mode='strict', flatten=False)[source]#

Validates the shapes and types of actual and predicted target arrays, ensuring they are compatible for further analysis or metrics calculation.

Parameters:
  • y_true (array-like) – True target values.

  • y_pred (array-like) – Predicted target values.

  • expected_type (str, optional) – The expected sklearn type of the target (‘binary’, ‘multiclass’, etc.).

  • validation_mode (str, optional) – Validation strictness. Currently, only ‘strict’ is implemented, which requires y_true and y_pred to have the same shape and match the expected_type.

  • flatten (bool, optional) – If True, both y_true and y_pred are flattened to one-dimensional arrays.

Raises:

ValueError – If y_true and y_pred do not meet the validation criteria.

Returns:

The validated y_true and y_pred arrays, potentially flattened.

Return type:

tuple