geoprior.utils.io_utils#
Input/Output utilities for managing file paths, directories, and loading serialized data within FusionLab. Provides error-checked deserialization, directory management, and archive handling (e.g., .tgz, .zip), streamlining file operations and data recovery.
Adapted for FusionLab from the original geoprior.utils.io_utils.
Functions
|
Ensures a directory exists for saving files, creating it if necessary. |
|
Deserialize and load data from a serialized file using joblib or pickle. |
|
Translate a CSV file using a dictionary created from a markdown-style parser file. |
|
Extracts a single file from a tar archive with a progress bar. |
|
Dynamically load data from a joblib-saved dictionary with flexible key access. |
|
Retrieve and parse JSON data from a URL. |
|
Downloads a .tgz file from a specified URL, saves it to a directory, and optionally extracts a specific file from the archive. |
|
Extracts a specific file from a local .tgz archive and optionally renames it. |
|
Generate a filename based on a variable name for YAML configuration. |
|
Validates an input key and substitutes it with a valid key if necessary, based on a mapping of valid keys to their possible substitutes. |
|
check whether a give key exists in valid_keys and return a list if many keys are found. |
|
Find key in a list of default keys and select the best match. |
|
Loads a CSV file into a pandas DataFrame. |
|
Load data from a serialized file (e.g., pickle or joblib format). |
|
Moves a file to the specified path. |
|
Parses a CSV file or serializes data to a CSV file. |
|
Parse and manage JSON configuration files, either loading data from or saving data to a JSON file. |
|
Parse a markdown-style file with key-value pairs separated by a delimiter. |
|
Parse and handle YAML configuration files for loading or saving data. |
|
Generates output message for configuration file operations. |
|
Rename files from one set of names or paths to another. |
|
Removes spaces and replaces accented characters in a string. |
|
Quick save your job using 'joblib' or persistent Python pickle module. |
|
Creates a directory if it does not exist. |
|
Serialize and save a Python object to a binary file using either |
|
Serializes a Python object to a binary file using either joblib or pickle. |
|
Create a directory if it does not already exist. |
|
Store a DataFrame to HDF5, write it to CSV, or sanitize it in memory. |
|
Store a data object in Hierarchical Data Format 5 (HDF5). |
|
Export data objects to a text or JSON file with optional custom formatting. |
|
Extracts files from a ZIP archive based on various filtering criteria and saves them to a specified directory. |
Classes
|
A class for managing and organizing files within a directory structure. |
- class geoprior.utils.io_utils.FileManager(root_dir, target_dir, file_types=None, name_patterns=None, move=False, overwrite=False, create_dirs=False)[source]#
Bases:
BaseClassA class for managing and organizing files within a directory structure. This class provides methods to filter, organize, and rename files in bulk based on file extensions and name patterns. All operations are executed via the
runmethod to ensure proper initialization and state management.Mathematically, if \(\mathcal{F}\) represents the set of files in the root directory and \(\phi(f)\) is a filtering function that selects files based on file type and name pattern, then the FileManager produces a subset
(1)#\[\mathcal{F}' = \{ f \in \mathcal{F} \mid \phi(f) \}\]and performs operations such as moving or copying to reorganize these files into a target directory.
- Parameters:
root_dir (
str) – The root directory containing the files to be managed. This directory must exist and contain the files subject to filtering.target_dir (
str) – The directory where the organized files will be placed. If necessary, this directory can be created whencreate_dirsis True.file_types (
listofstr, optional) – A list of file extensions (e.g.,['.csv', '.json']) used to filter the files. IfNone, no file type filtering is applied.name_patterns (
listofstr, optional) – A list of substrings (e.g.,['2023', 'report']) to filter file names. IfNone, all file names are included.move (
bool, optional) – If True, files are moved from the source to the target directory; otherwise, they are copied. Default is False.overwrite (
bool, optional) – If True, existing files in the target directory will be overwritten. If False, existing files are skipped. Default is False.create_dirs (
bool, optional) – If True, missing directories in the target path are created. Default is False.
- Variables:
- run(pattern, replacement)[source]#
Executes the file organization process. It filters files using the criteria provided at initialization and, if a pattern and corresponding replacement are given, performs bulk renaming.
- get_processed_files()[source]#
Returns a list of file paths that have been processed and organized into the target directory.
Examples
>>> from geoprior.utils.io_utils import FileManager >>> manager = FileManager( ... root_dir='data/raw', ... target_dir='data/processed', ... file_types=['.csv', '.json'], ... name_patterns=['2023', 'report'], ... move=True, ... overwrite=True, ... create_dirs=True ... ) >>> manager.run(pattern='old', replacement='new') >>> processed = manager.get_processed_files() >>> print(processed)
Notes
The public method
runorchestrates the file management operations by first calling the internal method_organize_files()to filter and move or copy files from the source directory to the target directory. If renaming is needed,_rename_files()is invoked with the specified pattern and replacement. The methodget_processed_files()compiles a list of all files that have been organized, based on a walk of the target directory. The directory traversal and file-operation APIs are documented in [29, 30].See also
shutil.moveTo move files between directories.
shutil.copy2To copy files while preserving file metadata.
- __init__(root_dir, target_dir, file_types=None, name_patterns=None, move=False, overwrite=False, create_dirs=False)[source]#
Initialize the base class.
- run(pattern=None, replacement=None)[source]#
Executes file organization operations.
This method filters files based on the specified file types and name patterns, then organizes them by moving or copying into the target directory. Additionally, if a pattern is provided, file names containing that pattern are renamed by replacing the pattern with the specified replacement.
- Parameters:
- Returns:
self – The instance itself after executing operations.
- Return type:
Examples
>>> manager = FileManager(...) >>> manager.run(pattern='old', replacement='new')
- get_processed_files()[source]#
Retrieves a list of processed files in the target directory.
- Returns:
files – A list containing the full paths of the files that have been organized into the target directory.
- Return type:
Examples
>>> manager = FileManager(...) >>> manager.run() >>> files = manager.get_processed_files() >>> print(files)
- fit(pattern=None, replacement=None)#
Executes file organization operations.
This method filters files based on the specified file types and name patterns, then organizes them by moving or copying into the target directory. Additionally, if a pattern is provided, file names containing that pattern are renamed by replacing the pattern with the specified replacement.
- Parameters:
- Returns:
self – The instance itself after executing operations.
- Return type:
Examples
>>> manager = FileManager(...) >>> manager.run(pattern='old', replacement='new')
- help(**kwargs)#
- my_params = FileManager( root_dir, target_dir, file_types=None, name_patterns=None, move=False, overwrite=False, create_dirs=False )#
- geoprior.utils.io_utils.cpath(savepath=None, dpath='_default_path_')[source]#
Ensures a directory exists for saving files, creating it if necessary.
- Parameters:
- Returns:
The absolute path to the validated or created directory.
- Return type:
Examples
>>> from geoprior.utils.io_utils import cpath >>> default_path = cpath() >>> print(f"Files will be saved to: {default_path}")
>>> custom_path = cpath('/path/to/save') >>> print(f"Files will be saved to: {custom_path}")
Notes
cpath validates the directory path and, if necessary, creates the directory tree. If a problem occurs during creation, an error message is printed.
See also
pathlib.Path.mkdirUtility for directory creation.
- geoprior.utils.io_utils.deserialize_data(filename, verbose=0)[source]#
Deserialize and load data from a serialized file using joblib or pickle.
The function attempts to load the serialized data from the provided file filename using joblib first. If joblib fails, it tries to load the data using pickle. An error is raised if both methods fail.
- Parameters:
- Returns:
The data loaded from the serialized file, or None if loading fails.
- Return type:
Any- Raises:
TypeError – If filename is not a string, as file paths must be provided as strings.
FileNotFoundError – If the specified filename does not exist or cannot be located.
IOError – If both joblib and pickle fail to deserialize the data from the file.
ValueError – If the file was successfully read but yielded no data (i.e., None).
Examples
>>> from geoprior.utils.io_utils import deserialize_data >>> data = deserialize_data('path/to/serialized_data.pkl', verbose=1) Data loaded successfully from 'path/to/serialized_data.pkl' using joblib.
Notes
The function first attempts deserialization with joblib to leverage efficient file handling for large datasets. If joblib encounters an error, it falls back to pickle, which provides broader compatibility with Python objects but may be less optimized for large datasets. Loader semantics for the two backends are documented in [31, 32].
See also
joblib.loadJoblib’s load function for fast I/O operations on large data.
pickle.loadPickle’s load function for serializing and deserializing Python objects.
- geoprior.utils.io_utils.extract_tar_with_progress(tar, member, path)[source]#
Extracts a single file from a tar archive with a progress bar.
- Parameters:
tar (
tarfile.TarFile) – Opened tar file object.member (
tarfile.TarInfo) – Tar member (file) to be extracted.path (
Path) – Directory path where the file will be extracted.
Examples
>>> from geoprior.utils.io_utils import extract_tar_with_progress >>> with tarfile.open('data.tar.gz', 'r:gz') as tar: ... member = tar.getmember('file.csv') ... extract_tar_with_progress(tar, member, Path('output_dir'))
Notes
Uses tqdm for progress tracking of the file extraction process.
- geoprior.utils.io_utils.fetch_tgz_from_url(data_url, tgz_filename, data_path=None, file_to_retrieve=None, **kwargs)[source]#
Downloads a .tgz file from a specified URL, saves it to a directory, and optionally extracts a specific file from the archive.
This function retrieves a .tgz file from the provided data_url and saves it to the specified data_path directory. If file_to_retrieve is specified, the function will extract only that file from the archive; otherwise, the entire archive will be extracted.
- Parameters:
data_url (
str) – The URL to download the .tgz file from.tgz_filename (
str) – The name to assign to the downloaded .tgz file.data_path (
Union[str,Path], optional) – Directory where the downloaded file will be saved. Defaults to a ‘tgz_data’ directory in the current working directory if not specified.file_to_retrieve (
str, optional) – Specific filename to extract from the .tgz archive. If not provided, the entire archive is extracted.**kwargs (
dict) – Additional keyword arguments to pass to the extraction method.
- Returns:
Path to the extracted file if a specific file was requested; otherwise, returns None.
- Return type:
Optional[Path]- Raises:
FileNotFoundError – If the specified file_to_retrieve is not found in the archive.
Examples
>>> from geoprior.utils.io_utils import fetch_tgz_from_url >>> data_url = 'https://example.com/data.tar.gz' >>> extracted_file = fetch_tgz_from_url( ... data_url, 'data.tar.gz', data_path='data_dir', file_to_retrieve='file.csv') >>> print(extracted_file)
Notes
Uses the tqdm progress bar for tracking download progress.
- geoprior.utils.io_utils.fetch_tgz_locally(tgz_file, filename, savefile='tgz', rename_outfile=None)[source]#
Extracts a specific file from a local .tgz archive and optionally renames it.
This function fetches a specific file filename from a local tar archive located at tgz_file, and saves it to savefile. If rename_outfile is specified, the file is renamed after extraction.
- Parameters:
tgz_file (
str) – Full path to the tar file.filename (
str) – Name of the target file to extract from the archive.savefile (
str, optional) – Destination directory for the extracted file, defaulting to ‘tgz’.rename_outfile (
str, optional) – New name for the fetched file. If not provided, retains the original name.
- Returns:
Full path to the fetched and possibly renamed file.
- Return type:
- Raises:
FileNotFoundError – If the tgz_file or the specified filename is not found.
Examples
>>> from geoprior.utils.io_utils import fetch_tgz_locally >>> fetched_file = fetch_tgz_locally( ... 'path/to/archive.tgz', 'file.csv', savefile='extracted', rename_outfile='renamed.csv') >>> print(fetched_file)
- geoprior.utils.io_utils.dummy_csv_translator(csv_fn, pf, delimiter=':', destfile='pme.en.csv')[source]#
Translate a CSV file using a dictionary created from a markdown-style parser file.
- Parameters:
csv_fn (
str) – Path to the source CSV file.pf (
str) – Path to the markdown-style file used to create the translation dictionary.delimiter (
str, default':') – Delimiter used in the parser file to separate key-value pairs.destfile (
str, default'pme.en.csv') – Name of the destination file for the translated CSV.
- Returns:
DataFrame– Translated CSV data as a DataFrame.list– List of untranslated terms found in the source CSV.
Notes
This function uses parse_md_data to read the parser file and apply translations to the CSV content.
Missing translations are collected and returned for review.
Examples
>>> df, missing = dummy_csv_translator( "data.csv", "parser_file.md", delimiter=":", destfile="output.csv") >>> print(df.head()) >>> print(missing)
- geoprior.utils.io_utils.fetch_json_data_from_url(url, todo='load')[source]#
Retrieve and parse JSON data from a URL.
- Parameters:
url (
str) – Universal Resource Locator (URL) from which JSON data is fetched.todo (
{'load', 'dump'}, default'load') – Action to perform with JSON: - ‘load’: Load JSON data from the URL. - ‘dump’: Parse and prepare data from the URL for saving in a JSON file.
- Returns:
A tuple of todo action, filename (or data source), and parsed data.
- Return type:
- Raises:
urllib.error.URLError – If there is an issue accessing the URL.
Notes
The function uses json.loads to parse data directly from a URL response, supporting convenient access to web-hosted JSON content.
- geoprior.utils.io_utils.get_config_fname_from_varname(data, config_fname=None, config='.yml')[source]#
Generate a filename based on a variable name for YAML configuration.
- Parameters:
data (
Any) – The data object from which the variable name will be derived to create a YAML configuration filename.config_fname (
str, optional) – Custom configuration filename. If None, the name of data will be used as the filename.config (
str, default'.yml') – The file extension/type for the configuration file. Can be ‘.yml’, ‘.json’, or ‘.csv’.
- Returns:
A suitable filename for saving the configuration data.
- Return type:
- Raises:
ValueError – If config_fname cannot be derived or an invalid file type is provided.
Notes
This function supports dynamic filename generation based on variable names, which aids in maintaining a clear configuration structure for serialized data. Files are saved with appropriate extensions based on the config type.
- geoprior.utils.io_utils.get_valid_key(input_key, default_key, substitute_key_dict=None, regex_pattern='[#&*@!,;\\s]\\s*', deep_search=True)[source]#
Validates an input key and substitutes it with a valid key if necessary, based on a mapping of valid keys to their possible substitutes. If the input key is not provided or is invalid, a default key is used.
- Parameters:
input_key (
str) – The key to validate and possibly substitute.default_key (
str) – The default key to use if input_key is None, empty, or not found in the substitute mapping.substitute_key_dict (
dict, optional) – A mapping of valid keys to lists of their possible substitutes. This allows for flexible key substitution and validation.regex_pattern (
str, default ='[#&*@!,;\s-]\s*') – The base pattern to split the text into a columnsdeep_search (
bool, defaultFalse) – If deep-search, the key finder is no sensistive to lower/upper case or whether a numeric data is included.
- Returns:
A valid key, which is either the original input_key if valid, a substituted key if the original was found in the substitute mappings, or the default_key.
- Return type:
Notes
This function also leverages an external validation through key_checker for a deep search validation, ensuring the returned key is within the set of valid keys.
Example
>>> from geoprior.utils.io_utils import get_valid_key >>> substitute_key_dict = {'valid_key1': ['vk1', 'key1'], 'valid_key2': ['vk2', 'key2']} >>> get_valid_key('vk1', 'default_key', substitute_key_dict) 'valid_key1' >>> get_valid_key('unknown_key', 'default_key', substitute_key_dict) 'KeyError...'
- geoprior.utils.io_utils.key_checker(keys, valid_keys, regex=None, pattern=None, deep_search=False)[source]#
check whether a give key exists in valid_keys and return a list if many keys are found.
- Parameters:
keys (
str,listofstr) – Key value to find in the valid_keysvalid_keys (
list) – List of valid keys by default.regex (re object,) –
Regular expresion object. the default is:
>>> import re >>> re.compile (r'[_#&*@!_,;\s-]\s*', flags=re.IGNORECASE)
pattern (
str, default ='[_#&*@!_,;\s-]\s*') – The base pattern to split the text into a columnsdeep_search (
bool, defaultFalse) – If deep-search, the key finder is no sensistive to lower/upper case or whether a numeric data is included.
- Returns:
keys – List of keys that exists in the valid_keys.
- Return type:
str,list ,
Examples
>>> from geoprior.utils.io_utils import key_checker >>> key_checker('h502', valid_keys= ['h502', 'h253','h2601']) Out[68]: 'h502' >>> key_checker('h502+h2601', valid_keys= ['h502', 'h253','h2601']) Out[69]: ['h502', 'h2601'] >>> key_checker('h502 h2601', valid_keys= ['h502', 'h253','h2601']) Out[70]: ['h502', 'h2601'] >>> key_checker(['h502', 'h2601'], valid_keys= ['h502', 'h253','h2601']) Out[73]: ['h502', 'h2601'] >>> key_checker(['h502', 'h2602'], valid_keys= ['h502', 'h253','h2601']) UserWarning: key 'h2602' is missing in ['h502', 'h2602'] Out[82]: 'h502' >>> key_checker(['502', 'H2601'], valid_keys= ['h502', 'h253','h2601'], deep_search=True ) Out[57]: ['h502', 'h2601']
- geoprior.utils.io_utils.key_search(keys, default_keys, parse_keys=True, regex=None, pattern=None, deep=Ellipsis, raise_exception=Ellipsis)[source]#
Find key in a list of default keys and select the best match.
- Parameters:
keys (
strorlist) – The string or a list of key. When multiple keys is passed as a string, use the space for key separating.default_keys (
strorlist) – The likehood key to find. Can be a litteral text. When a litteral text is passed, it is better to provide the regex in order to skip some character to parse the text properly.parse_keys (
bool, defaultTrue) –Parse litteral string using default pattern and regex.
Added in version 0.2.7.
regex (re object,) –
Regular expresion object. Regex is important to specify the kind of data to parse. the default is:
>>> import re >>> re.compile (r'[_#&*@!_,;\s-]\s*', flags=re.IGNORECASE)
pattern (
str, default ='[_#&*@!_,;\s-]\s*') – The base pattern to split the text into a columns. Pattern is important especially when some character are considers as a part of word but they are not a separator. For example a data columns with a name ‘DH_Azimuth’, if a pattern is not explicitely provided, the default pattern will parse as two separated word which is far from the expected results.raise_exception (
bool, defaultFalse) – raise error when key is not find.
- Returns:
list
- Return type:
listofvalid keysorNone if not find ( default)
Examples
>>> from geoprior.utils.io_utils import key_search >>> key_search('h502-hh2601', default_keys= ['h502', 'h253','HH2601']) Out[44]: ['h502'] >>> key_search('h502-hh2601', default_keys= ['h502', 'h253','HH2601'], deep=True) Out[46]: ['h502', 'HH2601'] >>> key_search('253', default_keys= ("I m here to find key among h502, h253 and HH2601")) Out[53]: ['h253'] >>> key_search ('east', default_keys= ['DH_East', 'DH_North'] , deep =True,) Out[37]: ['East'] key_search ('east', default_keys= ['DH_East', 'DH_North'], deep =True,parse_keys= False) Out[39]: ['DH_East']
- geoprior.utils.io_utils.load_serialized_data(filename, verbose=0)[source]#
Load data from a serialized file (e.g., pickle or joblib format).
- Parameters:
- Returns:
Data loaded from the file, or None if deserialization fails.
- Return type:
Any- Raises:
TypeError – If filename is not a string.
FileExistsError – If the specified file does not exist.
Examples
>>> from geoprior.utils.io_utils import load_serialized_data >>> data = load_serialized_data('data/my_data.pkl', verbose=3)
Notes
This function attempts to load serialized data using joblib and fallbacks to pickle if needed. Verbose output provides feedback on the loading process and success or failure of each step.
See also
joblib.loadHigh-performance loading utility.
pickle.loadGeneral-purpose Python serialization library.
- geoprior.utils.io_utils.load_csv(data_path, delimiter=',', **kwargs)[source]#
Loads a CSV file into a pandas DataFrame.
This function reads a comma-separated values (CSV) file into a pandas DataFrame, with the ability to specify a custom delimiter. It provides support for additional options passed to pandas.read_csv for more granular control over the data loading process.
- Parameters:
data_path (
str) – The file path to the CSV file that is to be loaded. The file path must lead to a .csv file. If the file does not exist at the specified path, a FileNotFoundError is raised.delimiter (
str, optional) – The character used to separate values in the CSV file. The default is , for standard CSVs. If a different delimiter is used in the file (e.g., ;), it can be specified here.**kwargs (
dict) – Additional keyword arguments that will be passed directly to pandas.read_csv. For instance, users can specify header, index_col, dtype, and other options supported by read_csv for more customized data handling.
- Returns:
A pandas DataFrame containing the loaded data, with the specified options applied.
- Return type:
DataFrame- Raises:
FileNotFoundError – If the specified file does not exist at the provided data_path.
ValueError – If the file specified by data_path is not a CSV file (i.e., does not have a .csv extension), a ValueError is raised to ensure correct file type.
Notes
This function simplifies the process of loading CSV data into a DataFrame, with a straightforward parameter for delimiter customization and full access to pandas.read_csv options. It is ideal for basic CSV loading tasks, as well as more complex ones requiring specific column handling, type casting, and missing value handling, which can be passed via **kwargs. CSV-oriented DataFrame loading patterns are discussed in McKinney [27].
Examples
Suppose you have a CSV file example.csv with the following content:
` name,age,city Alice,30,New York Bob,25,Los Angeles `To load this file into a DataFrame:
>>> from geoprior.utils.io_utils import load_csv >>> df = load_csv('example.csv') >>> print(df) name age city 0 Alice 30 New York 1 Bob 25 Los Angeles
If the file uses a semicolon (;) as the delimiter:
>>> df = load_csv('example.csv', delimiter=';')
Additionally, you can pass custom read_csv parameters through **kwargs, such as specifying a column as the index:
>>> df = load_csv('example.csv', index_col='name') >>> print(df) age city name Alice 30 New York Bob 25 Los Angeles
See also
pandas.read_csvFull documentation for loading CSV files into a DataFrame with detailed parameter options.
- geoprior.utils.io_utils.move_cfile(cfile, savepath=None, **ckws)[source]#
Moves a file to the specified path. If moving fails, copies and deletes the original.
- Parameters:
- Returns:
The new file path and a confirmation message.
- Return type:
Tuple[str,str]
Examples
>>> from geoprior.utils.io_utils import move_cfile >>> new_path, msg = move_cfile('myfile.txt', 'new_directory') >>> print(new_path, msg)
- geoprior.utils.io_utils.parse_csv(csv_fn=None, data=None, todo='reader', fieldnames=None, savepath=None, header=False, verbose=0, **csvkws)[source]#
Parses a CSV file or serializes data to a CSV file.
This function allows loading (reading) from or dumping (writing) to a CSV file. It supports standard CSV and dictionary-based CSV formats.
- Parameters:
csv_fn (
str, optional) – The CSV filename for reading or writing. For writing operations, if data is provided and todo is set to ‘write’ or ‘dictwriter’, this specifies the output CSV filename.data (
list, optional) – Data to write in the form of a list of lists or dictionaries.todo (
str, default'reader') – Specifies the operation type: - ‘reader’ or ‘dictreader’: Reads data from a CSV file. - ‘writer’ or ‘dictwriter’: Writes data to a CSV file.fieldnames (
listofstr, optional) – List of keys for dictionary-based writing to specify the field order.savepath (
str, optional) – Directory to save the CSV file when writing. Defaults to ‘_savecsv_’ if not provided and the path does not exist.header (
bool, defaultFalse) – If True, includes headers when writing with DictWriter.verbose (
int, default0) – Controls the verbosity level for output messages.csvkws (
dict, optional) – Additional arguments passed to csv.writer or csv.DictWriter.
- Returns:
Parsed data from the CSV file, as a list of lists or a list of dictionaries, based on the operation. Returns None when writing.
- Return type:
Union[List[Dict],List[List[str]],None]
Notes
For writing data, the method uses either csv.writer for regular CSV or csv.DictWriter for dictionary-based CSV depending on the value of todo.
Examples
>>> from geoprior.utils.io_utils import parse_csv >>> data = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}] >>> parse_csv(csv_fn='output.csv', data=data, todo='dictwriter', fieldnames=['name', 'age']) >>> loaded_data = parse_csv(csv_fn='output.csv', todo='dictreader', fieldnames=['name', 'age']) >>> print(loaded_data) [{'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}]
- geoprior.utils.io_utils.parse_json(json_fn=None, data=None, todo='load', savepath=None, verbose=0, **jsonkws)[source]#
Parse and manage JSON configuration files, either loading data from or saving data to a JSON file.
- Parameters:
json_fn (
str, optional) – JSON filename or URL. If data is provided and todo is ‘dump’, json_fn will be used as the output filename. If todo is ‘load’, json_fn is the input filename or URL.data (
Any, optional) – Data in Python object format to serialize and save if todo is ‘dump’.todo (
{'load', 'loads', 'dump', 'dumps'}, default'load') – Action to perform with JSON: - ‘load’: Load data from a JSON file. - ‘loads’: Parse a JSON string. - ‘dump’: Serialize data to a JSON file. - ‘dumps’: Serialize data to a JSON string.savepath (
str, optional) – Path where the JSON file will be saved if todo is ‘dump’. If savepath does not exist, it will save to the default path ‘_savejson_’.verbose (
int, default0) – Controls verbosity of output messages.**jsonkws (
dict) – Additional keyword arguments passed to json.dump or json.dumps when saving data.
- Returns:
The data loaded from the JSON file or URL if todo is ‘load’, or data after saving if todo is ‘dump’.
- Return type:
Any- Raises:
json.JSONDecodeError – If there is an issue with reading or writing the JSON file.
TypeError – If the JSON file or data cannot be processed.
Notes
This function uses json.load, json.loads, json.dump, and json.dumps for efficient handling of JSON files and strings.
See also
fetch_json_data_from_urlFetches JSON data from a given URL.
get_config_fname_from_varnameUtility for generating JSON configuration filenames based on variable names.
- geoprior.utils.io_utils.parse_md(pf, delimiter=':')[source]#
Parse a markdown-style file with key-value pairs separated by a delimiter.
- Parameters:
- Yields:
Tuple[str,str]– A tuple containing the key and processed value.- Raises:
IOError – If the provided path does not lead to a valid file.
Notes
This function yields key-value pairs by reading the file line-by-line.
It applies sanitize_unicode_string to keys to ensure data consistency.
Examples
>>> list(parse_md_data('parser_file.md', delimiter=':')) [('key1', 'Value1'), ('key2', 'Value2')]
- geoprior.utils.io_utils.parse_yaml(yml_fn=None, data=None, todo='load', savepath=None, verbose=0, **ymlkws)[source]#
Parse and handle YAML configuration files for loading or saving data.
- Parameters:
yml_fn (
str, optional) – The YAML filename. If data is provided and todo is set to ‘dump’, yml_fn will be used as the output filename. If todo is set to ‘load’, yml_fn is the input filename to read from.data (
Any, optional) – Data in a Python object format that will be serialized and saved as a YAML file if todo is ‘dump’.todo (
{'load', 'dump'}, default'load') – Action to perform with the YAML file: - ‘load’: Load data from the YAML file specified by yml_fn. - ‘dump’: Serialize data into a YAML format and save to yml_fn.savepath (
str, optional) – Path where the YAML file will be saved if todo is ‘dump’. If not provided, a default path will be used. The function will ensure that the path exists.verbose (
int, default0) – Controls verbosity of output messages.**ymlkws (
dict) – Additional keyword arguments passed to yaml.dump when saving data.
- Returns:
The data loaded from the YAML file if todo is ‘load’, or data after saving if todo is ‘dump’.
- Return type:
Any- Raises:
yaml.YAMLError – If there is an issue with reading or writing the YAML file.
Notes
This function uses safe_load and safe_dump methods from PyYAML for secure handling of YAML files.
See also
get_config_fname_from_varnameUtility for generating YAML configuration filenames based on variable names.
- geoprior.utils.io_utils.print_cmsg(cfile, todo='load', config='YAML')[source]#
Generates output message for configuration file operations.
- Parameters:
- Returns:
Confirmation message for the configuration operation.
- Return type:
Examples
>>> from geoprior.utils.io_utils import print_cmsg >>> msg = print_cmsg('config.yml', 'dump') >>> print(msg) --> YAML 'config.yml' data was successfully saved.
- geoprior.utils.io_utils.rename_files(src_files, dst_files, basename=None, extension=None, how='py', prefix=True, keep_copy=True, trailer='_', sortby=None, **kws)[source]#
Rename files from one set of names or paths to another.
- Parameters:
src_files (
strorlistofstr) – Source files or a directory containing files to rename.dst_files (
strorlistofstr) – Destination file names or destination directory.basename (
strorNone, optional) – Base name used when generating numbered destination files.extension (
strorNone, optional) – Optional extension filter whensrc_filesis a directory.how (
str, optional) – Numbering convention used when destination names are generated.prefix (
bool, optional) – Whether generated numbering is appended after the basename.keep_copy (
bool, optional) – Whether to keep copies of the original files.trailer (
str, optional) – Separator inserted between the basename and the generated counter.sortby (
regex,callable, orNone, optional) – Optional sort key used when collecting files from a directory.**kws (
dict) – Additional keyword arguments forwarded toos.rename.
- Return type:
None
- geoprior.utils.io_utils.sanitize_unicode_string(str_)[source]#
Removes spaces and replaces accented characters in a string.
- Parameters:
- Returns:
The sanitized string with removed spaces and replaced accents.
- Return type:
Examples
>>> from geoprior.utils.io_utils import sanitize_unicode_string >>> sentence ='Nos clients sont extrêmement satisfaits ' 'de la qualité du service fourni. En outre Nos clients ' 'rachètent frequemment nos "services".' >>> sanitize_unicode_string (sentence) ... 'nosclientssontextrmementsatisfaitsdelaqualitduservice' 'fournienoutrenosclientsrachtentfrequemmentnosservices' >>> sanitize_unicode_string("Élève à l'école") 'elevealecole'
- geoprior.utils.io_utils.save_job(job, savefile, *, protocol=None, append_versions=True, append_date=True, fix_imports=True, buffer_callback=None, **job_kws)[source]#
Quick save your job using ‘joblib’ or persistent Python pickle module.
- Parameters:
job (
Any) – Anything to save, preferabaly a models in dictsavefile (
str, orpath-like object) – name of file to store the model. The file argument must have a write() method that accepts a single bytes argument. It can thus be a file object opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.append_versions (
bool, default=True) – Append the version of Joblib module or Python Pickle module following by the scikit-learn, numpy and also pandas versions. This is useful to have idea about previous versions for loading file when system or modules have been upgraded. This could avoid bottleneck when data have been stored for long times and user has forgotten the date and versions at the time the file was saved.append_date (
bool, defaultTrue,) – Append the date of the day to the filename.protocol (
int, optional) –The optional protocol argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2, 3, 4 and 5. The default protocol is 4. It was introduced in Python 3.4, and is incompatible with previous versions.
Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.
fix_imports (
bool, defaultTrue,) – If fix_imports is True and protocol is less than 3, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.buffer_call_back (
int, optional) –If buffer_callback is None (the default), buffer views are serialized into file as part of the pickle stream.
If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.
It is an error if buffer_callback is not None and protocol is None or smaller than 5.
job_kws (
dict,) – Additional keywords arguments passed tojoblib.dump().
- Returns:
The final filename where the job was saved.
- Return type:
Notes
This function appends system-specific metadata like versions and date to the filename, which can aid in tracking compatibility over time.
Examples
>>> from geoprior.utils.io_utils import save_job >>> model = {"key": "value"} # Replace with actual model object >>> savefile = save_job(model, "my_model", append_date=True, append_versions=True) >>> print(savefile) 'my_model.20240101.sklearn_v1.0.numpy_v1.21.joblib'
- geoprior.utils.io_utils.save_path(nameOfPath)[source]#
Creates a directory if it does not exist.
- Parameters:
nameOfPath (
str) – Name or path of the directory to create.- Returns:
The path of the created directory. If it exists, returns the existing path.
- Return type:
Examples
>>> save_path("test_directory") 'path/to/test_directory'
- geoprior.utils.io_utils.serialize_data(data, filename=None, savepath=None, to=None, force=True, compress=None, pickle_protocol=5, verbose=0)[source]#
Serialize and save a Python object to a binary file using either
jobliborpickle. This function is designed to be robust and versatile, handling multiple cases including file naming, overwriting behavior, and compression options.The final file path is computed as:
(2)#\[\text{filepath} = \text{savepath} \oplus \text{filename}\]where \(\oplus\) denotes string concatenation.
- Parameters:
data (
Any) – The Python object to serialize. The object must be compatible with eitherjoblib.dumporpickle.dump.filename (
str, optional) – The target filename for the serialized data. IfNone, a filename is generated using the current timestamp, e.g.,"__mydumpedfile_20230315_123045.pkl".savepath (
str, optional) – The directory in which to save the file. If not specified, the current working directory (os.getcwd()) is used. The directory is created if it does not exist.to (
str, optional) – The serialization method to use. Acceptable values are'joblib'and'pickle'. IfNone, the default is'joblib'.force (
bool, defaultTrue) – IfTrue, any existing file with the same name is overwritten. IfFalse, a timestamp is appended to the filename to ensure uniqueness.compress (
intorstr, optional) – Compression level or method forjoblib.dump. IfNone, no compression is applied.pickle_protocol (
int, defaultpickle.HIGHEST_PROTOCOL) – The pickle protocol to use when serializing withpickle.dump.verbose (
int, default0) – Controls the verbosity of output messages. Higher values produce more detailed logging during the serialization process.
- Returns:
The full path to the saved serialized file.
- Return type:
Examples
>>> from geoprior.utils.io_utils import serialize_data >>> import numpy as np >>> data = {"a": np.arange(10), "b": np.random.rand(10)} >>> filepath = serialize_data( ... data, filename="mydata.pkl", savepath="output", ... to="pickle", force=False, verbose=1 ... ) >>> print(filepath) /current/working/directory/output/mydata_<timestamp>.pkl
Notes
The function first constructs the file path from
savepathandfilename. If a file already exists andforceis False, a timestamp is appended to ensure uniqueness. Then, depending on the value ofto, the function attempts to serialize the data using eitherjoblib.dump(with optional compression via thecompressparameter) orpickle.dump(using the specifiedpickle_protocol). If an error occurs during serialization, anIOErroris raised.See also
joblib.dumpSerialize objects to disk using Joblib.
pickle.dumpSerialize objects to disk using Pickle.
os.getcwdRetrieve the current working directory.
- geoprior.utils.io_utils.serialize_data_in(data, filename=None, force=True, savepath=None, verbose=0)[source]#
Serializes a Python object to a binary file using either joblib or pickle.
This function attempts to serialize the input data using the
joblib.dumpmethod. If this attempt fails, it falls back to usingpickle.dump. The final file path is constructed by concatenating the directory specified bysavepath(or the current working directory ifsavepathis None) with the givenfilename. Mathematically, the file path is given by:(3)#\[\text{filepath} = \text{savepath} \oplus \text{filename}\]where \(\oplus\) denotes string concatenation.
- Parameters:
data (
Any) – The Python object to serialize. It must be compatible with eitherjobliborpickleserialization.filename (
str, optional) – The target filename for the serialized data. IfNone, a filename is generated using the current timestamp formatted as"%Y%m%d%H%M%S"(e.g.,"serialized_20230315123045.pkl").force (
bool, defaultTrue) – Determines whether to overwrite an existing file with the same filename. IfFalse, a timestamp is appended to the filename to ensure uniqueness.savepath (
str, optional) – The directory in which to save the serialized file. If not specified, the file is saved to the current working directory (os.getcwd()).verbose (
int, default0) – Controls the verbosity of output messages. Higher values produce more detailed logging during the serialization process.
- Returns:
The complete file path to which the data has been serialized.
- Return type:
Examples
>>> from geoprior.utils.io_utils import serialize_data_in >>> data = {"a": 1, "b": 2} >>> filepath = serialize_data_in(data, filename='data.pkl', ... force=True, verbose=1) >>> print(filepath) /path/to/current/directory/data.pkl
Notes
The function first tries to serialize the input data using
joblib.dump. In case of any exception during this attempt, it falls back to usingpickle.dump. This dual approach improves robustness in diverse runtime environments where one serialization method might be unsupported or encounter issues with the given data type.See also
joblib.dumpSerialize objects to disk using Joblib.
pickle.dumpSerialize objects to disk using Pickle.
os.getcwdRetrieve the current working directory.
- geoprior.utils.io_utils.spath(name_of_path)[source]#
Create a directory if it does not already exist.
- Parameters:
name_of_path (
str) – Path-like object to create if it doesn’t exist.- Returns:
The absolute path to the created or existing directory.
- Return type:
Examples
>>> from geoprior.utils.io_utils import spath >>> path = spath('data/saved_models') >>> print(f"Directory available at: {path}")
Notes
spath is useful for quickly ensuring that a specific directory is available for storing files. It provides feedback if the directory already exists.
- geoprior.utils.io_utils.store_or_write_hdf5(df, key=None, mode='a', kind=None, path_or_buf=None, encoding='utf8', csv_sep=',', index=Ellipsis, columns=None, sanitize_columns=False, func=None, args=(), applyto=None, **func_kwds)[source]#
Store a DataFrame to HDF5, write it to CSV, or sanitize it in memory.
- Parameters:
df (
pandas.DataFrameorarray-like) – Input data to store, export, or sanitize.key (
strorNone, optional) – Group key used when storing to HDF5.mode (
{'a', 'w', 'r+'}, optional) – File mode used when opening an HDF5 store.kind (
{'store', 'write', None}, optional) – Operation to perform. Use'store'for HDF5 output,'write'for CSV export, orNoneto return a sanitized DataFrame.path_or_buf (
str,path-like,pandas.HDFStore,file-like, orNone, optional) – Destination path, buffer, or open HDF5 store.encoding (
str, optional) – Output encoding used for CSV export.csv_sep (
str, optional) – Field separator used for CSV export.index (
bool, optional) – Whether to write the index when exporting to CSV.columns (
listofstrorNone, optional) – Column names used when constructing a DataFrame from an array.sanitize_columns (
bool, optional) – Whether to sanitize column names with the built-in regex helper.func (
callableorNone, optional) – Optional custom sanitizing function applied to selected columns.args (
tuple, optional) – Positional arguments forwarded tofunc.applyto (
strorlistofstrorNone, optional) – Column or columns to whichfuncshould be applied.func_kwds (
dict) – Keyword arguments forwarded tofunc.
- Returns:
Returns
Nonewhenkindis'store'or'write'. Otherwise returns the resulting DataFrame.- Return type:
- geoprior.utils.io_utils.to_hdf5(data, fn, objname=None, close=True, **hdf5_kws)[source]#
Store a data object in Hierarchical Data Format 5 (HDF5).
This function serializes the input
datainto an HDF5 file. It supports both pandas DataFrames and NumPy arrays. Ifdatais a DataFrame, it usespd.HDFStore(which requires thepytablespackage) to store the data. Ifdatais a NumPy array, it usesh5py.Fileto create a dataset.The file path is constructed by concatenating the specified
savepath(or the current working directory ifsavepathis not provided) with the provided filename (fn). The function automatically appends the appropriate file extension:.h5for DataFrames and.hdf5for arrays.(4)#\[\text{filepath} = \text{savepath} \oplus \text{filename} \oplus \text{extension}\]where \(\oplus\) denotes string concatenation.
- Parameters:
data (
Any) – The data object to be stored. Must be either a NumPy array or a pandas DataFrame.fn (
str) – The file path (without extension) where the HDF5 file will be saved.objname (
str, optional) – The name under which to store the data within the HDF5 file. Defaults to'data'if not provided.close (
bool, defaultTrue) – IfTrue, the file is closed after writing. IfFalse, the file remains open for additional modifications.**hdf5_kws (
dict, optional) – Additional keyword arguments to pass to the HDFStore constructor (for DataFrames) or to customize dataset creation (for arrays). Common options includemodefor the file mode,complevelfor compression level,complibfor the compression library, andfletcher32to enable the Fletcher32 checksum. Formode, use'r'for read-only access,'w'to create a new file,'a'to append or create, and'r+'to open an existing file for reading and writing.
- Returns:
store – An IO interface for the stored data. For DataFrames, this is a
pd.HDFStoreobject; for arrays, anh5py.Fileobject.- Return type:
Examples
>>> import os >>> import pandas as pd >>> from geoprior.utils.io_utils import to_hdf5 >>> data = pd.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [4, 5, 6] ... }) >>> save_path = os.path.join('output', 'datafile') >>> store = to_hdf5(data, fn=save_path, objname='mydata', verbose=1) >>> # Access stored data: >>> retrieved = store['mydata'] >>> print(retrieved.head())
Notes
Ensure the dependency
pytablesis installed when serializing a DataFrame. When serializing NumPy arrays, the dataset is created with the name"dataset_01". Ifcloseis set toFalse, the caller is responsible for closing the store. The pandas and NumPy foundations underlying this serialization path are summarized in [33, 34].See also
joblib.dump,pickle.dump,h5py.File
- geoprior.utils.io_utils.zip_extractor(zip_file, samples='*', ftype=None, savepath=None, pwd=None)[source]#
Extracts files from a ZIP archive based on various filtering criteria and saves them to a specified directory.
The extraction process can be controlled by the
samplesparameter to limit the number of files extracted, or by theftypeparameter to filter by a specific file extension. The resulting file names are returned as a list.(5)#\[\text{Extracted Files} = \{ f \in \mathcal{A} \mid \phi(f) \}\]where \(\mathcal{A}\) is the set of all files in the archive, and \(\phi(f)\) is a predicate that checks if a file matches the desired extension and is within the specified sample count.
- Parameters:
zip_file (
str) – Full path to the ZIP archive file.samples (
intorstr, optional) – Number of files to extract. If set to'*', all files are extracted. Default is'*'.ftype (
str, optional) – File extension filter (e.g.,'.csv'). Only files with this extension are extracted. If no matching files are found, a ValueError is raised.savepath (
str, optional) – Directory where the extracted files will be stored. If not provided, files are extracted to the current working directory.pwd (
strorbytes, optional) – Password for encrypted ZIP files. If provided as a string, it will be used as is (or can be encoded to bytes as needed).
- Returns:
A list of extracted file names (with paths).
- Return type:
Examples
>>> from geoprior.utils.io_utils import zip_extractor >>> extracted_files = zip_extractor( ... 'data/archive.zip', ... samples='*', ... ftype='.csv', ... savepath='data/extracted', ... pwd='secret' ... ) >>> print(extracted_files) ['folder1/file1.csv', 'folder2/file2.csv', ...]
Notes
The function first validates the input ZIP file using
check_files(assumed to be defined in the package). It then determines the sample count and filters files by extension ifftypeis provided. Extraction is done via the standardZipFile.extractorZipFile.extractallmethods.See also
zipfile.ZipFile.extractExtract a single file from a ZIP archive.
zipfile.ZipFile.extractallExtract all files from a ZIP archive.
- geoprior.utils.io_utils.fetch_joblib_data(job_file, *keys, error_mode='raise', verbose=0)[source]#
Dynamically load data from a joblib-saved dictionary with flexible key access.
- Parameters:
job_file (
str) – Path to the joblib file containing a dictionary*keys (
str) – Variable-length list of dictionary keys to retrieveerror_mode (
{'raise', 'warn', 'ignore'}, default'raise') – Handling of missing keys: - ‘raise’: Immediately raise KeyError - ‘warn’: Issue warning and skip missing keys - ‘ignore’: Silently skip missing keysverbose (
int, default0) – Verbosity level: - 0: No output - 1: Basic loading information - 2: Detailed debugging output
- Returns:
Full dictionary if no keys specified
Tuple of values for requested keys (maintaining order)
- Return type:
Union[Dict,Tuple]- Raises:
FileNotFoundError – If specified job_file doesn’t exist
TypeError – If loaded data isn’t a dictionary
KeyError – If requested key not found and error_mode=’raise’
Examples
>>> from geoprior.utils.io_utils import fetch_joblib_data >>> data = fetch_joblib_data('data.joblib', 'X_train', 'y_train') >>> X, y = fetch_joblib_data('data.joblib', 'X_val', 'y_val', verbose=1) >>> full_dict = fetch_joblib_data('data.joblib')
Notes
Maintains original insertion order for Python 3.7+ dictionaries
Missing keys in ‘warn’/’ignore’ modes result in shorter return tuple
Joblib files must contain dictionary objects
- geoprior.utils.io_utils.to_txt(d, filename=None, format='txt', indent=2, width=80, depth=None, compat=False, include_header=True, mode='w', encoding='utf-8', overwrite=True, header=None, footer=None, serializer=None, savepath=None, verbose=1, logger=None, **kwargs)[source]#
Export data objects to a text or JSON file with optional custom formatting.
The function, <to_txt>, handles writing <d> (a string, dict, list, or general object) to a file named <filename>. When no filename is given, it automatically generates one based on the current date/time. If <format> is “json” and <d> is valid for JSON serialization, it attempts a JSON export. Otherwise, it falls back to text mode, leveraging Python’s built-in pformat and an optional <serializer> for advanced transformations.
(6)#\[\begin{split}\\text{FileName}_{timestamp} \\rightarrow \\text{output}\end{split}\]where \(\\text{FileName}_{timestamp}\) is an auto-generated name like output_20230101_123456.txt if <filename> is not provided.
- Parameters:
d (
object) – Data to write. Can be any Python object supported by pformat, or a dict if <format> is ‘json’.filename (
str, optional) – Full path (or name) of the output file. If None, a time-stamped name is produced, prefixed with ‘output_’.format (
str, default'txt') – File format, either"txt"or"json". If it fails to serialize as JSON, the process reverts to text.indent (
int, default2) – Indentation level for pretty-printing text or JSON.width (
int, default80) – Wrap width for formatted text lines.depth (
int, optional) – Maximum depth to which nested structures are expanded. If None, no limit is applied.compat (
bool, defaultFalse) – If True, instructs pformat to produce more compact text. Not used when exporting JSON.include_header (
bool, defaultTrue) – Whether to include a decorative header (with timestamp) at the top of the file in text mode.mode (
str, default'w') – File writing mode. Typically ‘w’ for overwrite, ‘a’ for append.encoding (
str, default'utf-8') – Text encoding used when opening the file.overwrite (
bool, defaultTrue) – If False, raises an error if the file already exists.header (
str, optional) – Custom header text (if <include_header> is True). Overwrites the default header if given.footer (
str, optional) – Custom footer text appended at the end of the file, if <include_header> is True.serializer (
callable, optional) – A function that transforms <d> before printing. If it fails, <d> remains unchanged.verbose (
int, default1) – Verbosity level for logging. Higher values yield more console messages (e.g., file stats at <verbose>>=3).**kwargs – Additional parameters passed to the JSON serializer (json.dump) or pformat.
- Returns:
The final filename used to store the output (potentially auto-generated).
- Return type:
Notes
If <format> is “json”, the function tries json.dump with a few standard parameters. If an exception occurs, it reverts to text export. The <serializer> argument allows custom transformations, such as flattening nested dicts or converting objects to JSON- serializable representations. The standard-library JSON behavior used here is documented in Python Software Foundation [35].
Examples
>>> from geoprior.utils.io_utils import to_txt >>> my_data = {"name":"Alice","age":30} >>> # Basic text export >>> txt_file = to_txt(my_data, verbose=2) >>> # Enforce JSON format >>> json_file = to_txt(my_data, format='json', indent=4)
See also
pformatPretty-print complex Python data structures.