Data I/O
Overview
The piblin_jax.dataio module provides a comprehensive file I/O system for reading and
writing measurement data. It implements an extensible architecture that supports multiple
file formats with automatic format detection and intelligent hierarchy building.
The I/O system is designed around several key principles:
Format Agnostic: The module provides generic readers for CSV and TXT files that automatically parse column-based data. The extensible reader registry allows easy addition of custom format parsers without modifying core code.
Auto-Detection: File formats are automatically detected based on file extension and content inspection. You can read files without knowing their format in advance, making batch processing straightforward.
Automatic Hierarchy Building: When reading multiple files, the system automatically analyzes metadata to identify constant and varying experimental conditions, then builds an appropriate hierarchical structure (Experiment, MeasurementSet, etc.). This eliminates manual organization of large datasets.
Batch Operations: Read entire directories or multiple files at once. The module provides convenient functions for common workflows like reading all CSV files in a directory or processing files from multiple experimental runs.
Metadata Preservation: All file-level metadata (filenames, paths, timestamps) is automatically captured and attached to the resulting data structures. Inline metadata from file headers is also preserved.
The module currently supports CSV and TXT formats out of the box, with the infrastructure in place for adding instrument-specific readers (e.g., TA Instruments, Anton Paar) as needed.
Quick Examples
Reading a Single File
Read a data file with automatic format detection:
from piblin_jax.dataio import read_file
# Read CSV file - format auto-detected
measurement = read_file("experiment_data.csv")
# Access the data
print(f"Number of datasets: {len(measurement.datasets)}")
print(f"Metadata: {measurement.metadata}")
Reading Multiple Files
Read and organize multiple files into a hierarchy:
from piblin_jax.dataio import read_files
# List of files to read
files = [
"sample_25C_rep1.csv",
"sample_25C_rep2.csv",
"sample_25C_rep3.csv",
"sample_50C_rep1.csv",
"sample_50C_rep2.csv",
]
# Read all files and build hierarchy
experiment_set = read_files(files)
# Hierarchy is built automatically based on conditions
for exp in experiment_set.experiments:
print(f"Temperature: {exp.metadata['temperature']}")
print(f"Replicates: {len(exp.measurement_sets[0].measurements)}")
Reading Entire Directories
Process all files in a directory:
from piblin_jax.dataio import read_directory
# Read all CSV files in directory
experiment_set = read_directory(
"/path/to/data",
pattern="*.csv"
)
# Read recursively with custom pattern
experiment_set = read_directory(
"/path/to/data",
pattern="sample_*.txt",
recursive=True
)
print(f"Total measurements: {len(experiment_set.get_all_measurements())}")
Registering Custom Readers
Add support for custom file formats:
from piblin_jax.dataio import register_reader
from piblin_jax.data.collections import Measurement
def my_custom_reader(filepath):
"""Read custom instrument format."""
# Parse file
data = parse_my_format(filepath)
# Create datasets
datasets = [create_dataset(data)]
# Return Measurement
return Measurement(
datasets=datasets,
metadata={"source": filepath}
)
# Register reader for .dat files
register_reader(".dat", my_custom_reader)
# Now you can read .dat files
measurement = read_file("data.dat")
See Also
Data Structures - Data structures created by I/O operations
Transformations - Processing data after loading
pathlib Documentation - Path handling utilities
API Reference
Module Contents
Data I/O system for piblin-jax.
This module provides a comprehensive file I/O system with: - Generic CSV and TXT readers - Auto-detection of file formats - Batch reading of multiple files - Automatic hierarchy building from file lists - Extensible reader registry
Main Functions
read_file : Read single file with auto-detection read_files : Read multiple files and build hierarchy read_directory : Read all matching files in a directory read_directories : Read multiple directories
Examples
Read a single file:
>>> from piblin_jax.dataio import read_file
>>> measurement = read_file("data.csv")
Read multiple files:
>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"]
>>> experiment_set = read_files(files)
Read an entire directory:
>>> experiment_set = read_directory("/path/to/data", pattern="*.csv")
Read multiple directories:
>>> paths = ["/path/to/exp1", "/path/to/exp2"]
>>> experiment_set = read_directories(paths)
- piblin_jax.dataio.build_hierarchy(measurements)[source]
Build hierarchical structure from flat list of measurements.
Analyzes the conditions across all measurements and organizes them into a hierarchical structure:
Extract all conditions from all measurements
Identify constant conditions (same across all) -> Experiment level
Identify varying conditions -> MeasurementSet grouping
Group measurements by conditions
- Parameters:
measurements (
list[Measurement]) – Flat list of measurements to organize- Returns:
Hierarchical organization of measurements
- Return type:
ExperimentSet
Notes
Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements
Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.
Examples
Build hierarchy from file list:
>>> from piblin_jax.dataio.readers import read_file >>> measurements = [ ... read_file("sample1.csv"), ... read_file("sample2.csv"), ... ] >>> experiment_set = build_hierarchy(measurements) >>> len(experiment_set.experiments) 1
Access measurements:
>>> for exp in experiment_set.experiments: ... for ms in exp.measurement_sets: ... for m in ms.measurements: ... print(m.conditions)
- piblin_jax.dataio.detect_reader(filepath)[source]
Auto-detect appropriate reader for file.
Uses a multi-layer detection strategy:
Extension-based: Matches file extension to registered readers
Header-based: Checks file headers for instrument signatures (future)
Content-based: Analyzes file content structure (future)
Fallback: Returns generic reader based on best guess
- Parameters:
filepath (
str | Path) – Path to file- Returns:
Instance of appropriate reader class
- Return type:
Reader instance
Examples
>>> reader = detect_reader("data.csv") >>> isinstance(reader, GenericCSVReader) True
>>> reader = detect_reader("data.txt") >>> isinstance(reader, GenericTXTReader) True
Notes
Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.
- piblin_jax.dataio.read_directories(path_list, pattern='*.csv', recursive=False)[source]
Read multiple directories and combine into single hierarchy.
Scans multiple directories for matching files and builds a unified hierarchical structure from all measurements.
- Parameters:
- Returns:
Hierarchical organization of all measurements from all directories
- Return type:
ExperimentSet- Raises:
FileNotFoundError – If any directory does not exist
ValueError – If any file cannot be parsed
Examples
Read from multiple directories:
>>> paths = ["/data/exp1", "/data/exp2", "/data/exp3"] >>> experiment_set = read_directories(paths)
With custom pattern:
>>> experiment_set = read_directories( ... paths, ... pattern="*.txt" ... )
Recursive search:
>>> experiment_set = read_directories( ... paths, ... recursive=True ... )
Notes
All measurements from all directories are combined and analyzed together to build a unified hierarchy. This is useful when an experiment spans multiple directories.
- piblin_jax.dataio.read_directory(path, pattern='*.csv', recursive=False)[source]
Read all matching files in a directory.
Scans a directory for files matching the pattern, reads them all, and builds a hierarchical structure.
- Parameters:
- Returns:
Hierarchical organization of all measurements
- Return type:
ExperimentSet- Raises:
FileNotFoundError – If the directory does not exist
ValueError – If any file cannot be parsed
Examples
Read all CSV files in a directory:
>>> experiment_set = read_directory("/data/experiment1")
Read all TXT files:
>>> experiment_set = read_directory("/data/experiment1", pattern="*.txt")
Read recursively:
>>> experiment_set = read_directory( ... "/data", ... pattern="*.csv", ... recursive=True ... )
Read with custom pattern:
>>> experiment_set = read_directory( ... "/data", ... pattern="sample_A*.csv" ... )
Notes
Files are sorted alphabetically before reading for consistent ordering. All measurements are analyzed together to build the hierarchy.
- piblin_jax.dataio.read_file(filepath)[source]
Read file with automatic format detection.
This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.
- Parameters:
filepath (
str | Path) – Path to file- Returns:
Measurement object containing datasets and metadata
- Return type:
Measurement- Raises:
FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed
Examples
Read a CSV file:
>>> measurement = read_file("data.csv")
Read a TXT file:
>>> measurement = read_file("experiment.txt")
Read with explicit path:
>>> from pathlib import Path >>> measurement = read_file(Path("/data/experiment/sample1.csv"))
Notes
This function combines detection and reading in a single call. For more control over the reading process, you can use
detect_reader()followed by calling the reader’sread()method directly.
- piblin_jax.dataio.read_files(file_list)[source]
Read multiple files and build hierarchical structure.
Reads all files in the list, automatically detecting formats, and organizes them into a hierarchical ExperimentSet based on their experimental conditions.
- Parameters:
file_list (
Sequence[str | Path]) – List of file paths to read- Returns:
Hierarchical organization of all measurements
- Return type:
ExperimentSet- Raises:
FileNotFoundError – If any file in the list does not exist
ValueError – If any file cannot be parsed
Examples
Read specific files:
>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"] >>> experiment_set = read_files(files) >>> len(experiment_set.experiments) 1
With Path objects:
>>> from pathlib import Path >>> files = list(Path("/data").glob("*.csv")) >>> experiment_set = read_files(files)
Notes
All measurements from all files are analyzed together to identify constant and varying conditions, which determines the hierarchy structure. Files with the same conditions are grouped together.
- piblin_jax.dataio.register_reader(extension, reader_class)[source]
Register a custom reader for a file extension.
This allows users to add support for custom file formats without modifying the core library.
- Parameters:
extension (
str) – File extension (should include the dot, e.g., “.xyz”)reader_class (
Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement aread(filepath)method that returns a Measurement object.
Examples
Register a custom reader class:
>>> class MyCustomReader: ... def read(self, filepath): ... # ... custom reading logic ... pass >>> register_reader('.xyz', MyCustomReader)
Register a factory function:
>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))
Notes
Custom readers should follow the same interface as GenericCSVReader, implementing a
read(filepath)method that returns a Measurement.
Readers
Base Reader Interface
File readers and auto-detection system.
This module provides: - Generic CSV and TXT readers - Multi-layer auto-detection system - Extensible reader registry - read_file function for automatic file reading
The auto-detection system uses four layers: 1. Extension-based (.csv, .txt, etc.) 2. Header-based (instrument signatures) 3. Content-based (parse first lines) 4. Fallback to generic readers
- class piblin_jax.dataio.readers.GenericCSVReader(delimiter=',', comment_char='#')[source]
Bases:
objectGeneric CSV file reader with metadata extraction.
This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.
- Parameters:
Examples
Read a standard CSV file:
>>> reader = GenericCSVReader() >>> measurement = reader.read("data.csv")
Read a tab-delimited file:
>>> reader = GenericCSVReader(delimiter="\t") >>> measurement = reader.read("data.tsv")
File format example:
# Temperature: 25 # Pressure: 1.0 # Sample: A1 0.0,0.0 1.0,1.0 2.0,4.0
Methods
read(filepath)Read CSV file and return Measurement object.
- __init__(delimiter=',', comment_char='#')[source]
Initialize GenericCSVReader.
See class docstring for parameter details.
- read(filepath)[source]
Read CSV file and return Measurement object.
Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.
- Parameters:
filepath (
str | Path) – Path to CSV file- Returns:
Measurement object containing datasets and metadata
- Return type:
Measurement- Raises:
FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed
- class piblin_jax.dataio.readers.GenericTXTReader(comment_char='#')[source]
Bases:
GenericCSVReaderGeneric TXT file reader (whitespace-delimited).
This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.
- Parameters:
comment_char (
str, optional) – Comment character for header lines (default: “#”)
Examples
Read a whitespace-delimited file:
>>> reader = GenericTXTReader() >>> measurement = reader.read("data.txt")
File format example:
# Temperature: 25 # Sample: A1 0.0 0.0 1.0 1.0 2.0 4.0
Notes
This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.
Methods
read(filepath)Read CSV file and return Measurement object.
- piblin_jax.dataio.readers.detect_reader(filepath)[source]
Auto-detect appropriate reader for file.
Uses a multi-layer detection strategy:
Extension-based: Matches file extension to registered readers
Header-based: Checks file headers for instrument signatures (future)
Content-based: Analyzes file content structure (future)
Fallback: Returns generic reader based on best guess
- Parameters:
filepath (
str | Path) – Path to file- Returns:
Instance of appropriate reader class
- Return type:
Reader instance
Examples
>>> reader = detect_reader("data.csv") >>> isinstance(reader, GenericCSVReader) True
>>> reader = detect_reader("data.txt") >>> isinstance(reader, GenericTXTReader) True
Notes
Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.
- piblin_jax.dataio.readers.read_file(filepath)[source]
Read file with automatic format detection.
This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.
- Parameters:
filepath (
str | Path) – Path to file- Returns:
Measurement object containing datasets and metadata
- Return type:
Measurement- Raises:
FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed
Examples
Read a CSV file:
>>> measurement = read_file("data.csv")
Read a TXT file:
>>> measurement = read_file("experiment.txt")
Read with explicit path:
>>> from pathlib import Path >>> measurement = read_file(Path("/data/experiment/sample1.csv"))
Notes
This function combines detection and reading in a single call. For more control over the reading process, you can use
detect_reader()followed by calling the reader’sread()method directly.
- piblin_jax.dataio.readers.register_reader(extension, reader_class)[source]
Register a custom reader for a file extension.
This allows users to add support for custom file formats without modifying the core library.
- Parameters:
extension (
str) – File extension (should include the dot, e.g., “.xyz”)reader_class (
Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement aread(filepath)method that returns a Measurement object.
Examples
Register a custom reader class:
>>> class MyCustomReader: ... def read(self, filepath): ... # ... custom reading logic ... pass >>> register_reader('.xyz', MyCustomReader)
Register a factory function:
>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))
Notes
Custom readers should follow the same interface as GenericCSVReader, implementing a
read(filepath)method that returns a Measurement.
CSV Reader
Generic CSV file reader with metadata extraction.
This module provides a flexible CSV reader that can handle various delimiters, extract metadata from file headers, and create appropriate Dataset objects.
- class piblin_jax.dataio.readers.csv.GenericCSVReader(delimiter=',', comment_char='#')[source]
Bases:
objectGeneric CSV file reader with metadata extraction.
This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.
- Parameters:
Examples
Read a standard CSV file:
>>> reader = GenericCSVReader() >>> measurement = reader.read("data.csv")
Read a tab-delimited file:
>>> reader = GenericCSVReader(delimiter="\t") >>> measurement = reader.read("data.tsv")
File format example:
# Temperature: 25 # Pressure: 1.0 # Sample: A1 0.0,0.0 1.0,1.0 2.0,4.0
Methods
read(filepath)Read CSV file and return Measurement object.
- __init__(delimiter=',', comment_char='#')[source]
Initialize GenericCSVReader.
See class docstring for parameter details.
- read(filepath)[source]
Read CSV file and return Measurement object.
Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.
- Parameters:
filepath (
str | Path) – Path to CSV file- Returns:
Measurement object containing datasets and metadata
- Return type:
Measurement- Raises:
FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed
TXT Reader
Generic TXT file reader for whitespace-delimited data files.
This module provides a TXT reader that extends the CSV reader for whitespace-delimited files, commonly used in scientific data.
- class piblin_jax.dataio.readers.txt.GenericTXTReader(comment_char='#')[source]
Bases:
GenericCSVReaderGeneric TXT file reader (whitespace-delimited).
This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.
- Parameters:
comment_char (
str, optional) – Comment character for header lines (default: “#”)
Examples
Read a whitespace-delimited file:
>>> reader = GenericTXTReader() >>> measurement = reader.read("data.txt")
File format example:
# Temperature: 25 # Sample: A1 0.0 0.0 1.0 1.0 2.0 4.0
Notes
This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.
Methods
read(filepath)Read CSV file and return Measurement object.
Hierarchy Building
Hierarchy building algorithm for organizing measurements.
This module provides algorithms for building hierarchical data structures from flat lists of measurements by analyzing conditions and grouping data.
- piblin_jax.dataio.hierarchy.build_hierarchy(measurements)[source]
Build hierarchical structure from flat list of measurements.
Analyzes the conditions across all measurements and organizes them into a hierarchical structure:
Extract all conditions from all measurements
Identify constant conditions (same across all) -> Experiment level
Identify varying conditions -> MeasurementSet grouping
Group measurements by conditions
- Parameters:
measurements (
list[Measurement]) – Flat list of measurements to organize- Returns:
Hierarchical organization of measurements
- Return type:
ExperimentSet
Notes
Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements
Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.
Examples
Build hierarchy from file list:
>>> from piblin_jax.dataio.readers import read_file >>> measurements = [ ... read_file("sample1.csv"), ... read_file("sample2.csv"), ... ] >>> experiment_set = build_hierarchy(measurements) >>> len(experiment_set.experiments) 1
Access measurements:
>>> for exp in experiment_set.experiments: ... for ms in exp.measurement_sets: ... for m in ms.measurements: ... print(m.conditions)
- piblin_jax.dataio.hierarchy.group_by_conditions(measurements, grouping_keys)[source]
Group measurements by specific condition keys.
This is a utility function for more advanced hierarchy building that groups measurements based on specific condition values.
- Parameters:
measurements (
list[Measurement]) – Measurements to groupgrouping_keys (
list[str]) – Condition keys to group by
- Returns:
Dictionary mapping condition value tuples to lists of measurements
- Return type:
dict[tuple[Any,],list[Measurement]]
Examples
Group by temperature:
>>> groups = group_by_conditions(measurements, ['Temperature']) >>> for temp_value, group in groups.items(): ... print(f"Temperature {temp_value}: {len(group)} measurements")
Group by multiple conditions:
>>> groups = group_by_conditions( ... measurements, ... ['Temperature', 'Pressure'] ... )
Notes
This function is provided for future extensions to the hierarchy building algorithm that may want to create separate Experiments or MeasurementSets based on specific conditions.
- piblin_jax.dataio.hierarchy.identify_varying_conditions(measurements)[source]
Identify which conditions vary across measurements.
- Parameters:
measurements (
list[Measurement]) – Measurements to analyze- Returns:
Set of condition keys that have different values across measurements
- Return type:
set[str]
Examples
>>> varying = identify_varying_conditions(measurements) >>> print(varying) {'Temperature', 'Sample'}
Notes
This function is useful for determining which conditions should be used to group measurements into different MeasurementSets or Experiments.
Writers
Data writers module for piblin-jax.
This module provides functionality for writing piblin-jax data structures to various file formats. Writer implementations will be added in future phases to support common data serialization formats.
Future implementations may include: - CSV writers for tabular data - HDF5 writers for large datasets - JSON/YAML writers for metadata - Binary formats for efficient storage
Examples
Future usage example:
from piblin_jax.dataio.writers import CSVWriter
writer = CSVWriter()
writer.write(dataset, 'output.csv')