Data I/O

Overview

The piblin_jax.dataio module provides a comprehensive file I/O system for reading and writing measurement data. It implements an extensible architecture that supports multiple file formats with automatic format detection and intelligent hierarchy building.

The I/O system is designed around several key principles:

Format Agnostic: The module provides generic readers for CSV and TXT files that automatically parse column-based data. The extensible reader registry allows easy addition of custom format parsers without modifying core code.
Auto-Detection: File formats are automatically detected based on file extension and content inspection. You can read files without knowing their format in advance, making batch processing straightforward.
Automatic Hierarchy Building: When reading multiple files, the system automatically analyzes metadata to identify constant and varying experimental conditions, then builds an appropriate hierarchical structure (Experiment, MeasurementSet, etc.). This eliminates manual organization of large datasets.
Batch Operations: Read entire directories or multiple files at once. The module provides convenient functions for common workflows like reading all CSV files in a directory or processing files from multiple experimental runs.
Metadata Preservation: All file-level metadata (filenames, paths, timestamps) is automatically captured and attached to the resulting data structures. Inline metadata from file headers is also preserved.

The module currently supports CSV and TXT formats out of the box, with the infrastructure in place for adding instrument-specific readers (e.g., TA Instruments, Anton Paar) as needed.

Quick Examples

Reading a Single File

Read a data file with automatic format detection:

from piblin_jax.dataio import read_file

# Read CSV file - format auto-detected
measurement = read_file("experiment_data.csv")

# Access the data
print(f"Number of datasets: {len(measurement.datasets)}")
print(f"Metadata: {measurement.metadata}")

Reading Multiple Files

Read and organize multiple files into a hierarchy:

from piblin_jax.dataio import read_files

# List of files to read
files = [
    "sample_25C_rep1.csv",
    "sample_25C_rep2.csv",
    "sample_25C_rep3.csv",
    "sample_50C_rep1.csv",
    "sample_50C_rep2.csv",
]

# Read all files and build hierarchy
experiment_set = read_files(files)

# Hierarchy is built automatically based on conditions
for exp in experiment_set.experiments:
    print(f"Temperature: {exp.metadata['temperature']}")
    print(f"Replicates: {len(exp.measurement_sets[0].measurements)}")

Reading Entire Directories

Process all files in a directory:

from piblin_jax.dataio import read_directory

# Read all CSV files in directory
experiment_set = read_directory(
    "/path/to/data",
    pattern="*.csv"
)

# Read recursively with custom pattern
experiment_set = read_directory(
    "/path/to/data",
    pattern="sample_*.txt",
    recursive=True
)

print(f"Total measurements: {len(experiment_set.get_all_measurements())}")

Registering Custom Readers

Add support for custom file formats:

from piblin_jax.dataio import register_reader
from piblin_jax.data.collections import Measurement

def my_custom_reader(filepath):
    """Read custom instrument format."""
    # Parse file
    data = parse_my_format(filepath)

    # Create datasets
    datasets = [create_dataset(data)]

    # Return Measurement
    return Measurement(
        datasets=datasets,
        metadata={"source": filepath}
    )

# Register reader for .dat files
register_reader(".dat", my_custom_reader)

# Now you can read .dat files
measurement = read_file("data.dat")

API Reference

Module Contents

Data I/O system for piblin-jax.

This module provides a comprehensive file I/O system with: - Generic CSV and TXT readers - Auto-detection of file formats - Batch reading of multiple files - Automatic hierarchy building from file lists - Extensible reader registry

Main Functions

read_file : Read single file with auto-detection read_files : Read multiple files and build hierarchy read_directory : Read all matching files in a directory read_directories : Read multiple directories

Examples

Read a single file:

>>> from piblin_jax.dataio import read_file
>>> measurement = read_file("data.csv")

Read multiple files:

>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"]
>>> experiment_set = read_files(files)

Read an entire directory:

>>> experiment_set = read_directory("/path/to/data", pattern="*.csv")

Read multiple directories:

>>> paths = ["/path/to/exp1", "/path/to/exp2"]
>>> experiment_set = read_directories(paths)

piblin_jax.dataio.build_hierarchy(measurements)[source]

Build hierarchical structure from flat list of measurements.

Analyzes the conditions across all measurements and organizes them into a hierarchical structure:

Extract all conditions from all measurements
Identify constant conditions (same across all) -> Experiment level
Identify varying conditions -> MeasurementSet grouping
Group measurements by conditions

Parameters:: measurements (list[Measurement]) – Flat list of measurements to organize
Returns:: Hierarchical organization of measurements
Return type:: ExperimentSet

Notes

Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements

Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.

Examples

Build hierarchy from file list:

>>> from piblin_jax.dataio.readers import read_file
>>> measurements = [
...     read_file("sample1.csv"),
...     read_file("sample2.csv"),
... ]
>>> experiment_set = build_hierarchy(measurements)
>>> len(experiment_set.experiments)
1

Access measurements:

>>> for exp in experiment_set.experiments:
...     for ms in exp.measurement_sets:
...         for m in ms.measurements:
...             print(m.conditions)

piblin_jax.dataio.detect_reader(filepath)[source]

Auto-detect appropriate reader for file.

Uses a multi-layer detection strategy:

Extension-based: Matches file extension to registered readers
Header-based: Checks file headers for instrument signatures (future)
Content-based: Analyzes file content structure (future)
Fallback: Returns generic reader based on best guess

Parameters:: filepath (str | Path) – Path to file
Returns:: Instance of appropriate reader class
Return type:: Reader instance

Examples

>>> reader = detect_reader("data.csv")
>>> isinstance(reader, GenericCSVReader)
True

>>> reader = detect_reader("data.txt")
>>> isinstance(reader, GenericTXTReader)
True

Notes

Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.

piblin_jax.dataio.read_directories(path_list, pattern='*.csv', recursive=False)[source]

Read multiple directories and combine into single hierarchy.

Scans multiple directories for matching files and builds a unified hierarchical structure from all measurements.

Parameters:

path_list (Sequence[str | Path]) – List of directory paths to scan
pattern (str, optional) – Glob pattern for file matching (default: "*.csv")
recursive (bool, optional) – If True, search recursively in subdirectories (default: False)

Returns:

Hierarchical organization of all measurements from all directories

Return type:

ExperimentSet

Raises:

FileNotFoundError – If any directory does not exist
ValueError – If any file cannot be parsed

Examples

Read from multiple directories:

>>> paths = ["/data/exp1", "/data/exp2", "/data/exp3"]
>>> experiment_set = read_directories(paths)

With custom pattern:

>>> experiment_set = read_directories(
...     paths,
...     pattern="*.txt"
... )

Recursive search:

>>> experiment_set = read_directories(
...     paths,
...     recursive=True
... )

Notes

All measurements from all directories are combined and analyzed together to build a unified hierarchy. This is useful when an experiment spans multiple directories.

piblin_jax.dataio.read_directory(path, pattern='*.csv', recursive=False)[source]

Read all matching files in a directory.

Scans a directory for files matching the pattern, reads them all, and builds a hierarchical structure.

Parameters:

path (str | Path) – Directory path to scan
pattern (str, optional) – Glob pattern for file matching (default: "*.csv"). Examples: “.txt”, “.dat”, “sample_*.csv”
recursive (bool, optional) – If True, search recursively in subdirectories (default: False)

Returns:

Hierarchical organization of all measurements

Return type:

ExperimentSet

Raises:

FileNotFoundError – If the directory does not exist
ValueError – If any file cannot be parsed

Examples

Read all CSV files in a directory:

>>> experiment_set = read_directory("/data/experiment1")

Read all TXT files:

>>> experiment_set = read_directory("/data/experiment1", pattern="*.txt")

Read recursively:

>>> experiment_set = read_directory(
...     "/data",
...     pattern="*.csv",
...     recursive=True
... )

Read with custom pattern:

>>> experiment_set = read_directory(
...     "/data",
...     pattern="sample_A*.csv"
... )

Notes

Files are sorted alphabetically before reading for consistent ordering. All measurements are analyzed together to build the hierarchy.

piblin_jax.dataio.read_file(filepath)[source]

Read file with automatic format detection.

This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.

Parameters:

filepath (str | Path) – Path to file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed

Examples

Read a CSV file:

>>> measurement = read_file("data.csv")

Read a TXT file:

>>> measurement = read_file("experiment.txt")

Read with explicit path:

>>> from pathlib import Path
>>> measurement = read_file(Path("/data/experiment/sample1.csv"))

Notes

This function combines detection and reading in a single call. For more control over the reading process, you can use detect_reader() followed by calling the reader’s read() method directly.

piblin_jax.dataio.read_files(file_list)[source]

Read multiple files and build hierarchical structure.

Reads all files in the list, automatically detecting formats, and organizes them into a hierarchical ExperimentSet based on their experimental conditions.

Parameters:

file_list (Sequence[str | Path]) – List of file paths to read

Returns:

Hierarchical organization of all measurements

Return type:

ExperimentSet

Raises:

FileNotFoundError – If any file in the list does not exist
ValueError – If any file cannot be parsed

Examples

Read specific files:

>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"]
>>> experiment_set = read_files(files)
>>> len(experiment_set.experiments)
1

With Path objects:

>>> from pathlib import Path
>>> files = list(Path("/data").glob("*.csv"))
>>> experiment_set = read_files(files)

Notes

All measurements from all files are analyzed together to identify constant and varying conditions, which determines the hierarchy structure. Files with the same conditions are grouped together.

piblin_jax.dataio.register_reader(extension, reader_class)[source]

Register a custom reader for a file extension.

This allows users to add support for custom file formats without modifying the core library.

Parameters:

extension (str) – File extension (should include the dot, e.g., “.xyz”)
reader_class (Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement a read(filepath) method that returns a Measurement object.

Examples

Register a custom reader class:

>>> class MyCustomReader:
...     def read(self, filepath):
...         # ... custom reading logic
...         pass
>>> register_reader('.xyz', MyCustomReader)

Register a factory function:

>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))

Notes

Custom readers should follow the same interface as GenericCSVReader, implementing a read(filepath) method that returns a Measurement.

Readers

Base Reader Interface

File readers and auto-detection system.

This module provides: - Generic CSV and TXT readers - Multi-layer auto-detection system - Extensible reader registry - read_file function for automatic file reading

The auto-detection system uses four layers: 1. Extension-based (.csv, .txt, etc.) 2. Header-based (instrument signatures) 3. Content-based (parse first lines) 4. Fallback to generic readers

class piblin_jax.dataio.readers.GenericCSVReader(delimiter=',', comment_char='#')[source]

Bases: object

Generic CSV file reader with metadata extraction.

This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.

Parameters:

delimiter (str, optional) – Column delimiter character (default: “,”). Common values: “,” (CSV), “t” (TSV), “;” (European CSV)
comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a standard CSV file:

>>> reader = GenericCSVReader()
>>> measurement = reader.read("data.csv")

Read a tab-delimited file:

>>> reader = GenericCSVReader(delimiter="\t")
>>> measurement = reader.read("data.tsv")

File format example:

# Temperature: 25
# Pressure: 1.0
# Sample: A1
0.0,0.0
1.0,1.0
2.0,4.0

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(delimiter=',', comment_char='#')[source]

Initialize GenericCSVReader.

See class docstring for parameter details.

read(filepath)[source]

Read CSV file and return Measurement object.

Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.

Parameters:

filepath (str | Path) – Path to CSV file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed

class piblin_jax.dataio.readers.GenericTXTReader(comment_char='#')[source]

Bases: GenericCSVReader

Generic TXT file reader (whitespace-delimited).

This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.

Parameters:: comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a whitespace-delimited file:

>>> reader = GenericTXTReader()
>>> measurement = reader.read("data.txt")

File format example:

# Temperature: 25
# Sample: A1
0.0 0.0
1.0 1.0
2.0 4.0

Notes

This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(comment_char='#')[source]

Initialize GenericTXTReader.

See class docstring for parameter details.

piblin_jax.dataio.readers.detect_reader(filepath)[source]

Auto-detect appropriate reader for file.

Uses a multi-layer detection strategy:

Extension-based: Matches file extension to registered readers
Header-based: Checks file headers for instrument signatures (future)
Content-based: Analyzes file content structure (future)
Fallback: Returns generic reader based on best guess

Parameters:: filepath (str | Path) – Path to file
Returns:: Instance of appropriate reader class
Return type:: Reader instance

Examples

>>> reader = detect_reader("data.csv")
>>> isinstance(reader, GenericCSVReader)
True

>>> reader = detect_reader("data.txt")
>>> isinstance(reader, GenericTXTReader)
True

Notes

Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.

piblin_jax.dataio.readers.read_file(filepath)[source]

Read file with automatic format detection.

This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.

Parameters:

filepath (str | Path) – Path to file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed

Examples

Read a CSV file:

>>> measurement = read_file("data.csv")

Read a TXT file:

>>> measurement = read_file("experiment.txt")

Read with explicit path:

>>> from pathlib import Path
>>> measurement = read_file(Path("/data/experiment/sample1.csv"))

Notes

This function combines detection and reading in a single call. For more control over the reading process, you can use detect_reader() followed by calling the reader’s read() method directly.

piblin_jax.dataio.readers.register_reader(extension, reader_class)[source]

Register a custom reader for a file extension.

This allows users to add support for custom file formats without modifying the core library.

Parameters:

extension (str) – File extension (should include the dot, e.g., “.xyz”)
reader_class (Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement a read(filepath) method that returns a Measurement object.

Examples

Register a custom reader class:

>>> class MyCustomReader:
...     def read(self, filepath):
...         # ... custom reading logic
...         pass
>>> register_reader('.xyz', MyCustomReader)

Register a factory function:

>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))

Notes

Custom readers should follow the same interface as GenericCSVReader, implementing a read(filepath) method that returns a Measurement.

CSV Reader

Generic CSV file reader with metadata extraction.

This module provides a flexible CSV reader that can handle various delimiters, extract metadata from file headers, and create appropriate Dataset objects.

class piblin_jax.dataio.readers.csv.GenericCSVReader(delimiter=',', comment_char='#')[source]

Bases: object

Generic CSV file reader with metadata extraction.

This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.

Parameters:

delimiter (str, optional) – Column delimiter character (default: “,”). Common values: “,” (CSV), “t” (TSV), “;” (European CSV)
comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a standard CSV file:

>>> reader = GenericCSVReader()
>>> measurement = reader.read("data.csv")

Read a tab-delimited file:

>>> reader = GenericCSVReader(delimiter="\t")
>>> measurement = reader.read("data.tsv")

File format example:

# Temperature: 25
# Pressure: 1.0
# Sample: A1
0.0,0.0
1.0,1.0
2.0,4.0

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(delimiter=',', comment_char='#')[source]

Initialize GenericCSVReader.

See class docstring for parameter details.

read(filepath)[source]

Read CSV file and return Measurement object.

Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.

Parameters:

filepath (str | Path) – Path to CSV file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

FileNotFoundError – If the file does not exist
ValueError – If the file format is invalid or cannot be parsed

TXT Reader

Generic TXT file reader for whitespace-delimited data files.

This module provides a TXT reader that extends the CSV reader for whitespace-delimited files, commonly used in scientific data.

class piblin_jax.dataio.readers.txt.GenericTXTReader(comment_char='#')[source]

Bases: GenericCSVReader

Generic TXT file reader (whitespace-delimited).

This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.

Parameters:: comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a whitespace-delimited file:

>>> reader = GenericTXTReader()
>>> measurement = reader.read("data.txt")

File format example:

# Temperature: 25
# Sample: A1
0.0 0.0
1.0 1.0
2.0 4.0

Notes

This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(comment_char='#')[source]

Initialize GenericTXTReader.

See class docstring for parameter details.

Hierarchy Building

Hierarchy building algorithm for organizing measurements.

This module provides algorithms for building hierarchical data structures from flat lists of measurements by analyzing conditions and grouping data.

piblin_jax.dataio.hierarchy.build_hierarchy(measurements)[source]

Build hierarchical structure from flat list of measurements.

Analyzes the conditions across all measurements and organizes them into a hierarchical structure:

Extract all conditions from all measurements
Identify constant conditions (same across all) -> Experiment level
Identify varying conditions -> MeasurementSet grouping
Group measurements by conditions

Parameters:: measurements (list[Measurement]) – Flat list of measurements to organize
Returns:: Hierarchical organization of measurements
Return type:: ExperimentSet

Notes

Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements

Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.

Examples

Build hierarchy from file list:

>>> from piblin_jax.dataio.readers import read_file
>>> measurements = [
...     read_file("sample1.csv"),
...     read_file("sample2.csv"),
... ]
>>> experiment_set = build_hierarchy(measurements)
>>> len(experiment_set.experiments)
1

Access measurements:

>>> for exp in experiment_set.experiments:
...     for ms in exp.measurement_sets:
...         for m in ms.measurements:
...             print(m.conditions)

piblin_jax.dataio.hierarchy.group_by_conditions(measurements, grouping_keys)[source]

Group measurements by specific condition keys.

This is a utility function for more advanced hierarchy building that groups measurements based on specific condition values.

Parameters:

measurements (list[Measurement]) – Measurements to group
grouping_keys (list[str]) – Condition keys to group by

Returns:

Dictionary mapping condition value tuples to lists of measurements

Return type:

dict[tuple[Any, ], list[Measurement]]

Examples

Group by temperature:

>>> groups = group_by_conditions(measurements, ['Temperature'])
>>> for temp_value, group in groups.items():
...     print(f"Temperature {temp_value}: {len(group)} measurements")

Group by multiple conditions:

>>> groups = group_by_conditions(
...     measurements,
...     ['Temperature', 'Pressure']
... )

Notes

This function is provided for future extensions to the hierarchy building algorithm that may want to create separate Experiments or MeasurementSets based on specific conditions.

piblin_jax.dataio.hierarchy.identify_varying_conditions(measurements)[source]

Identify which conditions vary across measurements.

Parameters:: measurements (list[Measurement]) – Measurements to analyze
Returns:: Set of condition keys that have different values across measurements
Return type:: set[str]

Examples

>>> varying = identify_varying_conditions(measurements)
>>> print(varying)
{'Temperature', 'Sample'}

Notes

This function is useful for determining which conditions should be used to group measurements into different MeasurementSets or Experiments.

Writers

Data writers module for piblin-jax.

This module provides functionality for writing piblin-jax data structures to various file formats. Writer implementations will be added in future phases to support common data serialization formats.

Future implementations may include: - CSV writers for tabular data - HDF5 writers for large datasets - JSON/YAML writers for metadata - Binary formats for efficient storage

Examples

Future usage example:

from piblin_jax.dataio.writers import CSVWriter

writer = CSVWriter()
writer.write(dataset, 'output.csv')

Data I/O

Overview

Quick Examples

Reading a Single File

Reading Multiple Files

Reading Entire Directories

Registering Custom Readers

See Also

API Reference

Module Contents

Main Functions

Readers

Base Reader Interface

CSV Reader

TXT Reader

Hierarchy Building

Writers