Data I/O

Overview

The piblin_jax.dataio module provides a comprehensive file I/O system for reading and writing measurement data. It implements an extensible architecture that supports multiple file formats with automatic format detection and intelligent hierarchy building.

The I/O system is designed around several key principles:

  • Format Agnostic: The module provides generic readers for CSV and TXT files that automatically parse column-based data. The extensible reader registry allows easy addition of custom format parsers without modifying core code.

  • Auto-Detection: File formats are automatically detected based on file extension and content inspection. You can read files without knowing their format in advance, making batch processing straightforward.

  • Automatic Hierarchy Building: When reading multiple files, the system automatically analyzes metadata to identify constant and varying experimental conditions, then builds an appropriate hierarchical structure (Experiment, MeasurementSet, etc.). This eliminates manual organization of large datasets.

  • Batch Operations: Read entire directories or multiple files at once. The module provides convenient functions for common workflows like reading all CSV files in a directory or processing files from multiple experimental runs.

  • Metadata Preservation: All file-level metadata (filenames, paths, timestamps) is automatically captured and attached to the resulting data structures. Inline metadata from file headers is also preserved.

The module currently supports CSV and TXT formats out of the box, with the infrastructure in place for adding instrument-specific readers (e.g., TA Instruments, Anton Paar) as needed.

Quick Examples

Reading a Single File

Read a data file with automatic format detection:

from piblin_jax.dataio import read_file

# Read CSV file - format auto-detected
measurement = read_file("experiment_data.csv")

# Access the data
print(f"Number of datasets: {len(measurement.datasets)}")
print(f"Metadata: {measurement.metadata}")

Reading Multiple Files

Read and organize multiple files into a hierarchy:

from piblin_jax.dataio import read_files

# List of files to read
files = [
    "sample_25C_rep1.csv",
    "sample_25C_rep2.csv",
    "sample_25C_rep3.csv",
    "sample_50C_rep1.csv",
    "sample_50C_rep2.csv",
]

# Read all files and build hierarchy
experiment_set = read_files(files)

# Hierarchy is built automatically based on conditions
for exp in experiment_set.experiments:
    print(f"Temperature: {exp.metadata['temperature']}")
    print(f"Replicates: {len(exp.measurement_sets[0].measurements)}")

Reading Entire Directories

Process all files in a directory:

from piblin_jax.dataio import read_directory

# Read all CSV files in directory
experiment_set = read_directory(
    "/path/to/data",
    pattern="*.csv"
)

# Read recursively with custom pattern
experiment_set = read_directory(
    "/path/to/data",
    pattern="sample_*.txt",
    recursive=True
)

print(f"Total measurements: {len(experiment_set.get_all_measurements())}")

Registering Custom Readers

Add support for custom file formats:

from piblin_jax.dataio import register_reader
from piblin_jax.data.collections import Measurement

def my_custom_reader(filepath):
    """Read custom instrument format."""
    # Parse file
    data = parse_my_format(filepath)

    # Create datasets
    datasets = [create_dataset(data)]

    # Return Measurement
    return Measurement(
        datasets=datasets,
        metadata={"source": filepath}
    )

# Register reader for .dat files
register_reader(".dat", my_custom_reader)

# Now you can read .dat files
measurement = read_file("data.dat")

See Also

API Reference

Module Contents

Data I/O system for piblin-jax.

This module provides a comprehensive file I/O system with: - Generic CSV and TXT readers - Auto-detection of file formats - Batch reading of multiple files - Automatic hierarchy building from file lists - Extensible reader registry

Main Functions

read_file : Read single file with auto-detection read_files : Read multiple files and build hierarchy read_directory : Read all matching files in a directory read_directories : Read multiple directories

Examples

Read a single file:

>>> from piblin_jax.dataio import read_file
>>> measurement = read_file("data.csv")

Read multiple files:

>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"]
>>> experiment_set = read_files(files)

Read an entire directory:

>>> experiment_set = read_directory("/path/to/data", pattern="*.csv")

Read multiple directories:

>>> paths = ["/path/to/exp1", "/path/to/exp2"]
>>> experiment_set = read_directories(paths)
piblin_jax.dataio.build_hierarchy(measurements)[source]

Build hierarchical structure from flat list of measurements.

Analyzes the conditions across all measurements and organizes them into a hierarchical structure:

  1. Extract all conditions from all measurements

  2. Identify constant conditions (same across all) -> Experiment level

  3. Identify varying conditions -> MeasurementSet grouping

  4. Group measurements by conditions

Parameters:

measurements (list[Measurement]) – Flat list of measurements to organize

Returns:

Hierarchical organization of measurements

Return type:

ExperimentSet

Notes

Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements

Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.

Examples

Build hierarchy from file list:

>>> from piblin_jax.dataio.readers import read_file
>>> measurements = [
...     read_file("sample1.csv"),
...     read_file("sample2.csv"),
... ]
>>> experiment_set = build_hierarchy(measurements)
>>> len(experiment_set.experiments)
1

Access measurements:

>>> for exp in experiment_set.experiments:
...     for ms in exp.measurement_sets:
...         for m in ms.measurements:
...             print(m.conditions)
piblin_jax.dataio.detect_reader(filepath)[source]

Auto-detect appropriate reader for file.

Uses a multi-layer detection strategy:

  1. Extension-based: Matches file extension to registered readers

  2. Header-based: Checks file headers for instrument signatures (future)

  3. Content-based: Analyzes file content structure (future)

  4. Fallback: Returns generic reader based on best guess

Parameters:

filepath (str | Path) – Path to file

Returns:

Instance of appropriate reader class

Return type:

Reader instance

Examples

>>> reader = detect_reader("data.csv")
>>> isinstance(reader, GenericCSVReader)
True
>>> reader = detect_reader("data.txt")
>>> isinstance(reader, GenericTXTReader)
True

Notes

Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.

piblin_jax.dataio.read_directories(path_list, pattern='*.csv', recursive=False)[source]

Read multiple directories and combine into single hierarchy.

Scans multiple directories for matching files and builds a unified hierarchical structure from all measurements.

Parameters:
  • path_list (Sequence[str | Path]) – List of directory paths to scan

  • pattern (str, optional) – Glob pattern for file matching (default: "*.csv")

  • recursive (bool, optional) – If True, search recursively in subdirectories (default: False)

Returns:

Hierarchical organization of all measurements from all directories

Return type:

ExperimentSet

Raises:

Examples

Read from multiple directories:

>>> paths = ["/data/exp1", "/data/exp2", "/data/exp3"]
>>> experiment_set = read_directories(paths)

With custom pattern:

>>> experiment_set = read_directories(
...     paths,
...     pattern="*.txt"
... )

Recursive search:

>>> experiment_set = read_directories(
...     paths,
...     recursive=True
... )

Notes

All measurements from all directories are combined and analyzed together to build a unified hierarchy. This is useful when an experiment spans multiple directories.

piblin_jax.dataio.read_directory(path, pattern='*.csv', recursive=False)[source]

Read all matching files in a directory.

Scans a directory for files matching the pattern, reads them all, and builds a hierarchical structure.

Parameters:
  • path (str | Path) – Directory path to scan

  • pattern (str, optional) – Glob pattern for file matching (default: "*.csv"). Examples: “.txt”, “.dat”, “sample_*.csv”

  • recursive (bool, optional) – If True, search recursively in subdirectories (default: False)

Returns:

Hierarchical organization of all measurements

Return type:

ExperimentSet

Raises:

Examples

Read all CSV files in a directory:

>>> experiment_set = read_directory("/data/experiment1")

Read all TXT files:

>>> experiment_set = read_directory("/data/experiment1", pattern="*.txt")

Read recursively:

>>> experiment_set = read_directory(
...     "/data",
...     pattern="*.csv",
...     recursive=True
... )

Read with custom pattern:

>>> experiment_set = read_directory(
...     "/data",
...     pattern="sample_A*.csv"
... )

Notes

Files are sorted alphabetically before reading for consistent ordering. All measurements are analyzed together to build the hierarchy.

piblin_jax.dataio.read_file(filepath)[source]

Read file with automatic format detection.

This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.

Parameters:

filepath (str | Path) – Path to file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

Examples

Read a CSV file:

>>> measurement = read_file("data.csv")

Read a TXT file:

>>> measurement = read_file("experiment.txt")

Read with explicit path:

>>> from pathlib import Path
>>> measurement = read_file(Path("/data/experiment/sample1.csv"))

Notes

This function combines detection and reading in a single call. For more control over the reading process, you can use detect_reader() followed by calling the reader’s read() method directly.

piblin_jax.dataio.read_files(file_list)[source]

Read multiple files and build hierarchical structure.

Reads all files in the list, automatically detecting formats, and organizes them into a hierarchical ExperimentSet based on their experimental conditions.

Parameters:

file_list (Sequence[str | Path]) – List of file paths to read

Returns:

Hierarchical organization of all measurements

Return type:

ExperimentSet

Raises:

Examples

Read specific files:

>>> files = ["sample1.csv", "sample2.csv", "sample3.csv"]
>>> experiment_set = read_files(files)
>>> len(experiment_set.experiments)
1

With Path objects:

>>> from pathlib import Path
>>> files = list(Path("/data").glob("*.csv"))
>>> experiment_set = read_files(files)

Notes

All measurements from all files are analyzed together to identify constant and varying conditions, which determines the hierarchy structure. Files with the same conditions are grouped together.

piblin_jax.dataio.register_reader(extension, reader_class)[source]

Register a custom reader for a file extension.

This allows users to add support for custom file formats without modifying the core library.

Parameters:
  • extension (str) – File extension (should include the dot, e.g., “.xyz”)

  • reader_class (Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement a read(filepath) method that returns a Measurement object.

Examples

Register a custom reader class:

>>> class MyCustomReader:
...     def read(self, filepath):
...         # ... custom reading logic
...         pass
>>> register_reader('.xyz', MyCustomReader)

Register a factory function:

>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))

Notes

Custom readers should follow the same interface as GenericCSVReader, implementing a read(filepath) method that returns a Measurement.

Readers

Base Reader Interface

File readers and auto-detection system.

This module provides: - Generic CSV and TXT readers - Multi-layer auto-detection system - Extensible reader registry - read_file function for automatic file reading

The auto-detection system uses four layers: 1. Extension-based (.csv, .txt, etc.) 2. Header-based (instrument signatures) 3. Content-based (parse first lines) 4. Fallback to generic readers

class piblin_jax.dataio.readers.GenericCSVReader(delimiter=',', comment_char='#')[source]

Bases: object

Generic CSV file reader with metadata extraction.

This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.

Parameters:
  • delimiter (str, optional) – Column delimiter character (default: “,”). Common values: “,” (CSV), “t” (TSV), “;” (European CSV)

  • comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a standard CSV file:

>>> reader = GenericCSVReader()
>>> measurement = reader.read("data.csv")

Read a tab-delimited file:

>>> reader = GenericCSVReader(delimiter="\t")
>>> measurement = reader.read("data.tsv")

File format example:

# Temperature: 25
# Pressure: 1.0
# Sample: A1
0.0,0.0
1.0,1.0
2.0,4.0

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(delimiter=',', comment_char='#')[source]

Initialize GenericCSVReader.

See class docstring for parameter details.

read(filepath)[source]

Read CSV file and return Measurement object.

Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.

Parameters:

filepath (str | Path) – Path to CSV file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:
class piblin_jax.dataio.readers.GenericTXTReader(comment_char='#')[source]

Bases: GenericCSVReader

Generic TXT file reader (whitespace-delimited).

This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.

Parameters:

comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a whitespace-delimited file:

>>> reader = GenericTXTReader()
>>> measurement = reader.read("data.txt")

File format example:

# Temperature: 25
# Sample: A1
0.0 0.0
1.0 1.0
2.0 4.0

Notes

This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(comment_char='#')[source]

Initialize GenericTXTReader.

See class docstring for parameter details.

piblin_jax.dataio.readers.detect_reader(filepath)[source]

Auto-detect appropriate reader for file.

Uses a multi-layer detection strategy:

  1. Extension-based: Matches file extension to registered readers

  2. Header-based: Checks file headers for instrument signatures (future)

  3. Content-based: Analyzes file content structure (future)

  4. Fallback: Returns generic reader based on best guess

Parameters:

filepath (str | Path) – Path to file

Returns:

Instance of appropriate reader class

Return type:

Reader instance

Examples

>>> reader = detect_reader("data.csv")
>>> isinstance(reader, GenericCSVReader)
True
>>> reader = detect_reader("data.txt")
>>> isinstance(reader, GenericTXTReader)
True

Notes

Currently implements Layer 1 (extension-based) and Layer 4 (fallback). Layers 2 and 3 are reserved for future extensions to detect specific instrument file formats.

piblin_jax.dataio.readers.read_file(filepath)[source]

Read file with automatic format detection.

This is the main entry point for reading individual files. It automatically detects the file format and uses the appropriate reader.

Parameters:

filepath (str | Path) – Path to file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

Examples

Read a CSV file:

>>> measurement = read_file("data.csv")

Read a TXT file:

>>> measurement = read_file("experiment.txt")

Read with explicit path:

>>> from pathlib import Path
>>> measurement = read_file(Path("/data/experiment/sample1.csv"))

Notes

This function combines detection and reading in a single call. For more control over the reading process, you can use detect_reader() followed by calling the reader’s read() method directly.

piblin_jax.dataio.readers.register_reader(extension, reader_class)[source]

Register a custom reader for a file extension.

This allows users to add support for custom file formats without modifying the core library.

Parameters:
  • extension (str) – File extension (should include the dot, e.g., “.xyz”)

  • reader_class (Type | Callable) – Reader class or factory function that returns a reader instance. The reader must implement a read(filepath) method that returns a Measurement object.

Examples

Register a custom reader class:

>>> class MyCustomReader:
...     def read(self, filepath):
...         # ... custom reading logic
...         pass
>>> register_reader('.xyz', MyCustomReader)

Register a factory function:

>>> register_reader('.custom', lambda: GenericCSVReader(delimiter='|'))

Notes

Custom readers should follow the same interface as GenericCSVReader, implementing a read(filepath) method that returns a Measurement.

CSV Reader

Generic CSV file reader with metadata extraction.

This module provides a flexible CSV reader that can handle various delimiters, extract metadata from file headers, and create appropriate Dataset objects.

class piblin_jax.dataio.readers.csv.GenericCSVReader(delimiter=',', comment_char='#')[source]

Bases: object

Generic CSV file reader with metadata extraction.

This reader handles CSV files with optional header comments containing metadata. It supports various delimiters and automatically creates Dataset objects from the parsed data.

Parameters:
  • delimiter (str, optional) – Column delimiter character (default: “,”). Common values: “,” (CSV), “t” (TSV), “;” (European CSV)

  • comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a standard CSV file:

>>> reader = GenericCSVReader()
>>> measurement = reader.read("data.csv")

Read a tab-delimited file:

>>> reader = GenericCSVReader(delimiter="\t")
>>> measurement = reader.read("data.tsv")

File format example:

# Temperature: 25
# Pressure: 1.0
# Sample: A1
0.0,0.0
1.0,1.0
2.0,4.0

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(delimiter=',', comment_char='#')[source]

Initialize GenericCSVReader.

See class docstring for parameter details.

read(filepath)[source]

Read CSV file and return Measurement object.

Parses the CSV file, extracting metadata from headers and creating appropriate Dataset objects from the data columns.

Parameters:

filepath (str | Path) – Path to CSV file

Returns:

Measurement object containing datasets and metadata

Return type:

Measurement

Raises:

TXT Reader

Generic TXT file reader for whitespace-delimited data files.

This module provides a TXT reader that extends the CSV reader for whitespace-delimited files, commonly used in scientific data.

class piblin_jax.dataio.readers.txt.GenericTXTReader(comment_char='#')[source]

Bases: GenericCSVReader

Generic TXT file reader (whitespace-delimited).

This reader handles text files with whitespace-delimited columns (spaces or tabs). It inherits from GenericCSVReader but uses whitespace splitting instead of a specific delimiter.

Parameters:

comment_char (str, optional) – Comment character for header lines (default: “#”)

Examples

Read a whitespace-delimited file:

>>> reader = GenericTXTReader()
>>> measurement = reader.read("data.txt")

File format example:

# Temperature: 25
# Sample: A1
0.0 0.0
1.0 1.0
2.0 4.0

Notes

This reader is suitable for files where columns are separated by any amount of whitespace (spaces, tabs, or combinations). It automatically handles varying amounts of whitespace between columns.

Methods

read(filepath)

Read CSV file and return Measurement object.

__init__(comment_char='#')[source]

Initialize GenericTXTReader.

See class docstring for parameter details.

Hierarchy Building

Hierarchy building algorithm for organizing measurements.

This module provides algorithms for building hierarchical data structures from flat lists of measurements by analyzing conditions and grouping data.

piblin_jax.dataio.hierarchy.build_hierarchy(measurements)[source]

Build hierarchical structure from flat list of measurements.

Analyzes the conditions across all measurements and organizes them into a hierarchical structure:

  1. Extract all conditions from all measurements

  2. Identify constant conditions (same across all) -> Experiment level

  3. Identify varying conditions -> MeasurementSet grouping

  4. Group measurements by conditions

Parameters:

measurements (list[Measurement]) – Flat list of measurements to organize

Returns:

Hierarchical organization of measurements

Return type:

ExperimentSet

Notes

Current implementation creates a simple hierarchy: - One ExperimentSet containing - One Experiment containing - One MeasurementSet with all measurements

Future enhancements can implement more sophisticated grouping based on varying conditions to create multiple Experiments and MeasurementSets.

Examples

Build hierarchy from file list:

>>> from piblin_jax.dataio.readers import read_file
>>> measurements = [
...     read_file("sample1.csv"),
...     read_file("sample2.csv"),
... ]
>>> experiment_set = build_hierarchy(measurements)
>>> len(experiment_set.experiments)
1

Access measurements:

>>> for exp in experiment_set.experiments:
...     for ms in exp.measurement_sets:
...         for m in ms.measurements:
...             print(m.conditions)
piblin_jax.dataio.hierarchy.group_by_conditions(measurements, grouping_keys)[source]

Group measurements by specific condition keys.

This is a utility function for more advanced hierarchy building that groups measurements based on specific condition values.

Parameters:
  • measurements (list[Measurement]) – Measurements to group

  • grouping_keys (list[str]) – Condition keys to group by

Returns:

Dictionary mapping condition value tuples to lists of measurements

Return type:

dict[tuple[Any, ], list[Measurement]]

Examples

Group by temperature:

>>> groups = group_by_conditions(measurements, ['Temperature'])
>>> for temp_value, group in groups.items():
...     print(f"Temperature {temp_value}: {len(group)} measurements")

Group by multiple conditions:

>>> groups = group_by_conditions(
...     measurements,
...     ['Temperature', 'Pressure']
... )

Notes

This function is provided for future extensions to the hierarchy building algorithm that may want to create separate Experiments or MeasurementSets based on specific conditions.

piblin_jax.dataio.hierarchy.identify_varying_conditions(measurements)[source]

Identify which conditions vary across measurements.

Parameters:

measurements (list[Measurement]) – Measurements to analyze

Returns:

Set of condition keys that have different values across measurements

Return type:

set[str]

Examples

>>> varying = identify_varying_conditions(measurements)
>>> print(varying)
{'Temperature', 'Sample'}

Notes

This function is useful for determining which conditions should be used to group measurements into different MeasurementSets or Experiments.

Writers

Data writers module for piblin-jax.

This module provides functionality for writing piblin-jax data structures to various file formats. Writer implementations will be added in future phases to support common data serialization formats.

Future implementations may include: - CSV writers for tabular data - HDF5 writers for large datasets - JSON/YAML writers for metadata - Binary formats for efficient storage

Examples

Future usage example:

from piblin_jax.dataio.writers import CSVWriter

writer = CSVWriter()
writer.write(dataset, 'output.csv')