h5py (h5py/h5py)

h5py

https://github.com/h5py/h5py
Admin
h5py is a pythonic wrapper around HDF5 that enables reading and writing HDF5 files from Python with...

Tokens:154,701
Snippets:1,095
Trust Score:7
Update:1 month ago
Show doc for...
Context Summary (auto-generated)
Raw
# h5py

h5py is a Pythonic interface to the HDF5 binary data format that lets you store huge amounts of numerical data and easily manipulate it from NumPy. HDF5 enables you to slice into multi-terabyte datasets stored on disk as if they were real NumPy arrays, store thousands of datasets in a single file with hierarchical organization, and attach metadata as attributes to any object. The library runs on Python 3.10+ and provides both high-level and low-level interfaces to the HDF5 library.

The core concepts are simple: groups work like dictionaries, and datasets work like NumPy arrays. Every object in an HDF5 file has a name arranged in a POSIX-style hierarchy with `/`-separators. h5py supports a variety of transparent storage features including compression, error-detection, chunked I/O, parallel I/O with MPI, and Single Writer Multiple Reader (SWMR) mode for concurrent access patterns.

## File Objects - Opening and Creating Files

File objects serve as your entry point into HDF5. They support standard modes like r/w/a and should be closed when no longer in use. Every File instance is also an HDF5 group representing the root group of the file.

```python
import h5py
import numpy as np

# Open existing file (read-only by default)
f = h5py.File('myfile.hdf5', 'r')

# Create new file (truncate if exists)
f = h5py.File('myfile.hdf5', 'w')

# Read/write if exists, create otherwise
f = h5py.File('myfile.hdf5', 'a')

# Using context manager (recommended)
with h5py.File('myfile.hdf5', 'w') as f:
    f['dataset'] = np.arange(100)
    print(f"File name: {f.filename}")
    print(f"Mode: {f.mode}")

# File access modes:
# 'r'  - Readonly, file must exist (default)
# 'r+' - Read/write, file must exist
# 'w'  - Create file, truncate if exists
# 'w-' or 'x' - Create file, fail if exists
# 'a'  - Read/write if exists, create otherwise
```

## Creating Datasets

Datasets are homogeneous collections of data elements with an immutable datatype and shape. They support NumPy-style slicing for reading and writing data to disk.

```python
import h5py
import numpy as np

with h5py.File('datasets.hdf5', 'w') as f:
    # Create dataset with shape and dtype
    dset1 = f.create_dataset('default', (100,))
    dset2 = f.create_dataset('ints', (100,), dtype='i8')

    # Create dataset from existing array
    arr = np.arange(100)
    dset3 = f.create_dataset('from_array', data=arr)

    # Shorthand: assign array directly to group
    f['shorthand'] = np.random.random((50, 50))

    # Create multidimensional dataset
    dset4 = f.create_dataset('3d_data', (10, 20, 30), dtype='f')

    # Access dataset properties
    print(f"Shape: {dset4.shape}")      # (10, 20, 30)
    print(f"Dtype: {dset4.dtype}")      # float32
    print(f"Size: {dset4.size}")        # 6000
    print(f"Ndim: {dset4.ndim}")        # 3
```

## Reading and Writing Data with Slicing

HDF5 datasets support NumPy-style slicing to read and write data. Slice specifications are translated directly to HDF5 hyperslab selections for fast, efficient file access.

```python
import h5py
import numpy as np

with h5py.File('slicing.hdf5', 'w') as f:
    # Create and populate a dataset
    dset = f.create_dataset('data', (10, 10, 10), dtype='f')
    dset[...] = np.random.random((10, 10, 10))

    # Various slicing operations
    value = dset[0, 0, 0]           # Single element
    row = dset[0, 2:10, 1:9:3]      # Slice with step
    plane = dset[:, ::2, 5]         # Every other element
    first = dset[0]                 # First 2D slice
    partial = dset[1, 5]            # 1D slice
    ellipsis1 = dset[0, ...]        # Use ellipsis
    ellipsis2 = dset[..., 6]        # Ellipsis at start
    all_data = dset[()]             # Read entire dataset

    # Writing data with broadcasting
    dset[0, :, :] = np.arange(10)   # Broadcasts to (10, 10)

    # Correct way to write (single indexing operation)
    dset[0, 1] = 3.0                # Correct
    # dset[0][1] = 3.0              # Wrong! Modifies copy only

    print(f"Retrieved shape: {all_data.shape}")
```

## Groups and Hierarchical Organization

Groups are the container mechanism for organizing HDF5 files. They operate like dictionaries with string keys for object names and support POSIX-style paths with `/`-separators.

```python
import h5py
import numpy as np

with h5py.File('groups.hdf5', 'w') as f:
    # Create groups
    grp = f.create_group('subgroup')
    print(f"Group name: {grp.name}")  # /subgroup

    # Create nested groups
    subgrp = grp.create_group('nested')
    print(f"Nested name: {subgrp.name}")  # /subgroup/nested

    # Create intermediate groups automatically
    deep = f.create_group('/deep/path/to/group')

    # Create dataset in group
    dset = grp.create_dataset('data', (50,), dtype='f')
    print(f"Dataset path: {dset.name}")  # /subgroup/data

    # Dictionary-style access
    retrieved = f['subgroup/data']
    print(f"'subgroup' in f: {'subgroup' in f}")

    # Iteration over group members
    for name in f:
        print(f"Member: {name}")

    # Get keys, values, items
    print(f"Keys: {list(f.keys())}")

    # Recursive traversal with visit
    def print_name(name):
        print(name)
    f.visit(print_name)

    # Visit with objects
    def print_name_and_obj(name, obj):
        print(f"{name}: {type(obj)}")
    f.visititems(print_name_and_obj)
```

## Attributes - Storing Metadata

Attributes are small named pieces of data attached directly to Groups and Datasets. They are the official way to store metadata in HDF5 and support a dictionary-style interface.

```python
import h5py
import numpy as np

with h5py.File('attributes.hdf5', 'w') as f:
    dset = f.create_dataset('data', data=np.arange(100))

    # Set attributes using dictionary syntax
    dset.attrs['temperature'] = 99.5
    dset.attrs['units'] = 'Kelvin'
    dset.attrs['calibration'] = np.array([1.0, 2.0, 3.0])

    # Read attributes
    temp = dset.attrs['temperature']
    print(f"Temperature: {temp}")

    # Check existence
    print(f"'units' exists: {'units' in dset.attrs}")

    # Iterate over attributes
    for key in dset.attrs:
        print(f"{key}: {dset.attrs[key]}")

    # Get all as dict-like items
    for key, value in dset.attrs.items():
        print(f"{key} = {value}")

    # Attributes on groups
    grp = f.create_group('experiment')
    grp.attrs['date'] = '2024-01-15'
    grp.attrs['researcher'] = 'Dr. Smith'

    # Modify existing attribute while preserving type
    dset.attrs.modify('temperature', 100.0)

    # Create with explicit type control
    dset.attrs.create('precise', 3.14159, dtype='f8')
```

## Chunked Storage and Compression

Chunked storage divides datasets into regularly-sized pieces stored haphazardly on disk and indexed using a B-tree. This enables resizable datasets and compression filters.

```python
import h5py
import numpy as np

with h5py.File('chunked.hdf5', 'w') as f:
    # Explicit chunk shape
    dset1 = f.create_dataset('chunked', (1000, 1000),
                              chunks=(100, 100), dtype='f')

    # Auto-chunking
    dset2 = f.create_dataset('auto_chunked', (1000, 1000),
                              chunks=True, dtype='f')

    # GZIP compression (levels 0-9)
    dset3 = f.create_dataset('gzip', (1000, 1000),
                              compression='gzip',
                              compression_opts=4)

    # LZF compression (fast, moderate compression)
    dset4 = f.create_dataset('lzf', (1000, 1000),
                              compression='lzf')

    # Shuffle filter improves compression
    dset5 = f.create_dataset('shuffled', (1000, 1000),
                              compression='gzip',
                              shuffle=True)

    # Fletcher32 checksum for error detection
    dset6 = f.create_dataset('checksummed', (1000, 1000),
                              compression='gzip',
                              fletcher32=True)

    # Write data
    data = np.random.random((1000, 1000)).astype('f')
    dset3[...] = data

    # Iterate over chunks
    for chunk_slice in dset1.iter_chunks():
        print(f"Chunk: {chunk_slice}")
        # arr = dset1[chunk_slice]  # Read chunk

    # Check compression info
    print(f"Compression: {dset3.compression}")
    print(f"Chunks: {dset3.chunks}")
```

## Resizable Datasets

Datasets can be resized after creation up to a maximum shape specified with the `maxshape` parameter. Use `None` for unlimited dimensions.

```python
import h5py
import numpy as np

with h5py.File('resizable.hdf5', 'w') as f:
    # Fixed maximum size
    dset1 = f.create_dataset('fixed_max', (10, 10),
                              maxshape=(500, 20))

    # Unlimited on first axis
    dset2 = f.create_dataset('unlimited', (10, 10),
                              maxshape=(None, 10))

    # Unlimited 1D dataset (note tuple syntax)
    dset3 = f.create_dataset('log', (0,),
                              maxshape=(None,), dtype='f')

    # Append data by resizing
    for i in range(5):
        new_data = np.random.random(100)
        current_size = dset3.shape[0]
        new_size = current_size + len(new_data)

        # Resize and write
        dset3.resize(new_size, axis=0)
        dset3[current_size:new_size] = new_data

    print(f"Final shape: {dset3.shape}")  # (500,)

    # Resize multidimensional
    dset1.resize((100, 15))
    print(f"Resized shape: {dset1.shape}")  # (100, 15)
```

## String Handling

HDF5 supports both fixed-length and variable-length strings with ASCII or UTF-8 encoding. h5py provides utilities to create and check string types.

```python
import h5py
import numpy as np

with h5py.File('strings.hdf5', 'w') as f:
    # Variable-length strings (implicit, UTF-8)
    f['vlen_strings'] = ["hello", "world", "variable", "length"]

    # Variable-length strings (explicit)
    dt = h5py.string_dtype(encoding='utf-8')
    dset = f.create_dataset('vlen_utf8', (4,), dtype=dt)
    dset[:] = ["alpha", "beta", "gamma", "delta"]

    # Fixed-length strings (using NumPy S dtype)
    f['fixed_strings'] = np.array(["abc", "def"], dtype='S10')

    # Fixed-length with explicit dtype
    dt_fixed = h5py.string_dtype(encoding='utf-8', length=20)
    dset2 = f.create_dataset('fixed_utf8', (3,), dtype=dt_fixed)
    dset2[:] = ["one", "two", "three"]

    # Reading strings
    data = f['vlen_strings'][:]
    print(f"Type: {type(data[0])}")  # bytes

    # Read as Python str objects
    str_data = f['vlen_strings'].asstr()[:]
    print(f"As string: {str_data}")

    # Check string dtype info
    info = h5py.check_string_dtype(dset.dtype)
    if info:
        print(f"Encoding: {info.encoding}, Length: {info.length}")
```

## Special Types - Enums and Variable-Length Arrays

h5py supports HDF5 special types including enumerated types and variable-length (ragged) arrays that have no direct NumPy equivalent.

```python
import h5py
import numpy as np

with h5py.File('special_types.hdf5', 'w') as f:
    # Enumerated type
    colors = {"RED": 0, "GREEN": 1, "BLUE": 2, "YELLOW": 3}
    dt_enum = h5py.enum_dtype(colors, basetype='i')

    dset_enum = f.create_dataset('colors', (10,), dtype=dt_enum)
    dset_enum[0] = 0  # RED
    dset_enum[1] = 2  # BLUE

    # Check enum dtype
    enum_info = h5py.check_enum_dtype(dset_enum.dtype)
    print(f"Enum values: {enum_info}")

    # Variable-length (ragged) arrays
    dt_vlen = h5py.vlen_dtype(np.dtype('int32'))
    dset_vlen = f.create_dataset('ragged', (5,), dtype=dt_vlen)

    dset_vlen[0] = [1, 2, 3]
    dset_vlen[1] = [10, 20, 30, 40, 50]
    dset_vlen[2] = [100]

    # Read single element (returns array)
    print(f"Element 0: {dset_vlen[0]}")  # array([1, 2, 3])

    # Read multiple (returns object array of arrays)
    print(f"Elements 0:2: {dset_vlen[0:2]}")

    # Complex numbers (compatible format)
    complex_arr = np.array([1+2j, 3+4j, 5+6j], dtype='c8')
    f['complex'] = complex_arr

    # Opaque dtype for datetime
    dt_arr = np.array([np.datetime64('2024-01-15')])
    f['datetime'] = dt_arr.astype(h5py.opaque_dtype(dt_arr.dtype))
```

## Links - Hard, Soft, and External

HDF5 supports multiple link types: hard links (direct pointers), soft links (symbolic paths), and external links (references to other files).

```python
import h5py
import numpy as np

# Create external file first
with h5py.File('external_source.hdf5', 'w') as f:
    f['shared_data'] = np.arange(100)

with h5py.File('links.hdf5', 'w') as f:
    # Create a dataset
    f['original'] = np.arange(10)

    # Hard link (same object, different name)
    f['hard_link'] = f['original']
    print(f"Same object: {f['hard_link'] == f['original']}")  # True

    # Soft link (symbolic path)
    f['soft_link'] = h5py.SoftLink('/original')

    # Soft link to non-existent target (will dangle)
    f['dangling'] = h5py.SoftLink('/does_not_exist')

    # External link to another file
    f['external'] = h5py.ExternalLink('external_source.hdf5',
                                       '/shared_data')

    # Access external data (file opened transparently)
    external_data = f['external'][:]
    print(f"External data: {external_data[:5]}")

    # Check link type using get()
    link_info = f.get('soft_link', getlink=True)
    print(f"Link type: {type(link_info)}")  # SoftLink
    print(f"Link path: {link_info.path}")
```

## File Drivers and In-Memory Files

HDF5 provides various file drivers for different storage backends including in-memory files, split files, and cloud storage (S3).

```python
import h5py
import numpy as np
import io

# In-memory file with BytesIO
bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    f['data'] = np.arange(100)

# Get raw bytes
raw_bytes = bio.getvalue()
print(f"File size: {len(raw_bytes)} bytes")

# Read from BytesIO
bio.seek(0)
with h5py.File(bio, 'r') as f:
    print(f"Data: {f['data'][:5]}")

# Core driver (pure in-memory, optional write-back)
with h5py.File('memory_only.hdf5', 'w', driver='core',
               backing_store=False) as f:
    f['temp'] = np.random.random((1000, 1000))
    # File discarded when closed

# Core driver with write-back
with h5py.File('core_backed.hdf5', 'w', driver='core',
               backing_store=True) as f:
    f['persistent'] = np.arange(50)
    # Written to disk on close

# New in-memory API (h5py 3.13+)
# f = h5py.File.in_memory()
# f['data'] = [1, 2, 3]
# hdf_bytes = f.id.get_file_image()

# Family driver for large files on limited filesystems
# with h5py.File('family%d.hdf5', 'w', driver='family',
#                memb_size=2**30) as f:
#     f['large'] = np.zeros((10000, 10000))
```

## Virtual Datasets (VDS)

Virtual datasets map multiple source datasets into a single sliceable dataset via an interface layer, allowing transparent access to distributed data.

```python
import h5py
import numpy as np

# Create source files
for i in range(4):
    with h5py.File(f'source_{i}.hdf5', 'w') as f:
        f['data'] = np.arange(100) + i * 100

# Create virtual dataset layout
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i8')

for i in range(4):
    source = h5py.VirtualSource(f'source_{i}.hdf5', 'data',
                                 shape=(100,))
    layout[i] = source

# Create file with virtual dataset
with h5py.File('virtual.hdf5', 'w', libver='latest') as f:
    f.create_virtual_dataset('combined', layout, fillvalue=-1)

# Read virtual dataset (transparent to reader)
with h5py.File('virtual.hdf5', 'r') as f:
    vds = f['combined']
    print(f"Virtual shape: {vds.shape}")  # (4, 100)
    print(f"Is virtual: {vds.is_virtual}")
    print(f"Row 0: {vds[0, :10]}")
    print(f"Row 3: {vds[3, :10]}")

    # Get source information
    for src in vds.virtual_sources():
        print(f"Source: {src.file_name}, {src.dset_name}")

# Context manager for building VDS
with h5py.File('vds_context.hdf5', 'w', libver='latest') as f:
    with f.build_virtual_dataset('vdata', (4, 100), 'i8') as layout:
        for i in range(4):
            layout[i] = h5py.VirtualSource(
                f'source_{i}.hdf5', 'data', shape=(100,))
```

## Single Writer Multiple Reader (SWMR)

SWMR mode allows concurrent reading of an HDF5 file while it is being written from another process, with guaranteed file consistency.

```python
import h5py
import numpy as np

# Writer process
with h5py.File('swmr.hdf5', 'w', libver='latest') as f:
    # Create resizable dataset before SWMR mode
    dset = f.create_dataset('data', (4,),
                            maxshape=(None,),
                            chunks=(4,),
                            dtype='i')
    dset[:] = [1, 2, 3, 4]

    # Switch to SWMR mode
    f.swmr_mode = True
    print(f"SWMR mode: {f.swmr_mode}")  # True

    # Now readers can open the file
    # Append data in SWMR mode
    for i in range(5):
        current_size = dset.shape[0]
        new_size = current_size + 4
        dset.resize(new_size, axis=0)
        dset[current_size:] = np.arange(4) + i * 10

        # Flush to make visible to readers
        dset.flush()

# Reader process (separate script/process)
with h5py.File('swmr.hdf5', 'r', libver='latest', swmr=True) as f:
    dset = f['data']

    # Refresh to see latest data
    dset.refresh()
    shape = dset.shape
    print(f"Current shape: {shape}")

    # Read latest data
    data = dset[:]
    print(f"Data: {data}")
```

## Parallel HDF5 with MPI

Parallel HDF5 enables writing to a single file from multiple MPI processes simultaneously, useful for high-performance computing workloads.

```python
# Requires MPI-enabled HDF5 build and mpi4py
# Run with: mpiexec -n 4 python script.py

from mpi4py import MPI
import h5py
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size

# Open file with MPI driver (collective operation)
with h5py.File('parallel.hdf5', 'w', driver='mpio',
               comm=comm) as f:

    # Create dataset (collective - all processes must call)
    dset = f.create_dataset('data', (size, 100), dtype='f')

    # Each process writes its own row (independent)
    dset[rank, :] = np.random.random(100) + rank

    # Create group (collective)
    grp = f.create_group('metadata')
    grp.attrs['num_processes'] = size

    # Synchronize before reading
    comm.Barrier()

    # All processes can read
    if rank == 0:
        print(f"Total shape: {dset.shape}")
        print(f"Sum: {dset[:].sum()}")

# Atomic mode for stricter consistency
# with h5py.File('atomic.hdf5', 'w', driver='mpio',
#                comm=comm) as f:
#     f.atomic = True  # Enable atomic mode
```

## Dimension Scales

Dimension scales label dataset dimensions and associate coordinate datasets, providing context for interpreting multidimensional data.

```python
import h5py
import numpy as np

with h5py.File('dimscales.hdf5', 'w') as f:
    # Create main dataset
    data = np.random.random((100, 50, 25))
    f['temperature'] = data

    # Create coordinate arrays
    f['time'] = np.arange(100) * 0.1  # seconds
    f['latitude'] = np.linspace(-90, 90, 50)
    f['longitude'] = np.linspace(-180, 180, 25)

    # Convert to dimension scales
    f['time'].make_scale('time')
    f['latitude'].make_scale('latitude')
    f['longitude'].make_scale('longitude')

    # Attach scales to dimensions
    f['temperature'].dims[0].attach_scale(f['time'])
    f['temperature'].dims[1].attach_scale(f['latitude'])
    f['temperature'].dims[2].attach_scale(f['longitude'])

    # Label dimensions
    f['temperature'].dims[0].label = 'time'
    f['temperature'].dims[1].label = 'lat'
    f['temperature'].dims[2].label = 'lon'

# Read dimension scale information
with h5py.File('dimscales.hdf5', 'r') as f:
    dset = f['temperature']

    # Get dimension labels
    labels = [dim.label for dim in dset.dims]
    print(f"Labels: {labels}")  # ['time', 'lat', 'lon']

    # Get scale names for a dimension
    print(f"Dim 0 scales: {list(dset.dims[0].keys())}")

    # Access scale dataset
    time_scale = dset.dims[0][0]  # First scale on dim 0
    print(f"Time values: {time_scale[:5]}")

    # Check if dataset is a scale
    print(f"Is scale: {f['time'].is_scale}")  # True
```

## Direct Chunk Read/Write

For advanced use cases, h5py allows direct reading and writing of compressed chunks without decompression, enabling efficient data pipelines.

```python
import h5py
import numpy as np
import zlib

with h5py.File('direct_chunks.hdf5', 'w') as f:
    # Create chunked, compressed dataset
    dset = f.create_dataset('data', (100, 100),
                            chunks=(10, 10),
                            compression='gzip',
                            compression_opts=4)

    # Normal write (h5py handles compression)
    dset[:10, :10] = np.random.random((10, 10))

    # Write pre-compressed chunk directly
    chunk_data = np.arange(100, dtype='f').reshape(10, 10)
    compressed = zlib.compress(chunk_data.tobytes(), level=4)

    # Write compressed bytes directly to chunk at (10, 0)
    dset.id.write_direct_chunk((10, 0), compressed)

with h5py.File('direct_chunks.hdf5', 'r') as f:
    dset = f['data']

    # Normal read (automatic decompression)
    data = dset[10:20, 0:10]
    print(f"Read data shape: {data.shape}")

    # Read raw compressed chunk
    # compressed_bytes = dset.id.read_direct_chunk((10, 0))
```

## Copy and Move Operations

Groups provide methods to copy objects within or between files and move/rename objects within a file.

```python
import h5py
import numpy as np

with h5py.File('source.hdf5', 'w') as src:
    src['data'] = np.arange(100)
    src['data'].attrs['info'] = 'original'
    grp = src.create_group('group')
    grp['nested'] = np.zeros((10, 10))

with h5py.File('source.hdf5', 'r') as src:
    with h5py.File('dest.hdf5', 'w') as dst:
        # Copy dataset between files
        src.copy('data', dst, name='copied_data')

        # Copy group with all contents
        src.copy('group', dst)

        # Copy without attributes
        src.copy('data', dst, name='no_attrs',
                 without_attrs=True)

with h5py.File('dest.hdf5', 'a') as f:
    # Move/rename within file
    f.move('copied_data', 'renamed_data')

    # Copy within same file
    f.copy('renamed_data', f, name='duplicate')

    print(f"Keys: {list(f.keys())}")

    # Delete object
    del f['duplicate']
```

h5py is the standard Python interface for working with HDF5 files in scientific computing, data analysis, and machine learning workflows. Its NumPy-like API makes it intuitive for Python users while providing access to HDF5's powerful features for handling large datasets that don't fit in memory. Common use cases include storing simulation output, managing experimental data with metadata, sharing large datasets across research teams, and building data pipelines that process data in chunks.

The library integrates seamlessly with the scientific Python ecosystem including NumPy, pandas, and xarray. For large-scale applications, h5py supports parallel I/O through MPI for HPC clusters, SWMR mode for real-time data streaming applications, and virtual datasets for organizing distributed data. The combination of hierarchical organization, compression, chunking, and metadata support makes HDF5 files self-describing archives that remain accessible for long-term data preservation.