Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
h5py
https://github.com/h5py/h5py
Admin
h5py is a pythonic wrapper around HDF5 that enables reading and writing HDF5 files from Python with
...
Tokens:
154,701
Snippets:
1,095
Trust Score:
7
Update:
1 month ago
Context
Skills
Chat
Benchmark
96.7
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# h5py h5py is a Pythonic interface to the HDF5 binary data format that lets you store huge amounts of numerical data and easily manipulate it from NumPy. HDF5 enables you to slice into multi-terabyte datasets stored on disk as if they were real NumPy arrays, store thousands of datasets in a single file with hierarchical organization, and attach metadata as attributes to any object. The library runs on Python 3.10+ and provides both high-level and low-level interfaces to the HDF5 library. The core concepts are simple: groups work like dictionaries, and datasets work like NumPy arrays. Every object in an HDF5 file has a name arranged in a POSIX-style hierarchy with `/`-separators. h5py supports a variety of transparent storage features including compression, error-detection, chunked I/O, parallel I/O with MPI, and Single Writer Multiple Reader (SWMR) mode for concurrent access patterns. ## File Objects - Opening and Creating Files File objects serve as your entry point into HDF5. They support standard modes like r/w/a and should be closed when no longer in use. Every File instance is also an HDF5 group representing the root group of the file. ```python import h5py import numpy as np # Open existing file (read-only by default) f = h5py.File('myfile.hdf5', 'r') # Create new file (truncate if exists) f = h5py.File('myfile.hdf5', 'w') # Read/write if exists, create otherwise f = h5py.File('myfile.hdf5', 'a') # Using context manager (recommended) with h5py.File('myfile.hdf5', 'w') as f: f['dataset'] = np.arange(100) print(f"File name: {f.filename}") print(f"Mode: {f.mode}") # File access modes: # 'r' - Readonly, file must exist (default) # 'r+' - Read/write, file must exist # 'w' - Create file, truncate if exists # 'w-' or 'x' - Create file, fail if exists # 'a' - Read/write if exists, create otherwise ``` ## Creating Datasets Datasets are homogeneous collections of data elements with an immutable datatype and shape. They support NumPy-style slicing for reading and writing data to disk. ```python import h5py import numpy as np with h5py.File('datasets.hdf5', 'w') as f: # Create dataset with shape and dtype dset1 = f.create_dataset('default', (100,)) dset2 = f.create_dataset('ints', (100,), dtype='i8') # Create dataset from existing array arr = np.arange(100) dset3 = f.create_dataset('from_array', data=arr) # Shorthand: assign array directly to group f['shorthand'] = np.random.random((50, 50)) # Create multidimensional dataset dset4 = f.create_dataset('3d_data', (10, 20, 30), dtype='f') # Access dataset properties print(f"Shape: {dset4.shape}") # (10, 20, 30) print(f"Dtype: {dset4.dtype}") # float32 print(f"Size: {dset4.size}") # 6000 print(f"Ndim: {dset4.ndim}") # 3 ``` ## Reading and Writing Data with Slicing HDF5 datasets support NumPy-style slicing to read and write data. Slice specifications are translated directly to HDF5 hyperslab selections for fast, efficient file access. ```python import h5py import numpy as np with h5py.File('slicing.hdf5', 'w') as f: # Create and populate a dataset dset = f.create_dataset('data', (10, 10, 10), dtype='f') dset[...] = np.random.random((10, 10, 10)) # Various slicing operations value = dset[0, 0, 0] # Single element row = dset[0, 2:10, 1:9:3] # Slice with step plane = dset[:, ::2, 5] # Every other element first = dset[0] # First 2D slice partial = dset[1, 5] # 1D slice ellipsis1 = dset[0, ...] # Use ellipsis ellipsis2 = dset[..., 6] # Ellipsis at start all_data = dset[()] # Read entire dataset # Writing data with broadcasting dset[0, :, :] = np.arange(10) # Broadcasts to (10, 10) # Correct way to write (single indexing operation) dset[0, 1] = 3.0 # Correct # dset[0][1] = 3.0 # Wrong! Modifies copy only print(f"Retrieved shape: {all_data.shape}") ``` ## Groups and Hierarchical Organization Groups are the container mechanism for organizing HDF5 files. They operate like dictionaries with string keys for object names and support POSIX-style paths with `/`-separators. ```python import h5py import numpy as np with h5py.File('groups.hdf5', 'w') as f: # Create groups grp = f.create_group('subgroup') print(f"Group name: {grp.name}") # /subgroup # Create nested groups subgrp = grp.create_group('nested') print(f"Nested name: {subgrp.name}") # /subgroup/nested # Create intermediate groups automatically deep = f.create_group('/deep/path/to/group') # Create dataset in group dset = grp.create_dataset('data', (50,), dtype='f') print(f"Dataset path: {dset.name}") # /subgroup/data # Dictionary-style access retrieved = f['subgroup/data'] print(f"'subgroup' in f: {'subgroup' in f}") # Iteration over group members for name in f: print(f"Member: {name}") # Get keys, values, items print(f"Keys: {list(f.keys())}") # Recursive traversal with visit def print_name(name): print(name) f.visit(print_name) # Visit with objects def print_name_and_obj(name, obj): print(f"{name}: {type(obj)}") f.visititems(print_name_and_obj) ``` ## Attributes - Storing Metadata Attributes are small named pieces of data attached directly to Groups and Datasets. They are the official way to store metadata in HDF5 and support a dictionary-style interface. ```python import h5py import numpy as np with h5py.File('attributes.hdf5', 'w') as f: dset = f.create_dataset('data', data=np.arange(100)) # Set attributes using dictionary syntax dset.attrs['temperature'] = 99.5 dset.attrs['units'] = 'Kelvin' dset.attrs['calibration'] = np.array([1.0, 2.0, 3.0]) # Read attributes temp = dset.attrs['temperature'] print(f"Temperature: {temp}") # Check existence print(f"'units' exists: {'units' in dset.attrs}") # Iterate over attributes for key in dset.attrs: print(f"{key}: {dset.attrs[key]}") # Get all as dict-like items for key, value in dset.attrs.items(): print(f"{key} = {value}") # Attributes on groups grp = f.create_group('experiment') grp.attrs['date'] = '2024-01-15' grp.attrs['researcher'] = 'Dr. Smith' # Modify existing attribute while preserving type dset.attrs.modify('temperature', 100.0) # Create with explicit type control dset.attrs.create('precise', 3.14159, dtype='f8') ``` ## Chunked Storage and Compression Chunked storage divides datasets into regularly-sized pieces stored haphazardly on disk and indexed using a B-tree. This enables resizable datasets and compression filters. ```python import h5py import numpy as np with h5py.File('chunked.hdf5', 'w') as f: # Explicit chunk shape dset1 = f.create_dataset('chunked', (1000, 1000), chunks=(100, 100), dtype='f') # Auto-chunking dset2 = f.create_dataset('auto_chunked', (1000, 1000), chunks=True, dtype='f') # GZIP compression (levels 0-9) dset3 = f.create_dataset('gzip', (1000, 1000), compression='gzip', compression_opts=4) # LZF compression (fast, moderate compression) dset4 = f.create_dataset('lzf', (1000, 1000), compression='lzf') # Shuffle filter improves compression dset5 = f.create_dataset('shuffled', (1000, 1000), compression='gzip', shuffle=True) # Fletcher32 checksum for error detection dset6 = f.create_dataset('checksummed', (1000, 1000), compression='gzip', fletcher32=True) # Write data data = np.random.random((1000, 1000)).astype('f') dset3[...] = data # Iterate over chunks for chunk_slice in dset1.iter_chunks(): print(f"Chunk: {chunk_slice}") # arr = dset1[chunk_slice] # Read chunk # Check compression info print(f"Compression: {dset3.compression}") print(f"Chunks: {dset3.chunks}") ``` ## Resizable Datasets Datasets can be resized after creation up to a maximum shape specified with the `maxshape` parameter. Use `None` for unlimited dimensions. ```python import h5py import numpy as np with h5py.File('resizable.hdf5', 'w') as f: # Fixed maximum size dset1 = f.create_dataset('fixed_max', (10, 10), maxshape=(500, 20)) # Unlimited on first axis dset2 = f.create_dataset('unlimited', (10, 10), maxshape=(None, 10)) # Unlimited 1D dataset (note tuple syntax) dset3 = f.create_dataset('log', (0,), maxshape=(None,), dtype='f') # Append data by resizing for i in range(5): new_data = np.random.random(100) current_size = dset3.shape[0] new_size = current_size + len(new_data) # Resize and write dset3.resize(new_size, axis=0) dset3[current_size:new_size] = new_data print(f"Final shape: {dset3.shape}") # (500,) # Resize multidimensional dset1.resize((100, 15)) print(f"Resized shape: {dset1.shape}") # (100, 15) ``` ## String Handling HDF5 supports both fixed-length and variable-length strings with ASCII or UTF-8 encoding. h5py provides utilities to create and check string types. ```python import h5py import numpy as np with h5py.File('strings.hdf5', 'w') as f: # Variable-length strings (implicit, UTF-8) f['vlen_strings'] = ["hello", "world", "variable", "length"] # Variable-length strings (explicit) dt = h5py.string_dtype(encoding='utf-8') dset = f.create_dataset('vlen_utf8', (4,), dtype=dt) dset[:] = ["alpha", "beta", "gamma", "delta"] # Fixed-length strings (using NumPy S dtype) f['fixed_strings'] = np.array(["abc", "def"], dtype='S10') # Fixed-length with explicit dtype dt_fixed = h5py.string_dtype(encoding='utf-8', length=20) dset2 = f.create_dataset('fixed_utf8', (3,), dtype=dt_fixed) dset2[:] = ["one", "two", "three"] # Reading strings data = f['vlen_strings'][:] print(f"Type: {type(data[0])}") # bytes # Read as Python str objects str_data = f['vlen_strings'].asstr()[:] print(f"As string: {str_data}") # Check string dtype info info = h5py.check_string_dtype(dset.dtype) if info: print(f"Encoding: {info.encoding}, Length: {info.length}") ``` ## Special Types - Enums and Variable-Length Arrays h5py supports HDF5 special types including enumerated types and variable-length (ragged) arrays that have no direct NumPy equivalent. ```python import h5py import numpy as np with h5py.File('special_types.hdf5', 'w') as f: # Enumerated type colors = {"RED": 0, "GREEN": 1, "BLUE": 2, "YELLOW": 3} dt_enum = h5py.enum_dtype(colors, basetype='i') dset_enum = f.create_dataset('colors', (10,), dtype=dt_enum) dset_enum[0] = 0 # RED dset_enum[1] = 2 # BLUE # Check enum dtype enum_info = h5py.check_enum_dtype(dset_enum.dtype) print(f"Enum values: {enum_info}") # Variable-length (ragged) arrays dt_vlen = h5py.vlen_dtype(np.dtype('int32')) dset_vlen = f.create_dataset('ragged', (5,), dtype=dt_vlen) dset_vlen[0] = [1, 2, 3] dset_vlen[1] = [10, 20, 30, 40, 50] dset_vlen[2] = [100] # Read single element (returns array) print(f"Element 0: {dset_vlen[0]}") # array([1, 2, 3]) # Read multiple (returns object array of arrays) print(f"Elements 0:2: {dset_vlen[0:2]}") # Complex numbers (compatible format) complex_arr = np.array([1+2j, 3+4j, 5+6j], dtype='c8') f['complex'] = complex_arr # Opaque dtype for datetime dt_arr = np.array([np.datetime64('2024-01-15')]) f['datetime'] = dt_arr.astype(h5py.opaque_dtype(dt_arr.dtype)) ``` ## Links - Hard, Soft, and External HDF5 supports multiple link types: hard links (direct pointers), soft links (symbolic paths), and external links (references to other files). ```python import h5py import numpy as np # Create external file first with h5py.File('external_source.hdf5', 'w') as f: f['shared_data'] = np.arange(100) with h5py.File('links.hdf5', 'w') as f: # Create a dataset f['original'] = np.arange(10) # Hard link (same object, different name) f['hard_link'] = f['original'] print(f"Same object: {f['hard_link'] == f['original']}") # True # Soft link (symbolic path) f['soft_link'] = h5py.SoftLink('/original') # Soft link to non-existent target (will dangle) f['dangling'] = h5py.SoftLink('/does_not_exist') # External link to another file f['external'] = h5py.ExternalLink('external_source.hdf5', '/shared_data') # Access external data (file opened transparently) external_data = f['external'][:] print(f"External data: {external_data[:5]}") # Check link type using get() link_info = f.get('soft_link', getlink=True) print(f"Link type: {type(link_info)}") # SoftLink print(f"Link path: {link_info.path}") ``` ## File Drivers and In-Memory Files HDF5 provides various file drivers for different storage backends including in-memory files, split files, and cloud storage (S3). ```python import h5py import numpy as np import io # In-memory file with BytesIO bio = io.BytesIO() with h5py.File(bio, 'w') as f: f['data'] = np.arange(100) # Get raw bytes raw_bytes = bio.getvalue() print(f"File size: {len(raw_bytes)} bytes") # Read from BytesIO bio.seek(0) with h5py.File(bio, 'r') as f: print(f"Data: {f['data'][:5]}") # Core driver (pure in-memory, optional write-back) with h5py.File('memory_only.hdf5', 'w', driver='core', backing_store=False) as f: f['temp'] = np.random.random((1000, 1000)) # File discarded when closed # Core driver with write-back with h5py.File('core_backed.hdf5', 'w', driver='core', backing_store=True) as f: f['persistent'] = np.arange(50) # Written to disk on close # New in-memory API (h5py 3.13+) # f = h5py.File.in_memory() # f['data'] = [1, 2, 3] # hdf_bytes = f.id.get_file_image() # Family driver for large files on limited filesystems # with h5py.File('family%d.hdf5', 'w', driver='family', # memb_size=2**30) as f: # f['large'] = np.zeros((10000, 10000)) ``` ## Virtual Datasets (VDS) Virtual datasets map multiple source datasets into a single sliceable dataset via an interface layer, allowing transparent access to distributed data. ```python import h5py import numpy as np # Create source files for i in range(4): with h5py.File(f'source_{i}.hdf5', 'w') as f: f['data'] = np.arange(100) + i * 100 # Create virtual dataset layout layout = h5py.VirtualLayout(shape=(4, 100), dtype='i8') for i in range(4): source = h5py.VirtualSource(f'source_{i}.hdf5', 'data', shape=(100,)) layout[i] = source # Create file with virtual dataset with h5py.File('virtual.hdf5', 'w', libver='latest') as f: f.create_virtual_dataset('combined', layout, fillvalue=-1) # Read virtual dataset (transparent to reader) with h5py.File('virtual.hdf5', 'r') as f: vds = f['combined'] print(f"Virtual shape: {vds.shape}") # (4, 100) print(f"Is virtual: {vds.is_virtual}") print(f"Row 0: {vds[0, :10]}") print(f"Row 3: {vds[3, :10]}") # Get source information for src in vds.virtual_sources(): print(f"Source: {src.file_name}, {src.dset_name}") # Context manager for building VDS with h5py.File('vds_context.hdf5', 'w', libver='latest') as f: with f.build_virtual_dataset('vdata', (4, 100), 'i8') as layout: for i in range(4): layout[i] = h5py.VirtualSource( f'source_{i}.hdf5', 'data', shape=(100,)) ``` ## Single Writer Multiple Reader (SWMR) SWMR mode allows concurrent reading of an HDF5 file while it is being written from another process, with guaranteed file consistency. ```python import h5py import numpy as np # Writer process with h5py.File('swmr.hdf5', 'w', libver='latest') as f: # Create resizable dataset before SWMR mode dset = f.create_dataset('data', (4,), maxshape=(None,), chunks=(4,), dtype='i') dset[:] = [1, 2, 3, 4] # Switch to SWMR mode f.swmr_mode = True print(f"SWMR mode: {f.swmr_mode}") # True # Now readers can open the file # Append data in SWMR mode for i in range(5): current_size = dset.shape[0] new_size = current_size + 4 dset.resize(new_size, axis=0) dset[current_size:] = np.arange(4) + i * 10 # Flush to make visible to readers dset.flush() # Reader process (separate script/process) with h5py.File('swmr.hdf5', 'r', libver='latest', swmr=True) as f: dset = f['data'] # Refresh to see latest data dset.refresh() shape = dset.shape print(f"Current shape: {shape}") # Read latest data data = dset[:] print(f"Data: {data}") ``` ## Parallel HDF5 with MPI Parallel HDF5 enables writing to a single file from multiple MPI processes simultaneously, useful for high-performance computing workloads. ```python # Requires MPI-enabled HDF5 build and mpi4py # Run with: mpiexec -n 4 python script.py from mpi4py import MPI import h5py import numpy as np comm = MPI.COMM_WORLD rank = comm.rank size = comm.size # Open file with MPI driver (collective operation) with h5py.File('parallel.hdf5', 'w', driver='mpio', comm=comm) as f: # Create dataset (collective - all processes must call) dset = f.create_dataset('data', (size, 100), dtype='f') # Each process writes its own row (independent) dset[rank, :] = np.random.random(100) + rank # Create group (collective) grp = f.create_group('metadata') grp.attrs['num_processes'] = size # Synchronize before reading comm.Barrier() # All processes can read if rank == 0: print(f"Total shape: {dset.shape}") print(f"Sum: {dset[:].sum()}") # Atomic mode for stricter consistency # with h5py.File('atomic.hdf5', 'w', driver='mpio', # comm=comm) as f: # f.atomic = True # Enable atomic mode ``` ## Dimension Scales Dimension scales label dataset dimensions and associate coordinate datasets, providing context for interpreting multidimensional data. ```python import h5py import numpy as np with h5py.File('dimscales.hdf5', 'w') as f: # Create main dataset data = np.random.random((100, 50, 25)) f['temperature'] = data # Create coordinate arrays f['time'] = np.arange(100) * 0.1 # seconds f['latitude'] = np.linspace(-90, 90, 50) f['longitude'] = np.linspace(-180, 180, 25) # Convert to dimension scales f['time'].make_scale('time') f['latitude'].make_scale('latitude') f['longitude'].make_scale('longitude') # Attach scales to dimensions f['temperature'].dims[0].attach_scale(f['time']) f['temperature'].dims[1].attach_scale(f['latitude']) f['temperature'].dims[2].attach_scale(f['longitude']) # Label dimensions f['temperature'].dims[0].label = 'time' f['temperature'].dims[1].label = 'lat' f['temperature'].dims[2].label = 'lon' # Read dimension scale information with h5py.File('dimscales.hdf5', 'r') as f: dset = f['temperature'] # Get dimension labels labels = [dim.label for dim in dset.dims] print(f"Labels: {labels}") # ['time', 'lat', 'lon'] # Get scale names for a dimension print(f"Dim 0 scales: {list(dset.dims[0].keys())}") # Access scale dataset time_scale = dset.dims[0][0] # First scale on dim 0 print(f"Time values: {time_scale[:5]}") # Check if dataset is a scale print(f"Is scale: {f['time'].is_scale}") # True ``` ## Direct Chunk Read/Write For advanced use cases, h5py allows direct reading and writing of compressed chunks without decompression, enabling efficient data pipelines. ```python import h5py import numpy as np import zlib with h5py.File('direct_chunks.hdf5', 'w') as f: # Create chunked, compressed dataset dset = f.create_dataset('data', (100, 100), chunks=(10, 10), compression='gzip', compression_opts=4) # Normal write (h5py handles compression) dset[:10, :10] = np.random.random((10, 10)) # Write pre-compressed chunk directly chunk_data = np.arange(100, dtype='f').reshape(10, 10) compressed = zlib.compress(chunk_data.tobytes(), level=4) # Write compressed bytes directly to chunk at (10, 0) dset.id.write_direct_chunk((10, 0), compressed) with h5py.File('direct_chunks.hdf5', 'r') as f: dset = f['data'] # Normal read (automatic decompression) data = dset[10:20, 0:10] print(f"Read data shape: {data.shape}") # Read raw compressed chunk # compressed_bytes = dset.id.read_direct_chunk((10, 0)) ``` ## Copy and Move Operations Groups provide methods to copy objects within or between files and move/rename objects within a file. ```python import h5py import numpy as np with h5py.File('source.hdf5', 'w') as src: src['data'] = np.arange(100) src['data'].attrs['info'] = 'original' grp = src.create_group('group') grp['nested'] = np.zeros((10, 10)) with h5py.File('source.hdf5', 'r') as src: with h5py.File('dest.hdf5', 'w') as dst: # Copy dataset between files src.copy('data', dst, name='copied_data') # Copy group with all contents src.copy('group', dst) # Copy without attributes src.copy('data', dst, name='no_attrs', without_attrs=True) with h5py.File('dest.hdf5', 'a') as f: # Move/rename within file f.move('copied_data', 'renamed_data') # Copy within same file f.copy('renamed_data', f, name='duplicate') print(f"Keys: {list(f.keys())}") # Delete object del f['duplicate'] ``` h5py is the standard Python interface for working with HDF5 files in scientific computing, data analysis, and machine learning workflows. Its NumPy-like API makes it intuitive for Python users while providing access to HDF5's powerful features for handling large datasets that don't fit in memory. Common use cases include storing simulation output, managing experimental data with metadata, sharing large datasets across research teams, and building data pipelines that process data in chunks. The library integrates seamlessly with the scientific Python ecosystem including NumPy, pandas, and xarray. For large-scale applications, h5py supports parallel I/O through MPI for HPC clusters, SWMR mode for real-time data streaming applications, and virtual datasets for organizing distributed data. The combination of hierarchical organization, compression, chunking, and metadata support makes HDF5 files self-describing archives that remain accessible for long-term data preservation.