### Get Matter array file size Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Retrieves the file size of the created 'matter_arr' object to assess storage efficiency. ```R file.info(a1@paths)$size / 1e9 ``` -------------------------------- ### Saving and Loading SummarizedExperiment with HDF5 Source: https://context7.com/bioconductor/hdf5array/llms.txt Demonstrates how to save a SummarizedExperiment object to disk using HDF5 for assay data and then reload it. It also shows how to perform a quick resave after modifying metadata without rewriting the HDF5 data. ```APIDOC ## Saving and Loading SummarizedExperiment with HDF5 ### Description Save a SummarizedExperiment object to disk using HDF5 for assay data, allowing for efficient storage and retrieval of large datasets. This includes options for custom chunk dimensions and compression levels. Reload the experiment and perform quick resaves after metadata modifications. ### Usage ```r # Save to disk dir <- tempfile() saveHDF5SummarizedExperiment(se, dir=dir, chunkdim=c(100, 25), # custom chunk dims level=6L, # gzip level verbose=TRUE) list.files(dir) # "assays.h5" "se.rds" # Reload with HDF5-backed assays se2 <- loadHDF5SummarizedExperiment(dir) assay(se2, "counts") # HDF5Matrix – still on disk # After adding metadata, quick-resave without rewriting HDF5 colData(se2)$batch <- sample(c("A","B"), ncol, replace=TRUE) quickResaveHDF5SummarizedExperiment(se2, verbose=TRUE) # Use a prefix to store multiple objects in the same directory dir2 <- tempfile() dir.create(dir2) saveHDF5SummarizedExperiment(se, dir=dir2, prefix="exp1_") saveHDF5SummarizedExperiment(se, dir=dir2, prefix="exp2_") list.files(dir2) # "exp1_assays.h5" "exp1_se.rds" "exp2_..." se_exp1 <- loadHDF5SummarizedExperiment(dir2, prefix="exp1_") ``` ``` -------------------------------- ### Save and Load SummarizedExperiment with HDF5 assays Source: https://context7.com/bioconductor/hdf5array/llms.txt Saves a SummarizedExperiment object to disk, writing assays as HDF5 datasets and metadata to an RDS file. The result is a directory containing 'se.rds' and 'assays.h5'. `loadHDF5SummarizedExperiment` reconstructs the object with HDF5-backed assays. `quickResaveHDF5SummarizedExperiment` re-serializes only metadata. ```R library(HDF5Array) library(SummarizedExperiment) # Build a toy SummarizedExperiment row <- 200; ncol <- 50 counts <- matrix(rpois(nrow * ncol, lambda=5), nrow=nrow, dimnames=list(paste0("gene", seq_len(nrow)), paste0("cell", seq_len(ncol)))) se <- SummarizedExperiment(assays=list(counts=counts)) ``` -------------------------------- ### Save and Load SummarizedExperiment to HDF5 Source: https://context7.com/bioconductor/hdf5array/llms.txt Saves a SummarizedExperiment object to disk using HDF5 for assays, allowing for efficient reloading. Supports custom chunk dimensions and compression levels. ```R dir <- tempfile() saveHDF5SummarizedExperiment(se, dir=dir, chunkdim=c(100, 25), # custom chunk dims level=6L, # gzip level verbose=TRUE) list.files(dir) # "assays.h5" "se.rds" se2 <- loadHDF5SummarizedExperiment(dir) assay(se2, "counts") # HDF5Matrix – still on disk colData(se2)$batch <- sample(c("A","B"), ncol, replace=TRUE) quickResaveHDF5SummarizedExperiment(se2, verbose=TRUE) ``` ```R dir2 <- tempfile() dir.create(dir2) saveHDF5SummarizedExperiment(se, dir=dir2, prefix="exp1_") saveHDF5SummarizedExperiment(se, dir=dir2, prefix="exp2_") list.files(dir2) # "exp1_assays.h5" "exp1_se.rds" "exp2_..." se_exp1 <- loadHDF5SummarizedExperiment(dir2, prefix="exp1_") ``` -------------------------------- ### saveHDF5SummarizedExperiment() / loadHDF5SummarizedExperiment() Source: https://context7.com/bioconductor/hdf5array/llms.txt Saves and loads SummarizedExperiment objects with HDF5-backed assays. saveHDF5SummarizedExperiment writes assays to HDF5 datasets and metadata to an .rds file, while loadHDF5SummarizedExperiment reconstructs the object. ```APIDOC ## saveHDF5SummarizedExperiment() / loadHDF5SummarizedExperiment() ### Description Saves a `SummarizedExperiment` object to disk by writing all assays as HDF5 datasets and serialising the R metadata (colData, rowData, etc.) to an `.rds` file. The result is a directory containing `se.rds` and `assays.h5`. `loadHDF5SummarizedExperiment()` reconstructs the object with HDF5-backed assays. `quickResaveHDF5SummarizedExperiment()` re-serialises only the metadata without touching the HDF5 file. ### Usage ```r library(HDF5Array) library(SummarizedExperiment) # Build a toy SummarizedExperiment nrow <- 200; ncol <- 50 counts <- matrix(rpois(nrow * ncol, lambda=5), nrow=nrow, dimnames=list(paste0("gene", seq_len(nrow)), paste0("cell", seq_len(ncol)))) se <- SummarizedExperiment(assays=list(counts=counts)) # ... further usage examples for saving and loading ... ``` ``` -------------------------------- ### H5SparseMatrix Operations Source: https://context7.com/bioconductor/hdf5array/llms.txt Demonstrates basic operations on an H5SparseMatrix, including dimension checking, sparsity, non-zero count, subsetting, and extraction of non-zero data by column. It also shows coercion to a dgCMatrix. ```R dim(sm) # c(500, 300) is_sparse(sm) # TRUE nzcount(sm) # number of nonzero entries # Subset (delayed) sm[1:10, 1:20] # Extract nonzero values by column (low-level, avoids materialising full rows) nz_cols <- extractNonzeroDataByCol(sm, 1:5) lengths(nz_cols) # Coerce to in-memory sparse matrix as(sm, "dgCMatrix") ``` -------------------------------- ### Create a large integer array Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Initializes a large 3D integer array with random values for benchmarking purposes. ```R set.seed(123) a0 <- array(as.integer(runif(250e6, max=100)), dim=c(3000, 800, 125)) ``` -------------------------------- ### Manage HDF5 Dump Directory and File Settings Source: https://context7.com/bioconductor/hdf5array/llms.txt Control the location and naming of automatically created HDF5 datasets. These settings are global and propagated to BiocParallel workers. ```R library(HDF5Array) # --- Dump directory and file --- getHDF5DumpDir() # current auto-dump directory (in tempdir()) setHDF5DumpDir("~/my_dumps") # redirect auto-dumps to a custom directory setHDF5DumpFile("~/my_dumps/results.h5") # pin all auto-dumps to one file getHDF5DumpFile() lsHDF5DumpFile() # list datasets in the current dump file setHDF5DumpName("/experiment1/counts") # pin the next dataset name getHDF5DumpName() # --- Chunk geometry --- getHDF5DumpChunkLength() # 1,000,000 elements (default) setHDF5DumpChunkLength(500000L) getHDF5DumpChunkShape() # "scale" (default) setHDF5DumpChunkShape("first-dim-grows-first") # Compute chunk dims for a given array shape getHDF5DumpChunkDim(c(20000L, 500L)) # e.g. c(2000, 500) # --- Compression --- getHDF5DumpCompressionLevel() # 6 (default; 0 = none, 9 = max) setHDF5DumpCompressionLevel(9L) # --- Dump log (shows every dataset created in this session) --- m <- matrix(runif(100), 10, 10) writeHDF5Array(m, name="test1") writeHDF5Array(m + 1, name="test2") showHDF5DumpLog() # [2025-01-15 10:00:01] #1 In file '.../auto....h5': creation of dataset # '/test1' (10x10:double, chunkdims=10x10, level=6) ``` -------------------------------- ### H5ADMatrix() Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedMatrix backed by the central X matrix or any /layers matrix in an .h5ad (AnnData) file. It handles both dense and sparse storage and populates rownames/colnames from the var and obs groups. ```APIDOC ## H5ADMatrix() ### Description Constructs a `DelayedMatrix` backed by the central `X` matrix (or any `/layers` matrix) in an `.h5ad` (AnnData) file. Automatically handles both dense (`HDF5ArraySeed`) and sparse (`CSC_H5ADMatrixSeed` / `CSR_H5ADMatrixSeed`) storage, and populates `rownames`/`colnames` from the `var` and `obs` groups. ### Usage ```r library(HDF5Array) library(zellkonverter) # provides test h5ad files # Obtain an example h5ad file h5ad_path <- system.file("extdata", "krumsiek11.h5ad", package="zellkonverter") # Load the central X matrix X <- H5ADMatrix(h5ad_path) X # <200 x 11> matrix of class H5ADMatrix and type "double": dim(X) # c(200, 11) rownames(X) # cell barcodes from obs/_index colnames(X) # gene names from var/_index # Load a specific layer instead of X # (requires the h5ad file to have a /layers/counts group) # counts <- H5ADMatrix(h5ad_path, layer="counts") # Arithmetic is delayed log1p_X <- log1p(X) class(log1p_X) # "DelayedMatrix" # Realise to memory as.matrix(X[1:5, ]) # Access the underlying seed to inspect storage format seed(X) # Dense_H5ADMatrixSeed / CSC_H5ADMatrixSeed / CSR_H5ADMatrixSeed is_sparse(X) # TRUE if stored as h5sparse nzcount(X) # only works for sparse seeds ``` ``` -------------------------------- ### writeTENxMatrix() Source: https://context7.com/bioconductor/hdf5array/llms.txt Writes any matrix-like object to disk in the 10x Genomics HDF5 sparse format (CSR layout). It processes the input column-by-column for large matrices and returns a TENxMatrix pointing to the result. ```APIDOC ## writeTENxMatrix() ### Description Writes any matrix-like object to disk in the 10x Genomics HDF5 sparse format (CSR layout with standard group structure). Returns a `TENxMatrix` pointing to the result. Block-processes the input column-by-column so that arbitrarily large matrices can be written without loading them fully into memory. ### Usage ```r library(HDF5Array) library(Matrix) m <- rsparsematrix(5000, 3000, density=0.02, dimnames=list(paste0("g", 1:5000), paste0("b", 1:3000))) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="counts", level=6L, # gzip compression (0–9) verbose=TRUE) # sparsity: 0.98 tenx nzcount(tenx) # actual stored nonzero count sparsity(tenx) # fraction of zero entries # Round-trip: coerce a TENxMatrix back to dgCMatrix stopifnot(all.equal(as(tenx, "dgCMatrix"), as(m, "dgCMatrix"))) # Using coercion shorthand (writes to current dump file) tenx2 <- as(m, "TENxMatrix") path(tenx2) ``` ``` -------------------------------- ### Compare extracted slices from different formats Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Verifies that the data extracted from Matter and HDF5 arrays are identical. ```R identical(x1, x2) ``` ```R identical(x1, x3) ``` -------------------------------- ### H5SparseMatrix() Constructor Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedMatrix backed by an HDF5 sparse matrix stored in CSR/CSC/Yale format. The sparse layout is automatically detected from HDF5 group attributes, but can also be overridden. It supports efficient operations on specific slices through nonzero-data extraction by column or row. ```APIDOC ## H5SparseMatrix() ### Description Constructs a `DelayedMatrix` backed by an HDF5 sparse matrix stored in CSR/CSC/Yale format (as produced by Python's `scipy.sparse` or AnnData). The sparse layout is detected automatically from the HDF5 group attributes; it can also be overridden. Nonzero-data extraction by column or row is available for efficient operations on specific slices. ### Usage ```r library(HDF5Array) # Write a sparse matrix in 10x/CSC format first, then reload m <- Matrix::rsparsematrix(500, 300, density=0.05) h5f <- tempfile(fileext=".h5") # writeTENxMatrix writes 10x CSR format; H5SparseMatrix reads generic h5sparse tenx <- writeTENxMatrix(m, h5f, group="matrix") # H5SparseMatrix works on any h5sparse group (CSR or CSC) sm <- H5SparseMatrix(h5f, "matrix") ``` ``` -------------------------------- ### R Session Information Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? The `sessionInfo()` function in R provides details about the R version, platform, loaded packages, and their versions. This is useful for reproducibility. ```r > sessionInfo() R version 3.6.0 Patched (2019-05-02 r76454) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /home/hpages/R/R-3.6.r76454/lib/libRblas.so LAPACK: /home/hpages/R/R-3.6.r76454/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] HDF5Array_1.13.9 rhdf5_2.29.3 DelayedArray_0.11.8 [4] IRanges_2.19.16 S4Vectors_0.23.25 BiocGenerics_0.31.6 [7] matrixStats_0.55.0 matter_1.11.1 biglm_0.9-1 [10] DBI_1.0.0 BiocParallel_1.19.3 loaded via a namespace (and not attached): [1] lattice_0.20-38 digest_0.6.21 grid_3.6.0 irlba_2.3.3 [5] Matrix_1.2-17 Rhdf5lib_1.7.5 tools_3.6.0 compiler_3.6.0 ``` -------------------------------- ### HDF5 Dump Management Source: https://context7.com/bioconductor/hdf5array/llms.txt Provides functions to manage the global options for HDF5 dump directories, file pinning, dataset naming, chunk geometry, and compression levels. It also includes functionality to show the dump log. ```APIDOC ## HDF5 Dump Management ### Description A set of `get/set` functions control where and how automatically created HDF5 datasets are stored. These global options are propagated to `BiocParallel` workers, ensuring consistent dump locations and compression settings across parallel jobs. ### Usage ```r library(HDF5Array) # --- Dump directory and file --- getHDF5DumpDir() # current auto-dump directory (in tempdir()) setHDF5DumpDir("~/my_dumps") # redirect auto-dumps to a custom directory setHDF5DumpFile("~/my_dumps/results.h5") # pin all auto-dumps to one file getHDF5DumpFile() lsHDF5DumpFile() # list datasets in the current dump file setHDF5DumpName("/experiment1/counts") # pin the next dataset name getHDF5DumpName() # --- Chunk geometry --- getHDF5DumpChunkLength() # 1,000,000 elements (default) setHDF5DumpChunkLength(500000L) getHDF5DumpChunkShape() # "scale" (default) setHDF5DumpChunkShape("first-dim-grows-first") # Compute chunk dims for a given array shape getHDF5DumpChunkDim(c(20000L, 500L)) # e.g. c(2000, 500) # --- Compression --- getHDF5DumpCompressionLevel() # 6 (default; 0 = none, 9 = max) setHDF5DumpCompressionLevel(9L) # --- Dump log (shows every dataset created in this session) --- m <- matrix(runif(100), 10, 10) writeHDF5Array(m, name="test1") writeHDF5Array(m + 1, name="test2") showHDF5DumpLog() # [2025-01-15 10:00:01] #1 In file '.../auto....h5': creation of dataset # '/test1' (10x10:double, chunkdims=10x10, level=6) ``` ``` -------------------------------- ### HDF5Array() Constructor Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedArray (or DelayedMatrix) backed by a conventional dense HDF5 dataset. It supports local file paths or H5File objects for remote files (e.g., S3). Options include enabling memory-optimized block processing for sparse data and overriding the inferred R type. ```APIDOC ## HDF5Array() ### Description Constructs a `DelayedArray` (or `DelayedMatrix`) backed by a conventional dense HDF5 dataset. Accepts a local file path or an `H5File` object for S3-hosted files. The optional `as.sparse` flag enables memory-optimized block processing for zero-heavy datasets, and `type` overrides the automatically inferred R type. ### Usage ```r library(HDF5Array) # --- Local file --- toy_h5 <- system.file("extdata", "toy.h5", package="HDF5Array") h5ls(toy_h5) M2 <- HDF5Array(toy_h5, "M2") # Override inferred type M2_int <- HDF5Array(toy_h5, "M2", type="integer") # Flag as sparse for memory-efficient block processing M2_sp <- HDF5Array(toy_h5, "M2", as.sparse=TRUE) # Toggle sparse flag after construction is_sparse(M2) <- TRUE # Standard array operations (all delayed) dim(M2) # --- Remote file on Amazon S3 --- h5file <- H5File("https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5", s3=TRUE) HDF5Array(h5file, "a1") ``` ``` -------------------------------- ### Construct H5ADMatrix from .h5ad file Source: https://context7.com/bioconductor/hdf5array/llms.txt Loads the central 'X' matrix or a specific layer from an AnnData (.h5ad) file into a DelayedMatrix. Supports both dense and sparse storage, and populates row/column names from metadata. Arithmetic operations are delayed. ```R library(HDF5Array) library(zellkonverter) h5ad_path <- system.file("extdata", "krumsiek11.h5ad", package="zellkonverter") X <- H5ADMatrix(h5ad_path) X # <200 x 11> matrix of class H5ADMatrix and type "double": dim(X) rownames(X) colnames(X) # Load a specific layer instead of X # (requires the h5ad file to have a /layers/counts group) # counts <- H5ADMatrix(h5ad_path, layer="counts") # Arithmetic is delayed log1p_X <- log1p(X) class(log1p_X) # Realise to memory as.matrix(X[1:5, ]) # Access the underlying seed to inspect storage format seed(X) # Dense_H5ADMatrixSeed / CSC_H5ADMatrixSeed / CSR_H5ADMatrixSeed is_sparse(X) nzcount(X) ``` -------------------------------- ### Create Matter array and DelayedArray object Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Generates a 'matter_arr' object and wraps it in a DelayedArray for efficient handling of large arrays. Measures the time taken for creation. ```R library(matter) system.time(a1 <- matter_arr(a0, datamode="integer", dim=dim(a0))) ``` ```R library(DelayedArray) A1 <- DelayedArray(a1) ``` -------------------------------- ### TENxMatrix() Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedMatrix backed by the HDF5 sparse matrix format used by 10x Genomics. This is suitable for Cell Ranger output .h5 files. ```APIDOC ## TENxMatrix() ### Description Constructs a `DelayedMatrix` backed by the HDF5 sparse matrix format used by 10x Genomics (CSR with `shape`, `data`, `indices`, `indptr`, `barcodes`, `genes` datasets under a named group). This is the appropriate constructor for Cell Ranger output `.h5` files. ### Usage ```r library(HDF5Array) library(TENxBrainData) # provides example 10x data # Download 1.3 Million Brain Cell Dataset (subset) tenx_file <- TENxBrainData() # returns a SingleCellExperiment # Or load directly from an .h5 file: # tenx_file <- "path/to/filtered_gene_bc_matrices.h5" # m <- TENxMatrix(tenx_file, group="mm10") # Create a TENxMatrix from an in-memory sparse matrix library(Matrix) m <- rsparsematrix(1000, 500, density=0.01, dimnames=list(paste0("gene", 1:1000), paste0("cell", 1:500))) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="matrix", verbose=TRUE) tenx # <1000 x 500> sparse matrix of class TENxMatrix and type "double": dim(tenx) # c(1000, 500) is_sparse(tenx) # TRUE nzcount(tenx) # number of nonzero entries # Delayed subsetting and arithmetic sub <- tenx[1:100, 1:50] row_sums <- rowSums(sub) # block-processed # Coerce to dgCMatrix for in-memory use dgc <- as(tenx, "dgCMatrix") ``` ``` -------------------------------- ### Create compressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Writes the array to an HDF5 file using HDF5Array with compression enabled (level=6). Measures the time and reports the file size. ```R system.time(A3 <- writeHDF5Array(a0, chunkdim=c(50, 50, 10), level=6)) ``` ```R file.info(path(A3))$size / 1e9 ``` -------------------------------- ### Construct HDF5Array from Local File Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedArray backed by a dense HDF5 dataset. Supports overriding the inferred R type and enabling sparse processing for zero-heavy datasets. Use for local HDF5 files. ```r library(HDF5Array) # --- Local file --- toy_h5 <- system.file("extdata", "toy.h5", package="HDF5Array") h5ls(toy_h5) # group name otype dclass dim # 0 / M1 H5I_DATASET FLOAT 100 x 3 # 1 / M2 H5I_DATASET FLOAT 200 x 3 M2 <- HDF5Array(toy_h5, "M2") M2 # <200 x 3> matrix of class HDF5Matrix and type "double": # Override inferred type M2_int <- HDF5Array(toy_h5, "M2", type="integer") type(M2_int) # "integer" # Flag as sparse for memory-efficient block processing M2_sp <- HDF5Array(toy_h5, "M2", as.sparse=TRUE) is_sparse(M2_sp) # TRUE # Toggle sparse flag after construction is_sparse(M2) <- TRUE # Standard array operations (all delayed) dim(M2) # c(200, 3) dimnames(M2) M2[1:5, ] # subset – no data loaded until needed as.array(M2) # materialise into memory ``` -------------------------------- ### Construct TENxMatrix from 10x Genomics HDF5 file Source: https://context7.com/bioconductor/hdf5array/llms.txt Creates a DelayedMatrix from the HDF5 sparse matrix format used by 10x Genomics. This is suitable for Cell Ranger output files. It supports delayed subsetting and arithmetic, and can be coerced to a dgCMatrix. ```R library(HDF5Array) library(TENxBrainData) tenx_file <- TENxBrainData() # Or load directly from an .h5 file: # tenx_file <- "path/to/filtered_gene_bc_matrices.h5" # m <- TENxMatrix(tenx_file, group="mm10") # Create a TENxMatrix from an in-memory sparse matrix library(Matrix) m <- rsparsematrix(1000, 500, density=0.01, dimnames=list(paste0("gene", 1:1000), paste0("cell", 1:500))) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="matrix", verbose=TRUE) tenx # <1000 x 500> sparse matrix of class TENxMatrix and type "double": dim(tenx) is_sparse(tenx) nzcount(tenx) # Delayed subsetting and arithmetic sub <- tenx[1:100, 1:50] row_sums <- rowSums(sub) # Coerce to dgCMatrix for in-memory use dgc <- as(tenx, "dgCMatrix") ``` -------------------------------- ### extractNonzeroDataByCol() / extractNonzeroDataByRow() Source: https://context7.com/bioconductor/hdf5array/llms.txt Low-level generics for extracting nonzero data from H5SparseMatrix (or TENxMatrix) objects column by column or row by row, without materializing the full matrix. Returns a NumericList or IntegerList. ```APIDOC ## extractNonzeroDataByCol() / extractNonzeroDataByRow() ### Description Low-level generics for extracting nonzero data from `H5SparseMatrix` (or `TENxMatrix`) objects one or more columns (or rows) at a time, without materialising the full matrix. Return a `NumericList` or `IntegerList` parallel to the requested indices. ### Usage ```r library(HDF5Array) library(Matrix) m <- rsparsematrix(1000, 500, density=0.05) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="mat") # Extract nonzero values for columns 10, 20, 30 nz <- extractNonzeroDataByCol(tenx, c(10L, 20L, 30L)) length(nz) # 3 lengths(nz) # number of nonzero entries in each requested column nz[[1]] # nonzero values in column 10 # For CSR-layout H5SparseMatrix, use extractNonzeroDataByRow sm_csr <- H5SparseMatrix(h5f, "mat") # layout determined by file # if CSR: # nz_rows <- extractNonzeroDataByRow(sm_csr, 1:5) ``` ``` -------------------------------- ### Extract Nonzero Data from Sparse HDF5 Matrices Source: https://context7.com/bioconductor/hdf5array/llms.txt Low-level generics for extracting nonzero data from H5SparseMatrix or TENxMatrix objects by column or row. Returns a NumericList or IntegerList parallel to the requested indices. ```R library(HDF5Array) library(Matrix) m <- rsparsematrix(1000, 500, density=0.05) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="mat") # Extract nonzero values for columns 10, 20, 30 nz <- extractNonzeroDataByCol(tenx, c(10L, 20L, 30L)) length(nz) # 3 lengths(nz) # number of nonzero entries in each requested column nz[[1]] # nonzero values in column 10 # For CSR-layout H5SparseMatrix, use extractNonzeroDataByRow sm_csr <- H5SparseMatrix(h5f, "mat") # layout determined by file # if CSR: # nz_rows <- extractNonzeroDataByRow(sm_csr, 1:5) ``` -------------------------------- ### Write matrix to 10x Genomics HDF5 sparse format Source: https://context7.com/bioconductor/hdf5array/llms.txt Writes a matrix-like object to disk in the 10x Genomics HDF5 sparse format (CSR layout). This function block-processes input column-by-column, allowing large matrices to be written without full in-memory loading. It returns a TENxMatrix pointing to the written file. ```R library(HDF5Array) library(Matrix) m <- rsparsematrix(5000, 3000, density=0.02, dimnames=list(paste0("g", 1:5000), paste0("b", 1:3000))) h5f <- tempfile(fileext=".h5") tenx <- writeTENxMatrix(m, h5f, group="counts", level=6L, # gzip compression (0–9) verbose=TRUE) # sparsity: 0.98 tenx nzcount(tenx) sparsity(tenx) # Round-trip: coerce a TENxMatrix back to dgCMatrix stopifnot(all.equal(as(tenx, "dgCMatrix"), as(m, "dgCMatrix"))) # Using coercion shorthand (writes to current dump file) tenx2 <- as(m, "TENxMatrix") path(tenx2) ``` -------------------------------- ### Create uncompressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Writes the array to an HDF5 file using HDF5Array, with compression disabled (level=0). Measures the time and reports the file size. ```R library(HDF5Array) system.time(A2 <- writeHDF5Array(a0, chunkdim=c(50, 50, 10), level=0)) ``` ```R file.info(path(A2))$size / 1e9 ``` -------------------------------- ### Construct HDF5Array from Remote S3 File Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs an HDF5Array from a remote HDF5 file hosted on Amazon S3. Requires an H5File object configured for S3 access. ```r library(HDF5Array) # --- Remote file on Amazon S3 --- h5file <- H5File("https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5", s3=TRUE) HDF5Array(h5file, "a1") ``` -------------------------------- ### Configure parallel processing for DelayedArray operations Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Sets up parallel processing parameters (workers and block size) and verbosity for DelayedArray operations. This configuration affects the performance of subsequent summarization tasks. ```R workers <- 4 block_size <- 2.5e6 # 2.5 Mb setAutoBPPARAM(MulticoreParam(workers)) setAutoBlockSize(block_size) DelayedArray:::set_verbose_block_processing(TRUE) ``` -------------------------------- ### Construct HDF5SparseMatrix Source: https://context7.com/bioconductor/hdf5array/llms.txt Constructs a DelayedMatrix from an HDF5 sparse matrix (CSR/CSC/Yale format). Automatically detects sparse layout from HDF5 attributes, which can also be overridden. Use for sparse matrices stored in HDF5. ```r library(HDF5Array) # Write a sparse matrix in 10x/CSC format first, then reload m <- Matrix::rsparsematrix(500, 300, density=0.05) h5f <- tempfile(fileext=".h5") # writeTENxMatrix writes 10x CSR format; H5SparseMatrix reads generic h5sparse tenx <- writeTENxMatrix(m, h5f, group="matrix") # H5SparseMatrix works on any h5sparse group (CSR or CSC) sm <- H5SparseMatrix(h5f, "matrix") sm ``` -------------------------------- ### Reshape HDF5 Dataset with ReshapedHDF5Array Source: https://context7.com/bioconductor/hdf5array/llms.txt Wraps an HDF5 dataset as a DelayedArray with a user-supplied virtual reshape. This allows the on-disk data dimensions to differ from the in-memory view without copying data. ```R library(HDF5Array) toy_h5 <- system.file("extdata", "toy.h5", package="HDF5Array") # The dataset "M2" is stored as 200 x 3 on disk M2 <- HDF5Array(toy_h5, "M2") dim(M2) # c(200, 3) # Reshape to 3D without touching the file M2r <- ReshapedHDF5Array(toy_h5, "M2", dim=c(4L, 50L, 3L)) dim(M2r) # c(4, 50, 3) class(M2r) # "ReshapedHDF5Array" # All DelayedArray operations still work M2r[1, , ] # 50 x 3 slice as.array(M2r[1:2, 1:5, ]) ``` -------------------------------- ### Extract random subset using extract_array from Matter array Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time to extract a random subset of elements using the 'extract_array' function from the 'matter_arr' object. ```R i <- list(sample(3000L, 50), sample(800L, 25), sample(125L, 10)) system.time(x1 <- extract_array(a1, i)) ``` -------------------------------- ### Write Array to HDF5 File Source: https://context7.com/bioconductor/hdf5array/llms.txt Writes any array-like or DelayedArray object to an HDF5 file using block processing. Returns an HDF5Array pointing to the new dataset. Supports control over chunk dimensions, compression, and data types. ```r library(HDF5Array) h5file <- tempfile(fileext=".h5") # Write a plain matrix m0 <- matrix(runif(364, min=-1), nrow=26, dimnames=list(letters, LETTERS[1:14])) M1 <- writeHDF5Array(m0, h5file, name="M1", chunkdim=c(5, 5)) chunkdim(M1) # c(5, 5) dimnames(M1) # dimnames are stored in the HDF5 file by default # Skip writing dimnames M1b <- writeHDF5Array(m0, h5file, name="M1b", with.dimnames=FALSE) is.null(dimnames(M1b)) # TRUE # Write a sparse matrix (auto-detected; as.sparse flag set on result) sm <- Matrix::rsparsematrix(20, 8, density=0.1) M2 <- writeHDF5Array(sm, h5file, name="M2", chunkdim=c(5, 5)) is_sparse(M2) # TRUE # Realize a DelayedArray with pending operations to disk M3_delayed <- log(t(DelayedArray(m0)) + 1) M3 <- writeHDF5Array(M3_delayed, h5file, name="M3", chunkdim=c(5, 5), level=6L) M3 # Coercion shorthand – writes to the current HDF5 dump file auto <- as(m0, "HDF5Array") path(auto) # path to auto-generated dump file # Use a compact 32-bit float type to reduce disk footprint M4 <- writeHDF5Array(m0, h5file, name="M4", H5type="H5T_IEEE_F32LE") ``` -------------------------------- ### Extract subset with complex indexing from Matter array Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time for extracting a subset using more complex indexing from the 'matter_arr' object. ```R system.time(x1 <- a1[(310:11)*7, (1:100)*8, 77]) ``` -------------------------------- ### ReshapedHDF5Array() Source: https://context7.com/bioconductor/hdf5array/llms.txt Wraps an HDF5 dataset as a DelayedArray with a user-supplied virtual reshape, allowing on-disk dimensions to differ from the in-memory view without data copying. ```APIDOC ## ReshapedHDF5Array() ### Description Wraps an HDF5 dataset as a `DelayedArray` with a user-supplied virtual reshape, allowing the on-disk data dimensions to differ from the in-memory view without copying data. ### Usage ```r library(HDF5Array) toy_h5 <- system.file("extdata", "toy.h5", package="HDF5Array") # The dataset "M2" is stored as 200 x 3 on disk M2 <- HDF5Array(toy_h5, "M2") dim(M2) # c(200, 3) # Reshape to 3D without touching the file M2r <- ReshapedHDF5Array(toy_h5, "M2", dim=c(4L, 50L, 3L)) dim(M2r) # c(4, 50, 3) class(M2r) # "ReshapedHDF5Array" # All DelayedArray operations still work M2r[1, , ] # 50 x 3 slice as.array(M2r[1:2, 1:5, ]) ``` ``` -------------------------------- ### Calculate column sums for uncompressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to compute column sums for a slice of the uncompressed HDF5 DelayedArray, showing block processing messages and potential errors. ```R system.time(cs2 <- colSums(A2[ , , 77L])) ``` -------------------------------- ### Verify Dataframe Identity in R Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Use the `identical()` function in R to check if two data structures are exactly the same. ```r identical(cs1, cs2) # [1] TRUE identical(cs1, cs3) # [1] TRUE ``` -------------------------------- ### writeHDF5Array() Function Source: https://context7.com/bioconductor/hdf5array/llms.txt Writes any array-like or DelayedArray object to an HDF5 file using block processing, ensuring the object is never fully materialized in memory. It returns an HDF5Array pointing to the newly written dataset and allows control over chunk dimensions, compression level, HDF5 datatype, and dimname storage. ```APIDOC ## writeHDF5Array() ### Description Writes any array-like or `DelayedArray` object to an HDF5 file via block processing — the object is never fully materialised in memory. Returns an `HDF5Array` pointing to the newly written dataset. Chunk dimensions, compression level, H5 datatype, and dimname storage are all controllable. ### Usage ```r library(HDF5Array) h5file <- tempfile(fileext=".h5") # Write a plain matrix m0 <- matrix(runif(364, min=-1), nrow=26, dimnames=list(letters, LETTERS[1:14])) M1 <- writeHDF5Array(m0, h5file, name="M1", chunkdim=c(5, 5)) # Skip writing dimnames M1b <- writeHDF5Array(m0, h5file, name="M1b", with.dimnames=FALSE) # Write a sparse matrix (auto-detected; as.sparse flag set on result) sm <- Matrix::rsparsematrix(20, 8, density=0.1) M2 <- writeHDF5Array(sm, h5file, name="M2", chunkdim=c(5, 5)) # Realize a DelayedArray with pending operations to disk M3_delayed <- log(t(DelayedArray(m0)) + 1) M3 <- writeHDF5Array(M3_delayed, h5file, name="M3", chunkdim=c(5, 5), level=6L) # Coercion shorthand – writes to the current HDF5 dump file auto <- as(m0, "HDF5Array") # Use a compact 32-bit float type to reduce disk footprint M4 <- writeHDF5Array(m0, h5file, name="M4", H5type="H5T_IEEE_F32LE") ``` ``` -------------------------------- ### Calculate column sums for compressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to compute column sums for a slice of the compressed HDF5 DelayedArray, showing block processing messages and potential errors. ```R system.time(cs3 <- colSums(A3[ , , 77L])) ``` -------------------------------- ### Extract random subset using extract_array from uncompressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time to extract the same random subset using 'extract_array' from the uncompressed HDF5 DelayedArray. ```R system.time(x2 <- extract_array(A2, i)) ``` -------------------------------- ### Extract random subset using extract_array from compressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time to extract the same random subset using 'extract_array' from the compressed HDF5 DelayedArray. ```R system.time(x3 <- extract_array(A3, i)) ``` -------------------------------- ### Calculate column sums for Matter DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to compute column sums for a slice of the Matter DelayedArray, showing block processing messages. ```R system.time(cs1 <- colSums(A1[ , , 77L])) ``` -------------------------------- ### Extract subset with complex indexing from compressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time for extracting the same subset from the compressed HDF5 DelayedArray. ```R system.time(x3 <- as.matrix(A3[(310:11)*7, (1:100)*8, 77])) ``` -------------------------------- ### Extract subset with complex indexing from uncompressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time for extracting the same subset from the uncompressed HDF5 DelayedArray. ```R system.time(x2 <- as.matrix(A2[(310:11)*7, (1:100)*8, 77])) ``` -------------------------------- ### Extract slice from compressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to extract the same slice from the compressed HDF5 DelayedArray. ```R system.time(x3 <- as.matrix(A3[891:1400, 401:700, 77])) ``` -------------------------------- ### Extract slice from uncompressed HDF5 DelayedArray Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to extract the same slice from the uncompressed HDF5 DelayedArray. ```R system.time(x2 <- as.matrix(A2[891:1400, 401:700, 77])) ``` -------------------------------- ### Extract slice from Matter array Source: https://github.com/bioconductor/hdf5array/wiki/matter-vs-hdf5:-which-format-performs-better-for-storing-big-array-like-datasets? Measures the time taken to extract a specific slice from the 'matter_arr' object. ```R system.time(x1 <- a1[891:1400, 401:700, 77]) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.