### Development Workflow with Live Reload Source: https://github.com/google/array_record/blob/main/docs/README.md Sets up a local development server with live reloading for documentation. Installs sphinx-autobuild and starts the server. ```bash # Install additional development dependencies pip install sphinx-autobuild ``` ```bash # Start live reload server sphinx-autobuild . _build/html ``` -------------------------------- ### Install Documentation Dependencies Source: https://github.com/google/array_record/blob/main/docs/README.md Installs the necessary Python packages for building the documentation. Ensure you are in the docs directory. ```bash pip install -r requirements.txt ``` -------------------------------- ### Install ArrayRecord Source: https://github.com/google/array_record/blob/main/docs/index.md Install the ArrayRecord library using pip. For Apache Beam integration, install with the beam extra. ```bash pip install array_record ``` ```bash pip install array_record[beam] ``` -------------------------------- ### Install Apache Beam and ArrayRecord Source: https://github.com/google/array_record/blob/main/beam/README.md Installs the necessary Apache Beam and ArrayRecord libraries, including GCP support for Beam. Verifies the installed Beam version. ```bash pip install apache-beam[gcp]==2.53.0 pip install array-record[beam] # check that apache-beam is still at 2.53.0 pip show apache-beam ``` -------------------------------- ### Clone ArrayRecord Repository and Navigate to Examples Source: https://github.com/google/array_record/blob/main/beam/README.md Clones the ArrayRecord GitHub repository and changes the directory to the Beam examples. ```bash git clone https://github.com/google/array_record.git cd array_record/beam/examples ``` -------------------------------- ### Command Line Usage Example Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/beam_integration.md Demonstrates how to run a Beam pipeline using the parsed arguments from the command line. This example shows typical arguments for input, output, runner, project, and region. ```bash python my_pipeline.py \ --input "gs://bucket/input/*.tfrecord" \ --output "gs://bucket/output/" \ --runner DataflowRunner \ --project my-project \ --region us-central1 ``` -------------------------------- ### Build Documentation Source: https://github.com/google/array_record/blob/main/docs/README.md Manually build the documentation using the make html command. Ensure you have Sphinx installed. ```bash make html ``` -------------------------------- ### ArrayRecordReader Deduction Guide Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Example of a deduction guide for constructing an ArrayRecordReader, allowing type inference for the source and target types. ```cpp template explicit ArrayRecordReader( Src&& src, ArrayRecordReaderBase::Options options = ArrayRecordReaderBase::Options(), ARThreadPool* pool = nullptr) -> ArrayRecordReader>; ``` -------------------------------- ### Create ArrayRecord Writer Example Source: https://github.com/google/array_record/blob/main/docs/README.md Demonstrates how to create an ArrayRecord writer and write data to a file. This example uses the Python API. ```python from array_record.python import array_record_module # Create a writer writer = array_record_module.ArrayRecordWriter('example.array_record') writer.write(b'Hello, ArrayRecord!') writer.close() ``` -------------------------------- ### Build PDF Documentation Source: https://github.com/google/array_record/blob/main/docs/README.md Builds the documentation in PDF format. This requires a LaTeX installation. ```bash make latexpdf ``` -------------------------------- ### Create ArrayRecordDataSource with FileInstructions Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Example of creating a custom FileInstruction class and initializing ArrayRecordDataSource with a list of instructions. ```python from dataclasses import dataclass @dataclass class MyFileInstruction: filename: str skip: int take: int examples_in_shard: int instructions = [ MyFileInstruction("data-00000.arecord", skip=0, take=500, examples_in_shard=1000), MyFileInstruction("data-00001.arecord", skip=100, take=900, examples_in_shard=1000), ] ds = ArrayRecordDataSource(instructions) ``` -------------------------------- ### Beam Pipeline Command Line Example Source: https://github.com/google/array_record/blob/main/_autodocs/INDEX.md Demonstrates how to run a Beam pipeline using ArrayRecord with specified input, output, runner, and project settings. ```bash python pipeline.py \ --input "gs://bucket/input/*.tfrecord" \ --output "gs://bucket/output/" \ --runner DataflowRunner \ --project my-project ``` -------------------------------- ### C++ Riegeli RecordsMetadata Configuration Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of setting Riegeli file metadata using the RecordsMetadata protobuf type. Metadata can include compression settings and record counts. ```cpp riegeli::RecordsMetadata metadata; metadata.set_...(); // Set fields options.set_metadata(metadata); ``` -------------------------------- ### Complete ArrayRecordDataSource Configuration Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Example of creating an ArrayRecordDataSource from multiple files with specified reader options. ```python from array_record.python import array_record_data_source import glob ds = array_record_data_source.ArrayRecordDataSource( paths=glob.glob("data-*.arecord"), reader_options={ "readahead_buffer_size": "8MB", "max_parallelism": "4" } ) print(f"Total records: {len(ds)}") # Sequential iteration for record in ds: process(record) # Random access batch = ds.__getitems__([0, 100, 500, 1000]) ``` -------------------------------- ### Complete ArrayRecordWriter Configuration Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Example of initializing and using ArrayRecordWriter with specific compression and parallelism options. ```python from array_record.python.array_record_module import ArrayRecordWriter writer = ArrayRecordWriter( "high_quality.arecord", options="group_size:64,zstd:9,window_log:21,max_parallelism:8" ) for i in range(10000): writer.write(f"record_{i}".encode()) writer.close() ``` -------------------------------- ### Complete ArrayRecordReader Configuration Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Example of initializing and using ArrayRecordReader with readahead buffer and parallelism options. ```python from array_record.python.array_record_module import ArrayRecordReader reader = ArrayRecordReader( "high_quality.arecord", options="readahead_buffer_size:16MB,max_parallelism:4" ) print(f"Total records: {reader.num_records()}") all_records = reader.read_all() reader.close() ``` -------------------------------- ### Simple Path for ArrayRecordDataSource Source: https://github.com/google/array_record/blob/main/_autodocs/README.md Example of creating an `ArrayRecordDataSource` with a single file path. ```python ds = ArrayRecordDataSource("data.arecord") ``` -------------------------------- ### Python Script with Flags Example Source: https://github.com/google/array_record/blob/main/_autodocs/INDEX.md Shows how to execute a Python script with specific flags related to thread configuration for computing and fetching records. ```bash python script.py \ --grain_num_threads_computing_num_records=32 \ --grain_num_threads_fetching_records=8 ``` -------------------------------- ### Python ArrayRecordDataSource Usage Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of initializing an ArrayRecordDataSource to read from multiple ArrayRecord files. This data source implements the RandomAccessDataSource protocol. ```python from array_record.python import array_record_data_source data_source: RandomAccessDataSource[bytes] = ( array_record_data_source.ArrayRecordDataSource( glob.glob("data-*.arecord") ) ) ``` -------------------------------- ### Python Function Docstring Example Source: https://github.com/google/array_record/blob/main/docs/README.md Illustrates the structure of a Python docstring for a function, including arguments, return values, exceptions, and examples. ```python def my_function(param1: str, param2: int = 0) -> bool: """Brief description of the function. Longer description with more details about the function's behavior, use cases, and any important considerations. Args: param1: Description of the first parameter. param2: Description of the second parameter. Defaults to 0. Returns: Description of what the function returns. Raises: ValueError: Description of when this exception is raised. Example: >>> result = my_function("test", 42) >>> print(result) True """ pass ``` -------------------------------- ### Python: ArrayRecordReader Configuration Options Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Shows examples of configuring ArrayRecordReader for different access patterns. Options include read-ahead buffer size, maximum parallelism, and index storage location. ```python # Optimized for random access reader = ArrayRecordReader("data.arecord", "readahead_buffer_size:0,max_parallelism:0") ``` ```python # Optimized for sequential read reader = ArrayRecordReader("data.arecord", "readahead_buffer_size:16MB,max_parallelism:4") ``` ```python # Offload index for low memory reader = ArrayRecordReader("data.arecord", "index_storage_option:offloaded") ``` -------------------------------- ### Beam Shard Name Template Examples Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Demonstrates different patterns for output filenames in Beam, including the default, a custom pattern, and a pattern incorporating the current date. ```python # Default (Beam standard) shard_name_template=None # Uses {num:05d}-of-{total:05d} # Output: data-00000-of-00100, data-00001-of-00100, ... # Custom pattern shard_name_template="shard-{num:03d}" # Output: shard-000, shard-001, ... # With date import datetime date_str = datetime.date.today().strftime("%Y%m%d") shard_name_template=f"{date_str}-{{num:05d}}" # Output: 20240529-00000, 20240529-00001, ... ``` -------------------------------- ### C++ Protobuf MessageLite Read/Write Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Shows how to read and write protocol buffer messages using the MessageLite base class. This is fundamental for proto serialization in ArrayRecord. ```cpp // Read and deserialize proto MyProto proto; reader.ReadRecord(&proto); // Write proto writer.WriteRecord(proto); ``` -------------------------------- ### Python: ArrayRecordWriter Fast Write Example Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Configures the ArrayRecordWriter for fast writes with minimal compression by disabling compression and setting a small group size. ```python # Fast write, minimal compression writer = ArrayRecordWriter("output.arecord", "uncompressed,group_size:1") ``` -------------------------------- ### Create Riegeli String Reader Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli StringReader using the Maker utility, which reads from a std::string. Useful for in-memory deserialization. ```cpp riegeli::Maker(src_string) ``` -------------------------------- ### Complete Beam Pipeline for Writing ArrayRecords Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Example of using Apache Beam to create and write records to ArrayRecord files with specified sharding and compression. ```python import apache_beam as beam from array_record.beam.arrayrecordio import WriteToArrayRecord with beam.Pipeline() as pipeline: (pipeline | "Create" >> beam.Create([b"record1", b"record2", b"record3"]) | "Write" >> WriteToArrayRecord( file_path_prefix="gs://bucket/output/data", file_name_suffix=".arecord", num_shards=100, compression_type=beam.io.filesystem.CompressionTypes.AUTO )) ``` -------------------------------- ### Python: ArrayRecordWriter Balanced Settings Example Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Configures the ArrayRecordWriter with balanced settings for compression and performance, using Brotli compression with a specific level and window log size. ```python # Balanced settings writer = ArrayRecordWriter("output.arecord", "group_size:8,brotli:6,window_log:20") ``` -------------------------------- ### Python SupportsIndex Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Illustrates objects that satisfy the SupportsIndex protocol, which requires the __index__() method. This includes standard integers and numpy integers. ```python # These are all SupportsIndex: ds[0] # int ds[np.int64(5)] # numpy integer ``` -------------------------------- ### Create Riegeli String Writer Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli StringWriter using the Maker utility, which writes to a std::string. Useful for in-memory serialization. ```cpp riegeli::Maker(&dest_string) ``` -------------------------------- ### Valid Option String Format Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Example of a correctly formatted option string without whitespace around delimiters. ```python # Valid "group_size:8,zstd:3" ``` -------------------------------- ### Create Riegeli Cord Writer Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli CordWriter using the Maker utility, which writes to a Riegeli Cord. Useful for efficient string-like data handling. ```cpp riegeli::Maker(&dest_cord) ``` -------------------------------- ### Create Riegeli Cord Reader Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli CordReader using the Maker utility, which reads from a Riegeli Cord. Useful for efficient string-like data handling. ```cpp riegeli::Maker(&src_cord) ``` -------------------------------- ### ArrayRecordReader Constructor and Options Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Initializes an ArrayRecordReader with optional configuration settings for read-ahead and index storage. Examples show how to optimize for random access, sequential reads, or low memory usage. ```APIDOC ### Configuration Options The `options` string accepts comma-separated key:value pairs: #### Read-Ahead Configuration - `readahead_buffer_size:N` - Buffer size per thread in bytes, with optional suffix (B/K/M/G). Default: 0. Set to 0 for optimized random access. - `max_parallelism:N|auto` - Number of concurrent read-ahead threads. Default: uses thread pool size. Set to 0 to disable read-ahead prefetching and optimize random access. #### Index Storage - `index_storage_option:in_memory|offloaded` - Where to store the record index. Default: in_memory. - `in_memory`: Loads all chunk offsets into memory (faster access, higher memory) - `offloaded`: Reads chunk offsets from disk on each access (lower memory, slower) **Example:** ```python # Optimized for random access reader = ArrayRecordReader("data.arecord", "readahead_buffer_size:0,max_parallelism:0") # Optimized for sequential read reader = ArrayRecordReader("data.arecord", "readahead_buffer_size:16MB,max_parallelism:4") # Offload index for low memory reader = ArrayRecordReader("data.arecord", "index_storage_option:offloaded") ``` ``` -------------------------------- ### Run TFRecord to ArrayRecord Conversion Example Source: https://github.com/google/array_record/blob/main/beam/README.md Executes a Python script to convert TFRecord files to ArrayRecords. Requires filling in specific fields in the script and potentially configuring DataFlow pipeline options. ```bash # Fill in the required fields in example_gcs_conversion.py # If use DataFlow, set pipeline_options as instructed in example_gcs_conversion.py python example_gcs_conversion.py ``` -------------------------------- ### ArrayRecordReader Option String Format Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Configure ArrayRecordReader using a string of comma-separated key:value pairs. This example shows common reader options. ```text readahead_buffer_size:16MB,max_parallelism:4,index_storage_option:in_memory ``` -------------------------------- ### Import Core Python ArrayRecord Modules Source: https://github.com/google/array_record/blob/main/_autodocs/INDEX.md Import necessary classes for writing, reading, and data source operations in Python. Ensure the 'array_record' package is installed. ```python from array_record.python.array_record_module import ArrayRecordWriter, ArrayRecordReader from array_record.python.array_record_data_source import ArrayRecordDataSource from array_record.beam.arrayrecordio import WriteToArrayRecord ``` -------------------------------- ### Create Riegeli Writer from File Path Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli FileWriter using the Maker utility from a file path. This is a common way to initialize Riegeli writers for output files. ```cpp riegeli::Maker("output.arecord") ``` -------------------------------- ### Check Reader State at Boundaries in C++ Source: https://github.com/google/array_record/blob/main/_autodocs/errors.md This C++ example emphasizes checking the reader's state (`ok()`) both before and after operations like `SeekRecord`. This ensures the reader is in a valid state for the operation and that the operation itself did not cause an error. ```cpp // Always check object state before use if (!reader.ok()) { return reader.status(); } // And after operations bool success = reader.SeekRecord(index); if (!success || !reader.ok()) { return reader.status(); } ``` -------------------------------- ### Create Riegeli Reader from File Path Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of creating a Riegeli FileReader using the Maker utility from a file path. This is a common way to initialize Riegeli readers for input files. ```cpp riegeli::Maker("input.arecord") ``` -------------------------------- ### C++ absl::string_view Usage Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Shows how to use absl::string_view to create a non-owning reference to string data. This avoids unnecessary memory allocations. ```cpp // No allocation, just reference absl::string_view view = reader.ReadRecord(); ``` -------------------------------- ### read(start, end) Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Reads a contiguous range of records from a specified start index (inclusive) to an end index (exclusive). Supports negative indexing. ```APIDOC ## read(start, end) ### Description Reads a range of records from `start` (inclusive) to `end` (exclusive). ### Parameters #### Path Parameters - **start** (int) - Required - Starting index (inclusive, supports negative indexing) - **end** (int) - Required - Ending index (exclusive, supports negative indexing) ### Returns - **Sequence[bytes]**: Records in range ### Raises - **IndexError**: Invalid range (start >= end, start < 0 and abs(start) > num_records, etc.) ### Example ```python reader = ArrayRecordReader("data.arecord") records = reader.read(10, 20) # Records 10-19 records = reader.read(-10, -1) # Last 10 records ``` ``` -------------------------------- ### Configure ArrayRecordDataSource with Reader Options Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Demonstrates initializing ArrayRecordDataSource with reader options for buffer size, parallelism, and index storage. ```python import glob ds = ArrayRecordDataSource( glob.glob("data-*.arecord"), reader_options={ "readahead_buffer_size": "8MB", "max_parallelism": "4", "index_storage_option": "in_memory" } ) ``` -------------------------------- ### Initialize ArrayRecordReader Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Instantiate the ArrayRecordReader with a file path. Custom configuration options and buffer sizes can also be provided. ```python from array_record.python.array_record_module import ArrayRecordReader # Basic read reader = ArrayRecordReader("data.arecord") # With custom options reader = ArrayRecordReader( "data.arecord", options="readahead_buffer_size:8MB,max_parallelism:4" ) ``` -------------------------------- ### Options::readahead_buffer_size Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Gets the configured size of the read-ahead buffer. ```APIDOC ## Options::readahead_buffer_size ### Description Gets the readahead buffer size. ### Method `uint64_t readahead_buffer_size() const` ### Parameters None ### Response #### Success Response `uint64_t` - The configured readahead buffer size. ``` -------------------------------- ### Initialize ArrayRecord Writers with Zstandard and Snappy Compression Source: https://github.com/google/array_record/blob/main/docs/core_concepts.md Initializes an ArrayRecord file with specified group size and compression settings. Use Zstandard for higher compression ratios or Snappy for faster compression/decompression. ```python from array_record.python import array_record_module zstd_writer = array_record_module.ArrayRecordWriter( 'output.array_record', 'group_size:1024,zstd:5,window_log:10' ) snappy_writer = array_record_module.ArrayRecordWriter( 'output.array_record', 'group_size:1024,snappy' ) ``` -------------------------------- ### Options::index_storage_option Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Gets the configured option for storing index data. ```APIDOC ## Options::index_storage_option ### Description Gets the index storage option. ### Method `IndexStorageOption index_storage_option() const` ### Parameters None ### Response #### Success Response `IndexStorageOption` - The configured index storage option. ``` -------------------------------- ### Initialize with pathlib.Path Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Demonstrates initializing ArrayRecordDataSource using a pathlib.Path object for specifying the data file. ```python from pathlib import Path ds = ArrayRecordDataSource(Path("data.arecord")) ``` -------------------------------- ### Options::max_parallelism Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Gets the configured maximum parallelism for parallel operations. ```APIDOC ## Options::max_parallelism ### Description Gets the maximum parallelism. ### Method `std::optional max_parallelism() const` ### Parameters None ### Response #### Success Response `std::optional` - The configured maximum parallelism. ``` -------------------------------- ### C++: Initialize and Read Records with ArrayRecordReader Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Demonstrates initializing an ArrayRecordReader in C++ using a FileReader and iterating through records. It shows how to read records sequentially and handle potential errors. ```cpp #include "cpp/array_record_reader.h" #include "riegeli/bytes/file_reader.h" array_record::ArrayRecordReader reader( riegeli::Maker("data.arecord")); for (uint64_t i = 0; i < reader.NumRecords(); ++i) { absl::string_view record; if (!reader.ReadRecord(&record)) { return reader.status(); } // Process record } if (!reader.Close()) { return reader.status(); } ``` -------------------------------- ### Initialize ArrayRecord Reader and DataSource Source: https://github.com/google/array_record/blob/main/docs/core_concepts.md Illustrates the initialization of both ArrayRecordReader for direct API access and array_record_data_source for integration with pygrain, focusing on read options. ```python from array_record.python import array_record_module from array_record.python import array_record_data_source reader = array_record_module.ArrayRecordReader( 'output.array_record', 'index_storage_option:offloaded,readahead_buffer_size=0' ) ds = array_record_data_source( 'output.array_record', reader_options={ 'index_storage_option': 'offloaded', } ) ``` -------------------------------- ### record_index() Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Gets the current 0-indexed position of the read cursor. This is primarily relevant when reading records sequentially. ```APIDOC ## record_index() ### Description Returns the current read position (0-indexed). ### Returns - **int**: Current record index ### Note Only relevant in sequential reading mode. ``` -------------------------------- ### Build Wheel Locally Source: https://github.com/google/array_record/blob/main/oss/README.md Run this script in the root folder to build a wheel for the current python3 version. Optionally specify a Python version. ```shell ./oss/build_whl.sh ``` ```shell PYTHON_VERSION=3.9 ./oss/build_whl.sh ``` -------------------------------- ### Get Total Number of Records Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Retrieve the total count of records across all files managed by the data source. ```python ds = ArrayRecordDataSource(glob.glob("data-*.arecord")) num_records = len(ds) ``` -------------------------------- ### ArrayRecordDataSource Integration with Grain DataLoader Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Example of using ArrayRecordDataSource with Grain's DataLoader for efficient data loading in ML pipelines. ```python from array_record.python import array_record_data_source import grain import glob data_source = array_record_data_source.ArrayRecordDataSource( glob.glob("train-*.arecord"), reader_options={"max_parallelism": "4"} ) loader = grain.DataLoader( data_source=data_source, sampler=grain.RandomSampler(seed=42), ) for records in loader: train(records) ``` -------------------------------- ### C++: Initialize Writer Options and Read Records in Parallel Source: https://github.com/google/array_record/blob/main/_autodocs/quick-reference.md Shows how to convert string options to ArrayRecordWriter options and perform parallel reads with a callback in C++. Handles potential errors during these operations. ```cpp auto result = array_record::ArrayRecordWriterBase::Options::FromString(options); if (!result.ok()) { return result.status(); } auto status = reader.ParallelReadRecords(callback); if (!status.ok()) { LOG(ERROR) << status; return status; } ``` -------------------------------- ### Get Writer Options String Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Retrieve the options string that was used when the ArrayRecord file was originally written. This can be empty for older files. ```python reader = ArrayRecordReader("data.arecord") options = reader.writer_options_string() # Output: "group_size:1,transpose:false,pad_to_block_boundary:false,zstd:3,window_log:20,max_parallelism:1" ``` -------------------------------- ### Configuration Presets for Writes Source: https://github.com/google/array_record/blob/main/_autodocs/quick-reference.md Select configuration presets for optimal write performance or file size. 'uncompressed' is fastest, while 'zstd:11' offers high compression. ```python "uncompressed,group_size:1" ``` ```python "group_size:128,zstd:11" ``` -------------------------------- ### Generate Random Access Indices Source: https://github.com/google/array_record/blob/main/docs/performance.md Generates a list of random indices for dataset access using numpy. Ensure numpy is installed and imported. ```python import numpy as np rng = np.random.default_rng(42) num_records = 65536 # dataset size indices = [int(v) for v in rng.permutations(num_records)] ``` -------------------------------- ### C++ InvalidArgumentError Example Source: https://github.com/google/array_record/blob/main/_autodocs/errors.md Shows how to construct an InvalidArgumentError in C++ for bad parameters, such as an invalid group_size. Ensure parameters meet the required constraints. ```cpp #include "cpp/common.h" return array_record::InvalidArgumentError( "Invalid group_size: %d (must be >= 1)", value); ``` -------------------------------- ### Python ArrayRecordReader Initialization with Option String Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Initializes an ArrayRecordReader using an option string for configuration. Ensure the 'array_record.python.array_record_module' is imported. ```python from array_record.python.array_record_module import ArrayRecordReader # Option string reader = ArrayRecordReader( "data.arecord", "readahead_buffer_size:16MB,max_parallelism:4" ) ``` -------------------------------- ### C++: Write Records Source: https://github.com/google/array_record/blob/main/_autodocs/README.md Use ArrayRecordWriter in C++ to write records to a file. This example uses Riegeli FileWriter for underlying file operations. ```cpp #include "cpp/array_record_writer.h" #include "riegeli/bytes/file_writer.h" array_record::ArrayRecordWriter writer( riegeli::Maker("output.arecord")); writer.WriteRecord("record_1"); writer.WriteRecord("record_2"); writer.Close(); ``` -------------------------------- ### Balanced Configuration Source: https://github.com/google/array_record/blob/main/_autodocs/README.md A default performance preset offering a balance between compression and speed. ```python "group_size:8,zstd:3" ``` -------------------------------- ### Complete Data Pipeline with ArrayRecordDataSource Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Illustrates a full data pipeline using ArrayRecordDataSource, including initialization, iteration, random access, batch access, and cleanup. ```python from array_record.python import array_record_data_source import glob # Create data source from multiple sharded files data_source = array_record_data_source.ArrayRecordDataSource( paths=glob.glob("gs://bucket/dataset/*.arecord"), reader_options={ "readahead_buffer_size": "16MB", "max_parallelism": "8" } ) # Check total size print(f"Total examples: {len(data_source)}") # Sequential iteration for record in data_source: process(record) # Random access record_42 = data_source[42] # Batch access with specific indices batch = data_source.__getitems__([0, 100, 500, 1000]) # Cleanup data_source.close() ``` -------------------------------- ### C++ InternalError Example Source: https://github.com/google/array_record/blob/main/_autodocs/errors.md Illustrates creating an internal error in C++ using the common.h utility. This is used for unexpected internal states or parsing failures. ```cpp #include "cpp/common.h" return array_record::InternalError( "Failed to parse record at index %d", index); ``` -------------------------------- ### Python: Initialize ArrayRecord Writer and Reader with Error Handling Source: https://github.com/google/array_record/blob/main/_autodocs/quick-reference.md Demonstrates how to initialize ArrayRecordWriter and ArrayRecordReader in Python, including handling potential ValueErrors for invalid options and IndexErrors for end-of-file. ```python try: writer = ArrayRecordWriter("out.arecord", options) except ValueError as e: print(f"Invalid options: {e}") try: reader = ArrayRecordReader("in.arecord") record = reader.read() except IndexError: print("End of file") except Exception as e: print(f"Read error: {e}") ``` -------------------------------- ### Build HTML Documentation with Sphinx Source: https://github.com/google/array_record/blob/main/docs/README.md Builds the HTML version of the documentation using Sphinx. The output will be in the _build/html directory. ```bash # Using Sphinx directly sphinx-build -b html . _build/html ``` ```bash # Or using the Makefile make html ``` -------------------------------- ### Python: Initialize ArrayRecordWriter with Compression Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Initializes an ArrayRecordWriter with a specified file path and compression options. Use this to configure group size and compression algorithm like Zstd with a specific level. ```python from array_record.python.array_record_module import ArrayRecordWriter writer = ArrayRecordWriter("output.arecord", "group_size:1,zstd:3") writer.write(b"record_1") writer.write(b"record_2") writer.write(b"record_3") writer.close() ``` -------------------------------- ### ArrayRecordWriter Option String Format Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Both Python and C++ use comma-separated key:value pairs for configuration. This example shows a typical string format. ```text group_size:1,zstd:3,window_log:20,pad_to_block_boundary:false,transpose:false,max_parallelism:4 ``` -------------------------------- ### Initialize with File Range Syntax Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Utilize range syntax within string paths to read specific subsets of records from ArrayRecord files. This allows for partial file reads. ```python # Read records 100-200 (inclusive start, exclusive end) ds = ArrayRecordDataSource("data.arecord[100:200]") # Combine multiple ranges ds = ArrayRecordDataSource([ "data-00000.arecord[0:500]", # First 500 records "data-00001.arecord[100:600]", # Records 100-599 ]) ``` -------------------------------- ### Initialize ArrayRecordDataSource with FileInstruction Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Initializes an ArrayRecordDataSource using a list of FileInstruction objects. This is how you specify which files and ranges to read from. ```python from array_record.python import array_record_data_source ds = array_record_data_source.ArrayRecordDataSource([instruction]) ``` -------------------------------- ### Python ArrayRecordWriter Initialization with String Options Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Initialize ArrayRecordWriter in Python using a string of options. Ensure the 'array_record.python.array_record_module' is imported. ```python from array_record.python.array_record_module import ArrayRecordWriter # String-based options writer = ArrayRecordWriter( "output.arecord", "group_size:8,zstd:3,window_log:20" ) ``` -------------------------------- ### Initialize ArrayRecordWriter Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Demonstrates how to initialize an ArrayRecordWriter to write to a file named 'output.arecord'. It includes writing two records and closing the writer, with error handling. ```cpp #include "cpp/array_record_writer.h" #include "riegeli/bytes/file_writer.h" // Write to file array_record::ArrayRecordWriter writer( riegeli::Maker("output.arecord")); writer.WriteRecord("record_1"); writer.WriteRecord("record_2"); if (!writer.Close()) { // Handle error return writer.status(); } ``` -------------------------------- ### Get Total Number of Records Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Retrieve the total count of records stored within the ArrayRecord file. This is useful for understanding the file's size. ```python reader = ArrayRecordReader("data.arecord") print(f"Total records: {reader.num_records()}") ``` -------------------------------- ### Import Apache Beam ArrayRecord Components Source: https://github.com/google/array_record/blob/main/_autodocs/INDEX.md Import the necessary components for integrating ArrayRecord with Apache Beam pipelines in Python. This includes I/O transforms and Dataflow DoFns. ```python from array_record.beam.arrayrecordio import WriteToArrayRecord from array_record.beam.dofns import ConvertToArrayRecordGCS ``` -------------------------------- ### C++ OutOfRangeError Example Source: https://github.com/google/array_record/blob/main/_autodocs/errors.md Demonstrates creating an OutOfRangeError in C++ when an index is beyond the file bounds. Verify that indices are within the valid range before accessing records. ```cpp #include "cpp/common.h" return array_record::OutOfRangeError( "Index %d beyond file bounds [0, %d)", idx, num_recs); ``` -------------------------------- ### Python: ArrayRecordWriter High Compression Example Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Configures the ArrayRecordWriter for high compression density by setting a larger group size and Zstd compression level 11. ```python # High compression density writer = ArrayRecordWriter("output.arecord", "group_size:128,zstd:11") ``` -------------------------------- ### Invalid Compression Combinations for ArrayRecordWriter Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md These examples demonstrate invalid combinations of compression algorithms which will raise a ValueError. Select exactly one compression algorithm or use 'uncompressed'. ```python # ERROR: Multiple compression algorithms "brotli:6,zstd:3" "brotli,snappy" ``` -------------------------------- ### Multiple Files with Ranges for ArrayRecordDataSource Source: https://github.com/google/array_record/blob/main/_autodocs/README.md Shows how to load multiple files with specific record ranges using `ArrayRecordDataSource`. ```python # Multiple files with ranges ds = ArrayRecordDataSource([ "file1.arecord[0:500]", "file2.arecord[100:600]" ]) ``` -------------------------------- ### C++ ArrayRecordReader Initialization from Options String Source: https://github.com/google/array_record/blob/main/_autodocs/configuration.md Initializes a C++ ArrayRecordReader by parsing options from a string. This method uses the Riegeli library for file handling. ```cpp #include "cpp/array_record_reader.h" // Parse from string auto opts_result = array_record::ArrayRecordReaderBase::Options::FromString( "readahead_buffer_size:16MB,max_parallelism:4"); auto reader = array_record::ArrayRecordReader( riegeli::Maker("data.arecord"), opts_result.value()); ``` -------------------------------- ### Get Global Thread Pool in C++ Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Retrieves the global Eigen::ThreadPoolInterface instance used by ArrayRecord for default parallelism. This pool can be passed to ArrayRecordWriter for parallel operations. ```cpp #include "cpp/thread_pool.h" ARThreadPool* pool = array_record::ArrayRecordGlobalPool(); array_record::ArrayRecordWriter writer(..., pool); ``` -------------------------------- ### Write Raw Binary Data with ArrayRecordWriter Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_writer.md Shows how to write raw binary data using the WriteRecord method, taking a pointer to the data and its size. ```cpp std::vector binary_data = {...}; writer.WriteRecord(binary_data.data(), binary_data.size()); ``` -------------------------------- ### Debug Sphinx Build with Verbose Output Source: https://github.com/google/array_record/blob/main/docs/README.md Enables verbose output during the Sphinx build process to help debug issues. Use this command to get more detailed logs. ```bash sphinx-build -v -b html . _build/html ``` -------------------------------- ### Fast Writes Configuration Source: https://github.com/google/array_record/blob/main/_autodocs/README.md A performance preset for fast writes using uncompressed data and a small group size. ```python "uncompressed,group_size:1" ``` -------------------------------- ### C++ absl::Status Error Handling Example Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Demonstrates checking the status of an operation using absl::Status and logging errors. This is a common pattern for error handling in C++ with Abseil. ```cpp absl::Status status = reader.ParallelReadRecords(...); if (!status.ok()) { LOG(ERROR) << status; } ``` -------------------------------- ### Initialize ArrayRecordDataSource Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Instantiate ArrayRecordDataSource with single or multiple file paths. Supports glob patterns and file range syntax. Reader options can be provided for performance tuning. ```python from array_record.python.array_record_data_source import ArrayRecordDataSource import glob # Single file ds = ArrayRecordDataSource("data.arecord") # Multiple files with glob ds = ArrayRecordDataSource(glob.glob("data-*.arecord")) # File range syntax (records 100-200 from file 0, all from file 1) ds = ArrayRecordDataSource([ "data-00000.arecord[100:200]", "data-00001.arecord" ]) # With reader options ds = ArrayRecordDataSource( glob.glob("data-*.arecord"), reader_options={"max_parallelism": "4"} ) ``` -------------------------------- ### Read Records by Range Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Extract a contiguous block of records from the file using start (inclusive) and end (exclusive) indices. Supports negative indexing. An IndexError is raised for invalid ranges. ```python reader = ArrayRecordReader("data.arecord") records = reader.read(10, 20) # Records 10-19 records = reader.read(-10, -1) # Last 10 records ``` -------------------------------- ### Define FileInstruction Protocol Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_data_source.md Defines the FileInstruction protocol with filename, skip, take, and examples_in_shard attributes, matching the TFDS interface. ```python from typing import Protocol class FileInstruction(Protocol): filename: str # Path to file skip: int # Skip first N records take: int # Take N records examples_in_shard: int # Total records in file ``` -------------------------------- ### C++ absl::Span Usage for Parallel Reads Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Example of using absl::Span to pass a non-owning reference to an array of indices for parallel record reads. This is efficient for large index sets. ```cpp std::vector indices = {0, 5, 10}; absl::Span span = indices; reader.ParallelReadRecordsWithIndices(span, callback); ``` -------------------------------- ### Define PathLikeOrFileInstruction in Python Source: https://github.com/google/array_record/blob/main/_autodocs/types.md Demonstrates the usage of PathLikeOrFileInstruction, a type alias for various file path specifications including strings, pathlib.Path, epath.PathLike, and FileInstruction objects. It also shows how to use range syntax directly in strings. ```python # String paths: PathLikeOrFileInstruction = "data.arecord" # pathlib.Path paths: PathLikeOrFileInstruction = Path("data.arecord") # Range syntax (stored as string, parsed at init) paths: PathLikeOrFileInstruction = "data.arecord[0:500]" # FileInstruction paths: PathLikeOrFileInstruction = my_file_instruction ``` -------------------------------- ### File Organization Structure Source: https://github.com/google/array_record/blob/main/_autodocs/INDEX.md Illustrates the directory structure for technical reference documentation, highlighting key files for quick reference, architecture, configuration, and API details. ```markdown ``` Technical Reference Documentation │ ├── quick-reference.md ← Start here for common tasks ├── architecture.md ← System design and overview ├── configuration.md ← All configuration options ├── types.md ← Type definitions ├── errors.md ← Error handling │ └── api-reference/ ├── array_record_writer.md ← Writing records ├── array_record_reader.md ← Reading records ├── array_record_data_source.md ← Multi-file data source └── beam_integration.md ← Beam transforms and DoFns ``` ``` -------------------------------- ### Get the number of records from ArrayRecordDataSource Source: https://github.com/google/array_record/blob/main/docs/python_reference.md Use the len() function on an ArrayRecordDataSource object to retrieve the total number of records across all specified ArrayRecord files. Ensure the glob pattern correctly matches your files. ```python from array_record.python import array_record_data_source ds = array_record_data_source.ArrayRecordDataSource(glob.glob("output.array_record*")) len(ds) ``` -------------------------------- ### Configuration Presets for Reads Source: https://github.com/google/array_record/blob/main/_autodocs/quick-reference.md Choose reader configurations for random access or sequential reads. 'readahead_buffer_size:0' is suitable for random access, while larger buffers benefit sequential reads. ```python # Writer "group_size:1,zstd:3" # Reader "readahead_buffer_size:0,max_parallelism:0" ``` ```python # Reader "readahead_buffer_size:16MB,max_parallelism:8" ``` -------------------------------- ### Read Specific Records by Index with Callback Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Reads specific records by index in parallel. The callback receives the index within the provided 'indices' span and the record data. Use 'indices[indices_idx]' to get the actual record index. ```cpp std::vector indices = {0, 5, 10, 25}; auto status = reader.ParallelReadRecordsWithIndices( indices, [&](uint64_t indices_idx, absl::string_view record) -> absl::Status { uint64_t record_idx = indices[indices_idx]; return absl::OkStatus(); }); ``` -------------------------------- ### ArrayRecord DataSource Flow (Python) Source: https://github.com/google/array_record/blob/main/_autodocs/architecture.md Describes the Python DataSource for ArrayRecord, handling path parsing, initialization by scanning files for record counts, and providing methods for getting the total number of records and retrieving individual or multiple records efficiently. ```text ArrayRecordDataSource(paths, reader_options) │ ├─ Parse paths │ ├─ String → file path │ ├─ Range syntax → (file, [start:end]) │ └─ FileInstruction → (file, skip, take) │ ├─ Initialize (parallel) │ └─ Scan each file for record count │ ├─ __len__() │ └─ Return sum of all file record counts │ ├─ __getitem__(index) │ ├─ Map global index to (file_idx, local_pos) │ ├─ Get or create reader for file │ └─ Read single record │ └─ __getitems__(indices) ├─ Group indices by file ├─ Parallel read from each file (thread pool) └─ Reorder results to match input order ``` -------------------------------- ### Configure Index Storage Option Source: https://github.com/google/array_record/blob/main/_autodocs/api-reference/array_record_reader.md Sets the index storage option to in-memory or offloaded. Default is kInMemory. ```cpp Options& set_index_storage_option(IndexStorageOption storage_option) ``` -------------------------------- ### Sequential Access with Read-Ahead Configuration Source: https://github.com/google/array_record/blob/main/docs/core_concepts.md Demonstrates how to configure ArrayRecordReader for sequential access with specific read-ahead buffer size and maximum parallelism. This is useful for iterating through large files efficiently. ```python from array_record.python import array_record_module reader = array_record_module.ArrayRecordReader( 'output.array_record', 'readahead_buffer_size:65536,max_parallelism:8' ) for _ in range(reader.num_records()): record = reader.read() ``` -------------------------------- ### ArrayRecord Write Flow Source: https://github.com/google/array_record/blob/main/_autodocs/architecture.md Details the write process in ArrayRecord, starting from user code calling the Python writer, which interfaces with a C++ extension. It covers buffering, chunk compression, and writing to a Riegeli writer, concluding with closing operations like writing the footer and postscript. ```text User Code │ ├─ Python: ArrayRecordWriter(path, options) │ └─ C++ Extension (array_record_module.cc) │ └─ ArrayRecordWriterBase │ ├─ Parse options │ ├─ Create ChunkEncoder │ └─ Allocate thread pool │ ├─ WriteRecord() │ ├─ Buffer in encoder │ └─ When group_size reached: │ ├─ Compress chunk (parallel) │ └─ Write to Riegeli writer │ └─ Close() ├─ Flush encoder ├─ Write footer (chunk offsets) └─ Write postscript └─ Close underlying writer ``` -------------------------------- ### Random Access Optimization with Group Size 1 Source: https://github.com/google/array_record/blob/main/docs/core_concepts.md Configures an ArrayRecordWriter for random access by setting group_size to 1, ensuring each record is a self-contained group for minimal read data per lookup. Also shows reader/data-source setup for both in-memory and offloaded index storage. ```python from array_record.python import array_record_module from array_record.python import array_record_data_source # Writer with the default compression option, which is zstd:3 writer = array_record_module.ArrayRecordWriter( 'output.array_record', 'group_size:1' ) # Reader/data-source with in-memory index reader = array_record_module.ArrayRecordReader( 'output.array_record') ds = array_record_data_source('output.array_record') # Reader/data-source with offloaded index reader = array_record_module.ArrayRecordReader( 'output.array_record', 'index_storage_option:offloaded' ) ds = array_record_data_source( 'output.array_record', reader_options={ 'index_storage_option': 'offloaded', } ) ``` -------------------------------- ### Configure Snappy Compression for ArrayRecord Writer Source: https://github.com/google/array_record/blob/main/docs/core_concepts.md This configuration is for Snappy compression, prioritizing speed over file size reduction. It is suitable for random access scenarios. ```python writer = array_record_module.ArrayRecordWriter( 'output.array_record', 'group_size:1,snappy' ) ```