### Run Hugo Server for Local Website Preview Source: https://github.com/apache/parquet-site/blob/production/README.md Command to start the Hugo development server, allowing local preview of the Parquet website. This assumes Hugo is already installed and configured. ```shell hugo server ``` -------------------------------- ### Install NPM Modules and Run Hugo Server Inside Docker Container Source: https://github.com/apache/parquet-site/blob/production/README.md Commands to be executed inside the Docker container after it's running. This includes installing necessary npm modules (autoprefixer, postcss-cli, postcss) and then starting the Hugo server, binding it to 0.0.0.0 to make it accessible from the host machine. ```shell # Install necessary npm modules in parquet-site directory cd parquet-site npm install -D autoprefixer npm install -D postcss-cli npm install -D postcss hugo server --bind 0.0.0.0 # run the server ``` -------------------------------- ### Install Mermaid CLI for Diagram Generation Source: https://github.com/apache/parquet-site/blob/production/README.md Command to install the Mermaid CLI tool via npm, which is required to build SVG metadata diagrams from Mermaid source files. This is installed as a development dependency. ```shell npm install -D @mermaid-js/mermaid-cli ``` -------------------------------- ### Install PostCSS Dependencies for CSS Building Source: https://github.com/apache/parquet-site/blob/production/README.md Commands to install necessary npm packages (autoprefixer, postcss-cli, postcss) required for building and updating CSS resources for the Parquet website. These tools are installed as development dependencies. ```shell npm install -D autoprefixer npm install -D postcss-cli npm install -D postcss ``` -------------------------------- ### Build Docker Image for Parquet Website Development Source: https://github.com/apache/parquet-site/blob/production/README.md Command to build a Docker image named `parquet-site` using the provided Dockerfile. This image encapsulates all necessary tools, allowing website development without installing Hugo and its dependencies locally. ```shell docker build -t parquet-site . ``` -------------------------------- ### Build Parquet Metadata SVG Diagrams Source: https://github.com/apache/parquet-site/blob/production/README.md Commands to generate SVG diagrams for FileMetaData and PageHeader using the Mermaid CLI tool. This requires Mermaid to be installed and the commands to be run from the `static/images` directory. ```shell cd static/images npx mmdc -i FileMetaData.mermaid -o FileMetaData.svg npx mmdc -i PageHeader.mermaid -o PageHeader.svg ``` -------------------------------- ### Encoded Bit-packed Values Example (Conceptual) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md This conceptual example demonstrates how the 3-bit values (0-7) are packed into bytes, showing the resulting bit sequence and corresponding labels after packing. ```Conceptual bit value: 00000101 00111001 01110111 bit label: ABCDEFGH IJKLMNOP QRSTUVWX ``` -------------------------------- ### Illustrating Bit-packed Values (Conceptual) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md This conceptual example shows decimal values, their 3-bit binary representations, and corresponding bit labels before packing. ```Conceptual dec value: 0 1 2 3 4 5 6 7 bit value: 000 001 010 011 100 101 110 111 bit label: ABC DEF GHI JKL MNO PQR STU VWX ``` -------------------------------- ### Delta Encoding Example 1: Simple Ascending Sequence (1, 2, 3, 4, 5) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md Illustrates the delta encoding process for a simple ascending sequence (1, 2, 3, 4, 5). It shows the intermediate delta calculations and the final encoded header and block structure, demonstrating how zero bitwidth is used for constant values. ```APIDOC header: 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value) block: 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0) ``` -------------------------------- ### Parquet RLE/Bit-Packing Bit Order Example Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md Illustrates the specific bit-packing order for the RLE/Bit-Packing Hybrid encoding, showing how values are packed from the least significant bit of each byte to the most significant bit, while maintaining the usual order of bits within each value. ```APIDOC dec value: 0 1 2 3 4 5 6 7 bit value: 000 001 010 011 100 101 110 111 bit label: ABC DEF GHI JKL MNO PQR STU VWX would be encoded like this where spaces mark byte boundaries (3 bytes): bit value: 10001000 11000110 11111010 bit label: HIDEFABC RMNOJKLG VWXSTUPQ ``` -------------------------------- ### Delta Encoding Example 2: Sequence with Negative Deltas (7, 5, 3, 1, 2, 3, 4, 5) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md Demonstrates delta encoding for a sequence including negative deltas (7, 5, 3, 1, 2, 3, 4, 5). It details the calculation of deltas, the minimum delta, relative deltas, and the resulting encoded header and block, showing how values are packed on 2 bits. ```APIDOC header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value) block: -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits) ``` -------------------------------- ### Example of Deprecated Java Method in Parquet-Java Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/contributing.md Illustrates the correct way to deprecate an interface, class, or method in Parquet-Java, adhering to semantic versioning. It includes the @Deprecated annotation and a Javadoc comment specifying the removal version and alternative usage. ```Java /** * @param c the current class * @return the corresponding logger * @deprecated will be removed in 2.0.0; use org.slf4j.LoggerFactory instead. */ @Deprecated public static Log getLog(Class c) { return new Log(c); } ``` -------------------------------- ### Generate C++ Thrift Resources with Make Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/building.md Instructions to generate C++ Thrift resources for Parquet using the make command. ```Shell make ``` -------------------------------- ### Run Docker Container for Parquet Website Local Preview Source: https://github.com/apache/parquet-site/blob/production/README.md Command to run the `parquet-site` Docker container in interactive mode, mounting the current directory to `/parquet-site` inside the container and exposing local port 1313. This enables local preview of the website via Docker. ```shell docker run -it -v \`pwd\`:/parquet-site -p 1313:1313 parquet-site ``` -------------------------------- ### Clone and Initialize Parquet Website Repository Source: https://github.com/apache/parquet-site/blob/production/README.md Instructions to clone the Apache Parquet website repository from GitHub and initialize its Git submodules, which are necessary for the Docsy theme. ```shell git clone git@github.com:apache/parquet-site.git cd parquet-site git submodule update --init --recursive ``` -------------------------------- ### Build Java Resources with Maven Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/building.md Instructions to build Java resources for Parquet using the Maven package command. The current stable version should always be available from Maven Central. ```Shell mvn package ``` -------------------------------- ### Run Unit Tests for Parquet-Java Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/contributing.md Instructions on how to execute unit tests for the Parquet-Java project using Maven. This command should be run from the root directory of the project to ensure all tests are executed. ```Shell mvn test ``` -------------------------------- ### Conditionally Initialize Algolia DocSearch via Go Template Source: https://github.com/apache/parquet-site/blob/production/layouts/partials/hooks/body-end.html This Go Template block checks for the existence of the `algolia_docsearch` parameter within `.Site.Params`. If the parameter is found, it renders the JavaScript code necessary to initialize the Algolia DocSearch library, targeting the HTML element with the ID `#search_box`. The `debug` option is set to `false` by default, which can be changed for development purposes. ```Go Template {{ with .Site.Params.algolia_docsearch }} docsearch({ container: '#search_box', debug: false // Set debug to true if you want to inspect the modal }); {{ end }} ``` -------------------------------- ### Stage Parquet-Java Binaries to Nexus Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This command uploads the compiled binary artifacts for the release tag to the Apache Nexus staging repository. It performs the actual deployment of the built components, making them available for review before final publication. ```sh mvn release:perform -DskipTests -Darguments=-DskipTests ``` -------------------------------- ### Parquet Compression Algorithm Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Details the compatibility of different compression algorithms (e.g., GZIP, SNAPPY, ZSTD) used in Parquet files across various Parquet implementations. The table shows full support (✅), no support (❌), or partial read support (R). ```APIDOC | Compression | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb | | --- | --- | --- | --- | --- | --- | --- | --- | | UNCOMPRESSED | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | BROTLI | ✅ | ✅ | ✅ | ✅ | (R) | (R) | ✅ | | GZIP | ✅ | ✅ | ✅ | ✅ | (R) | (R) | ✅ | | LZ4 (deprecated) | ✅ | ❌ | ❌ | ✅ | ❌ | (R) | ❌ | | LZ4_RAW | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | | LZO | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | SNAPPY | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ZSTD | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | ``` -------------------------------- ### Prepare Parquet-Java Release Tag Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This script runs Maven's release prepare process, ensuring a consistent tag name is used. After successful execution, the release tag will be created and exist in the Git repository, marking the official release point. ```sh ./dev/prepare-release.sh ``` -------------------------------- ### Verify Maven Deployment Permissions to Nexus Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This command verifies that you have the necessary permissions to deploy Parquet artifacts to Apache Nexus by pushing a snapshot. It's a crucial preliminary step before initiating a full release. ```sh mvn deploy ``` -------------------------------- ### Parquet Format-Level Feature Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Outlines the support for various advanced format-level features in Parquet, such as bloom filters, statistics (min/max values), page index, and modular encryption, across different implementations. Support levels include full (✅), partial read (R), partial (asterisk), or no support (❌). ```APIDOC | Feature | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb | | --- | --- | --- | --- | --- | --- | --- | --- | | xxHash-based bloom filters | (R) | ✅ | ✅ | ✅ | (R) | | ✅ | | Bloom filter length (1) | (R) | ✅ | ✅ | ✅ | (R) | | ✅ | | Statistics min_value, max_value | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Page index | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | (R) | | Page CRC32 checksum | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | (R) | | Modular encryption | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ (*) | | Size statistics (2) | ✅ | ✅ | (R) | ✅ | ✅ | | (R) | * (1) In parquet.thrift: ColumnMetaData->bloom_filter_length * (2) In parquet.thrift: ColumnMetaData->size_statistics * (*) Partial support ``` -------------------------------- ### Build and Upload Parquet-Java Source Tarball Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This script builds the official source tarball from the release tag's SHA1, signs it, and then uploads the necessary files using SVN. The source release is pushed to the Apache distribution development repository for public access and verification. ```sh dev/source-release.sh ``` -------------------------------- ### Rollback Changes After Failed Release Prepare Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md These commands are used to revert any changes made by the `release:prepare` step if it fails. They remove temporary backup files and restore the `pom.xml` files to their original state using Git checkout, ensuring a clean slate for retry. ```sh find ./ -type f -name '*.releaseBackup' -exec rm {} \; find ./ -type f -name 'pom.xml' -exec git checkout {} \; ``` -------------------------------- ### Parquet Encoding Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Lists the support status of various Parquet data encodings (e.g., PLAIN, RLE_DICTIONARY, DELTA_BINARY_PACKED) across different Parquet implementations and libraries. This table indicates whether an encoding is fully supported (✅), partially supported for read (R), or not supported (❌). ```APIDOC | Encoding | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb | | --- | --- | --- | --- | --- | --- | --- | --- | | PLAIN | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | PLAIN_DICTIONARY | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | | RLE_DICTIONARY | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | RLE | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | BIT_PACKED (deprecated) | ✅ | ✅ | ✅ | ❌ (1) | (R) | (R) | ❌ | | DELTA_BINARY_PACKED | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | | DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | | DELTA_BYTE_ARRAY | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | | BYTE_STREAM_SPLIT | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ | * (1) Partial read support, but only in the case of level data with a bitwidth of 0 ``` -------------------------------- ### Parquet Physical Type Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Summarizes the support for various Parquet physical data types across different implementations like Arrow (C++), Parquet-Java, Arrow-Go, Arrow-RS, cuDF, Hyparquet, and DuckDB. Includes notes on deprecated types. ```APIDOC Data type | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb ----------------------------------------- | ----- | ------------- | -------- | -------- | ----- | --------- | ------ BOOLEAN | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ INT32 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ INT64 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ INT96 (1) | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | (R) FLOAT | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ DOUBLE | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ BYTE_ARRAY | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ FIXED_LEN_BYTE_ARRAY | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (1) This type is deprecated, but as of 2024 it's common in currently produced parquet files ``` -------------------------------- ### Add Maven Dependency for Parquet-Avro Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.14.0.md This XML snippet shows how to add the `parquet-avro` dependency to your `pom.xml` file, allowing your Java project to use Apache Parquet's Avro integration. Replace `1.14.0` with the desired version. ```XML ... org.apache.parquet parquet-avro 1.14.0 ... ``` -------------------------------- ### Check Parquet-Java API Compatibility Violations Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/contributing.md Command to verify API compatibility using the `japicmp` Maven plugin. This helps ensure that API changes adhere to semantic versioning rules, failing if an API is changed without the correct deprecation cycle. ```Shell mvn verify -Dmaven.test.skip=true japicmp:cmp ``` -------------------------------- ### Parquet High-Level Data API Feature Usage Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Shows the support for high-level data APIs and features like external column data, sorting column metadata, and row/page pruning using statistics or bloom filters across different Parquet implementations. Support can be full (✅), write-only (W), partial (asterisk), read-only (R), or none (❌). ```APIDOC | Feature | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb | | --- | --- | --- | --- | --- | --- | --- | --- | | External column data (1) | ✅ | ✅ | ❌ | ❌ | (W) | ✅ | ❌ | | Row group "Sorting column" metadata (2) | ✅ | ❌ | ✅ | ✅ | (W) | ❌ | (R) | | Row group pruning using statistics | ❌ | ✅ | ✅ (*) | ✅ | ✅ | ❌ | ✅ | | Row group pruning using bloom filter | ❌ | ✅ | ✅ (*) | ✅ | ✅ | ❌ | ✅ | | Reading select columns only | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Page pruning using statistics | ❌ | ✅ | ✅ (*) | ✅ | ❌ | ❌ | ❌ | * (1) In parquet.thrift: ColumnChunk->file_path * (2) In parquet.thrift: RowGroup->sorting_columns * (*) Partial Support ``` -------------------------------- ### Add Apache Parquet-Java Git Remote Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/contributing.md Command to add a new Git remote named `github-apache` pointing to the official Apache Parquet-Java repository. This is useful for committers to manage maintenance branches and backport commits. ```Shell git remote add github-apache git@github.com:apache/parquet-java.git ``` -------------------------------- ### Cherry-Pick Commit to Parquet-Java Maintenance Branch Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/contributing.md Steps to backport a merged commit from the master branch to a specific maintenance branch (e.g., `parquet-1.14.x`). This involves fetching all remotes, checking out the target branch, resetting to the remote's state, cherry-picking the desired commit, and pushing the changes. ```Shell get fetch --all git checkout parquet-1.14.x git reset --hard github-apache/parquet-1.14.x git cherry-pick git push github-apache/parquet-1.14.x ``` -------------------------------- ### Add Parquet Avro Dependency to Maven pom.xml Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.14.1.md This snippet demonstrates how to include the `parquet-avro` library as a dependency in a Maven `pom.xml` file. This is essential for Java projects that need to interact with Parquet files using Avro. Users should ensure they use the latest stable version for `version`. ```XML ... org.apache.parquet parquet-avro 1.14.1 ... ``` -------------------------------- ### Parquet Logical Type Support Matrix Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/implementationstatus.md Summarizes the support for various Parquet logical data types across different implementations, including notes on specific limitations for certain types like ENUM, UUID, INTERVAL, JSON, BSON, LIST, MAP, and FLOAT16. ```APIDOC Data type | arrow | parquet-java | arrow-go | arrow-rs | cudf | hyparquet | duckdb ----------------------------------------- | ----- | ------------- | -------- | -------- | ----- | --------- | ------ STRING | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ ENUM | ❌ | ✅ | ✅ | ✅ (1) | ❌ | ✅ | ✅ UUID | ❌ | ✅ | ✅ | ✅ (1) | ❌ | ✅ | ✅ 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ DECIMAL (INT32) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ DECIMAL (INT64) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ DECIMAL (BYTE_ARRAY) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | (R) DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ DATE | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ TIME (INT32) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ TIME (INT64) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ TIMESTAMP (INT64) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ INTERVAL | ✅ | ✅ (1) | ✅ | ✅ | ❌ | ✅ | ✅ JSON | ✅ | ✅ (1) | ✅ | ✅ (1) | ❌ | ✅ | ✅ BSON | ❌ | ✅ (1) | ✅ | ✅ (1) | ❌ | ❌ | ❌ LIST | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ MAP | ✅ | ✅ | ✅ | ✅ | ✅ | (R) | ✅ UNKNOWN (always null) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ FLOAT16 | ✅ | ✅ (1) | ✅ | ✅ | ✅ | ✅ | ✅ (1) Only supported to use its annotated physical type ``` -------------------------------- ### Add Maven Dependency for Parquet-Avro Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.14.4.md This XML snippet demonstrates how to add the `parquet-avro` dependency to your Maven `pom.xml` file. This allows your Java project to use the Parquet Avro module. Remember to replace `1.14.4` with the specific version you intend to use. ```XML ... org.apache.parquet "artifactId">parquet-avro 1.14.4 ... ``` -------------------------------- ### Add Parquet-Avro Dependency to Maven pom.xml Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.13.1.md This snippet shows how to add the `parquet-avro` artifact as a dependency to your Maven `pom.xml` file. Replace `1.13.1` with the desired version. This allows your Java project to use Parquet functionalities. ```XML ... org.apache.parquet parquet-avro ``` -------------------------------- ### Move Apache Parquet release artifacts in SVN Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This SVN command moves the finalized Apache Parquet release artifacts from the development staging area (`dev`) to the official release repository (`release`). This step ensures that the release binaries are publicly available through the Apache distribution mirrors. ```sh svn mv https://dist.apache.org/repos/dist/dev/parquet/apache-parquet--rcN/ https://dist.apache.org/repos/dist/release/parquet/apache-parquet- -m "Parquet: Add release " ``` -------------------------------- ### Add Maven Dependency for Parquet Avro Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.15.1.md This XML snippet shows how to add the `parquet-avro` dependency to a Maven `pom.xml` file. It specifies the `groupId`, `artifactId`, and `version` for integrating Parquet Avro into a Java project. ```XML ... org.apache.parquet parquet-avro 1.15.1 ... ``` -------------------------------- ### Add Apache Parquet Maven Dependency Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.15.0.md This XML snippet shows how to add the Apache Parquet Avro module as a dependency to a Maven project's pom.xml file. It specifies the group ID, artifact ID, and version for the dependency, allowing Maven to download and include the library in the project. ```XML ... org.apache.parquet parquet-avro 1.15.0 ... ``` -------------------------------- ### Add Apache Parquet Avro Maven Dependency Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.12.3.md This XML snippet demonstrates how to include the Apache Parquet Avro library as a dependency in a Maven `pom.xml` file. It enables projects to utilize Parquet's Avro integration features. Users should update the version number to the latest stable release or a specific desired version. ```XML ... org.apache.parquet parquet-avro 1.12.3 ... ``` -------------------------------- ### Apache Parquet Release VOTE Email Template Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This is a template for the VOTE email sent to the `dev@parquet.apache.org` mailing list. It includes all necessary details for the release candidate, such as commit ID, tag, tarball location, KEYS file, changelog, and staged binary artifacts, along with the voting options. ```plaintext Subject: [VOTE] Release Apache Parquet RC Hi everyone, I propose the following RC to be released as official Apache Parquet release. The commit id is * This corresponds to the tag: apache-parquet--rc * https://github.com/apache/parquet-java/tree/ The release tarball, signature, and checksums are here: * https://dist.apache.org/repos/dist/dev/parquet/ You can find the KEYS file here: * https://downloads.apache.org/parquet/KEYS You can find the changelog here: https://github.com/apache/parquet-java/releases/tag/apache-parquet--rc Binary artifacts are staged in Nexus here: * https://repository.apache.org/content/groups/staging/org/apache/parquet/ This release includes important changes that I should have summarized here, but I'm lazy. Please download, verify, and test. Please vote in the next 72 hours. [ ] +1 Release this as Apache Parquet [ ] +0 [ ] -1 Do not release this because... ``` -------------------------------- ### Parquet File Format High-Level Layout Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/_index.md This snippet illustrates the high-level byte layout of a Parquet file. It shows the initial and final magic numbers, the arrangement of column chunks across multiple row groups, and the trailing file metadata, which includes its length. This structure allows for efficient single-pass writing and flexible data access. ```plaintext 4-byte magic number "PAR1" ... ... ... ... File Metadata 4-byte length in bytes of file metadata (little endian) 4-byte magic number "PAR1" ``` -------------------------------- ### Add Apache Parquet Avro Maven Dependency Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.13.0.md This XML snippet demonstrates how to include the Apache Parquet Avro module as a dependency in your Maven `pom.xml` file. It specifies the `groupId`, `artifactId`, and `version` (e.g., 1.13.0 or latest) required for Maven to manage the library in your project. ```XML ... org.apache.parquet parquet-avro 1.13.0 ... ``` -------------------------------- ### Hugo Website Navigation Bar Template Source: https://github.com/apache/parquet-site/blob/production/layouts/partials/navbar.html This Hugo template generates the main navigation bar for a website. It includes logic for displaying the site logo (if enabled), the site title, and iterating through main menu items. It handles active states for current pages, distinguishes between internal and external links, and conditionally includes partials for version and language selectors, as well as a search input field. The template adapts its rendering based on site parameters and shortcode usage. ```Go Template {{ $cover := and (.HasShortcode "blocks/cover") (not .Site.Params.ui.navbar_translucent_over_cover_disable) -}} {{ $baseURL := urls.Parse $.Site.Params.Baseurl -}} [{{- /* */ -}} {{- if ne .Site.Params.ui.navbar_logo false -}} {{ with resources.Get "icons/logo.svg" -}} {{ ( . | minify).Content | safeHTML -}} {{ end -}} {{ end -}} {{- /* */ -}} {{- .Site.Title -}} {{- /* */ -}}]({{ .Site.Home.RelPermalink }}) {{ $p := . -}} {{ range .Site.Menus.main -}}* {{ $active := or ($p.IsMenuCurrent "main" .) ($p.HasMenuCurrent "main" .) -}} {{ $href := "" -}} {{ with .Page -}} {{ $active = or $active ( $.IsDescendant .) -}} {{ $href = .RelPermalink -}} {{ else -}} {{ $href = .URL | relLangURL -}} {{ end -}} {{ $isExternal := ne $baseURL.Host (urls.Parse .URL).Host -}} [{{- .Pre -}} {{ .Name }} {{- .Post -}}]({{ $href }}) {{ end -}} {{ if .Site.Params.versions -}}* {{ partial "navbar-version-selector.html" . -}} {{ end -}} {{ if (gt (len .Site.Home.Translations) 0) -}}* {{ partial "navbar-lang-selector.html" . -}} {{ end -}} {{ partial "search-input.html" . }} ``` -------------------------------- ### Tag final release and set development version for Apache Parquet Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/Contribution Guidelines/releasing.md This shell script command is used to finalize an Apache Parquet release. It adds the final release tag to the release candidate (RC) tag and updates the project's POM files with the new development version. After execution, changes and the new tag should be pushed to GitHub using `git push --follow-tags`. ```sh ./dev/finalize-release ``` -------------------------------- ### Thrift Definitions for Parquet Bloom Filter Structures Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/bloomfilter.md These Thrift IDL definitions specify the data structures for Bloom filters in the Parquet file format. They include definitions for the algorithm, hash function, compression, the main Bloom filter page header, and the integration of Bloom filter offset into column metadata, providing a blueprint for their serialization. ```APIDOC /** Block-based algorithm type annotation. **/ struct SplitBlockAlgorithm {} /** The algorithm used in Bloom filter. **/ union BloomFilterAlgorithm { /** Block-based Bloom filter. **/ 1: SplitBlockAlgorithm BLOCK; } /** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash * algorithm. It uses 64 bits version of xxHash. **/ struct XxHash {} /** * The hash function used in Bloom filter. This function takes the hash of a column value * using plain encoding. **/ union BloomFilterHash { /** xxHash Strategy. **/ 1: XxHash XXHASH; } /** * The compression used in the Bloom filter. **/ struct Uncompressed {} union BloomFilterCompression { 1: Uncompressed UNCOMPRESSED; } /** * Bloom filter header is stored at beginning of Bloom filter data of each column * and followed by its bitset. **/ struct BloomFilterPageHeader { /** The size of bitset in bytes **/ 1: required i32 numBytes; /** The algorithm for setting bits. **/ 2: required BloomFilterAlgorithm algorithm; /** The hash function used for Bloom filter. **/ 3: required BloomFilterHash hash; /** The compression used in the Bloom filter **/ 4: required BloomFilterCompression compression; } struct ColumnMetaData { ... /** Byte offset from beginning of file to Bloom filter data. **/ 14: optional i64 bloom_filter_offset; } ``` -------------------------------- ### Add Maven Dependency for Parquet-Avro Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.12.2.md This XML snippet demonstrates how to include the `parquet-avro` dependency in a Maven `pom.xml` file. It specifies the necessary `groupId`, `artifactId`, and `version` to integrate the Parquet Avro module into a Java project managed by Maven. ```XML ... org.apache.parquet parquet-avro 1.12.2 ... ``` -------------------------------- ### Add Apache Parquet Dependency to Maven pom.xml Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.15.2.md This snippet demonstrates how to add the Apache Parquet Avro module as a dependency to your Maven `pom.xml` file. Ensure you replace `1.15.2` with the specific version you intend to use. ```XML ... org.apache.parquet parquet-avro 1.15.2 ... ``` -------------------------------- ### C Pseudocode: SBBF filter_insert Operation Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/bloomfilter.md This pseudocode demonstrates the `filter_insert` operation for a Space-Efficient Bloom Filter (SBBF). It calculates the block index `i` using a bit manipulation trick to avoid slow modulo operations, then retrieves the block and calls `block_insert` with the least significant 32 bits of the input hash `x`. The `>> 32` operator extracts the most significant 32 bits, and `(unsigned int32)` casts to the least significant 32 bits. ```C void filter_insert(SBBF filter, unsigned int64 x) { unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32; block b = filter.getBlock(i); block_insert(b, (unsigned int32)x) } ``` -------------------------------- ### Parquet Encryption Algorithm Thrift Definitions (C) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encryption.md Defines the core Thrift structures AesGcmV1, AesGcmCtrV1, and EncryptionAlgorithm used to specify the encryption method for Parquet files. It includes fields for AAD prefix and unique file identifiers, and a flag to indicate if the AAD prefix must be supplied by the reader. ```APIDOC struct AesGcmV1 { /** AAD prefix **/ 1: optional binary aad_prefix /** Unique file identifier part of AAD suffix **/ 2: optional binary aad_file_unique /** In files encrypted with AAD prefix without storing it, * readers must supply the prefix **/ 3: optional bool supply_aad_prefix } struct AesGcmCtrV1 { /** AAD prefix **/ 1: optional binary aad_prefix /** Unique file identifier part of AAD suffix **/ 2: optional binary aad_file_unique /** In files encrypted with AAD prefix without storing it, * readers must supply the prefix **/ 3: optional bool supply_aad_prefix } union EncryptionAlgorithm { 1: AesGcmV1 AES_GCM_V1 2: AesGcmCtrV1 AES_GCM_CTR_V1 } ``` -------------------------------- ### C: Define Salt Constants and Mask Function for Bloom Filter Block Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/bloomfilter.md This snippet defines an array of eight odd unsigned 32-bit integer constants (`salt`) crucial for hashing. It also presents the `mask` function, which takes a 32-bit unsigned integer `x`. For each of the eight words in a `block`, `mask` multiplies `x` by the corresponding `salt` value, then right-shifts the result by 27 bits to determine which bit to set. This function generates a `block` with specific bits set, used as a template for `block_insert` and `block_check`. ```C unsigned int32 salt[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U} block mask(unsigned int32 x) { block result for i in [0..7] { unsigned int32 y = x * salt[i] result.getWord(i).setBit(y >> 27) } return result } ``` -------------------------------- ### Add Parquet-Avro Dependency to Maven pom.xml Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.14.3.md This snippet demonstrates how to add the 'parquet-avro' dependency to your Maven 'pom.xml' file. It specifies the 'groupId', 'artifactId', and 'version' (e.g., 1.14.3 or latest) required to include Parquet-Avro in a Java project. ```XML ... org.apache.parquet parquet-avro 1.14.3 ... ``` -------------------------------- ### Delta Strings Encoding (DELTA_BYTE_ARRAY = 7) Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md Also known as incremental encoding or front compression, this method stores each string as a prefix length from the previous entry plus its suffix. It uses delta-encoded prefix lengths followed by suffixes encoded as delta length byte arrays. ```APIDOC Encoding: DELTA_BYTE_ARRAY (ID = 7) Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY This is also known as incremental encoding or front compression: for each element in a sequence of strings, store the prefix length of the previous entry plus the suffix. For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding. This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). For example, if the data was "axis", "axle", "babble", "babyhood" then the encoded data would be comprised of the following segments: - DeltaEncoding(0, 2, 0, 3) (the prefix lengths) - DeltaEncoding(4, 2, 6, 5) (the suffix lengths) - "axislebabbleyhood" Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the redundancy. ``` -------------------------------- ### Parquet Data Page Structure Specification Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/_index.md Defines the byte-level layout and components of a Parquet data page, including the sequence and conditional presence of repetition levels, definition levels, and encoded values. ```Parquet Specification ParquetDataPage { PageHeader header; // Precedes the data section // Data section, total size defined by header.uncompressed_page_size // No padding allowed. Optional repetition_levels_data; // Skipped if column is non-nested (path length 1) // If present and non-nested, values are always 1. Optional definition_levels_data; // Skipped if column is required // If present and required, values are always max_definition_level. Required encoded_values; // Always present. } ``` -------------------------------- ### Parquet RLE/Bit-Packing Hybrid Encoding Grammar Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/Data Pages/encodings.md Defines the grammar for the Run Length Encoding / Bit-Packing Hybrid (RLE = 3) used in Parquet, detailing the structure of encoded data, runs, headers, and values. It specifies how lengths and run types are encoded using varint and bit-packing. ```APIDOC rle-bit-packed-hybrid: // length is not always prepended, please check the table below for more detail length := length of the in bytes stored as 4 bytes little endian (unsigned int32) encoded-data := * run := | bit-packed-run := bit-packed-header := varint-encode( << 1 | 1) // we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8 bit-pack-scaled-run-len := (bit-packed-run-len) / 8 bit-packed-run-len := *see 3 below* bit-packed-values := *see 1 below* rle-run := rle-header := varint-encode( (rle-run-len) << 1) rle-run-len := *see 3 below* repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width) ``` -------------------------------- ### C Pseudocode: SBBF filter_check Operation Source: https://github.com/apache/parquet-site/blob/production/content/en/docs/File Format/bloomfilter.md This pseudocode illustrates the `filter_check` operation for an SBBF. Similar to `filter_insert`, it determines the block index `i` using the same efficient bit manipulation technique. It then retrieves the specified block and invokes `block_check` with the least significant 32 bits of the input hash `x`, returning the boolean result. ```C boolean filter_check(SBBF filter, unsigned int64 x) { unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32; block b = filter.getBlock(i); return block_check(b, (unsigned int32)x) } ``` -------------------------------- ### Add Parquet-Avro Dependency to Maven pom.xml Source: https://github.com/apache/parquet-site/blob/production/content/en/blog/parquet-java/1.14.2.md This XML snippet demonstrates how to include the 'parquet-avro' dependency within the 'dependencies' section of a Maven project's 'pom.xml' file. It specifies the group ID, artifact ID, and version for the Parquet Avro module. Users should update the version number to the desired or latest available release. ```XML ... org.apache.parquet parquet-avro 1.14.2 ... ```