### Install Thrift Compiler Source: https://github.com/apache/parquet-java/blob/master/README.md Installs the Thrift compiler from source. Ensure Thrift is in your PATH. ```bash wget -nv https://archive.apache.org/dist/thrift/0.22.0/thrift-0.22.0.tar.gz tar xzf thrift-0.22.0.tar.gz cd thrift-0.22.0 chmod +x ./configure ./configure --disable-libs sudo make install -j ``` -------------------------------- ### Get Help for Run Script (General) Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Displays general information and usage instructions for the `./parquet-benchmarks/run.sh` script. ```bash ./parquet-benchmarks/run.sh ``` -------------------------------- ### Get Parquet CLI Help Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Display help information for the Parquet CLI tool, including available commands and options. ```bash parquet help ``` -------------------------------- ### Get Help for Run Script Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Displays available arguments and options for the `./parquet-benchmarks/run.sh` script, including JMH specific flags. ```bash ./parquet-benchmarks/run.sh all -help ``` -------------------------------- ### Install Thrift on macOS with Homebrew Source: https://github.com/apache/parquet-java/blob/master/README.md Installs Thrift 0.22.0 using Homebrew and sets the PATH. This is an alternative for macOS users. ```bash brew install thrift export PATH="/usr/local/opt/thrift@0.22.0/bin:$PATH" ``` -------------------------------- ### Get Help for Specific Parquet CLI Command Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Retrieve detailed help for a specific Parquet CLI command, such as 'meta'. ```bash parquet help meta ``` -------------------------------- ### Predicate Pushdown with FilterApi Source: https://context7.com/apache/parquet-java/llms.txt Implement efficient data filtering using `FilterApi` for row group and column-level predicate pushdown. This example constructs a compound filter for age, username, and score, and applies it to an `AvroParquetReader`. ```java import org.apache.parquet.filter2.compat.FilterCompat; import org.apache.parquet.filter2.predicate.FilterApi; import org.apache.parquet.filter2.predicate.FilterPredicate; import org.apache.parquet.avro.AvroParquetReader; import org.apache.avro.generic.GenericRecord; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.hadoop.ParquetReader; import org.apache.parquet.hadoop.util.HadoopInputFile; import java.util.HashSet; import java.util.Set; // Build a compound filter: (age >= 18 AND age < 65) AND name != null AND score in {9.5, 10.0} FilterPredicate agePredicate = FilterApi.and( FilterApi.gtEq(FilterApi.intColumn("age"), 18), FilterApi.lt(FilterApi.intColumn("age"), 65) ); FilterPredicate namePredicate = FilterApi.notEq( FilterApi.binaryColumn("username"), null ); Set topScores = new HashSet<>(); topScores.add(9.5); topScores.add(10.0); FilterPredicate scorePredicate = FilterApi.in(FilterApi.doubleColumn("score"), topScores); FilterPredicate combined = FilterApi.and(FilterApi.and(agePredicate, namePredicate), scorePredicate); Configuration conf = new Configuration(); Path inputPath = new Path("/data/users.parquet"); try (ParquetReader reader = AvroParquetReader.builder( HadoopInputFile.fromPath(inputPath, conf)) .withDataModel(org.apache.avro.generic.GenericData.get()) .withFilter(FilterCompat.get(combined)) .build()) { GenericRecord record; int count = 0; while ((record = reader.read()) != null) { count++; System.out.println(record.get("username") + ": " + record.get("score")); } System.out.println("Total matching records: " + count); } ``` -------------------------------- ### Run All Benchmarks Rigorously with Report Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Executes all benchmarks with increased warm-up iterations, iterations, and forks for more reliable results. Saves a JSON report for comparison. ```bash ./parquet-benchmarks/run.sh all -wi 5 -i 5 -f 3 -rff /tmp/benchmark1.json ``` -------------------------------- ### Parquet CLI Tool: Build and Alias Source: https://context7.com/apache/parquet-java/llms.txt Instructions for building the Parquet CLI tool and creating an alias for easy access. Alternatively, run the JAR directly using `java -cp`. ```bash # Build and alias the CLI tool ./mvnw clean install -DskipTests alias parquet="hadoop jar parquet-cli-1.17.0-runtime.jar org.apache.parquet.cli.Main --dollar-zero parquet" # Or run without Hadoop: java -cp 'target/parquet-cli-1.17.0.jar:target/dependency/*' org.apache.parquet.cli.Main ``` -------------------------------- ### Run a Specific Benchmark Suite Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Launches a predefined benchmark 'suite' (e.g., 'checksum') using JMH defaults. This typically takes about 30 minutes. ```bash ./parquet-benchmarks/run.sh checksum ``` -------------------------------- ### Build Parquet CLI with Maven Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Build the project using Maven, skipping tests for a faster build. ```bash ./mvnw clean install -DskipTests ``` -------------------------------- ### Run All Benchmarks with JMH Defaults Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Executes all available benchmarks once with default JMH settings. This is useful for a quick check and typically takes around 20 minutes. ```bash ./parquet-benchmarks/run.sh all -wi 0 -i 1 -f 1 ``` -------------------------------- ### Build Jars with Maven Wrapper Source: https://github.com/apache/parquet-java/blob/master/README.md Package the project into JAR files using the Maven wrapper. This command compiles the code and creates the distributable artifacts. ```bash ./mvnw package ``` -------------------------------- ### Run Unit Tests with Maven Wrapper Source: https://github.com/apache/parquet-java/blob/master/README.md Execute all unit tests for the project using the Maven wrapper script. This is a standard step before building or contributing code. ```bash ./mvnw test ``` -------------------------------- ### Build Parquet with Maven Source: https://github.com/apache/parquet-java/blob/master/README.md Builds the Parquet Java project using Maven. Requires Thrift to be available in the PATH. ```bash LC_ALL=C ./mvnw clean install ``` -------------------------------- ### Run Parquet CLI using Hadoop Jar Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Execute the Parquet CLI tool using the shaded runtime JAR with the `hadoop` command. ```bash hadoop jar parquet-cli-1.16.0-runtime.jar org.apache.parquet.cli.Main ``` -------------------------------- ### Build Parquet with Vector API Support Source: https://github.com/apache/parquet-java/blob/master/README.md Builds Parquet JARs with experimental Java Vector API support enabled. Requires Java 17+ and specific CPU instruction sets. ```bash ./mvnw clean package -P vector-plugins ``` -------------------------------- ### Convert Avro to Parquet using Config File Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Convert an Avro file to Parquet format, loading configuration properties from a specified file. ```bash parquet convert input.avro -o output.parquet --config-file config.properties ``` -------------------------------- ### Configure Parquet Hadoop Integration Source: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md Set Parquet page size via Hadoop configuration and block size programmatically for a MapReduce job. ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; Configuration conf = new Configuration(); conf.set("parquet.page.size","128"); Job writeJob = new Job(conf); ParquetOutputFormat.setBlockSize(writeJob, 1024); ``` -------------------------------- ### Configure Hadoop Parquet Output and Input Formats Source: https://context7.com/apache/parquet-java/llms.txt Configure ParquetOutputFormat for writing and ParquetInputFormat for reading within a Hadoop Job. This includes setting write/read support classes, tuning performance parameters like block and page size, specifying compression codecs, enabling dictionary encoding, and configuring per-column settings. For reading, it also covers schema projection and enabling various filter levels for optimized data retrieval. ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.parquet.hadoop.ParquetInputFormat; import org.apache.parquet.hadoop.ParquetOutputFormat; import org.apache.parquet.hadoop.metadata.CompressionCodecName; Configuration conf = new Configuration(); // --- Output configuration --- Job writeJob = Job.getInstance(conf, "write-parquet"); writeJob.setOutputFormatClass(ParquetOutputFormat.class); // Set write support class that converts records to Parquet events ParquetOutputFormat.setWriteSupportClass(writeJob, MyWriteSupport.class); // Tune block and page size ParquetOutputFormat.setBlockSize(writeJob, 256 * 1024 * 1024); // 256 MB ParquetOutputFormat.setPageSize(writeJob, 2 * 1024 * 1024); // 2 MB ParquetOutputFormat.setCompression(writeJob, CompressionCodecName.ZSTD); ParquetOutputFormat.setEnableDictionary(writeJob, true); // Per-column: disable dictionary for a high-cardinality binary column conf.set("parquet.enable.dictionary#description", "false"); // Enable bloom filters for faster point lookups conf.set("parquet.bloom.filter.enabled#user_id", "true"); conf.set("parquet.bloom.filter.expected.ndv#user_id", "1000000"); // --- Input / read configuration --- Job readJob = Job.getInstance(conf, "read-parquet"); readJob.setInputFormatClass(ParquetInputFormat.class); ParquetInputFormat.setReadSupportClass(readJob, MyReadSupport.class); // Projection: only read the columns needed conf.set("parquet.read.schema", "message Projection { required int64 id; required binary username (STRING); }"); // Enable all filter levels conf.setBoolean("parquet.filter.stats.enabled", true); conf.setBoolean("parquet.filter.dictionary.enabled", true); conf.setBoolean("parquet.filter.columnindex.enabled", true); conf.setBoolean("parquet.filter.bloom.enabled", true); ``` -------------------------------- ### Build Benchmark Uber-Jar with Maven Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Builds the parquet-benchmarks module into an uber-jar using Maven Wrapper. This includes Parquet classes and dependencies, ready for JMH execution. Ensure you are in the project root directory. ```bash ./mvnw --projects parquet-benchmarks -amd -DskipTests -Denforcer.skip=true clean package ``` -------------------------------- ### Run Specific Benchmark by Regex Source: https://github.com/apache/parquet-java/blob/master/parquet-benchmarks/README.md Executes benchmarks that match a given regular expression, targeting specific benchmark classes like NestedNullWritingBenchmarks. ```bash ./parquet-benchmarks/run.sh all org.apache.parquet.benchmarks.NestedNullWritingBenchmarks ``` -------------------------------- ### Run Parquet CLI without Hadoop Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Execute the Parquet CLI tool by manually setting the classpath, including the main JAR and its dependencies. Avoid including the runtime JAR to prevent class loading conflicts. ```bash java -cp 'target/parquet-cli-1.16.0.jar:target/dependency/*' org.apache.parquet.cli.Main ``` -------------------------------- ### Configure Parquet Size Statistics Source: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md Use these configurations to control size statistics collection. Enable for all columns by default, or disable for specific columns using a path. ```java conf.set("parquet.size.statistics.enabled", true); ``` ```java conf.set("parquet.size.statistics.enabled#column.path", false); ``` -------------------------------- ### Convert Avro to Parquet with Configuration Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Convert an Avro file to Parquet format, enabling a specific configuration property for writing UUIDs. ```bash parquet convert input.avro -o output.parquet --conf parquet.avro.write-parquet-uuid=true ``` -------------------------------- ### Convert CSV to Parquet with Multiple Configurations Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Convert a CSV file to Parquet format, specifying a schema file and multiple configuration options for UUID writing and list structure. ```bash parquet convert-csv input.csv -o output.parquet --schema schema.avsc --conf parquet.avro.write-parquet-uuid=true --conf parquet.avro.write-old-list-structure=false ``` -------------------------------- ### Parquet CLI Tool: Convert Files Source: https://context7.com/apache/parquet-java/llms.txt Commands for converting CSV or Avro files to Parquet format. Supports specifying Avro schema, compression codecs, and other configuration options. ```bash # Convert CSV to Parquet with a provided Avro schema parquet convert-csv input.csv -o output.parquet --schema schema.avsc \ --conf parquet.avro.write-parquet-uuid=true # Convert Avro file to Parquet with ZSTD compression parquet convert input.avro -o output.parquet \ --conf parquet.compression=zstd ``` -------------------------------- ### Configure AvroParquetReader and AvroParquetWriter Source: https://context7.com/apache/parquet-java/llms.txt Set Hadoop configurations for Avro integration. Use GenericDataSupplier for schema-less deserialization and specify a read schema for projection or evolution. Enable compatible mode for IndexedRecord materialization. Configure write options for list structures and UUID fields. ```java import org.apache.hadoop.conf.Configuration; import org.apache.parquet.avro.AvroParquetReader; import org.apache.parquet.avro.AvroParquetWriter; import org.apache.avro.generic.GenericData; import org.apache.hadoop.fs.Path; import org.apache.parquet.hadoop.util.HadoopInputFile; Configuration conf = new Configuration(); // --- Read options --- // Use GenericData (not SpecificData) for schema-less deserialization conf.set("parquet.avro.data.supplier", "org.apache.parquet.avro.GenericDataSupplier"); // Set a compatible read schema for projection or evolution conf.set("parquet.avro.read.schema", "{\"type\":\"record\",\"name\":\"User\", ``` -------------------------------- ### Convert Avro to Parquet with List Structure Configuration Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Convert an Avro file to Parquet format, disabling the old list structure configuration. ```bash parquet convert input.avro -o output.parquet --conf parquet.avro.write-old-list-structure=false ``` -------------------------------- ### Parquet CLI Tool: Rewrite and Scan Files Source: https://context7.com/apache/parquet-java/llms.txt Commands to rewrite Parquet files, changing compression codecs and pruning columns. Also includes a command to scan all records for validation and row count. ```bash # Rewrite a file: change codec, prune columns parquet rewrite input.parquet output.parquet \ --codec SNAPPY \ --prune internal_flag,debug_info # Scan all records (validates readability, reports row count) parquet scan /data/users.parquet # Check for corrupt column statistics (PARQUET-251) parquet check-stats /data/users.parquet ``` -------------------------------- ### Parquet CLI Tool: Inspect File Metadata Source: https://context7.com/apache/parquet-java/llms.txt Commands to inspect Parquet file metadata, including schema, row groups, compression, encoding, column index, page summaries, and footer. ```bash # Print file metadata (schema, row groups, compression, encoding) parquet meta /data/users.parquet # Print schema in Avro format parquet schema /data/users.parquet # Print first 20 records parquet cat --limit 20 /data/users.parquet # Print column index and offset index parquet column-index /data/users.parquet # Print per-column sizes parquet column-size /data/users.parquet # Print page-level summaries parquet pages /data/users.parquet # Print the footer as JSON parquet footer /data/users.parquet # Check bloom filters for a specific column parquet bloom-filter --column user_id /data/users.parquet ``` -------------------------------- ### Create Parquet CLI Shell Alias Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Create a shell alias for a shorter command-line invocation of the Parquet CLI tool. Ensure the path to the JAR is correct. ```bash alias parquet="hadoop jar /path/to/parquet-cli-1.16.0-runtime.jar org.apache.parquet.cli.Main --dollar-zero parquet" ``` -------------------------------- ### Configure Parquet Hadoop Write Behavior Source: https://context7.com/apache/parquet-java/llms.txt Set various Hadoop `Configuration` properties to control Parquet writer behavior, including block size, page size, compression, dictionary encoding, and more. These settings are applied via `Configuration` or a Hadoop `Job` object. ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.parquet.hadoop.ParquetOutputFormat; Configuration conf = new Configuration(); // Block (row group) size: larger improves read IO, uses more writer memory conf.setLong("parquet.block.size", 256 * 1024 * 1024L); // 256 MB (default 128 MB) // Page size: smallest unit fully read to access one record conf.setInt("parquet.page.size", 2 * 1024 * 1024); // 2 MB (default 1 MB) // Compression codec: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4_raw conf.set("parquet.compression", "zstd"); // ZSTD-specific: compression level 1-22 conf.setInt("parquet.compression.codec.zstd.level", 9); // ZSTD parallel compression workers conf.setInt("parquet.compression.codec.zstd.workers", 4); // Dictionary encoding: enabled for all columns by default conf.setBoolean("parquet.enable.dictionary", true); // Disable dictionary for a specific high-cardinality column conf.set("parquet.enable.dictionary#raw_event_json", "false"); // Writer version: PARQUET_1_0 or PARQUET_2_0 conf.set("parquet.writer.version", "PARQUET_2_0"); // Bloom filters for fast row-group-level point lookups conf.setBoolean("parquet.bloom.filter.enabled", true); conf.set("parquet.bloom.filter.expected.ndv#user_id", "5000000"); conf.set("parquet.bloom.filter.fpp#user_id", "0.01"); // Page checksums for data integrity conf.setBoolean("parquet.page.write-checksum.enabled", true); // Column index truncation length for binary min/max values conf.setInt("parquet.columnindex.truncate.length", 64); // Row count limits conf.setInt("parquet.block.row.count.limit", 10000000); conf.setInt("parquet.page.row.count.limit", 20000); // Summary metadata: "all", "common_only", or "none" conf.set("parquet.summary.metadata.level", "none"); Job job = Job.getInstance(conf); ``` -------------------------------- ### Copy Parquet CLI Dependencies Source: https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md Copy project dependencies to a folder using Maven, preparing for running without the `hadoop` command. ```bash ./mvnw dependency:copy-dependencies ``` -------------------------------- ### Configure Column Encryption with PropertiesDrivenCryptoFactory Source: https://context7.com/apache/parquet-java/llms.txt Set Hadoop configuration properties to enable column-level AES-GCM encryption for sensitive data. Specify KMS client details, master keys for columns, and a separate key for the file footer. ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.parquet.hadoop.ParquetOutputFormat; import org.apache.parquet.hadoop.metadata.CompressionCodecName; Configuration conf = new Configuration(); // Specify which KMS client handles key wrapping conf.set("parquet.encryption.kms.client.class", "com.example.MyKmsClient"); conf.set("parquet.encryption.kms.instance.url", "https://kms.example.com"); conf.set("parquet.encryption.key.access.token", "my-access-token"); // Encrypt specific columns with named master keys // Format: masterKeyID:col1,col2;masterKeyID2:col3 conf.set("parquet.encryption.column.keys", "keyA:ssn,credit_card;keyB:salary"); // Encrypt the file footer with a separate key conf.set("parquet.encryption.footer.key", "footerKey"); // Use AES-GCM encryption (default), with double wrapping for DEK security conf.set("parquet.encryption.algorithm", "AES_GCM_V1"); conf.setBoolean("parquet.encryption.double.wrapping", true); // Configure crypto factory on the output format conf.set("parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory"); Job job = Job.getInstance(conf, "encrypted-write"); job.setOutputFormatClass(ParquetOutputFormat.class); ParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY); // ... set WriteSupport and run job ``` -------------------------------- ### Configure Column Statistics Collection Source: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md Set whether to enable or disable column statistics collection. Specific columns can be included or excluded by appending their path with '#'. ```java conf.set("parquet.column.statistics.enabled", true); ``` ```java conf.set("parquet.column.statistics.enabled#column.path", false); ```