### Setup Virtual Environment and Install Dependencies Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md Installs Jupyter and ipykernel for the current directory. Ensure you are in the parent folder before running. ```bash cd ../ # back into vector_blob_demo/ python3.12 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # adds jupyter + ipykernel for this folder ``` -------------------------------- ### Setup Python Virtual Environment and Install Dependencies Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md Creates a Python 3.12 virtual environment, activates it, and installs project dependencies from requirements.txt. Ensure Python 3.12 is installed. ```bash cd hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo # Homebrew: brew install python@3.12 if you don't have it yet python3.12 -m venv .venv source .venv/bin/activate python --version # sanity check: must be 3.12.x, not 3.13+ pip install --upgrade pip pip install -r requirements.txt ``` -------------------------------- ### Start Hudi Docker Demo Environment Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md Execute this script to set up the Hudi demo environment using Docker. This is a prerequisite for running tests within the local Docker setup. ```shell docker/setup_demo.sh ``` -------------------------------- ### Start the Docker Environment Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md Run this command to start the Spark, Hudi, Hive Metastore, and MinIO services. ```bash ./run_spark_hudi.sh start ``` -------------------------------- ### Run Hudi PySpark Quickstart Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/README.md Execute the Hudi PySpark quickstart script, specifying the table name and either the Hudi package or JAR. ```bash cd $HUDI_DIR python3 hudi-examples/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py [-h] -t TABLE (-p PACKAGE | -j JAR) ``` -------------------------------- ### Start Kafka Broker and Zookeeper Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md Starts the Zookeeper and Kafka servers locally. Ensure KAFKA_HOME is set to your Kafka installation directory. These commands should be run in separate terminals. ```bash export KAFKA_HOME=/path/to/kafka_install_dir cd $KAFKA_HOME # Run the following commands in separate terminals to keep them running ./bin/zookeeper-server-start.sh ./config/zookeeper.properties ./bin/kafka-server-start.sh ./config/server.properties ``` -------------------------------- ### Hudi Fixture Generation Examples Source: https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/resources/upgrade-downgrade-fixtures/README.md Various command-line examples for generating specific Hudi table versions and configurations. ```bash # Generate all available versions (6,8) - version 9 excluded due to local bundle requirement ./generate-fixtures.sh # Generate specific versions only ./generate-fixtures.sh --version 6,8 # Generate only version 6 ./generate-fixtures.sh --version 6 # Generate version 9 (requires locally built Hudi bundle) ./generate-fixtures.sh --version 9 --hudi-bundle-path /path/to/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT.jar # Generate multiple versions including version 9 ./generate-fixtures.sh --version 6,8,9 --hudi-bundle-path /path/to/bundle.jar # Generate complex-keygen tables instead of mor tables ./generate-fixtures.sh --script-name generate-fixture-complex-keygen.scala # Generate only version 6 complex-keygen table ./generate-fixtures.sh --version 6 --script-name generate-fixture-complex-keygen.scala ``` -------------------------------- ### Start Hudi Metaserver Source: https://github.com/apache/hudi/blob/master/hudi-platform-service/hudi-metaserver/README.md Command to start the Hudi Metaserver service. ```shell sh start_hudi_metaserver.sh ``` -------------------------------- ### Start Schema Registry Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md Starts the Kafka schema registry. Ensure the listener port is configured correctly. ```bash ./bin/schema-registry-start etc/schema-registry/schema-registry.properties ``` -------------------------------- ### Initialize Hudi Docker Demo Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md Commands to navigate to the docker directory and start the demo environment. ```bash cd $HUDI_DIR/docker ./setup_demo.sh ``` -------------------------------- ### Start Hudi Environment Source: https://github.com/apache/hudi/blob/master/hudi-trino-plugin/src/test/resources/hudi-testing-data/hudi_non_part_cow.md Initiates the Hudi environment using the PTL tool. ```shell testing/bin/ptl env up --environment singlenode-hudi ``` -------------------------------- ### Deploy Minio Server Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md Apply the Kubernetes configuration to start the Minio standalone server. ```shell kubectl apply -f config/k8s/minio-standalone.yaml ``` -------------------------------- ### Navigate to dbt project directory Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md Change the current working directory to the hudi-examples/hudi-examples-dbt directory to begin the setup process. ```shell cd hudi-examples/hudi-examples-dbt ``` -------------------------------- ### Start GPG Agent Source: https://github.com/apache/hudi/blob/master/release/release_guide.md Start the GPG agent to manage GPG keys and unlock them for use. This involves setting environment variables for the agent. ```bash eval $(gpg-agent --daemon --no-grab --write-env-file $HOME/.gpg-agent-info) export GPG_TTY=$(tty) export GPG_AGENT_INFO ``` -------------------------------- ### Initialize Spark Session Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/06_hudi_trino_example.ipynb Load utility functions and start the Spark session. ```python %run utils.py ``` ```python spark = get_spark_session(app_name = "Hudi Trino Example") ``` -------------------------------- ### Install Confluent HDFS Connector Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md Installs the Confluent HDFS connector and copies its components to the Kafka plugins directory. Ensure CONFLUENT_DIR is set to your Confluent Platform installation path. ```bash # Points CONFLUENT_DIR to Confluent Platform installation export CONFLUENT_DIR=/path/to/confluent_install_dir mkdir -p /usr/local/share/kafka/plugins $CONFLUENT_DIR/bin/confluent-hub install confluentinc/kafka-connect-hdfs:10.1.0 cp -r $CONFLUENT_DIR/share/confluent-hub-components/confluentinc-kafka-connect-hdfs/* /usr/local/share/kafka/plugins/ ``` -------------------------------- ### Start Spark Shell with Hudi Bundle Source: https://github.com/apache/hudi/blob/master/README.md This command starts a Spark shell, including the Hudi Spark bundle JAR and necessary configurations for Hudi integration. ```bash spark-3.5.0-bin-hadoop3/bin/spark-shell \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' ``` -------------------------------- ### Start Spark Thrift Server with Hudi integration Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md Start the Spark Thrift server, configuring it to use Hudi extensions and catalog, and connecting to the Derby metastore. This command also sets up the warehouse directory and necessary JDBC configurations. ```shell export SPARK_VERSION=3.2.3 wget https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz -P /opt/ tar -xf /opt/spark-$SPARK_VERSION-bin-hadoop2.7.tgz -C /opt/ export SPARK_HOME=/opt/spark-$SPARK_VERSION-bin-hadoop2.7 # install dependencies cp $DERBY_HOME/lib/{derby,derbyclient}.jar $SPARK_HOME/jars/ wget https://repository.apache.org/content/repositories/releases/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.14.0/hudi-spark3.2-bundle_2.12-0.14.0.jar -P $SPARK_HOME/jars/ # start Thrift server connecting to Derby as HMS backend $SPARK_HOME/sbin/start-thriftserver.sh \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.warehouse.dir=/tmp/hudi/hive/warehouse \ --hiveconf hive.metastore.warehouse.dir=/tmp/hudi/hive/warehouse \ --hiveconf hive.metastore.schema.verification=false \ --hiveconf datanucleus.schema.autoCreateAll=true \ --hiveconf javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \ --hiveconf 'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/default;create=true' ``` -------------------------------- ### Install dbt-spark with PyHive Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md Install the dbt-spark adapter along with PyHive support, which is necessary for connecting to Spark. This command should be run within the activated virtual environment. ```shell python3 -m pip install "dbt-spark[PyHive]" ``` -------------------------------- ### Start Derby Network Server Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md Start a local Derby network server, which will be used as the Hive Metastore backend for Spark. Ensure DERBY_HOME is set correctly. ```shell export DERBY_VERSION=10.14.2.0 wget https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz -P /opt/ tar -xf /opt/db-derby-$DERBY_VERSION-bin.tar.gz -C /opt/ export DERBY_HOME=/opt/db-derby-$DERBY_VERSION-bin $DERBY_HOME/bin/startNetworkServer -h 0.0.0.0 ``` -------------------------------- ### Start Local Docker Registry Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md Launch a local Docker registry container within minikube and expose it on port 5001. ```sh docker run -d -p 5001:5000 --restart=always --name registry registry:2 ``` -------------------------------- ### Clustering Job Configuration Properties Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md Example properties file for configuring clustering strategies and write concurrency. ```properties hoodie.datasource.write.recordkey.field=volume hoodie.datasource.write.partitionpath.field=date hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/hudi-test-topic/versions/latest hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=629145600 hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.sort.columns=volume hoodie.write.concurrency.mode=SINGLE_WRITER ``` -------------------------------- ### Launch Spark Shell with Hudi Dependencies Source: https://github.com/apache/hudi/blob/master/hudi-trino-plugin/src/test/resources/hudi-testing-data/hudi_non_part_mor.md Use this command to start the Spark shell with the necessary Hudi and AWS bundles and configuration settings. ```bash ./spark-shell \ --packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.2,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.12.538 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' ``` -------------------------------- ### Launch spark-sql shell with Hudi configurations Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md Start a spark-sql shell with necessary Hudi packages and configurations for interacting with Hudi tables. ```shell spark-sql \ --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.warehouse.dir=/tmp/hudi/hive/warehouse \ --conf spark.hadoop.hive.metastore.warehouse.dir=/tmp/hudi/hive/warehouse \ --conf spark.hadoop.hive.metastore.schema.verification=false \ --conf spark.hadoop.datanucleus.schema.autoCreateAll=true \ --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \ --conf 'spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/default;create=true' \ --conf 'spark.hadoop.hive.cli.print.header=true' ``` -------------------------------- ### Generate Automated Test Suites Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md Example command to execute the test suite generation script from the docker folder. ```shell ./generate_test_suite.sh --execute_test_suite false --include_medium_test_suite_yaml true --include_long_test_suite_yaml true ``` -------------------------------- ### Set Up Docker Demo Environment Source: https://github.com/apache/hudi/blob/master/docker/README.md After building new Docker images, run this script to bring up the Docker demo environment. Use 'dev' for the development environment. ```shell ./setup_demo.sh dev ``` -------------------------------- ### Create Hudi COW Table with Vector Columns Source: https://github.com/apache/hudi/blob/master/hudi-common/src/test/resources/vector_cross_engine_validation/README.md Defines a Copy-on-Write (COW) Hudi table with various vector data types. Note that the `LOCATION` is the same as the MOR table example, which might need adjustment based on your setup. ```sql CREATE TABLE vector_table_cow ( id BIGINT, name STRING, embedding1 VECTOR(128) COMMENT 'document float embedding', embedding2 VECTOR(128, DOUBLE) COMMENT 'document double embedding', embedding3 VECTOR(128, INT8) COMMENT 'document INT8 embedding', ts BIGINT ) USING hudi LOCATION '/tmp/hudi_vector_table_mor' TBLPROPERTIES ( primaryKey = 'id', preCombineField = 'ts', type = 'cow' ); ``` -------------------------------- ### Initialize Environment Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/03-scd-type2_and_type4.ipynb Imports necessary libraries and configures the SparkSession for Hudi and MinIO. ```python %run utils.py ``` ```python spark = get_spark_session("Spark Hudi SCD Types") ``` -------------------------------- ### Query Example for Bitmap Index Source: https://github.com/apache/hudi/blob/master/rfc/rfc-92/rfc-92.md Example SQL query targeting columns indexed with bitmap indexing. ```sql SELECT MIN(age) FROM hudi_table WHERE gender = 'female' AND commute_type = 'car'; ``` -------------------------------- ### Initialize Environment Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/04-schema-evolution.ipynb Imports utility functions and initializes the SparkSession for Hudi and MinIO integration. ```python %run utils.py ``` ```python spark = get_spark_session("Hudi Schema Evolution") ``` -------------------------------- ### Build Hudi Test Suite Script Help Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md Displays the help message for the prepare_integration_suite.sh script, detailing available parameters for building the Hudi test suite. ```shell ./prepare_integration_suite.sh --help Usage: prepare_integration_suite.sh --spark-command, prints the spark command -h, hdfs-version -s, spark version -p, parquet version -a, avro version -s, hive version ``` -------------------------------- ### Build Docker Images Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md Execute this script to build all necessary Docker images for the demo environment. ```bash ./build.sh ``` -------------------------------- ### HoodieAvroRecordMerger Implementation Source: https://github.com/apache/hudi/blob/master/rfc/rfc-46/rfc-46.md Example implementation of HoodieRecordMerger for backward compatibility using HoodieRecordPayload. ```APIDOC ## Class HoodieAvroRecordMerger ### Description Backward compatibility implementation for `HoodieRecordMerger` that utilizes user-defined subclasses of `HoodieRecordPayload` to combine records. This provides a bridge for seamless migration to newer Hudi releases but may incur performance overhead due to intermediate Avro conversions. ### Methods - `getMergingStrategyId()`: Returns the ID of the merging strategy. Expected to return `LATEST_RECORD_MERGING_STRATEGY`. - `merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props)`: Merges two records. This operation must be associative and can handle the semantics of both `preCombine` and `combineAndGetUpdateValue`. - `getRecordType()`: Returns the type of record this merger handles. Expected to return `HoodieRecordType.AVRO`. ### Example Usage (Conceptual) ```java class HoodieAvroRecordMerger implements HoodieRecordMerger { @Override public String getMergingStrategyId() { return LATEST_RECORD_MERGING_STRATEGY; } @Override Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException { // Implementation details for merging records using HoodieRecordPayload // This method unifies the semantics of preCombine and combineAndGetUpdateValue. return Option.empty(); // Placeholder } @Override HoodieRecordType getRecordType() { return HoodieRecordType.AVRO; } } ``` ``` -------------------------------- ### SQL Transaction Syntax Source: https://github.com/apache/hudi/blob/master/rfc/rfc-73/rfc-73.md Example of the proposed SQL syntax for defining transaction boundaries. ```plain BEGIN tx1 // anything done here is associated with tx1. // load table A; (load A's latest snapshot time into HoodieCatalog/driver memory) // load table B; (load B's latest snapshot time into HoodieCatalog/driver memory) // load table A again; (it will reuse the snapshot time already in HoodieCatalog/driver memory). COMMIT / ROLLBACK ``` -------------------------------- ### Create Spark DataFrame and Display Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/07_hudi_presto_example.ipynb Converts the sample data into a Spark DataFrame and displays the first 5 rows. This is a preview before Hudi ingestion. ```python input_df = spark.createDataFrame(data).toDF(*columns) display(input_df, 5) ``` -------------------------------- ### Schema Evolution DAG Example Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md Sample DAG structure for validating schema evolution. ```text rollback with num_rollbacks = 2 insert with schema_version = .... upsert with fraction_upsert_per_file = 0.5 ``` -------------------------------- ### Restart the Docker Environment Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md This command will stop and then start all services, useful for applying configuration changes. ```bash ./run_spark_hudi.sh restart ``` -------------------------------- ### List Docker Buildx Builders Source: https://github.com/apache/hudi/blob/master/docker/README.md Lists the available Docker buildx builders and their supported platforms. This is a preliminary step to ensure buildx is set up correctly. ```bash # List builders ~ ❯❯❯ docker buildx ls NAME/NODE DRIVER/ENDPOINT STATUS PLATFORMS default * docker default default running linux/amd64, linux/arm64, linux/arm/v7, linux/arm/v6 ``` -------------------------------- ### Configure HikariCP for Hudi Metaserver Source: https://github.com/apache/hudi/blob/master/hudi-platform-service/hudi-metaserver/README.md Example properties for configuring the MySQL database connection in hikariPool.properties. ```properties jdbcUrl=jdbc:mysql://localhost:3306 dataSource.user=root dataSource.password=password ``` -------------------------------- ### Setup and Imports for Hudi Vector Blob Demo Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb Configures Spark environment, cleans up previous runs, and imports necessary Python libraries and Hudi/Lance JARs. Ensure Hudi and Lance bundles are downloaded to ~/Downloads/ or specified via environment variables. ```python # === Toggles === N_SAMPLES = 250 TOP_K = 5 EMBEDDING_MODEL = "mobilenetv3_small_100" TABLE_PATH = "/tmp/hudi_main_demo_pets" TABLE_NAME = "pets_main_demo" # === Pre-JVM env (must run before any pyspark import) === import os DRIVER_MEMORY = "4g" os.environ.setdefault( "PYSPARK_SUBMIT_ARGS", f"--driver-memory {DRIVER_MEMORY} --conf spark.driver.maxResultSize=2g pyspark-shell", ) # === Cleanup /tmp/ from prior runs === import shutil from pathlib import Path for pattern in ["/tmp/hudi_*_pets", "/tmp/pets_blob_container.bin", "/tmp/staging_pets_*.parquet"]: for p in Path("/").glob(pattern.lstrip("/")): if p.is_dir(): shutil.rmtree(p, ignore_errors=True) elif p.is_file(): p.unlink(missing_ok=True) shutil.rmtree("spark-warehouse", ignore_errors=True) # === Imports === import io import sys import numpy as np import torch import timm from sklearn.preprocessing import normalize from PIL import Image import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from torchvision.datasets import OxfordIIITPet from pyspark.sql import Row, SparkSession from pyspark.sql.types import ( ArrayType, BinaryType, FloatType, IntegerType, StringType, StructField, StructType, ) from IPython.display import Image as IPyImage, display # === Resolve jars (defaults to ~/Downloads/) === def _default_jar(name): return str(Path.home() / "Downloads" / name) HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR", _default_jar("hudi-spark3.5-bundle_2.12-1.2.0.jar")) LANCE_JAR = os.getenv("LANCE_BUNDLE_JAR", _default_jar("lance-spark-bundle-3.5_2.12-0.4.0.jar")) for jar in (HUDI_JAR, LANCE_JAR): if not Path(jar).is_file(): sys.exit(f"ERROR: jar not found at {jar}. See ../README.md §1–2 for download URLs.") ``` -------------------------------- ### Define DAG with ValidateDatasetNode Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md Example DAG structure utilizing the ValidateDatasetNode for data integrity checks. ```text Insert Upsert ValidateDatasetNode with delete_input_data = true ``` ```text Insert Upsert ValidateDatasetNode ``` -------------------------------- ### Access Metrics Dashboard Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md URLs for accessing the local Graphite metrics and dashboard after starting the test suite. ```text http://localhost:80 ``` ```text http://localhost/dashboard ``` -------------------------------- ### Initialize SCD Type 4 Data Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/03-scd-type2_and_type4.ipynb Creates the initial DataFrame for the SCD Type 4 example. ```python scd4_data = [ ("2025-08-10 08:15:30", "uuid-001", "rider-A", "driver-X", 18.50, "new_york"), ("2025-08-10 09:22:10", "uuid-002", "rider-B", "driver-Y", 22.75, "san_francisco") ] scd4_columns = ["ts", "uuid", "rider", "driver", "fare", "city"] scd4_initial_df = spark.createDataFrame(scd4_data).toDF(*scd4_columns) ``` -------------------------------- ### Inspect Docker Buildx Builder Source: https://github.com/apache/hudi/blob/master/docker/README.md Inspects the 'mybuilder' to confirm its status and supported platforms. The '--bootstrap' flag ensures the builder is running. ```bash ~ ❯❯❯ docker buildx inspect --bootstrap [+] Building 2.5s (1/1) FINISHED => [internal] booting buildkit 2.5s => => pulling image moby/buildkit:master 1.3s => => creating container buildx_buildkit_mybuilder0 1.2s Name: mybuilder Driver: docker-container Nodes: Name: mybuilder0 Endpoint: unix:///var/run/docker.sock Status: running Platforms: linux/amd64, linux/arm64, linux/arm/v7, linux/arm/v6 ``` -------------------------------- ### Stop the Docker Environment Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md Use this command to gracefully stop all running services in the Docker Compose setup. ```bash ./run_spark_hudi.sh stop ``` -------------------------------- ### Variant Binary Encoding Example Source: https://github.com/apache/hudi/blob/master/rfc/rfc-99/rfc-99.md Illustrates the binary encoding for Variant data, showing metadata and value components. ```plaintext Metadata Bytes: [0x01, 0x02, 0x00, 0x07, 0x10, "updated", "new_field"] Value Bytes: [0x02, 0x02, 0x01, 0x00, 0x01, 0x00, 0x03, 0x04, 0x0C, 0x7B] ``` -------------------------------- ### Initialize Spark Session for Hudi Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/05-mastering-sql-procedures.ipynb Sets up the Spark session for interacting with Hudi. Ensure utils.py is available. ```python %run utils.py spark = get_spark_session("Hudi SQL Procedures") ``` -------------------------------- ### Filter vector search results Source: https://github.com/apache/hudi/blob/master/rfc/rfc-102/rfc-102.md Example of chaining a WHERE clause to filter vector search results by category and price. ```scala // Vector search with WHERE clause filtering val result = spark.sql( s""" |SELECT id, name, price, category, _distance |FROM hudi_vector_search( | 'products', | 'embedding', | ARRAY(1.0, 2.0, 3.0), | 10 |) |WHERE category = 'electronics' AND price < 100 |ORDER BY _distance |""".stripMargin ).collect() ``` -------------------------------- ### Configure Docker Environment Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md Point the local docker-cli to the Docker daemon running inside minikube. ```sh eval $(minikube -p minikube docker-env) ``` -------------------------------- ### Get Help for Hudi SQL Procedures Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/05-mastering-sql-procedures.ipynb Displays usage information for a specified Hudi SQL procedure, such as 'show_commits'. ```python spark.sql(f"CALL help(cmd => 'show_commits')").show(truncate=False) ``` -------------------------------- ### Build and Push Multi-Arch Docker Image Source: https://github.com/apache/hudi/blob/master/docker/README.md Builds a Docker image for a specified platform (e.g., linux/arm64) and pushes it to a registry. This command is executed from within the image's directory (e.g., hoodie/hadoop/base_java11). ```bash # Run under hoodie/hadoop, the is optional, "latest" by default docker buildx build --platform -t /[:] --push # For example, to build the Java 11 base image docker buildx build base_java11 --platform linux/arm64 -t apachehudi/hudi-hadoop_2.8.4-base-java11:linux-arm64-0.10.1 --push ``` -------------------------------- ### Configure Spark Environment Variables Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/README.md Set essential Spark environment variables for PySpark to locate and utilize Spark installations. ```bash export SPARK_HOME=/path/to/spark/home export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH ``` -------------------------------- ### Verify File System Changes Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/01-crud-operations.ipynb Lists the files in the target partition to confirm the creation of a new .log file. ```python ls(f"{base_path}/{table_name_mor}/city=san_francisco") ``` -------------------------------- ### Join vector search results Source: https://github.com/apache/hudi/blob/master/rfc/rfc-102/rfc-102.md Example of joining vector search results with another table to retrieve additional metadata. ```scala // Vector search with JOIN val result = spark.sql( s""" |SELECT vs.id, vs.name, c.category_name, vs._distance |FROM hudi_vector_search( | 'products', | 'embedding', | ARRAY(1.5, 2.5), | 3 |) vs |JOIN $categoriesTable c ON vs.category_id = c.category_id |ORDER BY vs._distance |""".stripMargin ).collect() ```