### Setup Virtual Environment and Install Dependencies

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md

Installs Jupyter and ipykernel for the current directory. Ensure you are in the parent folder before running.

```bash
cd ../    # back into vector_blob_demo/
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt    # adds jupyter + ipykernel for this folder
```

--------------------------------

### Setup Python Virtual Environment and Install Dependencies

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md

Creates a Python 3.12 virtual environment, activates it, and installs project dependencies from requirements.txt. Ensure Python 3.12 is installed.

```bash
cd hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo

# Homebrew: brew install python@3.12 if you don't have it yet
python3.12 -m venv .venv
source .venv/bin/activate

python --version     # sanity check: must be 3.12.x, not 3.13+
pip install --upgrade pip
pip install -r requirements.txt
```

--------------------------------

### Start Hudi Docker Demo Environment

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

Execute this script to set up the Hudi demo environment using Docker. This is a prerequisite for running tests within the local Docker setup.

```shell
docker/setup_demo.sh
```

--------------------------------

### Start the Docker Environment

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md

Run this command to start the Spark, Hudi, Hive Metastore, and MinIO services.

```bash
./run_spark_hudi.sh start
```

--------------------------------

### Run Hudi PySpark Quickstart

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/README.md

Execute the Hudi PySpark quickstart script, specifying the table name and either the Hudi package or JAR.

```bash
cd $HUDI_DIR
python3 hudi-examples/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py [-h] -t TABLE (-p PACKAGE | -j JAR)
```

--------------------------------

### Start Kafka Broker and Zookeeper

Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md

Starts the Zookeeper and Kafka servers locally. Ensure KAFKA_HOME is set to your Kafka installation directory. These commands should be run in separate terminals.

```bash
export KAFKA_HOME=/path/to/kafka_install_dir
cd $KAFKA_HOME
# Run the following commands in separate terminals to keep them running
./bin/zookeeper-server-start.sh ./config/zookeeper.properties
./bin/kafka-server-start.sh ./config/server.properties
```

--------------------------------

### Hudi Fixture Generation Examples

Source: https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/resources/upgrade-downgrade-fixtures/README.md

Various command-line examples for generating specific Hudi table versions and configurations.

```bash
# Generate all available versions (6,8) - version 9 excluded due to local bundle requirement
./generate-fixtures.sh

# Generate specific versions only
./generate-fixtures.sh --version 6,8

# Generate only version 6
./generate-fixtures.sh --version 6

# Generate version 9 (requires locally built Hudi bundle)
./generate-fixtures.sh --version 9 --hudi-bundle-path /path/to/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT.jar

# Generate multiple versions including version 9
./generate-fixtures.sh --version 6,8,9 --hudi-bundle-path /path/to/bundle.jar

# Generate complex-keygen tables instead of mor tables
./generate-fixtures.sh --script-name generate-fixture-complex-keygen.scala

# Generate only version 6 complex-keygen table
./generate-fixtures.sh --version 6 --script-name generate-fixture-complex-keygen.scala
```

--------------------------------

### Start Hudi Metaserver

Source: https://github.com/apache/hudi/blob/master/hudi-platform-service/hudi-metaserver/README.md

Command to start the Hudi Metaserver service.

```shell
sh start_hudi_metaserver.sh
```

--------------------------------

### Start Schema Registry

Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md

Starts the Kafka schema registry. Ensure the listener port is configured correctly.

```bash
./bin/schema-registry-start etc/schema-registry/schema-registry.properties
```

--------------------------------

### Initialize Hudi Docker Demo

Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md

Commands to navigate to the docker directory and start the demo environment.

```bash
cd $HUDI_DIR/docker
./setup_demo.sh
```

--------------------------------

### Start Hudi Environment

Source: https://github.com/apache/hudi/blob/master/hudi-trino-plugin/src/test/resources/hudi-testing-data/hudi_non_part_cow.md

Initiates the Hudi environment using the PTL tool.

```shell
testing/bin/ptl env up --environment singlenode-hudi
```

--------------------------------

### Deploy Minio Server

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md

Apply the Kubernetes configuration to start the Minio standalone server.

```shell
kubectl apply -f config/k8s/minio-standalone.yaml
```

--------------------------------

### Navigate to dbt project directory

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md

Change the current working directory to the hudi-examples/hudi-examples-dbt directory to begin the setup process.

```shell
cd hudi-examples/hudi-examples-dbt
```

--------------------------------

### Start GPG Agent

Source: https://github.com/apache/hudi/blob/master/release/release_guide.md

Start the GPG agent to manage GPG keys and unlock them for use. This involves setting environment variables for the agent.

```bash
eval $(gpg-agent --daemon --no-grab --write-env-file $HOME/.gpg-agent-info)
export GPG_TTY=$(tty)
export GPG_AGENT_INFO
```

--------------------------------

### Initialize Spark Session

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/06_hudi_trino_example.ipynb

Load utility functions and start the Spark session.

```python
%run utils.py
```

```python
spark = get_spark_session(app_name = "Hudi Trino Example")
```

--------------------------------

### Install Confluent HDFS Connector

Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md

Installs the Confluent HDFS connector and copies its components to the Kafka plugins directory. Ensure CONFLUENT_DIR is set to your Confluent Platform installation path.

```bash
# Points CONFLUENT_DIR to Confluent Platform installation
export CONFLUENT_DIR=/path/to/confluent_install_dir
mkdir -p /usr/local/share/kafka/plugins
$CONFLUENT_DIR/bin/confluent-hub install confluentinc/kafka-connect-hdfs:10.1.0
cp -r $CONFLUENT_DIR/share/confluent-hub-components/confluentinc-kafka-connect-hdfs/* /usr/local/share/kafka/plugins/
```

--------------------------------

### Start Spark Shell with Hudi Bundle

Source: https://github.com/apache/hudi/blob/master/README.md

This command starts a Spark shell, including the Hudi Spark bundle JAR and necessary configurations for Hudi integration.

```bash
spark-3.5.0-bin-hadoop3/bin/spark-shell \
  --jars `ls packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-*.*.*-SNAPSHOT.jar` \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
```

--------------------------------

### Start Spark Thrift Server with Hudi integration

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md

Start the Spark Thrift server, configuring it to use Hudi extensions and catalog, and connecting to the Derby metastore. This command also sets up the warehouse directory and necessary JDBC configurations.

```shell
export SPARK_VERSION=3.2.3
wget https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz -P /opt/
tar -xf /opt/spark-$SPARK_VERSION-bin-hadoop2.7.tgz -C /opt/
export SPARK_HOME=/opt/spark-$SPARK_VERSION-bin-hadoop2.7

# install dependencies
cp $DERBY_HOME/lib/{derby,derbyclient}.jar $SPARK_HOME/jars/
wget https://repository.apache.org/content/repositories/releases/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.14.0/hudi-spark3.2-bundle_2.12-0.14.0.jar -P $SPARK_HOME/jars/

# start Thrift server connecting to Derby as HMS backend
$SPARK_HOME/sbin/start-thriftserver.sh \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.warehouse.dir=/tmp/hudi/hive/warehouse \
--hiveconf hive.metastore.warehouse.dir=/tmp/hudi/hive/warehouse \
--hiveconf hive.metastore.schema.verification=false \
--hiveconf datanucleus.schema.autoCreateAll=true \
--hiveconf javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
--hiveconf 'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/default;create=true'
```

--------------------------------

### Install dbt-spark with PyHive

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md

Install the dbt-spark adapter along with PyHive support, which is necessary for connecting to Spark. This command should be run within the activated virtual environment.

```shell
python3 -m pip install "dbt-spark[PyHive]"
```

--------------------------------

### Start Derby Network Server

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md

Start a local Derby network server, which will be used as the Hive Metastore backend for Spark. Ensure DERBY_HOME is set correctly.

```shell
export DERBY_VERSION=10.14.2.0
wget https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz -P /opt/
tar -xf /opt/db-derby-$DERBY_VERSION-bin.tar.gz -C /opt/
export DERBY_HOME=/opt/db-derby-$DERBY_VERSION-bin
$DERBY_HOME/bin/startNetworkServer -h 0.0.0.0
```

--------------------------------

### Start Local Docker Registry

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md

Launch a local Docker registry container within minikube and expose it on port 5001.

```sh
docker run -d -p 5001:5000 --restart=always --name registry registry:2
```

--------------------------------

### Clustering Job Configuration Properties

Source: https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md

Example properties file for configuring clustering strategies and write concurrency.

```properties
hoodie.datasource.write.recordkey.field=volume
hoodie.datasource.write.partitionpath.field=date
hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/hudi-test-topic/versions/latest

hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=volume

hoodie.write.concurrency.mode=SINGLE_WRITER
```

--------------------------------

### Launch Spark Shell with Hudi Dependencies

Source: https://github.com/apache/hudi/blob/master/hudi-trino-plugin/src/test/resources/hudi-testing-data/hudi_non_part_mor.md

Use this command to start the Spark shell with the necessary Hudi and AWS bundles and configuration settings.

```bash
./spark-shell \
  --packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.2,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.12.538 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
```

--------------------------------

### Launch spark-sql shell with Hudi configurations

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-dbt/README.md

Start a spark-sql shell with necessary Hudi packages and configurations for interacting with Hudi tables.

```shell
spark-sql \
--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.warehouse.dir=/tmp/hudi/hive/warehouse \
--conf spark.hadoop.hive.metastore.warehouse.dir=/tmp/hudi/hive/warehouse \
--conf spark.hadoop.hive.metastore.schema.verification=false \
--conf spark.hadoop.datanucleus.schema.autoCreateAll=true \
--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
--conf 'spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/default;create=true' \
--conf 'spark.hadoop.hive.cli.print.header=true'
```

--------------------------------

### Generate Automated Test Suites

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

Example command to execute the test suite generation script from the docker folder.

```shell
./generate_test_suite.sh --execute_test_suite false --include_medium_test_suite_yaml true --include_long_test_suite_yaml true
```

--------------------------------

### Set Up Docker Demo Environment

Source: https://github.com/apache/hudi/blob/master/docker/README.md

After building new Docker images, run this script to bring up the Docker demo environment. Use 'dev' for the development environment.

```shell
./setup_demo.sh dev
```

--------------------------------

### Create Hudi COW Table with Vector Columns

Source: https://github.com/apache/hudi/blob/master/hudi-common/src/test/resources/vector_cross_engine_validation/README.md

Defines a Copy-on-Write (COW) Hudi table with various vector data types. Note that the `LOCATION` is the same as the MOR table example, which might need adjustment based on your setup.

```sql
CREATE TABLE vector_table_cow (
    id BIGINT,
    name STRING,
    embedding1 VECTOR(128) COMMENT 'document float embedding',
    embedding2 VECTOR(128, DOUBLE) COMMENT 'document double embedding',
    embedding3 VECTOR(128, INT8) COMMENT 'document INT8 embedding',
    ts BIGINT
) USING hudi
LOCATION '/tmp/hudi_vector_table_mor'
TBLPROPERTIES (
    primaryKey = 'id',
    preCombineField = 'ts',
    type = 'cow'
);
```

--------------------------------

### Initialize Environment

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/03-scd-type2_and_type4.ipynb

Imports necessary libraries and configures the SparkSession for Hudi and MinIO.

```python
%run utils.py
```

```python
spark = get_spark_session("Spark Hudi SCD Types")
```

--------------------------------

### Query Example for Bitmap Index

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-92/rfc-92.md

Example SQL query targeting columns indexed with bitmap indexing.

```sql
SELECT MIN(age) FROM hudi_table
WHERE gender = 'female' AND commute_type = 'car';
```

--------------------------------

### Initialize Environment

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/04-schema-evolution.ipynb

Imports utility functions and initializes the SparkSession for Hudi and MinIO integration.

```python
%run utils.py
```

```python
spark = get_spark_session("Hudi Schema Evolution")
```

--------------------------------

### Build Hudi Test Suite Script Help

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

Displays the help message for the prepare_integration_suite.sh script, detailing available parameters for building the Hudi test suite.

```shell
./prepare_integration_suite.sh --help
Usage: prepare_integration_suite.sh
   --spark-command, prints the spark command
   -h, hdfs-version
   -s, spark version
   -p, parquet version
   -a, avro version
   -s, hive version
```

--------------------------------

### Build Docker Images

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md

Execute this script to build all necessary Docker images for the demo environment.

```bash
./build.sh
```

--------------------------------

### HoodieAvroRecordMerger Implementation

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-46/rfc-46.md

Example implementation of HoodieRecordMerger for backward compatibility using HoodieRecordPayload.

```APIDOC
## Class HoodieAvroRecordMerger

### Description
Backward compatibility implementation for `HoodieRecordMerger` that utilizes user-defined subclasses of `HoodieRecordPayload` to combine records. This provides a bridge for seamless migration to newer Hudi releases but may incur performance overhead due to intermediate Avro conversions.

### Methods

- `getMergingStrategyId()`: Returns the ID of the merging strategy. Expected to return `LATEST_RECORD_MERGING_STRATEGY`.
- `merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props)`: Merges two records. This operation must be associative and can handle the semantics of both `preCombine` and `combineAndGetUpdateValue`.
- `getRecordType()`: Returns the type of record this merger handles. Expected to return `HoodieRecordType.AVRO`.

### Example Usage (Conceptual)
```java
class HoodieAvroRecordMerger implements HoodieRecordMerger {

   @Override
   public String getMergingStrategyId() {
      return LATEST_RECORD_MERGING_STRATEGY;
   }
  
   @Override
   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
      // Implementation details for merging records using HoodieRecordPayload
      // This method unifies the semantics of preCombine and combineAndGetUpdateValue.
      return Option.empty(); // Placeholder
   }

   @Override
   HoodieRecordType getRecordType() {
      return HoodieRecordType.AVRO;
   }
}
```
```

--------------------------------

### SQL Transaction Syntax

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-73/rfc-73.md

Example of the proposed SQL syntax for defining transaction boundaries.

```plain
BEGIN tx1
// anything done here is associated with tx1. 

// load table A; (load A's latest snapshot time into HoodieCatalog/driver memory)

// load table B; (load B's latest snapshot time into HoodieCatalog/driver memory)

// load table A again; (it will reuse the snapshot time already in HoodieCatalog/driver memory).

COMMIT / ROLLBACK
```

--------------------------------

### Create Spark DataFrame and Display

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/07_hudi_presto_example.ipynb

Converts the sample data into a Spark DataFrame and displays the first 5 rows. This is a preview before Hudi ingestion.

```python
input_df = spark.createDataFrame(data).toDF(*columns)
display(input_df, 5)
```

--------------------------------

### Schema Evolution DAG Example

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

Sample DAG structure for validating schema evolution.

```text
rollback with num_rollbacks = 2
insert with schema_version = <version>
....
upsert with fraction_upsert_per_file = 0.5
```

--------------------------------

### Restart the Docker Environment

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md

This command will stop and then start all services, useful for applying configuration changes.

```bash
./run_spark_hudi.sh restart
```

--------------------------------

### List Docker Buildx Builders

Source: https://github.com/apache/hudi/blob/master/docker/README.md

Lists the available Docker buildx builders and their supported platforms. This is a preliminary step to ensure buildx is set up correctly.

```bash
# List builders 
~ ❯❯❯ docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS  PLATFORMS
default * docker
  default default         running linux/amd64, linux/arm64, linux/arm/v7, linux/arm/v6
```

--------------------------------

### Configure HikariCP for Hudi Metaserver

Source: https://github.com/apache/hudi/blob/master/hudi-platform-service/hudi-metaserver/README.md

Example properties for configuring the MySQL database connection in hikariPool.properties.

```properties
jdbcUrl=jdbc:mysql://localhost:3306
dataSource.user=root
dataSource.password=password
```

--------------------------------

### Setup and Imports for Hudi Vector Blob Demo

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb

Configures Spark environment, cleans up previous runs, and imports necessary Python libraries and Hudi/Lance JARs. Ensure Hudi and Lance bundles are downloaded to ~/Downloads/ or specified via environment variables.

```python
# === Toggles ===
N_SAMPLES       = 250
TOP_K           = 5
EMBEDDING_MODEL = "mobilenetv3_small_100"

TABLE_PATH = "/tmp/hudi_main_demo_pets"
TABLE_NAME = "pets_main_demo"

# === Pre-JVM env (must run before any pyspark import) ===
import os
DRIVER_MEMORY = "4g"
os.environ.setdefault(
    "PYSPARK_SUBMIT_ARGS",
    f"--driver-memory {DRIVER_MEMORY} --conf spark.driver.maxResultSize=2g pyspark-shell",
)

# === Cleanup /tmp/ from prior runs ===
import shutil
from pathlib import Path
for pattern in ["/tmp/hudi_*_pets", "/tmp/pets_blob_container.bin", "/tmp/staging_pets_*.parquet"]:
    for p in Path("/").glob(pattern.lstrip("/")):
        if p.is_dir():
            shutil.rmtree(p, ignore_errors=True)
        elif p.is_file():
            p.unlink(missing_ok=True)
shutil.rmtree("spark-warehouse", ignore_errors=True)

# === Imports ===
import io
import sys
import numpy as np
import torch
import timm
from sklearn.preprocessing import normalize
from PIL import Image
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from torchvision.datasets import OxfordIIITPet
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import (
    ArrayType, BinaryType, FloatType, IntegerType,
    StringType, StructField, StructType,
)
from IPython.display import Image as IPyImage, display

# === Resolve jars (defaults to ~/Downloads/) ===
def _default_jar(name): return str(Path.home() / "Downloads" / name)

HUDI_JAR  = os.getenv("HUDI_BUNDLE_JAR",  _default_jar("hudi-spark3.5-bundle_2.12-1.2.0.jar"))
LANCE_JAR = os.getenv("LANCE_BUNDLE_JAR", _default_jar("lance-spark-bundle-3.5_2.12-0.4.0.jar"))
for jar in (HUDI_JAR, LANCE_JAR):
    if not Path(jar).is_file():
        sys.exit(f"ERROR: jar not found at {jar}. See ../README.md §1–2 for download URLs.")

```

--------------------------------

### Define DAG with ValidateDatasetNode

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

Example DAG structure utilizing the ValidateDatasetNode for data integrity checks.

```text
     Insert
     Upsert
     ValidateDatasetNode with delete_input_data = true
```

```text
     Insert
     Upsert
     ValidateDatasetNode 
```

--------------------------------

### Access Metrics Dashboard

Source: https://github.com/apache/hudi/blob/master/hudi-integ-test/README.md

URLs for accessing the local Graphite metrics and dashboard after starting the test suite.

```text
http://localhost:80
```

```text
http://localhost/dashboard
```

--------------------------------

### Initialize SCD Type 4 Data

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/03-scd-type2_and_type4.ipynb

Creates the initial DataFrame for the SCD Type 4 example.

```python
scd4_data = [
    ("2025-08-10 08:15:30", "uuid-001", "rider-A", "driver-X", 18.50, "new_york"),
    ("2025-08-10 09:22:10", "uuid-002", "rider-B", "driver-Y", 22.75, "san_francisco")
]
scd4_columns = ["ts", "uuid", "rider", "driver", "fare", "city"]

scd4_initial_df = spark.createDataFrame(scd4_data).toDF(*scd4_columns)
```

--------------------------------

### Inspect Docker Buildx Builder

Source: https://github.com/apache/hudi/blob/master/docker/README.md

Inspects the 'mybuilder' to confirm its status and supported platforms. The '--bootstrap' flag ensures the builder is running.

```bash
~ ❯❯❯ docker buildx inspect --bootstrap
[+] Building 2.5s (1/1) FINISHED
 => [internal] booting buildkit                                                   2.5s
 => => pulling image moby/buildkit:master                                         1.3s
 => => creating container buildx_buildkit_mybuilder0                              1.2s
Name:   mybuilder
Driver: docker-container

Nodes:
Name:      mybuilder0
Endpoint:  unix:///var/run/docker.sock
Status:    running

Platforms: linux/amd64, linux/arm64, linux/arm/v7, linux/arm/v6
```

--------------------------------

### Stop the Docker Environment

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/README.md

Use this command to gracefully stop all running services in the Docker Compose setup.

```bash
./run_spark_hudi.sh stop
```

--------------------------------

### Variant Binary Encoding Example

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-99/rfc-99.md

Illustrates the binary encoding for Variant data, showing metadata and value components.

```plaintext
Metadata Bytes: [0x01, 0x02, 0x00, 0x07, 0x10, "updated", "new_field"]
Value Bytes:   [0x02, 0x02, 0x01, 0x00, 0x01, 0x00, 0x03, 0x04, 0x0C, 0x7B]
```

--------------------------------

### Initialize Spark Session for Hudi

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/05-mastering-sql-procedures.ipynb

Sets up the Spark session for interacting with Hudi. Ensure utils.py is available.

```python
%run utils.py

spark = get_spark_session("Hudi SQL Procedures")
```

--------------------------------

### Filter vector search results

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-102/rfc-102.md

Example of chaining a WHERE clause to filter vector search results by category and price.

```scala
    // Vector search with WHERE clause filtering
      val result = spark.sql(
        s"""
           |SELECT id, name, price, category, _distance
           |FROM hudi_vector_search(
           |  'products',
           |  'embedding',
           |  ARRAY(1.0, 2.0, 3.0),
           |  10
           |)
           |WHERE category = 'electronics' AND price < 100
           |ORDER BY _distance
           |""".stripMargin
      ).collect()
```

--------------------------------

### Configure Docker Environment

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-k8s/README.md

Point the local docker-cli to the Docker daemon running inside minikube.

```sh
eval $(minikube -p minikube docker-env)
```

--------------------------------

### Get Help for Hudi SQL Procedures

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/05-mastering-sql-procedures.ipynb

Displays usage information for a specified Hudi SQL procedure, such as 'show_commits'.

```python
spark.sql(f"CALL help(cmd => 'show_commits')").show(truncate=False)
```

--------------------------------

### Build and Push Multi-Arch Docker Image

Source: https://github.com/apache/hudi/blob/master/docker/README.md

Builds a Docker image for a specified platform (e.g., linux/arm64) and pushes it to a registry. This command is executed from within the image's directory (e.g., hoodie/hadoop/base_java11).

```bash
# Run under hoodie/hadoop, the <tag> is optional, "latest" by default
docker buildx build <image_folder_name> --platform <comma-separated,platforms> -t <hub-user>/<repo-name>[:<tag>] --push

# For example, to build the Java 11 base image
docker buildx build base_java11 --platform linux/arm64 -t apachehudi/hudi-hadoop_2.8.4-base-java11:linux-arm64-0.10.1 --push
```

--------------------------------

### Configure Spark Environment Variables

Source: https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/README.md

Set essential Spark environment variables for PySpark to locate and utilize Spark installations.

```bash
export SPARK_HOME=/path/to/spark/home
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
```

--------------------------------

### Verify File System Changes

Source: https://github.com/apache/hudi/blob/master/hudi-notebooks/notebooks/01-crud-operations.ipynb

Lists the files in the target partition to confirm the creation of a new .log file.

```python
ls(f"{base_path}/{table_name_mor}/city=san_francisco")
```

--------------------------------

### Join vector search results

Source: https://github.com/apache/hudi/blob/master/rfc/rfc-102/rfc-102.md

Example of joining vector search results with another table to retrieve additional metadata.

```scala
  // Vector search with JOIN
      val result = spark.sql(
        s"""
           |SELECT vs.id, vs.name, c.category_name, vs._distance
           |FROM hudi_vector_search(
           |  'products',
           |  'embedding',
           |  ARRAY(1.5, 2.5),
           |  3
           |) vs
           |JOIN $categoriesTable c ON vs.category_id = c.category_id
           |ORDER BY vs._distance
           |""".stripMargin
      ).collect()
```