### Substrait-Spark Justfile Targets Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md Lists the available targets in the `justfile` for managing the Substrait-Spark example applications. These targets simplify building, running Spark, and executing specific examples. ```bash just Available recipes: buildapp # Builds the application into a JAR file consume arg # Consumes the Substrait plan file passed as the argument dataset # Runs a Spark dataset api query and produces a Substrait plan spark # Starts a simple Spark cluster locally in docker sql # Runs a Spark SQL api query and produces a Substrait plan ``` -------------------------------- ### Build and Run Substrait-Spark Examples Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This section provides bash commands for building the application JAR, setting up directories, copying datafiles, and running the Spark cluster using Docker Compose. It also shows how to submit the application to the Spark master. ```bash just buildapp # or ./gradlew build mkdir -p ./_data && chmod g+w ./_data mkdir -p ./_apps cp ./app/build/libs/app.jar ./_apps cp ./app/src/main/resources/*.csv ./_data ``` ```bash just spark ``` ```bash docker exec -it subtrait-spark-spark-1 bash /opt/spark/bin/spark-submit --master spark://subtrait-spark-spark-1:7077 --driver-memory 1G --executor-memory 1G /opt/spark-apps/app.jar ``` -------------------------------- ### Native Compilation with Gradle Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Builds the Isthmus-CLI as a native executable using Gradle. Requires GraalVM to be installed and its location set in the GRAALVM_HOME environment variable. ```bash ./gradlew nativeCompile ``` -------------------------------- ### View Textual Representation of Substrait Plan (Example) Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This snippet shows a sample textual representation of a Substrait plan, generated from a Spark query. It details the relations, functions, and data sources involved in the query execution. This format is useful for understanding the plan's structure and can be generated using the `SubstraitStringify` utility. ```text Root :: ImmutableSort [colour, count] +- Sort:: FieldRef#/I64/StructField{offset=1} ASC_NULLS_FIRST +- Project:: [Str, I64, Str, I64] +- Aggregate:: FieldRef#/Str/StructField{offset=0} +- Project:: [Str, Str, Str, Str] +- Join:: INNER equal:any_any : arg0 = FieldRef#/Str/StructField{offset=0} : arg1 = FieldRef#/Str/StructField{offset=2} +- Project:: [Str, Str, Str, Str, Str, Str, Str, Str, Str] +- Filter:: is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=0} +- LocalFiles:: : file:///opt/spark-data/vehicles_subset_2023.csv len=1547 partition=0 start=0 +- Project:: [Str, Str, Str, Str, Str, Str, Str, Str, Str] +- Filter:: and:bool : arg0 = and:bool : arg0 = is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=5} : arg1 = equal:any_any : arg0 = FieldRef#/Str/StructField{offset=5} : arg1 = : arg1 = is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=1} +- LocalFiles:: : file:///opt/spark-data/tests_subset_2023.csv len=1491 partition=0 start=0 ``` -------------------------------- ### Execute Isthmus Tool with SQL Query Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This shell command shows an example of how to use the downloaded Isthmus binary package. It executes a SQL query against a defined table schema, demonstrating the tool's capability to process Substrait plans. ```shell $ ./isthmus-macOS-1.0.0 -c "CREATE TABLE Persons ( firstName VARCHAR, lastName VARCHAR, zip INT )" "SELECT lastName, FROM Persons WHERE zip = 90210" ``` -------------------------------- ### Display Isthmus-CLI Version Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Displays the installed version of the Isthmus-CLI. This command is executed from the compiled output directory. ```bash $ ./isthmus-cli/build/native/nativeCompile/isthmus --version ``` -------------------------------- ### Using Built-in Substrait Extension Functions in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Demonstrates accessing and utilizing Substrait extension functions from the default catalog for arithmetic, comparison, and string operations. It covers function URNs and how to build expressions using SubstraitBuilder. This example requires the substrait-java library. ```java import io.substrait.dsl.SubstraitBuilder; import io.substrait.extension.DefaultExtensionCatalog; import io.substrait.extension.SimpleExtension; import io.substrait.expression.Expression; import io.substrait.relation.Rel; import io.substrait.type.TypeCreator; import java.util.Arrays; SubstraitBuilder b = new SubstraitBuilder( DefaultExtensionCatalog.DEFAULT_COLLECTION ); TypeCreator R = TypeCreator.REQUIRED; // Available function URNs in default catalog // DefaultExtensionCatalog.FUNCTIONS_ARITHMETIC // DefaultExtensionCatalog.FUNCTIONS_BOOLEAN // DefaultExtensionCatalog.FUNCTIONS_COMPARISON // DefaultExtensionCatalog.FUNCTIONS_STRING // DefaultExtensionCatalog.FUNCTIONS_DATETIME // DefaultExtensionCatalog.FUNCTIONS_ROUNDING Rel scan = b.namedScan( Arrays.asList("products"), Arrays.asList("product_id", "name", "price", "discount", "category"), Arrays.asList(R.I64, R.STRING, R.FP64, R.FP64, R.STRING) ); // Using arithmetic functions: final_price = price * (1 - discount) Rel projected = b.project( input -> Arrays.asList( b.fieldReference(input, 0), // product_id b.fieldReference(input, 1), // name // Calculate: price * (1.0 - discount) b.scalarFn( DefaultExtensionCatalog.FUNCTIONS_ARITHMETIC, "multiply:fp64_fp64", R.FP64, b.fieldReference(input, 2), // price b.scalarFn( DefaultExtensionCatalog.FUNCTIONS_ARITHMETIC, "subtract:fp64_fp64", R.FP64, b.fp64(1.0), b.fieldReference(input, 3) // discount ) ), // String uppercase: UPPER(category) b.scalarFn( DefaultExtensionCatalog.FUNCTIONS_STRING, "upper:str", R.STRING, b.fieldReference(input, 4) ) ), scan ); // Filter with string matching: name LIKE '%widget%' Rel filtered = b.filter( input -> b.scalarFn( DefaultExtensionCatalog.FUNCTIONS_STRING, "like:str_str", R.BOOLEAN, b.fieldReference(input, 1), // name b.str("%widget%") ), projected ); Plan plan = b.plan(b.root(filtered)); ``` -------------------------------- ### Example Isthmus Output Structure Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This JSON snippet represents a partial output from the Isthmus tool, illustrating the structure of a Substrait plan. It includes details about extension URIs, extensions, and relations, showcasing how the tool serializes query plans. ```json { "extensionUris": [{ "extensionUriAnchor": 1, "uri": "/functions_comparison.yaml" }], "extensions": [{ "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 0, "name": "equal:any_any" } }], "relations": [{ "root": { ... } ``` -------------------------------- ### Maintenance Branch Creation (Shell) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Example commands for creating maintenance branches based on existing release tags. This is used for applying fixes to specific versions. ```shell # Current version on production: 2.5.7, version planned to use for maintenance: 2.1.7 $ git checkout -b 2.1.x v2.1.7 # for fix $ git checkout -b 2.x v2.1.7 # for maintenance ``` -------------------------------- ### Pre-release Branch Creation (Shell) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Example command for creating a pre-release branch for experimenting with breaking changes. Commits to this branch are considered pre-releases. ```shell # Current version on production: 2.5.7, version planned to use for maintenance: 2.1.7 $ git checkout -b new_feature_lots_changes beta ``` -------------------------------- ### Example Field Access Depth Map Computation Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/OuterReferenceResolver.md This example shows the resulting fieldAccessDepthMap for the given SQL query, illustrating how the depth is assigned to correlated fields based on their nesting level within subqueries. This map is crucial for resolving outer references correctly. ```text fieldAccessDepthMap: l_partkey=$corr0.p_partkey: $corr0.p_partkey -> 1 ps_partkey = $corr0.p_partkey: $corr0.p_partkey -> 2 ps_suppkey = $corr2.l_suppkey: $corr2.l_suppkey -> 1 ``` -------------------------------- ### Display Isthmus-CLI Help Information Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Shows the usage instructions and available options for the Isthmus-CLI. This includes options for input SQL, create statements, expressions, output format, and identifier casing. ```bash $ ./isthmus-cli/build/native/nativeCompile/isthmus --help ``` -------------------------------- ### Load and Convert Substrait Plan to Spark Plan in Java Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md Loads a binary protobuf Substrait plan from a file, converts it into a Substrait Plan object, and then transforms it into a Spark LogicalPlan for execution. Dependencies include standard Java IO and the Substrait-Java library. Inputs are a file path to the plan and a SparkSession. ```java byte[] buffer = Files.readAllBytes(Paths.get("spark_sql_substrait.plan")); io.substrait.proto.Plan proto = io.substrait.proto.Plan.parseFrom(buffer); ProtoPlanConverter protoToPlan = new ProtoPlanConverter(); Plan plan = protoToPlan.from(proto); ToLogicalPlan substraitConverter = new ToLogicalPlan(spark); LogicalPlan sparkPlan = substraitConverter.convert(plan); ``` -------------------------------- ### Create and Serialize Substrait Plan (Java) Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This Java code demonstrates how to obtain an optimized Spark LogicalPlan, convert it into a Substrait Plan object using `ToSubstraitRel`, and then serialize this Substrait Plan into a binary protobuf file. Error handling for file writing is included. ```java LogicalPlan optimised = result.queryExecution().optimizedPlan(); System.out.println(optimised); ToSubstraitRel toSubstrait = new ToSubstraitRel(); io.substrait.plan.Plan plan = toSubstrait.convert(optimised); PlanProtoConverter planToProto = new PlanProtoConverter(); byte[] buffer = planToProto.toProto(plan).toByteArray(); try { Files.write(Paths.get(ROOT_DIR, "spark_sql_substrait.plan"), buffer); } catch (IOException e){ e.printStackTrace(); } ``` -------------------------------- ### GPG Key Generation and Management (Shell) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Commands for generating, listing, and publishing GPG keys used for signing release artifacts. Ensure keyserver is accessible and the correct key ID is used. ```shell gpg --full-gen-key gpg --list-keys gpg --keyserver keyserver.ubuntu.com --send-keys C8BA52B3 gpg --export-secret-keys C8BA52B3 | base64 ``` -------------------------------- ### SQL Query with Nested Subqueries Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/OuterReferenceResolver.md This SQL query demonstrates a common pattern involving nested subqueries and correlations. It is used as an example to illustrate how the OuterReferenceResolver computes field access depths in the Substrait Java project. ```sql SELECT p_partkey, p_size FROM part p WHERE p_size < (SELECT sum(l_orderkey) FROM lineitem l WHERE l.l_partkey = p.p_partkey AND l_linenumber > (SELECT count(*) cnt FROM partsupp ps WHERE ps.ps_partkey = p.p_partkey AND PS.ps_suppkey = l.l_suppkey)) ``` -------------------------------- ### Download and Verify Substrait-Java POM and JAR Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This section shows how to download the core-1.0.0.pom and core-1.0.0.jar files for Substrait-Java, along with their associated PGP signatures. It also includes commands to verify the integrity and authenticity of these downloaded artifacts using GPG. ```shell wget https://s01.oss.sonatype.org/content/repositories/staging/io/substrait/core/1.0.0/core-1.0.0.pom wget https://s01.oss.sonatype.org/content/repositories/staging/io/substrait/core/1.0.0/core-1.0.0.pom.asc gpg --verify /Users/substrait/core-1.0.0.pom.asc ``` ```shell wget https://s01.oss.sonatype.org/content/repositories/release/io/substrait/core/1.0.0/core-1.0.0.jar wget https://s01.oss.sonatype.org/content/repositories/release/io/substrait/core/1.0.0/core-1.0.0.jar.asc gpg --verify /Users/substrait/core-1.0.0.jar.asc ``` ```shell wget https://s01.oss.sonatype.org/content/repositories/release/io/substrait/core/1.0.0/core-1.0.0.pom wget https://s01.oss.sonatype.org/content/repositories/release/io/substrait/core/1.0.0/core-1.0.0.pom.asc gpg --verify /Users/substrait/core-1.0.0.pom.asc ``` -------------------------------- ### Creating Substrait Window Functions in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Illustrates how to define and use window functions in Substrait for analytical queries using Java. This includes setting up partitioning, ordering, and specifying bounds for functions like ROW_NUMBER. Requires the substrait-java library. ```java import io.substrait.dsl.SubstraitBuilder; import io.substrait.extension.DefaultExtensionCatalog; import io.substrait.expression.Expression; import io.substrait.expression.WindowBound; import io.substrait.relation.Rel; import io.substrait.type.TypeCreator; import java.util.Arrays; SubstraitBuilder b = new SubstraitBuilder( DefaultExtensionCatalog.DEFAULT_COLLECTION ); TypeCreator R = TypeCreator.REQUIRED; // Query: SELECT employee_id, dept_id, salary, // ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rank // FROM employees Rel scan = b.namedScan( Arrays.asList("employees"), Arrays.asList("employee_id", "dept_id", "salary"), Arrays.asList(R.I64, R.I64, R.FP64) ); // Create window function with partitioning and ordering Expression.WindowFunctionInvocation rowNumber = Expression.WindowFunctionInvocation.builder() .declaration( DefaultExtensionCatalog.DEFAULT_COLLECTION.getWindowFunction( SimpleExtension.FunctionAnchor.of( DefaultExtensionCatalog.FUNCTIONS_WINDOW, "row_number:" ) ) ) .outputType(R.I64) .partitionBy(Arrays.asList( b.fieldReference(scan, 1) // PARTITION BY dept_id )) .sorts(Arrays.asList( Expression.SortField.builder() .expr(b.fieldReference(scan, 2)) // ORDER BY salary .direction(Expression.SortDirection.DESC) .build() )) .boundsType(Expression.WindowBoundsType.ROWS) .lowerBound(WindowBound.Unbounded.INSTANCE) .upperBound(WindowBound.CurrentRow.INSTANCE) .build(); // Project with window function Rel projected = b.project( input -> Arrays.asList( b.fieldReference(input, 0), // employee_id b.fieldReference(input, 1), // dept_id b.fieldReference(input, 2), // salary rowNumber // row_number ), scan ); Plan plan = b.plan(b.root(projected)); ``` -------------------------------- ### GPG Key Import (Shell) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Command to import the public GPG key from a keyserver, necessary for verifying signed artifacts. ```shell $ gpg --keyserver keyserver.ubuntu.com --recv-keys C8BA52B3 gpg: key F4A1E652C8BA52B3: public key "Substrait Java Artifacts (Java artifact signing key for the Substrait project) " imported gpg: Total processed: 1 gpg: imported: 1 ``` -------------------------------- ### Configure Maven for Substrait-Java Staging Repository Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This XML snippet demonstrates how to configure your Maven `pom.xml` file to include the Sonatype staging repository. This allows Maven to resolve and download Substrait-Java artifacts from the staging environment. ```xml sonatype-staging https://s01.oss.sonatype.org/content/groups/staging ... io.substrait core 1.0.0 io.substrait isthmus 1.0.0 ... ``` -------------------------------- ### Configure Gradle for Substrait-Java Staging Repository Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This Groovy snippet shows how to configure your Gradle build script to use the Sonatype staging repository. It adds the necessary repository URL and dependency declarations for `core` and `isthmus` artifacts. ```groovy repositories { maven { url = uri("https://s01.oss.sonatype.org/content/groups/staging") } } dependencies { ... implementation 'io.substrait:core:1.0.0' implementation 'io.substrait:isthmus:1.0.0' ... } ``` -------------------------------- ### Serialize Substrait Plan to Protobuf in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Demonstrates how to serialize a Substrait Plan object into protobuf format and deserialize it back. This is essential for storing plans or transmitting them over a network. It involves using `PlanProtoConverter` and `ProtoPlanConverter`. ```java import io.substrait.plan.Plan; import io.substrait.plan.PlanProtoConverter; import io.substrait.plan.ProtoPlanConverter; import io.substrait.extension.DefaultExtensionCatalog; import java.nio.file.Files; import java.nio.file.Paths; // Plan to protobuf bytes Plan plan = /* constructed plan */; PlanProtoConverter toProto = new PlanProtoConverter(); io.substrait.proto.Plan protoPlan = toProto.toProto(plan); byte[] bytes = protoPlan.toByteArray(); // Save to file Files.write(Paths.get("query.plan"), bytes); // Load from file byte[] loadedBytes = Files.readAllBytes(Paths.get("query.plan")); io.substrait.proto.Plan loadedProtoPlan = io.substrait.proto.Plan.parseFrom(loadedBytes); // Protobuf to Plan object ProtoPlanConverter fromProto = new ProtoPlanConverter( DefaultExtensionCatalog.DEFAULT_COLLECTION ); Plan reconstructedPlan = fromProto.from(loadedProtoPlan); ``` -------------------------------- ### Execute Substrait Query Plan with Python Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This Python code snippet demonstrates how to read a Substrait plan from a file, decode it, and execute it using a provided table provider. It assumes the plan is base64 encoded. ```python with open(PLAN_FILE, "wb") as file: planbytes = file.read() reader = substrait.run_query( base64.b64decode(planbytes), table_provider=self.simple_provider, ) result = reader.read_all() ``` -------------------------------- ### Convert Substrait Plan to Spark Logical Plan in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Loads and executes Substrait plans in Apache Spark using `ToLogicalPlan`. This process involves creating a SparkSession, registering necessary tables, loading a Substrait plan from a protobuf file, converting it to a Substrait `Plan` object, and then transforming it into a Spark `LogicalPlan`. The resulting Spark plan can then be executed within Spark. ```java import io.substrait.spark.logical.ToLogicalPlan; import io.substrait.plan.Plan; import io.substrait.plan.ProtoPlanConverter; import io.substrait.extension.DefaultExtensionCatalog; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan; import java.nio.file.Files; import java.nio.file.Paths; // Create Spark session SparkSession spark = SparkSession.builder() .appName("SubstraitToSpark") .master("local[*]") .getOrCreate(); // Register tables that the plan references spark.read() .option("header", "true") .csv("employees.csv") .createOrReplaceTempView("employees"); spark.read() .option("header", "true") .csv("departments.csv") .createOrReplaceTempView("departments"); // Load Substrait plan from file byte[] bytes = Files.readAllBytes(Paths.get("spark_query.plan")); io.substrait.proto.Plan protoPlan = io.substrait.proto.Plan.parseFrom(bytes); // Convert protobuf to Plan object ProtoPlanConverter protoConverter = new ProtoPlanConverter( DefaultExtensionCatalog.DEFAULT_COLLECTION ); Plan plan = protoConverter.from(protoPlan); // Convert Substrait plan to Spark logical plan ToLogicalPlan converter = new ToLogicalPlan(spark); LogicalPlan sparkPlan = converter.convert(plan); // Execute query in Spark Dataset result = Dataset.ofRows(spark, sparkPlan); result.show(); // Save results result.write() .option("header", "true") .csv("query_results"); ``` -------------------------------- ### Build Aggregation Queries with GROUP BY in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Illustrates constructing aggregation queries using SubstraitBuilder in Java. It defines grouping keys and multiple aggregate functions such as COUNT, SUM, and AVG, along with ordering. This requires the io.substrait library for plan construction. ```java import io.substrait.dsl.SubstraitBuilder; import io.substrait.extension.DefaultExtensionCatalog; import io.substrait.plan.Plan; import io.substrait.relation.Rel; import io.substrait.type.TypeCreator; import java.util.Arrays; SubstraitBuilder b = new SubstraitBuilder( DefaultExtensionCatalog.DEFAULT_COLLECTION ); TypeCreator R = TypeCreator.REQUIRED; // Query: SELECT dept, COUNT(*), SUM(salary), AVG(age) // FROM employees GROUP BY dept ORDER BY dept Rel scan = b.namedScan( Arrays.asList("employees"), Arrays.asList("id", "name", "dept", "salary", "age"), Arrays.asList(R.I64, R.STRING, R.STRING, R.FP64, R.I32) ); Rel agg = b.aggregate( input -> b.grouping(input, 2), // GROUP BY dept (index 2) input -> Arrays.asList( b.countStar(), // COUNT(*) b.sum(input, 3), // SUM(salary) b.avg(input, 4) // AVG(age) ), scan ); Rel sorted = b.sort( input -> b.sortFields(input, 0), // ORDER BY dept agg ); Plan plan = b.plan(b.root(sorted)); ``` -------------------------------- ### GPG Key Verification (Shell) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Steps to download and verify the GPG signature of release artifacts using the public key. This ensures the integrity and authenticity of the downloaded JAR and POM files. ```shell wget https://s01.oss.sonatype.org/content/repositories/staging/io/substrait/core/1.0.0/core-1.0.0.jar wget https://s01.oss.sonatype.org/content/repositories/staging/io/substrait/core/1.0.0/core-1.0.0.jar.asc gpg --verify /Users/substrait/core-1.0.0.jar.asc ``` -------------------------------- ### Build Substrait Plans with Fluent Java DSL Source: https://context7.com/substrait-io/substrait-java/llms.txt Demonstrates using SubstraitBuilder to construct a query plan programmatically with a fluent DSL. It chains methods for scanning, filtering, projecting, and sorting data, utilizing lambda expressions for defining operations. Requires io.substrait libraries. ```java import io.substrait.dsl.SubstraitBuilder; import io.substrait.extension.DefaultExtensionCatalog; import io.substrait.extension.SimpleExtension; import io.substrait.plan.Plan; import io.substrait.relation.Rel; import io.substrait.type.TypeCreator; import java.util.Arrays; // Initialize builder with extension catalog SimpleExtension.ExtensionCollection extensions = DefaultExtensionCatalog.DEFAULT_COLLECTION; SubstraitBuilder b = new SubstraitBuilder(extensions); TypeCreator R = TypeCreator.REQUIRED; TypeCreator N = TypeCreator.NULLABLE; // Build query: SELECT name, age FROM users WHERE age > 18 ORDER BY age DESC Rel scan = b.namedScan( Arrays.asList("users"), Arrays.asList("id", "name", "age", "email"), Arrays.asList(R.I64, R.STRING, R.I32, R.STRING) ); Rel filtered = b.filter( input -> b.scalarFn( DefaultExtensionCatalog.FUNCTIONS_COMPARISON, "gt:i32_i32", R.BOOLEAN, b.fieldReference(input, 2), b.i32(18) ), scan ); Rel projected = b.project( input -> Arrays.asList( b.fieldReference(input, 1), // name b.fieldReference(input, 2) // age ), filtered ); Rel sorted = b.sort( input -> b.sortField(input, 1, Rel.SortDirection.DESC), projected ); Plan plan = b.plan(b.root(sorted)); ``` -------------------------------- ### Configure Gradle for Substrait-Java Maven Central Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This Groovy snippet demonstrates how to add Substrait-Java dependencies (`core` and `isthmus` version 1.0.0) to your Gradle build file for use with Maven Central. ```groovy dependencies { ... implementation 'io.substrait:core:1.0.0' implementation 'io.substrait:isthmus:1.0.0' ... } ``` -------------------------------- ### Branch Configuration for Semantic Release (JSON) Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md Configuration defining the branch patterns used by semantic release for determining release types (e.g., pre-release, maintenance, stable). ```json { "branches": [ { "name": "+([0-9])?(.{+([0-9]),x}).x" }, { "name": "main" }, { "name": "next" }, { "name": "next-major" }, { "name": "beta", "prerelease": true }, { "name": "alpha", "prerelease": true } ] } ``` -------------------------------- ### Convert SQL to Substrait Plan in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Converts a SQL query into a Substrait Plan using Apache Calcite for parsing and optimization. This involves defining a database schema using SQL `CREATE TABLE` statements, processing them into a catalog, and then using the `SqlToSubstrait` utility to perform the conversion. The resulting plan can then be serialized. ```java import io.substrait.isthmus.SqlToSubstrait; import io.substrait.isthmus.sql.SubstraitCreateStatementParser; import io.substrait.plan.Plan; import io.substrait.plan.PlanProtoConverter; import org.apache.calcite.prepare.Prepare; // Define database schema using CREATE TABLE statements String schema = """ CREATE TABLE lineitem ( l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_quantity DECIMAL(15,2), l_extendedprice DECIMAL(15,2), l_discount DECIMAL(15,2), l_shipdate DATE ); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus VARCHAR(1), o_totalprice DECIMAL(15,2), o_orderdate DATE ); """; // Create catalog from DDL Prepare.CatalogReader catalog = SubstraitCreateStatementParser.processCreateStatementsToCatalog(schema); // Convert SQL query to Substrait plan SqlToSubstrait sqlToSubstrait = new SqlToSubstrait(); String sql = """ SELECT l.l_orderkey, SUM(l.l_quantity) as total_qty, SUM(l.l_extendedprice) as total_price FROM lineitem l INNER JOIN orders o ON l.l_orderkey = o.o_orderkey WHERE o.o_orderdate >= DATE '2023-01-01' AND l.l_shipdate >= DATE '2023-01-01' GROUP BY l.l_orderkey HAVING SUM(l.l_quantity) > 100 ORDER BY total_price DESC LIMIT 10 """; Plan plan = sqlToSubstrait.convert(sql, catalog); // Serialize to protobuf PlanProtoConverter planToProto = new PlanProtoConverter(); byte[] bytes = planToProto.toProto(plan).toByteArray(); ``` -------------------------------- ### Spark and Substrait Inner Join Plan Comparison Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md Illustrates the structural similarity between Spark's LogicalPlan representation of an inner join and its equivalent in the Substrait format. This comparison highlights how the Substrait-Java library translates Spark operations. The Substrait format here is a custom string representation for readability. ```text +- Join Inner, (vehicle_id#2 = vehicle_id#10) :- Project [vehicle_id#2, colour#5] : +- Filter isnotnull(vehicle_id#2) +- Project [vehicle_id#10] +- Filter ((isnotnull(test_result#14) AND (test_result#14 = P)) AND isnotnull(vehicle_id#10)) ``` ```substrait +- Join:: INNER equal:any_any : arg0 = FieldRef#/Str/StructField{offset=0} : arg1 = FieldRef#/Str/StructField{offset=2} +- Project:: [Str, Str, Str, Str, Str, Str, Str, Str, Str] +- Filter:: is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=0} +- Project:: [Str, Str, Str, Str, Str, Str, Str, Str, Str] +- Filter:: and:bool : arg0 = and:bool : arg0 = is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=5} : arg1 = equal:any_any : arg0 = FieldRef#/Str/StructField{offset=5} : arg1 = : arg1 = is_not_null:any : arg0 = FieldRef#/Str/StructField{offset=1} ``` -------------------------------- ### Display Spark Logical and Optimized Query Plans Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This Java code retrieves and prints the logical and optimized query plans generated by Spark for a given DataFrame. It relies on Spark's queryExecution API. The input is a Spark DataFrame, and the output is the printed query plans to the console. ```java LogicalPlan logical = result.logicalPlan(); System.out.println(logical); LogicalPlan optimised = result.queryExecution().optimizedPlan(); System.out.println(optimised); ``` -------------------------------- ### Convert Spark Logical Plan to Substrait in Java Source: https://context7.com/substrait-io/substrait-java/llms.txt Converts Apache Spark logical plans to the Substrait format using `ToSubstraitRel`. This process involves creating a SparkSession, loading data, executing a Spark SQL query, obtaining the optimized Spark logical plan, and then converting it to a Substrait `Plan` object which is serialized to protobuf. This enables cross-engine query execution. ```java import io.substrait.spark.logical.ToSubstraitRel; import io.substrait.plan.Plan; import io.substrait.plan.PlanProtoConverter; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan; import java.nio.file.Files; import java.nio.file.Paths; // Create Spark session SparkSession spark = SparkSession.builder() .appName("SparkToSubstrait") .master("local[*]") .getOrCreate(); // Load data and create views spark.read() .option("header", "true") .csv("employees.csv") .createOrReplaceTempView("employees"); spark.read() .option("header", "true") .csv("departments.csv") .createOrReplaceTempView("departments"); // Execute SQL query in Spark String sql = """ SELECT d.dept_name, COUNT(*) as emp_count, AVG(e.salary) as avg_salary FROM employees e INNER JOIN departments d ON e.dept_id = d.dept_id WHERE e.hire_date >= '2020-01-01' GROUP BY d.dept_name HAVING COUNT(*) > 5 ORDER BY avg_salary DESC """; Dataset dataset = spark.sql(sql); dataset.show(); // Get optimized Spark logical plan LogicalPlan sparkPlan = dataset.queryExecution().optimizedPlan(); // Convert to Substrait ToSubstraitRel converter = new ToSubstraitRel(); Plan substraitPlan = converter.convert(sparkPlan); // Serialize to protobuf PlanProtoConverter planToProto = new PlanProtoConverter(); byte[] bytes = planToProto.toProto(substraitPlan).toByteArray(); Files.write(Paths.get("spark_query.plan"), bytes); System.out.println("Substrait plan saved to spark_query.plan"); ``` -------------------------------- ### Add Substrait-Spark Dependency (Gradle) Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This snippet demonstrates how to include the Substrait-Spark library in your Gradle project (Groovy syntax). This requires Java 17 or higher. ```groovy // https://mvnrepository.com/artifact/io.substrait/spark implementation 'io.substrait:spark:0.36.0' ``` -------------------------------- ### Load Datasets for PyArrow Table Provider Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This Python code loads data into PyArrow tables from Parquet files. These tables will be used by a custom table provider function to supply data to the Substrait query execution in PyArrow. ```python test = pq.read_table(TESTS_PQ_FILE) vehicles = pq.read_table(VEHICLES_PQ_FILE) ``` -------------------------------- ### Projection using Substrait Java Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Demonstrates how to perform a projection operation using the Substrait Java CLI. It takes a SQL CREATE TABLE statement and an expression for the new column. The output is a JSON representation of the Substrait plan. ```shell $ ./isthmus-cli/build/native/nativeCompile/isthmus -c "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))" \ -e "N_REGIONKEY + 10" ``` ```json { "extensionUris": [{ "extensionUriAnchor": 1, "uri": "/functions_arithmetic.yaml" }], "extensions": [{ "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 0, "name": "add:i64_i64" } }], "referredExpr": [{ "expression": { "scalarFunction": { "functionReference": 0, "args": [], "outputType": { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "arguments": [{ "value": { "selection": { "directReference": { "structField": { "field": 2 } }, "rootReference": { } } } }, { "value": { "cast": { "type": { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "input": { "literal": { "i32": 10, "nullable": false, "typeVariationReference": 0 } }, "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED" } } }], "options": [] } }, "outputNames": ["new-column"] }], "baseSchema": { "names": ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT"], "struct": { "types": [{ "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, { "fixedChar": { "length": 25, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, { "varchar": { "length": 152, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }], "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "expectedTypeUrls": [] } ``` -------------------------------- ### SQL to Substrait Plan Conversion using isthmus-cli Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Converts a SQL CREATE TABLE statement and a SELECT query into a Substrait Plan. This process involves parsing the SQL, mapping it to Substrait's relational algebra, and generating a JSON representation of the plan. The command requires the SQL schema and query as arguments. ```bash > $ ./isthmus-cli/build/native/nativeCompile/isthmus \ -c "CREATE TABLE Persons ( firstName VARCHAR, lastName VARCHAR, zip INT )" \ "SELECT lastName, firstName FROM Persons WHERE zip = 90210" { "extensionUris": [{ "extensionUriAnchor": 1, "uri": "/functions_comparison.yaml" }], "extensions": [{ "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 0, "name": "equal:any1_any1" } }], "relations": [{ "root": { "input": { "project": { "common": { "emit": { "outputMapping": [3, 4] } }, "input": { "filter": { "common": { "direct": { } }, "input": { "read": { "common": { "direct": { } }, "baseSchema": { "names": ["FIRSTNAME", "LASTNAME", "ZIP"], "struct": { "types": [{ "string": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "string": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i32": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }], "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "namedTable": { "names": ["PERSONS"] } } }, "condition": { "scalarFunction": { "functionReference": 0, "args": [{ "selection": { "directReference": { "structField": { "field": 2 } }, "rootReference": { } } }, { "literal": { "i32": 90210, "nullable": false, "typeVariationReference": 0 } }], "outputType": { "bool": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, "arguments": [] } } } }, "expressions": [{ "selection": { "directReference": { "structField": { "field": 1 } }, "rootReference": { } } }, { "selection": { "directReference": { "structField": { "field": 0 } }, "rootReference": { } } }] } }, "names": ["LASTNAME", "FIRSTNAME"] } }], "expectedTypeUrls": [] } ``` -------------------------------- ### Add Substrait-Spark Dependency (Maven) Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This snippet shows how to add the Substrait-Spark library as a Maven dependency to your project. Ensure you are using Java 17 or greater. ```xml io.substrait spark 0.36.0 ``` -------------------------------- ### Filtering using Substrait Java Source: https://github.com/substrait-io/substrait-java/blob/main/isthmus-cli/README.md Illustrates how to apply a filter operation using the Substrait Java CLI. It involves a SQL CREATE TABLE statement and a comparison expression for filtering. The output is a JSON representing the Substrait plan. ```shell $ ./isthmus-cli/build/native/nativeCompile/isthmus -c "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))" \ -e "N_REGIONKEY > 10" ``` ```json { "extensionUris": [{ "extensionUriAnchor": 1, "uri": "/functions_comparison.yaml" }], "extensions": [{ "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 0, "name": "gt:any_any" } }], "referredExpr": [{ "expression": { "scalarFunction": { "functionReference": 0, "args": [], "outputType": { "bool": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "arguments": [{ "value": { "selection": { "directReference": { "structField": { "field": 2 } }, "rootReference": { } } } }, { "value": { "cast": { "type": { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "input": { "literal": { "i32": 10, "nullable": false, "typeVariationReference": 0 } }, "failureBehavior": "FAILURE_BEHAVIOR_UNSPECIFIED" } } }], "options": [] } }, "outputNames": ["new-column"] }], "baseSchema": { "names": ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT"], "struct": { "types": [{ "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, { "fixedChar": { "length": 25, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, { "varchar": { "length": 152, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }], "typeVariationReference": 0, "nullability": "NULLABILITY_REQUIRED" } }, "expectedTypeUrls": [] } ``` -------------------------------- ### Configure Maven for Substrait-Java Maven Central Source: https://github.com/substrait-io/substrait-java/blob/main/ci/release/README.md This XML snippet shows how to declare dependencies for Substrait-Java artifacts (`core` and `isthmus` with version 1.0.0) in your Maven `pom.xml` when using Maven Central as the repository. ```xml ... io.substrait core 1.0.0 io.substrait isthmus 1.0.0 ... ``` -------------------------------- ### Reading CSV Data with Spark Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This code snippet demonstrates how to read data from CSV files into Spark Datasets. It specifies options for the delimiter and header, and then displays the content of the datasets. This is a prerequisite for further data manipulation. ```java dsVehicles = spark.read().option("delimiter", ",").option("header", "true").csv(vehiclesFile); dsVehicles.show(); dsTests = spark.read().option("delimiter", ",").option("header", "true").csv(testsFile); dsTests.show(); ``` -------------------------------- ### Create and Execute SQL Query in Spark Source: https://github.com/substrait-io/substrait-java/blob/main/examples/substrait-spark/README.md This snippet defines a standard SQL query to count vehicles by color that passed a safety test and then executes it using Spark SQL. It utilizes Java's string templating for the SQL query and Spark's sql() method. The input is a SparkSession and the output is a Spark DataFrame with the query results. ```java String sqlQuery = """ SELECT vehicles.colour, count(*) as colourcount FROM vehicles INNER JOIN tests ON vehicles.vehicle_id=tests.vehicle_id WHERE tests.test_result = 'P' GROUP BY vehicles.colour ORDER BY count(*) """; var result = spark.sql(sqlQuery); result.show(); ```