### Run Docker Integration Tests with Maven

Source: https://github.com/apache/spark/blob/master/docs/building-spark.md

Install Docker, start the service, and then run integration tests using Maven. This command builds the project first.

```bash
./build/mvn install -DskipTests
```

```bash
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_{{site.SCALA_BINARY_VERSION}}
```

--------------------------------

### Run Integration Tests with Default Minikube

Source: https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md

Execute integration tests using the default Minikube setup. Ensure Minikube is installed and running.

```bash
./dev/dev-run-integration-tests.sh
```

--------------------------------

### Start Spark Shell with Kafka Package

Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md

This snippet demonstrates how to initiate a `spark-shell` session with the `spark-sql-kafka-0-10` package and its dependencies pre-loaded. This is useful for interactive development and experimentation with Kafka integration within Spark.

```bash
./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
```

--------------------------------

### Configure Kafka Batch Queries with Offsets

Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md

Demonstrates how to perform batch reads from Kafka using explicit starting and ending offsets for specific topics and partitions.

```python
# Subscribe to 1 topic defaults to the earliest and latest offsets
df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Subscribe to multiple topics, specifying explicit Kafka offsets
df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1,topic2") \
  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") \
  .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
```

--------------------------------

### Install SDP with pip

Source: https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md

Use this command to quickly install the Spark Declarative Pipelines library along with PySpark. Ensure you have pip installed.

```bash
pip install pyspark[pipelines]
```

--------------------------------

### Create Table Example

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-table.md

Demonstrates how to create a partitioned Parquet table named 'customer' with specified columns and comments.

```sql
CREATE TABLE customer(
        cust_id INT,
        state VARCHAR(20),
        name STRING COMMENT 'Short name'
    )
    USING parquet
    PARTITIONED BY (state);
```

--------------------------------

### Run Spark Streaming Word Count Application

Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md

Commands to execute the word count example using different Spark interfaces. Includes starting a Netcat server and submitting the application via spark-submit or run-example scripts.

```bash
# Start Netcat server
nc -lk 9999

# Submit Python example
./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999

# Run Scala example
./bin/run-example streaming.NetworkWordCount localhost 9999

# Run Java example
./bin/run-example streaming.JavaNetworkWordCount localhost 9999
```

--------------------------------

### Start Minikube with Recommended Resources

Source: https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md

Start a Minikube cluster with the recommended minimum resources for integration tests. Ensure Minikube version 1.28.0 or greater is used and the kube-dns addon is enabled.

```bash
minikube start --cpus 4 --memory 6144
```

--------------------------------

### Start and Terminate Spark Streaming Computation (Python)

Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md

Starts the Spark Streaming computation and waits for it to terminate. This is called after all transformations have been set up.

```python
ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate
```

--------------------------------

### Verify SparkR Installation and Vignettes

Source: https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md

After installing SparkR, this R code snippet loads the library and displays the SparkR vignettes. This is a crucial step to confirm that the package has been installed correctly and that its documentation and examples are accessible.

```R
library(SparkR)
vignette("sparkr-vignettes", package="SparkR")
```

--------------------------------

### Launch Spark Connect Server

Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_connect.ipynb

Launches the Spark server with Spark Connect support. Ensure environment variables are loaded before execution. This script is essential for enabling remote connectivity.

```bash
source ~/.profile
$HOME/sbin/start-connect-server.sh
```

--------------------------------

### Install PySpark

Source: https://github.com/apache/spark/blob/master/python/docs/source/user_guide/loadandbehold.ipynb

Installs a specific version of the PySpark library. This is a prerequisite for running the subsequent PySpark code examples.

```python
!pip install pyspark==4.0.0.dev2
```

--------------------------------

### Insert Data Example

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-table.md

Shows how to insert data into a specific partition of the 'customer' table.

```sql
INSERT INTO customer PARTITION (state = 'AR') VALUES (100, 'Mike');
```

--------------------------------

### Start and Terminate Spark Streaming Computation (Scala)

Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md

Starts the Spark Streaming computation and waits for it to terminate. This is invoked after all streaming transformations have been defined.

```scala
ssc.start()             // Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate
```

--------------------------------

### CREATE FUNCTION with Resources

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-function.md

Example demonstrating how to create an external function, specifying the implementation class and providing necessary resources (JARs, files) using the USING clause.

```sql
CREATE FUNCTION my_func AS 'com.example.MyClass' USING JAR '/path/to/my_func.jar', FILE '/path/to/my_config.txt'
```

--------------------------------

### Run Scala and Java Spark Examples

Source: https://github.com/apache/spark/blob/master/docs/quick-start.md

This command demonstrates how to run pre-compiled Scala and Java examples from the Spark distribution. The run-example script simplifies the execution of these examples.

```bash
# For Scala and Java, use run-example:
./bin/run-example SparkPi
```

--------------------------------

### Install OpenBLAS via Package Manager

Source: https://github.com/apache/spark/blob/master/docs/ml-linalg-guide.md

Commands to install OpenBLAS on common Linux distributions using standard package managers.

```bash
sudo apt-get install libopenblas-dev
```

```bash
sudo yum install openblas
```

--------------------------------

### Specify Starting Offsets by Timestamp (JSON)

Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md

Defines the starting point for a query using timestamps for each TopicPartition. This option takes precedence over `startingOffsets`. For streaming queries, it applies only when a new query starts; resuming always picks up from where it left off. Newly discovered partitions start at the earliest offset.

```json
{
  "topicA": {"0": 1000, "1": 1000},
  "topicB": {"0": 2000, "1": 2000}
}
```

--------------------------------

### Build SQL Documentation

Source: https://github.com/apache/spark/blob/master/docs/README.md

Run this script from the $SPARK_HOME/sql directory to build SQL documentation. Ensure Spark is built first.

```sh
$SPARK_HOME/sql/create-docs.sh
```

--------------------------------

### Start Spark Streaming Computation

Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md

Starts the Spark Streaming context to begin processing data and waits for the termination signal. This is the final step after defining all DStream transformations.

```java
jssc.start();
jssc.awaitTermination();
```

--------------------------------

### Manage Kafka Topics

Source: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/streaming/KAFKA_TESTING.md

Examples for creating Kafka topics with varying partition configurations.

```python
# Create single partition topics
kafka_utils.create_topics(["topic1", "topic2"])

# Create multi-partition topic
kafka_utils.create_topics(["multi-partition-topic"], num_partitions=3)
```

--------------------------------

### Start Spark Shell (Python)

Source: https://github.com/apache/spark/blob/master/docs/quick-start.md

Starts the PySpark interactive shell. This is useful for learning the Spark API and analyzing data interactively using Python. It can be run directly from the Spark directory or if PySpark is installed via pip.

```bash
./bin/pyspark
```

```bash
pyspark
```

--------------------------------

### Run Scala/Java Spark Example

Source: https://github.com/apache/spark/blob/master/docs/index.md

Executes a Scala or Java Spark sample program using the `run-example` script. The example shows running SparkPi with a parameter.

```bash
./bin/run-example SparkPi 10
```

--------------------------------

### Install Ruby Dependencies

Source: https://github.com/apache/spark/blob/master/docs/README.md

Navigate to the docs directory and install all required Ruby dependencies using Bundler.

```sh
$ cd "$SPARK_HOME"/docs
$ bundle install
```

--------------------------------

### Spark SQL range function examples

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-qry-select-tvf.md

Provides examples of the 'range' function in Spark SQL for generating sequences of numbers. It can be called with an end value, start and end values, or with additional parameters for step and number of partitions.

```sql
-- range call with end
SELECT * FROM range(6 + cos(3));
```

```sql
-- range call with start and end
SELECT * FROM range(5, 10);
```

```sql
-- range call with numPartitions
SELECT * FROM range(0, 10, 2, 200);
```

```sql
-- range call with a table alias
SELECT * FROM range(5, 8) AS test;
```

--------------------------------

### Specify Starting Offsets (JSON)

Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md

Sets the starting point for a query using specific offsets for each TopicPartition. Offsets can be 'earliest' (-2) or 'latest' (-1). For batch queries, 'latest' is not allowed. For streaming queries, this applies only when a new query starts; resuming always picks up from where it left off. Newly discovered partitions start at the earliest offset.

```json
{
  "topicA": {"0": 23, "1": -1},
  "topicB": {"0": -2}
}
```

--------------------------------

### Run Scala/Java Example

Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md

Use the `run-example` script to execute Spark examples written in Scala or Java. Pass the class name of the example you want to run.

```bash
./bin/run-example SparkPi
```

--------------------------------

### Define DataFrame for Index Type Hinting

Source: https://github.com/apache/spark/blob/master/python/docs/source/tutorial/pandas_on_spark/typehints.md

Initializes a pandas DataFrame for demonstrating index type hinting. This setup is used across multiple examples.

```python
>>> pdf = pd.DataFrame({'id': range(5)})
>>> sample = pdf.copy()
>>> sample["a"] = sample.id + 1
```

--------------------------------

### Install ASV

Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md

Install the Airspeed Velocity (ASV) benchmarking tool using pip. This is a prerequisite for running benchmarks.

```bash
pip install asv
```

--------------------------------

### Connect to Spark Connect Server and Create DataFrame

Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_connect.ipynb

Demonstrates how to stop an existing Spark session, connect to a remote Spark Connect server at localhost:15002, and create a DataFrame. This involves initializing a remote SparkSession and then using it to construct a DataFrame from a list of Row objects.

```python
from pyspark.sql import SparkSession

SparkSession.builder.master("local[*]").getOrCreate().stop()

spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

from datetime import datetime, date
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
```

--------------------------------

### GET /spark/kafka-source/configuration

Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md

Defines the configuration parameters for reading data from Kafka topics in Spark streaming and batch queries.

```APIDOC
## GET /spark/kafka-source/configuration

### Description
Configures the behavior of the Kafka source reader, including rate limiting, partition splitting, and consumer group management.

### Method
GET

### Endpoint
/spark/kafka-source/configuration

### Parameters
#### Query Parameters
- **minOffsetsPerTrigger** (long) - Optional - Minimum number of offsets to be processed per trigger interval.
- **maxTriggerDelay** (time) - Optional - Maximum time to delay a trigger if data is available.
- **minPartitions** (int) - Optional - Desired minimum number of partitions to read from Kafka.
- **maxRecordsPerPartition** (long) - Optional - Limit maximum number of records present in a partition.
- **groupIdPrefix** (string) - Optional - Prefix for consumer group identifiers.
- **kafka.group.id** (string) - Optional - Specific Kafka group id for consumer.
- **includeHeaders** (boolean) - Optional - Whether to include Kafka headers in the row.
- **startingOffsetsByTimestampStrategy** (string) - Optional - Strategy when starting offset by timestamp does not match ('error' or 'latest').

### Request Example
{
  "minPartitions": 10,
  "includeHeaders": true
}

### Response
#### Success Response (200)
- **status** (string) - Configuration applied successfully.

#### Response Example
{
  "status": "success",
  "applied_configs": {
    "minPartitions": 10,
    "includeHeaders": true
  }
}
```

--------------------------------

### Define MultiIndex DataFrame for Type Hinting

Source: https://github.com/apache/spark/blob/master/python/docs/source/tutorial/pandas_on_spark/typehints.md

Initializes a pandas DataFrame with a MultiIndex for demonstrating MultiIndex type hinting. This setup is used for subsequent MultiIndex examples.

```python
>>> midx = pd.MultiIndex.from_arrays(
...     [(1, 1, 2), (1.5, 4.5, 7.5)],
...     names=("int", "float"))
>>> pdf = pd.DataFrame(range(3), index=midx, columns=["id"])
>>> sample = pdf.copy()
>>> sample["a"] = sample.id + 1
```

--------------------------------

### GPU Resource Discovery Script Example

Source: https://github.com/apache/spark/blob/master/docs/running-on-yarn.md

An example script for discovering GPU resources available to an executor. This script must be executable and write resource information to STDOUT in the `ResourceInformation` format.

```bash
#!/bin/bash

# Example script to discover GPU resources.
# This script should output a JSON string representing ResourceInformation.
# For example:
# {
#   "resourceName": "yarn.io/gpu",
#   "resources": [
#     "/dev/nvidia0",
#     "/dev/nvidia1"
#   ]
# }

echo "{\"resourceName\": \"yarn.io/gpu\", \"resources\": [\"/dev/nvidia0\", \"/dev/nvidia1\"]}"
```

--------------------------------

### Run Quick Benchmarks (Current Environment)

Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md

Execute benchmarks using the current Python environment for faster development cycles. Use the --quick flag to limit the scope.

```bash
./python/asv run --python=same --quick
```

--------------------------------

### PySpark Benchmark Class Structure

Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md

Example structure for a PySpark benchmark class using ASV. Includes setup, timing, and memory measurement methods with parameterization.

```python
class MyBenchmark:
    params = [[1000, 10000], ["option1", "option2"]]
    param_names = ["n_rows", "option"]

    def setup(self, n_rows, option):
        # Called before each benchmark method
        self.data = create_test_data(n_rows, option)

    def time_my_operation(self, n_rows, option):
        # Benchmark timing
        process(self.data)

    def peakmem_my_operation(self, n_rows, option):
        # Benchmark peak memory
        process(self.data)
```

--------------------------------

### Run R Spark Examples

Source: https://github.com/apache/spark/blob/master/docs/quick-start.md

This command illustrates how to execute R examples provided with Spark using spark-submit. It points to the R script file for execution.

```bash
# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R
```

--------------------------------

### Handle Spark Distribution Installation Prompt in SparkR

Source: https://github.com/apache/spark/blob/master/docs/sparkr-migration-guide.md

In older versions of SparkR, the Spark distribution was automatically downloaded and installed if not found. This behavior has changed to prompt the user. To revert to the automatic download, set the SPARKR_ASK_INSTALLATION environment variable to FALSE.

```r
Sys.setenv("SPARKR_ASK_INSTALLATION" = "FALSE")
```

--------------------------------

### Run JavaNetworkWordCount Example in Spark Streaming

Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md

This snippet shows the command to execute the JavaNetworkWordCount example provided by Spark Streaming. It requires a running network server on localhost port 9999 to receive and count words. The output displays the word counts over time.

```bash
# TERMINAL 2: RUNNING JavaNetworkWordCount

$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
...
-------------------------------------------
Time: 1357008430000 ms
-------------------------------------------
(hello,1)
(world,1)
...
```

--------------------------------

### Install Dependencies with pip

Source: https://github.com/apache/spark/blob/master/python/docs/source/development/contributing.md

Install and set up the development environment using pip for Python 3.11+.

```bash
pip install --upgrade -r dev/requirements.txt
```

--------------------------------

### Implement Kafka Integration Test

Source: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/streaming/KAFKA_TESTING.md

Minimal example demonstrating how to set up a Kafka container, produce messages, and read them using PySpark within a unit test.

```python
import unittest
from pyspark.sql.tests.streaming.kafka_utils import KafkaUtils
from pyspark.testing.sqlutils import ReusedSQLTestCase

class MyKafkaTest(ReusedSQLTestCase):
    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        cls.kafka_utils = KafkaUtils()
        cls.kafka_utils.setup()

    @classmethod
    def tearDownClass(cls):
        cls.kafka_utils.teardown()
        super().tearDownClass()

    def test_kafka_read_write(self):
        # Create a topic
        topic = "test-topic"
        self.kafka_utils.create_topics([topic])

        # Send test data
        messages = [("key1", "value1"), ("key2", "value2")]
        self.kafka_utils.send_messages(topic, messages)

        # Read with Spark
        df = (
            self.spark.read
            .format("kafka")
            .option("kafka.bootstrap.servers", self.kafka_utils.broker)
            .option("subscribe", topic)
            .option("startingOffsets", "earliest")
            .load()
        )

        # Verify data
        results = df.selectExpr(
            "CAST(key AS STRING) as key",
            "CAST(value AS STRING) as value"
        ).collect()

        self.assertEqual(len(results), 2)
```

--------------------------------

### Submit Spark Streaming Application

Source: https://github.com/apache/spark/blob/master/docs/streaming/getting-started.md

Commands to execute the structured streaming word count example using the spark-submit or run-example utilities.

```python
$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
```

```scala
$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999
```

```java
$ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999
```

```r
$ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999
```

--------------------------------

### Describe a Built-in Scalar Function with Extended Information

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-function.md

Use this command with the EXTENDED option to get detailed usage information, including examples, for a built-in scalar function.

```sql
-- Describe a builtin scalar function.
-- Returns function name, implementing class and usage and examples.
DESC FUNCTION EXTENDED abs;
+-------------------------------------------------------------------+
|function_desc                                                      |
+-------------------------------------------------------------------+
|Function: abs                                                      |
|Class: org.apache.spark.sql.catalyst.expressions.Abs               |
|Usage: abs(expr) - Returns the absolute value of the numeric value.|
|Extended Usage:                                                    |
|    Examples:                                                      |
|      > SELECT abs(-1);                                            |
|       1                                                           |
|                                                                   |
+-------------------------------------------------------------------+
```

--------------------------------

### Install Python Spark Connect client

Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.md

Install the pure Python Spark Connect client library from PyPI. This package only supports spark.remote with connection URIs.

```bash
pip install pyspark-client
```

--------------------------------

### Get DataFrame Column Names (Python)

Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_df.ipynb

Retrieves a list of all column names present in a PySpark DataFrame. This is a simple way to understand the structure of the data.

```python
df.columns
```

--------------------------------

### Initialize SparkContext in Python

Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md

Create a SparkConf with application name and master URL, then use it to initialize a SparkContext.

```python
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
```

--------------------------------

### Evaluate Ranking Metrics in Scala and Java

Source: https://github.com/apache/spark/blob/master/docs/mllib-evaluation-metrics.md

Demonstrates how to utilize the RankingMetrics API to evaluate model performance. The examples show the setup and execution of ranking evaluations in both Scala and Java environments.

```scala
import org.apache.spark.mllib.evaluation.RankingMetrics
// Example usage logic for RankingMetrics in Scala
```

```java
import org.apache.spark.mllib.evaluation.RankingMetrics;
// Example usage logic for RankingMetrics in Java
```

--------------------------------

### Build SparkR Documentation

Source: https://github.com/apache/spark/blob/master/docs/README.md

Execute this script from the $SPARK_HOME/R directory to build SparkR documentation. Ensure Spark is built first.

```sh
$SPARK_HOME/R/create-docs.sh
```

--------------------------------

### Correct Substring Indexing in SparkR's substr Method

Source: https://github.com/apache/spark/blob/master/docs/sparkr-migration-guide.md

In SparkR versions prior to 2.3.1, the `start` parameter of the `substr` method was incorrectly 0-based, leading to inconsistent results compared to R's native `substr`. This has been fixed in 2.3.1 and later, making the `start` parameter 1-based.

```r
# SparkR 2.3.0 and earlier (incorrect 0-based start)
# substr(lit('abcdef'), 2, 4)  # Result: 'abc'

# SparkR 2.3.1 and later (correct 1-based start)
substr(lit('abcdef'), 2, 4)  # Result: 'bcd'
```

--------------------------------

### Create Temporary SQL Function with Session Qualifier

Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-sql-function.md

Illustrates creating temporary SQL functions using different qualifiers (`session`, `system.session`) and how they all refer to the same per-session function. Also shows how to drop and the restrictions on qualifiers.

```sql
-- Unqualified, `session`-qualified, and `system.session`-qualified names all create the same
-- temporary function in the per-session `system.session` namespace.
CREATE TEMPORARY FUNCTION add_one(x INT) RETURNS INT RETURN x + 1;

CREATE OR REPLACE TEMPORARY FUNCTION session.add_one(x INT) RETURNS INT
    RETURN x + 1;

CREATE OR REPLACE TEMPORARY FUNCTION system.session.add_one(x INT) RETURNS INT
    RETURN x + 1;

-- All three names refer to the same temporary function:
SELECT add_one(1), session.add_one(1), system.session.add_one(1);
 2  2  2

-- DROP TEMPORARY FUNCTION accepts the same qualifiers:
DROP TEMPORARY FUNCTION session.add_one;

-- Any other qualifier on a TEMPORARY function is rejected.
CREATE TEMPORARY FUNCTION mydb.bad_temp() RETURNS INT RETURN 1;
  [INVALID_TEMP_OBJ_QUALIFIER] qualifier `mydb` is not allowed for temporary FUNCTION ...

CREATE TEMPORARY FUNCTION system.builtin.bad_temp() RETURNS INT RETURN 1;
  [INVALID_TEMP_OBJ_QUALIFIER] qualifier `system`.`builtin` is not allowed for temporary FUNCTION ...
```

--------------------------------

### Initialize SparkSession for Structured Streaming

Source: https://github.com/apache/spark/blob/master/docs/streaming/getting-started.md

Initializes the SparkSession, which serves as the entry point for all Spark functionality. This setup is required before defining streaming queries or data transformations.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
```

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder
  .appName("StructuredNetworkWordCount")
  .getOrCreate()

import spark.implicits._
```

```java
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;

import java.util.Arrays;
import java.util.Iterator;

SparkSession spark = SparkSession
  .builder()
  .appName("JavaStructuredNetworkWordCount")
  .getOrCreate();
```

```r
sparkR.session(appName = "StructuredNetworkWordCount")
```

--------------------------------

### Serve Spark Documentation Locally

Source: https://github.com/apache/spark/blob/master/python/docs/source/development/contributing.md

Use this command to serve the Spark documentation locally and watch for changes. Ensure you are in the 'docs' directory.

```bash
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll serve --watch
```

--------------------------------

### Run Python Example

Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md

Submit Python Spark applications using the `spark-submit` script. Ensure the path to your Python script is correctly specified.

```bash
./bin/spark-submit examples/src/main/python/pi.py
```

--------------------------------

### Python Spark Application Example

Source: https://github.com/apache/spark/blob/master/docs/quick-start.md

A simple PySpark application that counts lines containing 'a' and 'b' in a text file. It uses SparkSession for data processing and can be run via spark-submit or the Python interpreter if PySpark is pip installed.

```python
"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

```

```bash
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --master "local[4]" \
  SimpleApp.py
...
Lines with a: 46, Lines with b: 23

```

```bash
# Use the Python interpreter to run your application
$ python SimpleApp.py
...
Lines with a: 46, Lines with b: 23

```

--------------------------------

### Build Site with API Docs using Jekyll

Source: https://github.com/apache/spark/blob/master/docs/README.md

Run this command in the 'docs' directory to build the Spark website, which also copies over generated API documentation. This process may take time if API docs haven't been generated recently.

```sh
bundle exec jekyll build
```