### Run Docker Integration Tests with Maven Source: https://github.com/apache/spark/blob/master/docs/building-spark.md Install Docker, start the service, and then run integration tests using Maven. This command builds the project first. ```bash ./build/mvn install -DskipTests ``` ```bash ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_{{site.SCALA_BINARY_VERSION}} ``` -------------------------------- ### Run Integration Tests with Default Minikube Source: https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md Execute integration tests using the default Minikube setup. Ensure Minikube is installed and running. ```bash ./dev/dev-run-integration-tests.sh ``` -------------------------------- ### Start Spark Shell with Kafka Package Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md This snippet demonstrates how to initiate a `spark-shell` session with the `spark-sql-kafka-0-10` package and its dependencies pre-loaded. This is useful for interactive development and experimentation with Kafka integration within Spark. ```bash ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... ``` -------------------------------- ### Configure Kafka Batch Queries with Offsets Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md Demonstrates how to perform batch reads from Kafka using explicit starting and ending offsets for specific topics and partitions. ```python # Subscribe to 1 topic defaults to the earliest and latest offsets df = spark \ .read \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic1") \ .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # Subscribe to multiple topics, specifying explicit Kafka offsets df = spark \ .read \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic1,topic2") \ .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") \ .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") \ .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") ``` -------------------------------- ### Install SDP with pip Source: https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md Use this command to quickly install the Spark Declarative Pipelines library along with PySpark. Ensure you have pip installed. ```bash pip install pyspark[pipelines] ``` -------------------------------- ### Create Table Example Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-table.md Demonstrates how to create a partitioned Parquet table named 'customer' with specified columns and comments. ```sql CREATE TABLE customer( cust_id INT, state VARCHAR(20), name STRING COMMENT 'Short name' ) USING parquet PARTITIONED BY (state); ``` -------------------------------- ### Run Spark Streaming Word Count Application Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md Commands to execute the word count example using different Spark interfaces. Includes starting a Netcat server and submitting the application via spark-submit or run-example scripts. ```bash # Start Netcat server nc -lk 9999 # Submit Python example ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999 # Run Scala example ./bin/run-example streaming.NetworkWordCount localhost 9999 # Run Java example ./bin/run-example streaming.JavaNetworkWordCount localhost 9999 ``` -------------------------------- ### Start Minikube with Recommended Resources Source: https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md Start a Minikube cluster with the recommended minimum resources for integration tests. Ensure Minikube version 1.28.0 or greater is used and the kube-dns addon is enabled. ```bash minikube start --cpus 4 --memory 6144 ``` -------------------------------- ### Start and Terminate Spark Streaming Computation (Python) Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md Starts the Spark Streaming computation and waits for it to terminate. This is called after all transformations have been set up. ```python ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate ``` -------------------------------- ### Verify SparkR Installation and Vignettes Source: https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md After installing SparkR, this R code snippet loads the library and displays the SparkR vignettes. This is a crucial step to confirm that the package has been installed correctly and that its documentation and examples are accessible. ```R library(SparkR) vignette("sparkr-vignettes", package="SparkR") ``` -------------------------------- ### Launch Spark Connect Server Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_connect.ipynb Launches the Spark server with Spark Connect support. Ensure environment variables are loaded before execution. This script is essential for enabling remote connectivity. ```bash source ~/.profile $HOME/sbin/start-connect-server.sh ``` -------------------------------- ### Install PySpark Source: https://github.com/apache/spark/blob/master/python/docs/source/user_guide/loadandbehold.ipynb Installs a specific version of the PySpark library. This is a prerequisite for running the subsequent PySpark code examples. ```python !pip install pyspark==4.0.0.dev2 ``` -------------------------------- ### Insert Data Example Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-table.md Shows how to insert data into a specific partition of the 'customer' table. ```sql INSERT INTO customer PARTITION (state = 'AR') VALUES (100, 'Mike'); ``` -------------------------------- ### Start and Terminate Spark Streaming Computation (Scala) Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md Starts the Spark Streaming computation and waits for it to terminate. This is invoked after all streaming transformations have been defined. ```scala ssc.start() // Start the computation ssc.awaitTermination() # Wait for the computation to terminate ``` -------------------------------- ### CREATE FUNCTION with Resources Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-function.md Example demonstrating how to create an external function, specifying the implementation class and providing necessary resources (JARs, files) using the USING clause. ```sql CREATE FUNCTION my_func AS 'com.example.MyClass' USING JAR '/path/to/my_func.jar', FILE '/path/to/my_config.txt' ``` -------------------------------- ### Run Scala and Java Spark Examples Source: https://github.com/apache/spark/blob/master/docs/quick-start.md This command demonstrates how to run pre-compiled Scala and Java examples from the Spark distribution. The run-example script simplifies the execution of these examples. ```bash # For Scala and Java, use run-example: ./bin/run-example SparkPi ``` -------------------------------- ### Install OpenBLAS via Package Manager Source: https://github.com/apache/spark/blob/master/docs/ml-linalg-guide.md Commands to install OpenBLAS on common Linux distributions using standard package managers. ```bash sudo apt-get install libopenblas-dev ``` ```bash sudo yum install openblas ``` -------------------------------- ### Specify Starting Offsets by Timestamp (JSON) Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md Defines the starting point for a query using timestamps for each TopicPartition. This option takes precedence over `startingOffsets`. For streaming queries, it applies only when a new query starts; resuming always picks up from where it left off. Newly discovered partitions start at the earliest offset. ```json { "topicA": {"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000} } ``` -------------------------------- ### Build SQL Documentation Source: https://github.com/apache/spark/blob/master/docs/README.md Run this script from the $SPARK_HOME/sql directory to build SQL documentation. Ensure Spark is built first. ```sh $SPARK_HOME/sql/create-docs.sh ``` -------------------------------- ### Start Spark Streaming Computation Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md Starts the Spark Streaming context to begin processing data and waits for the termination signal. This is the final step after defining all DStream transformations. ```java jssc.start(); jssc.awaitTermination(); ``` -------------------------------- ### Manage Kafka Topics Source: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/streaming/KAFKA_TESTING.md Examples for creating Kafka topics with varying partition configurations. ```python # Create single partition topics kafka_utils.create_topics(["topic1", "topic2"]) # Create multi-partition topic kafka_utils.create_topics(["multi-partition-topic"], num_partitions=3) ``` -------------------------------- ### Start Spark Shell (Python) Source: https://github.com/apache/spark/blob/master/docs/quick-start.md Starts the PySpark interactive shell. This is useful for learning the Spark API and analyzing data interactively using Python. It can be run directly from the Spark directory or if PySpark is installed via pip. ```bash ./bin/pyspark ``` ```bash pyspark ``` -------------------------------- ### Run Scala/Java Spark Example Source: https://github.com/apache/spark/blob/master/docs/index.md Executes a Scala or Java Spark sample program using the `run-example` script. The example shows running SparkPi with a parameter. ```bash ./bin/run-example SparkPi 10 ``` -------------------------------- ### Install Ruby Dependencies Source: https://github.com/apache/spark/blob/master/docs/README.md Navigate to the docs directory and install all required Ruby dependencies using Bundler. ```sh $ cd "$SPARK_HOME"/docs $ bundle install ``` -------------------------------- ### Spark SQL range function examples Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-qry-select-tvf.md Provides examples of the 'range' function in Spark SQL for generating sequences of numbers. It can be called with an end value, start and end values, or with additional parameters for step and number of partitions. ```sql -- range call with end SELECT * FROM range(6 + cos(3)); ``` ```sql -- range call with start and end SELECT * FROM range(5, 10); ``` ```sql -- range call with numPartitions SELECT * FROM range(0, 10, 2, 200); ``` ```sql -- range call with a table alias SELECT * FROM range(5, 8) AS test; ``` -------------------------------- ### Specify Starting Offsets (JSON) Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md Sets the starting point for a query using specific offsets for each TopicPartition. Offsets can be 'earliest' (-2) or 'latest' (-1). For batch queries, 'latest' is not allowed. For streaming queries, this applies only when a new query starts; resuming always picks up from where it left off. Newly discovered partitions start at the earliest offset. ```json { "topicA": {"0": 23, "1": -1}, "topicB": {"0": -2} } ``` -------------------------------- ### Run Scala/Java Example Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md Use the `run-example` script to execute Spark examples written in Scala or Java. Pass the class name of the example you want to run. ```bash ./bin/run-example SparkPi ``` -------------------------------- ### Define DataFrame for Index Type Hinting Source: https://github.com/apache/spark/blob/master/python/docs/source/tutorial/pandas_on_spark/typehints.md Initializes a pandas DataFrame for demonstrating index type hinting. This setup is used across multiple examples. ```python >>> pdf = pd.DataFrame({'id': range(5)}) >>> sample = pdf.copy() >>> sample["a"] = sample.id + 1 ``` -------------------------------- ### Install ASV Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md Install the Airspeed Velocity (ASV) benchmarking tool using pip. This is a prerequisite for running benchmarks. ```bash pip install asv ``` -------------------------------- ### Connect to Spark Connect Server and Create DataFrame Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_connect.ipynb Demonstrates how to stop an existing Spark session, connect to a remote Spark Connect server at localhost:15002, and create a DataFrame. This involves initializing a remote SparkSession and then using it to construct a DataFrame from a list of Row objects. ```python from pyspark.sql import SparkSession SparkSession.builder.master("local[*]").getOrCreate().stop() spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() from datetime import datetime, date from pyspark.sql import Row df = spark.createDataFrame([ Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)) ]) df.show() ``` -------------------------------- ### GET /spark/kafka-source/configuration Source: https://github.com/apache/spark/blob/master/docs/streaming/structured-streaming-kafka-integration.md Defines the configuration parameters for reading data from Kafka topics in Spark streaming and batch queries. ```APIDOC ## GET /spark/kafka-source/configuration ### Description Configures the behavior of the Kafka source reader, including rate limiting, partition splitting, and consumer group management. ### Method GET ### Endpoint /spark/kafka-source/configuration ### Parameters #### Query Parameters - **minOffsetsPerTrigger** (long) - Optional - Minimum number of offsets to be processed per trigger interval. - **maxTriggerDelay** (time) - Optional - Maximum time to delay a trigger if data is available. - **minPartitions** (int) - Optional - Desired minimum number of partitions to read from Kafka. - **maxRecordsPerPartition** (long) - Optional - Limit maximum number of records present in a partition. - **groupIdPrefix** (string) - Optional - Prefix for consumer group identifiers. - **kafka.group.id** (string) - Optional - Specific Kafka group id for consumer. - **includeHeaders** (boolean) - Optional - Whether to include Kafka headers in the row. - **startingOffsetsByTimestampStrategy** (string) - Optional - Strategy when starting offset by timestamp does not match ('error' or 'latest'). ### Request Example { "minPartitions": 10, "includeHeaders": true } ### Response #### Success Response (200) - **status** (string) - Configuration applied successfully. #### Response Example { "status": "success", "applied_configs": { "minPartitions": 10, "includeHeaders": true } } ``` -------------------------------- ### Define MultiIndex DataFrame for Type Hinting Source: https://github.com/apache/spark/blob/master/python/docs/source/tutorial/pandas_on_spark/typehints.md Initializes a pandas DataFrame with a MultiIndex for demonstrating MultiIndex type hinting. This setup is used for subsequent MultiIndex examples. ```python >>> midx = pd.MultiIndex.from_arrays( ... [(1, 1, 2), (1.5, 4.5, 7.5)], ... names=("int", "float")) >>> pdf = pd.DataFrame(range(3), index=midx, columns=["id"]) >>> sample = pdf.copy() >>> sample["a"] = sample.id + 1 ``` -------------------------------- ### GPU Resource Discovery Script Example Source: https://github.com/apache/spark/blob/master/docs/running-on-yarn.md An example script for discovering GPU resources available to an executor. This script must be executable and write resource information to STDOUT in the `ResourceInformation` format. ```bash #!/bin/bash # Example script to discover GPU resources. # This script should output a JSON string representing ResourceInformation. # For example: # { # "resourceName": "yarn.io/gpu", # "resources": [ # "/dev/nvidia0", # "/dev/nvidia1" # ] # } echo "{\"resourceName\": \"yarn.io/gpu\", \"resources\": [\"/dev/nvidia0\", \"/dev/nvidia1\"]}" ``` -------------------------------- ### Run Quick Benchmarks (Current Environment) Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md Execute benchmarks using the current Python environment for faster development cycles. Use the --quick flag to limit the scope. ```bash ./python/asv run --python=same --quick ``` -------------------------------- ### PySpark Benchmark Class Structure Source: https://github.com/apache/spark/blob/master/python/benchmarks/README.md Example structure for a PySpark benchmark class using ASV. Includes setup, timing, and memory measurement methods with parameterization. ```python class MyBenchmark: params = [[1000, 10000], ["option1", "option2"]] param_names = ["n_rows", "option"] def setup(self, n_rows, option): # Called before each benchmark method self.data = create_test_data(n_rows, option) def time_my_operation(self, n_rows, option): # Benchmark timing process(self.data) def peakmem_my_operation(self, n_rows, option): # Benchmark peak memory process(self.data) ``` -------------------------------- ### Run R Spark Examples Source: https://github.com/apache/spark/blob/master/docs/quick-start.md This command illustrates how to execute R examples provided with Spark using spark-submit. It points to the R script file for execution. ```bash # For R examples, use spark-submit directly: ./bin/spark-submit examples/src/main/r/dataframe.R ``` -------------------------------- ### Handle Spark Distribution Installation Prompt in SparkR Source: https://github.com/apache/spark/blob/master/docs/sparkr-migration-guide.md In older versions of SparkR, the Spark distribution was automatically downloaded and installed if not found. This behavior has changed to prompt the user. To revert to the automatic download, set the SPARKR_ASK_INSTALLATION environment variable to FALSE. ```r Sys.setenv("SPARKR_ASK_INSTALLATION" = "FALSE") ``` -------------------------------- ### Run JavaNetworkWordCount Example in Spark Streaming Source: https://github.com/apache/spark/blob/master/docs/streaming-programming-guide.md This snippet shows the command to execute the JavaNetworkWordCount example provided by Spark Streaming. It requires a running network server on localhost port 9999 to receive and count words. The output displays the word counts over time. ```bash # TERMINAL 2: RUNNING JavaNetworkWordCount $ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999 ... ------------------------------------------- Time: 1357008430000 ms ------------------------------------------- (hello,1) (world,1) ... ``` -------------------------------- ### Install Dependencies with pip Source: https://github.com/apache/spark/blob/master/python/docs/source/development/contributing.md Install and set up the development environment using pip for Python 3.11+. ```bash pip install --upgrade -r dev/requirements.txt ``` -------------------------------- ### Implement Kafka Integration Test Source: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/streaming/KAFKA_TESTING.md Minimal example demonstrating how to set up a Kafka container, produce messages, and read them using PySpark within a unit test. ```python import unittest from pyspark.sql.tests.streaming.kafka_utils import KafkaUtils from pyspark.testing.sqlutils import ReusedSQLTestCase class MyKafkaTest(ReusedSQLTestCase): @classmethod def setUpClass(cls): super().setUpClass() cls.kafka_utils = KafkaUtils() cls.kafka_utils.setup() @classmethod def tearDownClass(cls): cls.kafka_utils.teardown() super().tearDownClass() def test_kafka_read_write(self): # Create a topic topic = "test-topic" self.kafka_utils.create_topics([topic]) # Send test data messages = [("key1", "value1"), ("key2", "value2")] self.kafka_utils.send_messages(topic, messages) # Read with Spark df = ( self.spark.read .format("kafka") .option("kafka.bootstrap.servers", self.kafka_utils.broker) .option("subscribe", topic) .option("startingOffsets", "earliest") .load() ) # Verify data results = df.selectExpr( "CAST(key AS STRING) as key", "CAST(value AS STRING) as value" ).collect() self.assertEqual(len(results), 2) ``` -------------------------------- ### Submit Spark Streaming Application Source: https://github.com/apache/spark/blob/master/docs/streaming/getting-started.md Commands to execute the structured streaming word count example using the spark-submit or run-example utilities. ```python $ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999 ``` ```scala $ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999 ``` ```java $ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999 ``` ```r $ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999 ``` -------------------------------- ### Describe a Built-in Scalar Function with Extended Information Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-aux-describe-function.md Use this command with the EXTENDED option to get detailed usage information, including examples, for a built-in scalar function. ```sql -- Describe a builtin scalar function. -- Returns function name, implementing class and usage and examples. DESC FUNCTION EXTENDED abs; +-------------------------------------------------------------------+ |function_desc | +-------------------------------------------------------------------+ |Function: abs | |Class: org.apache.spark.sql.catalyst.expressions.Abs | |Usage: abs(expr) - Returns the absolute value of the numeric value.| |Extended Usage: | | Examples: | | > SELECT abs(-1); | | 1 | | | +-------------------------------------------------------------------+ ``` -------------------------------- ### Install Python Spark Connect client Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.md Install the pure Python Spark Connect client library from PyPI. This package only supports spark.remote with connection URIs. ```bash pip install pyspark-client ``` -------------------------------- ### Get DataFrame Column Names (Python) Source: https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_df.ipynb Retrieves a list of all column names present in a PySpark DataFrame. This is a simple way to understand the structure of the data. ```python df.columns ``` -------------------------------- ### Initialize SparkContext in Python Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md Create a SparkConf with application name and master URL, then use it to initialize a SparkContext. ```python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) ``` -------------------------------- ### Evaluate Ranking Metrics in Scala and Java Source: https://github.com/apache/spark/blob/master/docs/mllib-evaluation-metrics.md Demonstrates how to utilize the RankingMetrics API to evaluate model performance. The examples show the setup and execution of ranking evaluations in both Scala and Java environments. ```scala import org.apache.spark.mllib.evaluation.RankingMetrics // Example usage logic for RankingMetrics in Scala ``` ```java import org.apache.spark.mllib.evaluation.RankingMetrics; // Example usage logic for RankingMetrics in Java ``` -------------------------------- ### Build SparkR Documentation Source: https://github.com/apache/spark/blob/master/docs/README.md Execute this script from the $SPARK_HOME/R directory to build SparkR documentation. Ensure Spark is built first. ```sh $SPARK_HOME/R/create-docs.sh ``` -------------------------------- ### Correct Substring Indexing in SparkR's substr Method Source: https://github.com/apache/spark/blob/master/docs/sparkr-migration-guide.md In SparkR versions prior to 2.3.1, the `start` parameter of the `substr` method was incorrectly 0-based, leading to inconsistent results compared to R's native `substr`. This has been fixed in 2.3.1 and later, making the `start` parameter 1-based. ```r # SparkR 2.3.0 and earlier (incorrect 0-based start) # substr(lit('abcdef'), 2, 4) # Result: 'abc' # SparkR 2.3.1 and later (correct 1-based start) substr(lit('abcdef'), 2, 4) # Result: 'bcd' ``` -------------------------------- ### Create Temporary SQL Function with Session Qualifier Source: https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-sql-function.md Illustrates creating temporary SQL functions using different qualifiers (`session`, `system.session`) and how they all refer to the same per-session function. Also shows how to drop and the restrictions on qualifiers. ```sql -- Unqualified, `session`-qualified, and `system.session`-qualified names all create the same -- temporary function in the per-session `system.session` namespace. CREATE TEMPORARY FUNCTION add_one(x INT) RETURNS INT RETURN x + 1; CREATE OR REPLACE TEMPORARY FUNCTION session.add_one(x INT) RETURNS INT RETURN x + 1; CREATE OR REPLACE TEMPORARY FUNCTION system.session.add_one(x INT) RETURNS INT RETURN x + 1; -- All three names refer to the same temporary function: SELECT add_one(1), session.add_one(1), system.session.add_one(1); 2 2 2 -- DROP TEMPORARY FUNCTION accepts the same qualifiers: DROP TEMPORARY FUNCTION session.add_one; -- Any other qualifier on a TEMPORARY function is rejected. CREATE TEMPORARY FUNCTION mydb.bad_temp() RETURNS INT RETURN 1; [INVALID_TEMP_OBJ_QUALIFIER] qualifier `mydb` is not allowed for temporary FUNCTION ... CREATE TEMPORARY FUNCTION system.builtin.bad_temp() RETURNS INT RETURN 1; [INVALID_TEMP_OBJ_QUALIFIER] qualifier `system`.`builtin` is not allowed for temporary FUNCTION ... ``` -------------------------------- ### Initialize SparkSession for Structured Streaming Source: https://github.com/apache/spark/blob/master/docs/streaming/getting-started.md Initializes the SparkSession, which serves as the entry point for all Spark functionality. This setup is required before defining streaming queries or data transformations. ```python from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split spark = SparkSession \ .builder \ .appName("StructuredNetworkWordCount") \ .getOrCreate() ``` ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate() import spark.implicits._ ``` ```java import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.sql.*; import org.apache.spark.sql.streaming.StreamingQuery; import java.util.Arrays; import java.util.Iterator; SparkSession spark = SparkSession .builder() .appName("JavaStructuredNetworkWordCount") .getOrCreate(); ``` ```r sparkR.session(appName = "StructuredNetworkWordCount") ``` -------------------------------- ### Serve Spark Documentation Locally Source: https://github.com/apache/spark/blob/master/python/docs/source/development/contributing.md Use this command to serve the Spark documentation locally and watch for changes. Ensure you are in the 'docs' directory. ```bash SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll serve --watch ``` -------------------------------- ### Run Python Example Source: https://github.com/apache/spark/blob/master/docs/rdd-programming-guide.md Submit Python Spark applications using the `spark-submit` script. Ensure the path to your Python script is correctly specified. ```bash ./bin/spark-submit examples/src/main/python/pi.py ``` -------------------------------- ### Python Spark Application Example Source: https://github.com/apache/spark/blob/master/docs/quick-start.md A simple PySpark application that counts lines containing 'a' and 'b' in a text file. It uses SparkSession for data processing and can be run via spark-submit or the Python interpreter if PySpark is pip installed. ```python """SimpleApp.py""" from pyspark.sql import SparkSession logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system spark = SparkSession.builder.appName("SimpleApp").getOrCreate() logData = spark.read.text(logFile).cache() numAs = logData.filter(logData.value.contains('a')).count() numBs = logData.filter(logData.value.contains('b')).count() print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) spark.stop() ``` ```bash # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit \ --master "local[4]" \ SimpleApp.py ... Lines with a: 46, Lines with b: 23 ``` ```bash # Use the Python interpreter to run your application $ python SimpleApp.py ... Lines with a: 46, Lines with b: 23 ``` -------------------------------- ### Build Site with API Docs using Jekyll Source: https://github.com/apache/spark/blob/master/docs/README.md Run this command in the 'docs' directory to build the Spark website, which also copies over generated API documentation. This process may take time if API docs haven't been generated recently. ```sh bundle exec jekyll build ```