### Build Arrow Vala Example Source: https://github.com/apache/arrow/blob/main/c_glib/example/vala/README.md This command demonstrates how to compile a Vala example that uses the Arrow GLib library. It requires the 'arrow-glib' and 'posix' packages to be installed. ```console valac --pkg arrow-glib --pkg posix XXX.vala ``` -------------------------------- ### Create a Schema with Fields and Metadata in Java Source: https://github.com/apache/arrow/blob/main/docs/source/java/quickstartguide.rst Demonstrates how to construct an Arrow Schema, which is a collection of Fields defining the columns of a dataset. This example includes creating two fields (an int32 and a UTF-8 string) and associating metadata with the schema itself. ```Java import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import java.util.HashMap; import java.util.Map; import static java.util.Arrays.asList; Map metadata = new HashMap<>(); metadata.put("K1", "V1"); metadata.put("K2", "V2"); Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/ null); Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), /*children*/ null); Schema schema = new Schema(asList(a, b), metadata); System.out.println("Schema created: " + schema); ``` -------------------------------- ### Setup Python Virtual Environment and Install Dependencies Source: https://github.com/apache/arrow/blob/main/docs/source/developers/python/building.rst Creates a Python virtual environment named 'pyarrow-dev', activates it, and installs Python build dependencies from the specified requirements file. It also creates a 'dist' directory for library installation. ```bash $ python3 -m venv pyarrow-dev $ source ./pyarrow-dev/bin/activate $ pip install -r arrow/python/requirements-build.txt $ # This is the folder where we will install the Arrow libraries during $ # development $ mkdir dist ``` -------------------------------- ### Prepare Arrow Site Fork for Documentation Source: https://github.com/apache/arrow/blob/main/docs/source/developers/release.rst Clones the apache-arrow-site repository and sets up the 'upstream' remote. This is a one-time setup for preparing documentation. ```Bash ## Prepare your fork of https://github.com/apache/arrow-site . ## You need to do this only once. # git clone git@github.com:kou/arrow-site.git ../ git clone git@github.com:/arrow-site.git ../ cd ../arrow-site ## Add git@github.com:apache/arrow-site.git as "upstream" remote. git remote add upstream git@github.com:apache/arrow-site.git cd - ``` -------------------------------- ### Verify Git Remotes Source: https://github.com/apache/arrow/blob/main/docs/source/developers/guide/step_by_step/set_up.rst Displays the configured remote repositories for your local Arrow clone. This command helps verify that both your personal fork ('origin') and the official repository ('upstream') are correctly set up. ```console $ git remote -v ``` -------------------------------- ### Basic CMake Project Setup for Arrow Examples Source: https://github.com/apache/arrow/blob/main/cpp/examples/tutorial_examples/CMakeLists.txt This snippet sets the minimum CMake version, defines the project name, finds the Arrow Dataset package, and configures C++ compilation standards and flags. It's essential for initializing the build environment for Arrow examples. ```cmake cmake_minimum_required(VERSION 3.25) project(ArrowTutorialExamples) find_package(ArrowDataset) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Werror -Wall -Wextra") set(CMAKE_BUILD_TYPE Release) message(STATUS "Arrow version: ${ARROW_VERSION}") message(STATUS "Arrow SO version: ${ARROW_FULL_SO_VERSION}") ``` -------------------------------- ### Setup Benchmarks Repository Source: https://github.com/apache/arrow/blob/main/dev/conbench_envs/README.md Clones the conbench benchmarks repository and installs it in development mode, making benchmark scripts and configurations available. ```bash git clone https://github.com/ursacomputing/benchmarks.git pushd benchmarks python setup.py develop popd ``` -------------------------------- ### Configure Git User Information Source: https://github.com/apache/arrow/blob/main/docs/source/developers/guide/step_by_step/set_up.rst Sets the global Git configuration for your username and email address. This is essential for tracking your contributions. Ensure you replace 'Your Name' and 'your.email@example.com' with your actual details. ```console $ git config --global user.name "Your Name" $ git config --global user.email your.email@example.com ``` -------------------------------- ### Development Setup for Red Arrow Source: https://github.com/apache/arrow/blob/main/ruby/red-arrow/README.md Instructions for setting up the development environment for Red Arrow, including installing master versions of Arrow C++/GLib and running tests. ```bash cd ruby/red-arrow bundle install bundle exec rake test ``` -------------------------------- ### Clone Forked Arrow Repository Source: https://github.com/apache/arrow/blob/main/docs/source/developers/guide/step_by_step/set_up.rst Clones your personal fork of the Apache Arrow repository to your local machine. Replace '' with your actual GitHub username. This command downloads the entire project history. ```console $ git clone https://github.com//arrow.git ``` -------------------------------- ### Prepare Release Candidate: Git and GPG Setup Source: https://github.com/apache/arrow/blob/main/docs/source/developers/release.rst Commands to prepare for creating a release candidate. This includes deleting any local tags for the release candidate version and sourcing a script to set up the GPG agent for signing artifacts. These steps are essential for ensuring a clean and secure release process. ```bash # Delete the local tag for RC1 or later git tag -d apache-arrow- # Setup gpg agent for signing binary artifacts source dev/release/setup-gpg-agent.sh ``` -------------------------------- ### Load Partitioned Dataset in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst Demonstrates loading a partitioned dataset created with the Arrow dataset API. The API automatically detects partitions, enabling lazy loading of data chunks. ```python import pyarrow as pa import pyarrow.dataset as ds import pyarrow.compute as pc import datetime # Load the partitioned dataset birthdays_dataset = ds.dataset("savedir", format="parquet", partitioning=["years"]) # Access files within the dataset print(birthdays_dataset.files) # Iterate over batches and perform computations (e.g., calculate ages) current_year = datetime.datetime.now(datetime.UTC).year for table_chunk in birthdays_dataset.to_batches(): print("AGES", pc.subtract(current_year, table_chunk["years"])) ``` -------------------------------- ### Build and Install Arrow GLib with Meson and CMake Prefix Path Source: https://github.com/apache/arrow/blob/main/c_glib/README.md This Meson setup command is used when Arrow GLib needs to reference a locally built Arrow C++ library. It explicitly specifies the path to the Arrow C++ installation using the `--cmake-prefix-path` option, which can help resolve build mismatches. ```bash $ meson setup c_glib.build c_glib --cmake-prefix-path=${arrow_cpp_install_prefix} -Dgtk_doc=true ``` -------------------------------- ### Configure Filesystem Examples (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt This CMake code snippet configures the build for Filesystem examples. It defines a shared library for filesystem definitions and adds a filesystem usage example, linking necessary libraries and setting compile definitions for the example. ```cmake if(ARROW_FILESYSTEM) add_library(filesystem_definition_example MODULE filesystem_definition_example.cc) target_link_libraries(filesystem_definition_example ${ARROW_EXAMPLE_LINK_LIBS}) add_arrow_example(filesystem_usage_example) target_compile_definitions(filesystem-usage-example PUBLIC FILESYSTEM_EXAMPLE_LIBPATH="$" ) endif() ``` -------------------------------- ### Build and Install Arrow GLib with Meson (macOS, Homebrew) Source: https://github.com/apache/arrow/blob/main/c_glib/README.md This set of commands builds and installs Arrow GLib on macOS using Meson and Homebrew. It first sets up the build environment using 'meson setup', enabling GTK documentation, then compiles the project with 'meson compile', and finally installs it with 'sudo meson install'. ```bash $ brew bundle --file=c_glib/Brewfile $ meson setup c_glib.build c_glib --buildtype=release $ meson compile -C c_glib.build $ sudo meson install -C c_glib.build ``` -------------------------------- ### Setup Temporary Working Directory in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst This snippet sets up a custom temporary working directory for file operations within the script. It saves the original directory and changes to the temporary one. ```python import os import tempfile orig_working_dir = os.getcwd() temp_working_dir = tempfile.mkdtemp(prefix="pyarrow-") os.chdir(temp_working_dir) ``` -------------------------------- ### Complete Dataset Example Code (C++) Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/datasets_tutorial.rst This snippet provides the complete C++ code for the dataset example. It encompasses all the configurations and operations shown in the preceding sections, allowing users to review and run the entire example. ```cpp // Doc section: Dataset Example #include #include #include #include #include #include using arrow::Status; Status ExampleWriteDataset(const std::string& path) { // Make a local filesystem std::shared_ptr local_fs; ARROW_ASSIGN_OR_RAISE(local_fs, arrow::fs::FileSystemFromUri("file://")); // Make a partitioning method previously, declaring that we’d use // Hive-style – this is where we actually pass that to our writing // function: auto partitioning = arrow::dataset::HivePartitioning("a"); // Options FS auto options = arrow::dataset::FileSystemDatasetWriteOptions::Builder() .directory_name("write_dataset") .partitioning(std::move(partitioning)) .name_template("part{i}.parquet") .file_behavior(arrow::dataset::FileBehavior::OVERWRITE) .build() .ValueOrDie(); // Prepare a Scanner arrow::TableBatchReader reader; auto scanner_builder = arrow::dataset::ScannerBuilder::FromTable(&reader.table()); std::shared_ptr scanner; ARROW_ASSIGN_OR_RAISE(scanner, scanner_builder.Finish()); // Write Dataset to Disk ARROW_RETURN_NOT_OK(dataset::FileSystemDataset::Write(path, scanner, options)); // Ret return arrow::Status::OK(); } ``` -------------------------------- ### Install Arrow Headers Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/vendored/CMakeLists.txt This command installs the necessary headers for the Apache Arrow project, specifically from the 'arrow/vendored' directory. It's a prerequisite for building or using Arrow components. ```cmake arrow_install_all_headers("arrow/vendored") ``` -------------------------------- ### Write Partitioned Dataset in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst Shows how to write an Arrow Table to disk as a partitioned dataset using the dataset API. Partitioning organizes data into smaller chunks based on column values, improving query performance for large datasets. ```python import pyarrow as pa import pyarrow.dataset as ds # Assuming 'birthdays_table' is already defined days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) months = pa.array([1, 3, 5, 7, 1], type=pa.int8()) years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16()) birthdays_table = pa.table([days, months, years], names=["days", "months", "years"]) ds.write_dataset(birthdays_table, "savedir", format="parquet", partitioning=ds.partitioning( pa.schema([birthdays_table.schema.field("years")]) )) ``` -------------------------------- ### Install Arrow Python Headers Source: https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/CMakeLists.txt Installs all necessary headers for the Arrow Python package. This is crucial for building Python extensions that interact with Arrow data structures. ```cmake arrow_install_all_headers("arrow/python") ``` -------------------------------- ### Configure Dataset Examples (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt This CMake code snippet configures the build for various dataset examples, including parquet scan, documentation, and execution plan examples. It conditionally enables these examples when ARROW_PARQUET and ARROW_DATASET are set, and links against appropriate shared or static dataset libraries. ```cmake if(ARROW_PARQUET AND ARROW_DATASET) if(ARROW_BUILD_SHARED) set(DATASET_EXAMPLES_LINK_LIBS arrow_dataset_shared) else() set(DATASET_EXAMPLES_LINK_LIBS arrow_dataset_static) endif() add_arrow_example(dataset_parquet_scan_example EXTRA_LINK_LIBS ${DATASET_EXAMPLES_LINK_LIBS}) add_dependencies(dataset-parquet-scan-example parquet) add_arrow_example(dataset_documentation_example EXTRA_LINK_LIBS ${DATASET_EXAMPLES_LINK_LIBS}) add_dependencies(dataset-documentation-example parquet) add_arrow_example(execution_plan_documentation_examples EXTRA_LINK_LIBS ${DATASET_EXAMPLES_LINK_LIBS}) add_dependencies(execution-plan-documentation-examples parquet) if(PARQUET_REQUIRE_ENCRYPTION) add_arrow_example(parquet_column_encryption EXTRA_SOURCES ${PROJECT_SOURCE_DIR}/src/parquet/encryption/test_in_memory_kms.cc EXTRA_LINK_LIBS ${DATASET_EXAMPLES_LINK_LIBS}) add_dependencies(parquet-column-encryption parquet) endif() if(ARROW_CSV) add_arrow_example(join_example EXTRA_LINK_LIBS ${DATASET_EXAMPLES_LINK_LIBS}) add_dependencies(join-example parquet) endif() add_arrow_example(udf_example) endif() ``` -------------------------------- ### Basic Ruby Usage Example for Red Arrow Dataset Source: https://github.com/apache/arrow/blob/main/ruby/red-arrow-dataset/README.md This Ruby code snippet demonstrates the basic setup for using the Red Arrow Dataset library. It requires the 'arrow-dataset' gem and serves as a starting point for further dataset operations. ```ruby require "arrow-dataset" # TODO ``` -------------------------------- ### Create an Int32 ValueVector in Java Source: https://github.com/apache/arrow/blob/main/docs/source/java/quickstartguide.rst Demonstrates the creation of a ValueVector to hold a sequence of 32-bit integers, including support for null values. It requires BufferAllocator for memory management and shows how to set values and their count. ```Java import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.IntVector; try( BufferAllocator allocator = new RootAllocator(); IntVector intVector = new IntVector("fixed-size-primitive-layout", allocator); ){ intVector.allocateNew(3); intVector.set(0,1); intVector.setNull(1); intVector.set(2,2); intVector.setValueCount(3); System.out.println("Vector created in memory: " + intVector); } ``` -------------------------------- ### Install All Archery Packages Source: https://github.com/apache/arrow/blob/main/dev/archery/README.md Installs all available Archery subpackages at once, providing a convenient way to get all functionalities for Arrow development. This is an alias command executed with pip. ```shell pip install -e "arrow/dev/archery[all]" ``` -------------------------------- ### Setup Ubuntu for Apache Arrow Release Verification Source: https://github.com/apache/arrow/blob/main/docs/source/developers/release_verification.rst This script installs the necessary packages on an Ubuntu system to perform a source verification of an Apache Arrow release candidate. It should be run from the root of the Arrow clone directory. ```bash # From the arrow clone sudo dev/release/setup-ubuntu.sh ``` -------------------------------- ### Build Examples of C++ API Usage Source: https://github.com/apache/arrow/blob/main/docs/source/developers/cpp/building.rst Enables the compilation of example programs demonstrating how to use the Arrow C++ API. This is helpful for developers learning to integrate Arrow. ```bash cmake .. -DARROW_BUILD_EXAMPLES=ON ``` -------------------------------- ### Development Setup on macOS with Homebrew Source: https://github.com/apache/arrow/blob/main/ruby/red-arrow/README.md Specific development setup for macOS users with Homebrew, detailing the installation of head versions of Apache Arrow and Arrow GLib before running tests. ```bash cd ruby/red-arrow bundle install brew install apache-arrow --head brew install apache-arrow-glib --head bundle exec rake test ``` -------------------------------- ### C++: Running an Acero Execution Plan Directly Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/acero/user_guide.rst This outlines the steps to directly run an Acero execution plan when the standard `DeclarationToXyz` methods are insufficient. This is typically needed for unique scenarios like custom sink nodes or plans with multiple outputs. It involves creating an `ExecPlan`, adding sink nodes, adding declarations, validating, and starting the plan. ```c++ #include #include #include #include // Conceptual steps for running a plan directly: // 1. Create a new ExecPlan object. arrow::acero::ExecPlan plan; // 2. Add sink nodes to your graph of Declaration objects. // (Assuming 'custom_sink_declaration' is a Declaration for a custom sink) // arrow::Result sink_decl = plan.AddSinkNode(...); // Example // 3. Use Declaration::AddToPlan to add your declaration to your plan. // If multiple outputs, add nodes one at a time. // arrow::Status status = declaration->AddToPlan(&plan); // 4. Validate the plan. // arrow::Result valid = arrow::acero::ExecPlan::Validate(&plan); // if (!valid.ok() || !*valid) { // return arrow::Status::Invalid("Plan validation failed"); // } // 5. Start the plan. // arrow::Result producers = plan.StartProducing(); // if (!producers.ok()) { // return producers.status(); // } // 6. Wait for the plan to finish. // std::future finished_future = plan.finished(); // arrow::Status finish_status = finished_future.get(); // return finish_status; // Note: This is a simplified representation. Actual implementation requires // specific node declarations and detailed API usage. ``` -------------------------------- ### Install Arrow Headers Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/CMakeLists.txt This function installs all header files for a specified Arrow component. It takes the component path as an argument, for example, 'arrow/array'. This is essential for ensuring that external projects can correctly compile against Arrow libraries. ```c++ arrow_install_all_headers("arrow/array") ``` -------------------------------- ### Example Server Listening Message Source: https://github.com/apache/arrow/blob/main/docs/source/java/flight.rst This is an example output message from a running Flight server indicating the port it is listening on. This message is typically printed to standard output when the server starts successfully. ```shell Server listening on port 58104 ``` -------------------------------- ### Add Filesystem Test Suite Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/CMakeLists.txt Defines a test suite named 'filesystem-test' for Arrow's filesystem component. It specifies source files, extra labels for organization, and compilation definitions, including the path to the Arrow filesystem example library. ```cmake add_arrow_test(filesystem-test SOURCES filesystem_test.cc localfs_test.cc EXTRA_LABELS filesystem DEFINITIONS ARROW_FILESYSTEM_EXAMPLE_LIBPATH="$" EXTRA_DEPENDENCIES arrow_filesystem_example) ``` -------------------------------- ### Create CMake Build with Custom Install Prefix Source: https://github.com/apache/arrow/blob/main/docs/source/developers/cpp/building.rst This example shows how to create a CMake build using a preset while overriding a default configuration option. Specifically, it sets the CMAKE_INSTALL_PREFIX to '/usr/local', ensuring the build artifacts are installed to that location. ```bash $ cmake .. --preset ninja-debug-minimal -DCMAKE_INSTALL_PREFIX=/usr/local ``` -------------------------------- ### Build Filesystem Example Library (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/testing/CMakeLists.txt Builds the 'arrow_filesystem_example' library module if the ARROW_FILESYSTEM option is enabled. This involves compiling 'examplefs.cc' and linking it with necessary Arrow test and example libraries. ```cmake if(ARROW_FILESYSTEM) add_library(arrow_filesystem_example MODULE examplefs.cc) target_link_libraries(arrow_filesystem_example ${ARROW_TEST_LINK_LIBS} ${ARROW_EXAMPLE_LINK_LIBS}) endif() ``` -------------------------------- ### String Data Buffer Layout Example Source: https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions/Examples.rst Illustrates the memory layout for a string data buffer, including validity bitmap, offsets, and the actual string values. The offsets buffer indicates the start of each string within the value buffer, and the validity bitmap tracks nullability. ```text * field-1 array (`String` typed_value) * Length: 10, Null count: 7 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | |--------------------------|-----------|-----------------------| | 01000011 | 00000000 | 0 (padding) | * Offsets buffer (int32) | Byte 0-43 | Bytes 44-63 | |-------------------------------------|------------------------| | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) | * Value buffer: | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 | |-----------|-----------|------------|------------------------| | noop | login | noop | unspecified (padding) | ``` -------------------------------- ### Create an Arrow Array in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst Demonstrates how to create a basic Arrow Array with a specified data type. This is a fundamental building block for Arrow data structures. ```python import pyarrow as pa days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) ``` -------------------------------- ### Local FileSystem Example Source: https://github.com/apache/arrow/blob/main/docs/source/python/filesystems.rst Demonstrates how to use the `LocalFileSystem` to write data to a file and then read it back. ```APIDOC ## Local FS Example ### `fs.LocalFileSystem` Provides access to files on the local machine. **Example:** ```python from pyarrow import fs local = fs.LocalFileSystem() # Write data to a file with local.open_output_stream('/tmp/pyarrowtest.dat') as stream: stream.write(b'data') # Reading the data would typically involve open_input_stream and reading the content. ``` ``` -------------------------------- ### CPP: Example of Consuming Sink Execution Node Implementation Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/acero/user_guide.rst Provides the C++ code for implementing and using the Consuming Sink execution node. This example defines a `CustomSinkNodeConsumer` that increments a counter for each consumed batch and returns an OK status. It then creates an `ExecNode` of type `consuming_sink` using this consumer. ```cpp // ConsumingSink Example // The consuming sink example is not included in this file. ``` -------------------------------- ### Prepare Environment for Dataset Reading (C++) Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/datasets_tutorial.rst Initializes the environment for reading a dataset by calling the PrepareEnv helper function. This ensures that the necessary files and directories for the dataset are created on disk before proceeding. ```cpp ARROW_RETURN_NOT_OK(PrepareEnv(root.path().ToString(), argc, argv)); ``` -------------------------------- ### Add Arrow Example with Flight SQL Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt Configures and adds an Arrow Flight SQL example. The linking libraries depend on whether Arrow is built as a shared or static library. Requires ARROW_FLIGHT_SQL build flag. ```cmake if(ARROW_FLIGHT_SQL) if(ARROW_BUILD_SHARED AND ARROW_GRPC_USE_SHARED) set(FLIGHT_SQL_EXAMPLES_LINK_LIBS arrow_flight_sql_shared) else() set(FLIGHT_SQL_EXAMPLES_LINK_LIBS arrow_flight_sql_static) endif() add_arrow_example(flight_sql_example ``` -------------------------------- ### Save and Load Arrow Table to Parquet in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst Illustrates saving an Arrow Table to a Parquet file and then loading it back. Parquet is a common columnar storage format optimized for analytics. ```python import pyarrow as pa import pyarrow.parquet as pq # Assuming 'birthdays_table' is already defined days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) months = pa.array([1, 3, 5, 7, 1], type=pa.int8()) years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16()) birthdays_table = pa.table([days, months, years], names=["days", "months", "years"]) # Save table to Parquet pq.write_table(birthdays_table, 'birthdays.parquet') # Load table back from Parquet reloaded_birthdays = pq.read_table('birthdays.parquet') reloaded_birthdays ``` -------------------------------- ### Complete File I/O Example Code Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/io_tutorial.rst This is the complete C++ code for the tutorial examples, covering file I/O operations for IPC, CSV, and Parquet formats within the Apache Arrow library. It demonstrates reading and writing data using various Arrow components. ```cpp #include #include #include #include #include #include #include #include #include #include #include arrow::Status Main() { // --- CSV Write Example --- // Create a simple table arrow::Int64Builder int_builder; arrow::StringBuilder string_builder; arrow::ArrayVector arrays; int_builder.AppendValues({1, 2, 3}); string_builder.AppendValues({"a", "b", "c"}); int_builder.Finish(&arrays.emplace_back()); string_builder.Finish(&arrays.emplace_back()); auto schema = arrow::schema({arrow::field("int_col", arrow::int64()), arrow::field("str_col", arrow::utf8())}); std::shared_ptr table_to_write = arrow::Table::Make(schema, arrays); // Write CSV ARROW_RETURN_NOT_OK( arrow::ipc::csv::TableWriter::Open("output.csv", table_to_write->schema()). Then([&](std::shared_ptr writer) { return writer->WriteTable(table_to_write); }).status()); std::cout << "Wrote to output.csv" << std::endl; // --- CSV Read Example --- // Read CSV ARROW_RETURN_NOT_OK( arrow::io::ReadableFile::Open("output.csv"). Then([](std::shared_ptr infile) { return arrow::ipc::csv::TableReader::Open(infile, arrow::ipc::csv::ReadOptions::Defaults()); }). Then([](std::shared_ptr reader) { return reader->Read(); }).status()); std::cout << "Read from output.csv" << std::endl; // --- Parquet Read/Write Example --- // Write Parquet ARROW_RETURN_NOT_OK( arrow::io::FileOutputStream::Open("output.parquet"). Then([&](std::shared_ptr outfile) { return parquet::arrow::WriteTable(*table_to_write, arrow::default_memory_pool(), outfile, 100000); }).status()); std::cout << "Wrote to output.parquet" << std::endl; // Read Parquet std::shared_ptr infile_parquet; std::shared_ptr reader_parquet; ARROW_RETURN_NOT_OK(arrow::io::ReadableFile::Open("output.parquet", &infile_parquet)); ARROW_RETURN_NOT_OK(parquet::arrow::FileReader::Open(infile_parquet, arrow::default_memory_pool(), &reader_parquet)); std::shared_ptr table_read_parquet; ARROW_RETURN_NOT_OK(reader_parquet->ReadTable(&table_read_parquet)); std::cout << "Read from output.parquet" << std::endl; return arrow::Status::OK(); } int main(int argc, char** argv) { arrow::Status status = Main(); if (!status.ok()) { std::cerr << status.ToString() << std::endl; return 1; } return 0; } ``` -------------------------------- ### Install Red Arrow Flight Gem Source: https://github.com/apache/arrow/blob/main/ruby/red-arrow-flight/README.md This command installs the Red Arrow Flight gem. It requires Apache Arrow Flight GLib to be installed beforehand. Ensure you follow the Apache Arrow installation guide for prerequisites. ```console gem install red-arrow-flight ``` -------------------------------- ### Configure Parquet Read/Write Examples (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt This CMake code snippet configures the build process for Parquet read/write examples. It conditionally adds the example based on the ARROW_PARQUET build flag and links against either shared or static Parquet libraries depending on ARROW_BUILD_SHARED. ```cmake if(ARROW_PARQUET) if(ARROW_BUILD_SHARED) add_arrow_example(parquet_read_write EXTRA_LINK_LIBS parquet_shared) else() add_arrow_example(parquet_read_write EXTRA_LINK_LIBS parquet_static) endif() endif() ``` -------------------------------- ### Basic Flight Client and Server Operations in Java Source: https://github.com/apache/arrow/blob/main/docs/source/java/flight.rst Demonstrates basic client-server interaction using Apache Arrow Flight. It shows how to establish a connection, get a stream, and handle stream cancellation on both the client and server sides. This example requires a BufferAllocator and Location for connection. ```Java Location location = Location.forGrpcInsecure("0.0.0.0", 58609); try(BufferAllocator allocator = new RootAllocator(); FlightClient tutorialFlightClient = FlightClient.builder(allocator, location).build()){ try(FlightStream flightStream = flightClient.getStream(new Ticket(new byte[]{}))) { // ... flightStream.cancel("tutorial-cancel", new Exception("Testing cancellation option!")); } } catch (Exception e) { e.printStackTrace(); } // Server @Override public void getStream(CallContext context, Ticket ticket, ServerStreamListener listener) { // ... listener.setOnCancelHandler(()->{ // Implement logic to handle cancellation option }); } ``` -------------------------------- ### Add Arrow Example with Compute and CSV Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt Adds an Arrow example that combines compute functionality with CSV reading/writing. The example links against either shared or static Arrow compute libraries, depending on the build configuration. Requires ARROW_COMPUTE and ARROW_CSV build flags. ```cmake if(ARROW_COMPUTE AND ARROW_CSV) if(ARROW_BUILD_SHARED) set(COMPUTE_KERNELS_LINK_LIBS arrow_compute_shared) else() set(COMPUTE_KERNELS_LINK_LIBS arrow_compute_static) endif() add_arrow_example(compute_and_write_csv_example EXTRA_LINK_LIBS ${COMPUTE_KERNELS_LINK_LIBS}) endif() ``` -------------------------------- ### Run Python Tests with Pytest Source: https://github.com/apache/arrow/blob/main/docs/source/developers/guide/index.rst This snippet demonstrates how to run tests for the PyArrow library using the pytest framework from the terminal. It assumes pytest is installed and configured for the project. ```console $ pytest pyarrow ``` -------------------------------- ### Build and Install Arrow GLib with Meson (Others) Source: https://github.com/apache/arrow/blob/main/c_glib/README.md These console commands outline the process of building and installing Arrow GLib on systems other than macOS using Meson. It involves setting up the build directory, compiling the project, and then installing it. The '-Dgtk_doc=true' flag is used to enable GTK-Doc generation. ```bash $ meson c_glib.build c_glib -Dgtk_doc=true $ meson compile -C c_glib.build $ sudo meson install -C c_glib.build ``` -------------------------------- ### Add Upstream Remote for Arrow Repository Source: https://github.com/apache/arrow/blob/main/docs/source/developers/guide/step_by_step/set_up.rst Adds the official Apache Arrow repository as a remote named 'upstream' to your local clone. This allows you to fetch changes from the main project. First, navigate into the cloned repository directory. ```console $ cd arrow $ git remote add upstream https://github.com/apache/arrow ``` -------------------------------- ### Complete Compute Example Code in C++ Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/compute_tutorial.rst Provides the full source code for the Apache Arrow compute function tutorial example. This includes all necessary includes, struct declarations, function calls, and result inspections. ```cpp #include #include #include #include "arrow/api.h" #include "arrow/compute/api.h" #include "arrow/testing/gtest_util.h" using arrow::compute::CallFunction; using arrow::compute::Datum; using arrow::compute::IndexOptions; using arrow::Int64Scalar; namespace { // Doc section: IndexOptions Declare struct IndexOptions { int64_t value; }; // Doc section: IndexOptions Assign void AssignIndexOptions(IndexOptions* options) { options->value = 2223; } // Doc section: Index Call arrow::Status IndexCallExample() { // Example data: a simple array of integers auto array = arrow::ArrayFromVector({1, 5, 2223, 8, 10}); Datum input_data(array); IndexOptions options; AssignIndexOptions(&options); std::shared_ptr result; // CallFunction("index", input_data, options); ASSERT_OK(CallFunction("index", {input_data}, &options, &result)); // Doc section: Index Inspection // One last time, let’s see what our :class:`Datum` has! This will be a :class:`Scalar` with // a 64-bit integer, and the output will be 2: ASSERT_TRUE(result->is_scalar()); ASSERT_EQ(result->scalar_as().value, 2); return arrow::Status::OK(); } // Doc section: Ret arrow::Status RetExample() { return arrow::Status::OK(); } } // namespace // Doc section: Compute Example TEST(ComputeTest, Index) { ASSERT_OK(IndexCallExample()); ASSERT_OK(RetExample()); } ``` -------------------------------- ### Install pyarrow with Flight RPC support Source: https://github.com/apache/arrow/blob/main/docs/source/python/install.rst This command installs the pyarrow package and adds support for the Flight RPC framework. This is a custom selection for users who require Flight capabilities beyond the standard pyarrow installation. ```shell conda install -c conda-forge pyarrow libarrow-flight ``` -------------------------------- ### Get Timezone Data Path using tzdata Source: https://github.com/apache/arrow/blob/main/docs/source/python/install.rst Retrieves the installation path of the tzdata package using Python. This is useful for setting the TZDIR environment variable or understanding where timezone data is located. ```python import tzdata print(tzdata.__file__) ``` -------------------------------- ### Install and Activate Emscripten SDK Source: https://github.com/apache/arrow/blob/main/docs/source/developers/cpp/emscripten.rst This snippet demonstrates how to clone the Emscripten SDK repository, install a specific version, activate it, and set up the environment variables for cross-compilation. ```shell git clone https://github.com/emscripten-core/emsdk.git cd emsdk # replace with the desired EMSDK version. # e.g. for Pyodide 0.26, you need EMSDK version 3.1.58 ./emsdk install ./emsdk activate source ./emsdk_env.sh ``` -------------------------------- ### Install PyArrow using Pip Source: https://github.com/apache/arrow/blob/main/docs/source/python/install.rst Installs the latest version of PyArrow from PyPI for Windows, Linux, and macOS. Ensure pip is version 19.0 or higher on Linux for prebuilt binary packages. ```bash pip install pyarrow ``` -------------------------------- ### Generate Partitioned Dataset Files (C++) Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/datasets_tutorial.rst Helper function to generate a partitioned dataset on disk for the tutorial. It creates sample data and writes it out in a partitioned format, creating necessary directories and files. ```cpp arrow::Status PrepareEnv(const std::string& path, int /*argc*/, char** /*argv*/) { // For this tutorial, we'll generate some data on disk. // In practice, you'll likely have your own dataset. std::shared_ptr fs; ARROW_ASSIGN_OR_RAISE(fs, arrow::fs::FileSystemFromUri(path)); // Use the main tutorial example file to generate a dataset. // This file contains the logic for generating the dataset files. return GenerateDataset(fs, path); } ``` -------------------------------- ### Starting a Flight Server Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/flight.rst To start a server, create a `Location` to specify where to listen, and call `FlightServerBase::Init`. The server can be configured to shut down on signals and then started with `Serve`. ```APIDOC ## Starting a Flight Server ### Description To start a server, create a :class:`arrow::flight::Location` to specify where to listen, and call :func:`arrow::flight::FlightServerBase::Init`. This will start the server, but won't block the rest of the program. Use :func:`arrow::flight::FlightServerBase::SetShutdownOnSignals` to enable stopping the server if an interrupt signal is received, then call :func:`arrow::flight::FlightServerBase::Serve` to block until the server stops. ### Example ```cpp std::unique_ptr server; // Initialize server arrow::flight::Location location; // Listen to all interfaces on a free port ARROW_CHECK_OK(arrow::flight::Location::ForGrpcTcp("0.0.0.0", 0, &location)); arrow::flight::FlightServerOptions options(location); // Start the server ARROW_CHECK_OK(server->Init(options)); // Exit with a clean error code (0) on SIGTERM ARROW_CHECK_OK(server->SetShutdownOnSignals({SIGTERM})); std::cout << "Server listening on localhost:" << server->port() << std::endl; ARROW_CHECK_OK(server->Serve()); ``` ``` -------------------------------- ### Install pyarrow and Build HTML Documentation Source: https://github.com/apache/arrow/blob/main/docs/source/developers/documentation.rst Installs the 'pyarrow' library in non-editable mode and then builds the HTML documentation. This is a workaround for potential issues with building Python documentation on macOS Monterey. ```shell pushd arrow/docs python -m pip install ../python --quiet make html popd ``` -------------------------------- ### Add Arrow Example with Substrait Engine Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt Sets up and adds an Arrow example for Substrait engine integration. The required linking libraries are determined by the Arrow build configuration (shared or static). Requires the ARROW_SUBSTRAIT build flag. ```cmake if(ARROW_SUBSTRAIT) if(ARROW_BUILD_SHARED) set(ENGINE_SUBSTRAIT_CONSUMPTION_LINK_LIBS arrow_substrait_shared) else() set(ENGINE_SUBSTRAIT_CONSUMPTION_LINK_LIBS arrow_substrait_static) endif() add_arrow_example(engine_substrait_consumption EXTRA_LINK_LIBS ${ENGINE_SUBSTRAIT_CONSUMPTION_LINK_LIBS}) endif() ``` -------------------------------- ### Install All Headers for Arrow Vendored Datetime Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/vendored/datetime/CMakeLists.txt Installs all necessary headers for the vendored datetime library within Apache Arrow. This function is crucial for ensuring that the datetime components are correctly set up for use. ```C++ arrow_install_all_headers("arrow/vendored/datetime") ``` -------------------------------- ### Install pyarrow-core with Parquet support Source: https://github.com/apache/arrow/blob/main/docs/source/python/install.rst This command installs the core pyarrow package along with the libparquet library, enabling support for reading and writing Parquet files. It's useful when you need specific components and want to manage dependencies explicitly. ```shell conda install -c conda-forge pyarrow-core libparquet ``` -------------------------------- ### Install Documentation and Scripts (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/CMakeLists.txt Installs license files, README, and GDB scripts to their respective destinations. This ensures that important documentation and helper scripts are available after installation. ```cmake install(FILES ${CMAKE_CURRENT_SOURCE_DIR}/../LICENSE.txt ${CMAKE_CURRENT_SOURCE_DIR}/../NOTICE.txt ${CMAKE_CURRENT_SOURCE_DIR}/README.md DESTINATION "${ARROW_DOC_DIR}") install(FILES ${CMAKE_CURRENT_SOURCE_DIR}/gdb_arrow.py DESTINATION "${ARROW_GDB_DIR}") ``` -------------------------------- ### Add Dependencies for Parquet Examples (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/examples/parquet/CMakeLists.txt Ensures that the main 'parquet' target depends on all the example executables being built. This guarantees that examples are built after the core Parquet library. ```cmake add_dependencies(parquet parquet-low-level-example parquet-low-level-example2 parquet-arrow-example parquet-stream-api-example) if(PARQUET_REQUIRE_ENCRYPTION) add_dependencies(parquet parquet-encryption-example parquet-encryption-example-all-crypto-options) endif() ``` -------------------------------- ### Basic Ruby Usage Example Source: https://github.com/apache/arrow/blob/main/ruby/red-arrow-flight/README.md This Ruby code snippet demonstrates the basic setup for using the Red Arrow Flight library. It requires the 'arrow-flight' gem to be loaded. The 'TODO' comment indicates that further implementation is needed for actual usage. ```ruby require "arrow-flight" # TODO ``` -------------------------------- ### Perform Computations on Arrow Data in Python Source: https://github.com/apache/arrow/blob/main/docs/source/python/getstarted.rst Demonstrates using Arrow's compute functions to perform operations on table columns, such as calculating value counts. This leverages Arrow's optimized compute kernels. ```python import pyarrow as pa import pyarrow.compute as pc # Assuming 'birthdays_table' is already defined days = pa.array([1, 12, 17, 23, 28], type=pa.int8()) months = pa.array([1, 3, 5, 7, 1], type=pa.int8()) years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16()) birthdays_table = pa.table([days, months, years], names=["days", "months", "years"]) # Calculate value counts for the 'years' column pc.value_counts(birthdays_table["years"]) ``` -------------------------------- ### Configure Gandiva Example (CMake) Source: https://github.com/apache/arrow/blob/main/cpp/examples/arrow/CMakeLists.txt This CMake code snippet configures the build for the Gandiva example. It conditionally enables the example and links against either shared or static Gandiva libraries based on the ARROW_BUILD_SHARED flag. ```cmake if(ARROW_GANDIVA) if(ARROW_BUILD_SHARED) set(GANDIVA_EXAMPLE_LINK_LIBS gandiva_shared) else() set(GANDIVA_EXAMPLE_LINK_LIBS gandiva_static) endif() add_arrow_example(gandiva_example EXTRA_LINK_LIBS ${GANDIVA_EXAMPLE_LINK_LIBS}) endif() ``` -------------------------------- ### Set FileSystemFactoryOptions for Dataset Creation (C++) Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/datasets_tutorial.rst Sets up FileSystemFactoryOptions, which are necessary for configuring the dataset factory. These options specify how the dataset should be interpreted and processed. ```cpp dataset::FileSystemFactoryOptions options; ``` -------------------------------- ### Install All Headers for Arrow Filesystem Source: https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/CMakeLists.txt This function installs all headers required for the Apache Arrow filesystem module. It ensures that all necessary header files are available for development and compilation purposes. ```unknown arrow_install_all_headers("arrow/filesystem") ``` -------------------------------- ### Full Apache Arrow Dataset Example (C++) Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/dataset.rst This is a comprehensive example demonstrating various functionalities of the Apache Arrow Datasets API in C++. It covers reading and writing partitioned data, interacting with different storage systems, and applying dataset operations. This example is intended to illustrate practical usage scenarios. ```cpp #include #include #include #include #include #include #include #include #include #include #include #include // Example for demonstrating reading and writing partitioned data // This is a placeholder for the actual code within the literalinclude directive. // The full code is available in the specified file and line numbers. void example_partitioned_data() { // Code related to partitioned data operations would be here. } // Example for demonstrating reading from cloud storage // This is a placeholder for the actual code within the literalinclude directive. void example_cloud_storage() { // Code related to cloud storage operations would be here. } int main() { // Placeholder for main execution logic that might call other examples. // The actual 'dataset_documentation_example.cc' contains the full implementation. // For demonstration purposes, we'll simulate the structure. std::cout << "This is a placeholder for the full Apache Arrow Dataset example.\n"; std::cout << "Please refer to the 'dataset_documentation_example.cc' file for the complete code.\n"; // Mocking the table creation for illustration arrow::Int64Builder int_builder; arrow::StringBuilder string_builder; ASSERT_OK(int_builder.AppendValues({1, 2, 3})); ASSERT_OK(string_builder.AppendValues({"a", "b", "c"})); std::shared_ptr int_array; ASSERT_OK(int_builder.Finish(&int_array)); std::shared_ptr string_array; ASSERT_OK(string_builder.Finish(&string_array)); auto schema = arrow::schema({ arrow::field("ints", arrow::int64()), arrow::field("strings", arrow::utf8())}); auto table = arrow::Table::Make(schema, {int_array, string_array}); // Example of creating an InMemoryDataset (as shown in another snippet) auto dataset = std::make_shared(std::move(table)); auto scanner_builder = dataset->NewScan(); std::cout << "Created an InMemoryDataset.\n"; return 0; } ``` -------------------------------- ### Configure Dataset Write Options in C++ Source: https://github.com/apache/arrow/blob/main/docs/source/cpp/tutorials/datasets_tutorial.rst This snippet shows how to set up `dataset::FileSystemDatasetWriteOptions` for writing a dataset to disk. It initializes options and prepares for writing. ```cpp dataset::FileSystemDatasetWriteOptions write_options; write_options.partitioning = partitioning; write_options.file_format = format; ```