### Install uv Only Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Installs uv without setting up the full development environment. ```bash make install-uv ``` -------------------------------- ### Install PyIceberg Package Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Install the PyIceberg package from the source distribution using the `make install` command. This is a prerequisite for running tests. ```sh make install ``` -------------------------------- ### Install PyIceberg from Source Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Clones the repository and installs PyIceberg locally with optional extras for S3 and Hive support. ```bash git clone https://github.com/apache/iceberg-python.git cd iceberg-python pip3 install -e ".[s3fs,hive]" ``` -------------------------------- ### Install PyIceberg with Bodo support Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Install PyIceberg with the necessary dependencies for Bodo integration. This command should be run in your environment. ```bash pip install pyiceberg['bodo'] ``` -------------------------------- ### Install PyIceberg Directly from GitHub Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Installs PyIceberg directly from a GitHub repository, useful for testing unreleased changes. Includes PyArrow support. ```bash pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[pyarrow]" ``` -------------------------------- ### Setup: Connect to a Catalog Source: https://github.com/apache/iceberg-python/blob/main/notebooks/pyiceberg_example.ipynb Configure and load a catalog using SQLite for local testing. This requires setting up a temporary warehouse directory. ```python # Import required libraries import os import tempfile import pyarrow.compute as pc ``` ```python # Create a temporary warehouse location warehouse_path = tempfile.mkdtemp(prefix="iceberg_warehouse_") print(f"Warehouse location: {warehouse_path}") ``` ```python # Configure and load the catalog catalog = load_catalog( "default", type="sql", uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", warehouse=f"file://{warehouse_path}", ) print("Catalog loaded successfully!") print(f"Namespaces: {list(catalog.list_namespaces())}") ``` -------------------------------- ### Install PyIceberg with Daft support Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Install PyIceberg with the necessary dependencies for Daft integration. This command should be run in your environment. ```bash pip install pyiceberg['daft'] ``` -------------------------------- ### List Namespaces Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Retrieve a list of all existing namespaces in the catalog. The example asserts that the newly created 'docs_example' namespace is present. ```python ns = catalog.list_namespaces() assert ns == [("docs_example",)] ``` -------------------------------- ### Install PyIceberg with Polars support Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Install PyIceberg with the necessary dependencies for Polars integration. This command should be run in your environment. ```bash pip install pyiceberg['polars'] ``` -------------------------------- ### Launch Jupyter Lab for Basic Experimentation Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Install notebook dependencies and launch Jupyter Lab in the 'notebooks/' directory for basic PyIceberg experimentation without external infrastructure. ```bash make notebook ``` -------------------------------- ### Import PyIceberg Libraries Source: https://github.com/apache/iceberg-python/blob/main/notebooks/pyiceberg_example.ipynb Import necessary libraries for PyIceberg operations and display the installed version. ```python # Import required libraries import pyarrow as pa import pyiceberg from pyiceberg.catalog import load_catalog print(f"PyIceberg version: {pyiceberg.__version__}") ``` -------------------------------- ### SimpleLocationProvider Data File Path (Partitioned) Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example of a data file path generated by SimpleLocationProvider for a table partitioned by 'category'. ```txt s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -------------------------------- ### Install PyIceberg Nightly Build Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/nightly-build.md Use this command to install the latest nightly build of PyIceberg from TestPyPI. This is recommended for testing purposes only. ```shell pip install -i https://test.pypi.org/simple/ --pre pyiceberg ``` -------------------------------- ### Install PyIceberg with S3 and Hive Support Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/index.md Install the latest release of PyIceberg with optional dependencies for S3 file system and Hive metastore. ```sh pip install "pyiceberg[s3fs,hive]" ``` -------------------------------- ### SimpleLocationProvider Data File Path (Non-Partitioned) Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example of a data file path generated by SimpleLocationProvider for a non-partitioned table. ```txt s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -------------------------------- ### YAML Configuration for Multiple Catalogs Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Example of a .pyiceberg.yaml file configuring both a Hive and a REST catalog. This demonstrates how to define multiple catalog configurations in a single file. ```yaml catalog: hive: uri: thrift://127.0.0.1:9083 s3.endpoint: http://127.0.0.1:9000 s3.access-key-id: admin s3.secret-access-key: password rest: uri: https://rest-server:8181/ warehouse: my-warehouse ``` -------------------------------- ### Complete Filter Examples Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/row-filter-syntax.md Demonstrates combining various row filter operations to create complex filtering conditions. ```sql -- Complex filter with multiple conditions status = 'active' AND age > 18 AND NOT (country IN ('US', 'CA')) ``` ```sql -- Filter with string pattern matching name LIKE 'John%' AND age >= 21 ``` ```sql -- Filter with NULL checks and numeric comparisons price IS NOT NULL AND price > 100 AND quantity > 0 ``` ```sql -- Filter with multiple logical operations (status = 'pending' OR status = 'processing') AND NOT (priority = 'low') ``` -------------------------------- ### Install Pre-commit Hooks Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Install pre-commit hooks to automatically run linters and formatters on code changes before each commit. This helps maintain code quality. ```bash prek install ``` -------------------------------- ### PyArrow Table Schema Example Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Example structure of a PyArrow Table returned by PyIceberg scans. Shows column names and data types. ```text pyarrow.Table VendorID: int64 tpep_pickup_datetime: timestamp[us, tz=+00:00] tpep_dropoff_datetime: timestamp[us, tz=+00:00] ---- VendorID: [[2,1,2,1,1,...,2,2,2,2,2],[2,1,1,1,2,...,1,1,2,1,2],...,[2,2,2,2,2,...,2,6,6,2,2],[2,2,2,2,2,...,2,2,2,2,2]] tpep_pickup_datetime: [[2021-04-01 00:28:05.000000,...,2021-04-30 23:44:25.000000]] tpep_dropoff_datetime: [[2021-04-01 00:47:59.000000,...,2021-05-01 00:14:47.000000]] ``` -------------------------------- ### ObjectStoreLocationProvider Data File Path (Partitioned) Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example of a data file path generated by ObjectStoreLocationProvider for a partitioned table, including binary directories for hash distribution. ```txt s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -------------------------------- ### Define Partition Specification Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Define how a table's data should be partitioned. This example partitions data by the 'day' of the 'datetime' field. ```python from pyiceberg.partitioning import PartitionSpec, PartitionField partition_spec = PartitionSpec( PartitionField( source_id=1, field_id=1000, transform="day", name="datetime_day" ) ) ``` -------------------------------- ### Launch Jupyter Lab with Infrastructure for Spark Integration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Spin up the full integration test infrastructure (Spark, REST Catalog, Hive Metastore, Minio) via Docker Compose and launch Jupyter Lab for Spark integration examples. ```bash make notebook-infra ``` -------------------------------- ### ObjectStoreLocationProvider Data File Path (Partition Exclusion) Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example of a data file path generated by ObjectStoreLocationProvider with partition exclusion enabled, omitting partition keys and values. ```txt s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -------------------------------- ### YAML Configuration for REST Catalog Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Example of a .pyiceberg.yaml file to configure a REST catalog named 'prod'. This file specifies the catalog's URI and credentials. ```yaml catalog: prod: uri: http://rest-catalog/ws/ credential: t-1234:secret ``` -------------------------------- ### Configure REST Catalog via YAML Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Example configuration for a REST catalog named 'test_catalog' using a YAML file. This specifies the URI and credentials. ```yaml catalog: test_catalog: uri: http://rest-catalog/ws/ credential: t-1234:secret ``` -------------------------------- ### Example File Metadata Data Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md This shows sample data for the PyArrow Table returned by `table.inspect.files()`. It illustrates the actual values for file paths, record counts, and other metadata fields for two Parquet files. ```python content: [[0,0]] file_path: [["s3://warehouse/default/table_metadata_files/data/00000-0-9ea7d222-6457-467f-bad5-6fb125c9aa5f.parquet","s3://warehouse/default/table_metadata_files/data/00000-0-afa8893c-de71-4710-97c9-6b01590d0c44.parquet"]] file_format: [["PARQUET","PARQUET"]] spec_id: [[0,0]] record_count: [[3,3]] file_size_in_bytes: [[5459,5459]] column_sizes: [[keys:[1,2,3,4,5,...,8,9,10,11,12]values:[49,78,128,94,118,...,118,118,94,78,109],keys:[1,2,3,4,5,...,8,9,10,11,12]values:[49,78,128,94,118,...,118,118,94,78,109]]] value_counts: [[keys:[1,2,3,4,5,...,8,9,10,11,12]values:[3,3,3,3,3,...,3,3,3,3,3],keys:[1,2,3,4,5,...,8,9,10,11,12]values:[3,3,3,3,3,...,3,3,3,3,3]]] null_value_counts: [[keys:[1,2,3,4,5,...,8,9,10,11,12]values:[1,1,1,1,1,...,1,1,1,1,1],keys:[1,2,3,4,5,...,8,9,10,11,12]values:[1,1,1,1,1,...,1,1,1,1,1]]] nan_value_counts: [[keys:[]values:[],keys:[]values:[]]] lower_bounds: [[keys:[1,2,3,4,5,...,8,9,10,11,12]values:[00,61,61616161616161616161616161616161,01000000,0100000000000000,...,009B6ACA38F10500,009B6ACA38F10500,9E4B0000,01,00000000000000000000000000000000],keys:[1,2,3,4,5,...,8,9,10,11,12]values:[00,61,61616161616161616161616161616161,01000000,0100000000000000,...,009B6ACA38F10500,009B6ACA38F10500,9E4B0000,01,00000000000000000000000000000000]]] upper_bounds:[[keys:[1,2,3,4,5,...,8,9,10,11,12]values:[00,61,61616161616161616161616161616161,01000000,0100000000000000,...,009B6ACA38F10500,009B6ACA38F10500,9E4B0000,01,00000000000000000000000000000000],keys:[1,2,3,4,5,...,8,9,10,11,12]values:[00,61,61616161616161616161616161616161,01000000,0100000000000000,...,009B6ACA38F10500,009B6ACA38F10500,9E4B0000,01,00000000000000000000000000000000]]] key_metadata: [[0100,0100]] split_offsets:[[[],[]]] equality_ids:[[[],[]]] sort_order_id:[[[],[]]] readable_metrics: [ -- is_valid: all not null ``` -------------------------------- ### PyArrow Import Example Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Imports the pyarrow module, commonly used for data manipulation and type handling in PyIceberg. ```python import pyarrow as pa ``` -------------------------------- ### Describe a Table in JSON Format Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Use the '--output json' flag with the 'describe' command to get table details in JSON format, suitable for programmatic processing and automation. The output is piped to 'jq' for pretty-printing. ```sh ➜ pyiceberg --output json describe nyc.taxis | jq { "identifier": [ "nyc", "taxis" ], "metadata_location": "file:/.../nyc.db/taxis/metadata/00000-aa3a3eac-ea08-4255-b890-383a64a94e42.metadata.json", "metadata": { "location": "file:/.../nyc.db/taxis", "table-uuid": "6cdfda33-bfa3-48a7-a09e-7abb462e3460", "last-updated-ms": 1661783158061, "last-column-id": 19, "schemas": [ { "type": "struct", "fields": [ { "id": 1, "name": "VendorID", "type": "long", "required": false }, ... { "id": 19, "name": "airport_fee", "type": "double", "required": false } ], "schema-id": 0, "identifier-field-ids": [] } ], "current-schema-id": 0, "partition-specs": [ { "spec-id": 0, "fields": [] } ], "default-spec-id": 0, "last-partition-id": 999, "properties": { "owner": "root", "write.format.default": "parquet" }, "current-snapshot-id": 5937117119577207000, "snapshots": [ { "snapshot-id": 5937117119577207000, "timestamp-ms": 1661783158061, "manifest-list": "file:/.../nyc.db/taxis/metadata/snap-5937117119577207079-1-94656c4f-4c66-4600-a4ca-f30377300527.avro", "summary": { "operation": "append", "spark.app.id": "local-1661783139151", "added-data-files": "1", "added-records": "2979431", "added-files-size": "46600777", "changed-partition-count": "1", "total-records": "2979431", "total-files-size": "46600777", "total-data-files": "1", "total-delete-files": "0", "total-position-deletes": "0", "total-equality-deletes": "0" }, "schema-id": 0 } ], "snapshot-log": [ { "snapshot-id": "5937117119577207079", ``` -------------------------------- ### REST Catalog Authentication Configuration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example YAML configuration for a REST catalog with pluggable authentication. Replace `` with the desired authentication method (e.g., `oauth2`, `basic`, `custom`). ```yaml catalog: default: type: rest uri: http://rest-catalog/ws/ auth: type: : # Type-specific configuration impl: # Only for custom auth ``` -------------------------------- ### Add Files with Custom Snapshot Properties and Duplicate Check Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md This example demonstrates adding Parquet files to an Iceberg table while specifying custom snapshot properties and explicitly enabling duplicate file checking. It also includes assertions to verify the NameMapping and the snapshot property. ```python # Assume an existing Iceberg table object `tbl` file_paths = [ "s3a://warehouse/default/existing-1.parquet", "s3a://warehouse/default/existing-2.parquet", ] # Custom snapshot properties snapshot_properties = {"abc": "def"} # Enable duplicate file checking check_duplicate_files = True # Add the Parquet files to the Iceberg table without rewriting tbl.add_files( file_paths=file_paths, snapshot_properties=snapshot_properties, check_duplicate_files=check_duplicate_files ) # NameMapping must have been set to enable reads assert tbl.name_mapping() is not None # Verify that the snapshot property was set correctly assert tbl.metadata.snapshots[-1].summary["abc"] == "def" ``` -------------------------------- ### Initiate Schema Transaction Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Start a transaction to perform multiple schema modifications, such as adding columns and updating properties. ```python with table.transaction() as transaction: with transaction.update_schema() as update_schema: update.add_column("some_other_field", IntegerType(), "doc") # ... Update properties etc ``` -------------------------------- ### Represent Point Data in WKB Format Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/geospatial.md Example of a Point(0, 0) represented as Well-Known Binary (WKB) bytes. PyIceberg uses WKB for storing geometry and geography values. ```python # Example: Point(0, 0) in WKB format point_wkb = bytes.fromhex("0101000000000000000000000000000000000000") ``` -------------------------------- ### Serve Docs Locally Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/README.md Run this command to serve the documentation locally. Ensure you are in the root directory of the project. ```sh make docs-serve ``` -------------------------------- ### Catalog Instantiation Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Demonstrates how to load an Iceberg catalog using configuration from a `.pyiceberg.yaml` file or by passing properties directly. ```APIDOC ## Catalog Instantiation ### Description Instantiate an Iceberg catalog to read and write data. Catalogs can be loaded by name from a configuration file or by providing properties directly. ### Method `load_catalog(name: str, **properties) -> Catalog` ### Parameters - **name** (str) - The name of the catalog to load. - **properties** (dict) - A dictionary of properties to configure the catalog. ### Example ```python from pyiceberg.catalog import load_catalog # Load catalog by name from .pyiceberg.yaml catalog_by_name = load_catalog(name="prod") # Load catalog by passing properties directly catalog_direct = load_catalog( "docs", **{ "uri": "http://127.0.0.1:8181", "s3.endpoint": "http://127.0.0.1:9000", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", "s3.access-key-id": "admin", "s3.secret-access-key": "password", } ) ``` ``` -------------------------------- ### Remove Deprecated API Example Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Example of a deprecated API that should be removed before a release. ```python @deprecated( deprecated_in="0.1.0", removed_in="0.2.0", help_message="Please use load_something_else() instead", ) ``` -------------------------------- ### Deprecation Message Example Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Example of using the deprecation_message function to inform users about API changes. ```python deprecation_message( deprecated_in="0.1.0", removed_in="0.2.0", help_message="The old_property is deprecated. Please use the something_else property instead.", ) ``` -------------------------------- ### Display Pyiceberg CLI Help Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Use the --help flag to display available commands and options for the pyiceberg CLI. This is useful for understanding the CLI's capabilities and structure. ```sh ➜ pyiceberg --help Usage: pyiceberg [OPTIONS] COMMAND [ARGS]... Options: --catalog TEXT --verbose BOOLEAN --output [text|json] --ugi TEXT --uri TEXT --credential TEXT --warehouse TEXT --help Show this message and exit. Commands: create Operation to create a namespace. describe Describe a namespace or a table. drop Operations to drop a namespace or table. files List all the files of the table. list List tables or namespaces. list-refs List all the refs in the provided table. location Return the location of the table. properties Properties on tables/namespaces. rename Rename a table. schema Get the schema of the table. spec Return the partition spec of the table. uuid Return the UUID of the table. version Print pyiceberg version. ``` -------------------------------- ### Set Release Version and Verification Directory Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Set environment variables for the PyIceberg version to verify and a temporary directory to store downloaded artifacts. Replace `` with the actual release candidate version. ```sh export PYICEBERG_VERSION= # e.g. 0.6.1rc3 export PYICEBERG_VERIFICATION_DIR=/tmp/pyiceberg/${PYICEBERG_VERSION} ``` -------------------------------- ### Complex Expression Example Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/expression-dsl.md Combine multiple predicates and logical operators to construct intricate filtering logic. This example demonstrates nested AND and OR operations. ```python from pyiceberg.expressions import And, Or, Not, EqualTo, GreaterThan, LessThan, In # (age >= 18 AND age <= 65) AND (status = 'active' OR status = 'pending') complex_filter = And( And( GreaterThanOrEqual("age", 18), LessThanOrEqual("age", 65) ), Or( EqualTo("status", "active"), EqualTo("status", "pending") ) ) # NOT (age < 18 OR age > 65) age_in_range = Not( Or( LessThan("age", 18), GreaterThan("age", 65) ) ) ``` -------------------------------- ### Import GPG Keys Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Download the Apache Iceberg KEYS file and import it into your GPG keyring. This is the first step in verifying release signatures. ```sh curl https://downloads.apache.org/iceberg/KEYS -o KEYS gpg --import KEYS ``` -------------------------------- ### Upgrade Pip Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/index.md Ensure you are using an up-to-date version of pip before installing PyIceberg. ```sh pip install --upgrade pip ``` -------------------------------- ### Generate All Vendor Packages Source: https://github.com/apache/iceberg-python/blob/main/vendor/README.md Run this command to generate all vendor packages. Ensure 'make all' is executed for a complete build. ```bash make all ``` -------------------------------- ### Configure Custom Catalog Implementation Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Set up a custom catalog implementation by specifying the Python module and class path, along with any custom configuration keys. ```yaml catalog: default: py-catalog-impl: mypackage.mymodule.MyCatalog custom-key1: value1 custom-key2: value2 ``` -------------------------------- ### Update Column Requirement Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Change the nullability of a column, for example, making a required field optional. This can be an incompatible change. ```python with table.update_schema() as update: # Make a field optional update.update_column("symbol", required=False) ``` -------------------------------- ### Create a Partitioned Iceberg Table Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md This snippet demonstrates creating a partitioned Iceberg table with a schema and partition specification. ```python from pyiceberg.schema import Schema from pyiceberg.types import DoubleType, NestedField, StringType from pyiceberg.partitioning import PartitionSpec, PartitionField, IdentityTransform schema = Schema( NestedField(1, "city", StringType(), required=False), NestedField(2, "lat", DoubleType(), required=False), NestedField(3, "long", DoubleType(), required=False), ) tbl = catalog.create_table( "default.cities", schema=schema, partition_spec=PartitionSpec(PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name="city_identity")) ) ``` -------------------------------- ### Load Catalog Programmatically Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Instantiate an Iceberg catalog by passing configuration properties directly to `load_catalog`. This is an alternative to using a `.pyiceberg.yaml` file. ```python from pyiceberg.catalog import load_catalog catalog = load_catalog( "docs", **{ "uri": "http://127.0.0.1:8181", "s3.endpoint": "http://127.0.0.1:9000", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", "s3.access-key-id": "admin", "s3.secret-access-key": "password", } ) ``` -------------------------------- ### Prepare and Verify License Documentation Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Extract the source tarball, navigate into the extracted directory, and run the `./dev/check-license` script to validate the license headers. This ensures compliance with Apache licensing requirements. ```sh export PYICEBERG_RELEASE_VERSION=${PYICEBERG_VERSION/rc?/} # remove rcX qualifier tar xzf pyiceberg-${PYICEBERG_RELEASE_VERSION}.tar.gz cd pyiceberg-${PYICEBERG_RELEASE_VERSION} ./dev/check-license ``` -------------------------------- ### Update Column Type Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Modify the data type of an existing column, for example, promoting a float to a double. This operation might be incompatible. ```python with table.update_schema() as update: # Promote a float to a double update.update_column("bid", field_type=DoubleType()) ``` -------------------------------- ### Get All Table Properties using Python CLI Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Retrieve all properties associated with a given table. This helps in understanding the current configuration of a table. ```sh ➜ pyiceberg properties get table nyc.taxis ``` -------------------------------- ### Convert Iceberg Table to Polars LazyFrame Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Converts an Iceberg table to a Polars LazyFrame for efficient data manipulation and filtering. Requires Polars to be installed. ```python lf = iceberg_table.to_polars().filter(pl.col("ticket_id") > 10) print(lf.collect()) ``` -------------------------------- ### Create Sample Data Source: https://github.com/apache/iceberg-python/blob/main/notebooks/pyiceberg_example.ipynb Generate a sample PyArrow table with taxi-like data for writing to an Iceberg table. ```python # Create sample data using PyArrow # Sample taxi-like data data = { "vendor_id": [1, 2, 1, 2, 1], "trip_distance": [1.5, 2.3, 0.8, 5.2, 3.1], "fare_amount": [10.0, 15.5, 6.0, 22.0, 18.0], "tip_amount": [2.0, 3.0, 1.0, 4.5, 3.5], "passenger_count": [1, 2, 1, 3, 2], } df = pa.table(data) print("Sample data:") print(df) ``` -------------------------------- ### Conclude Vote Thread Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Example text for concluding a release vote thread on the dev mailing list after the voting period has ended and requirements are met. ```text Thanks everyone for voting! The 72 hours have passed, and a minimum of 3 binding votes have been cast: +1 Foo Bar (non-binding) ... +1 Fokko Driesprong (binding) The release candidate has been accepted as PyIceberg . Thanks everyone, when all artifacts are published the announcement will be sent out. Kind regards, ``` -------------------------------- ### Load SqlCatalog for Local Testing Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/index.md Load the SqlCatalog implementation to manage Iceberg tables using a local SQLite database and filesystem warehouse. This is suitable for testing but not recommended for production. ```python from pyiceberg.catalog import load_catalog warehouse_path = "/tmp/warehouse" catalog = load_catalog( "default", **{ 'type': 'sql', "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": f"file://{warehouse_path}", }, ) ``` -------------------------------- ### Configure OneLake REST Catalog with Entra ID Authentication Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Use this configuration for OneLake REST catalog with Entra ID authentication. Ensure pyiceberg[entra-auth] is installed. ```yaml catalog: onelake_catalog: type: rest uri: https://onelake.table.fabric.microsoft.com/iceberg warehouse: / auth: type: entra adls.account-name: onelake adls.account-host: onelake.blob.fabric.microsoft.com ``` -------------------------------- ### Configure REST Catalog via Environment Variables Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Set environment variables to configure the URI and credentials for a REST catalog named 'test_catalog'. ```bash export PYICEBERG_CATALOG__TEST_CATALOG__URI=thrift://localhost:9083 export PYICEBERG_CATALOG__TEST_CATALOG__ACCESS_KEY_ID=username export PYICEBERG_CATALOG__TEST_CATALOG__SECRET_ACCESS_KEY=password ``` -------------------------------- ### Get Specific Table Property using Python CLI Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Retrieve the value of a single, specific property for a table. This is useful for checking the status of a particular configuration setting. ```sh ➜ pyiceberg properties get table nyc.taxis write.metadata.delete-after-commit.enabled ``` -------------------------------- ### Create a Namespace Source: https://github.com/apache/iceberg-python/blob/main/notebooks/pyiceberg_example.ipynb Create a new namespace within the loaded catalog. Check available namespaces after creation. ```python # Create a namespace catalog.create_namespace("default") print(f"Available namespaces: {list(catalog.list_namespaces())}") ``` -------------------------------- ### Custom UUID Location Provider Implementation Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Example of a custom LocationProvider that generates unique data file locations using UUIDs. This implementation extends the base LocationProvider and customizes the new_data_location method. ```python import uuid class UUIDLocationProvider(LocationProvider): def __init__(self, table_location: str, table_properties: Properties): super().__init__(table_location, table_properties) def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: # Can use any custom method to generate a file path given the partitioning information and file name prefix = f"{self.table_location}/{uuid.uuid4()}" return f"{prefix}/{partition_key.to_path()}/{data_file_name}" if partition_key else f"{prefix}/{data_file_name}" ``` -------------------------------- ### Configure Glue Catalog with Static Credentials Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure a Glue catalog using static AWS access key ID, secret access key, and session token. ```yaml catalog: default: type: glue glue.access-key-id: glue.secret-access-key: glue.session-token: glue.region: s3.endpoint: http://localhost:9000 s3.access-key-id: admin s3.secret-access-key: password ``` -------------------------------- ### Describe a Table in Default Format Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Use the 'describe' command to view detailed information about a specific table, including its metadata, schema, and snapshots. The output is in a human-readable format. ```sh ➜ pyiceberg describe nyc.taxis Table format version 1 Metadata location file:/.../nyc.db/taxis/metadata/00000-aa3a3eac-ea08-4255-b890-383a64a94e42.metadata.json Table UUID 6cdfda33-bfa3-48a7-a09e-7abb462e3460 Last Updated 1661783158061 Partition spec [] Sort order [] Current schema Schema, id=0 ├── 1: VendorID: optional long ├── 2: tpep_pickup_datetime: optional timestamptz ├── 3: tpep_dropoff_datetime: optional timestamptz ├── 4: passenger_count: optional double ├── 5: trip_distance: optional double ├── 6: RatecodeID: optional double ├── 7: store_and_fwd_flag: optional string ├── 8: PULocationID: optional long ├── 9: DOLocationID: optional long ├── 10: payment_type: optional long ├── 11: fare_amount: optional double ├── 12: extra: optional double ├── 13: mta_tax: optional double ├── 14: tip_amount: optional double ├── 15: tolls_amount: optional double ├── 16: improvement_surcharge: optional double ├── 17: total_amount: optional double ├── 18: congestion_surcharge: optional double └── 19: airport_fee: optional double Current snapshot Operation.APPEND: id=5937117119577207079, schema_id=0 Snapshots Snapshots └── Snapshot 5937117119577207079, schema 0: file:/.../nyc.db/taxis/metadata/snap-5937117119577207079-1-94656c4f-4c66-4600-a4ca-f30377300527.avro Properties owner root write.format.default parquet ``` -------------------------------- ### Create Patch Branch from Tag Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Commands to create a new patch branch from an existing release tag and push it. ```bash # Fetch all tags git fetch --tags # Assuming 0.8.0 is the latest release tag git checkout -b pyiceberg-0.8.x pyiceberg-0.8.0 # Cherry-pick commits for the upcoming patch release git cherry-pick # Push the new branch git push git@github.com:apache/iceberg-python.git pyiceberg-0.8.x ``` -------------------------------- ### Configure SQL Catalog with PostgreSQL Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure a SQL catalog using PostgreSQL as the backend. Set init_catalog_tables to false to prevent automatic table creation. ```yaml catalog: default: type: sql uri: postgresql+psycopg2://username:password@localhost/mydatabase init_catalog_tables: false ``` -------------------------------- ### Create a Table with Iceberg Format Version 3 Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/geospatial.md Demonstrates creating a new table with the specified schema and explicitly setting the `format-version` property to '3', which is required for geospatial types. ```python from pyiceberg.table import TableProperties # Creating a v3 table table = catalog.create_table( identifier="db.spatial_table", schema=schema, properties={ TableProperties.FORMAT_VERSION: "3" } ) ``` -------------------------------- ### Generate Individual Vendor Packages Source: https://github.com/apache/iceberg-python/blob/main/vendor/README.md Use these commands to generate specific vendor packages. 'make fb303' generates only the FB303 Thrift client, while 'make hive-metastore' generates only the Hive Metastore Thrift definitions. ```bash make fb303 # FB303 Thrift client only ``` ```bash make hive-metastore # Hive Metastore Thrift definitions only ``` -------------------------------- ### Run Linting and Formatting Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Execute the linting and autoformatting checks for the project. This command ensures code style consistency and catches potential issues. ```bash make lint ``` -------------------------------- ### Basic Authentication Configuration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure basic authentication with a username and password. Ensure these credentials are kept secure. ```yaml auth: type: basic basic: username: myuser password: mypass ``` -------------------------------- ### Upload PyIceberg Release to PyPI Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Checks out the release artifacts from Apache SVN and uploads them to PyPI using twine. Requires the VERSION environment variable to be set and may require a PyPi API token. ```bash : "${VERSION:?ERROR: VERSION is not set or is empty}" svn checkout https://dist.apache.org/repos/dist/release/iceberg/pyiceberg-${VERSION} /tmp/iceberg-dist-release/pyiceberg-${VERSION} cd /tmp/iceberg-dist-release/pyiceberg-${VERSION} twine upload pyiceberg-*.whl pyiceberg-*.tar.gz ``` -------------------------------- ### Import PySpark Libraries Source: https://github.com/apache/iceberg-python/blob/main/notebooks/spark_integration_example.ipynb Import the necessary PySpark SQL library for creating a SparkSession. ```python from pyspark.sql import SparkSession ``` -------------------------------- ### Configure SQL Catalog with SQLite Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure a SQL catalog using SQLite. This is suitable for development and exploratory purposes only due to concurrency limitations. ```yaml catalog: default: type: sql uri: sqlite:////tmp/pyiceberg.db init_catalog_tables: false ``` -------------------------------- ### Show Tables in Default Namespace Source: https://github.com/apache/iceberg-python/blob/main/notebooks/spark_integration_example.ipynb List all tables present in the 'default' namespace. This is useful for identifying available Iceberg tables. ```python # Show tables in the default namespace spark.sql("SHOW TABLES FROM default").show() ``` -------------------------------- ### Apache Gravitino Catalog Configuration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure integration with Apache Gravitino. Requires catalog URI and delegation headers. Uses noop authentication by default. ```yaml catalog: gravitino_catalog: type: rest uri: header.X-Iceberg-Access-Delegation: vended-credentials auth: type: noop ``` -------------------------------- ### Configure In-Memory Catalog Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure an in-memory catalog for testing and demos. It uses an in-memory SQLite database and is not suitable for production. ```yaml catalog: default: type: in-memory warehouse: /tmp/pyiceberg/warehouse ``` -------------------------------- ### Run Full Test Coverage Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Execute all unit and integration tests for PyIceberg with coverage reporting. This command spins up Docker containers to facilitate the testing process. ```sh make test-coverage ``` -------------------------------- ### Configure REST Catalog for Testing Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Set the PYICEBERG_TEST_CATALOG environment variable to specify which REST catalog to use for integration tests. Warning: Do not run against production catalogs. ```bash export PYICEBERG_TEST_CATALOG=test_catalog ``` -------------------------------- ### Run Unit Tests Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Execute the project's unit tests using pytest and coverage. Aims to enforce 90%+ code coverage. ```bash make test ``` -------------------------------- ### Configure Glue Catalog with Profile Name Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure a Glue catalog using an AWS profile name and region. ```yaml catalog: default: type: glue glue.profile-name: glue.region: s3.endpoint: http://localhost:9000 s3.access-key-id: admin s3.secret-access-key: password ``` -------------------------------- ### Create and Push Signed Git Tag Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Bash script to set version variables, create a signed Git tag, and push it to the repository. ```bash export VERSION=0.7.0 export RC=1 export VERSION_WITH_RC=${VERSION}rc${RC} export GIT_TAG=pyiceberg-${VERSION_WITH_RC} git tag -s ${GIT_TAG} -m "PyIceberg ${VERSION_WITH_RC}" git push git@github.com:apache/iceberg-python.git ${GIT_TAG} ``` -------------------------------- ### List Tables in Default Catalog Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/cli.md Execute the 'list' command to display tables and namespaces within the default catalog. This command is useful for exploring available data. ```sh ➜ pyiceberg list default nyc ``` -------------------------------- ### Specify Python Version for Environment Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Creates a virtual environment using a specific Python version and runs tests against it. ```bash PYTHON=3.12 make install # Create environment with Python 3.12 make test # Run tests against Python 3.12 ``` -------------------------------- ### Verify Release Signatures Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/verify-release.md Checkout the release artifacts from the Apache distribution repository and iterate through the downloaded files, verifying their signatures using GPG. This ensures the integrity and authenticity of the release files. ```sh svn checkout https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-$PYICEBERG_VERSION/ ${PYICEBERG_VERIFICATION_DIR} cd ${PYICEBERG_VERIFICATION_DIR} for name in $(ls pyiceberg-*.whl pyiceberg-*.tar.gz) do gpg --verify ${name}.asc ${name} done ``` -------------------------------- ### Create and Append Data to an Iceberg Table Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md This snippet demonstrates how to create a new Iceberg table and append initial data using PyArrow and PyIceberg. ```python import pyarrow as pa df = pa.Table.from_pylist( [ {"city": "Amsterdam", "lat": 52.371807, "long": 4.896029}, {"city": "San Francisco", "lat": 37.773972, "long": -122.431297}, {"city": "Drachten", "lat": 53.11254, "long": 6.0989}, {"city": "Paris", "lat": 48.864716, "long": 2.349014}, ], ) from pyiceberg.catalog import load_catalog catalog = load_catalog("default") tbl = catalog.create_table("default.cities", schema=df.schema) tbl.append(df) ``` -------------------------------- ### Unity Catalog Configuration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Configure integration with Databricks Unity Catalog using its REST API. Requires workspace URL, catalog name, and a Databricks PAT token. ```yaml catalog: unity_catalog: type: rest uri: https:///api/2.1/unity-catalog/iceberg-rest warehouse: token: ``` -------------------------------- ### Upload Artifacts to Apache Dev SVN Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Import the release candidate artifacts into the Apache Development SVN repository. The command requires the version and RC number to construct the correct paths. ```bash : "${VERSION:?ERROR: VERSION is not set or is empty}" : "${VERSION_WITH_RC:?ERROR: VERSION_WITH_RC is not set or is empty}" : "${RC:?ERROR: RC is not set or is empty}" svn import "svn-release-candidate-${VERSION}rc${RC}" "https://dist.apache.org/repos/dist/dev/iceberg/pyiceberg-${VERSION_WITH_RC}" -m "PyIceberg ${VERSION_WITH_RC}" ``` -------------------------------- ### Inspect Table Manifests Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Use this to view a table's current file manifests. The output is a pyarrow.Table. ```python table.inspect.manifests() ``` ```python pyarrow.Table content: int8 not null path: string not null length: int64 not null partition_spec_id: int32 not null added_snapshot_id: int64 not null added_data_files_count: int32 not null existing_data_files_count: int32 not null deleted_data_files_count: int32 not null added_delete_files_count: int32 not null existing_delete_files_count: int32 not null deleted_delete_files_count: int32 not null partition_summaries: list> not null child 0, item: struct child 0, contains_null: bool not null child 1, contains_nan: bool child 2, lower_bound: string child 3, upper_bound: string ---- content: [[0]] path: [["s3://warehouse/default/table_metadata_manifests/metadata/3bf5b4c6-a7a4-4b43-a6ce-ca2b4887945a-m0.avro"]] length: [[6886]] partition_spec_id: [[0]] added_snapshot_id: [[3815834705531553721]] added_data_files_count: [[1]] existing_data_files_count: [[0]] deleted_data_files_count: [[0]] added_delete_files_count: [[0]] existing_delete_files_count: [[0]] deleted_delete_files_count: [[0]] partition_summaries: [[ -- is_valid: all not null -- child 0 type: bool [false] -- child 1 type: bool [false] -- child 2 type: string ["test"] -- child 3 type: string ["test"]]] ``` -------------------------------- ### Set Catalog Configuration via Environment Variables Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Use environment variables to configure catalog settings, such as the URI and S3 credentials. Double underscores represent nested fields in the YAML structure. ```sh export PYICEBERG_CATALOG__DEFAULT__URI=thrift://localhost:9083 export PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID=username export PYICEBERG_CATALOG__DEFAULT__S3__SECRET_ACCESS_KEY=password ``` -------------------------------- ### Create Branch with Default Settings Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md Create a mutable branch referencing a specific snapshot with default retention settings. ```python # Create a branch with default settings table.manage_snapshots().create_branch( snapshot_id=snapshot_id, branch_name="dev" ).commit() ``` -------------------------------- ### Run S3 Integration Tests Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/contributing.md Execute integration tests specifically for S3, requiring minio to be running. ```bash make test-s3 ``` -------------------------------- ### No Authentication Configuration Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md Use this configuration when no authentication is required for the catalog. This is the simplest authentication type. ```yaml auth: type: noop ``` -------------------------------- ### Show Namespaces in Spark Source: https://github.com/apache/iceberg-python/blob/main/notebooks/spark_integration_example.ipynb Display all available namespaces (databases) within the connected Spark environment. This helps in understanding the data organization. ```python # Show available namespaces/databases spark.sql("SHOW NAMESPACES").show() ``` -------------------------------- ### Monitor and Watch GitHub Release Action Source: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/how-to-release.md Use the `gh` CLI to find the database ID of the release workflow and then watch its progress. This is useful for tracking the artifact generation process. ```bash : "${GIT_TAG:?ERROR: GIT_TAG is not set or is empty}" RUN_ID=$(gh run list --repo apache/iceberg-python --workflow "Python Build Release Candidate" --branch "${GIT_TAG}" --event push --json databaseId -q '.[0].databaseId') : "${RUN_ID:?ERROR: RUN_ID could not be determined}" echo "Waiting for workflow to complete, this will take several minutes..." gh run watch $RUN_ID --repo apache/iceberg-python ```