### Post-installation Setup Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md Install the package locally and set up pre-commit hooks using pixi run commands. ```bash pixi run postinstall pixi run pre-commit-install ``` -------------------------------- ### Install Development Environment Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md Clone the repository, navigate to the directory, check Rust version, and install project dependencies using pixi. ```bash git clone https://github.com/Quantco/dataframely cd dataframely rustup show pixi install ``` -------------------------------- ### Install dataframely with Pixi Source: https://github.com/quantco/dataframely/blob/main/docs/guides/index.md Use this command to install dataframely using the Pixi package manager. ```bash pixi add dataframely ``` -------------------------------- ### Install dataframely skill using skills.sh Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md Install the dataframely skill using the skills.sh command-line tool. ```bash npx skills add Quantco/dataframely ``` -------------------------------- ### SQL CREATE TABLE Statement Example Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md Example of a generated SQL CREATE TABLE statement for the 'myTable' table, based on the 'MySchema' definition. ```sql CREATE TABLE "myTable" ( x BIGINT NOT NULL, y VARCHAR NOT NULL, PRIMARY KEY (x) ) ``` -------------------------------- ### Install dataframely skill for Claude Code Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md Install the dataframely skill for Claude Code by downloading the SKILL.md file to the specified directory. ```bash mkdir -p .claude/skills/dataframely/ curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/skills/SKILL.md ``` -------------------------------- ### Install dataframely with Pip Source: https://github.com/quantco/dataframely/blob/main/docs/guides/index.md Use this command to install dataframely using the Pip package manager. ```bash pip install dataframely ``` -------------------------------- ### Sample Relational Collections Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Generate sample data for entire relational data models by calling `.sample()` on a `Collection` class. This example demonstrates sampling for invoices and their associated diagnoses. ```python class DiagnosisSchema(dy.Schema): invoice_id = dy.String(primary_key=True) code = dy.String(nullable=False, regex=r"[A-Z][0-9]{2,4}") class HospitalInvoiceData(dy.Collection): invoice: dy.LazyFrame[InvoiceSchema] diagnosis: dy.LazyFrame[DiagnosisSchema] invoice_data: HospitalInvoiceData = HospitalInvoiceData.sample(num_rows=10) ``` -------------------------------- ### Install dataframely with Pixi or Pip Source: https://github.com/quantco/dataframely/blob/main/README.md Install the dataframely library using either the pixi package manager or pip. This is the first step to using dataframely for data frame validation. ```bash pixi add dataframely pip install dataframely ``` -------------------------------- ### Collection.create_empty Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/generation.rst Creates an empty collection. This is a foundational method for starting new data structures. ```APIDOC ## Collection.create_empty ### Description Creates an empty collection. ### Method (Not specified, likely a constructor or static method) ### Parameters (No parameters explicitly documented) ### Request Example (Not applicable for this method) ### Response (Not explicitly documented, likely returns an empty collection object) ``` -------------------------------- ### Define a Basic Schema with Dataframely Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Define a schema for your data by subclassing `dy.Schema` and specifying column types and constraints. This example sets up expectations for housing data, including non-nullable columns and a minimum length for zip codes. ```python import dataframely as dy class HouseSchema(dy.Schema): zip_code = dy.String(nullable=False, min_length=3) num_bedrooms = dy.UInt8(nullable=False) num_bathrooms = dy.UInt8(nullable=False) price = dy.Float64(nullable=False) ``` -------------------------------- ### Renamed Schema Conversion Functions Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md Schema conversion functions have been renamed for consistency with other packages. For example, `sql_schema` is now `to_sqlalchemy_columns`. ```python # v1: schema.sql_schema() # v2: schema.to_sqlalchemy_columns() # v1: schema.pyarrow_schema() # v2: schema.to_pyarrow_schema() # v1: schema.polars_schema() # v2: schema.to_polars_schema() ``` -------------------------------- ### Inspect Failed Row Counts Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Use `failure.counts()` to get a summary of validation failures per rule. This is useful for quickly identifying which rules are failing and how often. ```python # Inspect the reasons for the failed rows failure.counts() ``` ```text Result: {'amount|min_exclusive': 1} ``` -------------------------------- ### Add Custom Rule for Column Ratios in Dataframely Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Extend your schema with custom rules using the `@dy.rule()` decorator to enforce cross-column expectations. This example adds a rule to ensure a reasonable ratio between bathrooms and bedrooms. ```python import dataframely as dy class HouseSchema(dy.Schema): zip_code = dy.String(nullable=False, min_length=3) num_bedrooms = dy.UInt8(nullable=False) num_bathrooms = dy.UInt8(nullable=False) price = dy.Float64(nullable=False) @dy.rule() def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr: ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms") return (ratio >= 1 / 3) & (ratio <= 3) ``` -------------------------------- ### Build and Open Documentation Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md Compile a localized build of the documentation using pixi and open the generated HTML file in a web browser. ```bash # Run build pixi run -e docs postinstall pixi run docs # Open documentation open docs/_build/html/index.html ``` -------------------------------- ### Unit Testing with Generated Data Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Demonstrates setting up sample data for a specific schema (`OutputSchema`) to be used in unit tests, ensuring the function under test receives data in the expected format. ```python from polars.testing import assert_frame_equal class OutputSchema(dy.Schema): invoice_id = dy.String(primary_key=True) amount = dy.Decimal(nullable=False) ``` -------------------------------- ### Write, Read, and Scan Parquet Directories Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Use these methods to write, read, or scan entire directories of Parquet files. Ensure the directory path is correctly specified. ```python collection.write_parquet("/path/to/directory/") collection.read_parquet("/path/to/directory/") collection.scan_parquet("/path/to/directory/") ``` -------------------------------- ### Generate Synthetic Test Data with `Schema.sample` Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Use `Schema.sample` for generating random data. Use `overrides` to pin specific columns to certain values for targeted testing. Use `create_empty()` for empty data frames. ```python from polars.testing import assert_frame_equal def test_grouped_sum(): df = pl.DataFrame({ "col1": [1, 2, 3], "col2": ["a", "a", "b"], }).pipe(MyInputSchema.validate, cast=True) expected = pl.DataFrame({ "col1": ["a", "b"], "col2": [3, 3], }) result = my_code(df) assert_frame_equal(expected, result) ``` -------------------------------- ### Schema.sample Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst Generates a sample of data based on the schema. ```APIDOC ## Schema.sample ### Description Generates a sample of data based on the schema. ### Method ```python Schema.sample(n_samples=1) ``` ### Parameters #### Query Parameters - **n_samples** (int) - Optional - The number of samples to generate. Defaults to 1. ``` -------------------------------- ### Define Group Rules in Dataframely Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Use the `group_by` parameter in the `@dy.rule()` decorator to evaluate rules across groups of rows. This example enforces a minimum count of houses per zip code. ```python import dataframely as dy class HouseSchema(dy.Schema): zip_code = dy.String(nullable=False, min_length=3) num_bedrooms = dy.UInt8(nullable=False) num_bathrooms = dy.UInt8(nullable=False) price = dy.Float64(nullable=False) @dy.rule() def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr: ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms") return (ratio >= 1 / 3) & (ratio <= 3) @dy.rule(group_by=["zip_code"]) def minimum_zip_code_count(cls) -> pl.Expr: return pl.len() >= 2 ``` -------------------------------- ### Run Tests Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md Execute all project tests using the pixi run test command. The tests path can be adjusted to target specific directories or modules. ```bash pixi run test ``` -------------------------------- ### Generate Synthetic Collection Data with `Collection.sample` Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Use `Collection.sample` for generating random collection data. Use `overrides` with lists of dicts to specify values for collection members. Use `create_empty()` for empty collections. ```python MySchema.sample(num_rows=...) MySchema.sample(overrides=...) MySchema.create_empty() ``` -------------------------------- ### Collection.sample Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/generation.rst Generates a sample from an existing collection. Useful for testing or creating subsets of data. ```APIDOC ## Collection.sample ### Description Generates a sample from an existing collection. ### Method (Not specified, likely an instance method) ### Parameters (No parameters explicitly documented) ### Request Example (Not applicable for this method) ### Response (Not explicitly documented, likely returns a new collection object with sampled data) ``` -------------------------------- ### Create and Register SQLAlchemy Table Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md Create an SQLAlchemy Table object from the generated columns and register it with the database engine. This allows for table creation and data manipulation. ```python my_table = sa.Table("myTable", sa.MetaData(), *columns) my_table.create(engine) ``` -------------------------------- ### Generate Random Data for Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Employ the `sample` method on a schema to generate synthetic data. This respects per-column validation rules like `regex`, `nullable`, and `primary_key`. ```python class InvoiceSchema(dy.Schema): invoice_id = dy.String(primary_key=True, regex=r"\d{1,10}") admission_date = dy.Date(nullable=False) discharge_date = dy.Date(nullable=False) amount = dy.Decimal(nullable=False) df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(num_rows=100) ``` -------------------------------- ### Schema-less alternative for comparison Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md Illustrates the schema-less alternative to dataframely type hinting, highlighting the reduced information provided to coding agents. ```python def load_data(raw: pl.LazyFrame) -> pl.DataFrame: ... ``` -------------------------------- ### ImplementationError Source: https://github.com/quantco/dataframely/blob/main/docs/api/errors/index.rst Raised for general implementation-related errors. ```APIDOC ## ImplementationError ### Description A general-purpose exception for errors encountered during the implementation or execution of Dataframely features. ### Exception Type `dataframely.exc.ImplementationError` ``` -------------------------------- ### FailureInfo I/O Methods Source: https://github.com/quantco/dataframely/blob/main/docs/api/filter_result/failure_info.rst Methods for reading from and writing to various file formats like Parquet and Delta. ```APIDOC ## FailureInfo.write_parquet ### Description Writes FailureInfo data to a Parquet file. ### Method N/A (Method call) ### Parameters None ## FailureInfo.sink_parquet ### Description Sinks FailureInfo data to a Parquet file. ### Method N/A (Method call) ### Parameters None ## FailureInfo.read_parquet ### Description Reads FailureInfo data from a Parquet file. ### Method N/A (Method call) ### Parameters None ## FailureInfo.scan_parquet ### Description Scans FailureInfo data from a Parquet file. ### Method N/A (Method call) ### Parameters None ## FailureInfo.write_delta ### Description Writes FailureInfo data to a Delta table. ### Method N/A (Method call) ### Parameters None ## FailureInfo.read_delta ### Description Reads FailureInfo data from a Delta table. ### Method N/A (Method call) ### Parameters None ## FailureInfo.scan_delta ### Description Scans FailureInfo data from a Delta table. ### Method N/A (Method call) ### Parameters None ``` -------------------------------- ### Handle 1:N Relationships in Collection Sampling Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Override the `_preprocess_sample` class method in a `Collection` to handle complex relationships, such as generating a variable number of related records (diagnoses per invoice) to satisfy `@dy.filter` conditions. ```python from random import random from typing import Any, override from dataframely.random import Generator class HospitalInvoiceData(dy.Collection): invoice: dy.LazyFrame[InvoiceSchema] diagnosis: dy.LazyFrame[DiagnosisSchema] @dy.filter() def at_least_one_diagnosis(cls) -> pl.Expr: return dy.functional.require_relationship_one_to_at_least_one( cls.invoice, cls.diagnosis, on="invoice_id", ) @classmethod @override def _preprocess_sample(cls, sample: dict[str, Any], index: int, generator: Generator): # Set common primary key. if "invoice_id" not in sample: sample["invoice_id"] = str(index) # Satisfy filter by adding 1-10 diagnoses. if "diagnosis" not in sample: # NOTE: Every key in the `sample` corresponds to one member of the collection. # In this case, diagnoses contains a list of N diagnoses. # Inside the list, one can simply pass empty dictionaries, which means that all columns # in the member will be sampled. sample["diagnosis"] = [{} for _ in range(0, int(random() * 10) + 1)] return sample ``` -------------------------------- ### Inspect Generated SQL CREATE TABLE Statement Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md Print the SQL CREATE TABLE statement that SQLAlchemy would generate for a given table. This is useful for verifying the schema definition before execution. ```python from sqlalchemy.schema import CreateTable print(CreateTable(my_table).compile()) ``` -------------------------------- ### Config Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst Configuration settings for the dataframely library. ```APIDOC ## Config ### Description Provides access to configuration settings for the dataframely library. ### Usage ```python from dataframely import Config # Access configuration values print(Config.some_setting) ``` ``` -------------------------------- ### testing.create_schema Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst Utility function to create a schema for testing purposes. ```APIDOC ## testing.create_schema ### Description Creates a schema object, typically used for setting up test environments or validating data structures. ### Usage ```python from dataframely.testing import create_schema # Define schema structure (example) schema_definition = { "fields": [ {"name": "id", "type": "integer"}, {"name": "name", "type": "string"} ] } schema = create_schema(schema_definition) ``` ``` -------------------------------- ### testing.create_collection Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst Utility function to create a collection for testing purposes. ```APIDOC ## testing.create_collection ### Description Creates a collection of data, often used in conjunction with schemas for testing data loading or manipulation. ### Usage ```python from dataframely.testing import create_collection, create_schema # Define schema and data schema_definition = { "fields": [ {"name": "id", "type": "integer"}, {"name": "value", "type": "float"} ] } schema = create_schema(schema_definition) data = [ {"id": 1, "value": 10.5}, {"id": 2, "value": 20.1} ] collection = create_collection(schema, data) ``` ``` -------------------------------- ### Automethod Documentation Source: https://github.com/quantco/dataframely/blob/main/docs/_templates/autosummary/method.rst This section details the automethod directive used for generating documentation for a class method. ```APIDOC ## Method Documentation This page provides documentation for a specific method within a class. The `automethod` directive is used to extract and display the documentation for this method. ### Method Signature ```python {{ (class + '.' + name) | underline }} ``` ### Module ```python .. currentmodule:: {{ module }} ``` ### Usage ```python .. automethod:: {{ class }}.{{ name }} ``` ``` -------------------------------- ### Configure Ruff for Classmethod Decorators Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md If using Ruff, configure `pyproject.toml` to recognize `@dy.rule` as a decorator that transforms a method into a classmethod. ```toml [tool.ruff.lint.pep8-naming] classmethod-decorators = ["dataframely.rule"] ``` -------------------------------- ### Create Empty DataFrame with Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Use `create_empty` to instantiate an empty DataFrame with the specified schema, ensuring correct data types and type hints without generating actual data. ```python class InvoiceSchema(dy.Schema): invoice_id = dy.String(primary_key=True, regex=r"\d{1,10}") admission_date = dy.Date(nullable=False) discharge_date = dy.Date(nullable=False) amount = dy.Decimal(nullable=False) # Get data frame with correct type hint. df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.create_empty() ``` -------------------------------- ### Runtime Schema Enforcement with `validate` and `filter` Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Use `Schema.validate` to raise errors on failure, suitable for unexpected failures. Use `Schema.filter` to gracefully handle possible failures, returning valid rows and `FailureInfo` for introspection. ```python result = df.pipe(MySchema.validate) out, failures = df.pipe(MySchema.filter) ``` -------------------------------- ### Schema.create_empty_if_none Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst Creates an empty schema object if the provided schema is None. ```APIDOC ## Schema.create_empty_if_none ### Description Creates an empty schema object if the provided schema is None. ### Method ```python Schema.create_empty_if_none(schema) ``` ### Parameters #### Path Parameters - **schema** (Schema) - Required - The schema object to check. ``` -------------------------------- ### Creating an Empty DataFrame with Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Use `create_empty` to generate an empty DataFrame that adheres to a defined schema. This is particularly useful for testing purposes. ```python HouseSchema.create_empty() ``` -------------------------------- ### Schema.create_empty Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst Creates an empty schema object. ```APIDOC ## Schema.create_empty ### Description Creates an empty schema object. ### Method ```python Schema.create_empty() ``` ``` -------------------------------- ### Inline Sampling with CollectionMember Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Demonstrates how to use `Annotated` with `dy.CollectionMember(inline_for_sampling=True)` to allow direct supply of non-primary key columns at the top level of overrides. This simplifies data definition by avoiding nested structures for sampled fields. ```python from typing import Annotated class HospitalInvoiceData(dy.Collection): invoice: Annotated[ dy.LazyFrame[InvoiceSchema], dy.CollectionMember(inline_for_sampling=True), ] diagnosis: dy.LazyFrame[DiagnosisSchema] ``` ```python HospitalInvoiceData.sample(overrides=[ { "invoice_id": "1", "amount": 1000.0, "diagnosis": [{"code": "E11.2"}], } ]) ``` -------------------------------- ### Define Invoice Schema with column constraints Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Extends the InvoiceSchema to include column-level constraints such as primary keys, nullability, and minimum values. ```python class InvoiceSchema(dy.Schema): invoice_id = dy.String(primary_key=True) admission_date = dy.Date(nullable=False) discharge_date = dy.Date(nullable=False) received_at = dy.Datetime(nullable=False) amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0)) ``` -------------------------------- ### Define Schema Rule as Classmethod in v2 Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md Schema rules must now be defined as classmethods. Add the `cls` argument to your rule signatures to access schema information. ```python class MySchema(dy.Schema): ... @dy.rule() def my_rule(cls) -> pl.Expr: ... ``` -------------------------------- ### Schema.columns Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst Retrieves detailed information about each column in the schema. ```APIDOC ## Schema.columns ### Description Get the columns of the schema. ### Method `Schema.columns()` ### Parameters None ### Response #### Success Response (list of dict) - Returns a list of dictionaries, where each dictionary contains details about a column. ``` -------------------------------- ### Type hinting with dataframely schemas Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md Use dataframely type hints to provide explicit schema information to coding agents, improving code understanding and maintainability. ```python def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]: ... ``` -------------------------------- ### Write Typed Data Frames with `Schema.write_...` Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Prefer `MySchema.write_...` over `df.write_...` to persist schema metadata alongside data for later use during reading. ```python MySchema.write_parquet(df, "path/to/file.parquet") ``` -------------------------------- ### Create a Polars DataFrame for Validation Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Prepare a Polars DataFrame with sample housing data to be validated against a Dataframely schema. This includes defining columns and populating them with various data types and values, including nulls. ```python import polars as pl df = pl.DataFrame({ "zip_code": ["01234", "01234", "1", "213", "123", "213"], "num_bedrooms": [2, 2, 1, None, None, 2], "num_bathrooms": [1, 2, 1, 1, 0, 8], "price": [100_000, 110_000, 50_000, 80_000, 60_000, 160_000] }) ``` -------------------------------- ### Schema.matches Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst Checks if the schema matches a given pattern. ```APIDOC ## Schema.matches ### Description Check if the schema matches a given pattern. ### Method `Schema.matches(pattern)` ### Parameters #### Path Parameters - **pattern** (str) - Required - The pattern to match against the schema. ### Response #### Success Response (bool) - Returns True if the schema matches the pattern, False otherwise. ``` -------------------------------- ### Serialize and Parse Schema Metadata Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Serialize a schema to a JSON string and then parse it using `json.loads`. This demonstrates the string-encoded representation of the schema, including its columns and rules. ```python json.loads(HouseSchema.serialize()) ``` -------------------------------- ### Writing Data Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst Methods for writing data to different storage formats. ```APIDOC ## Schema.write_parquet ### Description Writes the schema to a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Schema.sink_parquet ### Description Sinks the schema to a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Schema.write_delta ### Description Writes the schema to a Delta Lake table. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Sample Data with Column Overrides (Column-wise) Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Use the `overrides` parameter in `sample` to specify values for certain columns. Dataframely infers the number of rows from the longest sequence provided and broadcasts other columns. ```python from datetime import date # Override values for specific columns. df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(overrides={ # Use either ..name or just the column name as a string. InvoiceSchema.invoice_id.name: ["1234567890", "2345678901", "3456789012"], # Dataframely will automatically infer the number of rows based on the longest given # sequence of values and broadcast all other columns to that shape. "admission_date": date(2025, 1, 1), }) ``` -------------------------------- ### Read Typed Data Frames with `Schema.read_...` Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Prefer `MySchema.read_...` over `pl.read_...` to leverage persisted schema metadata when reading data back in. ```python df = MySchema.read_parquet("path/to/file.parquet") ``` -------------------------------- ### Import necessary libraries Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Imports the required libraries for data manipulation and dataframely. ```python from datetime import date, datetime from decimal import Decimal import polars as pl import dataframely as dy ``` -------------------------------- ### Convert Dataframely Schema to SQLAlchemy Columns Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md Define a Dataframely schema and convert it into a list of SQLAlchemy columns. This is the first step in generating SQL table definitions. ```python import dataframely as dy import sqlalchemy as sa class MySchema(dy.Schema): x = dy.Int64(primary_key=True) y = dy.String(nullable=False) engine = sa.create_engine(...) columns: list[sa.Column] = MySchema.to_sqlalchemy_columns(engine.dialect) ``` -------------------------------- ### Schema.to_sqlalchemy_columns Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/conversion.rst Converts the Schema object into a list of SQLAlchemy Column objects. ```APIDOC ## Schema.to_sqlalchemy_columns ### Description Converts the Schema object into a list of SQLAlchemy Column objects. ### Method ```python Schema.to_sqlalchemy_columns() ``` ### Parameters None ### Response #### Success Response - A list of SQLAlchemy Column objects representing the schema. ``` -------------------------------- ### Schema Serialization Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst Methods for serializing and deserializing schemas. ```APIDOC ## Schema.serialize ### Description Serializes the schema. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## deserialize_schema ### Description Deserializes a schema. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## read_parquet_metadata_schema ### Description Reads schema metadata from a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Define Schema with Column Metadata Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/column-metadata.md Use the `metadata` parameter in column definitions to attach custom information. This is useful for marking columns as pseudonymized or providing database-specific details. ```python class UserSchema(dy.Schema): id = dy.String(primary_key=True) # Mark last name column as pseudonymized and (non-docstring) comment on it. last_name = dy.String(metadata={ "pseudonymized": True, "comment": "Pseudonymized using cryptographic hash function" }) # Add information about database column type. address = dy.String(metadata={"database-type": "VARCHAR(MAX)"}) ``` -------------------------------- ### Reading Data Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst Methods for reading data from different storage formats. ```APIDOC ## Schema.read_parquet ### Description Reads a schema from a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Schema.scan_parquet ### Description Scans a schema from a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Schema.read_delta ### Description Reads a schema from a Delta Lake table. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Schema.scan_delta ### Description Scans a schema from a Delta Lake table. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Documenting schema column meanings Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md Document additional domain information for schema columns using docstrings, such as the semantic meanings of enum values. ```python class HospitalStaySchema(dy.Schema): # Reason for admission to the hospital # N = Emergency # V = Transfer from another hospital # ... admission_reason = dy.Enum(["N", "V", ...]) ``` -------------------------------- ### Collection Matches Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/metadata.rst Check if a collection matches certain criteria. ```APIDOC ## Collection.matches ### Description Determines if the collection satisfies a given condition or matches a specified pattern. ### Method N/A (Method call on an object) ### Endpoint N/A ### Parameters - **criteria** (any) - The criteria or pattern to match against the collection. ### Request Example ```python collection.matches(some_criteria) ``` ### Response #### Success Response - **matches** (bool) - True if the collection matches the criteria, False otherwise. ``` -------------------------------- ### Generating SQLAlchemy Columns from Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Convert a Dataframely schema into a list of SQLAlchemy columns. This facilitates the creation of SQL tables with types and constraints that match the schema. ```python HouseSchema.to_sqlalchemy_columns() ``` -------------------------------- ### Collection.join Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/operations.rst Joins two collections based on specified keys or conditions. ```APIDOC ## Collection.join ### Description Joins two collections based on specified keys or conditions. ### Method (Not specified, likely a method call on a Collection object) ### Parameters (Not specified in the source) ### Request Example (Not specified in the source) ### Response (Not specified in the source) ``` -------------------------------- ### Type Hinting with Schemas for Function Signatures Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Define function signatures using `dy.DataFrame[Schema]` for static type checking. This ensures that functions receive DataFrames with the expected schema, improving code reliability. ```python def train_model(df: dy.DataFrame[HouseSchema]) -> None: ... ``` -------------------------------- ### Schema.to_pyarrow_schema Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/conversion.rst Converts the Schema object into a PyArrow Schema object. ```APIDOC ## Schema.to_pyarrow_schema ### Description Converts the Schema object into a PyArrow Schema object. ### Method ```python Schema.to_pyarrow_schema() ``` ### Parameters None ### Response #### Success Response - A PyArrow Schema object representing the schema. ``` -------------------------------- ### Collection Serialization Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst Methods for serializing and deserializing collections, and reading Parquet metadata. ```APIDOC ## Collection.serialize ### Description Serializes a collection into a specific format. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## deserialize_collection ### Description Deserializes data into a collection object. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## read_parquet_metadata_collection ### Description Reads metadata from a Parquet file specifically for collection-related information. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Define Invoice Schema with basic types Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Defines the base schema for the invoice data frame, specifying column names and their basic types. ```python class InvoiceSchema(dy.Schema): invoice_id = dy.String() admission_date = dy.Date() discharge_date = dy.Date() received_at = dy.Datetime() amount = dy.Decimal() ``` -------------------------------- ### Writing Data Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst Methods for writing collection data to various storage formats. ```APIDOC ## Collection.write_parquet ### Description Writes the collection data to a Parquet file. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Collection.sink_parquet ### Description Sinks the collection data to a Parquet file. This might imply an append or overwrite behavior. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Collection.write_delta ### Description Writes the collection data to a Delta Lake table. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Override Sample Data Values Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md Specify custom values for specific columns during data generation using the `overrides` parameter. ```python df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(overrides=[ {"invoice_id": "1234567890", "admission_date": date(2025, 1, 1)}, {"invoice_id": "2345678901", "admission_date": date(2025, 1, 1)}, {"invoice_id": "3456789012", "admission_date": date(2025, 1, 1)}, ]) ``` -------------------------------- ### Define Function Interface with Dataframely Schemas Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Use schemas for all input and output data frames in a function. Omit type hints for private helpers unless schemas improve readability or testability. Omit schemas for short-lived temporary or function-local data frames. ```python def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]: # Internal data frames do not require schemas df: pl.LazyFrame = ... return MyPreprocessedSchema.validate(df, cast=True) ``` -------------------------------- ### Collection.collect_all Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/operations.rst Collects all members of a collection into a single structure. ```APIDOC ## Collection.collect_all ### Description Collects all members of a collection into a single structure. ### Method (Not specified, likely a method call on a Collection object) ### Parameters (Not specified in the source) ### Request Example (Not specified in the source) ### Response (Not specified in the source) ``` -------------------------------- ### Generating PyArrow Schema from Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md Obtain a PyArrow schema from a Dataframely schema. This provides column data types and nullability information compatible with PyArrow. ```python HouseSchema.to_pyarrow_schema() ``` -------------------------------- ### Reading Data Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst Methods for reading data from various storage formats into collections. ```APIDOC ## Collection.read_parquet ### Description Reads data from a Parquet file into a collection. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Collection.scan_parquet ### Description Scans a Parquet file, potentially for metadata or a subset of data, into a collection. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Collection.read_delta ### Description Reads data from a Delta Lake table into a collection. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` ```APIDOC ## Collection.scan_delta ### Description Scans a Delta Lake table, potentially for metadata or a subset of data, into a collection. ### Method (Not specified in source) ### Endpoint (Not specified in source) ### Parameters (Not specified in source) ### Request Example (Not specified in source) ### Response (Not specified in source) ``` -------------------------------- ### Generate Multiple Tables from Dataframely Collection Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md Generate and create multiple SQL tables from a Dataframely Collection. This iterates through the collection's member schemas and creates corresponding SQLAlchemy tables. ```python MyCollection: dy.Collection meta = sa.MetaData() for name, dy_schema in MyCollection.member_schemas().items(): sa.Table( name, meta, *dy_schema.to_sqlalchemy_columns(dialect=engine.dialect), ) meta.create_all() ``` -------------------------------- ### Create a Data Frame Collection Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Define a `HospitalClaims` collection by subclassing `dy.Collection`. This collection groups `InvoiceSchema` and `DiagnosisSchema` lazy frames, enabling collection-level validation. ```python # Introduce a collection for groups of schema-validated data frames class HospitalClaims(dy.Collection): invoices: dy.LazyFrame[InvoiceSchema] diagnoses: dy.LazyFrame[DiagnosisSchema] ``` -------------------------------- ### Schema.column_names Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst Retrieves a list of all column names in the schema. ```APIDOC ## Schema.column_names ### Description Get the list of column names. ### Method `Schema.column_names()` ### Parameters None ### Response #### Success Response (list of strings) - Returns a list of strings, where each string is a column name. ``` -------------------------------- ### Control Schema Validation During Parquet Reads Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Specify the validation behavior when reading Parquet files. Use 'allow' to skip warnings, 'forbid' to raise errors on mismatches, or omit for default warning behavior. ```python # Will not warn and only validate if necessary MySchema.read_parquet("my.parquet", validation="allow") ``` ```python # Will raise an error if validation cannot be skipped MySchema.read_parquet("my.parquet", validation="forbid") ``` ```python # Dangerous: Will never validate. It's possible to load data that violates the schema! MySchema.read_parquet("my.parquet", validation="forbid") ``` -------------------------------- ### Define and Serialize Schema to Delta Lake Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Define a custom schema and use it to validate a polars DataFrame before writing it to a Delta Lake table. This method is analogous to polars' native write operations. ```python class MySchema(dy.Schema): x = dy.Int64(primary_key=True) df: dy.DataFrame[MySchema] = MySchema.validate( pl.DataFrame( {"x": [1, 2, 3]} ) ) # Or to deltalake MySchema.write_delta(df, "/path/to/table") ``` -------------------------------- ### Renamed Primary Key Methods Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md Methods related to primary keys have been renamed to better reflect the concept of a single, potentially composite, primary key. ```python # v1: Schema.primary_keys # v2: Schema.primary_key # v1: Collection.common_primary_keys # v2: Collection.common_primary_key ``` -------------------------------- ### Schema Inheritance for Shared Primary Key Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Leverage schema inheritance to reduce redundancy by defining a base `InvoiceIdSchema` for the common `invoice_id` primary key. This base schema is then inherited by `InvoiceSchema` and `DiagnosisSchema`. ```python # Reduce redundancies in schemas by using schema inheritance. # Here, we introduce a base schema for the shared primary key. class InvoiceIdSchema(dy.Schema): invoice_id = dy.String(primary_key=True) class InvoiceSchema(InvoiceIdSchema): admission_date = dy.Date(nullable=False) discharge_date = dy.Date(nullable=False) received_at = dy.Datetime(nullable=False) amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0)) @dy.rule() def discharge_after_admission(cls) -> pl.Expr: return pl.col("discharge_date") >= pl.col("admission_date") @dy.rule() def received_at_after_discharge(cls) -> pl.Expr: return pl.col("received_at").dt.date() >= pl.col("discharge_date") class DiagnosisSchema(InvoiceIdSchema): diagnosis_code = dy.String(primary_key=True, regex=r"[A-Z][0-9]{2,4}") is_main = dy.Bool(nullable=False) @dy.rule(group_by=["invoice_id"]) def exactly_one_main_diagnosis(cls) -> pl.Expr: return pl.col("is_main").sum() == 1 ``` -------------------------------- ### Define and Serialize Schema to Parquet Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Define a custom schema and use it to validate a polars DataFrame before writing it to a parquet file. This method is analogous to polars' native write operations. ```python class MySchema(dy.Schema): x = dy.Int64(primary_key=True) df: dy.DataFrame[MySchema] = MySchema.validate( pl.DataFrame( {"x": [1, 2, 3]} ) ) # The serialization methods provide interfaces that are as close as possible to the # polars interface you are probably familiar with # Writing to parquet MySchema.write_parquet(df, "my.parquet") ``` -------------------------------- ### Define Composite Primary Key in Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/primary-keys.md Combine multiple columns as a primary key by setting `primary_key=True` on each. This ensures that the combination of values across these columns is unique for each record. ```python class LineItemSchema(dy.Schema): invoice_id = dy.Int64(primary_key=True) item_id = dy.Int64(primary_key=True) price = dy.Decimal() ``` -------------------------------- ### Collection To Dictionary Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/metadata.rst Convert the collection to a dictionary representation. ```APIDOC ## Collection.to_dict ### Description Converts the collection and its members into a dictionary format. ### Method N/A (Method call on an object) ### Endpoint N/A ### Parameters None ### Request Example ```python collection.to_dict() ``` ### Response #### Success Response - **dictionary_representation** (dict) - A dictionary representing the collection. ``` -------------------------------- ### Inspect Co-occurrence of Validation Failures Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Use `failure.cooccurrence_counts()` to understand how validation failures occur together. This helps in identifying systemic issues rather than isolated ones. ```python # Inspect the co-occurrences of validation failures failure.cooccurrence_counts() ``` ```text Result: {frozenset({'amount|min_exclusive'}): 1} ``` -------------------------------- ### random.Generator Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst Random number generator utility. ```APIDOC ## random.Generator ### Description A random number generator utility, likely based on Python's standard `random` module or a similar implementation. ### Usage ```python from dataframely.random import Generator # Create a generator instance rng = Generator() # Generate random numbers print(rng.random()) print(rng.randint(1, 10)) ``` ``` -------------------------------- ### Define Cross-Column Constraints with Dataframely Rules Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md Define cross-column constraints for a `dy.Schema` using methods decorated with `@dy.rule`. Use expressive names for rules and `cls` to refer to schema columns. ```python class MySchema(dy.Schema): col1 = dy.UInt8() col2 = dy.UInt8() @dy.rule() def col1_greater_col2(cls) -> pl.Expr: return cls.col1.col > cls.col2.col ``` -------------------------------- ### AnnotationImplementationError Source: https://github.com/quantco/dataframely/blob/main/docs/api/errors/index.rst Raised when there are issues with annotation implementation. ```APIDOC ## AnnotationImplementationError ### Description Specific error raised when problems occur with the implementation or usage of annotations within Dataframely. ### Exception Type `dataframely.exc.AnnotationImplementationError` ``` -------------------------------- ### Read Parquet Schema Eagerly Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Read a parquet file back into a dataframely DataFrame, inferring the schema from the stored metadata. This avoids re-validation if the schema matches. ```python # Reading parquet eagerly new_df: dy.DataFrame[MySchema] = MySchema.read_parquet("my.parquet") ``` -------------------------------- ### FailureInfo Inspection Methods Source: https://github.com/quantco/dataframely/blob/main/docs/api/filter_result/failure_info.rst Methods for inspecting failure data, including identifying invalid entries and calculating counts. ```APIDOC ## FailureInfo.invalid ### Description Checks for invalid entries within the FailureInfo. ### Method N/A (Method call) ### Parameters None ## FailureInfo.counts ### Description Calculates the counts of failure occurrences. ### Method N/A (Method call) ### Parameters None ## FailureInfo.cooccurrence_counts ### Description Calculates co-occurrence counts for failures. ### Method N/A (Method call) ### Parameters None ## FailureInfo.__len__ ### Description Returns the total number of items in FailureInfo. ### Method N/A (Method call) ### Parameters None ``` -------------------------------- ### Read Delta Lake Schema Eagerly Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Read a Delta Lake table back into a dataframely DataFrame, inferring the schema from the stored metadata. This avoids re-validation if the schema matches. ```python # Or deltalake eagerly new_df: dy.DataFrame[MySchema] = MySchema.read_delta("/path/to/table") ``` -------------------------------- ### Define a Dataframely Schema Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md Define a schema with various column types and a custom rule. This schema can then be serialized. ```python class HouseSchema(dy.Schema): zip_code = dy.String(nullable=False, min_length=3) num_bedrooms = dy.UInt8(nullable=False) num_bathrooms = dy.UInt8(nullable=False) price = dy.Float64(nullable=False) @dy.rule() def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr: ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms") return (ratio >= 1 / 3) & (ratio <= 3) ``` -------------------------------- ### Test Diabetes Invoice Amounts Generation Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md A pytest test case for the `get_diabetes_invoice_amounts` function. It uses `HospitalInvoiceData.sample` with overrides to create test data, including invoices with and without diabetes diagnoses, and asserts the output against an expected DataFrame. ```python def test_get_diabetes_invoice_amounts() -> None: # Arrange invoice_data = HospitalInvoiceData.sample( overrides=[ # Invoice with diabetes diagnosis { "invoice_id": "1", "invoice": {"amount": 1500.0}, "diagnosis": [{"code": "E11.2"}], }, # Invoice without diabetes diagnosis { "invoice_id": "2", "invoice": {"amount": 1000.0}, "diagnosis": [{"code": "J45.909"}], }, ] ) expected = OutputSchema.validate( pl.DataFrame( { "invoice_id": ["1"], "amount": [1500.0], } ), cast=True, ).lazy() # Act actual = get_diabetes_invoice_amounts(invoice_data) # Assert assert_frame_equal(actual, expected) ``` -------------------------------- ### Define Invoice Schema with cross-column rules Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb Adds cross-column validation rules to the InvoiceSchema using the @dy.rule decorator to ensure logical data integrity. ```python class InvoiceSchema(dy.Schema): invoice_id = dy.String(primary_key=True) admission_date = dy.Date(nullable=False) discharge_date = dy.Date(nullable=False) received_at = dy.Datetime(nullable=False) amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0)) @dy.rule() def discharge_after_admission(cls) -> pl.Expr: return pl.col("discharge_date") >= pl.col("admission_date") @dy.rule() def received_at_after_discharge(cls) -> pl.Expr: return pl.col("received_at").dt.date() >= pl.col("discharge_date") ```