### Post-installation Setup

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md

Install the package locally and set up pre-commit hooks using pixi run commands.

```bash
pixi run postinstall
pixi run pre-commit-install
```

--------------------------------

### Install Development Environment

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md

Clone the repository, navigate to the directory, check Rust version, and install project dependencies using pixi.

```bash
git clone https://github.com/Quantco/dataframely
cd dataframely
rustup show
pixi install
```

--------------------------------

### Install dataframely with Pixi

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/index.md

Use this command to install dataframely using the Pixi package manager.

```bash
pixi add dataframely
```

--------------------------------

### Install dataframely skill using skills.sh

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md

Install the dataframely skill using the skills.sh command-line tool.

```bash
npx skills add Quantco/dataframely
```

--------------------------------

### SQL CREATE TABLE Statement Example

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md

Example of a generated SQL CREATE TABLE statement for the 'myTable' table, based on the 'MySchema' definition.

```sql
CREATE TABLE "myTable"
(
    x BIGINT  NOT NULL,
    y VARCHAR NOT NULL,
    PRIMARY KEY (x)
)
```

--------------------------------

### Install dataframely skill for Claude Code

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md

Install the dataframely skill for Claude Code by downloading the SKILL.md file to the specified directory.

```bash
mkdir -p .claude/skills/dataframely/
curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/skills/SKILL.md
```

--------------------------------

### Install dataframely with Pip

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/index.md

Use this command to install dataframely using the Pip package manager.

```bash
pip install dataframely
```

--------------------------------

### Sample Relational Collections

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Generate sample data for entire relational data models by calling `.sample()` on a `Collection` class. This example demonstrates sampling for invoices and their associated diagnoses.

```python
class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    code = dy.String(nullable=False, regex=r"[A-Z][0-9]{2,4}")

class HospitalInvoiceData(dy.Collection):
    invoice: dy.LazyFrame[InvoiceSchema]
    diagnosis: dy.LazyFrame[DiagnosisSchema]

invoice_data: HospitalInvoiceData = HospitalInvoiceData.sample(num_rows=10)
```

--------------------------------

### Install dataframely with Pixi or Pip

Source: https://github.com/quantco/dataframely/blob/main/README.md

Install the dataframely library using either the pixi package manager or pip. This is the first step to using dataframely for data frame validation.

```bash
pixi add dataframely
pip install dataframely
```

--------------------------------

### Collection.create_empty

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/generation.rst

Creates an empty collection. This is a foundational method for starting new data structures.

```APIDOC
## Collection.create_empty

### Description
Creates an empty collection.

### Method
(Not specified, likely a constructor or static method)

### Parameters
(No parameters explicitly documented)

### Request Example
(Not applicable for this method)

### Response
(Not explicitly documented, likely returns an empty collection object)
```

--------------------------------

### Define a Basic Schema with Dataframely

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Define a schema for your data by subclassing `dy.Schema` and specifying column types and constraints. This example sets up expectations for housing data, including non-nullable columns and a minimum length for zip codes.

```python
import dataframely as dy


class HouseSchema(dy.Schema):
    zip_code = dy.String(nullable=False, min_length=3)
    num_bedrooms = dy.UInt8(nullable=False)
    num_bathrooms = dy.UInt8(nullable=False)
    price = dy.Float64(nullable=False)
```

--------------------------------

### Renamed Schema Conversion Functions

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md

Schema conversion functions have been renamed for consistency with other packages. For example, `sql_schema` is now `to_sqlalchemy_columns`.

```python
# v1: schema.sql_schema()
# v2: schema.to_sqlalchemy_columns()

# v1: schema.pyarrow_schema()
# v2: schema.to_pyarrow_schema()

# v1: schema.polars_schema()
# v2: schema.to_polars_schema()
```

--------------------------------

### Inspect Failed Row Counts

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Use `failure.counts()` to get a summary of validation failures per rule. This is useful for quickly identifying which rules are failing and how often.

```python
# Inspect the reasons for the failed rows
failure.counts()
```

```text
Result:
{'amount|min_exclusive': 1}
```

--------------------------------

### Add Custom Rule for Column Ratios in Dataframely Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Extend your schema with custom rules using the `@dy.rule()` decorator to enforce cross-column expectations. This example adds a rule to ensure a reasonable ratio between bathrooms and bedrooms.

```python
import dataframely as dy


class HouseSchema(dy.Schema):
    zip_code = dy.String(nullable=False, min_length=3)
    num_bedrooms = dy.UInt8(nullable=False)
    num_bathrooms = dy.UInt8(nullable=False)
    price = dy.Float64(nullable=False)

    @dy.rule()
    def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr:
        ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms")
        return (ratio >= 1 / 3) & (ratio <= 3)
```

--------------------------------

### Build and Open Documentation

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md

Compile a localized build of the documentation using pixi and open the generated HTML file in a web browser.

```bash
# Run build
pixi run -e docs postinstall
pixi run docs

# Open documentation
open docs/_build/html/index.html
```

--------------------------------

### Unit Testing with Generated Data

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Demonstrates setting up sample data for a specific schema (`OutputSchema`) to be used in unit tests, ensuring the function under test receives data in the expected format.

```python
from polars.testing import assert_frame_equal


class OutputSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    amount = dy.Decimal(nullable=False)

```

--------------------------------

### Write, Read, and Scan Parquet Directories

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Use these methods to write, read, or scan entire directories of Parquet files. Ensure the directory path is correctly specified.

```python
collection.write_parquet("/path/to/directory/")
collection.read_parquet("/path/to/directory/")
collection.scan_parquet("/path/to/directory/")
```

--------------------------------

### Generate Synthetic Test Data with `Schema.sample`

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Use `Schema.sample` for generating random data. Use `overrides` to pin specific columns to certain values for targeted testing. Use `create_empty()` for empty data frames.

```python
from polars.testing import assert_frame_equal


def test_grouped_sum():
    df = pl.DataFrame({
        "col1": [1, 2, 3],
        "col2": ["a", "a", "b"],
    }).pipe(MyInputSchema.validate, cast=True)

    expected = pl.DataFrame({
        "col1": ["a", "b"],
        "col2": [3, 3],
    })

    result = my_code(df)

    assert_frame_equal(expected, result)
```

--------------------------------

### Schema.sample

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst

Generates a sample of data based on the schema.

```APIDOC
## Schema.sample

### Description
Generates a sample of data based on the schema.

### Method
```python
Schema.sample(n_samples=1)
```

### Parameters
#### Query Parameters
- **n_samples** (int) - Optional - The number of samples to generate. Defaults to 1.

```

--------------------------------

### Define Group Rules in Dataframely Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Use the `group_by` parameter in the `@dy.rule()` decorator to evaluate rules across groups of rows. This example enforces a minimum count of houses per zip code.

```python
import dataframely as dy


class HouseSchema(dy.Schema):
    zip_code = dy.String(nullable=False, min_length=3)
    num_bedrooms = dy.UInt8(nullable=False)
    num_bathrooms = dy.UInt8(nullable=False)
    price = dy.Float64(nullable=False)

    @dy.rule()
    def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr:
        ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms")
        return (ratio >= 1 / 3) & (ratio <= 3)

    @dy.rule(group_by=["zip_code"])
    def minimum_zip_code_count(cls) -> pl.Expr:
        return pl.len() >= 2
```

--------------------------------

### Run Tests

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/development.md

Execute all project tests using the pixi run test command. The tests path can be adjusted to target specific directories or modules.

```bash
pixi run test
```

--------------------------------

### Generate Synthetic Collection Data with `Collection.sample`

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Use `Collection.sample` for generating random collection data. Use `overrides` with lists of dicts to specify values for collection members. Use `create_empty()` for empty collections.

```python
MySchema.sample(num_rows=...)
MySchema.sample(overrides=...)
MySchema.create_empty()
```

--------------------------------

### Collection.sample

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/generation.rst

Generates a sample from an existing collection. Useful for testing or creating subsets of data.

```APIDOC
## Collection.sample

### Description
Generates a sample from an existing collection.

### Method
(Not specified, likely an instance method)

### Parameters
(No parameters explicitly documented)

### Request Example
(Not applicable for this method)

### Response
(Not explicitly documented, likely returns a new collection object with sampled data)
```

--------------------------------

### Create and Register SQLAlchemy Table

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md

Create an SQLAlchemy Table object from the generated columns and register it with the database engine. This allows for table creation and data manipulation.

```python
my_table = sa.Table("myTable", sa.MetaData(), *columns)
my_table.create(engine)
```

--------------------------------

### Generate Random Data for Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Employ the `sample` method on a schema to generate synthetic data. This respects per-column validation rules like `regex`, `nullable`, and `primary_key`.

```python
class InvoiceSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True, regex=r"\d{1,10}")
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    amount = dy.Decimal(nullable=False)

df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(num_rows=100)
```

--------------------------------

### Schema-less alternative for comparison

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md

Illustrates the schema-less alternative to dataframely type hinting, highlighting the reduced information provided to coding agents.

```python
def load_data(raw: pl.LazyFrame) -> pl.DataFrame:
    ...
```

--------------------------------

### ImplementationError

Source: https://github.com/quantco/dataframely/blob/main/docs/api/errors/index.rst

Raised for general implementation-related errors.

```APIDOC
## ImplementationError

### Description
A general-purpose exception for errors encountered during the implementation or execution of Dataframely features.

### Exception Type
`dataframely.exc.ImplementationError`
```

--------------------------------

### FailureInfo I/O Methods

Source: https://github.com/quantco/dataframely/blob/main/docs/api/filter_result/failure_info.rst

Methods for reading from and writing to various file formats like Parquet and Delta.

```APIDOC
## FailureInfo.write_parquet

### Description
Writes FailureInfo data to a Parquet file.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.sink_parquet

### Description
Sinks FailureInfo data to a Parquet file.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.read_parquet

### Description
Reads FailureInfo data from a Parquet file.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.scan_parquet

### Description
Scans FailureInfo data from a Parquet file.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.write_delta

### Description
Writes FailureInfo data to a Delta table.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.read_delta

### Description
Reads FailureInfo data from a Delta table.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.scan_delta

### Description
Scans FailureInfo data from a Delta table.

### Method
N/A (Method call)

### Parameters
None
```

--------------------------------

### Handle 1:N Relationships in Collection Sampling

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Override the `_preprocess_sample` class method in a `Collection` to handle complex relationships, such as generating a variable number of related records (diagnoses per invoice) to satisfy `@dy.filter` conditions.

```python
from random import random
from typing import Any, override

from dataframely.random import Generator


class HospitalInvoiceData(dy.Collection):
    invoice: dy.LazyFrame[InvoiceSchema]
    diagnosis: dy.LazyFrame[DiagnosisSchema]

    @dy.filter()
    def at_least_one_diagnosis(cls) -> pl.Expr:
        return dy.functional.require_relationship_one_to_at_least_one(
            cls.invoice,
            cls.diagnosis,
            on="invoice_id",
        )

    @classmethod
    @override
    def _preprocess_sample(cls, sample: dict[str, Any], index: int, generator: Generator):
        # Set common primary key.
        if "invoice_id" not in sample:
            sample["invoice_id"] = str(index)

        # Satisfy filter by adding 1-10 diagnoses.
        if "diagnosis" not in sample:
            # NOTE: Every key in the `sample` corresponds to one member of the collection.
            # In this case, diagnoses contains a list of N diagnoses.
            # Inside the list, one can simply pass empty dictionaries, which means that all columns
            # in the member will be sampled.
            sample["diagnosis"] = [{} for _ in range(0, int(random() * 10) + 1)]
        return sample
```

--------------------------------

### Inspect Generated SQL CREATE TABLE Statement

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md

Print the SQL CREATE TABLE statement that SQLAlchemy would generate for a given table. This is useful for verifying the schema definition before execution.

```python
from sqlalchemy.schema import CreateTable

print(CreateTable(my_table).compile())
```

--------------------------------

### Config

Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst

Configuration settings for the dataframely library.

```APIDOC
## Config

### Description
Provides access to configuration settings for the dataframely library.

### Usage
```python
from dataframely import Config

# Access configuration values
print(Config.some_setting)
```
```

--------------------------------

### testing.create_schema

Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst

Utility function to create a schema for testing purposes.

```APIDOC
## testing.create_schema

### Description
Creates a schema object, typically used for setting up test environments or validating data structures.

### Usage
```python
from dataframely.testing import create_schema

# Define schema structure (example)
schema_definition = {
    "fields": [
        {"name": "id", "type": "integer"},
        {"name": "name", "type": "string"}
    ]
}

schema = create_schema(schema_definition)
```
```

--------------------------------

### testing.create_collection

Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst

Utility function to create a collection for testing purposes.

```APIDOC
## testing.create_collection

### Description
Creates a collection of data, often used in conjunction with schemas for testing data loading or manipulation.

### Usage
```python
from dataframely.testing import create_collection, create_schema

# Define schema and data
schema_definition = {
    "fields": [
        {"name": "id", "type": "integer"},
        {"name": "value", "type": "float"}
    ]
}
schema = create_schema(schema_definition)

data = [
    {"id": 1, "value": 10.5},
    {"id": 2, "value": 20.1}
]

collection = create_collection(schema, data)
```
```

--------------------------------

### Automethod Documentation

Source: https://github.com/quantco/dataframely/blob/main/docs/_templates/autosummary/method.rst

This section details the automethod directive used for generating documentation for a class method.

```APIDOC
## Method Documentation

This page provides documentation for a specific method within a class. The `automethod` directive is used to extract and display the documentation for this method.

### Method Signature

```python
{{ (class + '.' + name) | underline }}
```

### Module

```python
.. currentmodule:: {{ module }}
```

### Usage

```python
.. automethod:: {{ class }}.{{ name }}
```
```

--------------------------------

### Configure Ruff for Classmethod Decorators

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md

If using Ruff, configure `pyproject.toml` to recognize `@dy.rule` as a decorator that transforms a method into a classmethod.

```toml
[tool.ruff.lint.pep8-naming]
classmethod-decorators = ["dataframely.rule"]
```

--------------------------------

### Create Empty DataFrame with Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Use `create_empty` to instantiate an empty DataFrame with the specified schema, ensuring correct data types and type hints without generating actual data.

```python
class InvoiceSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True, regex=r"\d{1,10}")
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    amount = dy.Decimal(nullable=False)

# Get data frame with correct type hint.
df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.create_empty()
```

--------------------------------

### Runtime Schema Enforcement with `validate` and `filter`

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Use `Schema.validate` to raise errors on failure, suitable for unexpected failures. Use `Schema.filter` to gracefully handle possible failures, returning valid rows and `FailureInfo` for introspection.

```python
result = df.pipe(MySchema.validate)
out, failures = df.pipe(MySchema.filter)
```

--------------------------------

### Schema.create_empty_if_none

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst

Creates an empty schema object if the provided schema is None.

```APIDOC
## Schema.create_empty_if_none

### Description
Creates an empty schema object if the provided schema is None.

### Method
```python
Schema.create_empty_if_none(schema)
```

### Parameters
#### Path Parameters
- **schema** (Schema) - Required - The schema object to check.

```

--------------------------------

### Creating an Empty DataFrame with Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Use `create_empty` to generate an empty DataFrame that adheres to a defined schema. This is particularly useful for testing purposes.

```python
HouseSchema.create_empty()
```

--------------------------------

### Schema.create_empty

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/generation.rst

Creates an empty schema object.

```APIDOC
## Schema.create_empty

### Description
Creates an empty schema object.

### Method
```python
Schema.create_empty()
```
```

--------------------------------

### Inline Sampling with CollectionMember

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Demonstrates how to use `Annotated` with `dy.CollectionMember(inline_for_sampling=True)` to allow direct supply of non-primary key columns at the top level of overrides. This simplifies data definition by avoiding nested structures for sampled fields.

```python
from typing import Annotated

class HospitalInvoiceData(dy.Collection):
    invoice: Annotated[
        dy.LazyFrame[InvoiceSchema],
        dy.CollectionMember(inline_for_sampling=True),
    ]
    diagnosis: dy.LazyFrame[DiagnosisSchema]
```

```python
HospitalInvoiceData.sample(overrides=[
    {
        "invoice_id": "1",
        "amount": 1000.0,
        "diagnosis": [{"code": "E11.2"}],
    }
])
```

--------------------------------

### Define Invoice Schema with column constraints

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Extends the InvoiceSchema to include column-level constraints such as primary keys, nullability, and minimum values.

```python
class InvoiceSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    received_at = dy.Datetime(nullable=False)
    amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))
```

--------------------------------

### Define Schema Rule as Classmethod in v2

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md

Schema rules must now be defined as classmethods. Add the `cls` argument to your rule signatures to access schema information.

```python
class MySchema(dy.Schema):
    ...

    @dy.rule()
    def my_rule(cls) -> pl.Expr:
        ...
```

--------------------------------

### Schema.columns

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst

Retrieves detailed information about each column in the schema.

```APIDOC
## Schema.columns

### Description
Get the columns of the schema.

### Method
`Schema.columns()`

### Parameters
None

### Response
#### Success Response (list of dict)
- Returns a list of dictionaries, where each dictionary contains details about a column.
```

--------------------------------

### Type hinting with dataframely schemas

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md

Use dataframely type hints to provide explicit schema information to coding agents, improving code understanding and maintainability.

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
    ...
```

--------------------------------

### Write Typed Data Frames with `Schema.write_...`

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Prefer `MySchema.write_...` over `df.write_...` to persist schema metadata alongside data for later use during reading.

```python
MySchema.write_parquet(df, "path/to/file.parquet")
```

--------------------------------

### Create a Polars DataFrame for Validation

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Prepare a Polars DataFrame with sample housing data to be validated against a Dataframely schema. This includes defining columns and populating them with various data types and values, including nulls.

```python
import polars as pl

df = pl.DataFrame({
    "zip_code": ["01234", "01234", "1", "213", "123", "213"],
    "num_bedrooms": [2, 2, 1, None, None, 2],
    "num_bathrooms": [1, 2, 1, 1, 0, 8],
    "price": [100_000, 110_000, 50_000, 80_000, 60_000, 160_000]
})
```

--------------------------------

### Schema.matches

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst

Checks if the schema matches a given pattern.

```APIDOC
## Schema.matches

### Description
Check if the schema matches a given pattern.

### Method
`Schema.matches(pattern)`

### Parameters
#### Path Parameters
- **pattern** (str) - Required - The pattern to match against the schema.

### Response
#### Success Response (bool)
- Returns True if the schema matches the pattern, False otherwise.
```

--------------------------------

### Serialize and Parse Schema Metadata

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Serialize a schema to a JSON string and then parse it using `json.loads`. This demonstrates the string-encoded representation of the schema, including its columns and rules.

```python
json.loads(HouseSchema.serialize())
```

--------------------------------

### Writing Data

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst

Methods for writing data to different storage formats.

```APIDOC
## Schema.write_parquet

### Description
Writes the schema to a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Schema.sink_parquet

### Description
Sinks the schema to a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Schema.write_delta

### Description
Writes the schema to a Delta Lake table.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Sample Data with Column Overrides (Column-wise)

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Use the `overrides` parameter in `sample` to specify values for certain columns. Dataframely infers the number of rows from the longest sequence provided and broadcasts other columns.

```python
from datetime import date

# Override values for specific columns.
df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(overrides={
    # Use either <schema>.<column>.name or just the column name as a string.
    InvoiceSchema.invoice_id.name: ["1234567890", "2345678901", "3456789012"],
    # Dataframely will automatically infer the number of rows based on the longest given
    # sequence of values and broadcast all other columns to that shape.
    "admission_date": date(2025, 1, 1),
})
```

--------------------------------

### Read Typed Data Frames with `Schema.read_...`

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Prefer `MySchema.read_...` over `pl.read_...` to leverage persisted schema metadata when reading data back in.

```python
df = MySchema.read_parquet("path/to/file.parquet")
```

--------------------------------

### Import necessary libraries

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Imports the required libraries for data manipulation and dataframely.

```python
from datetime import date, datetime
from decimal import Decimal

import polars as pl

import dataframely as dy
```

--------------------------------

### Convert Dataframely Schema to SQLAlchemy Columns

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md

Define a Dataframely schema and convert it into a list of SQLAlchemy columns. This is the first step in generating SQL table definitions.

```python
import dataframely as dy
import sqlalchemy as sa


class MySchema(dy.Schema):
    x = dy.Int64(primary_key=True)
    y = dy.String(nullable=False)


engine = sa.create_engine(...)
columns: list[sa.Column] = MySchema.to_sqlalchemy_columns(engine.dialect)
```

--------------------------------

### Schema.to_sqlalchemy_columns

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/conversion.rst

Converts the Schema object into a list of SQLAlchemy Column objects.

```APIDOC
## Schema.to_sqlalchemy_columns

### Description
Converts the Schema object into a list of SQLAlchemy Column objects.

### Method
```python
Schema.to_sqlalchemy_columns()
```

### Parameters
None

### Response
#### Success Response
- A list of SQLAlchemy Column objects representing the schema.
```

--------------------------------

### Schema Serialization

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst

Methods for serializing and deserializing schemas.

```APIDOC
## Schema.serialize

### Description
Serializes the schema.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## deserialize_schema

### Description
Deserializes a schema.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## read_parquet_metadata_schema

### Description
Reads schema metadata from a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Define Schema with Column Metadata

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/column-metadata.md

Use the `metadata` parameter in column definitions to attach custom information. This is useful for marking columns as pseudonymized or providing database-specific details.

```python
class UserSchema(dy.Schema):
    id = dy.String(primary_key=True)
    # Mark last name column as pseudonymized and (non-docstring) comment on it.
    last_name = dy.String(metadata={
        "pseudonymized": True,
        "comment": "Pseudonymized using cryptographic hash function"
    })
    # Add information about database column type.
    address = dy.String(metadata={"database-type": "VARCHAR(MAX)"})
```

--------------------------------

### Reading Data

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/io.rst

Methods for reading data from different storage formats.

```APIDOC
## Schema.read_parquet

### Description
Reads a schema from a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Schema.scan_parquet

### Description
Scans a schema from a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Schema.read_delta

### Description
Reads a schema from a Delta Lake table.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Schema.scan_delta

### Description
Scans a schema from a Delta Lake table.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Documenting schema column meanings

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/coding-agents.md

Document additional domain information for schema columns using docstrings, such as the semantic meanings of enum values.

```python
class HospitalStaySchema(dy.Schema):
    # Reason for admission to the hospital
    # N = Emergency
    # V = Transfer from another hospital
    # ...
    admission_reason = dy.Enum(["N", "V", ...])
```

--------------------------------

### Collection Matches

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/metadata.rst

Check if a collection matches certain criteria.

```APIDOC
## Collection.matches

### Description
Determines if the collection satisfies a given condition or matches a specified pattern.

### Method
N/A (Method call on an object)

### Endpoint
N/A

### Parameters
- **criteria** (any) - The criteria or pattern to match against the collection.

### Request Example
```python
collection.matches(some_criteria)
```

### Response
#### Success Response
- **matches** (bool) - True if the collection matches the criteria, False otherwise.
```

--------------------------------

### Generating SQLAlchemy Columns from Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Convert a Dataframely schema into a list of SQLAlchemy columns. This facilitates the creation of SQL tables with types and constraints that match the schema.

```python
HouseSchema.to_sqlalchemy_columns()
```

--------------------------------

### Collection.join

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/operations.rst

Joins two collections based on specified keys or conditions.

```APIDOC
## Collection.join

### Description
Joins two collections based on specified keys or conditions.

### Method
(Not specified, likely a method call on a Collection object)

### Parameters
(Not specified in the source)

### Request Example
(Not specified in the source)

### Response
(Not specified in the source)
```

--------------------------------

### Type Hinting with Schemas for Function Signatures

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Define function signatures using `dy.DataFrame[Schema]` for static type checking. This ensures that functions receive DataFrames with the expected schema, improving code reliability.

```python
def train_model(df: dy.DataFrame[HouseSchema]) -> None:
    ...

```

--------------------------------

### Schema.to_pyarrow_schema

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/conversion.rst

Converts the Schema object into a PyArrow Schema object.

```APIDOC
## Schema.to_pyarrow_schema

### Description
Converts the Schema object into a PyArrow Schema object.

### Method
```python
Schema.to_pyarrow_schema()
```

### Parameters
None

### Response
#### Success Response
- A PyArrow Schema object representing the schema.
```

--------------------------------

### Collection Serialization

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst

Methods for serializing and deserializing collections, and reading Parquet metadata.

```APIDOC
## Collection.serialize

### Description
Serializes a collection into a specific format.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## deserialize_collection

### Description
Deserializes data into a collection object.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## read_parquet_metadata_collection

### Description
Reads metadata from a Parquet file specifically for collection-related information.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Define Invoice Schema with basic types

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Defines the base schema for the invoice data frame, specifying column names and their basic types.

```python
class InvoiceSchema(dy.Schema):
    invoice_id = dy.String()
    admission_date = dy.Date()
    discharge_date = dy.Date()
    received_at = dy.Datetime()
    amount = dy.Decimal()
```

--------------------------------

### Writing Data

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst

Methods for writing collection data to various storage formats.

```APIDOC
## Collection.write_parquet

### Description
Writes the collection data to a Parquet file.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Collection.sink_parquet

### Description
Sinks the collection data to a Parquet file. This might imply an append or overwrite behavior.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Collection.write_delta

### Description
Writes the collection data to a Delta Lake table.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Override Sample Data Values

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

Specify custom values for specific columns during data generation using the `overrides` parameter.

```python
df: dy.DataFrame[InvoiceSchema] = InvoiceSchema.sample(overrides=[
    {"invoice_id": "1234567890", "admission_date": date(2025, 1, 1)},
    {"invoice_id": "2345678901", "admission_date": date(2025, 1, 1)},
    {"invoice_id": "3456789012", "admission_date": date(2025, 1, 1)},
])
```

--------------------------------

### Define Function Interface with Dataframely Schemas

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Use schemas for all input and output data frames in a function. Omit type hints for private helpers unless schemas improve readability or testability. Omit schemas for short-lived temporary or function-local data frames.

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
    # Internal data frames do not require schemas
    df: pl.LazyFrame = ...
    return MyPreprocessedSchema.validate(df, cast=True)
```

--------------------------------

### Collection.collect_all

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/operations.rst

Collects all members of a collection into a single structure.

```APIDOC
## Collection.collect_all

### Description
Collects all members of a collection into a single structure.

### Method
(Not specified, likely a method call on a Collection object)

### Parameters
(Not specified in the source)

### Request Example
(Not specified in the source)

### Response
(Not specified in the source)
```

--------------------------------

### Generating PyArrow Schema from Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/quickstart.md

Obtain a PyArrow schema from a Dataframely schema. This provides column data types and nullability information compatible with PyArrow.

```python
HouseSchema.to_pyarrow_schema()
```

--------------------------------

### Reading Data

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/io.rst

Methods for reading data from various storage formats into collections.

```APIDOC
## Collection.read_parquet

### Description
Reads data from a Parquet file into a collection.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Collection.scan_parquet

### Description
Scans a Parquet file, potentially for metadata or a subset of data, into a collection.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Collection.read_delta

### Description
Reads data from a Delta Lake table into a collection.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

```APIDOC
## Collection.scan_delta

### Description
Scans a Delta Lake table, potentially for metadata or a subset of data, into a collection.

### Method
(Not specified in source)

### Endpoint
(Not specified in source)

### Parameters
(Not specified in source)

### Request Example
(Not specified in source)

### Response
(Not specified in source)
```

--------------------------------

### Generate Multiple Tables from Dataframely Collection

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/sql-generation.md

Generate and create multiple SQL tables from a Dataframely Collection. This iterates through the collection's member schemas and creates corresponding SQLAlchemy tables.

```python
MyCollection: dy.Collection
meta = sa.MetaData()
for name, dy_schema in MyCollection.member_schemas().items():
    sa.Table(
        name,
        meta,
        *dy_schema.to_sqlalchemy_columns(dialect=engine.dialect),
    )
meta.create_all()
```

--------------------------------

### Create a Data Frame Collection

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Define a `HospitalClaims` collection by subclassing `dy.Collection`. This collection groups `InvoiceSchema` and `DiagnosisSchema` lazy frames, enabling collection-level validation.

```python
# Introduce a collection for groups of schema-validated data frames
class HospitalClaims(dy.Collection):
    invoices: dy.LazyFrame[InvoiceSchema]
    diagnoses: dy.LazyFrame[DiagnosisSchema]
```

--------------------------------

### Schema.column_names

Source: https://github.com/quantco/dataframely/blob/main/docs/api/schema/metadata.rst

Retrieves a list of all column names in the schema.

```APIDOC
## Schema.column_names

### Description
Get the list of column names.

### Method
`Schema.column_names()`

### Parameters
None

### Response
#### Success Response (list of strings)
- Returns a list of strings, where each string is a column name.
```

--------------------------------

### Control Schema Validation During Parquet Reads

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Specify the validation behavior when reading Parquet files. Use 'allow' to skip warnings, 'forbid' to raise errors on mismatches, or omit for default warning behavior.

```python
# Will not warn and only validate if necessary
MySchema.read_parquet("my.parquet", validation="allow")
```

```python
# Will raise an error if validation cannot be skipped
MySchema.read_parquet("my.parquet", validation="forbid")
```

```python
# Dangerous: Will never validate. It's possible to load data that violates the schema!
MySchema.read_parquet("my.parquet", validation="forbid")
```

--------------------------------

### Define and Serialize Schema to Delta Lake

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Define a custom schema and use it to validate a polars DataFrame before writing it to a Delta Lake table. This method is analogous to polars' native write operations.

```python
class MySchema(dy.Schema):
    x = dy.Int64(primary_key=True)


df: dy.DataFrame[MySchema] = MySchema.validate(
    pl.DataFrame(
        {"x": [1, 2, 3]}
    )
)

# Or to deltalake
MySchema.write_delta(df, "/path/to/table")
```

--------------------------------

### Renamed Primary Key Methods

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/migration/v1-v2.md

Methods related to primary keys have been renamed to better reflect the concept of a single, potentially composite, primary key.

```python
# v1: Schema.primary_keys
# v2: Schema.primary_key

# v1: Collection.common_primary_keys
# v2: Collection.common_primary_key
```

--------------------------------

### Schema Inheritance for Shared Primary Key

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Leverage schema inheritance to reduce redundancy by defining a base `InvoiceIdSchema` for the common `invoice_id` primary key. This base schema is then inherited by `InvoiceSchema` and `DiagnosisSchema`.

```python
# Reduce redundancies in schemas by using schema inheritance.
# Here, we introduce a base schema for the shared primary key.
class InvoiceIdSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)


class InvoiceSchema(InvoiceIdSchema):
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    received_at = dy.Datetime(nullable=False)
    amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))

    @dy.rule()
    def discharge_after_admission(cls) -> pl.Expr:
        return pl.col("discharge_date") >= pl.col("admission_date")

    @dy.rule()
    def received_at_after_discharge(cls) -> pl.Expr:
        return pl.col("received_at").dt.date() >= pl.col("discharge_date")


class DiagnosisSchema(InvoiceIdSchema):
    diagnosis_code = dy.String(primary_key=True, regex=r"[A-Z][0-9]{2,4}")
    is_main = dy.Bool(nullable=False)

    @dy.rule(group_by=["invoice_id"])
    def exactly_one_main_diagnosis(cls) -> pl.Expr:
        return pl.col("is_main").sum() == 1
```

--------------------------------

### Define and Serialize Schema to Parquet

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Define a custom schema and use it to validate a polars DataFrame before writing it to a parquet file. This method is analogous to polars' native write operations.

```python
class MySchema(dy.Schema):
    x = dy.Int64(primary_key=True)


df: dy.DataFrame[MySchema] = MySchema.validate(
    pl.DataFrame(
        {"x": [1, 2, 3]}
    )
)

# The serialization methods provide interfaces that are as close as possible to the
# polars interface you are probably familiar with
# Writing to parquet
MySchema.write_parquet(df, "my.parquet")
```

--------------------------------

### Define Composite Primary Key in Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/primary-keys.md

Combine multiple columns as a primary key by setting `primary_key=True` on each. This ensures that the combination of values across these columns is unique for each record.

```python
class LineItemSchema(dy.Schema):
    invoice_id = dy.Int64(primary_key=True)
    item_id = dy.Int64(primary_key=True)
    price = dy.Decimal()
```

--------------------------------

### Collection To Dictionary

Source: https://github.com/quantco/dataframely/blob/main/docs/api/collection/metadata.rst

Convert the collection to a dictionary representation.

```APIDOC
## Collection.to_dict

### Description
Converts the collection and its members into a dictionary format.

### Method
N/A (Method call on an object)

### Endpoint
N/A

### Parameters
None

### Request Example
```python
collection.to_dict()
```

### Response
#### Success Response
- **dictionary_representation** (dict) - A dictionary representing the collection.
```

--------------------------------

### Inspect Co-occurrence of Validation Failures

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Use `failure.cooccurrence_counts()` to understand how validation failures occur together. This helps in identifying systemic issues rather than isolated ones.

```python
# Inspect the co-occurrences of validation failures
failure.cooccurrence_counts()
```

```text
Result:
{frozenset({'amount|min_exclusive'}): 1}
```

--------------------------------

### random.Generator

Source: https://github.com/quantco/dataframely/blob/main/docs/api/misc/index.rst

Random number generator utility.

```APIDOC
## random.Generator

### Description
A random number generator utility, likely based on Python's standard `random` module or a similar implementation.

### Usage
```python
from dataframely.random import Generator

# Create a generator instance
rng = Generator()

# Generate random numbers
print(rng.random())
print(rng.randint(1, 10))
```
```

--------------------------------

### Define Cross-Column Constraints with Dataframely Rules

Source: https://github.com/quantco/dataframely/blob/main/skills/SKILL.md

Define cross-column constraints for a `dy.Schema` using methods decorated with `@dy.rule`. Use expressive names for rules and `cls` to refer to schema columns.

```python
class MySchema(dy.Schema):
    col1 = dy.UInt8()
    col2 = dy.UInt8()

    @dy.rule()
    def col1_greater_col2(cls) -> pl.Expr:
        return cls.col1.col > cls.col2.col
```

--------------------------------

### AnnotationImplementationError

Source: https://github.com/quantco/dataframely/blob/main/docs/api/errors/index.rst

Raised when there are issues with annotation implementation.

```APIDOC
## AnnotationImplementationError

### Description
Specific error raised when problems occur with the implementation or usage of annotations within Dataframely.

### Exception Type
`dataframely.exc.AnnotationImplementationError`
```

--------------------------------

### Read Parquet Schema Eagerly

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Read a parquet file back into a dataframely DataFrame, inferring the schema from the stored metadata. This avoids re-validation if the schema matches.

```python
# Reading parquet eagerly
new_df: dy.DataFrame[MySchema] = MySchema.read_parquet("my.parquet")
```

--------------------------------

### FailureInfo Inspection Methods

Source: https://github.com/quantco/dataframely/blob/main/docs/api/filter_result/failure_info.rst

Methods for inspecting failure data, including identifying invalid entries and calculating counts.

```APIDOC
## FailureInfo.invalid

### Description
Checks for invalid entries within the FailureInfo.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.counts

### Description
Calculates the counts of failure occurrences.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.cooccurrence_counts

### Description
Calculates co-occurrence counts for failures.

### Method
N/A (Method call)

### Parameters
None

## FailureInfo.__len__

### Description
Returns the total number of items in FailureInfo.

### Method
N/A (Method call)

### Parameters
None
```

--------------------------------

### Read Delta Lake Schema Eagerly

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Read a Delta Lake table back into a dataframely DataFrame, inferring the schema from the stored metadata. This avoids re-validation if the schema matches.

```python
# Or deltalake eagerly
new_df: dy.DataFrame[MySchema] = MySchema.read_delta("/path/to/table")
```

--------------------------------

### Define a Dataframely Schema

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/serialization.md

Define a schema with various column types and a custom rule. This schema can then be serialized.

```python
class HouseSchema(dy.Schema):
    zip_code = dy.String(nullable=False, min_length=3)
    num_bedrooms = dy.UInt8(nullable=False)
    num_bathrooms = dy.UInt8(nullable=False)
    price = dy.Float64(nullable=False)

    @dy.rule()
    def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr:
        ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms")
        return (ratio >= 1 / 3) & (ratio <= 3)

```

--------------------------------

### Test Diabetes Invoice Amounts Generation

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/features/data-generation.md

A pytest test case for the `get_diabetes_invoice_amounts` function. It uses `HospitalInvoiceData.sample` with overrides to create test data, including invoices with and without diabetes diagnoses, and asserts the output against an expected DataFrame.

```python
def test_get_diabetes_invoice_amounts() -> None:
    # Arrange
    invoice_data = HospitalInvoiceData.sample(
        overrides=[
            # Invoice with diabetes diagnosis
            {
                "invoice_id": "1",
                "invoice": {"amount": 1500.0},
                "diagnosis": [{"code": "E11.2"}],
            },
            # Invoice without diabetes diagnosis
            {
                "invoice_id": "2",
                "invoice": {"amount": 1000.0},
                "diagnosis": [{"code": "J45.909"}],
            },
        ]
    )
    expected = OutputSchema.validate(
        pl.DataFrame(
            {
                "invoice_id": ["1"],
                "amount": [1500.0],
            }
        ),
        cast=True,
    ).lazy()

    # Act
    actual = get_diabetes_invoice_amounts(invoice_data)

    # Assert
    assert_frame_equal(actual, expected)
```

--------------------------------

### Define Invoice Schema with cross-column rules

Source: https://github.com/quantco/dataframely/blob/main/docs/guides/examples/real-world.ipynb

Adds cross-column validation rules to the InvoiceSchema using the @dy.rule decorator to ensure logical data integrity.

```python
class InvoiceSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    received_at = dy.Datetime(nullable=False)
    amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))

    @dy.rule()
    def discharge_after_admission(cls) -> pl.Expr:
        return pl.col("discharge_date") >= pl.col("admission_date")

    @dy.rule()
    def received_at_after_discharge(cls) -> pl.Expr:
        return pl.col("received_at").dt.date() >= pl.col("discharge_date")
```