### Python Example: VACUUM All Tables

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook demonstrating how to perform a VACUUM operation on all tables using DiscoverX. This helps in reclaiming storage space.

```python
# Example notebook for vacuum_multiple_tables.py
```

--------------------------------

### Install DiscoverX Project Locally

Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md

Installs the DiscoverX project in editable mode with local and test development requirements. This command uses pip and assumes a Python environment is active.

```bash
pip install -e ".[local,test]"
```

--------------------------------

### Python Example: Scan with User Specified Formats

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook showing how to scan data sources using user-specified data formats. Provides flexibility in data ingestion and processing.

```python
# Example notebook for scan_with_user_specified_data_source_formats.py
```

--------------------------------

### Python Example: Create MLflow Gateway Routes

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook illustrating how to create MLflow gateway routes for integrating with services like MosaicML and OpenAI. Facilitates model serving and access.

```python
# Example notebook for mlflow_gateway_routes_examples.py
```

--------------------------------

### Python Example: Deep Clone a Catalog

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook demonstrating the deep cloning of a catalog, which duplicates all its contents. This is useful for backups or creating isolated environments.

```python
# Example notebook for deep_clone_schema.py
```

--------------------------------

### Python Example: Text Analysis with MosaicML & MLflow

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook for performing text analysis using MosaicML and Databricks MLflow. Integrates large language models for insights.

```python
# Example notebook for text_analysis_mosaicml_mlflow.py
```

--------------------------------

### Python Example: Delta Housekeeping Analysis

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook for executing Delta housekeeping tasks, providing statistics on table size, file counts, and optimization/vacuum timestamps. It offers recommendations for optimization.

```python
# Example notebook for exec_delta_housekeeping.py
```

--------------------------------

### Install OpenJDK 11 via Conda

Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md

Installs OpenJDK version 11.0.15 using Conda from the conda-forge channel. This is a dependency for running certain project functionalities.

```bash
conda install -c conda-forge openjdk=11.0.15
```

--------------------------------

### Python Example: PII Detection with Presidio

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook showcasing PII (Personally Identifiable Information) detection using the Presidio library within Databricks. It helps in identifying sensitive data.

```python
# Example notebook for pii_detection_presidio.py
```

--------------------------------

### Python Example: Text Analysis with OpenAI GPT

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook for conducting text analysis using OpenAI's GPT models within Databricks. Enables advanced natural language processing tasks.

```python
# Example notebook for text_analysis_openai_gpt.py
```

--------------------------------

### Install DiscoverX Library in Databricks

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

This snippet shows how to install the DiscoverX Python package within a Databricks notebook environment using pip.

```python
%pip install dbl-discoverx
```

--------------------------------

### Python Example: Detect Tables with Small Files

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook for detecting tables that contain a large number of small files, which can impact performance. This is part of DiscoverX's maintenance utilities.

```python
# Example notebook for detect_small_files.py
```

--------------------------------

### Setup Local Python Environment with Conda

Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md

Commands to create and activate a Conda Python environment for the DiscoverX project. Ensures the correct Python version and isolates project dependencies.

```bash
conda create -n discoverx python=3.9
conda activate discoverx
```

--------------------------------

### Python Example: Update Owner of Data Objects

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Example notebook for updating the ownership of data objects (tables, views, etc.) across the Databricks Unity Catalog. Useful for data governance and stewardship.

```python
# Example notebook for update_owner_of_data_objects.py
```

--------------------------------

### Initialize DiscoverX with Python

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Initializes the DiscoverX class to access all its functionality. It allows for basic initialization, initialization with locale-specific classification rules (e.g., US phone numbers, SSNs), and initialization with custom classification rules. This setup is crucial for utilizing DiscoverX's features for data governance and manipulation.

```python
from discoverx import DX

# Basic initialization
dx = DX()

# Initialize with locale-specific rules (e.g., US phone numbers, SSNs)
dx = DX(locale="US")

# Initialize with custom classification rules
from discoverx.rules import RegexRule

custom_rule = RegexRule(
    name="custom_id",
    description="Custom ID pattern",
    definition=r"^CUS-\d{6}$",
    match_example=["CUS-123456"],
    class_name="custom_id"
)
dx = DX(custom_rules=[custom_rule])
```

--------------------------------

### Apply Queries and Get Results

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Executes the DiscoverX queries and returns a unioned dataframe containing the results from all processed tables. Used after `with_sql` or `unpivot_string_columns`.

```python
.apply()
```

--------------------------------

### Define Custom Regex Rules for DiscoverX

Source: https://context7.com/databrickslabs/discoverx/llms.txt

This snippet shows how to define custom regular expression rules for DiscoverX, specifying patterns for employee IDs and order IDs. It includes examples of matching and non-matching data, along with class names for categorization.

```python
from dx.rules import RegexRule

employee_id_rule = RegexRule(
    name="employee_id",
    description="Company employee ID format",
    definition=r"^EMP\d{6}$",
    match_example=["EMP123456", "EMP999999"],
    nomatch_example=["EMP12345", "emp123456"],
    class_name="employee_id"
)

order_id_rule = RegexRule(
    name="order_id",
    description="Order ID format ORD-YYYYMMDD-####",
    definition=r"^ORD-\d{8}-\d{4}$",
    match_example=["ORD-20240115-0001"],
    class_name="order_id"
)
```

--------------------------------

### Select Tables and Execute SQL Templates with Python

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Executes SQL templates across multiple tables, matching a specified pattern, utilizing concurrent execution and automatic result aggregation. This allows for operations such as getting table details, Delta history, deep cloning tables, and changing table ownership. The `apply()` method triggers the execution of the defined SQL templates.

```python
from discoverx import DX

dx = DX()

# Get table details for all tables in a catalog
result_df = (
    dx.from_tables("my_catalog.my_schema.*")
    .with_sql("DESCRIBE DETAIL {full_table_name}")
    .apply()
)
result_df.display()

# Get Delta history for all tables
history_df = (
    dx.from_tables("my_catalog.*.*")
    .with_sql("DESCRIBE HISTORY {full_table_name}")
    .apply()
)

# Deep clone tables to a backup schema
(
    dx.from_tables("prod_catalog.sales.*")
    .with_sql(
        "CREATE TABLE IF NOT EXISTS {table_catalog}_backup.{table_schema}.{table_name} "
        "DEEP CLONE {full_table_name}"
    )
    .apply()
)

# Change table ownership across multiple tables
(
    dx.from_tables("my_catalog.public.*")
    .with_sql("ALTER TABLE {full_table_name} SET OWNER TO `data_engineering_team`")
    .apply()
)
```

--------------------------------

### Define Custom Regex Rules for DiscoverX

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Allows the creation of custom semantic classification rules using regular expressions. These rules can be added to the default set for more specific data matching. The `RegexRule` requires a name, description, definition (regex pattern), and examples.

```python
from discoverx.rules import RegexRule
from discoverx import DX

custom_rules = [
  RegexRule(
    name = "resource_request_id",
    description = "Resource request ID",
    definition = r"^AR-\d{9}$",
    match_example = ["AR-123456789"],
    nomatch_example = ["R-123"],
  )
]

dx = DX(custom_rules=custom_rules)
```

--------------------------------

### Filter Tables by Column Presence with Python

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Restricts operations to tables that contain specific columns before executing SQL templates or Python functions. This enables users to target operations based on data characteristics. Example uses include finding tables with specific columns and applying GDPR deletion based on column presence.

```python
from discoverx import DX

dx = DX()

# Find tables with user_id column and query them
user_tables_df = (
    dx.from_tables("*.*.*")
    .having_columns("user_id")
    .with_sql("SELECT COUNT(DISTINCT user_id) as unique_users FROM {full_table_name}")
    .apply()
)

# Find tables with email and phone columns
contact_tables = (
    dx.from_tables("customer_data.*.*")
    .having_columns("email", "phone_number")
    .with_sql("SELECT {full_table_name} as table_name, COUNT(*) as contact_count FROM {full_table_name}")
    .apply()
)

# Apply GDPR deletion to tables containing user_email
(
    dx.from_tables("*.*.*")
    .having_columns("user_email")
    .with_sql("DELETE FROM {full_table_name} WHERE user_email = 'user@example.com'")
    .apply()
)
```

--------------------------------

### Initialize and Scan with Custom Rules in DiscoverX

Source: https://context7.com/databrickslabs/discoverx/llms.txt

This code demonstrates initializing DiscoverX with custom rules and performing a scan on specified tables. It shows how to target specific rules by name and set a sample size for the scan.

```python
from dx.dx import DX

# Assuming employee_id_rule and order_id_rule are defined as above

# Initialize with custom rules
dx = DX(custom_rules=[employee_id_rule, order_id_rule])

# Scan using custom rules
dx.scan(
    from_tables="hr_data.*.*",
    rules="employee_id",
    sample_size=5000
)
```

--------------------------------

### Configure Concurrency and Preview Queries with Python

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Provides control over concurrent execution and the ability to preview generated SQL before execution. This allows users to optimize performance by setting custom concurrency levels and verifying SQL queries before running them. The `explain()` method shows the SQL for each table, and `apply()` executes the queries.

```python
from discoverx import DX

dx = DX()

# Set custom concurrency level
result = (
    dx.from_tables("large_catalog.*.*")
    .with_concurrency(20)  # Execute 20 queries in parallel
    .with_sql("OPTIMIZE {full_table_name}")
    .apply()
)

# Preview SQL that will be executed without running it
explorer = (
    dx.from_tables("prod_catalog.*.*")
    .with_sql("VACUUM {full_table_name} RETAIN 168 HOURS")
)
explorer.explain()  # Shows SQL for each table

# Execute after previewing
explorer.apply()
```

--------------------------------

### Initialize DiscoverX DataExplorer Object

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

This snippet demonstrates how to import and initialize the main DataExplorer (DX) object from the DiscoverX library. It requires specifying a locale.

```python
from discoverx import DX
dx = DX(locale="US")
```

--------------------------------

### Explain SQL Queries

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

An action that explains the SQL queries that would be executed by DiscoverX without actually running them. Typically used after `with_sql` or `unpivot_string_columns`.

```python
.explain()
```

--------------------------------

### Create Git Tag for Release

Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md

Creates an annotated Git tag for a specific project version and pushes it to the origin repository. This is part of the release pipeline process.

```bash
git tag -a v<your-project-version> -m "Release tag for version <your-project-version>"
git push origin --tags
```

--------------------------------

### Run Unit Tests with Pytest

Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md

Executes unit tests for the DiscoverX project using pytest, including code coverage reporting. Tests are located in the `tests/unit` directory.

```bash
pytest tests/unit --cov
```

--------------------------------

### Apply SQL Template to Tables

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Applies a provided SQL template to all selected tables. Subsequent actions can then be applied to the results.

```python
.with_sql("SQL_TEMPLATE")
```

--------------------------------

### Vacuum Multiple Tables with DiscoverX SQL

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Vacuum.md

This code snippet demonstrates how to vacuum multiple tables at once using DiscoverX. It utilizes the `dx.from_tables` method with a wildcard pattern and the `with_sql` method to apply the VACUUM command to all matched tables. The output can be displayed using `.display()`.

```python
dx.from_tables("*.*.*").with_sql("VACUUM {full_table_name}").display()
```

--------------------------------

### Select Tables using from_tables Pattern

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Demonstrates selecting tables based on a catalog, schema, and table pattern using a wildcard. This returns a DataExplorer object for further operations.

```python
dx.from_tables("<catalog>.<schema>.<table>")
```

--------------------------------

### Delta Housekeeping Analysis and Recommendations

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Generate comprehensive recommendations for Delta table maintenance, including analysis for OPTIMIZE and VACUUM operations. This function analyzes tables and provides actionable insights. It requires the 'discoverx' library.

```python
from discoverx import DX

dx = DX()

# Run housekeeping analysis on all tables
housekeeping_output = (
    dx.from_tables("production.*.*")
    .delta_housekeeping()
)

# Get recommendations DataFrame
recommendations_df = housekeeping_output.apply()

# View only recommendation columns
recommendations_df.select(
    "catalog", "database", "tableName",
    *[c for c in recommendations_df.columns if c.startswith("rec_")]
).display()

# Display full analysis including stats
housekeeping_output.display()

# Get HTML-formatted explanation
housekeeping_output.explain()
```

--------------------------------

### Execute SQL Templates on Multiple Tables

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

This functionality allows executing a SQL template against multiple tables concurrently. It supports various operations like describing details, showing history, deep cloning, tagging, and more. The available variables within the templates include `{full_table_name}`, `{table_catalog}`, `{table_schema}`, `{table_name}`, `{stack_string_columns}`, and `{stack_all_columns_as_string}`. Results are returned as a Spark DataFrame.

```sql
DESCRIBE DETAIL {full_table_name}
SHOW HISTORY {full_table_name}
CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}_clone.{table_name} DEEP CLONE {full_table_name}
CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}.{table_name}_empty_copy LIKE {full_table_name}
ALTER TABLE {full_table_name} SET TAGS ('tag_name' = 'tag_value')
ALTER TABLE {full_table_name} SET OWNER TO principal
SHOW PARTITIONS {full_table_name}
SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1
SELECT {stack_string_columns} AS (column_name, string_value) FROM {full_table_name}
SELECT {stack_all_columns_as_string} AS (column_name, string_value) FROM {full_table_name}
ALTER TABLE {full_table_name} CLUSTER BY (column1, column2)
VACUUM {full_table_name}
OPTIMIZE {full_table_name}
```

--------------------------------

### Apply Python Functions to Multiple Assets

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

DiscoverX enables the concurrent application of Python functions to multiple Lakehouse assets. This feature provides access to table metadata, including catalog, schema, table name, columns (with data type and partition index), and tags (if `.with_tags(True)` is used). This is useful for complex data processing or analysis tasks.

```python
# Example usage:
def process_table(table_info):
    print(f"Processing table: {table_info['catalog']}.{table_info['schema']}.{table_info['table']}")
    # Access columns
    for col in table_info['columns']:
        print(f"  Column: {col['name']} ({col['data_type']})")
    # Access tags (if available)
    if 'tags' in table_info:
        print(f"  Tags: {table_info['tags']}")
    # Add your custom logic here

# Assuming 'discoverx' is imported and configured
# discoverx.from_tables(...).apply(process_table)
```

--------------------------------

### Scan Tables with Custom Rules in DiscoverX

Source: https://context7.com/databrickslabs/discoverx/llms.txt

This snippet illustrates using the fluent API of DiscoverX to select tables and then scan them using a specific custom rule and sample size. It highlights the flexibility in applying rules to different table patterns.

```python
from dx.dx import DX

# Assuming order_id_rule is defined as above

# Use from_tables with custom rules
dx.from_tables("orders.*.*", custom_rules=[order_id_rule]).scan(
    rules="order_id",
    sample_size=10000
)
```

--------------------------------

### Display Query Results

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Executes the DiscoverX queries and displays the first 1000 rows of the combined results as a unioned dataframe. Used after `with_sql` or `unpivot_string_columns`.

```python
.display()
```

--------------------------------

### Scan Lakehouse with Regex Rules

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Scans the lakehouse using predefined regex rules to power semantic classification. This function is experimental.

```python
.scan(rules=RULES)
```

--------------------------------

### Aggregate IP Address Occurrences by Date using Discoverx

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Select_by_class.md

Demonstrates building summary tables by selecting date and IP address columns, then counting occurrences. It groups by catalog, schema, table, date, and IP column, and aggregates the count of IP address values. Requires Discoverx and a DataFrame `df`.

```python
df = (dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_ip_v4"])
    .groupby(["table_catalog", "table_schema", "table_name", "classified_columns.dx_iso_date.column", "classified_columns.dx_iso_date.value", "classified_columns.dx_ip_v4.column"])
    .agg(func.count("classified_columns.dx_ip_v4.value").alias("count"))
)
```

--------------------------------

### Display Available Classification Rules

Source: https://context7.com/databrickslabs/discoverx/llms.txt

View the built-in semantic classification rules provided by DiscoverX, including descriptions and the patterns they use. This function helps users understand what types of sensitive data can be detected. It requires the 'discoverx' library.

```python
from discoverx import DX

dx = DX(locale="US")

# Display all available rules in HTML format
dx.display_rules()

# Rules include patterns for:
# - Email addresses
# - Phone numbers (locale-specific)
# - IP addresses (IPv4, IPv6)
# - Social Security Numbers (US)
# - Credit card numbers
# - URLs
# - Postal codes
# - And more...
```

--------------------------------

### Scan with custom rules using Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Defines and utilizes custom regular expression-based rules for data classification, enabling domain-specific pattern matching during scans. Requires importing RegexRule from discoverx.rules.

```python
from discoverx import DX
from discoverx.rules import RegexRule

# Example usage would follow here, defining rules and then using dx.scan with those rules.
```

--------------------------------

### Advanced SQL Template Variables in Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Enables the use of dynamic variables within SQL templates for complex data operations. Supports stacking columns, using individual table component variables (catalog, schema, name), and applying liquid clustering for table optimization.

```python
from discoverx import DX

dx = DX()

# Stack string columns into column_name/value pairs
stacked_strings = (
    dx.from_tables("analytics.events.*")
    .with_sql("""
        SELECT
            '{full_table_name}' as source_table,
            {stack_string_columns} AS (column_name, string_value)
        FROM {full_table_name}
        LIMIT 100
    """)
    .apply()
)

# Stack ALL columns as strings
all_columns_stacked = (
    dx.from_tables("my_catalog.my_schema.my_table")
    .with_sql("""
        SELECT
            {stack_all_columns_as_string} AS (column_name, string_value)
        FROM {full_table_name}
        WHERE string_value IS NOT NULL
    """)
    .apply()
)

# Use individual component variables
custom_query = (
    dx.from_tables("prod.*.*")
    .with_sql("""
        SELECT
            '{table_catalog}' as catalog,
            '{table_schema}' as schema,
            '{table_name}' as table,
            COUNT(*) as row_count,
            '{full_table_name}' as full_name
        FROM {full_table_name}
    """)
    .apply()
)

# Apply liquid clustering to tables
(
    dx.from_tables("high_volume.*.*")
    .having_columns("user_id", "event_date")
    .with_sql("ALTER TABLE {full_table_name} CLUSTER BY (user_id, event_date)")
    .apply()
)
```

--------------------------------

### Apply Python Functions to Multiple Tables

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Execute custom Python functions concurrently across tables, accessing table metadata like columns, tags, and properties. This function takes a TableInfo object and returns structured data. It requires the 'discoverx' library. Dependencies include 'discoverx' and 'discoverx.table_info'.

```python
from discoverx import DX
from discoverx.table_info import TableInfo

dx = DX()

def analyze_table(table_info: TableInfo):
    """Custom function to analyze each table"""
    full_name = f"{table_info.catalog}.{table_info.schema}.{table_info.table}"

    # Access column information
    string_cols = [col.name for col in table_info.columns if col.data_type == "string"]
    partition_cols = [col.name for col in table_info.columns if col.partition_index is not None]

    print(f"Table: {full_name}")
    print(f"  Total columns: {len(table_info.columns)}")
    print(f"  String columns: {string_cols}")
    print(f"  Partition columns: {partition_cols}")

    # Return structured data
    return {
        "table": full_name,
        "column_count": len(table_info.columns),
        "string_column_count": len(string_cols)
    }

# Execute function across all tables
results = (
    dx.from_tables("analytics.sales.*")
    .with_concurrency(10)
    .map(analyze_table)
)

print(f"Analyzed {len(results)} tables")

# Access tags in custom function
def check_tags(table_info: TableInfo):
    if table_info.tags:
        print(f"Catalog tags: {table_info.tags.catalog_tags}")
        print(f"Table tags: {table_info.tags.table_tags}")
    return table_info

results_with_tags = (
    dx.from_tables("my_catalog.*.*")
    .with_tags(True)  # Enable tag fetching
    .map(check_tags)
)
```

--------------------------------

### Configure Scan Parameters in DiscoverX

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Defines detailed parameters for the `dx.scan` function, including table patterns, rule filters, sample size, and an option to preview the executed SQL. Use `None` for `sample_size` to perform a full table scan.

```python
dx.scan(
    from_tables="*.*.*", # Table pattern in form of <catalog>.<schema>.<table> ('*' is a wildcard)
    rules="*",           # Rule filter ('*' is a wildcard)
    sample_size=10000,   # Number of rows to sample, use None for a full table scan
    what_if=False        # If `True` it prints the SQL that would be executed
)
```

--------------------------------

### Scan Lakehouse for PII and Semantic Classes

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Scan tables using regex-based rules to identify columns containing sensitive data such as emails, phone numbers, IP addresses, and SSNs. The function allows specifying tables, rules, and sample sizes. It uses the 'discoverx' library and supports locale-specific rules. Results can be viewed, saved, or loaded.

```python
from discoverx import DX

dx = DX(locale="US")

# Scan all tables for PII using built-in rules
dx.scan(
    from_tables="customer_data.*.*",
```

```python
    rules="*",  # Use all available rules
    sample_size=10000  # Sample 10k rows per table
)

# Scan specific schemas with custom sample size
dx.scan(
    from_tables="analytics.events.*",
    rules="email,phone",  # Only check for emails and phones
    sample_size=5000
)

# Preview what would be scanned without executing
dx.scan(
    from_tables="*.*.*",
```

```python
    rules="*",
    sample_size=1000,
    what_if=True  # Print SQL without executing
)

# View scan results as DataFrame
scan_results_df = dx.scan_result
scan_results_df.display()

# Save scan results for later use
dx.save("my_catalog.governance.discoverx_scan_results")

# Load previously saved results
dx.load("my_catalog.governance.discoverx_scan_results")
```

--------------------------------

### Search for specific data using Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Searches for specific data patterns within tables. It supports searching by email, auto-detecting classifications from search terms, and setting a minimum classification score threshold. Requires the discoverx library.

```python
from discoverx import DX

dx = DX()

# Search for a specific email address
email_results = dx.search(
    search_term="user@example.com",
    from_tables="*.*.*",
    by_class="email"  # Search in columns classified as email
)
email_results.display()

# Auto-detect class from search term (if pattern matches)
phone_results = dx.search(
    search_term="+1-555-123-4567",
    from_tables="customer_data.*.*"
    # by_class will be auto-detected as "phone"
)

# Set minimum classification score threshold
high_confidence_results = dx.search(
    search_term="192.168.1.1",
    from_tables="logs.*.*",
    by_class="ip_address",
    min_score=0.8  # Only search in columns with >80% match rate
)
```

--------------------------------

### Configure Query Concurrency

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Sets the maximum number of concurrent queries to be executed for DiscoverX operations. Defaults to 10.

```python
.with_concurrency(10)
```

--------------------------------

### Retrieve User Data Across Tables (Python, Discoverx)

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/GDPR_RoA.md

This code snippet demonstrates how to fetch all data for a specific user (e.g., user ID '1') from all tables that contain a 'user_id' column. It utilizes the Discoverx library to filter tables and apply a SQL query. The input is a user identifier, and the output is a DataFrame containing the user's data in JSON format.

```python
df = dx.from_tables("*.*.*")\
  .having_columns("user_id")\
  .with_sql("SELECT `user_id`, to_json(struct(*)) AS row_content FROM {full_table_name} WHERE `user_id` = 1")\
  .apply()
```

--------------------------------

### Filter Tables by Columns

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Extends table selection by filtering for tables that contain specific columns. This is a method chained after `from_tables`.

```python
.having_columns
```

--------------------------------

### Display Available Classification Rules

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Lists all available semantic classification rules supported by the DiscoverX library. This is useful for understanding the types of data that can be identified.

```python
dx.display_rules()
```

--------------------------------

### Detect Small Files in Delta Tables

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Identify Delta tables containing an excessive number of small files, which can lead to performance issues. This script defines thresholds for file size and number, then queries Delta table details to find problematic tables. It uses 'discoverx' and 'pyspark.sql.functions'.

```python
from discoverx import DX
from pyspark.sql.functions import col

dx = DX()

# Define thresholds
small_file_max_size_MB = 10
min_file_number = 100

# Find tables with small files
small_files_df = (
    dx.from_tables("*.*.*")
    .with_sql("DESCRIBE DETAIL {full_table_name}")
    .apply()
    .withColumn("average_file_size_MB", col("sizeInBytes") / col("numFiles") / 1024 / 1024)
    .withColumn(
        "has_too_many_small_files",
        (col("average_file_size_MB") < small_file_max_size_MB) & (col("numFiles") > min_file_number)
    )
    .filter("has_too_many_small_files")
)

# Display problematic tables
small_files_df.select(
    "table_catalog", "table_schema", "table_name",
    "numFiles", "average_file_size_MB"
).display()
```

--------------------------------

### Select data by semantic classes using Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Retrieves data from columns matching specified semantic classifications across tables. It requires scanning the data first and can select single or multiple classes with a minimum confidence score. Outputs include classified columns and row content as JSON.

```python
from discoverx import DX

dx = DX(locale="US")

# Scan first
dx.scan(from_tables="*.*.*", sample_size=10000)

# Select all email columns across tables
email_data = dx.select_by_classes(
    from_tables="customer_data.*.*",
    by_classes="email"
)
email_data.display()

# Select multiple classes
contact_data = dx.select_by_classes(
    from_tables="*.*.*",
    by_classes=["email", "phone"],  # Tables with both email AND phone columns
    min_score=0.5  # Minimum 50% classification confidence
)

# Result includes classified_columns struct and row_content as JSON
contact_data.select("table_catalog", "table_schema", "table_name", "classified_columns").display()
```

--------------------------------

### Unpivot String Columns

Source: https://github.com/databrickslabs/discoverx/blob/master/README.md

Melts (unpivots) all string columns from the selected tables into a long format dataframe. Actions can be applied afterward.

```python
.unpivot_string_columns()
```

--------------------------------

### Select Columns by Semantic Class in Discoverx

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Select_by_class.md

Selects all columns belonging to specified semantic classes from tables. The `from_tables` argument accepts patterns like '*.*.*' to include all tables. `by_classes` is a list of Discoverx semantic class names. `min_score` can optionally filter classifications by confidence score.

```python
dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_email"], min_score=None)
```

--------------------------------

### Scan Tables with DiscoverX

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Initiates a scan on tables to classify columns semantically. It samples a specified number of rows per table. The result is a dataset containing a 'score' column indicating the fraction of matched records per rule.

```python
dx.scan(from_tables="*.*.*")
```

--------------------------------

### Basic Search in DiscoverX

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md

Performs a search for a specific value across all tables in the data catalog. This function automatically attempts to classify the search term and restricts the search to relevant columns. It requires prior semantic classification to be run.

```python
dx.search("example_email@databricks.com", from_tables="*.*.*")
```

--------------------------------

### Search Lakehouse by Semantic Class

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Search the lakehouse for specific values within columns that have been classified by semantic type after a scan. This functionality enables targeted data exploration based on data classification. It requires the 'discoverx' library and assumes a prior scan has been performed.

```python
from discoverx import DX

dx = DX(locale="US")

# First, scan the lakehouse
dx.scan(from_tables="*.*.*", sample_size=10000)
```

--------------------------------

### Load DiscoverX Scan Results

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Loads previously saved semantic column scan results from a specified Delta table. This is useful for resuming analysis or accessing results in a different session.

```python
dx.load(full_table_name=<your-table-name>)
```

--------------------------------

### Delete Rows by Column Class using Discoverx

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Delete_by_class.md

Deletes rows from specified tables where a column of a given class matches provided values. Requires `yes_i_am_sure=True` for actual deletion. Uses a minimum score threshold for classification matching. Input is tables to scan, the class to match, the values to filter by, and a boolean to confirm deletion.

```python
dx.delete_by_class(from_tables="*.*.*", by_class="email", values=['example_email@databricks.com'], yes_i_am_sure=True, min_score=0.95)
```

```python
dx.delete_by_class(from_tables="*.*.*", by_class="email", values=['example_email@databricks.com'], yes_i_am_sure=False, min_score=0.95)
```

--------------------------------

### Delete User Data Across Tables with DiscoverX

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/GDPR_RoE.md

This code snippet demonstrates how to delete user data from all tables containing a specific user identifier column. It leverages DiscoverX to identify relevant tables and execute a SQL DELETE statement. Use .explain() to preview SQL before execution.

```python
dx.from_tables("*.*.*")\
  .having_columns("user_id")\
  .with_sql("DELETE FROM {full_table_name} WHERE `user_id` IN (1, 2, 3)")\
  .display() 
  # You can use .explain() instead of .display() to preview the generated SQL
```

--------------------------------

### Filter tables by data source format using Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Restricts operations to specific data source formats such as Delta, Parquet, CSV, or JSON. This feature allows for targeted scanning or querying of tables based on their storage format, supporting single or multiple formats.

```python
from discoverx import DX

dx = DX()

# Only scan Delta tables (default behavior)
delta_tables = (
    dx.from_tables("*.*.*")
    .with_data_source_formats(["DELTA"])
    .with_sql("DESCRIBE DETAIL {full_table_name}")
    .apply()
)

# Include multiple formats
multi_format = (
    dx.from_tables("landing_zone.*.*")
    .with_data_source_formats(["DELTA", "PARQUET", "CSV", "JSON"])
    .with_sql("SELECT '{full_table_name}' as table_name, COUNT(*) as row_count FROM {full_table_name}")
    .apply()
)

# Scan with specific formats
dx.from_tables("*.*.*").scan(
    rules="email",
    sample_size=1000,
    data_source_formats=["DELTA", "PARQUET"]
)
```

--------------------------------

### Unpivot String Columns Across Tables with Python

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Melts all string columns into column_name/string_value pairs for pattern analysis or data discovery. This operation is useful for analyzing data across different tables and columns. The `sample_size` parameter can be used to limit the number of rows processed for performance reasons.

```python
from discoverx import DX

dx = DX()

# Unpivot all string columns from sample data
unpivoted_df = (
    dx.from_tables("analytics.*.*")
    .unpivot_string_columns(sample_size=1000)
    .apply()
)

# Result contains: column_name, string_value, table_catalog, table_schema, table_name
unpivoted_df.select("table_name", "column_name", "string_value").display()

# Unpivot all columns (casted to string)
all_columns_unpivoted = (
    dx.from_tables("my_catalog.my_schema.my_table")
    .unpivot_all_columns(sample_size=500)
    .apply()
)
```

--------------------------------

### Save DiscoverX Scan Results

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md

Saves the results of a previous semantic column scan to a specified Delta table. This allows for persistence and later retrieval of the classification findings.

```python
dx.save(full_table_name=<your-table-name>)
```

--------------------------------

### Delete data by semantic class (GDPR Right of Erasure) using Discoverx

Source: https://context7.com/databrickslabs/discoverx/llms.txt

Deletes rows across tables based on semantically classified columns, designed for GDPR compliance. It supports previewing deletions (what_if mode) before actual execution and can delete multiple values or specific types like phone numbers with optional minimum score thresholds.

```python
from discoverx import DX

dx = DX(locale="US")

# Scan first
dx.scan(from_tables="*.*.*", sample_size=10000)

# Preview deletion (what_if mode by default)
dx.delete_by_class(
    from_tables="customer_data.*.*",
    by_class="email",
    values="user@example.com",
    yes_i_am_sure=False  # Preview only
)

# Actually execute deletion
dx.delete_by_class(
    from_tables="customer_data.*.*",
    by_class="email",
    values="user@example.com",
    yes_i_am_sure=True  # Execute deletion
)

# Delete multiple values
dx.delete_by_class(
    from_tables="*.*.*",
    by_class="user_id",
    values=["user123", "user456", "user789"],
    yes_i_am_sure=True,
    min_score=0.9  # Only delete from high-confidence classified columns
)

# Delete phone number across all tables
dx.delete_by_class(
    from_tables="*.*.*",
    by_class="phone",
    values="+1-555-123-4567",
    yes_i_am_sure=True
)
```

--------------------------------

### DiscoverX Search with Minimum Score Filter

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md

Conducts a search for a value across tables and filters results based on a minimum classification score. This is useful for prioritizing results that have a high confidence score based on the classification scan. Requires prior semantic classification.

```python
dx.search("example_email@databricks.com", from_tables="*.*.*", min_score=0.95)
```

--------------------------------

### DiscoverX Search with Explicit Classification

Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md

Searches for a value across tables, explicitly specifying the classification rules to use for filtering columns. This allows for more targeted searches when the automatic classification might be too broad or imprecise. Requires prior semantic classification.

```python
dx.search("example_email@databricks.com", from_tables="*.*.*", by_classes=["dx_email"])
```