### Python Example: VACUUM All Tables Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook demonstrating how to perform a VACUUM operation on all tables using DiscoverX. This helps in reclaiming storage space. ```python # Example notebook for vacuum_multiple_tables.py ``` -------------------------------- ### Install DiscoverX Project Locally Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md Installs the DiscoverX project in editable mode with local and test development requirements. This command uses pip and assumes a Python environment is active. ```bash pip install -e ".[local,test]" ``` -------------------------------- ### Python Example: Scan with User Specified Formats Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook showing how to scan data sources using user-specified data formats. Provides flexibility in data ingestion and processing. ```python # Example notebook for scan_with_user_specified_data_source_formats.py ``` -------------------------------- ### Python Example: Create MLflow Gateway Routes Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook illustrating how to create MLflow gateway routes for integrating with services like MosaicML and OpenAI. Facilitates model serving and access. ```python # Example notebook for mlflow_gateway_routes_examples.py ``` -------------------------------- ### Python Example: Deep Clone a Catalog Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook demonstrating the deep cloning of a catalog, which duplicates all its contents. This is useful for backups or creating isolated environments. ```python # Example notebook for deep_clone_schema.py ``` -------------------------------- ### Python Example: Text Analysis with MosaicML & MLflow Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook for performing text analysis using MosaicML and Databricks MLflow. Integrates large language models for insights. ```python # Example notebook for text_analysis_mosaicml_mlflow.py ``` -------------------------------- ### Python Example: Delta Housekeeping Analysis Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook for executing Delta housekeeping tasks, providing statistics on table size, file counts, and optimization/vacuum timestamps. It offers recommendations for optimization. ```python # Example notebook for exec_delta_housekeeping.py ``` -------------------------------- ### Install OpenJDK 11 via Conda Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md Installs OpenJDK version 11.0.15 using Conda from the conda-forge channel. This is a dependency for running certain project functionalities. ```bash conda install -c conda-forge openjdk=11.0.15 ``` -------------------------------- ### Python Example: PII Detection with Presidio Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook showcasing PII (Personally Identifiable Information) detection using the Presidio library within Databricks. It helps in identifying sensitive data. ```python # Example notebook for pii_detection_presidio.py ``` -------------------------------- ### Python Example: Text Analysis with OpenAI GPT Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook for conducting text analysis using OpenAI's GPT models within Databricks. Enables advanced natural language processing tasks. ```python # Example notebook for text_analysis_openai_gpt.py ``` -------------------------------- ### Install DiscoverX Library in Databricks Source: https://github.com/databrickslabs/discoverx/blob/master/README.md This snippet shows how to install the DiscoverX Python package within a Databricks notebook environment using pip. ```python %pip install dbl-discoverx ``` -------------------------------- ### Python Example: Detect Tables with Small Files Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook for detecting tables that contain a large number of small files, which can impact performance. This is part of DiscoverX's maintenance utilities. ```python # Example notebook for detect_small_files.py ``` -------------------------------- ### Setup Local Python Environment with Conda Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md Commands to create and activate a Conda Python environment for the DiscoverX project. Ensures the correct Python version and isolates project dependencies. ```bash conda create -n discoverx python=3.9 conda activate discoverx ``` -------------------------------- ### Python Example: Update Owner of Data Objects Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Example notebook for updating the ownership of data objects (tables, views, etc.) across the Databricks Unity Catalog. Useful for data governance and stewardship. ```python # Example notebook for update_owner_of_data_objects.py ``` -------------------------------- ### Initialize DiscoverX with Python Source: https://context7.com/databrickslabs/discoverx/llms.txt Initializes the DiscoverX class to access all its functionality. It allows for basic initialization, initialization with locale-specific classification rules (e.g., US phone numbers, SSNs), and initialization with custom classification rules. This setup is crucial for utilizing DiscoverX's features for data governance and manipulation. ```python from discoverx import DX # Basic initialization dx = DX() # Initialize with locale-specific rules (e.g., US phone numbers, SSNs) dx = DX(locale="US") # Initialize with custom classification rules from discoverx.rules import RegexRule custom_rule = RegexRule( name="custom_id", description="Custom ID pattern", definition=r"^CUS-\d{6}$", match_example=["CUS-123456"], class_name="custom_id" ) dx = DX(custom_rules=[custom_rule]) ``` -------------------------------- ### Apply Queries and Get Results Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Executes the DiscoverX queries and returns a unioned dataframe containing the results from all processed tables. Used after `with_sql` or `unpivot_string_columns`. ```python .apply() ``` -------------------------------- ### Define Custom Regex Rules for DiscoverX Source: https://context7.com/databrickslabs/discoverx/llms.txt This snippet shows how to define custom regular expression rules for DiscoverX, specifying patterns for employee IDs and order IDs. It includes examples of matching and non-matching data, along with class names for categorization. ```python from dx.rules import RegexRule employee_id_rule = RegexRule( name="employee_id", description="Company employee ID format", definition=r"^EMP\d{6}$", match_example=["EMP123456", "EMP999999"], nomatch_example=["EMP12345", "emp123456"], class_name="employee_id" ) order_id_rule = RegexRule( name="order_id", description="Order ID format ORD-YYYYMMDD-####", definition=r"^ORD-\d{8}-\d{4}$", match_example=["ORD-20240115-0001"], class_name="order_id" ) ``` -------------------------------- ### Select Tables and Execute SQL Templates with Python Source: https://context7.com/databrickslabs/discoverx/llms.txt Executes SQL templates across multiple tables, matching a specified pattern, utilizing concurrent execution and automatic result aggregation. This allows for operations such as getting table details, Delta history, deep cloning tables, and changing table ownership. The `apply()` method triggers the execution of the defined SQL templates. ```python from discoverx import DX dx = DX() # Get table details for all tables in a catalog result_df = ( dx.from_tables("my_catalog.my_schema.*") .with_sql("DESCRIBE DETAIL {full_table_name}") .apply() ) result_df.display() # Get Delta history for all tables history_df = ( dx.from_tables("my_catalog.*.*") .with_sql("DESCRIBE HISTORY {full_table_name}") .apply() ) # Deep clone tables to a backup schema ( dx.from_tables("prod_catalog.sales.*") .with_sql( "CREATE TABLE IF NOT EXISTS {table_catalog}_backup.{table_schema}.{table_name} " "DEEP CLONE {full_table_name}" ) .apply() ) # Change table ownership across multiple tables ( dx.from_tables("my_catalog.public.*") .with_sql("ALTER TABLE {full_table_name} SET OWNER TO `data_engineering_team`") .apply() ) ``` -------------------------------- ### Define Custom Regex Rules for DiscoverX Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Allows the creation of custom semantic classification rules using regular expressions. These rules can be added to the default set for more specific data matching. The `RegexRule` requires a name, description, definition (regex pattern), and examples. ```python from discoverx.rules import RegexRule from discoverx import DX custom_rules = [ RegexRule( name = "resource_request_id", description = "Resource request ID", definition = r"^AR-\d{9}$", match_example = ["AR-123456789"], nomatch_example = ["R-123"], ) ] dx = DX(custom_rules=custom_rules) ``` -------------------------------- ### Filter Tables by Column Presence with Python Source: https://context7.com/databrickslabs/discoverx/llms.txt Restricts operations to tables that contain specific columns before executing SQL templates or Python functions. This enables users to target operations based on data characteristics. Example uses include finding tables with specific columns and applying GDPR deletion based on column presence. ```python from discoverx import DX dx = DX() # Find tables with user_id column and query them user_tables_df = ( dx.from_tables("*.*.*") .having_columns("user_id") .with_sql("SELECT COUNT(DISTINCT user_id) as unique_users FROM {full_table_name}") .apply() ) # Find tables with email and phone columns contact_tables = ( dx.from_tables("customer_data.*.*") .having_columns("email", "phone_number") .with_sql("SELECT {full_table_name} as table_name, COUNT(*) as contact_count FROM {full_table_name}") .apply() ) # Apply GDPR deletion to tables containing user_email ( dx.from_tables("*.*.*") .having_columns("user_email") .with_sql("DELETE FROM {full_table_name} WHERE user_email = 'user@example.com'") .apply() ) ``` -------------------------------- ### Initialize and Scan with Custom Rules in DiscoverX Source: https://context7.com/databrickslabs/discoverx/llms.txt This code demonstrates initializing DiscoverX with custom rules and performing a scan on specified tables. It shows how to target specific rules by name and set a sample size for the scan. ```python from dx.dx import DX # Assuming employee_id_rule and order_id_rule are defined as above # Initialize with custom rules dx = DX(custom_rules=[employee_id_rule, order_id_rule]) # Scan using custom rules dx.scan( from_tables="hr_data.*.*", rules="employee_id", sample_size=5000 ) ``` -------------------------------- ### Configure Concurrency and Preview Queries with Python Source: https://context7.com/databrickslabs/discoverx/llms.txt Provides control over concurrent execution and the ability to preview generated SQL before execution. This allows users to optimize performance by setting custom concurrency levels and verifying SQL queries before running them. The `explain()` method shows the SQL for each table, and `apply()` executes the queries. ```python from discoverx import DX dx = DX() # Set custom concurrency level result = ( dx.from_tables("large_catalog.*.*") .with_concurrency(20) # Execute 20 queries in parallel .with_sql("OPTIMIZE {full_table_name}") .apply() ) # Preview SQL that will be executed without running it explorer = ( dx.from_tables("prod_catalog.*.*") .with_sql("VACUUM {full_table_name} RETAIN 168 HOURS") ) explorer.explain() # Shows SQL for each table # Execute after previewing explorer.apply() ``` -------------------------------- ### Initialize DiscoverX DataExplorer Object Source: https://github.com/databrickslabs/discoverx/blob/master/README.md This snippet demonstrates how to import and initialize the main DataExplorer (DX) object from the DiscoverX library. It requires specifying a locale. ```python from discoverx import DX dx = DX(locale="US") ``` -------------------------------- ### Explain SQL Queries Source: https://github.com/databrickslabs/discoverx/blob/master/README.md An action that explains the SQL queries that would be executed by DiscoverX without actually running them. Typically used after `with_sql` or `unpivot_string_columns`. ```python .explain() ``` -------------------------------- ### Create Git Tag for Release Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md Creates an annotated Git tag for a specific project version and pushes it to the origin repository. This is part of the release pipeline process. ```bash git tag -a v -m "Release tag for version " git push origin --tags ``` -------------------------------- ### Run Unit Tests with Pytest Source: https://github.com/databrickslabs/discoverx/blob/master/CONTRIBUTING.md Executes unit tests for the DiscoverX project using pytest, including code coverage reporting. Tests are located in the `tests/unit` directory. ```bash pytest tests/unit --cov ``` -------------------------------- ### Apply SQL Template to Tables Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Applies a provided SQL template to all selected tables. Subsequent actions can then be applied to the results. ```python .with_sql("SQL_TEMPLATE") ``` -------------------------------- ### Vacuum Multiple Tables with DiscoverX SQL Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Vacuum.md This code snippet demonstrates how to vacuum multiple tables at once using DiscoverX. It utilizes the `dx.from_tables` method with a wildcard pattern and the `with_sql` method to apply the VACUUM command to all matched tables. The output can be displayed using `.display()`. ```python dx.from_tables("*.*.*").with_sql("VACUUM {full_table_name}").display() ``` -------------------------------- ### Select Tables using from_tables Pattern Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Demonstrates selecting tables based on a catalog, schema, and table pattern using a wildcard. This returns a DataExplorer object for further operations. ```python dx.from_tables("..") ``` -------------------------------- ### Delta Housekeeping Analysis and Recommendations Source: https://context7.com/databrickslabs/discoverx/llms.txt Generate comprehensive recommendations for Delta table maintenance, including analysis for OPTIMIZE and VACUUM operations. This function analyzes tables and provides actionable insights. It requires the 'discoverx' library. ```python from discoverx import DX dx = DX() # Run housekeeping analysis on all tables housekeeping_output = ( dx.from_tables("production.*.*") .delta_housekeeping() ) # Get recommendations DataFrame recommendations_df = housekeeping_output.apply() # View only recommendation columns recommendations_df.select( "catalog", "database", "tableName", *[c for c in recommendations_df.columns if c.startswith("rec_")] ).display() # Display full analysis including stats housekeeping_output.display() # Get HTML-formatted explanation housekeeping_output.explain() ``` -------------------------------- ### Execute SQL Templates on Multiple Tables Source: https://github.com/databrickslabs/discoverx/blob/master/README.md This functionality allows executing a SQL template against multiple tables concurrently. It supports various operations like describing details, showing history, deep cloning, tagging, and more. The available variables within the templates include `{full_table_name}`, `{table_catalog}`, `{table_schema}`, `{table_name}`, `{stack_string_columns}`, and `{stack_all_columns_as_string}`. Results are returned as a Spark DataFrame. ```sql DESCRIBE DETAIL {full_table_name} SHOW HISTORY {full_table_name} CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}_clone.{table_name} DEEP CLONE {full_table_name} CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}.{table_name}_empty_copy LIKE {full_table_name} ALTER TABLE {full_table_name} SET TAGS ('tag_name' = 'tag_value') ALTER TABLE {full_table_name} SET OWNER TO principal SHOW PARTITIONS {full_table_name} SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1 SELECT {stack_string_columns} AS (column_name, string_value) FROM {full_table_name} SELECT {stack_all_columns_as_string} AS (column_name, string_value) FROM {full_table_name} ALTER TABLE {full_table_name} CLUSTER BY (column1, column2) VACUUM {full_table_name} OPTIMIZE {full_table_name} ``` -------------------------------- ### Apply Python Functions to Multiple Assets Source: https://github.com/databrickslabs/discoverx/blob/master/README.md DiscoverX enables the concurrent application of Python functions to multiple Lakehouse assets. This feature provides access to table metadata, including catalog, schema, table name, columns (with data type and partition index), and tags (if `.with_tags(True)` is used). This is useful for complex data processing or analysis tasks. ```python # Example usage: def process_table(table_info): print(f"Processing table: {table_info['catalog']}.{table_info['schema']}.{table_info['table']}") # Access columns for col in table_info['columns']: print(f" Column: {col['name']} ({col['data_type']})") # Access tags (if available) if 'tags' in table_info: print(f" Tags: {table_info['tags']}") # Add your custom logic here # Assuming 'discoverx' is imported and configured # discoverx.from_tables(...).apply(process_table) ``` -------------------------------- ### Scan Tables with Custom Rules in DiscoverX Source: https://context7.com/databrickslabs/discoverx/llms.txt This snippet illustrates using the fluent API of DiscoverX to select tables and then scan them using a specific custom rule and sample size. It highlights the flexibility in applying rules to different table patterns. ```python from dx.dx import DX # Assuming order_id_rule is defined as above # Use from_tables with custom rules dx.from_tables("orders.*.*", custom_rules=[order_id_rule]).scan( rules="order_id", sample_size=10000 ) ``` -------------------------------- ### Display Query Results Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Executes the DiscoverX queries and displays the first 1000 rows of the combined results as a unioned dataframe. Used after `with_sql` or `unpivot_string_columns`. ```python .display() ``` -------------------------------- ### Scan Lakehouse with Regex Rules Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Scans the lakehouse using predefined regex rules to power semantic classification. This function is experimental. ```python .scan(rules=RULES) ``` -------------------------------- ### Aggregate IP Address Occurrences by Date using Discoverx Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Select_by_class.md Demonstrates building summary tables by selecting date and IP address columns, then counting occurrences. It groups by catalog, schema, table, date, and IP column, and aggregates the count of IP address values. Requires Discoverx and a DataFrame `df`. ```python df = (dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_ip_v4"]) .groupby(["table_catalog", "table_schema", "table_name", "classified_columns.dx_iso_date.column", "classified_columns.dx_iso_date.value", "classified_columns.dx_ip_v4.column"]) .agg(func.count("classified_columns.dx_ip_v4.value").alias("count")) ) ``` -------------------------------- ### Display Available Classification Rules Source: https://context7.com/databrickslabs/discoverx/llms.txt View the built-in semantic classification rules provided by DiscoverX, including descriptions and the patterns they use. This function helps users understand what types of sensitive data can be detected. It requires the 'discoverx' library. ```python from discoverx import DX dx = DX(locale="US") # Display all available rules in HTML format dx.display_rules() # Rules include patterns for: # - Email addresses # - Phone numbers (locale-specific) # - IP addresses (IPv4, IPv6) # - Social Security Numbers (US) # - Credit card numbers # - URLs # - Postal codes # - And more... ``` -------------------------------- ### Scan with custom rules using Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Defines and utilizes custom regular expression-based rules for data classification, enabling domain-specific pattern matching during scans. Requires importing RegexRule from discoverx.rules. ```python from discoverx import DX from discoverx.rules import RegexRule # Example usage would follow here, defining rules and then using dx.scan with those rules. ``` -------------------------------- ### Advanced SQL Template Variables in Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Enables the use of dynamic variables within SQL templates for complex data operations. Supports stacking columns, using individual table component variables (catalog, schema, name), and applying liquid clustering for table optimization. ```python from discoverx import DX dx = DX() # Stack string columns into column_name/value pairs stacked_strings = ( dx.from_tables("analytics.events.*") .with_sql(""" SELECT '{full_table_name}' as source_table, {stack_string_columns} AS (column_name, string_value) FROM {full_table_name} LIMIT 100 """) .apply() ) # Stack ALL columns as strings all_columns_stacked = ( dx.from_tables("my_catalog.my_schema.my_table") .with_sql(""" SELECT {stack_all_columns_as_string} AS (column_name, string_value) FROM {full_table_name} WHERE string_value IS NOT NULL """) .apply() ) # Use individual component variables custom_query = ( dx.from_tables("prod.*.*") .with_sql(""" SELECT '{table_catalog}' as catalog, '{table_schema}' as schema, '{table_name}' as table, COUNT(*) as row_count, '{full_table_name}' as full_name FROM {full_table_name} """) .apply() ) # Apply liquid clustering to tables ( dx.from_tables("high_volume.*.*") .having_columns("user_id", "event_date") .with_sql("ALTER TABLE {full_table_name} CLUSTER BY (user_id, event_date)") .apply() ) ``` -------------------------------- ### Apply Python Functions to Multiple Tables Source: https://context7.com/databrickslabs/discoverx/llms.txt Execute custom Python functions concurrently across tables, accessing table metadata like columns, tags, and properties. This function takes a TableInfo object and returns structured data. It requires the 'discoverx' library. Dependencies include 'discoverx' and 'discoverx.table_info'. ```python from discoverx import DX from discoverx.table_info import TableInfo dx = DX() def analyze_table(table_info: TableInfo): """Custom function to analyze each table""" full_name = f"{table_info.catalog}.{table_info.schema}.{table_info.table}" # Access column information string_cols = [col.name for col in table_info.columns if col.data_type == "string"] partition_cols = [col.name for col in table_info.columns if col.partition_index is not None] print(f"Table: {full_name}") print(f" Total columns: {len(table_info.columns)}") print(f" String columns: {string_cols}") print(f" Partition columns: {partition_cols}") # Return structured data return { "table": full_name, "column_count": len(table_info.columns), "string_column_count": len(string_cols) } # Execute function across all tables results = ( dx.from_tables("analytics.sales.*") .with_concurrency(10) .map(analyze_table) ) print(f"Analyzed {len(results)} tables") # Access tags in custom function def check_tags(table_info: TableInfo): if table_info.tags: print(f"Catalog tags: {table_info.tags.catalog_tags}") print(f"Table tags: {table_info.tags.table_tags}") return table_info results_with_tags = ( dx.from_tables("my_catalog.*.*") .with_tags(True) # Enable tag fetching .map(check_tags) ) ``` -------------------------------- ### Configure Scan Parameters in DiscoverX Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Defines detailed parameters for the `dx.scan` function, including table patterns, rule filters, sample size, and an option to preview the executed SQL. Use `None` for `sample_size` to perform a full table scan. ```python dx.scan( from_tables="*.*.*", # Table pattern in form of ..
('*' is a wildcard) rules="*", # Rule filter ('*' is a wildcard) sample_size=10000, # Number of rows to sample, use None for a full table scan what_if=False # If `True` it prints the SQL that would be executed ) ``` -------------------------------- ### Scan Lakehouse for PII and Semantic Classes Source: https://context7.com/databrickslabs/discoverx/llms.txt Scan tables using regex-based rules to identify columns containing sensitive data such as emails, phone numbers, IP addresses, and SSNs. The function allows specifying tables, rules, and sample sizes. It uses the 'discoverx' library and supports locale-specific rules. Results can be viewed, saved, or loaded. ```python from discoverx import DX dx = DX(locale="US") # Scan all tables for PII using built-in rules dx.scan( from_tables="customer_data.*.*", ``` ```python rules="*", # Use all available rules sample_size=10000 # Sample 10k rows per table ) # Scan specific schemas with custom sample size dx.scan( from_tables="analytics.events.*", rules="email,phone", # Only check for emails and phones sample_size=5000 ) # Preview what would be scanned without executing dx.scan( from_tables="*.*.*", ``` ```python rules="*", sample_size=1000, what_if=True # Print SQL without executing ) # View scan results as DataFrame scan_results_df = dx.scan_result scan_results_df.display() # Save scan results for later use dx.save("my_catalog.governance.discoverx_scan_results") # Load previously saved results dx.load("my_catalog.governance.discoverx_scan_results") ``` -------------------------------- ### Search for specific data using Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Searches for specific data patterns within tables. It supports searching by email, auto-detecting classifications from search terms, and setting a minimum classification score threshold. Requires the discoverx library. ```python from discoverx import DX dx = DX() # Search for a specific email address email_results = dx.search( search_term="user@example.com", from_tables="*.*.*", by_class="email" # Search in columns classified as email ) email_results.display() # Auto-detect class from search term (if pattern matches) phone_results = dx.search( search_term="+1-555-123-4567", from_tables="customer_data.*.*" # by_class will be auto-detected as "phone" ) # Set minimum classification score threshold high_confidence_results = dx.search( search_term="192.168.1.1", from_tables="logs.*.*", by_class="ip_address", min_score=0.8 # Only search in columns with >80% match rate ) ``` -------------------------------- ### Configure Query Concurrency Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Sets the maximum number of concurrent queries to be executed for DiscoverX operations. Defaults to 10. ```python .with_concurrency(10) ``` -------------------------------- ### Retrieve User Data Across Tables (Python, Discoverx) Source: https://github.com/databrickslabs/discoverx/blob/master/docs/GDPR_RoA.md This code snippet demonstrates how to fetch all data for a specific user (e.g., user ID '1') from all tables that contain a 'user_id' column. It utilizes the Discoverx library to filter tables and apply a SQL query. The input is a user identifier, and the output is a DataFrame containing the user's data in JSON format. ```python df = dx.from_tables("*.*.*")\ .having_columns("user_id")\ .with_sql("SELECT `user_id`, to_json(struct(*)) AS row_content FROM {full_table_name} WHERE `user_id` = 1")\ .apply() ``` -------------------------------- ### Filter Tables by Columns Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Extends table selection by filtering for tables that contain specific columns. This is a method chained after `from_tables`. ```python .having_columns ``` -------------------------------- ### Display Available Classification Rules Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Lists all available semantic classification rules supported by the DiscoverX library. This is useful for understanding the types of data that can be identified. ```python dx.display_rules() ``` -------------------------------- ### Detect Small Files in Delta Tables Source: https://context7.com/databrickslabs/discoverx/llms.txt Identify Delta tables containing an excessive number of small files, which can lead to performance issues. This script defines thresholds for file size and number, then queries Delta table details to find problematic tables. It uses 'discoverx' and 'pyspark.sql.functions'. ```python from discoverx import DX from pyspark.sql.functions import col dx = DX() # Define thresholds small_file_max_size_MB = 10 min_file_number = 100 # Find tables with small files small_files_df = ( dx.from_tables("*.*.*") .with_sql("DESCRIBE DETAIL {full_table_name}") .apply() .withColumn("average_file_size_MB", col("sizeInBytes") / col("numFiles") / 1024 / 1024) .withColumn( "has_too_many_small_files", (col("average_file_size_MB") < small_file_max_size_MB) & (col("numFiles") > min_file_number) ) .filter("has_too_many_small_files") ) # Display problematic tables small_files_df.select( "table_catalog", "table_schema", "table_name", "numFiles", "average_file_size_MB" ).display() ``` -------------------------------- ### Select data by semantic classes using Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Retrieves data from columns matching specified semantic classifications across tables. It requires scanning the data first and can select single or multiple classes with a minimum confidence score. Outputs include classified columns and row content as JSON. ```python from discoverx import DX dx = DX(locale="US") # Scan first dx.scan(from_tables="*.*.*", sample_size=10000) # Select all email columns across tables email_data = dx.select_by_classes( from_tables="customer_data.*.*", by_classes="email" ) email_data.display() # Select multiple classes contact_data = dx.select_by_classes( from_tables="*.*.*", by_classes=["email", "phone"], # Tables with both email AND phone columns min_score=0.5 # Minimum 50% classification confidence ) # Result includes classified_columns struct and row_content as JSON contact_data.select("table_catalog", "table_schema", "table_name", "classified_columns").display() ``` -------------------------------- ### Unpivot String Columns Source: https://github.com/databrickslabs/discoverx/blob/master/README.md Melts (unpivots) all string columns from the selected tables into a long format dataframe. Actions can be applied afterward. ```python .unpivot_string_columns() ``` -------------------------------- ### Select Columns by Semantic Class in Discoverx Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Select_by_class.md Selects all columns belonging to specified semantic classes from tables. The `from_tables` argument accepts patterns like '*.*.*' to include all tables. `by_classes` is a list of Discoverx semantic class names. `min_score` can optionally filter classifications by confidence score. ```python dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_email"], min_score=None) ``` -------------------------------- ### Scan Tables with DiscoverX Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Initiates a scan on tables to classify columns semantically. It samples a specified number of rows per table. The result is a dataset containing a 'score' column indicating the fraction of matched records per rule. ```python dx.scan(from_tables="*.*.*") ``` -------------------------------- ### Basic Search in DiscoverX Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md Performs a search for a specific value across all tables in the data catalog. This function automatically attempts to classify the search term and restricts the search to relevant columns. It requires prior semantic classification to be run. ```python dx.search("example_email@databricks.com", from_tables="*.*.*") ``` -------------------------------- ### Search Lakehouse by Semantic Class Source: https://context7.com/databrickslabs/discoverx/llms.txt Search the lakehouse for specific values within columns that have been classified by semantic type after a scan. This functionality enables targeted data exploration based on data classification. It requires the 'discoverx' library and assumes a prior scan has been performed. ```python from discoverx import DX dx = DX(locale="US") # First, scan the lakehouse dx.scan(from_tables="*.*.*", sample_size=10000) ``` -------------------------------- ### Load DiscoverX Scan Results Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Loads previously saved semantic column scan results from a specified Delta table. This is useful for resuming analysis or accessing results in a different session. ```python dx.load(full_table_name=) ``` -------------------------------- ### Delete Rows by Column Class using Discoverx Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Delete_by_class.md Deletes rows from specified tables where a column of a given class matches provided values. Requires `yes_i_am_sure=True` for actual deletion. Uses a minimum score threshold for classification matching. Input is tables to scan, the class to match, the values to filter by, and a boolean to confirm deletion. ```python dx.delete_by_class(from_tables="*.*.*", by_class="email", values=['example_email@databricks.com'], yes_i_am_sure=True, min_score=0.95) ``` ```python dx.delete_by_class(from_tables="*.*.*", by_class="email", values=['example_email@databricks.com'], yes_i_am_sure=False, min_score=0.95) ``` -------------------------------- ### Delete User Data Across Tables with DiscoverX Source: https://github.com/databrickslabs/discoverx/blob/master/docs/GDPR_RoE.md This code snippet demonstrates how to delete user data from all tables containing a specific user identifier column. It leverages DiscoverX to identify relevant tables and execute a SQL DELETE statement. Use .explain() to preview SQL before execution. ```python dx.from_tables("*.*.*")\ .having_columns("user_id")\ .with_sql("DELETE FROM {full_table_name} WHERE `user_id` IN (1, 2, 3)")\ .display() # You can use .explain() instead of .display() to preview the generated SQL ``` -------------------------------- ### Filter tables by data source format using Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Restricts operations to specific data source formats such as Delta, Parquet, CSV, or JSON. This feature allows for targeted scanning or querying of tables based on their storage format, supporting single or multiple formats. ```python from discoverx import DX dx = DX() # Only scan Delta tables (default behavior) delta_tables = ( dx.from_tables("*.*.*") .with_data_source_formats(["DELTA"]) .with_sql("DESCRIBE DETAIL {full_table_name}") .apply() ) # Include multiple formats multi_format = ( dx.from_tables("landing_zone.*.*") .with_data_source_formats(["DELTA", "PARQUET", "CSV", "JSON"]) .with_sql("SELECT '{full_table_name}' as table_name, COUNT(*) as row_count FROM {full_table_name}") .apply() ) # Scan with specific formats dx.from_tables("*.*.*").scan( rules="email", sample_size=1000, data_source_formats=["DELTA", "PARQUET"] ) ``` -------------------------------- ### Unpivot String Columns Across Tables with Python Source: https://context7.com/databrickslabs/discoverx/llms.txt Melts all string columns into column_name/string_value pairs for pattern analysis or data discovery. This operation is useful for analyzing data across different tables and columns. The `sample_size` parameter can be used to limit the number of rows processed for performance reasons. ```python from discoverx import DX dx = DX() # Unpivot all string columns from sample data unpivoted_df = ( dx.from_tables("analytics.*.*") .unpivot_string_columns(sample_size=1000) .apply() ) # Result contains: column_name, string_value, table_catalog, table_schema, table_name unpivoted_df.select("table_name", "column_name", "string_value").display() # Unpivot all columns (casted to string) all_columns_unpivoted = ( dx.from_tables("my_catalog.my_schema.my_table") .unpivot_all_columns(sample_size=500) .apply() ) ``` -------------------------------- ### Save DiscoverX Scan Results Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Semantic_classification.md Saves the results of a previous semantic column scan to a specified Delta table. This allows for persistence and later retrieval of the classification findings. ```python dx.save(full_table_name=) ``` -------------------------------- ### Delete data by semantic class (GDPR Right of Erasure) using Discoverx Source: https://context7.com/databrickslabs/discoverx/llms.txt Deletes rows across tables based on semantically classified columns, designed for GDPR compliance. It supports previewing deletions (what_if mode) before actual execution and can delete multiple values or specific types like phone numbers with optional minimum score thresholds. ```python from discoverx import DX dx = DX(locale="US") # Scan first dx.scan(from_tables="*.*.*", sample_size=10000) # Preview deletion (what_if mode by default) dx.delete_by_class( from_tables="customer_data.*.*", by_class="email", values="user@example.com", yes_i_am_sure=False # Preview only ) # Actually execute deletion dx.delete_by_class( from_tables="customer_data.*.*", by_class="email", values="user@example.com", yes_i_am_sure=True # Execute deletion ) # Delete multiple values dx.delete_by_class( from_tables="*.*.*", by_class="user_id", values=["user123", "user456", "user789"], yes_i_am_sure=True, min_score=0.9 # Only delete from high-confidence classified columns ) # Delete phone number across all tables dx.delete_by_class( from_tables="*.*.*", by_class="phone", values="+1-555-123-4567", yes_i_am_sure=True ) ``` -------------------------------- ### DiscoverX Search with Minimum Score Filter Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md Conducts a search for a value across tables and filters results based on a minimum classification score. This is useful for prioritizing results that have a high confidence score based on the classification scan. Requires prior semantic classification. ```python dx.search("example_email@databricks.com", from_tables="*.*.*", min_score=0.95) ``` -------------------------------- ### DiscoverX Search with Explicit Classification Source: https://github.com/databrickslabs/discoverx/blob/master/docs/Search.md Searches for a value across tables, explicitly specifying the classification rules to use for filtering columns. This allows for more targeted searches when the automatic classification might be too broad or imprecise. Requires prior semantic classification. ```python dx.search("example_email@databricks.com", from_tables="*.*.*", by_classes=["dx_email"]) ```