### Install SSL Certificates for Python on macOS Source: https://intugle.github.io/data-tools/docs/getting-started Installs SSL certificates for Python installations from python.org on macOS. Replace '3.XX' with your specific Python version. Not needed for Homebrew Python. ```bash /Applications/Python\ 3.XX/Install\ Certificates.command ``` -------------------------------- ### Configure LLM Environment Variables Source: https://intugle.github.io/data-tools/docs/getting-started Sets environment variables for LLM provider and API key. This example uses OpenAI's GPT-3.5 Turbo model. ```bash export LLM_PROVIDER="openai:gpt-3.5-turbo" export OPENAI_API_KEY="your-openai-api-key" ``` -------------------------------- ### Install Intugle Package with Pip Source: https://intugle.github.io/data-tools/docs/getting-started Installs the 'intugle' Python package using pip. This command should be run after activating the virtual environment. ```bash pip install intugle ``` -------------------------------- ### Create and Activate Python Virtual Environment Source: https://intugle.github.io/data-tools/docs/getting-started Creates a Python virtual environment named '.venv' and activates it. This is a standard practice for managing project dependencies. ```bash python -m venv .venv source .venv/bin/activate ``` -------------------------------- ### Install libomp on macOS with Homebrew Source: https://intugle.github.io/data-tools/docs/getting-started Installs the 'libomp' library on macOS using the Homebrew package manager. This is a dependency for some Python packages on macOS. ```bash brew install libomp ``` -------------------------------- ### Qdrant Vector Database Setup (Bash) Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Shell commands to start a Qdrant vector database using Docker. It maps ports, mounts a volume for persistent storage, and names the container. This is a prerequisite for certain semantic search configurations. ```bash # Prerequisites: Run Qdrant vector database docker run -d -p 6333:6333 -p 6334:6334 \ -v qdrant_storage:/qdrant/storage:z \ --name qdrant qdrant/qdrant ``` -------------------------------- ### Inject Custom LLM Instance in Intugle Settings Source: https://intugle.github.io/data-tools/docs/getting-started Demonstrates how to inject a pre-initialized custom LLM instance into Intugle's settings before importing Intugle modules. The custom LLM must inherit from `langchain_core.language_models.chat_models.BaseChatModel`. ```python # main.py from intugle.core import settings # This must be an object that inherits from BaseChatModel my_llm_instance = ... # Set the custom instance in the settings settings.CUSTOM_LLM_INSTANCE = my_llm_instance # Now, any intugle modules imported after this point will use your custom LLM # ... rest of your code ``` -------------------------------- ### Run Qdrant Instance using Docker Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-search This command starts a Qdrant vector database instance using Docker. It maps ports 6333 and 6334 for Qdrant communication and uses a Docker volume for persistent storage. Ensure Docker is installed and running on your system. ```bash docker run -d -p 6333:6333 -p 6334:6334 \ -v qdrant_storage:/qdrant/storage:z \ --name qdrant qdrant/qdrant ``` -------------------------------- ### Start MCP Server for Vibe Coding Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Provides the command to start the MCP server, which facilitates natural language data product generation. The server runs on localhost:8000 and exposes a semantic layer endpoint. ```bash # Start MCP server intugle-mcp # Server runs on localhost:8000 # Endpoint: http://localhost:8000/semantic_layer/mcp ``` -------------------------------- ### Initialize and Use DataSet Object Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Provides an example of initializing a `DataSet` object with data source information, running individual pipeline stages like profiling and key identification, and accessing table-level metadata. ```python from intugle.analysis.models import DataSet # Initialize dataset data_source = {"path": "data/patients.csv", "type": "csv"} dataset = DataSet(data_source, name="patients") # Run pipeline stages individually dataset.profile(save=True) dataset.identify_datatypes(save=True) dataset.identify_keys(save=True) dataset.generate_glossary(domain="Healthcare", save=True) # Access table-level metadata print(f"Table Name: {dataset.source_table_model.name}") print(f"Primary Key: {dataset.source_table_model.key}") print(f"Description: {dataset.source_table_model.description}") ``` -------------------------------- ### Start MCP Server Source: https://intugle.github.io/data-tools/docs/vibe-coding This command initiates the built-in MCP server from the project's root directory. The server runs on localhost:8000 by default and mounts the 'semantic_layer' and 'adapter' services, making the semantic layer accessible to AI assistants. ```bash intugle-mcp ``` -------------------------------- ### Install Intugle Python Package Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Demonstrates how to install the Intugle Python library, including optional support for Snowflake and Databricks, and macOS-specific dependencies. ```bash # Create virtual environment python -m venv .venv source .venv/bin/activate # Install base package pip install intugle # Install with Snowflake support pip install "intugle[snowflake]" # Install with Databricks support pip install "intugle[databricks]" # macOS users: install libomp brew install libomp ``` -------------------------------- ### Defining Optional Dependencies in pyproject.toml Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Illustrates how to add a custom adapter's required driver library as an optional dependency in the `pyproject.toml` file, enabling installation via `pip install "intugle[myconnector]"`. ```toml # In pyproject.toml [project.optional-dependencies] # ... other dependencies myconnector = ["myconnector-driver-library>=1.0.0"] ``` -------------------------------- ### Initializing and Running DataSet Stages Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/dataset Shows how to initialize a DataSet with a data source (CSV in this example) and sequentially run analysis stages like profiling, datatype identification, key identification, and glossary generation, with options to save progress after each stage. Dataset names should not contain whitespaces. ```python from intugle.analysis.models import DataSet # Initialize the dataset data_source = {"path": "path/to/my_data.csv", "type": "csv"} dataset = DataSet(data_source, name="my_data") # Run each stage individually and save progress print("Step 1: Profiling...") dataset.profile(save=True) print("Step 2: Identifying Datatypes...") dataset.identify_datatypes(save=True) print("Step 3: Identifying Keys...") dataset.identify_keys(save=True) print("Step 4: Generating Glossary...") dataset.generate_glossary(domain="my_domain", save=True) ``` -------------------------------- ### Build and Deploy Semantic Model with Snowflake Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Shows how to initialize a `SemanticModel` with Snowflake datasets, build the model, and deploy it back to Snowflake. Includes configuration example for external Snowflake connections. ```python from intugle import SemanticModel # Snowflake datasets datasets = { "CUSTOMERS": { "identifier": "CUSTOMERS", # Must match key "type": "snowflake" }, "ORDERS": { "identifier": "ORDERS", "type": "snowflake" } } sm = SemanticModel(datasets, domain="E-commerce") sm.build() # Deploy to Snowflake: sync metadata and create semantic view sm.deploy(target="snowflake") # Custom semantic view name sm.deploy(target="snowflake", model_name="my_custom_semantic_view") ``` ```yaml # profiles.yml for external Snowflake connection snowflake: type: snowflake account: your_snowflake_account user: your_username password: your_password role: your_role warehouse: your_warehouse database: your_database schema: your_schema ``` -------------------------------- ### Configure LLM Provider for Intugle Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Shows how to configure the Large Language Model (LLM) provider and API key for Intugle, using OpenAI as an example. ```bash # Configure LLM provider export LLM_PROVIDER="openai:gpt-3.5-turbo" export OPENAI_API_KEY="your-openai-api-key" ``` -------------------------------- ### Generate Data Product Specification with Prompt Source: https://intugle.github.io/data-tools/docs/vibe-coding This example demonstrates invoking the 'create-dp' prompt within an MCP-compatible client to generate a 'product_spec' dictionary. The prompt takes a natural language request, such as 'show me the top 5 patients with the most claims', and utilizes tools like 'get_tables' and 'get_schema' to construct the specification. ```bash /create-dp show me the top 5 patients with the most claims ``` ```json { "name": "top_5_patients_by_claims", "fields": [ { "id": "patients.first", "name": "first_name" }, { "id": "patients.last", "name": "last_name" }, { "id": "claims.id", "name": "number_of_claims", "category": "measure", "measure_func": "count" } ], "filter": { "sort_by": [ { "id": "claims.id", "alias": "number_of_claims", "direction": "desc" } ], "limit": 5 } } ``` -------------------------------- ### Install Intugle with Databricks Dependencies Source: https://intugle.github.io/data-tools/docs/connectors/databricks Installs the Intugle library with optional dependencies required for Databricks integration. This includes PySpark, sqlglot, and databricks-sql-connector. Ensure you have Python and pip installed. ```bash pip install "intugle[databricks]" ``` -------------------------------- ### Join and Filter Across Tables for Patient Conditions Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product/advanced-examples This example shows how to join tables and apply filters across them to answer complex questions. It finds patients from Boston diagnosed with conditions containing the word 'fracture'. The snippet defines a product specification and its corresponding generated SQL, demonstrating implicit joins and filtering on fields from different tables. ```python product_spec = { "name": "conditions_of_boston_patients", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, {"id": "conditions.description", "name": "condition"}, ], "filter": { "selections": [ {"id": "patients.city", "values": ["Boston"]}, ], "wildcards": [ { "id": "conditions.description", "value": "fracture", "option": "contains", } ], "limit": 10, }, } ``` ```sql SELECT "patients"."first" as first_name, "patients"."last" as last_name, "conditions"."description" as condition FROM conditions LEFT JOIN patients ON "conditions"."patient" = "patients"."id" WHERE ("patients"."city" IN ('Boston',)) AND "conditions"."description" LIKE '%fracture%' LIMIT 10 ``` -------------------------------- ### Semantic Search - Standalone Client Initialization (Python) Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Demonstrates how to initialize the SemanticSearch client in Python. It can load configurations from default .yml files or from a custom project path, allowing for flexible setup of semantic search capabilities. ```python # Standalone semantic search from intugle.semantic_search import SemanticSearch # Initialize (loads from .yml files) search_client = SemanticSearch() # Or specify custom path # search_client = SemanticSearch(project_base="/path/to/models") ``` -------------------------------- ### Combine Multiple Filters for Patients Data Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product/advanced-examples This example demonstrates how to apply multiple filter criteria to find male patients in Boston. It uses the 'selections' and 'wildcards' lists within the 'filter' object. By default, conditions are combined with an 'AND' operator. This snippet defines a product specification for a data product and shows the generated SQL. ```python product_spec = { "name": "male_patients_in_boston", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, {"id": "patients.city", "name": "city"}, {"id": "patients.gender", "name": "gender"}, ], "filter": { "selections": [ {"id": "patients.city", "values": ["Boston"]}, {"id": "patients.gender", "values": ["M"]}, ], "limit": 10, }, } ``` ```sql SELECT "patients"."first" as first_name, "patients"."last" as last_name, "patients"."city" as city, "patients"."gender" as gender FROM patients WHERE ("patients"."city" IN ('Boston',)) AND ("patients"."gender" IN ('M',)) LIMIT 10 ``` -------------------------------- ### Configure Snowflake Connection via profiles.yml Source: https://intugle.github.io/data-tools/docs/connectors/snowflake Example 'profiles.yml' file demonstrating how to provide Snowflake connection credentials for external environments. This includes account, user, password, role, warehouse, database, and schema. ```yaml snowflake: type: snowflake account: user: password: role: warehouse: database: schema: ``` -------------------------------- ### Install Intugle with Snowflake Dependencies Source: https://intugle.github.io/data-tools/docs/connectors/snowflake Installs the Intugle package with optional dependencies for Snowflake integration, including the 'snowflake-snowpark-python' library. ```bash pip install "intugle[snowflake]" ``` -------------------------------- ### Example Product Spec Combining Patient and Condition Data Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product/joins This Python example defines a `product_spec` for a data product named 'patient_conditions'. It selects patient names and condition descriptions, implicitly instructing the DataProduct builder to join the 'patients' and 'conditions' tables. A limit of 5 records is also applied. ```python product_spec = { "name": "patient_conditions", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, {"id": "conditions.description", "name": "condition"}, ], "filter": {"limit": 5} } ``` -------------------------------- ### Semantic Search - Build and Search Model (Python) Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Provides a Python example for building a semantic model from specified datasets and then performing a semantic search query. It requires the SemanticModel class and assumes datasets are available. This process includes auto-indexing. ```python from intugle import SemanticModel # Build semantic model first datasets = { "patients": {"path": "data/patients.csv", "type": "csv"}, "allergies": {"path": "data/allergies.csv", "type": "csv"}, } sm = SemanticModel(datasets, domain="Healthcare") sm.build() # Perform semantic search (auto-indexes on first run) results = sm.search("reason for hospital visit") print(results) ``` -------------------------------- ### Generate Data Product with Natural Language via MCP Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Illustrates how to use natural language commands with an MCP-compatible client (like Cursor, Claude Code, or Gemini CLI) to automatically generate a data product specification. The example shows generating a data product for the top 5 patients with the most claims. ```json # Natural language data product generation # In MCP-compatible client (Cursor, Claude Code, Gemini CLI): # /create-dp show me the top 5 patients with the most claims # AI generates this specification automatically: { "name": "top_5_patients_by_claims", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, { "id": "claims.id", "name": "number_of_claims", "category": "measure", "measure_func": "count" } ], "filter": { "sort_by": [ { "id": "claims.id", "alias": "number_of_claims", "direction": "desc" } ], "limit": 5 } } ``` -------------------------------- ### Configure Databricks Connection in profiles.yml Source: https://intugle.github.io/data-tools/docs/connectors/databricks Provides example configurations for the profiles.yml file to connect to Databricks. This file can be used for external connections requiring host, http_path, token, schema, and catalog, or for notebook connections specifying schema and catalog. ```yaml databricks: host: http_path: token: schema: catalog: # Optional, for Unity Catalog ``` ```yaml databricks: schema: catalog: # Optional, for Unity Catalog ``` -------------------------------- ### Configure Custom Connector in profiles.yml Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Example YAML configuration for a custom connector named 'myconnector'. It specifies connection details like host, port, user, password, and schema required for the adapter to establish a connection. ```yaml # profiles.yml for custom connector myconnector: host: localhost port: 5432 user: myuser password: mypassword schema: public ``` -------------------------------- ### Sort Data by Single Column Example (JSON Schema) Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product/sorting An example of a JSON schema for sorting patient data by healthcare expenses in descending order. It includes field definitions and filter parameters. ```json { "name": "patients_by_expenses", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, {"id": "patients.healthcare_expenses", "name": "expenses"} ], "filter": { "sort_by": [ { "id": "patients.healthcare_expenses", "direction": "desc" } ], "limit": 5 } } ``` -------------------------------- ### Sort Data by Aggregated Field Example (JSON Schema) Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product/sorting An example of a JSON schema for sorting cities by their total healthcare expenses in descending order. It defines a sum aggregation and uses the alias for sorting. ```json { "name": "total_healthcare_expenses_by_city", "fields": [ {"id": "patients.city", "name": "city"}, { "id": "patients.healthcare_expenses", "name": "total_expenses", "category": "measure", "measure_func": "sum", }, ], "filter": { "sort_by": [ { "alias": "total_expenses", "direction": "desc", } ], "limit": 5, }, } ``` -------------------------------- ### Implement Custom Connector Adapter Class Skeleton Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Provides a basic skeleton for a custom adapter class in Python, inheriting from `intugle.adapters.adapter.Adapter`. This class will contain the core logic for interacting with a specific data source. It includes necessary imports and placeholders for the database driver and abstract method implementations. Users should refer to existing adapters like `DatabricksAdapter` or `SnowflakeAdapter` for comprehensive examples. ```python from typing import Any, Optional import pandas as pd from intugle.adapters.adapter import Adapter from intugle.adapters.factory import AdapterFactory from intugle.adapters.models import ColumnProfile, ProfilingOutput from .models import MyConnectorConfig, MyConnectorConnectionConfig from intugle.core import settings # Import your database driver class MyConnectorAdapter(Adapter): # Adapter implementation details go here pass ``` -------------------------------- ### Access LinkPredictor via SemanticModel in Python Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt This example demonstrates accessing the LinkPredictor functionality through a SemanticModel. It shows how to build a SemanticModel with specified datasets and domain, then retrieve the link predictor instance and discovered links. Shortcuts for accessing links as a list or DataFrame are also provided. The code assumes 'datasets' is a pre-defined list of DataSet objects and 'SemanticModel' is imported. ```python sm = SemanticModel(datasets, domain="Healthcare") sm.build() predictor = sm.link_predictor discovered_links = sm.links links_dataframe = sm.links_df sm.link_predictor.show_graph() ``` -------------------------------- ### Get All Column Profiles DataFrame (Python) Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model This code retrieves a single Pandas DataFrame containing all column profiling metrics from all processed datasets using the 'profiling_df' property of a SemanticModel object. It then prints the first few rows of this DataFrame. ```python # Get a single DataFrame of all column profiles all_profiles = sm.profiling_df print(all_profiles.head()) ``` -------------------------------- ### Get Consolidated Business Glossary DataFrame (Python) Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model This code retrieves a unified business glossary as a Pandas DataFrame from all datasets using the 'glossary_df' property of a SemanticModel object. The DataFrame includes table name, column name, description, and tags for every column. The first few rows are then printed. ```python # Get a single, unified business glossary full_glossary = sm.glossary_df print(full_glossary.head()) ``` -------------------------------- ### DataProduct Sorting with Limit in Python Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt This Python snippet shows how to configure sorting and limiting for a DataProduct query. The 'filter' section within the 'product_spec' dictionary is used to define an array of 'sort_by' objects, each specifying the field 'id' and the 'direction' ('asc' or 'desc'). A 'limit' can also be applied to restrict the number of results. This example sorts patients by healthcare expenses in descending order and limits the output to the top 5. ```python product_spec = { "name": "patients_by_expenses", "fields": [ {"id": "patients.first", "name": "first_name"}, {"id": "patients.last", "name": "last_name"}, {"id": "patients.healthcare_expenses", "name": "expenses"}, ], "filter": { "sort_by": [ { "id": "patients.healthcare_expenses", "direction": "desc" } ], "limit": 5 } } ``` -------------------------------- ### Python: Initialize and Build Semantic Model Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-search Demonstrates how to define datasets, initialize a SemanticModel with a specified domain, build the model, and perform a semantic search. Assumes dataset configurations are available. ```Python datasets = { "allergies": {"path": "path/to/allergies.csv", "type": "csv"}, "patients": {"path": "path/to/patients.csv", "type": "csv"}, # ... add other datasets } # Initialize and build the semantic model sm = SemanticModel(datasets, domain="Healthcare") sm.build() # Perform a semantic search search_results = sm.search("reason for hospital visit") # View the search results print(search_results) ``` -------------------------------- ### Build Data Product with Specification - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/data-product Demonstrates how to define a product specification as a Python dictionary and use the DataProduct class to build a unified data product. It shows how to select fields, specify measures with aggregation functions, and apply sorting and limits. The generated SQL query and the resulting data as a Pandas DataFrame can be accessed. ```python from intugle import DataProduct # 1. Define the product specification for your data product product_spec = { "name": "top_patients_by_claim_count", "fields": [ { "id": "patients.first", "name": "first_name", }, { "id": "patients.last", "name": "last_name", }, { "id": "claims.id", "name": "number_of_claims", "category": "measure", "measure_func": "count" } ], "filter": { "sort_by": [ { "id": "claims.id", "alias": "number_of_claims", "direction": "desc" } ], "limit": 10 } } # 2. Initialize the DataProduct # It automatically loads the manifest from the current directory dp = DataProduct() # 3. Build the data product data_product = dp.build(product_spec) # 4. Access the results # View the data as a Pandas DataFrame print(data_product.to_df()) # You can also inspect the generated SQL query print(data_product.sql_query) ``` -------------------------------- ### Build and Search with Intugle Search Client Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Demonstrates how to initialize the Intugle search client, build a search index, and perform natural language searches. Results contain metadata about the matched columns. ```python search_client.initialize() # Search with natural language results = search_client.search("reason for hospital visit") # Results include: column_id, score, relevancy, column_name, # column_glossary, table_name, uniqueness, completeness print(results) ``` -------------------------------- ### Initialize and Use SemanticModel Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-search This Python code shows the basic usage of the `SemanticModel` for building and performing searches. The `sm.build()` method prepares metadata, and the first call to `sm.search()` automatically vectors and indexes the data in Qdrant. ```python from intugle import SemanticModel # Assuming SemanticModel is initialized and configured # sm = SemanticModel(...) # Build the semantic model metadata sm.build() # Perform a search (first time will also index) # results = sm.search("your natural language query") ``` -------------------------------- ### Build Semantic Model with CSV Datasets Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Demonstrates creating a `SemanticModel` using local CSV files, running the full build pipeline (profile, predict links, generate glossary), and accessing enriched dataset metadata. ```python from intugle import SemanticModel # Initialize with datasets datasets = { "patients": {"path": "data/patients.csv", "type": "csv"}, "claims": {"path": "data/claims.csv", "type": "csv"}, "allergies": {"path": "data/allergies.csv", "type": "csv"}, } # Create semantic model with domain context sm = SemanticModel(datasets, domain="Healthcare") # Run full pipeline: profile, predict links, generate glossary sm.build() # Or run stages individually for granular control sm.profile() sm.predict_links() sm.generate_glossary() # Force rebuild, ignoring cache sm.build(force_recreate=True) # Access enriched datasets patients_dataset = sm.datasets['patients'] print(f"Primary Key: {patients_dataset.source_table_model.key}") print(f"Description: {patients_dataset.source_table_model.description}") # Access utility DataFrames profiling_data = sm.profiling_df # All column profiles relationships = sm.links_df # Predicted relationships glossary = sm.glossary_df # Business glossary ``` -------------------------------- ### Build and Deploy Semantic Model with Databricks Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Demonstrates initializing a `SemanticModel` with Databricks datasets, building the model, and deploying it to Databricks with various options. Includes configuration for external Databricks connections. ```python from intugle import SemanticModel # Databricks datasets datasets = { "CUSTOMERS": { "identifier": "CUSTOMERS", "type": "databricks" }, "ORDERS": { "identifier": "ORDERS", "type": "databricks" } } sm = SemanticModel(datasets, domain="E-commerce") sm.build() # Deploy to Databricks: sync glossary, tags, and set constraints sm.deploy(target="databricks") # Control deployment options sm.deploy( target="databricks", sync_glossary=True, sync_tags=True, set_primary_keys=True, set_foreign_keys=True ) ``` ```yaml # profiles.yml for external Databricks connection databricks: host: your_databricks_host http_path: your_sql_warehouse_http_path token: your_personal_access_token schema: your_schema catalog: your_catalog ``` -------------------------------- ### Get Discovered Links DataFrame (Python) Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model This snippet demonstrates how to access a Pandas DataFrame of all discovered relationships (links) using the 'links_df' property of a SemanticModel object. The resulting DataFrame is then printed. ```python # Get a DataFrame of all predicted links all_links = sm.links_df print(all_links) ``` -------------------------------- ### Build Unified Data Queries with DataProduct in Python Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt This snippet illustrates how to use the DataProduct class to define and build a unified data query. It involves specifying the product's name, fields (with aliasing and measure functions), and filtering criteria (sorting and limiting). The code shows initializing DataProduct, building the product specification, and accessing the results as a DataFrame or inspecting the generated SQL query. Dependencies include 'intugle.DataProduct'. ```python from intugle import DataProduct product_spec = { "name": "top_patients_by_claim_count", "fields": [ { "id": "patients.first", "name": "first_name", }, { "id": "patients.last", "name": "last_name", }, { "id": "claims.id", "name": "number_of_claims", "category": "measure", "measure_func": "count" } ], "filter": { "sort_by": [ { "id": "claims.id", "alias": "number_of_claims", "direction": "desc" } ], "limit": 10 } } dp = DataProduct() data_product = dp.build(product_spec) print(data_product.to_df()) print(data_product.sql_query) ``` -------------------------------- ### Initialize SemanticModel from Dictionary - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model Initializes the SemanticModel using a dictionary where keys are dataset names and values contain configuration like path and type. This is the recommended and most common method. It requires the 'intugle' library. ```python from intugle import SemanticModel datasets = { "allergies": {"path": "path/to/allergies.csv", "type": "csv"}, "patients": {"path": "path/to/patients.csv", "type": "csv"}, "claims": {"path": "path/to/claims.csv", "type": "csv"}, } sm = SemanticModel(datasets, domain="Healthcare") ``` -------------------------------- ### Initialize SemanticModel from DataSet Objects - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model Initializes the SemanticModel with a list of pre-configured DataSet objects for more advanced scenarios. This requires the 'intugle' library and pre-instantiated DataSet objects. ```python from intugle import SemanticModel, DataSet # Create DataSet objects first dataset_allergies = DataSet(data={"path": "path/to/allergies.csv", "type": "csv"}, name="allergies") dataset_patients = DataSet(data={"path": "path/to/patients.csv", "type": "csv"}, name="patients") # Initialize the SemanticModel with the list of objects sm = SemanticModel([dataset_allergies, dataset_patients], domain="Healthcare") ``` -------------------------------- ### Registering MyConnector Adapter with Factory Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Provides Python functions to register a custom adapter (`MyConnectorAdapter`) with the `AdapterFactory`. `can_handle_myconnector` checks if a given data configuration is compatible, and `register` adds the adapter to the factory. ```python # In src/intugle/adapters/types/myconnector/myconnector.py def can_handle_myconnector(df: Any) -> bool: try: MyConnectorConfig.model_validate(df) return True except Exception: return False def register(factory: AdapterFactory): # Check if the required driver is installed # if MYCONNECTOR_DRIVER_AVAILABLE: factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter) ``` -------------------------------- ### Adding Adapter to Default Plugins List Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Shows how to include the newly created custom adapter module in the `DEFAULT_PLUGINS` list within the `AdapterFactory` to make it discoverable by Intugle. ```python # In src/intugle/adapters/factory.py DEFAULT_PLUGINS = [ "intugle.adapters.types.pandas.pandas", # ... other adapters "intugle.adapters.types.myconnector.myconnector", ] ``` -------------------------------- ### Python: Standalone Semantic Search Initialization and Querying Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-search Shows how to use the `SemanticSearch` class directly for initializing the search index and performing queries. This is useful when bypassing the full `SemanticModel` pipeline. Assumes .yml files are in the default location or a custom path is provided. ```Python from intugle.semantic_search import SemanticSearch # This assumes your project's .yml files are in the default location. # You can also specify the path to your models directory: # search_client = SemanticSearch(project_base="/path/to/your/models") search_client = SemanticSearch() # 1. Initialize the search index. # This reads the .yml files, vectorizes the metadata, and populates Qdrant. # You only need to run this once, or whenever your source metadata changes. print("Initializing semantic search index...") search_client.initialize() print("Initialization complete.") # 2. Perform a search. query = "reason for hospital visit" search_results = search_client.search(query) # View the results print(search_results) ``` -------------------------------- ### Implement Custom Connector Adapter Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Provides a skeleton for implementing a custom data adapter in Intugle. It includes methods for initialization, profiling, query execution, and data loading, along with registration logic. ```python # Step 2: Implement adapter in myconnector.py from typing import Any, Optional import pandas as pd from intugle.adapters.adapter import Adapter from intugle.adapters.factory import AdapterFactory from intugle.adapters.models import ColumnProfile, ProfilingOutput from .models import MyConnectorConfig, MyConnectorConnectionConfig from intugle.core import settings class MyConnectorAdapter(Adapter): def __init__(self): connection_params = settings.PROFILES.get("myconnector", {}) config = MyConnectorConnectionConfig.model_validate(connection_params) # self.connection = myconnector_driver.connect(**config.model_dump()) pass def profile(self, data: Any, table_name: str) -> ProfilingOutput: # Return table metadata: row count, column names, dtypes raise NotImplementedError() def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int) -> Optional[ColumnProfile]: # Return column statistics raise NotImplementedError() def execute(self, query: str): # Execute query and return results raise NotImplementedError() def to_df_from_query(self, query: str) -> pd.DataFrame: # Execute and return DataFrame raise NotImplementedError() def create_table_from_query(self, table_name: str, query: str) -> str: # Materialize query as table/view raise NotImplementedError() def create_new_config_from_etl(self, etl_name: str) -> "DataSetData": return MyConnectorConfig(identifier=etl_name) def intersect_count(self, table1: "DataSet", column1_name: str, table2: "DataSet", column2_name: str) -> int: # Calculate intersecting values count raise NotImplementedError() def load(self, data: Any, table_name: str): pass def to_df(self, data: DataSetData, table_name: str): config = MyConnectorConfig.model_validate(data) return self.to_df_from_query(f"SELECT * FROM {config.identifier}") def get_details(self, data: DataSetData): config = MyConnectorConfig.model_validate(data) return config.model_dump() # Step 3: Register adapter def can_handle_myconnector(df: Any) -> bool: try: MyConnectorConfig.model_validate(df) return True except Exception: return False def register(factory: AdapterFactory): factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter) ``` -------------------------------- ### Semantic Search - Environment Configuration (Bash) Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Shell commands to set environment variables for configuring semantic search, including Qdrant URL, API keys for OpenAI or Azure OpenAI, and the embedding model name. These variables are crucial for connecting to the necessary services. ```bash # Configure environment export QDRANT_URL="http://localhost:6333" export QDRANT_API_KEY="your-qdrant-api-key" export EMBEDDING_MODEL_NAME="openai:ada" export OPENAI_API_KEY="your-openai-api-key" # For Azure OpenAI export EMBEDDING_MODEL_NAME="azure_openai:ada" export AZURE_OPENAI_API_KEY="your-azure-openai-api-key" export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint" export OPENAI_API_VERSION="your-openai-api-version" ``` -------------------------------- ### Explore Primary Key Description and Discovered Links (Python) Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model This snippet shows how to print the description of the primary key for a customers dataset and display discovered links using the LinkPredictor. It assumes 'customers_dataset' and 'link_predictor' objects are already initialized. ```python print(f"Primary Key for customers: {customers_dataset.source_table_model.description}") print("Discovered Links:") print(link_predictor.get_links_df()) ``` -------------------------------- ### Deploy Semantic Model to Databricks Source: https://intugle.github.io/data-tools/docs/connectors/databricks Shows how to deploy a built Semantic Model to Databricks using the 'deploy()' method. This synchronizes metadata (comments, tags) and sets constraints (primary, foreign keys). Optional parameters allow granular control over the deployment process. ```python # Deploy the model to Databricks sm.deploy(target="databricks") # You can also control which parts of the deployment to run sm.deploy( target="databricks", sync_glossary=True, sync_tags=True, set_primary_keys=True, set_foreign_keys=True ) ``` -------------------------------- ### Configure Environment Variables for OpenAI Embeddings Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-search These environment variables configure the connection to a Qdrant instance and an OpenAI embedding model. Set `QDRANT_URL`, `QDRANT_API_KEY` (if authorization is enabled), `EMBEDDING_MODEL_NAME`, and `OPENAI_API_KEY`. ```bash # The URL of your running Qdrant instance export QDRANT_URL="http://localhost:6333" # Your Qdrant API key (only if you have enabled authorization) export QDRANT_API_KEY="your-qdrant-api-key" # The embedding model to use (the default is openai:ada) export EMBEDDING_MODEL_NAME="openai:ada" # Your OpenAI API key export OPENAI_API_KEY="your-openai-api-key" ``` -------------------------------- ### Run Full Semantic Model Pipeline - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model Executes the entire SemanticModel pipeline, including profiling, link prediction, and business glossary generation, in the correct sequence. The build() method can also force a re-run of all stages, ignoring cached results. ```python # Run the full pipeline from start to finish sm.build() # You can also force it to re-run everything, ignoring any cached results sm.build(force_recreate=True) ``` -------------------------------- ### Generate Business Glossary - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model Executes the business glossary generation stage, using an LLM to create business-friendly context for the data. This stage assumes the profile() stage has already been run. The generated information is saved back into each dataset's .yml file. ```python # Run the glossary generation stage # This assumes profile() has already been run sm.generate_glossary() ``` -------------------------------- ### MyConnectorAdapter Class Definition Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Defines the abstract base class for a custom data adapter, outlining methods for profiling, querying, and data manipulation. Specific implementations like `MyConnectorAdapter` inherit from this and must implement these methods. ```python class MyConnectorAdapter(Adapter): def __init__(self): # Initialize your connection here connection_params = settings.PROFILES.get("myconnector", {}) config = MyConnectorConnectionConfig.model_validate(connection_params) # self.connection = myconnector_driver.connect(**config.model_dump()) pass # --- Must be implemented --- def profile(self, data: Any, table_name: str) -> ProfilingOutput: # Return table-level metadata: row count, column names, and dtypes raise NotImplementedError() def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int) -> Optional[ColumnProfile]: # Return column-level statistics: null count, distinct count, samples, etc. raise NotImplementedError() def execute(self, query: str): # Execute a query and return the raw results raise NotImplementedError() def to_df_from_query(self, query: str) -> pd.DataFrame: # Execute a query and return the result as a pandas DataFrame raise NotImplementedError() def create_table_from_query(self, table_name: str, query: str) -> str: # Materialize a query as a new table or view raise NotImplementedError() def create_new_config_from_etl(self, etl_name: str) -> "DataSetData": # Return a new MyConnectorConfig for a materialized table return MyConnectorConfig(identifier=etl_name) def intersect_count(self, table1: "DataSet", column1_name: str, table2: "DataSet", column2_name: str) -> int: # Calculate the count of intersecting values between two columns raise NotImplementedError() # --- Other required methods --- def load(self, data: Any, table_name: str): # For database adapters, this is often a no-op pass def to_df(self, data: DataSetData, table_name: str): # Read an entire table into a pandas DataFrame config = MyConnectorConfig.model_validate(data) return self.to_df_from_query(f"SELECT * FROM {config.identifier}") def get_details(self, data: DataSetData): config = MyConnectorConfig.model_validate(data) return config.model_dump() ``` -------------------------------- ### Run Profiling Stage - Python Source: https://intugle.github.io/data-tools/docs/core-concepts/semantic-intelligence/semantic-model Executes the profiling stage of the SemanticModel pipeline, which performs a deep analysis of each dataset, including structure, content, datatype identification, and key identification. Progress is saved to a .yml file for each dataset. ```python # Run only the profiling and key identification stage sm.profile() ``` -------------------------------- ### Update DataSetData Type Hint with New Connector Model Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Integrates the newly defined `MyConnectorConfig` into the `DataSetData` type hint within `src/intugle/adapters/models.py`. This ensures that the `intugle` factory can recognize and handle configurations for the new connector type, allowing it to be used alongside other supported data sources. ```python # src/intugle/adapters/models.py # ... other imports from intugle.adapters.types.myconnector.models import MyConnectorConfig DataSetData = pd.DataFrame | DuckdbConfig | ... | MyConnectorConfig ``` -------------------------------- ### Define Pydantic Models for Connector Configuration Source: https://intugle.github.io/data-tools/docs/connectors/implementing-a-connector Defines Pydantic models for connection parameters and data identification for a custom connector. The `MyConnectorConnectionConfig` model specifies connection details like host, port, user, and password, while `MyConnectorConfig` defines how to identify a specific table or asset. These models are essential for configuring and interacting with the data source. ```python from typing import Optional from intugle.common.schema import SchemaBase class MyConnectorConnectionConfig(SchemaBase): host: str port: int user: str password: str schema: Optional[str] = None class MyConnectorConfig(SchemaBase): identifier: str type: str = "myconnector" ``` -------------------------------- ### DataProduct - Wildcard Filtering (LIKE) (Python) Source: https://context7.com/context7/intugle_github_io_data-tools/llms.txt Shows how to perform wildcard filtering on string fields, allowing for 'contains', 'starts_with', and 'ends_with' matching. This is useful for flexible text-based searches. Requires the DataProduct class. ```python # Wildcard filtering (LIKE) product_spec = { "name": "fracture_conditions", "fields": [ {"id": "conditions.description", "name": "condition_description"}, ], "filter": { "wildcards": [ { "id": "conditions.description", "value": "fracture", "option": "contains", # starts_with, ends_with, exactly_matches } ], }, } dp = DataProduct() result = dp.build(product_spec) ```