### Example: Minimal Prompt Tuning Configuration Source: https://github.com/microsoft/graphrag/blob/main/docs/prompt_tuning/auto_prompt_tuning.md A simplified example of running the prompt tuning script with only essential configurations. This is the suggested approach when starting with default values. ```bash python -m graphrag prompt-tune --root /path/to/project --no-discover-entity-types ``` -------------------------------- ### Install Dependencies with UV Source: https://github.com/microsoft/graphrag/blob/main/unified-search-app/README.md Install all project dependencies using the `uv sync` command. ```bash uv sync ``` -------------------------------- ### Example: Full Prompt Tuning Configuration Source: https://github.com/microsoft/graphrag/blob/main/docs/prompt_tuning/auto_prompt_tuning.md An example demonstrating how to run the prompt tuning script with a comprehensive set of command-line arguments. This includes specifying paths, domain, selection method, limits, language, token counts, chunk size, minimum examples, and disabling entity type discovery. ```bash python -m graphrag prompt-tune --root /path/to/project --domain "environmental news" \ --selection-method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \ --no-discover-entity-types --output /path/to/output ``` -------------------------------- ### Start Azurite Source: https://github.com/microsoft/graphrag/blob/main/docs/developing.md Launch the Azurite emulator for testing Azure resources. ```sh ./scripts/start-azurite.sh ``` -------------------------------- ### Install Dependencies Source: https://github.com/microsoft/graphrag/blob/main/docs/developing.md Synchronize project dependencies using uv. ```sh # install python dependencies uv sync --all-packages ``` -------------------------------- ### Install dependencies with uv Source: https://github.com/microsoft/graphrag/blob/main/DEVELOPING.md Synchronize the project environment with the defined dependencies. ```shell # install python dependencies uv sync ``` -------------------------------- ### Install GraphRAG Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/custom_vector_store.ipynb Install the GraphRAG library using pip. This is the first step to using GraphRAG's features. ```bash pip install graphrag ``` -------------------------------- ### Troubleshoot LLVM Configuration Source: https://github.com/microsoft/graphrag/blob/main/docs/developing.md Commands to install LLVM and configure the environment variable for uv. ```sh sudo apt-get install llvm-9 llvm-9-dev ``` ```sh export LLVM_CONFIG=/usr/bin/llvm-config-9 ``` -------------------------------- ### Configure Input Metadata in YAML Source: https://github.com/microsoft/graphrag/blob/main/docs/index/inputs.md Example of specifying metadata columns in the settings.yaml file. ```yaml input: metadata: [title,tag] ``` -------------------------------- ### Initialize and Use File Storage in Python Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-storage/example_notebooks/basic_storage_example.ipynb Configures a file-based storage system in a specified directory and performs basic set and get operations. ```python from graphrag_storage import StorageConfig, StorageType, create_storage async def run(): """Demonstrate basic storage operations.""" storage = create_storage(StorageConfig(type=StorageType.File, base_dir="output")) print("Saving and retrieving a value from storage...") print("Setting key 'my_key' to 'value'") await storage.set("my_key", "value") print("Getting key 'my_key':") print(await storage.get("my_key")) if __name__ == "__main__": await run() ``` -------------------------------- ### Custom Metrics Processor Example Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/04_metrics.ipynb Implement a custom metrics processor by inheriting from DefaultMetricsProcessor to add custom metric tracking. This example tracks whether the 'temperature' argument was used in a request, in addition to default metrics. ```python import json import os from collections.abc import AsyncIterator, Iterator from typing import Any from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import MetricsConfig, MetricsWriterType, ModelConfig from graphrag_llm.metrics import metrics_aggregator, register_metrics_processor from graphrag_llm.metrics.default_metrics_processor import DefaultMetricsProcessor from graphrag_llm.types import ( LLMCompletionChunk, LLMCompletionResponse, LLMEmbeddingResponse, Metrics, ) load_dotenv() class MyCustomMetricsProcessor(DefaultMetricsProcessor): """Custom metrics processor. Inheriting from DefaultMetricsProcessor to add to the default metrics being tracked instead of implementing the interface from scratch. Metrics = dict[str, float]. The metrics passed to process_metrics method represent the metrics for a single request. Typically, you will count/flag metrics of interest per request and then aggregate them in the metrics_aggregator. """ def __init__(self, some_custom_option: str, **kwargs: Any) -> None: """Initialize the custom metrics processor.""" super().__init__(**kwargs) self._some_custom_option = some_custom_option # Not actually used def process_metrics( self, *, model_config: ModelConfig, metrics: Metrics, input_args: dict[str, Any], response: LLMCompletionResponse | Iterator[LLMCompletionChunk] | AsyncIterator[LLMCompletionChunk] | LLMEmbeddingResponse, ) -> None: """On top of the default metrics, track if temperature argument was used. Expected to mutate the metrics dict in place with metrics you want to track. process_metrics is only called for successful requests and will be passed in the response from either a completion or embedding call. Args ---- model_config: ModelConfig The model config used for the request. metrics: Metrics The metrics dict to be mutated in place. input_args: dict[str, Any] The input arguments passed to completion or embedding. response: LLMChatCompletion | Iterator[LLMChatCompletionChunk] | LLMEmbeddingResponse Either a completion or embedding response from the LLM. """ # Track default metrics first super().process_metrics( model_config=model_config, metrics=metrics, input_args=input_args, response=response, ) metrics["responses_with_temperature"] = 1 if "temperature" in input_args else 0 ``` -------------------------------- ### Prepare local repository for release Source: https://github.com/microsoft/graphrag/blob/main/RELEASE.md Commands to ensure the local main branch is up to date before starting the release process. ```sh git checkout main git pull ``` -------------------------------- ### Custom Storage Execution Output Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-storage/example_notebooks/custom_storage_example.ipynb Expected console output after running the custom storage implementation example. ```text Saving and retrieving a value from storage... Setting key 'my_key' to 'value' Getting key 'my_key': value ``` -------------------------------- ### Asynchronous Completion Example Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/01_basic.ipynb Shows how to perform a text completion asynchronously. This is useful for non-blocking operations in applications. ```python response: LLMCompletionResponse = await llm_completion.completion_async( messages="What is the capital of France?", ) # type: ignore print(response.content) ``` -------------------------------- ### Synchronous Completion Example Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/01_basic.ipynb Demonstrates a basic synchronous call to the LLM for text completion. Handles both streaming and non-streaming responses. ```python # Copyright (c) 2024 Microsoft Corporation. # Licensed under the MIT License import os from collections.abc import AsyncIterator, Iterator from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import AuthMethod, ModelConfig from graphrag_llm.types import LLMCompletionChunk, LLMCompletionResponse load_dotenv() api_key = os.getenv("GRAPHRAG_API_KEY") model_config = ModelConfig( model_provider="azure", model=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), azure_deployment_name=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), api_base=os.getenv("GRAPHRAG_API_BASE"), api_version=os.getenv("GRAPHRAG_API_VERSION", "2025-04-01-preview"), api_key=api_key, auth_method=AuthMethod.AzureManagedIdentity if not api_key else AuthMethod.ApiKey, ) llm_completion: LLMCompletion = create_completion(model_config) response: LLMCompletionResponse | Iterator[LLMCompletionChunk] = ( llm_completion.completion( messages="What is the capital of France?", ) ) if isinstance(response, Iterator): # Streaming response for chunk in response: print(chunk.choices[0].delta.content or "", end="", flush=True) else: # Non-streaming response print(response.choices[0].message.content) # Or alternatively, access via the content property # This is equivalent to the above line, getting the content of the first choice print(response.content) print("Full Response:") print(response.model_dump_json(indent=2)) # type: ignore ``` -------------------------------- ### Environment Variable and YAML Configuration Example Source: https://github.com/microsoft/graphrag/blob/main/docs/config/yaml.md Demonstrates how to use environment variables for sensitive information like API keys within a YAML configuration file. The `${ENV_VAR}` syntax allows for dynamic replacement. ```bash # .env GRAPHRAG_API_KEY=some_api_key ``` ```yaml # settings.yml default_chat_model: api_key: ${GRAPHRAG_API_KEY} ``` -------------------------------- ### LLM Execution Logs Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/04_metrics.ipynb Example output showing LiteLLM and Azure identity logging during an LLM completion call. ```text INFO:azure.identity._credentials.environment:No environment configuration found. INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS 22:45:27 - LiteLLM:INFO: utils.py:3373 - LiteLLM completion() model= gpt-4o; provider = azure INFO:LiteLLM: LiteLLM completion() model= gpt-4o; provider = azure 22:45:28 - LiteLLM:INFO: utils.py:1286 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler ``` -------------------------------- ### Example Output of Function Tool Execution Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/10_tool_calling.ipynb This output shows the print statements from the executed functions and the final response from the LLM after processing the tool results. ```text Adding numbers: 3 8 Multiplying numbers: 9 5 Reversing text: GraphRAG 3 + 8 is 11, 9 * 5 is 45, and the reversed string 'GraphRAG' is 'GARhparG'. ``` -------------------------------- ### Cosmos Document Schema Example Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-storage/COSMOS_TABLE_PROVIDER_DESIGN.md Illustrates the structure of a document stored in Cosmos DB, including the partition key and table name. ```json { "id": "entities:42", "namespace": "output", "table_name": "entities", "name": "JOHN DOE", "type": "PERSON", "description": "A character in ...", "human_readable_id": 42 } ``` -------------------------------- ### Configure and Use LLM Completion with Retries Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/06_retries.ipynb Enable retries for LLM completions by configuring `RetryConfig` with `ExponentialBackoff`. This example sets up the model configuration with retry parameters and then makes a completion request, printing the metrics which may include retry information. ```python # Copyright (c) 2024 Microsoft Corporation. # Licensed under the MIT License import json import logging import os from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import AuthMethod, ModelConfig, RetryConfig, RetryType load_dotenv() logging.basicConfig(level=logging.CRITICAL) api_key = os.getenv("GRAPHRAG_API_KEY") model_config = ModelConfig( model_provider="azure", model=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), azure_deployment_name=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), api_base=os.getenv("GRAPHRAG_API_BASE"), api_version=os.getenv("GRAPHRAG_API_VERSION", "2025-04-01-preview"), api_key=api_key, auth_method=AuthMethod.AzureManagedIdentity if not api_key else AuthMethod.ApiKey, retry=RetryConfig( type=RetryType.ExponentialBackoff, max_retries=7, base_delay=2.0, jitter=True ), # Internal option to test error handling and retries failure_rate_for_testing=0.5, # type: ignore ) llm_completion: LLMCompletion = create_completion(model_config) response = llm_completion.completion( messages="What is the capital of France?", ) print(f"Metrics for: {llm_completion.metrics_store.id}") print(json.dumps(llm_completion.metrics_store.get_metrics(), indent=2)) ``` -------------------------------- ### Example GraphRAG Settings YAML Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/custom_vector_store.ipynb Demonstrates how to configure GraphRAG to use the custom vector store in a settings file. This includes specifying the vector store type and any custom configuration options. ```python # Example GraphRAG yaml settings example_settings = { "vector_store": { "type": CUSTOM_VECTOR_STORE_TYPE, # "simple_memory" # Add any custom parameters your vector store needs "custom_config_option": "example_value", }, # Other GraphRAG configuration... "models": { "default_embedding_model": { "type": "embedding", "model_provider": "openai", "model": "text-embedding-3-small", } }, } # Convert to YAML format for settings.yml yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2) print("📄 Example settings.yml configuration:") print("=" * 40) print(yaml_config) ``` -------------------------------- ### Configure and Demonstrate Request Rate Limiting Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/07_rate_limiting.ipynb This example shows how to set up rate limiting for LLM requests, specifically limiting to 3 requests per minute. It verifies that the rate limiting is effective by measuring the time taken for two consecutive requests, which should be at least 20 seconds apart. Ensure you have the necessary environment variables like GRAPHRAG_API_KEY set. ```python # Copyright (c) 2024 Microsoft Corporation. # Licensed under the MIT License import json import os import time from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import AuthMethod, ModelConfig, RateLimitConfig, RateLimitType load_dotenv() api_key = os.getenv("GRAPHRAG_API_KEY") model_config = ModelConfig( model_provider="azure", model=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), azure_deployment_name=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), api_base=os.getenv("GRAPHRAG_API_BASE"), api_version=os.getenv("GRAPHRAG_API_VERSION", "2025-04-01-preview"), api_key=api_key, auth_method=AuthMethod.AzureManagedIdentity if not api_key else AuthMethod.ApiKey, rate_limit=RateLimitConfig( type=RateLimitType.SlidingWindow, period_in_seconds=60, # limit requests per minute requests_per_period=3, # max 3 requests per minute. Fire one off every 20 seconds ), ) llm_completion: LLMCompletion = create_completion(model_config) start_time = time.time() response = llm_completion.completion( messages="What is the capital of France?", ) response = llm_completion.completion( messages="What is the capital of France?", ) end_time = time.time() total_time = end_time - start_time assert total_time >= 20, "Rate limiting did not work as expected." print(f"Time taken for two requests: {total_time:.2f} seconds") print(f"Metrics for: {llm_completion.metrics_store.id}") print(json.dumps(llm_completion.metrics_store.get_metrics(), indent=2)) ``` -------------------------------- ### Install SpaCy Model Source: https://github.com/microsoft/graphrag/blob/main/docs/index/methods.md Manually install a SpaCy model for FastGraphRAG. The package will attempt to download it automatically if not found, but manual installation ensures it's available. ```python python -m spacy download en_core_web_md ``` -------------------------------- ### Install MarkItDown PDF dependency Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-input/README.md Install the necessary package to enable PDF processing via MarkItDown. ```bash pip install 'markitdown[pdf]' # required dependency for pdf processing ``` -------------------------------- ### Register and Create File Storage with Factory Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-storage/README.md Demonstrates how to register a custom key for FileStorage and then create an instance using the storage_factory. This approach bypasses preregistered providers, requiring explicit registration for any storage type used. ```python from graphrag_storage.storage_factory import storage_factory from graphrag_storage.file_storage import FileStorage # storage_factory has no preregistered providers so you must register any # providers you plan on using. # May also register a custom implementation, see above for example. storage_factory.register("my_storage_key", FileStorage) storage = storage_factory.create(strategy="my_storage_key", init_args={"base_dir": "...", "other_settings": "..."}) ... ``` -------------------------------- ### Run GraphRAG App with Poe Source: https://github.com/microsoft/graphrag/blob/main/unified-search-app/README.md Start the GraphRAG project using Streamlit via the `poe start` command, executed with `uv run`. ```bash uv run poe start ``` -------------------------------- ### Create Project Space and Virtual Environment Source: https://github.com/microsoft/graphrag/blob/main/docs/get_started.md Set up a new directory for your project and create a Python virtual environment. ```bash mkdir graphrag_quickstart cd graphrag_quickstart python -m venv .venv ``` -------------------------------- ### Configure LLM and Template Engine Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/11_templating.ipynb Initializes the LLM completion client with Azure configuration and sets up the Jinja2 template engine for rendering. ```python # Copyright (c) 2024 Microsoft Corporation. # Licensed under the MIT License import os from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import ( AuthMethod, ModelConfig, TemplateEngineConfig, TemplateEngineType, TemplateManagerType, ) from graphrag_llm.templating import create_template_engine from graphrag_llm.types import LLMCompletionResponse from pydantic import BaseModel, Field load_dotenv() api_key = os.getenv("GRAPHRAG_API_KEY") model_config = ModelConfig( model_provider="azure", model=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), azure_deployment_name=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), api_base=os.getenv("GRAPHRAG_API_BASE"), api_version=os.getenv("GRAPHRAG_API_VERSION", "2025-04-01-preview"), api_key=api_key, auth_method=AuthMethod.AzureManagedIdentity if not api_key else AuthMethod.ApiKey, ) llm_completion: LLMCompletion = create_completion(model_config) template_engine = create_template_engine() # The above default is the same as the following configuration: template_engine = create_template_engine( TemplateEngineConfig( type=TemplateEngineType.Jinja, template_manager=TemplateManagerType.File, base_dir="templates", template_extension=".jinja", encoding="utf-8", ) ) msg = template_engine.render( # Name of the template file without extension template_name="weather_listings", # Values to fill in the template context={ "weather_reports": [ {"city": "Seattle", "temperature_f": 52, "condition": "sunny"}, {"city": "San Francisco", "temperature_f": 75, "condition": "cloudy"}, ] }, ) print(f"The rendered message to parse: {msg}") # Structured response parsing using pydantic class LocalWeather(BaseModel): """City weather information model.""" city: str = Field(description="The name of the city") temperature: float = Field(description="The temperature in Celsius") condition: str = Field(description="The weather condition description") class WeatherReports(BaseModel): """Weather information model.""" reports: list[LocalWeather] = Field( description="The weather reports for multiple cities" ) response: LLMCompletionResponse[WeatherReports] = llm_completion.completion( messages=msg, response_format=WeatherReports, ) # type: ignore local_weather_reports: WeatherReports = response.formatted_response # type: ignore for report in local_weather_reports.reports: print(f"City: {report.city}") print(f" Temperature: {report.temperature} °C") print(f" Condition: {report.condition}") ``` -------------------------------- ### Connect and Load Documents Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/custom_vector_store.ipynb Connect to the vector store, create the index, and load the sample documents. This prepares the vector store for search operations. ```python # Connect and load documents vector_store.connect() vector_store.create_index() vector_store.load_documents(sample_documents) ``` -------------------------------- ### Initialize LLM Completion and Model Configuration Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/10_tool_calling.ipynb Sets up the LLM completion client using Azure as the model provider. Ensure environment variables for API key, model name, and base URL are set. ```python import os from dotenv import load_dotenv from graphrag_llm.completion import LLMCompletion, create_completion from graphrag_llm.config import AuthMethod, ModelConfig from graphrag_llm.types import LLMCompletionResponse from graphrag_llm.utils import ( CompletionMessagesBuilder, FunctionToolManager, ) from pydantic import BaseModel, ConfigDict, Field load_dotenv() api_key = os.getenv("GRAPHRAG_API_KEY") model_config = ModelConfig( model_provider="azure", model=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), azure_deployment_name=os.getenv("GRAPHRAG_MODEL", "gpt-4o"), api_base=os.getenv("GRAPHRAG_API_BASE"), api_version=os.getenv("GRAPHRAG_API_VERSION", "2025-04-01-preview"), api_key=api_key, auth_method=AuthMethod.AzureManagedIdentity if not api_key else AuthMethod.ApiKey, ) llm_completion: LLMCompletion = create_completion(model_config) ``` -------------------------------- ### Initialize and Connect to CosmosDB Vector Store Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/cosmosdb.ipynb Sets up a connection to a CosmosDB instance, creates an index if it doesn't exist, and loads documents. Requires a CosmosDB connection string, either from the environment variable COMSOSDB_CONNECTION_STRING or a default emulator string. The 'fields' parameter defines metadata fields and their types for indexing. ```python # Create and connect to a CosmosDB vector store # Local emulator connection string (Docker must be running with the emulator) EMULATOR_CONNECTION_STRING = "AccountEndpoint=http://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw==;" connection_string = os.environ.get( "COSMOSDB_CONNECTION_STRING", EMULATOR_CONNECTION_STRING ) store = CosmosDBVectorStore( connection_string=connection_string, database_name="graphrag_vectors", index_name="text_units", fields={ "os": "str", "category": "str", "timestamp": "date", }, ) store.connect() store.create_index() # Load documents docs = [ VectorStoreDocument( id=row["id"], vector=row["embedding"].tolist(), data=row.to_dict(), create_date=row.get("timestamp"), ) for _, row in text_units.iterrows() ] store.load_documents(docs) print(f"Loaded {len(docs)} documents into store") ``` -------------------------------- ### Execute Queries Source: https://github.com/microsoft/graphrag/blob/main/docs/developing.md Run the query CLI using poethepoet. ```sh uv run poe query <...args> ``` -------------------------------- ### Timestamp Filtering Setup Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/lancedb.ipynb Import datetime utilities for working with date-based metadata fields. ```python from datetime import datetime, timedelta ``` -------------------------------- ### LLM Initialization and Call Logs Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/04_metrics.ipynb Logs showing the initialization and completion calls for the LLM, including model provider and configuration details. ```text INFO:azure.identity._credentials.environment:No environment configuration found. INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS [92m22:45:28 - LiteLLM:INFO: utils.py:3373 - LiteLLM completion() model= gpt-4o; provider = azure INFO:LiteLLM: LiteLLM completion() model= gpt-4o; provider = azure [92m22:45:28 - LiteLLM:INFO: utils.py:1286 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [92m22:45:28 - LiteLLM:INFO: utils.py:3373 - LiteLLM completion() model= gpt-4o; provider = azure INFO:LiteLLM: LiteLLM completion() model= gpt-4o; provider = azure [92m22:45:29 - LiteLLM:INFO: utils.py:1286 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler ``` -------------------------------- ### Define Input Metadata in CSV Source: https://github.com/microsoft/graphrag/blob/main/docs/index/inputs.md Example of a CSV file structure used for document input. ```csv text,title,tag My first program,Hello World,tutorial An early space shooter game,Space Invaders,arcade ``` -------------------------------- ### Initialize Input Readers with Factory Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-input/example_notebooks/input_example.ipynb Demonstrates configuring input readers for CSV and MarkItDown formats using the factory pattern. ```python from graphrag_input import InputConfig, InputType, create_input_reader from graphrag_storage import StorageConfig, create_storage config = InputConfig( type=InputType.Csv, text_column="content", title_column="title", ) storage = create_storage(StorageConfig(base_dir="./input")) reader = create_input_reader(config, storage) documents = await reader.read_files() ``` ```python from graphrag_input import InputConfig, InputType, create_input_reader from graphrag_storage import StorageConfig, create_storage config = InputConfig(type=InputType.MarkItDown, file_pattern=".*\\.pdf$") storage = create_storage(StorageConfig(base_dir="./input")) reader = create_input_reader(config, storage) documents = await reader.read_files() ``` -------------------------------- ### Execute Indexing Engine Source: https://github.com/microsoft/graphrag/blob/main/docs/developing.md Run the indexing CLI using poethepoet. ```sh uv run poe index <...args> ``` -------------------------------- ### Configure Azure Managed Identity Authentication Source: https://github.com/microsoft/graphrag/blob/main/docs/get_started.md Example configuration for using managed identity authentication with Azure OpenAI. ```yaml auth_method: azure_managed_identity # Default auth_method is is api_key ``` -------------------------------- ### Configure Azure OpenAI Chat Model Source: https://github.com/microsoft/graphrag/blob/main/docs/get_started.md Example configuration for using Azure OpenAI as the chat model provider. ```yaml type: chat model_provider: azure model: gpt-4.1 azure_deployment_name: api_base: https://.openai.azure.com api_version: 2024-02-15-preview # You can customize this for other versions ``` -------------------------------- ### Initialize and Connect to Azure AI Search Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/azure_ai_search.ipynb Creates an AzureAISearchVectorStore instance, connects to the specified URL, and creates an index. Requires AZURE_AI_SEARCH_URL and optionally AZURE_AI_SEARCH_API_KEY to be set in the environment. Documents are loaded into the store after index creation. ```python # Create and connect to an Azure AI Search vector store url = os.environ["AZURE_AI_SEARCH_URL"] api_key = os.environ.get("AZURE_AI_SEARCH_API_KEY") store = AzureAISearchVectorStore( url=url, api_key=api_key, index_name="text_units", fields={ "os": "str", "category": "str", "timestamp": "date", }, ) store.connect() store.create_index() # Load documents docs = [ VectorStoreDocument( id=row["id"], vector=row["embedding"].tolist(), data=row.to_dict(), create_date=row.get("timestamp"), ) for _, row in text_units.iterrows() ] store.load_documents(docs) print(f"Loaded {len(docs)} documents into store") # Allow time for Azure AI Search to propagate time.sleep(5) ``` -------------------------------- ### Create Vector Store with Utility Function Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/basic_usage_with_utility_function_example.ipynb Use the `create_vector_store` function for a simplified way to set up a vector store. Specify the store type, database URI, and index schema details. ```python from graphrag_vectors import ( IndexSchema, VectorStoreConfig, create_vector_store, ) # Create a vector store using the convenience function store_config = VectorStoreConfig(type="lancedb", db_uri="lance") schema_config = IndexSchema( index_name="my_index", vector_size=1536, ) vector_store = create_vector_store( config=store_config, index_schema=schema_config, ) vector_store.connect() vector_store.create_index() ``` -------------------------------- ### Configure Vector Store Schema in YAML Source: https://github.com/microsoft/graphrag/blob/main/docs/config/yaml.md Example configuration for a LanceDB vector store with customized index schema fields. ```yaml vector_store: type: lancedb db_uri: output/lancedb index_schema: text_unit_text: index_name: "text-unit-embeddings" id_field: "id_custom" vector_field: "vector_custom" vector_size: 3072 entity_description: id_field: "id_custom" ``` -------------------------------- ### Legacy Cosmos DB Configuration Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-storage/COSMOS_TABLE_PROVIDER_DESIGN.md This is an example of the previous configuration for Cosmos DB storage, typically using a single container for output. ```yaml # Before (legacy) output_storage: type: cosmosdb account_url: https://myaccount.documents.azure.com:443/ database_name: graphrag container_name: graphrag-output ``` -------------------------------- ### Create and Use a LiteLLM Tokenizer Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/02_encoding_decoding.ipynb Shows how to create a tokenizer instance explicitly using `TokenizerType.LiteLLM` and then use it to encode and decode text. This is useful when you need a tokenizer independent of an LLM completion or embedding model. ```python from graphrag_llm.config import TokenizerConfig, TokenizerType from graphrag_llm.tokenizer import create_tokenizer tokenizer = create_tokenizer( TokenizerConfig( type=TokenizerType.LiteLLM, model_id="openai/text-embedding-3-small", ) ) encoded = tokenizer.encode("Hello, world!") print(f"Encoded tokens: {encoded}") print(f"Number of tokens: {len(encoded)}") decoded = tokenizer.decode(encoded) print(f"Decoded text: {decoded}") ``` -------------------------------- ### Initialize Language Models and Tokenizer Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/local_search.ipynb Sets up the chat and embedding language models using configurations for OpenAI's GPT-4 and text-embedding-3-small, respectively. It also retrieves a tokenizer. ```python from graphrag.config.enums import ModelType from graphrag.config.models.language_model_config import LanguageModelConfig from graphrag.language_model.manager import ModelManager from graphrag.tokenizer.get_tokenizer import get_tokenizer api_key = os.environ["GRAPHRAG_API_KEY"] chat_config = LanguageModelConfig( api_key=api_key, type=ModelType.Chat, model_provider="openai", model="gpt-4.1", max_retries=20, ) chat_model = ModelManager().get_or_create_chat_model( name="local_search", model_type=ModelType.Chat, config=chat_config, ) embedding_config = LanguageModelConfig( api_key=api_key, type=ModelType.Embedding, model_provider="openai", model="text-embedding-3-small", max_retries=20, ) text_embedder = ModelManager().get_or_create_embedding_model( name="local_search_embedding", model_type=ModelType.Embedding, config=embedding_config, ) tokenizer = get_tokenizer(chat_config) ``` -------------------------------- ### Initialize GlobalSearch Engine Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/global_search.ipynb Configure the GlobalSearch engine with the previously defined parameters and model settings. ```python search_engine = GlobalSearch( model=model, context_builder=context_builder, tokenizer=tokenizer, max_data_tokens=12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000) map_llm_params=map_llm_params, reduce_llm_params=reduce_llm_params, allow_general_knowledge=False, # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases. json_mode=True, # set this to False if your LLM model does not support JSON mode. context_builder_params=context_builder_params, concurrent_coroutines=32, response_type="multiple paragraphs", # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report ) ``` -------------------------------- ### LLM Completion Call Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-llm/notebooks/04_metrics.ipynb Make a completion request to the configured LLM. This example shows a basic call without specifying the temperature parameter. ```python response = llm_completion.completion( messages="What is the capital of France?", ) ``` -------------------------------- ### Filter Documents by Quarter Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/lancedb.ipynb Retrieve documents created in a specific quarter using the `create_date_quarter` filter. This example searches for documents from the 4th quarter. ```python print("=== Filter: create_date_quarter == 4 (Q4) ===") filtered = store.similarity_search_by_vector( query_vector, k=5, filters=F.create_date_quarter == 4, ) print(f"Found {len(filtered)} results:") for r in filtered: print(f" - {r.document.id}: quarter={r.document.data.get('create_date_quarter')}") ``` -------------------------------- ### Run Standard GraphRAG Indexing Source: https://github.com/microsoft/graphrag/blob/main/docs/index/methods.md Use this command to initiate the standard indexing method. This is the default, so the `--method` parameter can be omitted. ```bash graphrag index --method standard ``` -------------------------------- ### Filter Documents by Month Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/lancedb.ipynb Use the `create_date_month` filter to retrieve documents created in a specific month. This example filters for documents created in December. ```python print("=== Filter: create_date_month == 12 (December) ===") filtered = store.similarity_search_by_vector( query_vector, k=5, filters=F.create_date_month == 12, ) print(f"Found {len(filtered)} results:") for r in filtered: print( f" - {r.document.id}: create_date={r.document.create_date}, month={r.document.data.get('create_date_month')}" ) ``` -------------------------------- ### Define Input Paths and Parameters Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/global_search_with_dynamic_community_selection.ipynb Set the directory paths and community level parameters for loading indexed data. ```python # parquet files generated from indexing pipeline INPUT_DIR = "./inputs/operation dulce" COMMUNITY_TABLE = "communities" COMMUNITY_REPORT_TABLE = "community_reports" ENTITY_TABLE = "entities" # we don't fix a specific community level but instead use an agent to dynamicially # search through all the community reports to check if they are relevant. COMMUNITY_LEVEL = None ``` -------------------------------- ### Define Project Directory for Settings Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/index_migration_to_v2.ipynb Specify the directory containing your settings.yaml file. This is crucial for loading the correct configuration. ```python # This is the directory that has your settings.yaml PROJECT_DIRECTORY = "" ``` -------------------------------- ### Initialize GraphRag Drift Search Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/drift_search.ipynb Configures drift search parameters and initializes the DRIFTSearchContextBuilder and DRIFTSearch objects. This setup is necessary before performing drift-based queries. ```python drift_params = DRIFTSearchConfig( primer_folds=1, drift_k_followups=3, n_depth=3, ) context_builder = DRIFTSearchContextBuilder( model=chat_model, text_embedder=text_embedder, entities=entities, relationships=relationships, reports=reports, entity_text_embeddings=description_embedding_store, text_units=text_units, tokenizer=tokenizer, config=drift_params, ) search = DRIFTSearch( model=chat_model, context_builder=context_builder, tokenizer=tokenizer ) ``` -------------------------------- ### Remove Documents by ID Source: https://github.com/microsoft/graphrag/blob/main/packages/graphrag-vectors/example_notebooks/lancedb.ipynb Delete one or more documents from the vector store by providing a list of their IDs to the `remove()` method. This example removes the first 5 documents. ```python # Remove documents ids_to_delete = text_units["id"].head(5).tolist() print(f"Deleting {len(ids_to_delete)} documents...") store.remove(ids_to_delete) new_count = store.count() print(f"Document count after delete: {new_count}") assert new_count == 37, f"Expected 37, got {new_count}" print("Remove confirmed.") ``` -------------------------------- ### Define Project Directory Source: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/index_migration_to_v1.ipynb Set the path to the project directory containing the settings.yaml file. ```python # This is the directory that has your settings.yaml # NOTE: much older indexes may have been output with a timestamped directory # if this is the case, you will need to make sure the storage.base_dir in settings.yaml points to it correctly PROJECT_DIRECTORY = "