### Install TypeAgent-py Source: https://github.com/microsoft/typeagent-py/blob/main/docs/getting-started.md Installs the TypeAgent-py library using pip. It's recommended to use a virtual environment or a package manager like poetry or uv for managing dependencies. ```shell pip install typeagent ``` -------------------------------- ### Example Text Data File Source: https://github.com/microsoft/typeagent-py/blob/main/docs/getting-started.md A sample text file (`testdata.txt`) containing lines that represent conversation messages, with each line starting with a speaker's name followed by their utterance. This format is used for ingestion by the `ingest.py` script. ```text STEVE We should really make a Python library for Structured RAG. UMESH Who would be a good person to do the Python library? GUIDO I volunteer to do the Python library. Give me a few months. ``` -------------------------------- ### Query Ingested Data with TypeAgent-py (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/docs/getting-started.md Creates a conversation object, defines a question, and queries the ingested data using TypeAgent-py. Requires the same environment setup as the ingestion program. Prints the question and the retrieved answer. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import TranscriptMessage async def main(): conversation = await create_conversation("demo.db", TranscriptMessage) question = "Who volunteered to do the python library?" print("Q:", question) answer = await conversation.query(question) print("A:", answer) if __name__ == "__main__": import asyncio asyncio.run(main()) ``` -------------------------------- ### Set OpenAI Environment Variables (Shell) Source: https://github.com/microsoft/typeagent-py/blob/main/docs/getting-started.md Sets the necessary environment variables for using OpenAI's API with TypeAgent-py. This includes the API key and the desired model. Additional variables might be needed for specific setups, including Azure OpenAI. ```shell export OPENAI_API_KEY=your-very-secret-openai-api-key export OPENAI_MODEL=gpt-4o ``` -------------------------------- ### Ingest Text Data with TypeAgent-py (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/docs/getting-started.md Reads messages from a text file, parses them into TranscriptMessage objects, and indexes them using TypeAgent-py. Requires OpenAI API key and model to be set as environment variables. Outputs the number of messages indexed and semantic references created. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import ( TranscriptMessage, TranscriptMessageMeta, ) def read_messages(filename) -> list[TranscriptMessage]: messages: list[TranscriptMessage] = [] with open(filename, "r") as f: for line in f: # Parse each line into a TranscriptMessage speaker, text_chunk = line.split(None, 1) message = TranscriptMessage( text_chunks=[text_chunk], metadata=TranscriptMessageMeta(speaker=speaker), ) messages.append(message) return messages async def main(): conversation = await create_conversation("demo.db", TranscriptMessage) messages = read_messages("testdata.txt") print(f"Indexing {len(messages)} messages...") results = await conversation.add_messages_with_indexing(messages) print(f"Indexed {results.messages_added} messages.") print(f"Got {results.semrefs_added} semantic refs.") if __name__ == "__main__": import asyncio asyncio.run(main()) ``` -------------------------------- ### Query Conversation with TypeAgent-Py Source: https://context7.com/microsoft/typeagent-py/llms.txt This example shows how to create a conversation instance, ask a single question, and then enter an interactive loop for continuous querying. It handles basic user input and potential errors. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import TranscriptMessage import asyncio async def query_example(): # Connect to existing conversation conversation = await create_conversation("demo.db", TranscriptMessage) # Single query question = "Who volunteered to work on authentication?" print(f"Q: {question}") answer = await conversation.query(question) print(f"A: {answer}") # Interactive query loop print("\nInteractive mode (type 'quit' to exit):") while True: try: user_question = input("typeagent> ") if not user_question.strip(): continue if user_question.lower() in ('quit', 'exit', 'q'): break response = await conversation.query(user_question) print(response) # Check if no answer was found if response.startswith("No answer found:"): print("(Insufficient information in conversation)") except KeyboardInterrupt: break except Exception as e: print(f"Error: {e}") asyncio.run(query_example()) ``` -------------------------------- ### Example Main Function and Execution in Python Source: https://context7.com/microsoft/typeagent-py/llms.txt Demonstrates the main execution flow for the TypeAgent project. It includes creating a conversation, generating a large dataset of messages, performing batch ingestion, and verifying the final state of the conversation. The script uses asyncio for asynchronous operations and basic logging. ```python import logging import asyncio # Assume TranscriptMessage, TranscriptMessageMeta, create_conversation, and batch_ingest are defined elsewhere logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # Placeholder classes and functions for demonstration purposes class TranscriptMessageMeta: def __init__(self, speaker): self.speaker = speaker class TranscriptMessage: def __init__(self, text_chunks, metadata): self.text_chunks = text_chunks self.metadata = metadata async def create_conversation(db_path, message_type): # This is a mock implementation. Replace with actual conversation creation. logger.info(f"Creating conversation with db: {db_path} and type: {message_type.__name__}") class MockConversation: async def add_messages_with_indexing(self, messages): # Mocking the result of adding messages return type('obj', (object,), {'messages_added': len(messages), 'semrefs_added': len(messages) // 2})() @property async def messages(self): # Mocking the messages attribute and its size method class MockMessages: async def size(self): return 500 # Mock size return MockMessages() return MockConversation() async def batch_ingest(conversation, messages, batch_size=50): # This is a mock implementation of batch_ingest. Replace with actual implementation. total_messages = 0 total_semrefs = 0 for i in range(0, len(messages), batch_size): batch = messages[i:i + batch_size] batch_num = i // batch_size + 1 try: logger.info(f"Processing batch {batch_num} ({len(batch)} messages)...") result = await conversation.add_messages_with_indexing(batch) total_messages += result.messages_added total_semrefs += result.semrefs_added logger.info(f"Batch {batch_num} complete: {result.messages_added} messages, {result.semrefs_added} semantic refs") except Exception as e: logger.error(f"Error in batch {batch_num}: {e}") continue return total_messages, total_semrefs async def main(): conversation = await create_conversation("large_dataset.db", TranscriptMessage) # Generate large dataset messages = [ TranscriptMessage( text_chunks=[f"This is message number {i} about topic {i % 10}."], metadata=TranscriptMessageMeta(speaker=f"Speaker{i % 5}") ) for i in range(500) ] logger.info(f"Starting ingestion of {len(messages)} messages...") total_msgs, total_refs = await batch_ingest(conversation, messages, batch_size=100) logger.info(f"Ingestion complete: {total_msgs} messages, {total_refs} semantic refs") # Verify conversation state size = await conversation.messages.size() logger.info(f"Total messages in conversation: {size}") if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### TypeAgent-Py Environment Configuration Source: https://context7.com/microsoft/typeagent-py/llms.txt This example shows various ways to configure API credentials for OpenAI and Azure OpenAI services. It covers setting environment variables directly, using a .env file, and specifying custom endpoints for OpenAI-compatible services and embedding servers. ```python from typeagent.aitools.utils import load_dotenv import os # Option 1: Set environment variables directly os.environ['OPENAI_API_KEY'] = 'sk-...' os.environ['OPENAI_MODEL'] = 'gpt-4o' # Option 2: Use .env file (recommended) # Create a .env file in your project directory: # OPENAI_API_KEY=sk-your-secret-key # OPENAI_MODEL=gpt-4o load_dotenv() # Option 3: Azure OpenAI configuration os.environ['AZURE_OPENAI_API_KEY'] = 'your-azure-key' os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://YOUR_RESOURCE.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT/chat/completions?api-version=2023-05-15' # Option 4: Custom OpenAI-compatible endpoint os.environ['OPENAI_API_KEY'] = 'dummy-key' os.environ['OPENAI_MODEL'] = 'llama:3.2:1b' os.environ['OPENAI_ENDPOINT'] = 'http://localhost:11434/v1' # Ollama example # Option 5: Custom embedding server os.environ['OPENAI_API_KEY'] = 'dummy-key' os.environ['OPENAI_BASE_URL'] = 'http://localhost:7997' # Infinity embedding server ``` -------------------------------- ### WebVTT Transcript Format Example Source: https://github.com/microsoft/typeagent-py/blob/main/typeagent/transcripts/README.md Illustrates the standard WebVTT file format for captions and subtitles, including timestamps and speaker information. This format is recognized and parsed by the `ingest_vtt_transcript` function. ```webvtt WEBVTT Kind: captions Language: en 00:00:07.599 --> 00:00:10.559 SPEAKER: Hello, this is a test. 00:00:10.560 --> 00:00:15.000 [Another Speaker] This is another line. ``` -------------------------------- ### Read Messages from File and Index with TypeAgent-Py Source: https://context7.com/microsoft/typeagent-py/llms.txt This example demonstrates how to read messages from a text file, parse them with speaker attribution, and then add them to a conversation for indexing. It includes error handling for file not found and malformed lines. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import ( TranscriptMessage, TranscriptMessageMeta ) import asyncio def read_messages_from_file(filename): """ Parse messages from a text file. Expected format: SPEAKER message text Example file (testdata.txt): ALICE We should add a new feature. BOB I agree, let's start next week. CHARLIE I can help with the design. """ messages = [] try: with open(filename, 'r') as f: for line_num, line in enumerate(f, 1): line = line.strip() if not line or line.startswith('#'): continue # Skip empty lines and comments try: # Split on first whitespace speaker, text_chunk = line.split(None, 1) message = TranscriptMessage( text_chunks=[text_chunk], metadata=TranscriptMessageMeta(speaker=speaker) ) messages.append(message) except ValueError: print(f"Warning: Skipping malformed line {line_num}") continue except FileNotFoundError: print(f"Error: File '{filename}' not found") return [] return messages async def main(): conversation = await create_conversation("demo.db", TranscriptMessage) messages = read_messages_from_file("testdata.txt") if messages: print(f"Indexing {len(messages)} messages...") results = await conversation.add_messages_with_indexing(messages) print(f"Indexed {results.messages_added} messages.") print(f"Got {results.semrefs_added} semantic refs.") else: print("No messages to index") asyncio.run(main()) ``` -------------------------------- ### Python Type Hinting for Interfaces Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md Interfaces in Python, specifically those starting with 'I' followed by a capital letter, should be defined using the 'Protocol' type from the 'typing' module. This facilitates structural subtyping. ```python from typing import Protocol class IListService: ... class SomeClass(IListService): ... ``` -------------------------------- ### Test VTT Transcript Ingestion - Python Source: https://github.com/microsoft/typeagent-py/blob/main/typeagent/transcripts/README.md An example test case for the `ingest_vtt_transcript` function using pytest. It demonstrates setting up `ConversationSettings` with an embedding model and asserting that the ingested transcript contains messages. Dependencies include pytest and fixtures for authentication and embedding models. ```python import pytest from fixtures import needs_auth, embedding_model from typeagent.knowpro.convsettings import ConversationSettings from typeagent.transcripts.transcript_ingest import ingest_vtt_transcript @pytest.mark.asyncio async def test_my_transcript(needs_auth, embedding_model): settings = ConversationSettings(embedding_model) transcript = await ingest_vtt_transcript( "test.vtt", settings, dbname="test.db", ) assert await transcript.messages.size() > 0 ``` -------------------------------- ### Python Conversation Metadata Usage Examples Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_spec.md Demonstrates how to interact with conversation metadata using an instance of IStorageProvider. It covers retrieving metadata (with default creation if none exists), setting metadata with various options (updating specific fields, providing explicit timestamps, overriding all fields), handling schema version validation errors, and using None to retain existing values. This snippet requires an asynchronous context to run. ```python # Assume 'provider' is an initialized instance of IStorageProvider # Get metadata (always returns a valid object, creates defaults if needed) # metadata = await provider.get_conversation_metadata() # print(f"Name: {metadata.name_tag}") # Never None, could be empty string # print(f"Tags: {metadata.tags}") # Never None, could be empty list [] # print(f"Extra: {metadata.extra}") # Never None, could be empty dict {} # Create new conversation with defaults (timestamps auto-set to now UTC) # await provider.set_conversation_metadata(name_tag="my_conversation") # Update just the tags, timestamp gets updated automatically # await provider.set_conversation_metadata(tags=["important", "work"]) # Update timestamp only (equivalent to refresh/touch) # await provider.set_conversation_metadata() # Override everything explicitly with timezone handling # from datetime import datetime, timezone # await provider.set_conversation_metadata( # name_tag="conversation", # created_at=datetime(2025, 1, 1, 12, 0, 0), # Assumes local TZ, converted to UTC # updated_at=datetime.now(timezone.utc), # Explicit UTC # tags=["tag1", "tag2"], # Never None # extra={"custom_field": "value"} # Never None # ) # Schema version validation (raises ValueError if mismatch) # try: # await provider.set_conversation_metadata(schema_version="1.0") # Would raise error # except ValueError as e: # print(f"Schema mismatch: {e}") # Baseline behavior - use None to keep existing values # await provider.set_conversation_metadata( # name_tag="new_name", # Update name # tags=None, # Keep existing tags from baseline # extra=None, # Keep existing extra from baseline # # created_at not specified -> keeps existing, updated_at -> current time # ) ``` -------------------------------- ### Configure Embedding Models in TypeAgent Python Source: https://context7.com/microsoft/typeagent-py/llms.txt This example shows how to configure and use custom embedding models with TypeAgent Python for semantic search. It covers using default OpenAI embeddings, specifying custom models with different sizes, and generating embeddings for text. Dependencies include 'typeagent.aitools.embeddings', 'asyncio', and 'numpy'. ```python from typeagent.aitools.embeddings import AsyncEmbeddingModel import asyncio import numpy as np async def embedding_example(): # Option 1: Default OpenAI embeddings (text-embedding-ada-002) default_model = AsyncEmbeddingModel() # Option 2: Specify model explicitly small_model = AsyncEmbeddingModel( model_name="text-embedding-small", embedding_size=512 # Smaller, faster embeddings ) # Option 3: Large model for better quality large_model = AsyncEmbeddingModel( model_name="text-embedding-large", embedding_size=3072 ) # Generate embeddings for text texts = [ "Machine learning is a subset of artificial intelligence.", "Python is a popular programming language.", "Deep learning uses neural networks." ] try: embeddings = await default_model.create_embeddings(texts) print(f"Generated {len(embeddings)} embeddings") print(f"Embedding shape: {embeddings[0].shape}") print(f"Embedding size: {default_model.embedding_size}") # Calculate similarity between first two texts similarity = np.dot(embeddings[0], embeddings[1]) print(f"Similarity between texts: {similarity:.4f}") except Exception as e: print(f"Error generating embeddings: {e}") asyncio.run(embedding_example()) ``` -------------------------------- ### Run TypeAgent demo UI with default podcast data Source: https://github.com/microsoft/typeagent-py/blob/main/docs/demos.md This command runs the TypeAgent demo UI without specifying a database, which defaults to using the provided podcast index files. Users can then interactively ask questions about the podcast content. The demo utilizes Azure OpenAI for processing. ```sh python tools/query.py ``` -------------------------------- ### Activating Virtual Environment Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md To activate the project's virtual environment, first create it using 'make venv' and then source the activate script located in '.venv/bin/activate'. This ensures that the correct Python interpreter and packages are used. ```bash make venv source .venv/bin/activate ``` -------------------------------- ### Ingest WebVTT files into SQLite DB Source: https://github.com/microsoft/typeagent-py/blob/main/docs/demos.md This tool ingests WebVTT format files into a SQLite database for querying. It requires one or more .vtt files and an output database file name. The process can take a significant amount of time depending on the number and size of the input files. ```sh python tools/ingest_vtt.py FILE1.vtt ... FILEN.vtt -d mp.db ``` -------------------------------- ### Formatting Code with Black Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md Run the 'make format' command to automatically format all files in the project using the Black code formatter. This ensures consistent code style across the codebase. ```bash make format ``` -------------------------------- ### Get Nearest Indexes (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/indexes_overview.md Retrieves indexes of nearest neighbors based on a given embedding. This method supports optional parameters for maximum matches, minimum score, and a predicate for filtering results. It returns a list of scored integer identifiers. ```python def get_indexes_of_nearest( self, embedding: NormalizedEmbedding, max_matches: int | None = None, min_score: float | None = None, predicate: Callable[[int], bool] | None = None, ) -> list[ScoredInt] ``` -------------------------------- ### Query data from SQLite DB Source: https://github.com/microsoft/typeagent-py/blob/main/docs/demos.md This tool allows querying a SQLite database. It can be used to ask questions about data that has been previously ingested. The database file path is provided as an argument. This tool is used for both the Monty Python and GMail demos. ```sh python tools/query.py -d mp.db ``` -------------------------------- ### Analyze VTT Transcript - Python Source: https://github.com/microsoft/typeagent-py/blob/main/typeagent/transcripts/README.md Provides functions to analyze WebVTT files, including getting the total duration and extracting speaker information. It also includes a utility to extract speaker names from text lines. Dependencies include typeagent.transcripts.transcript_ingest. ```python from typeagent.transcripts.transcript_ingest import ( get_transcript_duration, get_transcript_speakers, extract_speaker_from_text, ) # Get basic information duration = get_transcript_duration("transcript.vtt") speakers = get_transcript_speakers("transcript.vtt") print(f"Duration: {duration/60:.1f} minutes") print(f"Speakers: {speakers}") # Test speaker extraction speaker, text = extract_speaker_from_text("NARRATOR: Once upon a time...") print(f"Speaker: {speaker}, Text: {text}") ``` -------------------------------- ### Package Management with uv: Adding Dependencies Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md Use the 'uv add' command to incorporate new packages into the project's dependencies. uv will automatically update the 'pyproject.toml' file to reflect the changes. ```bash uv add ``` -------------------------------- ### Download GMail messages using GMail API Source: https://github.com/microsoft/typeagent-py/blob/main/docs/demos.md A tool to download GMail messages using the GMail API. It requires a Google Cloud app to be created and configured. The tool can download a specified number of messages, with a default of 50. Instructions for setting up the Google Cloud app are provided via a GeeksForGeeks link. ```python from gmail import gmail_dump # Example usage (assuming configuration is done) # gmail_dump.download_messages(num_messages=50) ``` -------------------------------- ### Memory Storage Provider Implementation (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md Provides a concrete implementation of the `IStorageProvider` interface using in-memory storage. The `MemoryStorageProvider` class ensures that all required index getter methods are implemented, returning the corresponding in-memory index instances. This serves as a basic storage solution for testing and development purposes. ```python class MemoryStorageProvider[TMessage: IMessage](IStorageProvider[TMessage]): async def get_conversation_index(self) -> ITermToSemanticRefIndex: return self._conversation_index async def get_property_index(self) -> IPropertyToSemanticRefIndex: return self._property_index # ... all other index getters implemented ``` -------------------------------- ### SQL: Create SemanticRefs Table Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_spec.md Defines the schema for the SemanticRefs table, used to store semantic references. It decomposes text ranges into start and end message IDs and chunk orders, specifying the knowledge type and the knowledge content. Foreign key constraints ensure referential integrity with the Messages table. ```sql CREATE TABLE SemanticRefs ( semref_id INTEGER PRIMARY KEY AUTOINCREMENT, -- TextRange decomposed into separate columns for efficient querying -- Forms a half-open interval [start, end) -- If in-memory TextRange has no end, defaults to: end_msg_id = start_msg_id, end_chunk_ord = start_chunk_ord + 1 start_msg_id INTEGER NOT NULL, start_chunk_ord INTEGER NOT NULL, end_msg_id INTEGER NOT NULL, end_chunk_ord INTEGER NOT NULL, -- Points past the last included chunk ktype TEXT NOT NULL CHECK (ktype IN ('entity', 'action', 'topic', 'tag')), knowledge JSON NOT NULL, FOREIGN KEY (start_msg_id) REFERENCES Messages(msg_id) ON DELETE RESTRICT, FOREIGN KEY (end_msg_id) REFERENCES Messages(msg_id) ON DELETE RESTRICT ); CREATE INDEX idx_semantic_refs_start_msg ON SemanticRefs(start_msg_id); CREATE INDEX idx_semantic_refs_end_msg ON SemanticRefs(end_msg_id); CREATE INDEX idx_semantic_refs_ktype ON SemanticRefs(ktype); ``` -------------------------------- ### Running Project Tests Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md To execute the tests within the project, use the 'pytest test' command. This command invokes the pytest framework to discover and run all defined tests. ```bash make test ``` -------------------------------- ### Running Type Checking with Pyright Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md To perform type checking on the project's code, utilize the 'pyright' command or the 'make check' command. This helps in identifying type-related errors before runtime. ```bash pyright ``` ```bash make check ``` -------------------------------- ### Package Management with uv: Upgrading Dependencies Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md To upgrade existing packages to their latest compatible versions, use 'uv add --upgrade'. This command ensures that packages are updated and 'pyproject.toml' is synchronized. ```bash uv add --upgrade ``` -------------------------------- ### Running All Checks and Tests Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md Execute 'make check test' to first run the type checker ('make check') and, if it passes, subsequently run all tests ('make test'). This is a comprehensive validation step. ```bash make check test ``` -------------------------------- ### Implement MemoryStorageProvider Index Management (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python code snippet implements index management within the MemoryStorageProvider class. It initializes various dictionaries to hold different types of indexes and provides methods to asynchronously retrieve or create these indexes for a given conversation ID. It also includes methods to ensure all necessary indexes are created or dropped for a conversation. ```python class MemoryStorageProvider[TMessage: IMessage](IStorageProvider[TMessage]): def __init__(self): # ... existing init ... self._conversation_indexes: dict[str, SemanticRefIndex] = {} self._property_indexes: dict[str, PropertyIndex] = {} self._timestamp_indexes: dict[str, TimestampToTextRangeIndex] = {} self._message_text_indexes: dict[str, MessageTextIndex] = {} self._related_terms_indexes: dict[str, RelatedTermsIndex] = {} self._conversation_threads: dict[str, ConversationThreads] = {} async def get_conversation_index( self, conversation_id: str ) -> ITermToSemanticRefIndex: if conversation_id not in self._conversation_indexes: self._conversation_indexes[conversation_id] = SemanticRefIndex() return self._conversation_indexes[conversation_id] async def get_related_terms_index( self, conversation_id: str ) -> ITermToRelatedTermsIndex: if conversation_id not in self._related_terms_indexes: # Use default settings for now from .reltermsindex import RelatedTermsIndex, RelatedTermIndexSettings settings = RelatedTermIndexSettings() self._related_terms_indexes[conversation_id] = RelatedTermsIndex(settings) return self._related_terms_indexes[conversation_id] async def get_conversation_threads( self, conversation_id: str ) -> IConversationThreads: if conversation_id not in self._conversation_threads: self._conversation_threads[conversation_id] = ConversationThreads() return self._conversation_threads[conversation_id] # ... similar methods for other index types ... async def create_indexes_for_conversation( self, conversation_id: str ) -> None: # Ensure all indexes exist for this conversation await self.get_conversation_index(conversation_id) await self.get_property_index(conversation_id) await self.get_timestamp_index(conversation_id) await self.get_message_text_index(conversation_id) await self.get_related_terms_index(conversation_id) await self.get_conversation_threads(conversation_id) async def drop_indexes_for_conversation( self, conversation_id: str ) -> None: self._conversation_indexes.pop(conversation_id, None) self._property_indexes.pop(conversation_id, None) self._timestamp_indexes.pop(conversation_id, None) self._message_text_indexes.pop(conversation_id, None) self._related_terms_indexes.pop(conversation_id, None) self._conversation_threads.pop(conversation_id, None) ``` -------------------------------- ### Python Copyright Header Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md When creating a new Python file, a standard copyright and license header must be included at the top of the file. This ensures proper attribution and licensing for the code. ```python # Copyright (c) Microsoft Corporation. # Licensed under the MIT License. ``` -------------------------------- ### Ingest EML email files into SQLite DB Source: https://github.com/microsoft/typeagent-py/blob/main/docs/demos.md This tool ingests email messages from .eml files in a specified directory into a SQLite database named gmail.db. It is an interactive tool, and the primary command is to add messages from a given path. The process can be time-consuming and may encounter errors with large files or timeouts. ```sh python tools/test_email.py . ``` ```text @add_messages --path "email-folder" ``` -------------------------------- ### Interface: IStorageProvider Index Methods Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python code snippet demonstrates the extended IStorageProvider interface, which includes methods for retrieving various types of indexes. These methods are crucial for accessing and managing indexes within the storage layer of the TypeAgent-Py project. ```python from typing import Dict, List, Optional, Tuple from datatypes import ScoredSemanticRefOrdinal, TextRange, Thread from datatypes import SemanticRef from vectorbase import VectorBase class IStorageProvider: def get_semantic_ref_index(self) -> ITermToSemanticRefIndex: ... # Placeholder for implementation def get_property_index(self) -> IPropertyToSemanticRefIndex: ... # Placeholder for implementation def get_timestamp_to_text_range_index(self) -> ITimestampToTextRangeIndex: ... # Placeholder for implementation def get_message_text_index(self) -> IMessageTextEmbeddingIndex: ... # Placeholder for implementation def get_related_terms_index(self) -> ITermToRelatedTermsIndex: ... # Placeholder for implementation def get_conversation_threads(self) -> IConversationThreads: ... # Placeholder for implementation def get_embedding_index(self) -> EmbeddingIndex: ... # Placeholder for implementation ``` -------------------------------- ### Accessing Secondary Indexes via Coordinator Source: https://github.com/microsoft/typeagent-py/blob/main/spec/indexes_overview.md Demonstrates the recommended way to access secondary indexes in TypeAgent-Py. It shows how to obtain the index coordinator and subsequently access specific indexes like `property_to_semantic_ref_index`. This pattern abstracts away the underlying storage provider details. ```python # Access via the coordinator (recommended) idx = conversation.secondary_indexes prop_idx = idx.property_to_semantic_ref_index # … use prop_idx, timestamp_index, message_index, etc. ``` -------------------------------- ### Build Semantic Reference Index using Storage Provider (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python asynchronous function builds a semantic reference index by retrieving the conversation index from a storage provider. It maintains existing building logic but utilizes the index obtained from storage. ```python async def build_semantic_ref[TMessage: IMessage]( conversation: IConversation[TMessage, SemanticRefIndex], conversation_settings: importing.ConversationSettings, event_handler: IndexingEventHandlers | None = None, ) -> IndexingResults: # Get indexes from storage provider instead of conversation properties storage_provider = conversation.storage_provider conversation_index = await storage_provider.get_conversation_index(conversation.conversation_id) # Keep existing building logic, just use storage provider index result = IndexingResults() result.semantic_refs = await build_semantic_ref_index( conversation, conversation_settings.semantic_ref_index_settings, event_handler, ) # ... rest of building logic stays the same ... ``` -------------------------------- ### Loading Environment Variables Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md To load environment variables, particularly API keys, for ad-hoc code execution, call the 'typeagent.aitools.utils.load_dotenv()' function. This is useful for local development and testing. ```python from typeagent.aitools.utils import load_dotenv load_dotenv() ``` -------------------------------- ### Test All Index Creation with Storage Provider (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python test function, using pytest and asyncio, verifies that all six index types can be created and accessed via the storage provider. It asserts that conversation and property indexes are not null. ```python # ✅ IMPLEMENTED in test/test_storage_indexes.py @pytest.mark.asyncio async def test_all_index_creation(storage, needs_auth): """Test that all 6 index types are created and accessible.""" conv_index = await storage.get_conversation_index() assert conv_index is not None prop_index = await storage.get_property_index() assert prop_index is not None # ... tests for all index types ``` -------------------------------- ### SQL: Create Usage Metrics and Query Performance Tables Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_future_extensions.md Defines database schemas for tracking usage metrics and query performance. The UsageMetrics table stores named metrics with values and timestamps, while QueryPerformance tracks details about executed queries. ```sql -- Usage metrics table CREATE TABLE UsageMetrics ( metric_id INTEGER PRIMARY KEY AUTOINCREMENT, metric_name TEXT NOT NULL, metric_value REAL NOT NULL, timestamp TEXT NOT NULL, metadata JSON ); CREATE INDEX idx_usage_metrics_name_time ON UsageMetrics(metric_name, timestamp); -- Query performance tracking CREATE TABLE QueryPerformance ( query_id INTEGER PRIMARY KEY AUTOINCREMENT, query_type TEXT NOT NULL, duration_ms INTEGER NOT NULL, result_count INTEGER, timestamp TEXT NOT NULL, query_params JSON ); ``` -------------------------------- ### Parameterizing Storage Provider Tests with Pytest Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This fixture demonstrates how to parameterize tests to run against multiple storage provider implementations (Memory and SQLite). It ensures that tests are executed once for each provider, facilitating cross-provider validation. Dependencies include pytest_asyncio and the embedding model. ```python @pytest_asyncio.fixture(params=["memory", "sqlite"]) async def storage_provider_type(request, embedding_model, temp_db_path): # Returns both provider types, tests run twice - once per provider ``` -------------------------------- ### Build Timestamp Index using Storage Provider (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python asynchronous function constructs a timestamp index by fetching it from the storage provider. It then uses existing logic to add new messages to this index. ```python async def build_timestamp_index(conversation: IConversation) -> ListIndexingResult: if conversation.messages: # Get timestamp index from storage provider storage_provider = conversation.storage_provider timestamp_index = await storage_provider.get_timestamp_index(conversation.conversation_id) # Use existing logic with storage provider index return await add_to_timestamp_index( timestamp_index, conversation.messages, 0, ) return ListIndexingResult(0) ``` -------------------------------- ### Update Conversation Access Pattern (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md Demonstrates the refactoring of how to access conversation indexes, moving from direct property access to using new asynchronous getter methods on the conversation object. ```python # Before: # index = conversation.semantic_ref_index # After: index = await conversation.get_conversation_index() ``` -------------------------------- ### Email Memory Integration in Python Source: https://context7.com/microsoft/typeagent-py/llms.txt Illustrates how to set up a specialized conversation for indexing email messages using TypeAgent's EmailMemory. This includes configuring conversation settings, storage providers, and enabling features like noise term filtering and verb synonyms. It requires the `typeagent` library and its email-related submodules. ```python from typeagent.emails.email_memory import EmailMemory, EmailMemorySettings from typeagent.emails.email_message import EmailMessage from typeagent.knowpro.convsettings import ConversationSettings from typeagent.storage.utils import create_storage_provider import asyncio async def create_email_conversation(): # Create settings conversation_settings = ConversationSettings() email_settings = EmailMemorySettings(conversation_settings) # Create storage provider storage_provider = await create_storage_provider( message_text_settings=conversation_settings.message_text_index_settings, related_terms_settings=conversation_settings.related_term_index_settings, dbname="emails.db", message_type=EmailMessage ) email_settings.conversation_settings.storage_provider = storage_provider # Create email memory (includes noise term filtering and verb synonyms) email_memory = await EmailMemory.create( settings=email_settings.conversation_settings, name="Corporate Inbox", tags=["work-email"] ) # Add email messages email_messages = [ EmailMessage( text_chunks=["Please review the quarterly report by EOD."], metadata={ 'sender': 'boss@company.com', 'recipients': ['team@company.com'], 'subject': 'Q1 Report Review' }, timestamp="2025-01-15T09:00:00z" ) ] result = await email_memory.add_messages_with_indexing(email_messages) print(f"Indexed {result.messages_added} emails, {result.semrefs_added} semantic refs") # Query emails (noise filtering applied automatically) answer = await email_memory.query("Who asked for the quarterly report?") print(answer) asyncio.run(create_email_conversation()) ``` -------------------------------- ### Test All Index Creation (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md Tests the lazy creation of all seven index types within the MemoryStorageProvider. It asserts that the correct interface types are instantiated when accessed for a given conversation. ```python import pytest from typeagent.storage.memorystore import MemoryStorageProvider from typeagent.knowpro.interfaces import ( ITermToSemanticRefIndex, IPropertyToSemanticRefIndex, ITimestampToTextRangeIndex, IMessageTextIndex, ITermToRelatedTermsIndex, IConversationThreads ) @pytest.mark.asyncio async def test_all_index_creation(): """Test that all 7 index types are created lazily.""" storage = MemoryStorageProvider() # Test all index types conv_index = await storage.get_conversation_index("conv1") assert isinstance(conv_index, ITermToSemanticRefIndex) prop_index = await storage.get_property_index("conv1") assert isinstance(prop_index, IPropertyToSemanticRefIndex) time_index = await storage.get_timestamp_index("conv1") assert isinstance(time_index, ITimestampToTextRangeIndex) msg_index = await storage.get_message_text_index("conv1") assert isinstance(msg_index, IMessageTextIndex) rel_index = await storage.get_related_terms_index("conv1") assert isinstance(rel_index, ITermToRelatedTermsIndex) threads = await storage.get_conversation_threads("conv1") assert isinstance(threads, IConversationThreads) ``` -------------------------------- ### Advanced Query with Custom Options in Python Source: https://context7.com/microsoft/typeagent-py/llms.txt Demonstrates how to perform advanced queries using TypeAgent with customizable search and answer generation options. This includes configuring search parameters like exact scope, verb scope, and semantic similarity, as well as answer generation parameters such as entity and topic limits. It requires the `typeagent` library and its submodules. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import TranscriptMessage from typeagent.knowpro.searchlang import ( LanguageSearchOptions, LanguageQueryCompileOptions ) from typeagent.knowpro.answers import AnswerContextOptions import asyncio async def advanced_query(): conversation = await create_conversation("demo.db", TranscriptMessage) # Configure search options search_options = LanguageSearchOptions( compile_options=LanguageQueryCompileOptions( exact_scope=False, # Allow fuzzy entity matching verb_scope=True, # Match action verbs term_filter=None, # No term filtering apply_scope=True # Apply scoping rules ), exact_match=False, # Enable semantic similarity max_message_matches=50, # Maximum messages to retrieve max_knowledge_matches=100 # Maximum knowledge items ) # Configure answer generation options answer_options = AnswerContextOptions( entities_top_k=50, # Top entities to include topics_top_k=50, # Top topics to include messages_top_k=None, # No message limit chunking=None # No text chunking ) question = "What security features were discussed?" answer = await conversation.query( question=question, search_options=search_options, answer_options=answer_options ) print(f"Q: {question}") print(f"A: {answer}") asyncio.run(advanced_query()) ``` -------------------------------- ### Test Index Persistence Per Conversation (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md Ensures that accessing indexes for the same conversation multiple times returns the same instance, demonstrating persistence within the provider for a given conversation. ```python @pytest.mark.asyncio async def test_index_persistence_per_conversation(): """Test that same index instance is returned for same conversation.""" storage = MemoryStorageProvider() # All index types should return same instance for same conversation conv1_1 = await storage.get_conversation_index("conv1") conv1_2 = await storage.get_conversation_index("conv1") assert conv1_1 is conv1_2 prop1_1 = await storage.get_property_index("conv1") prop1_2 = await storage.get_property_index("conv1") assert prop1_1 is prop1_2 ``` -------------------------------- ### Create Conversation with TypeAgent Python Source: https://context7.com/microsoft/typeagent-py/llms.txt Initializes a TypeAgent conversation, allowing for message storage and indexing. Supports both persistent SQLite databases and in-memory storage. Requires API keys and model configuration for LLM integration. ```python from typeagent import create_conversation from typeagent.transcripts.transcript import TranscriptMessage import asyncio import os # Set up environment variables for OpenAI os.environ['OPENAI_API_API_KEY'] = 'your-api-key-here' os.environ['OPENAI_MODEL'] = 'gpt-4o' async def main(): # Create conversation with SQLite storage conversation = await create_conversation( dbname="my_conversation.db", message_type=TranscriptMessage, name="Team Meeting", tags=["project-discussion", "2025-q1"] ) # Create in-memory conversation (no persistence) temp_conversation = await create_conversation( dbname=None, message_type=TranscriptMessage, name="Temporary Session" ) print(f"Conversation created with {await conversation.messages.size()} messages") asyncio.run(main()) ``` -------------------------------- ### Build Secondary Indexes (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/indexes_overview.md Coordinates the building of all secondary indexes, handling dependencies between them and managing their lifecycle. This function is responsible for initializing the complete set of secondary indexes for a conversation. ```python async def build_secondary_indexes( conversation: IConversation, conversation_settings: ConversationSettings, ) -> SecondaryIndexingResults: # Controls building of all secondary indexes # Handles dependencies between indexes ``` -------------------------------- ### Python Type Hinting for String Literals Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md Use the 'Literal' type from the 'typing' module for unions of string literals in Python type hints. This provides precise type information for string constants. ```python from typing import Literal status: Literal['pending', 'completed', 'failed'] ``` -------------------------------- ### Route ConversationSecondaryIndexes Through Storage Provider (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md This Python code snippet demonstrates how to modify the ConversationSecondaryIndexes class to leverage a storage provider for retrieving indexes. It uses async properties to lazily load indexes from the provided storage_provider instance, ensuring that indexes are fetched only when needed and are managed centrally. ```python class ConversationSecondaryIndexes[TMessage: IMessage](IConversationSecondaryIndexes[TMessage]): def __init__(self, storage_provider: IStorageProvider[TMessage], conversation_id: str): self._storage_provider = storage_provider self._conversation_id = conversation_id # Initialize all indexes through storage provider self._property_index: IPropertyToSemanticRefIndex | None = None self._timestamp_index: ITimestampToTextRangeIndex | None = None self._related_terms_index: ITermToRelatedTermsIndex | None = None self._threads: IConversationThreads | None = None self._message_index: IMessageTextIndex[TMessage] | None = None @property async def property_to_semantic_ref_index(self) -> IPropertyToSemanticRefIndex | None: if self._property_index is None: self._property_index = await self._storage_provider.get_property_index(self._conversation_id) return self._property_index @property async def timestamp_index(self) -> ITimestampToTextRangeIndex | None: if self._timestamp_index is None: self._timestamp_index = await self._storage_provider.get_timestamp_index(self._conversation_id) return self._timestamp_index # ... similar async properties for other indexes ... ``` -------------------------------- ### Python Type Hinting for Structured Types Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md For classes that primarily serve as structured data containers (other than interfaces), use the 'dataclass' decorator from the 'dataclasses' module. This simplifies the creation of data-holding classes. ```python from dataclasses import dataclass @dataclass class UserProfile: user_id: str display_name: str | None = None ``` -------------------------------- ### Storage Provider Interface Definition (Python) Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_immediate_implementation.md Defines the abstract base class `IStorageProvider` for managing different types of indexes within a conversation. It outlines methods for retrieving various index types, such as conversation index, property index, timestamp index, message text index, related terms index, and conversation threads. This interface serves as a contract for concrete storage implementations. ```python class IStorageProvider[TMessage: IMessage](Protocol): # ... existing methods ... # Index getters - ALL 6 index types for this conversation async def get_conversation_index(self) -> ITermToSemanticRefIndex: ... async def get_property_index(self) -> IPropertyToSemanticRefIndex: ... async def get_timestamp_index(self) -> ITimestampToTextRangeIndex: ... async def get_message_text_index(self) -> IMessageTextIndex[TMessage]: ... async def get_related_terms_index(self) -> ITermToRelatedTermsIndex: ... async def get_conversation_threads(self) -> IConversationThreads: ... # ❌ TODO: Multi-conversation support when needed # async def create_indexes_for_conversation( # self, conversation_id: str # ) -> None: ... # async def drop_indexes_for_conversation( # self, conversation_id: str # ) -> None: ... ``` -------------------------------- ### Implement Python Interface for Advanced Term Search Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_future_extensions.md Defines a protocol for advanced term indexing and searching. Features include adding terms with variants, fuzzy searching, semantic similarity searches, and term suggestions. ```python class IAdvancedTermIndex(Protocol): async def add_term_with_variants( self, term: str, semref_id: int, relevance_score: float = 1.0 ) -> None: ... async def search_fuzzy( self, query: str, max_distance: int = 2 ) -> list[tuple[int, float]]: ... # (semref_id, relevance_score) async def search_semantic_similar( self, term: str, threshold: float = 0.8 ) -> list[tuple[int, float]]: ... async def get_term_suggestions( self, partial_term: str, limit: int = 10 ) -> list[str]: ... ``` -------------------------------- ### SQL: Create SemanticRefIndex Table Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_spec.md Defines the schema for the SemanticRefIndex table, used for indexing semantic references. It maps a normalized, lowercased term to a semantic reference ID, enabling efficient searching of semantic information. A composite primary key prevents duplicate entries for the same term and semref_id. ```sql CREATE TABLE SemanticRefIndex ( term TEXT NOT NULL, -- lowercased, not-unique/normalized semref_id INTEGER NOT NULL, PRIMARY KEY (term, semref_id), FOREIGN KEY (semref_id) REFERENCES SemanticRefs(semref_id) ON DELETE CASCADE ); CREATE INDEX idx_semantic_ref_index_term ON SemanticRefIndex(term); ``` -------------------------------- ### Python Type Hinting for Aliased Types Source: https://github.com/microsoft/typeagent-py/blob/main/AGENTS.md For type aliases in Python, use the 'type' keyword. Type aliases should follow PascalCase naming conventions, similar to class names, for consistency. ```python from typing import TypeAlias UserIdentifier: TypeAlias = str ``` -------------------------------- ### Python: Storage Analytics Interface Source: https://github.com/microsoft/typeagent-py/blob/main/spec/storage_future_extensions.md Defines a protocol for storage analytics, enabling the recording of query performance and retrieval of usage statistics and performance metrics. It supports flexible querying of historical data. ```python class IStorageAnalytics(Protocol): async def record_query_performance( self, query_type: str, duration_ms: int, result_count: int ) -> None: ... async def get_usage_stats( self, start_time: str, end_time: str ) -> dict[str, Any]: ... async def get_performance_metrics( self, query_type: str | None = None ) -> dict[str, Any]: ... ```