### Install ContextGem from Source Source: https://contextgem.dev/installation Clones the ContextGem repository from GitHub and installs it in editable mode, suitable for local development or custom builds. ```Shell git clone https://github.com/shcherbak-ai/contextgem.git cd contextgem pip install -e . ``` -------------------------------- ### Verify ContextGem installation and version Source: https://contextgem.dev/_sources/installation After installation, run this command to confirm that ContextGem is correctly installed and to display its version number. ```bash python -c "import contextgem; print(contextgem.__version__)" ``` -------------------------------- ### Install ContextGem from source via Git Source: https://contextgem.dev/_sources/installation To install ContextGem directly from its source code, clone the GitHub repository and then install it in editable mode using pip. ```bash git clone https://github.com/shcherbak-ai/contextgem.git cd contextgem pip install -e . ``` -------------------------------- ### Set up ContextGem for development using Poetry Source: https://contextgem.dev/_sources/installation For development purposes, ContextGem utilizes Poetry. This setup involves installing Poetry, installing project dependencies including development extras, and activating the virtual environment. ```bash # Install poetry if you don't have it pip install poetry # Install dependencies including development extras poetry install --with dev # Activate the virtual environment poetry shell ``` -------------------------------- ### Development Installation with Poetry Source: https://contextgem.dev/installation Sets up ContextGem for development using Poetry, including installing Poetry itself, resolving and installing project dependencies with development extras, and activating the project's virtual environment. ```Shell # Install poetry if you don't have it pip install poetry # Install dependencies including development extras poetry install --with dev # Activate the virtual environment poetry shell ``` -------------------------------- ### Install ContextGem from PyPI Source: https://contextgem.dev/installation Installs or upgrades the ContextGem library using the Python package installer, pip, directly from the Python Package Index (PyPI). ```Shell pip install -U contextgem ``` -------------------------------- ### Verify ContextGem Installation Source: https://contextgem.dev/installation Executes a Python command to import the ContextGem library and print its version, confirming that the installation was successful and the library is accessible. ```Python import contextgem; print(contextgem.__version__) ``` -------------------------------- ### Create JsonObjectExample for LLM Guidance Source: https://contextgem.dev/api/examples Demonstrates how to create an instance of `JsonObjectExample` using `contextgem` classes. This example shows the basic initialization for guiding LLM extraction tasks. ```Python from contextgem import JsonObjectConcept, JsonObjectExample # Create a JSON object example json_example = JsonObjectExample( ``` -------------------------------- ### Python Example: Creating and Attaching String Examples to a StringConcept Source: https://contextgem.dev/api/examples Demonstrates how to create `StringExample` instances with specific content and attach them to a `StringConcept` object. This illustrates how examples guide LLM extraction by providing concrete illustrations of expected information for a concept. ```python from contextgem import StringConcept, StringExample # Create string examples string_examples = [ StringExample(content="X (Client)"), StringExample(content="Y (Supplier)"), ] # Attach string examples to a StringConcept string_concept = StringConcept( name="Contract party name and role", description="The name and role of the contract party", examples=string_examples # Attach the example to the concept (optional) ) ``` -------------------------------- ### Install ContextGem from PyPI using pip Source: https://contextgem.dev/_sources/installation The simplest way to install ContextGem is via pip. This command installs or upgrades ContextGem from the Python Package Index. ```bash pip install -U contextgem ``` -------------------------------- ### Python Example: Extracting Concepts from Documents Source: https://contextgem.dev/llms/llm_extraction_methods An example demonstrating the initial setup and imports required to use ContextGem for extracting concepts directly from documents. ```Python # ContextGem: Extracting Concepts Directly from Documents import os from contextgem import Document, DocumentLLM, NumericalConcept, StringConcept ``` -------------------------------- ### Extraction Pipeline Example (Instructor) Source: https://contextgem.dev/_sources/vs_other_frameworks Shows an extraction pipeline using Instructor, a library focused on structured outputs with Pydantic. This example highlights its strength in structured data extraction but also the need for manual work in building complex pipelines, including comprehensive prompt engineering, Pydantic model definition, custom assembly of components, manual reference mapping, and additional setup for concurrency and cost tracking. ```python # See file: ../../dev/usage_examples/vs_other_frameworks/advanced/instructor.py ``` -------------------------------- ### Initialize Document and Import Classes for StringConcept with Examples in Python Source: https://contextgem.dev/concepts/string_concept Illustrates the initial setup for using `StringConcept` with examples, including importing required ContextGem classes and creating a `Document` object from a sample contract text. This forms the basis for defining and applying string concepts. ```Python # ContextGem: StringConcept Extraction with Examples import os from contextgem import Document, DocumentLLM, StringConcept, StringExample # Create a Document object from text contract_text = """ SERVICE AGREEMENT This Service Agreement (the "Agreement") is entered as of January 15, 2025 by and between: XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA ("Provider"), and Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza, New York, NY ("Customer"). """ doc = Document(raw_text=contract_text) ``` -------------------------------- ### API: StringExample Class Definition Source: https://contextgem.dev/_modules/contextgem/public/examples Defines the StringExample class, a Pydantic model for representing string-based examples used to guide LLM extraction tasks. It contains a 'content' field for the example text, which must be a non-empty string. This class can be attached to a StringConcept. ```python class StringExample(_Example): """ Represents a string example that can be provided by users for certain extraction tasks. :ivar content: A non-empty string that holds the text content of the example. :type content: NonEmptyStr Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. StringExample can be attached to a :class:`~contextgem.public.concepts.StringConcept`. """ content: NonEmptyStr ``` -------------------------------- ### Extraction Pipeline Example (LangChain) Source: https://contextgem.dev/_sources/vs_other_frameworks Illustrates an extraction pipeline using LangChain, a flexible framework for LLM applications. While powerful, this example highlights the development overhead for complex extraction workflows, including manual prompt engineering, Pydantic model definition, complex chain configuration, custom reference mapping, and additional setup for concurrency and cost tracking. ```python # See file: ../../dev/usage_examples/vs_other_frameworks/advanced/langchain.py ``` -------------------------------- ### API Documentation: StringExample Class Definition Source: https://contextgem.dev/api/examples Documents the `StringExample` class, which represents a string-based example for LLM extraction tasks. It details its inheritance, variables, parameters, and notes on its usage, including its `content` property. ```APIDOC class contextgem.public.examples.StringExample(**data) Bases: _Example Variables: content: A non-empty string that holds the text content of the example. Parameters: custom_data (dict) content (NonEmptyStr) Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. StringExample can be attached to a StringConcept. Properties: content: NonEmptyStr ``` -------------------------------- ### API Documentation for JsonObjectExample Class Source: https://contextgem.dev/api/examples Detailed API documentation for the `contextgem.public.examples.JsonObjectExample` class, which represents a JSON object example for LLM extraction tasks. ```APIDOC class contextgem.public.examples.JsonObjectExample(**data) Bases: _Example Description: Represents a JSON object example that can be provided by users for certain extraction tasks. Variables: content: A JSON-serializable dict with the minimum length of 1 that holds the content of the example. Parameters: custom_data (dict) content (dict[str, Any]) Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. JsonObjectExample can be attached to a JsonObjectConcept. ``` -------------------------------- ### API Documentation: StringExample.clone() Method Source: https://contextgem.dev/api/examples Documents the `clone` method of the `StringExample` class, which creates and returns a deep copy of the current instance. This method is useful for duplicating example objects. ```APIDOC contextgem.public.examples.StringExample.clone() Description: Creates and returns a deep copy of the current instance. Returns: A deep copy of the current instance. Return type: typing.Self ``` -------------------------------- ### ContextGem DocumentLLMGroup Workflow Example Source: https://contextgem.dev/_sources/how_it_works An example demonstrating the configuration of a `DocumentLLMGroup` with three distinct LLMs (LLM 1, LLM 2, LLM 3), each assigned a specific role (extractor_text, reasoner_text, extractor_vision), model, task, and optional fallback LLM, illustrating a practical multi-LLM extraction setup. ```APIDOC LLM Group Workflow Example: LLM 1: Role: extractor_text Model: gpt-4o-mini Task: Extract payment terms from a contract Fallback LLM (optional): gpt-3.5-turbo LLM 2: Role: reasoner_text Model: gpt-4o Task: Detect anomalies in the payment terms Fallback LLM (optional): claude-3-5-sonnet LLM 3: Role: extractor_vision Model: gpt-4o-mini Task: Extract invoice amounts Fallback LLM (optional): gpt-4o ``` -------------------------------- ### NumericalConcept Extraction with References and Justifications Setup Source: https://contextgem.dev/concepts/numerical_concept This Python code snippet provides the initial setup for demonstrating advanced usage of `NumericalConcept` extraction, specifically focusing on how to enable and configure justifications and references. It includes the necessary imports from the `contextgem` library. ```python import os from contextgem import Document, DocumentLLM, NumericalConcept ``` -------------------------------- ### Extraction Pipeline Example (ContextGem) Source: https://contextgem.dev/_sources/vs_other_frameworks Demonstrates ContextGem's simplified, declarative syntax for defining multi-LLM extraction pipelines. It highlights features like automated token counting, cost calculation, built-in concurrency, easy example definition, and unified result aggregation, reducing development overhead for complex workflows. ```python # See file: ../../dev/usage_examples/docs/advanced/advanced_multiple_docs_pipeline.py ``` -------------------------------- ### API Reference: Concept examples attribute Source: https://contextgem.dev/genindex Documents the 'examples' attribute for JsonObjectConcept and StringConcept, providing sample data or usage examples relevant to these concept types. ```APIDOC contextgem.public.concepts.JsonObjectConcept.examples (attribute) contextgem.public.concepts.StringConcept.examples (attribute) ``` -------------------------------- ### Extracting Concepts from Documents using Vision Capabilities with ContextGem Source: https://contextgem.dev/_sources/quickstart This Python example illustrates ContextGem's vision capabilities for extracting structured data from documents with complex layouts or images. It shows how to process scanned contracts or analyze information from charts and graphs by providing an image path and a target schema. ```python from contextgem import ContextGem # Initialize ContextGem for vision-based concept extraction gem = ContextGem(api_key="YOUR_API_KEY") # Example: Extract data from a scanned contract image # Assume 'scanned_contract.png' is a path to an image file image_path = "path/to/scanned_contract.png" contract_schema = { "contract_id": "string", "party_names": "array", "effective_date": "string" } extracted_contract_data = gem.extract_concepts(image_path=image_path, schema=contract_schema) print("Extracted Vision Concepts (Contract):", extracted_contract_data) # Example: Identify information from a chart in a report image # Assume 'report_chart.jpg' is a path to an image file chart_path = "path/to/report_chart.jpg" chart_schema = { "chart_title": "string", "data_points": "array" } extracted_chart_info = gem.extract_concepts(image_path=chart_path, schema=chart_schema) print("Extracted Vision Concepts (Chart):", extracted_chart_info) ``` -------------------------------- ### Initialize ContextGem for JsonObjectConcept with Examples Source: https://contextgem.dev/concepts/json_object_concept Partial code snippet showing initial imports for using `JsonObjectConcept` and `JsonObjectExample` within ContextGem, typically for providing examples to improve extraction accuracy for complex schemas. ```python # ContextGem: JsonObjectConcept Extraction with Examples import os from pprint import pprint from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample ``` -------------------------------- ### Sphinx Automodule Directive for ContextGem Examples API Source: https://contextgem.dev/_sources/api/examples This Sphinx `automodule` directive is used to automatically generate comprehensive API documentation for the `contextgem.public.examples` Python module. It includes all public and undocumented members, shows inheritance relationships, and excludes specific Pydantic model configuration attributes (`model_config`, `model_post_init`) to keep the documentation focused on core API functionality. ```APIDOC .. automodule:: contextgem.public.examples :members: :undoc-members: :show-inheritance: :inherited-members: :exclude-members: model_config, model_post_init ``` -------------------------------- ### Example of Optimizing LLM Extraction for Cost in Python Source: https://contextgem.dev/optimizations/optimization_cost This Python example demonstrates how to initialize a DocumentLLM instance with custom pricing details using `LLMPricing` to enable cost tracking. It shows how to retrieve and print the usage and cost details after performing extractions, allowing developers to monitor token consumption and overall expenses. ```Python # Example of optimizing extraction for cost import os from contextgem import DocumentLLM, LLMPricing llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), pricing_details=LLMPricing( input_per_1m_tokens=0.150, output_per_1m_tokens=0.600, ), # add pricing details to track costs ) # ... use the LLM for extraction ... # ... monitor usage and cost ... usage = llm.get_usage() # get the usage details, including tokens and calls' details. cost = llm.get_cost() # get the cost details, including input, output, and total costs. print(usage) print(cost) ``` -------------------------------- ### LangChain: Initial Setup for Anomaly Extraction Source: https://contextgem.dev/vs_other_frameworks This Python snippet provides the initial imports and class definitions for implementing anomaly extraction using LangChain. It highlights the need for manual setup, including defining Pydantic models for structured output and importing various components for prompt engineering, output parsing, and runnable chains, which contrasts with ContextGem's more integrated approach. ```python # LangChain implementation for extracting anomalies from a document, with source references and justifications import os from textwrap import dedent from typing import Optional from langchain.output_parsers import PydanticOutputParser from langchain.prompts import PromptTemplate from langchain_core.runnables import RunnableLambda, RunnablePassthrough from langchain_openai import ChatOpenAI from pydantic import BaseModel, Field ``` -------------------------------- ### Extraction Pipeline Example (LlamaIndex) Source: https://contextgem.dev/_sources/vs_other_frameworks Presents an extraction pipeline built with LlamaIndex, a robust data framework for LLM applications. This example demonstrates its capabilities while pointing out the manual effort required for complex extraction, such as crafting prompts, defining Pydantic models, configuring pipeline components, and custom solutions for fine-grained reference tracking, concurrency, and cost tracking. ```python # See file: ../../dev/usage_examples/vs_other_frameworks/advanced/llama_index.py ``` -------------------------------- ### ContextGem Advanced Extraction Pipeline Example Source: https://contextgem.dev/vs_other_frameworks This Python example demonstrates an advanced extraction workflow using ContextGem. It shows how to analyze multiple documents concurrently within a single pipeline, leveraging different LLMs, and includes built-in cost tracking. ContextGem handles boilerplate code automatically, simplifying complex LLM extraction tasks. ```Python # Advanced Usage Example - analyzing multiple documents with a single pipeline, # with different LLMs, concurrency and cost tracking import os from contextgem import ( Aspect, DateConcept, Document, DocumentLLM, DocumentLLMGroup, DocumentPipeline, JsonObjectConcept, JsonObjectExample, LLMPricing, NumericalConcept, RatingConcept, RatingScale, StringConcept, StringExample, ) # Construct documents # Document 1 - Consultancy Agreement (shortened for brevity) doc1 = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ), ) ``` -------------------------------- ### Define JSON Structure and Concept with ContextGem Source: https://contextgem.dev/api/examples This snippet demonstrates how to create a JsonObjectExample with sample data, define a Python class (PersonInfo) to represent the expected JSON structure, and then use JsonObjectConcept to associate the structure with the example for validation and conceptualization. ```python json_example = JsonObjectExample( content={ "name": "John Doe", "education": "Bachelor's degree in Computer Science", "skills": ["Python", "Machine Learning", "Data Analysis"], "hobbies": ["Reading", "Traveling", "Gaming"] } ) # Define a structure for JSON object concept class PersonInfo: name: str education: str skills: list[str] hobbies: list[str] # Also works as a dict with type hints, e.g. # PersonInfo = { # "name": str, # "education": str, # "skills": list[str], # "hobbies": list[str], # } # Attach JSON example to a JsonObjectConcept json_concept = JsonObjectConcept( name="Candidate info", description="Structured information about a job candidate", structure=PersonInfo, # Define the expected structure examples=[json_example] # Attach the example to the concept (optional) ) ``` -------------------------------- ### Python Example for Processing Documents with Concurrency and Cost Tracking Source: https://contextgem.dev/vs_other_frameworks This Python example demonstrates how to process multiple contract documents using a 'process_document' function with concurrency enabled. It initializes a 'CostTracker' to monitor API costs and then prints the detailed analysis results for each document, followed by a summary of the processing costs per model. ```python # Example usage # Sample contract texts (shortened for brevity) doc1_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ) doc2_text = ( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ) # Create cost tracker cost_tracker = CostTracker() # Process documents print("Processing document 1 with concurrency...") doc1_results = process_document(doc1_text, cost_tracker, use_concurrency=True) print("Processing document 2 with concurrency...") doc2_results = process_document(doc2_text, cost_tracker, use_concurrency=True) # Print results print_document_results("Document 1 (Consultancy Agreement)", doc1_results) print_document_results("Document 2 (Service Level Agreement)", doc2_results) # Print cost information print("\nProcessing costs:") costs = cost_tracker.get_costs() for model, model_data in costs["model_costs"].items(): print(f"\n{model}:") print(f" Input cost: ${model_data['input_cost']:.4f}") print(f" Output cost: ${model_data['output_cost']:.4f}") print(f" Total cost: ${model_data['total_cost']:.4f}") print(f"\nTotal across all models: ${costs['total_cost']:.4f}") ``` -------------------------------- ### ContextGem: Advanced Multi-Document Extraction with LLM Pipelines Source: https://contextgem.dev/advanced_usage This advanced Python example demonstrates how to configure and use a `DocumentPipeline` for efficient data extraction from multiple documents. It highlights the use of `DocumentLLMGroup` for managing different LLMs, `LLMPricing` for cost tracking, and various concept types for defining extraction targets, enabling scalable and concurrent processing. ```python # Advanced Usage Example - analyzing multiple documents with a single pipeline, # with different LLMs, concurrency and cost tracking import os from contextgem import ( Aspect, DateConcept, Document, DocumentLLM, DocumentLLMGroup, DocumentPipeline, JsonObjectConcept, JsonObjectExample, LLMPricing, NumericalConcept, RatingConcept, RatingScale, StringConcept, StringExample, ) # Construct documents ``` -------------------------------- ### API Documentation: StringExample.from_disk() Class Method Source: https://contextgem.dev/api/examples Documents the `from_disk` class method for `StringExample`, which loads an instance from a JSON file stored on disk. It specifies parameters, return type, and potential exceptions during file loading and deserialization. ```APIDOC classmethod contextgem.public.examples.StringExample.from_disk(file_path) Description: Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method. Parameters: file_path (str): Path to the JSON file to load (must end with ‘.json’). Returns: An instance of the class populated with the data from the file. Return type: Self Raises: ValueError: If the file path doesn’t end with ‘.json’. OSError: If there’s an error reading the file. RuntimeError: If deserialization fails. ``` -------------------------------- ### API Documentation for StringExample Class Methods Source: https://contextgem.dev/api/examples Detailed API documentation for methods and properties of a class, likely `StringExample`, covering serialization, deserialization, and data transformation. ```APIDOC classmethod from_json(json_string) Description: Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: json_string (str): JSON string containing the serialized object data. Returns: A new instance of the class with restored state. (Type: Self) Raises: TypeError: If the class name in the serialized data doesn’t match. ``` ```APIDOC to_dict() Description: Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes. When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization (Type: dict[str, Any]) ``` ```APIDOC to_disk(file_path) Description: Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: file_path (str): Path where the JSON file should be saved (must end with ‘.json’). Returns: None Raises: ValueError: If the file path doesn’t end with ‘.json’. IOError: If there’s an error during the file writing process. ``` ```APIDOC to_json() Description: Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method. Returns: A JSON string representation of the object. (Type: str) ``` ```APIDOC property unique_id (str) Description: Returns the ULID of the instance. ``` ```APIDOC custom_data (dict) ``` -------------------------------- ### Initialize OpenAI Clients with Instructor Integration Source: https://contextgem.dev/vs_other_frameworks These utility functions provide synchronous and asynchronous methods to initialize OpenAI API clients. They integrate with the `instructor` library to enhance model capabilities, automatically retrieving the API key from environment variables if not provided. ```Python def get_client(api_key=None): """Get an OpenAI client with instructor integrated""" api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") client = OpenAI(api_key=api_key) return instructor.from_openai(client) async def get_async_client(api_key=None): """Get an AsyncOpenAI client with instructor integrated""" api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") client = AsyncOpenAI(api_key=api_key) return instructor.from_openai(client) ``` -------------------------------- ### Example of optimizing extraction for accuracy in ContextGem Source: https://contextgem.dev/optimizations/optimization_accuracy This Python example demonstrates how to configure a ContextGem Document for improved extraction accuracy. It shows how to specify a larger SAT segmentation model, enable SAT-based paragraph segmentation, and define a StringConcept with justifications and examples to guide the LLM's extraction process. ```Python # Example of optimizing extraction for accuracy import os from contextgem import Document, DocumentLLM, StringConcept, StringExample # Define document doc = Document( raw_text="Non-Disclosure Agreement...", sat_model_id="sat-6l-sm", # default is "sat-3l-sm" paragraph_segmentation_mode="sat", # default is "newlines" # sentence segmentation mode is always "sat", as other approaches proved to be less accurate ) # Define document concepts doc.concepts = [ StringConcept( name="Title", # A very simple concept, just an example for testing purposes description="Title of the document", add_justifications=True, # enable justifications justification_depth="brief", # default examples=[ StringExample( content="Supplier Agreement", ) ], ), # ... add other concepts ... ] # ... attach other aspects/concepts to the document ... ``` -------------------------------- ### Instructor Library Overview and Development Overhead Source: https://contextgem.dev/vs_other_frameworks This section provides an overview of the Instructor library, highlighting its focus on structured outputs from LLMs with Pydantic typing. It also details the development overheads associated with using Instructor, such as manual prompt engineering, model definition, pipeline assembly, and cost tracking setup. ```APIDOC Instructor Instructor is a powerful library focused on structured outputs from LLMs with strong typing support through Pydantic. It excels at extracting structured data with validation, but requires additional work to build complex extraction pipelines. Development overhead: * ⮚ Manual prompt engineering: Crafting comprehensive prompts that guide the LLM effectively * ⚙ Manual model definition: Developers must define Pydantic validation models for structured output * ⚔ Manual pipeline assembly: Requires custom code to connect extraction components involving multiple LLMs * 🔍 Manual reference mapping: Must implement custom logic to track source references * 📊 Embedding examples in prompts: Examples must be manually incorporated into prompts * 🔄 Complex concurrency setup: Implementing concurrent processing requires additional setup with asyncio * 💰 Cost tracking setup: Requires custom logic for cost tracking for each LLM ``` -------------------------------- ### Anomaly Extraction with LlamaIndex RAG Setup Source: https://contextgem.dev/_sources/vs_other_frameworks This example illustrates anomaly extraction using LlamaIndex configured with a Retrieval-Augmented Generation (RAG) setup. While powerful for knowledge-intensive applications and complex document interactions, this approach requires more manual configuration and specialized setup for structured extraction tasks compared to streamlined alternatives. ```python Code content for ../../dev/usage_examples/vs_other_frameworks/basic/llama_index_rag.py is not provided in the input text. ``` -------------------------------- ### JsonObjectExample Class API Reference Source: https://contextgem.dev/api/examples Comprehensive API documentation for the JsonObjectExample class, detailing its constructor and methods for cloning, serialization, and deserialization from various formats. ```APIDOC Class: JsonObjectExample Constructor: __init__(**kwargs) Description: Create a new model by parsing and validating input data from keyword arguments. Raises: ValidationError if the input data cannot be validated to form a valid model. Notes: self is explicitly positional-only to allow self as a field name. Properties: content: dict[str, Any] Method: clone() Description: Creates and returns a deep copy of the current instance. Returns: A deep copy of the current instance. Return Type: typing.Self Method: from_dict(obj_dict: dict[str, Any]) (classmethod) Description: Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: obj_dict (dict[str, Any]): Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return Type: Self Method: from_disk(file_path: str) (classmethod) Description: Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method. Parameters: file_path (str): Path to the JSON file to load (must end with ‘.json’). Returns: An instance of the class populated with the data from the file. Return Type: Self Raises: ValueError: If the file path doesn’t end with ‘.json’. OSError: If there’s an error reading the file. RuntimeError: If deserialization fails. Method: from_json(json_string: str) (classmethod) Description: Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: json_string (str): JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return Type: Self Raises: TypeError: If the class name in the serialized data doesn’t match. Method: to_dict() Description: Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes all public attributes. ``` -------------------------------- ### Define String Concept for Party Names and Roles Source: https://contextgem.dev/api/concepts This Python example demonstrates how to define a `StringConcept` using the `contextgem` library. It creates a concept named 'Party names and roles' to identify contractual parties and their roles, including an example to guide the extraction format. ```python from contextgem import StringConcept, StringExample # Define a string concept for identifying contract party names # and their roles in the contract party_names_and_roles_concept = StringConcept( name="Party names and roles", description=( "Names of all parties entering into the agreement " "and their contractual roles" ), examples=[ StringExample( content="X (Client)", # guidance regarding format ) ], ) ``` -------------------------------- ### Initialize Document Objects with Raw Text Source: https://contextgem.dev/advanced_usage Demonstrates how to create `Document` instances, populating them with raw text content from legal agreements like Consultancy Agreements and Service Level Agreements. This sets up the initial data for processing within the document pipeline. ```Python doc1 = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ), ) doc2 = Document( raw_text=( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ), ) ``` -------------------------------- ### Extracting Product Rating with RatingConcept (Partial) Source: https://contextgem.dev/concepts/rating_concept This example demonstrates the initial setup for using RatingConcept to extract a product rating, showing the necessary imports from the ContextGem library. ```Python # ContextGem: RatingConcept Extraction import os from contextgem import Document, DocumentLLM, RatingConcept, RatingScale ``` -------------------------------- ### Define String Concepts for Termination Details Source: https://contextgem.dev/advanced_usage Defines three `StringConcept` objects: 'Termination for Cause', 'Notice Period', and 'Severance Package'. Each concept includes a description, optional examples to guide the LLM, and settings to add references at the sentence level, enabling precise extraction of specific clauses related to employment termination. ```python termination_for_cause = StringConcept( name="Termination for Cause", description="Conditions under which the company can terminate the employee for cause.", examples=[ # optional, examples help the LLM to understand the concept better StringExample(content="Employee may be terminated for misconduct"), StringExample(content="Termination for breach of contract"), ], add_references=True, reference_depth="sentences", ) notice_period = StringConcept( name="Notice Period", description="Required notification period before employment termination.", add_references=True, reference_depth="sentences", ) severance_terms = StringConcept( name="Severance Package", description="Compensation and benefits provided upon termination.", add_references=True, reference_depth="sentences", ) ``` -------------------------------- ### API: JsonObjectExample Class Definition Source: https://contextgem.dev/_modules/contextgem/public/examples Defines the JsonObjectExample class, a Pydantic model for representing JSON object examples used to guide LLM extraction tasks. It includes a 'content' field for the JSON-serializable dictionary (minimum length 1) and a validator to ensure its serializability. This class can be attached to a JsonObjectConcept. ```python class JsonObjectExample(_Example): """ Represents a JSON object example that can be provided by users for certain extraction tasks. :ivar content: A JSON-serializable dict with the minimum length of 1 that holds the content of the example. :type content: dict[str, Any] Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. JsonObjectExample can be attached to a :class:`~contextgem.public.concepts.JsonObjectConcept`. """ content: dict[str, Any] = Field(default_factory=dict, min_length=1) @field_validator("content") @classmethod def _validate_content_serializable(cls, value: dict[str, Any]) -> dict[str, Any]: """ Validates that the `content` field is serializable to JSON. :param value: The value of the `content` field to validate. :type value: dict[str, Any] :return: The validated `content` value. :rtype: dict[str, Any] :raises ValueError: If the `content` value is not serializable. """ if not _is_json_serializable(value): raise ValueError(f"`content` must be JSON serializable.") return value ``` -------------------------------- ### LlamaIndex: Structured Anomaly Extraction Program Source: https://contextgem.dev/_sources/vs_other_frameworks This LlamaIndex example demonstrates structured data extraction, specifically for anomalies, outside of a RAG setup. It necessitates manual definition of Pydantic models, explicit prompt construction, and the use of an output parser. While powerful for data indexing, it requires more manual setup for direct structured extraction tasks. ```python from llama_index.core.program import LLMTextCompletionProgram from llama_index.core.output_parsers import PydanticOutputParser from pydantic import BaseModel, Field from llama_index.llms.openai import OpenAI # 1. Define Pydantic model for structured output class Anomaly(BaseModel): type: str = Field(description="Type of anomaly (e.g., symptom, observation)") description: str = Field(description="Detailed description of the anomaly") location: str = Field(description="Approximate location or context in the document") class Anomalies(BaseModel): anomalies: list[Anomaly] = Field(description="List of anomalies found in the document") # 2. Initialize the LLM llm = OpenAI(model="gpt-3.5-turbo") # 3. Initialize output parser with the Pydantic model parser = PydanticOutputParser(output_cls=Anomalies) # 4. Define the prompt template string prompt_template_str = """ Extract all anomalies from the following document. {format_instructions} Document: {document_text} """ # 5. Create the LLMTextCompletionProgram program = LLMTextCompletionProgram.from_defaults( output_parser=parser, prompt_template_str=prompt_template_str, llm=llm, verbose=True, ) # 6. Define the document text document_text = "The patient presented with unusual symptoms: high fever, persistent cough, and severe fatigue. No rash was observed." # 7. Execute the program to extract anomalies result = program(document_text=document_text) print(result) ``` -------------------------------- ### JsonObjectExample Class API Reference Source: https://contextgem.dev/api/examples Provides a comprehensive reference for the JsonObjectExample class, detailing its methods and properties for object serialization, file I/O, and unique identification. ```APIDOC JsonObjectExample Class: to_dict() Description: Special handling for specific public and private attributes. When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization. Return Type: dict[str, Any] to_disk(file_path: str) Description: Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: file_path (str): Path where the JSON file should be saved (must end with ‘.json’). Returns: None Return Type: None Raises: ValueError: If the file path doesn’t end with ‘.json’. IOError: If there’s an error during the file writing process. to_json() Description: Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method. Returns: A JSON string representation of the object. Return Type: str Property: unique_id Type: str Description: Returns the ULID of the instance. Property: custom_data Type: dict Description: Custom data associated with the instance. ``` -------------------------------- ### Optimizing ContextGem Extraction for Speed with Concurrency and Fallback LLM Source: https://contextgem.dev/optimizations/optimization_speed This Python example demonstrates how to configure a ContextGem DocumentLLM for speed optimization. It shows the setup of an AsyncLimiter for concurrent processing, the inclusion of a fallback LLM to handle rate limits, and the use of the `extract_all` method with concurrency enabled. This setup helps manage API call rates and ensures robustness for faster extractions. ```Python # Example of optimizing extraction for speed import os from aiolimiter import AsyncLimiter from contextgem import Document, DocumentLLM # Define document document = Document( raw_text="document_text", # aspects=[Aspect(...), ...], # concepts=[Concept(...), ...], ) # Define LLM with a fallback model llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), async_limiter=AsyncLimiter( 10, 5 ), # e.g. 10 acquisitions per 5-second period; adjust to your LLM API setup fallback_llm=DocumentLLM( model="openai/gpt-3.5-turbo", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), is_fallback=True, async_limiter=AsyncLimiter( 20, 5 ), # e.g. 20 acquisitions per 5-second period; adjust to your LLM API setup ), ) # Use the LLM for extraction with concurrency enabled llm.extract_all(document, use_concurrency=True) # ... use the extracted data ... ``` -------------------------------- ### Interacting with LLMs via ContextGem's Lightweight Chat Interface Source: https://contextgem.dev/_sources/quickstart This Python snippet showcases ContextGem's unified interface for natural language interaction with Large Language Models. It demonstrates both text-based and vision-based chat functionalities, highlighting its built-in fallback support for robust LLM communication. ```python from contextgem import ContextGem # Initialize ContextGem for LLM chat interface gem = ContextGem(api_key="YOUR_API_KEY") # Example: Simple text-based chat response_text = gem.chat(prompt="What is the capital of France?") print("Text Chat Response:", response_text) # Example: Vision-based chat (e.g., asking about an image) # Assume 'image_of_eiffel.jpg' is a path to an image file image_path = "path/to/image_of_eiffel.jpg" response_vision = gem.chat(prompt="Describe this image.", image_path=image_path) print("Vision Chat Response:", response_vision) # Example: Chat with built-in fallback support # (ContextGem handles model fallbacks internally) response_fallback = gem.chat(prompt="Tell me a short story about a robot.", model="gpt-4o", fallback_model="gpt-3.5-turbo") print("Fallback Chat Response:", response_fallback) ``` -------------------------------- ### Setting Up LLM Cost Tracking in ContextGem Source: https://contextgem.dev/llms/llm_config Shows how to configure pricing details for a DocumentLLM using LLMPricing to track input and output token costs, and how to retrieve the total cost. ```python from contextgem import DocumentLLM, LLMPricing llm = DocumentLLM( model="openai/gpt-4o-mini", api_key="", pricing_details=LLMPricing( input_per_1m_tokens=0.150, # Cost per 1M input tokens output_per_1m_tokens=0.600, # Cost per 1M output tokens ), ) # Perform some extraction tasks # Later, you can check the cost cost_info = llm.get_cost() ``` -------------------------------- ### Python: Defining Document and Concepts for Extraction with ContextGem Source: https://contextgem.dev/advanced_usage This Python example demonstrates how to define and prepare document-level concepts for extraction using the `contextgem` library. It initializes a `Document` object with sample text and defines various concept types such as `BooleanConcept`, `DateConcept`, and `StringConcept`, specifying their names, descriptions, and optional properties like `singular_occurrence` and `add_references`. This setup is a prerequisite for extracting structured information from documents. ```Python # Advanced Usage Example - Extracting aspects and concepts from a document, with references, # using concurrency import os from aiolimiter import AsyncLimiter from contextgem import ( Aspect, BooleanConcept, DateConcept, Document, DocumentLLM, JsonObjectConcept, StringConcept, ) # Example privacy policy document (shortened for brevity) doc = Document( raw_text=( "Privacy Policy\n\n" "Last Updated: March 15, 2024\n\n" "1. Data Collection\n" "We collect various types of information from our users, including:\n" "- Personal information (name, email address, phone number)\n" "- Device information (IP address, browser type, operating system)\n" "- Usage data (pages visited, time spent on site)\n" "- Location data (with your consent)\n\n" "2. Data Usage\n" "We use your information to:\n" "- Provide and improve our services\n" "- Send you marketing communications (if you opt-in)\n" "- Analyze website performance\n" "- Comply with legal obligations\n\n" "3. Data Sharing\n" "We may share your information with:\n" "- Service providers (for processing payments and analytics)\n" "- Law enforcement (when legally required)\n" "- Business partners (with your explicit consent)\n\n" "4. Data Retention\n" "We retain personal data for 24 months after your last interaction with our services. " "Analytics data is kept for 36 months.\n\n" "5. User Rights\n" "You have the right to:\n" "- Access your personal data\n" "- Request data deletion\n" "- Opt-out of marketing communications\n" "- Lodge a complaint with supervisory authorities\n\n" "6. Contact Information\n" "For privacy-related inquiries, contact our Data Protection Officer at privacy@example.com\n" ), ) # Define all document-level concepts in a single declaration document_concepts = [ BooleanConcept( name="Is Privacy Policy", description="Verify if this document is a privacy policy", singular_occurrence=True, # explicitly enforce singular extracted item (optional) ), DateConcept( name="Last Updated Date", description="The date when the privacy policy was last updated", singular_occurrence=True, # explicitly enforce singular extracted item (optional) ), StringConcept( name="Contact Information", description="Contact details for privacy-related inquiries", add_references=True, reference_depth="sentences", ), ] ```