### Initialize Themefinder Pipeline

Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb

Setup the environment by importing necessary libraries, loading survey data from JSON, and configuring the AzureChatOpenAI LLM instance.

```python
import pandas as pd
import themefinder
from langchain_openai import AzureChatOpenAI

question = "What improvements would you most like to see in local public transportation?"
responses = pd.read_json("./example_data.json")

llm = AzureChatOpenAI(
    model_name="gpt-4o",
    temperature=0
)
```

--------------------------------

### Install ThemeFinder in Editable Mode

Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md

Commands to install the local development version of the package into another project for testing purposes.

```bash
pip install -e <FILE_PATH>
```

```bash
poetry add -e <FILE_PATH>
```

--------------------------------

### Example Configuration in YAML

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md

Demonstrates a sample configuration file for the ThemeFinder project, specifying dataset parameters like topic, size, questions, theme counts, noise levels, and demographic field distributions.

```yaml
# config.yaml
dataset_name: "transport_M"
topic: "public transport improvements"
size: "M"  # 1000 responses
n_questions: 3

questions:
  - text: "What improvements would you like to see to local bus services?"
    multi_choice: ["Support more buses", "Oppose changes"]
  - text: "How can we make public transport more accessible?"
  - text: "What role should cycling infrastructure play in transport planning?"

n_themes_per_question: 12
noise_level: "medium"

position_distribution:
  agree: 0.45
  disagree: 0.35
  unclear: 0.20

demographic_fields:
  - name: "region"
    values: ["England", "Scotland", "Wales", "Northern Ireland"]
    distribution: [0.84, 0.08, 0.05, 0.03]
  - name: "transport_user"
    values: ["Daily", "Weekly", "Monthly", "Rarely", "Never"]
    distribution: [0.25, 0.30, 0.20, 0.15, 0.10]
```

--------------------------------

### Initialize Langfuse Callback Handler

Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md

Example of initializing the Langfuse callback handler for Langchain tracing to monitor LLM calls.

```python
from langfuse import Langfuse
from langfuse.callback import CallbackHandler
import dotenv

dotenv.load_dotenv()

# Initialize Langfuse CallbackHandler for Langchain (tracing)
# Use the session id to group calls
langfuse_callback_handler = CallbackHandler(session_id="run_1")
```

--------------------------------

### Langfuse LLM-as-Judge Integration in Python

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md

Demonstrates integrating Langfuse's native LLM-as-Judge capabilities into the project. This approach leverages Langfuse's built-in templates, UI configuration, and features like sampling and batch evaluation for cost control and improved observability. The code snippet shows basic setup for attaching traces and allowing Langfuse to manage evaluators.

```python
from langfuse import Langfuse

def setup_langfuse_evaluators(client: Langfuse):
    """Configure Langfuse native evaluators for theme evaluation."""

    # Use Langfuse's built-in evaluator for relevance
    # Configure via Langfuse UI with custom prompt

    # For custom theme evaluators, create via API when supported
    # Currently UI-only for custom evaluators

    # Attach to traces
    with client.trace(name="theme_generation") as trace:
        # ... run task ...

        # Langfuse auto-runs configured evaluators
        # Results visible in Langfuse dashboard
        pass
```

--------------------------------

### Initialize AzureChatOpenAI LLM with Langchain and Langfuse

Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md

This snippet shows how to initialize an AzureChatOpenAI language model using Langchain. It configures the model for JSON output and integrates a Langfuse callback handler to log LLM calls, including inputs, outputs, and model details. This setup is useful for tracking and analyzing LLM interactions within an application.

```python
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    model="gpt-4o",
    temperature=0,
    callbacks=[langfuse_callback_handler],
    model_kwargs={\"response_format\": {\"type\": \"json_object\"}},
)
```

--------------------------------

### GET /theme_validation

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Validates theme sets against predefined quality rules including count limits, coverage, and overlap.

```APIDOC
## GET /theme_validation

### Description
Provides four validation rules to ensure theme quality. These functions check theme count limits, response coverage, semantic similarity, and theme overlap.

### Method
GET

### Endpoint
themefinder.rules

### Parameters
#### Request Body
- **themes** (List[ThemeNode]) - Required - List of theme nodes to validate.
- **mapping** (List[dict]) - Required - Mapping data for coverage checks.

### Response
#### Success Response (200)
- **slack_messages** (List[str]) - List of validation error messages.
- **failed** (bool) - Status indicating if the validation rule failed.
```

--------------------------------

### Detect Evidence-Rich Responses

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Uses the detail_detection function to flag survey responses that contain specific facts, data, or concrete examples. This is useful for identifying high-value feedback for deeper analysis.

```python
from themefinder import detail_detection

async def detect_details():
    detail_df, unprocessables = await detail_detection(
        responses_df,
        llm,
        question="What are your views on the proposed cycling infrastructure expansion?",
        batch_size=20,
        system_prompt="You are an AI evaluation tool analyzing public consultation responses.",
        concurrency=10
    )
    evidence_rich = detail_df[detail_df['evidence_rich'] == 'YES']
    return evidence_rich
```

--------------------------------

### Build and Serve Documentation Locally

Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md

Commands to build and serve the MkDocs documentation site locally to preview changes.

```bash
poetry run mkdocs build
poetry run mkdocs serve
```

--------------------------------

### Create Open Question Inputs

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

Prepares input files for open-ended questions, creating a directory for each question and saving responses in JSONL format. It also generates a question metadata JSON file. Handles character removal and optional sampling. Dependencies include os and json.

```python
def create_open_question_inputs(
    df: pd.DataFrame,
    open_questions: list[dict],
    characters_to_remove = ["/", "\", '- Text', '_x000D_'],
    sample_size: Optional[int] = None
) -> None:
    for question in open_questions:
        q_num = question['question_number']
        question_col = question['column_name']
        q_dir = f"inputs/question_part_{q_num}"
        os.makedirs(q_dir, exist_ok=True)

        question_string = question['question_text']

        question_answers = df[['themefinder_id', question_col]].dropna()
        if sample_size is not None and sample_size < len(question_answers):
            question_answers = question_answers.sample(sample_size)

        for bad_string in characters_to_remove:
            question_answers[question_col] = question_answers[question_col].apply(lambda x: x.replace(bad_string, " "))
        
        question_answers[question_col] = question_answers[question_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii")

        question_answers.columns = ['themefinder_id', 'text']

        question_answers[['themefinder_id', 'text']].to_json(os.path.join(q_dir, 'responses.jsonl'), orient='records', lines=True)

        question_data = {
            "question_number": q_num,
            "question_text": question_string,
            "has_free_text": True
        }

        with open(os.path.join(q_dir, 'question.json'), 'w') as f:
            json.dump(question_data, f, indent=4)
```

--------------------------------

### Define Structured Data Models with Pydantic

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Illustrates the initialization of various Pydantic models used in ThemeFinder, such as Theme, CondensedTheme, and ThemeMappingOutput. These models ensure that data structures for theme generation and refinement are strictly validated.

```python
from themefinder.models import Theme, CondensedTheme, RefinedTheme, ThemeMappingOutput, DetailDetectionOutput, Position, EvidenceRich

theme = Theme(topic_label="traffic reduction", topic_description="The proposal will help reduce road congestion significantly", position=Position.AGREEMENT)
condensed = CondensedTheme(topic_label="traffic concerns", topic_description="Combined themes about traffic and congestion issues", source_topic_count=5)
refined = RefinedTheme(topic="Traffic Reduction: The cycling infrastructure will significantly reduce road congestion", source_topic_count=5)
mapping = ThemeMappingOutput(response_id=1, labels=["A", "C", "D"])
detail = DetailDetectionOutput(response_id=1, evidence_rich=EvidenceRich.YES)
```

--------------------------------

### JSONL Schema for Detail Detection

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md

Defines a flag indicating whether a response is considered 'evidence-rich', meaning it contains specific details or examples. Stored in JSON Lines format with 'YES' or 'NO' values.

```jsonl
{"response_id": 1, "evidence_rich": "YES"}
{"response_id": 2, "evidence_rich": "NO"}
```

--------------------------------

### Analyze survey responses with ThemeFinder

Source: https://github.com/i-dot-ai/themefinder/blob/main/README.md

This snippet demonstrates how to initialize a LangChain LLM, prepare survey data in a pandas DataFrame, and execute the find_themes pipeline asynchronously. It requires environment variables for LLM authentication and returns structured thematic analysis.

```python
import asyncio
from dotenv import load_dotenv
import pandas as pd
from langchain_openai import AzureChatOpenAI
from themefinder import find_themes

load_dotenv()

llm = AzureChatOpenAI(
    model="gpt-4o",
    temperature=0,
)

responses_df = pd.DataFrame({
   "response_id": ["1", "2", "3", "4", "5"],
   "response": ["I think it's awesome, I can use it for consultation analysis.", 
   "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
})

question = "What do you think of ThemeFinder?"
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."

async def main():
    result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
```

--------------------------------

### Execute Theme Analysis Pipeline with find_themes

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

This snippet demonstrates how to initialize an Azure OpenAI LLM, prepare survey data in a pandas DataFrame, and execute the full ThemeFinder pipeline. The find_themes function processes responses to return identified themes, response-theme mappings, and evidence-rich insights.

```python
import asyncio
import pandas as pd
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from themefinder import find_themes

load_dotenv()

llm = AzureChatOpenAI(
    model="gpt-4o",
    temperature=0,
)

responses_df = pd.DataFrame({
    "response_id": ["1", "2", "3", "4", "5"],
    "response": [
        "Buses need to run more frequently, especially during rush hour.",
        "The schedule says every 15 minutes but I've been waiting for 35 minutes.",
        "Better lighting at bus stops - some areas feel unsafe at night.",
        "Monthly passes are too expensive for low-income families.",
        "Electric buses would reduce noise and air pollution."
    ]
})

question = "What improvements would you like to see in public transit?"
system_prompt = "You are an AI evaluation tool analyzing responses to a UK Government public consultation on public transit improvements."

async def main():
    result = await find_themes(
        responses_df,
        llm,
        question,
        system_prompt=system_prompt,
        verbose=True,
        concurrency=10
    )

    print("Question:", result["question"])
    print("\nIdentified Themes:")
    print(result["themes"])
    print("\nResponse-Theme Mapping:")
    print(result["mapping"])

if __name__ == "__main__":
    asyncio.run(main())
```

--------------------------------

### Create Hybrid Question Inputs

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

Prepares input files for hybrid questions, which combine closed and open-ended responses. It creates directories and processes data similarly to open questions, handling combined columns and optional sampling. Dependencies include os.

```python
def create_hybrid_question_inputs(
    df: pd.DataFrame,
    hybrid_questions: list[dict],
    characters_to_remove = ["/", "\", '- Text', '_x000D_'],
    sample_size: Optional[int] = None
) -> None:
    for question in hybrid_questions:
        q_num = question['question_number']
        q_dir = f"inputs/question_part_{q_num}"
        closed_col = question['closed_column']
        open_col = question['open_column']
        question_string = question['question_text']
        os.makedirs(q_dir, exist_ok=True)

        question_answers = df[['themefinder_id'] + [closed_col, open_col]].dropna(subset=[closed_col, open_col], how='all')

        if sample_size is not None and sample_size < len(question_answers):
            
```

--------------------------------

### Interact with OpenAI LLM using OpenAILLM Interface

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

The OpenAILLM class offers a direct interface to OpenAI's SDK, bypassing LangChain when not needed. It supports both synchronous and asynchronous API calls and allows for structured output using Pydantic models. Initialization requires model name and optionally API keys or environment variables.

```python
import asyncio
from themefinder import OpenAILLM, LLMResponse
from pydantic import BaseModel, Field
from typing import List

# Define structured output model
class ThemeList(BaseModel):
    themes: List[str] = Field(description="List of identified themes")

# Initialize OpenAI LLM with custom settings
llm = OpenAILLM(
    model="gpt-4o",
    request_kwargs={"temperature": 0, "max_tokens": 1000},
    api_key="your-api-key",  # Or use OPENAI_API_KEY env var
)

# Async call with structured output
async def async_example():
    response: LLMResponse = await llm.ainvoke(
        prompt="List 3 main themes from: 'Better bus service, lower fares, more routes'",
        output_model=ThemeList
    )
    print("Parsed themes:", response.parsed.themes)

# Sync call without structured output
def sync_example():
    response: LLMResponse = llm.invoke(
        prompt="Summarize the main concern: 'Buses are always late'"
    )
    print("Response:", response.parsed)

if __name__ == "__main__":
    asyncio.run(async_example())
    sync_example()

```

--------------------------------

### Run Theme Extraction

Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb

Execute the theme extraction pipeline using the loaded responses and the configured LLM.

```python
results = await themefinder.find_themes(
    responses, 
    llm=llm, 
    question=question
)
```

--------------------------------

### Refine Themes with Themefinder

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Demonstrates how to use the theme_refinement function to consolidate survey topics into a refined DataFrame. It requires a DataFrame of condensed themes and an LLM instance to process the data asynchronously.

```python
import asyncio
import pandas as pd
from themefinder import theme_refinement

condensed_themes_df = pd.DataFrame({
    "topic_label": ["traffic and congestion reduction", "cost and financial concerns", "environmental and air quality benefits"],
    "topic_description": ["The proposal will significantly reduce traffic congestion on roads", "Concerns about the financial impact on taxpayers and project costs", "Positive environmental impact including better air quality from cycling"],
    "source_topic_count": [2, 2, 2]
})

async def refine_themes():
    refined_df, unprocessables = await theme_refinement(
        condensed_themes_df,
        llm,
        question="What are your views on the proposed cycling infrastructure expansion?",
        batch_size=10000,
        system_prompt="You are an AI evaluation tool analyzing public consultation responses.",
        concurrency=10
    )
    return refined_df
```

--------------------------------

### POST /llm_invoke

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Interface for interacting with OpenAI models, supporting both synchronous and asynchronous execution with structured output.

```APIDOC
## POST /llm_invoke

### Description
Provides a direct OpenAI SDK implementation for use cases where LangChain is not needed. It supports both synchronous and asynchronous calls with Pydantic-based structured output support.

### Method
POST

### Endpoint
OpenAILLM.ainvoke / OpenAILLM.invoke

### Parameters
#### Request Body
- **prompt** (string) - Required - The input text for the LLM.
- **output_model** (BaseModel) - Optional - Pydantic model for structured output parsing.

### Request Example
{
  "prompt": "List 3 main themes from: 'Better bus service'",
  "output_model": "ThemeList"
}

### Response
#### Success Response (200)
- **response** (LLMResponse) - Contains the parsed structured output.
```

--------------------------------

### Shuffle Theme Order for Evaluation (Python)

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvements.md

Implements a function to shuffle the order of themes before evaluation. This is a zero-cost reliability improvement to mitigate positional bias in LLM judgments.

```python
import random

def evaluate_with_shuffle(themes, judge_prompt, llm):
    shuffled = themes.copy()
    random.shuffle(shuffled)
    return llm.invoke(judge_prompt.format(themes=shuffled))
```

--------------------------------

### Enforce Theme Quality Rules

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Demonstrates how to execute semantic similarity and overlap checks on theme data using ThemeFinder's rule-based functions. These functions return a list of Slack notifications and a failure status flag.

```python
client = OpenAI()
slack_messages, failed = rule_3_semantic_similarity_must_be_less_than_90pc_slack(themes, client)
print(f"Rule 3 - Similarity check: {'FAILED' if failed else 'PASSED'}")

slack_messages, failed = rule_4_themes_should_not_overlap_slack(mapping)
print(f"Rule 4 - Overlap check: {'FAILED' if failed else 'PASSED'}")
```

--------------------------------

### Create Inputs for Hybrid Questions (Python)

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

A helper function to create input files for hybrid questions. It processes responses by cleaning and encoding data, removing specified characters, and splitting options. It then saves multi-choice options and free text responses into separate JSONL files and generates a question.json file containing question details and all unique multi-choice options.

```python
def create_hybrid_question_inputs(responses_df, question_info: list[dict]):
    for question in question_info:
        q_num = question['question_number']
        closed_col = question['closed_column']
        open_col = question['open_column']
        q_dir = f"inputs/question_part_{q_num}"
        os.makedirs(q_dir, exist_ok=True)

        question_string = question['question_text']

        question_answers = responses_df[['themefinder_id', closed_col, open_col]].dropna()
        if sample_size is not None:
            question_answers = question_answers.sample(sample_size)

        question_answers[closed_col] = question_answers[closed_col].fillna('Not Provided')
        question_answers[open_col] = question_answers[open_col].fillna('Not Provided')

        question_answers[closed_col] = question_answers[closed_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii")
        question_answers[open_col] = question_answers[open_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii")

        for bad_string in characters_to_remove:
            question_answers[closed_col] = question_answers[closed_col].apply(lambda x: x.replace(bad_string, " "))
            question_answers[open_col] = question_answers[open_col].apply(lambda x: x.replace(bad_string, " "))

        question_answers[closed_col] = question_answers[closed_col].apply(lambda x: x.split(","))

        question_answers.rename(columns={closed_col: 'options', open_col: 'text'}, inplace=True)

        question_answers[['themefinder_id','options']].to_json(os.path.join(q_dir, 'multi_choice.jsonl'), orient='records', lines=True)
        question_answers[['themefinder_id', 'text']].to_json(os.path.join(q_dir, 'responses.jsonl'), orient='records', lines=True)

        question_data = {
            "question_number": q_num,
            "question_text": question_string,
            "has_free_text": True,
            "multi_choice_options": list(set([item for sublist in question_answers['options'] for item in sublist])),
        }

        with open(os.path.join(q_dir, 'question.json'), 'w') as f:
            json.dump(question_data, f, indent=4)
```

--------------------------------

### Refine Themes into Actionable Statements

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Standardizes condensed themes into clear, actionable statements. It reformulates topics to express definitive stances and assigns sequential alphabetic IDs.

```python
import asyncio
import pandas as pd
from langchain_openai import AzureChatOpenAI
from themefinder import theme_refinement

llm = AzureChatOpenAI(model="gpt-4o", temperature=0)
```

--------------------------------

### Theme Mapping and Export

Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb

This snippet covers the theme mapping process and how to export the results to an Excel file.

```APIDOC
## Theme Mapping and Export

### Description
This section describes the process of mapping responses to the refined themes and exporting the mapping results to an Excel file.

### Method
Asynchronous function call and DataFrame method

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```python
import pandas as pd
from themefinder import theme_mapping

# Assuming 'responses', 'llm', 'question', and 'themes' (refined themes DataFrame) are defined
# mapping, unprocessed = await theme_mapping(
#     responses, 
#     llm=llm, 
#     refined_themes_df=themes,
#     question=question
# )

# Export the mapping to an Excel file
# mapping.to_excel("mapping.xlsx")
```

### Response
#### Success Response (200)
- **mapping** (DataFrame) - A DataFrame containing the mapping of responses to themes.
- **unprocessed** (DataFrame) - A DataFrame of responses that could not be processed.

#### Response Example
```json
{
  "mapping": [
    {"response_id": 1, "topic_id": "A"},
    {"response_id": 2, "topic_id": "B"}
  ],
  "unprocessed": []
}
```
```

--------------------------------

### Implement Chain-of-Thought Prompting for Theme Evaluation

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md

Defines a structured prompt template that forces the LLM to perform step-by-step reasoning before outputting a judgment. This improves auditability and reliability of theme matching.

```text
For theme "{theme_label}":

Step 1: Identify the core concept
- What is this theme fundamentally about?

Step 2: Search for matches
- Which theme(s) in the comparison list address the same concept?

Step 3: Assess alignment
- What aligns between them?
- What differs?

Step 4: Judgment
- Decision: MATCH / NO_MATCH
- If MATCH, strength: STRONG / PARTIAL
- Reasoning summary: <1-2 sentences>

Output JSON:
{
  "theme_label": {
    "reasoning": "This theme about X matches theme Y because...",
    "decision": "MATCH",
    "strength": "STRONG"
  }
}
```

--------------------------------

### Configure Separate Judge Model in BenchmarkRunner

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md

Demonstrates how to decouple the judge model from the task model by updating the BenchmarkConfig dataclass and CLI arguments. This allows for flexible model selection for evaluations.

```python
# benchmark.py - Add judge_model to BenchmarkRunner
@dataclass
class BenchmarkConfig:
    models: list[ModelConfig]
    judge_model: ModelConfig | None = None  # If None, uses GPT-4o default
    datasets: list[str]
    eval_types: list[str]
    runs_per_model: int = 3

# evaluators.py - Accept judge as parameter
def create_groundedness_evaluator(judge_llm: BaseLLM):
    """Judge LLM is now explicitly passed, not same as task LLM."""
    ...

# CLI addition
parser.add_argument(
    "--judge-model",
    default="gpt-4o-mini",
    help="Model to use for LLM-as-judge evaluations"
)
```

--------------------------------

### ThemeFinder Rules and Pydantic Models

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Demonstrates the application of rules for semantic similarity and theme overlap, and showcases the usage of Pydantic models for structured data representation.

```APIDOC
## Rule Application and Pydantic Models

### Description
This section illustrates how to apply predefined rules for semantic similarity and theme overlap checks, and demonstrates the structure and usage of Pydantic models for data validation and organization within the ThemeFinder package.

### Rule 3: Semantic Similarity Check
This rule ensures that the semantic similarity between themes is below a specified threshold (e.g., 90%).

```python
from themefinder.rules import rule_3_semantic_similarity_must_be_less_than_90pc_slack
from openai import OpenAI

client = OpenAI()
# Assuming 'themes' is a predefined list of themes
# slack_messages, failed = rule_3_semantic_similarity_must_be_less_than_90pc_slack(themes, client)
# print(f"Rule 3 - Similarity check: {'FAILED' if failed else 'PASSED'}")
```

### Rule 4: Theme Overlap Check
This rule checks if the response overlap between themes is below a specified threshold (e.g., 70%).

```python
from themefinder.rules import rule_4_themes_should_not_overlap_slack

# Assuming 'mapping' is a predefined theme mapping object
# slack_messages, failed = rule_4_themes_should_not_overlap_slack(mapping)
# print(f"Rule 4 - Overlap check: {'FAILED' if failed else 'PASSED'}")
```

### Pydantic Models for Structured Output
ThemeFinder utilizes Pydantic models for robust data validation and structured representation of analysis outputs.

#### Theme Model
Represents a single theme with its topic label, description, and sentiment position.

```python
from themefinder.models import Theme, Position

theme = Theme(
    topic_label="traffic reduction",
    topic_description="The proposal will help reduce road congestion significantly",
    position=Position.AGREEMENT  # Options: AGREEMENT, DISAGREEMENT, UNCLEAR
)

print(f"Theme: {theme.topic_label} - {theme.position.value}")
```

#### CondensedTheme Model
Represents a condensed theme, summarizing multiple related topics and including the count of source topics.

```python
from themefinder.models import CondensedTheme

condensed = CondensedTheme(
    topic_label="traffic concerns",
    topic_description="Combined themes about traffic and congestion issues",
    source_topic_count=5
)
```

#### RefinedTheme Model
Represents a refined theme, typically in a colon-separated format, along with the source topic count.

```python
from themefinder.models import RefinedTheme

refined = RefinedTheme(
    topic="Traffic Reduction: The cycling infrastructure will significantly reduce road congestion",
    source_topic_count=5
)

print(f"Refined topic format: {refined.topic}")
```

#### ThemeMappingOutput Model
Represents the output of a theme mapping process, linking response IDs to theme labels.

```python
from themefinder.models import ThemeMappingOutput

mapping = ThemeMappingOutput(
    response_id=1,
    labels=["A", "C", "D"]  # Labels must be unique
)

print(f"Mapping labels: {mapping.labels}")
```

#### DetailDetectionOutput Model
Represents the output of a detail detection process, indicating whether rich evidence is present.

```python
from themefinder.models import DetailDetectionOutput, EvidenceRich

detail = DetailDetectionOutput(
    response_id=1,
    evidence_rich=EvidenceRich.YES  # Options: YES, NO
)
```

#### Container Models
These models aggregate multiple responses for batch processing and validation.

```python
from themefinder.models import ThemeGenerationResponses, ThemeMappingResponses

# Example for ThemeGenerationResponses
# theme_responses = ThemeGenerationResponses(responses=[theme])

# Example for ThemeMappingResponses
# mapping_responses = ThemeMappingResponses(responses=[mapping])
```
```

--------------------------------

### Save Open Questions from Excel

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

Reads open question configurations from an Excel file and uses create_open_question_inputs to generate necessary input files. It filters out questions with no answers and validates unique question numbers. Dependencies include pandas.

```python
def save_open_questions(responses_df, question_understanding_path: str):
    question_info = pd.read_excel(question_understanding_path, sheet_name="Open questions", skiprows=3)
    
    question_info.columns = ["column_name", 'question_number', "question_text"]

    # remove questions with no answers
    only_nans = responses_df[question_info['column_name'].tolist()].isna().all()
    column_names_with_only_nans = only_nans[only_nans].index.tolist()
    question_info = question_info[~question_info['column_name'].isin(column_names_with_only_nans)]

    # Ensure question numbers are ints
    question_info['question_number'] = question_info['question_number'].astype(str).str.replace(r'\D', '', regex=True).astype(int)
    if not question_info['question_number'].is_unique:
        raise AssertionError("Non-unique values found in 'question_number' column")

    create_open_question_inputs(responses_df, question_info.to_dict(orient="records"))
```

--------------------------------

### Process and Save Response Data (Python)

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

This script reads response data from a CSV file, renames its columns using the get_excel_column_name utility, creates an 'inputs' directory if it doesn't exist, adds a 'themefinder_id' column, and then saves different subsets of the data (demographic, open questions, hybrid questions, closed questions) to specified paths.

```python
responses_df = pd.read_csv("raw_data/responses_output_cleaned.csv", header=0)
responses_df.columns = [get_excel_column_name(i) for i in range(len(responses_df.columns))]
os.makedirs("inputs", exist_ok=True)
responses_df['themefinder_id'] = range(1, len(responses_df) + 1)
```

```python
save_demographic_data(responses_df, "question_understanding_path")
```

```python
save_open_questions(responses_df, "question_understanding_path")
```

```python
save_hybrid_questions(responses_df, "question_understanding_path")
```

```python
save_closed_questions(responses_df, "question_understanding_path")
```

--------------------------------

### Generate Themes from Survey Responses

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Extracts initial themes from survey responses using exploratory LLM prompts. It processes data in batches to identify viewpoints and sentiment positions.

```python
import asyncio
import pandas as pd
from langchain_openai import AzureChatOpenAI
from themefinder import theme_generation

llm = AzureChatOpenAI(model="gpt-4o", temperature=0)

responses_df = pd.DataFrame({
    "response_id": ["1", "2", "3", "4", "5", "6"],
    "response": [
        "I think the proposal will help reduce traffic congestion significantly.",
        "This change will cost taxpayers too much money.",
        "The environmental benefits outweigh any short-term costs.",
        "I'm worried about the impact on local businesses.",
        "Great initiative for improving air quality in the city.",
        "The timeline is unrealistic and needs more planning."
    ]
})

question = "What are your views on the proposed cycling infrastructure expansion?"

async def generate_themes():
    theme_df, unprocessables = await theme_generation(
        responses_df,
        llm,
        question=question,
        batch_size=50,
        partition_key=None,
        system_prompt="You are an AI evaluation tool analyzing public consultation responses.",
        concurrency=10
    )

    print("Generated Themes:")
    print(theme_df)

    return theme_df

if __name__ == "__main__":
    asyncio.run(generate_themes())
```

--------------------------------

### POST /theme_clustering

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

Performs hierarchical clustering on a set of themes to reduce them to a target number, creating parent-child relationships.

```APIDOC
## POST /theme_clustering

### Description
Performs advanced hierarchical clustering using an agentic approach. It iteratively merges semantically similar themes to reach a target number, creating parent-child relationships useful for drill-down analysis.

### Method
POST

### Endpoint
theme_clustering

### Parameters
#### Request Body
- **themes_df** (DataFrame) - Required - DataFrame containing topic_id, topic_label, topic_description, and source_topic_count.
- **llm** (AzureChatOpenAI) - Required - LangChain LLM instance.
- **max_iterations** (int) - Optional - Maximum clustering iterations.
- **target_themes** (int) - Optional - Target number of final themes.
- **significance_percentage** (float) - Optional - Minimum percentage for significance.

### Request Example
{
  "themes_df": "...",
  "max_iterations": 5,
  "target_themes": 4
}

### Response
#### Success Response (200)
- **clustered_df** (DataFrame) - The resulting themes with hierarchical parent_id and children columns.
```

--------------------------------

### ThemeFinder Pipeline

Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb

This snippet demonstrates the end-to-end pipeline for finding themes in a set of responses using the ThemeFinder library and an LLM.

```APIDOC
## ThemeFinder Pipeline

### Description
This section shows how to run the entire ThemeFinder pipeline, from loading responses to identifying themes and mapping them.

### Method
Asynchronous function call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```python
import pandas as pd
from langchain_openai import AzureChatOpenAI
import themefinder

# Define the question and load responses
question = "What improvements would you most like to see in local public transportation?"
responses = pd.read_json("./example_data.json")

# Initialize the LLM
llm = AzureChatOpenAI(
    model_name="gpt-4o",
    temperature=0
)

# Run the pipeline
results = await themefinder.find_themes(
    responses,
    llm=llm,
    question=question,
    )

# Access the identified themes
print(results["themes"])
```

### Response
#### Success Response (200)
- **themes** (DataFrame) - A DataFrame containing the identified themes.
- **mapping** (DataFrame) - A DataFrame mapping responses to themes.
- **unprocessed** (DataFrame) - A DataFrame of responses that could not be processed.

#### Response Example
```json
{
  "themes": [
    {"topic_id": "A", "topic": "Improved Bus Routes"},
    {"topic_id": "B", "topic": "More Frequent Service"}
  ],
  "mapping": [
    {"response_id": 1, "topic_id": "A"},
    {"response_id": 2, "topic_id": "B"}
  ],
  "unprocessed": []
}
```
```

--------------------------------

### Create Inputs for Closed Questions (Python)

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

A helper function to create input files for closed-ended questions. It iterates through provided question details, cleans the relevant DataFrame column by encoding to ASCII, removing specified characters, and splitting options by comma. It then saves the processed data into 'multi_choice.jsonl' and 'question.json' files within a dedicated directory for each question.

```python
def create_closed_question_inputs(
    df: pd.DataFrame,
    closed_questions: list[dict],
    characters_to_remove = ["/", "\\", '- Text', '_x000D_'],
    sample_size: Optional[int] = None
) -> None:
    for question in closed_questions:
        q_num = question['question_number']
        question_col = question['column_name']
        q_dir = f"inputs/question_part_{q_num}"
        os.makedirs(q_dir, exist_ok=True)

        question_string = question['question_text']

        question_answers = df[['themefinder_id', question_col]].dropna()
        if sample_size is not None:
            question_answers = question_answers.sample(sample_size)

        question_answers[question_col] = question_answers[question_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii")
        for bad_string in characters_to_remove:
            question_answers[question_col] = question_answers[question_col].apply(lambda x: x.replace(bad_string, " "))
        
        
        
        question_answers[question_col] = question_answers[question_col].apply(lambda x: x.split(","))

        question_answers.columns = ['themefinder_id', 'options']

        question_answers[['themefinder_id', 'options']].to_json(os.path.join(q_dir, 'multi_choice.jsonl'), orient='records', lines=True)

        question_data = {
            "question_number": q_num,
            "question_text": question_string,
            "has_free_text": False,
            "multi_choice_options": list(set([item for sublist in question_answers['options'] for item in sublist])),
        }

        with open(os.path.join(q_dir, 'question.json'), 'w') as f:
            json.dump(question_data, f, indent=4)
```

--------------------------------

### Hierarchical Evaluation Implementation in Python

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md

Implements a multi-stage evaluation process that prioritizes cheaper methods like semantic caching and heuristic checks before resorting to expensive LLM judgments. This reduces overall evaluation costs by handling simple cases efficiently. Dependencies include a semantic cache, a heuristic function, and an LLM judge.

```python
class HierarchicalEvaluator:
    def __init__(
        self,
        cache: SemanticCache,
        heuristic_fn: Callable,
        llm_judge: BaseLLM,
        cache_threshold: float = 0.92,
        heuristic_confidence: float = 0.8
    ):
        self.cache = cache
        self.heuristic_fn = heuristic_fn
        self.llm_judge = llm_judge

    def evaluate(self, themes: list[dict], expected: list[dict]) -> EvalResult:
        # Stage 1: Cache check
        cache_key = self._make_cache_key(themes, expected)
        cached = self.cache.get(cache_key, threshold=self.cache_threshold)
        if cached:
            return EvalResult(score=cached, source="cache", cost=0)

        # Stage 2: Heuristic check
        heuristic_result = self.heuristic_fn(themes, expected)
        if heuristic_result.confidence > self.heuristic_confidence:
            self.cache.set(cache_key, heuristic_result.score)
            return EvalResult(
                score=heuristic_result.score,
                source="heuristic",
                cost=0.0001
            )

        # Stage 3: LLM judge
        llm_result = self._run_llm_judge(themes, expected)
        self.cache.set(cache_key, llm_result.score)
        return EvalResult(
            score=llm_result.score,
            source="llm",
            cost=0.02,
            reasoning=llm_result.reasoning
        )
```

--------------------------------

### JSON Schema for Theme Definitions

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md

Defines the structure for categorizing responses into themes. This includes a unique topic ID, label, description, and a combined topic string. Special themes 'X' (None of the Above) and 'Y' (No Reason Given) are always included.

```json
[
    {
        "topic_id": "A",
        "topic_label": "Funding Concerns",
        "topic_description": "Concerns about adequate funding for the proposed initiative.",
        "topic": "Funding Concerns: Concerns about adequate funding for the proposed initiative."
    }
]
```

--------------------------------

### JSONL Schema for Response-Theme Mapping

Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md

Defines the mapping between responses and identified themes, including associated stances. This schema ensures that each response is linked to one or more valid topic IDs and their corresponding stances (POSITIVE, NEGATIVE, NEUTRAL). Stored in JSON Lines format.

```jsonl
{"response_id": 1, "labels": ["A", "C"], "stances": ["POSITIVE", "NEGATIVE"]}
{"response_id": 2, "labels": ["B"], "stances": ["POSITIVE"]}
```

--------------------------------

### Create Respondent Demographics JSONL

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

Generates a JSONL file containing respondents' demographic data. It cleans and formats demographic columns before saving them to 'inputs/respondents.jsonl'. Dependencies include pandas and numpy.

```python
import pandas as pd
import numpy as np
import os
import string
import json
from typing import Optional

def create_respondents_jsonl(df: pd.DataFrame, demographic_columns: list[str], demographic_labels: list[str]) -> None:
    """Create a JSONL file of respondents with their demographic data.

    Args:
        df (pd.DataFrame): DataFrame containing the respondents' data
        demographics (list[str]): List of demographic columns to include in the JSONL file
    """
    for c in demographic_columns:
        df[c] = df[c].astype(str).str.replace('_x000D_', '', regex=False).str.encode("ascii", "ignore").str.decode("ascii")
    for c in demographic_columns:
        df[c] = df[c].apply(lambda x: x.split(","))
    df.rename(columns=dict(zip(demographic_columns, demographic_labels)), inplace=True)
    df["demographic_data"] = df[demographic_labels].to_dict(orient="records")
    df[['themefinder_id',"demographic_data"]].to_json("inputs/respondents.jsonl", orient="records", lines=True)
```

--------------------------------

### Perform Hierarchical Theme Clustering with ThemeFinder

Source: https://context7.com/i-dot-ai/themefinder/llms.txt

This Python function utilizes an agentic approach for hierarchical clustering of themes. It iteratively merges semantically similar themes to achieve a target number of clusters, establishing parent-child relationships for detailed analysis. The function requires a pandas DataFrame with specific columns and an initialized LLM object.

```python
import asyncio
import pandas as pd
from langchain_openai import AzureChatOpenAI
from themefinder import theme_clustering

llm = AzureChatOpenAI(model="gpt-4o", temperature=0)

# Input: DataFrame of condensed themes with required columns
themes_df = pd.DataFrame({
    "topic_id": ["1", "2", "3", "4", "5", "6", "7", "8"],
    "topic_label": [
        "traffic reduction",
        "congestion improvement",
        "cost concerns",
        "budget worries",
        "air quality",
        "environmental benefits",
        "safety improvements",
        "timeline concerns"
    ],
    "topic_description": [
        "Less traffic on roads",
        "Reduced congestion at junctions",
        "Project costs too high",
        "Financial burden on citizens",
        "Better air quality",
        "Positive environmental impact",
        "Safer roads for cyclists",
        "Unrealistic project timeline"
    ],
    "source_topic_count": [10, 8, 15, 12, 20, 18, 5, 7]
})

async def cluster_themes():
    # Cluster themes down to target number
    clustered_df, _ = await theme_clustering(
        themes_df,
        llm,
        max_iterations=5,  # Maximum clustering iterations
        target_themes=4,  # Target number of final themes
        significance_percentage=10.0,  # Minimum percentage for significance
        return_all_themes=False,  # Return only significant themes
        system_prompt="You are an AI evaluation tool analyzing public consultation responses."
    )

    print("Clustered Themes:")
    print(clustered_df)
    # Output includes hierarchical relationships via parent_id and children columns

    return clustered_df

if __name__ == "__main__":
    asyncio.run(cluster_themes())

```

--------------------------------

### Refine and Map Themes

Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb

Manually modify generated themes to include custom categories and perform the mapping stage to associate responses with themes, including handling unprocessed items.

```python
import string
from themefinder import theme_mapping

themes = results["themes"][["topic_id", "topic"]].copy()
themes.loc[len(themes)] = {"topic_id": string.ascii_uppercase[len(themes)], "topic": "Other: The response does not match any of the listed themes"}

mapping, unprocessed = await theme_mapping(
    responses,
    llm=llm,
    refined_themes_df=themes,
    question=question
)
mapping.to_excel("mapping.xlsx")
```

--------------------------------

### Process and Save Hybrid Question Data (Python)

Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb

Processes responses for hybrid questions (containing both closed and open-ended answers). It cleans the data, handles ASCII encoding, removes specified characters, splits options, renames columns, and saves them into separate JSONL files for multi-choice options and free text responses. Finally, it generates a question.json file with question details.

```python
def save_hybrid_questions(responses_df, question_understanding_path: str):
    question_info = pd.read_excel(question_understanding_path, sheet_name="Hybrid questions", skiprows=3)

    question_info.columns = ["closed_column",  "question_number", "question_text", "open_column"]

    # Ensure question numbers are ints
    question_info['question_number'] = question_info['question_number'].astype(str).str.replace(r'\D', '', regex=True).astype(int)
    if not question_info['question_number'].is_unique:
        raise AssertionError("Non-unique values found in 'question_number' column")

    create_hybrid_question_inputs(responses_df, question_info.to_dict(orient="records"))
```