### Initialize Themefinder Pipeline Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb Setup the environment by importing necessary libraries, loading survey data from JSON, and configuring the AzureChatOpenAI LLM instance. ```python import pandas as pd import themefinder from langchain_openai import AzureChatOpenAI question = "What improvements would you most like to see in local public transportation?" responses = pd.read_json("./example_data.json") llm = AzureChatOpenAI( model_name="gpt-4o", temperature=0 ) ``` -------------------------------- ### Install ThemeFinder in Editable Mode Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md Commands to install the local development version of the package into another project for testing purposes. ```bash pip install -e ``` ```bash poetry add -e ``` -------------------------------- ### Example Configuration in YAML Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md Demonstrates a sample configuration file for the ThemeFinder project, specifying dataset parameters like topic, size, questions, theme counts, noise levels, and demographic field distributions. ```yaml # config.yaml dataset_name: "transport_M" topic: "public transport improvements" size: "M" # 1000 responses n_questions: 3 questions: - text: "What improvements would you like to see to local bus services?" multi_choice: ["Support more buses", "Oppose changes"] - text: "How can we make public transport more accessible?" - text: "What role should cycling infrastructure play in transport planning?" n_themes_per_question: 12 noise_level: "medium" position_distribution: agree: 0.45 disagree: 0.35 unclear: 0.20 demographic_fields: - name: "region" values: ["England", "Scotland", "Wales", "Northern Ireland"] distribution: [0.84, 0.08, 0.05, 0.03] - name: "transport_user" values: ["Daily", "Weekly", "Monthly", "Rarely", "Never"] distribution: [0.25, 0.30, 0.20, 0.15, 0.10] ``` -------------------------------- ### Initialize Langfuse Callback Handler Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md Example of initializing the Langfuse callback handler for Langchain tracing to monitor LLM calls. ```python from langfuse import Langfuse from langfuse.callback import CallbackHandler import dotenv dotenv.load_dotenv() # Initialize Langfuse CallbackHandler for Langchain (tracing) # Use the session id to group calls langfuse_callback_handler = CallbackHandler(session_id="run_1") ``` -------------------------------- ### Langfuse LLM-as-Judge Integration in Python Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md Demonstrates integrating Langfuse's native LLM-as-Judge capabilities into the project. This approach leverages Langfuse's built-in templates, UI configuration, and features like sampling and batch evaluation for cost control and improved observability. The code snippet shows basic setup for attaching traces and allowing Langfuse to manage evaluators. ```python from langfuse import Langfuse def setup_langfuse_evaluators(client: Langfuse): """Configure Langfuse native evaluators for theme evaluation.""" # Use Langfuse's built-in evaluator for relevance # Configure via Langfuse UI with custom prompt # For custom theme evaluators, create via API when supported # Currently UI-only for custom evaluators # Attach to traces with client.trace(name="theme_generation") as trace: # ... run task ... # Langfuse auto-runs configured evaluators # Results visible in Langfuse dashboard pass ``` -------------------------------- ### Initialize AzureChatOpenAI LLM with Langchain and Langfuse Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md This snippet shows how to initialize an AzureChatOpenAI language model using Langchain. It configures the model for JSON output and integrates a Langfuse callback handler to log LLM calls, including inputs, outputs, and model details. This setup is useful for tracking and analyzing LLM interactions within an application. ```python from langchain_openai import AzureChatOpenAI llm = AzureChatOpenAI( model="gpt-4o", temperature=0, callbacks=[langfuse_callback_handler], model_kwargs={\"response_format\": {\"type\": \"json_object\"}}, ) ``` -------------------------------- ### GET /theme_validation Source: https://context7.com/i-dot-ai/themefinder/llms.txt Validates theme sets against predefined quality rules including count limits, coverage, and overlap. ```APIDOC ## GET /theme_validation ### Description Provides four validation rules to ensure theme quality. These functions check theme count limits, response coverage, semantic similarity, and theme overlap. ### Method GET ### Endpoint themefinder.rules ### Parameters #### Request Body - **themes** (List[ThemeNode]) - Required - List of theme nodes to validate. - **mapping** (List[dict]) - Required - Mapping data for coverage checks. ### Response #### Success Response (200) - **slack_messages** (List[str]) - List of validation error messages. - **failed** (bool) - Status indicating if the validation rule failed. ``` -------------------------------- ### Detect Evidence-Rich Responses Source: https://context7.com/i-dot-ai/themefinder/llms.txt Uses the detail_detection function to flag survey responses that contain specific facts, data, or concrete examples. This is useful for identifying high-value feedback for deeper analysis. ```python from themefinder import detail_detection async def detect_details(): detail_df, unprocessables = await detail_detection( responses_df, llm, question="What are your views on the proposed cycling infrastructure expansion?", batch_size=20, system_prompt="You are an AI evaluation tool analyzing public consultation responses.", concurrency=10 ) evidence_rich = detail_df[detail_df['evidence_rich'] == 'YES'] return evidence_rich ``` -------------------------------- ### Build and Serve Documentation Locally Source: https://github.com/i-dot-ai/themefinder/blob/main/docs/internal_contributors.md Commands to build and serve the MkDocs documentation site locally to preview changes. ```bash poetry run mkdocs build poetry run mkdocs serve ``` -------------------------------- ### Create Open Question Inputs Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb Prepares input files for open-ended questions, creating a directory for each question and saving responses in JSONL format. It also generates a question metadata JSON file. Handles character removal and optional sampling. Dependencies include os and json. ```python def create_open_question_inputs( df: pd.DataFrame, open_questions: list[dict], characters_to_remove = ["/", "\", '- Text', '_x000D_'], sample_size: Optional[int] = None ) -> None: for question in open_questions: q_num = question['question_number'] question_col = question['column_name'] q_dir = f"inputs/question_part_{q_num}" os.makedirs(q_dir, exist_ok=True) question_string = question['question_text'] question_answers = df[['themefinder_id', question_col]].dropna() if sample_size is not None and sample_size < len(question_answers): question_answers = question_answers.sample(sample_size) for bad_string in characters_to_remove: question_answers[question_col] = question_answers[question_col].apply(lambda x: x.replace(bad_string, " ")) question_answers[question_col] = question_answers[question_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii") question_answers.columns = ['themefinder_id', 'text'] question_answers[['themefinder_id', 'text']].to_json(os.path.join(q_dir, 'responses.jsonl'), orient='records', lines=True) question_data = { "question_number": q_num, "question_text": question_string, "has_free_text": True } with open(os.path.join(q_dir, 'question.json'), 'w') as f: json.dump(question_data, f, indent=4) ``` -------------------------------- ### Define Structured Data Models with Pydantic Source: https://context7.com/i-dot-ai/themefinder/llms.txt Illustrates the initialization of various Pydantic models used in ThemeFinder, such as Theme, CondensedTheme, and ThemeMappingOutput. These models ensure that data structures for theme generation and refinement are strictly validated. ```python from themefinder.models import Theme, CondensedTheme, RefinedTheme, ThemeMappingOutput, DetailDetectionOutput, Position, EvidenceRich theme = Theme(topic_label="traffic reduction", topic_description="The proposal will help reduce road congestion significantly", position=Position.AGREEMENT) condensed = CondensedTheme(topic_label="traffic concerns", topic_description="Combined themes about traffic and congestion issues", source_topic_count=5) refined = RefinedTheme(topic="Traffic Reduction: The cycling infrastructure will significantly reduce road congestion", source_topic_count=5) mapping = ThemeMappingOutput(response_id=1, labels=["A", "C", "D"]) detail = DetailDetectionOutput(response_id=1, evidence_rich=EvidenceRich.YES) ``` -------------------------------- ### JSONL Schema for Detail Detection Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md Defines a flag indicating whether a response is considered 'evidence-rich', meaning it contains specific details or examples. Stored in JSON Lines format with 'YES' or 'NO' values. ```jsonl {"response_id": 1, "evidence_rich": "YES"} {"response_id": 2, "evidence_rich": "NO"} ``` -------------------------------- ### Analyze survey responses with ThemeFinder Source: https://github.com/i-dot-ai/themefinder/blob/main/README.md This snippet demonstrates how to initialize a LangChain LLM, prepare survey data in a pandas DataFrame, and execute the find_themes pipeline asynchronously. It requires environment variables for LLM authentication and returns structured thematic analysis. ```python import asyncio from dotenv import load_dotenv import pandas as pd from langchain_openai import AzureChatOpenAI from themefinder import find_themes load_dotenv() llm = AzureChatOpenAI( model="gpt-4o", temperature=0, ) responses_df = pd.DataFrame({ "response_id": ["1", "2", "3", "4", "5"], "response": ["I think it's awesome, I can use it for consultation analysis.", "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."] }) question = "What do you think of ThemeFinder?" system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package." async def main(): result = await find_themes(responses_df, llm, question, system_prompt=system_prompt) print(result) if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### Execute Theme Analysis Pipeline with find_themes Source: https://context7.com/i-dot-ai/themefinder/llms.txt This snippet demonstrates how to initialize an Azure OpenAI LLM, prepare survey data in a pandas DataFrame, and execute the full ThemeFinder pipeline. The find_themes function processes responses to return identified themes, response-theme mappings, and evidence-rich insights. ```python import asyncio import pandas as pd from dotenv import load_dotenv from langchain_openai import AzureChatOpenAI from themefinder import find_themes load_dotenv() llm = AzureChatOpenAI( model="gpt-4o", temperature=0, ) responses_df = pd.DataFrame({ "response_id": ["1", "2", "3", "4", "5"], "response": [ "Buses need to run more frequently, especially during rush hour.", "The schedule says every 15 minutes but I've been waiting for 35 minutes.", "Better lighting at bus stops - some areas feel unsafe at night.", "Monthly passes are too expensive for low-income families.", "Electric buses would reduce noise and air pollution." ] }) question = "What improvements would you like to see in public transit?" system_prompt = "You are an AI evaluation tool analyzing responses to a UK Government public consultation on public transit improvements." async def main(): result = await find_themes( responses_df, llm, question, system_prompt=system_prompt, verbose=True, concurrency=10 ) print("Question:", result["question"]) print("\nIdentified Themes:") print(result["themes"]) print("\nResponse-Theme Mapping:") print(result["mapping"]) if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### Create Hybrid Question Inputs Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb Prepares input files for hybrid questions, which combine closed and open-ended responses. It creates directories and processes data similarly to open questions, handling combined columns and optional sampling. Dependencies include os. ```python def create_hybrid_question_inputs( df: pd.DataFrame, hybrid_questions: list[dict], characters_to_remove = ["/", "\", '- Text', '_x000D_'], sample_size: Optional[int] = None ) -> None: for question in hybrid_questions: q_num = question['question_number'] q_dir = f"inputs/question_part_{q_num}" closed_col = question['closed_column'] open_col = question['open_column'] question_string = question['question_text'] os.makedirs(q_dir, exist_ok=True) question_answers = df[['themefinder_id'] + [closed_col, open_col]].dropna(subset=[closed_col, open_col], how='all') if sample_size is not None and sample_size < len(question_answers): ``` -------------------------------- ### Interact with OpenAI LLM using OpenAILLM Interface Source: https://context7.com/i-dot-ai/themefinder/llms.txt The OpenAILLM class offers a direct interface to OpenAI's SDK, bypassing LangChain when not needed. It supports both synchronous and asynchronous API calls and allows for structured output using Pydantic models. Initialization requires model name and optionally API keys or environment variables. ```python import asyncio from themefinder import OpenAILLM, LLMResponse from pydantic import BaseModel, Field from typing import List # Define structured output model class ThemeList(BaseModel): themes: List[str] = Field(description="List of identified themes") # Initialize OpenAI LLM with custom settings llm = OpenAILLM( model="gpt-4o", request_kwargs={"temperature": 0, "max_tokens": 1000}, api_key="your-api-key", # Or use OPENAI_API_KEY env var ) # Async call with structured output async def async_example(): response: LLMResponse = await llm.ainvoke( prompt="List 3 main themes from: 'Better bus service, lower fares, more routes'", output_model=ThemeList ) print("Parsed themes:", response.parsed.themes) # Sync call without structured output def sync_example(): response: LLMResponse = llm.invoke( prompt="Summarize the main concern: 'Buses are always late'" ) print("Response:", response.parsed) if __name__ == "__main__": asyncio.run(async_example()) sync_example() ``` -------------------------------- ### Run Theme Extraction Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb Execute the theme extraction pipeline using the loaded responses and the configured LLM. ```python results = await themefinder.find_themes( responses, llm=llm, question=question ) ``` -------------------------------- ### Refine Themes with Themefinder Source: https://context7.com/i-dot-ai/themefinder/llms.txt Demonstrates how to use the theme_refinement function to consolidate survey topics into a refined DataFrame. It requires a DataFrame of condensed themes and an LLM instance to process the data asynchronously. ```python import asyncio import pandas as pd from themefinder import theme_refinement condensed_themes_df = pd.DataFrame({ "topic_label": ["traffic and congestion reduction", "cost and financial concerns", "environmental and air quality benefits"], "topic_description": ["The proposal will significantly reduce traffic congestion on roads", "Concerns about the financial impact on taxpayers and project costs", "Positive environmental impact including better air quality from cycling"], "source_topic_count": [2, 2, 2] }) async def refine_themes(): refined_df, unprocessables = await theme_refinement( condensed_themes_df, llm, question="What are your views on the proposed cycling infrastructure expansion?", batch_size=10000, system_prompt="You are an AI evaluation tool analyzing public consultation responses.", concurrency=10 ) return refined_df ``` -------------------------------- ### POST /llm_invoke Source: https://context7.com/i-dot-ai/themefinder/llms.txt Interface for interacting with OpenAI models, supporting both synchronous and asynchronous execution with structured output. ```APIDOC ## POST /llm_invoke ### Description Provides a direct OpenAI SDK implementation for use cases where LangChain is not needed. It supports both synchronous and asynchronous calls with Pydantic-based structured output support. ### Method POST ### Endpoint OpenAILLM.ainvoke / OpenAILLM.invoke ### Parameters #### Request Body - **prompt** (string) - Required - The input text for the LLM. - **output_model** (BaseModel) - Optional - Pydantic model for structured output parsing. ### Request Example { "prompt": "List 3 main themes from: 'Better bus service'", "output_model": "ThemeList" } ### Response #### Success Response (200) - **response** (LLMResponse) - Contains the parsed structured output. ``` -------------------------------- ### Shuffle Theme Order for Evaluation (Python) Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvements.md Implements a function to shuffle the order of themes before evaluation. This is a zero-cost reliability improvement to mitigate positional bias in LLM judgments. ```python import random def evaluate_with_shuffle(themes, judge_prompt, llm): shuffled = themes.copy() random.shuffle(shuffled) return llm.invoke(judge_prompt.format(themes=shuffled)) ``` -------------------------------- ### Enforce Theme Quality Rules Source: https://context7.com/i-dot-ai/themefinder/llms.txt Demonstrates how to execute semantic similarity and overlap checks on theme data using ThemeFinder's rule-based functions. These functions return a list of Slack notifications and a failure status flag. ```python client = OpenAI() slack_messages, failed = rule_3_semantic_similarity_must_be_less_than_90pc_slack(themes, client) print(f"Rule 3 - Similarity check: {'FAILED' if failed else 'PASSED'}") slack_messages, failed = rule_4_themes_should_not_overlap_slack(mapping) print(f"Rule 4 - Overlap check: {'FAILED' if failed else 'PASSED'}") ``` -------------------------------- ### Create Inputs for Hybrid Questions (Python) Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb A helper function to create input files for hybrid questions. It processes responses by cleaning and encoding data, removing specified characters, and splitting options. It then saves multi-choice options and free text responses into separate JSONL files and generates a question.json file containing question details and all unique multi-choice options. ```python def create_hybrid_question_inputs(responses_df, question_info: list[dict]): for question in question_info: q_num = question['question_number'] closed_col = question['closed_column'] open_col = question['open_column'] q_dir = f"inputs/question_part_{q_num}" os.makedirs(q_dir, exist_ok=True) question_string = question['question_text'] question_answers = responses_df[['themefinder_id', closed_col, open_col]].dropna() if sample_size is not None: question_answers = question_answers.sample(sample_size) question_answers[closed_col] = question_answers[closed_col].fillna('Not Provided') question_answers[open_col] = question_answers[open_col].fillna('Not Provided') question_answers[closed_col] = question_answers[closed_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii") question_answers[open_col] = question_answers[open_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii") for bad_string in characters_to_remove: question_answers[closed_col] = question_answers[closed_col].apply(lambda x: x.replace(bad_string, " ")) question_answers[open_col] = question_answers[open_col].apply(lambda x: x.replace(bad_string, " ")) question_answers[closed_col] = question_answers[closed_col].apply(lambda x: x.split(",")) question_answers.rename(columns={closed_col: 'options', open_col: 'text'}, inplace=True) question_answers[['themefinder_id','options']].to_json(os.path.join(q_dir, 'multi_choice.jsonl'), orient='records', lines=True) question_answers[['themefinder_id', 'text']].to_json(os.path.join(q_dir, 'responses.jsonl'), orient='records', lines=True) question_data = { "question_number": q_num, "question_text": question_string, "has_free_text": True, "multi_choice_options": list(set([item for sublist in question_answers['options'] for item in sublist])), } with open(os.path.join(q_dir, 'question.json'), 'w') as f: json.dump(question_data, f, indent=4) ``` -------------------------------- ### Refine Themes into Actionable Statements Source: https://context7.com/i-dot-ai/themefinder/llms.txt Standardizes condensed themes into clear, actionable statements. It reformulates topics to express definitive stances and assigns sequential alphabetic IDs. ```python import asyncio import pandas as pd from langchain_openai import AzureChatOpenAI from themefinder import theme_refinement llm = AzureChatOpenAI(model="gpt-4o", temperature=0) ``` -------------------------------- ### Theme Mapping and Export Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb This snippet covers the theme mapping process and how to export the results to an Excel file. ```APIDOC ## Theme Mapping and Export ### Description This section describes the process of mapping responses to the refined themes and exporting the mapping results to an Excel file. ### Method Asynchronous function call and DataFrame method ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python import pandas as pd from themefinder import theme_mapping # Assuming 'responses', 'llm', 'question', and 'themes' (refined themes DataFrame) are defined # mapping, unprocessed = await theme_mapping( # responses, # llm=llm, # refined_themes_df=themes, # question=question # ) # Export the mapping to an Excel file # mapping.to_excel("mapping.xlsx") ``` ### Response #### Success Response (200) - **mapping** (DataFrame) - A DataFrame containing the mapping of responses to themes. - **unprocessed** (DataFrame) - A DataFrame of responses that could not be processed. #### Response Example ```json { "mapping": [ {"response_id": 1, "topic_id": "A"}, {"response_id": 2, "topic_id": "B"} ], "unprocessed": [] } ``` ``` -------------------------------- ### Implement Chain-of-Thought Prompting for Theme Evaluation Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md Defines a structured prompt template that forces the LLM to perform step-by-step reasoning before outputting a judgment. This improves auditability and reliability of theme matching. ```text For theme "{theme_label}": Step 1: Identify the core concept - What is this theme fundamentally about? Step 2: Search for matches - Which theme(s) in the comparison list address the same concept? Step 3: Assess alignment - What aligns between them? - What differs? Step 4: Judgment - Decision: MATCH / NO_MATCH - If MATCH, strength: STRONG / PARTIAL - Reasoning summary: <1-2 sentences> Output JSON: { "theme_label": { "reasoning": "This theme about X matches theme Y because...", "decision": "MATCH", "strength": "STRONG" } } ``` -------------------------------- ### Configure Separate Judge Model in BenchmarkRunner Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md Demonstrates how to decouple the judge model from the task model by updating the BenchmarkConfig dataclass and CLI arguments. This allows for flexible model selection for evaluations. ```python # benchmark.py - Add judge_model to BenchmarkRunner @dataclass class BenchmarkConfig: models: list[ModelConfig] judge_model: ModelConfig | None = None # If None, uses GPT-4o default datasets: list[str] eval_types: list[str] runs_per_model: int = 3 # evaluators.py - Accept judge as parameter def create_groundedness_evaluator(judge_llm: BaseLLM): """Judge LLM is now explicitly passed, not same as task LLM.""" ... # CLI addition parser.add_argument( "--judge-model", default="gpt-4o-mini", help="Model to use for LLM-as-judge evaluations" ) ``` -------------------------------- ### ThemeFinder Rules and Pydantic Models Source: https://context7.com/i-dot-ai/themefinder/llms.txt Demonstrates the application of rules for semantic similarity and theme overlap, and showcases the usage of Pydantic models for structured data representation. ```APIDOC ## Rule Application and Pydantic Models ### Description This section illustrates how to apply predefined rules for semantic similarity and theme overlap checks, and demonstrates the structure and usage of Pydantic models for data validation and organization within the ThemeFinder package. ### Rule 3: Semantic Similarity Check This rule ensures that the semantic similarity between themes is below a specified threshold (e.g., 90%). ```python from themefinder.rules import rule_3_semantic_similarity_must_be_less_than_90pc_slack from openai import OpenAI client = OpenAI() # Assuming 'themes' is a predefined list of themes # slack_messages, failed = rule_3_semantic_similarity_must_be_less_than_90pc_slack(themes, client) # print(f"Rule 3 - Similarity check: {'FAILED' if failed else 'PASSED'}") ``` ### Rule 4: Theme Overlap Check This rule checks if the response overlap between themes is below a specified threshold (e.g., 70%). ```python from themefinder.rules import rule_4_themes_should_not_overlap_slack # Assuming 'mapping' is a predefined theme mapping object # slack_messages, failed = rule_4_themes_should_not_overlap_slack(mapping) # print(f"Rule 4 - Overlap check: {'FAILED' if failed else 'PASSED'}") ``` ### Pydantic Models for Structured Output ThemeFinder utilizes Pydantic models for robust data validation and structured representation of analysis outputs. #### Theme Model Represents a single theme with its topic label, description, and sentiment position. ```python from themefinder.models import Theme, Position theme = Theme( topic_label="traffic reduction", topic_description="The proposal will help reduce road congestion significantly", position=Position.AGREEMENT # Options: AGREEMENT, DISAGREEMENT, UNCLEAR ) print(f"Theme: {theme.topic_label} - {theme.position.value}") ``` #### CondensedTheme Model Represents a condensed theme, summarizing multiple related topics and including the count of source topics. ```python from themefinder.models import CondensedTheme condensed = CondensedTheme( topic_label="traffic concerns", topic_description="Combined themes about traffic and congestion issues", source_topic_count=5 ) ``` #### RefinedTheme Model Represents a refined theme, typically in a colon-separated format, along with the source topic count. ```python from themefinder.models import RefinedTheme refined = RefinedTheme( topic="Traffic Reduction: The cycling infrastructure will significantly reduce road congestion", source_topic_count=5 ) print(f"Refined topic format: {refined.topic}") ``` #### ThemeMappingOutput Model Represents the output of a theme mapping process, linking response IDs to theme labels. ```python from themefinder.models import ThemeMappingOutput mapping = ThemeMappingOutput( response_id=1, labels=["A", "C", "D"] # Labels must be unique ) print(f"Mapping labels: {mapping.labels}") ``` #### DetailDetectionOutput Model Represents the output of a detail detection process, indicating whether rich evidence is present. ```python from themefinder.models import DetailDetectionOutput, EvidenceRich detail = DetailDetectionOutput( response_id=1, evidence_rich=EvidenceRich.YES # Options: YES, NO ) ``` #### Container Models These models aggregate multiple responses for batch processing and validation. ```python from themefinder.models import ThemeGenerationResponses, ThemeMappingResponses # Example for ThemeGenerationResponses # theme_responses = ThemeGenerationResponses(responses=[theme]) # Example for ThemeMappingResponses # mapping_responses = ThemeMappingResponses(responses=[mapping]) ``` ``` -------------------------------- ### Save Open Questions from Excel Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb Reads open question configurations from an Excel file and uses create_open_question_inputs to generate necessary input files. It filters out questions with no answers and validates unique question numbers. Dependencies include pandas. ```python def save_open_questions(responses_df, question_understanding_path: str): question_info = pd.read_excel(question_understanding_path, sheet_name="Open questions", skiprows=3) question_info.columns = ["column_name", 'question_number', "question_text"] # remove questions with no answers only_nans = responses_df[question_info['column_name'].tolist()].isna().all() column_names_with_only_nans = only_nans[only_nans].index.tolist() question_info = question_info[~question_info['column_name'].isin(column_names_with_only_nans)] # Ensure question numbers are ints question_info['question_number'] = question_info['question_number'].astype(str).str.replace(r'\D', '', regex=True).astype(int) if not question_info['question_number'].is_unique: raise AssertionError("Non-unique values found in 'question_number' column") create_open_question_inputs(responses_df, question_info.to_dict(orient="records")) ``` -------------------------------- ### Process and Save Response Data (Python) Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb This script reads response data from a CSV file, renames its columns using the get_excel_column_name utility, creates an 'inputs' directory if it doesn't exist, adds a 'themefinder_id' column, and then saves different subsets of the data (demographic, open questions, hybrid questions, closed questions) to specified paths. ```python responses_df = pd.read_csv("raw_data/responses_output_cleaned.csv", header=0) responses_df.columns = [get_excel_column_name(i) for i in range(len(responses_df.columns))] os.makedirs("inputs", exist_ok=True) responses_df['themefinder_id'] = range(1, len(responses_df) + 1) ``` ```python save_demographic_data(responses_df, "question_understanding_path") ``` ```python save_open_questions(responses_df, "question_understanding_path") ``` ```python save_hybrid_questions(responses_df, "question_understanding_path") ``` ```python save_closed_questions(responses_df, "question_understanding_path") ``` -------------------------------- ### Generate Themes from Survey Responses Source: https://context7.com/i-dot-ai/themefinder/llms.txt Extracts initial themes from survey responses using exploratory LLM prompts. It processes data in batches to identify viewpoints and sentiment positions. ```python import asyncio import pandas as pd from langchain_openai import AzureChatOpenAI from themefinder import theme_generation llm = AzureChatOpenAI(model="gpt-4o", temperature=0) responses_df = pd.DataFrame({ "response_id": ["1", "2", "3", "4", "5", "6"], "response": [ "I think the proposal will help reduce traffic congestion significantly.", "This change will cost taxpayers too much money.", "The environmental benefits outweigh any short-term costs.", "I'm worried about the impact on local businesses.", "Great initiative for improving air quality in the city.", "The timeline is unrealistic and needs more planning." ] }) question = "What are your views on the proposed cycling infrastructure expansion?" async def generate_themes(): theme_df, unprocessables = await theme_generation( responses_df, llm, question=question, batch_size=50, partition_key=None, system_prompt="You are an AI evaluation tool analyzing public consultation responses.", concurrency=10 ) print("Generated Themes:") print(theme_df) return theme_df if __name__ == "__main__": asyncio.run(generate_themes()) ``` -------------------------------- ### POST /theme_clustering Source: https://context7.com/i-dot-ai/themefinder/llms.txt Performs hierarchical clustering on a set of themes to reduce them to a target number, creating parent-child relationships. ```APIDOC ## POST /theme_clustering ### Description Performs advanced hierarchical clustering using an agentic approach. It iteratively merges semantically similar themes to reach a target number, creating parent-child relationships useful for drill-down analysis. ### Method POST ### Endpoint theme_clustering ### Parameters #### Request Body - **themes_df** (DataFrame) - Required - DataFrame containing topic_id, topic_label, topic_description, and source_topic_count. - **llm** (AzureChatOpenAI) - Required - LangChain LLM instance. - **max_iterations** (int) - Optional - Maximum clustering iterations. - **target_themes** (int) - Optional - Target number of final themes. - **significance_percentage** (float) - Optional - Minimum percentage for significance. ### Request Example { "themes_df": "...", "max_iterations": 5, "target_themes": 4 } ### Response #### Success Response (200) - **clustered_df** (DataFrame) - The resulting themes with hierarchical parent_id and children columns. ``` -------------------------------- ### ThemeFinder Pipeline Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb This snippet demonstrates the end-to-end pipeline for finding themes in a set of responses using the ThemeFinder library and an LLM. ```APIDOC ## ThemeFinder Pipeline ### Description This section shows how to run the entire ThemeFinder pipeline, from loading responses to identifying themes and mapping them. ### Method Asynchronous function call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python import pandas as pd from langchain_openai import AzureChatOpenAI import themefinder # Define the question and load responses question = "What improvements would you most like to see in local public transportation?" responses = pd.read_json("./example_data.json") # Initialize the LLM llm = AzureChatOpenAI( model_name="gpt-4o", temperature=0 ) # Run the pipeline results = await themefinder.find_themes( responses, llm=llm, question=question, ) # Access the identified themes print(results["themes"]) ``` ### Response #### Success Response (200) - **themes** (DataFrame) - A DataFrame containing the identified themes. - **mapping** (DataFrame) - A DataFrame mapping responses to themes. - **unprocessed** (DataFrame) - A DataFrame of responses that could not be processed. #### Response Example ```json { "themes": [ {"topic_id": "A", "topic": "Improved Bus Routes"}, {"topic_id": "B", "topic": "More Frequent Service"} ], "mapping": [ {"response_id": 1, "topic_id": "A"}, {"response_id": 2, "topic_id": "B"} ], "unprocessed": [] } ``` ``` -------------------------------- ### Create Inputs for Closed Questions (Python) Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb A helper function to create input files for closed-ended questions. It iterates through provided question details, cleans the relevant DataFrame column by encoding to ASCII, removing specified characters, and splitting options by comma. It then saves the processed data into 'multi_choice.jsonl' and 'question.json' files within a dedicated directory for each question. ```python def create_closed_question_inputs( df: pd.DataFrame, closed_questions: list[dict], characters_to_remove = ["/", "\\", '- Text', '_x000D_'], sample_size: Optional[int] = None ) -> None: for question in closed_questions: q_num = question['question_number'] question_col = question['column_name'] q_dir = f"inputs/question_part_{q_num}" os.makedirs(q_dir, exist_ok=True) question_string = question['question_text'] question_answers = df[['themefinder_id', question_col]].dropna() if sample_size is not None: question_answers = question_answers.sample(sample_size) question_answers[question_col] = question_answers[question_col].astype(str).str.encode("ascii", "ignore").str.decode("ascii") for bad_string in characters_to_remove: question_answers[question_col] = question_answers[question_col].apply(lambda x: x.replace(bad_string, " ")) question_answers[question_col] = question_answers[question_col].apply(lambda x: x.split(",")) question_answers.columns = ['themefinder_id', 'options'] question_answers[['themefinder_id', 'options']].to_json(os.path.join(q_dir, 'multi_choice.jsonl'), orient='records', lines=True) question_data = { "question_number": q_num, "question_text": question_string, "has_free_text": False, "multi_choice_options": list(set([item for sublist in question_answers['options'] for item in sublist])), } with open(os.path.join(q_dir, 'question.json'), 'w') as f: json.dump(question_data, f, indent=4) ``` -------------------------------- ### Hierarchical Evaluation Implementation in Python Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/docs/llm-judge-improvement-plan.md Implements a multi-stage evaluation process that prioritizes cheaper methods like semantic caching and heuristic checks before resorting to expensive LLM judgments. This reduces overall evaluation costs by handling simple cases efficiently. Dependencies include a semantic cache, a heuristic function, and an LLM judge. ```python class HierarchicalEvaluator: def __init__( self, cache: SemanticCache, heuristic_fn: Callable, llm_judge: BaseLLM, cache_threshold: float = 0.92, heuristic_confidence: float = 0.8 ): self.cache = cache self.heuristic_fn = heuristic_fn self.llm_judge = llm_judge def evaluate(self, themes: list[dict], expected: list[dict]) -> EvalResult: # Stage 1: Cache check cache_key = self._make_cache_key(themes, expected) cached = self.cache.get(cache_key, threshold=self.cache_threshold) if cached: return EvalResult(score=cached, source="cache", cost=0) # Stage 2: Heuristic check heuristic_result = self.heuristic_fn(themes, expected) if heuristic_result.confidence > self.heuristic_confidence: self.cache.set(cache_key, heuristic_result.score) return EvalResult( score=heuristic_result.score, source="heuristic", cost=0.0001 ) # Stage 3: LLM judge llm_result = self._run_llm_judge(themes, expected) self.cache.set(cache_key, llm_result.score) return EvalResult( score=llm_result.score, source="llm", cost=0.02, reasoning=llm_result.reasoning ) ``` -------------------------------- ### JSON Schema for Theme Definitions Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md Defines the structure for categorizing responses into themes. This includes a unique topic ID, label, description, and a combined topic string. Special themes 'X' (None of the Above) and 'Y' (No Reason Given) are always included. ```json [ { "topic_id": "A", "topic_label": "Funding Concerns", "topic_description": "Concerns about adequate funding for the proposed initiative.", "topic": "Funding Concerns: Concerns about adequate funding for the proposed initiative." } ] ``` -------------------------------- ### JSONL Schema for Response-Theme Mapping Source: https://github.com/i-dot-ai/themefinder/blob/main/evals/SYNTHETIC_DATA_SPEC.md Defines the mapping between responses and identified themes, including associated stances. This schema ensures that each response is linked to one or more valid topic IDs and their corresponding stances (POSITIVE, NEGATIVE, NEUTRAL). Stored in JSON Lines format. ```jsonl {"response_id": 1, "labels": ["A", "C"], "stances": ["POSITIVE", "NEGATIVE"]} {"response_id": 2, "labels": ["B"], "stances": ["POSITIVE"]} ``` -------------------------------- ### Create Respondent Demographics JSONL Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb Generates a JSONL file containing respondents' demographic data. It cleans and formats demographic columns before saving them to 'inputs/respondents.jsonl'. Dependencies include pandas and numpy. ```python import pandas as pd import numpy as np import os import string import json from typing import Optional def create_respondents_jsonl(df: pd.DataFrame, demographic_columns: list[str], demographic_labels: list[str]) -> None: """Create a JSONL file of respondents with their demographic data. Args: df (pd.DataFrame): DataFrame containing the respondents' data demographics (list[str]): List of demographic columns to include in the JSONL file """ for c in demographic_columns: df[c] = df[c].astype(str).str.replace('_x000D_', '', regex=False).str.encode("ascii", "ignore").str.decode("ascii") for c in demographic_columns: df[c] = df[c].apply(lambda x: x.split(",")) df.rename(columns=dict(zip(demographic_columns, demographic_labels)), inplace=True) df["demographic_data"] = df[demographic_labels].to_dict(orient="records") df[['themefinder_id',"demographic_data"]].to_json("inputs/respondents.jsonl", orient="records", lines=True) ``` -------------------------------- ### Perform Hierarchical Theme Clustering with ThemeFinder Source: https://context7.com/i-dot-ai/themefinder/llms.txt This Python function utilizes an agentic approach for hierarchical clustering of themes. It iteratively merges semantically similar themes to achieve a target number of clusters, establishing parent-child relationships for detailed analysis. The function requires a pandas DataFrame with specific columns and an initialized LLM object. ```python import asyncio import pandas as pd from langchain_openai import AzureChatOpenAI from themefinder import theme_clustering llm = AzureChatOpenAI(model="gpt-4o", temperature=0) # Input: DataFrame of condensed themes with required columns themes_df = pd.DataFrame({ "topic_id": ["1", "2", "3", "4", "5", "6", "7", "8"], "topic_label": [ "traffic reduction", "congestion improvement", "cost concerns", "budget worries", "air quality", "environmental benefits", "safety improvements", "timeline concerns" ], "topic_description": [ "Less traffic on roads", "Reduced congestion at junctions", "Project costs too high", "Financial burden on citizens", "Better air quality", "Positive environmental impact", "Safer roads for cyclists", "Unrealistic project timeline" ], "source_topic_count": [10, 8, 15, 12, 20, 18, 5, 7] }) async def cluster_themes(): # Cluster themes down to target number clustered_df, _ = await theme_clustering( themes_df, llm, max_iterations=5, # Maximum clustering iterations target_themes=4, # Target number of final themes significance_percentage=10.0, # Minimum percentage for significance return_all_themes=False, # Return only significant themes system_prompt="You are an AI evaluation tool analyzing public consultation responses." ) print("Clustered Themes:") print(clustered_df) # Output includes hierarchical relationships via parent_id and children columns return clustered_df if __name__ == "__main__": asyncio.run(cluster_themes()) ``` -------------------------------- ### Refine and Map Themes Source: https://github.com/i-dot-ai/themefinder/blob/main/examples/example_notebook.ipynb Manually modify generated themes to include custom categories and perform the mapping stage to associate responses with themes, including handling unprocessed items. ```python import string from themefinder import theme_mapping themes = results["themes"][["topic_id", "topic"]].copy() themes.loc[len(themes)] = {"topic_id": string.ascii_uppercase[len(themes)], "topic": "Other: The response does not match any of the listed themes"} mapping, unprocessed = await theme_mapping( responses, llm=llm, refined_themes_df=themes, question=question ) mapping.to_excel("mapping.xlsx") ``` -------------------------------- ### Process and Save Hybrid Question Data (Python) Source: https://github.com/i-dot-ai/themefinder/blob/main/ingestion.ipynb Processes responses for hybrid questions (containing both closed and open-ended answers). It cleans the data, handles ASCII encoding, removes specified characters, splits options, renames columns, and saves them into separate JSONL files for multi-choice options and free text responses. Finally, it generates a question.json file with question details. ```python def save_hybrid_questions(responses_df, question_understanding_path: str): question_info = pd.read_excel(question_understanding_path, sheet_name="Hybrid questions", skiprows=3) question_info.columns = ["closed_column", "question_number", "question_text", "open_column"] # Ensure question numbers are ints question_info['question_number'] = question_info['question_number'].astype(str).str.replace(r'\D', '', regex=True).astype(int) if not question_info['question_number'].is_unique: raise AssertionError("Non-unique values found in 'question_number' column") create_hybrid_question_inputs(responses_df, question_info.to_dict(orient="records")) ```