### Local Development Setup for LLM Comparator (Shell) Source: https://github.com/pair-code/llm-comparator/blob/main/README.md Provides the shell commands necessary to clone the LLM Comparator repository, install dependencies, build the project, and start the development server locally. This is for users who want to run the tool on their own machines. ```shell git clone https://github.com/PAIR-code/llm-comparator.git cd llm-comparator npm install npm run build npm run serve ``` -------------------------------- ### Serve LLM Comparator Web App Locally Source: https://context7.com/pair-code/llm-comparator/llms.txt Instructions to install dependencies, build, and serve the LLM Comparator web application locally. Access the application at http://localhost:8000. This setup is useful for local development and testing. ```bash # Serve the web application locally npm install npm run build npm run serve # Access at http://localhost:8000 # Load specific results file via URL parameter # http://localhost:8000/?results_path=https://example.com/results.json # http://localhost:8000/?results_path=/local/path/to/results.json # Or use the hosted version # https://pair-code.github.io/llm-comparator/?results_path=https://example.com/results.json ``` -------------------------------- ### Install LLM Comparator from PyPI Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Installs the LLM Comparator library using pip, the recommended method for most users. Ensure Python's virtual environment is activated if used. ```shell pip install llm_comparator ``` -------------------------------- ### Install and Run LLM Comparator Python Library Source: https://context7.com/pair-code/llm-comparator/llms.txt Commands for installing the LLM Comparator Python library, either via pip or from source. It also includes a sample script to perform an end-to-end evaluation workflow using Vertex AI models and writing results to a JSON file. ```bash # Install Python library pip install llm_comparator # Or install from source for development git clone https://github.com/PAIR-code/llm-comparator.git cd llm-comparator/python pip install -e . # Create evaluation script (save as evaluate.py) cat > evaluate.py << 'EOF' from llm_comparator import comparison from llm_comparator import model_helper from llm_comparator import llm_judge_runner from llm_comparator import rationale_bullet_generator from llm_comparator import rationale_cluster_generator inputs = [ { 'prompt': 'Your prompt here', 'response_a': 'Response from model A', 'response_b': 'Response from model B' } ] generator = model_helper.VertexGenerationModelHelper() embedder = model_helper.VertexEmbeddingModelHelper() judge = llm_judge_runner.LLMJudgeRunner(generator) bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder) result = comparison.run(inputs, judge, bulletizer, clusterer) comparison.write(result, './output.json') EOF # Run evaluation (requires Google Cloud authentication for Vertex AI) python evaluate.py # View results in web app # Option 1: Use hosted version # Open: https://pair-code.github.io/llm-comparator/?results_path=file:///path/to/output.json # Option 2: Serve locally cd ../ # Back to repo root npm install npm run build npm run serve # Open: http://localhost:8000/?results_path=/path/to/output.json ``` -------------------------------- ### Install LLM Comparator from Source Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Installs the LLM Comparator library by cloning the GitHub repository and installing from the local source. This method is suitable for developers contributing to the library. ```shell git clone https://github.com/PAIR-code/llm-comparator.git cd llm-comparator/python pip install -e . ``` -------------------------------- ### Setup and Authenticate with Google Vertex AI Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Configures and authenticates the environment with Google Vertex AI. Users need to provide their project ID and select a region. This step ensures secure access to Vertex AI services. ```python #@title Setup and authenticate with Google Vertex AI. PROJECT_ID = 'your_project_id' #@param {type: "string"} REGION = 'us-central1' #@param {type: "string"} auth.authenticate_user() ! gcloud config set project {PROJECT_ID} vertexai.init(project=PROJECT_ID, location=REGION) ``` -------------------------------- ### LLM Comparator Custom Fields Schema Example (JSON) Source: https://github.com/pair-code/llm-comparator/blob/main/README.md Demonstrates the structure for defining custom fields and their types within the metadata of LLM Comparator. This allows for richer data analysis and visualization. ```json { "metadata": { "source_path": "...", "custom_fields_schema": [ {"name": "prompt_word_count", "type": "number"}, {"name": "word_overlap_rate_between_a_b", "type": "number"}, {"name": "data_source", "type": "category"}, {"name": "unique_id", "type": "string"}, {"name": "is_over_max_token", "type": "per_model_boolean"}, {"name": "TF-IDF_between_prompt_and_response", "type": "per_model_number"}, {"name": "writing_style", "type": "per_model_category"} ] }, "models": [{...}, {...}], "examples": [ { "input_text": "Which city should I visit in South Korea?", "tags": ["Travel"], "output_text_a": "You can visit Seoul, the capital of South Korea.", "output_text_b": "You can visit Seoul, Busan, and Jeju.", "score": 0.5, "individual_rater_scores": [], "custom_fields": { "prompt_word_count": 8, "word_overlap_rate_between_a_b": 0.61, "data_source": "XYZ", "unique_id": "abc000", "is_over_max_token": [true, false], "TF-IDF_between_prompt_and_response": [0.31, 0.15], "writing_style": ["Verbose", "Neutral"] } }, { "input_text": "How to draw bar charts using Python?", ... } ] } ``` -------------------------------- ### Install LLM Comparator Package Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Installs the LLM Comparator Python package using pip. This package is essential for running comparative evaluations and generating the required JSON outputs for the web application. ```python #@title Install the LLM Comparator package ! pip install llm_comparator ``` -------------------------------- ### Initialize Rationale Cluster Generator and Embedder Source: https://context7.com/pair-code/llm-comparator/llms.txt Initializes Vertex Generation Model Helper, Vertex Embedding Model Helper, and Rationale Cluster Generator. This setup is used to embed rationale bullets, cluster them by similarity, and generate descriptive labels for these clusters. The output includes cluster titles and bullet points augmented with similarity scores to each cluster. ```python from llm_comparator import rationale_cluster_generator from llm_comparator import model_helper generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003') clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder) rationale_bullets_for_examples = [ ['More detailed information', 'Better structure with bullet points'], ['Includes specific examples', 'More comprehensive coverage'], ['Concise and clear', 'Direct answer to the question'], [] ] clusters, bullets_with_similarities = clusterer.run( rationale_bullets_for_examples, num_clusters=8 ) print("Cluster titles:") for i, cluster in enumerate(clusters): print(f" {i}: {cluster['title']}") for example_idx, example_bullets in enumerate(bullets_with_similarities): print(f"\nExample {example_idx}:") for bullet in example_bullets: print(f" Bullet: {bullet['rationale']}") print(f" Similarities to clusters: {bullet['similarities']}") ``` -------------------------------- ### LLM Comparator Individual Rater Scores Example (JSON) Source: https://github.com/pair-code/llm-comparator/blob/main/README.md Illustrates how to embed detailed individual rater scores, including whether responses were flipped, the score awarded, and the rationale provided by the LLM judge. This aids in analyzing evaluation consistency. ```json { "metadata": [ "source_path": "...", "custom_fields_schema": [], ] "models": [ {"name": "Gemma 1.1"}, {"name": "Gemma 1.0"}, ], "examples": [ { "input_text": "Which city should I visit in South Korea?", "tags": ["Travel"], "output_text_a": "You can visit Seoul, the capital of South Korea.", "output_text_b": "You can visit Seoul, Busan, and Jeju.", "score": 0.5, "individual_rater_scores": [ { "is_flipped": false, "score": 1.5, "rationale": "A describes more information about ...", }, { "is_flipped": false, "score": -0.5, "rationale": "While A provides one option, B gives ...", } ], "custom_fields": {}, }, { "input_text": "How to draw bar charts using Python?", ... } ] } ``` -------------------------------- ### LLM Comparator JSON Data Format Source: https://github.com/pair-code/llm-comparator/blob/main/README.md Defines the structure for JSON files used by the LLM Comparator tool. It requires metadata, model names, and a list of examples, where each example includes prompt text, model outputs (A and B), a score, and optional tags and custom fields. This format is essential for visualizing comparative LLM evaluation results. ```json { "metadata": { "source_path": "Any string for your records (e.g., run id)", "custom_fields_schema": [] }, "models": [ {"name": "Short name of your first model"}, {"name": "Short name of your second model"} ], "examples": [ { "input_text": "This is a prompt.", "tags": ["Math"], # A list of keywords for categorizing prompts "output_text_a": "Response to the prompt from the first model (A)", "output_text_b": "Response to the prompt from the other model (B)", "score": -1.25, # Score from the judge LLM "individual_rater_scores": [], "custom_fields": {} }, { "input_text": "This is a next prompt.", "...": "..." } ] } ``` -------------------------------- ### JSON Data Format for LLM Comparator Input Source: https://context7.com/pair-code/llm-comparator/llms.txt Defines the structure for JSON files used as input by the LLM Comparator web application. It includes metadata, model information, example data with outputs and scores, and rationale clusters. ```json { "metadata": { "source_path": "evaluation_run_2024_01_04", "custom_fields_schema": [ {"name": "prompt_word_count", "type": "number"}, {"name": "data_source", "type": "category"}, {"name": "is_structured", "type": "per_model_boolean"}, {"name": "response_length", "type": "per_model_number"} ] }, "models": [ {"name": "Gemini 1.1"}, {"name": "Gemini 1.0"} ], "examples": [ { "input_text": "What is LLM Comparator?", "tags": ["Technology", "AI"], "output_text_a": "LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.", "output_text_b": "LLM Comparator is a tool for comparing LLM responses.", "score": 0.75, "individual_rater_scores": [ { "score": 1.0, "rating_label": "A is better", "is_flipped": false, "rationale": "Response A is more detailed and informative." }, { "score": 0.5, "rating_label": "A is slightly better", "is_flipped": false, "rationale": "Response A provides better context." } ], "rationale_list": [ { "rationale": "More detailed information", "similarities": [0.85, 0.42, 0.31, 0.28, 0.19, 0.12, 0.08, 0.05] }, { "rationale": "Better technical accuracy", "similarities": [0.51, 0.89, 0.23, 0.18, 0.15, 0.09, 0.06, 0.03] } ], "custom_fields": { "prompt_word_count": 4, "data_source": "Internal Dataset", "is_structured": [false, false], "response_length": [89, 54] } } ], "rationale_clusters": [ {"title": "Detail and Comprehensiveness"}, {"title": "Technical Accuracy"}, {"title": "Clarity and Structure"}, {"title": "Conciseness"}, {"title": "Engagement"}, {"title": "Relevance"}, {"title": "Examples and Specificity"}, {"title": "Tone and Style"} ] } ``` -------------------------------- ### Initialize and Run Rationale Clusterer in Python Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Demonstrates the minimal Python script for initializing the necessary components, including the RationaleClusterGenerator, and running the comparative evaluation. It requires the instantiation of `GenerationModelHelper` and `EmbeddingModelHelper` subclasses and defines the input format and output file path. ```python from llm_comparator import comparison from llm_comparator import model_helper from llm_comparator import llm_judge_runner from llm_comparator import rationale_bullet_generator from llm_comparator import rationale_cluster_generator inputs = [ # Provide your inputs here. # They must conform to llm_comparator.types.LLMJudgeInput ] # Initialize the models-calling classes. generator = # Initialize a model_helper.GenerationModelHelper() subclass embedder = # Initialize a model_helper.EmbeddingModelHelper() subclass # Initialize the instances that run work on the models. judge = llm_judge_runner.LLMJudgeRunner(generator) bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) clusterer = rationale_cluster_generator.RationaleClusterGenerator( generator, embedder ) # Configure and run the comparative evaluation. comparison_result = comparison.run(inputs, judge, bulletizer, clusterer) # Write the results to a JSON file that can be loaded in # https://pair-code.github.io/llm-comparator file_path = "path/to/file.json" comparison.write(comparison_result, file_path) ``` -------------------------------- ### Initialize Vertex AI Generation Model Helper Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Initializes a helper to interface with Google Vertex AI generative language models. Defaults to 'gemini-pro', but can be configured with a different 'model_name'. ```python from llm_comparator.model_helper import VertexGenerationModelHelper # Using default model 'gemini-pro' vertex_gen_helper = VertexGenerationModelHelper() # Configuring with a different model name # vertex_gen_helper = VertexGenerationModelHelper(model_name='your-model-name') ``` -------------------------------- ### Initialize LLM Comparator Components (Python) Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Initializes the core components for the LLM Comparator: LLMJudgeRunner, RationaleBulletGenerator, and RationaleClusterGenerator. These components are responsible for judging, bulletizing, and clustering the results of LLM comparisons. Dependencies include 'generator' and 'embedder' objects, which are assumed to be pre-configured. ```python judge = llm_judge_runner.LLMJudgeRunner(generator) bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) clusterer = rationale_cluster_generator.RationaleClusterGenerator( generator, embedder ) ``` -------------------------------- ### View Comparison Results in Colab (Python) Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Displays the comparative evaluation results directly within a Google Colab environment. The `comparison.show_in_colab()` function takes the `file_path` of the saved results and renders them in an interactive format, facilitating immediate analysis and review. ```python comparison.show_in_colab(file_path) ``` -------------------------------- ### Initialize Vertex AI Embedding Model Helper Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Initializes a helper to interface with Google Vertex AI text embedding models. Defaults to 'textembedding-gecko@003', but can be configured with a different 'model_name'. ```python from llm_comparator.model_helper import VertexEmbeddingModelHelper # Using default model 'textembedding-gecko@003' vertex_emb_helper = VertexEmbeddingModelHelper() # Configuring with a different model name # vertex_emb_helper = VertexEmbeddingModelHelper(model_name='your-embedding-model-name') ``` -------------------------------- ### Initialize Vertex AI Models for LLM Comparator Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Initializes the generative and embedding models from Google Vertex AI to be used within the LLM Comparator evaluation. It defaults to 'gemini-pro' for generation and 'textembedding-gecko@003' for embedding, but these can be customized. These models are crucial for the judge, bulletizer, and clusterer functionalities. ```python #@title Initialize models used in the LLM Comparator evaluation. # The generator model can be any Text-to-Text LLM provided by Vertex AI. This # model will be asked to do a series of tasks---judge, bulletize, and cluster--- # and it is often beneficial to use a larger model for this reason. # # We default to 'gemini-pro' but you can change this with the `model_name=` # param. For a full list of models available via the Model Garden, check out # https://console.cloud.google.com/vertex-ai/model-garden?pageState=(%22galleryStateKey%22:(%22f%22:(%22g%22:%5B%22supportedTasks%22,%22inputTypes%22%5D,%22o%22:%5B%22GENERATION%22,%22LANGUAGE%22%5D),%22s%22:%22%22)). # # Since we're using Gemini Pro, a very competent and flexible foundation model, # we are sharing the same generator across all downstream tasks. However, you # could use different models for each task if desired. generator = model_helper.VertexGenerationModelHelper() # The embedding model can be any text embedder provided by Vertex AI. We default # to 'textembedding-gecko@003' but you can change this with the `model_name=` # param. For a full list of models available via the Model Garden, check out # https://console.cloud.google.com/vertex-ai/model-garden?pageState=(%22galleryStateKey%22:(%22f%22:(%22g%22:%5B%22supportedTasks%22,%22inputTypes%22%5D,%22o%22:%5B%22EMBEDDING%22,%22LANGUAGE%22%5D),%22s%22:%22%22)) embedder = model_helper.VertexEmbeddingModelHelper() # The following models do the core work of a Comparative Evaluation: judge, # bulletize, and cluster. Each class provides a `.run()` function, and the # `llm_comparator.comparison.run()` API orchestrates configuring and calling # these APIs on the instances you pass in. More on how to configure these below. # The `judge` is the model responsible for actually doing the comparison between # the two models. The same judge is run multiple times to get a diversity of # perspectives, more on how to configure this below. # # A judge must phrase its responses in a simple XML format that includes the # verdict and an explanation of the results, to enable downstream processing by # the bulletizer and clusterer. # # # YOUR EXPLANATION GOES HERE. # A is slightly better # # # We provide a default "judge" prompt in # llm_comparator.llm_judge_runner.DEFAULT_LLM_JUDGE_PROMPT_TEMPLATE, and you can # use the `llm_judge_prompt_template=` parameter to provide a custom prompt that ``` -------------------------------- ### Import LLM Comparator and Vertex AI Packages Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Imports necessary libraries for the LLM Comparator, including Vertex AI for model interaction and Google Colab authentication. It also imports modules for comparison, model helpers, and specific runner functionalities. ```python #@title Import relevant packages import vertexai from google.colab import auth # The comparison library provides the primary API for running Comparative # Evaluations and generating the JSON files required by the LLM Comparator web # app. from llm_comparator import comparison # The model_helper library is used to initialize API wrapper to interface with # models. For this demo we focus on models served by Google Vertex AI, but you # can extend the llm_comparator.model_helper.GenerationModelHelper and # llm_comparator.model_helper.EmbeddingModelHelper classes to work with other # providers or models you host yourself. from llm_comparator import model_helper # The following libraries contain wrappers that implement the core functionality # of the Comparative Evaluation workflow. More on these below. from llm_comparator import llm_judge_runner from llm_comparator import rationale_bullet_generator from llm_comparator import rationale_cluster_generator ``` -------------------------------- ### Run Comparative Evaluation Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Coordinates the comparative evaluation process, including judging, bulletizing, and clustering model responses. Requires initialized judge and bulletizer components, along with model helpers and prompts. ```python from llm_comparator import comparison from llm_comparator.model_helper import VertexGenerationModelHelper, VertexEmbeddingModelHelper from llm_comparator.llm_judge_runner import LLMJudgeRunner from llm_comparator.rationale_bullet_generator import RationaleBulletGenerator # Initialize model helpers vertex_gen_helper = VertexGenerationModelHelper(model_name='gemini-pro') vertex_emb_helper = VertexEmbeddingModelHelper(model_name='textembedding-gecko@003') # Initialize judge and bulletizer judge_runner = LLMJudgeRunner(generator_model=vertex_gen_helper) bullet_generator = RationaleBulletGenerator(generator_model=vertex_gen_helper) # Define evaluation configuration (example: prompts, models) prompts = ["What is the capital of France?", "Translate 'hello' to Spanish."] models = { "model_a": vertex_gen_helper, # Replace with actual model instances "model_b": vertex_gen_helper # Replace with actual model instances } # Run the comparative evaluation # results = comparison.run( # prompts=prompts, # models=models, # judge=judge_runner, # bulletizer=bullet_generator, # embedding_model=vertex_emb_helper # ) # print(results.to_json()) ``` -------------------------------- ### Implement Custom Generation Model Helper in Python Source: https://context7.com/pair-code/llm-comparator/llms.txt Extends the abstract GenerationModelHelper class to integrate with a custom LLM provider. This implementation includes `predict` and `predict_batch` methods to handle single and batch text generation requests, respectively. It allows for customization of parameters like temperature and max output tokens. ```python from llm_comparator import model_helper from typing import Sequence, Optional class CustomGenerationModelHelper(model_helper.GenerationModelHelper): def __init__(self, api_key: str, model_name: str): self.api_key = api_key self.model_name = model_name # Initialize your model client here def predict(self, prompt: str, **kwargs) -> str: temperature = kwargs.get('temperature', 0.7) max_tokens = kwargs.get('max_output_tokens', 256) # Call your LLM API # response = your_api.generate( # prompt=prompt, # temperature=temperature, # max_tokens=max_tokens # ) # return response.text return "Generated response" def predict_batch(self, prompts: Sequence[str], **kwargs) -> Sequence[str]: # Implement batch prediction return [self.predict(prompt, **kwargs) for prompt in prompts] ``` -------------------------------- ### Run Comparative Evaluation (Python) Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Executes the comparative evaluation using the `comparison.run()` function. This function takes prepared inputs, a judge, a bulletizer, and a clusterer to produce a structured dictionary output suitable for the LLM Comparator web app. Optional parameters like `judge_opts`, `bulletizer_opts`, and `clusterer_opts` allow for customization of component behaviors. ```python # The comparison.run() function is the primary interface for running a # Comparative Evaluation. It take your prepared inputs, a judge, a buletizer, # and a clusterer and returns a Python dictioary in the required format for use # in the LLM Comparator web app. You can inspect this dictionary in Python if # you like, but it's more useful once written to a file. # # The example below is basic, but you can use the judge_opts=, bulletizer_opts=, # and/or clusterer_opts= parameters (all of which are optional dictionaries that # are converted to keyword options) to further customize the behaviors. See the # Docsrtrings for more. comparison_result = comparison.run( llm_judge_inputs, judge, bulletizer, clusterer, ) ``` -------------------------------- ### Python: Run Complete LLM Comparative Evaluation Source: https://context7.com/pair-code/llm-comparator/llms.txt Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. It takes model responses, initializes evaluation components like judges and clusterers, and runs the comparison, saving results to a JSON file. Dependencies include the llm_comparator library and model_helper for LLM integrations. ```python from llm_comparator import comparison from llm_comparator import model_helper from llm_comparator import llm_judge_runner from llm_comparator import rationale_bullet_generator from llm_comparator import rationale_cluster_generator # Define inputs with prompts and responses from two models inputs = [ { 'prompt': 'What is LLM Comparator?', 'response_a': 'LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.', 'response_b': 'LLM Comparator is a tool for comparing LLM responses.' }, { 'prompt': 'How to draw bar charts using Python?', 'response_a': 'Bar charts can be created using data visualization libraries.', 'response_b': 'You can use Matplotlib, Plotly, or Altair to draw bar charts.' } ] # Initialize model helpers (example using Vertex AI) generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003') # Initialize evaluation components judge = llm_judge_runner.LLMJudgeRunner(generator) bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder) # Run comparative evaluation with custom options comparison_result = comparison.run( inputs, judge, bulletizer, clusterer, model_names=('Model A', 'Model B'), judge_opts={'num_repeats': 6}, bulletizer_opts={'win_rate_threshold': 0.25}, clusterer_opts={'num_clusters': 8} ) # Write results to JSON file output_path = comparison.write(comparison_result, './results.json') print(f"Results written to {output_path}") ``` -------------------------------- ### Initialize Rationale Bullet Generator Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Initializes the Rationale Bullet Generator, which uses a language model to summarize judge results into bullet points for easier UI consumption. It can be configured with a win rate threshold. ```python from llm_comparator.rationale_bullet_generator import RationaleBulletGenerator from llm_comparator.model_helper import VertexGenerationModelHelper # Initialize a generation model helper model_helper = VertexGenerationModelHelper() # Initialize the bullet generator bullet_generator = RationaleBulletGenerator(generator_model=model_helper) # Optional: Configure bulletizer options like win rate threshold # bulletizer_opts = { # "win_rate_threshold": 0.3 # } # bullet_generator = RationaleBulletGenerator(generator_model=model_helper, bulletizer_opts=bulletizer_opts) ``` -------------------------------- ### Initialize LLM Judge Runner with Custom Prompt and Scoring Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Initializes the LLM Judge Runner with a custom prompt template and a mapping for verdict scores. This allows for tailored comparison logic and ensures judge verdicts can be numerically evaluated. ```python from llm_comparator.llm_judge_runner import LLMJudgeRunner from llm_comparator.model_helper import VertexGenerationModelHelper # Define a custom prompt template custom_prompt = """Compare these two responses and tell me which one is better:\nResponse A: {response_a}\nResponse B: {response_b}\n\nProvide your verdict (e.g., 'A is better', 'B is better', 'tie') and an explanation.""" # Define a mapping from verdicts to scores rating_map = { "A is better": 1.0, "B is better": -1.0, "tie": 0.0 } # Initialize a generation model helper model_helper = VertexGenerationModelHelper() # Initialize the judge runner with custom prompt and rating map judge_runner = LLMJudgeRunner( generator_model=model_helper, llm_judge_prompt_template=custom_prompt, rating_to_score_map=rating_map ) ``` -------------------------------- ### Initialize Generation Model and Rationale Bullet Generator Source: https://context7.com/pair-code/llm-comparator/llms.txt Initializes the Vertex Generation Model Helper and the Rationale Bullet Generator. The generator is used by the bulletizer to process judgements and generate bullet points based on a win rate threshold. Judgements are expected to be a list of dictionaries, each containing scores and individual rater scores with rationales. ```python generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) judgements = [ { 'score': 0.75, 'individual_rater_scores': [ { 'score': 1.0, 'is_flipped': False, 'rationale': 'Response A provides more detailed information about multiple cities.' }, { 'score': 1.5, 'is_flipped': False, 'rationale': 'Response A includes specific details about each location.' }, { 'score': 0.0, 'is_flipped': False, 'rationale': 'Both responses are helpful.' } ] } ] bullets = bulletizer.run( judgements, win_rate_threshold=0.25 ) for example_bullets in bullets: print(f"Generated {len(example_bullets)} bullets:") for bullet in example_bullets: print(f" - {bullet}") ``` -------------------------------- ### Initialize LLM Judge Runner with Default Prompt Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md Initializes the LLM Judge Runner, which uses a language model to compare responses. It utilizes a default prompt template for judging and can be configured to run multiple times per comparison. ```python from llm_comparator.llm_judge_runner import LLMJudgeRunner from llm_comparator.model_helper import VertexGenerationModelHelper # Initialize a generation model helper (e.g., Vertex AI) model_helper = VertexGenerationModelHelper() # Initialize the judge runner with the model helper judge_runner = LLMJudgeRunner(generator_model=model_helper) # Optional: Configure judge options like number of repeats # judge_opts = { # "num_repeats": 10 # } # judge_runner = LLMJudgeRunner(generator_model=model_helper, judge_opts=judge_opts) ``` -------------------------------- ### Prepare Input Data for LLM Judge Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Defines the input data structure for the LLM Judge, which consists of a list of dictionaries. Each dictionary contains a 'prompt' and two responses ('response_a', 'response_b') to be compared. This format is required by the `llm_comparator.llm_judge_runner.LLMJudgeInput`. ```python #@title Prepare Your Inputs # See llm_comparator.llm_judge_runner.LLMJudgeInput for the required input type. llm_judge_inputs = [ {'prompt': 'how are you?', 'response_a': 'good', 'response_b': 'bad'}, {'prompt': 'hello?', 'response_a': 'hello', 'response_b': 'hi'}, {'prompt': 'what is the capital of korea?', 'response_a': 'Seoul', 'response_b': 'Vancouver'} ] ``` -------------------------------- ### Python: Rationale Bullet Generator for Condensing Rationales Source: https://context7.com/pair-code/llm-comparator/llms.txt Condenses judge rationales into concise bullet points, improving readability and facilitating visualization. It requires an initialized generation model and potentially other components from the llm_comparator library. The input is typically a set of rationales from the judging phase, and the output is a structured list of bullets. ```python from llm_comparator import rationale_bullet_generator from llm_comparator import model_helper from llm_comparator import types # Assuming generator and judgements are already initialized from previous steps # generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') # judgements = [...] # Output from judge.run() # Initialize the bullet generator # bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) # Example usage (requires actual judgements data): # bullet_summary = bulletizer.run(judgements) # print(bullet_summary) ``` -------------------------------- ### Save Comparison Results to File (Python) Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb Saves the results of a comparative evaluation to a JSON file. The `comparison.write()` function is used for this purpose, taking the `comparison_result` dictionary and a specified `file_path` as arguments. This allows for persistent storage and later retrieval of the evaluation data. ```python file_path = 'json_for_llm_comparator.json' # @param {type: "string"} comparison.write(comparison_result, file_path) ``` -------------------------------- ### Python Library - Running Comparative Evaluation Source: https://context7.com/pair-code/llm-comparator/llms.txt Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. This function takes model inputs, evaluation components, and configuration options to produce a comparison result that can be written to a JSON file. ```APIDOC ## Python Library - Running Comparative Evaluation ### Description Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. This function takes model inputs, evaluation components, and configuration options to produce a comparison result that can be written to a JSON file. ### Method `comparison.run()` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python from llm_comparator import comparison from llm_comparator import model_helper from llm_comparator import llm_judge_runner from llm_comparator import rationale_bullet_generator from llm_comparator import rationale_cluster_generator # Define inputs with prompts and responses from two models inputs = [ { 'prompt': 'What is LLM Comparator?', 'response_a': 'LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.', 'response_b': 'LLM Comparator is a tool for comparing LLM responses.' }, { 'prompt': 'How to draw bar charts using Python?', 'response_a': 'Bar charts can be created using data visualization libraries.', 'response_b': 'You can use Matplotlib, Plotly, or Altair to draw bar charts.' } ] # Initialize model helpers (example using Vertex AI) generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003') # Initialize evaluation components judge = llm_judge_runner.LLMJudgeRunner(generator) bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder) # Run comparative evaluation with custom options comparison_result = comparison.run( inputs, judge, bulletizer, clusterer, model_names=('Model A', 'Model B'), judge_opts={'num_repeats': 6}, bulletizer_opts={'win_rate_threshold': 0.25}, clusterer_opts={'num_clusters': 8} ) # Write results to JSON file output_path = comparison.write(comparison_result, './results.json') print(f"Results written to {output_path}") ``` ### Response #### Success Response (200) - **comparison_result**: object - The result of the comparative evaluation. - **output_path**: string - The path to the saved JSON file. #### Response Example ```json { "results": [...], "summary": {...} } ``` ``` -------------------------------- ### Implement Custom Embedding Model Helper in Python Source: https://context7.com/pair-code/llm-comparator/llms.txt Extends the abstract EmbeddingModelHelper class to integrate with a custom embedding model provider. This implementation includes `embed` and `embed_batch` methods for generating embeddings for single texts and batches of texts, respectively. It demonstrates basic batching logic for API calls. ```python from llm_comparator import model_helper from typing import Sequence, Optional # Implement custom embedding model helper class CustomEmbeddingModelHelper(model_helper.EmbeddingModelHelper): def __init__(self, api_key: str, model_name: str): self.api_key = api_key self.model_name = model_name def embed(self, text: str) -> Sequence[float]: # Call your embedding API # embedding = your_api.embed(text) # return embedding.values return [0.1, 0.2, 0.3] # Example 3D embedding def embed_batch(self, texts: Sequence[str]) -> Sequence[Sequence[float]]: # Implement batch embedding with batching logic results = [] batch_size = 100 for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] # batch_embeddings = your_api.embed_batch(batch) results.extend([self.embed(text) for text in batch]) return results # Use custom helpers in comparison pipeline generator = CustomGenerationModelHelper(api_key='your-key', model_name='custom-llm') embedder = CustomEmbeddingModelHelper(api_key='your-key', model_name='custom-embedding') ``` -------------------------------- ### Python: LLM Judge Runner for Evaluating Model Response Pairs Source: https://context7.com/pair-code/llm-comparator/llms.txt Evaluates pairs of model responses using an LLM judge, allowing for configurable repeated runs to mitigate bias. It takes input prompts and responses, initializes a generation model, and can be configured with a custom rating scale. The output includes aggregated scores and individual ratings with rationales. ```python from llm_comparator import llm_judge_runner from llm_comparator import model_helper # Initialize generation model generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') # Create judge with custom rating scale custom_rating_map = { 'A is much better': 1.5, 'A is better': 1.0, 'A is slightly better': 0.5, 'same': 0.0, 'B is slightly better': -0.5, 'B is better': -1.0, 'B is much better': -1.5 } judge = llm_judge_runner.LLMJudgeRunner( generation_model_helper=generator, rating_to_score_map=custom_rating_map ) # Prepare inputs inputs = [ { 'prompt': 'Which city should I visit in South Korea?', 'response_a': 'You can visit Seoul, the capital of South Korea.', 'response_b': 'You can visit Seoul, Busan, and Jeju with their unique attractions.' } ] # Run judge with 6 repeats (3 flipped, 3 non-flipped to handle position bias) judgements = judge.run(inputs, num_repeats=6) # Output contains aggregated scores and individual ratings for judgement in judgements: print(f"Average score: {judgement['score']}") print(f"Individual ratings: {len(judgement['individual_rater_scores'])}") for rating in judgement['individual_rater_scores']: print(f" Score: {rating['score']}, Flipped: {rating['is_flipped']}") print(f" Rationale: {rating['rationale']}") ``` -------------------------------- ### Rationale Bullet Generator - Condensing Judgement Rationales Source: https://context7.com/pair-code/llm-comparator/llms.txt Condenses judge rationales into concise bulleted summaries. This helps in quickly understanding the key points of the judge's reasoning and facilitates easier analysis and visualization of differences between model responses. ```APIDOC ## Rationale Bullet Generator - Condensing Judgement Rationales ### Description Condenses judge rationales into concise bulleted summaries. This helps in quickly understanding the key points of the judge's reasoning and facilitates easier analysis and visualization of differences between model responses. ### Method `rationale_bullet_generator.RationaleBulletGenerator` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python from llm_comparator import rationale_bullet_generator from llm_comparator import model_helper from llm_comparator import types # Assume 'generator' and 'judgements' are initialized as in the previous example # generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro') # judgements = [...] # Output from LLM Judge Runner bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator) # Example: Condense rationales from a single judgement if judgements: condensed_rationales = bulletizer.run(judgements[0]['individual_rater_scores']) print(condensed_rationales) ``` ### Response #### Success Response (200) - **condensed_rationales**: list - A list of strings, where each string is a bulleted summary of a rationale. #### Response Example ```json [ "- Response B includes more locations.", "- Response A is more concise." ] ``` ``` -------------------------------- ### Display LLM Comparator Results in Google Colab Source: https://context7.com/pair-code/llm-comparator/llms.txt Python code to display LLM Comparator evaluation results within a Google Colab notebook. It saves the comparison results to a file and then uses an iframe to embed the web application, allowing interactive viewing. ```python # Display in Google Colab from llm_comparator import comparison # After creating comparison_result file_path = '/content/evaluation_results.json' comparison.write(comparison_result, file_path) # Display in Colab iframe with custom height and port comparison.show_in_colab( file_path=file_path, height=800, port=8888 ) # The function automatically: # 1. Copies website files to /content/llm_comparator # 2. Starts HTTP server on specified port # 3. Creates iframe displaying the results ``` -------------------------------- ### BibTeX Citation for ACM CHI Publication Source: https://github.com/pair-code/llm-comparator/blob/main/README.md This BibTeX entry provides the citation for the preliminary version of LLM Comparator presented at ACM CHI 2024 in the Late-Breaking Work track. It includes title, authors, booktitle, year, publisher, DOI, and arXiv URL. ```bibtex @inproceedings{kahng2024comparator, title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models}, author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas}, booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems}, year={2024}, publisher={ACM}, doi={10.1145/3613905.3650755}, url={https://arxiv.org/abs/2402.10524} } ``` -------------------------------- ### BibTeX Citation for IEEE TVCG Publication Source: https://github.com/pair-code/llm-comparator/blob/main/README.md This BibTeX entry provides the full citation details for the LLM Comparator paper published in IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) for the IEEE VIS 2024 conference. It includes title, authors, journal, year, volume, number, publisher, and DOI. ```bibtex @article{kahng2025comparator, title={{LLM Comparator}: Interactive Analysis of Side-by-Side Evaluation of Large Language Models}, author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas}, journal={IEEE Transactions on Visualization and Computer Graphics}, year={2025}, volume={31}, number={1}, publisher={IEEE}, doi={10.1109/TVCG.2024.3456354} } ```