### Local Development Setup for LLM Comparator (Shell)

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

Provides the shell commands necessary to clone the LLM Comparator repository, install dependencies, build the project, and start the development server locally. This is for users who want to run the tool on their own machines.

```shell
git clone https://github.com/PAIR-code/llm-comparator.git
cd llm-comparator
npm install
npm run build
npm run serve
```

--------------------------------

### Serve LLM Comparator Web App Locally

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Instructions to install dependencies, build, and serve the LLM Comparator web application locally. Access the application at http://localhost:8000. This setup is useful for local development and testing.

```bash
# Serve the web application locally
npm install
npm run build
npm run serve

# Access at http://localhost:8000

# Load specific results file via URL parameter
# http://localhost:8000/?results_path=https://example.com/results.json
# http://localhost:8000/?results_path=/local/path/to/results.json

# Or use the hosted version
# https://pair-code.github.io/llm-comparator/?results_path=https://example.com/results.json
```

--------------------------------

### Install LLM Comparator from PyPI

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Installs the LLM Comparator library using pip, the recommended method for most users. Ensure Python's virtual environment is activated if used.

```shell
pip install llm_comparator
```

--------------------------------

### Install and Run LLM Comparator Python Library

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Commands for installing the LLM Comparator Python library, either via pip or from source. It also includes a sample script to perform an end-to-end evaluation workflow using Vertex AI models and writing results to a JSON file.

```bash
# Install Python library
pip install llm_comparator

# Or install from source for development
git clone https://github.com/PAIR-code/llm-comparator.git
cd llm-comparator/python
pip install -e .

# Create evaluation script (save as evaluate.py)
cat > evaluate.py << 'EOF'
from llm_comparator import comparison
from llm_comparator import model_helper
from llm_comparator import llm_judge_runner
from llm_comparator import rationale_bullet_generator
from llm_comparator import rationale_cluster_generator

inputs = [
    {
        'prompt': 'Your prompt here',
        'response_a': 'Response from model A',
        'response_b': 'Response from model B'
    }
]

generator = model_helper.VertexGenerationModelHelper()
embedder = model_helper.VertexEmbeddingModelHelper()

judge = llm_judge_runner.LLMJudgeRunner(generator)
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)
clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder)

result = comparison.run(inputs, judge, bulletizer, clusterer)
comparison.write(result, './output.json')
EOF

# Run evaluation (requires Google Cloud authentication for Vertex AI)
python evaluate.py

# View results in web app
# Option 1: Use hosted version
# Open: https://pair-code.github.io/llm-comparator/?results_path=file:///path/to/output.json

# Option 2: Serve locally
cd ../  # Back to repo root
npm install
npm run build
npm run serve
# Open: http://localhost:8000/?results_path=/path/to/output.json
```

--------------------------------

### Install LLM Comparator from Source

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Installs the LLM Comparator library by cloning the GitHub repository and installing from the local source. This method is suitable for developers contributing to the library.

```shell
git clone https://github.com/PAIR-code/llm-comparator.git
cd llm-comparator/python
pip install -e .
```

--------------------------------

### Setup and Authenticate with Google Vertex AI

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Configures and authenticates the environment with Google Vertex AI. Users need to provide their project ID and select a region. This step ensures secure access to Vertex AI services.

```python
#@title Setup and authenticate with Google Vertex AI.
PROJECT_ID = 'your_project_id'  #@param {type: "string"}
REGION = 'us-central1'  #@param {type: "string"}

auth.authenticate_user()
! gcloud config set project {PROJECT_ID}
vertexai.init(project=PROJECT_ID, location=REGION)
```

--------------------------------

### LLM Comparator Custom Fields Schema Example (JSON)

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

Demonstrates the structure for defining custom fields and their types within the metadata of LLM Comparator. This allows for richer data analysis and visualization.

```json
{
	"metadata": {
		"source_path": "...",
		"custom_fields_schema": [
			{"name": "prompt_word_count", "type": "number"},
			{"name": "word_overlap_rate_between_a_b", "type": "number"},
			{"name": "data_source", "type": "category"},
			{"name": "unique_id", "type": "string"},
			{"name": "is_over_max_token", "type": "per_model_boolean"},
			{"name": "TF-IDF_between_prompt_and_response", "type": "per_model_number"},
			{"name": "writing_style", "type": "per_model_category"}
		]
	},
	"models": [{...}, {...}],
	"examples": [
		{
			"input_text": "Which city should I visit in South Korea?",
			"tags": ["Travel"],
			"output_text_a": "You can visit Seoul, the capital of South Korea.",
			"output_text_b": "You can visit Seoul, Busan, and Jeju.",
			"score": 0.5,
			"individual_rater_scores": [],
			"custom_fields": {
				"prompt_word_count": 8,
				"word_overlap_rate_between_a_b": 0.61,
				"data_source": "XYZ",
				"unique_id": "abc000",
				"is_over_max_token": [true, false],
				"TF-IDF_between_prompt_and_response": [0.31, 0.15],
				"writing_style": ["Verbose", "Neutral"]
			}
		},
		{
            "input_text": "How to draw bar charts using Python?",
            ...
		}
	]
}
```

--------------------------------

### Install LLM Comparator Package

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Installs the LLM Comparator Python package using pip. This package is essential for running comparative evaluations and generating the required JSON outputs for the web application.

```python
#@title Install the LLM Comparator package
! pip install llm_comparator
```

--------------------------------

### Initialize Rationale Cluster Generator and Embedder

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Initializes Vertex Generation Model Helper, Vertex Embedding Model Helper, and Rationale Cluster Generator. This setup is used to embed rationale bullets, cluster them by similarity, and generate descriptive labels for these clusters. The output includes cluster titles and bullet points augmented with similarity scores to each cluster.

```python
from llm_comparator import rationale_cluster_generator
from llm_comparator import model_helper

generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003')
clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder)

rationale_bullets_for_examples = [
    ['More detailed information', 'Better structure with bullet points'],
    ['Includes specific examples', 'More comprehensive coverage'],
    ['Concise and clear', 'Direct answer to the question'],
    []  
]

clusters, bullets_with_similarities = clusterer.run(
    rationale_bullets_for_examples,
    num_clusters=8
)

print("Cluster titles:")
for i, cluster in enumerate(clusters):
    print(f"  {i}: {cluster['title']}")

for example_idx, example_bullets in enumerate(bullets_with_similarities):
    print(f"\nExample {example_idx}:")
    for bullet in example_bullets:
        print(f"  Bullet: {bullet['rationale']}")
        print(f"  Similarities to clusters: {bullet['similarities']}")
```

--------------------------------

### LLM Comparator Individual Rater Scores Example (JSON)

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

Illustrates how to embed detailed individual rater scores, including whether responses were flipped, the score awarded, and the rationale provided by the LLM judge. This aids in analyzing evaluation consistency.

```json
{
    "metadata": [
        "source_path": "...",
        "custom_fields_schema": [],
    ]
    "models": [
        {"name": "Gemma 1.1"},
        {"name": "Gemma 1.0"},
    ],
    "examples": [
        {
            "input_text": "Which city should I visit in South Korea?",
			"tags": ["Travel"],
			"output_text_a": "You can visit Seoul, the capital of South Korea.",
			"output_text_b": "You can visit Seoul, Busan, and Jeju.",
			"score": 0.5,
            "individual_rater_scores": [
                {
                    "is_flipped": false,
                    "score": 1.5,
                    "rationale": "A describes more information about ...",
                },
                {
                    "is_flipped": false,
                    "score": -0.5,
                    "rationale": "While A provides one option, B gives ...",                }
            ],
            "custom_fields": {},
        },
        {
            "input_text": "How to draw bar charts using Python?",
            ...
        }
    ]
}
```

--------------------------------

### LLM Comparator JSON Data Format

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

Defines the structure for JSON files used by the LLM Comparator tool. It requires metadata, model names, and a list of examples, where each example includes prompt text, model outputs (A and B), a score, and optional tags and custom fields. This format is essential for visualizing comparative LLM evaluation results.

```json
{
    "metadata": {
        "source_path": "Any string for your records (e.g., run id)",
        "custom_fields_schema": []
    },
    "models": [
        {"name": "Short name of your first model"},
        {"name": "Short name of your second model"}
    ],
    "examples": [
        {
            "input_text": "This is a prompt.",
            "tags": ["Math"],  # A list of keywords for categorizing prompts
            "output_text_a": "Response to the prompt from the first model (A)",
            "output_text_b": "Response to the prompt from the other model (B)",
            "score": -1.25,  # Score from the judge LLM
            "individual_rater_scores": [],
            "custom_fields": {}
        },
        {
            "input_text": "This is a next prompt.",
            "...": "..."
        }
    ]
}
```

--------------------------------

### JSON Data Format for LLM Comparator Input

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Defines the structure for JSON files used as input by the LLM Comparator web application. It includes metadata, model information, example data with outputs and scores, and rationale clusters.

```json
{
  "metadata": {
    "source_path": "evaluation_run_2024_01_04",
    "custom_fields_schema": [
      {"name": "prompt_word_count", "type": "number"},
      {"name": "data_source", "type": "category"},
      {"name": "is_structured", "type": "per_model_boolean"},
      {"name": "response_length", "type": "per_model_number"}
    ]
  },
  "models": [
    {"name": "Gemini 1.1"},
    {"name": "Gemini 1.0"}
  ],
  "examples": [
    {
      "input_text": "What is LLM Comparator?",
      "tags": ["Technology", "AI"],
      "output_text_a": "LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.",
      "output_text_b": "LLM Comparator is a tool for comparing LLM responses.",
      "score": 0.75,
      "individual_rater_scores": [
        {
          "score": 1.0,
          "rating_label": "A is better",
          "is_flipped": false,
          "rationale": "Response A is more detailed and informative."
        },
        {
          "score": 0.5,
          "rating_label": "A is slightly better",
          "is_flipped": false,
          "rationale": "Response A provides better context."
        }
      ],
      "rationale_list": [
        {
          "rationale": "More detailed information",
          "similarities": [0.85, 0.42, 0.31, 0.28, 0.19, 0.12, 0.08, 0.05]
        },
        {
          "rationale": "Better technical accuracy",
          "similarities": [0.51, 0.89, 0.23, 0.18, 0.15, 0.09, 0.06, 0.03]
        }
      ],
      "custom_fields": {
        "prompt_word_count": 4,
        "data_source": "Internal Dataset",
        "is_structured": [false, false],
        "response_length": [89, 54]
      }
    }
  ],
  "rationale_clusters": [
    {"title": "Detail and Comprehensiveness"},
    {"title": "Technical Accuracy"},
    {"title": "Clarity and Structure"},
    {"title": "Conciseness"},
    {"title": "Engagement"},
    {"title": "Relevance"},
    {"title": "Examples and Specificity"},
    {"title": "Tone and Style"}
  ]
}
```

--------------------------------

### Initialize and Run Rationale Clusterer in Python

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Demonstrates the minimal Python script for initializing the necessary components, including the RationaleClusterGenerator, and running the comparative evaluation. It requires the instantiation of `GenerationModelHelper` and `EmbeddingModelHelper` subclasses and defines the input format and output file path.

```python
from llm_comparator import comparison
from llm_comparator import model_helper
from llm_comparator import llm_judge_runner
from llm_comparator import rationale_bullet_generator
from llm_comparator import rationale_cluster_generator

inputs = [
  # Provide your inputs here.
  # They must conform to llm_comparator.types.LLMJudgeInput
]

# Initialize the models-calling classes.
generator = # Initialize a model_helper.GenerationModelHelper() subclass
embedder = # Initialize a model_helper.EmbeddingModelHelper() subclass

# Initialize the instances that run work on the models.
judge = llm_judge_runner.LLMJudgeRunner(generator)
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)
clusterer = rationale_cluster_generator.RationaleClusterGenerator(
    generator, embedder
)

# Configure and run the comparative evaluation.
comparison_result = comparison.run(inputs, judge, bulletizer, clusterer)

# Write the results to a JSON file that can be loaded in
# https://pair-code.github.io/llm-comparator
file_path = "path/to/file.json"
comparison.write(comparison_result, file_path)

```

--------------------------------

### Initialize Vertex AI Generation Model Helper

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Initializes a helper to interface with Google Vertex AI generative language models. Defaults to 'gemini-pro', but can be configured with a different 'model_name'.

```python
from llm_comparator.model_helper import VertexGenerationModelHelper

# Using default model 'gemini-pro'
vertex_gen_helper = VertexGenerationModelHelper()

# Configuring with a different model name
# vertex_gen_helper = VertexGenerationModelHelper(model_name='your-model-name')
```

--------------------------------

### Initialize LLM Comparator Components (Python)

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Initializes the core components for the LLM Comparator: LLMJudgeRunner, RationaleBulletGenerator, and RationaleClusterGenerator. These components are responsible for judging, bulletizing, and clustering the results of LLM comparisons. Dependencies include 'generator' and 'embedder' objects, which are assumed to be pre-configured.

```python
judge = llm_judge_runner.LLMJudgeRunner(generator)
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)
clusterer = rationale_cluster_generator.RationaleClusterGenerator(
    generator,
    embedder
)
```

--------------------------------

### View Comparison Results in Colab (Python)

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Displays the comparative evaluation results directly within a Google Colab environment. The `comparison.show_in_colab()` function takes the `file_path` of the saved results and renders them in an interactive format, facilitating immediate analysis and review.

```python
comparison.show_in_colab(file_path)
```

--------------------------------

### Initialize Vertex AI Embedding Model Helper

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Initializes a helper to interface with Google Vertex AI text embedding models. Defaults to 'textembedding-gecko@003', but can be configured with a different 'model_name'.

```python
from llm_comparator.model_helper import VertexEmbeddingModelHelper

# Using default model 'textembedding-gecko@003'
vertex_emb_helper = VertexEmbeddingModelHelper()

# Configuring with a different model name
# vertex_emb_helper = VertexEmbeddingModelHelper(model_name='your-embedding-model-name')
```

--------------------------------

### Initialize Vertex AI Models for LLM Comparator

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Initializes the generative and embedding models from Google Vertex AI to be used within the LLM Comparator evaluation. It defaults to 'gemini-pro' for generation and 'textembedding-gecko@003' for embedding, but these can be customized. These models are crucial for the judge, bulletizer, and clusterer functionalities.

```python
#@title Initialize models used in the LLM Comparator evaluation.

# The generator model can be any Text-to-Text LLM provided by Vertex AI. This
# model will be asked to do a series of tasks---judge, bulletize, and cluster---
# and it is often beneficial to use a larger model for this reason.
#
# We default to 'gemini-pro' but you can change this with the `model_name=`
# param. For a full list of models available via the Model Garden, check out
# https://console.cloud.google.com/vertex-ai/model-garden?pageState=(%22galleryStateKey%22:(%22f%22:(%22g%22:%5B%22supportedTasks%22,%22inputTypes%22%5D,%22o%22:%5B%22GENERATION%22,%22LANGUAGE%22%5D),%22s%22:%22%22)).
#
# Since we're using Gemini Pro, a very competent and flexible foundation model,
# we are sharing the same generator across all downstream tasks. However, you
# could use different models for each task if desired.
generator = model_helper.VertexGenerationModelHelper()

# The embedding model can be any text embedder provided by Vertex AI. We default
# to 'textembedding-gecko@003' but you can change this with the `model_name=`
# param. For a full list of models available via the Model Garden, check out
# https://console.cloud.google.com/vertex-ai/model-garden?pageState=(%22galleryStateKey%22:(%22f%22:(%22g%22:%5B%22supportedTasks%22,%22inputTypes%22%5D,%22o%22:%5B%22EMBEDDING%22,%22LANGUAGE%22%5D),%22s%22:%22%22))
embedder = model_helper.VertexEmbeddingModelHelper()

# The following models do the core work of a Comparative Evaluation: judge,
# bulletize, and cluster. Each class provides a `.run()` function, and the
# `llm_comparator.comparison.run()` API orchestrates configuring and calling
# these APIs on the instances you pass in. More on how to configure these below.

# The `judge` is the model responsible for actually doing the comparison between
# the two models. The same judge is run multiple times to get a diversity of
# perspectives, more on how to configure this below.
#
# A judge must phrase its responses in a simple XML format that includes the
# verdict and an explanation of the results, to enable downstream processing by
# the bulletizer and clusterer.
#
#     <result>
#       <explanation>YOUR EXPLANATION GOES HERE.</explanation>
#       <verdict>A is slightly better</verdict>
#     </result>
#
# We provide a default "judge" prompt in
# llm_comparator.llm_judge_runner.DEFAULT_LLM_JUDGE_PROMPT_TEMPLATE, and you can
# use the `llm_judge_prompt_template=` parameter to provide a custom prompt that

```

--------------------------------

### Import LLM Comparator and Vertex AI Packages

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Imports necessary libraries for the LLM Comparator, including Vertex AI for model interaction and Google Colab authentication. It also imports modules for comparison, model helpers, and specific runner functionalities.

```python
#@title Import relevant packages
import vertexai
from google.colab import auth

# The comparison library provides the primary API for running Comparative
# Evaluations and generating the JSON files required by the LLM Comparator web
# app.
from llm_comparator import comparison

# The model_helper library is used to initialize API wrapper to interface with
# models. For this demo we focus on models served by Google Vertex AI, but you
# can extend the llm_comparator.model_helper.GenerationModelHelper and
# llm_comparator.model_helper.EmbeddingModelHelper classes to work with other
# providers or models you host yourself.
from llm_comparator import model_helper

# The following libraries contain wrappers that implement the core functionality
# of the Comparative Evaluation workflow. More on these below.
from llm_comparator import llm_judge_runner
from llm_comparator import rationale_bullet_generator
from llm_comparator import rationale_cluster_generator
```

--------------------------------

### Run Comparative Evaluation

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Coordinates the comparative evaluation process, including judging, bulletizing, and clustering model responses. Requires initialized judge and bulletizer components, along with model helpers and prompts.

```python
from llm_comparator import comparison
from llm_comparator.model_helper import VertexGenerationModelHelper, VertexEmbeddingModelHelper
from llm_comparator.llm_judge_runner import LLMJudgeRunner
from llm_comparator.rationale_bullet_generator import RationaleBulletGenerator

# Initialize model helpers
vertex_gen_helper = VertexGenerationModelHelper(model_name='gemini-pro')
vertex_emb_helper = VertexEmbeddingModelHelper(model_name='textembedding-gecko@003')

# Initialize judge and bulletizer
judge_runner = LLMJudgeRunner(generator_model=vertex_gen_helper)
bullet_generator = RationaleBulletGenerator(generator_model=vertex_gen_helper)

# Define evaluation configuration (example: prompts, models)
prompts = ["What is the capital of France?", "Translate 'hello' to Spanish."]
models = {
    "model_a": vertex_gen_helper, # Replace with actual model instances
    "model_b": vertex_gen_helper  # Replace with actual model instances
}

# Run the comparative evaluation
# results = comparison.run(
#     prompts=prompts,
#     models=models,
#     judge=judge_runner,
#     bulletizer=bullet_generator,
#     embedding_model=vertex_emb_helper
# )

# print(results.to_json())

```

--------------------------------

### Implement Custom Generation Model Helper in Python

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Extends the abstract GenerationModelHelper class to integrate with a custom LLM provider. This implementation includes `predict` and `predict_batch` methods to handle single and batch text generation requests, respectively. It allows for customization of parameters like temperature and max output tokens.

```python
from llm_comparator import model_helper
from typing import Sequence, Optional

class CustomGenerationModelHelper(model_helper.GenerationModelHelper):
    def __init__(self, api_key: str, model_name: str):
        self.api_key = api_key
        self.model_name = model_name
        # Initialize your model client here

    def predict(self, prompt: str, **kwargs) -> str:
        temperature = kwargs.get('temperature', 0.7)
        max_tokens = kwargs.get('max_output_tokens', 256)

        # Call your LLM API
        # response = your_api.generate(
        #     prompt=prompt,
        #     temperature=temperature,
        #     max_tokens=max_tokens
        # )
        # return response.text

        return "Generated response"

    def predict_batch(self, prompts: Sequence[str], **kwargs) -> Sequence[str]:
        # Implement batch prediction
        return [self.predict(prompt, **kwargs) for prompt in prompts]
```

--------------------------------

### Run Comparative Evaluation (Python)

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Executes the comparative evaluation using the `comparison.run()` function. This function takes prepared inputs, a judge, a bulletizer, and a clusterer to produce a structured dictionary output suitable for the LLM Comparator web app. Optional parameters like `judge_opts`, `bulletizer_opts`, and `clusterer_opts` allow for customization of component behaviors.

```python
# The comparison.run() function is the primary interface for running a
# Comparative Evaluation. It take your prepared inputs, a judge, a buletizer,
# and a clusterer and returns a Python dictioary in the required format for use
# in the LLM Comparator web app. You can inspect this dictionary in Python if
# you like, but it's more useful once written to a file.
#
# The example below is basic, but you can use the judge_opts=, bulletizer_opts=,
# and/or clusterer_opts= parameters (all of which are optional dictionaries that
# are converted to keyword options) to further customize the behaviors. See the
# Docsrtrings for more.
comparison_result = comparison.run(
    llm_judge_inputs,
    judge,
    bulletizer,
    clusterer,
)
```

--------------------------------

### Python: Run Complete LLM Comparative Evaluation

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. It takes model responses, initializes evaluation components like judges and clusterers, and runs the comparison, saving results to a JSON file. Dependencies include the llm_comparator library and model_helper for LLM integrations.

```python
from llm_comparator import comparison
from llm_comparator import model_helper
from llm_comparator import llm_judge_runner
from llm_comparator import rationale_bullet_generator
from llm_comparator import rationale_cluster_generator

# Define inputs with prompts and responses from two models
inputs = [
    {
        'prompt': 'What is LLM Comparator?',
        'response_a': 'LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.',
        'response_b': 'LLM Comparator is a tool for comparing LLM responses.'
    },
    {
        'prompt': 'How to draw bar charts using Python?',
        'response_a': 'Bar charts can be created using data visualization libraries.',
        'response_b': 'You can use Matplotlib, Plotly, or Altair to draw bar charts.'
    }
]

# Initialize model helpers (example using Vertex AI)
generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003')

# Initialize evaluation components
judge = llm_judge_runner.LLMJudgeRunner(generator)
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)
clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder)

# Run comparative evaluation with custom options
comparison_result = comparison.run(
    inputs,
    judge,
    bulletizer,
    clusterer,
    model_names=('Model A', 'Model B'),
    judge_opts={'num_repeats': 6},
    bulletizer_opts={'win_rate_threshold': 0.25},
    clusterer_opts={'num_clusters': 8}
)

# Write results to JSON file
output_path = comparison.write(comparison_result, './results.json')
print(f"Results written to {output_path}")
```

--------------------------------

### Initialize Rationale Bullet Generator

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Initializes the Rationale Bullet Generator, which uses a language model to summarize judge results into bullet points for easier UI consumption. It can be configured with a win rate threshold.

```python
from llm_comparator.rationale_bullet_generator import RationaleBulletGenerator
from llm_comparator.model_helper import VertexGenerationModelHelper

# Initialize a generation model helper
model_helper = VertexGenerationModelHelper()

# Initialize the bullet generator
bullet_generator = RationaleBulletGenerator(generator_model=model_helper)

# Optional: Configure bulletizer options like win rate threshold
# bulletizer_opts = {
#     "win_rate_threshold": 0.3
# }
# bullet_generator = RationaleBulletGenerator(generator_model=model_helper, bulletizer_opts=bulletizer_opts)
```

--------------------------------

### Initialize LLM Judge Runner with Custom Prompt and Scoring

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Initializes the LLM Judge Runner with a custom prompt template and a mapping for verdict scores. This allows for tailored comparison logic and ensures judge verdicts can be numerically evaluated.

```python
from llm_comparator.llm_judge_runner import LLMJudgeRunner
from llm_comparator.model_helper import VertexGenerationModelHelper

# Define a custom prompt template
custom_prompt = """Compare these two responses and tell me which one is better:\nResponse A: {response_a}\nResponse B: {response_b}\n\nProvide your verdict (e.g., 'A is better', 'B is better', 'tie') and an explanation."""

# Define a mapping from verdicts to scores
rating_map = {
    "A is better": 1.0,
    "B is better": -1.0,
    "tie": 0.0
}

# Initialize a generation model helper
model_helper = VertexGenerationModelHelper()

# Initialize the judge runner with custom prompt and rating map
judge_runner = LLMJudgeRunner(
    generator_model=model_helper,
    llm_judge_prompt_template=custom_prompt,
    rating_to_score_map=rating_map
)
```

--------------------------------

### Initialize Generation Model and Rationale Bullet Generator

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Initializes the Vertex Generation Model Helper and the Rationale Bullet Generator. The generator is used by the bulletizer to process judgements and generate bullet points based on a win rate threshold. Judgements are expected to be a list of dictionaries, each containing scores and individual rater scores with rationales.

```python
generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)

judgements = [
    {
        'score': 0.75,
        'individual_rater_scores': [
            {
                'score': 1.0,
                'is_flipped': False,
                'rationale': 'Response A provides more detailed information about multiple cities.'
            },
            {
                'score': 1.5,
                'is_flipped': False,
                'rationale': 'Response A includes specific details about each location.'
            },
            {
                'score': 0.0,
                'is_flipped': False,
                'rationale': 'Both responses are helpful.'
            }
        ]
    }
]

bullets = bulletizer.run(
    judgements,
    win_rate_threshold=0.25
)

for example_bullets in bullets:
    print(f"Generated {len(example_bullets)} bullets:")
    for bullet in example_bullets:
        print(f"  - {bullet}")
```

--------------------------------

### Initialize LLM Judge Runner with Default Prompt

Source: https://github.com/pair-code/llm-comparator/blob/main/python/README.md

Initializes the LLM Judge Runner, which uses a language model to compare responses. It utilizes a default prompt template for judging and can be configured to run multiple times per comparison.

```python
from llm_comparator.llm_judge_runner import LLMJudgeRunner
from llm_comparator.model_helper import VertexGenerationModelHelper

# Initialize a generation model helper (e.g., Vertex AI)
model_helper = VertexGenerationModelHelper()

# Initialize the judge runner with the model helper
judge_runner = LLMJudgeRunner(generator_model=model_helper)

# Optional: Configure judge options like number of repeats
# judge_opts = {
#     "num_repeats": 10
# }
# judge_runner = LLMJudgeRunner(generator_model=model_helper, judge_opts=judge_opts)
```

--------------------------------

### Prepare Input Data for LLM Judge

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Defines the input data structure for the LLM Judge, which consists of a list of dictionaries. Each dictionary contains a 'prompt' and two responses ('response_a', 'response_b') to be compared. This format is required by the `llm_comparator.llm_judge_runner.LLMJudgeInput`.

```python
#@title Prepare Your Inputs

# See llm_comparator.llm_judge_runner.LLMJudgeInput for the required input type.
llm_judge_inputs = [
    {'prompt': 'how are you?', 'response_a': 'good', 'response_b': 'bad'},
    {'prompt': 'hello?', 'response_a': 'hello', 'response_b': 'hi'},
    {'prompt': 'what is the capital of korea?', 'response_a': 'Seoul', 'response_b': 'Vancouver'}
]
```

--------------------------------

### Python: Rationale Bullet Generator for Condensing Rationales

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Condenses judge rationales into concise bullet points, improving readability and facilitating visualization. It requires an initialized generation model and potentially other components from the llm_comparator library. The input is typically a set of rationales from the judging phase, and the output is a structured list of bullets.

```python
from llm_comparator import rationale_bullet_generator
from llm_comparator import model_helper
from llm_comparator import types

# Assuming generator and judgements are already initialized from previous steps
# generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
# judgements = [...] # Output from judge.run()

# Initialize the bullet generator
# bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)

# Example usage (requires actual judgements data):
# bullet_summary = bulletizer.run(judgements)
# print(bullet_summary)
```

--------------------------------

### Save Comparison Results to File (Python)

Source: https://github.com/pair-code/llm-comparator/blob/main/python/notebooks/basic_demo.ipynb

Saves the results of a comparative evaluation to a JSON file. The `comparison.write()` function is used for this purpose, taking the `comparison_result` dictionary and a specified `file_path` as arguments. This allows for persistent storage and later retrieval of the evaluation data.

```python
file_path = 'json_for_llm_comparator.json' # @param {type: "string"}
comparison.write(comparison_result, file_path)
```

--------------------------------

### Python Library - Running Comparative Evaluation

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. This function takes model inputs, evaluation components, and configuration options to produce a comparison result that can be written to a JSON file.

```APIDOC
## Python Library - Running Comparative Evaluation

### Description
Orchestrates the complete comparative evaluation pipeline including judging, bulletizing, and clustering phases. This function takes model inputs, evaluation components, and configuration options to produce a comparison result that can be written to a JSON file.

### Method
`comparison.run()`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```python
from llm_comparator import comparison
from llm_comparator import model_helper
from llm_comparator import llm_judge_runner
from llm_comparator import rationale_bullet_generator
from llm_comparator import rationale_cluster_generator

# Define inputs with prompts and responses from two models
inputs = [
    {
        'prompt': 'What is LLM Comparator?',
        'response_a': 'LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation.',
        'response_b': 'LLM Comparator is a tool for comparing LLM responses.'
    },
    {
        'prompt': 'How to draw bar charts using Python?',
        'response_a': 'Bar charts can be created using data visualization libraries.',
        'response_b': 'You can use Matplotlib, Plotly, or Altair to draw bar charts.'
    }
]

# Initialize model helpers (example using Vertex AI)
generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
embedder = model_helper.VertexEmbeddingModelHelper(model_name='textembedding-gecko@003')

# Initialize evaluation components
judge = llm_judge_runner.LLMJudgeRunner(generator)
bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)
clusterer = rationale_cluster_generator.RationaleClusterGenerator(generator, embedder)

# Run comparative evaluation with custom options
comparison_result = comparison.run(
    inputs,
    judge,
    bulletizer,
    clusterer,
    model_names=('Model A', 'Model B'),
    judge_opts={'num_repeats': 6},
    bulletizer_opts={'win_rate_threshold': 0.25},
    clusterer_opts={'num_clusters': 8}
)

# Write results to JSON file
output_path = comparison.write(comparison_result, './results.json')
print(f"Results written to {output_path}")
```

### Response
#### Success Response (200)
- **comparison_result**: object - The result of the comparative evaluation.
- **output_path**: string - The path to the saved JSON file.

#### Response Example
```json
{
  "results": [...], 
  "summary": {...}
}
```
```

--------------------------------

### Implement Custom Embedding Model Helper in Python

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Extends the abstract EmbeddingModelHelper class to integrate with a custom embedding model provider. This implementation includes `embed` and `embed_batch` methods for generating embeddings for single texts and batches of texts, respectively. It demonstrates basic batching logic for API calls.

```python
from llm_comparator import model_helper
from typing import Sequence, Optional

# Implement custom embedding model helper
class CustomEmbeddingModelHelper(model_helper.EmbeddingModelHelper):
    def __init__(self, api_key: str, model_name: str):
        self.api_key = api_key
        self.model_name = model_name

    def embed(self, text: str) -> Sequence[float]:
        # Call your embedding API
        # embedding = your_api.embed(text)
        # return embedding.values

        return [0.1, 0.2, 0.3]  # Example 3D embedding

    def embed_batch(self, texts: Sequence[str]) -> Sequence[Sequence[float]]:
        # Implement batch embedding with batching logic
        results = []
        batch_size = 100
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            # batch_embeddings = your_api.embed_batch(batch)
            results.extend([self.embed(text) for text in batch])
        return results

# Use custom helpers in comparison pipeline
generator = CustomGenerationModelHelper(api_key='your-key', model_name='custom-llm')
embedder = CustomEmbeddingModelHelper(api_key='your-key', model_name='custom-embedding')
```

--------------------------------

### Python: LLM Judge Runner for Evaluating Model Response Pairs

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Evaluates pairs of model responses using an LLM judge, allowing for configurable repeated runs to mitigate bias. It takes input prompts and responses, initializes a generation model, and can be configured with a custom rating scale. The output includes aggregated scores and individual ratings with rationales.

```python
from llm_comparator import llm_judge_runner
from llm_comparator import model_helper

# Initialize generation model
generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')

# Create judge with custom rating scale
custom_rating_map = {
    'A is much better': 1.5,
    'A is better': 1.0,
    'A is slightly better': 0.5,
    'same': 0.0,
    'B is slightly better': -0.5,
    'B is better': -1.0,
    'B is much better': -1.5
}

judge = llm_judge_runner.LLMJudgeRunner(
    generation_model_helper=generator,
    rating_to_score_map=custom_rating_map
)

# Prepare inputs
inputs = [
    {
        'prompt': 'Which city should I visit in South Korea?',
        'response_a': 'You can visit Seoul, the capital of South Korea.',
        'response_b': 'You can visit Seoul, Busan, and Jeju with their unique attractions.'
    }
]

# Run judge with 6 repeats (3 flipped, 3 non-flipped to handle position bias)
judgements = judge.run(inputs, num_repeats=6)

# Output contains aggregated scores and individual ratings
for judgement in judgements:
    print(f"Average score: {judgement['score']}")
    print(f"Individual ratings: {len(judgement['individual_rater_scores'])}")
    for rating in judgement['individual_rater_scores']:
        print(f"  Score: {rating['score']}, Flipped: {rating['is_flipped']}")
        print(f"  Rationale: {rating['rationale']}")
```

--------------------------------

### Rationale Bullet Generator - Condensing Judgement Rationales

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Condenses judge rationales into concise bulleted summaries. This helps in quickly understanding the key points of the judge's reasoning and facilitates easier analysis and visualization of differences between model responses.

```APIDOC
## Rationale Bullet Generator - Condensing Judgement Rationales

### Description
Condenses judge rationales into concise bulleted summaries. This helps in quickly understanding the key points of the judge's reasoning and facilitates easier analysis and visualization of differences between model responses.

### Method
`rationale_bullet_generator.RationaleBulletGenerator`

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```python
from llm_comparator import rationale_bullet_generator
from llm_comparator import model_helper
from llm_comparator import types

# Assume 'generator' and 'judgements' are initialized as in the previous example
# generator = model_helper.VertexGenerationModelHelper(model_name='gemini-pro')
# judgements = [...] # Output from LLM Judge Runner

bulletizer = rationale_bullet_generator.RationaleBulletGenerator(generator)

# Example: Condense rationales from a single judgement
if judgements:
    condensed_rationales = bulletizer.run(judgements[0]['individual_rater_scores'])
    print(condensed_rationales)
```

### Response
#### Success Response (200)
- **condensed_rationales**: list - A list of strings, where each string is a bulleted summary of a rationale.

#### Response Example
```json
[
  "- Response B includes more locations.",
  "- Response A is more concise."
]
```
```

--------------------------------

### Display LLM Comparator Results in Google Colab

Source: https://context7.com/pair-code/llm-comparator/llms.txt

Python code to display LLM Comparator evaluation results within a Google Colab notebook. It saves the comparison results to a file and then uses an iframe to embed the web application, allowing interactive viewing.

```python
# Display in Google Colab
from llm_comparator import comparison

# After creating comparison_result
file_path = '/content/evaluation_results.json'
comparison.write(comparison_result, file_path)

# Display in Colab iframe with custom height and port
comparison.show_in_colab(
    file_path=file_path,
    height=800,
    port=8888
)

# The function automatically:
# 1. Copies website files to /content/llm_comparator
# 2. Starts HTTP server on specified port
# 3. Creates iframe displaying the results
```

--------------------------------

### BibTeX Citation for ACM CHI Publication

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

This BibTeX entry provides the citation for the preliminary version of LLM Comparator presented at ACM CHI 2024 in the Late-Breaking Work track. It includes title, authors, booktitle, year, publisher, DOI, and arXiv URL.

```bibtex
@inproceedings{kahng2024comparator,
    title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
    author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
    booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
    year={2024},
    publisher={ACM},
    doi={10.1145/3613905.3650755},
    url={https://arxiv.org/abs/2402.10524}
}
```

--------------------------------

### BibTeX Citation for IEEE TVCG Publication

Source: https://github.com/pair-code/llm-comparator/blob/main/README.md

This BibTeX entry provides the full citation details for the LLM Comparator paper published in IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) for the IEEE VIS 2024 conference. It includes title, authors, journal, year, volume, number, publisher, and DOI.

```bibtex
@article{kahng2025comparator,
    title={{LLM Comparator}: Interactive Analysis of Side-by-Side Evaluation of Large Language Models},
    author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
    journal={IEEE Transactions on Visualization and Computer Graphics},
    year={2025},
    volume={31},
    number={1},
    publisher={IEEE},
    doi={10.1109/TVCG.2024.3456354}
}
```