Try Live
Add Docs
Rankings
Pricing
Docs
Install
Theme
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
Prompt Learning
https://github.com/arize-ai/prompt-learning
Admin
Prompt Learning is a novel approach to optimizing LLM prompts using natural language feedback
...
Tokens:
35,967
Snippets:
164
Trust Score:
8.8
Update:
6 months ago
Context
Skills
Chat
Benchmark
82.1
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Prompt Learning SDK Prompt Learning is a novel approach to optimizing LLM prompts using natural language feedback instead of numerical scores. This system enables continuous prompt improvement through feedback-driven edits that are low-cost, interpretable, and effective even post-deployment. Rather than tuning model weights or relying on scalar metrics, Prompt Learning uses a three-model loop: an Agent executes tasks, an Evaluator identifies failures with textual feedback, and an Optimizer revises the prompt based on that feedback. This paradigm shift allows AI systems to self-improve through failure, learning through instruction adjustment rather than behavioral rewiring. The repository provides a complete SDK for implementing prompt learning workflows, including support for Big Bench Hard benchmarks, SWE-bench coding agent experiments, and custom evaluation tasks. It features token-aware batch processing, Phoenix integration for observability, and multiple experiment runners for different optimization scenarios including JSON webpage generation, business normalization, and support query classification. ## API Reference ### Initialize Prompt Learning Optimizer Create an optimizer instance that refines prompts using natural language feedback from LLM evaluations. ```python import pandas as pd import os from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer os.environ["OPENAI_API_KEY"] = "sk-..." # Initialize with string prompt optimizer = PromptLearningOptimizer( prompt="You are a helpful assistant. Answer this question: {question}", model_choice="gpt-4", openai_api_key=os.getenv("OPENAI_API_KEY") ) # Or initialize with message list messages = [ {"role": "user", "content": "Generate a JSON webpage for: {input}"} ] optimizer = PromptLearningOptimizer( prompt=messages, model_choice="gpt-4o" ) # Or initialize with Phoenix PromptVersion from phoenix.client.types import PromptVersion prompt_version = PromptVersion( messages=[{"role": "user", "content": "Classify this query: {query}"}], model_name="gpt-4", model_provider="openai" ) optimizer = PromptLearningOptimizer(prompt=prompt_version) ``` ### Run Evaluators on Dataset Execute Phoenix evaluators to generate feedback annotations for prompt optimization. ```python from phoenix.evals import OpenAIModel from phoenix.evals import llm_generate # Create custom evaluator function def evaluate_correctness(dataset): eval_model = OpenAIModel( model="gpt-4o", model_kwargs={"temperature": 0} ) eval_template = """ Evaluate if this answer is correct: Question: {question} Answer: {output} Return JSON with "correctness" (correct/incorrect) and "explanation". """ def parser(response: str, row_index: int) -> dict: import re correctness = re.search(r'"correctness":\s*"?(correct|incorrect)"?', response, re.I) explanation = re.search(r'"explanation":\s*"([^"]*)"', response, re.I) return { "correctness": correctness.group(1).lower() if correctness else None, "explanation": explanation.group(1) if explanation else None } results = llm_generate( dataframe=dataset, template=eval_template, model=eval_model, output_parser=parser, concurrency=10 ) return dataset.assign(**{col: results[col] for col in ["correctness", "explanation"]}), ["correctness", "explanation"] # Prepare dataset with LLM outputs dataset = pd.DataFrame({ "question": ["What is 2+2?", "What is the capital of France?"], "output": ["4", "Paris"] }) # Run evaluators dataset_with_feedback, feedback_columns = optimizer.run_evaluators( dataset=dataset, evaluators=[evaluate_correctness], feedback_columns=[] ) print(dataset_with_feedback) # Output: # question output correctness explanation # 0 What is 2+2? 4 correct The answer is mathematically correct # 1 What is the capital of France? Paris correct Paris is indeed the capital of France ``` ### Optimize Prompt with Feedback Generate an improved prompt using training data with feedback annotations. ```python # Create training dataset with outputs and feedback train_data = pd.DataFrame({ "query": [ "Generate tech company career page", "Create restaurant menu page", "Build product showcase page" ], "output": [ '{"title": "Careers"}', # Missing required fields '{"page": {"sections": [{"type": "menu"}]}}', # Correct '{"content": "Products"}' # Wrong top-level key ], "correctness": ["incorrect", "correct", "incorrect"], "explanation": [ "Missing 'updatedAt' field and top-level key should be 'page'", "Correct structure with proper fields", "Top-level key should be 'page', not 'content'" ] }) # Optimize the prompt optimized_prompt = optimizer.optimize( dataset=train_data, output_column="output", feedback_columns=["correctness", "explanation"], context_size_k=128000 # Token limit for context window ) print(optimized_prompt) # Output (string format): # You are an expert in JSON webpage creation. Follow these rules: # 1. Always use 'page' as the top-level key # 2. Include 'updatedAt' field with ISO timestamp # 3. Structure sections with proper 'type' values # Generate: {input} ``` ### Create Annotations for Optimization Generate structured annotations from feedback to provide targeted guidance for prompt improvement. ```python from optimizer_sdk.annotator import Annotator # Create annotator with custom template annotator_template = """ Analyze the following examples and provide specific guidance for improving the prompt. Baseline Prompt: {baseline prompt} {examples} Generate 3-5 specific rules or guidelines that would prevent these errors. """ # Prepare dataset with feedback dataset = pd.DataFrame({ "input": ["Create a homepage", "Build a contact page"], "output": ['{"title": "Home"}', '{"page": "Contact"}'], "ground_truth": ['{"page": {"title": "Home", "updatedAt": "..."}}', '{"page": {"title": "Contact", "updatedAt": "..."}}'], "error_type": ["Missing page wrapper", "Incorrect structure"] }) # Generate annotations annotations = optimizer.create_annotation( prompt="Generate JSON for: {input}", template_variables=["input"], dataset=dataset, feedback_columns=["error_type"], annotator_prompts=[annotator_template], output_column="output", ground_truth_column="ground_truth" ) print(annotations[0]) # Output: # Rule 1: Always wrap content in a 'page' object at the top level # Rule 2: Include 'updatedAt' field with ISO 8601 timestamp # Rule 3: Use proper nested structure for sections and components # Rule 4: Validate all required fields are present before returning # Rule 5: Follow the schema strictly with no deviations ``` ### Optimize with Annotations and Rulesets Use pre-generated annotations and dynamic rulesets for coding agent optimization. ```python # Prepare annotations from previous analysis annotations = [ "Rule 1: Always validate JSON structure before submission", "Rule 2: Include required fields: page, updatedAt, sections", "Rule 3: Use allowed section types: hero, content, footer" ] # Define dynamic ruleset for iterative improvement dynamic_ruleset = """ - Validate all JSON outputs against schema - Check for required fields before generating - Use proper error handling for edge cases """ # Optimize with annotations optimized_prompt = optimizer.optimize( dataset=train_data, output_column="output", feedback_columns=["correctness", "explanation"], annotations=annotations, ruleset=dynamic_ruleset, # If provided, optimizes ruleset instead of prompt context_size_k=128000 ) # When ruleset is provided, returns optimized ruleset print(optimized_prompt) # Output: # - Validate all JSON outputs against schema with explicit checks # - Check for required fields (page, updatedAt, sections) before generating # - Use proper error handling for malformed inputs and edge cases # - Ensure 'page' is always the top-level key # - Include ISO 8601 formatted 'updatedAt' timestamp in all responses # - Restrict section types to allowed vocabulary: hero, content, footer, header ``` ### Token-Aware Batch Processing Split large datasets into token-limited batches for efficient processing within LLM context windows. ```python from optimizer_sdk.tiktoken_splitter import TiktokenSplitter # Initialize splitter for specific model splitter = TiktokenSplitter(model="gpt-4o") # Create dataset dataset = pd.DataFrame({ "input": ["Generate page " + str(i) for i in range(100)], "output": ["Output " + str(i) for i in range(100)], "feedback": ["Feedback " + str(i) for i in range(100)] }) # Split into batches that fit within context window batches = splitter.get_batch_dataframes( df=dataset, columns=["input", "output", "feedback"], context_size_tokens=8000 # 8k token limit per batch ) print(f"Split {len(dataset)} rows into {len(batches)} batches") # Output: Split 100 rows into 3 batches for i, batch in enumerate(batches): print(f"Batch {i}: {len(batch)} rows") # Process each batch independently # Output: # Batch 0: 40 rows # Batch 1: 35 rows # Batch 2: 25 rows ``` ### Run Big Bench Hard Experiments Execute prompt learning on Big Bench Hard benchmarks with automatic dataset download and evaluation. ```python from big_bench_hard.run_files.pl_multidataset import run_bbh_experiments # Run complete Big Bench Hard experiment suite results_df, ground_truth_comparisons, summary_df = run_bbh_experiments() # Output: # 🔬 Running experiment for boolean_expressions... # 📊 Processing 50 examples in 1 batches # ✅ Batch 1/1: Optimized # ✅ Completed experiment # Initial metric: 0.600 # Final test accuracy: 0.880 # # 🔬 Running experiment for web_of_lies... # ... # # 📊 EXPERIMENT SUMMARY TABLE # ================================================================================== # Task Init GT Final GT Init LLM Final LLM GT Δ LLM Δ Type # ---------------------------------------------------------------------------------- # boolean_expressions 0.600 0.880 0.620 0.900 0.280 0.280 boolean # web_of_lies 0.500 0.820 0.480 0.860 0.320 0.380 general # word_sorting 0.400 0.760 0.420 0.800 0.360 0.380 sorting # ... print(f"Average improvement: {summary_df['Ground_Truth_Improvement'].mean():.3f}") # Output: Average improvement: 0.287 # Access detailed results print(results_df.columns) # Output: ['initial metric', 'train', 'test', 'prompt', 'file', 'raw'] # Examine specific task performance task_results = results_df.iloc[0] print(f"Task: {task_results['file']}") print(f"Initial prompt: {task_results['prompt'][0]}") print(f"Optimized prompt: {task_results['prompt'][-1]}") print(f"Test scores per iteration: {task_results['test']}") ``` ### Run SWE-bench Coding Agent Experiments Execute Cline agent on SWE-bench instances with prompt optimization for code generation tasks. ```python import pandas as pd from cline.act_mode.run_act import run_act # Run Cline on SWE-bench instances results = run_act( run_id="experiment_v1", # Unique identifier for this run dataset_name="SWE-bench/SWE-bench_Lite", instance_ids=["django__django-12345", "sympy__sympy-14308"], # Specific instances ruleset="", # Optional: dynamic ruleset for agent guidance workers=4, # Parallel workers for concurrent execution count=None # Or set to N to run N random instances ) # Output DataFrame with results print(results) # instance_id problem_statement ... cline_patch pass_or_fail # 0 django__django-12345 Fix QuerySet.defer() with... ... diff --git pass # 1 sympy__sympy-14308 Simplify should handle fra... ... diff --git fail # Analyze results pass_rate = (results['pass_or_fail'] == 'pass').sum() / len(results) print(f"Pass rate: {pass_rate:.1%}") # Output: Pass rate: 50.0% # Save results for analysis results.to_csv("cline_results.csv", index=False) # Run on random subset random_results = run_act( run_id="random_experiment", dataset_name="SWE-bench/SWE-bench_Lite", count=10, # Run on 10 random instances workers=8 ) ``` ### Optimize JSON Webpage Generation End-to-end example of optimizing prompts for structured JSON generation with rule validation. ```python import pandas as pd from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer import os # Initial prompt system_prompt = "You are an expert in JSON webpage creation. Generate: {input}" # Training data with feedback train_data = pd.DataFrame({ "input": [ "Create a homepage for a tech startup", "Build a product catalog page", "Generate an about us page" ], "output": [ '{"title": "Homepage"}', # Missing structure '{"page": {"sections": [{"type": "catalog"}], "updatedAt": "2024-01-01"}}', # Good '{"content": {"title": "About"}}' # Wrong key ], "correctness": ["incorrect", "correct", "incorrect"], "explanation": [ "Missing 'page' wrapper, 'sections' array, and 'updatedAt' field", "Correct structure with all required fields", "Top-level key must be 'page', not 'content'" ], "rule_violations": [ "Rules violated: 1, 3, 5", "No violations", "Rules violated: 1" ] }) # Initialize optimizer optimizer = PromptLearningOptimizer( prompt=system_prompt, model_choice="gpt-4o", openai_api_key=os.getenv("OPENAI_API_KEY") ) # Optimize with all feedback columns optimized_prompt = optimizer.optimize( dataset=train_data, output_column="output", feedback_columns=["correctness", "explanation", "rule_violations"], context_size_k=128000 ) print("Original:", system_prompt) print("\nOptimized:", optimized_prompt) # Output: # Original: You are an expert in JSON webpage creation. Generate: {input} # # Optimized: You are an expert in JSON webpage creation. When generating webpages, follow these rules: # 1. Always use 'page' as the top-level key (never 'content', 'title', or other keys) # 2. Include a 'sections' array containing the page sections # 3. Each section must have a 'type' field using allowed vocabulary # 4. Always include an 'updatedAt' field with ISO 8601 timestamp # 5. Validate the complete structure before returning the JSON # # Generate: {input} # Test optimized prompt test_data = pd.DataFrame({ "input": ["Create a contact page"] }) from phoenix.evals import OpenAIModel, llm_generate output_model = OpenAIModel(model="gpt-4o") test_data["output"] = llm_generate( dataframe=test_data, template=optimized_prompt, model=output_model )["output"] print(test_data["output"][0]) # Output: # {"page": {"title": "Contact Us", "sections": [{"type": "contact_form"}], "updatedAt": "2024-10-15T12:00:00Z"}} ``` ### Compare Model Outputs with Ground Truth Evaluate optimization effectiveness by comparing generated outputs against ground truth labels with task-specific comparison logic. ```python from big_bench_hard.run_files.pl_multidataset import compare_results_with_targets # Results from optimization loop results = { 'raw': [ # List of test DataFrames per iteration pd.DataFrame({ 'input': ['Is this statement true?', 'Solve: (True AND False)'], 'output': ['{"result": "yes"}', '{"result": "False"}'] }), pd.DataFrame({ 'input': ['Is this statement true?', 'Solve: (True AND False)'], 'output': ['{"result": "Yes"}', '{"result": "False"}'] }) ], 'test': [0.500, 0.950] # LLM evaluator scores per iteration } # Ground truth targets test_targets = pd.Series(['Yes', 'False']) # Compare with appropriate task type comparison = compare_results_with_targets( results=results, test_targets=test_targets, task_type="boolean" # Options: "general", "boolean", "sorting", "counting" ) print(f"Initial accuracy: {comparison['initial_accuracy']:.3f}") print(f"Final accuracy: {comparison['final_accuracy']:.3f}") print(f"Improvement: {comparison['improvement']:.3f}") print(f"Best accuracy: {comparison['best_accuracy']:.3f}") # Output: # Initial accuracy: 0.500 # Final accuracy: 1.000 # Improvement: 0.500 # Best accuracy: 1.000 # Detailed iteration analysis for detail in comparison['iteration_details']: print(f"Iteration {detail['iteration']}:") print(f" Ground truth accuracy: {detail['ground_truth_accuracy']:.3f}") print(f" LLM evaluator score: {detail['llm_evaluator_score']:.3f}") print(f" Difference: {detail['difference']:.3f}") # Output: # Iteration 0: # Ground truth accuracy: 0.500 # LLM evaluator score: 0.500 # Difference: 0.000 # Iteration 1: # Ground truth accuracy: 1.000 # LLM evaluator score: 0.950 # Difference: 0.050 ``` ## Summary The Prompt Learning SDK provides a complete framework for continuous prompt improvement through natural language feedback. The core workflow involves generating outputs with a baseline prompt, collecting textual feedback from LLM evaluators or annotators, and using a meta-prompt approach to synthesize improvements. This creates a virtuous cycle where prompts become increasingly effective through iterative refinement based on concrete failure examples. The system integrates seamlessly with Phoenix for observability, supports both simple classification tasks and complex code generation scenarios, and includes specialized tooling for benchmark evaluation (Big Bench Hard, SWE-bench). Token-aware batching ensures scalability, while flexible evaluator architecture enables custom feedback mechanisms. Use cases span from JSON generation and business rule validation to support ticket classification and automated code repair, all optimized through English error terms rather than opaque numerical metrics.