# ARC-AGI Program Synthesis System This project implements a DreamCoder-inspired, LLM-assisted program synthesis system designed to solve Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) tasks. ARC-AGI is a benchmark measuring AI's ability to adapt to novel scenarios through abstract reasoning rather than memorization. Each task consists of training examples (input/output grid pairs) that encode transformation rules, which the system must discover and apply to test inputs. The system achieves state-of-the-art results by combining evolutionary test-time compute with an expanding library of learned programs. Starting from an empty library, it iteratively generates Python transformation functions using frontier LLMs (Grok-4, GPT-5, Claude), evaluates their performance on training examples, and adds promising solutions to the library. On subsequent tasks, the system includes relevant primitives from the library in prompts, enabling knowledge transfer and compositional reasoning. This approach scored 77.1% on ARC-AGI-1 and 26.0% on ARC-AGI-2, beating both frontier models and previous bespoke systems while maintaining superior cost-efficiency ($2.56 per task vs. $29-$400 for comparable systems). ## Core Functions and APIs ### Generate Submission File ```python # Run the main submission script to generate solutions # -e flag uses pre-trained library from ARC-AGI-2 training set # -v1 flag targets ARC-AGI-1 public eval set # --path flag specifies custom input dataset python -m src.submission -e -v1 # For ARC-AGI-2 evaluation (default) python -m src.submission -e # For custom dataset python -m src.submission --path "path/to/challenges.json" # Output: Creates submission.json with two attempts per test case # { # "challenge_id": [ # {"attempt_1": [[0,1],[2,3]], "attempt_2": [[0,1],[2,3]]}, # ... # ] # } # The system runs for 2 rounds, generating 5 programs per task per round # Library grows as successful programs are added after each task ``` ### Solve Challenge with Accuracy Tracking ```python import asyncio from src.logic import solve_challenge_with_accuracy from src.models import Challenge, Library, Primitive, RootAttemptConfig from src.data import eval_challenges from src.trees.experiments import grok_dreamcoder_tree async def solve_task(): # Initialize library with primitives library = Library(primitives=[ Primitive(id="0", python_code_str=""" def transform(grid): import numpy as np return np.rot90(grid).tolist() """) ]) # Track costs and accuracy total_cost_in_cents = [0.0] challenge_primitive_accuracy_scores = {} # Solve challenge challenge = eval_challenges["00d62c1b"] results = await solve_challenge_with_accuracy( challenge=challenge, tree=grok_dreamcoder_tree, library=library, use_primitives_weighed_by_score=True, challenge_primitive_accuracy_scores=challenge_primitive_accuracy_scores, aggregate_cost_in_cents=total_cost_in_cents, ) # Returns: [(grids_1, accuracy_1), (grids_2, accuracy_2)] # grids_1: list of output grids for test inputs # accuracy_1: float between 0.0-1.0 representing training accuracy first_solutions, first_accuracy = results[0] second_solutions, second_accuracy = results[1] print(f"First solution accuracy: {first_accuracy}") print(f"Second solution accuracy: {second_accuracy}") print(f"Total cost: ${total_cost_in_cents[0]/100:.2f}") # Library automatically updated with best program print(f"Library size after solve: {len(library.primitives)}") asyncio.run(solve_task()) ``` ### Batch Challenge Processing ```python import asyncio from pathlib import Path from src.data import build_challenges from run import run_from_json from src.trees.o3 import small_tree async def batch_solve(): # Process challenges with concurrency control await run_from_json( challenges_path="arc-prize-2024/arc-agi_evaluation_challenges.json", solutions_path="output/evaluation_solutions.json", temp_solutions_dir_path="output/tmp_solutions", truth_solutions_path="arc-prize-2024/arc-agi_evaluation_solutions.json", tree=small_tree, limit=10, # Process first 10 challenges offset=0, # Start from beginning max_concurrent=20, # Run 20 challenges in parallel ) # Evaluate results from run import evaluate_solutions evaluate_solutions( attempts_solutions_path="output/evaluation_solutions.json", truth_solutions_path="arc-prize-2024/arc-agi_evaluation_solutions.json" ) # Output: Prints "total count X correct count Y" # Creates JSON file with solutions and intermediate attempt files asyncio.run(batch_solve()) ``` ### Library Management with Score-Weighted Selection ```python from src.logic import get_best_primitives_weighed_by_score_async from src.models import Library, Primitive, Challenge from src.data import training_challenges from collections import defaultdict async def select_primitives(): # Create library with transformation primitives library = Library(primitives=[ Primitive(id="0", python_code_str="def transform(grid): return grid"), Primitive(id="1", python_code_str=""" def transform(grid): import numpy as np arr = np.array(grid) return np.rot90(arr, k=1).tolist() """), Primitive(id="2", python_code_str=""" def transform(grid): import numpy as np arr = np.array(grid) return np.flipud(arr).tolist() """), ]) # Cache for primitive scores across challenges challenge_primitive_scores = defaultdict(dict) challenge = training_challenges["007bbfb7"] # Select top 2 primitives weighted by accuracy scores # Uses softmax over (primary_score + secondary_score) for sampling selected = await get_best_primitives_weighed_by_score_async( library=library, challenge=challenge, k_top=2, challenge_primitive_scores=challenge_primitive_scores ) print(f"Selected {len(selected)} primitives") for prim in selected: scores = challenge_primitive_scores[challenge.id].get(prim.id) if scores: num_correct, avg_accuracy = scores print(f"Primitive {prim.id}: {num_correct} correct examples, " f"{avg_accuracy:.2%} average cell accuracy") asyncio.run(select_primitives()) ``` ### Configure Attempt Tree for LLM Generation ```python from src.models import ( RootAttemptConfig, RootPromptConfig, LLMConfig, Model, Prompt ) # Define attempt configuration for program generation attempt_config = RootAttemptConfig( attempts=5, # Generate 5 programs per attempt llm_config=LLMConfig( model=Model.grok_4, temperature=0.7 ), prompt_config=RootPromptConfig( base_prompt=Prompt.REASONING, use_examples=True, # Include example demonstrations use_diffs=True, # Show input/output differences use_images=True, # Include PNG visualizations use_ascii=True, # Include ASCII grid representation use_array=True, # Include Python list representation ), fixes=[], # No fix attempts in this tree include_all_attempts_in_fixes=False ) # Use in tree definition grok_dreamcoder_tree = [attempt_config] # Example with multiple models and fix stages from src.models import FixAttemptConfig, FixPromptConfig, AttemptEdge, KTopConfig fix_config = FixAttemptConfig( attempts=3, llm_config=LLMConfig(model=Model.gpt_5, temperature=0.8), prompt_config=FixPromptConfig( base_prompt=Prompt.REASONING, use_ascii=True, use_array=True, use_image=False, use_fix_reasoning_tags=True, use_fix_fail_line=True, use_typical_issue_text=True, include_diffs=True ), fixes=[], include_all_attempts_in_fixes=True ) # Create tree with fix stage root_with_fixes = RootAttemptConfig( attempts=5, llm_config=LLMConfig(model=Model.grok_4, temperature=0.7), prompt_config=RootPromptConfig( base_prompt=Prompt.REASONING, use_examples=True, use_diffs=True, use_images=True, use_ascii=True, use_array=True, ), fixes=[ AttemptEdge( k_top_config=KTopConfig(k_top=2, unique_code=True, unique_output=True), configs=[fix_config], pooling=None ) ], include_all_attempts_in_fixes=False ) tree_with_fixes = [root_with_fixes] ``` ### Transform Grid with Python Code ```python from src.run_python import run_python_transform_sync, run_python_transform_async from copy import deepcopy # Synchronous execution code = """ def transform(grid): import numpy as np arr = np.array(grid) # Rotate 90 degrees clockwise return np.rot90(arr, k=-1).tolist() """ input_grids = [ [[0, 1], [2, 3]], [[4, 5, 6], [7, 8, 9], [1, 2, 3]] ] result = run_python_transform_sync( code=code, grid_lists=[deepcopy(g) for g in input_grids], timeout=5, raise_exception=False ) if result.transform_results: print(f"Execution took {result.latency_ms:.2f}ms") for i, transformed in enumerate(result.transform_results): print(f"Output {i}: {transformed}") else: print(f"Error: {result.error}") # Async execution for better performance import asyncio async def async_transform(): result = await run_python_transform_async( code=code, grid_lists=[deepcopy(g) for g in input_grids], timeout=20, raise_exception=True # Raise exception on failure ) return result.transform_results outputs = asyncio.run(async_transform()) print(f"Async outputs: {outputs}") ``` ### Build and Load Challenges ```python from pathlib import Path from src.data import build_challenges, build_challenges_v2 import json # Load ARC-AGI-1 format (single JSON file) challenges = build_challenges( challenges_path=Path("arc-prize-2024/arc-agi_training_challenges.json"), solutions_path=Path("arc-prize-2024/arc-agi_training_solutions.json") ) # Access challenge by ID challenge = challenges["007bbfb7"] print(f"Challenge {challenge.id}:") print(f" Train examples: {len(challenge.train)}") print(f" Test examples: {len(challenge.test)}") print(f" First train input shape: {len(challenge.train[0].input)}x{len(challenge.train[0].input[0])}") # Load ARC-AGI-2 format (directory of JSON files) v2_challenges = build_challenges_v2( challenges_path=Path("arc-agi-2/training") ) # Without solutions (creates dummy [[0]] outputs) eval_challenges_no_sol = build_challenges( challenges_path=Path("arc-prize-2024/arc-agi_evaluation_challenges.json"), solutions_path=None ) # Iterate through challenges for challenge_id, challenge in list(challenges.items())[:3]: print(f"\n{challenge_id}: {len(challenge.train)} train, {len(challenge.test)} test") for i, example in enumerate(challenge.train): print(f" Example {i}: input {len(example.input)}x{len(example.input[0])} " f"-> output {len(example.output)}x{len(example.output[0])}") ``` ### Calculate Costs and Track Usage ```python from src.models import Attempt, Model, ModelUsage # Calculate cost from token usage usage = ModelUsage( cache_creation_input_tokens=5000, cache_read_input_tokens=15000, input_tokens=1000, output_tokens=2000 ) cost_cents = Attempt.cost_cents_from_usage( model=Model.grok_4, usage=usage ) print(f"Cost: ${cost_cents/100:.4f}") print(f" Cache creation: 5000 tokens @ $15/M = ${5000 * 15 / 1_000_000:.4f}") print(f" Cache read: 15000 tokens @ $0.75/M = ${15000 * 0.75 / 1_000_000:.4f}") print(f" Input: 1000 tokens @ $3/M = ${1000 * 3 / 1_000_000:.4f}") print(f" Output: 2000 tokens @ $15/M = ${2000 * 15 / 1_000_000:.4f}") # Track aggregate costs across multiple attempts total_cost_in_cents = [0.0] # After each solve_challenge call: # total_cost_in_cents[0] += cost_cents # Print summary print(f"\nTotal cost for batch: ${total_cost_in_cents[0]/100:.2f}") print(f"Average cost per task: ${total_cost_in_cents[0]/100/len(challenges):.2f}") ``` ### LPN Model Integration (Optional) ```python import pickle from src.submission import load_lpn_model from src.logic import get_best_primitives_by_lpn_vmap from collections import defaultdict # Load pre-trained Latent Program Network lpn_model, evaluator, key = load_lpn_model( artifact_path="wandb-artifact-path" ) # Use LPN for primitive selection challenge_primitive_lpn_scores = defaultdict(dict) primitives = get_best_primitives_by_lpn_vmap( library=library, challenge=challenge, k_top=2, lpn_model=lpn_model, evaluator=evaluator, key=key, challenge_primitive_scores=challenge_primitive_lpn_scores, batch_size=50 ) # Scores cached in challenge_primitive_lpn_scores # Format: {challenge_id: {primitive_id: cosine_similarity}} # Use in solve_challenge results = await solve_challenge_with_accuracy( challenge=challenge, tree=tree, library=library, use_primitives_weighed_by_score=False, # Set to False when using LPN lpn_model=lpn_model, evaluator=evaluator, key=key, challenge_primitive_lpn_scores=challenge_primitive_lpn_scores, aggregate_cost_in_cents=total_cost_in_cents ) ``` ## Usage and Integration Patterns The system is designed for both research experimentation and production deployment on the ARC-AGI challenge. The primary usage pattern involves seeding a library by running the system on the ARC-AGI-2 training set (1,000 tasks, 1 round, 1 program per task), which creates a library of ~500 primitives. This library is then loaded during evaluation using the `-e` flag when running `src.submission`, and the system performs 2 rounds with 5 programs per task, transferring knowledge through primitive selection. For cost optimization, primitives are selected using score-weighted sampling based on training example accuracy, though an optional LPN-based selection using latent space similarity is available for advanced users. Integration is straightforward through command-line execution or Python API usage. The `run.py` module provides batch processing with concurrency control, automatically managing parallel challenge solving and solution aggregation. Error handling is built-in: failed LLM responses are gracefully skipped, Python execution timeouts are enforced, and empty solutions default to `[[0]]` arrays. The system supports multiple LLM backends (OpenAI, Anthropic, X.AI, DeepSeek) through environment variable configuration, with automatic retry logic and rate limiting. Cost tracking aggregates expenses across all API calls, and the library persists between runs using pickle serialization for efficient knowledge transfer across sessions.