# ARC-AGI Program Synthesis System

This project implements a DreamCoder-inspired, LLM-assisted program synthesis system designed to solve Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) tasks. ARC-AGI is a benchmark measuring AI's ability to adapt to novel scenarios through abstract reasoning rather than memorization. Each task consists of training examples (input/output grid pairs) that encode transformation rules, which the system must discover and apply to test inputs.

The system achieves state-of-the-art results by combining evolutionary test-time compute with an expanding library of learned programs. Starting from an empty library, it iteratively generates Python transformation functions using frontier LLMs (Grok-4, GPT-5, Claude), evaluates their performance on training examples, and adds promising solutions to the library. On subsequent tasks, the system includes relevant primitives from the library in prompts, enabling knowledge transfer and compositional reasoning. This approach scored 77.1% on ARC-AGI-1 and 26.0% on ARC-AGI-2, beating both frontier models and previous bespoke systems while maintaining superior cost-efficiency ($2.56 per task vs. $29-$400 for comparable systems).

## Core Functions and APIs

### Generate Submission File

```python
# Run the main submission script to generate solutions
# -e flag uses pre-trained library from ARC-AGI-2 training set
# -v1 flag targets ARC-AGI-1 public eval set
# --path flag specifies custom input dataset

python -m src.submission -e -v1

# For ARC-AGI-2 evaluation (default)
python -m src.submission -e

# For custom dataset
python -m src.submission --path "path/to/challenges.json"

# Output: Creates submission.json with two attempts per test case
# {
#   "challenge_id": [
#     {"attempt_1": [[0,1],[2,3]], "attempt_2": [[0,1],[2,3]]},
#     ...
#   ]
# }

# The system runs for 2 rounds, generating 5 programs per task per round
# Library grows as successful programs are added after each task
```

### Solve Challenge with Accuracy Tracking

```python
import asyncio
from src.logic import solve_challenge_with_accuracy
from src.models import Challenge, Library, Primitive, RootAttemptConfig
from src.data import eval_challenges
from src.trees.experiments import grok_dreamcoder_tree

async def solve_task():
    # Initialize library with primitives
    library = Library(primitives=[
        Primitive(id="0", python_code_str="""
def transform(grid):
    import numpy as np
    return np.rot90(grid).tolist()
""")
    ])

    # Track costs and accuracy
    total_cost_in_cents = [0.0]
    challenge_primitive_accuracy_scores = {}

    # Solve challenge
    challenge = eval_challenges["00d62c1b"]
    results = await solve_challenge_with_accuracy(
        challenge=challenge,
        tree=grok_dreamcoder_tree,
        library=library,
        use_primitives_weighed_by_score=True,
        challenge_primitive_accuracy_scores=challenge_primitive_accuracy_scores,
        aggregate_cost_in_cents=total_cost_in_cents,
    )

    # Returns: [(grids_1, accuracy_1), (grids_2, accuracy_2)]
    # grids_1: list of output grids for test inputs
    # accuracy_1: float between 0.0-1.0 representing training accuracy
    first_solutions, first_accuracy = results[0]
    second_solutions, second_accuracy = results[1]

    print(f"First solution accuracy: {first_accuracy}")
    print(f"Second solution accuracy: {second_accuracy}")
    print(f"Total cost: ${total_cost_in_cents[0]/100:.2f}")

    # Library automatically updated with best program
    print(f"Library size after solve: {len(library.primitives)}")

asyncio.run(solve_task())
```

### Batch Challenge Processing

```python
import asyncio
from pathlib import Path
from src.data import build_challenges
from run import run_from_json
from src.trees.o3 import small_tree

async def batch_solve():
    # Process challenges with concurrency control
    await run_from_json(
        challenges_path="arc-prize-2024/arc-agi_evaluation_challenges.json",
        solutions_path="output/evaluation_solutions.json",
        temp_solutions_dir_path="output/tmp_solutions",
        truth_solutions_path="arc-prize-2024/arc-agi_evaluation_solutions.json",
        tree=small_tree,
        limit=10,           # Process first 10 challenges
        offset=0,           # Start from beginning
        max_concurrent=20,  # Run 20 challenges in parallel
    )

    # Evaluate results
    from run import evaluate_solutions
    evaluate_solutions(
        attempts_solutions_path="output/evaluation_solutions.json",
        truth_solutions_path="arc-prize-2024/arc-agi_evaluation_solutions.json"
    )

    # Output: Prints "total count X correct count Y"
    # Creates JSON file with solutions and intermediate attempt files

asyncio.run(batch_solve())
```

### Library Management with Score-Weighted Selection

```python
from src.logic import get_best_primitives_weighed_by_score_async
from src.models import Library, Primitive, Challenge
from src.data import training_challenges
from collections import defaultdict

async def select_primitives():
    # Create library with transformation primitives
    library = Library(primitives=[
        Primitive(id="0", python_code_str="def transform(grid): return grid"),
        Primitive(id="1", python_code_str="""
def transform(grid):
    import numpy as np
    arr = np.array(grid)
    return np.rot90(arr, k=1).tolist()
"""),
        Primitive(id="2", python_code_str="""
def transform(grid):
    import numpy as np
    arr = np.array(grid)
    return np.flipud(arr).tolist()
"""),
    ])

    # Cache for primitive scores across challenges
    challenge_primitive_scores = defaultdict(dict)

    challenge = training_challenges["007bbfb7"]

    # Select top 2 primitives weighted by accuracy scores
    # Uses softmax over (primary_score + secondary_score) for sampling
    selected = await get_best_primitives_weighed_by_score_async(
        library=library,
        challenge=challenge,
        k_top=2,
        challenge_primitive_scores=challenge_primitive_scores
    )

    print(f"Selected {len(selected)} primitives")
    for prim in selected:
        scores = challenge_primitive_scores[challenge.id].get(prim.id)
        if scores:
            num_correct, avg_accuracy = scores
            print(f"Primitive {prim.id}: {num_correct} correct examples, "
                  f"{avg_accuracy:.2%} average cell accuracy")

asyncio.run(select_primitives())
```

### Configure Attempt Tree for LLM Generation

```python
from src.models import (
    RootAttemptConfig, RootPromptConfig, LLMConfig,
    Model, Prompt
)

# Define attempt configuration for program generation
attempt_config = RootAttemptConfig(
    attempts=5,  # Generate 5 programs per attempt
    llm_config=LLMConfig(
        model=Model.grok_4,
        temperature=0.7
    ),
    prompt_config=RootPromptConfig(
        base_prompt=Prompt.REASONING,
        use_examples=True,      # Include example demonstrations
        use_diffs=True,         # Show input/output differences
        use_images=True,        # Include PNG visualizations
        use_ascii=True,         # Include ASCII grid representation
        use_array=True,         # Include Python list representation
    ),
    fixes=[],  # No fix attempts in this tree
    include_all_attempts_in_fixes=False
)

# Use in tree definition
grok_dreamcoder_tree = [attempt_config]

# Example with multiple models and fix stages
from src.models import FixAttemptConfig, FixPromptConfig, AttemptEdge, KTopConfig

fix_config = FixAttemptConfig(
    attempts=3,
    llm_config=LLMConfig(model=Model.gpt_5, temperature=0.8),
    prompt_config=FixPromptConfig(
        base_prompt=Prompt.REASONING,
        use_ascii=True,
        use_array=True,
        use_image=False,
        use_fix_reasoning_tags=True,
        use_fix_fail_line=True,
        use_typical_issue_text=True,
        include_diffs=True
    ),
    fixes=[],
    include_all_attempts_in_fixes=True
)

# Create tree with fix stage
root_with_fixes = RootAttemptConfig(
    attempts=5,
    llm_config=LLMConfig(model=Model.grok_4, temperature=0.7),
    prompt_config=RootPromptConfig(
        base_prompt=Prompt.REASONING,
        use_examples=True,
        use_diffs=True,
        use_images=True,
        use_ascii=True,
        use_array=True,
    ),
    fixes=[
        AttemptEdge(
            k_top_config=KTopConfig(k_top=2, unique_code=True, unique_output=True),
            configs=[fix_config],
            pooling=None
        )
    ],
    include_all_attempts_in_fixes=False
)

tree_with_fixes = [root_with_fixes]
```

### Transform Grid with Python Code

```python
from src.run_python import run_python_transform_sync, run_python_transform_async
from copy import deepcopy

# Synchronous execution
code = """
def transform(grid):
    import numpy as np
    arr = np.array(grid)
    # Rotate 90 degrees clockwise
    return np.rot90(arr, k=-1).tolist()
"""

input_grids = [
    [[0, 1], [2, 3]],
    [[4, 5, 6], [7, 8, 9], [1, 2, 3]]
]

result = run_python_transform_sync(
    code=code,
    grid_lists=[deepcopy(g) for g in input_grids],
    timeout=5,
    raise_exception=False
)

if result.transform_results:
    print(f"Execution took {result.latency_ms:.2f}ms")
    for i, transformed in enumerate(result.transform_results):
        print(f"Output {i}: {transformed}")
else:
    print(f"Error: {result.error}")

# Async execution for better performance
import asyncio

async def async_transform():
    result = await run_python_transform_async(
        code=code,
        grid_lists=[deepcopy(g) for g in input_grids],
        timeout=20,
        raise_exception=True  # Raise exception on failure
    )
    return result.transform_results

outputs = asyncio.run(async_transform())
print(f"Async outputs: {outputs}")
```

### Build and Load Challenges

```python
from pathlib import Path
from src.data import build_challenges, build_challenges_v2
import json

# Load ARC-AGI-1 format (single JSON file)
challenges = build_challenges(
    challenges_path=Path("arc-prize-2024/arc-agi_training_challenges.json"),
    solutions_path=Path("arc-prize-2024/arc-agi_training_solutions.json")
)

# Access challenge by ID
challenge = challenges["007bbfb7"]
print(f"Challenge {challenge.id}:")
print(f"  Train examples: {len(challenge.train)}")
print(f"  Test examples: {len(challenge.test)}")
print(f"  First train input shape: {len(challenge.train[0].input)}x{len(challenge.train[0].input[0])}")

# Load ARC-AGI-2 format (directory of JSON files)
v2_challenges = build_challenges_v2(
    challenges_path=Path("arc-agi-2/training")
)

# Without solutions (creates dummy [[0]] outputs)
eval_challenges_no_sol = build_challenges(
    challenges_path=Path("arc-prize-2024/arc-agi_evaluation_challenges.json"),
    solutions_path=None
)

# Iterate through challenges
for challenge_id, challenge in list(challenges.items())[:3]:
    print(f"\n{challenge_id}: {len(challenge.train)} train, {len(challenge.test)} test")
    for i, example in enumerate(challenge.train):
        print(f"  Example {i}: input {len(example.input)}x{len(example.input[0])} "
              f"-> output {len(example.output)}x{len(example.output[0])}")
```

### Calculate Costs and Track Usage

```python
from src.models import Attempt, Model, ModelUsage

# Calculate cost from token usage
usage = ModelUsage(
    cache_creation_input_tokens=5000,
    cache_read_input_tokens=15000,
    input_tokens=1000,
    output_tokens=2000
)

cost_cents = Attempt.cost_cents_from_usage(
    model=Model.grok_4,
    usage=usage
)

print(f"Cost: ${cost_cents/100:.4f}")
print(f"  Cache creation: 5000 tokens @ $15/M = ${5000 * 15 / 1_000_000:.4f}")
print(f"  Cache read: 15000 tokens @ $0.75/M = ${15000 * 0.75 / 1_000_000:.4f}")
print(f"  Input: 1000 tokens @ $3/M = ${1000 * 3 / 1_000_000:.4f}")
print(f"  Output: 2000 tokens @ $15/M = ${2000 * 15 / 1_000_000:.4f}")

# Track aggregate costs across multiple attempts
total_cost_in_cents = [0.0]

# After each solve_challenge call:
# total_cost_in_cents[0] += cost_cents

# Print summary
print(f"\nTotal cost for batch: ${total_cost_in_cents[0]/100:.2f}")
print(f"Average cost per task: ${total_cost_in_cents[0]/100/len(challenges):.2f}")
```

### LPN Model Integration (Optional)

```python
import pickle
from src.submission import load_lpn_model
from src.logic import get_best_primitives_by_lpn_vmap
from collections import defaultdict

# Load pre-trained Latent Program Network
lpn_model, evaluator, key = load_lpn_model(
    artifact_path="wandb-artifact-path"
)

# Use LPN for primitive selection
challenge_primitive_lpn_scores = defaultdict(dict)

primitives = get_best_primitives_by_lpn_vmap(
    library=library,
    challenge=challenge,
    k_top=2,
    lpn_model=lpn_model,
    evaluator=evaluator,
    key=key,
    challenge_primitive_scores=challenge_primitive_lpn_scores,
    batch_size=50
)

# Scores cached in challenge_primitive_lpn_scores
# Format: {challenge_id: {primitive_id: cosine_similarity}}

# Use in solve_challenge
results = await solve_challenge_with_accuracy(
    challenge=challenge,
    tree=tree,
    library=library,
    use_primitives_weighed_by_score=False,  # Set to False when using LPN
    lpn_model=lpn_model,
    evaluator=evaluator,
    key=key,
    challenge_primitive_lpn_scores=challenge_primitive_lpn_scores,
    aggregate_cost_in_cents=total_cost_in_cents
)
```

## Usage and Integration Patterns

The system is designed for both research experimentation and production deployment on the ARC-AGI challenge. The primary usage pattern involves seeding a library by running the system on the ARC-AGI-2 training set (1,000 tasks, 1 round, 1 program per task), which creates a library of ~500 primitives. This library is then loaded during evaluation using the `-e` flag when running `src.submission`, and the system performs 2 rounds with 5 programs per task, transferring knowledge through primitive selection. For cost optimization, primitives are selected using score-weighted sampling based on training example accuracy, though an optional LPN-based selection using latent space similarity is available for advanced users.

Integration is straightforward through command-line execution or Python API usage. The `run.py` module provides batch processing with concurrency control, automatically managing parallel challenge solving and solution aggregation. Error handling is built-in: failed LLM responses are gracefully skipped, Python execution timeouts are enforced, and empty solutions default to `[[0]]` arrays. The system supports multiple LLM backends (OpenAI, Anthropic, X.AI, DeepSeek) through environment variable configuration, with automatic retry logic and rate limiting. Cost tracking aggregates expenses across all API calls, and the library persists between runs using pickle serialization for efficient knowledge transfer across sessions.