# ast-doc

ast-doc is a Rust CLI tool that generates optimized `llms.txt` documentation files from codebases using AST-based semantic parsing. It employs a four-stage pipeline (Ingestion, Parser, Scheduler, Renderer) to intelligently process source files, extract meaningful code structures via tree-sitter, and fit the output within configurable token budgets while preserving the most important content.

The tool supports multiple programming languages including Rust, Python, TypeScript/JavaScript, Go, and C. It offers three output strategies (Full, NoTests, Summary) that can be automatically applied based on token constraints, with the ability to protect critical "core" files from degradation. Additional features include git context capture, directory tree generation with language annotations, and anti-bloat rules that compress blank lines and trim whitespace.

## CLI Usage

The `ast-doc` command-line interface provides comprehensive options for generating optimized documentation from any codebase with configurable token budgets and output strategies.

```bash
# Basic usage - generate llms.txt to stdout from current directory
ast-doc .

# Generate with output file and token budget
ast-doc /path/to/project --output llms.txt --max-tokens 64000

# Use summary strategy for smaller output (signatures only)
ast-doc . --strategy summary --output summary.md

# Protect core files from degradation while using summary for others
ast-doc . --core "src/main.rs" --core "src/lib.rs" --strategy summary --max-tokens 50000

# Filter files with include/exclude patterns
ast-doc . --include "*.rs" --exclude "*test*" --output rust-only.txt

# Skip git context and directory tree for pure code output
ast-doc . --no-git --no-tree --output pure-code.txt

# Enable verbose logging for debugging
ast-doc . --verbose --output debug.txt
```

## run_pipeline Function

The main entry point for the ast-doc pipeline that orchestrates all four stages: file discovery, AST parsing, token budget optimization, and markdown rendering. Returns both the rendered output and scheduling metadata.

```rust
use ast_doc_core::{AstDocConfig, OutputStrategy, run_pipeline};
use std::path::PathBuf;

fn main() -> eyre::Result<()> {
    let config = AstDocConfig {
        path: PathBuf::from("."),
        output: Some(PathBuf::from("llms.txt")),
        max_tokens: 128_000,
        core_patterns: vec!["src/lib.rs".to_string()],
        default_strategy: OutputStrategy::Full,
        include_patterns: vec!["*.rs".to_string()],
        exclude_patterns: vec!["*test*".to_string()],
        no_git: false,
        no_tree: false,
        copy: false,
        verbose: false,
    };

    let result = run_pipeline(&config)?;

    // Access the rendered output
    println!("Generated {} bytes", result.output.len());

    // Access scheduling metadata
    println!("Total tokens: {}", result.schedule.total_tokens);
    println!("Raw tokens (before optimization): {}", result.schedule.raw_tokens);
    println!("Files processed: {}", result.schedule.files.len());

    // Write to file
    std::fs::write(&config.output.unwrap(), &result.output)?;

    Ok(())
}
```

## AstDocConfig Structure

The configuration structure that controls all aspects of the ast-doc pipeline including file discovery, token budgets, output strategies, and filtering patterns.

```rust
use ast_doc_core::{AstDocConfig, OutputStrategy};
use std::path::PathBuf;

// Minimal configuration for quick documentation
let minimal_config = AstDocConfig {
    path: PathBuf::from("/path/to/project"),
    output: None,  // stdout
    max_tokens: 128_000,
    core_patterns: vec![],
    default_strategy: OutputStrategy::Full,
    include_patterns: vec![],
    exclude_patterns: vec![],
    no_git: false,
    no_tree: false,
    copy: false,
    verbose: false,
};

// Full configuration with all options
let full_config = AstDocConfig {
    path: PathBuf::from("/path/to/project"),
    output: Some(PathBuf::from("docs/llms.txt")),
    max_tokens: 50_000,  // Strict token budget
    core_patterns: vec![
        "src/lib.rs".to_string(),
        "src/main.rs".to_string(),
        "src/core/**".to_string(),  // Glob patterns supported
    ],
    default_strategy: OutputStrategy::NoTests,  // Strip tests by default
    include_patterns: vec!["*.rs".to_string(), "*.py".to_string()],
    exclude_patterns: vec!["*_test.rs".to_string(), "benches/**".to_string()],
    no_git: false,   // Include git context
    no_tree: false,  // Include directory tree
    copy: false,     // Don't copy to clipboard
    verbose: true,   // Enable debug logging
};
```

## OutputStrategy Enum

Defines the three levels of code extraction that control how much detail is preserved in the output, from full source code to signatures only.

```rust
use ast_doc_core::OutputStrategy;

// Full: Include all source code verbatim (default)
let full = OutputStrategy::Full;

// NoTests: Strip test modules and test functions
let no_tests = OutputStrategy::NoTests;

// Summary: Extract signatures only, omit implementations
let summary = OutputStrategy::Summary;

// Strategies can be degraded in order: Full -> NoTests -> Summary
assert_eq!(OutputStrategy::Full.degrade(), Some(OutputStrategy::NoTests));
assert_eq!(OutputStrategy::NoTests.degrade(), Some(OutputStrategy::Summary));
assert_eq!(OutputStrategy::Summary.degrade(), None);  // Already at minimum

// Strategies have ordering for comparison
assert!(OutputStrategy::Full < OutputStrategy::NoTests);
assert!(OutputStrategy::NoTests < OutputStrategy::Summary);
```

## run_ingestion Function

Phase 1 of the pipeline that discovers source files, reads their contents, detects languages, captures git metadata, and generates the directory tree structure.

```rust
use ast_doc_core::{AstDocConfig, OutputStrategy, ingestion::run_ingestion};
use std::path::PathBuf;

fn main() -> Result<(), ast_doc_core::AstDocError> {
    let config = AstDocConfig {
        path: PathBuf::from("."),
        output: None,
        max_tokens: 100_000,
        core_patterns: vec![],
        default_strategy: OutputStrategy::Full,
        include_patterns: vec!["*.rs".to_string()],
        exclude_patterns: vec!["target/**".to_string()],
        no_git: false,
        no_tree: false,
        copy: false,
        verbose: false,
    };

    let result = run_ingestion(&config)?;

    // Access discovered files
    for file in &result.files {
        println!("Found: {} ({:?}) - {} tokens",
            file.path.display(),
            file.language,
            file.raw_token_count
        );
    }

    // Access directory tree
    println!("Directory tree:\n{}", result.directory_tree);

    // Access git context if available
    if let Some(git) = &result.git_context {
        println!("Branch: {}", git.branch);
        println!("Latest commit: {}", git.latest_commit);
        if let Some(diff) = &git.diff {
            println!("Uncommitted changes:\n{}", diff);
        }
    }

    Ok(())
}
```

## parse_file Function

Phase 2 function that uses tree-sitter to parse a source file and pre-compute all three strategy variants (Full, NoTests, Summary) with their token counts.

```rust
use ast_doc_core::{
    OutputStrategy,
    parser::{parse_file, detect_language, Language},
    ingestion::DiscoveredFile,
};
use std::path::PathBuf;

// Create a discovered file (normally from ingestion phase)
let file = DiscoveredFile {
    path: PathBuf::from("src/lib.rs"),
    content: r#"
/// A simple greeting function.
pub fn greet(name: &str) -> String {
    format!("Hello, {}!", name)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_greet() {
        assert_eq!(greet("World"), "Hello, World!");
    }
}
"#.to_string(),
    language: Some(Language::Rust),
    raw_token_count: 50,
};

// Parse the file with tree-sitter
let parsed = parse_file(&file, Language::Rust)?;

// Access pre-computed strategy data
for strategy in [OutputStrategy::Full, OutputStrategy::NoTests, OutputStrategy::Summary] {
    if let Some(data) = parsed.strategies_data.get(&strategy) {
        println!("{:?}: {} tokens", strategy, data.token_count);
        println!("Content:\n{}\n", data.content);
    }
}

// Detect language from file path
assert_eq!(detect_language(&PathBuf::from("app.py")), Some(Language::Python));
assert_eq!(detect_language(&PathBuf::from("main.go")), Some(Language::Go));
assert_eq!(detect_language(&PathBuf::from("index.ts")), Some(Language::TypeScript));
```

## run_scheduler Function

Phase 3 optimizer that selects the best output strategy for each file to fit within the token budget while protecting core files from degradation.

```rust
use ast_doc_core::{
    AstDocConfig, OutputStrategy,
    parser::ParsedFile,
    scheduler::run_scheduler,
};
use std::path::PathBuf;

// Assume parsed_files is a Vec<ParsedFile> from the parser phase
let parsed_files: Vec<ParsedFile> = vec![/* ... */];

let config = AstDocConfig {
    path: PathBuf::from("."),
    output: None,
    max_tokens: 50_000,
    core_patterns: vec!["src/lib.rs".to_string()],  // Never degrade lib.rs
    default_strategy: OutputStrategy::Full,
    include_patterns: vec![],
    exclude_patterns: vec![],
    no_git: true,
    no_tree: true,
    copy: false,
    verbose: false,
};

// Base overhead from directory tree and git context (computed during ingestion)
let base_overhead_tokens = 500;

let schedule = run_scheduler(&parsed_files, &config, base_overhead_tokens)?;

// Access scheduling results
println!("Total tokens: {} (raw: {})", schedule.total_tokens, schedule.raw_tokens);
println!("Strategy breakdown:");
for (strategy, count) in &schedule.strategy_counts {
    println!("  {:?}: {} files", strategy, count);
}

// Access individual file assignments
for file in &schedule.files {
    println!("{}: {:?} ({} tokens, saved {})",
        file.parsed.path.display(),
        file.strategy,
        file.rendered_tokens,
        file.saved_tokens
    );
}
```

## render_llms_txt Function

Phase 4 renderer that assembles the final markdown output from scheduled files, directory tree, and git context with anti-bloat rules applied.

```rust
use ast_doc_core::{
    AstDocConfig, OutputStrategy,
    ingestion::IngestionResult,
    scheduler::ScheduleResult,
    renderer::render_llms_txt,
};
use std::path::PathBuf;

// Assume we have results from previous phases
let scheduled: ScheduleResult = /* from scheduler */;
let ingestion: IngestionResult = /* from ingestion */;

let config = AstDocConfig {
    path: PathBuf::from("/home/user/my-project"),
    output: Some(PathBuf::from("llms.txt")),
    max_tokens: 100_000,
    core_patterns: vec![],
    default_strategy: OutputStrategy::Full,
    include_patterns: vec![],
    exclude_patterns: vec![],
    no_git: false,
    no_tree: false,
    copy: false,
    verbose: false,
};

// Render the final output
let output = render_llms_txt(&scheduled, &ingestion, &config)?;

// Output format example:
// # Repository: my-project
//
// > Note: This codebase has been optimized using AST trimming to fit token limits.
// > Files are presented in Full, NoTests, Summary modes.
//
// ## Structure & Symbol Index
//
// ### Directory Tree
// my-project
// └── src
//     ├── main.rs [Full]
//     └── lib.rs [Full]
//
// ### Git Context
// - **Branch**: main
// - **Latest Commit**: abc123 feat: add feature
//
// ---
//
// ## Source Files
//
// ### File: src/main.rs
// *Strategy: Full | Tokens: 150 (Saved: 0)*
// ```rust
// fn main() { ... }
// ```

println!("{}", output);
```

## AstDocError Enum

Comprehensive error types for all pipeline stages including budget exceeded, unsupported language, file read errors, parse errors, and git/glob failures.

```rust
use ast_doc_core::AstDocError;
use std::path::PathBuf;

fn handle_error(err: AstDocError) {
    match err {
        AstDocError::BudgetExceeded { message } => {
            eprintln!("Token budget exceeded: {}", message);
            // Consider increasing --max-tokens or using --strategy summary
        }
        AstDocError::UnsupportedLanguage { language } => {
            eprintln!("Language not supported: {}", language);
            // Supported: Rust, Python, TypeScript, Go, C
        }
        AstDocError::FileRead { path, source } => {
            eprintln!("Failed to read {}: {}", path.display(), source);
        }
        AstDocError::Parse { path, message } => {
            eprintln!("Parse error in {}: {}", path.display(), message);
        }
        AstDocError::Git(e) => {
            eprintln!("Git error: {}", e);
            // Use --no-git to skip git context
        }
        AstDocError::InvalidGlob(e) => {
            eprintln!("Invalid glob pattern: {}", e);
        }
        AstDocError::Io(e) => {
            eprintln!("I/O error: {}", e);
        }
        AstDocError::Json(e) => {
            eprintln!("JSON error: {}", e);
        }
    }
}

// Example: Handling budget exceeded gracefully
fn generate_with_fallback(config: &mut ast_doc_core::AstDocConfig) -> eyre::Result<String> {
    match ast_doc_core::run_pipeline(config) {
        Ok(result) => Ok(result.output),
        Err(err) => {
            if let Some(AstDocError::BudgetExceeded { .. }) = err.downcast_ref() {
                // Retry with summary strategy
                config.default_strategy = ast_doc_core::OutputStrategy::Summary;
                let result = ast_doc_core::run_pipeline(config)?;
                Ok(result.output)
            } else {
                Err(err)
            }
        }
    }
}
```

## Language Detection

Utility function that detects the programming language from a file's extension, used during ingestion to determine which tree-sitter parser to use.

```rust
use ast_doc_core::parser::{detect_language, Language};
use std::path::Path;

// Rust files
assert_eq!(detect_language(Path::new("src/main.rs")), Some(Language::Rust));
assert_eq!(detect_language(Path::new("lib.rs")), Some(Language::Rust));

// Python files
assert_eq!(detect_language(Path::new("app.py")), Some(Language::Python));
assert_eq!(detect_language(Path::new("scripts/run.py")), Some(Language::Python));

// TypeScript/JavaScript files
assert_eq!(detect_language(Path::new("index.ts")), Some(Language::TypeScript));
assert_eq!(detect_language(Path::new("App.tsx")), Some(Language::TypeScript));
assert_eq!(detect_language(Path::new("script.js")), Some(Language::TypeScript));
assert_eq!(detect_language(Path::new("Component.jsx")), Some(Language::TypeScript));

// Go files
assert_eq!(detect_language(Path::new("main.go")), Some(Language::Go));

// C files
assert_eq!(detect_language(Path::new("main.c")), Some(Language::C));
assert_eq!(detect_language(Path::new("header.h")), Some(Language::C));

// Unsupported files return None
assert_eq!(detect_language(Path::new("README.md")), None);
assert_eq!(detect_language(Path::new("config.json")), None);
assert_eq!(detect_language(Path::new("Makefile")), None);
```

## Cargo Feature Flags

The ast-doc-core library uses feature flags to control which language parsers are compiled, allowing you to minimize binary size by including only needed languages.

```toml
# Cargo.toml - Include only Rust support (default)
[dependencies]
ast-doc-core = "0.1.0"

# Include all language support
[dependencies]
ast-doc-core = { version = "0.1.0", features = ["all-languages"] }

# Include specific languages
[dependencies]
ast-doc-core = { version = "0.1.0", features = ["lang-rust", "lang-python"] }

# Available features:
# - lang-rust (default): Rust (.rs) support
# - lang-python: Python (.py) support
# - lang-typescript: TypeScript/JavaScript (.ts, .tsx, .js, .jsx) support
# - lang-go: Go (.go) support
# - lang-c: C (.c, .h) support
# - all-languages: Enable all language parsers
```

## Summary

ast-doc is designed for developers and AI-assisted development workflows who need to feed code context to large language models efficiently. The primary use cases include generating optimized documentation for AI code assistants, creating architectural overviews of large codebases, preparing code context for code review or refactoring discussions, and reducing token costs when using paid LLM APIs. The tool excels at balancing comprehensiveness with token efficiency through its intelligent degradation algorithm.

Integration patterns typically involve either direct CLI usage in CI/CD pipelines or shell scripts, or programmatic usage via the Rust library API for custom tooling. The four-stage pipeline design allows for extension at each phase, whether adding support for new languages, implementing custom scheduling strategies, or modifying the output format. The tool integrates well with existing Rust workflows through standard Cargo patterns and can be installed globally via `cargo install ast-doc` for immediate use across projects.