# ast-doc ast-doc is a Rust CLI tool that generates optimized `llms.txt` documentation files from codebases using AST-based semantic parsing. It employs a four-stage pipeline (Ingestion, Parser, Scheduler, Renderer) to intelligently process source files, extract meaningful code structures via tree-sitter, and fit the output within configurable token budgets while preserving the most important content. The tool supports multiple programming languages including Rust, Python, TypeScript/JavaScript, Go, and C. It offers three output strategies (Full, NoTests, Summary) that can be automatically applied based on token constraints, with the ability to protect critical "core" files from degradation. Additional features include git context capture, directory tree generation with language annotations, and anti-bloat rules that compress blank lines and trim whitespace. ## CLI Usage The `ast-doc` command-line interface provides comprehensive options for generating optimized documentation from any codebase with configurable token budgets and output strategies. ```bash # Basic usage - generate llms.txt to stdout from current directory ast-doc . # Generate with output file and token budget ast-doc /path/to/project --output llms.txt --max-tokens 64000 # Use summary strategy for smaller output (signatures only) ast-doc . --strategy summary --output summary.md # Protect core files from degradation while using summary for others ast-doc . --core "src/main.rs" --core "src/lib.rs" --strategy summary --max-tokens 50000 # Filter files with include/exclude patterns ast-doc . --include "*.rs" --exclude "*test*" --output rust-only.txt # Skip git context and directory tree for pure code output ast-doc . --no-git --no-tree --output pure-code.txt # Enable verbose logging for debugging ast-doc . --verbose --output debug.txt ``` ## run_pipeline Function The main entry point for the ast-doc pipeline that orchestrates all four stages: file discovery, AST parsing, token budget optimization, and markdown rendering. Returns both the rendered output and scheduling metadata. ```rust use ast_doc_core::{AstDocConfig, OutputStrategy, run_pipeline}; use std::path::PathBuf; fn main() -> eyre::Result<()> { let config = AstDocConfig { path: PathBuf::from("."), output: Some(PathBuf::from("llms.txt")), max_tokens: 128_000, core_patterns: vec!["src/lib.rs".to_string()], default_strategy: OutputStrategy::Full, include_patterns: vec!["*.rs".to_string()], exclude_patterns: vec!["*test*".to_string()], no_git: false, no_tree: false, copy: false, verbose: false, }; let result = run_pipeline(&config)?; // Access the rendered output println!("Generated {} bytes", result.output.len()); // Access scheduling metadata println!("Total tokens: {}", result.schedule.total_tokens); println!("Raw tokens (before optimization): {}", result.schedule.raw_tokens); println!("Files processed: {}", result.schedule.files.len()); // Write to file std::fs::write(&config.output.unwrap(), &result.output)?; Ok(()) } ``` ## AstDocConfig Structure The configuration structure that controls all aspects of the ast-doc pipeline including file discovery, token budgets, output strategies, and filtering patterns. ```rust use ast_doc_core::{AstDocConfig, OutputStrategy}; use std::path::PathBuf; // Minimal configuration for quick documentation let minimal_config = AstDocConfig { path: PathBuf::from("/path/to/project"), output: None, // stdout max_tokens: 128_000, core_patterns: vec![], default_strategy: OutputStrategy::Full, include_patterns: vec![], exclude_patterns: vec![], no_git: false, no_tree: false, copy: false, verbose: false, }; // Full configuration with all options let full_config = AstDocConfig { path: PathBuf::from("/path/to/project"), output: Some(PathBuf::from("docs/llms.txt")), max_tokens: 50_000, // Strict token budget core_patterns: vec![ "src/lib.rs".to_string(), "src/main.rs".to_string(), "src/core/**".to_string(), // Glob patterns supported ], default_strategy: OutputStrategy::NoTests, // Strip tests by default include_patterns: vec!["*.rs".to_string(), "*.py".to_string()], exclude_patterns: vec!["*_test.rs".to_string(), "benches/**".to_string()], no_git: false, // Include git context no_tree: false, // Include directory tree copy: false, // Don't copy to clipboard verbose: true, // Enable debug logging }; ``` ## OutputStrategy Enum Defines the three levels of code extraction that control how much detail is preserved in the output, from full source code to signatures only. ```rust use ast_doc_core::OutputStrategy; // Full: Include all source code verbatim (default) let full = OutputStrategy::Full; // NoTests: Strip test modules and test functions let no_tests = OutputStrategy::NoTests; // Summary: Extract signatures only, omit implementations let summary = OutputStrategy::Summary; // Strategies can be degraded in order: Full -> NoTests -> Summary assert_eq!(OutputStrategy::Full.degrade(), Some(OutputStrategy::NoTests)); assert_eq!(OutputStrategy::NoTests.degrade(), Some(OutputStrategy::Summary)); assert_eq!(OutputStrategy::Summary.degrade(), None); // Already at minimum // Strategies have ordering for comparison assert!(OutputStrategy::Full < OutputStrategy::NoTests); assert!(OutputStrategy::NoTests < OutputStrategy::Summary); ``` ## run_ingestion Function Phase 1 of the pipeline that discovers source files, reads their contents, detects languages, captures git metadata, and generates the directory tree structure. ```rust use ast_doc_core::{AstDocConfig, OutputStrategy, ingestion::run_ingestion}; use std::path::PathBuf; fn main() -> Result<(), ast_doc_core::AstDocError> { let config = AstDocConfig { path: PathBuf::from("."), output: None, max_tokens: 100_000, core_patterns: vec![], default_strategy: OutputStrategy::Full, include_patterns: vec!["*.rs".to_string()], exclude_patterns: vec!["target/**".to_string()], no_git: false, no_tree: false, copy: false, verbose: false, }; let result = run_ingestion(&config)?; // Access discovered files for file in &result.files { println!("Found: {} ({:?}) - {} tokens", file.path.display(), file.language, file.raw_token_count ); } // Access directory tree println!("Directory tree:\n{}", result.directory_tree); // Access git context if available if let Some(git) = &result.git_context { println!("Branch: {}", git.branch); println!("Latest commit: {}", git.latest_commit); if let Some(diff) = &git.diff { println!("Uncommitted changes:\n{}", diff); } } Ok(()) } ``` ## parse_file Function Phase 2 function that uses tree-sitter to parse a source file and pre-compute all three strategy variants (Full, NoTests, Summary) with their token counts. ```rust use ast_doc_core::{ OutputStrategy, parser::{parse_file, detect_language, Language}, ingestion::DiscoveredFile, }; use std::path::PathBuf; // Create a discovered file (normally from ingestion phase) let file = DiscoveredFile { path: PathBuf::from("src/lib.rs"), content: r#" /// A simple greeting function. pub fn greet(name: &str) -> String { format!("Hello, {}!", name) } #[cfg(test)] mod tests { use super::*; #[test] fn test_greet() { assert_eq!(greet("World"), "Hello, World!"); } } "#.to_string(), language: Some(Language::Rust), raw_token_count: 50, }; // Parse the file with tree-sitter let parsed = parse_file(&file, Language::Rust)?; // Access pre-computed strategy data for strategy in [OutputStrategy::Full, OutputStrategy::NoTests, OutputStrategy::Summary] { if let Some(data) = parsed.strategies_data.get(&strategy) { println!("{:?}: {} tokens", strategy, data.token_count); println!("Content:\n{}\n", data.content); } } // Detect language from file path assert_eq!(detect_language(&PathBuf::from("app.py")), Some(Language::Python)); assert_eq!(detect_language(&PathBuf::from("main.go")), Some(Language::Go)); assert_eq!(detect_language(&PathBuf::from("index.ts")), Some(Language::TypeScript)); ``` ## run_scheduler Function Phase 3 optimizer that selects the best output strategy for each file to fit within the token budget while protecting core files from degradation. ```rust use ast_doc_core::{ AstDocConfig, OutputStrategy, parser::ParsedFile, scheduler::run_scheduler, }; use std::path::PathBuf; // Assume parsed_files is a Vec from the parser phase let parsed_files: Vec = vec![/* ... */]; let config = AstDocConfig { path: PathBuf::from("."), output: None, max_tokens: 50_000, core_patterns: vec!["src/lib.rs".to_string()], // Never degrade lib.rs default_strategy: OutputStrategy::Full, include_patterns: vec![], exclude_patterns: vec![], no_git: true, no_tree: true, copy: false, verbose: false, }; // Base overhead from directory tree and git context (computed during ingestion) let base_overhead_tokens = 500; let schedule = run_scheduler(&parsed_files, &config, base_overhead_tokens)?; // Access scheduling results println!("Total tokens: {} (raw: {})", schedule.total_tokens, schedule.raw_tokens); println!("Strategy breakdown:"); for (strategy, count) in &schedule.strategy_counts { println!(" {:?}: {} files", strategy, count); } // Access individual file assignments for file in &schedule.files { println!("{}: {:?} ({} tokens, saved {})", file.parsed.path.display(), file.strategy, file.rendered_tokens, file.saved_tokens ); } ``` ## render_llms_txt Function Phase 4 renderer that assembles the final markdown output from scheduled files, directory tree, and git context with anti-bloat rules applied. ```rust use ast_doc_core::{ AstDocConfig, OutputStrategy, ingestion::IngestionResult, scheduler::ScheduleResult, renderer::render_llms_txt, }; use std::path::PathBuf; // Assume we have results from previous phases let scheduled: ScheduleResult = /* from scheduler */; let ingestion: IngestionResult = /* from ingestion */; let config = AstDocConfig { path: PathBuf::from("/home/user/my-project"), output: Some(PathBuf::from("llms.txt")), max_tokens: 100_000, core_patterns: vec![], default_strategy: OutputStrategy::Full, include_patterns: vec![], exclude_patterns: vec![], no_git: false, no_tree: false, copy: false, verbose: false, }; // Render the final output let output = render_llms_txt(&scheduled, &ingestion, &config)?; // Output format example: // # Repository: my-project // // > Note: This codebase has been optimized using AST trimming to fit token limits. // > Files are presented in Full, NoTests, Summary modes. // // ## Structure & Symbol Index // // ### Directory Tree // my-project // └── src // ├── main.rs [Full] // └── lib.rs [Full] // // ### Git Context // - **Branch**: main // - **Latest Commit**: abc123 feat: add feature // // --- // // ## Source Files // // ### File: src/main.rs // *Strategy: Full | Tokens: 150 (Saved: 0)* // ```rust // fn main() { ... } // ``` println!("{}", output); ``` ## AstDocError Enum Comprehensive error types for all pipeline stages including budget exceeded, unsupported language, file read errors, parse errors, and git/glob failures. ```rust use ast_doc_core::AstDocError; use std::path::PathBuf; fn handle_error(err: AstDocError) { match err { AstDocError::BudgetExceeded { message } => { eprintln!("Token budget exceeded: {}", message); // Consider increasing --max-tokens or using --strategy summary } AstDocError::UnsupportedLanguage { language } => { eprintln!("Language not supported: {}", language); // Supported: Rust, Python, TypeScript, Go, C } AstDocError::FileRead { path, source } => { eprintln!("Failed to read {}: {}", path.display(), source); } AstDocError::Parse { path, message } => { eprintln!("Parse error in {}: {}", path.display(), message); } AstDocError::Git(e) => { eprintln!("Git error: {}", e); // Use --no-git to skip git context } AstDocError::InvalidGlob(e) => { eprintln!("Invalid glob pattern: {}", e); } AstDocError::Io(e) => { eprintln!("I/O error: {}", e); } AstDocError::Json(e) => { eprintln!("JSON error: {}", e); } } } // Example: Handling budget exceeded gracefully fn generate_with_fallback(config: &mut ast_doc_core::AstDocConfig) -> eyre::Result { match ast_doc_core::run_pipeline(config) { Ok(result) => Ok(result.output), Err(err) => { if let Some(AstDocError::BudgetExceeded { .. }) = err.downcast_ref() { // Retry with summary strategy config.default_strategy = ast_doc_core::OutputStrategy::Summary; let result = ast_doc_core::run_pipeline(config)?; Ok(result.output) } else { Err(err) } } } } ``` ## Language Detection Utility function that detects the programming language from a file's extension, used during ingestion to determine which tree-sitter parser to use. ```rust use ast_doc_core::parser::{detect_language, Language}; use std::path::Path; // Rust files assert_eq!(detect_language(Path::new("src/main.rs")), Some(Language::Rust)); assert_eq!(detect_language(Path::new("lib.rs")), Some(Language::Rust)); // Python files assert_eq!(detect_language(Path::new("app.py")), Some(Language::Python)); assert_eq!(detect_language(Path::new("scripts/run.py")), Some(Language::Python)); // TypeScript/JavaScript files assert_eq!(detect_language(Path::new("index.ts")), Some(Language::TypeScript)); assert_eq!(detect_language(Path::new("App.tsx")), Some(Language::TypeScript)); assert_eq!(detect_language(Path::new("script.js")), Some(Language::TypeScript)); assert_eq!(detect_language(Path::new("Component.jsx")), Some(Language::TypeScript)); // Go files assert_eq!(detect_language(Path::new("main.go")), Some(Language::Go)); // C files assert_eq!(detect_language(Path::new("main.c")), Some(Language::C)); assert_eq!(detect_language(Path::new("header.h")), Some(Language::C)); // Unsupported files return None assert_eq!(detect_language(Path::new("README.md")), None); assert_eq!(detect_language(Path::new("config.json")), None); assert_eq!(detect_language(Path::new("Makefile")), None); ``` ## Cargo Feature Flags The ast-doc-core library uses feature flags to control which language parsers are compiled, allowing you to minimize binary size by including only needed languages. ```toml # Cargo.toml - Include only Rust support (default) [dependencies] ast-doc-core = "0.1.0" # Include all language support [dependencies] ast-doc-core = { version = "0.1.0", features = ["all-languages"] } # Include specific languages [dependencies] ast-doc-core = { version = "0.1.0", features = ["lang-rust", "lang-python"] } # Available features: # - lang-rust (default): Rust (.rs) support # - lang-python: Python (.py) support # - lang-typescript: TypeScript/JavaScript (.ts, .tsx, .js, .jsx) support # - lang-go: Go (.go) support # - lang-c: C (.c, .h) support # - all-languages: Enable all language parsers ``` ## Summary ast-doc is designed for developers and AI-assisted development workflows who need to feed code context to large language models efficiently. The primary use cases include generating optimized documentation for AI code assistants, creating architectural overviews of large codebases, preparing code context for code review or refactoring discussions, and reducing token costs when using paid LLM APIs. The tool excels at balancing comprehensiveness with token efficiency through its intelligent degradation algorithm. Integration patterns typically involve either direct CLI usage in CI/CD pipelines or shell scripts, or programmatic usage via the Rust library API for custom tooling. The four-stage pipeline design allows for extension at each phase, whether adding support for new languages, implementing custom scheduling strategies, or modifying the output format. The tool integrates well with existing Rust workflows through standard Cargo patterns and can be installed globally via `cargo install ast-doc` for immediate use across projects.