# llm-utils llm-utils is a Rust workspace providing utilities for processing text and content for large language model (LLM) applications. The workspace contains two primary crates: `llm-text` for text cleaning, HTML extraction, URL parsing, and intelligent text splitting, and `opml-protocol` for parsing and generating OPML 2.0 (Outline Processor Markup Language) files used in RSS feed management. The library focuses on preparing raw content for LLM consumption by normalizing whitespace, removing citations, extracting readable content from HTML, and splitting text into semantically meaningful chunks. The OPML crate provides robust XML parsing with serde integration, builder patterns for document construction, and utilities for extracting RSS feeds from hierarchical folder structures. ## llm-text Crate ### TextCleaner - Text Normalization and Cleaning The `TextCleaner` provides configurable single-pass text cleaning with whitespace normalization, newline handling, citation removal, and control character stripping. It uses a builder pattern for flexible configuration. ```rust use llm_text::text::TextCleaner; // Basic whitespace normalization - collapse multiple spaces and newlines let messy_text = "Hello world!\n\n\n\nMultiple spaces\tand\ttabs."; let cleaned = TextCleaner::new() .reduce_newlines_to_double_newline() .run(messy_text); assert_eq!(cleaned, "Hello world!\n\nMultiple spaces and tabs."); // Remove academic citations like [1], [2, 3], [4-6] let academic_text = "Studies show this [1, 2] and also [3-5] plus [6]."; let cleaned = TextCleaner::new() .remove_citations() .run(academic_text); assert_eq!(cleaned, "Studies show this and also plus."); // Collapse all newlines to single spaces for single-line output let multiline = "First paragraph.\n\nSecond paragraph.\n\nThird."; let single_line = TextCleaner::new() .reduce_newlines_to_single_space() .run(multiline); assert_eq!(single_line, "First paragraph. Second paragraph. Third."); // Preserve structure with single newlines let structured = "Line one.\n\n\nLine two.\n\nLine three."; let preserved = TextCleaner::new() .reduce_newlines_to_single_newline() .run(structured); assert_eq!(preserved, "Line one.\nLine two.\nLine three."); // Remove control characters while preserving multilingual text let mixed = "Hello\x00World\x01 世界 Привет"; let clean = TextCleaner::new() .remove_non_basic_ascii() .run(mixed); assert_eq!(clean, "HelloWorld 世界 Привет"); ``` ### clean_html - HTML to Structured Text Extraction Extracts readable content from HTML using the Readability algorithm and converts it to clean, structured text preserving headings, lists, and formatting. ```rust use llm_text::html::clean_html; let html = r#" Article

Main Title

This is the main content of the article.

First point
Second point

Conclusion paragraph.

"#; let text = clean_html(html).expect("Failed to parse HTML"); // Result preserves structure: headings, paragraphs, and lists // Navigation and footer boilerplate are removed println!("{}", text); // Output: // Main Title // // This is the main content of the article. // // * First point // * Second point // // Conclusion paragraph. ``` ### extract_urls - URL Extraction from Text Extracts and deduplicates all URLs from text content, returning parsed `Url` objects. ```rust use llm_text::links::extract_urls; let text = "Check out https://example.com and also visit https://rust-lang.org/learn. The first link https://example.com appears twice but will be deduplicated."; let urls = extract_urls(text); assert_eq!(urls.len(), 2); // Deduplicated for url in &urls { println!("Found URL: {} (host: {:?})", url, url.host_str()); } // Output: // Found URL: https://example.com/ (host: Some("example.com")) // Found URL: https://rust-lang.org/learn (host: Some("rust-lang.org")) ``` ### TextSplitter - Intelligent Text Chunking Splits text into semantically meaningful chunks using configurable separators with optional recursive splitting and token-based size limits. ```rust use llm_text::splitting::{TextSplitter, CharRatioTokenizer, Tokenizer}; use std::sync::Arc; let document = "Introduction to Rust. Rust is a systems programming language focused on safety and performance. Memory Safety Rust prevents null pointer dereferences and data races at compile time. Concurrency Rust makes concurrent programming safer with its ownership system."; // Basic splitting on double newlines (paragraphs) let splitter = TextSplitter::new() .on_two_plus_newline() .recursive(true); if let Some(splits) = splitter.split_text(document) { println!("Split into {} chunks:", splits.len()); for (i, split) in splits.iter().enumerate() { println!("Chunk {}: {:?}", i, split.text()); } } // Output: // Split into 4 chunks: // Chunk 0: "Introduction to Rust." // Chunk 1: "Rust is a systems programming language..." // Chunk 2: "Memory Safety\n\nRust prevents null pointer..." // Chunk 3: "Concurrency\n\nRust makes concurrent programming..." // Token-aware splitting with maximum chunk size let tokenizer = Arc::new(CharRatioTokenizer::new().with_ratio(4.0)); let splitter = TextSplitter::new() .on_two_plus_newline() .recursive(true) .max_token_size(50) // Max 50 tokens per chunk .with_tokenizer(tokenizer); if let Some(splits) = splitter.split_text(document) { for split in &splits { println!("Chunk ({} tokens): {}", split.token_count.unwrap_or(0), split.text().chars().take(50).collect::()); } } ``` ### Separator - Text Splitting Strategies Different separator types for various granularity levels, from paragraphs down to individual graphemes. ```rust use llm_text::splitting::{Separator, TextSplitter}; let text = "First sentence. Second sentence. Third sentence."; // Split on sentences using Unicode rules let splitter = TextSplitter::new() .on_sentences_unicode() .recursive(false); if let Some(splits) = splitter.split_text(text) { for split in &splits { println!("Sentence: {:?}", split.text()); } } // Output: // Sentence: "First sentence." // Sentence: "Second sentence." // Sentence: "Third sentence." // Split on words let splitter = TextSplitter::new() .on_words_unicode() .recursive(false); if let Some(splits) = splitter.split_text(text) { let words: Vec<&str> = splits.iter().map(|s| s.text()).collect(); println!("Words: {:?}", words); } // Output: Words: ["First", "sentence", "Second", "sentence", "Third", "sentence"] // Get byte ranges for text indices let indices = Separator::SentencesUnicode.split_text_into_indices(text); for range in indices { println!("Range {:?}: {:?}", range.clone(), &text[range]); } ``` ### split_text_into_sentences - Sentence Boundary Detection Unicode-aware sentence splitting for precise text segmentation. ```rust use llm_text::splitting::split_text_into_sentences; let text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was a long trip!"; // With separators preserved (whitespace included) let sentences = split_text_into_sentences(text, true); for s in &sentences { println!("With sep: {:?}", s); } // Without separators (trimmed) let sentences = split_text_into_sentences(text, false); for s in &sentences { println!("Trimmed: {:?}", s); } // Output: // Trimmed: "Dr. Smith went to Washington." // Trimmed: "He arrived at 3 p.m." // Trimmed: "It was a long trip!" ``` ### Custom Tokenizer Implementation Implement the `Tokenizer` trait to integrate with real tokenizers like tiktoken. ```rust use llm_text::splitting::{Tokenizer, TextSplitter}; use std::sync::Arc; // Example: wrapper for tiktoken (pseudo-code) struct TiktokenWrapper { // bpe: tiktoken::CoreBPE, } impl Tokenizer for TiktokenWrapper { fn count_tokens(&self, text: &str) -> u32 { // self.bpe.encode_ordinary(text).len() as u32 (text.len() / 4) as u32 // Simplified example } } // Use with TextSplitter let tokenizer: Arc = Arc::new(TiktokenWrapper {}); let splitter = TextSplitter::new() .with_tokenizer(tokenizer) .max_token_size(100); ``` ## opml-protocol Crate ### Opml::from_str - Parse OPML from XML String Parse OPML 2.0 documents from XML strings with full support for nested outlines and custom attributes. ```rust use opml_protocol::Opml; use std::str::FromStr; let xml = r#" My Subscription List Mon, 01 Jan 2024 00:00:00 GMT John Doe "#; let opml = Opml::from_str(xml).expect("Failed to parse OPML"); println!("OPML Version: {}", opml.version); println!("Title: {:?}", opml.head.title); println!("Owner: {:?}", opml.head.owner_name); println!("Number of top-level outlines: {}", opml.body.outlines.len()); for outline in &opml.body.outlines { println!("Outline: {} (type: {:?})", outline.text, outline.type_); if let Some(children) = &outline.outlines { for child in children { println!(" - {}: {:?}", child.text, child.xml_url); } } } ``` ### Opml::to_string - Serialize OPML to XML Convert OPML documents back to XML string format with proper attribute serialization. ```rust use opml_protocol::{Opml, Head, Body, Outline, OutlineBuilder}; let opml = Opml { version: "2.0".to_string(), head: Head { title: Some("My Feeds".to_string()), owner_name: Some("Jane Doe".to_string()), ..Default::default() }, body: Body { outlines: vec![ Outline { text: "Rust Blog".to_string(), type_: Some("rss".to_string()), xml_url: Some("https://blog.rust-lang.org/feed.xml".to_string()), html_url: Some("https://blog.rust-lang.org".to_string()), ..Default::default() }, ], }, }; let xml = opml.to_string().expect("Failed to serialize OPML"); println!("{}", xml); // Output: // // My FeedsJane Doe // // // // ``` ### OutlineBuilder - Programmatic Outline Construction Build OPML outlines using the builder pattern with validation. ```rust use opml_protocol::{OutlineBuilder, Opml, Body, Head}; // Build a simple RSS feed outline let feed = OutlineBuilder::default() .text("Hacker News".to_string()) .type_(Some("rss".to_string())) .xml_url(Some("https://news.ycombinator.com/rss".to_string())) .html_url(Some("https://news.ycombinator.com".to_string())) .description(Some("Hacker News RSS Feed".to_string())) .build() .expect("Failed to build outline"); // Build a folder with nested feeds let tech_folder = OutlineBuilder::default() .text("Technology".to_string()) .type_(Some("folder".to_string())) .outlines(Some(vec![ OutlineBuilder::default() .text("Ars Technica".to_string()) .type_(Some("rss".to_string())) .xml_url(Some("https://feeds.arstechnica.com/arstechnica/index".to_string())) .build() .unwrap(), OutlineBuilder::default() .text("Wired".to_string()) .type_(Some("rss".to_string())) .xml_url(Some("https://www.wired.com/feed/rss".to_string())) .build() .unwrap(), ])) .build() .expect("Failed to build folder"); let opml = Opml { version: "2.0".to_string(), head: Head { title: Some("My Tech Feeds".to_string()), ..Default::default() }, body: Body { outlines: vec![tech_folder, feed] }, }; println!("{}", opml.to_string().unwrap()); ``` ### Opml::get_group_feeds - Extract RSS Feeds by Folder Extract all RSS feeds from an OPML document, grouped by their folder hierarchy with full path support for nested folders. ```rust use opml_protocol::Opml; use std::str::FromStr; let xml = r#" Feeds "#; let opml = Opml::from_str(xml).unwrap(); let groups = opml.get_group_feeds().expect("Failed to extract feeds"); for group in &groups { let folder_name = if group.group.is_empty() { "(Root Level)" } else { &group.group }; println!("Folder: {}", folder_name); for feed in &group.feeds { println!(" - {} -> {}", feed.text.as_deref().unwrap_or("Untitled"), feed.xml_url); } } // Output: // Folder: (Root Level) // - Hacker News -> https://hnrss.org/frontpage // Folder: Tech/News // - TechCrunch -> https://techcrunch.com/feed/ // - Ars Technica -> https://feeds.arstechnica.com/arstechnica/index // Folder: Tech/Blogs // - Rust Blog -> https://blog.rust-lang.org/feed.xml ``` ### Custom Attributes Handling OPML outlines can contain arbitrary custom attributes which are preserved during parsing and serialization. ```rust use opml_protocol::Opml; use std::str::FromStr; use std::collections::HashMap; let xml = r#" Custom Attrs "#; let opml = Opml::from_str(xml).unwrap(); let outline = &opml.body.outlines[0]; // Access custom attributes from the extra HashMap // Note: attributes are stored with @ prefix println!("Custom priority: {:?}", outline.extra.get("@customPriority")); println!("Refresh rate: {:?}", outline.extra.get("@refreshRate")); // Roundtrip preserves custom attributes let xml_out = opml.to_string().unwrap(); let reparsed = Opml::from_str(&xml_out).unwrap(); assert_eq!( reparsed.body.outlines[0].extra.get("@customPriority"), Some(&"high".to_string()) ); ``` ### Error Handling The `OpmlError` type provides detailed error information for parsing and serialization failures. ```rust use opml_protocol::{Opml, errors::OpmlError}; use std::str::FromStr; // Handle parsing errors let invalid_xml = ""; match Opml::from_str(invalid_xml) { Ok(_) => println!("Parsed successfully"), Err(OpmlError::XmlDeError(e)) => println!("XML parse error: {}", e), Err(e) => println!("Other error: {}", e), } // Handle feed extraction errors (RSS without xmlUrl) let xml = r#" Test "#; let opml = Opml::from_str(xml).unwrap(); match opml.get_group_feeds() { Ok(feeds) => println!("Got {} groups", feeds.len()), Err(OpmlError::BadFeed(msg)) => println!("Invalid feed: {}", msg), Err(e) => println!("Error: {}", e), } // Output: Invalid feed: RSS outline missing xmlUrl ``` ## Summary The llm-utils workspace provides essential text processing utilities for LLM applications. The `llm-text` crate handles the complete pipeline from raw HTML to clean, chunked text suitable for embedding or prompt construction - including HTML content extraction with Readability, configurable text cleaning with whitespace normalization and citation removal, URL extraction, and intelligent text splitting with Unicode-aware sentence boundaries and token-based size limits. The `opml-protocol` crate offers a complete OPML 2.0 implementation for RSS feed management applications, featuring bidirectional XML parsing and generation with serde, builder patterns for programmatic document construction, hierarchical feed extraction with folder path support, and preservation of custom attributes. Both crates are designed for high performance with single-pass processing, Arc-based text sharing for zero-copy splitting, and efficient memory usage patterns suitable for production workloads.