# llm-utils
llm-utils is a Rust workspace providing utilities for processing text and content for large language model (LLM) applications. The workspace contains two primary crates: `llm-text` for text cleaning, HTML extraction, URL parsing, and intelligent text splitting, and `opml-protocol` for parsing and generating OPML 2.0 (Outline Processor Markup Language) files used in RSS feed management.
The library focuses on preparing raw content for LLM consumption by normalizing whitespace, removing citations, extracting readable content from HTML, and splitting text into semantically meaningful chunks. The OPML crate provides robust XML parsing with serde integration, builder patterns for document construction, and utilities for extracting RSS feeds from hierarchical folder structures.
## llm-text Crate
### TextCleaner - Text Normalization and Cleaning
The `TextCleaner` provides configurable single-pass text cleaning with whitespace normalization, newline handling, citation removal, and control character stripping. It uses a builder pattern for flexible configuration.
```rust
use llm_text::text::TextCleaner;
// Basic whitespace normalization - collapse multiple spaces and newlines
let messy_text = "Hello world!\n\n\n\nMultiple spaces\tand\ttabs.";
let cleaned = TextCleaner::new()
.reduce_newlines_to_double_newline()
.run(messy_text);
assert_eq!(cleaned, "Hello world!\n\nMultiple spaces and tabs.");
// Remove academic citations like [1], [2, 3], [4-6]
let academic_text = "Studies show this [1, 2] and also [3-5] plus [6].";
let cleaned = TextCleaner::new()
.remove_citations()
.run(academic_text);
assert_eq!(cleaned, "Studies show this and also plus.");
// Collapse all newlines to single spaces for single-line output
let multiline = "First paragraph.\n\nSecond paragraph.\n\nThird.";
let single_line = TextCleaner::new()
.reduce_newlines_to_single_space()
.run(multiline);
assert_eq!(single_line, "First paragraph. Second paragraph. Third.");
// Preserve structure with single newlines
let structured = "Line one.\n\n\nLine two.\n\nLine three.";
let preserved = TextCleaner::new()
.reduce_newlines_to_single_newline()
.run(structured);
assert_eq!(preserved, "Line one.\nLine two.\nLine three.");
// Remove control characters while preserving multilingual text
let mixed = "Hello\x00World\x01 世界 Привет";
let clean = TextCleaner::new()
.remove_non_basic_ascii()
.run(mixed);
assert_eq!(clean, "HelloWorld 世界 Привет");
```
### clean_html - HTML to Structured Text Extraction
Extracts readable content from HTML using the Readability algorithm and converts it to clean, structured text preserving headings, lists, and formatting.
```rust
use llm_text::html::clean_html;
let html = r#"
Article
Main Title
This is the main content of the article.
First point
Second point
Conclusion paragraph.
"#;
let text = clean_html(html).expect("Failed to parse HTML");
// Result preserves structure: headings, paragraphs, and lists
// Navigation and footer boilerplate are removed
println!("{}", text);
// Output:
// Main Title
//
// This is the main content of the article.
//
// * First point
// * Second point
//
// Conclusion paragraph.
```
### extract_urls - URL Extraction from Text
Extracts and deduplicates all URLs from text content, returning parsed `Url` objects.
```rust
use llm_text::links::extract_urls;
let text = "Check out https://example.com and also visit
https://rust-lang.org/learn. The first link https://example.com
appears twice but will be deduplicated.";
let urls = extract_urls(text);
assert_eq!(urls.len(), 2); // Deduplicated
for url in &urls {
println!("Found URL: {} (host: {:?})", url, url.host_str());
}
// Output:
// Found URL: https://example.com/ (host: Some("example.com"))
// Found URL: https://rust-lang.org/learn (host: Some("rust-lang.org"))
```
### TextSplitter - Intelligent Text Chunking
Splits text into semantically meaningful chunks using configurable separators with optional recursive splitting and token-based size limits.
```rust
use llm_text::splitting::{TextSplitter, CharRatioTokenizer, Tokenizer};
use std::sync::Arc;
let document = "Introduction to Rust.
Rust is a systems programming language focused on safety and performance.
Memory Safety
Rust prevents null pointer dereferences and data races at compile time.
Concurrency
Rust makes concurrent programming safer with its ownership system.";
// Basic splitting on double newlines (paragraphs)
let splitter = TextSplitter::new()
.on_two_plus_newline()
.recursive(true);
if let Some(splits) = splitter.split_text(document) {
println!("Split into {} chunks:", splits.len());
for (i, split) in splits.iter().enumerate() {
println!("Chunk {}: {:?}", i, split.text());
}
}
// Output:
// Split into 4 chunks:
// Chunk 0: "Introduction to Rust."
// Chunk 1: "Rust is a systems programming language..."
// Chunk 2: "Memory Safety\n\nRust prevents null pointer..."
// Chunk 3: "Concurrency\n\nRust makes concurrent programming..."
// Token-aware splitting with maximum chunk size
let tokenizer = Arc::new(CharRatioTokenizer::new().with_ratio(4.0));
let splitter = TextSplitter::new()
.on_two_plus_newline()
.recursive(true)
.max_token_size(50) // Max 50 tokens per chunk
.with_tokenizer(tokenizer);
if let Some(splits) = splitter.split_text(document) {
for split in &splits {
println!("Chunk ({} tokens): {}",
split.token_count.unwrap_or(0),
split.text().chars().take(50).collect::());
}
}
```
### Separator - Text Splitting Strategies
Different separator types for various granularity levels, from paragraphs down to individual graphemes.
```rust
use llm_text::splitting::{Separator, TextSplitter};
let text = "First sentence. Second sentence. Third sentence.";
// Split on sentences using Unicode rules
let splitter = TextSplitter::new()
.on_sentences_unicode()
.recursive(false);
if let Some(splits) = splitter.split_text(text) {
for split in &splits {
println!("Sentence: {:?}", split.text());
}
}
// Output:
// Sentence: "First sentence."
// Sentence: "Second sentence."
// Sentence: "Third sentence."
// Split on words
let splitter = TextSplitter::new()
.on_words_unicode()
.recursive(false);
if let Some(splits) = splitter.split_text(text) {
let words: Vec<&str> = splits.iter().map(|s| s.text()).collect();
println!("Words: {:?}", words);
}
// Output: Words: ["First", "sentence", "Second", "sentence", "Third", "sentence"]
// Get byte ranges for text indices
let indices = Separator::SentencesUnicode.split_text_into_indices(text);
for range in indices {
println!("Range {:?}: {:?}", range.clone(), &text[range]);
}
```
### split_text_into_sentences - Sentence Boundary Detection
Unicode-aware sentence splitting for precise text segmentation.
```rust
use llm_text::splitting::split_text_into_sentences;
let text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was a long trip!";
// With separators preserved (whitespace included)
let sentences = split_text_into_sentences(text, true);
for s in &sentences {
println!("With sep: {:?}", s);
}
// Without separators (trimmed)
let sentences = split_text_into_sentences(text, false);
for s in &sentences {
println!("Trimmed: {:?}", s);
}
// Output:
// Trimmed: "Dr. Smith went to Washington."
// Trimmed: "He arrived at 3 p.m."
// Trimmed: "It was a long trip!"
```
### Custom Tokenizer Implementation
Implement the `Tokenizer` trait to integrate with real tokenizers like tiktoken.
```rust
use llm_text::splitting::{Tokenizer, TextSplitter};
use std::sync::Arc;
// Example: wrapper for tiktoken (pseudo-code)
struct TiktokenWrapper {
// bpe: tiktoken::CoreBPE,
}
impl Tokenizer for TiktokenWrapper {
fn count_tokens(&self, text: &str) -> u32 {
// self.bpe.encode_ordinary(text).len() as u32
(text.len() / 4) as u32 // Simplified example
}
}
// Use with TextSplitter
let tokenizer: Arc = Arc::new(TiktokenWrapper {});
let splitter = TextSplitter::new()
.with_tokenizer(tokenizer)
.max_token_size(100);
```
## opml-protocol Crate
### Opml::from_str - Parse OPML from XML String
Parse OPML 2.0 documents from XML strings with full support for nested outlines and custom attributes.
```rust
use opml_protocol::Opml;
use std::str::FromStr;
let xml = r#"
My Subscription ListMon, 01 Jan 2024 00:00:00 GMTJohn Doe
"#;
let opml = Opml::from_str(xml).expect("Failed to parse OPML");
println!("OPML Version: {}", opml.version);
println!("Title: {:?}", opml.head.title);
println!("Owner: {:?}", opml.head.owner_name);
println!("Number of top-level outlines: {}", opml.body.outlines.len());
for outline in &opml.body.outlines {
println!("Outline: {} (type: {:?})", outline.text, outline.type_);
if let Some(children) = &outline.outlines {
for child in children {
println!(" - {}: {:?}", child.text, child.xml_url);
}
}
}
```
### Opml::to_string - Serialize OPML to XML
Convert OPML documents back to XML string format with proper attribute serialization.
```rust
use opml_protocol::{Opml, Head, Body, Outline, OutlineBuilder};
let opml = Opml {
version: "2.0".to_string(),
head: Head {
title: Some("My Feeds".to_string()),
owner_name: Some("Jane Doe".to_string()),
..Default::default()
},
body: Body {
outlines: vec![
Outline {
text: "Rust Blog".to_string(),
type_: Some("rss".to_string()),
xml_url: Some("https://blog.rust-lang.org/feed.xml".to_string()),
html_url: Some("https://blog.rust-lang.org".to_string()),
..Default::default()
},
],
},
};
let xml = opml.to_string().expect("Failed to serialize OPML");
println!("{}", xml);
// Output:
//
// My FeedsJane Doe
//
//
//
//
```
### OutlineBuilder - Programmatic Outline Construction
Build OPML outlines using the builder pattern with validation.
```rust
use opml_protocol::{OutlineBuilder, Opml, Body, Head};
// Build a simple RSS feed outline
let feed = OutlineBuilder::default()
.text("Hacker News".to_string())
.type_(Some("rss".to_string()))
.xml_url(Some("https://news.ycombinator.com/rss".to_string()))
.html_url(Some("https://news.ycombinator.com".to_string()))
.description(Some("Hacker News RSS Feed".to_string()))
.build()
.expect("Failed to build outline");
// Build a folder with nested feeds
let tech_folder = OutlineBuilder::default()
.text("Technology".to_string())
.type_(Some("folder".to_string()))
.outlines(Some(vec![
OutlineBuilder::default()
.text("Ars Technica".to_string())
.type_(Some("rss".to_string()))
.xml_url(Some("https://feeds.arstechnica.com/arstechnica/index".to_string()))
.build()
.unwrap(),
OutlineBuilder::default()
.text("Wired".to_string())
.type_(Some("rss".to_string()))
.xml_url(Some("https://www.wired.com/feed/rss".to_string()))
.build()
.unwrap(),
]))
.build()
.expect("Failed to build folder");
let opml = Opml {
version: "2.0".to_string(),
head: Head { title: Some("My Tech Feeds".to_string()), ..Default::default() },
body: Body { outlines: vec![tech_folder, feed] },
};
println!("{}", opml.to_string().unwrap());
```
### Opml::get_group_feeds - Extract RSS Feeds by Folder
Extract all RSS feeds from an OPML document, grouped by their folder hierarchy with full path support for nested folders.
```rust
use opml_protocol::Opml;
use std::str::FromStr;
let xml = r#"
Feeds
"#;
let opml = Opml::from_str(xml).unwrap();
let groups = opml.get_group_feeds().expect("Failed to extract feeds");
for group in &groups {
let folder_name = if group.group.is_empty() {
"(Root Level)"
} else {
&group.group
};
println!("Folder: {}", folder_name);
for feed in &group.feeds {
println!(" - {} -> {}",
feed.text.as_deref().unwrap_or("Untitled"),
feed.xml_url);
}
}
// Output:
// Folder: (Root Level)
// - Hacker News -> https://hnrss.org/frontpage
// Folder: Tech/News
// - TechCrunch -> https://techcrunch.com/feed/
// - Ars Technica -> https://feeds.arstechnica.com/arstechnica/index
// Folder: Tech/Blogs
// - Rust Blog -> https://blog.rust-lang.org/feed.xml
```
### Custom Attributes Handling
OPML outlines can contain arbitrary custom attributes which are preserved during parsing and serialization.
```rust
use opml_protocol::Opml;
use std::str::FromStr;
use std::collections::HashMap;
let xml = r#"
Custom Attrs
"#;
let opml = Opml::from_str(xml).unwrap();
let outline = &opml.body.outlines[0];
// Access custom attributes from the extra HashMap
// Note: attributes are stored with @ prefix
println!("Custom priority: {:?}", outline.extra.get("@customPriority"));
println!("Refresh rate: {:?}", outline.extra.get("@refreshRate"));
// Roundtrip preserves custom attributes
let xml_out = opml.to_string().unwrap();
let reparsed = Opml::from_str(&xml_out).unwrap();
assert_eq!(
reparsed.body.outlines[0].extra.get("@customPriority"),
Some(&"high".to_string())
);
```
### Error Handling
The `OpmlError` type provides detailed error information for parsing and serialization failures.
```rust
use opml_protocol::{Opml, errors::OpmlError};
use std::str::FromStr;
// Handle parsing errors
let invalid_xml = "";
match Opml::from_str(invalid_xml) {
Ok(_) => println!("Parsed successfully"),
Err(OpmlError::XmlDeError(e)) => println!("XML parse error: {}", e),
Err(e) => println!("Other error: {}", e),
}
// Handle feed extraction errors (RSS without xmlUrl)
let xml = r#"
Test
"#;
let opml = Opml::from_str(xml).unwrap();
match opml.get_group_feeds() {
Ok(feeds) => println!("Got {} groups", feeds.len()),
Err(OpmlError::BadFeed(msg)) => println!("Invalid feed: {}", msg),
Err(e) => println!("Error: {}", e),
}
// Output: Invalid feed: RSS outline missing xmlUrl
```
## Summary
The llm-utils workspace provides essential text processing utilities for LLM applications. The `llm-text` crate handles the complete pipeline from raw HTML to clean, chunked text suitable for embedding or prompt construction - including HTML content extraction with Readability, configurable text cleaning with whitespace normalization and citation removal, URL extraction, and intelligent text splitting with Unicode-aware sentence boundaries and token-based size limits.
The `opml-protocol` crate offers a complete OPML 2.0 implementation for RSS feed management applications, featuring bidirectional XML parsing and generation with serde, builder patterns for programmatic document construction, hierarchical feed extraction with folder path support, and preservation of custom attributes. Both crates are designed for high performance with single-pass processing, Arc-based text sharing for zero-copy splitting, and efficient memory usage patterns suitable for production workloads.