Try Live
Add Docs
Rankings
Pricing
Docs
Install
Theme
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
MinerU SDK for Rust
https://github.com/longcipher/mineru-sdk-rs
Admin
A Rust SDK for interacting with the MinerU API to extract text and data from PDFs, DOCX, PPTX, and
...
Tokens:
11,434
Snippets:
67
Trust Score:
5.9
Update:
11 hours ago
Context
Skills
Chat
Benchmark
85.3
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# MinerU SDK for Rust MinerU SDK for Rust is an async Rust library that provides a type-safe interface to the MinerU document extraction API. The SDK enables developers to extract text, tables, formulas, and other content from various document formats including PDF, DOCX, PPTX, PNG, JPG, and HTML files. Built on Tokio for asynchronous operations, the library offers both single-file extraction and batch processing capabilities. The SDK consists of a core library (`mineru-sdk`) and a command-line interface (`mineru-cli`). It handles authentication via Bearer tokens, provides structured request/response types with Serde serialization, and supports advanced extraction options such as OCR processing, formula recognition, table detection, and multiple output formats (Markdown, JSON, DOCX, HTML, LaTeX). The API supports both synchronous polling and callback-based result notification. ## MineruClient::new - Create Client Instance Creates a new MinerU API client with the provided authentication token. The client uses `https://mineru.net` as the base URL and manages all HTTP communication with the API. ```rust use mineru_sdk::MineruClient; // Create client with API token let client = MineruClient::new("your-api-token".to_string()); // Client is Clone + Debug safe for concurrent use println!("{:?}", client); // Token is masked in debug output ``` ## MineruClient::create_extract_task - Single File Extraction Creates an extraction task for a single document from a URL. Supports PDF, DOCX, PPTX, PNG, JPG, JPEG, and HTML formats. Returns a task ID for tracking the extraction progress. ```rust use mineru_sdk::{MineruClient, ExtractTaskRequest}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = MineruClient::new("your-api-token".to_string()); // Create extraction request with full options let request = ExtractTaskRequest { url: "https://example.com/document.pdf".to_string(), is_ocr: false, // Enable OCR for scanned docs enable_formula: true, // Recognize math formulas enable_table: true, // Detect tables language: "ch".to_string(), // Document language data_id: Some("my-custom-id".to_string()), // Custom identifier callback: None, // Optional webhook URL seed: None, // Callback signature seed extra_formats: Some(vec!["docx".to_string(), "html".to_string()]), page_ranges: Some("1-10".to_string()), // Extract pages 1-10 only model_version: "vlm".to_string(), // "pipeline" or "vlm" }; let response = client.create_extract_task(request).await?; // Response structure: // { "code": 0, "msg": "ok", "trace_id": "...", "data": { "task_id": "uuid" } } if response.code == 0 { println!("Task created: {}", response.data.task_id); } Ok(()) } ``` **cURL equivalent:** ```bash curl -X POST 'https://mineru.net/api/v4/extract/task' \ -H 'Authorization: Bearer your-api-token' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com/document.pdf", "model_version": "vlm", "enable_formula": true, "enable_table": true, "extra_formats": ["docx", "html"] }' ``` ## MineruClient::get_task_result - Get Task Status and Result Queries the status and result of an extraction task. Returns the current state (pending, running, done, failed, converting) and the download URL when completed. ```rust use mineru_sdk::MineruClient; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = MineruClient::new("your-api-token".to_string()); let task_id = "a90e6ab6-44f3-4554-b459-b62fe4c6b436"; let result = client.get_task_result(task_id).await?; match result.data.state.as_str() { "done" => { // Extraction complete - download results if let Some(zip_url) = result.data.full_zip_url { println!("Download results: {}", zip_url); } } "running" => { // Check progress if let Some(progress) = result.data.extract_progress { println!("Progress: {}/{} pages (started: {})", progress.extracted_pages, progress.total_pages, progress.start_time); } } "pending" => println!("Task queued, waiting to start"), "converting" => println!("Converting to requested formats"), "failed" => { if let Some(err) = result.data.err_msg { eprintln!("Extraction failed: {}", err); } } _ => println!("Unknown state: {}", result.data.state), } Ok(()) } ``` **cURL equivalent:** ```bash curl -X GET 'https://mineru.net/api/v4/extract/task/a90e6ab6-44f3-4554-b459-b62fe4c6b436' \ -H 'Authorization: Bearer your-api-token' ``` ## MineruClient::batch_file_upload_urls - Get Upload URLs for Batch Processing Requests presigned upload URLs for batch file processing. Upload files via PUT request, then the system automatically submits extraction tasks. Upload links are valid for 24 hours. ```rust use mineru_sdk::{MineruClient, BatchFileRequest, BatchFileItem}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = MineruClient::new("your-api-token".to_string()); // Request upload URLs for multiple files let request = BatchFileRequest { files: vec![ BatchFileItem { name: "document1.pdf".to_string(), data_id: Some("doc-001".to_string()), }, BatchFileItem { name: "document2.docx".to_string(), data_id: Some("doc-002".to_string()), }, ], model_version: "vlm".to_string(), }; let response = client.batch_file_upload_urls(request).await?; if response.code == 0 { let batch_id = &response.data.batch_id; let upload_urls = &response.data.files; println!("Batch ID: {}", batch_id); for (i, url) in upload_urls.iter().enumerate() { println!("Upload URL for file {}: {}", i + 1, url); // Upload file via PUT request (no Content-Type header needed) // let file_bytes = std::fs::read(&file_paths[i])?; // reqwest::Client::new().put(url).body(file_bytes).send().await?; } } Ok(()) } ``` **cURL equivalent:** ```bash # Step 1: Get upload URLs curl -X POST 'https://mineru.net/api/v4/file-urls/batch' \ -H 'Authorization: Bearer your-api-token' \ -H 'Content-Type: application/json' \ -d '{ "files": [{"name": "document.pdf", "data_id": "doc-001"}], "model_version": "vlm" }' # Step 2: Upload file to presigned URL curl -X PUT -T /path/to/document.pdf 'https://presigned-upload-url...' ``` ## MineruClient::batch_url_upload - Batch Extract from URLs Creates batch extraction tasks for multiple documents specified by URL. Supports up to 200 files per batch request with file size limit of 200MB and 600 pages per file. ```rust use mineru_sdk::{MineruClient, BatchUrlRequest, BatchUrlItem}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = MineruClient::new("your-api-token".to_string()); // Create batch extraction from URLs let request = BatchUrlRequest { files: vec![ BatchUrlItem { url: "https://example.com/report1.pdf".to_string(), data_id: Some("report-001".to_string()), }, BatchUrlItem { url: "https://example.com/report2.pdf".to_string(), data_id: Some("report-002".to_string()), }, BatchUrlItem { url: "https://example.com/presentation.pptx".to_string(), data_id: Some("pres-001".to_string()), }, ], model_version: "pipeline".to_string(), }; let response = client.batch_url_upload(request).await?; if response.code == 0 { println!("Batch ID: {}", response.data.batch_id); // Use batch_id to poll for results } Ok(()) } ``` **cURL equivalent:** ```bash curl -X POST 'https://mineru.net/api/v4/extract/task/batch' \ -H 'Authorization: Bearer your-api-token' \ -H 'Content-Type: application/json' \ -d '{ "files": [ {"url": "https://example.com/report1.pdf", "data_id": "report-001"}, {"url": "https://example.com/report2.pdf", "data_id": "report-002"} ], "model_version": "pipeline" }' ``` ## MineruClient::get_batch_results - Get Batch Processing Results Retrieves the status and results of all tasks in a batch. Returns individual file states and download URLs for completed extractions. ```rust use mineru_sdk::MineruClient; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = MineruClient::new("your-api-token".to_string()); let batch_id = "2bb2f0ec-a336-4a0a-b61a-241afaf9cc87"; let result = client.get_batch_results(batch_id).await?; if result.code == 0 { println!("Batch: {}", result.data.batch_id); for item in &result.data.extract_result { let file_name = item.file_name.as_deref().unwrap_or("unknown"); match item.state.as_str() { "done" => { if let Some(url) = &item.full_zip_url { println!("[DONE] {}: {}", file_name, url); } } "running" => { if let Some(p) = &item.extract_progress { println!("[RUNNING] {}: {}/{} pages", file_name, p.extracted_pages, p.total_pages); } } "waiting-file" => println!("[WAITING] {}: awaiting file upload", file_name), "pending" => println!("[PENDING] {}: queued", file_name), "failed" => { let err = item.err_msg.as_deref().unwrap_or("unknown error"); println!("[FAILED] {}: {}", file_name, err); } _ => println!("[{}] {}", item.state, file_name), } } } Ok(()) } ``` **cURL equivalent:** ```bash curl -X GET 'https://mineru.net/api/v4/extract-results/batch/2bb2f0ec-a336-4a0a-b61a-241afaf9cc87' \ -H 'Authorization: Bearer your-api-token' ``` ## CLI Tool - Command Line Interface The `mineru-cli` provides command-line access to all SDK functions. Set the token via environment variable or pass as argument. ```bash # Set token via environment variable export MINERU_TOKEN="your-api-token" # Create single extraction task mineru-cli extract --url "https://example.com/document.pdf" --model-version vlm # With all options mineru-cli extract \ --url "https://example.com/document.pdf" \ --model-version pipeline \ --is-ocr \ --enable-formula \ --enable-table \ --language en \ --data-id "my-doc-001" \ --page-ranges "1-50" # Get task result mineru-cli get-task --task-id "a90e6ab6-44f3-4554-b459-b62fe4c6b436" # Batch file upload (get presigned URL) mineru-cli batch-file --name "document.pdf" --model-version vlm # Batch URL extraction mineru-cli batch-url --url "https://example.com/document.pdf" --model-version pipeline # Get batch results mineru-cli get-batch --batch-id "2bb2f0ec-a336-4a0a-b61a-241afaf9cc87" # Pass token as argument mineru-cli --token "your-api-token" extract --url "https://example.com/doc.pdf" ``` ## Error Handling The SDK uses a custom `MineruError` enum for error handling that wraps HTTP errors, JSON serialization errors, and API-specific errors. ```rust use mineru_sdk::{MineruClient, ExtractTaskRequest, MineruError}; #[tokio::main] async fn main() { let client = MineruClient::new("your-api-token".to_string()); let request = ExtractTaskRequest { url: "https://example.com/document.pdf".to_string(), is_ocr: false, enable_formula: true, enable_table: true, language: "ch".to_string(), data_id: None, callback: None, seed: None, extra_formats: None, page_ranges: None, model_version: "pipeline".to_string(), }; match client.create_extract_task(request).await { Ok(response) => { // Check API-level errors (code != 0) match response.code { 0 => println!("Success: task_id = {}", response.data.task_id), -500 => eprintln!("Parameter error: check request format"), -60005 => eprintln!("File too large: max 200MB"), -60006 => eprintln!("Too many pages: max 600 pages"), code => eprintln!("API error {}: {}", code, response.msg), } } Err(MineruError::Http(e)) => eprintln!("HTTP error: {}", e), Err(MineruError::Json(e)) => eprintln!("JSON parse error: {}", e), Err(MineruError::Api(msg)) => eprintln!("API error: {}", msg), } } // Common error codes: // A0202 - Invalid token (check Bearer prefix) // A0211 - Token expired // -60005 - File size exceeds 200MB limit // -60006 - File exceeds 600 page limit // -60012 - Task not found // -60013 - No permission to access task ``` ## Summary The MinerU SDK for Rust is ideal for applications requiring automated document processing at scale. Common use cases include building document ingestion pipelines for RAG (Retrieval-Augmented Generation) systems, converting legacy documents to searchable formats, extracting structured data from forms and reports, and processing academic papers with complex formulas and tables. The async design makes it suitable for high-throughput services handling concurrent extraction requests. Integration typically follows a create-poll pattern: submit extraction tasks via `create_extract_task` or batch methods, then poll `get_task_result` or `get_batch_results` until completion. For production systems, use the callback mechanism to receive webhook notifications when extraction completes, eliminating the need for polling. The SDK's type-safe request/response structures ensure compile-time validation, while the CLI tool provides quick testing and scripting capabilities without writing Rust code.