MinerU SDK for Rust (longcipher/mineru-sdk-rs)

MinerU SDK for Rust

https://github.com/longcipher/mineru-sdk-rs
Admin
A Rust SDK for interacting with the MinerU API to extract text and data from PDFs, DOCX, PPTX, and...

Tokens:11,434
Snippets:67
Trust Score:5.9
Update:11 hours ago
Show doc for...
Context Summary (auto-generated)
Raw
# MinerU SDK for Rust

MinerU SDK for Rust is an async Rust library that provides a type-safe interface to the MinerU document extraction API. The SDK enables developers to extract text, tables, formulas, and other content from various document formats including PDF, DOCX, PPTX, PNG, JPG, and HTML files. Built on Tokio for asynchronous operations, the library offers both single-file extraction and batch processing capabilities.

The SDK consists of a core library (`mineru-sdk`) and a command-line interface (`mineru-cli`). It handles authentication via Bearer tokens, provides structured request/response types with Serde serialization, and supports advanced extraction options such as OCR processing, formula recognition, table detection, and multiple output formats (Markdown, JSON, DOCX, HTML, LaTeX). The API supports both synchronous polling and callback-based result notification.

## MineruClient::new - Create Client Instance

Creates a new MinerU API client with the provided authentication token. The client uses `https://mineru.net` as the base URL and manages all HTTP communication with the API.

```rust
use mineru_sdk::MineruClient;

// Create client with API token
let client = MineruClient::new("your-api-token".to_string());

// Client is Clone + Debug safe for concurrent use
println!("{:?}", client); // Token is masked in debug output
```

## MineruClient::create_extract_task - Single File Extraction

Creates an extraction task for a single document from a URL. Supports PDF, DOCX, PPTX, PNG, JPG, JPEG, and HTML formats. Returns a task ID for tracking the extraction progress.

```rust
use mineru_sdk::{MineruClient, ExtractTaskRequest};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = MineruClient::new("your-api-token".to_string());

    // Create extraction request with full options
    let request = ExtractTaskRequest {
        url: "https://example.com/document.pdf".to_string(),
        is_ocr: false,                              // Enable OCR for scanned docs
        enable_formula: true,                       // Recognize math formulas
        enable_table: true,                         // Detect tables
        language: "ch".to_string(),                 // Document language
        data_id: Some("my-custom-id".to_string()),  // Custom identifier
        callback: None,                             // Optional webhook URL
        seed: None,                                 // Callback signature seed
        extra_formats: Some(vec!["docx".to_string(), "html".to_string()]),
        page_ranges: Some("1-10".to_string()),      // Extract pages 1-10 only
        model_version: "vlm".to_string(),           // "pipeline" or "vlm"
    };

    let response = client.create_extract_task(request).await?;

    // Response structure:
    // { "code": 0, "msg": "ok", "trace_id": "...", "data": { "task_id": "uuid" } }
    if response.code == 0 {
        println!("Task created: {}", response.data.task_id);
    }
    Ok(())
}
```

**cURL equivalent:**
```bash
curl -X POST 'https://mineru.net/api/v4/extract/task' \
  -H 'Authorization: Bearer your-api-token' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/document.pdf",
    "model_version": "vlm",
    "enable_formula": true,
    "enable_table": true,
    "extra_formats": ["docx", "html"]
  }'
```

## MineruClient::get_task_result - Get Task Status and Result

Queries the status and result of an extraction task. Returns the current state (pending, running, done, failed, converting) and the download URL when completed.

```rust
use mineru_sdk::MineruClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = MineruClient::new("your-api-token".to_string());
    let task_id = "a90e6ab6-44f3-4554-b459-b62fe4c6b436";

    let result = client.get_task_result(task_id).await?;

    match result.data.state.as_str() {
        "done" => {
            // Extraction complete - download results
            if let Some(zip_url) = result.data.full_zip_url {
                println!("Download results: {}", zip_url);
            }
        }
        "running" => {
            // Check progress
            if let Some(progress) = result.data.extract_progress {
                println!("Progress: {}/{} pages (started: {})",
                    progress.extracted_pages,
                    progress.total_pages,
                    progress.start_time);
            }
        }
        "pending" => println!("Task queued, waiting to start"),
        "converting" => println!("Converting to requested formats"),
        "failed" => {
            if let Some(err) = result.data.err_msg {
                eprintln!("Extraction failed: {}", err);
            }
        }
        _ => println!("Unknown state: {}", result.data.state),
    }
    Ok(())
}
```

**cURL equivalent:**
```bash
curl -X GET 'https://mineru.net/api/v4/extract/task/a90e6ab6-44f3-4554-b459-b62fe4c6b436' \
  -H 'Authorization: Bearer your-api-token'
```

## MineruClient::batch_file_upload_urls - Get Upload URLs for Batch Processing

Requests presigned upload URLs for batch file processing. Upload files via PUT request, then the system automatically submits extraction tasks. Upload links are valid for 24 hours.

```rust
use mineru_sdk::{MineruClient, BatchFileRequest, BatchFileItem};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = MineruClient::new("your-api-token".to_string());

    // Request upload URLs for multiple files
    let request = BatchFileRequest {
        files: vec![
            BatchFileItem {
                name: "document1.pdf".to_string(),
                data_id: Some("doc-001".to_string()),
            },
            BatchFileItem {
                name: "document2.docx".to_string(),
                data_id: Some("doc-002".to_string()),
            },
        ],
        model_version: "vlm".to_string(),
    };

    let response = client.batch_file_upload_urls(request).await?;

    if response.code == 0 {
        let batch_id = &response.data.batch_id;
        let upload_urls = &response.data.files;

        println!("Batch ID: {}", batch_id);
        for (i, url) in upload_urls.iter().enumerate() {
            println!("Upload URL for file {}: {}", i + 1, url);
            // Upload file via PUT request (no Content-Type header needed)
            // let file_bytes = std::fs::read(&file_paths[i])?;
            // reqwest::Client::new().put(url).body(file_bytes).send().await?;
        }
    }
    Ok(())
}
```

**cURL equivalent:**
```bash
# Step 1: Get upload URLs
curl -X POST 'https://mineru.net/api/v4/file-urls/batch' \
  -H 'Authorization: Bearer your-api-token' \
  -H 'Content-Type: application/json' \
  -d '{
    "files": [{"name": "document.pdf", "data_id": "doc-001"}],
    "model_version": "vlm"
  }'

# Step 2: Upload file to presigned URL
curl -X PUT -T /path/to/document.pdf 'https://presigned-upload-url...'
```

## MineruClient::batch_url_upload - Batch Extract from URLs

Creates batch extraction tasks for multiple documents specified by URL. Supports up to 200 files per batch request with file size limit of 200MB and 600 pages per file.

```rust
use mineru_sdk::{MineruClient, BatchUrlRequest, BatchUrlItem};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = MineruClient::new("your-api-token".to_string());

    // Create batch extraction from URLs
    let request = BatchUrlRequest {
        files: vec![
            BatchUrlItem {
                url: "https://example.com/report1.pdf".to_string(),
                data_id: Some("report-001".to_string()),
            },
            BatchUrlItem {
                url: "https://example.com/report2.pdf".to_string(),
                data_id: Some("report-002".to_string()),
            },
            BatchUrlItem {
                url: "https://example.com/presentation.pptx".to_string(),
                data_id: Some("pres-001".to_string()),
            },
        ],
        model_version: "pipeline".to_string(),
    };

    let response = client.batch_url_upload(request).await?;

    if response.code == 0 {
        println!("Batch ID: {}", response.data.batch_id);
        // Use batch_id to poll for results
    }
    Ok(())
}
```

**cURL equivalent:**
```bash
curl -X POST 'https://mineru.net/api/v4/extract/task/batch' \
  -H 'Authorization: Bearer your-api-token' \
  -H 'Content-Type: application/json' \
  -d '{
    "files": [
      {"url": "https://example.com/report1.pdf", "data_id": "report-001"},
      {"url": "https://example.com/report2.pdf", "data_id": "report-002"}
    ],
    "model_version": "pipeline"
  }'
```

## MineruClient::get_batch_results - Get Batch Processing Results

Retrieves the status and results of all tasks in a batch. Returns individual file states and download URLs for completed extractions.

```rust
use mineru_sdk::MineruClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = MineruClient::new("your-api-token".to_string());
    let batch_id = "2bb2f0ec-a336-4a0a-b61a-241afaf9cc87";

    let result = client.get_batch_results(batch_id).await?;

    if result.code == 0 {
        println!("Batch: {}", result.data.batch_id);

        for item in &result.data.extract_result {
            let file_name = item.file_name.as_deref().unwrap_or("unknown");

            match item.state.as_str() {
                "done" => {
                    if let Some(url) = &item.full_zip_url {
                        println!("[DONE] {}: {}", file_name, url);
                    }
                }
                "running" => {
                    if let Some(p) = &item.extract_progress {
                        println!("[RUNNING] {}: {}/{} pages",
                            file_name, p.extracted_pages, p.total_pages);
                    }
                }
                "waiting-file" => println!("[WAITING] {}: awaiting file upload", file_name),
                "pending" => println!("[PENDING] {}: queued", file_name),
                "failed" => {
                    let err = item.err_msg.as_deref().unwrap_or("unknown error");
                    println!("[FAILED] {}: {}", file_name, err);
                }
                _ => println!("[{}] {}", item.state, file_name),
            }
        }
    }
    Ok(())
}
```

**cURL equivalent:**
```bash
curl -X GET 'https://mineru.net/api/v4/extract-results/batch/2bb2f0ec-a336-4a0a-b61a-241afaf9cc87' \
  -H 'Authorization: Bearer your-api-token'
```

## CLI Tool - Command Line Interface

The `mineru-cli` provides command-line access to all SDK functions. Set the token via environment variable or pass as argument.

```bash
# Set token via environment variable
export MINERU_TOKEN="your-api-token"

# Create single extraction task
mineru-cli extract --url "https://example.com/document.pdf" --model-version vlm

# With all options
mineru-cli extract \
  --url "https://example.com/document.pdf" \
  --model-version pipeline \
  --is-ocr \
  --enable-formula \
  --enable-table \
  --language en \
  --data-id "my-doc-001" \
  --page-ranges "1-50"

# Get task result
mineru-cli get-task --task-id "a90e6ab6-44f3-4554-b459-b62fe4c6b436"

# Batch file upload (get presigned URL)
mineru-cli batch-file --name "document.pdf" --model-version vlm

# Batch URL extraction
mineru-cli batch-url --url "https://example.com/document.pdf" --model-version pipeline

# Get batch results
mineru-cli get-batch --batch-id "2bb2f0ec-a336-4a0a-b61a-241afaf9cc87"

# Pass token as argument
mineru-cli --token "your-api-token" extract --url "https://example.com/doc.pdf"
```

## Error Handling

The SDK uses a custom `MineruError` enum for error handling that wraps HTTP errors, JSON serialization errors, and API-specific errors.

```rust
use mineru_sdk::{MineruClient, ExtractTaskRequest, MineruError};

#[tokio::main]
async fn main() {
    let client = MineruClient::new("your-api-token".to_string());

    let request = ExtractTaskRequest {
        url: "https://example.com/document.pdf".to_string(),
        is_ocr: false,
        enable_formula: true,
        enable_table: true,
        language: "ch".to_string(),
        data_id: None,
        callback: None,
        seed: None,
        extra_formats: None,
        page_ranges: None,
        model_version: "pipeline".to_string(),
    };

    match client.create_extract_task(request).await {
        Ok(response) => {
            // Check API-level errors (code != 0)
            match response.code {
                0 => println!("Success: task_id = {}", response.data.task_id),
                -500 => eprintln!("Parameter error: check request format"),
                -60005 => eprintln!("File too large: max 200MB"),
                -60006 => eprintln!("Too many pages: max 600 pages"),
                code => eprintln!("API error {}: {}", code, response.msg),
            }
        }
        Err(MineruError::Http(e)) => eprintln!("HTTP error: {}", e),
        Err(MineruError::Json(e)) => eprintln!("JSON parse error: {}", e),
        Err(MineruError::Api(msg)) => eprintln!("API error: {}", msg),
    }
}

// Common error codes:
// A0202  - Invalid token (check Bearer prefix)
// A0211  - Token expired
// -60005 - File size exceeds 200MB limit
// -60006 - File exceeds 600 page limit
// -60012 - Task not found
// -60013 - No permission to access task
```

## Summary

The MinerU SDK for Rust is ideal for applications requiring automated document processing at scale. Common use cases include building document ingestion pipelines for RAG (Retrieval-Augmented Generation) systems, converting legacy documents to searchable formats, extracting structured data from forms and reports, and processing academic papers with complex formulas and tables. The async design makes it suitable for high-throughput services handling concurrent extraction requests.

Integration typically follows a create-poll pattern: submit extraction tasks via `create_extract_task` or batch methods, then poll `get_task_result` or `get_batch_results` until completion. For production systems, use the callback mechanism to receive webhook notifications when extraction completes, eliminating the need for polling. The SDK's type-safe request/response structures ensure compile-time validation, while the CLI tool provides quick testing and scripting capabilities without writing Rust code.