# MCP Server Fetch

MCP Server Fetch is a Model Context Protocol (MCP) server that provides advanced web content fetching capabilities for Large Language Models. It enables LLMs to retrieve and process content from web pages using browser automation, OCR, and multiple extraction methods, even for pages that require JavaScript rendering or employ anti-scraping techniques.

The server implements a sophisticated multi-method extraction pipeline that includes browser automation with undetected-chromedriver, OCR using pytesseract with layout detection, HTML extraction via requests/BeautifulSoup, and document parsing for PDF, DOCX, and PPTX files. A scoring system automatically selects the best extraction result based on content length, structure quality, and error detection.

## MCP Tool: fetch

The `fetch` tool retrieves content from a URL using browser automation and multi-method extraction. It automatically handles cookie consent banners, captures full-page screenshots for OCR, and supports various document formats including HTML, PDF, DOCX, and PPTX.

```python
# Tool schema (as exposed via MCP)
{
    "name": "fetch",
    "description": "Fetches a URL from the internet using browser automation and multi-method extraction (including OCR).",
    "inputSchema": {
        "type": "object",
        "properties": {
            "url": {
                "type": "string",
                "format": "uri",
                "description": "URL to fetch"
            },
            "raw": {
                "type": "boolean",
                "default": false,
                "description": "Get the actual HTML content if the requested page, without simplification."
            }
        },
        "required": ["url"]
    }
}

# Example MCP tool call (JSON-RPC format)
{
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
        "name": "fetch",
        "arguments": {
            "url": "https://example.com/article",
            "raw": false
        }
    },
    "id": 1
}

# Response format
{
    "jsonrpc": "2.0",
    "result": {
        "content": [
            {
                "type": "text",
                "text": "Content extracted using Browser (detected type: html):\n\nContents of https://example.com/article:\n\n[Extracted markdown content...]"
            }
        ]
    },
    "id": 1
}
```

## MCP Prompt: fetch

The `fetch` prompt allows users to request URL content extraction through the MCP prompt interface. It uses the same multi-method extraction pipeline as the tool but is initiated via user-specified prompts rather than autonomous tool calls.

```python
# Prompt schema (as exposed via MCP)
{
    "name": "fetch",
    "description": "Fetch a URL and extract its contents as markdown using browser automation",
    "arguments": [
        {
            "name": "url",
            "description": "URL to fetch",
            "required": true
        }
    ]
}

# Example MCP prompt request (JSON-RPC format)
{
    "jsonrpc": "2.0",
    "method": "prompts/get",
    "params": {
        "name": "fetch",
        "arguments": {
            "url": "https://docs.python.org/3/library/asyncio.html"
        }
    },
    "id": 2
}

# Response format
{
    "jsonrpc": "2.0",
    "result": {
        "description": "Contents of https://docs.python.org/3/library/asyncio.html",
        "messages": [
            {
                "role": "user",
                "content": {
                    "type": "text",
                    "text": "Content extracted using HTML_Original (detected type: html):\n\n[Extracted content...]"
                }
            }
        ]
    },
    "id": 2
}
```

## Docker Installation and Configuration

The server is designed to run in a Docker container that includes Chrome, Tesseract OCR, and all required dependencies. Build and run the container, then configure your MCP client to use it.

```bash
# Build the Docker image
docker build -t mcp-server-fetch .

# Run the server (interactive mode for MCP stdio communication)
docker run --rm -i mcp-server-fetch

# Run with custom logging level
docker run --rm -i mcp-server-fetch mcp-server-fetch --log-level DEBUG

# Run with custom user agent
docker run --rm -i mcp-server-fetch mcp-server-fetch --user-agent "MyCustomAgent/1.0"

# Run with mounted volumes for logs and output
docker run --rm -i \
    -v $(pwd)/logs:/app/logs \
    -v $(pwd)/output:/app/output \
    mcp-server-fetch
```

## Claude/Roo Code MCP Configuration

Configure your MCP client (Claude Desktop, Roo Code, etc.) to use the fetch server by adding the appropriate configuration to your settings file.

```json
{
    "mcpServers": {
        "fetch": {
            "command": "docker",
            "args": [
                "run",
                "--rm",
                "-i",
                "mcp-server-fetch"
            ],
            "disabled": false,
            "alwaysAllow": []
        }
    }
}
```

## Content Extraction Functions

### extract_content_from_html

Converts raw HTML content to simplified Markdown format using readability algorithms for clean text extraction.

```python
from mcp_server_fetch.server import extract_content_from_html

html_content = """
<!DOCTYPE html>
<html>
<head><title>Article</title></head>
<body>
    <nav>Navigation menu</nav>
    <article>
        <h1>Main Article Title</h1>
        <p>This is the main content of the article that will be extracted.</p>
        <p>Additional paragraph with important information.</p>
    </article>
    <footer>Copyright 2024</footer>
</body>
</html>
"""

# Extract and convert to markdown
markdown_content = extract_content_from_html(html_content)
# Output: "# Main Article Title\n\nThis is the main content..."
```

### extract_html_with_requests

Extracts text content from a URL using the requests library and BeautifulSoup, removing scripts, styles, and navigation elements.

```python
from mcp_server_fetch.server import extract_html_with_requests

# Fetch and extract text from a URL
url = "https://example.com/documentation"
text_content = extract_html_with_requests(url)

# Returns cleaned text with scripts, styles, headers, footers, and nav removed
# Output: "Example Domain\nThis domain is for use in illustrative examples..."
```

### Document Parsing Functions

Parse various document formats and extract their text content for processing by LLMs.

```python
from mcp_server_fetch.server import _parse_pdf, _parse_docx, _parse_pptx

# Parse PDF content
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()
pdf_text = _parse_pdf(pdf_bytes)
# Output: "Page 1 content...\n\nPage 2 content..."

# Parse DOCX content (includes text from paragraphs and tables)
with open("document.docx", "rb") as f:
    docx_bytes = f.read()
docx_text = _parse_docx(docx_bytes)
# Output: "Paragraph 1\n\nParagraph 2\n\nTable row 1 | Column 2 | Column 3"

# Parse PPTX content (includes slide content and speaker notes)
with open("presentation.pptx", "rb") as f:
    pptx_bytes = f.read()
pptx_text = _parse_pptx(pptx_bytes)
# Output: "Slide 1:\nTitle\nBullet point 1\nNotes: Speaker notes here\n\nSlide 2:..."
```

### fetch_url_with_multiple_methods

The main extraction function that tries multiple methods and returns the best result using a scoring system.

```python
import asyncio
from mcp_server_fetch.server import fetch_url_with_multiple_methods

async def fetch_content():
    url = "https://example.com/complex-page"
    user_agent = "ModelContextProtocol/1.0"

    # Fetches using: Browser automation, OCR, requests/BeautifulSoup, and readability
    content, prefix = await fetch_url_with_multiple_methods(url, user_agent)

    # prefix contains extraction method info: "Content extracted using Browser (detected type: html):\n\n"
    # content contains the extracted text/markdown
    print(f"{prefix}{content}")

asyncio.run(fetch_content())
```

### Content Scoring System

The `choose_best_result` function scores extracted content based on length, structure, and quality indicators.

```python
from mcp_server_fetch.server import choose_best_result

# Multiple extraction results from different methods
results = [
    ("Browser", "Short text"),  # Low score due to length
    ("OCR", "Extracted text with proper paragraphs.\n\nMultiple sections.\n\nWell structured content with good length."),  # Higher score
    ("HTML", "<error>Failed to load page</error>"),  # Penalized for error indicator
]

# Scoring criteria:
# - Base score: 1 point per 100 characters (max 50 points)
# - Structure bonus: Points for paragraph count (max 20 points)
# - Penalty: 50% reduction for <100 characters
# - Penalty: 90% reduction for error indicators

best_method, best_text = choose_best_result(results)
# Returns: ("OCR", "Extracted text with proper paragraphs...")
```

## Command Line Interface

The server can be run directly with various configuration options for logging and user agent customization.

```bash
# Run with default settings
mcp-server-fetch

# Run with debug logging to stderr
mcp-server-fetch --log-level DEBUG

# Run with logging to a file
mcp-server-fetch --log-level INFO --log-file /path/to/mcp-fetch.log

# Run with custom user agent
mcp-server-fetch --user-agent "CustomBot/2.0 (+https://example.com/bot)"

# Combined options
mcp-server-fetch \
    --log-level DEBUG \
    --log-file ./fetch-debug.log \
    --user-agent "MyApp/1.0"
```

## Summary

MCP Server Fetch is designed for AI applications that need to access and process web content. Primary use cases include: fetching documentation and reference materials for LLM context, extracting content from JavaScript-heavy websites that block simple HTTP requests, processing PDF/DOCX/PPTX documents from URLs, and bypassing common anti-bot measures through browser automation. The multi-method extraction approach ensures high success rates across diverse web technologies.

Integration follows the Model Context Protocol standard, making it compatible with Claude Desktop, Roo Code, and other MCP clients. The Docker-based deployment bundles all dependencies including Chrome, Tesseract OCR, and layout detection models, ensuring consistent behavior across environments. The server communicates via stdio using JSON-RPC, enabling seamless integration with MCP-compatible applications that need reliable web content fetching capabilities.