# MCP Server Fetch MCP Server Fetch is a Model Context Protocol (MCP) server that provides advanced web content fetching capabilities for Large Language Models. It enables LLMs to retrieve and process content from web pages using browser automation, OCR, and multiple extraction methods, even for pages that require JavaScript rendering or employ anti-scraping techniques. The server implements a sophisticated multi-method extraction pipeline that includes browser automation with undetected-chromedriver, OCR using pytesseract with layout detection, HTML extraction via requests/BeautifulSoup, and document parsing for PDF, DOCX, and PPTX files. A scoring system automatically selects the best extraction result based on content length, structure quality, and error detection. ## MCP Tool: fetch The `fetch` tool retrieves content from a URL using browser automation and multi-method extraction. It automatically handles cookie consent banners, captures full-page screenshots for OCR, and supports various document formats including HTML, PDF, DOCX, and PPTX. ```python # Tool schema (as exposed via MCP) { "name": "fetch", "description": "Fetches a URL from the internet using browser automation and multi-method extraction (including OCR).", "inputSchema": { "type": "object", "properties": { "url": { "type": "string", "format": "uri", "description": "URL to fetch" }, "raw": { "type": "boolean", "default": false, "description": "Get the actual HTML content if the requested page, without simplification." } }, "required": ["url"] } } # Example MCP tool call (JSON-RPC format) { "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "fetch", "arguments": { "url": "https://example.com/article", "raw": false } }, "id": 1 } # Response format { "jsonrpc": "2.0", "result": { "content": [ { "type": "text", "text": "Content extracted using Browser (detected type: html):\n\nContents of https://example.com/article:\n\n[Extracted markdown content...]" } ] }, "id": 1 } ``` ## MCP Prompt: fetch The `fetch` prompt allows users to request URL content extraction through the MCP prompt interface. It uses the same multi-method extraction pipeline as the tool but is initiated via user-specified prompts rather than autonomous tool calls. ```python # Prompt schema (as exposed via MCP) { "name": "fetch", "description": "Fetch a URL and extract its contents as markdown using browser automation", "arguments": [ { "name": "url", "description": "URL to fetch", "required": true } ] } # Example MCP prompt request (JSON-RPC format) { "jsonrpc": "2.0", "method": "prompts/get", "params": { "name": "fetch", "arguments": { "url": "https://docs.python.org/3/library/asyncio.html" } }, "id": 2 } # Response format { "jsonrpc": "2.0", "result": { "description": "Contents of https://docs.python.org/3/library/asyncio.html", "messages": [ { "role": "user", "content": { "type": "text", "text": "Content extracted using HTML_Original (detected type: html):\n\n[Extracted content...]" } } ] }, "id": 2 } ``` ## Docker Installation and Configuration The server is designed to run in a Docker container that includes Chrome, Tesseract OCR, and all required dependencies. Build and run the container, then configure your MCP client to use it. ```bash # Build the Docker image docker build -t mcp-server-fetch . # Run the server (interactive mode for MCP stdio communication) docker run --rm -i mcp-server-fetch # Run with custom logging level docker run --rm -i mcp-server-fetch mcp-server-fetch --log-level DEBUG # Run with custom user agent docker run --rm -i mcp-server-fetch mcp-server-fetch --user-agent "MyCustomAgent/1.0" # Run with mounted volumes for logs and output docker run --rm -i \ -v $(pwd)/logs:/app/logs \ -v $(pwd)/output:/app/output \ mcp-server-fetch ``` ## Claude/Roo Code MCP Configuration Configure your MCP client (Claude Desktop, Roo Code, etc.) to use the fetch server by adding the appropriate configuration to your settings file. ```json { "mcpServers": { "fetch": { "command": "docker", "args": [ "run", "--rm", "-i", "mcp-server-fetch" ], "disabled": false, "alwaysAllow": [] } } } ``` ## Content Extraction Functions ### extract_content_from_html Converts raw HTML content to simplified Markdown format using readability algorithms for clean text extraction. ```python from mcp_server_fetch.server import extract_content_from_html html_content = """ Article

Main Article Title

This is the main content of the article that will be extracted.

Additional paragraph with important information.

""" # Extract and convert to markdown markdown_content = extract_content_from_html(html_content) # Output: "# Main Article Title\n\nThis is the main content..." ``` ### extract_html_with_requests Extracts text content from a URL using the requests library and BeautifulSoup, removing scripts, styles, and navigation elements. ```python from mcp_server_fetch.server import extract_html_with_requests # Fetch and extract text from a URL url = "https://example.com/documentation" text_content = extract_html_with_requests(url) # Returns cleaned text with scripts, styles, headers, footers, and nav removed # Output: "Example Domain\nThis domain is for use in illustrative examples..." ``` ### Document Parsing Functions Parse various document formats and extract their text content for processing by LLMs. ```python from mcp_server_fetch.server import _parse_pdf, _parse_docx, _parse_pptx # Parse PDF content with open("document.pdf", "rb") as f: pdf_bytes = f.read() pdf_text = _parse_pdf(pdf_bytes) # Output: "Page 1 content...\n\nPage 2 content..." # Parse DOCX content (includes text from paragraphs and tables) with open("document.docx", "rb") as f: docx_bytes = f.read() docx_text = _parse_docx(docx_bytes) # Output: "Paragraph 1\n\nParagraph 2\n\nTable row 1 | Column 2 | Column 3" # Parse PPTX content (includes slide content and speaker notes) with open("presentation.pptx", "rb") as f: pptx_bytes = f.read() pptx_text = _parse_pptx(pptx_bytes) # Output: "Slide 1:\nTitle\nBullet point 1\nNotes: Speaker notes here\n\nSlide 2:..." ``` ### fetch_url_with_multiple_methods The main extraction function that tries multiple methods and returns the best result using a scoring system. ```python import asyncio from mcp_server_fetch.server import fetch_url_with_multiple_methods async def fetch_content(): url = "https://example.com/complex-page" user_agent = "ModelContextProtocol/1.0" # Fetches using: Browser automation, OCR, requests/BeautifulSoup, and readability content, prefix = await fetch_url_with_multiple_methods(url, user_agent) # prefix contains extraction method info: "Content extracted using Browser (detected type: html):\n\n" # content contains the extracted text/markdown print(f"{prefix}{content}") asyncio.run(fetch_content()) ``` ### Content Scoring System The `choose_best_result` function scores extracted content based on length, structure, and quality indicators. ```python from mcp_server_fetch.server import choose_best_result # Multiple extraction results from different methods results = [ ("Browser", "Short text"), # Low score due to length ("OCR", "Extracted text with proper paragraphs.\n\nMultiple sections.\n\nWell structured content with good length."), # Higher score ("HTML", "Failed to load page"), # Penalized for error indicator ] # Scoring criteria: # - Base score: 1 point per 100 characters (max 50 points) # - Structure bonus: Points for paragraph count (max 20 points) # - Penalty: 50% reduction for <100 characters # - Penalty: 90% reduction for error indicators best_method, best_text = choose_best_result(results) # Returns: ("OCR", "Extracted text with proper paragraphs...") ``` ## Command Line Interface The server can be run directly with various configuration options for logging and user agent customization. ```bash # Run with default settings mcp-server-fetch # Run with debug logging to stderr mcp-server-fetch --log-level DEBUG # Run with logging to a file mcp-server-fetch --log-level INFO --log-file /path/to/mcp-fetch.log # Run with custom user agent mcp-server-fetch --user-agent "CustomBot/2.0 (+https://example.com/bot)" # Combined options mcp-server-fetch \ --log-level DEBUG \ --log-file ./fetch-debug.log \ --user-agent "MyApp/1.0" ``` ## Summary MCP Server Fetch is designed for AI applications that need to access and process web content. Primary use cases include: fetching documentation and reference materials for LLM context, extracting content from JavaScript-heavy websites that block simple HTTP requests, processing PDF/DOCX/PPTX documents from URLs, and bypassing common anti-bot measures through browser automation. The multi-method extraction approach ensures high success rates across diverse web technologies. Integration follows the Model Context Protocol standard, making it compatible with Claude Desktop, Roo Code, and other MCP clients. The Docker-based deployment bundles all dependencies including Chrome, Tesseract OCR, and layout detection models, ensuring consistent behavior across environments. The server communicates via stdio using JSON-RPC, enabling seamless integration with MCP-compatible applications that need reliable web content fetching capabilities.