### Install Dependencies and Setup Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/CONTRIBUTING.md Clone the repository, install project dependencies using uv, and set up Playwright browsers. This is the initial setup for development. ```bash git clone https://github.com/potterdigital/crawl4ai-mcp.git cd crawl4ai-mcp # Install dependencies (uv manages the virtualenv) uv sync # Install Playwright browser (Chromium — required by crawl4ai) uv run crawl4ai-setup # Verify everything works uv run pytest uv run ruff check src/ ``` -------------------------------- ### Install Dependencies and Setup Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Installs project dependencies using uv and sets up Playwright browsers. Run `uv run crawl4ai-doctor` to verify. ```bash git clone https://github.com/potterdigital/crawl4ai-mcp.git cd crawl4ai-mcp # Install dependencies (uv manages the virtualenv automatically) uv sync # Install Playwright browser (required by crawl4ai — downloads Chromium) uv run crawl4ai-setup # Verify the installation uv run crawl4ai-doctor ``` -------------------------------- ### Check Playwright/Chromium Setup Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md Verify the setup and configuration of Playwright and Chromium, which are dependencies for certain project functionalities. ```bash # Check Playwright/Chromium uv run crawl4ai-setup ``` -------------------------------- ### API Reference File Example Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/00-START-HERE.md Example of a specific API reference file for the 'crawl_url' tool, illustrating the type of documentation provided for each MCP tool. ```markdown /api-reference/crawl_url.md ``` -------------------------------- ### Install Playwright and Chromium Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Execute these commands to install Playwright and its necessary browser binaries, typically after a Playwright version upgrade or if they are missing. ```bash uv sync uv run crawl4ai-setup ``` -------------------------------- ### Create Session with Initial URL Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/create_session.md Create a session and immediately navigate to a specified URL. This is useful for starting workflows on login pages or specific entry points. ```python response = await client.call_tool("create_session", { "session_id": "github-auth", "url": "https://github.com/login" }) # Response includes login page HTML ``` -------------------------------- ### Manifest JSON Format Example Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/crawl_many.md Shows the structure of the manifest.json file generated when `output_dir` is specified. It lists each crawled URL, its corresponding file path, and the success status. ```json [ { "url": "https://example.com/page1", "file": "example_com_page1.md", "success": true }, { "url": "https://example.com/page2", "success": false, "error": "HTTP 404 Not Found" } ] ``` -------------------------------- ### Check Session Existence Example Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Before performing operations like destroying a session, verify its existence by calling `list_sessions` and checking the response. ```python response = await client.call_tool("list_sessions") # Check if session_id appears in response ``` -------------------------------- ### Destroy and Create Session Example Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Use this snippet when a session already exists and you need to re-initialize it. It first destroys the existing session and then creates a new one with the same ID. ```python await client.call_tool("destroy_session", {"session_id": "my-session"}) await client.call_tool("create_session", {"session_id": "my-session", "url": "..."}) ``` -------------------------------- ### Example Custom Profile Definition Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_profiles.md Defines a custom crawl profile in YAML format. This example sets specific configurations for handling complex pages, including wait conditions, page timeout, and scan settings. ```yaml wait_until: networkidle page_timeout: 120000 # 120 seconds for complex pages scan_full_page: true word_count_threshold: 5 ``` -------------------------------- ### Custom Profile Example for SPAs Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md An example custom profile for heavy JavaScript SPA pages, with extended timeouts and full page scanning. ```yaml # profiles/slow_spa.yaml wait_until: networkidle page_timeout: 120000 # 120 seconds for complex SPAs scan_full_page: true scroll_delay: 1.0 word_count_threshold: 5 ``` -------------------------------- ### Example Error Response Format Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md This is a general example of the structured string format returned by crawl4ai-mcp tools on error. ```text Crawl failed URL: https://example.com HTTP status: 404 Error: Not Found ``` -------------------------------- ### Setting LLM API Keys via Environment Variables Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Configures API keys for various LLM providers using bash export commands before starting the MCP server. ```bash export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export GROQ_API_KEY="gsk_..." # Then start the server uv run python -m crawl4ai_mcp.server ``` -------------------------------- ### Create Session with Pre-injected Cookies Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/create_session.md Initialize a session with a set of cookies already provided. This is useful if you have obtained session cookies through other means and want to start an authenticated session. ```python response = await client.call_tool("create_session", { "session_id": "authenticated", "cookies": [ { "name": "sessionid", "value": "abc123def456", "domain": "example.com" }, { "name": "user_pref", "value": "dark_mode", "domain": "example.com", "path": "/" } ] }) ``` -------------------------------- ### Example Sitemap Fetch Error Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md A sample error message indicating a failure to fetch or parse a sitemap XML file. ```text Sitemap fetch failed URL: https://example.com/sitemap.xml Error: 404 Client Error: Not Found ``` -------------------------------- ### Run Crawl4AI MCP Server Locally Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md Provides instructions for running the crawl4ai-mcp server locally during development. It involves navigating to the project directory, syncing dependencies, and starting the server. ```bash cd crawl4ai-mcp uv sync uv run python -m crawl4ai_mcp.server ``` -------------------------------- ### Attempt to Destroy Non-Existent Session Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/destroy_session.md This example demonstrates the expected output when attempting to destroy a session that does not exist. The API will return an informative error message. ```python response = await client.call_tool("destroy_session", { "session_id": "unknown-session" }) # Response: "Session not found: unknown-session" ``` -------------------------------- ### Limited Depth Crawl Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a website up to a specified depth and page limit. This example limits the crawl to depth 2 (start page, first level links, second level links) and a maximum of 50 pages. ```python response = await client.call_tool("deep_crawl", { "url": "https://example.com", "max_depth": 2, "max_pages": 50 }) ``` -------------------------------- ### Calling crawl_url with js_heavy profile Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Example of how to call the crawl_url function with a specific profile to handle slow-loading pages. ```python crawl_url(url="https://slow-spa.example.com", profile="js_heavy") ``` -------------------------------- ### Handle No Active Sessions Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_sessions.md When no active sessions are found, the list_sessions tool returns a specific string indicating this state. This example shows how to handle that response. ```python # When no sessions exist response = await client.call_tool("list_sessions") # Response: "No active sessions." ``` -------------------------------- ### Fix Missing or Stale Playwright Chromium Binary Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Run crawl4ai-setup to download or update the Playwright Chromium binary, resolving issues with Chromium failing to start. ```bash uv run crawl4ai-setup ``` -------------------------------- ### Example Crawl with Profile and Override Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_profiles.md Demonstrates how per-call parameters override profile settings. In this case, the 'page_timeout' parameter directly in the crawl_url call takes precedence over the timeout defined in the 'fast' profile. ```python crawl_url( url="...", profile="fast", page_timeout=30 ) ``` -------------------------------- ### Deep Crawl with Path Filtering Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a website while applying include and exclude patterns to filter which URLs are followed. This example only follows links under '/docs/' and excludes pages under '/docs/admin/'. ```python response = await client.call_tool("deep_crawl", { "url": "https://docs.example.com", "include_pattern": "/docs/*", "exclude_pattern": "/docs/admin/*" }) ``` -------------------------------- ### Perform a Deep Crawl (BFS) Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md Initiates a Breadth-First Search crawl starting from a given URL, respecting maximum depth and page count limits. Results are saved to a specified directory. ```python response = await client.call_tool("deep_crawl", { "url": "https://docs.example.com", "max_depth": 2, "max_pages": 50, "output_dir": "/tmp/crawl_results" }) # Crawls start page, follows links to depth 2, saves to disk ``` -------------------------------- ### Deep Crawl Saving to Disk Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a website and saves the crawled content as Markdown files and a manifest.json to a specified output directory. This example crawls up to depth 2. ```python response = await client.call_tool("deep_crawl", { "url": "https://blog.example.com", "max_depth": 2, "output_dir": "/tmp/blog_crawl" }) ``` -------------------------------- ### Extract Structured Data with Local Ollama Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_structured.md This example demonstrates how to use the extract_structured tool with a local Ollama provider to extract executive summary and key metrics from a given URL, based on a provided JSON schema. ```APIDOC ## call_tool extract_structured (Local Ollama) ### Description Extracts structured data from a web page using a specified schema and instruction, leveraging a local Ollama provider. ### Method `client.call_tool("extract_structured", { ... }) ### Parameters #### Tool Arguments - **url** (string) - Required - The URL of the web page to extract data from. - **schema** (object) - Required - A JSON schema defining the structure of the data to be extracted. - **instruction** (string) - Required - A natural language instruction detailing what data to extract. - **provider** (string) - Required - The LLM provider to use. Example: "ollama/llama2". ### Request Example ```json { "url": "https://internal.example.com/report", "schema": { "type": "object", "properties": { "summary": {"type": "string"}, "metrics": { "type": "object", "properties": { "revenue": {"type": "number"}, "growth": {"type": "number"} } } } }, "instruction": "Extract executive summary and key metrics", "provider": "ollama/llama2" } ``` ### Response #### Success Response - The structured data extracted from the page, conforming to the provided schema. ``` -------------------------------- ### Extract API Endpoints with Anthropic Claude Sonnet Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_structured.md This example demonstrates extracting API endpoint details from a documentation page using Anthropic's Claude Sonnet model. It specifies a schema for API paths, methods, and descriptions. ```python response = await client.call_tool("extract_structured", { "url": "https://docs.example.com", "schema": { "type": "object", "properties": { "api_endpoints": { "type": "array", "items": { "type": "object", "properties": { "path": {"type": "string"}, "method": {"type": "string"}, "description": {"type": "string"} } } } } }, "instruction": "Extract all API endpoints with their HTTP methods and descriptions", "provider": "anthropic/claude-sonnet-4-20250514" }) ``` -------------------------------- ### Deep Crawl with Politeness Delay Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Performs a deep crawl with a specified delay between page requests to be polite to the target server. This example sets a 1-second delay and limits the crawl to 30 pages. ```python response = await client.call_tool("deep_crawl", { "url": "https://target.com", "delay": 1.0, # 1 second between requests "max_pages": 30 }) ``` -------------------------------- ### Extract Structured Data with CSS Scoping and Wait Condition Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_structured.md This example shows how to extract structured data from a dynamic web page using CSS selectors for scoping and a wait condition to ensure content is loaded, utilizing an OpenAI provider. ```APIDOC ## call_tool extract_structured (with CSS Scoping) ### Description Extracts structured data from a web page, allowing for specific element targeting using CSS selectors and defining a wait condition for dynamic content loading. Uses a specified LLM provider. ### Method `client.call_tool("extract_structured", { ... }) ### Parameters #### Tool Arguments - **url** (string) - Required - The URL of the web page to extract data from. - **css_selector** (string) - Optional - A CSS selector to scope the extraction to a specific part of the page. - **wait_for** (string) - Optional - A condition to wait for before extraction (e.g., "css:table.data-table"). - **schema** (object) - Required - A JSON schema defining the structure of the data to be extracted. - **instruction** (string) - Required - A natural language instruction detailing what data to extract. - **provider** (string) - Required - The LLM provider to use. Example: "openai/gpt-4o-mini". ### Request Example ```json { "url": "https://spa.example.com/table", "css_selector": "div.main-table", "wait_for": "css:table.data-table", "schema": { "type": "object", "properties": { "rows": { "type": "array", "items": { "type": "object", "properties": { "id": {"type": "string"}, "status": {"type": "string"} } } } } }, "instruction": "Extract all rows with ID and status", "provider": "openai/gpt-4o-mini" } ``` ### Response #### Success Response - The structured data extracted from the specified page section, conforming to the provided schema. ``` -------------------------------- ### Manifest JSON Output Example Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md This JSON structure represents the output of a deep crawl operation when an output directory is specified. It details crawled URLs, their corresponding files, success status, depth, and parent URL. ```json [ { "url": "https://example.com", "file": "example_com.md", "success": true, "depth": 0 }, { "url": "https://example.com/page1", "file": "example_com_page1.md", "success": true, "depth": 1, "parent_url": "https://example.com" }, { "url": "https://example.com/broken", "success": false, "error": "HTTP 404 Not Found" } ] ``` -------------------------------- ### Multi-step Authentication Flow - Step 1: Create Session and Load Login Page Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/create_session.md Initiates a session and loads the initial login page. This is the first step in a multi-step authentication process, setting up the environment for subsequent interactions. ```python # Step 1: Create session and load login page response = await client.call_tool("create_session", { "session_id": "authenticated-user", "url": "https://app.example.com/login" }) ``` -------------------------------- ### Crawl and Save to Disk with Profile Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/crawl_many.md Illustrates crawling multiple URLs and saving the results to disk using a specific profile ('js_heavy'). The response will be a metadata summary, and the actual content will be in markdown files within the specified output directory. ```python response = await client.call_tool("crawl_many", { "urls": [ "https://docs.example.com/intro", "https://docs.example.com/api", "https://docs.example.com/examples" ], "profile": "js_heavy", "output_dir": "/tmp/crawl_results" }) ``` -------------------------------- ### Diagnose Crawl4ai/Playwright Health Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Run the crawl4ai-doctor command to diagnose the health of crawl4ai and Playwright installations. ```bash uv run crawl4ai-doctor ``` -------------------------------- ### List All Profiles Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_profiles.md This tool lists all available crawl profiles and their configuration settings. Profiles provide named starting-point configurations that can be referenced in crawl tools via the `profile` parameter. ```APIDOC ## list_profiles ### Description Lists all available crawl profiles and their configuration settings. ### Method `list_profiles` (Tool Call) ### Parameters None ### Response Example ``` ## default (base layer — applied to every crawl) wait_until: domcontentloaded page_timeout: 60000 word_count_threshold: 10 ## fast wait_until: domcontentloaded page_timeout: 15000 word_count_threshold: 5 ## js_heavy delay_before_return_html: 1.0 page_timeout: 90000 remove_overlay_elements: true scan_full_page: true scroll_delay: 0.5 wait_until: networkidle ## stealth delay_before_return_html: 2.0 magic: true max_range: 2.0 mean_delay: 1.5 override_navigator: true page_timeout: 90000 simulate_user: true wait_until: networkidle ``` ``` -------------------------------- ### Initialize BrowserConfig for MCP Server Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Configure the Chromium browser settings for the MCP server. Key settings include running in headless mode and ensuring verbose logging is disabled to maintain the integrity of the MCP transport. ```python browser_cfg = BrowserConfig( headless=True, verbose=False, # CRITICAL: must be False to protect MCP transport extra_args=[ "--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox", ], ) ``` -------------------------------- ### Simple Product Extraction with CSS Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_css.md Extracts basic product information like title, price, and URL from a product listing page. Ensure the schema accurately reflects the page structure. ```python # Simple product extraction response = await client.call_tool("extract_css", { "url": "https://shop.example.com/products", "schema": { "name": "Products", "baseSelector": "div.product-card", "fields": [ { "name": "title", "selector": "h2.product-title", "type": "text" }, { "name": "price", "selector": "span.price", "type": "text" }, { "name": "url", "selector": "a.product-link", "type": "attribute", "attribute": "href" } ] } }) # Response: JSON string # [ # {"title": "Product 1", "price": "$29.99", "url": "/products/1"}, # {"title": "Product 2", "price": "$39.99", "url": "/products/2"} # ] ``` -------------------------------- ### Run Tests and Linting Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/CONTRIBUTING.md Execute the test suite and run the linter to ensure code quality and correctness. All tests must pass and the code must be clean. ```bash uv run pytest ``` ```bash uv run ruff check src/ ``` -------------------------------- ### Extract Product Data with Default Provider Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_structured.md Use this snippet to extract product information from a URL using the default LLM provider (OpenAI GPT-4o mini). The schema defines the expected structure for product name, price, and description. ```python response = await client.call_tool("extract_structured", { "url": "https://shop.example.com/products", "schema": { "type": "object", "properties": { "products": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "description": {"type": "string"} } } } } }, "instruction": "Extract all product names, prices, and descriptions from this page" }) ``` -------------------------------- ### Basic Deep Crawl Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Performs a basic deep crawl starting from the given URL. It follows links on the site up to a default depth of 3 and a maximum of 100 pages. ```python response = await client.call_tool("deep_crawl", { "url": "https://docs.example.com" }) ``` -------------------------------- ### Using a Custom Profile Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_profiles.md Shows how to utilize a custom-defined crawl profile by specifying its name in the crawl_url function call. ```python crawl_url(url="...", profile="heavy_js") ``` -------------------------------- ### Fast Profile Configuration Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Optimized for static pages and quick fetches, with a shorter page timeout and lower word count threshold. ```yaml wait_until: domcontentloaded page_timeout: 15000 # 15 seconds — fail fast word_count_threshold: 5 # Retain short blocks ``` -------------------------------- ### check_update Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/check_update.md Checks if a newer version of crawl4ai is available on PyPI. It compares the installed version against the latest release and reports version information, including changelog highlights if an update is found. ```APIDOC ## check_update ### Description Checks if a newer version of crawl4ai is available on PyPI. Compares the installed version against the latest release and reports version information with changelog highlights. ### Method `call_tool` (as used in the example, implying an internal tool call mechanism) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python response = await client.call_tool("check_update") ``` ### Response #### Success Response **Type:** `str` Returns version comparison result: - If up to date: "crawl4ai is up to date\nInstalled: X.Y.Z\nLatest: X.Y.Z" - If update available: version info + release link + changelog highlights - If check fails: error description with installed version and failure reason #### Response Example ``` # Response (if up to date): crawl4ai is up to date Installed: 0.8.2 Latest: 0.8.2 # Response (if update available): Update available Installed: 0.8.1 Latest: 0.8.2 Release: https://github.com/unclecode/crawl4ai/releases/tag/v0.8.2 To upgrade: stop the server and run: scripts/update.sh Changelog highlights: ### Bug Fixes - **Fixed** headless browser detection bypass - **Fixed** cookie handling in sessions ### Features - **Added** support for custom extraction strategies # Response (if check fails): Version check failed Installed: 0.8.1 Error: Could not reach PyPI (Connection timeout) ``` ### Throws Does not raise exceptions. Returns error as string if PyPI check fails. ``` -------------------------------- ### Configuring API Key in MCP Client Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Illustrates how to configure an API key within the MCP client configuration for crawl4ai. ```json { "mcpServers": { "crawl4ai": { "env": { "OPENAI_API_KEY": "sk-..." } } } } ``` -------------------------------- ### Run Server and Redirect Output for Logging Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Run the crawl4ai-mcp server, redirecting stdout to stderr and discarding stdout to capture logs effectively. ```bash uv run python -m crawl4ai_mcp.server 2>&1 1>/dev/null ``` -------------------------------- ### Authenticated Multi-Step Workflow Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md This snippet demonstrates an authenticated multi-step workflow. It includes creating a session, interacting via JavaScript to log in, and then crawling authenticated pages. ```python # 1. Create session and log in await client.call_tool("create_session", { "session_id": "user-auth", "url": "https://app.example.com/login" }) # 2. Interact via JavaScript await client.call_tool("crawl_url", { "session_id": "user-auth", "url": "https://app.example.com/login", "js_code": "document.querySelector('button[type=submit]').click();" }) # 3. Crawl authenticated pages response = await client.call_tool("crawl_url", { "session_id": "user-auth", "url": "https://app.example.com/dashboard" }) ``` -------------------------------- ### BFS Crawl Metadata Structure Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/types.md Metadata included with results from deep_crawl, indicating the depth from the start URL and the parent URL. This metadata is preserved in manifest.json if an output directory is specified. ```python { "depth": int, # Distance from start URL (0 for root) "parent_url": str # URL that linked to this page } ``` -------------------------------- ### deep_crawl Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Initiates a deep crawl of a website starting from a given URL. It follows links recursively up to a specified depth or page limit, with options for scope, filtering, delays, and output. ```APIDOC ## deep_crawl ### Description Initiates a deep crawl of a website starting from a given URL. It follows links recursively up to a specified depth or page limit, with options for scope, filtering, delays, and output. ### Method POST ### Endpoint /deep_crawl ### Parameters #### Query Parameters - **url** (str) - Required - Starting URL for the crawl - **max_depth** (int) - Optional - Maximum link levels to follow (depth 0 is start page, 1 is linked pages, etc.). Default: 3 - **max_pages** (int) - Optional - Hard cap on total pages crawled. Stops when reached. Default: 100 - **scope** (str) - Optional - Domain scope: "same-domain" (include subdomains), "same-origin", or "any" (follow external links). Default: "same-domain" - **include_pattern** (str) - Optional - Glob pattern to filter which URLs to follow (e.g., "/docs/*") - **exclude_pattern** (str) - Optional - Glob pattern to exclude URLs (e.g., "/internal/*") - **delay** (float) - Optional - Politeness delay in seconds between page fetches. Default: 0 - **output_dir** (str) - Optional - Directory for .md files and manifest.json. Returns metadata summary instead of content. - **profile** (str) - Optional - Named profile for per-page configuration - **cache_mode** (str) - Optional - Cache behavior: "enabled", "bypass", "disabled", "read_only", "write_only". Default: "enabled" - **css_selector** (str) - Optional - CSS selector to restrict extraction on all pages - **excluded_selector** (str) - Optional - CSS selector to exclude elements on all pages - **wait_for** (str) - Optional - Wait condition before extracting each page - **js_code** (str) - Optional - JavaScript to execute on each page - **user_agent** (str) - Optional - Custom User-Agent for all requests - **page_timeout** (int) - Optional - Page load timeout in seconds for each page. Default: 60 - **word_count_threshold** (int) - Optional - Minimum word count for content blocks. Default: 10 ### Request Example ```json { "url": "https://docs.example.com", "max_depth": 3, "max_pages": 100, "scope": "same-domain", "include_pattern": null, "exclude_pattern": null, "delay": 0, "output_dir": null, "profile": null, "cache_mode": "enabled", "css_selector": null, "excluded_selector": null, "wait_for": null, "js_code": null, "user_agent": null, "page_timeout": 60, "word_count_threshold": 10 } ``` ### Response #### Success Response (200) **Type:** `str` When `output_dir` is None: returns full markdown content for all crawled pages organized by depth, with parent URL info. When `output_dir` is set: returns metadata-only summary listing file paths and manifest location. Results include depth metadata (how far from start) and parent_url for each page. #### Response Example ```json "# Crawl Results\n\n## Depth 0\n### https://docs.example.com\n... markdown content ..." ``` ### Error Handling Does not raise exceptions. Partial failures are reported in the result string. ``` -------------------------------- ### Set Environment Variables for API Keys Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md Configures API keys for various services (OpenAI, Anthropic, Groq) using environment variables. These are necessary for LLM-based extraction tools. ```bash export OPENAI_API_KEY="sk-..." # For OpenAI extraction export ANTHROPIC_API_KEY="sk-ant-..." # For Anthropic extraction export GROQ_API_KEY="gsk_..." # For Groq extraction ``` -------------------------------- ### Call check_update Tool Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/check_update.md Use this snippet to call the check_update tool from the client. The response will vary based on whether the crawl4ai installation is up to date, an update is available, or if the version check fails. ```python # Check for updates response = await client.call_tool("check_update") # Response (if up to date): # crawl4ai is up to date # Installed: 0.8.2 # Latest: 0.8.2 # Response (if update available): # Update available # Installed: 0.8.1 # Latest: 0.8.2 # Release: https://github.com/unclecode/crawl4ai/releases/tag/v0.8.2 # To upgrade: stop the server and run: scripts/update.sh # # Changelog highlights: # ### Bug Fixes # - **Fixed** headless browser detection bypass # - **Fixed** cookie handling in sessions # ### Features # - **Added** support for custom extraction strategies # Response (if check fails): # Version check failed # Installed: 0.8.1 # Error: Could not reach PyPI (Connection timeout) ``` -------------------------------- ### Setting API Key Environment Variable Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Shows how to set the required environment variable for an LLM provider before calling an extraction tool. ```bash export OPENAI_API_KEY="sk-..." ``` -------------------------------- ### Deep Crawl with JavaScript Execution Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a JavaScript-heavy website, executing custom JavaScript code and waiting for specific content to load. This example scrolls to the bottom of the page and waits for '#app-content' to be present. ```python response = await client.call_tool("deep_crawl", { "url": "https://spa.example.com", "profile": "js_heavy", "js_code": "window.scrollTo(0, document.body.scrollHeight);", "wait_for": "css:#app-content" }) ``` -------------------------------- ### Handling HTTP 404 Errors in Python Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/errors.md Demonstrates how to check for specific HTTP status code errors in the crawl response and take action. ```python result = await client.call_tool("crawl_url", {"url": "..."}) if result.startswith("Crawl failed"): # Parse error and decide next action if "404" in result: print("Page not found") elif "403" in result: print("Access denied — may need authentication") ``` -------------------------------- ### Register crawl4ai-mcp with Other MCP Clients Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Configuration for adding crawl4ai-mcp as an MCP server in other clients. Ensure the `--directory` flag points to your local clone. ```json { "crawl4ai": { "type": "stdio", "command": "uv", "args": [ "run", "--directory", "/path/to/crawl4ai-mcp", "python", "-m", "crawl4ai_mcp.server" ] } } ``` -------------------------------- ### Deploy MCP Server using Claude CLI Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/README.md Registers the crawl4ai MCP server for user scope using the Claude CLI, specifying the command and arguments to run the server. This is for integrating with Claude environments. ```bash claude mcp add-json --scope user crawl4ai '{ "type": "stdio", "command": "uv", "args": ["run", "--directory", "/path/to/crawl4ai-mcp", "python", "-m", "crawl4ai_mcp.server"] }' ``` -------------------------------- ### Configure PruningContentFilter with word_count_threshold Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Demonstrates how the 'word_count_threshold' parameter is used to configure the PruningContentFilter. The threshold is popped from the merged configuration and used to initialize the filter, allowing per-call and profile-level control over content pruning. ```python wct = merged.pop("word_count_threshold", 10) merged["markdown_generator"] = DefaultMarkdownGenerator( content_filter=PruningContentFilter( threshold=0.48, threshold_type="fixed", min_word_threshold=wct, ), ) ``` -------------------------------- ### Deep Crawl Following External Links Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a website and follows external links, expanding the crawl scope beyond the initial domain. This example follows external links up to depth 1 and limits the crawl to 50 pages. ```python response = await client.call_tool("deep_crawl", { "url": "https://example.com", "scope": "any", # Follow external links "max_depth": 1, "max_pages": 50 }) ``` -------------------------------- ### Run Server Directly for Debugging Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Execute the crawl4ai-mcp server directly for debugging purposes. ```bash uv run python -m crawl4ai_mcp.server ``` -------------------------------- ### Extract Structured Data with CSS Scoping and Wait Condition Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/extract_structured.md Extract data from dynamic web pages by specifying a CSS selector for the target element and a wait condition for content to load. This example uses OpenAI's GPT-4o mini. ```python response = await client.call_tool("extract_structured", { "url": "https://spa.example.com/table", "css_selector": "div.main-table", "wait_for": "css:table.data-table", "schema": { "type": "object", "properties": { "rows": { "type": "array", "items": { "type": "object", "properties": { "id": {"type": "string"}, "status": {"type": "string"} } } } } }, "instruction": "Extract all rows with ID and status", "provider": "openai/gpt-4o-mini" }) ``` -------------------------------- ### Basic Sitemap Crawl Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/crawl_sitemap.md Initiates a crawl of up to 500 URLs from the provided sitemap using 10 concurrent requests. ```python response = await client.call_tool("crawl_sitemap", { "sitemap_url": "https://example.com/sitemap.xml" }) ``` -------------------------------- ### Deep Crawl with CSS Scoping Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/deep_crawl.md Crawls a website and extracts content only from elements matching a specified CSS selector, while excluding elements matching another selector. This example targets content within 'div.main-content' and excludes 'aside' and 'nav' elements. ```python response = await client.call_tool("deep_crawl", { "url": "https://catalog.example.com", "css_selector": "div.main-content", "excluded_selector": "aside, nav", "max_pages": 100 }) ``` -------------------------------- ### Main Documents Overview Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/00-START-HERE.md Table outlining the main documentation files in the crawl4ai-mcp project, their purpose, and when to read them. ```markdown | File | Purpose | Read When | |------|---------|-----------| | [INDEX.md](INDEX.md) | Navigation guide for all docs | You want an overview of what's available | | [README.md](README.md) | Quick reference, architecture, patterns | You want examples and system overview | | [configuration.md](configuration.md) | Profiles, env vars, per-call overrides | You need to customize behavior | | [types.md](types.md) | Type definitions, data structures | You're building against the API | | [errors.md](errors.md) | Error response formats and handling | Something failed, you need to debug | ``` -------------------------------- ### Upgrade crawl4ai using uv Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/check_update.md This bash snippet outlines the manual upgrade process for crawl4ai using the `uv` package manager. It involves upgrading the package, synchronizing dependencies, and updating Playwright/Chromium. ```bash uv lock --upgrade crawl4ai uv sync uv run crawl4ai-setup ``` -------------------------------- ### Deep Crawl with Delay and Output Directory Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/README.md Perform a deep crawl of a site, respecting a politeness delay and saving results to disk. ```bash deep_crawl(url="...", delay=0.5, output_dir="/tmp/crawl") ``` -------------------------------- ### AppContext Usage in a Tool Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/types.md Demonstrates how to access the AppContext from the request context within an MCP tool to utilize the shared crawler instance and active sessions. ```python @mcp.tool() async def my_tool(ctx: Context[ServerSession, AppContext]) -> str: app: AppContext = ctx.request_context.lifespan_context crawler = app.crawler # Use the shared crawler instance sessions = app.sessions # Check active sessions ``` -------------------------------- ### Python Tool Call with Custom Profile Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Demonstrates how to use a custom profile named 'slow_spa' when calling the crawl_url tool. ```python crawl_url(url="https://spa.example.com", profile="slow_spa") ``` -------------------------------- ### Configure MCP Server in Claude Desktop Config Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/configuration.md Add a 'crawl4ai' MCP server configuration to the Claude desktop client's config.json file. This includes specifying the server type, command, arguments, and environment variables for API keys. ```json { "mcpServers": { "crawl4ai": { "type": "stdio", "command": "uv", "args": [ "run", "--directory", "/path/to/crawl4ai-mcp", "python", "-m", "crawl4ai_mcp.server" ], "env": { "OPENAI_API_KEY": "sk-வுகளை", "ANTHROPIC_API_KEY": "sk-ant-வுகளை" } } } } ``` -------------------------------- ### List All Profiles Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/list_profiles.md Call the list_profiles tool to retrieve a formatted list of all loaded crawl profiles and their configurations. The 'default' profile is marked as the base layer. ```python # List all profiles response = await client.call_tool("list_profiles") # Response: # ## default (base layer — applied to every crawl) # wait_until: domcontentloaded # page_timeout: 60000 # word_count_threshold: 10 # # ## fast # wait_until: domcontentloaded # page_timeout: 15000 # word_count_threshold: 5 # # ## js_heavy # delay_before_return_html: 1.0 # page_timeout: 90000 # remove_overlay_elements: true # scan_full_page: true # scroll_delay: 0.5 # wait_until: networkidle # # ## stealth # delay_before_return_html: 2.0 # magic: true # max_range: 2.0 # mean_delay: 1.5 # override_navigator: true # page_timeout: 90000 # simulate_user: true # wait_until: networkidle ``` -------------------------------- ### Crawl with Politeness Delay Source: https://github.com/potterdigital/crawl4ai-mcp/blob/main/_autodocs/api-reference/crawl_many.md Shows how to implement a politeness delay between requests to avoid overwhelming the target server. This is useful for respecting rate limits. ```python response = await client.call_tool("crawl_many", { "urls": [ "https://target.com/api/page1", "https://target.com/api/page2" ], "delay": 1.5, # 1.5 second delay between requests "max_concurrent": 2 }) ```