# Crawl4AI: LLM-Ready Web Crawler & Scraper

Crawl4AI is an open-source Python library and Docker service that transforms web pages into clean, structured markdown and JSON data optimized for AI applications. Built on Playwright for reliable browser automation, it provides intelligent content extraction strategies including CSS selectors, XPath, LLM-powered extraction, and adaptive crawling that learns website patterns. The framework handles complex scenarios like JavaScript rendering, infinite scroll, authentication, proxy rotation, and anti-bot detection while maintaining high performance through async operations and browser pooling.

The library excels at preparing web data for RAG systems, AI agents, and data pipelines with minimal code. It offers flexible deployment via pip installation for Python applications or Docker containers with REST APIs for language-agnostic integration. Built-in content filtering using BM25 and pruning algorithms removes boilerplate, while markdown generation with automatic citations creates LLM-friendly output. Crawl4AI supports multi-URL concurrent crawling, deep crawling strategies (BFS/DFS), URL seeding from sitemaps and Common Crawl, production features like caching, session management, real-time monitoring dashboards, and undetected browser modes for bypassing sophisticated bot detection systems.

## Core Python API

### AsyncWebCrawler - Main Crawler Interface

Asynchronous web crawler with browser pooling, caching, and content extraction. Supports single URLs, batch processing, and streaming results.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    # Basic configuration
    browser_config = BrowserConfig(
        headless=True,
        browser_type="chromium",  # or "firefox", "webkit"
        verbose=True
    )

    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,  # ENABLED, DISABLED, READ_ONLY, WRITE_ONLY
        word_count_threshold=10,
        page_timeout=60000,
        wait_until="domcontentloaded"  # or "load", "networkidle"
    )

    # Context manager handles browser lifecycle
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Single URL crawl
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=run_config
        )

        if result.success:
            print(f"Title: {result.metadata.get('title', 'N/A')}")
            print(f"Markdown length: {len(result.markdown.raw_markdown)}")
            print(f"Filtered markdown: {len(result.markdown.fit_markdown)}")
            print(f"Links found: {len(result.links['internal'])} internal, {len(result.links['external'])} external")
            print(f"Images: {len(result.media['images'])}")

            # Direct table access (new in v0.7.3)
            if result.tables:
                import pandas as pd
                df = pd.DataFrame(result.tables[0]['data'])
                print(f"First table: {len(df)} rows")
        else:
            print(f"Crawl failed: {result.error_message}")

asyncio.run(main())
```

### Multiple URL Concurrent Crawling

Process multiple URLs concurrently with automatic rate limiting and memory management.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import MemoryAdaptiveDispatcher

async def crawl_multiple():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ]

    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=5,
        screenshot=True  # Capture screenshots
    )

    # Memory-adaptive dispatcher automatically manages concurrency
    dispatcher = MemoryAdaptiveDispatcher(
        max_session_permit=5,  # Max concurrent sessions
        max_session_permit_percent=0.8,  # 80% memory threshold
        monitor_interval=1.0
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=run_config,
            dispatcher=dispatcher
        )

        for i, result in enumerate(results):
            if result.success:
                print(f"URL {i+1}: {len(result.markdown.raw_markdown)} chars")
                if result.screenshot:
                    # Screenshot is base64 encoded
                    print(f"Screenshot captured: {len(result.screenshot)} bytes")
            else:
                print(f"URL {i+1} failed: {result.error_message}")

asyncio.run(crawl_multiple())
```

### Multi-URL Configuration System

Apply different configurations to different URL patterns in batch processing (new in v0.7.3).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, MatchMode

async def multi_config_crawl():
    browser_config = BrowserConfig(headless=True)

    # Define URL-specific configurations
    url_configs = [
        {
            "url_matcher": "*.pdf",  # String pattern with wildcards
            "config": CrawlerRunConfig(
                screenshot=False,
                pdf_export=True
            )
        },
        {
            "url_matcher": "*/blog/*",  # Match blog URLs
            "config": CrawlerRunConfig(
                word_count_threshold=50,
                extraction_strategy=JsonCssExtractionStrategy(schema={
                    "name": "Blog Posts",
                    "baseSelector": "article",
                    "fields": [
                        {"name": "title", "selector": "h1", "type": "text"},
                        {"name": "content", "selector": ".content", "type": "text"}
                    ]
                })
            )
        },
        {
            # Lambda function matcher for complex logic
            "url_matcher": lambda url: "docs" in url and url.endswith(".html"),
            "config": CrawlerRunConfig(
                word_count_threshold=100,
                css_selector=".documentation-content"
            )
        }
    ]

    # Fallback configuration when no patterns match
    fallback_config = CrawlerRunConfig(word_count_threshold=10)

    urls = [
        "https://example.com/doc.pdf",
        "https://example.com/blog/post-1",
        "https://example.com/docs/guide.html"
    ]

    async with AsyncWebCrawler(config=browser_config) as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=url_configs,
            fallback_config=fallback_config,
            match_mode=MatchMode.OR  # Use OR for multiple matchers
        )

        for result in results:
            print(f"{result.url}: {result.success}")

asyncio.run(multi_config_crawl())
```

### Real-Time Crawler Monitoring

Monitor crawl operations in real-time with detailed metrics and statistics (new in v0.7.3).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import CrawlerMonitor, DisplayMode

async def monitored_crawl():
    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(word_count_threshold=10)

    # Create monitor with display configuration
    monitor = CrawlerMonitor(
        max_visible_tasks=10,
        display_mode=DisplayMode.DETAILED,  # or AGGREGATED
        show_details=True
    )

    urls = [f"https://example.com/page{i}" for i in range(20)]

    async with AsyncWebCrawler(config=browser_config) as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=run_config,
            monitor=monitor  # Attach monitor
        )

        # Get final statistics
        stats = monitor.get_stats()
        print(f"\nCrawl Statistics:")
        print(f"  Total URLs: {stats['total']}")
        print(f"  Successful: {stats['success']}")
        print(f"  Failed: {stats['failed']}")
        print(f"  Success Rate: {stats['success_rate']:.1f}%")
        print(f"  Avg Duration: {stats['avg_duration']:.2f}s")
        print(f"  Peak Memory: {stats['peak_memory_mb']:.1f}MB")

asyncio.run(monitored_crawl())
```

## Content Extraction Strategies

### LLM-Based Structured Extraction

Extract structured data using any LLM provider with schema validation.

```python
import asyncio
import os
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field

# Define extraction schema
class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Product price")
    rating: float = Field(..., description="Product rating out of 5")
    availability: str = Field(..., description="In stock or out of stock")

async def llm_extraction():
    # Configure LLM - supports OpenAI, Anthropic, Ollama, etc.
    llm_config = LLMConfig(
        provider="openai/gpt-4o-mini",  # or "anthropic/claude-3-sonnet", "ollama/llama2"
        api_token=os.getenv("OPENAI_API_KEY"),
        temperature=0.2,
        max_tokens=4000
    )

    extraction_strategy = LLMExtractionStrategy(
        llm_config=llm_config,
        schema=Product.schema(),
        extraction_type="schema",  # or "block", "markdown"
        instruction="Extract all products with their details. Focus on name, price, rating, and availability.",
        chunk_token_threshold=4000,  # Split large pages
        overlap_rate=0.1
    )

    browser_config = BrowserConfig(headless=True, verbose=True)
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        word_count_threshold=1
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.amazon.com/s?k=laptop",
            config=run_config
        )

        if result.success and result.extracted_content:
            import json
            products = json.loads(result.extracted_content)
            print(f"Extracted {len(products)} products:")
            for product in products[:3]:
                print(f"  - {product['name']}: {product['price']} ({product['rating']}★)")

            # Show token usage
            extraction_strategy.show_usage()

asyncio.run(llm_extraction())
```

### CSS Selector Extraction (Zero-LLM)

Fast structured extraction using CSS selectors without LLM costs.

```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import JsonCssExtractionStrategy

async def css_extraction():
    # Define extraction schema with CSS selectors
    schema = {
        "name": "Course Catalog",
        "baseSelector": "div.course-card",  # Container for each item
        "fields": [
            {
                "name": "title",
                "selector": "h3.course-title",
                "type": "text"
            },
            {
                "name": "instructor",
                "selector": "span.instructor-name",
                "type": "text"
            },
            {
                "name": "price",
                "selector": "div.price",
                "type": "text"
            },
            {
                "name": "thumbnail",
                "selector": "img.course-thumb",
                "type": "attribute",
                "attribute": "src"
            },
            {
                "name": "rating",
                "selector": "span.rating-value",
                "type": "text"
            },
            {
                "name": "course_link",
                "selector": "a.course-link",
                "type": "attribute",
                "attribute": "href"
            }
        ]
    }

    extraction_strategy = JsonCssExtractionStrategy(schema=schema, verbose=True)

    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        # Execute JavaScript to load dynamic content
        js_code=[
            """
            (async () => {
                // Scroll to load lazy content
                await new Promise(resolve => {
                    let scrolls = 0;
                    const interval = setInterval(() => {
                        window.scrollBy(0, 500);
                        scrolls++;
                        if (scrolls >= 5) {
                            clearInterval(interval);
                            resolve();
                        }
                    }, 200);
                });
            })();
            """
        ],
        wait_for="div.course-card",  # Wait for elements to appear
        wait_for_timeout=10000
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.udemy.com/courses/search/?q=python",
            config=run_config
        )

        if result.success:
            courses = json.loads(result.extracted_content)
            print(f"Extracted {len(courses)} courses:")
            for course in courses[:5]:
                print(f"  - {course['title']} by {course['instructor']} - {course['price']}")

asyncio.run(css_extraction())
```

### Regex Pattern Extraction

Extract common entities (emails, phones, URLs, etc.) using built-in or custom regex patterns.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import RegexExtractionStrategy, _B

async def regex_extraction():
    # Use built-in patterns
    extraction_strategy = RegexExtractionStrategy(
        pattern=_B.EMAIL | _B.PHONE | _B.URL | _B.PRICE,  # Combine patterns with |
        input_format="fit_html"  # or "raw_html", "cleaned_html", "markdown"
    )

    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy)

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.example.com/contact",
            config=run_config
        )

        if result.success:
            import json
            matches = json.loads(result.extracted_content)

            print(f"Found {len(matches.get('EMAIL', []))} emails:")
            for email in matches.get('EMAIL', [])[:5]:
                print(f"  - {email}")

            print(f"\nFound {len(matches.get('PHONE', []))} phone numbers:")
            for phone in matches.get('PHONE', [])[:5]:
                print(f"  - {phone}")

            print(f"\nFound {len(matches.get('URL', []))} URLs")
            print(f"Found {len(matches.get('PRICE', []))} prices")

asyncio.run(regex_extraction())

# Custom regex patterns
async def custom_regex():
    extraction_strategy = RegexExtractionStrategy(
        custom={
            "product_id": r"SKU:\s*([A-Z0-9-]+)",
            "reference_number": r"Ref#\s*(\d{6,})"
        }
    )

    run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/product",
            config=run_config
        )
        if result.success:
            import json
            matches = json.loads(result.extracted_content)
            print(f"Product IDs: {matches.get('product_id', [])}")
            print(f"Reference Numbers: {matches.get('reference_number', [])}")

asyncio.run(custom_regex())
```

## Intelligent Markdown Generation

### Heuristic Content Filtering

Generate clean markdown with automatic noise removal and content prioritization.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def filtered_markdown():
    # Pruning filter - removes low-quality content automatically
    pruning_filter = PruningContentFilter(
        threshold=0.48,  # Content quality threshold (0-1)
        threshold_type="fixed",  # or "dynamic"
        min_word_threshold=5  # Minimum words per block
    )

    markdown_generator = DefaultMarkdownGenerator(
        content_filter=pruning_filter,
        options={
            "include_links": True,
            "include_images": True,
            "body_width": 0  # No wrapping
        }
    )

    browser_config = BrowserConfig(headless=True, verbose=True)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=markdown_generator
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Web_scraping",
            config=run_config
        )

        if result.success:
            raw_length = len(result.markdown.raw_markdown)
            fit_length = len(result.markdown.fit_markdown)
            reduction = ((raw_length - fit_length) / raw_length) * 100

            print(f"Raw markdown: {raw_length} chars")
            print(f"Filtered markdown: {fit_length} chars")
            print(f"Reduction: {reduction:.1f}%")
            print("\nFirst 500 chars of filtered content:")
            print(result.markdown.fit_markdown[:500])

asyncio.run(filtered_markdown())

# BM25 filter - query-based relevance filtering
async def bm25_filtering():
    bm25_filter = BM25ContentFilter(
        user_query="machine learning algorithms neural networks",
        bm25_threshold=1.0,
        language="english",
        use_stemming=True
    )

    markdown_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
    run_config = CrawlerRunConfig(markdown_generator=markdown_generator)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Artificial_intelligence",
            config=run_config
        )
        if result.success:
            # Only content relevant to the query is retained
            print(f"Query-filtered markdown: {len(result.markdown.fit_markdown)} chars")
            print(result.markdown.fit_markdown[:1000])

asyncio.run(bm25_filtering())
```

## Browser Configuration & Anti-Detection

### Stealth Mode and User Profiles

Bypass bot detection with stealth mode and persistent browser profiles.

```python
import asyncio
import os
from pathlib import Path
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def stealth_crawling():
    # Create persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        # Browser type
        browser_type="chromium",  # "undetected" for enhanced stealth
        headless=True,

        # Stealth features
        enable_stealth=True,

        # Persistent profile
        use_persistent_context=True,
        user_data_dir=user_data_dir,

        # User agent
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

        # Viewport
        viewport_width=1920,
        viewport_height=1080,

        # Extra arguments to avoid detection
        extra_args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-sandbox",
            "--disable-web-security"
        ],

        # Custom headers
        headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://www.google.com/"
        },

        # Cookies
        cookies=[
            {
                "name": "session_id",
                "value": "your_session_value",
                "domain": ".example.com",
                "path": "/"
            }
        ],

        verbose=True
    )

    run_config = CrawlerRunConfig(
        # Override navigator properties
        override_navigator=True,

        # Simulate human behavior
        simulate_user=True,
        delay_before_return_html=2.0,

        # Magic mode - combines multiple anti-detection techniques
        magic=True
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com/protected",
            config=run_config
        )

        if result.success:
            print("Successfully crawled protected page")
            print(f"Content length: {len(result.markdown.raw_markdown)}")

asyncio.run(stealth_crawling())
```

### Undetected Browser Mode

Use undetected Chrome for bypassing sophisticated bot detection systems (new in v0.7.3).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def undetected_crawling():
    # Use undetected browser adapter
    browser_config = BrowserConfig(
        browser_type="undetected",  # Special undetected Chrome
        headless=True,  # Stealth headless mode

        # Human-like behavior
        viewport_width=1920,
        viewport_height=1080,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

        verbose=True
    )

    run_config = CrawlerRunConfig(
        # Enable magic mode for additional anti-detection
        magic=True,
        simulate_user=True,
        override_navigator=True,
        page_timeout=60000
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Crawl sites with Cloudflare or other bot protection
        result = await crawler.arun(
            url="https://example.com/protected-by-cloudflare",
            config=run_config
        )

        if result.success:
            print("Bypassed bot detection successfully!")
            print(f"Content: {result.markdown.raw_markdown[:500]}")
        else:
            print(f"Failed: {result.error_message}")

asyncio.run(undetected_crawling())
```

### Proxy Configuration

Configure proxies with authentication and rotation strategies.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.async_configs import ProxyConfig, ProxyRotationStrategy

async def proxy_crawling():
    # Single proxy
    proxy_config = ProxyConfig(
        server="http://proxy.example.com:8080",
        username="proxy_user",
        password="proxy_pass"
    )

    browser_config = BrowserConfig(
        headless=True,
        proxy_config=proxy_config
    )

    # Proxy rotation strategy
    proxy_list = [
        ProxyConfig(server="http://proxy1.example.com:8080", username="user1", password="pass1"),
        ProxyConfig(server="http://proxy2.example.com:8080", username="user2", password="pass2"),
        ProxyConfig(server="http://proxy3.example.com:8080", username="user3", password="pass3"),
    ]

    rotation_strategy = ProxyRotationStrategy(
        proxies=proxy_list,
        rotation_type="round_robin"  # or "random", "least_used"
    )

    run_config = CrawlerRunConfig(
        proxy_rotation_strategy=rotation_strategy
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        urls = [f"https://httpbin.org/ip" for _ in range(5)]
        results = await crawler.arun_many(urls=urls, config=run_config)

        for i, result in enumerate(results):
            if result.success:
                print(f"Request {i+1} IP: {result.html[:200]}")

asyncio.run(proxy_crawling())
```

## Advanced Features

### URL Discovery and Seeding

Discover URLs from sitemaps and Common Crawl index (new feature).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def discover_urls():
    # Configure URL seeding
    seeding_config = SeedingConfig(
        source="sitemap+cc",  # "sitemap", "cc", or "sitemap+cc"
        pattern="*/blog/*",  # URL pattern to match
        max_urls=100,
        live_check=True,  # Verify URLs are accessible
        extract_head=True,  # Extract title and meta from HEAD
        concurrency=10,
        scoring_method="bm25",  # or "cosine"
        query="machine learning artificial intelligence",  # For relevance scoring
        score_threshold=0.3,  # Minimum relevance score
        filter_nonsense_urls=True
    )

    # Discover URLs using AsyncUrlSeeder
    async with AsyncUrlSeeder() as seeder:
        url_data = await seeder.seed(
            domain="example.com",
            config=seeding_config
        )

        print(f"Discovered {len(url_data)} URLs:")
        for data in url_data[:10]:
            print(f"  - {data['url']} (score: {data.get('score', 0):.2f})")
            if data.get('title'):
                print(f"    Title: {data['title']}")

    # Now crawl the discovered URLs
    browser_config = BrowserConfig(headless=True)

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Extract just the URLs for crawling
        urls = [data['url'] for data in url_data[:5]]
        results = await crawler.arun_many(urls=urls)

        for result in results:
            if result.success:
                print(f"Crawled: {result.url} - {len(result.markdown.raw_markdown)} chars")

asyncio.run(discover_urls())
```

### Virtual Scroll for Modern Websites

Handle virtualized scrolling on modern websites like Twitter and Instagram (new feature).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai import VirtualScrollConfig

async def virtual_scroll_crawl():
    # Configure virtual scrolling
    virtual_scroll = VirtualScrollConfig(
        scroll_amount=500,  # Pixels to scroll each time
        scroll_count=10,  # Number of scroll operations
        wait_time=2.0,  # Wait between scrolls (seconds)

        # Automatically handles three scenarios:
        # 1. Content unchanged (continue scrolling)
        # 2. Content appended (traditional infinite scroll)
        # 3. Content replaced (true virtual scroll - Twitter/Instagram)
    )

    browser_config = BrowserConfig(headless=True, viewport_height=1080)
    run_config = CrawlerRunConfig(
        virtual_scroll=virtual_scroll,
        word_count_threshold=10
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Works with Twitter timelines, Instagram grids, etc.
        result = await crawler.arun(
            url="https://twitter.com/user/timeline",
            config=run_config
        )

        if result.success:
            print(f"Captured all content through virtual scroll")
            print(f"Total content: {len(result.markdown.raw_markdown)} chars")
            print(f"Content chunks captured: {len(result.virtual_scroll_chunks)}")

asyncio.run(virtual_scroll_crawl())
```

### Deep Crawling Strategies

Automatically discover and crawl related pages using intelligent strategies.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.async_configs import DeepCrawlStrategy, DeepCrawlRule

async def deep_crawl():
    # Define crawling rules
    rules = [
        DeepCrawlRule(
            match_pattern="/docs/*",  # Only follow documentation links
            max_depth=3,
            priority=10
        ),
        DeepCrawlRule(
            match_pattern="/blog/*",
            max_depth=2,
            priority=5
        )
    ]

    deep_strategy = DeepCrawlStrategy(
        strategy="bfs",  # or "dfs", "best-first"
        max_pages=50,
        max_depth=3,
        rules=rules,
        same_domain_only=True,
        exclude_patterns=[
            "/login",
            "/signup",
            "/logout",
            "*.pdf",
            "*.zip"
        ]
    )

    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(
        deep_crawl_strategy=deep_strategy,
        word_count_threshold=50
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.example.com",
            config=run_config
        )

        if result.success:
            print(f"Pages crawled: {len(result.deep_crawl_results)}")
            for page in result.deep_crawl_results[:10]:
                print(f"  - {page['url']} (depth: {page['depth']})")

asyncio.run(deep_crawl())
```

### Session Management and Caching

Reuse browser sessions and cache results for faster subsequent crawls.

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def session_and_cache():
    browser_config = BrowserConfig(
        headless=True,
        use_persistent_context=True,
        user_data_dir="/tmp/my_browser_profile"
    )

    # First crawl - login and save session
    login_config = CrawlerRunConfig(
        cache_mode=CacheMode.WRITE_ONLY,
        session_id="authenticated_session",
        js_code=[
            """
            // Simulate login
            document.querySelector('#username').value = 'user@example.com';
            document.querySelector('#password').value = 'password123';
            document.querySelector('#login-button').click();
            await new Promise(r => setTimeout(r, 3000));
            """
        ]
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Login
        login_result = await crawler.arun(
            url="https://example.com/login",
            config=login_config
        )

        # Subsequent crawls reuse session
        protected_config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            session_id="authenticated_session"
        )

        result = await crawler.arun(
            url="https://example.com/dashboard",
            config=protected_config
        )

        if result.success:
            print("Accessed protected page using saved session")
            print(f"Content: {result.markdown.raw_markdown[:500]}")

        # Read from cache on subsequent runs
        cached_result = await crawler.arun(
            url="https://example.com/dashboard",
            config=CrawlerRunConfig(
                cache_mode=CacheMode.READ_ONLY,
                session_id="authenticated_session"
            )
        )
        print(f"Loaded from cache: {cached_result.success}")

asyncio.run(session_and_cache())
```

## Docker REST API

### Basic Crawling via HTTP

Submit crawl jobs via REST API with synchronous or asynchronous endpoints.

```python
import requests
import time

# Docker container should be running: docker run -p 11235:11235 unclecode/crawl4ai:latest

BASE_URL = "http://localhost:11235"

# Synchronous crawl - wait for result
def sync_crawl():
    response = requests.post(
        f"{BASE_URL}/crawl",
        json={
            "urls": ["https://www.nbcnews.com/business"],
            "browser_config": {
                "headless": True,
                "viewport_width": 1920
            },
            "crawler_config": {
                "word_count_threshold": 10,
                "screenshot": True
            }
        },
        timeout=60
    )

    response.raise_for_status()
    data = response.json()

    if data["success"]:
        result = data["results"][0]
        print(f"Title: {result['metadata']['title']}")
        print(f"Markdown length: {len(result['markdown']['raw_markdown'])}")
        print(f"Screenshot available: {bool(result.get('screenshot'))}")
        return result
    else:
        print(f"Crawl failed: {data.get('error')}")

# Asynchronous crawl with job queue
def async_crawl():
    # Submit job
    response = requests.post(
        f"{BASE_URL}/crawl/job",
        json={
            "urls": ["https://example.com"],
            "priority": 8,
            "crawler_config": {
                "js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
                "wait_for": ".content-loaded"
            }
        }
    )

    response.raise_for_status()
    task_id = response.json()["task_id"]
    print(f"Job submitted: {task_id}")

    # Poll for result
    while True:
        status_response = requests.get(f"{BASE_URL}/crawl/job/{task_id}")
        status_data = status_response.json()

        if status_data["status"] == "completed":
            print("Job completed!")
            return status_data["result"]
        elif status_data["status"] == "failed":
            print(f"Job failed: {status_data.get('error')}")
            return None

        print(f"Status: {status_data['status']}")
        time.sleep(2)

if __name__ == "__main__":
    result = sync_crawl()
    # result = async_crawl()
```

### LLM Extraction via API

Perform LLM-based extraction through Docker API.

```python
import requests
import time
import json

BASE_URL = "http://localhost:11235"

def llm_extraction_job():
    # Submit LLM extraction job
    response = requests.post(
        f"{BASE_URL}/llm/job",
        json={
            "urls": ["https://openai.com/api/pricing/"],
            "extraction_config": {
                "provider": "openai/gpt-4o-mini",
                "api_token": "your_openai_api_key",
                "schema": {
                    "type": "object",
                    "properties": {
                        "model_name": {
                            "type": "string",
                            "description": "Name of the AI model"
                        },
                        "input_price": {
                            "type": "string",
                            "description": "Price per input token"
                        },
                        "output_price": {
                            "type": "string",
                            "description": "Price per output token"
                        }
                    },
                    "required": ["model_name", "input_price", "output_price"]
                },
                "instruction": "Extract all AI models with their pricing information",
                "extraction_type": "schema"
            }
        }
    )

    response.raise_for_status()
    task_id = response.json()["task_id"]
    print(f"LLM extraction job submitted: {task_id}")

    # Poll for completion
    while True:
        status = requests.get(f"{BASE_URL}/llm/job/{task_id}").json()

        if status["status"] == "completed":
            results = status["result"]
            extracted = json.loads(results["extracted_content"])

            print(f"\nExtracted {len(extracted)} models:")
            for model in extracted[:3]:
                print(f"  - {model['model_name']}")
                print(f"    Input: {model['input_price']}")
                print(f"    Output: {model['output_price']}")
            return extracted

        elif status["status"] == "failed":
            print(f"Job failed: {status.get('error')}")
            return None

        time.sleep(2)

# Markdown with content filtering
def get_filtered_markdown():
    response = requests.post(
        f"{BASE_URL}/md",
        json={
            "url": "https://en.wikipedia.org/wiki/Machine_learning",
            "f": "bm25",  # Filter type: "fit", "raw", "bm25", "llm"
            "q": "supervised learning neural networks algorithms",
            "c": "0"  # Cache version
        }
    )

    response.raise_for_status()
    data = response.json()

    print(f"Filtered markdown length: {len(data['markdown'])}")
    print("\nFirst 500 characters:")
    print(data['markdown'][:500])
    return data

if __name__ == "__main__":
    # llm_extraction_job()
    get_filtered_markdown()
```

### Webhooks for Async Jobs

Configure webhooks to receive notifications when jobs complete.

```python
import requests

BASE_URL = "http://localhost:11235"

def crawl_with_webhook():
    response = requests.post(
        f"{BASE_URL}/crawl/job",
        json={
            "urls": ["https://example.com/page1", "https://example.com/page2"],
            "webhook_url": "https://your-server.com/webhook/crawl-complete",
            "webhook_data_in_payload": True,  # Include results in webhook
            "webhook_headers": {
                "X-API-Key": "your-secret-key",
                "Content-Type": "application/json"
            },
            "crawler_config": {
                "word_count_threshold": 10
            }
        }
    )

    response.raise_for_status()
    task_id = response.json()["task_id"]
    print(f"Job submitted with webhook: {task_id}")
    print("You will receive notification at your webhook URL when complete")

    return task_id

# Webhook endpoint receives payload:
# {
#     "task_id": "abc-123",
#     "task_type": "crawl",
#     "status": "completed",
#     "timestamp": "2024-01-15T10:30:00Z",
#     "urls": ["https://example.com/page1", "https://example.com/page2"],
#     "data": {
#         "results": [...],  # Full crawl results if webhook_data_in_payload=True
#         "success": true
#     }
# }

if __name__ == "__main__":
    crawl_with_webhook()
```

### Docker Monitoring Dashboard

Access real-time monitoring and control (new in v0.7.7).

```python
import httpx
import asyncio

BASE_URL = "http://localhost:11235"

async def monitor_system_health():
    """Access monitoring dashboard at http://localhost:11235/dashboard"""
    async with httpx.AsyncClient() as client:
        # Get system health
        response = await client.get(f"{BASE_URL}/monitor/health")
        health = response.json()

        print(f"Container Metrics:")
        print(f"  CPU: {health['container']['cpu_percent']:.1f}%")
        print(f"  Memory: {health['container']['memory_percent']:.1f}%")
        print(f"  Uptime: {health['container']['uptime_seconds']}s")

        print(f"\nBrowser Pool:")
        print(f"  Permanent: {health['pool']['permanent']['active']} active")
        print(f"  Hot Pool: {health['pool']['hot']['count']} browsers")
        print(f"  Cold Pool: {health['pool']['cold']['count']} browsers")

        print(f"\nStatistics:")
        print(f"  Total Requests: {health['stats']['total_requests']}")
        print(f"  Success Rate: {health['stats']['success_rate_percent']:.1f}%")
        print(f"  Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")

async def track_requests():
    """Track active and completed requests"""
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{BASE_URL}/monitor/requests")
        requests_data = response.json()

        print(f"Active Requests: {len(requests_data['active'])}")
        print(f"Completed Requests: {len(requests_data['completed'])}")

        # See details of recent requests
        for req in requests_data['completed'][:5]:
            status_icon = "✅" if req['success'] else "❌"
            print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")

async def manage_browsers():
    """Manage browser pool - kill/restart browsers"""
    async with httpx.AsyncClient() as client:
        # Get browser list
        response = await client.get(f"{BASE_URL}/monitor/browsers")
        data = response.json()

        print(f"Total Browsers: {data['summary']['total_count']}")
        print(f"Total Memory: {data['summary']['total_memory_mb']:.1f}MB")
        print(f"Reuse Rate: {data['summary']['reuse_rate_percent']:.1f}%")

        # Manual control actions
        for browser in data['browsers']:
            if browser['tier'] == 'cold' and browser['idle_time_seconds'] > 300:
                # Kill idle cold browsers
                kill_response = await client.post(
                    f"{BASE_URL}/monitor/browsers/{browser['id']}/kill"
                )
                print(f"Killed idle browser: {browser['id']}")

# WebSocket streaming for real-time updates
async def stream_metrics():
    """Stream real-time metrics via WebSocket"""
    import websockets

    uri = "ws://localhost:11235/monitor/stream"
    async with websockets.connect(uri) as websocket:
        print("Connected to monitoring stream...")
        while True:
            message = await websocket.recv()
            data = json.loads(message)
            print(f"CPU: {data['cpu']:.1f}% | Memory: {data['memory']:.1f}% | Active: {data['active_requests']}")

if __name__ == "__main__":
    # Run monitoring examples
    asyncio.run(monitor_system_health())
    # asyncio.run(track_requests())
    # asyncio.run(manage_browsers())
    # asyncio.run(stream_metrics())
```

## Command Line Interface

### CLI Basic Usage

Command-line interface for quick crawling and testing.

```bash
# Install Crawl4AI
pip install -U crawl4ai
crawl4ai-setup

# Basic crawl with markdown output
crwl https://www.example.com -o markdown

# Output formats: all, json, markdown, md, markdown-fit, md-fit
crwl https://www.example.com -o json

# Deep crawl with BFS strategy
crwl https://docs.example.com --deep-crawl bfs --max-pages 20

# CSS selector extraction
crwl https://news.example.com -c css_selector=".article-title"

# LLM-based extraction with question
crwl https://example.com/products -q "What are the main products and their prices?"

# Use browser profile for authentication
crwl https://linkedin.com/in/profile --profile linkedin-session

# Screenshot capture
crwl https://example.com -c screenshot=true -c screenshot_wait_for=2

# Custom browser settings
crwl https://example.com -b headless=false -b viewport_width=1920

# Load configuration from files
crwl https://example.com -B browser_config.yaml -C crawler_config.yaml

# Structured extraction with schema
crwl https://example.com/products -s product_schema.json -o json
```

### CLI Advanced Configuration

Use YAML configuration files for complex crawling scenarios.

```yaml
# browser_config.yaml
browser_type: chromium
headless: true
viewport_width: 1920
viewport_height: 1080
enable_stealth: true
extra_args:
  - --disable-blink-features=AutomationControlled
  - --disable-dev-shm-usage
headers:
  Accept-Language: en-US,en;q=0.9
  Referer: https://www.google.com/

# crawler_config.yaml
word_count_threshold: 10
cache_mode: BYPASS
page_timeout: 60000
wait_until: domcontentloaded
screenshot: true
js_code:
  - window.scrollTo(0, document.body.scrollHeight);
  - await new Promise(r => setTimeout(r, 2000));
css_selector: ".main-content"
excluded_tags:
  - nav
  - footer
  - aside

# extraction_config.yaml
provider: openai/gpt-4o-mini
api_token: ${OPENAI_API_KEY}
schema:
  type: object
  properties:
    title:
      type: string
    author:
      type: string
    date:
      type: string
    content:
      type: string
extraction_type: schema
instruction: Extract the article title, author, publication date, and main content

# product_schema.json
{
  "name": "Product Catalog",
  "baseSelector": "div.product-card",
  "fields": [
    {"name": "title", "selector": "h3.title", "type": "text"},
    {"name": "price", "selector": "span.price", "type": "text"},
    {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
  ]
}
```

```bash
# Use the configuration files
crwl https://example.com \
  -B browser_config.yaml \
  -C crawler_config.yaml \
  -e extraction_config.yaml \
  -o json

# Quick JSON extraction without config files
crwl https://example.com/api-docs \
  -j "Extract all API endpoints with their methods and descriptions" \
  -o json

# Manage browser profiles
crwl profiles list
crwl profiles create my-session
crwl profiles delete my-session

# Launch CDP-enabled browser for manual testing
crwl cdp --port 9222 --profile test-session

# Built-in browser management
crwl browser start
crwl browser status
crwl browser stop
```

Crawl4AI provides a comprehensive toolkit for web data extraction ranging from simple one-liners to enterprise-grade crawling infrastructure. The library's modular design allows mixing and matching components - use simple CSS selectors for structured sites, LLM extraction for complex content, or adaptive crawling to automatically learn patterns. Content filtering ensures high-quality output for AI applications while caching and session management optimize performance. Deploy locally via pip for Python integration or use Docker containers with REST APIs for polyglot applications. The framework handles the complexity of modern web scraping including JavaScript rendering, authentication, proxies, virtual scrolling, and anti-bot measures with undetected browser support, letting developers focus on extracting value from web data rather than wrestling with browser automation details.

Integration patterns span the spectrum from ad-hoc scripts to production pipelines. Use the Python API for RAG systems, data preprocessing, competitive intelligence, and AI agent tooling. Deploy Docker containers behind load balancers for high-throughput web data APIs serving multiple teams or customers. The CLI enables rapid prototyping and integration with shell scripts, cron jobs, and CI/CD pipelines. Real-time monitoring dashboards with WebSocket streaming provide complete visibility into crawler health, browser pool management, performance metrics, and resource utilization for production deployments. Advanced features like URL seeding from sitemaps and Common Crawl, multi-URL configuration patterns, virtual scroll support for modern SPAs, and undetected browser modes for bypassing sophisticated bot detection make Crawl4AI suitable for any web data extraction challenge. Whether extracting pricing data, building knowledge bases, monitoring competitors, or feeding AI models, Crawl4AI delivers reliable, scalable web data extraction with minimal code and maximum flexibility.