# Crawl4AI: LLM-Ready Web Crawler & Scraper Crawl4AI is an open-source Python library and Docker service that transforms web pages into clean, structured markdown and JSON data optimized for AI applications. Built on Playwright for reliable browser automation, it provides intelligent content extraction strategies including CSS selectors, XPath, LLM-powered extraction, and adaptive crawling that learns website patterns. The framework handles complex scenarios like JavaScript rendering, infinite scroll, authentication, proxy rotation, and anti-bot detection while maintaining high performance through async operations and browser pooling. The library excels at preparing web data for RAG systems, AI agents, and data pipelines with minimal code. It offers flexible deployment via pip installation for Python applications or Docker containers with REST APIs for language-agnostic integration. Built-in content filtering using BM25 and pruning algorithms removes boilerplate, while markdown generation with automatic citations creates LLM-friendly output. Crawl4AI supports multi-URL concurrent crawling, deep crawling strategies (BFS/DFS), URL seeding from sitemaps and Common Crawl, production features like caching, session management, real-time monitoring dashboards, and undetected browser modes for bypassing sophisticated bot detection systems. ## Core Python API ### AsyncWebCrawler - Main Crawler Interface Asynchronous web crawler with browser pooling, caching, and content extraction. Supports single URLs, batch processing, and streaming results. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode async def main(): # Basic configuration browser_config = BrowserConfig( headless=True, browser_type="chromium", # or "firefox", "webkit" verbose=True ) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, # ENABLED, DISABLED, READ_ONLY, WRITE_ONLY word_count_threshold=10, page_timeout=60000, wait_until="domcontentloaded" # or "load", "networkidle" ) # Context manager handles browser lifecycle async with AsyncWebCrawler(config=browser_config) as crawler: # Single URL crawl result = await crawler.arun( url="https://www.nbcnews.com/business", config=run_config ) if result.success: print(f"Title: {result.metadata.get('title', 'N/A')}") print(f"Markdown length: {len(result.markdown.raw_markdown)}") print(f"Filtered markdown: {len(result.markdown.fit_markdown)}") print(f"Links found: {len(result.links['internal'])} internal, {len(result.links['external'])} external") print(f"Images: {len(result.media['images'])}") # Direct table access (new in v0.7.3) if result.tables: import pandas as pd df = pd.DataFrame(result.tables[0]['data']) print(f"First table: {len(df)} rows") else: print(f"Crawl failed: {result.error_message}") asyncio.run(main()) ``` ### Multiple URL Concurrent Crawling Process multiple URLs concurrently with automatic rate limiting and memory management. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import MemoryAdaptiveDispatcher async def crawl_multiple(): urls = [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3", ] browser_config = BrowserConfig(headless=True) run_config = CrawlerRunConfig( word_count_threshold=5, screenshot=True # Capture screenshots ) # Memory-adaptive dispatcher automatically manages concurrency dispatcher = MemoryAdaptiveDispatcher( max_session_permit=5, # Max concurrent sessions max_session_permit_percent=0.8, # 80% memory threshold monitor_interval=1.0 ) async with AsyncWebCrawler(config=browser_config) as crawler: results = await crawler.arun_many( urls=urls, config=run_config, dispatcher=dispatcher ) for i, result in enumerate(results): if result.success: print(f"URL {i+1}: {len(result.markdown.raw_markdown)} chars") if result.screenshot: # Screenshot is base64 encoded print(f"Screenshot captured: {len(result.screenshot)} bytes") else: print(f"URL {i+1} failed: {result.error_message}") asyncio.run(crawl_multiple()) ``` ### Multi-URL Configuration System Apply different configurations to different URL patterns in batch processing (new in v0.7.3). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, MatchMode async def multi_config_crawl(): browser_config = BrowserConfig(headless=True) # Define URL-specific configurations url_configs = [ { "url_matcher": "*.pdf", # String pattern with wildcards "config": CrawlerRunConfig( screenshot=False, pdf_export=True ) }, { "url_matcher": "*/blog/*", # Match blog URLs "config": CrawlerRunConfig( word_count_threshold=50, extraction_strategy=JsonCssExtractionStrategy(schema={ "name": "Blog Posts", "baseSelector": "article", "fields": [ {"name": "title", "selector": "h1", "type": "text"}, {"name": "content", "selector": ".content", "type": "text"} ] }) ) }, { # Lambda function matcher for complex logic "url_matcher": lambda url: "docs" in url and url.endswith(".html"), "config": CrawlerRunConfig( word_count_threshold=100, css_selector=".documentation-content" ) } ] # Fallback configuration when no patterns match fallback_config = CrawlerRunConfig(word_count_threshold=10) urls = [ "https://example.com/doc.pdf", "https://example.com/blog/post-1", "https://example.com/docs/guide.html" ] async with AsyncWebCrawler(config=browser_config) as crawler: results = await crawler.arun_many( urls=urls, config=url_configs, fallback_config=fallback_config, match_mode=MatchMode.OR # Use OR for multiple matchers ) for result in results: print(f"{result.url}: {result.success}") asyncio.run(multi_config_crawl()) ``` ### Real-Time Crawler Monitoring Monitor crawl operations in real-time with detailed metrics and statistics (new in v0.7.3). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import CrawlerMonitor, DisplayMode async def monitored_crawl(): browser_config = BrowserConfig(headless=True) run_config = CrawlerRunConfig(word_count_threshold=10) # Create monitor with display configuration monitor = CrawlerMonitor( max_visible_tasks=10, display_mode=DisplayMode.DETAILED, # or AGGREGATED show_details=True ) urls = [f"https://example.com/page{i}" for i in range(20)] async with AsyncWebCrawler(config=browser_config) as crawler: results = await crawler.arun_many( urls=urls, config=run_config, monitor=monitor # Attach monitor ) # Get final statistics stats = monitor.get_stats() print(f"\nCrawl Statistics:") print(f" Total URLs: {stats['total']}") print(f" Successful: {stats['success']}") print(f" Failed: {stats['failed']}") print(f" Success Rate: {stats['success_rate']:.1f}%") print(f" Avg Duration: {stats['avg_duration']:.2f}s") print(f" Peak Memory: {stats['peak_memory_mb']:.1f}MB") asyncio.run(monitored_crawl()) ``` ## Content Extraction Strategies ### LLM-Based Structured Extraction Extract structured data using any LLM provider with schema validation. ```python import asyncio import os from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import LLMExtractionStrategy, LLMConfig from pydantic import BaseModel, Field # Define extraction schema class Product(BaseModel): name: str = Field(..., description="Product name") price: str = Field(..., description="Product price") rating: float = Field(..., description="Product rating out of 5") availability: str = Field(..., description="In stock or out of stock") async def llm_extraction(): # Configure LLM - supports OpenAI, Anthropic, Ollama, etc. llm_config = LLMConfig( provider="openai/gpt-4o-mini", # or "anthropic/claude-3-sonnet", "ollama/llama2" api_token=os.getenv("OPENAI_API_KEY"), temperature=0.2, max_tokens=4000 ) extraction_strategy = LLMExtractionStrategy( llm_config=llm_config, schema=Product.schema(), extraction_type="schema", # or "block", "markdown" instruction="Extract all products with their details. Focus on name, price, rating, and availability.", chunk_token_threshold=4000, # Split large pages overlap_rate=0.1 ) browser_config = BrowserConfig(headless=True, verbose=True) run_config = CrawlerRunConfig( extraction_strategy=extraction_strategy, word_count_threshold=1 ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://www.amazon.com/s?k=laptop", config=run_config ) if result.success and result.extracted_content: import json products = json.loads(result.extracted_content) print(f"Extracted {len(products)} products:") for product in products[:3]: print(f" - {product['name']}: {product['price']} ({product['rating']}★)") # Show token usage extraction_strategy.show_usage() asyncio.run(llm_extraction()) ``` ### CSS Selector Extraction (Zero-LLM) Fast structured extraction using CSS selectors without LLM costs. ```python import asyncio import json from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import JsonCssExtractionStrategy async def css_extraction(): # Define extraction schema with CSS selectors schema = { "name": "Course Catalog", "baseSelector": "div.course-card", # Container for each item "fields": [ { "name": "title", "selector": "h3.course-title", "type": "text" }, { "name": "instructor", "selector": "span.instructor-name", "type": "text" }, { "name": "price", "selector": "div.price", "type": "text" }, { "name": "thumbnail", "selector": "img.course-thumb", "type": "attribute", "attribute": "src" }, { "name": "rating", "selector": "span.rating-value", "type": "text" }, { "name": "course_link", "selector": "a.course-link", "type": "attribute", "attribute": "href" } ] } extraction_strategy = JsonCssExtractionStrategy(schema=schema, verbose=True) browser_config = BrowserConfig(headless=True) run_config = CrawlerRunConfig( extraction_strategy=extraction_strategy, # Execute JavaScript to load dynamic content js_code=[ """ (async () => { // Scroll to load lazy content await new Promise(resolve => { let scrolls = 0; const interval = setInterval(() => { window.scrollBy(0, 500); scrolls++; if (scrolls >= 5) { clearInterval(interval); resolve(); } }, 200); }); })(); """ ], wait_for="div.course-card", # Wait for elements to appear wait_for_timeout=10000 ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://www.udemy.com/courses/search/?q=python", config=run_config ) if result.success: courses = json.loads(result.extracted_content) print(f"Extracted {len(courses)} courses:") for course in courses[:5]: print(f" - {course['title']} by {course['instructor']} - {course['price']}") asyncio.run(css_extraction()) ``` ### Regex Pattern Extraction Extract common entities (emails, phones, URLs, etc.) using built-in or custom regex patterns. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import RegexExtractionStrategy, _B async def regex_extraction(): # Use built-in patterns extraction_strategy = RegexExtractionStrategy( pattern=_B.EMAIL | _B.PHONE | _B.URL | _B.PRICE, # Combine patterns with | input_format="fit_html" # or "raw_html", "cleaned_html", "markdown" ) browser_config = BrowserConfig(headless=True) run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://www.example.com/contact", config=run_config ) if result.success: import json matches = json.loads(result.extracted_content) print(f"Found {len(matches.get('EMAIL', []))} emails:") for email in matches.get('EMAIL', [])[:5]: print(f" - {email}") print(f"\nFound {len(matches.get('PHONE', []))} phone numbers:") for phone in matches.get('PHONE', [])[:5]: print(f" - {phone}") print(f"\nFound {len(matches.get('URL', []))} URLs") print(f"Found {len(matches.get('PRICE', []))} prices") asyncio.run(regex_extraction()) # Custom regex patterns async def custom_regex(): extraction_strategy = RegexExtractionStrategy( custom={ "product_id": r"SKU:\s*([A-Z0-9-]+)", "reference_number": r"Ref#\s*(\d{6,})" } ) run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/product", config=run_config ) if result.success: import json matches = json.loads(result.extracted_content) print(f"Product IDs: {matches.get('product_id', [])}") print(f"Reference Numbers: {matches.get('reference_number', [])}") asyncio.run(custom_regex()) ``` ## Intelligent Markdown Generation ### Heuristic Content Filtering Generate clean markdown with automatic noise removal and content prioritization. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator async def filtered_markdown(): # Pruning filter - removes low-quality content automatically pruning_filter = PruningContentFilter( threshold=0.48, # Content quality threshold (0-1) threshold_type="fixed", # or "dynamic" min_word_threshold=5 # Minimum words per block ) markdown_generator = DefaultMarkdownGenerator( content_filter=pruning_filter, options={ "include_links": True, "include_images": True, "body_width": 0 # No wrapping } ) browser_config = BrowserConfig(headless=True, verbose=True) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, markdown_generator=markdown_generator ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://en.wikipedia.org/wiki/Web_scraping", config=run_config ) if result.success: raw_length = len(result.markdown.raw_markdown) fit_length = len(result.markdown.fit_markdown) reduction = ((raw_length - fit_length) / raw_length) * 100 print(f"Raw markdown: {raw_length} chars") print(f"Filtered markdown: {fit_length} chars") print(f"Reduction: {reduction:.1f}%") print("\nFirst 500 chars of filtered content:") print(result.markdown.fit_markdown[:500]) asyncio.run(filtered_markdown()) # BM25 filter - query-based relevance filtering async def bm25_filtering(): bm25_filter = BM25ContentFilter( user_query="machine learning algorithms neural networks", bm25_threshold=1.0, language="english", use_stemming=True ) markdown_generator = DefaultMarkdownGenerator(content_filter=bm25_filter) run_config = CrawlerRunConfig(markdown_generator=markdown_generator) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://en.wikipedia.org/wiki/Artificial_intelligence", config=run_config ) if result.success: # Only content relevant to the query is retained print(f"Query-filtered markdown: {len(result.markdown.fit_markdown)} chars") print(result.markdown.fit_markdown[:1000]) asyncio.run(bm25_filtering()) ``` ## Browser Configuration & Anti-Detection ### Stealth Mode and User Profiles Bypass bot detection with stealth mode and persistent browser profiles. ```python import asyncio import os from pathlib import Path from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def stealth_crawling(): # Create persistent user data directory user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile") os.makedirs(user_data_dir, exist_ok=True) browser_config = BrowserConfig( # Browser type browser_type="chromium", # "undetected" for enhanced stealth headless=True, # Stealth features enable_stealth=True, # Persistent profile use_persistent_context=True, user_data_dir=user_data_dir, # User agent user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", # Viewport viewport_width=1920, viewport_height=1080, # Extra arguments to avoid detection extra_args=[ "--disable-blink-features=AutomationControlled", "--disable-dev-shm-usage", "--no-sandbox", "--disable-web-security" ], # Custom headers headers={ "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/" }, # Cookies cookies=[ { "name": "session_id", "value": "your_session_value", "domain": ".example.com", "path": "/" } ], verbose=True ) run_config = CrawlerRunConfig( # Override navigator properties override_navigator=True, # Simulate human behavior simulate_user=True, delay_before_return_html=2.0, # Magic mode - combines multiple anti-detection techniques magic=True ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://example.com/protected", config=run_config ) if result.success: print("Successfully crawled protected page") print(f"Content length: {len(result.markdown.raw_markdown)}") asyncio.run(stealth_crawling()) ``` ### Undetected Browser Mode Use undetected Chrome for bypassing sophisticated bot detection systems (new in v0.7.3). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def undetected_crawling(): # Use undetected browser adapter browser_config = BrowserConfig( browser_type="undetected", # Special undetected Chrome headless=True, # Stealth headless mode # Human-like behavior viewport_width=1920, viewport_height=1080, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", verbose=True ) run_config = CrawlerRunConfig( # Enable magic mode for additional anti-detection magic=True, simulate_user=True, override_navigator=True, page_timeout=60000 ) async with AsyncWebCrawler(config=browser_config) as crawler: # Crawl sites with Cloudflare or other bot protection result = await crawler.arun( url="https://example.com/protected-by-cloudflare", config=run_config ) if result.success: print("Bypassed bot detection successfully!") print(f"Content: {result.markdown.raw_markdown[:500]}") else: print(f"Failed: {result.error_message}") asyncio.run(undetected_crawling()) ``` ### Proxy Configuration Configure proxies with authentication and rotation strategies. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai.async_configs import ProxyConfig, ProxyRotationStrategy async def proxy_crawling(): # Single proxy proxy_config = ProxyConfig( server="http://proxy.example.com:8080", username="proxy_user", password="proxy_pass" ) browser_config = BrowserConfig( headless=True, proxy_config=proxy_config ) # Proxy rotation strategy proxy_list = [ ProxyConfig(server="http://proxy1.example.com:8080", username="user1", password="pass1"), ProxyConfig(server="http://proxy2.example.com:8080", username="user2", password="pass2"), ProxyConfig(server="http://proxy3.example.com:8080", username="user3", password="pass3"), ] rotation_strategy = ProxyRotationStrategy( proxies=proxy_list, rotation_type="round_robin" # or "random", "least_used" ) run_config = CrawlerRunConfig( proxy_rotation_strategy=rotation_strategy ) async with AsyncWebCrawler(config=browser_config) as crawler: urls = [f"https://httpbin.org/ip" for _ in range(5)] results = await crawler.arun_many(urls=urls, config=run_config) for i, result in enumerate(results): if result.success: print(f"Request {i+1} IP: {result.html[:200]}") asyncio.run(proxy_crawling()) ``` ## Advanced Features ### URL Discovery and Seeding Discover URLs from sitemaps and Common Crawl index (new feature). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig from crawl4ai import AsyncUrlSeeder, SeedingConfig async def discover_urls(): # Configure URL seeding seeding_config = SeedingConfig( source="sitemap+cc", # "sitemap", "cc", or "sitemap+cc" pattern="*/blog/*", # URL pattern to match max_urls=100, live_check=True, # Verify URLs are accessible extract_head=True, # Extract title and meta from HEAD concurrency=10, scoring_method="bm25", # or "cosine" query="machine learning artificial intelligence", # For relevance scoring score_threshold=0.3, # Minimum relevance score filter_nonsense_urls=True ) # Discover URLs using AsyncUrlSeeder async with AsyncUrlSeeder() as seeder: url_data = await seeder.seed( domain="example.com", config=seeding_config ) print(f"Discovered {len(url_data)} URLs:") for data in url_data[:10]: print(f" - {data['url']} (score: {data.get('score', 0):.2f})") if data.get('title'): print(f" Title: {data['title']}") # Now crawl the discovered URLs browser_config = BrowserConfig(headless=True) async with AsyncWebCrawler(config=browser_config) as crawler: # Extract just the URLs for crawling urls = [data['url'] for data in url_data[:5]] results = await crawler.arun_many(urls=urls) for result in results: if result.success: print(f"Crawled: {result.url} - {len(result.markdown.raw_markdown)} chars") asyncio.run(discover_urls()) ``` ### Virtual Scroll for Modern Websites Handle virtualized scrolling on modern websites like Twitter and Instagram (new feature). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import VirtualScrollConfig async def virtual_scroll_crawl(): # Configure virtual scrolling virtual_scroll = VirtualScrollConfig( scroll_amount=500, # Pixels to scroll each time scroll_count=10, # Number of scroll operations wait_time=2.0, # Wait between scrolls (seconds) # Automatically handles three scenarios: # 1. Content unchanged (continue scrolling) # 2. Content appended (traditional infinite scroll) # 3. Content replaced (true virtual scroll - Twitter/Instagram) ) browser_config = BrowserConfig(headless=True, viewport_height=1080) run_config = CrawlerRunConfig( virtual_scroll=virtual_scroll, word_count_threshold=10 ) async with AsyncWebCrawler(config=browser_config) as crawler: # Works with Twitter timelines, Instagram grids, etc. result = await crawler.arun( url="https://twitter.com/user/timeline", config=run_config ) if result.success: print(f"Captured all content through virtual scroll") print(f"Total content: {len(result.markdown.raw_markdown)} chars") print(f"Content chunks captured: {len(result.virtual_scroll_chunks)}") asyncio.run(virtual_scroll_crawl()) ``` ### Deep Crawling Strategies Automatically discover and crawl related pages using intelligent strategies. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai.async_configs import DeepCrawlStrategy, DeepCrawlRule async def deep_crawl(): # Define crawling rules rules = [ DeepCrawlRule( match_pattern="/docs/*", # Only follow documentation links max_depth=3, priority=10 ), DeepCrawlRule( match_pattern="/blog/*", max_depth=2, priority=5 ) ] deep_strategy = DeepCrawlStrategy( strategy="bfs", # or "dfs", "best-first" max_pages=50, max_depth=3, rules=rules, same_domain_only=True, exclude_patterns=[ "/login", "/signup", "/logout", "*.pdf", "*.zip" ] ) browser_config = BrowserConfig(headless=True) run_config = CrawlerRunConfig( deep_crawl_strategy=deep_strategy, word_count_threshold=50 ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://docs.example.com", config=run_config ) if result.success: print(f"Pages crawled: {len(result.deep_crawl_results)}") for page in result.deep_crawl_results[:10]: print(f" - {page['url']} (depth: {page['depth']})") asyncio.run(deep_crawl()) ``` ### Session Management and Caching Reuse browser sessions and cache results for faster subsequent crawls. ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode async def session_and_cache(): browser_config = BrowserConfig( headless=True, use_persistent_context=True, user_data_dir="/tmp/my_browser_profile" ) # First crawl - login and save session login_config = CrawlerRunConfig( cache_mode=CacheMode.WRITE_ONLY, session_id="authenticated_session", js_code=[ """ // Simulate login document.querySelector('#username').value = 'user@example.com'; document.querySelector('#password').value = 'password123'; document.querySelector('#login-button').click(); await new Promise(r => setTimeout(r, 3000)); """ ] ) async with AsyncWebCrawler(config=browser_config) as crawler: # Login login_result = await crawler.arun( url="https://example.com/login", config=login_config ) # Subsequent crawls reuse session protected_config = CrawlerRunConfig( cache_mode=CacheMode.ENABLED, session_id="authenticated_session" ) result = await crawler.arun( url="https://example.com/dashboard", config=protected_config ) if result.success: print("Accessed protected page using saved session") print(f"Content: {result.markdown.raw_markdown[:500]}") # Read from cache on subsequent runs cached_result = await crawler.arun( url="https://example.com/dashboard", config=CrawlerRunConfig( cache_mode=CacheMode.READ_ONLY, session_id="authenticated_session" ) ) print(f"Loaded from cache: {cached_result.success}") asyncio.run(session_and_cache()) ``` ## Docker REST API ### Basic Crawling via HTTP Submit crawl jobs via REST API with synchronous or asynchronous endpoints. ```python import requests import time # Docker container should be running: docker run -p 11235:11235 unclecode/crawl4ai:latest BASE_URL = "http://localhost:11235" # Synchronous crawl - wait for result def sync_crawl(): response = requests.post( f"{BASE_URL}/crawl", json={ "urls": ["https://www.nbcnews.com/business"], "browser_config": { "headless": True, "viewport_width": 1920 }, "crawler_config": { "word_count_threshold": 10, "screenshot": True } }, timeout=60 ) response.raise_for_status() data = response.json() if data["success"]: result = data["results"][0] print(f"Title: {result['metadata']['title']}") print(f"Markdown length: {len(result['markdown']['raw_markdown'])}") print(f"Screenshot available: {bool(result.get('screenshot'))}") return result else: print(f"Crawl failed: {data.get('error')}") # Asynchronous crawl with job queue def async_crawl(): # Submit job response = requests.post( f"{BASE_URL}/crawl/job", json={ "urls": ["https://example.com"], "priority": 8, "crawler_config": { "js_code": ["window.scrollTo(0, document.body.scrollHeight);"], "wait_for": ".content-loaded" } } ) response.raise_for_status() task_id = response.json()["task_id"] print(f"Job submitted: {task_id}") # Poll for result while True: status_response = requests.get(f"{BASE_URL}/crawl/job/{task_id}") status_data = status_response.json() if status_data["status"] == "completed": print("Job completed!") return status_data["result"] elif status_data["status"] == "failed": print(f"Job failed: {status_data.get('error')}") return None print(f"Status: {status_data['status']}") time.sleep(2) if __name__ == "__main__": result = sync_crawl() # result = async_crawl() ``` ### LLM Extraction via API Perform LLM-based extraction through Docker API. ```python import requests import time import json BASE_URL = "http://localhost:11235" def llm_extraction_job(): # Submit LLM extraction job response = requests.post( f"{BASE_URL}/llm/job", json={ "urls": ["https://openai.com/api/pricing/"], "extraction_config": { "provider": "openai/gpt-4o-mini", "api_token": "your_openai_api_key", "schema": { "type": "object", "properties": { "model_name": { "type": "string", "description": "Name of the AI model" }, "input_price": { "type": "string", "description": "Price per input token" }, "output_price": { "type": "string", "description": "Price per output token" } }, "required": ["model_name", "input_price", "output_price"] }, "instruction": "Extract all AI models with their pricing information", "extraction_type": "schema" } } ) response.raise_for_status() task_id = response.json()["task_id"] print(f"LLM extraction job submitted: {task_id}") # Poll for completion while True: status = requests.get(f"{BASE_URL}/llm/job/{task_id}").json() if status["status"] == "completed": results = status["result"] extracted = json.loads(results["extracted_content"]) print(f"\nExtracted {len(extracted)} models:") for model in extracted[:3]: print(f" - {model['model_name']}") print(f" Input: {model['input_price']}") print(f" Output: {model['output_price']}") return extracted elif status["status"] == "failed": print(f"Job failed: {status.get('error')}") return None time.sleep(2) # Markdown with content filtering def get_filtered_markdown(): response = requests.post( f"{BASE_URL}/md", json={ "url": "https://en.wikipedia.org/wiki/Machine_learning", "f": "bm25", # Filter type: "fit", "raw", "bm25", "llm" "q": "supervised learning neural networks algorithms", "c": "0" # Cache version } ) response.raise_for_status() data = response.json() print(f"Filtered markdown length: {len(data['markdown'])}") print("\nFirst 500 characters:") print(data['markdown'][:500]) return data if __name__ == "__main__": # llm_extraction_job() get_filtered_markdown() ``` ### Webhooks for Async Jobs Configure webhooks to receive notifications when jobs complete. ```python import requests BASE_URL = "http://localhost:11235" def crawl_with_webhook(): response = requests.post( f"{BASE_URL}/crawl/job", json={ "urls": ["https://example.com/page1", "https://example.com/page2"], "webhook_url": "https://your-server.com/webhook/crawl-complete", "webhook_data_in_payload": True, # Include results in webhook "webhook_headers": { "X-API-Key": "your-secret-key", "Content-Type": "application/json" }, "crawler_config": { "word_count_threshold": 10 } } ) response.raise_for_status() task_id = response.json()["task_id"] print(f"Job submitted with webhook: {task_id}") print("You will receive notification at your webhook URL when complete") return task_id # Webhook endpoint receives payload: # { # "task_id": "abc-123", # "task_type": "crawl", # "status": "completed", # "timestamp": "2024-01-15T10:30:00Z", # "urls": ["https://example.com/page1", "https://example.com/page2"], # "data": { # "results": [...], # Full crawl results if webhook_data_in_payload=True # "success": true # } # } if __name__ == "__main__": crawl_with_webhook() ``` ### Docker Monitoring Dashboard Access real-time monitoring and control (new in v0.7.7). ```python import httpx import asyncio BASE_URL = "http://localhost:11235" async def monitor_system_health(): """Access monitoring dashboard at http://localhost:11235/dashboard""" async with httpx.AsyncClient() as client: # Get system health response = await client.get(f"{BASE_URL}/monitor/health") health = response.json() print(f"Container Metrics:") print(f" CPU: {health['container']['cpu_percent']:.1f}%") print(f" Memory: {health['container']['memory_percent']:.1f}%") print(f" Uptime: {health['container']['uptime_seconds']}s") print(f"\nBrowser Pool:") print(f" Permanent: {health['pool']['permanent']['active']} active") print(f" Hot Pool: {health['pool']['hot']['count']} browsers") print(f" Cold Pool: {health['pool']['cold']['count']} browsers") print(f"\nStatistics:") print(f" Total Requests: {health['stats']['total_requests']}") print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%") print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms") async def track_requests(): """Track active and completed requests""" async with httpx.AsyncClient() as client: response = await client.get(f"{BASE_URL}/monitor/requests") requests_data = response.json() print(f"Active Requests: {len(requests_data['active'])}") print(f"Completed Requests: {len(requests_data['completed'])}") # See details of recent requests for req in requests_data['completed'][:5]: status_icon = "✅" if req['success'] else "❌" print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms") async def manage_browsers(): """Manage browser pool - kill/restart browsers""" async with httpx.AsyncClient() as client: # Get browser list response = await client.get(f"{BASE_URL}/monitor/browsers") data = response.json() print(f"Total Browsers: {data['summary']['total_count']}") print(f"Total Memory: {data['summary']['total_memory_mb']:.1f}MB") print(f"Reuse Rate: {data['summary']['reuse_rate_percent']:.1f}%") # Manual control actions for browser in data['browsers']: if browser['tier'] == 'cold' and browser['idle_time_seconds'] > 300: # Kill idle cold browsers kill_response = await client.post( f"{BASE_URL}/monitor/browsers/{browser['id']}/kill" ) print(f"Killed idle browser: {browser['id']}") # WebSocket streaming for real-time updates async def stream_metrics(): """Stream real-time metrics via WebSocket""" import websockets uri = "ws://localhost:11235/monitor/stream" async with websockets.connect(uri) as websocket: print("Connected to monitoring stream...") while True: message = await websocket.recv() data = json.loads(message) print(f"CPU: {data['cpu']:.1f}% | Memory: {data['memory']:.1f}% | Active: {data['active_requests']}") if __name__ == "__main__": # Run monitoring examples asyncio.run(monitor_system_health()) # asyncio.run(track_requests()) # asyncio.run(manage_browsers()) # asyncio.run(stream_metrics()) ``` ## Command Line Interface ### CLI Basic Usage Command-line interface for quick crawling and testing. ```bash # Install Crawl4AI pip install -U crawl4ai crawl4ai-setup # Basic crawl with markdown output crwl https://www.example.com -o markdown # Output formats: all, json, markdown, md, markdown-fit, md-fit crwl https://www.example.com -o json # Deep crawl with BFS strategy crwl https://docs.example.com --deep-crawl bfs --max-pages 20 # CSS selector extraction crwl https://news.example.com -c css_selector=".article-title" # LLM-based extraction with question crwl https://example.com/products -q "What are the main products and their prices?" # Use browser profile for authentication crwl https://linkedin.com/in/profile --profile linkedin-session # Screenshot capture crwl https://example.com -c screenshot=true -c screenshot_wait_for=2 # Custom browser settings crwl https://example.com -b headless=false -b viewport_width=1920 # Load configuration from files crwl https://example.com -B browser_config.yaml -C crawler_config.yaml # Structured extraction with schema crwl https://example.com/products -s product_schema.json -o json ``` ### CLI Advanced Configuration Use YAML configuration files for complex crawling scenarios. ```yaml # browser_config.yaml browser_type: chromium headless: true viewport_width: 1920 viewport_height: 1080 enable_stealth: true extra_args: - --disable-blink-features=AutomationControlled - --disable-dev-shm-usage headers: Accept-Language: en-US,en;q=0.9 Referer: https://www.google.com/ # crawler_config.yaml word_count_threshold: 10 cache_mode: BYPASS page_timeout: 60000 wait_until: domcontentloaded screenshot: true js_code: - window.scrollTo(0, document.body.scrollHeight); - await new Promise(r => setTimeout(r, 2000)); css_selector: ".main-content" excluded_tags: - nav - footer - aside # extraction_config.yaml provider: openai/gpt-4o-mini api_token: ${OPENAI_API_KEY} schema: type: object properties: title: type: string author: type: string date: type: string content: type: string extraction_type: schema instruction: Extract the article title, author, publication date, and main content # product_schema.json { "name": "Product Catalog", "baseSelector": "div.product-card", "fields": [ {"name": "title", "selector": "h3.title", "type": "text"}, {"name": "price", "selector": "span.price", "type": "text"}, {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"} ] } ``` ```bash # Use the configuration files crwl https://example.com \ -B browser_config.yaml \ -C crawler_config.yaml \ -e extraction_config.yaml \ -o json # Quick JSON extraction without config files crwl https://example.com/api-docs \ -j "Extract all API endpoints with their methods and descriptions" \ -o json # Manage browser profiles crwl profiles list crwl profiles create my-session crwl profiles delete my-session # Launch CDP-enabled browser for manual testing crwl cdp --port 9222 --profile test-session # Built-in browser management crwl browser start crwl browser status crwl browser stop ``` Crawl4AI provides a comprehensive toolkit for web data extraction ranging from simple one-liners to enterprise-grade crawling infrastructure. The library's modular design allows mixing and matching components - use simple CSS selectors for structured sites, LLM extraction for complex content, or adaptive crawling to automatically learn patterns. Content filtering ensures high-quality output for AI applications while caching and session management optimize performance. Deploy locally via pip for Python integration or use Docker containers with REST APIs for polyglot applications. The framework handles the complexity of modern web scraping including JavaScript rendering, authentication, proxies, virtual scrolling, and anti-bot measures with undetected browser support, letting developers focus on extracting value from web data rather than wrestling with browser automation details. Integration patterns span the spectrum from ad-hoc scripts to production pipelines. Use the Python API for RAG systems, data preprocessing, competitive intelligence, and AI agent tooling. Deploy Docker containers behind load balancers for high-throughput web data APIs serving multiple teams or customers. The CLI enables rapid prototyping and integration with shell scripts, cron jobs, and CI/CD pipelines. Real-time monitoring dashboards with WebSocket streaming provide complete visibility into crawler health, browser pool management, performance metrics, and resource utilization for production deployments. Advanced features like URL seeding from sitemaps and Common Crawl, multi-URL configuration patterns, virtual scroll support for modern SPAs, and undetected browser modes for bypassing sophisticated bot detection make Crawl4AI suitable for any web data extraction challenge. Whether extracting pricing data, building knowledge bases, monitoring competitors, or feeding AI models, Crawl4AI delivers reliable, scalable web data extraction with minimal code and maximum flexibility.