Firecrawl (firecrawl/firecrawl)

Firecrawl

https://github.com/firecrawl/firecrawl
Admin
Firecrawl is an API service that crawls websites and extracts clean markdown or structured data from...

Tokens:64,064
Snippets:496
Trust Score:9.4
Update:2 weeks ago
Show doc for...
Context Summary (auto-generated)
Raw
# Firecrawl

Firecrawl is an open-source web scraping and data extraction API that transforms any website into clean, LLM-ready data. It powers AI agents with reliable web access, handling the complexity of JavaScript rendering, proxy rotation, rate limiting, and content extraction. The platform supports multiple output formats including markdown, HTML, JSON (via AI extraction), and screenshots, covering 96% of the web including dynamic JavaScript-heavy pages.

The core architecture provides a unified API (v2) with SDKs for Python, Node.js, Java, Elixir, Go, and Rust. Key capabilities include single-page scraping, full-site crawling, web search with content extraction, AI-powered data extraction (Agent), batch processing, and interactive browser sessions. The system is designed for both cloud deployment at api.firecrawl.dev and self-hosted installations using Docker.

## Scrape API

Scrape a single URL and convert it to markdown, HTML, screenshots, or structured JSON data. Supports advanced options like custom headers, viewport settings, wait conditions, and browser actions (click, scroll, type).

```bash
# Basic scrape - returns markdown by default
curl -X POST 'https://api.firecrawl.dev/v2/scrape' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com"
  }'

# Advanced scrape with multiple formats and options
curl -X POST 'https://api.firecrawl.dev/v2/scrape' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html", "screenshot", "links"],
    "onlyMainContent": true,
    "waitFor": 2000,
    "timeout": 30000,
    "headers": {"User-Agent": "CustomBot/1.0"},
    "actions": [
      {"type": "wait", "milliseconds": 1000},
      {"type": "click", "selector": "#load-more"},
      {"type": "screenshot", "fullPage": true}
    ]
  }'

# Response
{
  "success": true,
  "data": {
    "markdown": "# Page Title\n\nContent here...",
    "html": "<h1>Page Title</h1>...",
    "screenshot": "data:image/png;base64,...",
    "links": ["https://example.com/page1", "https://example.com/page2"],
    "metadata": {
      "title": "Page Title",
      "description": "Page description",
      "sourceURL": "https://example.com",
      "statusCode": 200
    }
  }
}
```

## Search API

Search the web and get full page content from results. Combines web search with automatic scraping of each result page.

```bash
# Basic web search
curl -X POST 'https://api.firecrawl.dev/v2/search' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "firecrawl web scraping",
    "limit": 5
  }'

# Advanced search with sources and scrape options
curl -X POST 'https://api.firecrawl.dev/v2/search' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "machine learning tutorials",
    "limit": 10,
    "sources": ["web", "news"],
    "lang": "en",
    "country": "us",
    "scrapeOptions": {
      "formats": ["markdown", "links"],
      "onlyMainContent": true
    }
  }'

# Response
{
  "success": true,
  "data": {
    "web": [
      {
        "url": "https://example.com/article",
        "title": "Article Title",
        "description": "Article description",
        "markdown": "# Full article content..."
      }
    ],
    "news": [
      {
        "url": "https://news.example.com/story",
        "title": "News Story",
        "markdown": "# News content..."
      }
    ]
  },
  "creditsUsed": 5
}
```

## Interact API

Scrape a page and then interact with it using AI prompts or code. Enables multi-step browser automation.

```bash
# Step 1: Initial scrape to get session
curl -X POST 'https://api.firecrawl.dev/v2/scrape' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://amazon.com"}'
# Returns: {"success": true, "data": {..., "metadata": {"scrapeId": "abc123"}}}

# Step 2: Interact with the page using AI prompt
curl -X POST 'https://api.firecrawl.dev/v2/scrape/abc123/interact' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Search for mechanical keyboard and click the first result"
  }'

# Step 3: Continue interaction
curl -X POST 'https://api.firecrawl.dev/v2/scrape/abc123/interact' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Add the item to cart"
  }'

# Response
{
  "success": true,
  "output": "Added mechanical keyboard to cart",
  "liveViewUrl": "https://liveview.firecrawl.dev/session/abc123"
}
```

## Agent API

AI-powered autonomous data gathering. Describe what you need and the agent searches, navigates, and extracts data without requiring specific URLs.

```bash
# Basic agent request
curl -X POST 'https://api.firecrawl.dev/v2/agent' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Find the pricing plans for Notion"
  }'

# Agent with structured output schema
curl -X POST 'https://api.firecrawl.dev/v2/agent' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Find the founders of Firecrawl and their roles",
    "schema": {
      "type": "object",
      "properties": {
        "founders": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "role": {"type": "string"}
            }
          }
        }
      }
    },
    "model": "spark-1-pro"
  }'

# Agent with specific URLs to focus on
curl -X POST 'https://api.firecrawl.dev/v2/agent' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Compare features and pricing",
    "urls": ["https://docs.firecrawl.dev", "https://firecrawl.dev/pricing"]
  }'

# Response (async - poll for status)
{
  "success": true,
  "id": "agent-job-123"
}

# Get agent status
curl -X GET 'https://api.firecrawl.dev/v2/agent/agent-job-123' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY'

# Completed response
{
  "success": true,
  "status": "completed",
  "data": {
    "founders": [
      {"name": "Eric Ciarla", "role": "Co-founder"},
      {"name": "Nicolas Camara", "role": "Co-founder"}
    ]
  },
  "creditsUsed": 15
}
```

## Crawl API

Crawl an entire website and extract content from all pages. Supports depth limits, path filtering, sitemap handling, and webhooks.

```bash
# Start a crawl job
curl -X POST 'https://api.firecrawl.dev/v2/crawl' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "limit": 100,
    "maxDiscoveryDepth": 3,
    "includePaths": ["/docs/*", "/guides/*"],
    "excludePaths": ["/blog/*"],
    "scrapeOptions": {
      "formats": ["markdown"],
      "onlyMainContent": true
    }
  }'

# Response - returns job ID
{
  "success": true,
  "id": "crawl-job-456",
  "url": "https://api.firecrawl.dev/v2/crawl/crawl-job-456"
}

# Check crawl status
curl -X GET 'https://api.firecrawl.dev/v2/crawl/crawl-job-456' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY'

# Status response
{
  "success": true,
  "status": "scraping",
  "completed": 25,
  "total": 100,
  "creditsUsed": 25,
  "expiresAt": "2024-01-15T12:00:00Z",
  "data": [
    {
      "markdown": "# Page content...",
      "metadata": {"sourceURL": "https://docs.firecrawl.dev/intro"}
    }
  ]
}

# Cancel a crawl
curl -X DELETE 'https://api.firecrawl.dev/v2/crawl/crawl-job-456' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY'
```

## Map API

Discover all URLs on a website instantly without scraping content. Uses sitemaps and intelligent crawling.

```bash
# Basic URL mapping
curl -X POST 'https://api.firecrawl.dev/v2/map' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://firecrawl.dev"
  }'

# Map with search filter
curl -X POST 'https://api.firecrawl.dev/v2/map' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://firecrawl.dev",
    "search": "pricing",
    "limit": 100,
    "includeSubdomains": true
  }'

# Response
{
  "success": true,
  "links": [
    {"url": "https://firecrawl.dev", "title": "Firecrawl", "description": "Turn websites into LLM-ready data"},
    {"url": "https://firecrawl.dev/pricing", "title": "Pricing", "description": "Firecrawl pricing plans"},
    {"url": "https://docs.firecrawl.dev", "title": "Documentation", "description": "API documentation"}
  ]
}
```

## Batch Scrape API

Scrape multiple URLs asynchronously with a single request. Ideal for processing large lists of pages.

```bash
# Start batch scrape
curl -X POST 'https://api.firecrawl.dev/v2/batch/scrape' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2",
      "https://example.com/page3"
    ],
    "formats": ["markdown", "links"],
    "onlyMainContent": true
  }'

# Response
{
  "success": true,
  "id": "batch-job-789",
  "url": "https://api.firecrawl.dev/v2/batch/scrape/batch-job-789"
}

# Check batch status
curl -X GET 'https://api.firecrawl.dev/v2/batch/scrape/batch-job-789' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY'

# Completed response
{
  "success": true,
  "status": "completed",
  "completed": 3,
  "total": 3,
  "data": [
    {"markdown": "# Page 1...", "metadata": {"sourceURL": "https://example.com/page1"}},
    {"markdown": "# Page 2...", "metadata": {"sourceURL": "https://example.com/page2"}},
    {"markdown": "# Page 3...", "metadata": {"sourceURL": "https://example.com/page3"}}
  ]
}
```

## Browser Sessions API

Create persistent browser sessions for complex multi-step automation with code execution.

```bash
# Create browser session
curl -X POST 'https://api.firecrawl.dev/v2/browser' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "ttl": 300,
    "streamWebView": true
  }'

# Response
{
  "success": true,
  "sessionId": "browser-session-abc",
  "cdpUrl": "wss://browser.firecrawl.dev/session/abc",
  "liveViewUrl": "https://liveview.firecrawl.dev/session/abc",
  "expiresAt": "2024-01-15T12:05:00Z"
}

# Execute code in browser
curl -X POST 'https://api.firecrawl.dev/v2/browser/browser-session-abc/execute' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "code": "await page.goto(\"https://example.com\"); return await page.title();",
    "language": "node"
  }'

# Response
{
  "success": true,
  "output": "Example Domain",
  "stdout": "",
  "stderr": "",
  "exitCode": 0
}

# Delete browser session
curl -X DELETE 'https://api.firecrawl.dev/v2/browser/browser-session-abc' \
  -H 'Authorization: Bearer fc-YOUR_API_KEY'
```

## Python SDK

The Python SDK provides a convenient wrapper around all Firecrawl APIs with automatic polling for async operations.

```python
from firecrawl import Firecrawl
from pydantic import BaseModel, Field
from typing import List, Optional

# Initialize client
app = Firecrawl(api_key="fc-YOUR_API_KEY")

# Scrape a single URL
doc = app.scrape("https://example.com", formats=["markdown", "links"])
print(doc.markdown)
print(doc.links)

# Search the web
results = app.search("best web scraping tools 2024", limit=10)
for result in results.data.web:
    print(f"{result.title}: {result.url}")

# Crawl a website (automatically polls until complete)
crawl_result = app.crawl(
    "https://docs.firecrawl.dev",
    limit=50,
    scrape_options={"formats": ["markdown"]}
)
for doc in crawl_result.data:
    print(doc.metadata.source_url, doc.markdown[:100])

# Map a website
map_result = app.map("https://firecrawl.dev", search="pricing")
for link in map_result.links:
    print(link.url, link.title)

# Batch scrape multiple URLs
batch_result = app.batch_scrape([
    "https://example.com/page1",
    "https://example.com/page2"
], formats=["markdown"])

# Agent with structured output
class Founder(BaseModel):
    name: str = Field(description="Full name")
    role: Optional[str] = Field(None, description="Role or position")

class FoundersSchema(BaseModel):
    founders: List[Founder]

agent_result = app.agent(
    prompt="Find the founders of Stripe",
    schema=FoundersSchema
)
print(agent_result.data)

# Interactive scraping
scrape_result = app.scrape("https://amazon.com")
scrape_id = scrape_result.metadata.scrape_id
app.interact(scrape_id, prompt="Search for laptops")
app.interact(scrape_id, prompt="Click the first result")
```

## Node.js SDK

The Node.js SDK provides TypeScript support and async/await patterns for all Firecrawl operations.

```javascript
import Firecrawl from '@mendable/firecrawl-js';
import { z } from 'zod';

// Initialize client
const app = new Firecrawl({ apiKey: 'fc-YOUR_API_KEY' });

// Scrape with typed JSON extraction
const ProductSchema = z.object({
  name: z.string(),
  price: z.number(),
  description: z.string()
});

const doc = await app.scrape('https://example.com/product', {
  formats: [{ type: 'json', schema: ProductSchema }]
});
console.log(doc.json); // Typed as { name: string, price: number, description: string }

// Search the web
const searchResults = await app.search('machine learning tutorials', {
  limit: 5,
  sources: ['web', 'news'],
  scrapeOptions: { formats: ['markdown'] }
});

// Crawl with real-time updates using watcher
const crawlResponse = await app.startCrawl('https://docs.example.com', {
  limit: 100,
  scrapeOptions: { formats: ['markdown'] }
});

const watcher = app.watcher(crawlResponse.id, { kind: 'crawl' });
watcher.addEventListener('document', (doc) => {
  console.log('New document:', doc.metadata.sourceURL);
});
watcher.addEventListener('done', (status) => {
  console.log('Crawl complete:', status.completed, 'pages');
});

// Agent for autonomous data gathering
const agentResult = await app.agent({
  prompt: 'Find and compare pricing for Notion, Coda, and Airtable',
  model: 'spark-1-pro'
});
console.log(agentResult.data);

// Interactive browser session
const result = await app.scrape('https://example.com');
await app.interact(result.metadata.scrapeId, {
  prompt: 'Click the login button and fill in test@example.com'
});
```

## Java SDK

The Java SDK provides type-safe access to all Firecrawl APIs with builder patterns.

```java
import dev.firecrawl.client.FirecrawlClient;
import dev.firecrawl.model.*;

// Initialize client
FirecrawlClient client = new FirecrawlClient(
    System.getenv("FIRECRAWL_API_KEY"), null, null
);

// Scrape a URL
ScrapeParams scrapeParams = new ScrapeParams();
scrapeParams.setFormats(new String[]{"markdown", "links"});
FirecrawlDocument doc = client.scrapeURL("https://example.com", scrapeParams);
System.out.println(doc.getMarkdown());

// Crawl a website
CrawlParams crawlParams = new CrawlParams();
crawlParams.setLimit(50);
crawlParams.setIncludePaths(new String[]{"/docs/*"});
CrawlStatusResponse crawl = client.crawlURL(
    "https://docs.example.com", crawlParams, null, 10
);
for (FirecrawlDocument page : crawl.getData()) {
    System.out.println(page.getMetadata().get("sourceURL"));
}

// Search the web
SearchParams searchParams = new SearchParams("web scraping tools");
searchParams.setLimit(10);
SearchResponse results = client.search(searchParams);
for (SearchResult r : results.getResults()) {
    System.out.println(r.getTitle() + ": " + r.getUrl());
}

// Map a website
MapData mapData = client.map("https://example.com");
for (MapLink link : mapData.getLinks()) {
    System.out.println(link.getUrl());
}

// Agent request
AgentParams agentParams = new AgentParams("Find pricing for Slack");
AgentResponse start = client.createAgent(agentParams);
AgentStatusResponse result = client.getAgentStatus(start.getId());
System.out.println(result.getData());
```

## Self-Hosting with Docker

Firecrawl can be self-hosted using Docker Compose for complete control over your scraping infrastructure.

```bash
# Clone the repository
git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl

# Copy environment template
cp apps/api/.env.example apps/api/.env

# Configure environment (edit apps/api/.env)
# Required variables:
# - REDIS_URL=redis://redis:6379
# - POSTGRES_USER=firecrawl
# - POSTGRES_PASSWORD=your_secure_password
# - POSTGRES_DB=firecrawl

# Start services
docker-compose up -d

# Test the API (no API key required for self-hosted)
curl -X POST 'http://localhost:3002/v2/scrape' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://firecrawl.dev"}'

# View logs
docker-compose logs -f api
```

## MCP Integration

Connect Firecrawl to any MCP-compatible AI client (Claude Desktop, Cursor, etc.) for seamless web access.

```json
{
  "mcpServers": {
    "firecrawl-mcp": {
      "command": "npx",
      "args": ["-y", "firecrawl-mcp"],
      "env": {
        "FIRECRAWL_API_KEY": "fc-YOUR_API_KEY"
      }
    }
  }
}
```

```bash
# Or install the CLI skill for agent integration
npx -y firecrawl-cli@latest init --all --browser

# CLI commands
firecrawl scrape https://example.com
firecrawl search "web scraping tools" --limit 5
firecrawl crawl https://docs.example.com --limit 50
firecrawl map https://example.com
```

## Summary

Firecrawl serves three primary use cases: (1) **AI/LLM Data Pipelines** - converting web pages to clean markdown or structured JSON for RAG systems, chatbots, and AI agents; (2) **Web Scraping at Scale** - batch processing thousands of URLs with automatic handling of JavaScript rendering, proxies, and rate limits; (3) **Browser Automation** - interactive sessions for complex workflows requiring multi-step navigation, form filling, and dynamic content extraction.

Integration patterns follow a consistent model across all SDKs: synchronous methods for single operations (scrape, search, map), async job patterns with polling for bulk operations (crawl, batch_scrape, agent), and event-driven watchers for real-time progress updates. The v2 API is the current stable version exposed directly on SDK clients, while v1 remains available under a `.v1` namespace for backward compatibility. Self-hosted deployments use the same API surface without requiring API keys, making it seamless to migrate between cloud and on-premise installations.