### Quick Start: Scrape a Single Page Source: https://docs.llmcrawl.dev/sdk-javascript A basic example demonstrating how to initialize the LLMCrawl client and scrape a single web page, logging the extracted markdown content. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl({ apiKey: "your-api-key-here", }); // Scrape a single page const result = await client.scrape("https://example.com"); console.log(result.data?.markdown); ``` -------------------------------- ### Install LLMCrawl SDK Source: https://docs.llmcrawl.dev/sdk-javascript Instructions for installing the LLMCrawl JavaScript SDK using npm, yarn, or pnpm. ```bash npm install @llmcrawl/llmcrawl-js ``` ```bash yarn add @llmcrawl/llmcrawl-js ``` ```bash pnpm add @llmcrawl/llmcrawl-js ``` -------------------------------- ### Install LLMCrawl JavaScript SDK Source: https://docs.llmcrawl.dev/index Installs the official LLMCrawl JavaScript/TypeScript SDK using npm. This SDK allows for easy integration with Node.js, React, Next.js, and browser applications. ```bash npm install @llmcrawl/llmcrawl-js ``` -------------------------------- ### Initiate Crawl Request Source: https://docs.llmcrawl.dev/tags/Crawling Demonstrates how to send a POST request to the /v1/crawl endpoint to start a web crawl. Includes configuration for paths, depth, limits, scraping options, and webhooks. ```cURL curl https://api.llmcrawl.dev/v1/crawl \ --request POST \ --header 'Authorization: Token' \ --header 'Content-Type: application/json' \ --data '{ "includePaths": [ [ "/blog/*", "/articles/*", "/docs/*" ] ], "excludePaths": [ [ "/admin/*", "/private/*", "/api/*" ] ], "maxDepth": 3, "limit": 500, "allowBackwardLinks": false, "allowExternalLinks": false, "ignoreSitemap": true, "url": "string", "origin": "api", "scrapeOptions": { "formats": [ [ "markdown", "rawHtml" ] ], "headers": { "additionalProperties": "string" }, "includeTags": [ [ "h1", "h2", "p", "article" ] ], "excludeTags": [ [ "nav", "footer", "script", "style" ] ], "waitFor": 3000, "extract": { "mode": "string", "schema": { "type": "object", "properties": { "title": { "type": "string" }, "price": { "type": "number" }, "description": { "type": "string" } }, "required": [ "title", "price" ] }, "systemPrompt": "Based on the information on the page, extract all the information from the schema. Try to extract all the fields even those that might not be marked as required.", "prompt": "Extract the main article title and author from this page" } }, "webhookUrls": [ [ "https://your-webhook.com/crawl-status" ] ], "webhookMetadata": { "crawlId": "crawl_123", "userId": "user_456" } }' ``` ```JavaScript fetch('https://api.llmcrawl.dev/v1/crawl', { method: 'POST', headers: { Authorization: 'Token', 'Content-Type': 'application/json' }, body: JSON.stringify({ includePaths: [{ 0: '/blog/*', 1: '/articles/*', 2: '/docs/*' }], excludePaths: [{ 0: '/admin/*', 1: '/private/*', 2: '/api/*' }], maxDepth: 3, limit: 500, allowBackwardLinks: false, allowExternalLinks: false, ignoreSitemap: true, url: 'string', origin: 'api', scrapeOptions: { formats: [{ 0: 'markdown', 1: 'rawHtml' }], headers: { additionalProperties: 'string' }, includeTags: [{ 0: 'h1', 1: 'h2', 2: 'p', 3: 'article' }], excludeTags: [{ 0: 'nav', 1: 'footer', 2: 'script', 3: 'style' }], waitFor: 3000, extract: { mode: 'string', schema: { type: 'object', properties: { title: { type: 'string' }, price: { type: 'number' }, description: { type: 'string' } }, required: ['title', 'price'] }, systemPrompt: 'Based on the information on the page, extract all the information from the schema. Try to extract all the fields even those that might not be marked as required.', prompt: 'Extract the main article title and author from this page' } }, webhookUrls: [{ 0: 'https://your-webhook.com/crawl-status' }], webhookMetadata: { crawlId: 'crawl_123', userId: 'user_456' } }) }) ``` ```PHP $ch = curl_init("https://api.llmcrawl.dev/v1/crawl"); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_HTTPHEADER, ['Authorization: Token', 'Content-Type: application/json']); curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode([ 'includePaths' => [ [ '/blog/*', '/articles/*', '/docs/*' ] ], 'excludePaths' => [ [ '/admin/*', '/private/*', '/api/*' ] ], 'maxDepth' => 3, 'limit' => 500, 'allowBackwardLinks' => false, 'allowExternalLinks' => false, 'ignoreSitemap' => true, 'url' => 'string', 'origin' => 'api', 'scrapeOptions' => [ 'formats' => [ [ 'markdown', 'rawHtml' ] ], 'headers' => [ 'additionalProperties' => 'string' ], 'includeTags' => [ [ 'h1', 'h2', 'p', 'article' ] ], 'excludeTags' => [ [ 'nav', 'footer', 'script', 'style' ] ], 'waitFor' => 3000, 'extract' => [ 'mode' => 'string', 'schema' => [ 'type' => 'object', 'properties' => [ 'title' => [ 'type' => 'string' ], 'price' => [ 'type' => 'number' ], 'description' => [ 'type' => 'string' ] ], 'required' => [ 'title', ``` -------------------------------- ### Knowledge Base Creation Source: https://docs.llmcrawl.dev/examples Builds a searchable knowledge base from website content by crawling specified paths and extracting structured data. It includes logic to poll crawl status and process results into a defined interface. ```typescript interface KnowledgeEntry { id: string; title: string; content: string; url: string; section: string; keywords: string[]; } async function buildKnowledgeBase(baseUrl: string): Promise { const crawl = await client.crawl(baseUrl, { limit: 1000, includePaths: ["/docs/*", "/guides/*", "/tutorials/*"], scrapeOptions: { formats: ["markdown"], extract: { schema: { type: "object", properties: { title: { type: "string" }, section: { type: "string" }, content: { type: "string" }, keywords: { type: "array", items: { type: "string" } }, }, }, }, }, }); if (!crawl.success) return []; // Wait for completion let status = await client.getCrawlStatus(crawl.id); while (status.success && status.status === "scraping") { await new Promise((resolve) => setTimeout(resolve, 5000)); status = await client.getCrawlStatus(crawl.id); } if (!status.success || status.status !== "completed") return []; // Process results const knowledgeBase: KnowledgeEntry[] = status.data .filter((page) => page.extract && page.markdown) .map((page, index) => { const extracted = JSON.parse(page.extract); return { id: `kb_${index + 1}`, title: extracted.title || page.metadata?.title || "Untitled", content: page.markdown, url: page.metadata?.url || "", section: extracted.section || "General", keywords: extracted.keywords || [], }; }); return knowledgeBase; } // Usage const kb = await buildKnowledgeBase("https://docs.myapp.com"); console.log(`Knowledge base created with ${kb.length} entries`); ``` -------------------------------- ### API Documentation Scraping Source: https://docs.llmcrawl.dev/examples Extracts structured API documentation from a given URL using a defined schema. It specifies inclusion paths and scraping options, including the format and extraction schema. ```typescript const apiDocSchema = { type: "object", properties: { title: { type: "string" }, description: { type: "string" }, endpoint: { type: "string" }, method: { type: "string" }, parameters: { type: "array", items: { type: "object", properties: { name: { type: "string" }, type: { type: "string" }, required: { type: "boolean" }, description: { type: "string" }, }, }, }, responseExample: { type: "string" }, codeExamples: { type: "array", items: { type: "object", properties: { language: { type: "string" }, code: { type: "string" }, }, }, }, }, }; const docsCrawl = await client.crawl("https://docs.api.example.com", { limit: 500, includePaths: ["/reference/*", "/endpoints/*"], scrapeOptions: { formats: ["markdown"], extract: { schema: apiDocSchema }, }, }); ``` -------------------------------- ### LLMCrawl API Reference - Scraping, Crawling, and Mapping Source: https://docs.llmcrawl.dev/examples Provides an overview of the LLMCrawl API endpoints for scraping, crawling, and mapping operations. It details the HTTP methods, paths, and associated operations available for interacting with the LLMCrawl service. ```APIDOC Scraping: POST /v1/scrape - Initiates a web scraping task. - Related documentation: https://docs.llmcrawl.dev/operations/postV1Scrape.html Crawling: POST /v1/crawl - Starts a web crawling task. - Related documentation: https://docs.llmcrawl.dev/operations/postV1Crawl.html GET /v1/crawl/{id} - Retrieves the status or results of a specific crawling task. - Parameters: - id: The unique identifier of the crawl task. - Related documentation: https://docs.llmcrawl.dev/operations/getV1CrawlById.html DELETE /v1/crawl/{id}/cancel - Cancels an ongoing crawling task. - Parameters: - id: The unique identifier of the crawl task to cancel. - Related documentation: https://docs.llmcrawl.dev/operations/deleteV1CrawlByIdCancel.html Mapping: POST /v1/map - Performs a mapping operation, likely related to data transformation or analysis. - Related documentation: https://docs.llmcrawl.dev/operations/postV1Map.html ``` -------------------------------- ### Price Monitoring Source: https://docs.llmcrawl.dev/examples Monitors product prices across multiple websites by scraping product pages. It defines an interface for product data and uses a schema to extract name, price, and currency. Includes basic error handling and rate limiting. ```typescript interface Product { name: string; price: number; url: string; timestamp: Date; } async function monitorPrices(productUrls: string[]): Promise { const priceSchema = { type: "object", properties: { name: { type: "string" }, price: { type: "number" }, currency: { type: "string" }, }, required: ["name", "price"], }; const products: Product[] = []; for (const url of productUrls) { try { const result = await client.scrape(url, { extract: { schema: priceSchema }, }); if (result.success && result.data.extract) { const data = JSON.parse(result.data.extract); products.push({ name: data.name, price: data.price, url, timestamp: new Date(), }); } } catch (error) { console.error(`Failed to scrape ${url}:`, error); } // Rate limiting await new Promise((resolve) => setTimeout(resolve, 1000)); } return products; } // Usage const productUrls = [ "https://store1.example.com/product/123", "https://store2.example.com/item/456", "https://store3.example.com/product/789", ]; const prices = await monitorPrices(productUrls); console.log("Current prices:", prices); ``` -------------------------------- ### Install JavaScript/TypeScript SDK Source: https://docs.llmcrawl.dev/sdks Installs the official LLMCrawl JavaScript/TypeScript SDK using npm. This package provides full API coverage and TypeScript support for Node.js and browser environments. ```bash npm install @llmcrawl/llmcrawl-js ``` -------------------------------- ### Extract Product Information with TypeScript Source: https://docs.llmcrawl.dev/examples Demonstrates how to extract structured product data from an e-commerce website using the LLMCrawl JavaScript/TypeScript SDK. It defines a JSON schema for product details and uses the `scrape` method to extract and parse the information. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl({ apiKey: "your-api-key" }); const productSchema = { type: "object", properties: { name: { type: "string" }, price: { type: "number" }, originalPrice: { type: "number" }, discount: { type: "number" }, rating: { type: "number" }, reviewCount: { type: "number" }, inStock: { type: "boolean" }, description: { type: "string" }, specifications: { type: "object", properties: { brand: { type: "string" }, model: { type: "string" }, color: { type: "string" }, size: { type: "string" }, }, }, images: { type: "array", items: { type: "string" }, }, }, required: ["name", "price", "inStock"], }; const result = await client.scrape("https://store.example.com/product/123", { formats: ["markdown"], extract: { schema: productSchema }, }); if (result.success) { const product = JSON.parse(result.data.extract); console.log(`Product: ${product.name}`); console.log(`Price: $${product.price}`); console.log(`In Stock: ${product.inStock}`); } ``` -------------------------------- ### Blog Content Crawling Source: https://docs.llmcrawl.dev/examples Crawls blog sites to extract content, allowing specification of crawl limits, included/excluded paths, and scraping options like formats and extraction schemas. It also demonstrates monitoring crawl progress and processing the crawled data. ```typescript const blogCrawl = await client.crawl("https://blog.example.com", { limit: 200, includePaths: ["/posts/*", "/articles/*"], excludePaths: ["/admin/*", "/author/*"], scrapeOptions: { formats: ["markdown"], extract: { schema: { type: "object", properties: { title: { type: "string" }, author: { type: "string" }, publishDate: { type: "string" }, content: { type: "string" }, tags: { type: "array", items: { type: "string" } }, summary: { type: "string" }, }, }, }, }, }); // Monitor progress if (blogCrawl.success) { let status = await client.getCrawlStatus(blogCrawl.id); while (status.success && status.status === "scraping") { console.log(`Progress: ${status.completed}/${status.total} articles`); await new Promise((resolve) => setTimeout(resolve, 10000)); status = await client.getCrawlStatus(blogCrawl.id); } if (status.success && status.status === "completed") { console.log(`Successfully crawled ${status.data.length} articles`); // Process articles const articles = status.data .filter((page) => page.extract) .map((page) => JSON.parse(page.extract)); // Create content database const contentDB = articles.map((article, index) => ({ id: index + 1, title: article.title, author: article.author, publishDate: article.publishDate, wordCount: article.content.split(" ").length, tags: article.tags || [], })); console.log("Content database created:", contentDB.length, "entries"); } } ``` -------------------------------- ### Market Analysis with LLMCrawl Source: https://docs.llmcrawl.dev/examples Analyzes property market trends by scraping property data from a list of URLs. It calculates metrics like average price, median price, and price range. Requires a 'client' object with a 'scrape' method and a 'propertySchema'. ```typescript async function analyzeMarket(searchUrls: string[]) { const properties = []; for (const url of searchUrls) { const result = await client.scrape(url, { extract: { schema: propertySchema }, }); if (result.success && result.data.extract) { const property = JSON.parse(result.data.extract); properties.push(property); } await new Promise((resolve) => setTimeout(resolve, 2000)); } // Calculate market metrics const prices = properties.map((p) => p.price).filter((p) => p > 0); const avgPrice = prices.reduce((a, b) => a + b, 0) / prices.length; const medianPrice = prices.sort((a, b) => a - b)[ Math.floor(prices.length / 2) ]; return { totalProperties: properties.length, averagePrice: avgPrice, medianPrice: medianPrice, priceRange: { min: Math.min(...prices), max: Math.max(...prices), }, propertyTypes: [...new Set(properties.map((p) => p.propertyType))], }; } ``` -------------------------------- ### Stock and Financial Data Extraction Schema Source: https://docs.llmcrawl.dev/examples Defines a JSON schema for extracting stock and financial information, such as current price, change percentage, market cap, and P/E ratio. This schema is used with LLMCrawl's 'scrape' method to get structured financial data from company pages. ```typescript const financialSchema = { type: "object", properties: { symbol: { type: "string" }, companyName: { type: "string" }, currentPrice: { type: "number" }, change: { type: "number" }, changePercent: { type: "number" }, volume: { type: "number" }, marketCap: { type: "string" }, peRatio: { type: "number" }, dividendYield: { type: "number" }, earningsDate: { type: "string" }, }, }; const stockData = await client.scrape( "https://finance.example.com/stock/AAPL", { extract: { schema: financialSchema }, } ); ``` -------------------------------- ### Authentication and Client Initialization Source: https://docs.llmcrawl.dev/sdk-javascript Demonstrates how to initialize the LLMCrawl client with an API key and an optional custom base URL for API requests. ```typescript const client = new LLMCrawl({ apiKey: "your-api-key", baseUrl: "https://api.llmcrawl.dev", // Optional custom base URL }); ``` -------------------------------- ### Initialize LLMCrawl Client and Scrape Source: https://docs.llmcrawl.dev/sdks Demonstrates how to initialize the LLMCrawl client with an API key and perform a basic scrape operation for a given URL. The result contains extracted data, including markdown. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl({ apiKey: "your-api-key" }); const result = await client.scrape("https://example.com"); ``` -------------------------------- ### Data Validation and Cleaning with Ajv Source: https://docs.llmcrawl.dev/examples This example demonstrates how to validate and clean data against a JSON schema using the Ajv library. It includes functions to compile a schema, validate data, and attempt to clean data based on validation errors. ```typescript import Ajv from "ajv"; const ajv = new Ajv(); function validateAndCleanData(data: any, schema: any) { const validate = ajv.compile(schema); const valid = validate(data); if (!valid) { console.warn("Validation errors:", validate.errors); // Attempt to clean/fix data return cleanData(data, validate.errors); } return data; } function cleanData(data: any, errors: any[]) { // Implement data cleaning logic based on validation errors const cleaned = { ...data }; errors.forEach((error) => { if (error.keyword === "type" && error.params.type === "number") { const path = error.instancePath.replace("/", ""); if (typeof cleaned[path] === "string") { const num = parseFloat(cleaned[path].replace(/[^0-9.-]/g, "")); if (!isNaN(num)) cleaned[path] = num; } } }); return cleaned; } ``` -------------------------------- ### LLMCrawl API: Get Crawl Job Status Source: https://docs.llmcrawl.dev/operations/getV1CrawlById Retrieves the status of a specific crawl job using its unique ID. Requires Bearer token authorization. The response includes job progress, completion status, and data extracted from the crawl. ```APIDOC GET /v1/crawl/{id} Get crawl job status Authorizations: bearerAuth: TypeHTTP (bearer) Parameters: Path Parameters: id*: Type:string, Required Responses: 200: Successful response Content-Type: application/json Schema: JSON { "success": true, "status": "string", "completed": 0, "total": 0, "expiresAt": "string", "next": "string", "data": [ { "markdown": "string", "extract": "string", "html": "string", "rawHtml": "string", "links": [ "string" ], "screenshot": "string", "metadata": { "additionalProperties": {} } } ] } 404: Not Found 500: Internal Server Error Playground: Authorization: bearerAuth Variables: Key: id*, Value: Samples: cURL: curl 'https://api.llmcrawl.dev/v1/crawl/{id}' \ --header 'Authorization: Token' JavaScript: fetch('https://api.llmcrawl.dev/v1/crawl/{id}', { headers: { Authorization: 'Token' } }) PHP: $ch = curl_init("https://api.llmcrawl.dev/v1/crawl/{id}"); curl_setopt($ch, CURLOPT_HTTPHEADER, ['Authorization: Token']); curl_exec($ch); curl_close($ch); Python: requests.get("https://api.llmcrawl.dev/v1/crawl/{id}", headers={ "Authorization": "Token" } ) ``` -------------------------------- ### Cancel Crawl Samples Source: https://docs.llmcrawl.dev/tags/Crawling Provides code samples for canceling a crawl using different programming languages and tools. These examples demonstrate how to make the DELETE request to the API endpoint. ```cURL curl 'https://api.llmcrawl.dev/v1/crawl/{id}/cancel' \ --request DELETE \ --header 'Authorization: Token' ``` ```JavaScript fetch('https://api.llmcrawl.dev/v1/crawl/{id}/cancel', { method: 'DELETE', headers: { Authorization: 'Token' } }) ``` ```PHP $ch = curl_init("https://api.llmcrawl.dev/v1/crawl/{id}/cancel"); curl_setopt($ch, CURLOPT_HTTPHEADER, ['Authorization: Token']); curl_exec($ch); curl_close($ch); ``` ```Python requests.delete( "https://api.llmcrawl.dev/v1/crawl/{id}/cancel", headers={ "Authorization": "Token" } ) ``` -------------------------------- ### Property Listing Extraction Source: https://docs.llmcrawl.dev/examples Extracts detailed property information from a real estate listing URL using a predefined JSON schema. This allows for structured data retrieval of listings. ```typescript const propertySchema = { type: "object", properties: { title: { type: "string" }, price: { type: "number" }, address: { type: "string" }, bedrooms: { type: "number" }, bathrooms: { type: "number" }, sqft: { type: "number" }, propertyType: { type: "string" }, description: { type: "string" }, features: { type: "array", items: { type: "string" } }, images: { type: "array", items: { type: "string" } }, agent: { type: "object", properties: { name: { type: "string" }, phone: { type: "string" }, email: { type: "string" }, }, }, }, required: ["title", "price", "address"], }; const property = await client.scrape("https://realty.example.com/listing/123", { formats: ["markdown"], extract: { schema: propertySchema }, }); ``` -------------------------------- ### Initialize LLMCrawl Client and Scrape URL (JavaScript) Source: https://docs.llmcrawl.dev/index Initializes the LLMCrawl client with an API key and demonstrates how to scrape a single URL. The markdown content is accessed from the result's data property. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl({ apiKey: "your-api-key-here", }); // Scrape a single page const result = await client.scrape("https://example.com"); console.log(result.data?.markdown); ``` -------------------------------- ### Scrape Options Source: https://docs.llmcrawl.dev/sdk-javascript Configuration options for the `scrape()` method, detailing available parameters for output formats, headers, timeouts, page loading delays, AI extraction schemas, and webhook configurations. ```APIDOC Scrape Options: `formats`: `string[]` | Output formats: `'markdown'`, `'html'`, `'rawHtml'`, `'links'`, `'screenshot'`, `'screenshot@fullPage'` `headers`: `object` | Custom HTTP headers `includeTags`: `string[]` | HTML tags to include in output `excludeTags`: `string[]` | HTML tags to exclude from output `timeout`: `number` | Request timeout in milliseconds (1000-90000) `waitFor`: `number` | Delay before capturing content (0-60000ms) `extract`: `ExtractOptions` | AI extraction configuration `webhookUrls`: `string[]` | URLs to send results to `metadata`: `object` | Additional metadata to include ``` -------------------------------- ### Get Crawl Job Status Source: https://docs.llmcrawl.dev/api-reference Retrieves the status and results of a specific web crawl job using its unique ID. Includes details on completion progress, data extracted, and expiration time. ```APIDOC GET /v1/crawl/{id} Get crawl job status. **Authorizations:** - bearerAuth (HTTP Bearer Token) **Parameters:** - **id** (string, required): The unique identifier of the crawl job. **Responses:** - **200 OK**: Successful response. Content-Type: application/json Schema: ```json { "success": true, "status": "string", "completed": integer, "total": integer, "expiresAt": string, "next": string, "data": [ { "markdown": string, "extract": string, "html": string, "rawHtml": string, "links": [string], "screenshot": string, "metadata": { "additionalProperties": {} } } ] } ``` - **404 Not Found**: If the crawl job ID does not exist. - **500 Internal Server Error**: If there was an error on the server. ``` ```cURL curl 'https://api.llmcrawl.dev/v1/crawl/{id}' \ --header 'Authorization: Token' ``` ```JavaScript fetch('https://api.llmcrawl.dev/v1/crawl/{id}', { headers: { Authorization: 'Token' } }) ``` ```PHP $ch = curl_init("https://api.llmcrawl.dev/v1/crawl/{id}"); curl_setopt($ch, CURLOPT_HTTPHEADER, ['Authorization: Token']); curl_exec($ch); curl_close($ch); ``` ```Python import requests response = requests.get("https://api.llmcrawl.dev/v1/crawl/{id}", headers={ "Authorization": "Token" } ) print(response.json()) ``` -------------------------------- ### LLMCrawl Crawling API Operations Source: https://docs.llmcrawl.dev/operations/postV1Crawl This section details the API endpoints for managing website crawls. It includes starting a new crawl, checking the status of an ongoing crawl, and canceling a crawl. ```APIDOC POST /v1/crawl Crawl a website. Authorizations: bearerAuth (Type: HTTP (bearer)) Request Body (application/json): Schema: JSON Properties: includePaths: array (string) excludePaths: array (string) maxDepth: integer limit: integer allowBackwardLinks: boolean allowExternalLinks: boolean ignoreSitemap: boolean url: string (required) origin: string scrapeOptions: object formats: array (string) headers: object (additionalProperties: string) includeTags: array (string) excludeTags: array (string) waitFor: integer extract: object mode: string schema: object type: object properties: { title: { type: string }, price: { type: number }, description: { type: string } } required: [title, price] systemPrompt: string prompt: string webhookUrls: array (string) webhookMetadata: object (crawlId: string, userId: string) Responses: 200: Successful response (Content-Type: application/json) Schema: JSON Body: { success: boolean, id: string, url: string } 400, 403, 500: Error responses ``` ```APIDOC GET /v1/crawl/{id} Retrieve the status of a specific crawl. Parameters: id: string (path parameter, required) - The ID of the crawl to retrieve. Responses: 200: Successful response (Content-Type: application/json) Schema: JSON Body: (Details for crawl status response would be here) 404: Crawl not found. ``` ```APIDOC DELETE /v1/crawl/{id}/cancel Cancel an ongoing crawl. Parameters: id: string (path parameter, required) - The ID of the crawl to cancel. Responses: 200: Successful response (Content-Type: application/json) Schema: JSON Body: { success: boolean, message: string } 404: Crawl not found or already completed/canceled. ``` -------------------------------- ### Data Processing with Pandas (Python) Source: https://docs.llmcrawl.dev/sdk-python Shows how to integrate LLMCrawl scraping with Pandas for efficient data processing. This example defines a helper function to scrape and extract data, then process it in a structured manner. ```python import pandas as pd from concurrent.futures import ThreadPoolExecutor, as_completed def scrape_and_extract(url: str, schema: dict) -> dict: """Scrape a URL and extract structured data""" result = client.scrape(url, extract={"schema": schema}) if result["success"]: return { "url": url, "success": True, "data": json.loads(result["data"]["extract"]) } return {"url": url, "success": False, "error": result.get("error")} # Example usage with Pandas (assuming client is initialized) # urls_to_scrape = ["url1", "url2", ...] # extraction_schema = {...} # # with ThreadPoolExecutor(max_workers=5) as executor: # futures = [executor.submit(scrape_and_extract, url, extraction_schema) for url in urls_to_scrape] # results = [] # for future in as_completed(futures): # results.append(future.result()) # # df = pd.DataFrame(results) # print(df) ``` -------------------------------- ### LLMCrawl Python SDK Preview API Source: https://docs.llmcrawl.dev/sdk-python This snippet demonstrates the expected API design for the LLMCrawl Python SDK. It shows how to initialize the client, scrape a single page, and perform AI-powered data extraction using Pydantic models. The SDK supports both synchronous and asynchronous operations. ```python from llmcrawl import LLMCrawl client = LLMCrawl(api_key="your-api-key") # Scrape a single page result = await client.scrape("https://example.com") print(result.data.markdown) # AI-powered extraction with Pydantic from pydantic import BaseModel class Product(BaseModel): name: str price: float in_stock: bool result = await client.scrape( "https://store.example.com/product/123", extract_model=Product ) product = result.data.extract # Type: Product ``` -------------------------------- ### Get Crawl Job Status (API) Source: https://docs.llmcrawl.dev/tags/Crawling Retrieves the status of a specific crawl job using its ID. This endpoint returns information about the crawl's progress, completion status, and any extracted data. ```APIDOC GET /v1/crawl/{id} Get crawl job status Authorizations: bearerAuth Type: HTTP (bearer) Parameters: Path Parameters: id*: Type: string Required Responses: 200: Successful response Content-Type: application/json Schema: JSON { "success": true, "status": "string", "completed": 0, "total": 0, "expiresAt": "string", "next": "string", "data": [ { "markdown": "string", "extract": "string", "html": "string", "rawHtml": "string", "links": [ "string" ], "screenshot": "string", "metadata": { "additionalProperties": {} } } ] } Playground: Authorization: bearerAuth Variables: Key: id* Value: Try it out Samples: cURL: curl 'https://api.llmcrawl.dev/v1/crawl/{id}' \ --header 'Authorization: Token' JavaScript: fetch('https://api.llmcrawl.dev/v1/crawl/{id}', { headers: { Authorization: 'Token' } }) PHP: $ch = curl_init("https://api.llmcrawl.dev/v1/crawl/{id}"); curl_setopt($ch, CURLOPT_HTTPHEADER, ['Authorization: Token']); curl_exec($ch); curl_close($ch); Python: requests.get("https://api.llmcrawl.dev/v1/crawl/{id}", headers={ "Authorization": "Token" } ) ``` -------------------------------- ### LLMCrawl Client Initialization: v0.x vs v1.0.0 Source: https://docs.llmcrawl.dev/sdk-javascript Illustrates the difference in initializing the LLMCrawl client between version 0.x and version 1.0.0 of the SDK. The migration involves changing the import statement and the way the API key is passed during client instantiation. ```javascript // Before (v0.x): import LLMCrawl from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl("your-api-key"); ``` ```javascript // After (v1.0.0): import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; const client = new LLMCrawl({ apiKey: "your-api-key", }); ``` -------------------------------- ### Node.js Scraping and File Saving Source: https://docs.llmcrawl.dev/sdks An example for Node.js applications demonstrating how to scrape a website using the LLMCrawl SDK and save the extracted markdown content to a local file. It includes error handling for the scrape operation. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; import fs from "fs/promises"; const client = new LLMCrawl({ apiKey: process.env.LLMCRAWL_API_KEY }); // Scrape and save to file const result = await client.scrape("https://docs.example.com"); if (result.success) { await fs.writeFile("content.md", result.data.markdown); } ``` -------------------------------- ### News Article Extraction Source: https://docs.llmcrawl.dev/examples Extracts structured data from news articles using a predefined schema. It specifies properties like headline, author, content, and tags, and demonstrates how to parse the extracted JSON data. ```typescript const articleSchema = { type: "object", properties: { headline: { type: "string" }, subheadline: { type: "string" }, author: { type: "string" }, publishDate: { type: "string" }, content: { type: "string" }, tags: { type: "array", items: { type: "string" } }, category: { type: "string" }, readTime: { type: "number" }, relatedArticles: { type: "array", items: { type: "object", properties: { title: { type: "string" }, url: { type: "string" }, }, }, }, }, required: ["headline", "content"], }; const newsResult = await client.scrape("https://news.example.com/article/123", { formats: ["markdown"], extract: { schema: articleSchema }, }); if (newsResult.success) { const article = JSON.parse(newsResult.data.extract); console.log(`Article: ${article.headline}`); console.log(`Author: ${article.author}`); console.log(`Published: ${article.publishDate}`); } ``` -------------------------------- ### Node.js Environment Usage Source: https://docs.llmcrawl.dev/sdk-javascript Shows how to initialize and use the LLMCrawl SDK within a Node.js environment. It includes importing the client, setting the API key from environment variables, and writing scraped content to a file. ```typescript import { LLMCrawl } from "@llmcrawl/llmcrawl-js"; import fs from "fs/promises"; const client = new LLMCrawl({ apiKey: process.env.LLMCRAWL_API_KEY, }); const result = await client.scrape("https://example.com"); if (result.success) { await fs.writeFile("output.md", result.data.markdown); } ```