### WebCrawler API SDK Installation Source: https://webcrawlerapi.com/docs/sdk/python Instructions on how to install the WebCrawler API Python SDK using pip. ```APIDOC ## WebCrawler API SDK Installation ### Description Install the official Python SDK for the WebCrawler API to easily integrate web crawling functionalities into your projects. ### Method `pip` ### Command ```bash pip install webcrawlerapi ``` ``` -------------------------------- ### Installation Source: https://webcrawlerapi.com/docs/sdk/js Install the WebCrawler API JavaScript SDK using npm. ```APIDOC ## Installation ``` npm i webcrawlerapi-js ``` ``` -------------------------------- ### Install WebCrawler API Python SDK Source: https://webcrawlerapi.com/docs/sdk/python This snippet shows how to install the WebCrawler API Python SDK using pip. Ensure you have Python and pip installed on your system. ```bash pip install webcrawlerapi ``` -------------------------------- ### Install PHP WebCrawler API SDK Source: https://webcrawlerapi.com/docs/sdk/php This command uses Composer to install the WebCrawler API SDK for PHP. Ensure you have Composer installed and meet the PHP version requirements (8.1+). ```bash composer require webcrawlerapi/sdk ``` -------------------------------- ### POST /v1/crawl Source: https://webcrawlerapi.com/docs/getting-started Initiates a new web crawl job. This endpoint requires authentication and accepts a JSON payload with crawl parameters. ```APIDOC ## POST /v1/crawl ### Description Initiates a new web crawl job to extract data from a specified URL. ### Method POST ### Endpoint https://api.webcrawlerapi.com/v1/crawl ### Parameters #### Query Parameters None #### Request Body - **url** (string) - Required - The URL of the website to crawl. - **items_limit** (integer) - Optional - The maximum number of items to extract. - **scrape_type** (string) - Optional - The format for the scraped data (e.g., 'markdown'). ### Request Example ```json { "items_limit": 5, "url": "https://stripe.com/", "scrape_type": "markdown" } ``` ### Headers - **Authorization**: Bearer ### Response #### Success Response (200) - **id** (string) - The ID of the crawl job. #### Response Example ```json { "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b" } ``` ``` -------------------------------- ### Install WebCrawler API SDK for Node.js Source: https://webcrawlerapi.com/docs/sdk/js Installs the official WebCrawler API JavaScript/TypeScript SDK using npm. This is the first step before using any of the SDK's functionalities in your Node.js project. ```bash npm i webcrawlerapi-js ``` -------------------------------- ### Initiate a Webcrawl Request with Curl Source: https://webcrawlerapi.com/docs/getting-started This command initiates a web crawl job using the Webcrawler API. It sends an HTTP POST request to the API endpoint with an API key for authentication and specifies the URL to crawl and the desired scrape type (markdown). The response contains a crawl job ID for tracking. ```curl curl --request POST \ --url https://api.webcrawlerapi.com/v1/crawl \ --header 'Authorization: Bearer ' \ --data '{ "items_limit": 5, "url": "https://stripe.com/", "scrape_type": "markdown" }' ``` -------------------------------- ### GET /v1/job/{CRAWL_JOB_ID} Source: https://webcrawlerapi.com/docs/getting-started Retrieves the status and results of a specific web crawl job using its ID. ```APIDOC ## GET /v1/job/{CRAWL_JOB_ID} ### Description Retrieves the status and results of a web crawl job using its unique ID. ### Method GET ### Endpoint https://api.webcrawlerapi.com/v1/job/ ### Parameters #### Path Parameters - **CRAWL_JOB_ID** (string) - Required - The unique identifier for the crawl job. #### Query Parameters None #### Request Body None ### Headers - **Authorization**: Bearer ### Response #### Success Response (200) - **id** (string) - The ID of the crawl job. - **url** (string) - The URL that was crawled. - **status** (string) - The current status of the crawl job (e.g., 'done'). - **job_items** (array) - A list of items extracted during the crawl. - **id** (string) - ID of the job item. - **job_id** (string) - ID of the associated job. - **original_url** (string) - The original URL of the page. - **page_status_code** (integer) - The HTTP status code of the page. - **raw_content_url** (string) - URL to the raw content. - **clean_content_url** (string) - URL to the cleaned content. #### Response Example ```json { "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b", "url": "https://stripe.com/", "status": "done", "job_items": [ { "id": "be0c2ae2-8545-4c4a-8728-5dd122878098", "job_id": "be0c2ae2-8545-4c4a-8728-5dd122878098", "original_url": "https://stripe.com", "page_status_code": 200, "raw_content_url": "https://data.webcrawlerapi.com/raw/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com", "clean_content_url": "https://data.webcrawlerapi.com/clean/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com" } ] } ``` ``` -------------------------------- ### Example API GET Request for a Specific Job Source: https://webcrawlerapi.com/docs/api/job This example demonstrates a practical GET request to the API, specifying a concrete job ID to fetch the crawling job's information. This is useful for monitoring the progress or retrieving results of a past crawl. ```http https://api.webcrawlerapi.com/v1/job/6c391693-e566-4b99-97ca-5fa00032e281 ``` -------------------------------- ### Clone and Develop Webcrawler MCP Server Source: https://webcrawlerapi.com/docs/sdk/mcp Clones the WebcrawlerAPI MCP server repository from GitHub, installs dependencies, and starts the development server. This is for users who want to contribute to or modify the server code. ```bash git clone https://github.com/WebCrawlerAPI/webcrawlerapi-mcp cd webcrawlerapi-mcp npm install npm run dev ``` -------------------------------- ### Crawl a Website Source: https://webcrawlerapi.com/docs/index Initiates a web crawl to extract data from a specified URL. Requires an API key for authentication. ```APIDOC ## POST /v1/crawl ### Description Initiates a new crawl job to extract data from a given URL. The request is asynchronous. ### Method POST ### Endpoint https://api.webcrawlerapi.com/v1/crawl ### Parameters #### Headers - **Authorization** (string) - Required - Bearer token with your API key. #### Request Body - **url** (string) - Required - The URL of the website to crawl. - **items_limit** (integer) - Optional - The maximum number of items to extract. - **scrape_type** (string) - Optional - The desired format for the scraped data (e.g., 'markdown', 'html'). ### Request Example ```json { "items_limit": 5, "url": "https://stripe.com/", "scrape_type": "markdown" } ``` ### Response #### Success Response (200) - **id** (string) - The unique identifier for the crawl job. #### Response Example ```json { "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b" } ``` ``` -------------------------------- ### Build and Start Webcrawler MCP Server for Production Source: https://webcrawlerapi.com/docs/sdk/mcp Builds the WebcrawlerAPI MCP server for production deployment and then starts the production server. This process optimizes the code for performance and prepares it for live environments. ```bash npm run build npm start ``` -------------------------------- ### Install .NET WebCrawler API SDK Source: https://webcrawlerapi.com/docs/sdk/dotnet Installs the WebCrawler API SDK for .NET applications using the dotnet CLI. Ensure you have .NET 7.0 or higher installed. ```bash dotnet add package WebCrawlerApi ``` -------------------------------- ### Install WebcrawlerAPI LangChain Package Source: https://webcrawlerapi.com/docs/sdk/langchain Install the WebCrawlerAPI LangChain integration package using pip. This is the first step before using the integration in your project. ```bash pip install webcrawlerapi-langchain ``` -------------------------------- ### Install Webcrawler MCP Server using npm Source: https://webcrawlerapi.com/docs/sdk/mcp Installs the Webcrawler MCP server globally using npm. This command requires Node.js and npm to be installed on the system. ```bash npm install -g webcrawler-mcp ``` -------------------------------- ### Make a Webcrawler API Crawl Request (cURL) Source: https://webcrawlerapi.com/docs/index This command initiates a web crawl job using the Webcrawler API. It requires an API key for authentication and specifies the target URL and desired scrape type. The request is made via HTTP POST to the /v1/crawl endpoint. ```shell curl --request POST \ --url https://api.webcrawlerapi.com/v1/crawl \ --header 'Authorization: Bearer ' \ --data '{ \ "items_limit": 5, \ "url": "https://stripe.com/", \ "scrape_type": "markdown" \ }' ``` -------------------------------- ### Start Crawl Job Request Example (JSON) Source: https://webcrawlerapi.com/docs/api/crawl An example JSON payload to initiate a website crawling job. It includes the seed URL, webhook URL, item limit, scrape type, subdomain crawling preference, and robots.txt respect setting. ```json { "url": "https://stripe.com/", "webhook_url": "https://yourserver.com/webhook", "items_limit": 10, "scrape_type": "cleaned", "allow_subdomains": false, "respect_robots_txt": true } ``` -------------------------------- ### Get Crawl Job Status and Result Source: https://webcrawlerapi.com/docs/index Retrieves the status and extracted data for a specific crawl job using its ID. ```APIDOC ## GET /v1/job/{CRAWL_JOB_ID} ### Description Retrieves the status and results of a previously initiated crawl job. ### Method GET ### Endpoint https://api.webcrawlerapi.com/v1/job/{CRAWL_JOB_ID} ### Parameters #### Path Parameters - **CRAWL_JOB_ID** (string) - Required - The ID of the crawl job to retrieve. #### Headers - **Authorization** (string) - Required - Bearer token with your API key. ### Response #### Success Response (200) - **id** (string) - The crawl job ID. - **url** (string) - The original URL that was crawled. - **status** (string) - The current status of the crawl job (e.g., 'done', 'processing'). - **job_items** (array) - A list of items extracted during the crawl. - **id** (string) - The ID of the job item. - **job_id** (string) - The ID of the associated crawl job. - **original_url** (string) - The URL of the page from which the item was extracted. - **page_status_code** (integer) - The HTTP status code of the crawled page. - **raw_content_url** (string) - URL to the raw content of the crawled page. - **clean_content_url** (string) - URL to the cleaned content of the crawled page. #### Response Example ```json { "id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b", "url": "https://stripe.com/", "status": "done", "job_items": [ { "id": "be0c2ae2-8545-4c4a-8728-5dd122878098", "job_id": "be0c2ae2-8545-4c4a-8728-5dd122878098", "original_url": "https://stripe.com", "page_status_code": 200, "raw_content_url": "https://data.webcrawlerapi.com/raw/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com", "clean_content_url": "https://data.webcrawlerapi.com/clean/clrgcx48g0001ozloz9ficivc/be0c2ae2-8545-4c4a-8728-5dd122878098/https:__stripe_com" } ] } ``` ``` -------------------------------- ### Get Job URLs Request Example (cURL) Source: https://webcrawlerapi.com/docs/api/urls Example of how to make a GET request to retrieve all URLs for a specific job using cURL. This includes the API endpoint, the job ID, and the necessary authorization header. ```curl curl --request GET \ --url https://api.webcrawlerapi.com/v1/job/46c7b8ff-eb5e-4ebb-96f1-2685334c07d7/urls \ --header 'Authorization: Bearer ' ``` -------------------------------- ### Retrieve Webcrawler API Crawl Result (cURL) Source: https://webcrawlerapi.com/docs/index This command fetches the results of a completed web crawl job. It requires the crawl job ID and the API key for authentication. The request is made via HTTP GET to the /v1/job/ endpoint. ```shell curl --request GET \ --url https://api.webcrawlerapi.com/v1/job/ \ --header 'Authorization : Bearer ' ``` -------------------------------- ### Get Job URLs Response Example (JSON) Source: https://webcrawlerapi.com/docs/api/urls Example JSON response structure for the '/job/:id/urls' endpoint. It includes a 'clusters' array for path-based URL groupings and a 'urls' array for a flat list of all discovered URLs. ```json { "clusters": [ { "path": "/scrapers", "size": 36 }, { "path": "/scrapers/webcrawler", "size": 29 }, { "path": "/blog", "size": 21 }, { "path": "/docs", "size": 20 }, { "path": "/docs/API", "size": 7 } // ... more clusters ], "urls": [ "https://webcrawlerapi.com/privacy", "https://webcrawlerapi.com/changelog", "https://webcrawlerapi.com/docs/API/cancel", "https://webcrawlerapi.com/scrapers/webcrawler/google-search-result/api", "https://webcrawlerapi.com/docs/sdk/python" // ... more URLs ] } ``` -------------------------------- ### WebCrawlerAPI LangChain Integration - Document QA System Example Source: https://webcrawlerapi.com/docs/sdk/langchain Provides a practical example of using WebCrawlerAPILoader to build a Document Question Answering system with LangChain. ```APIDOC ## WebCrawlerAPILoader - Document QA System Example ### Description This example demonstrates how to load documents from a website using `WebCrawlerAPILoader` and then use them to build a Question Answering system with LangChain. ### Methods 1. Initialize `WebCrawlerAPILoader`. 2. Load documents using `loader.load()`. 3. Create embeddings (e.g., `OpenAIEmbeddings`). 4. Store documents in a vector store (e.g., `Chroma`). 5. Initialize a `RetrievalQA` chain. ### Endpoint N/A (SDK usage example) ### Request Example ```python from langchain.chains import RetrievalQA from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from webcrawlerapi_langchain import WebCrawlerAPILoader from langchain_openai import OpenAI # Assuming OpenAI is imported from langchain_openai # Load documents from a website loader = WebCrawlerAPILoader( url="https://docs.example.com", api_key="your-api-key", scrape_type="markdown" ) documents = loader.load() # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(documents, embeddings) # Create QA chain qa_chain = RetrievalQA.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever() ) # Now you can use qa_chain to ask questions about the loaded documents. # For example: # query = "What is the main purpose of this documentation?" # result = qa_chain.invoke({"query": query}) # print(result["result"]) ``` ### Response #### Success Response - **qa_chain** (RetrievalQA object) - An initialized LangChain QA chain ready to answer questions based on the loaded documents. ``` -------------------------------- ### Synchronous Crawling Source: https://webcrawlerapi.com/docs/sdk/python Example of how to perform a synchronous web crawl using the WebCrawler API Python SDK. The crawl waits for completion and returns all data. ```APIDOC ## Synchronous Crawling ### Description Perform a synchronous crawl where the SDK waits for the entire crawling process to finish and returns all collected data at once. This method is suitable for simpler use cases where immediate results are desired. ### Method `WebCrawlerAPI.crawl()` ### Endpoint N/A (SDK Method) ### Parameters #### Request Body Parameters (passed to `crawl` method) - **url** (string) - Required - The target URL to crawl. - **scrape_type** (string) - Optional - Type of content to extract ('markdown', 'html', 'cleaned'). - **items_limit** (integer) - Optional - Maximum number of pages to crawl (default: 10). - **allow_subdomains** (boolean) - Optional - Whether to crawl subdomains (default: False). - **whitelist_regexp** (string) - Optional - Regular expression for allowed URLs. - **blacklist_regexp** (string) - Optional - Regular expression for blocked URLs. - **webhook_url** (string) - Optional - URL to receive notifications when the job completes. - **max_polls** (integer) - Optional - Maximum number of status checks (default: 100). ### Request Example ```python from webcrawlerapi import WebCrawlerAPI # Initialize the client crawler = WebCrawlerAPI(api_key="YOUR_API_KEY") # Synchronous crawling result = crawler.crawl( url="https://example.com", scrape_type="markdown", items_limit=10 ) print(f"Job completed with status: {result.status}") print(f"Number of items crawled: {len(result.job_items)}") ``` ### Response #### Success Response (200) Returns a `Job` object containing crawl results. #### Job Object Attributes - **id** (string) - Unique job identifier. - **status** (string) - Job status ('new', 'in_progress', 'done', 'error'). - **url** (string) - Original crawl URL. - **created_at** (string) - Job creation timestamp. - **finished_at** (string) - Job completion timestamp. - **job_items** (list) - List of crawled items (`JobItem` objects). - **recommended_pull_delay_ms** (integer) - Recommended delay between status checks. #### JobItem Object Attributes - **id** (string) - Unique item identifier. - **original_url** (string) - URL of the crawled page. - **title** (string) - Page title. - **status** (string) - Item status. - **page_status_code** (integer) - HTTP status code. - **markdown_content_url** (string) - URL to markdown content (if applicable). - **raw_content_url** (string) - URL to raw content. - **cleaned_content_url** (string) - URL to cleaned content. ``` -------------------------------- ### n8n WebcrawlerAPI Node Configuration Source: https://webcrawlerapi.com/docs/sdk/n8n This section describes the steps to configure and use the official WebcrawlerAPI node within n8n. It involves installation, adding the node to a workflow, and setting up API credentials and scraping parameters like URL and output format. No specific code example is provided, as it's a UI-driven configuration. -------------------------------- ### Scrape Webpage Request Example (JSON) Source: https://webcrawlerapi.com/docs/api/scrape This example demonstrates how to structure a POST request to the /scrape endpoint to retrieve webpage content. It includes parameters for the target URL, output format, elements to exclude, and robots.txt respect. The `prompt` parameter can be used for advanced content extraction or formatting. ```json { "url": "https://www.example.com", "output_format": "markdown", "clean_selectors": ".advertisement,.footer", "respect_robots_txt": true } ``` -------------------------------- ### Run Webcrawler MCP Server using npx Source: https://webcrawlerapi.com/docs/sdk/mcp Executes the Webcrawler MCP server directly using npx without requiring a global installation. This is a convenient way to run the server for quick use or testing. ```bash npx webcrawler-mcp ``` -------------------------------- ### GetContent Method Source: https://webcrawlerapi.com/docs/sdk/js Retrieve the full content associated with a job item using the `getContent()` method. ```APIDOC ## GetContent The job item contains a link to its content. For convenience, there is a `getContent()` method that allows you to easily access this content. Here's an example: ```javascript const result = await client.crawl({ "url": "https://stripe.com/", "scrape_type": "markdown", "items_limit": 10 }); for (const item of syncJob.job_items) { item.getContent().then((content) => { console.log(content.slice(0, 100)); }) } ``` This method retrieves the full content associated with the job, which can be useful for processing or displaying the job's data. ``` -------------------------------- ### Synchronous Crawling Source: https://webcrawlerapi.com/docs/sdk/js Perform a synchronous crawl that waits for completion and returns all data at once. ```APIDOC ## Synchronous Crawling The synchronous method waits for the crawl to complete and returns all data at once. ```javascript import webcrawlerapi from "webcrawlerapi-js"; const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY"); // Synchronous crawling const result = await client.crawl({ "url": "https://stripe.com/", "scrape_type": "markdown", "items_limit": 10 }); for (const item of syncJob.job_items) { item.getContent().then((content) => { console.log(content.slice(0, 100)); }) } console.log(result); ``` ``` -------------------------------- ### Document QA System with WebcrawlerAPI and LangChain Source: https://webcrawlerapi.com/docs/sdk/langchain An example of building a Document QA system using WebCrawlerAPILoader, OpenAI embeddings, and Chroma vector store. This demonstrates how to load web content and make it searchable for question answering. ```python from langchain.chains import RetrievalQA from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from webcrawlerapi_langchain import WebCrawlerAPILoader # Load documents from a website loader = WebCrawlerAPILoader( url="https://docs.example.com", api_key="your-api-key", scrape_type="markdown" ) documents = loader.load() # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(documents, embeddings) # Create QA chain qa_chain = RetrievalQA.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever() ) ``` -------------------------------- ### Crawl API Response Example (JSON) Source: https://webcrawlerapi.com/docs/api/crawl An example JSON response from the Crawl API, containing the unique identifier for the initiated crawling task. ```json { "id": "23b81e21-c672-4402-a886-303f18de9555" } ``` -------------------------------- ### Asynchronous Crawling Source: https://webcrawlerapi.com/docs/sdk/python Example of how to perform an asynchronous web crawl using the WebCrawler API Python SDK. This method returns a job ID immediately, allowing for status checking. ```APIDOC ## Asynchronous Crawling ### Description Initiate an asynchronous crawl which returns a job ID immediately, allowing you to check the crawl status and retrieve results later. This method is suitable for long-running crawls or when you need to integrate the crawl into a background process. ### Method `WebCrawlerAPI.crawl_async()` and `WebCrawlerAPI.get_job()` ### Endpoint N/A (SDK Methods) ### Parameters #### Request Body Parameters (passed to `crawl_async` method) - **url** (string) - Required - The target URL to crawl. - **scrape_type** (string) - Optional - Type of content to extract ('markdown', 'html', 'cleaned'). - **items_limit** (integer) - Optional - Maximum number of pages to crawl (default: 10). - **allow_subdomains** (boolean) - Optional - Whether to crawl subdomains (default: False). - **whitelist_regexp** (string) - Optional - Regular expression for allowed URLs. - **blacklist_regexp** (string) - Optional - Regular expression for blocked URLs. - **webhook_url** (string) - Optional - URL to receive notifications when the job completes. #### Parameters for `get_job()` - **job_id** (string) - Required - The ID of the job to retrieve status for. ### Request Example ```python from webcrawlerapi import WebCrawlerAPI import time # Initialize the client crawler = WebCrawlerAPI(api_key="YOUR_API_KEY") # Start async crawl job job = crawler.crawl_async( url="https://example.com", scrape_type="markdown", items_limit=10 ) # Get the job ID job_id = job.id # Check job status job_status = crawler.get_job(job_id) # Poll until job is complete while job_status.status == 'in_progress': time.sleep(job_status.recommended_pull_delay_ms / 1000) # Convert ms to seconds job_status = crawler.get_job(job_id) # Process results if job_status.status == 'done': for item in job_status.job_items: print(f"Page title: {item.title}") print(f"Original URL: {item.original_url}") print(f"Markdown content URL: {item.markdown_content_url}") ``` ### Response #### Success Response (200) - **Job Object**: Returned by `crawl_async` initially containing job details. - **Job Object**: Returned by `get_job` with updated status and results. #### Job Object Attributes - **id** (string) - Unique job identifier. - **status** (string) - Job status ('new', 'in_progress', 'done', 'error'). - **url** (string) - Original crawl URL. - **created_at** (string) - Job creation timestamp. - **finished_at** (string) - Job completion timestamp. - **job_items** (list) - List of crawled items (`JobItem` objects). - **recommended_pull_delay_ms** (integer) - Recommended delay between status checks. #### JobItem Object Attributes - **id** (string) - Unique item identifier. - **original_url** (string) - URL of the crawled page. - **title** (string) - Page title. - **status** (string) - Item status. - **page_status_code** (integer) - HTTP status code. - **markdown_content_url** (string) - URL to markdown content (if applicable). - **raw_content_url** (string) - URL to raw content. - **cleaned_content_url** (string) - URL to cleaned content. ``` -------------------------------- ### Webhook Request Example - JSON Source: https://webcrawlerapi.com/docs/async-requests Example JSON payload for initiating an asynchronous request with a webhook URL. The 'webhook_url' parameter specifies where the results will be sent upon completion. ```json { "url": "https://stripe.com/", "webhook_url": "https://yourserver.com/webhook" } ``` -------------------------------- ### Webhook Response Example - JSON Source: https://webcrawlerapi.com/docs/async-requests Example JSON payload received by the webhook URL once a job is completed. It includes the job ID and other relevant data. ```json { "id": "b1b1b1b1-b1b1-b1b1-b1b1-b1b1b1b1b1b1", ... } ``` -------------------------------- ### Asynchronous Crawling Source: https://webcrawlerapi.com/docs/sdk/js Initiate an asynchronous crawl that returns a job ID immediately for later status checking. ```APIDOC ## Asynchronous Crawling The asynchronous method returns a job ID immediately and allows you to check the status later. ```javascript import webcrawlerapi from "webcrawlerapi-js"; const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY"); // Start the async crawl job const job = await client.crawlAsync({ "url": "https://stripe.com/", "scrape_type": "markdown", "items_limit": 10 }); // Get the job ID const jobId = job.id; // Check job status let jobStatus = await client.getJob(jobId); console.log(jobStatus); // You can poll the job status until it's complete while (jobStatus.status === 'in_progress') { await new Promise(resolve => setTimeout(resolve, jobStatus.recommended_pull_delay_ms)); jobStatus = await client.getJob(jobId); } console.log('Final result:', jobStatus); ``` ``` -------------------------------- ### POST /crawl - Start Crawling Job Source: https://webcrawlerapi.com/docs/api/crawl Initiates a web crawling job for a given URL. The API returns a job ID for asynchronous monitoring. ```APIDOC ## POST /crawl ### Description Basic API endpoint to start crawling a website. This is an asynchronous operation. ### Method POST ### Endpoint https://api.webcrawlerapi.com/v1/crawl ### Parameters #### Query Parameters - **url** (string) - Required - The seed URL where the crawler starts. Can be any valid URL. - **scrape_type** (string) - Optional - The type of scraping to perform. Can be `html`, `cleaned`, or `markdown`. Defaults to `markdown`. - **items_limit** (integer) - Required - The maximum number of pages the crawler will process for this job. - **webhook_url** (string) - Optional - The URL where a POST request will be sent upon task completion. - **allow_subdomains** (boolean) - Optional - If `true`, the crawler will also crawl subdomains. Defaults to `false`. - **whitelist_regexp** (string) - Optional - A regular expression to whitelist URLs. Only URLs matching the pattern will be crawled. - **blacklist_regexp** (string) - Optional - A regular expression to blacklist URLs. URLs matching the pattern will be skipped. - **respect_robots_txt** (boolean) - Optional - If `true`, the crawler will respect the `robots.txt` file. Defaults to `false`. ### Request Example ```json { "url": "https://stripe.com/", "webhook_url": "https://yourserver.com/webhook", "items_limit": 10, "scrape_type": "cleaned", "allow_subdomains": false, "respect_robots_txt": true } ``` ### Response #### Success Response (200) - **id** (string) - The unique identifier for the crawling job. #### Response Example ```json { "id": "23b81e21-c672-4402-a886-303f18de9555" } ``` Crawling requests are processed asynchronously. You will receive a response with a task ID that can be used to check the status of the scraping task. ``` -------------------------------- ### Synchronous Crawling Source: https://webcrawlerapi.com/docs/sdk/php The synchronous method waits for the crawl to complete and returns all data at once. This is useful when you need the results immediately after initiating the crawl. ```APIDOC ## POST /crawl ### Description Initiates a web crawl and waits for the results to be returned. ### Method POST ### Endpoint `/crawl` ### Parameters #### Query Parameters - **apiKey** (string) - Required - Your WebCrawler API key. - **url** (string) - Required - The target URL to crawl. - **scrapeType** (string) - Optional - Type of content to extract ('markdown', 'html', 'cleaned'). Defaults to 'markdown'. - **itemsLimit** (integer) - Optional - Maximum number of pages to crawl. Defaults to 10. - **allowSubdomains** (boolean) - Optional - Whether to crawl subdomains. Defaults to false. - **whitelistRegexp** (string) - Optional - Regular expression for allowed URLs. - **blacklistRegexp** (string) - Optional - Regular expression for blocked URLs. - **webhookUrl** (string) - Optional - URL to receive notifications when the job completes. - **maxPolls** (integer) - Optional - Maximum number of status checks. Defaults to 100. ### Request Example ```php crawl( url: 'https://example.com', scrapeType: 'markdown', itemsLimit: 10 ); echo "Job completed with status: {$job->status}\\n"; foreach ($job->jobItems as $item) { echo "Page title: {$item->title}\\n"; echo "Original URL: {$item->originalUrl}\\n"; $content = $item->getContent(); if ($content) { echo "Content preview: " . substr($content, 0, 200) . "...\\n"; } else { echo "Content not available (item not done)\\n"; } } ``` ### Response #### Success Response (200) - **job** (object) - The completed job object containing crawl results. #### Response Example ```json { "id": "job_abc123", "status": "done", "url": "https://example.com", "createdAt": "2023-09-25T10:00:00Z", "finishedAt": "2023-09-25T10:05:00Z", "jobItems": [ { "id": "item_def456", "originalUrl": "https://example.com", "title": "Example Domain", "status": "done", "pageStatusCode": 200, "markdownContentUrl": "/path/to/markdown/content", "rawContentUrl": "/path/to/raw/content", "cleanedContentUrl": "/path/to/cleaned/content" } ], "recommendedPullDelayMs": 5000 } ``` ``` -------------------------------- ### Markdown Output Example - WebcrawlerAPI Source: https://webcrawlerapi.com/docs/crawling-types Demonstrates the markdown output format for web scraping. This format preserves structural elements like headings and links, making it suitable for AI and LLM processing. ```markdown # Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. [More information...](https://www.iana.org/domains/example) ``` -------------------------------- ### Create Crawling Job Request (JSON) Source: https://webcrawlerapi.com/docs/job Example JSON payload for initiating a crawling job. This includes the target URL, webhook URL for notifications, the maximum number of pages to crawl, and the desired scrape type. ```json { "url": "https://stripe.com/", "webhook_url": "https://yourserver.com/webhook", "items_limit": 10, "scrape_type": "markdown", "allow_subdomains": false } ``` -------------------------------- ### Raw HTML Output Example - WebcrawlerAPI Source: https://webcrawlerapi.com/docs/crawling-types Shows the raw HTML output format, which returns the complete, unmanipulated source code of a web page. This is the default format if `scrape_type` is not specified or set to `html`. ```html Example Domain

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

``` -------------------------------- ### GET /job/:id Source: https://webcrawlerapi.com/docs/api/job Retrieves the status and details of a specific web crawling job using its unique identifier. This is a basic API endpoint to start crawling a website. ```APIDOC ## GET /job/:id ### Description Retrieves the status and details of a specific web crawling job using its unique identifier. This endpoint is used to get information about a previously initiated crawling task. ### Method GET ### Endpoint `https://api.webcrawlerapi.com/v1/job/:id` ### Parameters #### Path Parameters - **id** (string) - Required - The unique identifier of the job. ### Request Example ``` https://api.webcrawlerapi.com/v1/job/6c391693-e566-4b99-97ca-5fa00032e281 ``` ### Response #### Success Response (200) - **id** (string) - The unique identifier of the job. - **org_id** (string) - Your organization identifier. - **url** (string) - The seed URL where the crawler started. - **status** (string) - The status of the job. Can be `new`, `in_progress`, `done`, `error`. - **scrape_type** (string) - The type of scraping you want to perform (`html`, `cleaned` or `markdown`). - **whitelist_regexp** (string) - A regular expression to whitelist URLs. - **blacklist_regexp** (string) - A regular expression to blacklist URLs. - **allow_subdomains** (boolean) - Indicates if the crawler will also crawl subdomains. - **items_limit** (integer) - The limit of pages for this job. - **created_at** (string) - The date when the job was created. - **finished_at** (string) - The date when the job was finished. - **webhook_url** (string) - The URL where the server will send a POST request once the task is completed. - **webhook_status** (string) - The status of the webhook request. - **webhook_error** (string) - The error message if the webhook request failed. - **job_items** (array) - An array of items that were extracted from the pages. #### Job Item Details - **id** (string) - The unique identifier of the item. - **status** (string) - The status of the item. Can be `new`, `in_progress`, `done`, `error`. - **job_id** (string) - The job identifier. - **original_url** (string) - The URL of the page. - **page_status_code** (integer) - The status code of the page request. - **raw_content_url** (string) - The URL to the raw content of the page. - **cleaned_content_url** (string) - The URL to the cleaned content of the page (if `scrape_type` is `cleaned`). - **markdown_content_url** (string) - The URL to the markdown content of the page (if `scrape_type` is `markdown`). - **title** (string) - The title of the page (`` tag content). - **created_at** (string) - The date when the item was created. - **cost** (number) - The cost of the item in $. - **referred_url** (string) - The URL where the page was referred from. - **last_error** (string) - The last error message if the item failed. #### Response Example ```json { "id": "abb39f29-087e-4714-aa05-15537be12f90", "org_id": "cm48ww9kw00019rv7bsyfko1d", "url": "https://books.toscrape.com/", "scrape_type": "markdown", "whitelist_regexp": ".*category.*", "blacklist_regexp": "", "allow_subdomains": false, "items_limit": 10, "created_at": "2024-12-15T10:26:13.893Z", "finished_at": "2024-12-15T10:26:37.118Z", "updated_at": "2024-12-15T10:26:37.118Z", "webhook_url": "", "status": "done", "job_items": [ { "id": "a46f3117-f97a-4ca2-a434-6cfdcd022b72", "job_id": "abb39f29-087e-4714-aa05-15537be12f90", "original_url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "page_status_code": 200, "markdown_content_url": "https://data.webcrawlerapi.com/markdown/books.toscrape.com/https___books_toscrape_com_catalogue_category_books_travel_2_index_html", "status": "done", "title": "All products | Books to Scrape - Sandbox", "last_error": "", "created_at": "2024-12-15T10:26:17.941Z", "updated_at": "2024-12-15T10:26:23.915Z", "cost": 2000, "referred_url": "https://books.toscrape.com/" } ] } ``` **Note**: Crawling requests are processed asynchronously. You will receive a response with a task ID, which can be used to check the status of the scraping task. ``` -------------------------------- ### Synchronous Crawling with PHP WebCrawler API SDK Source: https://webcrawlerapi.com/docs/sdk/php Demonstrates how to perform a synchronous web crawl using the PHP SDK. The SDK fetches all data upon completion. Requires a valid API key and the 'vendor/autoload.php' file. Outputs job status and details of crawled items. ```php <?php require_once('vendor/autoload.php'); use WebCrawlerAPI\WebCrawlerAPI; // Initialize the client $crawler = new WebCrawlerAPI('YOUR_API_KEY'); // Synchronous crawling $job = $crawler->crawl( url: 'https://example.com', scrapeType: 'markdown', itemsLimit: 10 ); echo "Job completed with status: {$job->status}\n"; // Access job items and their content foreach ($job->jobItems as $item) { echo "Page title: {$item->title}\n"; echo "Original URL: {$item->originalUrl}\n"; // Get the content based on job's scrape_type // Returns null if item is not in "done" status $content = $item->getContent(); if ($content) { echo "Content preview: " . substr($content, 0, 200) . "...\n"; } else { echo "Content not available (item not done)\n"; } } ```