### Install Crawlbase Python Library Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Install the library using pip. ```bash pip install crawlbase ``` -------------------------------- ### Initialize and Use Scraper API Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the Scraper API with your token and make a GET request to scrape a URL. This example prints the product name from an Amazon product page. ```python scraper_api = ScraperAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) response = scraper_api.get('https://www.amazon.com/DualSense-Wireless-Controller-PlayStation-5/dp/B08FC6C75Y/') if response['status_code'] == 200: print(response['json']['name']) # Will print the name of the Amazon product ``` -------------------------------- ### Make a Basic GET Request Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Make a GET request to a URL and process the response. Handles successful responses and errors. ```python response = api.get('https://www.example.com') if response['status_code'] == 200: content = response['body'].decode('utf-8') print(content) else: print(f"Error: {response['status_code']}") ``` -------------------------------- ### Internal URL Building Example Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/base_api.md This example shows how `buildURL` is used internally to construct a full API URL with parameters. The output includes the base URL, token, and encoded query parameters. ```python # Internal use; called by request() url = api.buildURL({'url': 'https://example.com', 'format': 'json'}) # Returns: https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fexample.com&format=json ``` -------------------------------- ### CrawlingAPI GET and POST Requests Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Perform GET and POST requests using the CrawlingAPI. Includes an example for JavaScript rendering with a specified wait time. ```python from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) # GET request response = api.get('https://example.com') # POST request response = api.post('https://example.com/search', { 'query': 'python' }) # JavaScript rendering js_api = CrawlingAPI({'token': 'YOUR_JS_TOKEN'}) response = js_api.get('https://spa.example.com', { 'page_wait': 5000 # 5 seconds }) ``` -------------------------------- ### Internal Request Example Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/base_api.md This example demonstrates how the `request` method is called internally by a subclass method like `get`. Developers typically use the public API methods instead of calling `request` directly. ```python # This is called internally; developers use subclass methods response = api.get('https://example.com') # Calls request() internally ``` -------------------------------- ### Boolean String Parameter Example Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md For option parameters, boolean values are represented as 'true' or 'false' strings. This example sets the 'store' option. ```python # Option parameter response = api.get('https://example.com', {'store': 'true'}) ``` -------------------------------- ### Get Leads with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/leads_api.md Use the `get_from_domain` method with query parameters to specify response format and country. The response is parsed into a JSON object. ```python response = api.get_from_domain('example.com', { 'format': 'json', 'country': 'US' }) if response['status_code'] == 200: leads_data = response['json'] print(f"Found {len(leads_data['leads'])} leads") ``` -------------------------------- ### GET Request with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Perform a GET request with custom options like user agent and format. Ensure the response status code is 200 before processing the body. ```python response = api.get('https://www.reddit.com/r/pics/comments/5bx4bx/thanks_obama/', { 'user_agent': 'Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/30.0', 'format': 'json' }) if response['status_code'] == 200: print(response['body']) ``` -------------------------------- ### CrawlingAPI - Simple GET Request Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/PACKAGE_OVERVIEW.md Demonstrates how to perform a simple GET request to scrape HTML and JavaScript content from a URL using the CrawlingAPI. ```APIDOC ## CrawlingAPI.get ### Description Performs a GET request to scrape HTML and JavaScript content from a given URL. ### Method GET ### Endpoint (root) - relative to https://api.crawlbase.com/ ### Parameters #### Query Parameters - **url** (string) - Required - The URL to scrape. - **options** (object) - Optional - Additional options for the request, such as `page_wait` for JavaScript rendering. ### Request Example ```python from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) response = api.get('https://example.com') ``` ### Response #### Success Response (200) - **status_code** (integer) - The HTTP status code of the response. - **body** (string) - The decoded content of the scraped page. #### Response Example ```json { "status_code": 200, "body": "..." } ``` ``` -------------------------------- ### get(url, options) Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Scrapes a given URL using an HTTP GET request. Returns the page content and associated metadata. ```APIDOC ## get(url, options) Scrapes a URL using GET request, returning the page content. ### Method GET ### Endpoint [This is a client library method, not a direct HTTP endpoint] ### Parameters - **url** (str) - Required - The URL to scrape. - **options** (dict) - Optional - Query parameters from Crawlbase API documentation. Defaults to {}. - **user_agent** (str) - Custom User-Agent header. - **format** (str) - Response format: 'html' (default) or 'json'. - **callback** (str) - JSONP callback function name. - **page_wait** (int) - JavaScript: time to wait for page load in milliseconds. - **device** (str) - JavaScript: device type ('mobile' or 'desktop'). - **country** (str) - Geographic location for the request. - **proxy_type** (str) - Proxy type: 'datacenter' or 'residential'. ### Return Type dict ### Response - **status_code** (int) - HTTP status code. - **headers** (dict) - Response headers including `original_status`, `pc_status`, `url`. - **body** (bytes) - Raw HTML/content response body. - **json** (dict or list) - Parsed JSON if `format='json'` option was used. ### Request Example Simple GET request: ```python from crawlbase import CrawlingAPI api = CrawlingAPI({ 'token': 'YOUR_TOKEN' }) response = api.get('https://www.example.com') if response['status_code'] == 200: print(response['body'].decode('utf-8')) ``` GET with custom User-Agent: ```python response = api.get('https://www.example.com', { 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }) ``` GET with JSON response format: ```python response = api.get('https://www.example.com', { 'format': 'json' }) if response['status_code'] == 200: data = response['json'] print(data['body']) # JSON contains the HTML in 'body' field ``` JavaScript rendering with page wait: ```python js_api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' }) response = js_api.get('https://www.angular-app.com', { 'page_wait': 5000 # Wait 5 seconds for JavaScript to execute }) ``` ``` -------------------------------- ### get(url, options) Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/screenshots_api.md Captures a screenshot of a given URL and saves it as a JPEG file. Supports various options for customization. ```APIDOC ## get(url, options) ### Description Captures a screenshot of a URL and saves it as a JPEG file. Supports various options for customization. ### Method GET ### Endpoint /screenshots ### Parameters #### Path Parameters None #### Query Parameters - **url** (str) - Required - The URL to capture screenshot from - **options** (dict) - Optional - Query parameters from Crawlbase API documentation - **save_to_path** (str) - Optional - File path to save screenshot (must end with .jpg or .jpeg); auto-generated if not specified - **store** (str) - Optional - Set to 'true' to store screenshot and get a shareable URL - **user_agent** (str) - Optional - Custom User-Agent header - **viewport** (str) - Optional - Viewport size in format 'WIDTHxHEIGHT' (e.g., '1920x1080') - **country** (str) - Optional - Geographic location for the request ### Request Example ```python # Basic usage - capture and save screenshot: response = api.get('https://www.apple.com') # Save to specific path: response = api.get('https://www.apple.com', { 'save_to_path': '/tmp/apple_screenshot.jpg' }) # Store and get shareable URL: response = api.get('https://www.apple.com', { 'store': 'true' }) # With custom viewport: response = api.get('https://www.responsive-site.com', { 'save_to_path': 'mobile_view.jpg', 'viewport': '375x667' # iPhone size }) ``` ### Response #### Success Response (200) - **status_code** (int) - HTTP status code - **headers** (dict) - Response headers including `success`, `url`, `remaining_requests`, `screenshot_url` (if stored) - **success** (str) - Whether screenshot was successful ('true' or 'false') - **url** (str) - The URL that was screenshotted - **remaining_requests** (str) - Number of remaining requests in current plan - **screenshot_url** (str) - Shareable public URL if `store='true'` was used (None otherwise) - **body** (bytes) - Binary screenshot image data - **file** (str) - Path to the saved screenshot file #### Response Example ```json { "status_code": 200, "headers": { "success": "true", "url": "https://www.apple.com", "remaining_requests": "4999", "screenshot_url": null }, "body": "", "file": "/tmp/screenshots/20231027103000-apple-com.jpg" } ``` ### Raises - Exception: If `save_to_path` does not end with .jpg or .jpeg extension ``` -------------------------------- ### Perform a GET Request Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Make a GET request to scrape a URL. The response contains status code and body. Pass an empty dictionary for default options. ```python api.get(url, options = {}) ``` ```python response = api.get('https://www.facebook.com/britneyspears') if response['status_code'] == 200: print(response['body']) ``` -------------------------------- ### Get by URL or RID Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md The `get()` method automatically detects whether a URL or a Request ID (RID) is provided. Use this for direct retrieval of crawled pages. ```python api.get('https://www.example.com') # By URL api.get('RID_xyz123') # By RID ``` -------------------------------- ### StorageAPI: Retrieving and Counting Stored Pages Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Demonstrates using StorageAPI to retrieve all stored page RIDs, fetch bulk page data, and get the total count of stored pages. ```python from crawlbase import StorageAPI api = StorageAPI({ 'token': 'your_normal_token', 'timeout': 60 }) # Get all RIDs all_rids = api.rids(limit=10) # Retrieve pages response = api.bulk(all_rids) # Get total count total = api.totalCount() ``` -------------------------------- ### Simple GET Request with CrawlingAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/PACKAGE_OVERVIEW.md Perform a basic HTTP GET request to scrape HTML content. Ensure you have your API token and handle the response status code. ```python from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) response = api.get('https://example.com') if response['status_code'] == 200: content = response['body'].decode('utf-8') ``` -------------------------------- ### Perform a Simple GET Request Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Scrape a URL using a GET request and retrieve the raw HTML content. The response is a dictionary containing the status code, headers, and body. ```python from crawlbase import CrawlingAPI api = CrawlingAPI({ 'token': 'YOUR_TOKEN' }) response = api.get('https://www.example.com') if response['status_code'] == 200: print(response['body'].decode('utf-8')) ``` -------------------------------- ### Scrape URL with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/scraper_api.md Pass additional options to the `get` method to customize the request, such as setting a User-Agent or specifying a country. ```python response = api.get('https://www.example.com/product', { 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'country': 'US' }) if response['status_code'] == 200: structured_data = response['json'] # Access extracted fields like 'name', 'price', 'description', etc. ``` -------------------------------- ### Set Viewport for Screenshots Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Define the viewport dimensions for screenshots in 'WIDTHxHEIGHT' format when making a GET request. ```python response = api.get('https://example.com', {'viewport': '1920x1080'}) ``` -------------------------------- ### Initialize CrawlingAPI for JavaScript Rendering Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize CrawlingAPI with a JavaScript token for scraping dynamic websites. Only GET requests are supported for JavaScript scraping. ```python api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' }) ``` -------------------------------- ### Import Crawlbase API Classes Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Import the necessary API classes from the crawlbase library. Ensure the library is installed via pip. ```python from crawlbase import CrawlingAPI, ScraperAPI, LeadsAPI, ScreenshotsAPI, StorageAPI ``` -------------------------------- ### Handle Unsupported POST Request Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/screenshots_api.md The Screenshots API only supports GET requests. Attempting to use POST will raise an exception. ```python api = ScreenshotsAPI({ 'token': 'YOUR_TOKEN' }) try: response = api.post('https://example.com', {}) except Exception as e: print(e) # "Only GET is allowed on the Screenshots API" ``` -------------------------------- ### GET Request with Custom User-Agent Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Scrape a URL while specifying a custom User-Agent header in the request options. ```python response = api.get('https://www.example.com', { 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }) ``` -------------------------------- ### Boolean Return Value Example Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Method return values are Python booleans (True/False). This shows checking the success of a delete operation. ```python # Method return success: bool = api.delete('RID_xyz') # True or False ``` -------------------------------- ### StorageAPI - Access and Manage Stored Pages Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Interact with StorageAPI to retrieve, delete, and manage previously crawled pages. Demonstrates methods for getting RIDs, retrieving by URL or RID, bulk retrieval, and counting total stored pages. ```python from crawlbase import StorageAPI api = StorageAPI({'token': 'YOUR_TOKEN'}) # Get list of stored pages rids = api.rids(limit=10) # Retrieve by URL response = api.get('https://example.com') # Retrieve by RID response = api.get('RID_xyz123') # Bulk retrieve response = api.bulk(rids) # Delete stored page api.delete('RID_xyz123') # Get total count total = api.totalCount() ``` -------------------------------- ### Get Data by URL Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Retrieve a stored item using its URL. Check the status code for success and print relevant headers and the body. ```python response = storage_api.get('https://www.apple.com') if response['status_code'] == 200: print(response['headers']['original_status']) print(response['headers']['pc_status']) print(response['headers']['url']) print(response['headers']['rid']) print(response['headers']['stored_at']) print(response['body']) ``` -------------------------------- ### Initialize LeadsAPI with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Instantiate the LeadsAPI client with an authentication token and an optional request timeout. ```python from crawlbase import LeadsAPI api = LeadsAPI({ 'token': 'YOUR_TOKEN' }) ``` ```python # With timeout api = LeadsAPI({ 'token': 'YOUR_TOKEN', 'timeout': 60 }) ``` -------------------------------- ### Store Screenshot and Get Shareable URL Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/screenshots_api.md Use the 'store' option to save the screenshot on Crawlbase servers and obtain a public, shareable URL. ```python response = api.get('https://www.apple.com', { 'store': 'true' }) if response['status_code'] == 200: if response['headers']['success'] == 'true': print(f"Screenshot saved to: {response['file']}") print(f"Public URL: {response['headers']['screenshot_url']}") ``` -------------------------------- ### Initialize StorageAPI with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Instantiate the StorageAPI client with an authentication token and an optional request timeout. ```python from crawlbase import StorageAPI api = StorageAPI({ 'token': 'YOUR_TOKEN' }) ``` ```python # With timeout api = StorageAPI({ 'token': 'YOUR_TOKEN', 'timeout': 90 }) ``` -------------------------------- ### Constructor Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/leads_api.md Initializes the LeadsAPI with a required API token and optional timeout configuration. ```APIDOC ## Constructor ```python LeadsAPI(options) ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | options | dict | Yes | — | Configuration object with required `token` key and optional `timeout` | The `options` dictionary must contain: - `token` (string, required): The API authentication token - `timeout` (int, optional): Request timeout in seconds; defaults to 120 **Raises:** - `Exception`: If `token` is None or empty string **Example:** ```python from crawlbase import LeadsAPI api = LeadsAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = LeadsAPI({ 'token': 'YOUR_TOKEN', 'timeout': 90 }) ``` ``` -------------------------------- ### Initialize CrawlingAPI with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Instantiate the CrawlingAPI client with an authentication token and an optional request timeout. ```python api = CrawlingAPI({ 'token': 'YOUR_TOKEN', 'timeout': 120 }) ``` -------------------------------- ### GET Request for JavaScript Websites Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Make a GET request to scrape JavaScript-rendered websites. The process is the same as standard GET requests, but uses a JavaScript token. ```python response = api.get('https://www.nfl.com') if response['status_code'] == 200: print(response['body']) ``` -------------------------------- ### Initialize ScreenshotsAPI Client Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/screenshots_api.md Instantiate the ScreenshotsAPI client with your API token. An optional timeout can also be provided. ```python from crawlbase import ScreenshotsAPI api = ScreenshotsAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = ScreenshotsAPI({ 'token': 'YOUR_TOKEN', 'timeout': 120 }) ``` -------------------------------- ### GET Requests Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Perform GET requests to scrape URLs with optional parameters. ```APIDOC ## GET Requests ### Description Pass the url that you want to scrape plus any options from the ones available in the [API documentation](https://crawlbase.com/docs). ### Method ```python api.get(url, options = {}) ``` ### Example ```python response = api.get('https://www.facebook.com/britneyspears') if response['status_code'] == 200: print(response['body']) ``` ### Options Example ```python response = api.get('https://www.reddit.com/r/pics/comments/5bx4bx/thanks_obama/', { 'user_agent': 'Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/30.0', 'format': 'json' }) if response['status_code'] == 200: print(response['body']) ``` ``` -------------------------------- ### Catch Missing URL or RID Error in StorageAPI.get() Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Handle the 'Either URL or RID is required' error when calling get() without a URL or RID. Provide a valid URL or RID to resolve this issue. ```python from crawlbase import StorageAPI api = StorageAPI({'token': 'YOUR_TOKEN'}) try: response = api.get('') except Exception as e: print(f"Error: {e}") # Output: Error: Either URL or RID is required ``` ```python api = StorageAPI({'token': 'YOUR_TOKEN'}) # By URL response = api.get('https://www.example.com') # By RID response = api.get('RID_xyz123') ``` -------------------------------- ### Initialize and Use Leads API Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the Leads API with your token and retrieve email leads from a given domain. The response contains a 'leads' key if successful. ```python leads_api = LeadsAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) response = leads_api.get_from_domain('microsoft.com') if response['status_code'] == 200: print(response['json']['leads']) ``` -------------------------------- ### Initialize LeadsAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/leads_api.md Instantiate the LeadsAPI client with your API token. You can also set a custom request timeout. ```python from crawlbase import LeadsAPI api = LeadsAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = LeadsAPI({ 'token': 'YOUR_TOKEN', 'timeout': 90 }) ``` -------------------------------- ### Initialize StorageAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md Instantiate the StorageAPI with your authentication token. Optionally, set a custom request timeout. ```python from crawlbase import StorageAPI api = StorageAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = StorageAPI({ 'token': 'YOUR_TOKEN', 'timeout': 90 }) ``` -------------------------------- ### StorageAPI Initialization Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the Storage API client with your private token. ```APIDOC ## StorageAPI Initialization Initialize the Storage API using your private token. ```python from crawlbase import StorageAPI storage_api = StorageAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) ``` ``` -------------------------------- ### GET Request with JavaScript Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Pass additional options, such as 'page_wait', to the GET request when scraping JavaScript-rendered content. This allows for controlling page load times. ```python response = api.get('https://www.freelancer.com', { 'page_wait': 5000 }) if response['status_code'] == 200: print(response['body']) ``` -------------------------------- ### Instantiate CrawlingAPI with Token Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/base_api.md Use subclasses like CrawlingAPI to create an instance, providing your API token in the options dictionary. This is the standard way to initialize API clients. ```python from crawlbase import CrawlingAPI # Create instance with token (use subclasses, not BaseAPI directly) api = CrawlingAPI({ 'token': 'YOUR_TOKEN' }) ``` -------------------------------- ### Initialize CrawlingAPI Instance Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Import and initialize the CrawlingAPI with your API token. ```python from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) ``` -------------------------------- ### Catch ScreenshotsAPI POST Not Supported Error Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Handle errors when attempting to use the POST method on the ScreenshotsAPI, which only supports GET requests. Use the get() method for ScreenshotsAPI. ```python from crawlbase import ScreenshotsAPI api = ScreenshotsAPI({'token': 'YOUR_TOKEN'}) try: response = api.post('https://example.com', {}) except Exception as e: print(f"Error: {e}") # Output: Error: Only GET is allowed on the Screenshots API ``` -------------------------------- ### ScreenshotsAPI Constructor Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/screenshots_api.md Initializes the ScreenshotsAPI client. Requires an API token and allows for an optional timeout setting. ```APIDOC ## ScreenshotsAPI(options) ### Description Initializes the ScreenshotsAPI client. Requires an API token and allows for an optional timeout setting. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters - **options** (dict) - Required - Configuration object with required `token` key and optional `timeout` - **token** (string) - Required - The API authentication token - **timeout** (int) - Optional - Request timeout in seconds; defaults to 120 ### Raises - Exception: If `token` is None or empty string ### Example ```python from crawlbase import ScreenshotsAPI # Basic initialization api = ScreenshotsAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = ScreenshotsAPI({ 'token': 'YOUR_TOKEN', 'timeout': 120 }) ``` ``` -------------------------------- ### Catch ScraperAPI POST Not Supported Error Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Handle errors when attempting to use the POST method on the ScraperAPI, which only supports GET requests. Always use the get() method for ScraperAPI. ```python from crawlbase import ScraperAPI api = ScraperAPI({'token': 'YOUR_TOKEN'}) try: response = api.post('https://example.com', {'data': 'test'}) except Exception as e: print(f"Error: {e}") # Output: Error: Only GET is allowed on the Scraper API ``` -------------------------------- ### Initialize CrawlingAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the CrawlingAPI with your Crawlbase API token. Replace 'YOUR_CRAWLBASE_TOKEN' with your actual token. ```python api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' }) ``` -------------------------------- ### Basic CrawlingAPI Initialization Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Initialize the CrawlingAPI client with only the required authentication token. ```python from crawlbase import CrawlingAPI # Basic initialization api = CrawlingAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) ``` -------------------------------- ### Initialize and Use Screenshots API Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the Screenshots API with your token and capture a screenshot of a given URL. The response includes headers like 'success', 'url', and 'remaining_requests', as well as the image file. ```python screenshots_api = ScreenshotsAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) response = screenshots_api.get('https://www.apple.com') if response['status_code'] == 200: print(response['headers']['success']) print(response['headers']['url']) print(response['headers']['remaining_requests']) print(response['file']) ``` -------------------------------- ### ScraperAPI.get Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/scraper_api.md Extracts structured data from a URL using a GET request. ```APIDOC ## ScraperAPI.get(url, options) Extracts structured data from a URL using a GET request. ### Signature ```python get(url, options={}) ``` ### Parameters #### url (str) - Required The URL to scrape. #### options (dict) - Optional Query parameters for the Crawlbase API. ### Return Type dict ### Response Structure - `status_code` (int): HTTP status code. - `headers` (dict): Response headers including `original_status`, `pc_status`, `url`. - `body` (bytes): Raw response body. - `json` (dict): Parsed structured data (product info, contacts, etc.). ### Common Options #### format (str) - Optional Response format: 'html' (default) or 'json'. #### user_agent (str) - Optional Custom User-Agent header. #### country (str) - Optional Geographic location for the request. #### proxy_type (str) - Optional Proxy type: 'datacenter' or 'residential'. ### Example #### Basic Usage ```python from crawlbase import ScraperAPI api = ScraperAPI({ 'token': 'YOUR_TOKEN' }) response = api.get('https://www.amazon.com/DualSense-Wireless-Controller-PlayStation-5/dp/B08FC6C75Y/') if response['status_code'] == 200: data = response['json'] print(f"Product name: {data['name']}") print(f"Price: {data['price']}") print(f"Rating: {data['rating']}") ``` #### With Options ```python response = api.get('https://www.example.com/product', { 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'country': 'US' }) if response['status_code'] == 200: structured_data = response['json'] ``` #### JSON Format Response ```python response = api.get('https://www.example.com', { 'format': 'json' }) if response['status_code'] == 200: data = response['json'] ``` ``` -------------------------------- ### Initialize Storage API Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the Storage API client with your private token. This is the first step before making any requests. ```python storage_api = StorageAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) ``` -------------------------------- ### Total Count Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Get the total number of documents currently stored in your Crawlbase Storage area. ```APIDOC ## Total Count To get the total number of documents in your storage area ```python total_count = storage_api.totalCount() print(total_count) ``` ``` -------------------------------- ### Handle Initialization Errors Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/QUICK_START.md Catch exceptions during API initialization, such as missing authentication tokens. Ensure you provide a valid token when creating the API client. ```python try: api = CrawlingAPI({'token': ''}) except Exception as e: print(f"Error: {e}") # "You need to specify the token" ``` -------------------------------- ### Get List of RIDs Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md The `rids()` method retrieves a list of all Request IDs (RIDs) available in your storage. ```APIDOC ## rids() ### Description Retrieves a list of all Request IDs (RIDs) stored. ### Method `rids` ### Parameters None ### Request Example ```python # Assuming api is an instance of StorageAPI all_rids = api.rids() print(all_rids) ``` ### Response #### Success Response (200) - **(array of strings)** - A list containing all available RIDs. #### Response Example ```json [ "RID_abc123", "RID_def456", "RID_ghi789" ] ``` ``` -------------------------------- ### Bulk Operations with StorageAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md Initializes the StorageAPI with a token and performs bulk operations, such as retrieving a list of RIDs or fetching multiple pages efficiently. Handles responses for bulk requests. ```python api = StorageAPI({ 'token': 'YOUR_TOKEN' }) # Get list of RIDs all_rids = api.rids() # Retrieve first 10 pages in one request response = api.bulk(all_rids[:10]) if response['status_code'] == 200: for item in response['json']: print(f"Processing: {item['url']}") ``` -------------------------------- ### Fix Initialization Error: Missing Token Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Correctly initialize API clients by providing a valid, non-empty token. Avoid passing None or an empty string for the token. ```python # Wrong api = CrawlingAPI({'token': None}) # Raises api = CrawlingAPI({'token': ''}) # Raises # Correct api = CrawlingAPI({'token': 'YOUR_TOKEN'}) ``` -------------------------------- ### get(url_or_rid, options) Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md Retrieves a previously crawled page from storage by its URL or Request ID (RID). ```APIDOC ## get(url_or_rid, options) ### Description Retrieves a previously crawled page from storage by URL or RID. Supports retrieving the raw HTML or a JSON representation. ### Parameters #### url_or_rid (str) - Required The URL that was crawled or the Request ID (RID) of a stored page. #### options (dict) - Optional Query parameters to modify the retrieval. Defaults to an empty dictionary. - `format` (str, optional): Response format. Accepts 'html' (default) or 'json'. ### Return Type dict ### Response Structure - `status_code` (int): HTTP status code of the retrieval request. - `headers` (dict): Response headers including `original_status`, `pc_status`, `url`, `rid`, `stored_at`. - `body` (bytes): Raw page content (if format is 'html'). - `json_body` (dict): Parsed JSON content (if format is 'json'). ### Response Headers - `original_status` (str): HTTP status code from the original crawl. - `pc_status` (str): Proxy crawl status code. - `url` (str): The URL that was stored. - `rid` (str): Request ID of the stored page. - `stored_at` (str): Timestamp when the page was stored. ### Raises - `Exception`: If `url_or_rid` is None or an empty string. ### Example Retrieve by URL: ```python response = api.get('https://www.example.com') if response['status_code'] == 200: print(f"Original status: {response['headers']['original_status']}") print(f"Stored at: {response['headers']['stored_at']}") content = response['body'].decode('utf-8') ``` Retrieve by RID: ```python response = api.get('RID_xyz123') if response['status_code'] == 200: print(f"Retrieved page: {response['headers']['url']}") print(f"RID: {response['headers']['rid']}") ``` Get as JSON format: ```python response = api.get('https://www.example.com', { 'format': 'json' }) if response['status_code'] == 200: data = response['json_body'] print(f"Body: {data['body']}") print(f"Status: {data['original_status']}") ``` ``` -------------------------------- ### StorageAPI Constructor Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md Initializes the StorageAPI with authentication token and optional timeout. ```APIDOC ## StorageAPI Constructor ### Description Initializes the StorageAPI with a configuration object containing the authentication token and an optional timeout. ### Parameters #### options (dict) - Required Configuration object with required `token` key and optional `timeout`. - `token` (string, required): The API authentication token. - `timeout` (int, optional): Request timeout in seconds; defaults to 120. ### Raises - `Exception`: If `token` is None or empty string. ### Example ```python from crawlbase import StorageAPI # With default timeout api = StorageAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = StorageAPI({ 'token': 'YOUR_TOKEN', 'timeout': 90 }) ``` ``` -------------------------------- ### CrawlingAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/INDEX.md The CrawlingAPI allows for HTML scraping and JavaScript-rendered content retrieval using GET and POST requests. ```APIDOC ## CrawlingAPI ### Description Use for HTML scraping and JavaScript-rendered content. ### Methods - `get(url, options)`: Performs a GET request to crawl a URL. - `post(url, data, options)`: Performs a POST request to crawl a URL with provided data. ### Key Features - GET and POST requests - JavaScript rendering support - Custom headers and user agents - Format conversion (HTML to JSON) ``` -------------------------------- ### Get Total Count Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Retrieve the total count of items using the `totalCount()` method from the storage API. ```python total: int = api.totalCount() # e.g. 1234 ``` -------------------------------- ### Screenshots with Storage and Viewport Configuration Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Configure ScreenshotsAPI to save screenshots locally, store them for a public URL, and define the viewport size. ```python from crawlbase import ScreenshotsAPI api = ScreenshotsAPI({ 'token': 'your_normal_token', 'timeout': 120 }) response = api.get('https://example.com', { 'save_to_path': 'screenshot.jpg', 'store': 'true', 'viewport': '1920x1080' }) ``` -------------------------------- ### Initialize ScraperAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/scraper_api.md Instantiate the ScraperAPI client with your API token. Optionally, set a custom request timeout. ```python from crawlbase import ScraperAPI api = ScraperAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = ScraperAPI({ 'token': 'YOUR_TOKEN', 'timeout': 60 }) ``` -------------------------------- ### totalCount() Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/storage_api.md Gets the total count of pages currently stored in your storage area. This provides a quick overview of your storage usage. ```APIDOC ## totalCount() ### Description Gets the total count of pages in your storage area. ### Method GET ### Endpoint /storage/count ### Parameters #### Path Parameters None #### Query Parameters None ### Request Example ```python api = StorageAPI({ 'token': 'YOUR_TOKEN' }) total = api.totalCount() ``` ### Response #### Success Response (200) - **int** - The total number of stored pages as an integer. #### Response Example ```json 1500 ``` ``` -------------------------------- ### Get Storage Item by RID Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Retrieve a stored item from Crawlbase Storage using its unique Resource ID (RID). ```APIDOC ## Get Storage Item by RID Use the [RID](https://crawlbase.com/docs/storage-api/parameters/#rid) to retrieve a specific item. ```python response = storage_api.get('RID_REPLACE') if response['status_code'] == 200: print(response['headers']['original_status']) print(response['headers']['pc_status']) print(response['headers']['url']) print(response['headers']['rid']) print(response['headers']['stored_at']) print(response['body']) ``` Note: One of the two RID or URL must be sent. So both are optional but it's mandatory to send one of the two. ``` -------------------------------- ### ScraperAPI Constructor Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/scraper_api.md Initializes the ScraperAPI client with your API token and optional configuration. ```APIDOC ## ScraperAPI Constructor Initializes the ScraperAPI client. ### Signature ```python ScraperAPI(options) ``` ### Parameters #### options (dict) - Required Configuration object with required `token` key and optional `timeout`. - **token** (string) - Required - The API authentication token. - **timeout** (int) - Optional - Request timeout in seconds; defaults to 120. ### Raises - `Exception`: If `token` is None or an empty string. ### Example ```python from crawlbase import ScraperAPI # Basic initialization api = ScraperAPI({ 'token': 'YOUR_TOKEN' }) # With custom timeout api = ScraperAPI({ 'token': 'YOUR_TOKEN', 'timeout': 60 }) ``` ``` -------------------------------- ### Get Status Code from Headers Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Status codes in response headers are strings. This demonstrates retrieving the original status code. ```python status = response['headers']['original_status'] # str, e.g. '200' ``` -------------------------------- ### CrawlingAPI Initialization with Timeout Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Initialize the CrawlingAPI client with a custom request timeout. ```python # With timeout api = CrawlingAPI({ 'token': 'YOUR_TOKEN', 'timeout': 180 }) ``` -------------------------------- ### Initialize CrawlingAPI Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Instantiate the CrawlingAPI client with your API token. Use a normal token for HTML content and a JavaScript token for JS-rendered content. A custom timeout can also be specified. ```python from crawlbase import CrawlingAPI # For HTML content api = CrawlingAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) # For JavaScript-rendered content js_api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' }) # With custom timeout api = CrawlingAPI({ 'token': 'YOUR_TOKEN', 'timeout': 180 }) ``` -------------------------------- ### Catch Initialization Error: Missing Token Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Catch exceptions during API initialization when the token is missing or empty. Ensure the token is provided as a non-empty string. ```python try: api = CrawlingAPI({'token': ''}) except Exception as e: print(f"Initialization error: {e}") # Output: Initialization error: You need to specify the token ``` -------------------------------- ### Catch Initialization Errors Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/errors.md Use a try-except block to catch potential exceptions during API client initialization, such as when configuration details like the API token are missing or invalid. ```python from crawlbase import CrawlingAPI try: api = CrawlingAPI(config) # config might have missing/empty token except Exception as e: print(f"Failed to initialize API: {e}") # Fallback to alternative configuration ``` -------------------------------- ### Check Status Code 200 Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Status codes in the response dictionary are integers. This example shows how to check for a successful response. ```python if response['status_code'] == 200: # Success ``` -------------------------------- ### General Exception Handling Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/types.md Illustrates how to catch general exceptions raised by the SDK, which are instances of Python's `Exception` class with a string message. ```APIDOC ## Exception Type All errors are raised as `Exception` with string message: ```python try: api = CrawlingAPI({'token': None}) except Exception as e: print(str(e)) # "You need to specify the token" ``` ``` -------------------------------- ### CrawlingAPI: HTML and JavaScript Scraping Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/INDEX.md Use CrawlingAPI for general HTML scraping and JavaScript-rendered content. It supports both GET and POST requests. ```python from crawlbase import CrawlingAPI api = CrawlingAPI({'token': 'YOUR_TOKEN'}) response = api.get(url, options) response = api.post(url, data, options) ``` -------------------------------- ### Get Data by RID Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Retrieve a stored item using its unique Resource Identifier (RID). This method is an alternative to using the URL. ```python response = storage_api.get('RID_REPLACE') if response['status_code'] == 200: print(response['headers']['original_status']) print(response['headers']['pc_status']) print(response['headers']['url']) print(response['headers']['rid']) print(response['headers']['stored_at']) print(response['body']) ``` -------------------------------- ### CrawlingAPI Initialization Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Initialize the CrawlingAPI class with your Crawlbase API token. ```APIDOC ## Initialization ### Description Initialize the CrawlingAPI class with your Crawlbase API token. ### Method ```python api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' }) ``` ``` -------------------------------- ### Crawling API Usage Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Demonstrates how to initialize and use the Crawling API with both normal and JavaScript tokens. It highlights the limitations of the POST method with JavaScript tokens. ```APIDOC ## Crawling API Usage ### Normal Token Uses fast HTML scraping, suitable for static HTML pages. ```python api = CrawlingAPI({ 'token': 'normal_token_xyz' }) response = api.post('https://example.com', { 'data': 'test' }) # Works ``` ### JavaScript Token Executes JavaScript in a real browser, suitable for single-page applications (React, Angular, Vue, etc.). Only supports GET method via `get()`. POST is not available with JavaScript tokens. ```python js_api = CrawlingAPI({ 'token': 'js_token_abc' }) response = js_api.get('https://spa.example.com') # Works response = js_api.post('https://spa.example.com', {}) # Raises Exception ``` ``` -------------------------------- ### CrawlingAPI Initialization for JavaScript Rendering Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/configuration.md Initialize the CrawlingAPI client for tasks requiring JavaScript rendering. ```python # JavaScript rendering js_api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' }) ``` -------------------------------- ### Get Total Document Count Source: https://github.com/crawlbase/crawlbase-python/blob/main/README.md Retrieve the total number of documents stored in your Crawlbase Storage area. This provides an overview of your storage usage. ```python total_count = storage_api.totalCount() print(total_count) ``` -------------------------------- ### POST Request with Options Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Perform a POST request with additional options, such as specifying the response format or a custom User-Agent. ```python response = api.post( 'https://www.example.com/search', { 'query': 'test' }, { 'format': 'json', 'user_agent': 'MyBot/1.0' } ) ``` -------------------------------- ### CrawlingAPI Constructor Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/api-reference/crawling_api.md Initializes the CrawlingAPI client. Requires an API token and optionally accepts a timeout setting. ```APIDOC ## Constructor ```python CrawlingAPI(options) ``` ### Parameters - **options** (dict) - Required - Configuration object with required `token` key and optional `timeout`. - **token** (string) - Required - The API authentication token (normal token for HTML scraping, JavaScript token for JS-rendered content). - **timeout** (int) - Optional - Request timeout in seconds; defaults to 120. ### Raises - `Exception`: If `token` is None or empty string. ### Example ```python from crawlbase import CrawlingAPI # For HTML content api = CrawlingAPI({ 'token': 'YOUR_NORMAL_TOKEN' }) # For JavaScript-rendered content js_api = CrawlingAPI({ 'token': 'YOUR_JAVASCRIPT_TOKEN' }) # With custom timeout api = CrawlingAPI({ 'token': 'YOUR_TOKEN', 'timeout': 180 }) ``` ``` -------------------------------- ### LeadsAPI: Email Lead Discovery Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/INDEX.md Use LeadsAPI to discover email leads and contact information from a given domain. This API supports GET requests. ```python from crawlbase import LeadsAPI api = LeadsAPI({'token': 'YOUR_TOKEN'}) response = api.get_from_domain(domain, options) ``` -------------------------------- ### CrawlingAPI - JavaScript Rendering Source: https://github.com/crawlbase/crawlbase-python/blob/main/_autodocs/PACKAGE_OVERVIEW.md Illustrates how to enable JavaScript rendering for dynamic websites by specifying the `page_wait` option. ```APIDOC ## CrawlingAPI.get with JavaScript Rendering ### Description Performs a GET request with options to render JavaScript content on the page. ### Method GET ### Endpoint (root) - relative to https://api.crawlbase.com/ ### Parameters #### Query Parameters - **url** (string) - Required - The URL to scrape. - **options** (object) - Required - Options for the request. Must include `page_wait` (integer) in milliseconds to specify how long to wait for JavaScript execution. ### Request Example ```python js_api = CrawlingAPI({'token': 'YOUR_JS_TOKEN'}) response = js_api.get('https://spa.example.com', { 'page_wait': 5000 }) ``` ### Response #### Success Response (200) - **status_code** (integer) - The HTTP status code of the response. - **body** (string) - The rendered HTML content of the page. ```