### Install project dependencies Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/README.md Run this command to install all required dependencies for the project. ```bash $ yarn ``` -------------------------------- ### Install Botasaurus Environment on VM Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Initializes the VM environment by running the Botasaurus installation script. ```bash curl -sL https://raw.githubusercontent.com/omkarcloud/botasaurus/master/vm-scripts/install-bota.sh | bash ``` -------------------------------- ### Define S3 Debian Installer URL Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-aws.md Example URL format for a Debian installer hosted on an S3 bucket. ```text https://your-bucket.s3.amazonaws.com/Your-App-amd64.deb ``` -------------------------------- ### Comprehensive API Configuration Example Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md A detailed example demonstrating multiple API configurations including enabling the API, setting port and base path, adding scraper aliases, and defining custom routes with middleware. ```typescript import ApiConfig from 'botasaurus-server/api-config'; import { hotelsSearchScraper } from "../src/scrapers"; // Enable API functionality ApiConfig.enableApi(); // Production configuration ApiConfig.setApiPort(3000); ApiConfig.setApiBasePath("/v1"); // Add scraper aliases for direct access ApiConfig.addScraperAlias(hotelsSearchScraper, '/hotels/search'); // Add custom routes ApiConfig.addCustomRoutes((server) => { // Health check for monitoring server.get('/health', (request, reply) => { return reply.send({ status: 'OK'}); }); // Authentication middleware server.addHook('onRequest', (request, reply, done) => { // Check for secret const secret = request.headers['x-secret'] as string; if (secret === '49cb1de3-419b-4647-bf06-22c9e1110313') { // Valid secret, proceed return done(); } else { return reply.status(401).send({ message: 'Unauthorized: Invalid secret.', }); } }); }); ``` -------------------------------- ### Install Node.js Packages Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/quick-start.md Install all necessary npm packages for the project. ```bash npm install ``` -------------------------------- ### Start local development server Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/README.md Launches a local server with live reloading for development purposes. ```bash $ yarn start ``` -------------------------------- ### Install Scraper Dependencies Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Commands to install project requirements and initialize the environment. ```bash python -m pip install -r requirements.txt python run.py install ``` -------------------------------- ### Install Desktop Application Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-aws.md Installs a desktop application on the EC2 instance using a Debian installer URL. ```bash python3 -m bota install-desktop-app --debian-installer-url https://yahoo-finance-extractor.s3.us-east-1.amazonaws.com/Yahoo+Finance+Extractor-amd64.deb ``` -------------------------------- ### Install pg-cache-storage Source: https://github.com/omkarcloud/botasaurus/blob/master/pg-cache-storage/README.md Install the library using pip. This is the first step before using the cache storage. ```bash pip install pg-cache-storage ``` -------------------------------- ### Install UI Scraper Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Deploy the scraper from a repository URL to the VM. ```bash python3 -m bota install-ui-scraper --repo-url https://github.com/omkarcloud/botasaurus-starter ``` -------------------------------- ### Install Desktop Application Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-google-cloud.md Installs the desktop application using a Debian installer URL. Supports optional configuration flags for port and base path. ```bash python3 -m bota install-desktop-app --debian-installer-url https://yahoo-finance-extractor.s3.us-east-1.amazonaws.com/Yahoo+Finance+Extractor-amd64.deb ``` ```sh python3 -m bota install-desktop-app \ --debian-installer-url https://amazon-invoice-extractor.s3.us-east-1.amazonaws.com/Amazon+Invoice+Extractor-amd64.deb \ // highlight-next-line --port 8001 \ // highlight-next-line --api-base-path /amazon-invoices ``` -------------------------------- ### Initialize Botasaurus and Create Static IP Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Install the bota package and generate a static IP address for the virtual machine. ```bash python -m pip install bota python -m bota create-ip ``` -------------------------------- ### Example Usage of Botasaurus API Client Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md A basic example demonstrating how to import the Api class and create an instance of the Botasaurus API client. ```python from botasaurus_api import Api # Create an instance of the API client api = Api() ``` -------------------------------- ### Install Botasaurus Humancursor Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_humancursor/README.md Use pip to install the library in your Python environment. ```bash pip install botasaurus-humancursor ``` -------------------------------- ### Launch Botasaurus Desktop App Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/quick-start.md Start the development server to launch the desktop application. ```bash npm run dev ``` -------------------------------- ### Install Botasaurus API Client Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md Install the Botasaurus API client using pip. This command upgrades the package if it's already installed. ```bash python -m pip install --upgrade botasaurus-api ``` -------------------------------- ### Scrape Product Heading with Botasaurus Source: https://github.com/omkarcloud/botasaurus/blob/master/advanced.md This example demonstrates how to use Botasaurus to navigate to a specific URL, extract text from an element, and save a screenshot. It utilizes the `@browser` decorator for easy setup. ```python from botasaurus.browser import browser, Driver @browser def scrape_heading_task(driver: Driver, data): driver.google_get("https://www.g2.com/products/jenkins/reviews?page=5", bypass_cloudflare=True) heading = driver.get_text('.product-head__title [itemprop="name"]') driver.save_screenshot() return heading scrape_heading_task() ``` -------------------------------- ### Install Scraper Repository Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Clones and installs a specific scraper repository onto the VM. ```bash python3 -m bota install-scraper --repo-url https://github.com/omkarcloud/botasaurus-starter ``` -------------------------------- ### Initialize and Run Botasaurus Project Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/getting-started.md Commands to navigate to the project directory, install requirements, and execute the main script. ```bash cd my-botasaurus-project python -m pip install -r requirements.txt code . # Optionally, open the project in VSCode python main.py ``` -------------------------------- ### Install Botasaurus CLI and Create Static IP Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-google-cloud.md Installs the Botasaurus CLI and creates a static IP address for your VM. You will be prompted for a name and region. ```bash python -m pip install bota --upgrade python -m bota create-ip # Create a static IP address for your VM ``` -------------------------------- ### Install Multiple Desktop APIs Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-aws.md Configures an additional application on the same instance using unique ports and API base paths to avoid conflicts. ```sh python3 -m bota install-desktop-app \ --debian-installer-url https://amazon-invoice-extractor.s3.us-east-1.amazonaws.com/Amazon+Invoice+Extractor-amd64.deb \ // highlight-next-line --port 8001 \ // highlight-next-line --api-base-path /amazon-invoices ``` -------------------------------- ### Run the Scraper Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Command to start the scraper application. ```bash python run.py ``` -------------------------------- ### Create Kubernetes Cluster with Bota Source: https://github.com/omkarcloud/botasaurus/blob/master/run-scraper-in-kubernetes.md Installs the bota package and initializes a new Kubernetes cluster in Google Cloud. ```bash python -m pip install bota python -m bota create-cluster ``` -------------------------------- ### Install Botasaurus CLI and Apache Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-aws.md Executes a script to install the Botasaurus CLI and Apache web server on an EC2 instance. ```bash curl -sL https://raw.githubusercontent.com/omkarcloud/botasaurus/master/vm-scripts/install-bota-desktop.sh | bash ``` -------------------------------- ### Package Application for Current OS Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/packaging-publishing.md Executes the build process to generate an installer for the host operating system. ```bash npm run package ``` -------------------------------- ### Install SQLite Cache Storage Source: https://github.com/omkarcloud/botasaurus/blob/master/sqlite-cache-storage/README.md Install the package via pip to enable SQLite caching. ```bash pip install sqlite-cache-storage ``` -------------------------------- ### Install Botasaurus Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/what-is-botasaurus.md Install the Botasaurus package using pip. Ensure you have the latest version for all features. ```shell python -m pip install --upgrade botasaurus ``` -------------------------------- ### Install Certbot for Apache Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-domain-and-ssl.md Installs the Certbot package and the Apache plugin on Debian-based systems. ```bash sudo apt install certbot python3-certbot-apache -y ``` -------------------------------- ### Install Chrome and Botasaurus in Google Colab Source: https://github.com/omkarcloud/botasaurus/blob/master/advanced.md Run these commands in a Google Colab notebook to install Google Chrome and the Botasaurus library. Ensure all dependencies are met before proceeding. ```python ! apt-get update ! wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb ! apt-get install -y lsof wget gnupg2 apt-transport-https ca-certificates software-properties-common adwaita-icon-theme alsa-topology-conf alsa-ucm-conf at-spi2-core dbus-user-session dconf-gsettings-backend dconf-service fontconfig fonts-liberation glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gtk-update-icon-cache hicolor-icon-theme libasound2 libasound2-data libatk-bridge2.0-0 libatk1.0-0 libatk1.0-data libatspi2.0-0 libauthen-sasl-perl libavahi-client3 libavahi-common-data libavahi-common3 libcairo-gobject2 libcairo2 libclone-perl libcolord2 libcups2 libdata-dump-perl libdatrie1 libdconf1 libdrm-amdgpu1 libdrm-common libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libdrm2 libencode-locale-perl libepoxy0 libfile-basedir-perl libfile-desktopentry-perl libfile-listing-perl libfile-mimeinfo-perl libfont-afm-perl libfontenc1 libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-mesa-dri libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgtk-3-0 libgtk-3-bin libgtk-3-common libharfbuzz0b libhtml-form-perl libhtml-format-perl libhtml-parser-perl libhtml-tagset-perl libhtml-tree-perl libhttp-cookies-perl libhttp-daemon-perl libhttp-date-perl libhttp-message-perl libhttp-negotiate-perl libice6 libio-html-perl libio-socket-ssl-perl libio-stringy-perl libipc-system-simple-perl libjson-glib-1.0-0 libjson-glib-1.0-common liblcms2-2 libllvm11 liblwp-mediatypes-perl liblwp-protocol-https-perl libmailtools-perl libnet-dbus-perl libnet-http-perl libnet-smtp-ssl-perl libnet-ssleay-perl libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libpciaccess0 libpixman-1-0 libproxy1v5 librest-0.7-0 librsvg2-2 librsvg2-common libsensors-config libsensors5 libsm6 libsoup-gnome2.4-1 libsoup2.4-1 libtext-iconv-perl libthai-data libthai0 libtie-ixhash-perl libtimedate-perl libtry-tiny-perl libu2f-udev liburi-perl libvte-2.91-0 libvte-2.91-common libvulkan1 libwayland-client0 libwayland-cursor0 libwayland-egl1 libwayland-server0 libwww-perl libwww-robotrules-perl libx11-protocol-perl libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-present0 libxcb-randr0 libxcb-render0 libxcb-shape0 libxcb-shm0 libxcb-sync1 libxcb-xfixes0 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxft2 libxi6 libxinerama1 libxkbcommon0 libxkbfile1 libxml-parser-perl libxml-twig-perl libxml-xpathengine-perl libxmu6 libxmuu1 libxrandr2 libxrender1 libxshmfence1 libxt6 libxtst6 libxv1 libxxf86dga1 libxxf86vm1 libz3-4 mesa-vulkan-drivers perl-openssl-defaults shared-mime-info termit x11-common x11-utils xdg-utils xvfb ! dpkg -i google-chrome-stable_current_amd64.deb ! python -m pip install botasaurus ``` -------------------------------- ### Install PDF Parsing Package Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/your-first-extractors/amazon-pdf-invoice-extractor.md Install the electron-pdf-parse package via npm to enable PDF reading capabilities in an Electron environment. ```bash npm install electron-pdf-parse ``` -------------------------------- ### Restart the Scraper VM Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Command to reboot the virtual machine after starting it from the Google Cloud Console. ```bash shutdown -r now ``` -------------------------------- ### Complex API Configuration Example Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md A comprehensive configuration demonstrating multiple Botasaurus API settings including enabling the API, setting port and base path, adding scraper aliases, and defining custom routes with middleware. ```APIDOC ## Complex API Configuration ### Description This example shows how to configure the Botasaurus API with various options, including enabling the API, setting the port and base path, registering custom scraper aliases, and adding custom routes with authentication middleware. ### Configuration Steps 1. **Enable API**: `ApiConfig.enableApi();` 2. **Set API Port**: `ApiConfig.setApiPort(3000);` 3. **Set API Base Path**: `ApiConfig.setApiBasePath("/v1");` 4. **Add Scraper Alias**: `ApiConfig.addScraperAlias(hotelsSearchScraper, '/hotels/search');` 5. **Add Custom Routes**: Use `ApiConfig.addCustomRoutes()` to add endpoints like health checks and middleware. ### Example Code ```ts title="src/scraper/backend/server.ts" import ApiConfig from 'botasaurus-server/api-config'; import { hotelsSearchScraper } from "../src/scrapers"; // Enable API functionality ApiConfig.enableApi(); // Production configuration ApiConfig.setApiPort(3000); ApiConfig.setApiBasePath("/v1"); // Add scraper aliases for direct access ApiConfig.addScraperAlias(hotelsSearchScraper, '/hotels/search'); // Add custom routes ApiConfig.addCustomRoutes((server) => { // Health check for monitoring server.get('/health', (request, reply) => { return reply.send({ status: 'OK'}); }); // Authentication middleware server.addHook('onRequest', (request, reply, done) => { // Check for secret const secret = request.headers['x-secret'] as string; if (secret === '49cb1de3-419b-4647-bf06-22c9e1110313') { // Valid secret, proceed return done(); } else { return reply.status(401).send({ message: 'Unauthorized: Invalid secret.', }); } }); }); ``` ### Resulting Endpoints and Behavior: - The API runs on port `3000`. - All routes are prefixed with `/v1`. - Hotel search is available at `GET /v1/hotels/search`. - Health check is available at `GET /v1/health`. - All requests require authentication via the `x-secret` header. ``` -------------------------------- ### Dynamically Configure Browser Profile and Proxy with Functions Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Use functions to extract configuration values from data parameters and pass them to the `@browser` decorator for dynamic browser setup. This is useful when each data item requires a different profile or proxy. ```python from botasaurus.browser import browser, Driver def get_profile(data): return data["profile"] def get_proxy(data): return data["proxy"] @browser(profile=get_profile, proxy=get_proxy) def scrape_heading_task(driver: Driver, data): profile, proxy = driver.config.profile, driver.config.proxy print(profile, proxy) return profile, proxy data = [ {"profile": "pikachu", "proxy": "http://142.250.77.228:8000"}, {"profile": "greyninja", "proxy": "http://142.250.77.229:8000"}, ] scrape_heading_task(data) ``` -------------------------------- ### Accessing Botasaurus Storage Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/understanding-app-features.md Demonstrates how to import and use the Botasaurus storage utility to set and get item values. This is useful for persisting user settings. ```javascript import { getBotasaurusStorage } from 'botasaurus/botasaurus-storage'; const storage = getBotasaurusStorage(); storage.setItem('userId', 10); const userId = storage.getItem('userId'); ``` -------------------------------- ### Configure Request Decorator with Proxy Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md This example shows how to configure a proxy for the Request Decorator. The decorator enhances requests with browser-like headers and connections. ```python from botasaurus.request import request @request( proxy="http://username:password@proxy-provider-domain:port" ) ``` -------------------------------- ### Configure Authenticated Proxies with Selenium Wire Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Setup authenticated proxies using selenium-wire. Note that this method may be susceptible to bot detection. ```bash python -m pip install selenium_wire ``` ```python from seleniumwire import webdriver # Import from seleniumwire # Define the proxy proxy_options = { 'proxy': { 'http': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy 'https': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy } } # Install and set up the driver driver = webdriver.Chrome(seleniumwire_options=proxy_options) # Visit the desired URL link = 'https://fingerprint.com/products/bot-detection/' driver.get("https://www.google.com/") driver.execute_script(f'window.location.href = "{link}"') # Prompt for user input input("Press Enter to exit...") # Clean up driver.quit() ``` -------------------------------- ### Field Definition Example Source: https://github.com/omkarcloud/botasaurus/blob/master/advanced.md Illustrates the basic usage of the `Field` class for displaying a single data field. It shows how to alias the output key. ```python # value is the reviews_per_rating dictionary ``` -------------------------------- ### Enable API in Botasaurus Server Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Enable the API functionality by calling `ApiConfig.enableApi()` in your `src/scraper/backend/api-config.ts` file. This starts an API server at `http://localhost:8000` by default. ```typescript import ApiConfig from "botasaurus-server/api-config"; // Enable the API ApiConfig.enableApi(); ``` -------------------------------- ### GET /hotels/search (Example Scraper Alias) Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md This endpoint allows direct GET requests to execute the `hotelsSearchScraper` immediately, bypassing task management overhead. It validates input data and respects rate limits. ```APIDOC ## GET /hotels/search ### Description This endpoint allows direct GET requests to execute the `hotelsSearchScraper` immediately, bypassing task management overhead. It validates input data and respects rate limits. ### Method GET ### Endpoint `/hotels/search` ### Parameters #### Query Parameters - **(type)** - Required/Optional - Description of parameters the `hotelsSearchScraper` expects. ### Request Example ``` GET /hotels/search?param1=value1¶m2=value2 ``` ### Response #### Success Response (200) - **(type)** - Description of the data returned by `hotelsSearchScraper`. #### Response Example ```json { "example": "response data from hotelsSearchScraper" } ``` ``` -------------------------------- ### Manage All Profiles with Profiles Utility Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Utilize the `Profiles` utility to manage all browser profiles persistently. This includes setting, getting, and deleting profiles, with data stored in `profiles.json`. ```python from botasaurus.profiles import Profiles # Set profiles Profiles.set_profile('amit', {'name': 'Amit Sharma', 'age': 30}) Profiles.set_profile('rahul', {'name': 'Rahul Verma', 'age': 30}) # Get a profile profile = Profiles.get_profile('amit') print(profile) # Output: {'name': 'Amit Sharma', 'age': 30} # Get all profiles all_profiles = Profiles.get_profiles() print(all_profiles) # Output: [{'name': 'Amit Sharma', 'age': 30}, {'name': 'Rahul Verma', 'age': 30}] # Get all profiles in random order random_profiles = Profiles.get_profiles(random=True) print(random_profiles) # Output: [{'name': 'Rahul Verma', 'age': 30}, {'name': 'Amit Sharma', 'age': 30}] in random order # Delete a profile Profiles.delete_profile('amit') ``` -------------------------------- ### Manage Cache with Botasaurus Cache Module Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/what-is-botasaurus.md Use Botasaurus's Cache Module to manage cached data efficiently. This example demonstrates putting, checking, getting, removing, and clearing cached data for a scraping function. ```python from botasaurus import * from botasaurus.cache import Cache # Example scraping function @request def scrape_data(data): # Your scraping logic here return {"processed": data} # Sample data for scraping input_data = {"key": "value"} # Adding data to the cache Cache.put(scrape_data, input_data, scrape_data(input_data)) # Checking if data is in the cache if Cache.has(scrape_data, input_data): # Retrieving data from the cache cached_data = Cache.get(scrape_data, input_data) # Removing specific data from the cache Cache.remove(scrape_data, input_data) # Clearing the complete cache for the scrape_data function Cache.clear(scrape_data) ``` -------------------------------- ### Clone Botasaurus Desktop Starter Project Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/quick-start.md Clone the starter project repository and navigate into the new directory. ```bash git clone https://github.com/omkarcloud/botasaurus-desktop-starter my-botasaurus-app cd my-botasaurus-app ``` -------------------------------- ### Configure PostgreSQL Instance Settings Source: https://github.com/omkarcloud/botasaurus/blob/master/run-postgres-cloud-sql-instance.md Settings for creating a cost-effective PostgreSQL instance on Google Cloud for testing purposes. ```text Instance ID: pikachu # Choose any name for your instance. Password: pikachu # For testing purposes, we're using a simple password "pikachu". In a production environment, use a strong password. Choose a Cloud SQL edition: Enterprise # Opt for the Enterprise edition as it is cheaper. Preset for this edition: Sandbox # Select the Sandbox preset as it is also cheaper. Region: us-central1 # For testing, we're using the default region. In production, select the region that is closest to your server for best performance. Machine shapes: Shared Core/1 vCPU, 0.614 GB # Choose the cheapest instance as we don't need a high-end machine for storing web scraping data. Storage Capacity: 10 GB Enable automatic storage increases: Yes # Enable this feature so you don't have to worry about running out of storage. (Awesome feature!) Enable deletion protection: No # Disable this feature, otherwise you'll need to change this setting later to delete the instance. ``` -------------------------------- ### Initialize Botasaurus Project Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Create a new directory for your project and open it in your code editor. ```shell mkdir my-botasaurus-project cd my-botasaurus-project code . # This will open the project in VSCode if you have it installed ``` -------------------------------- ### Clone Botasaurus Starter Template Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/getting-started.md Use this command to download the official starter template repository. ```bash git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project ``` -------------------------------- ### Configure VM Deployment Settings Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Recommended configuration parameters for the Google Cloud Click to Deploy interface. ```text Zone: us-central1-a # Use the zone from the region you selected in the previous step. Series: N1 Machine Type: n1-standard-2 (2 vCPU, 1 core, 7.5 GB memory) Boot Disk Type: Standard persistent disk # This is the most cost-effective disk option. Boot disk size in GB: 20 GB # Adjust based on storage needs Network Interface [External IP]: pikachu-ip # Use the IP name you created in the previous step. ``` -------------------------------- ### Disable API Autostart Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Prevent the API server from starting automatically on application launch by calling `ApiConfig.disableApiAutostart()`. Users will need to manually start the API from the desktop GUI. ```typescript // API will not run until manually started from the desktop GUI ApiConfig.disableApiAutostart(); ``` -------------------------------- ### Create a Simple 'Overview' View Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/enhancing-scrapers/custom-views.md Defines a 'Overview' view with 'name' and 'price' fields and registers it with a scraper. Ensure 'yourScraper' is imported correctly. ```typescript import { Server } from 'botasaurus-server/server'; import { View, Field } from 'botasaurus-server/ui'; import { yourScraper } from '../src/yourScraper'; // your scraper function /* 1. Define the view */ // highlight-start const overviewView = new View('Overview', [ new Field('name'), new Field('price'), ]); // highlight-end /* 2. Register scraper + view */ Server.addScraper(yourScraper, { views: [overviewView] }); ``` -------------------------------- ### Botasaurus API Client Initialization Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md Demonstrates how to import and initialize the Botasaurus API client, with options for specifying the API URL and controlling response file creation. ```APIDOC ## Botasaurus API Client Initialization ### Description Initialize the Botasaurus API client. You can optionally specify the `api_url` and `create_response_files`. ### Method `Api(api_url='http://localhost:8000', create_response_files=True)` ### Parameters #### Optional Parameters - **api_url** (string) - The base URL for the API server. Defaults to `http://localhost:8000`. - **create_response_files** (boolean) - Whether to create response JSON files for debugging. Defaults to `True`. ### Request Example ```python from botasaurus_api import Api # Default initialization api = Api() # With custom API URL api_custom_url = Api('https://example.com/') # Disable response file creation api_no_files = Api(create_response_files=False) ``` ``` -------------------------------- ### Fetch Task Results Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md Get the results associated with a specific task ID. ```python results = api.get_task_results(task['id']) ``` -------------------------------- ### Solve CAPTCHAs with Capsolver Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Install and integrate the capsolver extension to handle CAPTCHAs automatically. ```bash python -m pip install capsolver_extension_python ``` ```python from botasaurus.browser import browser, Driver from capsolver_extension_python import Capsolver # Replace "CAP-MY_KEY" with your actual CapSolver API key @browser(extensions=[Capsolver(api_key="CAP-MY_KEY")]) def solve_captcha(driver: Driver, data): driver.get("https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php") driver.prompt() solve_captcha() ``` -------------------------------- ### Run Botasaurus in Docker Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/what-is-botasaurus.md Commands to clone the starter template and launch the project using Docker Compose. ```bash git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project cd my-botasaurus-project docker-compose build && docker-compose up ``` -------------------------------- ### Retrieving Element Properties Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Methods for getting text content, attributes, and other properties of elements. ```APIDOC ## Retrieving Element Properties ### Description Methods for extracting information from web elements, such as their text content or attribute values. ### Methods - `driver.get_text(selector)`: Gets the text content of the element matching the selector. - `driver.get_element_containing_text(text)`: Finds an element that contains the specified text. - `element.get_attribute(attribute_name)`: Gets the value of a specified attribute for an element. ### Request Example ```python # Example usage: header_text = driver.get_text("h1") error_message = driver.get_element_containing_text("Error: Invalid input") image_url = driver.select("img.logo").get_attribute("src") ``` ``` -------------------------------- ### Define Scraper Task Data Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/enhancing-scrapers/custom-views.md Example structure of product data returned by a scraper task. ```ts const taskScraper = task({ name: "taskScraper", run: () => { // highlight-start return [ { id: 1, name: "T-Shirt", price: 16, // in US Dollar reviews: 1000, reviews_per_rating: { 1: 0, 2: 0, 3: 0, 4: 100, 5: 900, }, featured_reviews: [ { id: 1, rating: 5, content: "Awesome t-shirt!", }, { id: 2, rating: 5, content: "Amazing t-shirt!", }, ], }, { id: 2, name: "Laptop", price: 700, reviews: 500, reviews_per_rating: { 1: 0, 2: 0, 3: 0, 4: 100, 5: 400, }, featured_reviews: [ { id: 1, rating: 5, content: "Best laptop ever!", }, { id: 2, rating: 5, content: "Great laptop!", }, ], }, ]; // highlight-end }, }) ``` -------------------------------- ### Configure Supabase project settings Source: https://github.com/omkarcloud/botasaurus/blob/master/run-supabase-postgres-instance.md YAML configuration settings for initializing a new Supabase project. ```yaml Name: Pikachu # Choose any name Database Password: greyninja1234_A # For testing, use "greyninja1234_A". In production, use a strong password. Region: West US (North California) # Select the region closest to your server for best performance. ``` -------------------------------- ### Build and Run Botasaurus Scraper in Docker Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Commands to clone the Botasaurus Starter Template, build the Docker image, and run the scraper within a Docker environment. ```bash git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project cd my-botasaurus-project docker-compose build docker-compose up ``` -------------------------------- ### Build static site Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/README.md Compiles the project into static files located in the build directory for deployment. ```bash $ yarn build ``` -------------------------------- ### Configure Google Cloud VM Deployment Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Recommended hardware and disk configuration for cost-effective Botasaurus VM deployments. ```text Zone: us-central1-a # Use us-central1 (Iowa) for the lowest-cost VMs Series: N1 Machine Type: n1-standard-2 (2 vCPU, 1 core, 7.5 GB memory) Boot Disk Type: Standard persistent disk # This is the most cost-effective disk option. Boot disk size in GB: 20 GB # Adjust based on storage needs ``` -------------------------------- ### Upgrade Botasaurus Packages Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Run this command to update all Botasaurus packages to their latest versions. Ensure you have pip installed. ```bash python -m pip install --upgrade bota botasaurus botasaurus-api botasaurus-requests botasaurus-driver botasaurus-proxy-authentication botasaurus-server botasaurus-humancursor ``` -------------------------------- ### Load pages organically Source: https://github.com/omkarcloud/botasaurus/blob/master/anti-detect-driver.md Simulate a search engine referral by visiting Google before the target URL. ```python driver.google_get("https://www.omkar.cloud/auth/sign-up/") ``` -------------------------------- ### Uninstall Desktop Application Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-google-cloud.md Removes the application using either the Debian installer URL or the package name defined in package.json. ```bash python3 -m bota uninstall-desktop-app --debian-installer-url https://yahoo-finance-extractor.s3.us-east-1.amazonaws.com/Yahoo+Finance+Extractor-amd64.deb ``` ```bash python3 -m bota uninstall-desktop-app --package-name yahoo-finance-extractor ``` -------------------------------- ### Create Botasaurus Project Directory Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/what-is-botasaurus.md Set up a new directory for your Botasaurus project and navigate into it. The 'code .' command opens the directory in VSCode. ```shell mkdir my-botasaurus-project cd my-botasaurus-project code . ``` -------------------------------- ### Uninstall Desktop App via URL Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/deploying-on-aws.md Removes a desktop application from the EC2 instance using its Debian installer URL. ```bash python3 -m bota uninstall-desktop-app --debian-installer-url https://yahoo-finance-extractor.s3.us-east-1.amazonaws.com/Yahoo+Finance+Extractor-amd64.deb ``` -------------------------------- ### Define a Simple Link Input Control Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md A basic example of defining a required link input control with a default value. ```javascript /** * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls */ /** * @param {Controls} controls */ function getInput(controls) { controls // Render a Link Input, which is required, defaults to "https://stackoverflow.blog/open-source". .link('link', { isRequired: true, defaultValue: "https://stackoverflow.blog/open-source" }) } ``` -------------------------------- ### Import and Initialize Api Class Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md Import the Api class from the botasaurus-api library and create an instance. The default API URL is http://localhost:8000. ```python from botasaurus_api import Api api = Api() ``` -------------------------------- ### Initialize Api with Custom URL Source: https://github.com/omkarcloud/botasaurus/blob/master/botasaurus_api/README.md Create an instance of the Api class, specifying a custom base URL for the API server using the api_url parameter. ```python api = Api('https://example.com/') ``` -------------------------------- ### Manage Scraper Cache Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Perform basic cache operations like put, get, has, remove, and clear for scraping tasks. ```python from botasaurus.task import task from botasaurus.cache import Cache # Example scraping function @task def scrape_data(data): # Your scraping logic here return {"processed": data} # Sample data for scraping input_data = {"key": "value"} # Adding data to the cache Cache.put('scrape_data', input_data, scrape_data(input_data)) # Checking if data is in the cache if Cache.has('scrape_data', input_data): # Retrieving data from the cache cached_data = Cache.get('scrape_data', input_data) print(f"Cached data: {cached_data}") # Removing specific data from the cache Cache.remove('scrape_data', input_data) # Clearing the complete cache for the scrape_data function Cache.clear('scrape_data') ``` -------------------------------- ### Define Complex Input Controls Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md A comprehensive example demonstrating multi-text, sections, switches, selects, and conditional logic for input controls. ```javascript /** * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls */ /** * @param {Controls} controls */ function getInput(controls) { controls .listOfTexts('queries', { defaultValue: ["Web Developers in Bangalore"], placeholder: "Web Developers in Bangalore", label: 'Search Queries', isRequired: true }) .section("Email and Social Links Extraction", (section) => { section.text('api_key', { placeholder: "2e5d346ap4db8mce4fj7fc112s9h26s61e1192b6a526af51n9", label: 'Email and Social Links Extraction API Key', helpText: 'Enter your API key to extract email addresses and social media links.', }) }) .section("Reviews Extraction", (section) => { section .switch('enable_reviews_extraction', { label: "Enable Reviews Extraction" }) .numberGreaterThanOrEqualToZero('max_reviews', { label: 'Max Reviews per Place (Leave empty to extract all reviews)', placeholder: 20, isShown: (data) => data['enable_reviews_extraction'], defaultValue: 20, }) .choose('reviews_sort', { label: "Sort Reviews By", isRequired: true, isShown: (data) => data['enable_reviews_extraction'], defaultValue: 'newest', options: [{ value: 'newest', label: 'Newest' }, { value: 'most_relevant', label: 'Most Relevant' }, { value: 'highest_rating', label: 'Highest Rating' }, { value: 'lowest_rating', label: 'Lowest Rating' }] }) }) .section("Language and Max Results", (section) => { section .addLangSelect() .numberGreaterThanOrEqualToOne('max_results', { placeholder: 100, label: 'Max Results per Search Query (Leave empty to extract all places)' }) }) .section("Geo Location", (section) => { section .text('coordinates', { placeholder: '12.900490, 77.571466' }) .numberGreaterThanOrEqualToOne('zoom_level', { label: 'Zoom Level (1-21)', defaultValue: 14, placeholder: 14 }) }) } ``` -------------------------------- ### Extract Product Links from G2 Sitemap Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Fetches product links from a gzipped sitemap index by filtering for segments starting with 'products'. ```python from botasaurus import bt from botasaurus.sitemap import Sitemap, Filters, Extractors links = ( Sitemap("https://www.g2.com/sitemaps/sitemap_index.xml.gz") .filter(Filters.first_segment_equals("products")) .extract(Extractors.extract_link_upto_second_segment()) .write_links('g2-products') ) ``` -------------------------------- ### Choose Control (Button Options) Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/enhancing-scrapers/input-controls.md Displays options as clickable buttons, suitable for a small number of choices (fewer than 3). It requires an `options` array. ```APIDOC ## choose Displays options as clickable buttons (an alternative to `select`). It requires an `options` array. Use `choose` instead of `select` when you have fewer than 3 options for better user experience. ### Parameters - **options** (array) - Required. An array of objects, each with `value` and `label` properties, defining the selectable options. - **defaultValue** (string) - Optional. The default selected option's value. ### Request Example ```ts .choose("theme", { options: [ { value: "light", label: "Light" }, { value: "dark", label: "Dark" }, ], defaultValue: "light", }) ``` ``` -------------------------------- ### Generate Deployment Manifests Source: https://github.com/omkarcloud/botasaurus/blob/master/run-scraper-in-kubernetes.md Creates the necessary GitHub Actions workflow and Kubernetes deployment YAML files. ```bash python -m bota create-manifests ``` -------------------------------- ### Visit URLs with Botasaurus Driver Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Use `driver.get` for standard URL navigation. `driver.google_get` is recommended for using Google as a referer. `driver.get_via` allows specifying a custom referer, and `driver.get_via_this_page` uses the current page as the referer. ```python driver.get("https://www.example.com") ``` ```python driver.google_get("https://www.example.com") # Use Google as the referer [Recommended] ``` ```python driver.get_via("https://www.example.com", referer="https://duckduckgo.com/") # Use custom referer ``` ```python driver.get_via_this_page("https://www.example.com") # Use current page as referer ``` -------------------------------- ### Push Code to GitHub Repository Source: https://github.com/omkarcloud/botasaurus/blob/master/run-scraper-in-kubernetes.md Initializes a new git repository and pushes the local project files to a remote GitHub repository. ```bash rm -rf .git # remove the existing git repository git init git add . git commit -m "Initial Commit" git remote add origin https://github.com/USERNAME/kubernetes-scraper # TODO: replace USERNAME with your GitHub username git branch -M main git push -u origin main ``` -------------------------------- ### Add Custom Health Check Endpoint Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Use this to add a custom GET endpoint for health checks to your API. It requires the Fastify instance provided by `addCustomRoutes`. ```typescript ApiConfig.addCustomRoutes((server) => { server.get('/health', (request, reply) => { return reply.send({ status: 'OK'}); }); }); ``` -------------------------------- ### Field with Options: outputKey, map, and showIf Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/enhancing-scrapers/custom-views.md Demonstrates a Field configuration with options to rename the output key, map values using a function, and conditionally show the column based on input data. ```typescript new Field("reviews_per_rating", { outputKey: "average_rating", map: (value, record) => { // ... Logic to calculate average rating from the 'value' object }, // Only show this column if the user checked "scrape_prices" showIf: (inputData) => inputData.scrape_prices === true }) ``` -------------------------------- ### SqliteCacheStorage Constructor Source: https://github.com/omkarcloud/botasaurus/blob/master/sqlite-cache-storage/README.md Initialize the storage backend with a custom database path and table name. ```python SqliteCacheStorage( db_path: str = 'cache.db', table_name: str = 'botasaurus_cache' ) ``` -------------------------------- ### Data Structure for Product Records Source: https://github.com/omkarcloud/botasaurus/blob/master/advanced.md Example data structure representing product information, including nested dictionaries and lists, used for demonstrating Botasaurus field types. ```python products = [ { "id": 1, "name": "T-Shirt", "price": 16, # in US Dollar "reviews": 1000, "reviews_per_rating": { "1": 0, "2": 0, "3": 0, "4": 100, "5": 900, }, "featured_reviews": [ { "id": 1, "rating": 5, "content": "Awesome t-shirt!", }, { "id": 2, "rating": 5, "content": "Amazing t-shirt!", }, ], }, { "id": 2, "name": "Laptop", "price": 700, "reviews": 500, "reviews_per_rating": { "1": 0, "2": 0, "3": 0, "4": 100, "5": 400, }, "featured_reviews": [ { "id": 1, "rating": 5, "content": "Best laptop ever!", }, { "id": 2, "rating": 5, "content": "Great laptop!", }, ], }, ] ``` -------------------------------- ### Enable Headless Mode Source: https://github.com/omkarcloud/botasaurus/blob/master/README.md Run the browser in headless mode. Note that this may trigger anti-bot detection services. ```python @browser( headless=True ) ``` -------------------------------- ### Add Custom Authentication Middleware Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Implement custom authentication logic using Fastify's `onRequest` hook. This example checks for a specific secret in the request headers. ```typescript ApiConfig.addCustomRoutes((server) => { server.addHook('onRequest', (request, reply, done) => { // Check for secret const secret = request.headers['x-secret'] as string; if (secret === '49cb1de3-419b-4647-bf06-22c9e1110313') { // Valid secret, proceed return done(); } else { return reply.status(401).send({ message: 'Unauthorized: Invalid secret.', }); } }); }); ``` -------------------------------- ### Add Direct Scraper Endpoint Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Create a direct GET endpoint for a scraper using `ApiConfig.addScraperAlias()`. This bypasses task creation and scheduling overhead, allowing immediate execution. ```typescript import { hotelsSearchScraper } from "../src/scrapers"; // Creates direct GET endpoint at /hotels/search ApiConfig.addScraperAlias(hotelsSearchScraper, "/hotels/search"); ``` -------------------------------- ### addCustomRoutes((server: FastifyInstance) => void) Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/botasaurus-desktop-api/adding-api.md Extends the API with custom endpoints and middleware using Fastify's routing system. This method receives a Fastify instance, allowing you to define any route you need. ```APIDOC ## `addCustomRoutes((server: FastifyInstance) => void)` ### Description Extends the API with custom endpoints and middleware using Fastify's routing system. This method receives a Fastify instance, allowing you to define any route you need. ### Usage ```ts ApiConfig.addCustomRoutes((server) => { // Define custom routes or middleware here }); ``` ### Examples #### Adding a custom health check endpoint ```ts ApiConfig.addCustomRoutes((server) => { server.get('/health', (request, reply) => { return reply.send({ status: 'OK'}); }); }); ``` #### Adding validation middleware ```ts ApiConfig.addCustomRoutes((server) => { server.addHook('onRequest', (request, reply, done) => { const secret = request.headers['x-secret'] as string; if (secret === '49cb1de3-419b-4647-bf06-22c9e1110313') { return done(); } else { return reply.status(401).send({ message: 'Unauthorized: Invalid secret.', }); } }); }); ``` ### When to use: - Adding authentication middleware - Creating custom endpoints - Implementing webhook receivers ``` -------------------------------- ### Create PDF File Picker Input with Botasaurus Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/your-first-extractors/amazon-pdf-invoice-extractor.md Use this JavaScript code to create a drag-and-drop file picker that accepts only PDF files. Ensure the 'botasaurus-controls' library is installed. ```javascript /** * @typedef {import('botasaurus-controls').Controls} Controls * @typedef {import('botasaurus-controls').FileTypes} FileTypes */ const { FileTypes } = require('botasaurus-controls'); /** * Renders the form users see on the Home page. * @param {Controls} controls */ function getInput(controls) { // Render a File Input for uploading PDFs controls.filePicker('files', { label: 'Invoice PDFs', accept: FileTypes.PDF, isRequired: true, helpText: 'Drag one or more Amazon invoice PDFs here', }); } ``` -------------------------------- ### Add Help Text Source: https://github.com/omkarcloud/botasaurus/blob/master/docs/docs/botasaurus-desktop/enhancing-scrapers/input-controls.md Displays a help icon with descriptive text when hovered. ```ts .text("api_key", { label: "API Key", // highlight-next-line helpText: "Find API key in Dashboard → Settings → API" }) ```