### Install and Start Express Server Source: https://github.com/vakra-dev/reader/blob/main/examples/production/README.md Commands to install dependencies and start the Express Server example for Reader. ```bash cd express-server && npm install && npm start ``` -------------------------------- ### Install and Start Browser Pool Scaling Source: https://github.com/vakra-dev/reader/blob/main/examples/production/README.md Commands to install dependencies and start the Browser Pool Scaling example for Reader. ```bash cd browser-pool-scaling && npm install && npm start ``` -------------------------------- ### Install and Start Job Queue (BullMQ) Source: https://github.com/vakra-dev/reader/blob/main/examples/production/README.md Commands to install dependencies and start the API server and worker process for the Job Queue example using BullMQ. ```bash cd job-queue-bullmq && npm install npm run start # API server npm run worker # Worker process ``` -------------------------------- ### Run Production Server Example Source: https://github.com/vakra-dev/reader/blob/main/CONTRIBUTING.md This command starts a production-ready Express.js server example from the 'examples/' folder. ```bash npx tsx production/express-server/src/index.ts ``` -------------------------------- ### Install Reader and Hero Core Dependencies Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md Installs the necessary packages for Vakra Reader, Express, and the shared Hero Core components. This is the initial setup step for the production server. ```bash npm install @vakra-dev/reader express npm install @ulixee/hero-core @ulixee/net # For shared Core ``` -------------------------------- ### Provider Examples Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Code examples for configuring proxies with popular providers like IPRoyal, Bright Data, Oxylabs, and SmartProxy. ```APIDOC ## Provider Examples ### Description Illustrative proxy configurations for various popular proxy providers. ### Method N/A (Configuration Snippets) ### Endpoint N/A (Configuration Snippets) ### Parameters #### IPRoyal Example - **type**: "residential" - **host**: "geo.iproyal.com" - **port**: 12321 - **username**: "customer-username" - **password**: "password" - **country**: "us" #### Bright Data (Luminati) Example - **type**: "residential" - **host**: "brd.superproxy.io" - **port**: 22225 - **username**: "customer-zone-residential" - **password**: "password" - **country**: "us" #### Oxylabs Example - **type**: "residential" - **host**: "pr.oxylabs.io" - **port**: 7777 - **username**: "customer-username" - **password**: "password" - **country**: "us" #### SmartProxy Example - **type**: "residential" - **host**: "gate.smartproxy.com" - **port**: 7000 - **username**: "user" - **password**: "pass" - **country**: "us" ### Request Example (IPRoyal) ```json { "type": "residential", "host": "geo.iproyal.com", "port": 12321, "username": "customer-username", "password": "password", "country": "us" } ``` ``` -------------------------------- ### Install BullMQ and Dependencies Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/job-queues.md Installs the necessary packages for BullMQ, ioredis, and the @vakra-dev/reader library using npm. ```bash npm install bullmq ioredis @vakra-dev/reader ``` -------------------------------- ### Install Dependencies with npm Source: https://github.com/vakra-dev/reader/blob/main/examples/production/job-queue-bullmq/README.md Installs the necessary Node.js packages for the BullMQ job queue example. Ensure you are in the correct directory before running. ```bash cd examples/production/job-queue-bullmq npm install ``` -------------------------------- ### BrowserPool Configuration Example (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/browser-pool.md Provides a comprehensive example of configuring a BrowserPool with various options such as size, retirement policies, and queue limits. ```typescript const pool = new BrowserPool({ size: 5, // Number of browser instances retireAfterPages: 100, // Recycle after N pages retireAfterMinutes: 30, // Recycle after N minutes maxQueueSize: 100, // Max pending requests healthCheckIntervalMs: 300000, // Health check interval (5 min) }); ``` -------------------------------- ### Bright Data Provider Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using Bright Data (formerly Luminati) as a residential proxy provider. This setup details the necessary parameters for connecting to their service. ```typescript proxy: { type: "residential", host: "brd.superproxy.io", port: 22225, username: "customer-zone-residential", password: "password", country: "us", } ``` -------------------------------- ### Start API Server with npm Source: https://github.com/vakra-dev/reader/blob/main/examples/production/job-queue-bullmq/README.md Starts the API server that handles job submissions and status checks. This command assumes Node.js is installed and dependencies are met. ```bash npm run start ``` -------------------------------- ### AWS Lambda Container Dockerfile Setup Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/serverless.md Sets up a Dockerfile for AWS Lambda to include Chrome and its dependencies. It installs necessary packages, copies application code, and defines the entry point for the Lambda function. This allows running Chrome within a containerized Lambda environment. ```dockerfile FROM public.ecr.aws/lambda/nodejs:20 # Install Chrome dependencies RUN yum install -y \ chromium \ nss \ freetype \ freetype-devel \ fontconfig \ pango \ --skip-broken ENV CHROME_PATH=/usr/bin/chromium-browser ENV FONTCONFIG_PATH=/etc/fonts COPY package*.json ./ RUN npm ci --only=production COPY . . CMD ["dist/handler.handler"] ``` -------------------------------- ### Basic Dockerfile for Reader Application Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md A foundational Dockerfile to build and run the Reader application. It installs Node.js, necessary Chrome dependencies, copies application files, installs production dependencies, and exposes the application port. ```dockerfile # Dockerfile FROM node:22-slim # Install Chrome dependencies RUN apt-get update && apt-get install -y \ chromium \ fonts-liberation \ libasound2 \ libatk-bridge2.0-0 \ libatk1.0-0 \ libcups2 \ libdbus-1-3 \ libdrm2 \ libgbm1 \ libgtk-3-0 \ libnspr4 \ libnss3 \ libxcomposite1 \ libxdamage1 \ libxrandr2 \ xdg-utils \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* # Set Chrome path for Hero ENV CHROME_PATH=/usr/bin/chromium WORKDIR /app # Copy package files COPY package*.json ./ # Install dependencies RUN npm ci --only=production # Copy application COPY . . # Build if TypeScript RUN npm run build 2>/dev/null || true EXPOSE 3000 CMD ["node", "dist/server.js"] ``` -------------------------------- ### Start Application with PM2 Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md This command initiates the application using PM2, applying the configuration defined in the `ecosystem.config.js` file. PM2 will manage the application lifecycle, including clustering and restarts. ```bash pm2 start ecosystem.config.js ``` -------------------------------- ### Datacenter Proxy Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using datacenter proxies. ```APIDOC ## Datacenter Proxies ### Description Configuration object for datacenter proxies. ### Method N/A (Configuration Snippet) ### Endpoint N/A (Configuration Snippet) ### Parameters #### Request Body (within `proxy` object) - **type** (string) - Must be "datacenter". - **host** (string) - Proxy server hostname. - **port** (number) - Proxy server port. - **username** (string) - Authentication username. - **password** (string) - Authentication password. ### Request Example ```json { "type": "datacenter", "host": "proxy.example.com", "port": 8080, "username": "username", "password": "password" } ``` ``` -------------------------------- ### Docker Compose Quick Start (Bash) Source: https://github.com/vakra-dev/reader/blob/main/examples/deployment/docker/README.md This command initiates the Reader Docker container using Docker Compose. It's the quickest way to get the Reader REST API server running locally. ```bash cd examples/deployment/docker docker-compose up -d ``` -------------------------------- ### Run AI Integration Examples Source: https://github.com/vakra-dev/reader/blob/main/CONTRIBUTING.md These commands demonstrate AI integration examples, specifically using OpenAI for summarization. It requires setting the OPENAI_API_KEY environment variable. ```bash # AI integration examples (requires API keys) export OPENAI_API_KEY="sk-..." npx tsx ai-tools/openai-summary.ts https://example.com ``` -------------------------------- ### Run Basic Examples Source: https://github.com/vakra-dev/reader/blob/main/CONTRIBUTING.md These commands execute basic examples from the 'examples/' folder, covering simple scraping, batch scraping, and website crawling. ```bash cd examples npm install # Basic examples npx tsx basic/basic-scrape.ts npx tsx basic/batch-scrape.ts npx tsx basic/crawl-website.ts ``` -------------------------------- ### Basic Docker Compose Setup for Reader Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md A simple Docker Compose configuration to define and run the Reader service. It specifies the build context, port mapping, environment variables, and restart policy. ```yaml # docker-compose.yml version: "3.8" services: reader: build: . ports: - "3000:3000" environment: - NODE_ENV=production - LOG_LEVEL=info restart: unless-stopped deploy: resources: limits: memory: 2G ``` -------------------------------- ### SmartProxy Provider Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using SmartProxy residential proxies. This demonstrates how to set the host, port, username, password, and country for the proxy connection. ```typescript proxy: { type: "residential", host: "gate.smartproxy.com", port: 7000, username: "user", password: "pass", country: "us", } ``` -------------------------------- ### Manually Installing Chromium on Ubuntu/Debian Source: https://github.com/vakra-dev/reader/blob/main/docs/troubleshooting.md Install the Chromium browser manually on Ubuntu or Debian systems using the apt package manager. ```bash sudo apt-get update sudo apt-get install -y chromium-browser ``` -------------------------------- ### Verify Development Setup (Bash) Source: https://github.com/vakra-dev/reader/blob/main/CONTRIBUTING.md Commands to verify the development environment setup by running type checking and building the project. These commands ensure Node.js and npm are correctly configured. ```bash npm run typecheck npm run build ``` -------------------------------- ### Docker Compose Setup with Redis for Reader Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md A Docker Compose configuration for a multi-service setup including an API service (Reader), a worker service, and a Redis instance for job queues. It defines dependencies, environment variables, and resource limits. ```yaml # docker-compose.yml version: "3.8" services: api: build: context: . dockerfile: Dockerfile.api ports: - "3000:3000" environment: - NODE_ENV=production - REDIS_HOST=redis - REDIS_PORT=6379 depends_on: - redis restart: unless-stopped worker: build: context: . dockerfile: Dockerfile.worker environment: - NODE_ENV=production - REDIS_HOST=redis - REDIS_PORT=6379 depends_on: - redis deploy: replicas: 3 resources: limits: memory: 2G restart: unless-stopped redis: image: redis:7-alpine volumes: - redis-data:/data restart: unless-stopped volumes: redis-data: ``` -------------------------------- ### Datacenter Proxy Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using a datacenter proxy with the Reader client. Datacenter proxies are fast and cheap but can be easily detected. ```typescript proxy: { type: "datacenter", host: "proxy.example.com", port: 8080, username: "username", password: "password" } ``` -------------------------------- ### Reduce Cold Starts with Connection Warm-up Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/serverless.md This TypeScript code snippet shows a pattern for reducing cold starts by keeping a connection warm. It initializes the connection only once and reuses the promise for subsequent calls. ```typescript // Keep connection warm let connectionPromise: Promise; function getConnection() { if (!connectionPromise) { connectionPromise = initializeConnection(); } return connectionPromise; } ``` -------------------------------- ### Docker Compose Management Commands Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md Essential commands for managing Docker Compose services, including starting, scaling, viewing logs, and stopping all services defined in the docker-compose.yml file. ```bash # Start all services docker-compose up -d # Scale workers docker-compose up -d --scale worker=5 # View logs docker-compose logs -f worker # Stop services docker-compose down ``` -------------------------------- ### Initialize BrowserPool (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/browser-pool.md Shows the basic steps to initialize a BrowserPool instance with a specified size. Initialization includes creating Hero instances and starting background health checking. ```typescript import { BrowserPool } from "@vakra-dev/reader"; const pool = new BrowserPool({ size: 5 }); await pool.initialize(); ``` -------------------------------- ### Install Reader from npm Source: https://github.com/vakra-dev/reader/blob/main/docs/getting-started.md Installs the Reader package using npm. This is the recommended method for most users. Ensure Node.js and npm are installed and up to date. ```bash npm install @vakra-dev/reader ``` -------------------------------- ### Troubleshoot Chrome Startup (Bash) Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md These bash commands are used to troubleshoot Chrome startup issues within a Docker container. The first checks the Chrome installation version, and the second performs a manual headless test. ```bash # Check Chrome installation docker exec -it container_name chromium --version # Test Chrome manually docker exec -it container_name chromium --headless --no-sandbox --dump-dom https://example.com ``` -------------------------------- ### Start API and Worker with npm Source: https://github.com/vakra-dev/reader/blob/main/examples/production/job-queue-bullmq/README.md Starts both the API server and the worker process simultaneously for development purposes. This is a convenient command for local testing. ```bash npm run dev ``` -------------------------------- ### Install Reader from Source Source: https://github.com/vakra-dev/reader/blob/main/docs/getting-started.md Installs the Reader package by cloning the source repository. This method is useful for developers who want to contribute to the project or use the latest unreleased features. It requires Git, Node.js, and npm. ```bash git clone https://github.com/vakra-dev/reader.git cd reader npm install npm run build ``` -------------------------------- ### Request Output Formats via CLI Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/output-formats.md Shows how to use the Reader CLI to scrape a URL and specify output formats. Examples include requesting a single format and multiple comma-separated formats. ```bash # Single format npx reader scrape https://example.com -f markdown # Multiple formats npx reader scrape https://example.com -f markdown,text,json ``` -------------------------------- ### Start Redis Server with Docker Source: https://github.com/vakra-dev/reader/blob/main/examples/production/job-queue-bullmq/README.md Starts a Redis server instance using a Docker container. This is a prerequisite for the BullMQ job queue. ```bash docker run -d -p 6379:6379 redis:alpine ``` -------------------------------- ### Vercel Configuration for Functions Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/serverless.md A `vercel.json` configuration file specifying settings for Vercel serverless functions. This example sets the memory to 1024MB and the maximum duration to 60 seconds for the `api/scrape.ts` function. ```json { "functions": { "api/scrape.ts": { "memory": 1024, "maxDuration": 60 } } } ``` -------------------------------- ### Run Vakra Reader Server Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md Executes the TypeScript server file using tsx, which allows running TypeScript directly without prior compilation. This command starts the production server. ```bash npx tsx server.ts ``` -------------------------------- ### Monitoring Network Resources with Hero Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/cloudflare-bypass.md Illustrates how to monitor network resources within a browser instance managed by Hero. This example specifically logs Cloudflare-related resources encountered during navigation. It utilizes the `on('resource')` event handler provided by Hero. ```typescript await pool.withBrowser(async (hero) => { hero.on("resource", (resource) => { if (resource.url.includes("cdn-cgi")) { console.log("Cloudflare resource:", resource.url); } }); await hero.goto("https://protected-site.com"); }); ``` -------------------------------- ### Test Vakra Reader API Endpoints with cURL Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md Provides example cURL commands to test the `/scrape` and `/crawl` endpoints of the Vakra Reader API server. These demonstrate how to send JSON payloads for scraping specific URLs and initiating crawls. ```bash # Scrape curl -X POST http://localhost:3000/scrape \ -H "Content-Type: application/json" \ -d '{"urls": ["https://example.com"], "formats": ["markdown"]}' # Crawl curl -X POST http://localhost:3000/crawl \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "depth": 2, "scrape": true}' ``` -------------------------------- ### Use Case: Search Indexing with Text Format and TypeScript Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/output-formats.md Illustrates a practical use case for the plain text output format: indexing content for search engines. This TypeScript example shows scraping content as text and then adding it to a hypothetical search index. ```typescript const reader = new ReaderClient(); const result = await reader.scrape({ urls: ["https://example.com"], formats: ["text"], }); // Index plain text await searchIndex.add({ url: result.data[0].metadata.baseUrl, content: result.data[0].text, }); await reader.close(); ``` -------------------------------- ### Manual Acquire and Release of Browser Instance (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/browser-pool.md Provides an example of manually acquiring and releasing a browser instance from the pool. This advanced method requires careful handling of exceptions to ensure the browser is always released. ```typescript const hero = await pool.acquire(); try { await hero.goto("https://example.com"); // ... do work } finally { await pool.release(hero); } ``` -------------------------------- ### Vercel Environment Variable Setup Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/serverless.md A bash command to add the `BROWSERLESS_URL` environment variable using the Vercel CLI. This is used to configure the connection string for the remote browser service within Vercel Functions. ```bash vercel env add BROWSERLESS_URL ``` -------------------------------- ### Clone Repository and Install Dependencies (Bash) Source: https://github.com/vakra-dev/reader/blob/main/CONTRIBUTING.md Steps to clone the Reader repository from GitHub and install project dependencies using npm. This is a prerequisite for local development. ```bash git clone https://github.com/YOUR_USERNAME/reader.git cd reader npm install ``` -------------------------------- ### Install supermarkdown Package (Bash) Source: https://github.com/vakra-dev/reader/blob/main/README.md Provides commands for installing the supermarkdown package, used for HTML to Markdown conversion, via npm for Node.js projects. ```bash # npm npm install @vakra-dev/supermarkdown ``` -------------------------------- ### Start Worker with npm Source: https://github.com/vakra-dev/reader/blob/main/examples/production/job-queue-bullmq/README.md Starts a worker process that consumes and processes jobs from the queue. This should typically be run in a separate terminal from the API server. ```bash npm run worker ``` -------------------------------- ### Verify CLI Installation Source: https://github.com/vakra-dev/reader/blob/main/docs/getting-started.md Tests the command-line interface (CLI) of the Reader package by scraping a sample URL. This command should output the content of example.com in markdown format, confirming the CLI is working correctly. ```bash npx reader scrape https://example.com ``` -------------------------------- ### Manually Installing Chromium on macOS Source: https://github.com/vakra-dev/reader/blob/main/docs/troubleshooting.md Install the Chromium browser on macOS using the Homebrew package manager. ```bash brew install --cask chromium ``` -------------------------------- ### Resolving Chrome/Chromium Not Found Error Source: https://github.com/vakra-dev/reader/blob/main/docs/troubleshooting.md Troubleshoot the 'Could not find Chrome installation' error. Solutions involve letting Reader download Chrome, manual installation on Ubuntu/Debian or macOS, or pointing to an existing Chrome installation via an environment variable. ```bash # Clear cache and retry download rm -rf ~/.cache/ulixee npx reader scrape https://example.com # Manual install (Ubuntu/Debian) sudo apt-get update sudo apt-get install -y chromium-browser # Manual install (macOS) brew install --cask chromium # Point to existing Chrome export CHROME_PATH=/usr/bin/chromium-browser # or on macOS export CHROME_PATH="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" ``` -------------------------------- ### Build and Run Reader Docker Image Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md Commands to build a Docker image for the Reader application and then run it as a container, mapping the application's port. ```bash # Build image docker build -t reader . # Run container docker run -p 3000:3000 reader ``` -------------------------------- ### Async Scraping with BullMQ and Express.js (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/job-queues.md This TypeScript code sets up an Express.js server with API endpoints for initiating and monitoring scraping jobs using BullMQ. It defines a worker that processes scraping tasks, leveraging Hero Core for browser automation. The server handles POST requests to start scraping and GET requests to check job status and retrieve results. ```typescript // complete-example.ts import { Queue, Worker, Job } from "bullmq"; import express from "express"; import HeroCore from "@ulixee/hero-core"; import { scrape, ScrapeResult } from "@vakra-dev/reader"; const app = express(); app.use(express.json()); // Redis connection const connection = { host: "localhost", port: 6379 }; // Queue const scrapeQueue = new Queue("scrape", { connection }); // Shared Hero Core let heroCore: HeroCore; // Worker const worker = new Worker( "scrape", async (job: Job) => { const result = await scrape({ ...job.data, connectionToCore: await createConnection(), }); return result; }, { connection, concurrency: 3 } ); // API endpoints app.post("/scrape/async", async (req, res) => { const job = await scrapeQueue.add("scrape", req.body); res.json({ jobId: job.id }); }); app.get("/scrape/:jobId", async (req, res) => { const job = await scrapeQueue.getJob(req.params.jobId); if (!job) return res.status(404).json({ error: "Not found" }); const state = await job.getState(); res.json({ state, progress: job.progress, result: state === "completed" ? job.returnvalue : null, }); }); // Start async function start() { heroCore = new HeroCore(); await heroCore.start(); app.listen(3000, () => console.log("Server running")); } start(); ``` -------------------------------- ### Residential Proxy Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using residential proxies, including country targeting. ```APIDOC ## Residential Proxies ### Description Configuration object for residential proxies, allowing for country-specific targeting. ### Method N/A (Configuration Snippet) ### Endpoint N/A (Configuration Snippet) ### Parameters #### Request Body (within `proxy` object) - **type** (string) - Must be "residential". - **host** (string) - Proxy server hostname. - **port** (number) - Proxy server port. - **username** (string) - Authentication username. - **password** (string) - Authentication password. - **country** (string) - Optional - Country code for geo-targeting (e.g., "us", "uk"). ### Request Example ```json { "type": "residential", "host": "proxy.example.com", "port": 8080, "username": "username", "password": "password", "country": "us" } ``` ``` -------------------------------- ### Rotate Proxies Manually Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Demonstrates how to cycle through a predefined list of proxies for sequential requests. This helps distribute load and avoid IP-based blocking. It requires a list of proxy configurations and a counter to track the current proxy. ```typescript const proxies = [ { host: "proxy1.example.com", port: 8080 }, { host: "proxy2.example.com", port: 8080 }, { host: "proxy3.example.com", port: 8080 }, ]; let proxyIndex = 0; const reader = new ReaderClient(); async function scrapeWithRotation(url: string) { const proxy = proxies[proxyIndex % proxies.length]; proxyIndex++; return await reader.scrape({ urls: [url], proxy: { ...proxy, username: "username", password: "password", }, }); } // Don't forget to close when done // await reader.close(); ``` -------------------------------- ### Scaling Strategies Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md Documentation on how to scale the Vakra Reader service horizontally and manage memory limits using PM2. ```APIDOC ## Scaling Strategies ### Horizontal Scaling Run multiple server instances behind a load balancer. ```bash # Start multiple instances PORT=3001 npx tsx server.ts & PORT=3002 npx tsx server.ts & PORT=3003 npx tsx server.ts & ``` ### PM2 Cluster Mode Use PM2 to manage and scale Node.js applications. ```javascript // ecosystem.config.js module.exports = { apps: [{ name: "reader", script: "server.ts", interpreter: "npx", interpreter_args: "tsx", instances: "max", exec_mode: "cluster", env: { NODE_ENV: "production", PORT: 3000, }, }], }; ``` ```bash pm2 start ecosystem.config.js ``` ### Memory Limits Configure memory limits for Node.js processes using PM2. ```javascript // ecosystem.config.js module.exports = { apps: [{ name: "reader", script: "server.ts", max_memory_restart: "2G", node_args: "--max-old-space-size=2048", }], }; ``` ``` -------------------------------- ### GET /job/:id Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/job-queues.md Retrieves the current status and progress of a specific scraping job using its ID. ```APIDOC ## GET /job/:id ### Description Retrieves the status and progress of a scraping job identified by its ID. ### Method GET ### Endpoint /job/:id ### Parameters #### Path Parameters - **id** (string) - Required - The ID of the job to retrieve. ### Response #### Success Response (200) - **id** (string) - The job ID. - **state** (string) - The current state of the job (e.g., "queued", "active", "completed", "failed"). - **progress** (number) - The completion progress of the job (0-100). - **data** (object) - The original data submitted for the job. - **result** (any) - The return value of the job if completed. - **failedReason** (string) - The reason for failure if the job failed. #### Response Example ```json { "id": "some-unique-job-id", "state": "active", "progress": 50, "data": { "urls": ["https://example.com"], "formats": ["markdown"] }, "result": null, "failedReason": null } ``` ``` -------------------------------- ### Setting Chrome Path Environment Variable Source: https://github.com/vakra-dev/reader/blob/main/docs/troubleshooting.md Configure the CHROME_PATH environment variable to point Reader to a specific Chrome or Chromium installation. ```bash # For Linux export CHROME_PATH=/usr/bin/chromium-browser # For macOS export CHROME_PATH="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" ``` -------------------------------- ### CLI Scrape with Proxy Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Demonstrates how to perform a scrape operation using the Reader CLI, specifying a proxy server directly in the command line arguments. ```bash npx reader scrape https://example.com --proxy http://user:pass@host:port ``` -------------------------------- ### GET /job/:id/result Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/job-queues.md Retrieves the final result of a completed scraping job. Returns a 202 Accepted status if the job is not yet completed. ```APIDOC ## GET /job/:id/result ### Description Retrieves the final result of a completed scraping job. If the job is still in progress, it returns the current status and progress. ### Method GET ### Endpoint /job/:id/result ### Parameters #### Path Parameters - **id** (string) - Required - The ID of the job whose result is requested. ### Response #### Success Response (200) - The response body will contain the scraped data in the format(s) requested when the job was enqueued. #### Accepted Response (202) - Returned if the job is not yet completed. Contains the current status and progress. - **status** (string) - The current state of the job. - **progress** (number) - The completion progress of the job (0-100). #### Response Example (Completed Job) ```json { "content": "# Scraped Content\nThis is the scraped markdown content." } ``` #### Response Example (Job in Progress) ```json { "status": "active", "progress": 75 } ``` ``` -------------------------------- ### CLI Usage with Proxy Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Demonstrates how to use the Reader CLI to scrape a URL with a specified proxy. ```APIDOC ## CLI Usage ### Description Command-line interface command to scrape a URL using a proxy. ### Method N/A (CLI Command) ### Endpoint N/A (CLI Command) ### Parameters - **scrape**: Command to initiate scraping. - **[URL]**: The URL to scrape (e.g., `https://example.com`). - **--proxy**: Optional flag to specify the proxy URL (e.g., `http://user:pass@host:port`). ### Request Example ```bash npx reader scrape https://example.com --proxy http://user:pass@host:port ``` ### Response Output will be the scraped content or an error message. ``` -------------------------------- ### Reader Client Initialization with Proxy URL Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Initialize the Reader client and configure a proxy using a direct URL string. ```APIDOC ## POST /vakra-dev/reader ### Description Initializes the Reader client with proxy configuration via a URL. ### Method POST ### Endpoint /vakra-dev/reader ### Parameters #### Request Body - **urls** (array) - Required - List of URLs to scrape. - **proxy** (object) - Optional - Proxy configuration. - **url** (string) - Required - Full proxy URL (e.g., "http://username:password@proxy.example.com:8080"). ### Request Example ```json { "urls": ["https://example.com"], "proxy": { "url": "http://username:password@proxy.example.com:8080" } } ``` ### Response #### Success Response (200) - **data** (array) - Array of scraped data. - **metadata** (object) - Metadata about the scrape. - **baseUrl** (string) - The base URL that was scraped. - **proxy** (object) - Information about the proxy used. - **host** (string) - Proxy host. - **port** (number) - Proxy port. - **country** (string) - Optional - Country code if geo-targeting was used. #### Response Example ```json { "data": [ { "metadata": { "baseUrl": "https://example.com", "proxy": { "host": "proxy.example.com", "port": 8080 } } } ] } ``` ``` -------------------------------- ### GET /health Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md Provides health status and performance metrics for the reader service, including active and total requests, failed requests, and queue status. ```APIDOC ## GET /health ### Description Provides health status and performance metrics for the reader service, including active and total requests, failed requests, and queue status. ### Method GET ### Endpoint /health ### Response #### Success Response (200) - **status** (string) - The overall status of the service ('ok'). - **heroCore** (string) - Status of the heroCore service ('running' or 'stopped'). - **stats** (object) - Performance statistics: - **activeRequests** (number) - Number of currently active requests. - **totalRequests** (number) - Total number of requests processed. - **failedRequests** (number) - Total number of failed requests (status code >= 500). - **queueSize** (number) - Current size of the request queue. - **queuePending** (number) - Number of requests pending in the queue. #### Response Example ```json { "status": "ok", "heroCore": "running", "stats": { "activeRequests": 5, "totalRequests": 1000, "failedRequests": 10, "queueSize": 2, "queuePending": 1 } } ``` ``` -------------------------------- ### Oxylabs Provider Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for integrating Oxylabs residential proxies with the Reader client. This snippet shows the specific host, port, and authentication details required. ```typescript proxy: { type: "residential", host: "pr.oxylabs.io", port: 7777, username: "customer-username", password: "password", country: "us", } ``` -------------------------------- ### Manual Docker Image Build (Bash) Source: https://github.com/vakra-dev/reader/blob/main/examples/deployment/docker/README.md This command builds the Docker image for the Reader project. It uses the Dockerfile located in the examples/deployment/docker directory and tags the image as 'reader'. ```bash docker build -t reader -f examples/deployment/docker/Dockerfile . ``` -------------------------------- ### IPRoyal Provider Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example of configuring the Reader client to use IPRoyal as a residential proxy provider. This includes specifying the host, port, username, password, and target country. ```typescript proxy: { type: "residential", host: "geo.iproyal.com", port: 12321, username: "customer-username", password: "password", country: "us", } ``` -------------------------------- ### Crawl Request Flow Example Source: https://github.com/vakra-dev/reader/blob/main/docs/architecture.md Illustrates the step-by-step process of a crawl request, from initialization to result return. It details the BFS loop, page fetching, link extraction and filtering, rate limiting, and optional scraping. ```text crawl({ url: "https://example.com", depth: 2, scrape: true }) │ ├─► Crawler.crawl() │ │ │ ├─► Initialize queue with seed URL at depth 0 │ │ │ ├─► BFS loop (while queue not empty && pages < maxPages): │ │ │ │ │ ├─► Dequeue next URL │ │ │ │ │ ├─► Fetch page with Hero │ │ │ │ │ ├─► Extract links via regex │ │ │ │ │ ├─► Filter links: │ │ │ ├─► Same domain only │ │ │ ├─► Match includePatterns │ │ │ └─► Exclude excludePatterns │ │ │ │ │ ├─► Add new links to queue with depth + 1 │ │ │ │ │ ├─► Rate limit (delay between requests) │ │ │ │ │ └─► Add to discovered URLs │ │ │ ├─► If scrape=true: │ │ └─► scrape({ urls: discoveredUrls }) │ │ │ └─► Return CrawlResult { urls[], scraped?, metadata } │ └─► Result returned to caller ``` -------------------------------- ### Reader Environment Variables Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/docker.md Example of environment variables that can be configured for the Reader service in Docker Compose. These variables control application behavior, port, logging level, and Chrome path. ```yaml services: reader: environment: - NODE_ENV=production - PORT=3000 - LOG_LEVEL=info - CHROME_PATH=/usr/bin/chromium - MAX_CONCURRENT_REQUESTS=10 - REQUEST_TIMEOUT_MS=60000 ``` -------------------------------- ### Residential Proxy Configuration Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Example configuration for using a residential proxy with the Reader client. Residential proxies use real IP addresses, making them harder to detect and suitable for sensitive scraping. ```typescript proxy: { type: "residential", host: "proxy.example.com", port: 8080, username: "username", password: "password", country: "us", } ``` -------------------------------- ### Set Up Bull Board Queue Dashboard (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/job-queues.md Integrates Bull Board to provide a web-based dashboard for monitoring and managing BullMQ queues. It allows visualization of job statuses, queue metrics, and manual job operations. ```typescript import { createBullBoard } from "@bull-board/api"; import { BullMQAdapter } from "@bull-board/api/bullMQAdapter"; import { ExpressAdapter } from "@bull-board/express"; const serverAdapter = new ExpressAdapter(); serverAdapter.setBasePath("/admin/queues"); createBullBoard({ queues: [new BullMQAdapter(scrapeQueue)], serverAdapter, }); app.use("/admin/queues", serverAdapter.getRouter()); ``` -------------------------------- ### Get BrowserPool Statistics (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/browser-pool.md Shows how to retrieve statistics about the current state of the browser pool, including the total number of instances, available instances, instances in use, and queue size. ```typescript const stats = pool.getStats(); console.log(stats); // { // total: 5, // available: 3, // inUse: 2, // queueSize: 0, // totalAcquired: 150, // totalRecycled: 3 // } ``` -------------------------------- ### Reader Client Initialization with Structured Proxy Config Source: https://github.com/vakra-dev/reader/blob/main/docs/guides/proxy-configuration.md Initialize the Reader client and configure a proxy using a structured object with type, host, port, and credentials. ```APIDOC ## POST /vakra-dev/reader ### Description Initializes the Reader client with proxy configuration using a structured object. ### Method POST ### Endpoint /vakra-dev/reader ### Parameters #### Request Body - **urls** (array) - Required - List of URLs to scrape. - **proxy** (object) - Optional - Proxy configuration. - **type** (string) - Required - Proxy type (e.g., "residential", "datacenter"). - **host** (string) - Required - Proxy server hostname. - **port** (number) - Required - Proxy server port. - **username** (string) - Optional - Authentication username. - **password** (string) - Optional - Authentication password. - **country** (string) - Optional - Country code for geo-targeting (e.g., "us"). ### Request Example ```json { "urls": ["https://example.com"], "proxy": { "type": "residential", "host": "proxy.example.com", "port": 8080, "username": "username", "password": "password", "country": "us" } } ``` ### Response #### Success Response (200) - **data** (array) - Array of scraped data. - **metadata** (object) - Metadata about the scrape. - **baseUrl** (string) - The base URL that was scraped. - **proxy** (object) - Information about the proxy used. - **host** (string) - Proxy host. - **port** (number) - Proxy port. - **country** (string) - Optional - Country code if geo-targeting was used. #### Response Example ```json { "data": [ { "metadata": { "baseUrl": "https://example.com", "proxy": { "host": "proxy.example.com", "port": 8080, "country": "us" } } } ] } ``` ``` -------------------------------- ### Horizontal Scaling with Multiple Instances Source: https://github.com/vakra-dev/reader/blob/main/docs/deployment/production-server.md This bash script demonstrates how to horizontally scale the application by running multiple instances on different ports. This approach requires a load balancer to distribute traffic across the instances. ```bash # Start multiple instances PORT=3001 npx tsx server.ts & PORT=3002 npx tsx server.ts & PORT=3003 npx tsx server.ts & ``` -------------------------------- ### Initialize and Use Shared Hero Core (TypeScript) Source: https://github.com/vakra-dev/reader/blob/main/README.md Demonstrates how to initialize a shared Hero Core instance for production environments to reuse browser instances across requests. It includes setting up connections and using the scrape function. ```typescript import HeroCore from "@ulixee/hero-core"; import { TransportBridge } from "@ulixee/net"; import { ConnectionToHeroCore } from "@ulixee/hero"; import { scrape } from "@vakra-dev/reader"; // Initialize once at startup const heroCore = new HeroCore(); await heroCore.start(); // Create connection for each request function createConnection() { const bridge = new TransportBridge(); heroCore.addConnection(bridge.transportToClient); return new ConnectionToHeroCore(bridge.transportToCore); } // Use in requests const result = await scrape({ urls: ["https://example.com"], connectionToCore: createConnection(), }); // Shutdown on exit await heroCore.close(); ```