### Plugin Registration Example Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md An example of how to register a new plugin with LibreCrawl, including essential configuration like ID, name, and tab settings. ```APIDOC ## POST /api/plugins/register ### Description Registers a new custom plugin with LibreCrawl. ### Method POST ### Endpoint /api/plugins/register ### Request Body - **id** (string) - Required - Unique ID for the plugin. - **name** (string) - Required - Display name of the plugin. - **tab** (object) - Required - Configuration for the plugin's tab in the UI. - **label** (string) - Required - Text for the tab button. - **icon** (string) - Optional - Emoji or icon for the tab. - **position** (number) - Optional - Position of the tab. ### Request Example ```json { "id": "my-plugin", "name": "My Plugin", "tab": { "label": "My Tab", "icon": "🔥" } } ``` ### Response #### Success Response (200) - **message** (string) - Confirmation message. #### Response Example ```json { "message": "Plugin registered successfully." } ``` ``` -------------------------------- ### Install Python Dependencies Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to install required Python packages and Playwright browser binaries. ```bash pip install -r requirements.txt ``` ```bash playwright install chromium ``` -------------------------------- ### Deploy with Docker Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to clone the repository and start the application using Docker Compose. ```bash # Clone the repository git clone https://github.com/PhialsBasement/LibreCrawl.git cd LibreCrawl # Copy environment file cp .env.example .env # Start LibreCrawl docker-compose up -d # Open browser to http://localhost:5000 ``` -------------------------------- ### Run Application via Python Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to start the application in standard or local mode. ```bash # Standard mode (with authentication and tier system) python main.py # Local mode (all users get admin tier, no rate limits) python main.py --local # or python main.py -l ``` -------------------------------- ### Run Automatic Startup Scripts Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Use these scripts to automatically check for dependencies, install requirements, and launch the application. ```batch start-librecrawl.bat ``` ```bash chmod +x start-librecrawl.sh ./start-librecrawl.sh ``` -------------------------------- ### Start a new crawl Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initiates a crawl for a specified URL. Requires a JSON payload containing the target URL. ```bash # Start a crawl curl -X POST http://localhost:5000/api/start_crawl \ -H "Content-Type: application/json" \ -H "Cookie: session=your_session_cookie" \ -d '{"url": "https://example.com"}' # Response { "success": true, "message": "Crawl started successfully", "crawl_id": 123 } ``` -------------------------------- ### GET /api/get_settings Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves the current crawler configuration settings. ```APIDOC ## GET /api/get_settings ### Description Fetches the current configuration settings for the crawler. ### Method GET ### Endpoint http://localhost:5000/api/get_settings ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **settings** (object) - The current crawler configuration #### Response Example { "success": true, "settings": { "maxDepth": 3, "maxUrls": 5000000, "crawlDelay": 1, "enableJavaScript": false, "userAgent": "LibreCrawl/1.0 (Web Crawler)", "respectRobotsTxt": true, "discoverSitemaps": true } } ``` -------------------------------- ### Configure Crawler Settings Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initialize the SettingsManager for a specific user and tier to get and update crawler settings. Settings are filtered by tier permissions. Obtain crawler-ready configuration and reset settings to defaults. ```python from src.settings_manager import SettingsManager settings_manager = SettingsManager( session_id='unique-session', user_id=1, tier='admin' # guest, user, extra, admin ) settings = settings_manager.get_settings() print(f"Max Depth: {settings['maxDepth']}") print(f"Max URLs: {settings['maxUrls']}") print(f"JavaScript Enabled: {settings['enableJavaScript']}") success, message = settings_manager.save_settings({ 'maxDepth': 5, 'maxUrls': 10000, 'crawlDelay': 0.5, 'enableJavaScript': True, 'jsWaitTime': 3, 'jsTimeout': 30, 'userAgent': 'MyCustomBot/1.0', 'includePatterns': '/blog/*\n/products/*', 'excludePatterns': '/admin/*', 'issueExclusionPatterns': '/login*\n/checkout/*' }) crawler_config = settings_manager.get_crawler_config() settings_manager.reset_settings() ``` -------------------------------- ### GET /api/user/info Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves information about the currently authenticated user. ```APIDOC ## GET /api/user/info ### Description Returns details about the current user, including crawl limits. ### Method GET ### Endpoint http://localhost:5000/api/user/info ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **user** (object) - User details including id, username, tier, and crawl stats #### Response Example { "success": true, "user": { "id": 1, "username": "newuser", "tier": "user", "crawls_today": 2, "crawls_remaining": -1 } } ``` -------------------------------- ### GET /api/visualization_data Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves graph data for site structure visualization. ```APIDOC ## GET /api/visualization_data ### Description Fetches nodes and edges representing the crawled site structure. ### Method GET ### Endpoint http://localhost:5000/api/visualization_data ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **nodes** (array) - List of site nodes - **edges** (array) - List of connections between nodes - **total_pages** (integer) - Total pages found #### Response Example { "success": true, "nodes": [{"data": {"id": "node-0", "url": "https://example.com", "status_code": 200, "title": "Home"}}], "edges": [{"data": {"id": "edge-0-1", "source": "node-0", "target": "node-1"}}], "total_pages": 500 } ``` -------------------------------- ### GET /links Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves all discovered links from the link manager, including source, target, anchor text, internal status, and placement. ```APIDOC ## GET /links ### Description Retrieves all discovered links from the link manager. ### Method GET ### Response #### Success Response (200) - **source_url** (string) - The URL where the link was found - **target_url** (string) - The destination URL - **anchor_text** (string) - The text content of the link - **is_internal** (boolean) - Whether the link is internal to the domain - **placement** (string) - The location of the link (e.g., navigation, body, footer) ``` -------------------------------- ### Calculate Crawl Duration Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Calculates the duration of a crawl given its start and end times. Formats the output into hours, minutes, and seconds. ```javascript function calculateDuration(start, end) { const startTime = new Date(start); const endTime = new Date(end); const diff = endTime - startTime; const hours = Math.floor(diff / 3600000); const minutes = Math.floor((diff % 3600000) / 60000); const seconds = Math.floor((diff % 60000) / 1000); if (hours > 0) return `${hours}h ${minutes}m`; if (minutes > 0) return `${minutes}m ${seconds}s`; return `${seconds}s`; } ``` -------------------------------- ### Get crawl status Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves real-time progress and statistics. Supports incremental updates using query parameters to filter by index. ```bash # Get full status curl http://localhost:5000/api/crawl_status \ -H "Cookie: session=your_session_cookie" # Incremental update (fetch only new data since index) curl "http://localhost:5000/api/crawl_status?url_since=50&link_since=100&issue_since=10" \ -H "Cookie: session=your_session_cookie" # Response { "status": "running", "stats": { "discovered": 150, "crawled": 75, "depth": 3, "speed": 2.5 }, "urls": [...], "links": [...], "issues": [...], "progress": 50.0, "is_running_pagespeed": false, "memory": {"current_mb": 256, "peak_mb": 300} } ``` -------------------------------- ### Configure Application Running Modes Source: https://context7.com/phialsbasement/librecrawl/llms.txt Execute main.py with various flags to toggle authentication, registration, guest access, and demo constraints. ```bash # Standard mode (with authentication) python main.py # Local mode (no auth, all users get admin tier) python main.py --local # or python main.py -l # Disable new registrations python main.py --disable-register # or python main.py -dr # Disable guest access python main.py --disable-guest # or python main.py -dg # Demo mode (1.5GB memory limit per user) python main.py --demo # or python main.py -dm # Combined flags python main.py --local --disable-guest --demo ``` -------------------------------- ### Deploy via Docker Source: https://context7.com/phialsbasement/librecrawl/llms.txt Prepare the environment configuration file before launching the containerized service. ```bash # Docker deployment cp .env.example .env # Edit .env for production # LOCAL_MODE=false # HOST_BINDING=0.0.0.0 # REGISTRATION_DISABLED=false docker-compose up -d ``` -------------------------------- ### Configure Environment Variables Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Settings for the .env file to toggle between local mode and production deployment. ```bash # .env file LOCAL_MODE=true HOST_BINDING=127.0.0.1 REGISTRATION_DISABLED=false ``` ```bash # .env file LOCAL_MODE=false HOST_BINDING=0.0.0.0 REGISTRATION_DISABLED=false ``` -------------------------------- ### Manage User Authentication via API Source: https://context7.com/phialsbasement/librecrawl/llms.txt Perform user registration, login, and account information retrieval. ```bash curl -X POST http://localhost:5000/api/register \ -H "Content-Type: application/json" \ -d '{ "username": "newuser", "email": "user@example.com", "password": "securepassword123" }' ``` ```bash curl -X POST http://localhost:5000/api/login \ -H "Content-Type: application/json" \ -d '{ "username": "newuser", "password": "securepassword123" }' ``` ```bash curl -X POST http://localhost:5000/api/guest-login ``` ```bash curl http://localhost:5000/api/user/info \ -H "Cookie: session=your_session_cookie" ``` -------------------------------- ### GET /api/crawl_status Source: https://context7.com/phialsbasement/librecrawl/llms.txt Returns real-time crawl progress including discovered URLs, links, issues, and statistics. ```APIDOC ## GET /api/crawl_status ### Description Returns real-time crawl progress including discovered URLs, links, issues, and statistics. Supports incremental polling to fetch only new data. ### Method GET ### Endpoint /api/crawl_status ### Parameters #### Query Parameters - **url_since** (integer) - Optional - Fetch URLs since this index - **link_since** (integer) - Optional - Fetch links since this index - **issue_since** (integer) - Optional - Fetch issues since this index ### Response #### Success Response (200) - **status** (string) - Current status of the crawl - **stats** (object) - Crawl statistics - **progress** (float) - Percentage completion #### Response Example { "status": "running", "stats": { "discovered": 150, "crawled": 75, "depth": 3, "speed": 2.5 }, "progress": 50.0 } ``` -------------------------------- ### Initial Load and Interval Refresh Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Initializes the application by loading statistics and the crawl list on page load. Sets up an interval to periodically refresh both stats and the crawl list every 30 seconds. ```javascript loadStats(); loadCrawls(); setInterval(() => { loadStats(); loadCrawls(); }, 30000); ``` -------------------------------- ### POST /api/start_crawl Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initiates a new crawl from a given URL, respecting robots.txt and discovering sitemaps. ```APIDOC ## POST /api/start_crawl ### Description Initiates a new crawl from a given URL. The crawler automatically discovers sitemaps, respects robots.txt, and extracts comprehensive SEO data from each page. ### Method POST ### Endpoint /api/start_crawl ### Request Body - **url** (string) - Required - The URL to start the crawl from. ### Request Example { "url": "https://example.com" } ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **message** (string) - Confirmation message - **crawl_id** (integer) - Unique identifier for the crawl #### Response Example { "success": true, "message": "Crawl started successfully", "crawl_id": 123 } ``` -------------------------------- ### Styling Guidelines Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Information on using CSS classes provided by LibreCrawl for consistent UI styling within plugins. ```APIDOC ## Plugin Styling ### Description LibreCrawl provides a set of CSS classes to ensure your plugin's UI matches the application's design. Apply these classes to your HTML elements. ### Available CSS Classes - **`.plugin-content`**: The main container for your plugin's UI. Apply padding and scrolling styles here. - **`.plugin-header`**: Use for header sections within your plugin's content. - **`.data-table`**: Automatically styles HTML tables to match LibreCrawl's appearance. - **`.stat-card`**: Styles elements used for displaying statistics. - **`.score-good`**, **`.score-needs-improvement`**, **`.score-poor`**: Classes for indicating different levels of quality or status. ### Scrolling Implementation To ensure proper scrolling within the tab pane, wrap your plugin's content in a div with the following styles: ```html
``` The `max-height` calculation ensures the content area respects the overall UI layout and provides a scrollable experience when content exceeds the available space. ``` -------------------------------- ### POST /crawls Source: https://context7.com/phialsbasement/librecrawl/llms.txt Creates a new crawl record in the database with initial configuration. ```APIDOC ## POST /crawls ### Description Creates a new crawl record in the database with initial configuration. ### Method POST ### Request Body - **user_id** (integer) - Required - ID of the user - **session_id** (string) - Required - Unique session identifier - **base_url** (string) - Required - Starting URL - **base_domain** (string) - Required - Base domain - **config_snapshot** (object) - Required - Snapshot of the configuration ### Response #### Success Response (200) - **crawl_id** (integer) - The ID of the created crawl ``` -------------------------------- ### Extract SEO Data from HTML Source: https://context7.com/phialsbasement/librecrawl/llms.txt Use SEOExtractor to parse HTML and extract various SEO-related data points. Ensure BeautifulSoup is installed and the HTML content is properly parsed. ```python from bs4 import BeautifulSoup from src.core.seo_extractor import SEOExtractor html = """ Example Page Title

Main Heading

Subheading 1

Alt text """ soup = BeautifulSoup(html, 'html.parser') extractor = SEOExtractor() # Initialize result structure result = { 'url': 'https://example.com/page', 'meta_tags': {}, 'og_tags': {}, 'twitter_tags': {}, 'json_ld': [], 'images': [], 'hreflang': [], 'schema_org': [], 'analytics': {'google_analytics': False, 'gtag': False, 'ga4_id': '', 'gtm_id': ''}, 'internal_links': 0, 'external_links': 0 } # Extract all SEO data extractor.extract_basic_seo_data(soup, result) extractor.extract_meta_tags(soup, result) extractor.extract_opengraph_tags(soup, result) extractor.extract_twitter_tags(soup, result) extractor.extract_json_ld(soup, result) extractor.extract_images(soup, result['url'], result) extractor.extract_link_counts(soup, result, 'example.com') print(f"Title: {result['title']}") # "Example Page Title" print(f"H1: {result['h1']}") # "Main Heading" print(f"Canonical: {result['canonical_url']}") # "https://example.com/page" print(f"Language: {result['lang']}") # "en" ``` -------------------------------- ### Guest Login Functionality with JavaScript Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/login.html Handles the guest login button click event. It disables the button, shows a loading state, and makes a POST request to the /api/guest-login endpoint. It provides user feedback via alerts and redirects on successful guest entry. ```javascript // Guest login const guestBtn = document.getElementById('guestBtn'); if (guestBtn) guestBtn.addEventListener('click', async () => { hideAlert(); guestBtn.disabled = true; guestBtn.textContent = 'Entering as guest...'; try { const response = await fetch('/api/guest-login', { method: 'POST', headers: { 'Content-Type': 'application/json' } }); const data = await response.json(); if (data.success) { showAlert('Entering as guest...', 'success'); setTimeout(() => { window.location.href = '/'; }, 500); } else { showAlert(data.message || 'Failed to enter as guest', 'error'); guestBtn.disabled = false; guestBtn.textContent = 'Continue as Guest (3 crawls/24h)'; } } catch (error) { showAlert('An error occurred. Please try again.', 'error'); guestBtn.disabled = false; guestBtn.textContent = 'Continue as Guest (3 crawls/24h)'; } }); ``` -------------------------------- ### Retrieve Visualization Data via API Source: https://context7.com/phialsbasement/librecrawl/llms.txt Fetch graph data representing the site structure for visualization purposes. ```bash curl http://localhost:5000/api/visualization_data \ -H "Cookie: session=your_session_cookie" ``` -------------------------------- ### Configuration Settings Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Details on the various settings that can be configured within LibreCrawl's UI. ```APIDOC ## LibreCrawl Configuration Settings ### Description LibreCrawl allows extensive customization through its Settings interface, covering crawler behavior, request handling, rendering, filtering, and more. ### Configurable Areas - **Crawler Settings**: Control crawl depth (up to 5 million URLs), delays between requests, and handling of external links. - **Request Settings**: Configure the User-Agent string, request timeouts, proxy settings, and adherence to `robots.txt`. - **JavaScript Rendering**: Adjust settings for the browser engine, wait times after page load, and viewport size for rendering dynamic content. - **Filters**: Define patterns for URL inclusion or exclusion, and specify file types to ignore during crawling. - **Export Options**: Select the desired formats (CSV, JSON, XML) and specify which data fields to include in exports. - **Custom CSS**: Personalize the LibreCrawl user interface by applying custom CSS rules. - **Issue Exclusion**: Set up patterns to exclude specific SEO issues from being reported. ### PageSpeed Analysis To enhance rate limits for PageSpeed analysis (from limited to 25,000 requests per day), add a Google API key in the `Settings > Requests` section. ``` -------------------------------- ### POST /api/login Source: https://context7.com/phialsbasement/librecrawl/llms.txt Authenticates a user and returns a session. ```APIDOC ## POST /api/login ### Description Authenticates a user with username and password. ### Method POST ### Endpoint http://localhost:5000/api/login ### Request Body - **username** (string) - Required - Username - **password** (string) - Required - Password ### Request Example { "username": "newuser", "password": "securepassword123" } ``` -------------------------------- ### POST /api/register Source: https://context7.com/phialsbasement/librecrawl/llms.txt Registers a new user account. ```APIDOC ## POST /api/register ### Description Creates a new user account. ### Method POST ### Endpoint http://localhost:5000/api/register ### Request Body - **username** (string) - Required - Unique username - **email** (string) - Required - User email address - **password** (string) - Required - User password ### Request Example { "username": "newuser", "email": "user@example.com", "password": "securepassword123" } ``` -------------------------------- ### Running Modes Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Explanation of the different modes LibreCrawl can run in: Standard and Local. ```APIDOC ## LibreCrawl Running Modes ### Description LibreCrawl offers two primary running modes, each with different access control and limitations, suitable for various use cases. ### Standard Mode - **Default mode**. - Features a full authentication system (login/register). - Implements tier-based access control (Guest, User, Extra, Admin). - **Guest users** are limited to 3 crawls per 24 hours, enforced via IP. - Suitable for public demonstrations or shared environments. ### Local Mode (`--local` or `-l` flag) - **Activated by** running LibreCrawl with the `--local` or `-l` command-line flag. - **All users** are automatically granted admin-tier access. - **No rate limits** or tier-based restrictions are enforced. - Ideal for personal use, local development, and self-hosted instances where strict access control is not required. - **Recommended** for testing and development purposes. ``` -------------------------------- ### Lifecycle Hooks and Utilities Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Overview of the available lifecycle hooks and utility functions for plugin development. ```APIDOC ## Plugin Lifecycle Hooks and Utilities ### Description Details the functions that LibreCrawl calls at various points in its operation, and the utility functions available to plugins. ### Lifecycle Hooks - **`onLoad()`**: Called once when the plugin is initially loaded by LibreCrawl. - **`onTabActivate(container, data)`**: Triggered when the user navigates to the plugin's tab. Receives the tab's DOM container and crawl data. - **`onTabDeactivate()`**: Called when the user switches away from the plugin's tab. - **`onDataUpdate(data)`**: Executed during live crawls whenever new crawl data becomes available. Useful for real-time UI updates. - **`onCrawlComplete(data)`**: Called after a crawl has finished, providing the final crawl data. ### Utilities Access built-in helper functions through `this.utils`: - **`this.utils.showNotification(message, type)`**: Displays a notification to the user. `type` can be 'success', 'error', or 'info'. - **`this.utils.formatUrl(url)`**: Formats a given URL according to LibreCrawl's standards. - **`this.utils.escapeHtml(text)`**: Escapes HTML special characters in a string to prevent XSS attacks. ``` -------------------------------- ### Plugin Configuration Reference Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Details on the configuration object required for registering a LibreCrawl plugin. ```APIDOC ## Plugin Configuration Object ### Description Defines the structure and properties for configuring a LibreCrawl plugin. ### Fields - **id** (string) - Required - Unique identifier for the plugin. Used for internal referencing and tab identification. - **name** (string) - Required - The display name of the plugin, shown in the UI. - **version** (string) - Optional - The version number of the plugin. - **author** (string) - Optional - The name of the plugin's author. - **description** (string) - Optional - A brief description of the plugin's functionality. - **tab** (object) - Required - Configuration for how the plugin appears as a tab. - **label** (string) - Required - The text displayed on the tab button. - **icon** (string) - Optional - An emoji or icon to display on the tab button. - **position** (number) - Optional - Specifies the order of the tab. Defaults to appending to the end. ``` -------------------------------- ### Manage Crawler Settings via API Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieve or update crawler configuration settings using the get_settings and save_settings endpoints. ```bash curl http://localhost:5000/api/get_settings \ -H "Cookie: session=your_session_cookie" ``` ```bash curl -X POST http://localhost:5000/api/save_settings \ -H "Content-Type: application/json" \ -H "Cookie: session=your_session_cookie" \ -d '{ "maxDepth": 5, "maxUrls": 10000, "crawlDelay": 0.5, "enableJavaScript": true, "jsWaitTime": 3, "includePatterns": "/blog/*\n/products/*", "excludePatterns": "/admin/*\n/cart/*" }' ``` -------------------------------- ### Multi-tenancy and Session Management Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Explanation of how LibreCrawl handles multiple users and manages their sessions. ```APIDOC ## LibreCrawl Multi-tenancy and Sessions ### Description LibreCrawl is designed to support multiple concurrent users, providing isolated environments and managing session data effectively. ### Key Features - **Isolated Instances**: Each browser session operates with its own independent crawler instance and crawl data. - **Persistent Settings**: User-specific settings are stored in the browser's `localStorage`, ensuring persistence across restarts. - **Per-Browser Themes**: Custom CSS themes applied by users are specific to their browser. - **Session Expiration**: User sessions automatically expire after 1 hour of inactivity. - **Data Isolation**: Crawl data is kept separate between different users/sessions. ``` -------------------------------- ### Register a Custom Analysis Plugin Source: https://context7.com/phialsbasement/librecrawl/llms.txt Use LibreCrawlPlugin.register to define a new UI tab with custom analysis logic. Ensure the file is saved in web/static/plugins/. ```javascript // Save as web/static/plugins/my-plugin.js LibreCrawlPlugin.register({ // Required configuration id: 'word-count-analyzer', name: 'Word Count Analyzer', tab: { label: 'Word Count', icon: '📊', position: 'end' }, // Optional metadata version: '1.0.0', author: 'Your Name', // Called when plugin loads onLoad() { console.log('Word Count Analyzer loaded'); this.thresholds = { thin: 300, good: 1000 }; }, // Called when tab is activated onTabActivate(container, data) { const analysis = this.analyzeWordCounts(data.urls); container.innerHTML = `

Word Count Analysis

Total Pages: ${data.urls.length}

Thin Content (< 300 words): ${analysis.thin}

Good Content (300-1000 words): ${analysis.moderate}

Rich Content (> 1000 words): ${analysis.rich}

Average Words: ${analysis.average.toFixed(0)}

Pages by Word Count

${analysis.pages.map(p => ` `).join('')}
URLWordsStatus
${this.utils.escapeHtml(p.url)} ${p.word_count} ${p.statusLabel}
`; }, // Called during live crawls onDataUpdate(data) { if (this.isActive && this.container) { this.onTabActivate(this.container, data); } }, // Custom analysis method analyzeWordCounts(urls) { let thin = 0, moderate = 0, rich = 0, total = 0; const pages = []; urls.forEach(url => { const wc = url.word_count || 0; total += wc; let status, statusLabel; if (wc < this.thresholds.thin) { thin++; status = 'score-poor'; statusLabel = 'Thin'; } else if (wc < this.thresholds.good) { moderate++; status = 'score-needs-improvement'; statusLabel = 'Moderate'; } else { rich++; status = 'score-good'; statusLabel = 'Rich'; } pages.push({ url: url.url, word_count: wc, status, statusLabel }); }); pages.sort((a, b) => a.word_count - b.word_count); return { thin, moderate, rich, average: urls.length > 0 ? total / urls.length : 0, pages: pages.slice(0, 50) }; } }); ``` -------------------------------- ### Define Plugin Configuration Object Source: https://github.com/phialsbasement/librecrawl/blob/main/web/static/plugins/README.md The configuration object passed to the register method defines metadata and tab appearance. ```javascript { id: string, // Unique identifier name: string, // Display name version: string, // Optional version author: string, // Optional author description: string, // Optional description tab: { label: string, // Tab button text icon: string, // Optional emoji/icon position: number // Optional position (default: append to end) } } ``` -------------------------------- ### Login Form Handling with JavaScript Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/login.html Handles user login submission by preventing default form behavior, showing/hiding alerts, and making an asynchronous POST request to the /api/login endpoint. It updates the UI based on the API response and redirects on success. ```javascript const form = document.getElementById('loginForm'); const loginBtn = document.getElementById('loginBtn'); const alert = document.getElementById('alert'); function showAlert(message, type) { alert.textContent = message; alert.className = `alert alert-${type} show`; } function hideAlert() { alert.className = 'alert'; } form.addEventListener('submit', async (e) => { e.preventDefault(); hideAlert(); const username = document.getElementById('username').value; const password = document.getElementById('password').value; loginBtn.disabled = true; loginBtn.textContent = 'Logging in...'; try { const response = await fetch('/api/login', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ username, password }) }); const data = await response.json(); if (data.success) { showAlert(data.message, 'success'); setTimeout(() => { window.location.href = '/'; }, 500); } else { showAlert(data.message, 'error'); loginBtn.disabled = false; loginBtn.textContent = 'Login'; } } catch (error) { showAlert('An error occurred. Please try again.', 'error'); loginBtn.disabled = false; loginBtn.textContent = 'Login'; } }); ``` -------------------------------- ### Manage Link Discovery and Tracking Source: https://context7.com/phialsbasement/librecrawl/llms.txt The LinkManager class helps in discovering and tracking URLs. Initialize it with the base domain. Use `collect_all_links` to gather links from HTML content, providing the parsed HTML, source URL, and a list to store results. ```python from bs4 import BeautifulSoup from src.core.link_manager import LinkManager # Initialize with base domain link_manager = LinkManager('example.com') html = """ About Us External Link """ soup = BeautifulSoup(html, 'html.parser') source_url = 'https://example.com' crawl_results = [] # Collect all links for the Links tab link_manager.collect_all_links(soup, source_url, crawl_results) ``` -------------------------------- ### Initialize and Control WebCrawler Source: https://context7.com/phialsbasement/librecrawl/llms.txt Configure and manage the lifecycle of a crawl operation using the WebCrawler class. ```python from src.crawler import WebCrawler # Create crawler instance crawler = WebCrawler() # Configure crawler crawler.update_config({ 'max_depth': 3, 'max_urls': 1000, 'delay': 0.5, 'user_agent': 'MyBot/1.0', 'timeout': 10, 'enable_javascript': False, 'respect_robots': True, 'crawl_external': False, 'include_extensions': ['html', 'htm', 'php'], 'exclude_patterns': ['/admin/*', '/cart/*'] }) # Start crawl (runs in background thread) success, message = crawler.start_crawl( 'https://example.com', user_id=1, session_id='unique-session-id' ) print(f"Started: {success}, {message}") # Get status during crawl status = crawler.get_status() print(f"Crawled: {status['stats']['crawled']}/{status['stats']['discovered']}") print(f"Issues found: {len(status['issues'])}") # Stop crawl crawler.stop_crawl() ``` -------------------------------- ### Plugin Lifecycle Hooks Source: https://github.com/phialsbasement/librecrawl/blob/main/web/static/plugins/README.md Overview of the lifecycle methods available for plugin management during the crawl process. ```APIDOC ## Lifecycle Hooks ### Description Methods that allow plugins to respond to application events and crawl states. ### Hooks - **onLoad()** - Called when the plugin is initially loaded. - **onTabActivate(container, data)** - Called when the user switches to the plugin tab. - **onTabDeactivate()** - Called when the user switches away from the plugin tab. - **onDataUpdate(data)** - Called during live crawls when new data is available. - **onCrawlComplete(data)** - Called when the crawl process finishes. ``` -------------------------------- ### Create and Manage Crawls Source: https://context7.com/phialsbasement/librecrawl/llms.txt Utilize functions for creating new crawl records, saving batches of crawled data (URLs, links, issues), updating crawl statistics, and setting crawl status. Load existing crawl data and delete crawls. ```python from src.crawl_db import ( create_crawl, set_crawl_status, update_crawl_stats, save_url_batch, save_links_batch, save_issues_batch, get_crawl_by_id, get_user_crawls, load_crawled_urls, load_crawl_links, load_crawl_issues, delete_crawl ) crawl_id = create_crawl( user_id=1, session_id='unique-session', base_url='https://example.com', base_domain='example.com', config_snapshot={'max_depth': 3, 'max_urls': 1000} ) urls = [ {'url': 'https://example.com', 'status_code': 200, 'title': 'Home', ...}, {'url': 'https://example.com/about', 'status_code': 200, 'title': 'About', ...} ] save_url_batch(crawl_id, urls) links = [ {'source_url': 'https://example.com', 'target_url': 'https://example.com/about', 'anchor_text': 'About Us', 'is_internal': True} ] save_links_batch(crawl_id, links) issues = [ {'url': 'https://example.com/about', 'type': 'warning', 'category': 'SEO', 'issue': 'Title Too Short', 'details': '...'} ] save_issues_batch(crawl_id, issues) update_crawl_stats(crawl_id, discovered=100, crawled=50, max_depth=3) set_crawl_status(crawl_id, 'completed') # running, paused, completed, failed, stopped crawl = get_crawl_by_id(crawl_id) urls = load_crawled_urls(crawl_id) links = load_crawl_links(crawl_id) issues = load_crawl_issues(crawl_id) user_crawls = get_user_crawls(user_id=1, limit=50, status_filter='completed') delete_crawl(crawl_id) ``` -------------------------------- ### Load Crawl List and Render Table Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Fetches the list of crawls from the API and dynamically generates table rows for display. Handles cases with no crawls or errors during fetching. Includes buttons for actions like resume, load, and delete. ```javascript async function loadCrawls() { try { const response = await fetch('/api/crawls/list'); const data = await response.json(); const tbody = document.getElementById('crawls-tbody'); if (data.success && data.crawls.length > 0) { tbody.innerHTML = data.crawls.map(crawl => { const started = new Date(crawl.started_at).toLocaleString(); const duration = crawl.completed_at ? calculateDuration(crawl.started_at, crawl.completed_at) : 'In progress'; return ` ${started} ${crawl.base_url} ${crawl.status} ${crawl.urls_crawled || 0} ${duration} ${(crawl.status === 'paused' || crawl.status === 'failed') && crawl.urls_crawled > 0 ? `` : ''} ${crawl.urls_crawled > 0 ? `` : ''} `; }).join(''); } else { tbody.innerHTML = `

No crawls found

Start a new crawl from the main application

`; } } catch (error) { console.error('Error loading crawls:', error); document.getElementById('crawls-tbody').innerHTML = `

Error loading crawls

${error.message}

`; } } ``` -------------------------------- ### POST /settings Source: https://context7.com/phialsbasement/librecrawl/llms.txt Updates crawler settings for a specific user, subject to tier-based permission filtering. ```APIDOC ## POST /settings ### Description Updates crawler settings for a specific user, subject to tier-based permission filtering. ### Method POST ### Request Body - **maxDepth** (integer) - Optional - Maximum crawl depth - **maxUrls** (integer) - Optional - Maximum number of URLs to crawl - **crawlDelay** (float) - Optional - Delay between requests - **enableJavaScript** (boolean) - Optional - Whether to enable JS rendering - **jsWaitTime** (integer) - Optional - Time to wait for JS - **jsTimeout** (integer) - Optional - Timeout for JS rendering - **userAgent** (string) - Optional - User agent string - **includePatterns** (string) - Optional - Patterns to include - **excludePatterns** (string) - Optional - Patterns to exclude - **issueExclusionPatterns** (string) - Optional - Patterns to exclude for issues ### Response #### Success Response (200) - **success** (boolean) - Whether the update was successful - **message** (string) - Status message ``` -------------------------------- ### Set Plugin Container HTML Source: https://github.com/phialsbasement/librecrawl/blob/main/web/static/plugins/README.md Use the provided container element to inject your UI, ensuring proper styling and scrolling behavior. ```javascript container.innerHTML = `
`; ``` -------------------------------- ### Access Plugin Utilities Source: https://github.com/phialsbasement/librecrawl/blob/main/web/static/plugins/README.md Built-in helper functions are available via the this.utils context within your plugin. ```javascript this.utils.showNotification(message, type) // 'success', 'error', 'info' this.utils.formatUrl(url) this.utils.escapeHtml(text) ``` -------------------------------- ### Load Crawl Action Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Sends a POST request to load a specific crawl. Prompts the user for confirmation, warning about potential data loss. On success, it sets session storage items and redirects to the main page. ```javascript async function viewCrawl(crawlId) { if (!confirm('Load this crawl? Any unsaved current data will be lost.')) return; try { const response = await fetch(`/api/crawls/${crawlId}/load`, { method: 'POST' }); const data = await response.json(); if (data.success) { sessionStorage.setItem('force_ui_refresh', 'true'); sessionStorage.setItem('loaded_urls', data.urls_count); sessionStorage.setItem('loaded_links', data.links_count); sessionStorage.setItem('loaded_issues', data.issues_count); window.location.href = '/'; } else { alert('Error: ' + (data.error || data.message)); } } catch (error) { alert('Error loading crawl: ' + error.message); } } ```