### Plugin Registration Example Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md An example of how to register a new plugin with LibreCrawl, including essential configuration like ID, name, and tab settings. ```APIDOC ## POST /api/plugins/register ### Description Registers a new custom plugin with LibreCrawl. ### Method POST ### Endpoint /api/plugins/register ### Request Body - **id** (string) - Required - Unique ID for the plugin. - **name** (string) - Required - Display name of the plugin. - **tab** (object) - Required - Configuration for the plugin's tab in the UI. - **label** (string) - Required - Text for the tab button. - **icon** (string) - Optional - Emoji or icon for the tab. - **position** (number) - Optional - Position of the tab. ### Request Example ```json { "id": "my-plugin", "name": "My Plugin", "tab": { "label": "My Tab", "icon": "🔥" } } ``` ### Response #### Success Response (200) - **message** (string) - Confirmation message. #### Response Example ```json { "message": "Plugin registered successfully." } ``` ``` -------------------------------- ### Install Python Dependencies Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to install required Python packages and Playwright browser binaries. ```bash pip install -r requirements.txt ``` ```bash playwright install chromium ``` -------------------------------- ### Deploy with Docker Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to clone the repository and start the application using Docker Compose. ```bash # Clone the repository git clone https://github.com/PhialsBasement/LibreCrawl.git cd LibreCrawl # Copy environment file cp .env.example .env # Start LibreCrawl docker-compose up -d # Open browser to http://localhost:5000 ``` -------------------------------- ### Run Application via Python Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Commands to start the application in standard or local mode. ```bash # Standard mode (with authentication and tier system) python main.py # Local mode (all users get admin tier, no rate limits) python main.py --local # or python main.py -l ``` -------------------------------- ### Run Automatic Startup Scripts Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Use these scripts to automatically check for dependencies, install requirements, and launch the application. ```batch start-librecrawl.bat ``` ```bash chmod +x start-librecrawl.sh ./start-librecrawl.sh ``` -------------------------------- ### Start a new crawl Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initiates a crawl for a specified URL. Requires a JSON payload containing the target URL. ```bash # Start a crawl curl -X POST http://localhost:5000/api/start_crawl \ -H "Content-Type: application/json" \ -H "Cookie: session=your_session_cookie" \ -d '{"url": "https://example.com"}' # Response { "success": true, "message": "Crawl started successfully", "crawl_id": 123 } ``` -------------------------------- ### GET /api/get_settings Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves the current crawler configuration settings. ```APIDOC ## GET /api/get_settings ### Description Fetches the current configuration settings for the crawler. ### Method GET ### Endpoint http://localhost:5000/api/get_settings ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **settings** (object) - The current crawler configuration #### Response Example { "success": true, "settings": { "maxDepth": 3, "maxUrls": 5000000, "crawlDelay": 1, "enableJavaScript": false, "userAgent": "LibreCrawl/1.0 (Web Crawler)", "respectRobotsTxt": true, "discoverSitemaps": true } } ``` -------------------------------- ### Configure Crawler Settings Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initialize the SettingsManager for a specific user and tier to get and update crawler settings. Settings are filtered by tier permissions. Obtain crawler-ready configuration and reset settings to defaults. ```python from src.settings_manager import SettingsManager settings_manager = SettingsManager( session_id='unique-session', user_id=1, tier='admin' # guest, user, extra, admin ) settings = settings_manager.get_settings() print(f"Max Depth: {settings['maxDepth']}") print(f"Max URLs: {settings['maxUrls']}") print(f"JavaScript Enabled: {settings['enableJavaScript']}") success, message = settings_manager.save_settings({ 'maxDepth': 5, 'maxUrls': 10000, 'crawlDelay': 0.5, 'enableJavaScript': True, 'jsWaitTime': 3, 'jsTimeout': 30, 'userAgent': 'MyCustomBot/1.0', 'includePatterns': '/blog/*\n/products/*', 'excludePatterns': '/admin/*', 'issueExclusionPatterns': '/login*\n/checkout/*' }) crawler_config = settings_manager.get_crawler_config() settings_manager.reset_settings() ``` -------------------------------- ### GET /api/user/info Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves information about the currently authenticated user. ```APIDOC ## GET /api/user/info ### Description Returns details about the current user, including crawl limits. ### Method GET ### Endpoint http://localhost:5000/api/user/info ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **user** (object) - User details including id, username, tier, and crawl stats #### Response Example { "success": true, "user": { "id": 1, "username": "newuser", "tier": "user", "crawls_today": 2, "crawls_remaining": -1 } } ``` -------------------------------- ### GET /api/visualization_data Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves graph data for site structure visualization. ```APIDOC ## GET /api/visualization_data ### Description Fetches nodes and edges representing the crawled site structure. ### Method GET ### Endpoint http://localhost:5000/api/visualization_data ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **nodes** (array) - List of site nodes - **edges** (array) - List of connections between nodes - **total_pages** (integer) - Total pages found #### Response Example { "success": true, "nodes": [{"data": {"id": "node-0", "url": "https://example.com", "status_code": 200, "title": "Home"}}], "edges": [{"data": {"id": "edge-0-1", "source": "node-0", "target": "node-1"}}], "total_pages": 500 } ``` -------------------------------- ### GET /links Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves all discovered links from the link manager, including source, target, anchor text, internal status, and placement. ```APIDOC ## GET /links ### Description Retrieves all discovered links from the link manager. ### Method GET ### Response #### Success Response (200) - **source_url** (string) - The URL where the link was found - **target_url** (string) - The destination URL - **anchor_text** (string) - The text content of the link - **is_internal** (boolean) - Whether the link is internal to the domain - **placement** (string) - The location of the link (e.g., navigation, body, footer) ``` -------------------------------- ### Calculate Crawl Duration Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Calculates the duration of a crawl given its start and end times. Formats the output into hours, minutes, and seconds. ```javascript function calculateDuration(start, end) { const startTime = new Date(start); const endTime = new Date(end); const diff = endTime - startTime; const hours = Math.floor(diff / 3600000); const minutes = Math.floor((diff % 3600000) / 60000); const seconds = Math.floor((diff % 60000) / 1000); if (hours > 0) return `${hours}h ${minutes}m`; if (minutes > 0) return `${minutes}m ${seconds}s`; return `${seconds}s`; } ``` -------------------------------- ### Get crawl status Source: https://context7.com/phialsbasement/librecrawl/llms.txt Retrieves real-time progress and statistics. Supports incremental updates using query parameters to filter by index. ```bash # Get full status curl http://localhost:5000/api/crawl_status \ -H "Cookie: session=your_session_cookie" # Incremental update (fetch only new data since index) curl "http://localhost:5000/api/crawl_status?url_since=50&link_since=100&issue_since=10" \ -H "Cookie: session=your_session_cookie" # Response { "status": "running", "stats": { "discovered": 150, "crawled": 75, "depth": 3, "speed": 2.5 }, "urls": [...], "links": [...], "issues": [...], "progress": 50.0, "is_running_pagespeed": false, "memory": {"current_mb": 256, "peak_mb": 300} } ``` -------------------------------- ### Configure Application Running Modes Source: https://context7.com/phialsbasement/librecrawl/llms.txt Execute main.py with various flags to toggle authentication, registration, guest access, and demo constraints. ```bash # Standard mode (with authentication) python main.py # Local mode (no auth, all users get admin tier) python main.py --local # or python main.py -l # Disable new registrations python main.py --disable-register # or python main.py -dr # Disable guest access python main.py --disable-guest # or python main.py -dg # Demo mode (1.5GB memory limit per user) python main.py --demo # or python main.py -dm # Combined flags python main.py --local --disable-guest --demo ``` -------------------------------- ### Deploy via Docker Source: https://context7.com/phialsbasement/librecrawl/llms.txt Prepare the environment configuration file before launching the containerized service. ```bash # Docker deployment cp .env.example .env # Edit .env for production # LOCAL_MODE=false # HOST_BINDING=0.0.0.0 # REGISTRATION_DISABLED=false docker-compose up -d ``` -------------------------------- ### Configure Environment Variables Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Settings for the .env file to toggle between local mode and production deployment. ```bash # .env file LOCAL_MODE=true HOST_BINDING=127.0.0.1 REGISTRATION_DISABLED=false ``` ```bash # .env file LOCAL_MODE=false HOST_BINDING=0.0.0.0 REGISTRATION_DISABLED=false ``` -------------------------------- ### Manage User Authentication via API Source: https://context7.com/phialsbasement/librecrawl/llms.txt Perform user registration, login, and account information retrieval. ```bash curl -X POST http://localhost:5000/api/register \ -H "Content-Type: application/json" \ -d '{ "username": "newuser", "email": "user@example.com", "password": "securepassword123" }' ``` ```bash curl -X POST http://localhost:5000/api/login \ -H "Content-Type: application/json" \ -d '{ "username": "newuser", "password": "securepassword123" }' ``` ```bash curl -X POST http://localhost:5000/api/guest-login ``` ```bash curl http://localhost:5000/api/user/info \ -H "Cookie: session=your_session_cookie" ``` -------------------------------- ### GET /api/crawl_status Source: https://context7.com/phialsbasement/librecrawl/llms.txt Returns real-time crawl progress including discovered URLs, links, issues, and statistics. ```APIDOC ## GET /api/crawl_status ### Description Returns real-time crawl progress including discovered URLs, links, issues, and statistics. Supports incremental polling to fetch only new data. ### Method GET ### Endpoint /api/crawl_status ### Parameters #### Query Parameters - **url_since** (integer) - Optional - Fetch URLs since this index - **link_since** (integer) - Optional - Fetch links since this index - **issue_since** (integer) - Optional - Fetch issues since this index ### Response #### Success Response (200) - **status** (string) - Current status of the crawl - **stats** (object) - Crawl statistics - **progress** (float) - Percentage completion #### Response Example { "status": "running", "stats": { "discovered": 150, "crawled": 75, "depth": 3, "speed": 2.5 }, "progress": 50.0 } ``` -------------------------------- ### Initial Load and Interval Refresh Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/dashboard.html Initializes the application by loading statistics and the crawl list on page load. Sets up an interval to periodically refresh both stats and the crawl list every 30 seconds. ```javascript loadStats(); loadCrawls(); setInterval(() => { loadStats(); loadCrawls(); }, 30000); ``` -------------------------------- ### POST /api/start_crawl Source: https://context7.com/phialsbasement/librecrawl/llms.txt Initiates a new crawl from a given URL, respecting robots.txt and discovering sitemaps. ```APIDOC ## POST /api/start_crawl ### Description Initiates a new crawl from a given URL. The crawler automatically discovers sitemaps, respects robots.txt, and extracts comprehensive SEO data from each page. ### Method POST ### Endpoint /api/start_crawl ### Request Body - **url** (string) - Required - The URL to start the crawl from. ### Request Example { "url": "https://example.com" } ### Response #### Success Response (200) - **success** (boolean) - Status of the request - **message** (string) - Confirmation message - **crawl_id** (integer) - Unique identifier for the crawl #### Response Example { "success": true, "message": "Crawl started successfully", "crawl_id": 123 } ``` -------------------------------- ### Styling Guidelines Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md Information on using CSS classes provided by LibreCrawl for consistent UI styling within plugins. ```APIDOC ## Plugin Styling ### Description LibreCrawl provides a set of CSS classes to ensure your plugin's UI matches the application's design. Apply these classes to your HTML elements. ### Available CSS Classes - **`.plugin-content`**: The main container for your plugin's UI. Apply padding and scrolling styles here. - **`.plugin-header`**: Use for header sections within your plugin's content. - **`.data-table`**: Automatically styles HTML tables to match LibreCrawl's appearance. - **`.stat-card`**: Styles elements used for displaying statistics. - **`.score-good`**, **`.score-needs-improvement`**, **`.score-poor`**: Classes for indicating different levels of quality or status. ### Scrolling Implementation To ensure proper scrolling within the tab pane, wrap your plugin's content in a div with the following styles: ```html
"""
soup = BeautifulSoup(html, 'html.parser')
extractor = SEOExtractor()
# Initialize result structure
result = {
'url': 'https://example.com/page',
'meta_tags': {}, 'og_tags': {}, 'twitter_tags': {},
'json_ld': [], 'images': [], 'hreflang': [], 'schema_org': [],
'analytics': {'google_analytics': False, 'gtag': False, 'ga4_id': '', 'gtm_id': ''},
'internal_links': 0, 'external_links': 0
}
# Extract all SEO data
extractor.extract_basic_seo_data(soup, result)
extractor.extract_meta_tags(soup, result)
extractor.extract_opengraph_tags(soup, result)
extractor.extract_twitter_tags(soup, result)
extractor.extract_json_ld(soup, result)
extractor.extract_images(soup, result['url'], result)
extractor.extract_link_counts(soup, result, 'example.com')
print(f"Title: {result['title']}") # "Example Page Title"
print(f"H1: {result['h1']}") # "Main Heading"
print(f"Canonical: {result['canonical_url']}") # "https://example.com/page"
print(f"Language: {result['lang']}") # "en"
```
--------------------------------
### Guest Login Functionality with JavaScript
Source: https://github.com/phialsbasement/librecrawl/blob/main/web/templates/login.html
Handles the guest login button click event. It disables the button, shows a loading state, and makes a POST request to the /api/guest-login endpoint. It provides user feedback via alerts and redirects on successful guest entry.
```javascript
// Guest login
const guestBtn = document.getElementById('guestBtn');
if (guestBtn) guestBtn.addEventListener('click', async () => {
hideAlert();
guestBtn.disabled = true;
guestBtn.textContent = 'Entering as guest...';
try {
const response = await fetch('/api/guest-login', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
}
});
const data = await response.json();
if (data.success) {
showAlert('Entering as guest...', 'success');
setTimeout(() => {
window.location.href = '/';
}, 500);
} else {
showAlert(data.message || 'Failed to enter as guest', 'error');
guestBtn.disabled = false;
guestBtn.textContent = 'Continue as Guest (3 crawls/24h)';
}
} catch (error) {
showAlert('An error occurred. Please try again.', 'error');
guestBtn.disabled = false;
guestBtn.textContent = 'Continue as Guest (3 crawls/24h)';
}
});
```
--------------------------------
### Retrieve Visualization Data via API
Source: https://context7.com/phialsbasement/librecrawl/llms.txt
Fetch graph data representing the site structure for visualization purposes.
```bash
curl http://localhost:5000/api/visualization_data \
-H "Cookie: session=your_session_cookie"
```
--------------------------------
### Configuration Settings
Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md
Details on the various settings that can be configured within LibreCrawl's UI.
```APIDOC
## LibreCrawl Configuration Settings
### Description
LibreCrawl allows extensive customization through its Settings interface, covering crawler behavior, request handling, rendering, filtering, and more.
### Configurable Areas
- **Crawler Settings**: Control crawl depth (up to 5 million URLs), delays between requests, and handling of external links.
- **Request Settings**: Configure the User-Agent string, request timeouts, proxy settings, and adherence to `robots.txt`.
- **JavaScript Rendering**: Adjust settings for the browser engine, wait times after page load, and viewport size for rendering dynamic content.
- **Filters**: Define patterns for URL inclusion or exclusion, and specify file types to ignore during crawling.
- **Export Options**: Select the desired formats (CSV, JSON, XML) and specify which data fields to include in exports.
- **Custom CSS**: Personalize the LibreCrawl user interface by applying custom CSS rules.
- **Issue Exclusion**: Set up patterns to exclude specific SEO issues from being reported.
### PageSpeed Analysis
To enhance rate limits for PageSpeed analysis (from limited to 25,000 requests per day), add a Google API key in the `Settings > Requests` section.
```
--------------------------------
### POST /api/login
Source: https://context7.com/phialsbasement/librecrawl/llms.txt
Authenticates a user and returns a session.
```APIDOC
## POST /api/login
### Description
Authenticates a user with username and password.
### Method
POST
### Endpoint
http://localhost:5000/api/login
### Request Body
- **username** (string) - Required - Username
- **password** (string) - Required - Password
### Request Example
{
"username": "newuser",
"password": "securepassword123"
}
```
--------------------------------
### POST /api/register
Source: https://context7.com/phialsbasement/librecrawl/llms.txt
Registers a new user account.
```APIDOC
## POST /api/register
### Description
Creates a new user account.
### Method
POST
### Endpoint
http://localhost:5000/api/register
### Request Body
- **username** (string) - Required - Unique username
- **email** (string) - Required - User email address
- **password** (string) - Required - User password
### Request Example
{
"username": "newuser",
"email": "user@example.com",
"password": "securepassword123"
}
```
--------------------------------
### Running Modes
Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md
Explanation of the different modes LibreCrawl can run in: Standard and Local.
```APIDOC
## LibreCrawl Running Modes
### Description
LibreCrawl offers two primary running modes, each with different access control and limitations, suitable for various use cases.
### Standard Mode
- **Default mode**.
- Features a full authentication system (login/register).
- Implements tier-based access control (Guest, User, Extra, Admin).
- **Guest users** are limited to 3 crawls per 24 hours, enforced via IP.
- Suitable for public demonstrations or shared environments.
### Local Mode (`--local` or `-l` flag)
- **Activated by** running LibreCrawl with the `--local` or `-l` command-line flag.
- **All users** are automatically granted admin-tier access.
- **No rate limits** or tier-based restrictions are enforced.
- Ideal for personal use, local development, and self-hosted instances where strict access control is not required.
- **Recommended** for testing and development purposes.
```
--------------------------------
### Lifecycle Hooks and Utilities
Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md
Overview of the available lifecycle hooks and utility functions for plugin development.
```APIDOC
## Plugin Lifecycle Hooks and Utilities
### Description
Details the functions that LibreCrawl calls at various points in its operation, and the utility functions available to plugins.
### Lifecycle Hooks
- **`onLoad()`**: Called once when the plugin is initially loaded by LibreCrawl.
- **`onTabActivate(container, data)`**: Triggered when the user navigates to the plugin's tab. Receives the tab's DOM container and crawl data.
- **`onTabDeactivate()`**: Called when the user switches away from the plugin's tab.
- **`onDataUpdate(data)`**: Executed during live crawls whenever new crawl data becomes available. Useful for real-time UI updates.
- **`onCrawlComplete(data)`**: Called after a crawl has finished, providing the final crawl data.
### Utilities
Access built-in helper functions through `this.utils`:
- **`this.utils.showNotification(message, type)`**: Displays a notification to the user. `type` can be 'success', 'error', or 'info'.
- **`this.utils.formatUrl(url)`**: Formats a given URL according to LibreCrawl's standards.
- **`this.utils.escapeHtml(text)`**: Escapes HTML special characters in a string to prevent XSS attacks.
```
--------------------------------
### Plugin Configuration Reference
Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md
Details on the configuration object required for registering a LibreCrawl plugin.
```APIDOC
## Plugin Configuration Object
### Description
Defines the structure and properties for configuring a LibreCrawl plugin.
### Fields
- **id** (string) - Required - Unique identifier for the plugin. Used for internal referencing and tab identification.
- **name** (string) - Required - The display name of the plugin, shown in the UI.
- **version** (string) - Optional - The version number of the plugin.
- **author** (string) - Optional - The name of the plugin's author.
- **description** (string) - Optional - A brief description of the plugin's functionality.
- **tab** (object) - Required - Configuration for how the plugin appears as a tab.
- **label** (string) - Required - The text displayed on the tab button.
- **icon** (string) - Optional - An emoji or icon to display on the tab button.
- **position** (number) - Optional - Specifies the order of the tab. Defaults to appending to the end.
```
--------------------------------
### Manage Crawler Settings via API
Source: https://context7.com/phialsbasement/librecrawl/llms.txt
Retrieve or update crawler configuration settings using the get_settings and save_settings endpoints.
```bash
curl http://localhost:5000/api/get_settings \
-H "Cookie: session=your_session_cookie"
```
```bash
curl -X POST http://localhost:5000/api/save_settings \
-H "Content-Type: application/json" \
-H "Cookie: session=your_session_cookie" \
-d '{
"maxDepth": 5,
"maxUrls": 10000,
"crawlDelay": 0.5,
"enableJavaScript": true,
"jsWaitTime": 3,
"includePatterns": "/blog/*\n/products/*",
"excludePatterns": "/admin/*\n/cart/*"
}'
```
--------------------------------
### Multi-tenancy and Session Management
Source: https://github.com/phialsbasement/librecrawl/blob/main/README.md
Explanation of how LibreCrawl handles multiple users and manages their sessions.
```APIDOC
## LibreCrawl Multi-tenancy and Sessions
### Description
LibreCrawl is designed to support multiple concurrent users, providing isolated environments and managing session data effectively.
### Key Features
- **Isolated Instances**: Each browser session operates with its own independent crawler instance and crawl data.
- **Persistent Settings**: User-specific settings are stored in the browser's `localStorage`, ensuring persistence across restarts.
- **Per-Browser Themes**: Custom CSS themes applied by users are specific to their browser.
- **Session Expiration**: User sessions automatically expire after 1 hour of inactivity.
- **Data Isolation**: Crawl data is kept separate between different users/sessions.
```
--------------------------------
### Register a Custom Analysis Plugin
Source: https://context7.com/phialsbasement/librecrawl/llms.txt
Use LibreCrawlPlugin.register to define a new UI tab with custom analysis logic. Ensure the file is saved in web/static/plugins/.
```javascript
// Save as web/static/plugins/my-plugin.js
LibreCrawlPlugin.register({
// Required configuration
id: 'word-count-analyzer',
name: 'Word Count Analyzer',
tab: {
label: 'Word Count',
icon: '📊',
position: 'end'
},
// Optional metadata
version: '1.0.0',
author: 'Your Name',
// Called when plugin loads
onLoad() {
console.log('Word Count Analyzer loaded');
this.thresholds = { thin: 300, good: 1000 };
},
// Called when tab is activated
onTabActivate(container, data) {
const analysis = this.analyzeWordCounts(data.urls);
container.innerHTML = `
Total Pages: ${data.urls.length}
Thin Content (< 300 words): ${analysis.thin}
Good Content (300-1000 words): ${analysis.moderate}
Rich Content (> 1000 words): ${analysis.rich}
Average Words: ${analysis.average.toFixed(0)}
| URL | Words | Status |
|---|---|---|
| ${this.utils.escapeHtml(p.url)} | ${p.word_count} | ${p.statusLabel} |
Start a new crawl from the main application
${error.message}