### Basic SQLStorageClient Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates the basic usage of the SqlStorageClient. Ensure you have installed the necessary dependencies, e.g., 'pip install "crawlee[sql_sqlite]"'. ```python from crawlee.storage import SqlStorageClient # Use default SQLite database client = SqlStorageClient() # Or use a different database with a connection string # client = SqlStorageClient(connection_string="postgresql://user:password@host:port/database") # Example: Saving data to a dataset await client.datasets.push_items({"key": "value"}) # Example: Getting data from a key-value store value = await client.key_value_stores.get_record_value("my_key") # Example: Enqueuing a request await client.request_queues.add_request({"url": "http://example.com"}) ``` -------------------------------- ### Launch Crawler with uv Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Execute this command to start the crawler after installing dependencies with uv. ```sh uv run python -m {{cookiecutter.__package_name}} ``` -------------------------------- ### Initialize Apify Project Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/09_running_in_cloud.mdx Use this command to initialize your project for Apify. It checks the project structure and guides you through the setup process. ```bash apify init ``` -------------------------------- ### Launch Crawler with pip Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Start the crawler after installing dependencies with pip. ```sh python -m {{cookiecutter.__package_name}} ``` -------------------------------- ### Launch Crawler with Poetry Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Execute this command to start the crawler after installing dependencies with Poetry. ```sh poetry run python -m {{cookiecutter.__package_name}} ``` -------------------------------- ### FastAPI Web Server Setup Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/running_in_web_server.mdx Sets up a FastAPI application with endpoints for serving scraped data. Requires installation of fastapi[standard]. Run with 'fastapi dev server.py'. ```python from fastapi import FastAPI, Request from contextlib import asynccontextmanager from apify_client import ApifyClient from apify_client.models import RequestQueue from pydantic import BaseModel class ScrapeUrl(BaseModel): url: str @asynccontextmanager async def lifespan(app: FastAPI): # Initialize ApifyClient client = ApifyClient("YOUR_APIFY_API_TOKEN") # Get a request queue request_queue: RequestQueue = await client.request_queues.get_or_create( name="my-queue" ) # Save client and queue to app state app.state.client = client app.state.request_queue = request_queue # Save dictionary for mapping requests to results app.state.requests_to_results = {} yield app = FastAPI(lifespan=lifespan) @app.get("/") def index(): return { "message": "Welcome to the Crawlee web server! Use the /scrape endpoint to get a page title.", "example": { "url": "/scrape?url=https://example.com" }, } @app.post("/scrape") async def scrape(request: Request, scrape_url: ScrapeUrl): # Add the URL to the request queue await request.app.state.request_queue.add_request({ "url": scrape_url.url, "method": "GET", }) # Store the URL in the dictionary to retrieve the result later request_id = len(request.app.state.requests_to_results) request.app.state.requests_to_results[request_id] = None # Wait for the result to be available while request.app.state.requests_to_results[request_id] is None: # Check for new items in the queue items = await request.app.state.request_queue.list_items(limit=10) for item in items.items: if item["url"] == scrape_url.url: # Store the result and remove it from the queue request.app.state.requests_to_results[request_id] = item["metadata"]["page_title"] await request.app.state.request_queue.delete_item(item["id"]) break if request.app.state.requests_to_results[request_id] is not None: break # Return the page title return {"url": scrape_url.url, "title": request.app.state.requests_to_results[request_id]} ``` -------------------------------- ### Quick Start with Custom Proxies Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/proxy_management.mdx Demonstrates how to quickly start using your own proxy URLs with Crawlee. ```python from crawlee import ProxyConfiguration # Use your own proxy URLs proxy_configuration = ProxyConfiguration(proxy_urls=["http://user:password@your-proxy.com:8080"]) # You can then use this proxy_configuration object when initializing your crawler. ``` -------------------------------- ### FileSystemStorageClient Configuration Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Shows how to configure the FileSystemStorageClient with a custom storage directory. ```python from crawlee.storage import FileSystemStorageClient # Initialize the FileSystemStorageClient with a custom directory custom_dir_storage = FileSystemStorageClient(storage_dir="./my_custom_storage") # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # custom_dir_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to custom file system storage.") print("FileSystemStorageClient initialized with custom directory.") ``` -------------------------------- ### Registering Storage Clients Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Shows how to register custom storage clients with the StorageClient registry. ```python from crawlee.storage import StorageClient # Assume MyCustomStorageClient is defined as shown in the previous example class MyCustomStorageClient(StorageClient): def __init__(self): super().__init__() print("MyCustomStorageClient initialized.") # Register the custom client StorageClient.register_client(MyCustomStorageClient, name="my_custom") # Retrieve and use the registered client retrieved_client = StorageClient.get_client("my_custom") print(f"Registered and retrieved client: {retrieved_client}") ``` -------------------------------- ### SQLStorageClient Configuration Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Shows how to configure the SQLStorageClient with a specific database connection URL. ```python from crawlee.storage import SqlStorageClient # Initialize the SQLStorageClient with a PostgreSQL connection URL postgres_url = "postgresql://user:password@host:port/database" postgres_storage = SqlStorageClient(db_url=postgres_url) # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # postgres_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to PostgreSQL storage.") print("SqlStorageClient initialized with PostgreSQL connection.") ``` -------------------------------- ### Install Crawlee with httpx extra Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/http_clients.mdx Install Crawlee with the `httpx` extra to enable the HttpxHttpClient. ```sh python -m pip install 'crawlee[httpx]' ``` -------------------------------- ### Install Crawlee with Extras Source: https://context7.com/apify/crawlee-python/llms.txt Install Crawlee with all extras or selectively install packages for specific functionalities like BeautifulSoup or Playwright. Playwright also requires a separate installation. ```bash pip install 'crawlee[all]' playwright install ``` ```bash pip install 'crawlee[beautifulsoup]' ``` ```bash pip install 'crawlee[parsel]' ``` ```bash pip install 'crawlee[playwright]' ``` ```bash pip install 'crawlee[cli]' ``` -------------------------------- ### Basic SQLStorageClient Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates the basic usage of the SQLStorageClient for persistent storage using a SQL database. ```python from crawlee.storage import SqlStorageClient # Initialize the SQLStorageClient (defaults to SQLite) # For other databases, specify the connection URL, e.g., 'postgresql://user:password@host:port/database' sql_storage = SqlStorageClient() # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # sql_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to SQL storage.") print("SqlStorageClient initialized.") ``` -------------------------------- ### Efficient Request Addition with BeautifulSoupCrawler Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/02_first_crawler.mdx This example shows a more concise way to start a BeautifulSoupCrawler by passing requests directly to the `crawler.run()` method. This approach internally uses batched request additions for better performance, allowing crawling to start almost instantly. ```python from crawlee import BeautifulSoupCrawler async def request_handler({context}): print(f'The title of "{context.request.url}" is "{context.data.bs4.title.string}".') crawler = BeautifulSoupCrawler( request_handler=request_handler, ) await crawler.run(['https://crawlee.dev']) ``` -------------------------------- ### Basic FileSystemStorageClient Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates the basic usage of the FileSystemStorageClient for persistent file system storage with in-memory caching. ```python from crawlee.storage import FileSystemStorageClient # Initialize the FileSystemStorageClient (defaults to './storage') file_system_storage = FileSystemStorageClient() # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # file_system_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to file system storage.") print("FileSystemStorageClient initialized.") ``` -------------------------------- ### Basic MemoryStorageClient Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates the basic usage of the MemoryStorageClient for in-memory data storage. No persistence is provided. ```python from crawlee.storage import MemoryStorageClient # Initialize the MemoryStorageClient memory_storage = MemoryStorageClient() # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # memory_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to memory storage.") print("MemoryStorageClient initialized.") ``` -------------------------------- ### Install Crawlee with All Features Source: https://github.com/apify/crawlee-python/blob/master/README.md Installs the crawlee package with all optional features. Ensure Playwright dependencies are installed separately. ```sh python -m pip install 'crawlee[all]' ``` -------------------------------- ### Basic RedisStorageClient Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates the basic usage of the RedisStorageClient for persistent storage using a Redis database. ```python from crawlee.storage import RedisStorageClient # Initialize the RedisStorageClient (defaults to localhost:6379) redis_storage = RedisStorageClient() # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # redis_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to Redis storage.") print("RedisStorageClient initialized.") ``` -------------------------------- ### Install Crawlee CLI with uv Source: https://github.com/apify/crawlee-python/blob/master/README.md Installs the Crawlee CLI using uvx, a tool for running Python tools. Ensure uv is installed first. ```sh uvx 'crawlee[cli]' create my-crawler ``` -------------------------------- ### Custom Storage Client Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Illustrates how to create and use a custom storage client by extending the base StorageClient class. ```python from crawlee.storage import StorageClient class MyCustomStorageClient(StorageClient): def __init__(self): super().__init__() print("MyCustomStorageClient initialized.") # Implement abstract methods here, e.g.: # def get_dataset(self, dataset_id=None): # pass # def get_key_value_store(self, key_value_store_id=None): # pass # def get_request_queue(self, request_queue_id=None): # pass # Instantiate and use the custom client custom_client = MyCustomStorageClient() print("Custom storage client created and used.") ``` -------------------------------- ### Install Dependencies with uv and poe Source: https://github.com/apify/crawlee-python/blob/master/AGENTS.md Installs all project dependencies, including development, extras, pre-commit hooks, and Playwright. Use this command to set up your development environment. ```bash uv run poe install-dev ``` -------------------------------- ### RedisStorageClient Configuration Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Shows how to configure the RedisStorageClient with a specific Redis connection URL and port. ```python from crawlee.storage import RedisStorageClient # Initialize the RedisStorageClient with a custom host and port custom_redis_storage = RedisStorageClient(host="my-redis-host", port=16379) # Example usage (assuming you have methods to interact with storage) # For instance, saving data to a dataset: # custom_redis_storage.dataset.push_items([{"key": "value"}]) # print("Data saved to custom Redis storage.") print("RedisStorageClient initialized with custom host and port.") ``` -------------------------------- ### Full Scraping Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/06_scraping.mdx A complete example demonstrating how to integrate the scraping logic into a request handler. ```python from crawlee import PlaywrightCrawler from scrapy import Selector async def request_handler({context}): # Find the SKU element using the selector and get its text content. sku = await context.page.locator('span.product-meta__sku-number').text_content() # Locate the price element and filter out the visually hidden elements. price_element = context.page.locator('span.price', has_text='$').first # Extract the text content of the price element. current_price_string = await price_element.text_content() or '' # current_price_string: 'Sale price$1,398.00' # Split the string by the '$' sign to get the numeric part. raw_price = current_price_string.split('$')[1] # raw_price: '1,398.00' # Convert the raw price string to a float after removing commas. price = float(raw_price.replace(',', '')) # price: 1398.00 # Locate the element that contains the text 'In stock' and filter out other elements. in_stock_element = context.page.locator( selector='span.product-form__inventory', has_text='In stock', ).first # Check if the element exists by counting the matching elements. in_stock = await in_stock_element.count() > 0 # Print the scraped data. print( { "url": context.page.url, "manufacturer": "sony", "title": "Sony STR-ZA810ES 7.2-Ch Hi-Res Wi-Fi Network A/V Receiver", "sku": sku, "price": price, "in_stock": in_stock, } ) async def main(): crawler = PlaywrightCrawler(request_handler=request_handler) await crawler.run([ "https://warehouse-theme-metal.myshopify.com/products/sony-str-za810es-7-2-channel-hi-res-wi-fi-network-av-receiver", ]) if __name__ == "__main__": import asyncio asyncio.run(main()) ``` -------------------------------- ### Install Dependencies with pip Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Run this command to install project dependencies using pip. ```sh python -m pip install . ``` -------------------------------- ### Basic BeautifulSoupCrawler Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/02_first_crawler.mdx This snippet demonstrates the fundamental setup of a BeautifulSoupCrawler. It initializes the crawler, defines a request handler to process the HTML content, and runs the crawler on a target URL. Use this for simple crawling tasks where JavaScript rendering is not required. ```python from crawlee import BeautifulSoupCrawler async def request_handler({context}): # You can access the parsed HTML via context.data.html # Or use BeautifulSoup directly via context.data.bs4 print(f'The title of "{context.request.url}" is "{context.data.bs4.title.string}".') crawler = BeautifulSoupCrawler( request_handler=request_handler, ) await crawler.run(['https://crawlee.dev']) ``` -------------------------------- ### Initialize RedisStorageClient Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Instantiate the RedisStorageClient using a connection string. Ensure 'crawlee[redis]' is installed and Redis is running. ```python from crawlee.storage import RedisStorageClient # Use a connection string client = RedisStorageClient(connection_string="redis://localhost:6379/0") ``` -------------------------------- ### Registering a Storage Client Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/upgrading/upgrading_to_v1.md This example demonstrates how to register a custom storage client globally, for a single crawler, or for a single storage instance. ```python from crawlee import service_locator from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import MemoryStorageClient from crawlee.storages import Dataset # Create custom storage client storage_client = MemoryStorageClient() # Then register it globally service_locator.set_storage_client(storage_client) # Or use it for a single crawler only crawler = ParselCrawler(storage_client=storage_client) # Or use it for a single storage only dataset = await Dataset.open( name='my-dataset', storage_client=storage_client, ) ``` -------------------------------- ### Install Apify SDK Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/deployment/apify_platform.mdx Install the Apify SDK for Python using pip. This is a prerequisite for running Crawlee code on the Apify platform. ```bash pip install apify ``` -------------------------------- ### Install Crawlee with Multiple Extras Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Install Crawlee with multiple optional features simultaneously by separating them with commas. ```sh python -m pip install 'crawlee[beautifulsoup,curl-impersonate]' ``` -------------------------------- ### Install Crawlee with curl-impersonate extra Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/http_clients.mdx Install Crawlee with the `curl-impersonate` extra to enable the CurlImpersonateHttpClient. ```sh python -m pip install 'crawlee[curl-impersonate]' ``` -------------------------------- ### Install Dependencies with uv Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Use this command to install project dependencies when using uv as your package manager. ```sh uv sync ``` -------------------------------- ### Install Crawlee Core Package Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Install the essential Crawlee package using pip. This command installs the core functionality. ```sh python -m pip install crawlee ``` -------------------------------- ### Install Apify CLI with npm Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/09_running_in_cloud.mdx Installs the Apify CLI globally, a command-line tool for authentication and deployment to the Apify platform. Requires Node.js. ```sh npm install -g apify-cli ``` -------------------------------- ### Basic RequestList Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/request_loaders.mdx Demonstrates the fundamental usage of `RequestList` with an asynchronous generator to stream requests, reducing memory consumption. ```python from crawlee import RequestList async def main(): request_list = RequestList() await request_list.add("http://example.com") await request_list.add("http://example.org") async for request in request_list: print(f"Processing: {request.url}") # Process the request here await request_list.done(request) await request_list.save() print("Finished processing requests.") ``` -------------------------------- ### Install Dependencies with Poetry Source: https://github.com/apify/crawlee-python/blob/master/src/crawlee/project_template/{{cookiecutter.project_name}}/README.md Use this command to install project dependencies when using Poetry as your package manager. ```sh poetry install ``` -------------------------------- ### Router Handler Examples Source: https://github.com/apify/crawlee-python/blob/master/GEMINI.md Shows how to define default and labeled request handlers for a crawler using decorators. ```python @crawler.router.default_handler async def handler(context: BeautifulSoupCrawlingContext): ... ``` ```python @crawler.router.handler(label='detail') async def detail(context: BeautifulSoupCrawlingContext): ... ``` -------------------------------- ### Setup OpenTelemetry Tracing for Crawlers Source: https://context7.com/apify/crawlee-python/llms.txt Integrates OpenTelemetry for tracing storage operations. Ensure the OTLP exporter endpoint is correctly configured. ```python import asyncio from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.trace import set_tracer_provider from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.otel import CrawlerInstrumentor from crawlee.storages import Dataset, KeyValueStore, RequestQueue def setup_tracing() -> None: resource = Resource.create({'service.name': 'MyCrawler', 'service.version': '1.0.0'}) provider = TracerProvider(resource=resource) provider.add_span_processor( SimpleSpanProcessor(OTLPSpanExporter(endpoint='localhost:4317', insecure=True)) ) set_tracer_provider(provider) CrawlerInstrumentor( instrument_classes=[RequestQueue, KeyValueStore, Dataset] ).instrument() async def main() -> None: setup_tracing() crawler = ParselCrawler(max_requests_per_crawl=100) kvs = await KeyValueStore.open() @crawler.router.default_handler async def handler(context: ParselCrawlingContext) -> None: await context.push_data({'url': context.request.url}) await kvs.set_value(key='last-url', value=context.request.url) await context.enqueue_links() await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Register Storage Clients Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Demonstrates how to register custom storage clients. This example shows registering clients per storage instance when opening them. ```python from apify_client.storage import Dataset, KeyValueStore, RequestQueue # Assuming 'custom_dataset_client', 'custom_kv_store_client', 'custom_rq_client' are instances of custom storage clients dataset = await Dataset.open(name='my-dataset', storage_client=custom_dataset_client) key_value_store = await KeyValueStore.open(name='my-kv-store', storage_client=custom_kv_store_client) request_queue = await RequestQueue.open(name='my-request-queue', storage_client=custom_rq_client) ``` -------------------------------- ### Start Jaeger Docker Container Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/trace_and_monitor_crawlers.mdx Use this command to run a preconfigured Jaeger Docker container locally. Ensure Docker is installed and running. ```bash docker run -d --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:latest ``` -------------------------------- ### Python PlaywrightCrawler Sanity Check Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/04_real_world_project.mdx Creates a PlaywrightCrawler to visit a start URL and print the text content of category elements. Useful for verifying initial setup and selectors. ```python from crawlee import PlaywrightCrawler async def main(): crawler = PlaywrightCrawler( \ # Use the same browser as in the Playwright API # For more options, see https://playwright.dev/docs/api/class-playwright#playwrightlaunch-options launch_options={"use": {"headless": True}}, ) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections"]) # Example of how to run the crawler # crawler.run(start_urls=["https://warehouse-theme-metal.myshopify.com/collections"]) ``` -------------------------------- ### PlaywrightCrawler Example Source: https://context7.com/apify/crawlee-python/llms.txt Demonstrates using PlaywrightCrawler for JavaScript-rendered content. It uses a headless browser and provides the full Playwright `Page` API. Requires `crawlee[playwright]` and `playwright install`. ```python import asyncio from crawlee.crawlers import ( PlaywrightCrawler, PlaywrightCrawlingContext, PlaywrightPreNavCrawlingContext, ) async def main() -> None: crawler = PlaywrightCrawler( max_requests_per_crawl=10, headless=True, browser_type='chromium', # 'firefox' or 'webkit' also supported ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') posts = await context.page.query_selector_all('.athing') data = [] for post in posts: title_el = await post.query_selector('.title a') rank_el = await post.query_selector('.rank') data.append({ 'title': await title_el.inner_text() if title_el else None, 'rank': await rank_el.inner_text() if rank_el else None, 'href': await title_el.get_attribute('href') if title_el else None, }) await context.push_data(data) await context.enqueue_links(selector='.morelink') # paginate @crawler.pre_navigation_hook async def log_nav(context: PlaywrightPreNavCrawlingContext) -> None: context.log.info(f'Navigating to {context.request.url} ...') await crawler.run(['https://news.ycombinator.com/']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Adaptive Crawler Handlers Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/request_router.mdx Illustrates an adaptive approach to crawler handlers, potentially using different strategies based on request characteristics. This example shows a flexible setup for handling various request types. ```python from crawlee import Router router = Router() @router.handle("product", "category") def handle_products_and_categories(context): context.log.info(f"Handling product or category: {context.request.url}") @router.handle_default() def handle_other(context): context.log.info(f"Handling other types: {context.request.url}") ``` -------------------------------- ### Verify Python and Pip Installation Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Check if Python and pip are installed on your system. These are required for Crawlee installation. ```sh python --version ``` ```sh python -m pip --version ``` -------------------------------- ### Install Playwright Dependencies Source: https://github.com/apify/crawlee-python/blob/master/README.md Installs the necessary Playwright browser binaries. This is a required step after installing the crawlee package. ```sh playwright install ``` -------------------------------- ### Configuring FileSystemStorageClient Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Shows how to configure the FileSystemStorageClient using environment variables or the Configuration class. Key options include 'storage_dir' and 'purge_on_start'. ```python from crawlee.storage.file_system import FileSystemStorageClient from crawlee.config import Configuration # Option 1: Using environment variables (e.g., CRAWLEE_STORAGE_DIR='my_custom_storage') # Option 2: Using the Configuration class config = Configuration({ "storage_dir": "./my_custom_storage", "purge_on_start": False }) storage_client = FileSystemStorageClient(config=config) ``` -------------------------------- ### Install uv with pip Source: https://context7.com/apify/crawlee-python/llms.txt Installs the uv package manager using pip. uv is a fast Python package installer. ```bash # Install uv first (https://docs.astral.sh/uv/) pip install uv ``` -------------------------------- ### Install Crawlee with Playwright Extra Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Install Crawlee with the 'playwright' extra for PlaywrightCrawler support. This also requires installing Playwright dependencies separately. ```sh python -m pip install 'crawlee[playwright]' ``` ```sh playwright install ``` -------------------------------- ### Configuring Storage Clients Source: https://context7.com/apify/crawlee-python/llms.txt Demonstrates how to replace the default filesystem storage with alternative backends like in-memory or Redis by passing a storage client to the crawler. Useful for testing or persistent storage needs. ```python import asyncio from crawlee.crawlers import ParselCrawler from crawlee.storage_clients import MemoryStorageClient, RedisStorageClient # In-memory (no disk I/O — ideal for tests) memory_crawler = ParselCrawler( storage_client=MemoryStorageClient(), max_requests_per_crawl=5, ) # Redis-backed (persistent, shareable across processes) redis_crawler = ParselCrawler( storage_client=RedisStorageClient(connection_string='redis://localhost:6379'), max_requests_per_crawl=5, ) ``` -------------------------------- ### Install Crawlee with Parsel Extra Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Install Crawlee with the 'parsel' extra, required for using the ParselCrawler. ```sh python -m pip install 'crawlee[parsel]' ``` -------------------------------- ### Implement Custom Storage Client Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storage_clients.mdx Example of a custom storage client implementing the `StorageClient` interface. This serves as a template for integrating custom storage logic. ```python from apify_client.storage.storage_client import StorageClient class CustomStorageClientExample(StorageClient): async def get_dataset_client(self): # Implementation for getting a Dataset client pass async def get_key_value_store_client(self): # Implementation for getting a KeyValueStore client pass async def get_request_queue_client(self): # Implementation for getting a RequestQueue client pass async def close(self): # Implementation for closing the client pass ``` -------------------------------- ### Install Crawlee with BeautifulSoup Extra Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Install Crawlee with the 'beautifulsoup' extra, required for using the BeautifulSoupCrawler. ```sh python -m pip install 'crawlee[beautifulsoup]' ``` -------------------------------- ### Basic Request Handlers with Router Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/request_router.mdx Demonstrates how to set up basic request handlers using the Router class. This example shows how to define handlers for different labels and a default handler for unmatched requests. ```python from crawlee import Router router = Router() @router.handle("product") def handle_product(context): context.log.info(f"Handling product page: {context.request.url}") @router.handle("category") def handle_category(context): context.log.info(f"Handling category page: {context.request.url}") @router.handle_default() def handle_default(context): context.log.info(f"Handling default page: {context.request.url}") ``` -------------------------------- ### Run Documentation Locally Source: https://github.com/apify/crawlee-python/blob/master/CONTRIBUTING.md Builds and runs the documentation website locally. Requires Node.js 20+. ```sh uv run poe run-docs ``` -------------------------------- ### Verify Crawlee Installation Source: https://github.com/apify/crawlee-python/blob/master/README.md Checks if the crawlee library is installed correctly by printing its version number. ```python import crawlee; print(crawlee.__version__) ``` -------------------------------- ### Log in to Apify CLI Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/deployment/apify_platform.mdx Install the Apify CLI and log in using your API token. This allows the CLI to authenticate with the Apify platform for subsequent commands. ```bash npm install -g apify-cli apify login -t YOUR_API_TOKEN ``` -------------------------------- ### Example Crawl Result Data Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/quick-start/index.mdx This is an example of the JSON data structure that Crawlee saves for each crawled page. ```json { "url": "https://crawlee.dev/", "title": "Crawlee · Build reliable crawlers. Fast. | Crawlee" } ``` -------------------------------- ### Check uv Installation Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Verify if the uv package manager is installed on your system. uv is recommended for managing Python environments and dependencies. ```sh uv --version ``` -------------------------------- ### Create New Crawlee Project Directly Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/01_setting_up.mdx Initialize a new Crawlee project named 'my_crawler' using the Crawlee CLI. This method is suitable if Crawlee is already installed. ```sh crawlee create my_crawler ``` -------------------------------- ### Initialize Dataset and Push Data Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/introduction/07_saving_data.mdx Import the Dataset class and open an instance within your crawler's setup. Then, push extracted data to this dataset instance. ```python from crawlee import Dataset # ... crawler setup ... async def setup(self): # ... dataset = await Dataset.open() # ... @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: # ... data = { 'manufacturer': manufacturer, 'title': title, 'sku': sku, 'price': price, 'in_stock': in_stock, } # Push the data to the dataset. await dataset.push_data(data) # ... ``` -------------------------------- ### Adaptive Playwright Crawler Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/examples/playwright_crawler_adaptive.mdx This example demonstrates how to use AdaptivePlaywrightCrawler. It combines Playwright and HTTP-based crawling, switching between them for performance. Pre-navigation hooks can be used to perform actions before navigating to a URL. ```python import asyncio from playwright.sync_api import sync_playwright from crawlee import ApifyClient from crawlee.playwright_crawler import PlaywrightCrawler from crawlee.playwright_crawler.adaptive_playwright_crawler import AdaptivePlaywrightCrawler async def main(): # You can optionally specify the Playwright browser to use. # If not specified, the default browser will be used. # For more info, see https://playwright.dev/docs/api/class-playwright#playwright-launch async with sync_playwright() as p: # You can also specify the Playwright browser to use. # If not specified, the default browser will be used. # For more info, see https://playwright.dev/docs/api/class-playwright#playwright-launch browser = await p.chromium.launch(headless=False) # Initialize the AdaptivePlaywrightCrawler. # You can pass any PlaywrightCrawler or ParselCrawler options here. crawler = AdaptivePlaywrightCrawler( # You can also specify the Playwright browser to use. # If not specified, the default browser will be used. # For more info, see https://playwright.dev/docs/api/class-playwright#playwright-launch browser_instance=browser, # You can also specify the Playwright browser to use. # If not specified, the default browser will be used. # For more info, see https://playwright.dev/docs/api/class-playwright#playwright-launch pre_navigation_hooks=[ ( "https://www.example.com", ( lambda playwright_context, url: print( f"Navigating to {url} with Playwright" ) ), {"playwright_only": True}, ) ], ) # Add a start URL to the queue. await crawler.enqueue_links(["https://www.example.com"]) # Start the crawler. await crawler.run() if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### Basic Key-Value Store Operations in Python Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/storages.mdx Demonstrates fundamental operations for the Key-Value Store, including saving and retrieving data using keys. Ensure the Key-Value Store is accessible. ```python from crawlee import KeyValueStore # Save data to the default key-value store await KeyValueStore.save_record(key="my-key", value="my-value") # Retrieve data from the default key-value store retrieved_value = await KeyValueStore.get_record("my-key") print(f"Retrieved value: {retrieved_value}") ``` -------------------------------- ### Playwright Crawler with Fingerprint Generator Example Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/examples/playwright_crawler_with_fingerprint_generator.mdx Use this example to configure PlaywrightCrawler with a FingerprintGenerator. Initialize the generator with desired fingerprint options to mimic real browser fingerprints. Unspecified options are automatically selected. ```python import asyncio from playwright.sync_api import sync_playwright from apify_client import ApifyClient from apify_storage_memory import MemoryStorage from crawlee import PlaywrightCrawler from crawlee.playwright_crawler import PlaywrightCrawlerOptions from crawlee.fingerprint_generator import FingerprintGenerator, FingerprintGeneratorOptions def main(): # You can optionally view the Playwright logs by uncommenting this line: # import logging # logging.basicConfig(level=logging.DEBUG) # Initialize the ApifyClient and MemoryStorage client = ApifyClient("YOUR_APIFY_TOKEN") # Replace with your Apify token or leave empty for local testing storage = MemoryStorage() # Initialize the FingerprintGenerator with desired options # If an option is not specified, it will be automatically selected from a set of reasonable values. # If some option is important for you, do not rely on the default and explicitly define it. fingerprint_generator_options = FingerprintGeneratorOptions( navigator_vendor_id=True, navigator_vendor=True, navigator_platform=True, navigator_user_agent=True, navigator_language=True, navigator_languages=True, navigator_timezone_offset=True, navigator_webdriver=True, screen_width=True, screen_height=True, screen_avail_width=True, screen_avail_height=True, screen_color_depth=True, screen_pixel_depth=True, webgl_vendor=True, webgl_renderer=True, webgl_aliased_width=True, webgl_aliased_height=True, webgl_unmasked_vendor=True, webgl_unmasked_renderer=True, canvas_winding_order=True, canvas_text=True, audio_context_fingerprint=True, webgl_context_attributes=True, font_family=True, font_resolution=True, font_blur=True, font_hinting=True, font_hinting_small=True, font_contrast=True, font_grayscale=True, font_smoothing=True, font_subpixel_aa=True, plugins=True, hardware_concurrency=True, device_memory=True, performance_timing=True, performance_navigation=True, dom_rect=True, media_codecs=True, battery_status=True, permissions=True, webdriver_selenium=True, webdriver_selenium_version=True, webdriver_chrome=True, webdriver_chrome_version=True, webdriver_edge=True, webdriver_edge_version=True, webdriver_firefox=True, webdriver_firefox_version=True, webdriver_safari=True, webdriver_safari_version=True, webdriver_opera=True, webdriver_opera_version=True, webdriver_ie=True, webdriver_ie_version=True, ) fingerprint_generator = FingerprintGenerator(fingerprint_generator_options) # Configure PlaywrightCrawler with the fingerprint generator crawler_options = PlaywrightCrawlerOptions( storage=storage, fingerprint_generator=fingerprint_generator, ) crawler = PlaywrightCrawler(crawler_options) # Define the start URLs and the request handler async def request_handler({request}): # noqa print(f"Visiting {request.url}...") await request.get_page().wait_for_timeout(1000) # Wait for 1000 ms await crawler.run(["https://apify.com"]) print("Crawling finished.") if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### Register StorageClient via Service Locator Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/service_locator.mdx Demonstrates how to register a custom StorageClient implementation with the global ServiceLocator. This ensures that all components using the StorageClient will utilize the registered instance. ```python from crawlee.storage.storage_client import StorageClient from crawlee.service_locator import service_locator class MyStorageClient(StorageClient): ... service_locator.register_storage_client(MyStorageClient()) ``` -------------------------------- ### Provide Storage Client to Storage Source: https://github.com/apify/crawlee-python/blob/master/website/versioned_docs/version-1.6/guides/service_locator.mdx Instantiate a storage client and pass it directly to the storage constructor to use it for that specific instance. ```python from crawlee.storage import MemoryStorage # Provide a custom storage client to a specific storage instance storage = MemoryStorage(storage_client=MemoryStorage()) ```