### Initialize Apify Project Source: https://crawlee.dev/python/docs/introduction/deployment Use the Apify CLI to initialize your project. This command guides you through the setup process and creates necessary configuration files. ```bash apify init ``` -------------------------------- ### Basic KeyValueStore Usage Source: https://crawlee.dev/python/docs/guides/storages Demonstrates the fundamental operations of opening, setting, getting, and deleting values in a KeyValueStore. Ensure you have the 'crawlee' library installed. This example uses the default KeyValueStore if no name is provided. ```python import asyncio from crawlee.storages import KeyValueStore async def main() -> None: # Open the key-value store, if it does not exist, it will be created. # Leave name empty to use the default KVS. kvs = await KeyValueStore.open(name='my-key-value-store') # Set a value associated with 'some-key'. await kvs.set_value(key='some-key', value={'foo': 'bar'}) # Get the value associated with 'some-key'. value = kvs.get_value('some-key') # Do something with it... # Delete the value associated with 'some-key' by setting it to None. await kvs.set_value(key='some-key', value=None) # Remove the key-value store. await kvs.drop() if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Router Usage Example Source: https://crawlee.dev/python/api/class/Router Demonstrates how to set up a Router with middleware, a default handler, and specific handlers for 'category' and 'product' labels. This setup is then used with an HttpCrawler. ```python from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.router import Router router = Router[HttpCrawlingContext]() # Middleware executed for every request before the handlers @router.use async def logging_middleware(context: HttpCrawlingContext) -> None: context.log.info(f'Processing request: {context.request.url} label={context.request.label}') # Handler for requests without a matching label handler @router.default_handler async def default_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Request without label {context.request.url} ...') # Handler for category requests @router.handler(label='category') async def category_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Category request {context.request.url} ...') # Handler for product requests @router.handler(label='product') async def product_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Product {context.request.url} ...') async def main() -> None: crawler = HttpCrawler(request_handler=router) await crawler.run() ``` -------------------------------- ### Browser Launch Hooks Example Source: https://crawlee.dev/python/docs/guides/playwright-crawler Demonstrates how to use pre_launch_hook and post_launch_hook with BrowserPool to log browser launch events. This setup is useful for monitoring and debugging browser instance lifecycles. ```python from __future__ import annotations import asyncio import logging from typing import TYPE_CHECKING from crawlee.browsers import BrowserPool from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext if TYPE_CHECKING: from crawlee.browsers._browser_controller import BrowserController from crawlee.browsers._browser_plugin import BrowserPlugin logger = logging.getLogger(__name__) async def main() -> None: async with BrowserPool() as browser_pool: @browser_pool.pre_launch_hook async def log_browser_launch(page_id: str, plugin: BrowserPlugin) -> None: """Log before a new browser instance is launched.""" logger.info(f'Launching {plugin.browser_type} browser for page {page_id}...') @browser_pool.post_launch_hook async def log_browser_launched( page_id: str, controller: BrowserController ) -> None: """Log after a new browser instance has been launched.""" logger.info(f'Browser launched for page {page_id}, controller: {controller}') crawler = PlaywrightCrawler( browser_pool=browser_pool, max_requests_per_crawl=5, ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') await context.enqueue_links() # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Install Crawlee with Multiple Extras Source: https://crawlee.dev/python/docs/introduction/setting-up Install Crawlee with several optional extras simultaneously by separating them with commas. This example installs 'beautifulsoup' and 'curl-impersonate' extras. ```bash python -m pip install 'crawlee[beautifulsoup,curl-impersonate]' ``` -------------------------------- ### Check uv Installation Source: https://crawlee.dev/python/docs/introduction/setting-up Verify if uv is installed on your system. If not, follow the official installation guide. ```bash uv --version ``` -------------------------------- ### Custom Request Router Setup Source: https://crawlee.dev/python/docs/guides/request-router Define a custom router instance and set up default and specific handlers for different request types. This example shows how to handle home pages, categories, and products. ```python import asyncio from crawlee import Request from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.router import Router async def main(): # Create a custom router instance router = Router[ParselCrawlingContext]() # Define the default handler (fallback for requests without specific labels) @router.default_handler async def default_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing homepage: {context.request.url}') # Extract page title title = context.selector.css('title::text').get() or 'No title found' await context.push_data( { 'url': context.request.url, 'title': title, 'page_type': 'homepage', } ) # Find and enqueue collection/category links await context.enqueue_links(selector='a[href*="/collections/"]', label='CATEGORY') # Define a handler for category pages @router.handle('CATEGORY') async def category_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing category page: {context.request.url}') # Extract category information category_title = context.selector.css('h1::text').get() or 'Unknown Category' product_count = len(context.selector.css('.product-item').getall()) await context.push_data( { 'url': context.request.url, 'type': 'category', 'category_title': category_title, 'product_count': product_count, 'handler': 'category', } ) # Enqueue product links from this category await context.enqueue_links(selector='a[href*="/products/"]', label='PRODUCT') # Define a handler for product detail pages @router.handle('PRODUCT') async def product_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing product page: {context.request.url}') # Extract detailed product information product_data = { 'url': context.request.url, 'name': context.selector.css('h1::text').get(), 'price': context.selector.css('.price::text').get(), 'description': context.selector.css('.product-description p::text').get(), 'images': context.selector.css('.product-gallery img::attr(src)').getall(), 'in_stock': bool(context.selector.css('.add-to-cart-button').get()), 'handler': 'product', } await context.push_data(product_data) # Create crawler with the router crawler = ParselCrawler( request_handler=router, max_requests_per_crawl=10, # Limit the max requests per crawl. ) # Start crawling with some initial requests await crawler.run( [ # Will use default handler 'https://warehouse-theme-metal.myshopify.com/', # Will use category handler Request( url='https://warehouse-theme-metal.myshopify.com/collections/all', label='CATEGORY', ), ] ) if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### start Source: https://crawlee.dev/python/api/class/SitemapRequestLoader Starts the sitemap loading process. This method is typically called automatically when entering the async context manager. ```APIDOC ## start ### Description Starts the sitemap loading process. This method is typically called automatically when entering the async context manager. ### Method POST (conceptual, as it's an SDK method) ### Endpoint N/A (SDK method) ### Parameters None ### Request Example ```python await sitemap_loader.start() ``` ### Response #### Success Response This method does not return a value. ``` -------------------------------- ### Install Apify CLI and Log In Source: https://crawlee.dev/python/docs/deployment/apify-platform Installs the Apify CLI globally and logs in using an API token. This is useful for managing Apify platform access from your local machine. ```bash npm install -g apify-cli apify login -t YOUR_API_TOKEN ``` -------------------------------- ### start Source: https://crawlee.dev/python/api/class/SitemapRequestLoader Initiates the sitemap loading process. This method is specific to the SitemapRequestLoader. ```APIDOC ## start ### Description Starts the sitemap loading process. ### Method async ### Returns None ``` -------------------------------- ### Install Crawlee with all extras Source: https://crawlee.dev/python/docs/guides/http-clients Install Crawlee with all available extras to enable all HTTP clients and features. This is a convenient option for accessing the full range of Crawlee's capabilities. ```bash python -m pip install 'crawlee[all]' ``` -------------------------------- ### Install Crawlee with httpx extra Source: https://crawlee.dev/python/docs/guides/http-clients Install Crawlee with the `httpx` extra to use the HttpxHttpClient. This client is built on the popular httpx library. ```bash python -m pip install 'crawlee[httpx]' ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/KeyValueStore Initializes a new instance of the KeyValueStore. It's recommended to use the KeyValueStore.open constructor instead. ```APIDOC ## __init__ ### Description Initializes a new instance of the KeyValueStore. It's recommended to use the `KeyValueStore.open` constructor instead. ### Parameters * **client**: KeyValueStoreClient - An instance of a storage client. * **id**: str - The unique identifier of the storage. * **name**: str | None - The name of the storage, if available. ### Returns None ``` -------------------------------- ### Open FileSystemKeyValueStoreClient Source: https://crawlee.dev/python/api/class/FileSystemKeyValueStoreClient Use the `open` class method to initialize a new instance. It attempts to open an existing store or creates a new one if none is found. ```python client = await FileSystemKeyValueStoreClient.open( id='my-store-id', name='my-store-name', alias='my-store-alias', configuration=Configuration() ) ``` -------------------------------- ### FileSystemKeyValueStoreClient.__init__ Source: https://crawlee.dev/python/api/class/FileSystemKeyValueStoreClient Initializes a new instance of FileSystemKeyValueStoreClient. It is recommended to use the `FileSystemKeyValueStoreClient.open` class method instead of this constructor. ```APIDOC ## __init__ ### Description Initialize a new instance. Preferably use the `FileSystemKeyValueStoreClient.open` class method to create a new instance. ### Parameters * `metadata` (KeyValueStoreMetadata) - Keyword-only. The metadata for the key-value store. * `path_to_kvs` (Path) - Keyword-only. The path to the key-value store directory. * `lock` (asyncio.Lock) - Keyword-only. An asyncio lock for synchronization. ### Returns None ``` -------------------------------- ### Adaptive Playwright Crawler Example Source: https://crawlee.dev/python/docs/guides/request-router This example shows how to initialize and configure an AdaptivePlaywrightCrawler with pre-navigation hooks for common and Playwright-specific setups. It also includes a default handler for extracting page titles and links. ```python import asyncio from crawlee import HttpHeaders from crawlee.crawlers import ( AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext, AdaptivePlaywrightPreNavCrawlingContext, ) async def main() -> None: crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( max_requests_per_crawl=10, # Limit the max requests per crawl. ) @crawler.pre_navigation_hook async def common_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: # Common pre-navigation hook - runs for both HTTP and browser requests. context.request.headers |= HttpHeaders( {'Accept': 'text/html,application/xhtml+xml'}, ) @crawler.pre_navigation_hook(playwright_only=True) async def browser_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: # Playwright-specific pre-navigation hook - runs only when browser is used. await context.page.set_viewport_size({'width': 1280, 'height': 720}) if context.block_requests: await context.block_requests(extra_url_patterns=['*.css', '*.js']) @crawler.router.default_handler async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Extract title using the unified context interface. title_tag = context.parsed_content.find('title') title = title_tag.get_text() if title_tag else None # Extract other data consistently across both modes. links = [a.get('href') for a in context.parsed_content.find_all('a', href=True)] await context.push_data( { 'url': context.request.url, 'title': title, 'links': links, } ) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Initialize ThrottlingRequestManager Source: https://crawlee.dev/python/api/class/ThrottlingRequestManager Example of initializing the ThrottlingRequestManager with an inner RequestQueue, specified domains, and a request manager opener callback. This setup is used when creating a BasicCrawler. ```python from crawlee.crawlers import BasicCrawler from crawlee.request_loaders import ThrottlingRequestManager from crawlee.storages import RequestQueue queue = await RequestQueue.open() throttler = ThrottlingRequestManager( inner=queue, domains=['api.example.com', 'slow-site.org'], request_manager_opener=RequestQueue.open, ) crawler = BasicCrawler(request_manager=throttler) ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/RedisClientMixin Initializes the RedisClientMixin with storage details and a Redis client instance. ```APIDOC ## __init__ ### Description Initializes the RedisClientMixin with storage details and a Redis client instance. ### Method __init__ ### Parameters #### Path Parameters * **storage_name** (str) - Description not available * **storage_id** (str) - Description not available * **redis** (Redis) - Description not available ### Returns None ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/SqlClientMixin Initializes the SqlClientMixin with a unique ID and a SqlStorageClient instance. ```APIDOC ## __init__ ### Description Initializes the SqlClientMixin with a unique ID and a SqlStorageClient instance. ### Method __init__ ### Parameters #### Keyword-only Parameters * **id** (str) - The unique identifier for the client. * **storage_client** (SqlStorageClient) - The SQL storage client instance. ### Returns None ``` -------------------------------- ### Initialize ImpitHttpClient and HttpCrawler Source: https://crawlee.dev/python/api/class/ImpitHttpClient Demonstrates how to initialize the ImpitHttpClient and integrate it with an HttpCrawler. This setup is typical for starting a crawling process that utilizes this specific HTTP client. ```python from crawlee.crawlers import HttpCrawler # or any other HTTP client-based crawler from crawlee.http_clients import ImpitHttpClient http_client = ImpitHttpClient() crawler = HttpCrawler(http_client=http_client) ``` -------------------------------- ### SqlKeyValueStoreClient.__init__ Source: https://crawlee.dev/python/api/class/SqlKeyValueStoreClient Initializes a new instance of SqlKeyValueStoreClient. It is recommended to use the `SqlKeyValueStoreClient.open` class method for instantiation. ```APIDOC ## __init__ ### Description Initializes a new instance. Preferably use the `SqlKeyValueStoreClient.open` class method to create a new instance. ### Parameters * **storage_client** (SqlStorageClient) - Keyword-only. The SQL storage client. * **id** (str) - Keyword-only. The ID of the key-value store. ### Returns None ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/ContextPipeline Initializes a new instance of the ContextPipeline. ```APIDOC ## __init__ ### Description Initializes a new instance of the ContextPipeline. ### Method CONSTRUCTOR ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters * **_middleware**: Callable[[TCrawlingContext], AsyncGenerator[TMiddlewareCrawlingContext, Exception | None]] | None = None * **_parent**: [ContextPipeline](https://crawlee.dev/python/python/api/class/ContextPipeline.md)[BasicCrawlingContext] | None = None ### Returns None ``` -------------------------------- ### Scraping and Storing Data with BeautifulSoupCrawler Source: https://crawlee.dev/python/docs/guides/storages This example shows how to use BeautifulSoupCrawler to scrape data from a website and push it to a named dataset. The dataset is then exported as a CSV file. Ensure you have the necessary libraries installed. ```python import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.storages import Dataset async def main() -> None: # Open the dataset, if it does not exist, it will be created. # Leave name empty to use the default dataset. dataset = await Dataset.open(name='my-dataset') # Create a new crawler (it can be any subclass of BasicCrawler). crawler = BeautifulSoupCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, } # Push the extracted data to the dataset. await dataset.push_data(data) # Run the crawler with the initial URLs. await crawler.run(['https://crawlee.dev']) # Export the dataset to the key-value store. await dataset.export_to(key='dataset', content_type='csv') if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/Session Initializes a new Session instance with various configuration options. ```APIDOC ## __init__ ### Description Initializes a new instance of the Session class. ### Method __init__ ### Parameters #### Keyword-Only Parameters - **id** (str | None) - Optional. Unique identifier for the session, autogenerated if not provided. - **max_age** (timedelta) - Optional. Time duration after which the session expires. Defaults to timedelta(minutes=50). - **user_data** (Mapping[str, JsonSerializable] | None) - Optional. Custom user data associated with the session. - **max_error_score** (float) - Optional. Threshold score beyond which the session is considered blocked. Defaults to 3.0. - **error_score_decrement** (float) - Optional. Value by which the error score is decremented on successful operations. Defaults to 0.5. - **created_at** (datetime | None) - Optional. Timestamp when the session was created, defaults to current UTC time if not provided. - **usage_count** (int) - Optional. Number of times the session has been used. Defaults to 0. - **max_usage_count** (int) - Optional. Maximum allowable uses of the session before it is considered expired. Defaults to 50. - **error_score** (float) - Optional. Current error score of the session. Defaults to 0.0. - **cookies** (SessionCookies | CookieJar | dict[str, str] | list[CookieParam] | None) - Optional. Cookies associated with the session. - **blocked_status_codes** (list | None) - Optional. HTTP status codes that indicate a session should be blocked. ### Returns - **None** ``` -------------------------------- ### PlaywrightCrawler Setup and Request Handling Source: https://crawlee.dev/python/docs/introduction/scraping Initializes PlaywrightCrawler and defines handlers for different request labels (start, category, detail). Use this for setting up the crawler and defining the logic for processing various page types. ```python import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: crawler = PlaywrightCrawler( # Let's limit our crawls to make our tests shorter and safer. max_requests_per_crawl=10, ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}') # We're not processing detail pages yet, so we just pass. if context.request.label == 'DETAIL': # Split the URL and get the last part to extract the manufacturer. url_part = context.request.url.split('/').pop() manufacturer = url_part.split('-')[0] # Extract the title using the combined selector. title = await context.page.locator('.product-meta h1').text_content() # Extract the SKU using its selector. sku = await context.page.locator( 'span.product-meta__sku-number' ).text_content() # Locate the price element that contains the '$' sign and filter out # the visually hidden elements. price_element = context.page.locator('span.price', has_text='$').first current_price_string = await price_element.text_content() or '' raw_price = current_price_string.split('$')[1] price = float(raw_price.replace(',', '')) # Locate the element that contains the text 'In stock' # and filter out other elements. in_stock_element = context.page.locator( selector='span.product-form__inventory', has_text='In stock', ).first in_stock = await in_stock_element.count() > 0 # Put it all together in a dictionary. data = { 'manufacturer': manufacturer, 'title': title, 'sku': sku, 'price': price, 'in_stock': in_stock, } # Print the extracted data. context.log.info(data) # We are now on a category page. We can use this to paginate through and # enqueue all products, as well as any subsequent pages we find. elif context.request.label == 'CATEGORY': # Wait for the product items to render. await context.page.wait_for_selector('.product-item > a') # Enqueue links found within elements matching the provided selector. # These links will be added to the crawling queue with the label DETAIL. await context.enqueue_links( selector='.product-item > a', label='DETAIL', ) # Find the "Next" button to paginate through the category pages. next_button = await context.page.query_selector('a.pagination__next') # If a "Next" button is found, enqueue the next page of results. if next_button: await context.enqueue_links( selector='a.pagination__next', label='CATEGORY', ) # This indicates we're on the start page with no specific label. # On the start page, we want to enqueue all the category pages. else: # Wait for the collection cards to render. await context.page.wait_for_selector('.collection-block-item') # Enqueue links found within elements matching the provided selector. # These links will be added to the crawling queue with the label CATEGORY. await context.enqueue_links( selector='.collection-block-item', label='CATEGORY', ) await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/SqlStorageClient Initializes the SqlStorageClient with a database connection string or a pre-configured engine. ```APIDOC ## __init__ ### Description Initializes the SQL storage client. ### Parameters #### Path Parameters * **connection_string** (str) - Optional - Database connection string (e.g., "sqlite+aiosqlite:///crawlee.db"). Defaults to SQLite database in the storage directory if not provided. * **engine** (AsyncEngine) - Optional - Pre-configured AsyncEngine instance. If provided, connection_string is ignored. ### Returns None ``` -------------------------------- ### Camoufox Integration with PlaywrightCrawler Source: https://crawlee.dev/python/docs/guides/avoid-blocking This example demonstrates how to create a custom Playwright browser plugin that utilizes Camoufox. It overrides the default browser behavior to enhance anti-detection capabilities. Ensure Camoufox is installed as an external package. ```python import asyncio # Camoufox is external package and needs to be installed. from camoufox import AsyncNewBrowser from typing_extensions import override from crawlee.browser_clients import ( BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin, ) from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext class CamoufoxPlugin(PlaywrightBrowserPlugin): """Example browser plugin that uses Camoufox browser, but otherwise keeps the functionality of PlaywrightBrowserPlugin.""" @override async def new_browser(self) -> PlaywrightBrowserController: if not self._playwright: raise RuntimeError('Playwright browser plugin is not initialized.') return PlaywrightBrowserController( browser=await AsyncNewBrowser( self._playwright, (*self._browser_launch_options ), # Increase, if camoufox can handle it in your use case. max_open_pages_per_browser=1, # This turns off the crawler header_generation. Camoufox has its own. header_generator=None, ) async def main(): crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, # Custom browser pool. Gives users full control over browsers used by the crawler. browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]) ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract some data from the page using Playwright's API. posts = await context.page.query_selector_all('.athing') for post in posts: # Get the HTML elements for the title and rank within each post. title_element = await post.query_selector(' .title a') # Extract the data we want from the elements. title = await title_element.inner_text() if title_element else None # Push the extracted data to the default dataset. await context.push_data({'title': title}) # Find a link to the next page and enqueue it if it exists. await context.enqueue_links(selector='.morelink') # Run the crawler with the initial list of URLs. await crawler.run(['https://news.ycombinator.com/']) if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/_ProxyTierTracker Initializes the _ProxyTierTracker with a list of proxy URLs for each tier. ```APIDOC ## __init__ ### Description Initializes the _ProxyTierTracker with a list of proxy URLs for each tier. ### Parameters * **tiered_proxy_urls** (list[list[URL | None]]) - The list of proxy URLs, where each inner list represents a tier. ### Returns None ``` -------------------------------- ### CloakBrowser Plugin for PlaywrightCrawler Source: https://crawlee.dev/python/docs/guides/avoid-blocking Example of a custom Playwright browser plugin that uses CloakBrowser's patched Chromium. It maintains the functionality of PlaywrightBrowserPlugin while applying CloakBrowser's fingerprinting defenses. Ensure CloakBrowser is installed separately. ```python from cloakbrowser.config import IGNORE_DEFAULT_ARGS, get_default_stealth_args from cloakbrowser.download import ensure_binary from typying_extensions import override from crawlee.browsers import ( BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin, ) from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext class CloakBrowserPlugin(PlaywrightBrowserPlugin): """Example browser plugin that uses CloakBrowser's patched Chromium, but otherwise keeps the functionality of PlaywrightBrowserPlugin. """ @override async def new_browser(self) -> PlaywrightBrowserController: if not self._playwright: raise RuntimeError('Playwright browser plugin is not initialized.') binary_path = ensure_binary() stealth_args = get_default_stealth_args() # Merge CloakBrowser stealth args with any user-provided launch options. launch_options = dict(self._browser_launch_options) launch_options.pop('executable_path', None) launch_options.pop('chromium_sandbox', None) existing_args = list(launch_options.pop('args', [])) launch_options['args'] = [*(existing_args, *stealth_args)] return PlaywrightBrowserController( browser=await self._playwright.chromium.launch( executable_path=binary_path, ignore_default_args=IGNORE_DEFAULT_ARGS, **launch_options, ), max_open_pages_per_browser=1, # CloakBrowser handles fingerprinting at the binary level. header_generator=None, ) async def main(): crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, # Custom browser pool. Gives users full control over browsers used by the crawler. browser_pool=BrowserPool(plugins=[CloakBrowserPlugin()]) ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract some data from the page using Playwright's API. posts = await context.page.query_selector_all('.athing') for post in posts: # Get the HTML elements for the title and rank within each post. title_element = await post.query_selector('a.title') # Extract the data we want from the elements. title = await title_element.inner_text() if title_element else None # Push the extracted data to the default dataset. await context.push_data({'title': title}) # Find a link to the next page and enqueue it if it exists. await context.enqueue_links(selector='.morelink') # Run the crawler with the initial list of URLs. await crawler.run(['https://news.ycombinator.com/']) if __name__ == "__main__": asyncio.run(main()) ``` -------------------------------- ### AdaptivePlaywrightCrawler Usage Source: https://crawlee.dev/python/api/class/AdaptivePlaywrightCrawler Demonstrates how to initialize and use the AdaptivePlaywrightCrawler with a default request handler. This example shows setting crawler options, defining a handler for processing page content and enqueuing links, and running the crawler on a starting URL. ```APIDOC ## AdaptivePlaywrightCrawler Usage ### Description This example demonstrates the basic setup and usage of the `AdaptivePlaywrightCrawler`. It shows how to instantiate the crawler with specific configurations, define a handler function to process crawled pages, and initiate the crawling process. ### Initialization ```python from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( max_requests_per_crawl=10, # Limit the max requests per crawl. playwright_crawler_specific_kwargs={'browser_type': 'chromium'}, ) ``` ### Request Handler ```python @crawler.router.default_handler async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None: # Do some processing using `parsed_content` context.log.info(context.parsed_content.title) # Locate element h2 within 5 seconds h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000)) # Do stuff with element found by the selector context.log.info(h2) # Find more links and enqueue them. await context.enqueue_links() # Save some data. await context.push_data({'Visited url': context.request.url}) ``` ### Running the Crawler ```python await crawler.run(['https://crawlee.dev/']) ``` ``` -------------------------------- ### ServiceLocator.__init__ Source: https://crawlee.dev/python/api/class/ServiceLocator Initializes the ServiceLocator with optional configuration, event manager, and storage client. ```APIDOC ## ServiceLocator.__init__ ### Description Initializes the ServiceLocator with optional configuration, event manager, and storage client. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Method Signature `__init__(configuration: Configuration | None = None, event_manager: EventManager | None = None, storage_client: StorageClient | None = None) -> None` ### Parameters * **configuration** (Configuration | None) - Optional: The configuration instance to set. Defaults to None. * **event_manager** (EventManager | None) - Optional: The event manager instance to set. Defaults to None. * **storage_client** (StorageClient | None) - Optional: The storage client instance to set. Defaults to None. ### Returns None ``` -------------------------------- ### Adaptive Playwright Crawler Example Source: https://crawlee.dev/python/docs/guides/request-router Demonstrates the setup and usage of the AdaptivePlaywrightCrawler, including common and Playwright-specific pre-navigation hooks for handling both static and dynamic content. It shows how to set viewport size and block certain resource types. ```python import asyncio from crawlee import HttpHeaders from crawlee.crawlers import ( AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext, AdaptivePlaywrightPreNavCrawlingContext, ) async def main(): crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( max_requests_per_crawl=10, # Limit the max requests per crawl. ) @crawler.pre_navigation_hook async def common_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: # Common pre-navigation hook - runs for both HTTP and browser requests. context.request.headers |= HttpHeaders({ 'Accept': 'text/html,application/xhtml+xml' }) @crawler.pre_navigation_hook(playwright_only=True) async def browser_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: # Playwright-specific pre-navigation hook - runs only when browser is used. await context.page.set_viewport_size({'width': 1280, 'height': 720}) if context.block_requests: await context.block_requests(extra_url_patterns=['*.css', '*.js']) @crawler.router.default_handler async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None: # Extract title using the unified context interface. title_tag = context.parsed_content.find('title') title = title_tag.get_text() if title_tag else None # Extract other data consistently across both modes. links = [a.get('href') for a in context.parsed_content.find_all('a', href=True)] await context.push_data({ 'url': context.request.url, 'title': title, 'links': links, }) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### Sanity Check with PlaywrightCrawler Source: https://crawlee.dev/python/docs/introduction/real-world-project Use this snippet to verify your PlaywrightCrawler setup by visiting a start URL and extracting specific text content from rendered elements. It waits for elements to load and then evaluates JavaScript to extract data, logging the results. ```python import asyncio # Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript. from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: crawler = PlaywrightCrawler() @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: # Wait for the collection cards to render on the page. This ensures that # the elements we want to interact with are present in the DOM. await context.page.wait_for_selector('.collection-block-item') # Execute a function within the browser context to target the collection # card elements and extract their text content, trimming any leading or # trailing whitespace. category_texts = await context.page.eval_on_selector_all( '.collection-block-item', '(els) => els.map(el => el.textContent.trim())', ) # Log the extracted texts. for i, text in enumerate(category_texts): context.log.info(f'CATEGORY_{i + 1}: {text}') await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### __init__ Source: https://crawlee.dev/python/api/class/StagehandBrowserController Initializes a new instance of StagehandBrowserController. It sets up the connection to the browser and configures Stagehand options. ```APIDOC ## __init__ ### Description Initialize a new instance. ### Parameters #### Keyword-Only Parameters - **playwright** (Playwright) - Required - Active Playwright instance used to connect to the browser via CDP. - **stagehand_client** (AsyncStagehand) - Required - Active Stagehand REST client used to start and end sessions. - **stagehand_options** (StagehandOptions) - Required - Stagehand-specific configuration (model, env, self-heal, etc.). - **max_open_pages_per_browser** (int) - Optional - Maximum number of pages that can be open at the same time. Defaults to 20. - **header_generator** (HeaderGenerator | None) - Optional - An optional `HeaderGenerator` instance used to generate and manage HTTP headers for requests made by the browser. By default, a predefined header generator is used. Set to `None` to disable automatic header modifications. Defaults to `_DEFAULT_HEADER_GENERATOR`. ### Returns None ``` -------------------------------- ### RedisKeyValueStoreClient.__init__ Source: https://crawlee.dev/python/api/class/RedisKeyValueStoreClient Initializes a new instance of the RedisKeyValueStoreClient. It's recommended to use the `RedisKeyValueStoreClient.open` class method for instantiation. ```APIDOC ## __init__ ### Description Initializes a new instance. Preferably use the `RedisKeyValueStoreClient.open` class method to create a new instance. ### Parameters * **storage_name**: str * **storage_id**: str * **redis**: Redis ### Returns None ``` -------------------------------- ### Basic PlaywrightCrawler Usage Source: https://crawlee.dev/python/api/class/PlaywrightCrawler Demonstrates the basic setup and usage of PlaywrightCrawler. It defines a default request handler to process URLs, extract page title and response text, and push the data to a dataset. The crawler is then run with a starting URL. ```python from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext crawler = PlaywrightCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': await context.page.title(), 'response': (await context.response.text())[:100], } # Push the extracted data to the default dataset. await context.push_data(data) await crawler.run(['https://crawlee.dev/']) ``` -------------------------------- ### _TxtSitemapParser.__init__ Source: https://crawlee.dev/python/api/class/_TxtSitemapParser Initializes the _TxtSitemapParser. ```APIDOC ## _TxtSitemapParser.__init__ ### Description Initializes the _TxtSitemapParser. ### Method __init__ ### Returns None ``` -------------------------------- ### FastAPI Web Server Setup for Crawlee Source: https://crawlee.dev/python/docs/guides/running-in-web-server This code sets up a FastAPI application to serve an HTML index page and an endpoint for scraping URLs. It integrates with Crawlee to handle scraping requests asynchronously. Ensure FastAPI is installed with 'fastapi[standard]'. ```python from __future__ import annotations import asyncio from uuid import uuid4 from fastapi import FastAPI from starlette.requests import Request from starlette.responses import HTMLResponse import crawlee from .crawler import lifespan app = FastAPI(lifespan=lifespan, title='Crawler app') @app.get('/', response_class=HTMLResponse) def index() -> str: return """
To scrape some page, visit "scrape" endpoint with url parameter. For example: /scrape?url=https://www.example.com
""" @app.get('/scrape') async def scrape_url(request: Request, url: str | None = None) -> dict: if not url: return {'url': 'missing', 'scrape result': 'no results'} # Generate random unique key for the request unique_key = str(uuid4()) # Set the result future in the result dictionary so that it can be awaited request.state.requests_to_results[unique_key] = asyncio.Future[dict[str, str]]() # Add the request to the crawler queue await request.state.crawler.add_requests( [crawlee.Request.from_url(url, unique_key=unique_key)] ) # Wait for the result future to be finished result = await request.state.requests_to_results[unique_key] # Clean the result from the result dictionary to free up memory request.state.requests_to_results.pop(unique_key) # Return the result return {'url': url, 'scrape result': result} ``` -------------------------------- ### Basic BeautifulSoup Crawler Setup and Execution Source: https://crawlee.dev/python/docs/examples/beautifulsoup-crawler This snippet shows how to initialize BeautifulSoupCrawler with custom settings like retries and timeouts. It defines a default request handler to extract page title and headings, and then runs the crawler starting from a given URL. ```python import asyncio from datetime import timedelta from crawlee.crawlers import ( BasicCrawlingContext, BeautifulSoupCrawler, BeautifulSoupCrawlingContext, ) async def main() -> None: # Create an instance of the BeautifulSoupCrawler class, a crawler that automatically # loads the URLs and parses their HTML using the BeautifulSoup library. crawler = BeautifulSoupCrawler( # On error, retry each page at most once. max_request_retries=1, # Increase the timeout for processing each page to 30 seconds. request_handler_timeout=timedelta(seconds=30), # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. # The handler receives a context parameter, providing various properties and # helper methods. Here are a few key ones we use for demonstration: # - request: an instance of the Request class containing details such as the URL # being crawled and the HTTP method used. # - soup: the BeautifulSoup object containing the parsed HTML of the response. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, 'h1s': [h1.text for h1 in context.soup.find_all('h1')], 'h2s': [h2.text for h2 in context.soup.find_all('h2')], 'h3s': [h3.text for h3 in context.soup.find_all('h3')], } # Push the extracted data to the default dataset. In local configuration, # the data will be stored as JSON files in ./storage/datasets/default. await context.push_data(data) # Register pre navigation hook which will be called before each request. # This hook is optional and does not need to be defined at all. @crawler.pre_navigation_hook async def some_hook(context: BasicCrawlingContext) -> None: pass # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` -------------------------------- ### FileSystemRequestQueueClient.__init__ Source: https://crawlee.dev/python/api/class/FileSystemRequestQueueClient Initializes a new instance of FileSystemRequestQueueClient. It's recommended to use the `open` class method instead of this constructor. ```APIDOC ## __init__ ### Description Initialize a new instance. Preferably use the `FileSystemRequestQueueClient.open` class method to create a new instance. ### Parameters * **metadata** (RequestQueueMetadata) - Keyword-only * **path_to_rq** (Path) - Keyword-only * **lock** (asyncio.Lock) - Keyword-only * **recoverable_state** (RecoverableState[RequestQueueState]) - Keyword-only ### Returns None ```