### Initialize Apify Project

Source: https://crawlee.dev/python/docs/introduction/deployment

Use the Apify CLI to initialize your project. This command guides you through the setup process and creates necessary configuration files.

```bash
apify init
```

--------------------------------

### Basic KeyValueStore Usage

Source: https://crawlee.dev/python/docs/guides/storages

Demonstrates the fundamental operations of opening, setting, getting, and deleting values in a KeyValueStore. Ensure you have the 'crawlee' library installed. This example uses the default KeyValueStore if no name is provided.

```python
import asyncio

from crawlee.storages import KeyValueStore


async def main() -> None:
    # Open the key-value store, if it does not exist, it will be created.
    # Leave name empty to use the default KVS.
    kvs = await KeyValueStore.open(name='my-key-value-store')

    # Set a value associated with 'some-key'.
    await kvs.set_value(key='some-key', value={'foo': 'bar'})

    # Get the value associated with 'some-key'.
    value = kvs.get_value('some-key')
    # Do something with it...

    # Delete the value associated with 'some-key' by setting it to None.
    await kvs.set_value(key='some-key', value=None)

    # Remove the key-value store.
    await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())
```

--------------------------------

### Router Usage Example

Source: https://crawlee.dev/python/api/class/Router

Demonstrates how to set up a Router with middleware, a default handler, and specific handlers for 'category' and 'product' labels. This setup is then used with an HttpCrawler.

```python
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext

from crawlee.router import Router


router = Router[HttpCrawlingContext]()


# Middleware executed for every request before the handlers

@router.use
async def logging_middleware(context: HttpCrawlingContext) -> None:

    context.log.info(f'Processing request: {context.request.url} label={context.request.label}')


# Handler for requests without a matching label handler

@router.default_handler
async def default_handler(context: HttpCrawlingContext) -> None:

    context.log.info(f'Request without label {context.request.url} ...')


# Handler for category requests

@router.handler(label='category')
async def category_handler(context: HttpCrawlingContext) -> None:

    context.log.info(f'Category request {context.request.url} ...')


# Handler for product requests

@router.handler(label='product')
async def product_handler(context: HttpCrawlingContext) -> None:

    context.log.info(f'Product {context.request.url} ...')


async def main() -> None:

    crawler = HttpCrawler(request_handler=router)

    await crawler.run()
```

--------------------------------

### Browser Launch Hooks Example

Source: https://crawlee.dev/python/docs/guides/playwright-crawler

Demonstrates how to use pre_launch_hook and post_launch_hook with BrowserPool to log browser launch events. This setup is useful for monitoring and debugging browser instance lifecycles.

```python
from __future__ import annotations


import asyncio

import logging

from typing import TYPE_CHECKING


from crawlee.browsers import BrowserPool
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


if TYPE_CHECKING:
    from crawlee.browsers._browser_controller import BrowserController
    from crawlee.browsers._browser_plugin import BrowserPlugin


logger = logging.getLogger(__name__)


async def main() -> None:
    async with BrowserPool() as browser_pool:

        @browser_pool.pre_launch_hook
        async def log_browser_launch(page_id: str, plugin: BrowserPlugin) -> None:
            """Log before a new browser instance is launched."""
            logger.info(f'Launching {plugin.browser_type} browser for page {page_id}...')


        @browser_pool.post_launch_hook
        async def log_browser_launched(
            page_id: str, controller: BrowserController
        ) -> None:
            """Log after a new browser instance has been launched."""
            logger.info(f'Browser launched for page {page_id}, controller: {controller}')


        crawler = PlaywrightCrawler(
            browser_pool=browser_pool,
            max_requests_per_crawl=5,
        )


        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            context.log.info(f'Processing {context.request.url} ...')

            await context.enqueue_links()


        # Run the crawler with the initial list of URLs.
        await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

```

--------------------------------

### Install Crawlee with Multiple Extras

Source: https://crawlee.dev/python/docs/introduction/setting-up

Install Crawlee with several optional extras simultaneously by separating them with commas. This example installs 'beautifulsoup' and 'curl-impersonate' extras.

```bash
python -m pip install 'crawlee[beautifulsoup,curl-impersonate]'
```

--------------------------------

### Check uv Installation

Source: https://crawlee.dev/python/docs/introduction/setting-up

Verify if uv is installed on your system. If not, follow the official installation guide.

```bash
uv --version
```

--------------------------------

### Custom Request Router Setup

Source: https://crawlee.dev/python/docs/guides/request-router

Define a custom router instance and set up default and specific handlers for different request types. This example shows how to handle home pages, categories, and products.

```python
import asyncio
from crawlee import Request
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router

async def main():
    # Create a custom router instance
    router = Router[ParselCrawlingContext]()

    # Define the default handler (fallback for requests without specific labels)
    @router.default_handler
    async def default_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing homepage: {context.request.url}')

        # Extract page title
        title = context.selector.css('title::text').get() or 'No title found'

        await context.push_data(
            {
                'url': context.request.url,
                'title': title,
                'page_type': 'homepage',
            }
        )

        # Find and enqueue collection/category links
        await context.enqueue_links(selector='a[href*="/collections/"]', label='CATEGORY')

    # Define a handler for category pages
    @router.handle('CATEGORY')
    async def category_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing category page: {context.request.url}')

        # Extract category information
        category_title = context.selector.css('h1::text').get() or 'Unknown Category'
        product_count = len(context.selector.css('.product-item').getall())

        await context.push_data(
            {
                'url': context.request.url,
                'type': 'category',
                'category_title': category_title,
                'product_count': product_count,
                'handler': 'category',
            }
        )

        # Enqueue product links from this category
        await context.enqueue_links(selector='a[href*="/products/"]', label='PRODUCT')

    # Define a handler for product detail pages
    @router.handle('PRODUCT')
    async def product_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing product page: {context.request.url}')

        # Extract detailed product information
        product_data = {
            'url': context.request.url,
            'name': context.selector.css('h1::text').get(),
            'price': context.selector.css('.price::text').get(),
            'description': context.selector.css('.product-description p::text').get(),
            'images': context.selector.css('.product-gallery img::attr(src)').getall(),
            'in_stock': bool(context.selector.css('.add-to-cart-button').get()),
            'handler': 'product',
        }

        await context.push_data(product_data)

    # Create crawler with the router
    crawler = ParselCrawler(
        request_handler=router,
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    # Start crawling with some initial requests
    await crawler.run(
        [
            # Will use default handler
            'https://warehouse-theme-metal.myshopify.com/',
            # Will use category handler
            Request(
                url='https://warehouse-theme-metal.myshopify.com/collections/all',
                label='CATEGORY',
            ),
        ]
    )

if __name__ == "__main__":
    asyncio.run(main())

```

--------------------------------

### start

Source: https://crawlee.dev/python/api/class/SitemapRequestLoader

Starts the sitemap loading process. This method is typically called automatically when entering the async context manager.

```APIDOC
## start

### Description
Starts the sitemap loading process. This method is typically called automatically when entering the async context manager.

### Method
POST (conceptual, as it's an SDK method)

### Endpoint
N/A (SDK method)

### Parameters
None

### Request Example
```python
await sitemap_loader.start()
```

### Response
#### Success Response
This method does not return a value.
```

--------------------------------

### Install Apify CLI and Log In

Source: https://crawlee.dev/python/docs/deployment/apify-platform

Installs the Apify CLI globally and logs in using an API token. This is useful for managing Apify platform access from your local machine.

```bash
npm install -g apify-cli

apify login -t YOUR_API_TOKEN
```

--------------------------------

### start

Source: https://crawlee.dev/python/api/class/SitemapRequestLoader

Initiates the sitemap loading process. This method is specific to the SitemapRequestLoader.

```APIDOC
## start

### Description
Starts the sitemap loading process.

### Method
async

### Returns
None
```

--------------------------------

### Install Crawlee with all extras

Source: https://crawlee.dev/python/docs/guides/http-clients

Install Crawlee with all available extras to enable all HTTP clients and features. This is a convenient option for accessing the full range of Crawlee's capabilities.

```bash
python -m pip install 'crawlee[all]'
```

--------------------------------

### Install Crawlee with httpx extra

Source: https://crawlee.dev/python/docs/guides/http-clients

Install Crawlee with the `httpx` extra to use the HttpxHttpClient. This client is built on the popular httpx library.

```bash
python -m pip install 'crawlee[httpx]'
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/KeyValueStore

Initializes a new instance of the KeyValueStore. It's recommended to use the KeyValueStore.open constructor instead.

```APIDOC
## __init__

### Description
Initializes a new instance of the KeyValueStore. It's recommended to use the `KeyValueStore.open` constructor instead.

### Parameters
* **client**: KeyValueStoreClient - An instance of a storage client.
* **id**: str - The unique identifier of the storage.
* **name**: str | None - The name of the storage, if available.

### Returns None
```

--------------------------------

### Open FileSystemKeyValueStoreClient

Source: https://crawlee.dev/python/api/class/FileSystemKeyValueStoreClient

Use the `open` class method to initialize a new instance. It attempts to open an existing store or creates a new one if none is found.

```python
client = await FileSystemKeyValueStoreClient.open(
    id='my-store-id',
    name='my-store-name',
    alias='my-store-alias',
    configuration=Configuration()
)
```

--------------------------------

### FileSystemKeyValueStoreClient.__init__

Source: https://crawlee.dev/python/api/class/FileSystemKeyValueStoreClient

Initializes a new instance of FileSystemKeyValueStoreClient. It is recommended to use the `FileSystemKeyValueStoreClient.open` class method instead of this constructor.

```APIDOC
## __init__

### Description
Initialize a new instance.
Preferably use the `FileSystemKeyValueStoreClient.open` class method to create a new instance.

### Parameters
* `metadata` (KeyValueStoreMetadata) - Keyword-only. The metadata for the key-value store.
* `path_to_kvs` (Path) - Keyword-only. The path to the key-value store directory.
* `lock` (asyncio.Lock) - Keyword-only. An asyncio lock for synchronization.

### Returns
None
```

--------------------------------

### Adaptive Playwright Crawler Example

Source: https://crawlee.dev/python/docs/guides/request-router

This example shows how to initialize and configure an AdaptivePlaywrightCrawler with pre-navigation hooks for common and Playwright-specific setups. It also includes a default handler for extracting page titles and links.

```python
import asyncio


from crawlee import HttpHeaders

from crawlee.crawlers import (

    AdaptivePlaywrightCrawler,

    AdaptivePlaywrightCrawlingContext,

    AdaptivePlaywrightPreNavCrawlingContext,

)


async def main() -> None:

    crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(

        max_requests_per_crawl=10,  # Limit the max requests per crawl.

    )


    @crawler.pre_navigation_hook

    async def common_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:

        # Common pre-navigation hook - runs for both HTTP and browser requests.

        context.request.headers |= HttpHeaders(

            {'Accept': 'text/html,application/xhtml+xml'},

        )


    @crawler.pre_navigation_hook(playwright_only=True)

    async def browser_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:

        # Playwright-specific pre-navigation hook - runs only when browser is used.

        await context.page.set_viewport_size({'width': 1280, 'height': 720})

        if context.block_requests:

            await context.block_requests(extra_url_patterns=['*.css', '*.js'])


    @crawler.router.default_handler

    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:

        # Extract title using the unified context interface.

        title_tag = context.parsed_content.find('title')

        title = title_tag.get_text() if title_tag else None


        # Extract other data consistently across both modes.

        links = [a.get('href') for a in context.parsed_content.find_all('a', href=True)]


        await context.push_data(

            {

                'url': context.request.url,

                'title': title,

                'links': links,

            }

        )


    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':

    asyncio.run(main())
```

--------------------------------

### Initialize ThrottlingRequestManager

Source: https://crawlee.dev/python/api/class/ThrottlingRequestManager

Example of initializing the ThrottlingRequestManager with an inner RequestQueue, specified domains, and a request manager opener callback. This setup is used when creating a BasicCrawler.

```python
from crawlee.crawlers import BasicCrawler

from crawlee.request_loaders import ThrottlingRequestManager

from crawlee.storages import RequestQueue


queue = await RequestQueue.open()
throttler = ThrottlingRequestManager(

    inner=queue,

    domains=['api.example.com', 'slow-site.org'],

    request_manager_opener=RequestQueue.open,

)

crawler = BasicCrawler(request_manager=throttler)
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/RedisClientMixin

Initializes the RedisClientMixin with storage details and a Redis client instance.

```APIDOC
## __init__

### Description
Initializes the RedisClientMixin with storage details and a Redis client instance.

### Method
__init__

### Parameters
#### Path Parameters
* **storage_name** (str) - Description not available
* **storage_id** (str) - Description not available
* **redis** (Redis) - Description not available

### Returns
None
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/SqlClientMixin

Initializes the SqlClientMixin with a unique ID and a SqlStorageClient instance.

```APIDOC
## __init__

### Description
Initializes the SqlClientMixin with a unique ID and a SqlStorageClient instance.

### Method
__init__

### Parameters
#### Keyword-only Parameters
* **id** (str) - The unique identifier for the client.
* **storage_client** (SqlStorageClient) - The SQL storage client instance.

### Returns
None
```

--------------------------------

### Initialize ImpitHttpClient and HttpCrawler

Source: https://crawlee.dev/python/api/class/ImpitHttpClient

Demonstrates how to initialize the ImpitHttpClient and integrate it with an HttpCrawler. This setup is typical for starting a crawling process that utilizes this specific HTTP client.

```python
from crawlee.crawlers import HttpCrawler  # or any other HTTP client-based crawler

from crawlee.http_clients import ImpitHttpClient


http_client = ImpitHttpClient()

crawler = HttpCrawler(http_client=http_client)
```

--------------------------------

### SqlKeyValueStoreClient.__init__

Source: https://crawlee.dev/python/api/class/SqlKeyValueStoreClient

Initializes a new instance of SqlKeyValueStoreClient. It is recommended to use the `SqlKeyValueStoreClient.open` class method for instantiation.

```APIDOC
## __init__

### Description
Initializes a new instance. Preferably use the `SqlKeyValueStoreClient.open` class method to create a new instance.

### Parameters
* **storage_client** (SqlStorageClient) - Keyword-only. The SQL storage client.
* **id** (str) - Keyword-only. The ID of the key-value store.

### Returns
None
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/ContextPipeline

Initializes a new instance of the ContextPipeline.

```APIDOC
## __init__

### Description
Initializes a new instance of the ContextPipeline.

### Method
CONSTRUCTOR

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Parameters
* **_middleware**: Callable[[TCrawlingContext], AsyncGenerator[TMiddlewareCrawlingContext, Exception | None]] | None = None
* **_parent**: [ContextPipeline](https://crawlee.dev/python/python/api/class/ContextPipeline.md)[BasicCrawlingContext] | None = None

### Returns
None
```

--------------------------------

### Scraping and Storing Data with BeautifulSoupCrawler

Source: https://crawlee.dev/python/docs/guides/storages

This example shows how to use BeautifulSoupCrawler to scrape data from a website and push it to a named dataset. The dataset is then exported as a CSV file. Ensure you have the necessary libraries installed.

```python
import asyncio


from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

from crawlee.storages import Dataset


async def main() -> None:

    # Open the dataset, if it does not exist, it will be created.

    # Leave name empty to use the default dataset.

    dataset = await Dataset.open(name='my-dataset')


    # Create a new crawler (it can be any subclass of BasicCrawler).

    crawler = BeautifulSoupCrawler()


    # Define the default request handler, which will be called for every request.

    @crawler.router.default_handler

    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:

        context.log.info(f'Processing {context.request.url} ...')


        # Extract data from the page.

        data = {

            'url': context.request.url,

            'title': context.soup.title.string if context.soup.title else None,

        }


        # Push the extracted data to the dataset.

        await dataset.push_data(data)


    # Run the crawler with the initial URLs.

    await crawler.run(['https://crawlee.dev'])


    # Export the dataset to the key-value store.

    await dataset.export_to(key='dataset', content_type='csv')


if __name__ == '__main__':

    asyncio.run(main())
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/Session

Initializes a new Session instance with various configuration options.

```APIDOC
## __init__

### Description
Initializes a new instance of the Session class.

### Method
__init__

### Parameters
#### Keyword-Only Parameters
- **id** (str | None) - Optional. Unique identifier for the session, autogenerated if not provided.
- **max_age** (timedelta) - Optional. Time duration after which the session expires. Defaults to timedelta(minutes=50).
- **user_data** (Mapping[str, JsonSerializable] | None) - Optional. Custom user data associated with the session.
- **max_error_score** (float) - Optional. Threshold score beyond which the session is considered blocked. Defaults to 3.0.
- **error_score_decrement** (float) - Optional. Value by which the error score is decremented on successful operations. Defaults to 0.5.
- **created_at** (datetime | None) - Optional. Timestamp when the session was created, defaults to current UTC time if not provided.
- **usage_count** (int) - Optional. Number of times the session has been used. Defaults to 0.
- **max_usage_count** (int) - Optional. Maximum allowable uses of the session before it is considered expired. Defaults to 50.
- **error_score** (float) - Optional. Current error score of the session. Defaults to 0.0.
- **cookies** (SessionCookies | CookieJar | dict[str, str] | list[CookieParam] | None) - Optional. Cookies associated with the session.
- **blocked_status_codes** (list | None) - Optional. HTTP status codes that indicate a session should be blocked.

### Returns
- **None**
```

--------------------------------

### PlaywrightCrawler Setup and Request Handling

Source: https://crawlee.dev/python/docs/introduction/scraping

Initializes PlaywrightCrawler and defines handlers for different request labels (start, category, detail). Use this for setting up the crawler and defining the logic for processing various page types.

```python
import asyncio


from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:

    crawler = PlaywrightCrawler(

        # Let's limit our crawls to make our tests shorter and safer.

        max_requests_per_crawl=10,

    )


    @crawler.router.default_handler

    async def request_handler(context: PlaywrightCrawlingContext) -> None:

        context.log.info(f'Processing {context.request.url}')


        # We're not processing detail pages yet, so we just pass.

        if context.request.label == 'DETAIL':

            # Split the URL and get the last part to extract the manufacturer.

            url_part = context.request.url.split('/').pop()

            manufacturer = url_part.split('-')[0]


            # Extract the title using the combined selector.

            title = await context.page.locator('.product-meta h1').text_content()


            # Extract the SKU using its selector.

            sku = await context.page.locator(

                'span.product-meta__sku-number'

            ).text_content()


            # Locate the price element that contains the '$' sign and filter out

            # the visually hidden elements.

            price_element = context.page.locator('span.price', has_text='$').first

            current_price_string = await price_element.text_content() or ''

            raw_price = current_price_string.split('$')[1]

            price = float(raw_price.replace(',', ''))


            # Locate the element that contains the text 'In stock'

            # and filter out other elements.

            in_stock_element = context.page.locator(

                selector='span.product-form__inventory',

                has_text='In stock',

            ).first

            in_stock = await in_stock_element.count() > 0


            # Put it all together in a dictionary.

            data = {

                'manufacturer': manufacturer,

                'title': title,

                'sku': sku,

                'price': price,

                'in_stock': in_stock,

            }


            # Print the extracted data.

            context.log.info(data)


        # We are now on a category page. We can use this to paginate through and

        # enqueue all products, as well as any subsequent pages we find.

        elif context.request.label == 'CATEGORY':

            # Wait for the product items to render.

            await context.page.wait_for_selector('.product-item > a')


            # Enqueue links found within elements matching the provided selector.

            # These links will be added to the crawling queue with the label DETAIL.

            await context.enqueue_links(

                selector='.product-item > a',

                label='DETAIL',

            )


            # Find the "Next" button to paginate through the category pages.

            next_button = await context.page.query_selector('a.pagination__next')


            # If a "Next" button is found, enqueue the next page of results.

            if next_button:

                await context.enqueue_links(

                    selector='a.pagination__next',

                    label='CATEGORY',

                )


        # This indicates we're on the start page with no specific label.

        # On the start page, we want to enqueue all the category pages.

        else:

            # Wait for the collection cards to render.

            await context.page.wait_for_selector('.collection-block-item')


            # Enqueue links found within elements matching the provided selector.

            # These links will be added to the crawling queue with the label CATEGORY.

            await context.enqueue_links(

                selector='.collection-block-item',

                label='CATEGORY',

            )


    await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])


if __name__ == '__main__':

    asyncio.run(main())

```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/SqlStorageClient

Initializes the SqlStorageClient with a database connection string or a pre-configured engine.

```APIDOC
## __init__

### Description
Initializes the SQL storage client.

### Parameters
#### Path Parameters
* **connection_string** (str) - Optional - Database connection string (e.g., "sqlite+aiosqlite:///crawlee.db"). Defaults to SQLite database in the storage directory if not provided.
* **engine** (AsyncEngine) - Optional - Pre-configured AsyncEngine instance. If provided, connection_string is ignored.

### Returns
None

```

--------------------------------

### Camoufox Integration with PlaywrightCrawler

Source: https://crawlee.dev/python/docs/guides/avoid-blocking

This example demonstrates how to create a custom Playwright browser plugin that utilizes Camoufox. It overrides the default browser behavior to enhance anti-detection capabilities. Ensure Camoufox is installed as an external package.

```python
import asyncio

# Camoufox is external package and needs to be installed.
from camoufox import AsyncNewBrowser
from typing_extensions import override

from crawlee.browser_clients import (
    BrowserPool,
    PlaywrightBrowserController,
    PlaywrightBrowserPlugin,
)
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    """Example browser plugin that uses Camoufox browser,
    but otherwise keeps the functionality of PlaywrightBrowserPlugin."""

    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(
                self._playwright, (*self._browser_launch_options
            ),
            # Increase, if camoufox can handle it in your use case.
            max_open_pages_per_browser=1,
            # This turns off the crawler header_generation. Camoufox has its own.
            header_generator=None,
        )


async def main():
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
        # Custom browser pool. Gives users full control over browsers used by the crawler.
        browser_pool=BrowserPool(plugins=[CamoufoxPlugin()])
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract some data from the page using Playwright's API.
        posts = await context.page.query_selector_all('.athing')
        for post in posts:
            # Get the HTML elements for the title and rank within each post.
            title_element = await post.query_selector(' .title a')

            # Extract the data we want from the elements.
            title = await title_element.inner_text() if title_element else None

            # Push the extracted data to the default dataset.
            await context.push_data({'title': title})

        # Find a link to the next page and enqueue it if it exists.
        await context.enqueue_links(selector='.morelink')

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://news.ycombinator.com/'])


if __name__ == "__main__":
    asyncio.run(main())

```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/_ProxyTierTracker

Initializes the _ProxyTierTracker with a list of proxy URLs for each tier.

```APIDOC
## __init__

### Description
Initializes the _ProxyTierTracker with a list of proxy URLs for each tier.

### Parameters
* **tiered_proxy_urls** (list[list[URL | None]]) - The list of proxy URLs, where each inner list represents a tier.

### Returns
None
```

--------------------------------

### CloakBrowser Plugin for PlaywrightCrawler

Source: https://crawlee.dev/python/docs/guides/avoid-blocking

Example of a custom Playwright browser plugin that uses CloakBrowser's patched Chromium. It maintains the functionality of PlaywrightBrowserPlugin while applying CloakBrowser's fingerprinting defenses. Ensure CloakBrowser is installed separately.

```python
from cloakbrowser.config import IGNORE_DEFAULT_ARGS, get_default_stealth_args
from cloakbrowser.download import ensure_binary
from typying_extensions import override

from crawlee.browsers import (
    BrowserPool,
    PlaywrightBrowserController,
    PlaywrightBrowserPlugin,
)
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


class CloakBrowserPlugin(PlaywrightBrowserPlugin):
    """Example browser plugin that uses CloakBrowser's patched Chromium,
    but otherwise keeps the functionality of PlaywrightBrowserPlugin.
    """

    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        binary_path = ensure_binary()
        stealth_args = get_default_stealth_args()

        # Merge CloakBrowser stealth args with any user-provided launch options.
        launch_options = dict(self._browser_launch_options)
        launch_options.pop('executable_path', None)
        launch_options.pop('chromium_sandbox', None)
        existing_args = list(launch_options.pop('args', []))
        launch_options['args'] = [*(existing_args, *stealth_args)]

        return PlaywrightBrowserController(
            browser=await self._playwright.chromium.launch(
                executable_path=binary_path,
                ignore_default_args=IGNORE_DEFAULT_ARGS,
                **launch_options,
            ),
            max_open_pages_per_browser=1,
            # CloakBrowser handles fingerprinting at the binary level.
            header_generator=None,
        )


async def main():
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
        # Custom browser pool. Gives users full control over browsers used by the crawler.
        browser_pool=BrowserPool(plugins=[CloakBrowserPlugin()])
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract some data from the page using Playwright's API.
        posts = await context.page.query_selector_all('.athing')
        for post in posts:
            # Get the HTML elements for the title and rank within each post.
            title_element = await post.query_selector('a.title')

            # Extract the data we want from the elements.
            title = await title_element.inner_text() if title_element else None

            # Push the extracted data to the default dataset.
            await context.push_data({'title': title})

        # Find a link to the next page and enqueue it if it exists.
        await context.enqueue_links(selector='.morelink')

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://news.ycombinator.com/'])


if __name__ == "__main__":
    asyncio.run(main())

```

--------------------------------

### AdaptivePlaywrightCrawler Usage

Source: https://crawlee.dev/python/api/class/AdaptivePlaywrightCrawler

Demonstrates how to initialize and use the AdaptivePlaywrightCrawler with a default request handler. This example shows setting crawler options, defining a handler for processing page content and enqueuing links, and running the crawler on a starting URL.

```APIDOC
## AdaptivePlaywrightCrawler Usage

### Description
This example demonstrates the basic setup and usage of the `AdaptivePlaywrightCrawler`. It shows how to instantiate the crawler with specific configurations, define a handler function to process crawled pages, and initiate the crawling process.

### Initialization
```python
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext


crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(

    max_requests_per_crawl=10,  # Limit the max requests per crawl.

    playwright_crawler_specific_kwargs={'browser_type': 'chromium'},

)
```

### Request Handler
```python
@crawler.router.default_handler
async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None:

    # Do some processing using `parsed_content`

    context.log.info(context.parsed_content.title)


    # Locate element h2 within 5 seconds

    h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))

    # Do stuff with element found by the selector

    context.log.info(h2)


    # Find more links and enqueue them.

    await context.enqueue_links()

    # Save some data.

    await context.push_data({'Visited url': context.request.url})
```

### Running the Crawler
```python
await crawler.run(['https://crawlee.dev/'])
```
```

--------------------------------

### ServiceLocator.__init__

Source: https://crawlee.dev/python/api/class/ServiceLocator

Initializes the ServiceLocator with optional configuration, event manager, and storage client.

```APIDOC
## ServiceLocator.__init__

### Description
Initializes the ServiceLocator with optional configuration, event manager, and storage client.

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Method Signature
`__init__(configuration: Configuration | None = None, event_manager: EventManager | None = None, storage_client: StorageClient | None = None) -> None`

### Parameters
* **configuration** (Configuration | None) - Optional: The configuration instance to set. Defaults to None.
* **event_manager** (EventManager | None) - Optional: The event manager instance to set. Defaults to None.
* **storage_client** (StorageClient | None) - Optional: The storage client instance to set. Defaults to None.

### Returns
None
```

--------------------------------

### Adaptive Playwright Crawler Example

Source: https://crawlee.dev/python/docs/guides/request-router

Demonstrates the setup and usage of the AdaptivePlaywrightCrawler, including common and Playwright-specific pre-navigation hooks for handling both static and dynamic content. It shows how to set viewport size and block certain resource types.

```python
import asyncio
from crawlee import HttpHeaders
from crawlee.crawlers import (
    AdaptivePlaywrightCrawler,
    AdaptivePlaywrightCrawlingContext,
    AdaptivePlaywrightPreNavCrawlingContext,
)

async def main():
    crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
        max_requests_per_crawl=10,  # Limit the max requests per crawl.
    )

    @crawler.pre_navigation_hook
    async def common_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
        # Common pre-navigation hook - runs for both HTTP and browser requests.
        context.request.headers |= HttpHeaders({
            'Accept': 'text/html,application/xhtml+xml'
        })

    @crawler.pre_navigation_hook(playwright_only=True)
    async def browser_setup(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
        # Playwright-specific pre-navigation hook - runs only when browser is used.
        await context.page.set_viewport_size({'width': 1280, 'height': 720})
        if context.block_requests:
            await context.block_requests(extra_url_patterns=['*.css', '*.js'])

    @crawler.router.default_handler
    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
        # Extract title using the unified context interface.
        title_tag = context.parsed_content.find('title')
        title = title_tag.get_text() if title_tag else None

        # Extract other data consistently across both modes.
        links = [a.get('href') for a in context.parsed_content.find_all('a', href=True)]

        await context.push_data({
            'url': context.request.url,
            'title': title,
            'links': links,
        })

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

```

--------------------------------

### Sanity Check with PlaywrightCrawler

Source: https://crawlee.dev/python/docs/introduction/real-world-project

Use this snippet to verify your PlaywrightCrawler setup by visiting a start URL and extracting specific text content from rendered elements. It waits for elements to load and then evaluates JavaScript to extract data, logging the results.

```python
import asyncio


# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:

    crawler = PlaywrightCrawler()


    @crawler.router.default_handler

    async def request_handler(context: PlaywrightCrawlingContext) -> None:

        # Wait for the collection cards to render on the page. This ensures that

        # the elements we want to interact with are present in the DOM.

        await context.page.wait_for_selector('.collection-block-item')


        # Execute a function within the browser context to target the collection

        # card elements and extract their text content, trimming any leading or

        # trailing whitespace.

        category_texts = await context.page.eval_on_selector_all(

            '.collection-block-item',

            '(els) => els.map(el => el.textContent.trim())',

        )


        # Log the extracted texts.

        for i, text in enumerate(category_texts):

            context.log.info(f'CATEGORY_{i + 1}: {text}')


    await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])


if __name__ == '__main__':

    asyncio.run(main())
```

--------------------------------

### __init__

Source: https://crawlee.dev/python/api/class/StagehandBrowserController

Initializes a new instance of StagehandBrowserController. It sets up the connection to the browser and configures Stagehand options.

```APIDOC
## __init__

### Description
Initialize a new instance.

### Parameters
#### Keyword-Only Parameters
- **playwright** (Playwright) - Required - Active Playwright instance used to connect to the browser via CDP.
- **stagehand_client** (AsyncStagehand) - Required - Active Stagehand REST client used to start and end sessions.
- **stagehand_options** (StagehandOptions) - Required - Stagehand-specific configuration (model, env, self-heal, etc.).
- **max_open_pages_per_browser** (int) - Optional - Maximum number of pages that can be open at the same time. Defaults to 20.
- **header_generator** (HeaderGenerator | None) - Optional - An optional `HeaderGenerator` instance used to generate and manage HTTP headers for requests made by the browser. By default, a predefined header generator is used. Set to `None` to disable automatic header modifications. Defaults to `_DEFAULT_HEADER_GENERATOR`.

### Returns
None
```

--------------------------------

### RedisKeyValueStoreClient.__init__

Source: https://crawlee.dev/python/api/class/RedisKeyValueStoreClient

Initializes a new instance of the RedisKeyValueStoreClient. It's recommended to use the `RedisKeyValueStoreClient.open` class method for instantiation.

```APIDOC
## __init__

### Description
Initializes a new instance.
Preferably use the `RedisKeyValueStoreClient.open` class method to create a new instance.

### Parameters
* **storage_name**: str
* **storage_id**: str
* **redis**: Redis

### Returns
None
```

--------------------------------

### Basic PlaywrightCrawler Usage

Source: https://crawlee.dev/python/api/class/PlaywrightCrawler

Demonstrates the basic setup and usage of PlaywrightCrawler. It defines a default request handler to process URLs, extract page title and response text, and push the data to a dataset. The crawler is then run with a starting URL.

```python
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


crawler = PlaywrightCrawler()


# Define the default request handler, which will be called for every request.

@crawler.router.default_handler

async def request_handler(context: PlaywrightCrawlingContext) -> None:

    context.log.info(f'Processing {context.request.url} ...')


    # Extract data from the page.

    data = {

        'url': context.request.url,

        'title': await context.page.title(),

        'response': (await context.response.text())[:100],

    }


    # Push the extracted data to the default dataset.

    await context.push_data(data)


await crawler.run(['https://crawlee.dev/'])
```

--------------------------------

### _TxtSitemapParser.__init__

Source: https://crawlee.dev/python/api/class/_TxtSitemapParser

Initializes the _TxtSitemapParser.

```APIDOC
## _TxtSitemapParser.__init__

### Description
Initializes the _TxtSitemapParser.

### Method
__init__

### Returns
None
```

--------------------------------

### FastAPI Web Server Setup for Crawlee

Source: https://crawlee.dev/python/docs/guides/running-in-web-server

This code sets up a FastAPI application to serve an HTML index page and an endpoint for scraping URLs. It integrates with Crawlee to handle scraping requests asynchronously. Ensure FastAPI is installed with 'fastapi[standard]'.

```python
from __future__ import annotations


import asyncio

from uuid import uuid4


from fastapi import FastAPI
from starlette.requests import Request
from starlette.responses import HTMLResponse


import crawlee


from .crawler import lifespan


app = FastAPI(lifespan=lifespan, title='Crawler app')


@app.get('/', response_class=HTMLResponse)
def index() -> str:

    return """

<!DOCTYPE html>

<html>

<body>

    <h1>Scraper server</h1>

        <p>To scrape some page, visit "scrape" endpoint with url parameter.

            For example:

            <a href="/scrape?url=https://www.example.com">

                /scrape?url=https://www.example.com

            </a>

        </p>

</body>

</html>
"""


@app.get('/scrape')
async def scrape_url(request: Request, url: str | None = None) -> dict:

    if not url:

        return {'url': 'missing', 'scrape result': 'no results'}


    # Generate random unique key for the request

    unique_key = str(uuid4())


    # Set the result future in the result dictionary so that it can be awaited

    request.state.requests_to_results[unique_key] = asyncio.Future[dict[str, str]]()


    # Add the request to the crawler queue

    await request.state.crawler.add_requests(

        [crawlee.Request.from_url(url, unique_key=unique_key)]

    )


    # Wait for the result future to be finished

    result = await request.state.requests_to_results[unique_key]


    # Clean the result from the result dictionary to free up memory

    request.state.requests_to_results.pop(unique_key)


    # Return the result

    return {'url': url, 'scrape result': result}

```

--------------------------------

### Basic BeautifulSoup Crawler Setup and Execution

Source: https://crawlee.dev/python/docs/examples/beautifulsoup-crawler

This snippet shows how to initialize BeautifulSoupCrawler with custom settings like retries and timeouts. It defines a default request handler to extract page title and headings, and then runs the crawler starting from a given URL.

```python
import asyncio

from datetime import timedelta


from crawlee.crawlers import (

    BasicCrawlingContext,

    BeautifulSoupCrawler,

    BeautifulSoupCrawlingContext,

)


async def main() -> None:

    # Create an instance of the BeautifulSoupCrawler class, a crawler that automatically

    # loads the URLs and parses their HTML using the BeautifulSoup library.

    crawler = BeautifulSoupCrawler(

        # On error, retry each page at most once.

        max_request_retries=1,

        # Increase the timeout for processing each page to 30 seconds.

        request_handler_timeout=timedelta(seconds=30),

        # Limit the crawl to max requests. Remove or increase it for crawling all links.

        max_requests_per_crawl=10,

    )


    # Define the default request handler, which will be called for every request.

    # The handler receives a context parameter, providing various properties and

    # helper methods. Here are a few key ones we use for demonstration:

    # - request: an instance of the Request class containing details such as the URL

    #   being crawled and the HTTP method used.

    # - soup: the BeautifulSoup object containing the parsed HTML of the response.

    @crawler.router.default_handler

    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:

        context.log.info(f'Processing {context.request.url} ...')


        # Extract data from the page.

        data = {

            'url': context.request.url,

            'title': context.soup.title.string if context.soup.title else None,

            'h1s': [h1.text for h1 in context.soup.find_all('h1')],

            'h2s': [h2.text for h2 in context.soup.find_all('h2')],

            'h3s': [h3.text for h3 in context.soup.find_all('h3')],

        }


        # Push the extracted data to the default dataset. In local configuration,

        # the data will be stored as JSON files in ./storage/datasets/default.

        await context.push_data(data)


    # Register pre navigation hook which will be called before each request.

    # This hook is optional and does not need to be defined at all.

    @crawler.pre_navigation_hook

    async def some_hook(context: BasicCrawlingContext) -> None:

        pass


    # Run the crawler with the initial list of URLs.

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':

    asyncio.run(main())

```

--------------------------------

### FileSystemRequestQueueClient.__init__

Source: https://crawlee.dev/python/api/class/FileSystemRequestQueueClient

Initializes a new instance of FileSystemRequestQueueClient. It's recommended to use the `open` class method instead of this constructor.

```APIDOC
## __init__

### Description
Initialize a new instance.
Preferably use the `FileSystemRequestQueueClient.open` class method to create a new instance.

### Parameters
* **metadata** (RequestQueueMetadata) - Keyword-only
* **path_to_rq** (Path) - Keyword-only
* **lock** (asyncio.Lock) - Keyword-only
* **recoverable_state** (RecoverableState[RequestQueueState]) - Keyword-only

### Returns
None
```