### Installing Spatie Crawler via Composer

Source: https://github.com/spatie/crawler/blob/main/README.md

This command installs the spatie/crawler package into your PHP project using Composer, the dependency manager for PHP.

```Bash
composer require spatie/crawler
```

--------------------------------

### Instantiating and Starting Spatie Crawler in PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This PHP code snippet demonstrates how to create a new Crawler instance, set a required CrawlObserver, and start the crawling process for a specified URL. The CrawlObserver handles events during the crawl.

```PHP
use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);
```

--------------------------------

### Example robots.txt Rule for Custom User Agent TXT

Source: https://github.com/spatie/crawler/blob/main/README.md

An example `robots.txt` entry demonstrating how to define specific disallow rules for a custom User Agent string used by the crawler.

```txt
// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /
```

--------------------------------

### Starting a Multi-Request Crawl with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet shows the initial step for crawling across multiple requests using `setCurrentCrawlLimit`. It starts the crawl, processes the first batch of URLs, and then serializes the crawl queue to be stored and used in subsequent requests. Requires a queue implementation.

```php
// Create a queue using your queue-driver.
$queue = <your selection/implementation of a queue>;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);
```

--------------------------------

### Continuing a Multi-Request Crawl with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet demonstrates how to continue a crawl started in a previous request. It unserializes the stored queue, passes it to the crawler with `setCurrentCrawlLimit`, and then serializes the updated queue for the next request. Requires the queue state from the previous request.

```php
// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);
```

--------------------------------

### Combining Total and Current Crawl Limits with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This example illustrates combining `setTotalCrawlLimit` and `setCurrentCrawlLimit`. The crawler will process up to the current limit per execution but will stop entirely once the total limit is reached across all executions using the same queue. Requires a queue implementation.

```php
$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);
```

--------------------------------

### Setting Maximum Crawl Depth with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet shows how to limit the maximum depth the crawler will follow links using the `setMaximumDepth` method. The crawler will only visit pages up to the specified link distance from the starting URL. By default, it crawls all pages.

```php
Crawler::create()
    ->setMaximumDepth(2)
```

--------------------------------

### Adding Multiple CrawlObservers with addCrawlObserver in PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This PHP snippet shows an alternative way to add multiple CrawlObserver instances to the crawler by chaining the `addCrawlObserver` method for each observer.

```PHP
Crawler::create()
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);
```

--------------------------------

### Enable JavaScript Execution with Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Configure the Spatie Crawler to execute JavaScript on crawled pages. This requires the spatie/browsershot package and its system dependencies (Puppeteer).

```php
Crawler::create()
    ->executeJavaScript()
    ...
```

--------------------------------

### Setting Maximum Response Size with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet demonstrates setting a maximum size for responses using `setMaximumResponseSize`. Responses larger than this limit will be truncated, preventing excessive memory usage when encountering large files like PDFs or media. The size is specified in bytes.

```php
// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)
```

--------------------------------

### Setting Multiple CrawlObservers with setCrawlObservers in PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This PHP snippet demonstrates how to assign an array of CrawlObserver instances to the crawler using the `setCrawlObservers` method, allowing multiple observers to process crawl events.

```PHP
Crawler::create()
    ->setCrawlObservers([
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);
```

--------------------------------

### Defining a Custom CrawlObserver in PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This abstract class definition shows the required methods for a custom CrawlObserver in spatie/crawler. You must extend this class and implement the `crawled` and `crawlFailed` methods to handle successful crawls and errors.

```PHP
namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}
```

--------------------------------

### Limiting Crawled URLs Per Execution with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet shows how `setCurrentCrawlLimit` limits the number of URLs crawled in a single execution. Each call to `startCrawling` will process up to the specified limit, allowing the crawl to be resumed later. Requires a queue implementation.

```php
$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);
```

--------------------------------

### Set Custom Browsershot Instance for Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Provide a pre-configured Browsershot instance to the Spatie Crawler when enabling JavaScript execution. This allows for more control over the rendering process.

```php
Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...
```

--------------------------------

### Use Sitemap URL Parser with Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Configure the Spatie Crawler to use the built-in `SitemapUrlParser`. This parser extracts and crawls links found within a sitemap, including sitemap index files.

```php
Crawler::create()
    ->setUrlParserClass(SitemapUrlParser::class)
    ...
```

--------------------------------

### Define shouldCrawl Method for Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Signature for the `shouldCrawl` method required by the `Spatie\Crawler\CrawlProfiles\CrawlProfile` interface. Implement this method in a custom class to define rules for which URLs the crawler should visit.

```php
/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;
```

--------------------------------

### Accept Nofollow Links with Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Override the default behavior of rejecting links with the `rel="nofollow"` attribute. This allows the crawler to follow links that are typically ignored by search engine bots.

```php
Crawler::create()
    ->acceptNofollowLinks()
    ...
```

--------------------------------

### Set Concurrency for Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Adjust the number of URLs the crawler processes concurrently. The default is 10. Setting it to 1 makes the crawler process URLs one by one.

```php
Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one
```

--------------------------------

### Ignore Robots.txt and Meta Tags with Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Disable the default behavior of respecting `robots.txt` files, meta tags, and response headers. This forces the crawler to visit pages that would otherwise be disallowed by robots data.

```php
Crawler::create()
    ->ignoreRobots()
    ...
```

--------------------------------

### Adding Delay Between Requests with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet shows how to add a pause between each HTTP request made by the crawler using `setDelayBetweenRequests`. This is useful for avoiding rate limiting from servers. The delay is specified in milliseconds.

```php
Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms
```

--------------------------------

### Set Custom User Agent for Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Configure the crawler to use a specific User Agent string. This is useful for respecting `robots.txt` rules that are defined for a particular agent.

```php
Crawler::create()
    ->setUserAgent('my-agent')
```

--------------------------------

### Limiting Total Crawled URLs with PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

This snippet demonstrates how to use `setTotalCrawlLimit` to restrict the total number of URLs crawled across multiple executions. Once the limit is reached, subsequent calls with the same queue will not crawl further. Requires a queue implementation.

```php
$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);
```

--------------------------------

### Set Custom URL Parser for Spatie Crawler PHP

Source: https://github.com/spatie/crawler/blob/main/README.md

Specify a custom class that implements `Spatie\Crawler\UrlParsers\UrlParser` to control how links are extracted from crawled pages. The default is `LinkUrlParser`.

```php
Crawler::create()
    ->setUrlParserClass(<class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class)
    ...
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.