### Installing Spatie Crawler via Composer Source: https://github.com/spatie/crawler/blob/main/README.md This command installs the spatie/crawler package into your PHP project using Composer, the dependency manager for PHP. ```Bash composer require spatie/crawler ``` -------------------------------- ### Instantiating and Starting Spatie Crawler in PHP Source: https://github.com/spatie/crawler/blob/main/README.md This PHP code snippet demonstrates how to create a new Crawler instance, set a required CrawlObserver, and start the crawling process for a specified URL. The CrawlObserver handles events during the crawl. ```PHP use Spatie\Crawler\Crawler; Crawler::create() ->setCrawlObserver() ->startCrawling($url); ``` -------------------------------- ### Example robots.txt Rule for Custom User Agent TXT Source: https://github.com/spatie/crawler/blob/main/README.md An example `robots.txt` entry demonstrating how to define specific disallow rules for a custom User Agent string used by the crawler. ```txt // Disallow crawling for my-agent User-agent: my-agent Disallow: / ``` -------------------------------- ### Starting a Multi-Request Crawl with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet shows the initial step for crawling across multiple requests using `setCurrentCrawlLimit`. It starts the crawl, processes the first batch of URLs, and then serializes the crawl queue to be stored and used in subsequent requests. Requires a queue implementation. ```php // Create a queue using your queue-driver. $queue = ; // Crawl the first set of URLs Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url); // Serialize and store your queue $serializedQueue = serialize($queue); ``` -------------------------------- ### Continuing a Multi-Request Crawl with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet demonstrates how to continue a crawl started in a previous request. It unserializes the stored queue, passes it to the crawler with `setCurrentCrawlLimit`, and then serializes the updated queue for the next request. Requires the queue state from the previous request. ```php // Unserialize queue $queue = unserialize($serializedQueue); // Crawls the next set of URLs Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(10) ->startCrawling($url); // Serialize and store your queue $serialized_queue = serialize($queue); ``` -------------------------------- ### Combining Total and Current Crawl Limits with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This example illustrates combining `setTotalCrawlLimit` and `setCurrentCrawlLimit`. The crawler will process up to the current limit per execution but will stop entirely once the total limit is reached across all executions using the same queue. Requires a queue implementation. ```php $queue = ; // Crawls 5 URLs and ends. Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url); // Crawls the next 5 URLs and ends. Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url); // Doesn't crawl further as the total limit is reached. Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(10) ->setCurrentCrawlLimit(5) ->startCrawling($url); ``` -------------------------------- ### Setting Maximum Crawl Depth with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet shows how to limit the maximum depth the crawler will follow links using the `setMaximumDepth` method. The crawler will only visit pages up to the specified link distance from the starting URL. By default, it crawls all pages. ```php Crawler::create() ->setMaximumDepth(2) ``` -------------------------------- ### Adding Multiple CrawlObservers with addCrawlObserver in PHP Source: https://github.com/spatie/crawler/blob/main/README.md This PHP snippet shows an alternative way to add multiple CrawlObserver instances to the crawler by chaining the `addCrawlObserver` method for each observer. ```PHP Crawler::create() ->addCrawlObserver() ->addCrawlObserver() ->addCrawlObserver() ->startCrawling($url); ``` -------------------------------- ### Enable JavaScript Execution with Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Configure the Spatie Crawler to execute JavaScript on crawled pages. This requires the spatie/browsershot package and its system dependencies (Puppeteer). ```php Crawler::create() ->executeJavaScript() ... ``` -------------------------------- ### Setting Maximum Response Size with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet demonstrates setting a maximum size for responses using `setMaximumResponseSize`. Responses larger than this limit will be truncated, preventing excessive memory usage when encountering large files like PDFs or media. The size is specified in bytes. ```php // let's use a 3 MB maximum. Crawler::create() ->setMaximumResponseSize(1024 * 1024 * 3) ``` -------------------------------- ### Setting Multiple CrawlObservers with setCrawlObservers in PHP Source: https://github.com/spatie/crawler/blob/main/README.md This PHP snippet demonstrates how to assign an array of CrawlObserver instances to the crawler using the `setCrawlObservers` method, allowing multiple observers to process crawl events. ```PHP Crawler::create() ->setCrawlObservers([ , , ... ]) ->startCrawling($url); ``` -------------------------------- ### Defining a Custom CrawlObserver in PHP Source: https://github.com/spatie/crawler/blob/main/README.md This abstract class definition shows the required methods for a custom CrawlObserver in spatie/crawler. You must extend this class and implement the `crawled` and `crawlFailed` methods to handle successful crawls and errors. ```PHP namespace Spatie\Crawler\CrawlObservers; use GuzzleHttp\Exception\RequestException; use Psr\Http\Message\ResponseInterface; use Psr\Http\Message\UriInterface; abstract class CrawlObserver { /* * Called when the crawler will crawl the url. */ public function willCrawl(UriInterface $url, ?string $linkText): void { } /* * Called when the crawler has crawled the given url successfully. */ abstract public function crawled( UriInterface $url, ResponseInterface $response, ?UriInterface $foundOnUrl = null, ?string $linkText, ): void; /* * Called when the crawler had a problem crawling the given url. */ abstract public function crawlFailed( UriInterface $url, RequestException $requestException, ?UriInterface $foundOnUrl = null, ?string $linkText = null, ): void; /** * Called when the crawl has ended. */ public function finishedCrawling(): void { } } ``` -------------------------------- ### Limiting Crawled URLs Per Execution with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet shows how `setCurrentCrawlLimit` limits the number of URLs crawled in a single execution. Each call to `startCrawling` will process up to the specified limit, allowing the crawl to be resumed later. Requires a queue implementation. ```php $queue = ; // Crawls 5 URLs and ends. Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url); // Crawls the next 5 URLs and ends. Crawler::create() ->setCrawlQueue($queue) ->setCurrentCrawlLimit(5) ->startCrawling($url); ``` -------------------------------- ### Set Custom Browsershot Instance for Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Provide a pre-configured Browsershot instance to the Spatie Crawler when enabling JavaScript execution. This allows for more control over the rendering process. ```php Crawler::create() ->setBrowsershot($browsershot) ->executeJavaScript() ... ``` -------------------------------- ### Use Sitemap URL Parser with Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Configure the Spatie Crawler to use the built-in `SitemapUrlParser`. This parser extracts and crawls links found within a sitemap, including sitemap index files. ```php Crawler::create() ->setUrlParserClass(SitemapUrlParser::class) ... ``` -------------------------------- ### Define shouldCrawl Method for Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Signature for the `shouldCrawl` method required by the `Spatie\Crawler\CrawlProfiles\CrawlProfile` interface. Implement this method in a custom class to define rules for which URLs the crawler should visit. ```php /* * Determine if the given url should be crawled. */ public function shouldCrawl(UriInterface $url): bool; ``` -------------------------------- ### Accept Nofollow Links with Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Override the default behavior of rejecting links with the `rel="nofollow"` attribute. This allows the crawler to follow links that are typically ignored by search engine bots. ```php Crawler::create() ->acceptNofollowLinks() ... ``` -------------------------------- ### Set Concurrency for Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Adjust the number of URLs the crawler processes concurrently. The default is 10. Setting it to 1 makes the crawler process URLs one by one. ```php Crawler::create() ->setConcurrency(1) // now all urls will be crawled one by one ``` -------------------------------- ### Ignore Robots.txt and Meta Tags with Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Disable the default behavior of respecting `robots.txt` files, meta tags, and response headers. This forces the crawler to visit pages that would otherwise be disallowed by robots data. ```php Crawler::create() ->ignoreRobots() ... ``` -------------------------------- ### Adding Delay Between Requests with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet shows how to add a pause between each HTTP request made by the crawler using `setDelayBetweenRequests`. This is useful for avoiding rate limiting from servers. The delay is specified in milliseconds. ```php Crawler::create() ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms ``` -------------------------------- ### Set Custom User Agent for Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Configure the crawler to use a specific User Agent string. This is useful for respecting `robots.txt` rules that are defined for a particular agent. ```php Crawler::create() ->setUserAgent('my-agent') ``` -------------------------------- ### Limiting Total Crawled URLs with PHP Source: https://github.com/spatie/crawler/blob/main/README.md This snippet demonstrates how to use `setTotalCrawlLimit` to restrict the total number of URLs crawled across multiple executions. Once the limit is reached, subsequent calls with the same queue will not crawl further. Requires a queue implementation. ```php $queue = ; // Crawls 5 URLs and ends. Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url); // Doesn't crawl further as the total limit is reached. Crawler::create() ->setCrawlQueue($queue) ->setTotalCrawlLimit(5) ->startCrawling($url); ``` -------------------------------- ### Set Custom URL Parser for Spatie Crawler PHP Source: https://github.com/spatie/crawler/blob/main/README.md Specify a custom class that implements `Spatie\Crawler\UrlParsers\UrlParser` to control how links are extracted from crawled pages. The default is `LinkUrlParser`. ```php Crawler::create() ->setUrlParserClass(::class) ... ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.