### Add Quickstart Dependency to Mix.exs

Source: https://github.com/elixir-crawly/crawly/blob/master/examples/quickstart/README.md

Add the quickstart package to your project's dependencies in mix.exs. This is the standard way to include external libraries in Elixir projects.

```elixir
def deps do
  [
    {:quickstart, "~> 0.1.0"}
  ]
end
```

--------------------------------

### Start Crawl Engine

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Start the Crawly engine and run a specific spider using the iex shell.

```bash
iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"
```

--------------------------------

### Basic Crawly Configuration

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

A fundamental configuration example showing how to set up pipelines and middlewares.

```elixir
config :crawly,
  pipelines: [
    # my pipelines
  ],
  middlewares: [
    # my middlewares
  ]
```

--------------------------------

### Start Splash Docker Image

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Start the Splash Docker image to enable browser rendering. This command maps port 8050 and sets a maximum timeout of 300 seconds.

```bash
docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300
```

--------------------------------

### Request Options and Auto Cookies Manager Middleware Example

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

This example shows how to configure request options like timeouts and enable automatic cookie management using built-in middlewares.

```elixir
    {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]},
     Crawly.Middlewares.AutoCookiesManager
```

--------------------------------

### Configuring Parsers in Crawly

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Example of setting up custom parsers for processing fetcher responses. Note the warning about global configuration.

```elixir
config :crawly,
  parsers: [
    {Crawly.Parsers.ExtractRequests, selector: "button"},
    ]
```

--------------------------------

### Start a Crawly Spider via HTTP API

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Use this endpoint to schedule and start a specific Crawly spider. Replace `<spider_name>` with the actual name of the spider.

```bash
curl -v localhost:4001/spiders/<spider_name>/schedule
```

--------------------------------

### Start a Spider

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Starts a specified Crawly spider. This endpoint triggers the engine to begin crawling according to the spider's configuration.

```APIDOC
## POST /spiders/<spider_name>/schedule

### Description
Starts a given Crawly spider.

### Method
POST

### Endpoint
/spiders/<spider_name>/schedule

### Parameters
#### Path Parameters
- **spider_name** (string) - Required - The name of the spider to start.
```

--------------------------------

### Configuring Middlewares in Crawly

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Example of defining a list of middlewares for pre-processing requests. Includes built-in middlewares and custom options.

```elixir
config :crawly,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.RobotsTxt,
    # With options
    {Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] },
    {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}
  ]
```

--------------------------------

### Generate Config with Mix Task

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Use the mix crawly.gen.config task to generate an example configuration file.

```bash
mix crawly.gen.config
```

--------------------------------

### Configuring Pipelines in Crawly

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Example of defining a list of pipelines for pre-processing scraped items. Includes validation, deduplication, encoding, and file writing.

```elixir
config :crawly,
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:id, :date]},
    {Crawly.Pipelines.DuplicatesFilter, item_id: :id},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp", include_timestamp: true}
    ]
```

--------------------------------

### Configuring Retry Logic in Crawly

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Example of configuring the retry mechanism, specifying retry codes, maximum retries, and ignored middlewares.

```elixir
retry:
    [
      retry_codes: [400],
      max_retries: 3,
      ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
  ]
```

--------------------------------

### Crawly Configuration File

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md

An example Erlang configuration file for Crawly, specifying item count limits, timeouts, middleware, and pipelines for data processing and storage.

```erlang
[{crawly, [
            {closespider_itemcount, 500},
            {closespider_timeout, 20},
            {concurrent_requests_per_domain, 2},

            {middlewares, [
                'Elixir.Crawly.Middlewares.DomainFilter',
                'Elixir.Crawly.Middlewares.UniqueRequest',
                'Elixir.Crawly.Middlewares.RobotsTxt',
                {'Elixir.Crawly.Middlewares.UserAgent', [
                    {user_agents, [<<"Crawly BOT">>]} 
                ]}
            ]},

            {pipelines, [
                {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, url]}]},
                {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
                {'Elixir.Crawly.Pipelines.JSONEncoder'},
                {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
                ]
            }]
        }].
```

--------------------------------

### Ecto Storage Pipeline Example

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

This pipeline inserts scraped items into a database using Ecto. It delegates the insertion to an application context function and handles success or error responses.

```elixir
defmodule MyApp.MyEctoPipeline do
  @impl Crawly.Pipeline
  def run(item, state, _opts \ []) do
    case MyApp.insert_with_ecto(item) do
      {:ok, _} ->
        # insert successful, carry on with pipeline
        {item, state}
      {:error, _} ->
        # insert not successful, drop from pipeline
        {false, state}
    end
  end
end
```

--------------------------------

### Get Currently Running Spiders via HTTP API

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

This endpoint retrieves a list of all currently running Crawly spiders.

```bash
curl -v localhost:4001/spiders
```

--------------------------------

### Get Spider Stats

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Retrieves statistics for a specific spider, including scheduled requests and scraped items.

```APIDOC
## GET /spiders/<spider_name>/scheduled-requests

### Description
Gets the number of scheduled requests for a given spider.

### Method
GET

### Endpoint
/spiders/<spider_name>/scheduled-requests

### Parameters
#### Path Parameters
- **spider_name** (string) - Required - The name of the spider to get stats for.
```

```APIDOC
## GET /spiders/<spider_name>/scraped-items

### Description
Gets the number of scraped items for a given spider.

### Method
GET

### Endpoint
/spiders/<spider_name>/scraped-items

### Parameters
#### Path Parameters
- **spider_name** (string) - Required - The name of the spider to get stats for.
```

--------------------------------

### Create a Crawly Spider

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Define a spider module using Crawly.Spider to specify base URL, starting URLs, and parsing logic for extracted items and next requests.

```elixir
defmodule BooksToScrape do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "https://books.toscrape.com/"

  @impl Crawly.Spider
  def init() do
    [start_urls: ["https://books.toscrape.com/"]]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    # Parse response body to document
    {:ok, document} = Floki.parse_document(response.body)

    # Create item (for pages where items exists)
    items = 
      document
      |> Floki.find(".product_pod")
      |> Enum.map(fn x ->
        %{ 
          title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
          price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
          url: response.request_url
        }
      end)

    next_requests = 
      document
      |> Floki.find(".next a")
      |> Floki.attribute("href")
      |> Enum.map(fn url ->
        Crawly.Utils.build_absolute_url(url, response.request.url)
        |> Crawly.Utils.request_from_url()
      end)

    %Crawly.ParsedItem{items: items, requests: next_requests}
  end
end
```

--------------------------------

### Spider Parsing for Multi-Item Pipelines

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Example of a spider's `parse_item` callback returning multiple types of parsed items, including a blog post and weather data, structured with distinct keys.

```elixir
# in MyApp.CustomSpider.ex
def parse_item(response):
  # parse my item
  %{parsed_items: [
    %{blog_post: blog_post} ,
    %{weather: [ january_weather, february_weather ]}
  ]}
```

--------------------------------

### Custom Request Middleware to Add Proxy

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

An example of a request middleware that adds proxy configuration to a request. It checks for `proxy` and `proxy_auth` options and updates the request's options accordingly. This middleware follows the `Crawly.Pipeline` behavior.

```elixir
defmodule MyApp.MyProxyMiddleware do
  @impl Crawly.Pipeline
  def run(request, state, opts \\ []) do
    # Set default proxy and proxy_auth to nil
    opts = Enum.into(opts, %{proxy: nil, proxy_auth: nil})

    case opts.proxy do
      nil ->
        # do nothing
        {request, state}
      value ->
        old_options = request.options
        new_options = [proxy: opts.proxy, proxy_auth: opts.proxy_auth]
        new_request =  Map.put(request, :options, old_optoins ++ new_options)
        {new_request, state}
    end
  end
end
```

--------------------------------

### Example Spider for Books to Scrape

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md

Defines a Crawly spider to scrape book titles, prices, and URLs from books.toscrape.com. It includes logic to find the next page link and build absolute URLs.

```elixir
defmodule BooksToScrape do
    use Crawly.Spider
    @impl Crawly.Spider
    def base_url(), do: "https://books.toscrape.com/"
    @impl Crawly.Spider
    def init() do
        [start_urls: ["https://books.toscrape.com/"]]
    end
    @impl Crawly.Spider
    def parse_item(response) do
        # Parse response body to document
        {:ok, document} = Floki.parse_document(response.body)
        # Create item (for pages where items exists)
        items = 
        document
        |> Floki.find(".product_pod")
        |> Enum.map(fn x ->
            %{ 
            title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
            price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
            url: response.request_url
            }
        end)
        next_requests = 
        document
        |> Floki.find(".next a")
        |> Floki.attribute("href")
        |> Enum.map(fn url ->
            Crawly.Utils.build_absolute_url(url, response.request.url)
            |> Crawly.Utils.request_from_url()
        end)
        %{items: items, requests: next_requests}
    end
end
```

--------------------------------

### Basic YML Spider Definition

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/spiders_in_yml.md

Define a spider's name, base URL, starting points, fields to extract, and links to follow using this YML structure.

```yml
name: BooksSpiderForTest
base_url: "https://books.toscrape.com/"
start_urls:
    - "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
fields:
    - name: title
    selector: ".product_main"
    - name: price
    selector: ".product_main .price_color"
links_to_follow:
    - selector: "a"
    attribute: "href"
```

--------------------------------

### Get Scraped Items for a Spider via HTTP API

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Retrieve the count of scraped items for a specific Crawly spider. Replace `<spider_name>` with the actual name of the spider.

```bash
curl -v localhost:4001/spiders/<spider_name>/scraped-items
```

--------------------------------

### Get Scheduled Requests for a Spider via HTTP API

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Retrieve the count of scheduled requests for a specific Crawly spider. Replace `<spider_name>` with the actual name of the spider.

```bash
curl -v localhost:4001/spiders/<spider_name>/scheduled-requests
```

--------------------------------

### Passing Configuration Options to a Custom Pipeline

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Demonstrates how to declare a custom pipeline with configuration options in the `pipelines` list. The options are then received in the `opts` argument of the `run` callback.

```elixir
pipelines: [
  {MyCustomPipeline, my_option: "value"}
]
```

```elixir
defmodule MyCustomPipeline do
  @impl Crawly.Pipeline
  def run(item, state, opts) do
    IO.inspect(opts)        # shows keyword list of  [ my_option: "value" ]
    # Do something
  end
end
```

--------------------------------

### Configure File Logging with LoggerFileBackend

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Enables file logging and separates logs from different spiders into multiple files. Requires adding the `:logger_file_backend` dependency.

```elixir
config :logger,
  backends: [{LoggerFileBackend, :info_log}]

config :crawly,
  log_dir: "/tmp/spider_logs",
  log_to_file: true,
  ......
  other configurations
```

--------------------------------

### Configure erlang-node-discovery Hosts and Ports

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md

Configure the erlang-node-discovery application in config.exs to specify the hosts and node ports for UI communication.

```elixir
config :erlang_node_discovery,
          hosts: ["127.0.0.1", "crawlyui.com"],
          node_ports: [
            {:ui, 0}
          ]
```

--------------------------------

### Add erlang-node-discovery Dependency

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md

Add the erlang-node-discovery library as a dependency in your project's mix.exs file to facilitate Erlang node discovery.

```elixir
{:erlang_node_discovery, git: "https://github.com/oltarasenko/erlang-node-discovery"}
```

--------------------------------

### Configure Crawly with Crawly Render Server

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Configure your Crawly project to use the crawly-render-server fetcher for browser rendering. Ensure the render server is running on http://localhost:3000.

```elixir
config :crawly,
  fetcher: {Crawly.Fetchers.CrawlyRenderServer, [base_url: "http://localhost:3000/render"]}
```

--------------------------------

### Configure Crawly with Splash Fetcher

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Configure your Crawly project to use the Splash fetcher for browser rendering. Ensure Splash is running and accessible at http://localhost:8050.

```elixir
fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}
```

--------------------------------

### Add Crawly Dependency

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Add Crawly and Floki as dependencies in your mix.exs file.

```elixir
defp deps do
      [
        {:crawly, "~> 0.17.2"},
        {:floki, "~> 0.33.0"}
      ]
    end
```

--------------------------------

### Configure Crawly Settings

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Configure Crawly's behavior, including timeouts, concurrency, middlewares, and pipelines, in config/config.exs.

```elixir
import Config

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  closespider_itemcount: 100,

  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
    {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
  ]
```

--------------------------------

### Forwarding to Crawly API Router in Plug.Router

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

This snippet shows how to integrate the Crawly API router into your Plug.Router configuration to enable the management UI. Ensure the `start_http_api` option in Crawly configuration is set to `true` (default) for this to work.

```elixir
defmodule MyApp.Router do
  use Plug.Router

  ...
  forward "/admin", Crawly.API.Router
  ...
end
```

--------------------------------

### Generate Spider with Mix Task

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Use the mix crawly.gen.spider task to generate a new spider file with all necessary callbacks.

```bash
mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape
```

--------------------------------

### Running Crawly Docker Container

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md

Command to run the Crawly Docker container, mounting local spider and configuration directories, and exposing the management interface.

```bash
docker run --name crawly -e "SPIDERS_DIR=/app/spiders" \
 -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders \
 -v $(pwd)/crawly.config:/app/config/crawly.config \
 crawly
```

--------------------------------

### Key-Based Blog Post Processing Pipeline

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

This pipeline selectively processes only blog post items using key-based pattern matching on the `:blog_post` key. It updates the blog post data within the item.

```elixir
defmodule MyApp.BlogPostPipeline do
  @impl Crawly.Pipeline
  def run(%{blog_post: old_blog_post} = item, state, _opts \ []) do
    # process the blog post
    updated_item = Map.put(item, :blog_post, %{my: "data"})
    {updated_item, state}
  end
  # do nothing if it does not match
  def run(item, state, _opts), do: {item, state}
```

--------------------------------

### Configure Logger to Send Logs to CrawlyUI

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md

Configure the logger to use the SendToUiBackend and specify the destination for sending logs to the CrawlyUI node.

```elixir
config :logger,
  backends: [
    :console,
    {Crawly.Loggers.SendToUiBackend, :send_log_to_ui}
  ],
  level: :debug
```

```elixir
config :logger, :send_log_to_ui, destination: {:"ui@127.0.0.1", CrawlyUI, :store_log}
```

--------------------------------

### Key-Based Pattern Matching Pipeline

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

A custom item pipeline that uses key-based pattern matching to selectively process items based on map keys (e.g., %{my_item: my_item}). Items not matching the specified key are passed through unchanged.

```elixir
defmodule MyApp.MyCustomPipeline do
  @impl Crawly.Pipeline
  def run(%{my_item: my_item} = item, state, _opts \ []) do
    # do something
  end
  # do nothing if it does not match
  def run(item, state, _opts), do: {item, state}
```

--------------------------------

### View Crawl Results

Source: https://github.com/elixir-crawly/crawly/blob/master/README.md

Access the crawl results, which are typically saved to a JSON Lines file in the /tmp directory.

```bash
$ cat /tmp/BooksToScrape_<timestamp>.jl
```

--------------------------------

### List Running Spiders

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Retrieves a list of all currently running Crawly spiders.

```APIDOC
## GET /spiders

### Description
Gets a list of currently running spiders.

### Method
GET

### Endpoint
/spiders
```

--------------------------------

### Add SendToUI Pipeline to Item Pipelines

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md

Add the SendToUI pipeline to your item pipelines configuration. Ensure it is placed before any encoder pipelines.

```elixir
{Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'}
```

--------------------------------

### Integrate Crawly API Router with Phoenix

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Adds the Crawly management interface to a Phoenix application's router. This allows operating Crawly through an existing server.

```elixir
defmodule MyApp.Router do
    use Plug.Router

    ...

    forward "/crawlers", Crawly.API.Router

    ...
  end
```

--------------------------------

### Struct-Based Pattern Matching Pipeline

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

A custom item pipeline that uses struct-based pattern matching to selectively process items of a specific struct type (e.g., %MyItem{}). Items not matching the struct are passed through unchanged.

```elixir
defmodule MyApp.MyCustomPipeline do
  @impl Crawly.Pipeline
  def run(%MyItem{} = item, state, _opts \ []) do
    # do something
  end
  # do nothing if it does not match
  def run(item, state, _opts), do: {item, state}
```

--------------------------------

### Override Global Settings on Spider Level

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md

Defines a callback function within a spider to override global configuration settings. This allows for spider-specific behavior.

```elixir
def override_settings() do
   [
    concurrent_requests_per_domain: 5,
    closespider_timeout: 6
   ]
end
```

--------------------------------

### Crawly Request Structure

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Defines the structure of a Request object in Crawly, including URL, headers, and options. This is used to initiate web requests.

```elixir
@type t :: %Crawly.Request{
    url: binary(),
    headers: [header()],
    prev_response: %{},
    options: [option()]
    }

@type header() :: {key(), value()}
```

--------------------------------

### Stop a Crawly Spider via HTTP API

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Use this endpoint to stop a running Crawly spider. Replace `<spider_name>` with the actual name of the spider.

```bash
curl -v localhost:4001/spiders/<spider_name>/stop
```

--------------------------------

### Stop a Spider

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md

Stops a specified Crawly spider that is currently running.

```APIDOC
## POST /spiders/<spider_name>/stop

### Description
Stops a given Crawly spider.

### Method
POST

### Endpoint
/spiders/<spider_name>/stop

### Parameters
#### Path Parameters
- **spider_name** (string) - Required - The name of the spider to stop.
```

--------------------------------

### Crawly ParsedItem Structure

Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md

Defines the structure of a ParsedItem in Crawly, used to hold scraped items and new requests generated during parsing. This item is processed by the Crawly.Worker.

```elixir
@type item() :: %{}
@type t :: %__MODULE__{ 
    items: [item()],
    requests: [Crawly.Request.t()]
    }
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.