### Add Quickstart Dependency to Mix.exs Source: https://github.com/elixir-crawly/crawly/blob/master/examples/quickstart/README.md Add the quickstart package to your project's dependencies in mix.exs. This is the standard way to include external libraries in Elixir projects. ```elixir def deps do [ {:quickstart, "~> 0.1.0"} ] end ``` -------------------------------- ### Start Crawl Engine Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Start the Crawly engine and run a specific spider using the iex shell. ```bash iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)" ``` -------------------------------- ### Basic Crawly Configuration Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md A fundamental configuration example showing how to set up pipelines and middlewares. ```elixir config :crawly, pipelines: [ # my pipelines ], middlewares: [ # my middlewares ] ``` -------------------------------- ### Start Splash Docker Image Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Start the Splash Docker image to enable browser rendering. This command maps port 8050 and sets a maximum timeout of 300 seconds. ```bash docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300 ``` -------------------------------- ### Request Options and Auto Cookies Manager Middleware Example Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md This example shows how to configure request options like timeouts and enable automatic cookie management using built-in middlewares. ```elixir {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}, Crawly.Middlewares.AutoCookiesManager ``` -------------------------------- ### Configuring Parsers in Crawly Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Example of setting up custom parsers for processing fetcher responses. Note the warning about global configuration. ```elixir config :crawly, parsers: [ {Crawly.Parsers.ExtractRequests, selector: "button"}, ] ``` -------------------------------- ### Start a Crawly Spider via HTTP API Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Use this endpoint to schedule and start a specific Crawly spider. Replace `` with the actual name of the spider. ```bash curl -v localhost:4001/spiders//schedule ``` -------------------------------- ### Start a Spider Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Starts a specified Crawly spider. This endpoint triggers the engine to begin crawling according to the spider's configuration. ```APIDOC ## POST /spiders//schedule ### Description Starts a given Crawly spider. ### Method POST ### Endpoint /spiders//schedule ### Parameters #### Path Parameters - **spider_name** (string) - Required - The name of the spider to start. ``` -------------------------------- ### Configuring Middlewares in Crawly Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Example of defining a list of middlewares for pre-processing requests. Includes built-in middlewares and custom options. ```elixir config :crawly, middlewares: [ Crawly.Middlewares.DomainFilter, Crawly.Middlewares.UniqueRequest, Crawly.Middlewares.RobotsTxt, # With options {Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] }, {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]} ] ``` -------------------------------- ### Generate Config with Mix Task Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Use the mix crawly.gen.config task to generate an example configuration file. ```bash mix crawly.gen.config ``` -------------------------------- ### Configuring Pipelines in Crawly Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Example of defining a list of pipelines for pre-processing scraped items. Includes validation, deduplication, encoding, and file writing. ```elixir config :crawly, pipelines: [ {Crawly.Pipelines.Validate, fields: [:id, :date]}, {Crawly.Pipelines.DuplicatesFilter, item_id: :id}, Crawly.Pipelines.JSONEncoder, {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp", include_timestamp: true} ] ``` -------------------------------- ### Configuring Retry Logic in Crawly Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Example of configuring the retry mechanism, specifying retry codes, maximum retries, and ignored middlewares. ```elixir retry: [ retry_codes: [400], max_retries: 3, ignored_middlewares: [Crawly.Middlewares.UniqueRequest] ] ``` -------------------------------- ### Crawly Configuration File Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md An example Erlang configuration file for Crawly, specifying item count limits, timeouts, middleware, and pipelines for data processing and storage. ```erlang [{crawly, [ {closespider_itemcount, 500}, {closespider_timeout, 20}, {concurrent_requests_per_domain, 2}, {middlewares, [ 'Elixir.Crawly.Middlewares.DomainFilter', 'Elixir.Crawly.Middlewares.UniqueRequest', 'Elixir.Crawly.Middlewares.RobotsTxt', {'Elixir.Crawly.Middlewares.UserAgent', [ {user_agents, [<<"Crawly BOT">>]} ]} ]}, {pipelines, [ {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, url]}]}, {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]}, {'Elixir.Crawly.Pipelines.JSONEncoder'}, {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]} ] }] }]. ``` -------------------------------- ### Ecto Storage Pipeline Example Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md This pipeline inserts scraped items into a database using Ecto. It delegates the insertion to an application context function and handles success or error responses. ```elixir defmodule MyApp.MyEctoPipeline do @impl Crawly.Pipeline def run(item, state, _opts \ []) do case MyApp.insert_with_ecto(item) do {:ok, _} -> # insert successful, carry on with pipeline {item, state} {:error, _} -> # insert not successful, drop from pipeline {false, state} end end end ``` -------------------------------- ### Get Currently Running Spiders via HTTP API Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md This endpoint retrieves a list of all currently running Crawly spiders. ```bash curl -v localhost:4001/spiders ``` -------------------------------- ### Get Spider Stats Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Retrieves statistics for a specific spider, including scheduled requests and scraped items. ```APIDOC ## GET /spiders//scheduled-requests ### Description Gets the number of scheduled requests for a given spider. ### Method GET ### Endpoint /spiders//scheduled-requests ### Parameters #### Path Parameters - **spider_name** (string) - Required - The name of the spider to get stats for. ``` ```APIDOC ## GET /spiders//scraped-items ### Description Gets the number of scraped items for a given spider. ### Method GET ### Endpoint /spiders//scraped-items ### Parameters #### Path Parameters - **spider_name** (string) - Required - The name of the spider to get stats for. ``` -------------------------------- ### Create a Crawly Spider Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Define a spider module using Crawly.Spider to specify base URL, starting URLs, and parsing logic for extracted items and next requests. ```elixir defmodule BooksToScrape do use Crawly.Spider @impl Crawly.Spider def base_url(), do: "https://books.toscrape.com/" @impl Crawly.Spider def init() do [start_urls: ["https://books.toscrape.com/"]] end @impl Crawly.Spider def parse_item(response) do # Parse response body to document {:ok, document} = Floki.parse_document(response.body) # Create item (for pages where items exists) items = document |> Floki.find(".product_pod") |> Enum.map(fn x -> %{ title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(), price: Floki.find(x, ".product_price .price_color") |> Floki.text(), url: response.request_url } end) next_requests = document |> Floki.find(".next a") |> Floki.attribute("href") |> Enum.map(fn url -> Crawly.Utils.build_absolute_url(url, response.request.url) |> Crawly.Utils.request_from_url() end) %Crawly.ParsedItem{items: items, requests: next_requests} end end ``` -------------------------------- ### Spider Parsing for Multi-Item Pipelines Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Example of a spider's `parse_item` callback returning multiple types of parsed items, including a blog post and weather data, structured with distinct keys. ```elixir # in MyApp.CustomSpider.ex def parse_item(response): # parse my item %{parsed_items: [ %{blog_post: blog_post} , %{weather: [ january_weather, february_weather ]} ]} ``` -------------------------------- ### Custom Request Middleware to Add Proxy Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md An example of a request middleware that adds proxy configuration to a request. It checks for `proxy` and `proxy_auth` options and updates the request's options accordingly. This middleware follows the `Crawly.Pipeline` behavior. ```elixir defmodule MyApp.MyProxyMiddleware do @impl Crawly.Pipeline def run(request, state, opts \\ []) do # Set default proxy and proxy_auth to nil opts = Enum.into(opts, %{proxy: nil, proxy_auth: nil}) case opts.proxy do nil -> # do nothing {request, state} value -> old_options = request.options new_options = [proxy: opts.proxy, proxy_auth: opts.proxy_auth] new_request = Map.put(request, :options, old_optoins ++ new_options) {new_request, state} end end end ``` -------------------------------- ### Example Spider for Books to Scrape Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md Defines a Crawly spider to scrape book titles, prices, and URLs from books.toscrape.com. It includes logic to find the next page link and build absolute URLs. ```elixir defmodule BooksToScrape do use Crawly.Spider @impl Crawly.Spider def base_url(), do: "https://books.toscrape.com/" @impl Crawly.Spider def init() do [start_urls: ["https://books.toscrape.com/"]] end @impl Crawly.Spider def parse_item(response) do # Parse response body to document {:ok, document} = Floki.parse_document(response.body) # Create item (for pages where items exists) items = document |> Floki.find(".product_pod") |> Enum.map(fn x -> %{ title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(), price: Floki.find(x, ".product_price .price_color") |> Floki.text(), url: response.request_url } end) next_requests = document |> Floki.find(".next a") |> Floki.attribute("href") |> Enum.map(fn url -> Crawly.Utils.build_absolute_url(url, response.request.url) |> Crawly.Utils.request_from_url() end) %{items: items, requests: next_requests} end end ``` -------------------------------- ### Basic YML Spider Definition Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/spiders_in_yml.md Define a spider's name, base URL, starting points, fields to extract, and links to follow using this YML structure. ```yml name: BooksSpiderForTest base_url: "https://books.toscrape.com/" start_urls: - "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" fields: - name: title selector: ".product_main" - name: price selector: ".product_main .price_color" links_to_follow: - selector: "a" attribute: "href" ``` -------------------------------- ### Get Scraped Items for a Spider via HTTP API Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Retrieve the count of scraped items for a specific Crawly spider. Replace `` with the actual name of the spider. ```bash curl -v localhost:4001/spiders//scraped-items ``` -------------------------------- ### Get Scheduled Requests for a Spider via HTTP API Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Retrieve the count of scheduled requests for a specific Crawly spider. Replace `` with the actual name of the spider. ```bash curl -v localhost:4001/spiders//scheduled-requests ``` -------------------------------- ### Passing Configuration Options to a Custom Pipeline Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Demonstrates how to declare a custom pipeline with configuration options in the `pipelines` list. The options are then received in the `opts` argument of the `run` callback. ```elixir pipelines: [ {MyCustomPipeline, my_option: "value"} ] ``` ```elixir defmodule MyCustomPipeline do @impl Crawly.Pipeline def run(item, state, opts) do IO.inspect(opts) # shows keyword list of [ my_option: "value" ] # Do something end end ``` -------------------------------- ### Configure File Logging with LoggerFileBackend Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Enables file logging and separates logs from different spiders into multiple files. Requires adding the `:logger_file_backend` dependency. ```elixir config :logger, backends: [{LoggerFileBackend, :info_log}] config :crawly, log_dir: "/tmp/spider_logs", log_to_file: true, ...... other configurations ``` -------------------------------- ### Configure erlang-node-discovery Hosts and Ports Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md Configure the erlang-node-discovery application in config.exs to specify the hosts and node ports for UI communication. ```elixir config :erlang_node_discovery, hosts: ["127.0.0.1", "crawlyui.com"], node_ports: [ {:ui, 0} ] ``` -------------------------------- ### Add erlang-node-discovery Dependency Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md Add the erlang-node-discovery library as a dependency in your project's mix.exs file to facilitate Erlang node discovery. ```elixir {:erlang_node_discovery, git: "https://github.com/oltarasenko/erlang-node-discovery"} ``` -------------------------------- ### Configure Crawly with Crawly Render Server Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Configure your Crawly project to use the crawly-render-server fetcher for browser rendering. Ensure the render server is running on http://localhost:3000. ```elixir config :crawly, fetcher: {Crawly.Fetchers.CrawlyRenderServer, [base_url: "http://localhost:3000/render"]} ``` -------------------------------- ### Configure Crawly with Splash Fetcher Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Configure your Crawly project to use the Splash fetcher for browser rendering. Ensure Splash is running and accessible at http://localhost:8050. ```elixir fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]} ``` -------------------------------- ### Add Crawly Dependency Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Add Crawly and Floki as dependencies in your mix.exs file. ```elixir defp deps do [ {:crawly, "~> 0.17.2"}, {:floki, "~> 0.33.0"} ] end ``` -------------------------------- ### Configure Crawly Settings Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Configure Crawly's behavior, including timeouts, concurrency, middlewares, and pipelines, in config/config.exs. ```elixir import Config config :crawly, closespider_timeout: 10, concurrent_requests_per_domain: 8, closespider_itemcount: 100, middlewares: [ Crawly.Middlewares.DomainFilter, Crawly.Middlewares.UniqueRequest, {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]} ], pipelines: [ {Crawly.Pipelines.Validate, fields: [:url, :title, :price]}, {Crawly.Pipelines.DuplicatesFilter, item_id: :title}, Crawly.Pipelines.JSONEncoder, {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} ] ``` -------------------------------- ### Forwarding to Crawly API Router in Plug.Router Source: https://github.com/elixir-crawly/crawly/blob/master/README.md This snippet shows how to integrate the Crawly API router into your Plug.Router configuration to enable the management UI. Ensure the `start_http_api` option in Crawly configuration is set to `true` (default) for this to work. ```elixir defmodule MyApp.Router do use Plug.Router ... forward "/admin", Crawly.API.Router ... end ``` -------------------------------- ### Generate Spider with Mix Task Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Use the mix crawly.gen.spider task to generate a new spider file with all necessary callbacks. ```bash mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape ``` -------------------------------- ### Running Crawly Docker Container Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/standalone_crawly.md Command to run the Crawly Docker container, mounting local spider and configuration directories, and exposing the management interface. ```bash docker run --name crawly -e "SPIDERS_DIR=/app/spiders" \ -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders \ -v $(pwd)/crawly.config:/app/config/crawly.config \ crawly ``` -------------------------------- ### Key-Based Blog Post Processing Pipeline Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md This pipeline selectively processes only blog post items using key-based pattern matching on the `:blog_post` key. It updates the blog post data within the item. ```elixir defmodule MyApp.BlogPostPipeline do @impl Crawly.Pipeline def run(%{blog_post: old_blog_post} = item, state, _opts \ []) do # process the blog post updated_item = Map.put(item, :blog_post, %{my: "data"}) {updated_item, state} end # do nothing if it does not match def run(item, state, _opts), do: {item, state} ``` -------------------------------- ### Configure Logger to Send Logs to CrawlyUI Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md Configure the logger to use the SendToUiBackend and specify the destination for sending logs to the CrawlyUI node. ```elixir config :logger, backends: [ :console, {Crawly.Loggers.SendToUiBackend, :send_log_to_ui} ], level: :debug ``` ```elixir config :logger, :send_log_to_ui, destination: {:"ui@127.0.0.1", CrawlyUI, :store_log} ``` -------------------------------- ### Key-Based Pattern Matching Pipeline Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md A custom item pipeline that uses key-based pattern matching to selectively process items based on map keys (e.g., %{my_item: my_item}). Items not matching the specified key are passed through unchanged. ```elixir defmodule MyApp.MyCustomPipeline do @impl Crawly.Pipeline def run(%{my_item: my_item} = item, state, _opts \ []) do # do something end # do nothing if it does not match def run(item, state, _opts), do: {item, state} ``` -------------------------------- ### View Crawl Results Source: https://github.com/elixir-crawly/crawly/blob/master/README.md Access the crawl results, which are typically saved to a JSON Lines file in the /tmp directory. ```bash $ cat /tmp/BooksToScrape_.jl ``` -------------------------------- ### List Running Spiders Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Retrieves a list of all currently running Crawly spiders. ```APIDOC ## GET /spiders ### Description Gets a list of currently running spiders. ### Method GET ### Endpoint /spiders ``` -------------------------------- ### Add SendToUI Pipeline to Item Pipelines Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/experimental_ui.md Add the SendToUI pipeline to your item pipelines configuration. Ensure it is placed before any encoder pipelines. ```elixir {Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'} ``` -------------------------------- ### Integrate Crawly API Router with Phoenix Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Adds the Crawly management interface to a Phoenix application's router. This allows operating Crawly through an existing server. ```elixir defmodule MyApp.Router do use Plug.Router ... forward "/crawlers", Crawly.API.Router ... end ``` -------------------------------- ### Struct-Based Pattern Matching Pipeline Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md A custom item pipeline that uses struct-based pattern matching to selectively process items of a specific struct type (e.g., %MyItem{}). Items not matching the struct are passed through unchanged. ```elixir defmodule MyApp.MyCustomPipeline do @impl Crawly.Pipeline def run(%MyItem{} = item, state, _opts \ []) do # do something end # do nothing if it does not match def run(item, state, _opts), do: {item, state} ``` -------------------------------- ### Override Global Settings on Spider Level Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/configuration.md Defines a callback function within a spider to override global configuration settings. This allows for spider-specific behavior. ```elixir def override_settings() do [ concurrent_requests_per_domain: 5, closespider_timeout: 6 ] end ``` -------------------------------- ### Crawly Request Structure Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Defines the structure of a Request object in Crawly, including URL, headers, and options. This is used to initiate web requests. ```elixir @type t :: %Crawly.Request{ url: binary(), headers: [header()], prev_response: %{}, options: [option()] } @type header() :: {key(), value()} ``` -------------------------------- ### Stop a Crawly Spider via HTTP API Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Use this endpoint to stop a running Crawly spider. Replace `` with the actual name of the spider. ```bash curl -v localhost:4001/spiders//stop ``` -------------------------------- ### Stop a Spider Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/http_api.md Stops a specified Crawly spider that is currently running. ```APIDOC ## POST /spiders//stop ### Description Stops a given Crawly spider. ### Method POST ### Endpoint /spiders//stop ### Parameters #### Path Parameters - **spider_name** (string) - Required - The name of the spider to stop. ``` -------------------------------- ### Crawly ParsedItem Structure Source: https://github.com/elixir-crawly/crawly/blob/master/documentation/basic_concepts.md Defines the structure of a ParsedItem in Crawly, used to hold scraped items and new requests generated during parsing. This item is processed by the Crawly.Worker. ```elixir @type item() :: %{} @type t :: %__MODULE__{ items: [item()], requests: [Crawly.Request.t()] } ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.