### Install Docker and Docker Compose on Ubuntu Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Installs Docker and Docker Compose on Ubuntu systems. Ensure Docker is enabled and started after installation. ```shell sudo apt update sudo apt install docker.io docker-compose ``` -------------------------------- ### Example: Spider with Start and End Callbacks Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/BaseParser.md Demonstrates how to implement `start_callback` and `end_callback` in a custom Feapder spider to execute code at the beginning and end of the spider's run. ```python import feapder class TestSpider(feapder.Spider): def start_callback(self): print("爬虫开始了") def end_callback(self): print("爬虫结束了") ``` -------------------------------- ### TaskSpider Command-Line Argument Parsing Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Sets up command-line arguments for starting the TaskSpider with either MySQL or Redis as the task source. Use '--start 1' to start monitor or '--start 2' to start crawling. ```python if __name__ == "__main__": parser = ArgumentParser(description="测试TaskSpider") parser.add_argument("--start", type=int, nargs=1, help="用mysql做种子表 (1|2)", function=start) parser.add_argument("--start2", type=int, nargs=1, help="用redis做种子表 (1|2)", function=start2) parser.start() # 下发任务 python3 task_spider_test.py --start 1 # 采集 python3 task_spider_test.py --start 2 ``` -------------------------------- ### Example Custom Docker Image Build Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md An example of building a custom Feapder Docker image named 'my_feapder' with tag '1.0'. ```shell docker build -f feapder_dockerfile -t my_feapder:1.0 . ``` -------------------------------- ### Install Docker Compose (if not included) Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Installs Docker Compose if it's not already included with your Docker installation. This command downloads the latest version and makes it executable. Use this if the 'docker compose' command is not found. ```shell sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose ``` ```shell sudo curl -L "https://get.daocloud.io/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose ``` -------------------------------- ### Install Python Build Tools Source: https://github.com/boris-code/feapder/blob/master/docs/question/安装问题.md On Windows, if you encounter build errors during pip installation, you may need to install Microsoft Visual C++ build tools. Download and run the provided executable. ```shell python setup.py install ``` -------------------------------- ### Standard Spider Start Source: https://github.com/boris-code/feapder/blob/master/docs/usage/Spider.md The standard way to start a spider, contrasting with the debug mode. It initializes the spider with a redis_key. ```python spider = SpiderTest(redis_key="feapder:spider_test") spider.start() ``` -------------------------------- ### Request Initialization and Usage Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/Request.md Demonstrates how to initialize a Request object and get a response. ```APIDOC ## POST /api/users ### Description This endpoint allows for the creation of new user resources. ### Method POST ### Endpoint /api/users ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **username** (string) - Required - The desired username for the new user. - **email** (string) - Required - The email address of the new user. - **password** (string) - Required - The password for the new user. ### Request Example ```json { "username": "johndoe", "email": "john.doe@example.com", "password": "securepassword123" } ``` ### Response #### Success Response (201 Created) - **id** (string) - The unique identifier for the newly created user. - **username** (string) - The username of the created user. - **email** (string) - The email address of the created user. #### Response Example ```json { "id": "user-12345", "username": "johndoe", "email": "john.doe@example.com" } ``` ``` -------------------------------- ### Example Usage of Callbacks Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/BaseParser.md Demonstrates how to implement `start_callback` and `end_callback` in a custom spider. ```APIDOC ## Example Usage This example shows how to use the `start_callback` and `end_callback` methods. ```python import feapder class TestSpider(feapder.Spider): def start_callback(self): print("爬虫开始了") def end_callback(self): print("爬虫结束了") ``` ### Explanation In this example, `TestSpider` inherits from `feapder.Spider` and overrides the `start_callback` and `end_callback` methods to print messages when the spider starts and ends, respectively. ``` -------------------------------- ### Basic AirSpider Example Source: https://github.com/boris-code/feapder/blob/master/docs/README.md A simple FEAPDER AirSpider example that yields a request to Baidu and prints the response. The `start_requests` method generates tasks, and the `parse` method handles data parsing. ```python import feapder class FirstSpider(feapder.AirSpider): def start_requests(self): yield feapder.Request("https://www.baidu.com") def parse(self, request, response): print(response) if __name__ == "__main__": FirstSpider().start() ``` -------------------------------- ### Install Git on CentOS Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Installs Git on CentOS systems. Version 1.8.3 or higher is sufficient for FEAPLAT deployment. ```shell yum -y install git ``` -------------------------------- ### Start TaskSpider with MySQL Task Table Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Function to start TaskSpider using MySQL as the task table. Requires a 'task_table' and 'task_keys' to be specified. ```python def start(args): """ 用mysql做种子表 """ spider = TaskSpiderTest( task_table="spider_task", # 任务表名 task_keys=["id", "url"], # 表里查询的字段 redis_key="test:task_spider", # redis里做任务队列的key keep_alive=True, # 是否常驻 ) if args == 1: spider.start_monitor_task() else: spider.start() ``` -------------------------------- ### Feapder Spider Full Code Example Source: https://github.com/boris-code/feapder/wiki/使用feapder开发爬虫是怎样的体验 A complete example of a Feapder spider that starts requests, parses job listings, and yields items for automatic data ingestion. It includes necessary imports and the spider's main execution block. ```python import feapder from items import * class ListSpider(feapder.Spider): def start_requests(self): yield feapder.Request( "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=", render=True, ) def parse(self, request, response): job_list = response.xpath("//li[contains(@class, 'con_list_item')]") for job in job_list: job_name = job.xpath("./@data-positionname").extract_first() company = job.xpath("./@data-company").extract_first() salary = job.xpath("./@data-salary").extract_first() job_url = job.xpath(".//a[@class='position_link']/@href").extract_first() # 列表数据 list_item = lagou_job_list_item.LagouJobListItem() list_item.job_name = job_name list_item.company = company list_item.salary = salary list_item.job_url = job_url yield list_item # 直接返回,框架实现批量入库 # 详情任务 detail_task_item = lagou_job_detail_task_item.LagouJobDetailTaskItem() detail_task_item.url = job_url yield detail_task_item # 直接返回,框架实现批量入库 if __name__ == "__main__": spider = ListSpider(redis_key="feapder:lagou_list") spider.start() ``` -------------------------------- ### FEAPDER Spider Output Example Source: https://github.com/boris-code/feapder/blob/master/docs/README.md Example output from running a basic FEAPDER spider, showing request details and the response object. It also indicates when the parser is waiting for tasks and when the spider finishes due to no pending tasks. ```shell Thread-2|2021-02-09 14:55:11,373|request.py|get_response|line:283|DEBUG| -------------- FirstSpider.parse request for ---------------- url = https://www.baidu.com method = GET body = {'timeout': 22, 'stream': True, 'verify': False, 'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}} Thread-2|2021-02-09 14:55:11,610|parser_control.py|run|line:415|DEBUG| parser 等待任务... FirstSpider|2021-02-09 14:55:14,620|air_spider.py|run|line:80|INFO| 无任务,爬虫结束 ``` -------------------------------- ### Example Validation Logic Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md This example shows how to implement validation logic within the `validate` function. It checks the status code and can be extended to check response content. ```python def validate(self, request, response): if response.status_code != 200: raise Exception("response code not 200") # 重试 # if "哈哈" not in response.text: # return False # 抛弃当前请求 ``` -------------------------------- ### Run Feapder with Docker Compose Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Navigate to the project directory and use Docker Compose to start the Feapder services. The first run may take time to pull images and might show errors, but subsequent runs should be stable. ```shell cd feaplat docker compose up -d ``` ```shell docker-compose up -d ``` -------------------------------- ### Feapder Spider with Min Task Count Example Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/Spider进阶.md Demonstrates how to configure a Feapder spider to use `min_task_count` for managing task queue size when fetching tasks from a database. This example requires `start_monitor_task` to be active. ```python import feapder from feapder.db.mysqldb import MysqlDB class SpiderTest(feapder.Spider): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.db = MysqlDB() def start_requests(self): sql = "select url from feapder_test where state = 0 limit 1000" result = self.db.find(sql) for url, in result: yield feapder.Request(url) def parser(self, request, response): print(response) if __name__ == "__main__": spider = SpiderTest( redis_key="feapder:spider_test", min_task_count=100 ) # 监控任务,若任务数小于min_task_count,则调用start_requests下发一批,注start_requests产生的任务会一次下发完,比如本例,会一次下发1000个任务,然后任务队列中少于100条任务时,再下发1000条 spider.start_monitor_task() # 采集 # spider.start() ``` -------------------------------- ### Start TaskSpider with Redis Task Table Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Function to start TaskSpider using Redis as the task table. Specify 'task_table_type="redis"' and a 'redis_key'. 'use_mysql=False' can disable MySQL usage. ```python def start2(args): """ 用redis做种子表 """ spider = TaskSpiderTest( task_table="spider_task2", # 任务表名 task_table_type="redis", # 任务表类型为redis redis_key="test:task_spider", # redis里做任务队列的key keep_alive=True, # 是否常驻 use_mysql=False, # 若用不到mysql,可以不使用 ) if args == 1: spider.start_monitor_task() else: spider.start() ``` -------------------------------- ### TaskSpider Implementation with Custom Settings Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Example of a TaskSpider implementation. Customize database settings like Redis and MySQL connection details. The add_task method seeds the task queue. ```python import feapder from feapder import ArgumentParser class TaskSpiderTest(feapder.TaskSpider): # 自定义数据库,若项目中有setting.py文件,此自定义可删除 # redis 必须,mysql可选 __custom_setting__ = dict( REDISDB_IP_PORTS="localhost:6379", REDISDB_USER_PASS="", REDISDB_DB=0, MYSQL_IP="localhost", MYSQL_PORT=3306, MYSQL_DB="feapder", MYSQL_USER_NAME="feapder", MYSQL_USER_PASS="feapder123", ) def add_task(self): # 加种子任务 框架会调用这个函数,方便往redis里塞任务,但不能写成死循环。实际业务中可以自己写个脚本往redis里塞任务 self._redisdb.zadd(self._task_table, {"id": 1, "url": "https://www.baidu.com"}) def start_requests(self, task): task_id, url = task yield feapder.Request(url, task_id=task_id) def parse(self, request, response): # 提取网站title print(response.xpath("//title/text()").extract_first()) # 提取网站描述 print(response.xpath("//meta[@name='description']/@content").extract_first()) print("网站地址: ", response.url) # mysql 需要更新任务状态为做完 即 state=1 # yield self.update_task_batch(request.task_id) ``` -------------------------------- ### Start Request with Browser Rendering Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/浏览器渲染-Selenium.md To enable browser rendering for a request, pass `render=True` to the `feapder.Request` object. This is useful for fetching content from dynamically loaded pages. ```python yield feapder.Request("https://news.qq.com/", render=True) ``` -------------------------------- ### Default Downloader Configuration Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/custom_downloader.md Configure the default downloader for your Feapder project using the DOWNLOADER setting. This example shows the default settings for RequestsDownloader, RequestsSessionDownloader, and SeleniumDownloader. ```python DOWNLOADER = "feapder.network.downloader.RequestsDownloader" SESSION_DOWNLOADER = "feapder.network.downloader.RequestsSessionDownloader" RENDER_DOWNLOADER = "feapder.network.downloader.SeleniumDownloader" ``` -------------------------------- ### Configure Volume Mount for Custom Python Version Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Example of how to add a volume mount to the SPIDER_RUN_ARGS in the docker-compose.yaml file. This is necessary to persist dependencies when using a custom Python version. ```yaml SPIDER_RUN_ARGS=["--mount type=volume,source=feapder_python3.10,destination=/usr/local/python-3.10.8"] ``` -------------------------------- ### Emit Multiple Custom Metrics Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/监控打点.md This example shows emitting multiple counters with different keys and classifiers. It illustrates how multiple calls to `emit_counter` can result in different charts and lines in the monitoring system. ```python from feapder.utils import metrics # Initialize the metrics system metrics.init() metrics.emit_counter("key", count=1, classify="test") metrics.emit_counter("key2", count=1, classify="test") metrics.emit_counter("key3", count=1, classify="test") metrics.emit_counter("哈哈", count=1, classify="test2") metrics.close() ``` -------------------------------- ### Connect to MysqlDB with Parameters Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/MysqlDB.md Instantiate MysqlDB by providing connection details such as IP, port, database name, username, and password. ```python from feapder.db.mysqldb import MysqlDB db = MysqlDB( ip="localhost", port=3306, db="feapder", user_name="feapder", user_pass="feapder123" ) ``` -------------------------------- ### Initialize MysqlDB with Direct Connection Info Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md Instantiate MysqlDB by directly passing connection parameters. ```python db = MysqlDB( ip="localhost", port=3306, user_name="feapder", user_pass="feapder123", db="feapder" ) ``` -------------------------------- ### Initialize MysqlDB with Custom Settings Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md Instantiate MysqlDB using custom connection settings defined in __custom_setting__. ```python from feapder.db.mysqldb import MysqlDB class AirSpiderTest(feapder.AirSpider): __custom_setting__ = dict( MYSQL_IP="localhost", MYSQL_PORT = 3306, MYSQL_DB = "feapder", MYSQL_USER_NAME = "feapder", MYSQL_USER_PASS = "feapder123" ) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.db = MysqlDB() ``` -------------------------------- ### BatchSpider Crawl Test Function Source: https://github.com/boris-code/feapder/blob/master/docs/usage/BatchSpider.md Example function to initialize and run a BatchSpider for testing purposes, with options to start monitoring or crawling. ```python def crawl_test(args): spider = test_spider.TestSpider( redis_key="feapder:test_batch_spider", # 分布式爬虫调度信息存储位置 task_table="batch_spider_task", # mysql中的任务表 task_keys=["id", "url"], # 需要获取任务表里的字段名,可添加多个 task_state="state", # mysql中任务状态字段 batch_record_table="batch_spider_batch_record", # mysql中的批次记录表 batch_name="批次爬虫测试(周全)", # 批次名字 batch_interval=7, # 批次周期 天为单位 若为小时 可写 1 / 24 ) if args == 1: spider.start_monitor_task() # 下发及监控任务 else: spider.start() # 采集 ``` -------------------------------- ### Build Custom Feapder Spider Image Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Dockerfile content for building a custom Feapder spider image. This example shows how to install a specific Python version (3.10.8) and common libraries. ```dockerfile # 基于最新的版本,若需要自定义python版本,则要求feapder版本号>=2.4 FROM registry.cn-hangzhou.aliyuncs.com/feapderd/feapder:2.4 # 安装自定义的python版本,3.10.8 RUN set -ex \ && wget https://www.python.org/ftp/python/3.10.8/Python-3.10.8.tgz \ && tar -zxvf Python-3.10.8.tgz \ && cd Python-3.10.8 \ && ./configure prefix=/usr/local/python-3.10.8 \ && make \ && make install \ && make clean \ && rm -rf /Python-3.10.8* \ # 配置软链接 && ln -s /usr/local/python-3.10.8/bin/python3 /usr/bin/python3.10.8 \ && ln -s /usr/local/python-3.10.8/bin/pip3 /usr/bin/pip3.10.8 # 删除之前的默认python版本 RUN set -ex \ && rm -rf /usr/bin/python3 \ && rm -rf /usr/bin/pip \ && rm -rf /usr/bin/python \ && rm -rf /usr/bin/pip # 设置默认为python3.10.8 RUN set -ex \ && ln -s /usr/local/python-3.10.8/bin/python3 /usr/bin/python \ && ln -s /usr/local/python-3.10.8/bin/python3 /usr/bin/python3 \ && ln -s /usr/local/python-3.10.8/bin/pip3 /usr/bin/pip \ && ln -s /usr/local/python-3.10.8/bin/pip3 /usr/bin/pip3 # 将python3.10.8加入到环境变量 ENV PATH=$PATH:/usr/local/python-3.10.8/bin/ # 安装依赖 RUN pip3 install feapder \ && pip3 install scrapy # 安装node依赖包,内置的node为v10.15.3版本 # RUN npm install packageName -g ``` -------------------------------- ### Feapder Create Command Help Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md Use `feapder create -h` to view detailed usage information and all available options for the create command. ```bash feapder create -h ``` -------------------------------- ### Install Docker CE on CentOS Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Installs Docker CE on CentOS systems. It's recommended to use version 20.10.12 or higher. This command adds the Docker repository and installs the necessary packages. ```shell yum remove docker docker-common docker-selinux docker-engine yum install -y yum-utils device-mapper-persistent-data lvm2 && python2 /usr/bin/yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && yum install docker-ce -y ``` ```shell yum install -y yum-utils device-mapper-persistent-data lvm2 && python2 /usr/bin/yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo && yum install docker-ce -y ``` ```shell curl -sSL https://get.daocloud.io/docker | sh ``` -------------------------------- ### Start BatchSpider Worker Source: https://github.com/boris-code/feapder/blob/master/docs/usage/BatchSpider.md Starts the worker process for BatchSpider, responsible for consuming tasks and crawling data. ```python spider.start() ``` -------------------------------- ### Connect to MysqlDB using Environment Variables or Settings Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/MysqlDB.md If database connection details are configured in environment variables or settings, MysqlDB can be instantiated without arguments. ```python db = MysqlDB() ``` -------------------------------- ### Create Feapder Project Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md Use the `-p` flag with `feapder create` to generate a new project structure, including directories for items and spiders, a main entry point, and a settings file. ```bash feapder create -p ``` ```bash feapder create -p first-project ``` -------------------------------- ### Start BatchSpider Monitor Task Source: https://github.com/boris-code/feapder/blob/master/docs/usage/BatchSpider.md Starts the master process for BatchSpider, responsible for task distribution and monitoring. ```python spider.start_monitor_task() ``` -------------------------------- ### Start Spider with Thread Count Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md Increase the scraping speed by specifying the `thread_count` when starting the spider. This overrides the default single-threaded execution. ```python if __name__ == "__main__": AirSpiderTest(thread_count=10).start() ``` -------------------------------- ### View Feapder Supported Commands Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md Run `feapder` in the command line to see a list of available commands and their basic usage. ```bash feapder ``` -------------------------------- ### Install urllib3 v1.25.8 Source: https://github.com/boris-code/feapder/blob/master/docs/question/请求问题.md To resolve the `ValueError: check_hostname requires server_hostname`, install version 1.25.8 of the `urllib3` library. This is a known workaround for compatibility issues. ```bash pip install urllib3==1.25.8 ``` -------------------------------- ### View Feapder Shell Help Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md Use this command to view the help information for the feapder shell, which outlines its available options and usage. ```bash feapder shell -h ``` -------------------------------- ### Proxy Extract API Example Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/Request.md An example of setting the `PROXY_EXTRACT_API` to fetch proxies from a given URL. The API is expected to return proxies separated by newline characters (`\r\n`). ```python PROXY_EXTRACT_API="http://xxxx" ``` -------------------------------- ### Import GuestUserPool Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/UserPool.md Import necessary classes for GuestUserPool. Requires redis environment. ```python from typing import Optional from feapder.network.user_pool import GuestUser from feapder.network.user_pool import GuestUserPool ``` -------------------------------- ### BatchSpider Task Condition Example Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/BatchSpider进阶.md Specifies conditions for fetching tasks from the database, similar to a SQL WHERE clause. This example retrieves tasks where the URL is not null and the ID is greater than 10. ```python task_condition="url is not null and id > 10" ``` -------------------------------- ### Initialize and Emit Custom Metrics Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/监控打点.md This snippet demonstrates how to initialize the metrics system, emit a counter with a key and classify, and then close the system. This is for standalone Python scripts. ```python from feapder.utils import metrics # Initialize the metrics system metrics.init() metrics.emit_counter("key", count=1, classify="test") metrics.close() ``` -------------------------------- ### Connect to MysqlDB using URL Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/MysqlDB.md Connect to MysqlDB by providing a connection URL. Ensure the URL includes necessary parameters like charset. ```python db = MysqlDB.from_url("mysql://username:password@ip:port/db?charset=utf8mb4") ``` -------------------------------- ### Run BatchSpider with ArgumentParser Source: https://github.com/boris-code/feapder/blob/master/docs/usage/BatchSpider.md Use ArgumentParser to manage the execution of BatchSpider. This allows differentiating between starting the task monitor (crawl_test 1) and starting the collector (crawl_test 2) via command-line arguments. ```python from spiders import * from feapder import ArgumentParser def crawl_test(args): spider = test_spider.TestSpider( redis_key="feapder:test_batch_spider", # 分布式爬虫调度信息存储位置 task_table="batch_spider_task", # mysql中的任务表 task_keys=["id", "url"], # 需要获取任务表里的字段名,可添加多个 task_state="state", # mysql中任务状态字段 batch_record_table="batch_spider_batch_record", # mysql中的批次记录表 batch_name="批次爬虫测试(周全)", # 批次名字 batch_interval=7, # 批次周期 天为单位 若为小时 可写 1 / 24 ) if args == 1: spider.start_monitor_task() # 下发及监控任务 else: spider.start() # 采集 if __name__ == "__main__": parser = ArgumentParser(description="批次爬虫测试") parser.add_argument( "--crawl_test", type=int, nargs=1, help="BatchSpider demo(1|2)", function=crawl_test ) parser.start() ``` -------------------------------- ### Create Feapder Project Source: https://github.com/boris-code/feapder/wiki/使用feapder开发爬虫是怎样的体验 Use the 'feapder create -p' command to generate a new Feapder project structure. This command initializes the necessary files and directories for your spider. ```shell > feapder create -p lagou-spider lagou-spider 项目生成成功 ``` -------------------------------- ### Connect to Redis in Sentinel Mode Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/RedisDB.md Instantiate RedisDB for Redis Sentinel mode. Requires a list of Sentinel addresses and the service name. ```python from feapder.db.redisdb import RedisDB db = RedisDB(ip_ports="172.25.21.4:26379,172.25.21.5:26379,172.25.21.6:26379", db=0, user_pass=None, service_name="my_master") ``` -------------------------------- ### Specify Database Connection for Item Creation Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md When creating an item using `feapder create -i`, you can specify database connection details directly via command-line arguments if not configured in `setting.py`. ```bash feapder create -i spider_data --host localhost --db feapder --username feapder --password feapder123 ``` -------------------------------- ### Run FEAPLAT Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/question.md Use this command to start FEAPLAT services in detached mode. ```bash docker-compose up -d ``` -------------------------------- ### Create TaskSpider Project Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Command to create a new TaskSpider project. Select 'TaskSpider' from the template options. ```bash feapder create -s task_spider_test 请选择爬虫模板 AirSpider Spider > TaskSpider BatchSpider ``` -------------------------------- ### FEAPLAT Network Configuration Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/question.md Example configuration for the FEAPLAT overlay network in docker-compose.yaml, specifying a subnet and gateway. ```yaml networks: default: name: feaplat driver: overlay attachable: true ipam: config: - subnet: 11.0.0.0/8 gateway: 11.0.0.1 ``` -------------------------------- ### Get User from NormalUserPool Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/UserPool.md Retrieve a user from the NormalUserPool. If no users are available, it will attempt to log in and produce a new user. ```python user = user_pool.get_user() print("取到user:", user) print("cookie:", user.cookies) print("user_agent:", user.user_agent) print("proxies:", user.proxies) ``` -------------------------------- ### Proxy and User-Agent Configuration Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/Request.md Explains how proxies and User-Agents are handled, prioritizing parameters over settings. ```APIDOC ## Proxy and User-Agent Configuration ### Priority - Parameters passed directly to the `Request` object take precedence. - If parameters are not specified, the configuration from `setting.py` is used. ### Default Settings in `setting.py`: ```python # Proxy settings PROXY_EXTRACT_API = None # API to extract proxies, delimiter is \r\n PROXY_ENABLE = True # Random headers RANDOM_HEADERS = True # Use session for requests USE_SESSION = False ``` ### Proxy Extraction API - If `PROXY_EXTRACT_API` is set (e.g., `PROXY_EXTRACT_API="http://your-proxy-api.com"`), feapder will fetch proxies from this URL. - The expected format for the returned proxies is one IP:port per line, separated by `\r\n`. ``` -------------------------------- ### Get User from GuestUserPool Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/UserPool.md Retrieve a user from the GuestUserPool. If no users are available, it will attempt to log in and produce a new user. ```python user = user_pool.get_user(block=True) print("取到user:", user) print("cookie:", user.cookies) print("user_agent:", user.user_agent) print("proxies:", user.proxies) ``` -------------------------------- ### Custom Downloader with httpx for HTTP/2 Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md Implement a custom downloader using httpx to leverage HTTP/2 support. ```python import feapder import httpx class AirSpeedTest(feapder.AirSpider): def start_requests(self): yield feapder.Request("http://www.baidu.com") def download_midware(self, request): with httpx.Client(http2=True) as client: response = client.get(request.url) return request, response def parse(self, request, response): print(response) if __name__ == "__main__": AirSpeedTest(thread_count=1).start() ``` -------------------------------- ### Configure SPIDER_IMAGE in .env File Source: https://github.com/boris-code/feapder/blob/master/docs/feapder_platform/feaplat.md Example of how to update the SPIDER_IMAGE variable in the .env file to use your custom-built Docker image. ```dotenv SPIDER_IMAGE=my_feapder:1.0 ``` -------------------------------- ### Custom Standard Downloader Implementation Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/custom_downloader.md Implement a custom standard downloader by inheriting from feapder.network.downloader.base.Downloader. The download method should handle the request and return a Response object. It's recommended to return a feapder.network.response.Response object for better compatibility. ```python import requests from feapder.network.downloader.base import Downloader from feapder.network.response import Response class RequestsDownloader(Downloader): def download(self, request) -> Response: response = requests.request( request.method, request.url, **request.requests_kwargs ) # 将requests的response转化为feapder的Response 对象,方便后续解析时使用xpath、re等方法 response = Response(response) return response ``` -------------------------------- ### Get Specific User from GoldUserPool Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/UserPool.md Retrieve a specific user from the GoldUserPool by username. If the user is not available, it will attempt to log in and produce them. ```python user = user_pool.get_user(username="用户名") print("取到user:", user) print("cookie:", user.cookies) print("user_agent:", user.user_agent) print("proxies:", user.proxies) ``` -------------------------------- ### Create __init__.py File using Feapder CLI Source: https://github.com/boris-code/feapder/blob/master/docs/command/cmdline.md This command generates an __init__.py file in the current directory and automatically imports all other Python files within that directory into the `__all__` variable. ```bash feapder create -init ``` -------------------------------- ### Debug Spider Source: https://github.com/boris-code/feapder/blob/master/docs/usage/Spider.md This method allows debugging a specific request within a spider. It converts the spider into a DebugSpider and starts it with a provided request. ```python if __name__ == "__main__": spider = SpiderTest.to_DebugSpider( redis_key="feapder:spider_test", request=feapder.Request("http://www.baidu.com") ) spider.start() ``` -------------------------------- ### MysqlDB Connection Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/MysqlDB.md Demonstrates various ways to establish a connection to a MySQL database using the MysqlDB class. ```APIDOC ## MysqlDB Connection ### Description Connect to a MySQL database using the `MysqlDB` class. It supports automatic reconnection, multi-thread operations, and has a built-in connection pool with a maximum of 100 connections. ### Connection Methods **1. Using explicit parameters:** ```python from feapder.db.mysqldb import MysqlDB db = MysqlDB( ip="localhost", port=3306, db="feapder", user_name="feapder", user_pass="feapder123" ) ``` **2. Using environment variables or settings:** If database connection details are configured in environment variables or settings, you can instantiate `MysqlDB` without parameters. ```python db = MysqlDB() ``` **3. Using a connection URL:** Alternatively, you can connect using a database URL. ```python db = MysqlDB.from_url("mysql://username:password@ip:port/db?charset=utf8mb4") ``` ``` -------------------------------- ### Run GuestUserPool Maintenance Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/UserPool.md Start a separate process to continuously maintain a sufficient number of users in the GuestUserPool. This process runs in the background. ```python user_pool.run() ``` -------------------------------- ### Connect to Redis in Standalone Mode Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/RedisDB.md Instantiate RedisDB for a single Redis instance. Connection details can be provided directly. ```python from feapder.db.redisdb import RedisDB db = RedisDB(ip_ports="localhost:6379", db=0, user_pass=None) ``` -------------------------------- ### BatchParser - add_task Method Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/BatchParser.md The `add_task` method is called before `init_task` during the `start_monitor` process. It is used to add tasks to the database before the batch spider starts. ```APIDOC ## BatchParser - add_task ### Description Adds tasks to the database before the batch spider initializes. ### Method This is a method within the BatchParser class, typically overridden in user-defined spiders. ### Endpoint N/A (Internal method) ### Parameters None explicitly defined in the provided text. ### Request Example ```python class TestSpider(feapder.BatchSpider): def add_task(self): # Add your task logic here pass ``` ### Response N/A ``` -------------------------------- ### Define add_task Method in BatchSpider Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/BatchParser.md Implement the add_task method to add tasks to the database before a batch spider starts. This method is called by start_monitor before init_task. ```python class TestSpider(feapder.BatchSpider): def add_task(self): pass ``` -------------------------------- ### Configure Feapder Logging Settings Source: https://github.com/boris-code/feapder/blob/master/docs/source_code/logger.md Set up logging parameters like log file path, level, color, and console/file output. Configure the maximum size and backup count for log files, as well as encoding. This configuration also controls the log level for third-party libraries. ```python LOG_NAME = os.path.basename(os.getcwd()) LOG_PATH = "log/%s.log" % LOG_NAME # log存储路径 LOG_LEVEL = "DEBUG" LOG_COLOR = True # 是否带有颜色 LOG_IS_WRITE_TO_CONSOLE = True # 是否打印到控制台 LOG_IS_WRITE_TO_FILE = False # 是否写文件 LOG_MODE = "w" # 写文件的模式 LOG_MAX_BYTES = 10 * 1024 * 1024 # 每个日志文件的最大字节数 LOG_BACKUP_COUNT = 20 # 日志文件保留数量 LOG_ENCODING = "utf8" # 日志文件编码 OTHERS_LOG_LEVAL = "ERROR" # 第三方库的log等级 ``` -------------------------------- ### Create Batch Spider Project Source: https://github.com/boris-code/feapder/blob/master/docs/usage/BatchSpider.md Use the feapder command-line tool to create a new batch spider project. Select 'BatchSpider' from the template options. ```bash feapder create -s batch_spider_test 请选择爬虫模板 AirSpider Spider TaskSpider > BatchSpider ``` -------------------------------- ### Retry Mechanism Example Source: https://github.com/boris-code/feapder/blob/master/docs/usage/AirSpider.md This code demonstrates how a non-200 status code in the `parse` function triggers a retry. The default maximum retry count is 100. ```python def parse(self, request, response): if response.status_code != 200: raise Exception("非法页面") ``` -------------------------------- ### Add Task to Redis Source: https://github.com/boris-code/feapder/blob/master/docs/usage/TaskSpider.md Example of adding a seed task to a Redis key. This function is automatically called by 'start_monitor_task'. The task is a dictionary with 'id' and 'url'. ```python # 本代码示例为向redis的`spider_task2`的key加了个值为`{"id": 1, "url": "https://www.baidu.com"}`的种子 ```