### Start WebUI via Module Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Alternatively, start the WebUI API server by running the module directly using uv. ```shell # Or start using module mode uv run python -m api.main ``` -------------------------------- ### 使用 uv 安装 Playwright 浏览器驱动 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 在项目根目录下,使用 `uv run playwright install` 命令来安装 Playwright 所需的浏览器驱动。项目支持使用 Playwright 连接本地 Chrome,也可在配置文件中调整 CDP 方式。 ```shell uv run playwright install ``` -------------------------------- ### Tieba Exception Handling Setup Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_detail.html Sets up global error listeners for JavaScript, resource loading, and API exceptions. Reports custom exceptions via `reportException`. ```javascript !function(){try{window.__tieba__weirwood__={jsExceptions:[],resourceExceptions:[],apiExceptions:[],customExceptions:[],weirwoodResourceListener:null,jsListener:function(){for(var e=arguments.length,t=new Array(e),n=0;n= 16) is installed if crawling Douyin and Zhihu. The provided requirements are based on Python 3.9.6; compatibility with other versions may vary. ```shell # Enter project root directory cd MediaCrawler # Create virtual environment # My python version is: 3.9.6, the libraries in requirements.txt are based on this version # If using other python versions, the libraries in requirements.txt may not be compatible, please resolve on your own python -m venv venv ``` ```shell # macOS & Linux activate virtual environment source venv/bin/activate ``` ```shell # Windows activate virtual environment venv\Scripts\activate ``` -------------------------------- ### Start WebUI API Server Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Launch the API server for the WebUI using uvicorn. The server runs on port 8080 by default and supports hot-reloading. ```shell # Start API server (default port 8080) uv run uvicorn api.main:app --port 8080 --reload ``` -------------------------------- ### Install openpyxl for Excel Export Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/excel_export_guide.md Install the 'openpyxl' library, which is required for Excel export functionality. Use 'uv sync' for recommended dependency management or 'pip install openpyxl' for direct installation. ```bash # Using uv (recommended) uv sync # Or using pip pip install openpyxl ``` -------------------------------- ### 使用 uv 获取爬虫程序帮助信息 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 运行 `uv run main.py --help` 命令可以查看爬虫程序支持的所有命令行参数和选项,帮助用户了解更多功能。 ```shell uv run main.py --help ``` -------------------------------- ### Initialize Tieba Platform with Parameters Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_comments.html Initializes the Tieba platform with a given set of parameters. It conditionally applies fixes for 'pc_tieba_list' and 'pc_tieba_detail' page IDs and handles image-related initialization. ```javascript var q={init:function(){new N({container:"[data-nvk]",url:"https:\/\/ada.baidu.com\/udpl\/exp",params:{exp:"data-nvk"}})}};t.PARAMS={},t.init=function(e){t.PARAMS=e;var n="."+t.PARAMS.resultClass;E.getInstance({}).support(function(){t.PARAMS.flags.tiebaCkFix&&-1!==["pc_tieba_list","pc_tieba_detail"].indexOf(t.PARAMS.pageid)?A.init(n,t.PARAMS.imTimeSign):o.init(n,t.PARAMS.imTimeSign),q.init(),t.PARAMS.flags.tiebaCkFix&&-1!==["pc_tieba_list","pc_tieba_detail"].indexOf(t.PARAMS.pageid)&&j.init(n,t.PARAMS)})},t.request=e}(this.ecomNsPcGlobal=this.ecomNsPcGlobal||{});\n\n (function (variable) {\n window.ecomNsPcGlobal.init(variable);\n })({"searchid":"000000006ff4131f","eid":"120201_120702_7869012_300003","bwsid":0,"osid":0,"pageid":"pc_tieba_detail","baiduid":"00E51832EC6DAA0D10E1C0699C5E6670","ovlid":"129424-dz#129771-dz#127633-9#119767-dz#125784-6#119725-17#124778-17#167nj-0","wpt":0,"netType":0,"cuid":"","feedCuid":"","query":"%E7%BD%91%E7%90%83%E9%A3%8E%E4%BA%91","imTimeSign":108,"asynMode":0,"asynUrl":"","isWiseDropDown":false,"asynsid":"","aspTime":1722965478873,"sourceAdNum":{"ads_2327":1},"asynQuery":"","jFieldLinkMap":{"7R_NR2Ar5Od66EzgSNerQ0BK-qjgKZLqWAZ1vmIMo9vxgj4e_5o33IOM9tSMjld3xg4mx5u9qx-9qEH9tOZjexZjES8Z1vmxgI9vNqIT1VQDpyuCp88EDkYvIMWgvX5WElkYP77BOoZBmoLNvNtTMNM-eRlrKYdvFWG_LU81k_tIhOubl_t8aFqjDk_3tILZkzXPMZJ1z3lQnyFWWuE_ooLeVOySe-SWDk7SOYOohzsOwSUqEHISUtVjVOPSLjqAJEklcELecqNxKYxhoOdkOhzqPL4EvqxuPSOOolltHDgeOHS8UQjqZ4EvqPL4EzYrLtLeT5YpJO-d9JmerVqLQxYwYA5Pe5O6ObI8eO_xenI---MO6kOSjGSgz_9BOBRU57-OHk_______OxdTgHgyyyyyyyyyyyyf4UcSjqOOgv1um8Z6OdkxwsxQkOtDgzdq_gzOdq8881lFELtSgtLSEOgqZ4EvOYcOusO-ItEd5kexSkOw_Ot7-----IO3Otj6SN--MO6qEdYXx_OU_____dvpVIbgjePyWqxlZSzM-HJdTXgSwOs51OxveqMSzevOBhpletx1OSjqXxSW3OBeO6XEU4e2OSQeqZoZqS1C3mElrWqiQOuoOejEzqZQSpolrZter1-4p81uYvyyX5qA1lry6GyAp7WW_3e826":"https:\/\/b2b.baidu.com\/aitf\/s?q=%E8%8B%8F%E5%B7%9E%E6%B1%BD%E8%BD%A6%E7%A7%9F%E8%B5%81&from=search&fid=604714779&styl=b&sid=1510132&a_keywordid={keywordid}&a_unitid={unitid}&a_planid={planid}"},"upAdNum":0,"middleAdNum":0,"downAdNum":0,"flags":{"fixPlusSign":true},"variable":{},"rsContent":[],"ecomData204":"","ecomData213":"","ecomData217":"","ad204Num":0,"ad213Num":0,"ad217Num":0,"isHasImlp":false,"adsInfo":{"000000006ff4131f_1626_0":{"ideaId":75048564995,"docId":"0","mts":[2410,2051]}},"bdCid":93,"bdPid":4,"queryWordEnc":"%CD%F8%C7%F2%B7%E7%D4%C6","wiseSt":"","requestIpV4":3604745399,"fnizebrab":"","nsVerticalKdomainList":[],"wiseExposureAds":[],"is_rm_asyn":true,"mod":"","app_verison":"","os_version":"","bd_version":"","ios_version":0,"passportId":0,"hasYunyingCard":false,"wiseFrom":"","resultClass":"fc-000000006ff4131f-2327"});\n ``` -------------------------------- ### Install Playwright Browser Drivers Source: https://github.com/nanmicoder/mediacrawler/blob/main/README_en.md Installs the browser drivers required by Playwright for web automation. ```shell playwright install ``` -------------------------------- ### Initialize SQLite Database Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/index.md Initialize the SQLite database for data storage. This command should be run without any other optional parameters. ```shell --init_db sqlite ``` -------------------------------- ### 使用 uv 运行爬虫程序(搜索) Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 通过 `uv run main.py` 命令运行爬虫程序,并指定平台和类型。此示例展示了如何配置为从配置中读取关键词搜索并爬取帖子与评论。 ```shell uv run main.py --platform xhs --lt qrcode --type search ``` -------------------------------- ### Start MediaCrawler WebUI Service Source: https://github.com/nanmicoder/mediacrawler/blob/main/README_en.md Starts the API server for the WebUI. The default port is 8080. Visit http://localhost:8080 in your browser to access the interface. ```shell # Start API server (default port 8080) uv run uvicorn api.main:app --port 8080 --reload ``` ```shell # Or start using module method uv run python -m api.main ``` -------------------------------- ### Initialize MySQL Database Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/index.md Initialize the MySQL database for data storage. This command requires a pre-configured MySQL server and database. ```shell --init_db mysql ``` -------------------------------- ### 使用 uv 运行爬虫程序(详情) Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 通过 `uv run main.py` 命令运行爬虫程序,并指定平台和类型。此示例展示了如何配置为从配置中读取指定帖子 ID 列表并爬取帖子与评论。 ```shell uv run main.py --platform xhs --lt qrcode --type detail ``` -------------------------------- ### 配置CDP模式(启动新浏览器) Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/CDP模式使用指南.md 在`config/base_config.py`中设置启用CDP模式,并配置不连接已有浏览器,而是启动新浏览器实例。 ```python ENABLE_CDP_MODE = True CDP_CONNECT_EXISTING = False # 关闭连接已有浏览器,改为启动新浏览器 ``` -------------------------------- ### Baidu Tieba Analytics Logging Initialization Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/tieba_note_list.html This snippet initializes Baidu Tieba's analytics logging using `window.alog`. It sets a speed marker for 'c_widget_search_show' and then fires a general 'mark' event. This is typically used for performance monitoring and event tracking. ```javascript if (window.alog && window.alog.fire) { alog('speed.set', 'c_widget_search_show', +new Date); alog.fire("mark"); } ``` -------------------------------- ### 配置CDP模式(连接已有浏览器) Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/CDP模式使用指南.md 在`config/base_config.py`中设置启用CDP模式,并配置连接已有浏览器及其调试端口。 ```python # 启用CDP模式 ENABLE_CDP_MODE = True # 连接已有浏览器(默认开启) CDP_CONNECT_EXISTING = True # CDP调试端口(与 chrome://inspect 页面显示的端口一致) CDP_DEBUG_PORT = 9222 ``` -------------------------------- ### Get Help for Crawler Commands (pip) Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Run the main script with the --help flag using Python to view available commands and options for other platforms. ```shell python main.py --help ``` -------------------------------- ### Initialize Tieba Widgets Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_comments.html This code initializes various Tieba widgets using the Module.use function. It includes widgets for creative platforms, previews, HTTP transformations, suggestions, search highlighting, sign-in modules, forum cards, focus buttons, navigation, and more. Ensure all necessary dependencies are loaded before using these widgets. ```javascript _.Module.use('creativeplatform/widget/aopApp', [[]]); _.Module.use('pcommon/widget/preview', [], function(){}); _.Module.use('common/widget/httpTransform', [], function(){}); _.Module.use('common/widget/suggestion', [], function(){}); _.Module.use('common/widget/searchBright',[$('#head'),{ style:'bright', theme:'bright_pb', forumName:'网球风云', searchFixed:'', sugOn:'1'}]); ``` ```javascript PageData.user.is_like = 0; PageData.user.is_block = 0; PageData.is_sign_in = 0; PageData.is_star = 0; PageData.sign_forum_info = {"is_on":true,"is_filter":false,"forum_info":{"forum_id":4513750,"level_1_dir_name":"\u7efc\u5408\u4f53\u80b2"},"current_rank_info":{"sign_count":444,"member_count":48481,"sign_rank":8,"dir_rate":"0.1"},"level_1_dir_name":"\u4f53\u80b2","level_2_dir_name":"\u7efc\u5408\u4f53\u80b2","yesterday_rank_info":{"sign_count":3452,"member_count":47919,"sign_rank":11,"dir_rate":"0.1"},"weekly_rank_info":{"sign_count":3405,"member_count":42491,"sign_rank":10},"monthly_rank_info":{"sign_count":2359,"member_count":41063,"sign_rank":16}}; PageData.memberTitle = "Ace"; PageData.memberNumber = ""; PageData.is_activity_sign = ''; PageData.annualMemberMode = false; ``` ```javascript _.Module.use('ucenter/widget/sign_mod',[$('#sign_mod'),{'hasClass': 1, 'page': ''}]); _.Module.use('frs/widget/forum_card'); _.Module.use('frs/widget/forum_card/focus_btn',[{ "islike":"0", "isCategoryOfGame": true, "forum_name":"网球风云" , "fr":"", "userForumList": []}]); _.Module.use('common/widget/tbnav_bright', [$('#tb_nav'),{jq_search:$('#tb_nav').find('.j_search_internal_forum'),forumName:'网球风云'},{promotion_setting: [[]]}]); ``` ```javascript _.Module.use("pb/widget/ForumTitle",[{ 'is_pic_act_underway': false}], function(forumTitle){ window.forumTitle = forumTitle; }); _.Module.use('pb/widget/saveFace', [{"isLogin": "1", "props": {"all_level":{"2":{"end_time":"1421113470","level":2,"score_limit":8000}},"level":{"end_time":"1421113470","expired_notify":1,"expiring_notify":1,"left_num":0,"max_free_score":8000,"open_status":null,"pic_url":null,"props_category":105,"props_id":2,"props_type":0,"update_time":null,"used_status":1}}, "forumId": "4513750", "threadId": "9119688421" }], function(){}); _.Module.use("encourage/widget/pb_marry", {}); _.Module.use('user/widget/SingleIcons'); ``` ```javascript if ($.trim($('.d_pb_icons').html()).length === 0) { $('.d_pb_icons').hide(); } _.Module.use('tbmall/widget/pb_post_foot_send_gift', { container: '.post-foot-send-gift-container', box: '.post-foot-send-gift', authorId: 1635505954, postId: 150726491368, presentNum: 0}); ``` -------------------------------- ### Create Python Virtual Environment Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/index.md Create a virtual environment using Python's built-in venv module. This isolates project dependencies. Ensure Node.js is installed if crawling Douyin or Zhihu. ```shell cd MediaCrawler python -m venv venv ``` -------------------------------- ### Initialize Tieba Features Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_detail.html Initializes Tieba features based on provided parameters. Supports fixing click tracking for list and detail pages. ```javascript var L="observer";function C(){}function N(t){var e=t.container,n=t.url,r=t.params;this.opts={container:e,url:n,params:void 0===r?{}:r},this.init()}N.prototype={constructor:N,init:function(){var t=this,e=t.opts.container;t.observer=new IntersectionObserver(t.observeCB.bind(t)),e&&Array.prototype.slice.call(document.querySelectorAll(e)).forEach(function(e){t.observer.observe(e)})},observe:function(t){this.observer.observe(t)},observeCB:function(t){var e=this;t.forEach(function(t){if(t.isIntersecting){var n=t.target;if(!n.getAttribute(L)){var r=e.combineData(n);e.log(r),e.observer.unobserve(n),n.setAttribute(L,1)}}})},combineData:function(t){var e=this.opts.params,n={data:{}};for(var r in e)if(e.hasOwnProperty(r)){var o=t.getAttribute(e[r]);""!==o&&(n.data[r]=o)}return n},log:function(t){this.nclick(t)},nclick:function(t){var e=this.opts.url;t.rand=this.addRand();var n="".concat(e,"?").concat(this.encodeSearchParams(t));this.imgRequest(n,t)},imgRequest:function(t,e){try{var n=e.rand,r=new Image;window["--IMAGE"+n]=r,r.onload=r.onerror=r.onabort=function(){r.onload=r.onerror=r.onabort=null,r=null,window["--IMAGE"+n]=C},r.src=t}catch(t){}},addRand:function(){return Math.random().toString(16).slice(2,8)+Math.random()},encodeSearchParams:function(t){var e=[];for(var n in t)if(t.hasOwnProperty(n)){var r=t[n];"object"==typeof r&&(r=JSON.stringify(r)),e.push([n,encodeURIComponent(r)].join("="));}return e.join("&")}},var q={init:function(){new N({container:"[data-nvk]",url:"https:\/\/ada.baidu.com\/udpl\/exp",params:{exp:"data-nvk"}})}};t.PARAMS={},t.init=function(e){t.PARAMS=e;var n="."+t.PARAMS.resultClass;E.getInstance({}).support(function(){t.PARAMS.flags.tiebaCkFix&&-1!==["pc_tieba_list","pc_tieba_detail"].indexOf(t.PARAMS.pageid)?A.init(n,t.PARAMS.imTimeSign):o.init(n,t.PARAMS.imTimeSign),q.init(),t.PARAMS.flags.tiebaCkFix&&-1!==["pc_tieba_list","pc_tieba_detail"].indexOf(t.PARAMS.pageid)&&j.init(n,t.PARAMS})},t.request=e}(this.ecomNsPcGlobal=this.ecomNsPcGlobal||{}); ``` -------------------------------- ### 使用 uv 同步 Python 依赖 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 进入项目根目录后,使用 `uv sync` 命令来保证 Python 版本和依赖的一致性。此命令会根据项目配置安装所需的 Python 包。 ```shell cd MediaCrawler uv sync ``` -------------------------------- ### Initialize UserVisitCard Widget Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_detail.html Initializes the UserVisitCard widget, displaying user information. It takes the username and login status as parameters. ```javascript _.Module.use('ihome/widget/UserVisitCard',{'uname':'抗压吧吧务666','is_login':0,'tbs':''}); ``` -------------------------------- ### 使用 venv 获取爬虫程序帮助信息 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/原生环境管理文档.md 在激活的 venv 环境中,运行 `python main.py --help` 命令可以查看爬虫程序支持的所有命令行参数和选项,帮助用户了解更多功能。 ```shell python main.py --help ``` -------------------------------- ### Run Crawler for Keyword Search (uv) Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Execute the main script using uv to search for posts based on keywords from the configuration file. This command initiates a search on the 'xhs' platform using QR code login. ```shell uv run main.py --platform xhs --lt qrcode --type search ``` -------------------------------- ### 在爬虫中启用CDP模式 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/CDP模式使用指南.md 通过修改`config/base_config.py`文件中的`ENABLE_CDP_MODE`配置项为`True`,即可在所有平台爬虫中启用CDP模式。 ```python # 在config/base_config.py中 ENABLE_CDP_MODE = True # 然后正常运行爬虫 python main.py ``` -------------------------------- ### Run Crawler for Specific Post Details (uv) Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Execute the main script using uv to fetch details for specific posts listed in the configuration file. This command targets the 'xhs' platform and uses QR code login. ```shell uv run main.py --platform xhs --lt qrcode --type detail ``` -------------------------------- ### Initialize Tieba Widgets and Event Handlers Source: https://github.com/nanmicoder/mediacrawler/blob/main/media_platform/tieba/test_data/note_detail.html Loads various Tieba widgets and sets up event delegation for user interactions like clicking authorization buttons. It also handles cookie-based logic for specific features. ```javascript \_.Module.use('encourage/widget/meizhi_vote'); \_.Module.use('encourage/widget/WelfareIcon'); \_.Module.use('encourage/widget/achieveCard'); \_.Module.use('fanclub/widget/fancard'); $("body").delegate('#j_meizhi_auth_btn, #j_meizhi_ysjg', 'click', function (e) { \_.Module.use('postor/widget/MeizhiPostor'); e.preventDefault(); }); if ($.cookie('zt2meizhi') == '1') { \_.Module.use('postor/widget/MeizhiPostor'); } $.cookie('zt2meizhi',null); //$.cookie('zt2meizhi', '1', {empires: 365, path: '/'}); \_.Module.use('frs/widget/frs_stamp_notice',[ {} ]); \_.Module.use('comforum/widget/GamePopWindow', [ null ]); /*url安全界别检测 by tanjiawei*/ \_.Module.use('pb/widget/UrlCheck'); \_.Module.use("props/widget/Feedback",[ {"appraise":null} ]); \_.Module.use("tbmall/widget/NameplateRecast", [ 0, null ]); \_.Module.use("pb/widget/PbTrack"); \_.Module.use('adsense/widget/data_handler_async', [], function (instance){ instance.addData({ 'forum_vdir': null }); } ); \_.Module.use('tbmall/widget/grab_treasure_dialog_success',[ { diamondData: []} ]); /*加载无刷新组件*/ \_.Module.use('pb/widget/NoRefresh', [ "\/p\/9117905169?pn=", false ] ); \_.Module.use('spage/widget/fixed_bar', [], function(){} ); if ('' === 'showBar'){ $.stats.track('底部', '新用户红包', 'spage', 'show'); } \_.Module.use('creativeplatform/widget/normalApp', [ [[]] ]); ``` -------------------------------- ### 检查浏览器进程 Source: https://github.com/nanmicoder/mediacrawler/blob/main/docs/CDP模式使用指南.md 使用系统命令检查正在运行的Chrome浏览器进程,有助于排查浏览器未正常启动或关闭的问题。 ```bash # Windows tasklist | findstr chrome # macOS/Linux ps aux | grep chrome ``` -------------------------------- ### Run Crawler for Specific Post Details (pip) Source: https://github.com/nanmicoder/mediacrawler/blob/main/README.md Execute the main script using Python to fetch details for specific posts listed in the configuration file. This command targets the 'xhs' platform and uses QR code login. ```shell python main.py --platform xhs --lt qrcode --type detail ```