### First Crawler with PageProcessor Source: https://github.com/code4craft/webmagic/blob/develop/README.md Implement the PageProcessor interface to define how to process downloaded pages. This example crawls GitHub repository information. ```java public class GithubRepoPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("(https://github\.com/\w+/\w+)").all()); page.putField("author", page.getUrl().regex("https://github\.com/(\w+)/.*").toString()); page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString()); if (page.getResultItems().get("name")==null){ //skip this page page.setSkip(true); } page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()")); } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run(); } } ``` -------------------------------- ### Basic GitHub Repository Crawler with PageProcessor Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-core/src/test/resources/html/mock-github.html Implements the PageProcessor interface to define how to process downloaded pages. This example extracts repository author and name from GitHub pages and adds new target URLs for crawling. ```Java public class GithubRepoPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*?").toString()); page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); if (page.getResultItems().get("name") == null) { //skip this page page.setSkip(true); } page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()")); } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run(); } } ``` -------------------------------- ### Annotation-based Crawler Configuration Source: https://github.com/code4craft/webmagic/blob/develop/README.md Use annotations to define URL patterns, extraction rules, and processing pipelines for a crawler. This example also crawls GitHub repository information. ```java @TargetUrl("https://github.com/\w+/\w+") @HelpUrl("https://github.com/\w+") public class GithubRepo { @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true) private String name; @ExtractByUrl("https://github\.com/(\w+)/.*") private String author; @ExtractBy("//div[@id='readme']/tidyText()") private String readme; public static void main(String[] args) { OOSpider.create(Site.me().setSleepTime(1000) , new ConsolePageModelPipeline(), GithubRepo.class) .addUrl("https://github.com/code4craft").thread(5).run(); } } ``` -------------------------------- ### Ruby Web Scraping Script for GitHub Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md This Ruby snippet shows how to extract GitHub repository details like name, readme, star count, and fork count using XPath. It's a more concise DSL example for WebMagic scripts. ```ruby name= xpath "//h1[@class='entry-title public']/strong/a/text()" readme = xpath "//div[@id='readme']/tidyText()" star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()" fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()" url=$page.getUrl().toString() puts name,readme,star,fork,url unless name==nil urls "(https://github\.com/\w+/\w+)" urls "(https://github\.com/\w+)" ``` -------------------------------- ### Annotation-based Crawler Configuration Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-extension/src/test/resources/html/mock-github.html Use annotations like @TargetUrl, @HelpUrl, and @ExtractBy to define crawler behavior and data extraction rules using POJOs. ```java @TargetUrl("https://github.com/\\w+/\\w+") @HelpUrl("https://github.com/\\w+") public class GithubRepo { @ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true) private String name; @ExtractByUrl("https://github\\.com/(\\w+)/.*") private String author; @ExtractBy("//div[@id='readme']/tidyText()") private String readme; public static void main(String[] args) { OOSpider.create(Site.me().setSleepTime(1000) , new ConsolePageModelPipeline(), GithubRepo.class) .addUrl("https://github.com/code4craft").thread(5).run(); } } ``` -------------------------------- ### Add WebMagic Dependencies to pom.xml Source: https://github.com/code4craft/webmagic/blob/develop/README.md Include these dependencies in your project's pom.xml to use WebMagic core and extensions. ```xml us.codecraft webmagic-core ${webmagic.version} us.codecraft webmagic-extension ${webmagic.version} ``` -------------------------------- ### JavaScript Web Scraping Script for GitHub Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md This JavaScript snippet demonstrates how to extract repository name, readme content, star count, fork count, and URL from a GitHub page using XPath selectors. It's designed to be executed with WebMagic scripts. ```javascript var name=xpath("//h1[@class='entry-title public']/strong/a/text()") var readme=xpath("//div[@id='readme']/tidyText()") var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()") var fork=xpath("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()") var url=page.getUrl().toString() if (name!=null){ println(name) println(readme) println(star) println(url) } urls("(https://github\.com/\w+/\w+)") urls("(https://github\.com/\w+)") ``` -------------------------------- ### Exclude slf4j-log4j12 for Custom SLF4J Implementation Source: https://github.com/code4craft/webmagic/blob/develop/README.md If you are using a custom SLF4J implementation, exclude the default slf4j-log4j12 binding. ```xml org.slf4j slf4j-log4j12 ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.