### First Crawler with PageProcessor
Source: https://github.com/code4craft/webmagic/blob/develop/README.md
Implement the PageProcessor interface to define how to process downloaded pages. This example crawls GitHub repository information.
```java
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\.com/\w+/\w+)").all());
page.putField("author", page.getUrl().regex("https://github\.com/(\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
--------------------------------
### Basic GitHub Repository Crawler with PageProcessor
Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-core/src/test/resources/html/mock-github.html
Implements the PageProcessor interface to define how to process downloaded pages. This example extracts repository author and name from GitHub pages and adds new target URLs for crawling.
```Java
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*?").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name") == null) {
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
--------------------------------
### Annotation-based Crawler Configuration
Source: https://github.com/code4craft/webmagic/blob/develop/README.md
Use annotations to define URL patterns, extraction rules, and processing pipelines for a crawler. This example also crawls GitHub repository information.
```java
@TargetUrl("https://github.com/\w+/\w+")
@HelpUrl("https://github.com/\w+")
public class GithubRepo {
@ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
private String name;
@ExtractByUrl("https://github\.com/(\w+)/.*")
private String author;
@ExtractBy("//div[@id='readme']/tidyText()")
private String readme;
public static void main(String[] args) {
OOSpider.create(Site.me().setSleepTime(1000)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
--------------------------------
### Ruby Web Scraping Script for GitHub
Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md
This Ruby snippet shows how to extract GitHub repository details like name, readme, star count, and fork count using XPath. It's a more concise DSL example for WebMagic scripts.
```ruby
name= xpath "//h1[@class='entry-title public']/strong/a/text()"
readme = xpath "//div[@id='readme']/tidyText()"
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
url=$page.getUrl().toString()
puts name,readme,star,fork,url unless name==nil
urls "(https://github\.com/\w+/\w+)"
urls "(https://github\.com/\w+)"
```
--------------------------------
### Annotation-based Crawler Configuration
Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-extension/src/test/resources/html/mock-github.html
Use annotations like @TargetUrl, @HelpUrl, and @ExtractBy to define crawler behavior and data extraction rules using POJOs.
```java
@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {
@ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true)
private String name;
@ExtractByUrl("https://github\\.com/(\\w+)/.*")
private String author;
@ExtractBy("//div[@id='readme']/tidyText()")
private String readme;
public static void main(String[] args) {
OOSpider.create(Site.me().setSleepTime(1000)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
--------------------------------
### Add WebMagic Dependencies to pom.xml
Source: https://github.com/code4craft/webmagic/blob/develop/README.md
Include these dependencies in your project's pom.xml to use WebMagic core and extensions.
```xml
us.codecraft
webmagic-core
${webmagic.version}
us.codecraft
webmagic-extension
${webmagic.version}
```
--------------------------------
### JavaScript Web Scraping Script for GitHub
Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md
This JavaScript snippet demonstrates how to extract repository name, readme content, star count, fork count, and URL from a GitHub page using XPath selectors. It's designed to be executed with WebMagic scripts.
```javascript
var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
var readme=xpath("//div[@id='readme']/tidyText()")
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
var fork=xpath("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()")
var url=page.getUrl().toString()
if (name!=null){
println(name)
println(readme)
println(star)
println(url)
}
urls("(https://github\.com/\w+/\w+)")
urls("(https://github\.com/\w+)")
```
--------------------------------
### Exclude slf4j-log4j12 for Custom SLF4J Implementation
Source: https://github.com/code4craft/webmagic/blob/develop/README.md
If you are using a custom SLF4J implementation, exclude the default slf4j-log4j12 binding.
```xml
org.slf4j
slf4j-log4j12
```
=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.