### First Crawler with PageProcessor

Source: https://github.com/code4craft/webmagic/blob/develop/README.md

Implement the PageProcessor interface to define how to process downloaded pages. This example crawls GitHub repository information.

```java
public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\.com/\w+/\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\.com/(\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
```

--------------------------------

### Basic GitHub Repository Crawler with PageProcessor

Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-core/src/test/resources/html/mock-github.html

Implements the PageProcessor interface to define how to process downloaded pages. This example extracts repository author and name from GitHub pages and adds new target URLs for crawling.

```Java
public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*?").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
        if (page.getResultItems().get("name") == null) {
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
```

--------------------------------

### Annotation-based Crawler Configuration

Source: https://github.com/code4craft/webmagic/blob/develop/README.md

Use annotations to define URL patterns, extraction rules, and processing pipelines for a crawler. This example also crawls GitHub repository information.

```java
@TargetUrl("https://github.com/\w+/\w+")
@HelpUrl("https://github.com/\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\.com/(\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}
```

--------------------------------

### Ruby Web Scraping Script for GitHub

Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md

This Ruby snippet shows how to extract GitHub repository details like name, readme, star count, and fork count using XPath. It's a more concise DSL example for WebMagic scripts.

```ruby
name= xpath "//h1[@class='entry-title public']/strong/a/text()"
readme = xpath "//div[@id='readme']/tidyText()"
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
url=$page.getUrl().toString()

puts name,readme,star,fork,url unless name==nil

urls "(https://github\.com/\w+/\w+)"
urls "(https://github\.com/\w+)"
```

--------------------------------

### Annotation-based Crawler Configuration

Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-extension/src/test/resources/html/mock-github.html

Use annotations like @TargetUrl, @HelpUrl, and @ExtractBy to define crawler behavior and data extraction rules using POJOs.

```java
@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}
```

--------------------------------

### Add WebMagic Dependencies to pom.xml

Source: https://github.com/code4craft/webmagic/blob/develop/README.md

Include these dependencies in your project's pom.xml to use WebMagic core and extensions.

```xml
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>${webmagic.version}</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>${webmagic.version}</version>
</dependency>
```

--------------------------------

### JavaScript Web Scraping Script for GitHub

Source: https://github.com/code4craft/webmagic/blob/develop/webmagic-scripts/README.md

This JavaScript snippet demonstrates how to extract repository name, readme content, star count, fork count, and URL from a GitHub page using XPath selectors. It's designed to be executed with WebMagic scripts.

```javascript
var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
var readme=xpath("//div[@id='readme']/tidyText()")
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
var fork=xpath("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()")
var url=page.getUrl().toString()
if (name!=null){
    println(name)
    println(readme)
    println(star)
    println(url)
}

urls("(https://github\.com/\w+/\w+)")
urls("(https://github\.com/\w+)")
```

--------------------------------

### Exclude slf4j-log4j12 for Custom SLF4J Implementation

Source: https://github.com/code4craft/webmagic/blob/develop/README.md

If you are using a custom SLF4J implementation, exclude the default slf4j-log4j12 binding.

```xml
<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.