### Install GoScrapy CLI Source: https://github.com/tech-engine/goscrapy/wiki/Home Installs the GoScrapy CLI tool. Ensure you have Go version 1.22 or higher. ```sh go install github.com/tech-engine/goscrapy@latest ``` -------------------------------- ### Install GoScrapy CLI and Create Project Source: https://context7.com/tech-engine/goscrapy/llms.txt Installs the GoScrapy CLI tool and scaffolds a new spider project. Use 'gos' or 'goscrapy' for commands. It also shows how to create a custom pipeline. ```bash # Install GoScrapy CLI (installs both 'goscrapy' and 'gos' alias) go install github.com/tech-engine/goscrapy/cmd/...@latest # Verify installation gos -v # or goscrapy -v # Create a new spider project goscrapy startproject books_to_scrape # Output: # 🚀 GoScrapy generating project files. Please wait! # 📦 Initializing Go module: books_to_scrape... # ✔️ books_to_scrape/base.go # ✔️ books_to_scrape/constants.go # ✔️ books_to_scrape/errors.go # ✔️ books_to_scrape/job.go # ✔️ main.go # ✔️ books_to_scrape/record.go # ✔️ books_to_scrape/spider.go # 📦 Do you want to resolve dependencies now (go mod tidy)? [Y/n]: Y # ✨ Congrats, books_to_scrape created successfully. # Create a custom pipeline gos pipeline export_2_DB ``` -------------------------------- ### Install GoScrapy CLI Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Installs the GoScrapy CLI tool and its 'gos' alias. Ensure your Go environment is set up correctly. ```sh go install github.com/tech-engine/goscrapy/cmd/...@latest ``` -------------------------------- ### Initialize GoScrapy Spider with Middlewares and Pipelines Source: https://github.com/tech-engine/goscrapy/wiki/Home Set up a new GoScrapy spider, configuring it with the defined middlewares and pipelines. The spider is started in a goroutine and its closure is handled. ```go package scrapejsp import ( "context" "github.com/tech-engine/goscrapy/cmd/gos" ) type Spider struct { gos.ICoreSpider[*Record] } func New(ctx context.Context) *Spider { core := gos.New[*Record](). WithMiddlewares(MIDDLEWARES...). WithPipelines(PIPELINES...) spider := &Spider{ core, } go func() { _ = core.Start(ctx) spider.Close(ctx) }() return spider } ``` -------------------------------- ### Initialize GoScrapy Spider Base Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Sets up the core engine and configures middlewares and pipelines in the base.go file. The application starts in a goroutine. ```go package myspider import ( "context" "github.com/tech-engine/goscrapy/pkg/gos" ) type Spider struct { gos.ICoreSpider[*Record] } func New(ctx context.Context) *Spider { // Initialize and configure everything in one go app := gos.NewApp[*Record](). WithMiddlewares(MIDDLEWARES...). WithPipelines(PIPELINES...) spider := &Spider{app} go func() { _ = app.Start(ctx) spider.Close(ctx) }() return spider, errCh } ``` -------------------------------- ### GoScrapy Spider Application Entry Point Source: https://context7.com/tech-engine/goscrapy/llms.txt The main function initializes and runs a GoScrapy spider with context management and graceful shutdown. It starts the spider and waits for its completion or cancellation. ```go package main import ( "context" "errors" "fmt" "os" "books_to_scrape/books_to_scrape" ) func main() { ctx, cancel := context.WithCancel(context.Background()) defer cancel() // Initialize and start the spider spider := books_to_scrape.New(ctx) // Start scraping with a job (nil for no specific job input) spider.StartRequest(ctx, nil) fmt.Println("🕷️ GoScrapy spider is running. Press Ctrl+C to stop.") // Wait for completion with auto-exit when work is done if err := spider.Wait(true); err != nil && !errors.Is(err, context.Canceled) { fmt.Fprintf(os.Stderr, "❌ Engine finished with error: %v\n", err) os.Exit(1) } fmt.Println("✨ Engine finished successfully.") } ``` -------------------------------- ### Configure GoScrapy Settings Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Defines middlewares and export pipelines in the centralized settings.go file. This example includes Azure TLS client, Retry functionality, and CSV export. ```go package myspider import ( "time" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv" ) // Add Azure TLS client and Retry functionality seamlessly var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.AzureTLS(azureTLSOpts), middlewares.Retry(), // 3 retries, 5s back-off } // Prepare CSV export pipeline var export2CSV = csv.New[*Record](csv.Options{ Filename: "itstimeitsnowornever.csv", }) // Export to CSV instantly var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, } ``` -------------------------------- ### Verify GoScrapy Installation Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Checks if the GoScrapy CLI was installed successfully by running the version command. You can use either 'gos' or 'goscrapy'. ```sh gos -v # or goscrapy -v ``` -------------------------------- ### Custom Database Pipeline Implementation Source: https://context7.com/tech-engine/goscrapy/llms.txt Implement a custom pipeline to store scraped data in a database. This example uses PostgreSQL and demonstrates the Open, Close, and ProcessItem methods. ```go package pipelines import ( "context" "database/sql" "github.com/tech-engine/goscrapy/pkg/core" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) type DatabasePipeline[OUT any] struct { db *sql.DB } func NewDatabasePipeline[OUT any](connectionString string) *DatabasePipeline[OUT] { return &DatabasePipeline[OUT]{} } // Open is called when the spider starts func (p *DatabasePipeline[OUT]) Open(ctx context.Context) error { db, err := sql.Open("postgres", "connection_string") if err != nil { return err } p.db = db return nil } // Close is called when the spider shuts down func (p *DatabasePipeline[OUT]) Close() { if p.db != nil { p.db.Close() } } // ProcessItem is called for each yielded record func (p *DatabasePipeline[OUT]) ProcessItem(item pm.IPipelineItem, original core.IOutput[OUT]) error { record := original.Record() // Insert record into database _, err := p.db.Exec( "INSERT INTO scraped_data (data) VALUES ($1)", record, ) // Pass data to next pipeline via item item.Set("db_inserted", true) return err } ``` -------------------------------- ### Generate Custom Pipeline Source: https://github.com/tech-engine/goscrapy/wiki/Home Example of using the goscrapy CLI to export a custom pipeline named 'export_2_DB'. ```sh abc\go\go-test-scrapy>scrapejsp> gos pipeline export_2_DB ✔️ pipelines\export_2_DB.go ✨ Congrates, export_2_DB created successfully. ``` -------------------------------- ### Get All and First Matching Nodes Source: https://github.com/tech-engine/goscrapy/wiki/Home Shows how to retrieve all matching HTML nodes or just the first matching node using CSS selectors. ```go var productUrlNodes []*html.Node productUrlNodes = resp.Css("article.product_pod h3 a").GetAll() var firstProductUrlNode *html.Node firstProductUrlNode = resp.Css("article.product_pod h3 a").Get() ``` -------------------------------- ### Run GoScrapy Spider and Handle Completion Source: https://github.com/tech-engine/goscrapy/wiki/Home Main function to initialize and run a GoScrapy spider. It starts the spider, sends an initial request, prints status messages, and waits for the spider to complete or be canceled. ```go func main() { ctx, cancel := context.WithCancel(context.Background()) defer cancel() // start spider spider := books_to_scrape.New(ctx) // start the scraper with a job, currently nil is passed but you can pass your job here spider.StartRequest(ctx, nil) fmt.Println("🕷️ GoScrapy spider is running. Press Ctrl+C to stop.") // wait for completion if err := spider.Wait(true); err != nil && !errors.Is(err, context.Canceled) { fmt.Fprintf(os.Stderr, "❌ Engine finished with error: %v\n", err) os.Exit(1) } fmt.Println("✨ Engine finished successfully.") } ``` -------------------------------- ### Build HTTP GET Request Source: https://context7.com/tech-engine/goscrapy/llms.txt Constructs a GET request using the Request Builder API. Allows setting URL, headers, metadata, and cookie jar keys before scheduling the request. ```go func (s *Spider) StartRequest(ctx context.Context, job *Job) { // Create a new request (must not be reused between calls) req := s.Request(ctx) // GET request (default method) req.Url("https://example.com/api/data") // Add custom headers headers := http.Header{ "User-Agent": []string{"GoScrapy/1.0"}, "Accept": []string{"application/json"}, } req.Header(headers) // Store metadata to pass to callback req.Meta("category", "books") req.Meta("page", 1) // Set cookie jar key for session isolation req.CookieJar("session_1") // Schedule request with callback s.Parse(req, s.parseResponse) } ``` -------------------------------- ### GoScrapy Spider Request and Response Handling Source: https://context7.com/tech-engine/goscrapy/llms.txt Implements the core scraping logic, including starting requests, parsing responses using CSS selectors, and extracting product details and pagination links. ```go package books_to_scrape import ( "context" "fmt" "regexp" "strings" "github.com/tech-engine/goscrapy/pkg/core" ) // StartRequest is the entry point for the spider func (s *Spider) StartRequest(ctx context.Context, job *Job) { // Create a new request (must not be reused) req := s.Request(ctx) // GET is the default method req.Url(s.baseUrl) // Schedule request with callback s.Parse(req, s.parse) } // Close is called when the spider is about to shut down func (s *Spider) Close(ctx context.Context) { s.Logger().Info("closing") } func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { s.Logger().Infof("GET: %d %s", resp.StatusCode(), resp.Request().URL.String()) // Extract product URLs using CSS selector for _, productUrl := range resp.Css("article.product_pod h3 a").Attr("href") { req := s.Request(ctx) if strings.HasPrefix(productUrl, "catalogue/") { productUrl = fmt.Sprintf("%s/%s", s.baseUrl, productUrl) } else { productUrl = fmt.Sprintf("%s/catalogue/%s", s.baseUrl, productUrl) } req.Url(productUrl) s.Parse(req, s.parseProduct) } // Handle pagination nextUrls := resp.Css("li.next a").Attr("href") if len(nextUrls) > 0 { nextUrl := fmt.Sprintf("%s/catalogue/%s", s.baseUrl, nextUrls[0]) req := s.Request(ctx) req.Url(nextUrl) s.Parse(req, s.parse) } } func (s *Spider) parseProduct(ctx context.Context, resp core.IResponseReader) { product := resp.Css("article.product_page") var title string if titles := product.Css(".product_main h1").Text(); len(titles) > 0 { title = titles[0] } var price string if prices := product.Css(".price_color").Text(); len(prices) > 0 { price = prices[0] } var stock string if stocks := product.Css(".availability").Text(); len(stocks) > 0 { match := regexp.MustCompile(`\((\d+) available\)`).FindStringSubmatch(strings.TrimSpace(stocks[0])) if len(match) > 0 { stock = match[1] } } var rating string if ratingClassAttrs := product.Css(".star-rating").Attr("class"); len(ratingClassAttrs) > 0 { rating = strings.Split(ratingClassAttrs[0], " ")[1] } // Yield the scraped data to pipelines s.Yield(&Record{ Title: title, Price: price, Stock: stock, Rating: rating, }) } ``` -------------------------------- ### Customize GoScrapy Client Source: https://github.com/tech-engine/goscrapy/wiki/Home Demonstrates how to create a new GoScrapy client with custom HTTP client options like proxies and timeouts. ```go func New(ctx context.Context) (*Spider, <-chan error) { // default client options // proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...) // core := gos.New[*Record]().WithClient( // gos.DefaultClient(proxies), // ) // we can also provide in our custom client // core := gos.New[*Record]().WithClient(myCustomHTTPClient) } ``` -------------------------------- ### Initialize CSV Export Pipeline Source: https://github.com/tech-engine/goscrapy/wiki/Home Sets up the CSV export pipeline with a specified filename. Ensure the 'csv' package is imported. ```go // use export 2 csv pipeline export2Csv := csv.New[*scrapejsp.Record](csv.Options{ Filename: "itstimeitsnowornever.csv", }) ``` -------------------------------- ### GoScrapy Spider StartRequest Method Source: https://github.com/tech-engine/goscrapy/wiki/Home The entrypoint for a GoScrapy spider. It initiates requests using the Request() method and calls the Parse() method to handle responses. ```go // This is the entrypoint to the spider func (s *Spider) StartRequest(ctx context.Context, job *Job) { // for each request we must call NewRequest() and never reuse it req := s.Request(ctx) var headers http.Header /* GET is the request method, method chaining possible req.Url(""). Meta("MY_KEY1", "MY_VALUE"). Meta("MY_KEY2", true). Header(headers) */ /* POST req.Url() req.Method("POST") req.Body() */ // call the next parse method s.Parse(req, s.parse) } ``` -------------------------------- ### Configure Default HTTP Client with Proxies and Timeouts Source: https://context7.com/tech-engine/goscrapy/llms.txt Configure the default HTTP client with proxy rotation, timeouts, and connection limits. Use gos.WithProxies for round-robin rotation or gos.WithProxyFn for custom proxy selection logic. ```go package myspider import ( "context" "time" "github.com/tech-engine/goscrapy/pkg/gos" ) func New(ctx context.Context) *Spider { // Configure default client with proxy rotation client := gos.DefaultClient( gos.WithProxies("http://proxy1:8080", "http://proxy2:8080"), // Round-robin rotation gos.WithTimeout(30 * time.Second), gos.WithMaxIdleConns(200), gos.WithMaxConnsPerHost(100), gos.WithMaxIdleConnsPerHost(100), ) // Or with custom proxy function customClient := gos.DefaultClient( gos.WithProxyFn(func(req *http.Request) (*url.URL, error) { // Custom proxy selection logic return url.Parse("http://myproxy:8080") }), ) app := gos.New[*Record](gos.WithClient(client)). WithMiddlewares(MIDDLEWARES...). WithPipelines(PIPELINES...) spider := &Spider{ICoreSpider: app} go func() { _ = app.Start(ctx) spider.Close(ctx) }() return spider } ``` -------------------------------- ### Create New GoScrapy Project Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Scaffolds a new GoScrapy project, initializing a Go module and generating necessary files. It prompts to resolve dependencies using 'go mod tidy'. ```sh goscrapy startproject books_to_scrape ``` ```sh \tech-engine\go\go-test-scrapy> goscrapy startproject books_to_scrape 🚀 GoScrapy generating project files. Please wait! 📦 Initializing Go module: books_to_scrape... ✔️ books_to_scrape\base.go ✔️ books_to_scrape\constants.go ✔️ books_to_scrape\errors.go ✔️ books_to_scrape\job.go ✔️ main.go ✔️ books_to_scrape\record.go ✔️ books_to_scrape\spider.go 📦 Do you want to resolve dependencies now (go mod tidy)? [Y/n]: Y 📦 Resolving dependencies... ✨ Congrats, books_to_scrape created successfully. ``` -------------------------------- ### Initialize JSON Export Pipeline Source: https://github.com/tech-engine/goscrapy/wiki/Home Configures the JSON export pipeline, enabling immediate writing to a file. Ensure the 'json' package is imported. ```go // use export 2 json pipeline export2Json := json.New[*scrapejsp.Record](json.Options{ Filename: "itstimeitsnowornever.json", Immediate: true, }) ``` -------------------------------- ### Add Middlewares to Manager Source: https://github.com/tech-engine/goscrapy/wiki/Home Demonstrates how to add multiple built-in middlewares to the GoScrapy middleware manager. ```go var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, } ``` -------------------------------- ### Configure Scheduler and Pipeline Manager Settings Source: https://github.com/tech-engine/goscrapy/wiki/Home Define constants for scheduler queue sizes, concurrency, and pipeline manager buffer sizes and concurrency limits. Defaults are provided. ```go // Default: 50000 const SCHEDULER_REQ_RES_POOL_SIZE = "" // Default: num. of CPU * 30 const SCHEDULER_CONCURRENCY = "" // Default: 50000 const SCHEDULER_WORK_QUEUE_SIZE = "" // Pipeline Manager settings // Default: 10000 const PIPELINEMANAGER_ITEMPOOL_SIZE = "" // Default: 24 const PIPELINEMANAGER_ITEM_SIZE = "" // Default: 5000 const PIPELINEMANAGER_OUTPUT_QUEUE_BUF_SIZE = "" // Default: 150 const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = "" ``` -------------------------------- ### Create New GoScrapy Project Source: https://github.com/tech-engine/goscrapy/wiki/Home Scaffolds a new GoScrapy project named 'scrapejsp'. This command initializes a Go module and generates necessary files, prompting for dependency resolution. ```sh gos startproject scrapejsp ``` -------------------------------- ### GoScrapy Spider Base Configuration Source: https://context7.com/tech-engine/goscrapy/llms.txt Defines the Spider struct, embedding the core framework and setting up middlewares and pipelines. The New function initializes the application with specified components. ```go package books_to_scrape import ( "context" "github.com/tech-engine/goscrapy/pkg/gos" ) type Spider struct { gos.ICoreSpider[*Record] baseUrl string } // New initializes the spider with middlewares and pipelines func New(ctx context.Context) *Spider { app := gos.NewApp[*Record](). WithMiddlewares(MIDDLEWARES...). WithPipelines(PIPELINES...) spider := &Spider{ ICoreSpider: app, baseUrl: "https://books.toscrape.com", } go func() { _ = app.Start(ctx) spider.Close(ctx) }() return spider } ``` -------------------------------- ### Configure GoScrapy Settings Source: https://context7.com/tech-engine/goscrapy/llms.txt Defines constants and variables for configuring GoScrapy's behavior, such as HTTP timeouts, retry mechanisms, scheduler concurrency, and logging levels. It also initializes middlewares and pipelines. ```go package books_to_scrape import ( "os" "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv" "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/json" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) // HTTP Transport settings const MIDDLEWARE_HTTP_TIMEOUT_MS = "" // Default: 10000 const MIDDLEWARE_HTTP_MAX_IDLE_CONN = "" // Default: 1000 const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = "" // Default: 1000 const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = "" // Default: 1000 // Retry middleware settings const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = "" // Default: 3 const MIDDLEWARE_HTTP_RETRY_CODES = "" // Default: 500, 502, 503, 504, 522, 524, 408, 429 const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = "" // Default: 1s // Scheduler settings const SCHEDULER_REQ_RES_POOL_SIZE = "" // Default: 50000 const SCHEDULER_CONCURRENCY = "" // Default: num of CPU * 30 const SCHEDULER_WORK_QUEUE_SIZE = "" // Default: 100000 // Pipeline Manager settings const PIPELINEMANAGER_ITEMPOOL_SIZE = "" // Default: 10000 const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = "" // Default: 150 // Log level: DEBUG, INFO, WARN, ERROR, NONE const GOS_LOG_LEVEL = "INFO" // Middlewares (executed in reverse order, bottom to top) var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, } // CSV export pipeline var export2CSV = csv.New[*Record](csv.Options{ Filename: "output.csv", }) // JSON export pipeline var export2Json = json.New[*Record](json.Options{ Filename: "output.json", Immediate: true, // Flush immediately after each write }) // Pipelines (executed in order) var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, // export2Json, } func init() { settings := map[string]string{ "GOS_LOG_LEVEL": GOS_LOG_LEVEL, "MIDDLEWARE_HTTP_TIMEOUT_MS": MIDDLEWARE_HTTP_TIMEOUT_MS, "MIDDLEWARE_HTTP_RETRY_MAX_RETRIES": MIDDLEWARE_HTTP_RETRY_MAX_RETRIES, "SCHEDULER_CONCURRENCY": SCHEDULER_CONCURRENCY, } for key, value := range settings { if value != "" { os.Setenv(key, value) } } } ``` -------------------------------- ### Configure Retry Middleware Source: https://context7.com/tech-engine/goscrapy/llms.txt Use default settings for retries or customize max retries, HTTP codes, base delay, and add a callback for retry attempts. ```go package settings import ( "time" "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) // Retry with default settings (3 retries, 1s base delay) var defaultRetry = middlewares.Retry() // Retry with custom options var customRetry = middlewares.Retry(middlewares.RetryOpts{ MaxRetries: 5, // Number of additional retries Codes: []uint16{500, 502, 503, 504, 429}, // HTTP codes to trigger retry BaseDelay: 2 * time.Second, // Exponential backoff base Cb: func(req *http.Request, attempt uint8) bool { // Optional callback after each retry log.Printf("Retry attempt %d for %s", attempt, req.URL.String()) return true // Return false to stop retrying }, }) var MIDDLEWARES = []middlewaremanager.Middleware{ customRetry, middlewares.MultiCookieJar, middlewares.DupeFilter, } ``` -------------------------------- ### Define Middlewares and Pipelines for GoScrapy Source: https://github.com/tech-engine/goscrapy/wiki/Home Declare and initialize middleware and pipeline variables. This includes retry, cookie jar, duplicate filter middlewares, and a CSV exporter pipeline. ```go // Middlewares here var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, } var export2CSV = csv.New[*Record](csv.Options{ Filename: "itstimeitsnowornever.csv", }) // Pipelines here var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, // export2Json, } ``` -------------------------------- ### JSON Export Pipeline (Buffered) Source: https://context7.com/tech-engine/goscrapy/llms.txt Configure a JSON export pipeline with buffered writes for efficiency. Specify the output filename. ```go package settings import ( "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/json" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) // JSON export with buffered writes var export2Json = json.New[*Record](json.Options{ Filename: "scraped_data.json", }) ``` -------------------------------- ### Create Custom Pipeline Group Source: https://github.com/tech-engine/goscrapy/wiki/Home Defines a group of pipelines to be executed concurrently. Add individual pipelines to this group using the 'Add' method. Ensure 'pm' and the relevant pipeline types are imported. ```go func myCustomPipelineGroup() *pm.Group[*Record] { // create a group pipelineGroup := pm.NewGroup[*Record]() pipelineGroup.Add(export2CSV) // pipelineGroup.Add(export2Json) return pipelineGroup } ``` -------------------------------- ### GoScrapy Job Implementation Source: https://github.com/tech-engine/goscrapy/wiki/Home A basic implementation of the IJob interface for GoScrapy. Includes an 'id' field and the required Id() method. ```go type Job struct { id string // add your own fields here } func (j *Job) Id() string { return j.id } ``` -------------------------------- ### GoScrapy Record Implementation Source: https://github.com/tech-engine/goscrapy/wiki/Home An implementation of the IOutput interface for GoScrapy Records. It embeds a Job and provides methods for accessing record data. ```go type Record struct { J *Job `json:"-" csv:"-"` } func (r *Record) Record() *Record { return r } func (r *Record) RecordKeys() []string { .... keys := make([]string, numFields) .... return keys } func (r *Record) RecordFlat() []any { .... return slice } func (r *Record) Job() core.IJob { return r.J } ``` -------------------------------- ### Configure HTTP Transport and Retry Middleware Settings Source: https://github.com/tech-engine/goscrapy/wiki/Home Set constants for HTTP transport timeouts, connection limits, and retry middleware parameters. Defaults are provided for each setting. ```go // HTTP Transport settings // Default: 10000 const MIDDLEWARE_HTTP_TIMEOUT_MS = "" // Default: 1000 const MIDDLEWARE_HTTP_MAX_IDLE_CONN = "" // Default: 1000 const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = "" // Default: 1000 const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = "" // Inbuilt Retry middleware settings // Default: 3 const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = "" // Default: 500, 502, 503, 504, 522, 524, 408, 429 const MIDDLEWARE_HTTP_RETRY_CODES = "" // Default: 1s const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = "" ``` -------------------------------- ### Configure Logging Levels and Use Logger Source: https://context7.com/tech-engine/goscrapy/llms.txt Set log level via environment variable GOS_LOG_LEVEL (DEBUG, INFO, WARN, ERROR, NONE). Use the logger within your spider methods for different log levels. ```go package myspider import ( "github.com/tech-engine/goscrapy/pkg/gos" "github.com/tech-engine/goscrapy/pkg/core" ) // Set log level via environment variable // GOS_LOG_LEVEL=DEBUG // Detailed execution trace // GOS_LOG_LEVEL=INFO // Basic startup/shutdown info (Default) // GOS_LOG_LEVEL=WARN // Warnings and retry notifications // GOS_LOG_LEVEL=ERROR // Fatal errors only // GOS_LOG_LEVEL=NONE // Disable all framework logging // Using logger in spider func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { s.Logger().Debug("Debug message") s.Logger().Info("Info message") s.Logger().Infof("Formatted: status=%d", resp.StatusCode()) s.Logger().Warn("Warning message") s.Logger().Error("Error message") s.Logger().Errorf("Error: %v", err) } ``` -------------------------------- ### Build HTTP POST Request Source: https://context7.com/tech-engine/goscrapy/llms.txt Constructs a POST request with a JSON body using the Request Builder API. Sets the URL, method, and body, then schedules the request. ```go func (s *Spider) postRequest(ctx context.Context) { req := s.Request(ctx) // POST request with body req.Url("https://example.com/api/submit") req.Method("POST") req.Body(map[string]string{"key": "value"}) s.Parse(req, s.handlePostResponse) } ``` -------------------------------- ### Configure CSV Export Pipeline Source: https://context7.com/tech-engine/goscrapy/llms.txt Export scraped data to CSV files with automatic header generation. Customize the filename or provide an existing file handle. ```go package settings import ( "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) // CSV export with custom filename var export2CSV = csv.New[*Record](csv.Options{ Filename: "scraped_data.csv", }) // CSV export with existing file handle // file, _ := os.OpenFile("custom.csv", os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0640) // var export2CSVFile = csv.New[*Record](csv.Options{ // File: file, // }) var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, } ``` -------------------------------- ### Define GoScrapy Spider Parsing Logic Source: https://github.com/tech-engine/goscrapy/blob/main/README.md Implements the spider's parsing logic in spider.go. This includes defining the entry point for requests and handling responses. ```go package myspider import ( "context" "encoding/json" "github.com/tech-engine/goscrapy/pkg/core" ) // StartRequest is the entrypoint to the spider func (s *Spider) StartRequest(ctx context.Context, job *Job) { // Create a new request. This request must not be reused. req := s.Request(ctx) req.Url("https://httpbin.org/get") s.Parse(req, s.parse) } func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { s.Logger().Infof("status: %d", resp.StatusCode()) var data Record if err := json.Unmarshal(resp.Bytes(), &data); err != nil { s.Logger().Errorf("failed to unmarshal record: %v", err) return } // Yield sends the data securely to your configured pipelines s.Yield(&data) } func (s *Spider) Close(ctx context.Context) { } ``` -------------------------------- ### Implement and Use Custom Logger Source: https://context7.com/tech-engine/goscrapy/llms.txt Implement the logger interface for custom logging behavior. Then, use gos.WithLogger to integrate your custom logger with the GoScrapy application. ```go // Custom logger implementation type MyLogger struct{} func (l *MyLogger) Debug(msg string) { /* custom implementation */ } func (l *MyLogger) Info(msg string) { /* custom implementation */ } func (l *MyLogger) Warn(msg string) { /* custom implementation */ } func (l *MyLogger) Error(msg string) { /* custom implementation */ } func (l *MyLogger) Debugf(format string, args ...any) { /* custom implementation */ } func (l *MyLogger) Infof(format string, args ...any) { /* custom implementation */ } func (l *MyLogger) Warnf(format string, args ...any) { /* custom implementation */ } func (l *MyLogger) Errorf(format string, args ...any) { /* custom implementation */ } // Use custom logger app := gos.New[*Record](). WithLogger(&MyLogger{}). WithMiddlewares(MIDDLEWARES...). WithPipelines(PIPELINES...) ``` -------------------------------- ### Extract Data with CSS and XPath Selectors Source: https://context7.com/tech-engine/goscrapy/llms.txt Demonstrates using CSS and XPath selectors to extract data from HTML responses. Supports chaining selectors and retrieving attributes or text content. ```go func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { // CSS selector - extract href attributes from all matching elements productUrls := resp.Css("article.product_pod h3 a").Attr("href") // Extract text content from elements productNames := resp.Css("article.product_pod h3 a").Text() // Selector chaining productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href") // XPath selector productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href") // Chain XPath and CSS selectors productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href") // Get all matching DOM nodes nodes := resp.Css("article.product_pod h3 a").GetAll() // Get first matching DOM node firstNode := resp.Css("article.product_pod h3 a").Get() } ``` -------------------------------- ### Create Custom GoScrapy Pipeline Source: https://github.com/tech-engine/goscrapy/wiki/Home Generates a custom pipeline for GoScrapy, named 'export_2_DB'. ```sh gos pipeline export_2_DB ``` -------------------------------- ### JSON Export Pipeline (Immediate Flush) Source: https://context7.com/tech-engine/goscrapy/llms.txt Configure a JSON export pipeline to immediately flush the buffer after each record is written. This is useful for real-time data processing. ```go package settings import ( "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/json" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) // JSON export with immediate flush after each record var export2JsonImmediate = json.New[*Record](json.Options{ Filename: "scraped_data.json", Immediate: true, // Flush buffer immediately after each write }) ``` ```go var PIPELINES = []pm.IPipeline[*Record]{ export2JsonImmediate, } ``` -------------------------------- ### Enable Multi-Cookie Jar Middleware Source: https://context7.com/tech-engine/goscrapy/llms.txt Isolate cookie sessions per scraping target by assigning a unique key to requests. ```go package myspider import ( "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, // Enable per-session cookie isolation middlewares.DupeFilter, } // In spider.go - use different cookie jars for different sessions func (s *Spider) StartRequest(ctx context.Context, job *Job) { // Request for user A reqA := s.Request(ctx) reqA.Url("https://example.com/login") reqA.CookieJar("user_a_session") // Cookies isolated to this key s.Parse(reqA, s.parseUserA) // Request for user B reqB := s.Request(ctx) reqB.Url("https://example.com/login") reqB.CookieJar("user_b_session") // Different cookie jar s.Parse(reqB, s.parseUserB) } ``` -------------------------------- ### Configure AzureTLS Middleware for Fingerprint Spoofing Source: https://context7.com/tech-engine/goscrapy/llms.txt Bypass bot detection by spoofing TLS fingerprints. Configure global options for browser emulation and per-request overrides via context. ```go package fingerprint_spoofing import ( "context" "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) // Global AzureTLS options var azureTLSOpts = &middlewares.AzureTLSOptions{ Browser: middlewares.BrowserFirefox, // Chrome, Firefox, Safari, Edge } // Middlewares with AzureTLS (disable MultiCookieJar as AzureTLS handles cookies) var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.AzureTLS(azureTLSOpts), middlewares.DupeFilter, } // Override options per-request using context func (s *Spider) StartRequest(ctx context.Context, job *Job) { // Set per-request AzureTLS options ctx = middlewares.WithAzureTLSOptions(ctx, &middlewares.AzureTLSOptions{ Browser: middlewares.BrowserChrome, Proxy: "http://user:pass@myproxy.com:8080", SessionKey: "my_session", JA3: "custom_ja3_fingerprint", }) req := s.Request(ctx) req.Url("https://tls.peet.ws/api/all") // TLS fingerprint test endpoint s.Parse(req, s.parse) } ``` -------------------------------- ### Use CSS Selectors for Data Extraction Source: https://github.com/tech-engine/goscrapy/wiki/Home Illustrates how to use CSS selectors to extract attribute values (like 'href') and text content from HTML elements. Selector chaining is also shown. ```go var productUrls []string productUrls = resp.Css("article.product_pod h3 a").Attr("href") var productNames []string productNames = resp.Css("article.product_pod h3 a").Text() productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href") ``` -------------------------------- ### Define Pipeline Execution Order Source: https://github.com/tech-engine/goscrapy/wiki/Home Declares the main list of pipelines to be executed by the spider. Pipelines are executed sequentially as listed. Custom pipeline groups can also be included. ```go // Pipelines here // Executed in the order they appear. var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, // export2Json, // myCustomPipelineGroup(), // use group as if it were a single pipeline } ``` -------------------------------- ### Use XPath Selectors for Data Extraction Source: https://github.com/tech-engine/goscrapy/wiki/Home Demonstrates using XPath selectors to extract attribute values from HTML. Chaining XPath with CSS selectors is also supported. ```go productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href") productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href") ``` -------------------------------- ### Custom Header Injection Middleware Source: https://context7.com/tech-engine/goscrapy/llms.txt Create a middleware factory that injects custom headers into HTTP requests. This factory takes a map of headers and returns a middleware function. ```go package middlewares import ( "log" "net/http" "time" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) // Custom header injection middleware func CustomHeaders(headers map[string]string) func(http.RoundTripper) http.RoundTripper { return func(next http.RoundTripper) http.RoundTripper { return middlewaremanager.MiddlewareFunc(func(req *http.Request) (*http.Response, error) { for key, value := range headers { req.Header.Set(key, value) } return next.RoundTrip(req) }) } } ``` ```go // Usage in settings.go var MIDDLEWARES = []middlewaremanager.Middleware{ RequestLogger, CustomHeaders(map[string]string{ "X-Custom-Header": "MyValue", }), middlewares.Retry(), } ``` -------------------------------- ### GoScrapy Output Interface Source: https://github.com/tech-engine/goscrapy/wiki/Home Defines the core interface for an output Record in GoScrapy. It includes methods to access the Record itself, its keys, flattened data, and the associated Job. ```go type IOutput interface { Record() *Record RecordKeys() []string RecordFlat() []any Job() IJob } ``` -------------------------------- ### Process HTTP Response Source: https://context7.com/tech-engine/goscrapy/llms.txt Handles the response from an HTTP request. Allows access to status code, headers, body, cookies, and the original request. Also demonstrates retrieving metadata passed during request construction. ```go func (s *Spider) parseResponse(ctx context.Context, resp core.IResponseReader) { // Access response data statusCode := resp.StatusCode() headers := resp.Header() body := resp.Bytes() cookies := resp.Cookies() originalRequest := resp.Request() // Retrieve metadata set on the request if category, ok := resp.Meta("category"); ok { s.Logger().Infof("Category: %v", category) } } ``` -------------------------------- ### Create Custom Middleware Function Source: https://github.com/tech-engine/goscrapy/wiki/Home Defines the required function signature for creating a custom HTTP middleware in GoScrapy. The middleware wraps the next http.RoundTripper in the chain. ```go func MultiCookieJar(next http.RoundTripper) http.RoundTripper { return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) { // you middleware custom code here }) } ``` -------------------------------- ### GoScrapy Record Definition for Scraped Data Source: https://context7.com/tech-engine/goscrapy/llms.txt Defines the structure for outputting scraped data, implementing core.IOutput for pipeline processing. Includes fields for title, price, stock, and rating, with JSON and CSV tags. ```go package books_to_scrape import ( "reflect" "github.com/tech-engine/goscrapy/pkg/core" ) // Record represents scraped data output // json and csv struct tags are required for builtin pipelines type Record struct { J *Job `json:"-" csv:"-"` // Job reference (excluded from export) Title string `json:"title" csv:"title"` Price string `json:"price" csv:"price"` Stock string `json:"stock" csv:"stock"` Rating string `json:"rating" csv:"rating"` Description string `json:"description" csv:"description"` Upc string `json:"upc" csv:"upc"` ProductType string `json:"product_type" csv:"product_type"` Reviews string `json:"reviews" csv:"reviews"` } func (r *Record) Record() *Record { return r } func (r *Record) RecordKeys() []string { dataType := reflect.TypeOf(*r) numFields := dataType.NumField() keys := make([]string, numFields) for i := 0; i < numFields; i++ { field := dataType.Field(i) keys[i] = field.Tag.Get("csv") } return keys } func (r *Record) RecordFlat() []any { inputType := reflect.TypeOf(*r) inputValue := reflect.ValueOf(*r) slice := make([]any, inputType.NumField()) for i := 0; i < inputType.NumField(); i++ { slice[i] = inputValue.Field(i).Interface() } return slice } func (r *Record) Job() core.IJob { return r.J } ``` -------------------------------- ### GoScrapy Job Definition Source: https://context7.com/tech-engine/goscrapy/llms.txt Defines the structure for job input to a spider, implementing the core.IJob interface. Includes an ID and allows for custom fields. ```go package books_to_scrape // Job represents input to the spider type Job struct { id string // Add custom fields here } func NewJob(id string) *Job { return &Job{ id: id, } } func (j *Job) Id() string { return j.id } func (j *Job) Reset() { j.id = "" } ``` -------------------------------- ### Custom Request Logging Middleware Source: https://context7.com/tech-engine/goscrapy/llms.txt Implement a custom middleware to log HTTP requests and responses, including method, URL, status code, and duration. This middleware uses the http.RoundTripper interface. ```go package middlewares import ( "log" "net/http" "time" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) // Custom logging middleware func RequestLogger(next http.RoundTripper) http.RoundTripper { return middlewaremanager.MiddlewareFunc(func(req *http.Request) (*http.Response, error) { start := time.Now() log.Printf("Request: %s %s", req.Method, req.URL.String()) // Call next middleware/transport resp, err := next.RoundTrip(req) duration := time.Since(start) if resp != nil { log.Printf("Response: %d %s (%v)", resp.StatusCode, req.URL.String(), duration) } return resp, err }) } ``` -------------------------------- ### Pipeline Group for Concurrent Exports Source: https://context7.com/tech-engine/goscrapy/llms.txt Create a pipeline group to execute multiple pipelines, such as CSV and JSON exports, concurrently. This enhances performance for independent tasks. ```go package settings import ( "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv" "github.com/tech-engine/goscrapy/pkg/builtin/pipelines/json" pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager" ) var export2CSV = csv.New[*Record](csv.Options{ Filename: "output.csv", }) var export2Json = json.New[*Record](json.Options{ Filename: "output.json", Immediate: true, }) ``` ```go // Create a pipeline group for concurrent execution func myExportGroup() *pm.Group[*Record] { group := pm.NewGroup[*Record]() group.Add(export2CSV) group.Add(export2Json) // group.WithIgnoreError() // Optionally ignore errors from individual pipelines return group } ``` ```go // Pipelines executed in order - group runs both exports concurrently var PIPELINES = []pm.IPipeline[*Record]{ myExportGroup(), // CSV and JSON export run in parallel } ``` -------------------------------- ### GoScrapy Spider Parse Method Source: https://github.com/tech-engine/goscrapy/wiki/Home Handles the response from a request in GoScrapy. It unmarshals the response body into a Record and yields it for pipeline processing. ```go func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { // response.Body() // response.StatusCode() // response.Header() // response.Bytes() // response.Meta("MY_KEY1") // yielding output pushes output to be processed by pipelines, also check output.go for the fields var data Record err := json.Unmarshal(resp.Bytes(), &data) if err != nil { log.Panicln(err) } // s.Yield(&data) } ``` -------------------------------- ### Enable Duplicate Filter Middleware Source: https://context7.com/tech-engine/goscrapy/llms.txt Automatically block duplicate requests based on URL, method, headers, and body fingerprint. A "dupefilter.go:DupeFilter: duplicate request" error is logged for blocked requests. ```go package settings import ( "github.com/tech-engine/goscrapy/pkg/builtin/middlewares" "github.com/tech-engine/goscrapy/pkg/middlewaremanager" ) // DupeFilter generates a Blake2b hash of each request // and blocks requests that have already been seen var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, // Must be added to enable duplicate filtering } // Duplicate requests are automatically blocked with error: // "dupefilter.go:DupeFilter: duplicate request" ``` -------------------------------- ### GoScrapy Job Interface Source: https://github.com/tech-engine/goscrapy/wiki/Home Defines the core interface for a Job in GoScrapy, representing an input to a spider. It requires an Id() method. ```go type IJob interface { Id() string } ```