### Install Scrapinghub library Source: https://python-scrapinghub.readthedocs.io/en/latest/quickstart.html Commands to install the scrapinghub package via pip, including an optional version with MessagePack support for improved performance. ```bash pip install scrapinghub pip install scrapinghub[msgpack] ``` -------------------------------- ### Install Scrapinghub Python Library Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/quickstart.rst.txt Installs the Scrapinghub Python library. It is recommended to install with MessagePack support for better performance and bandwidth usage. ```bash pip install scrapinghub ``` ```bash pip install scrapinghub[msgpack] ``` -------------------------------- ### Get a Specific Project Setting Value (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to retrieve the value of a specific project setting by its key. The `get` method takes the setting key as a string parameter. ```python >>> project = client.get_project(123) >>> project.settings.get('default_job_units') 2 ``` -------------------------------- ### Initialize Scrapinghub Client and Manage Projects Source: https://python-scrapinghub.readthedocs.io/en/latest/quickstart.html Demonstrates how to instantiate the ScrapinghubClient using an API key and how to list available projects. ```python from scrapinghub import ScrapinghubClient apikey = '84c87545607a4bc0****************' client = ScrapinghubClient(apikey) client.projects.list() ``` -------------------------------- ### Access Project Settings Instance (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to get a Settings instance for a specific project. This is accessed via the `settings` attribute of a Project object. ```python >>> project = client.get_project(123) >>> project.settings ``` -------------------------------- ### Python: Manage Collection Items with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates common operations on a Scrapinghub collection using the Python client. This includes setting items with a key, counting all items, retrieving a specific item by key, iterating over all items, iterating over key-value pairs, getting item keys, filtering by multiple keys, deleting items by key, and truncating the entire collection. Ensure the Scrapinghub client is installed and authenticated. ```python >>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', ... 'value': '1447221694537'}) >>> foo_store.count() 1 >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {'value': '1447221694537'} >>> foo_store.iter() >>> for elem in foo_store.iter(count=1)): ... print(elem) [{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}] >>> keys = foo_store.iter(nodata=True, meta=['_key'])) >>> next(keys) {'_key': '002d050ee3ff6192dcbecc4e4b4457d7'} >>> foo_store.list(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']) [{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}] >>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7') >>> foo_store.truncate() ``` -------------------------------- ### Get a Project Instance using Scrapinghub Client (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to obtain a Project instance using the ScrapinghubClient. This is the recommended way to interact with individual projects. ```python >>> client = ScrapinghubClient() >>> project = client.get_project(123) >>> project >>> project.key '123' ``` -------------------------------- ### Run a New Job with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/quickstart.rst.txt Illustrates how to run a new job for a specific project, providing the spider name and optional job arguments. ```python project = client.get_project(123) project.jobs.run('spider1', job_args={'arg1': 'val1'}) ``` -------------------------------- ### Instantiate Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/quickstart.rst.txt Demonstrates how to instantiate the Scrapinghub client using an API key. The API key can be found in the Zyte account settings. ```python from scrapinghub import ScrapinghubClient apikey = '84c87545607a4bc0****************' # your API key as a string client = ScrapinghubClient(apikey) ``` -------------------------------- ### Run Integration Tests Source: https://python-scrapinghub.readthedocs.io/en/latest/quickstart.html Commands to execute integration tests using pytest, including flags to ignore or update existing VCR.py cassettes. ```bash py.test --ignore-cassettes py.test --update-cassettes ``` -------------------------------- ### GET /projects/summary Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Get a summary of job statuses across all projects. ```APIDOC ## GET /projects/summary ### Description Returns a list of dictionaries containing job counts (pending, running, finished) and capacity status for user projects. ### Method GET ### Endpoint /projects/summary ### Parameters #### Query Parameters - **state** (string/list) - Optional - Filter projects by specific job state. ### Response #### Success Response (200) - **summaries** (list[dict]) - List of project status summaries. #### Response Example [ { "project": 123, "pending": 0, "running": 1, "finished": 674, "has_capacity": true } ] ``` -------------------------------- ### List Deployed Projects with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/quickstart.rst.txt Shows how to list all deployed projects associated with the provided API key using the Scrapinghub client. ```python client.projects.list() ``` -------------------------------- ### Run Jobs and Retrieve Data Source: https://python-scrapinghub.readthedocs.io/en/latest/quickstart.html Shows how to trigger a spider job for a specific project and iterate through the resulting items collected by a job. ```python project = client.get_project(123) project.jobs.run('spider1', job_args={'arg1': 'val1'}) job = client.get_job('123/1/2') for item in job.items.iter(): print(item) ``` -------------------------------- ### Start a Scraping Job (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to move a job to a running state. This method can accept optional keyword meta parameters and returns the previous string state of the job. ```python >>> job.start() 'pending' ``` -------------------------------- ### GET /projects Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieve a list of project IDs available to the current user. ```APIDOC ## GET /projects ### Description Returns a list of project IDs associated with the current user account. ### Method GET ### Endpoint /projects ### Response #### Success Response (200) - **projects** (list[int]) - A list of numeric project IDs. #### Response Example [123, 456] ``` -------------------------------- ### Retrieving a Spider Instance Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates the usage of the get method to fetch a specific spider object by its name. ```python spider = project.spiders.get('spider2') ``` -------------------------------- ### Get a Specific Project by ID using Projects Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to retrieve a specific project object using its ID from the Projects collection. The `get` method takes an integer or string numeric project ID. ```python >>> client = ScrapinghubClient() >>> project = client.projects.get(123) >>> project ``` -------------------------------- ### Access Job Output Data with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/quickstart.rst.txt Demonstrates how to retrieve and iterate over the output data of a specific job using its ID. ```python job = client.get_job('123/1/2') for item in job.items.iter(): print(item) ``` -------------------------------- ### List All Project IDs using Projects Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to get a list of all project IDs available to the current user. The `list` method returns a list of integers representing project IDs. ```python >>> client = ScrapinghubClient() >>> client.projects.list() [123, 456] ``` -------------------------------- ### GET /projects/{project_id} Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieve a specific project object to access its resources like jobs, spiders, and settings. ```APIDOC ## GET /projects/{project_id} ### Description Retrieves a project object for a given project ID, providing access to nested resources. ### Method GET ### Endpoint /projects/{project_id} ### Parameters #### Path Parameters - **project_id** (string/int) - Required - The unique identifier for the project. ### Response #### Success Response (200) - **project** (Object) - A project instance containing activity, collections, frontiers, jobs, settings, and spiders. #### Response Example { "key": "123", "has_capacity": true } ``` -------------------------------- ### Get Project Summaries using Projects Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to retrieve summaries for all available user projects. The `summary` method can optionally filter by project state and returns a list of dictionaries, each containing project status details. ```python >>> client = ScrapinghubClient() >>> client.projects.summary() [{'finished': 674, 'has_capacity': True, 'pending': 0, 'project': 123, 'running': 1}, {'finished': 33079, 'has_capacity': True, 'pending': 0, 'project': 456, 'running': 2}] ``` -------------------------------- ### List Job Metadata (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to get a list of all job metadata key/value pairs using the `list()` method of the JobMeta class. Be aware that this can consume significant memory for large datasets. ```python >>> job.metadata.list() [('project', 123), ('units', 1), ('state', 'finished'), ...] ``` -------------------------------- ### Manage Collections with Scrapinghub Client (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Demonstrates the workflow for using Scrapinghub Collections to store and retrieve records. It covers getting a store, setting key-value pairs, counting items, getting specific items, iterating, filtering, and deleting. ```python >>> collections = project.collections >>> foo_store = collections.get_store('foo_store') >>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}) >>> foo_store.count() 1 >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {u'value': u'1447221694537'} >>> # iterate over _key & value pair ... list(foo_store.iter()) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> # filter by multiple keys - only values for keys that exist will be returned ... list(foo_store.iter(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah'])) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7') >>> foo_store.count() 0 ``` -------------------------------- ### GET /jobs Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Methods for managing and retrieving information about scraping jobs within a project. ```APIDOC ## GET /jobs ### Description Retrieves a list or summary of scraping jobs associated with a project. ### Method GET ### Endpoint /jobs ### Parameters #### Query Parameters - **project_id** (string) - Required - The ID of the project to query. ### Response #### Success Response (200) - **jobs** (array) - List of job objects. ### Response Example { "jobs": [{"id": "123/45/67", "state": "finished"}] } ``` -------------------------------- ### Get and Access Job Information (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to retrieve a specific job using its key and access its properties like key and metadata. It requires an initialized Scrapinghub client and a project object. ```python >>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' >>> job.metadata.get('state') 'finished' ``` -------------------------------- ### Get Project Instance by ID Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Fetches a `Project` instance associated with a given project ID. This method serves as a shortcut for accessing projects via the `client.projects.get()` method. The project ID can be provided as an integer or a numeric string. ```python >>> project = client.get_project(123) >>> project ``` -------------------------------- ### Get Job Collection Instance (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to access the Jobs collection object associated with a project or a spider. This object is used to manage multiple jobs. ```python >>> project.jobs >>> spider = project.spiders.get('spider1') >>> spider.jobs ``` -------------------------------- ### Iterate Project Frontiers Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to get an iterator for all frontiers within a project. This allows for sequential processing of all frontiers associated with the project. ```python >>> project.frontiers.iter() ``` -------------------------------- ### Iterate Frontier Slots Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to get an iterator for all slots within a frontier. This is useful for processing all available frontier slots sequentially. ```python >>> frontier.iter() ``` -------------------------------- ### Getting a Versioned Store Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a collection that retains up to three copies of each item. This is suitable for scenarios where historical data or rollbacks are needed. ```python versioned_store = collections.get_versioned_store('my_versioned_data') ``` -------------------------------- ### Iterate Through Job Metadata (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Provides an example of iterating through job metadata key/value pairs using the `iter()` method of the JobMeta class. This is recommended for large amounts of metadata to avoid memory issues. ```python >>> job.metadata.iter() ``` -------------------------------- ### Getting a Versioned Cached Store Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a collection that retains multiple copies of items, with each copy expiring after a month. This offers a balance between versioning and caching. ```python versioned_cached_store = collections.get_versioned_cached_store('my_versioned_cached_data') ``` -------------------------------- ### Retrieve and Filter Job Logs using Python Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to access job logs using the iter() and list() methods. Includes examples of iterating through logs, limiting results, and applying filters based on log levels and message content. ```python # Retrieve all logs from a job job.logs.iter() # Iterate through first 100 log entries for log in job.logs.iter(count=100): print(log) # Retrieve a single log entry job.logs.list(count=1) # Retrieve logs with a specific level and filter by keyword filters = [("message", "contains", ["mymessage"])] job.logs.list(level='WARNING', filter=filters) ``` -------------------------------- ### Getting a Collection by Type and Name (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Provides the base method for retrieving a specific collection using its type and name. This method is fundamental for accessing any type of collection. ```python collection = collections.get(_type_='s', _name_='my_store') ``` -------------------------------- ### Filter and Paginate Jobs Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Shows how to filter jobs by state, tags, and custom metadata fields, as well as how to handle pagination using the start parameter. ```python >>> jobs_summary = project.jobs.iter(has_tag=['new', 'verified'], lacks_tag='obsolete') >>> jobs_summary = spider.jobs.iter(spider='foo', state='finished', count=3) >>> jobs_summary = spider.jobs.iter(start=1000) >>> jobs_summary = project.jobs.iter(jobmeta=['scheduled_by']) ``` -------------------------------- ### Project Settings Management Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Endpoints for interacting with project settings, including listing all settings, getting specific values, and updating single or multiple settings. ```APIDOC ## GET /project/settings ### Description Retrieves a list of all current project settings and their values. ### Method GET ### Endpoint /project/settings ### Response #### Success Response (200) - **settings** (list) - A list of tuples containing setting names and their values. #### Response Example [('default_job_units', 2), ('job_runtime_limit', 24)] --- ## GET /project/settings/{name} ### Description Retrieves the value of a specific project setting by its name. ### Method GET ### Endpoint /project/settings/{name} ### Parameters #### Path Parameters - **name** (string) - Required - The name of the setting to retrieve. ### Response #### Success Response (200) - **value** (any) - The value associated with the setting. --- ## POST /project/settings/{name} ### Description Updates the value of a specific project setting. ### Method POST ### Endpoint /project/settings/{name} ### Parameters #### Path Parameters - **name** (string) - Required - The name of the setting to update. #### Request Body - **value** (any) - Required - The new value for the setting. --- ## PATCH /project/settings ### Description Updates multiple project settings simultaneously using a dictionary of key-value pairs. ### Method PATCH ### Endpoint /project/settings ### Request Body - **settings** (object) - Required - A dictionary containing the settings to update. ### Request Example { "default_job_units": 1, "job_runtime_limit": 20 } ``` -------------------------------- ### List Job Samples by Timestamp with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to list job samples filtered by a timestamp using the `list()` method. This example retrieves samples with a timestamp greater than or equal to the provided value. The output is a list of lists, where each inner list contains sample data. ```python >>> job.samples.list(startts=1484570043851) [[1484570043851, 554, 576, 1777, 821, 0], [1484570046673, 561, 583, 1782, 821, 0]] ``` -------------------------------- ### List Project Settings (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to retrieve all project settings as a list. The `list` method is a convenient shortcut but may consume significant memory for large numbers of settings. ```python >>> project = client.get_project(123) >>> project.settings.list() [(u'default_job_units', 2), (u'job_runtime_limit', 20)] ``` -------------------------------- ### Python: List Collection Items with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Provides an example of using the `list()` method to retrieve all items from a Scrapinghub collection as a list. This method shares the same parameters as `iter()` for filtering. However, it's important to note that for large collections, `list()` can consume a substantial amount of memory, and using `iter()` is recommended for better performance and resource management. ```python all_items = foo_store.list(key=['key1', 'key2'], prefix='data_') print(all_items) ``` -------------------------------- ### Get Job Summaries by State (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a summary of jobs, optionally filtered by state or spider. The returned data is a list of dictionaries, grouped by job state. If a specific state is provided, it returns a single dictionary for that state. ```python >>> spider.jobs.summary() [{'count': 0, 'name': 'pending', 'summary': []}, {'count': 0, 'name': 'running', 'summary': []}, {'count': 5, 'name': 'finished', 'summary': [...]}] >>> project.jobs.summary('pending') {'count': 0, 'name': 'pending', 'summary': []} ``` -------------------------------- ### Manage Data with Hubstorage in Python Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/legacy/hubstorage.rst.txt Demonstrates basic CRUD operations on a key-value store using the `foo_store` object, likely from the Scrapinghub Hubstorage library. It shows how to set, count, get, iterate, and delete entries. Dependencies include the `scrapinghub` library. ```python >>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}) >>> foo_store.count() 1 >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {u'value': u'1447221694537'} >>> # iterate over _key & value pair ... list(foo_store.iter_values()) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> # filter by multiple keys - only values for keys that exist will be returned ... list(foo_store.iter_values(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah'])) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7') >>> foo_store.count() 0 ``` -------------------------------- ### Get job information in Python Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/legacy/connection.rst.txt Retrieves metadata and information about a specific job. This includes details like the spider name, start time, tags, and field counts. ```python print(job.info['spider']) print(job.info['started_time']) print(job.info['tags']) print(job.info['fields_count']['description']) ``` -------------------------------- ### Get Specific Frontier Slot Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to get a specific FrontierSlot object by its name. This allows for targeted operations on a particular slot. ```python >>> frontier.get('example.com') ``` -------------------------------- ### Get Specific Job Metadata Field (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to retrieve the value of a specific metadata field by its name using the `get()` method of the JobMeta class. ```python >>> job.metadata.get('version') 'test' ``` -------------------------------- ### Manage Frontiers with Scrapinghub Client (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Explains how to manage frontiers using the Scrapinghub client. It covers iterating through all frontiers, listing them, getting a specific frontier, iterating through frontier slots, listing slots, getting a specific slot, and adding/deleting requests and fingerprints. ```python >>> frontiers = project.frontiers >>> frontiers.iter() >>> frontiers.list() ['test', 'test1', 'test2'] >>> frontier = frontiers.get('test') >>> frontier >>> frontier.iter() >>> frontier.list() ['example.com', 'example.com2'] >>> slot = frontier.get('example.com') >>> slot >>> slot.queue.add([{'fp': '/some/path.html'}]) >>> slot.flush() >>> slot.newcount 1 >>> frontier.newcount 1 >>> frontiers.newcount 3 >>> slot.fingerprints.add(['fp1', 'fp2']) >>> slot.flush() >>> slot.q.add([{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}]) >>> slot.flush() >>> reqs = slot.q.iter() >>> fps = slot.f.iter() >>> fps = slot.q.list() >>> slot.q.delete('00013967d8af7b0001') >>> slot.delete() >>> frontier.flush() >>> frontiers.flush() >>> frontiers.close() ``` -------------------------------- ### Jobs Summary API Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Get a summary of jobs, optionally filtered by state or spider. ```APIDOC ## GET /jobs/summary ### Description Retrieves a summary of jobs, optionally filtered by state or spider. ### Method GET ### Endpoint /jobs/summary ### Parameters #### Query Parameters - **state** (str) - Optional - Filter jobs by a specific state. - **spider** (str) - Optional - Filter jobs by spider name (not needed if instantiated with `Spider`). - **params** (dict) - Optional - Additional keyword arguments. ### Response #### Success Response (200) - **list[dict]** - A list of dictionaries containing job summaries, grouped by job state. #### Response Example ```python # Example for spider.jobs.summary() [{'count': 0, 'name': 'pending', 'summary': []}, {'count': 0, 'name': 'running', 'summary': []}, {'count': 5, 'name': 'finished', 'summary': [...]}] # Example for project.jobs.summary('pending') {'count': 0, 'name': 'pending', 'summary': []} ``` ``` -------------------------------- ### Initialize HubstorageClient and Project Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/legacy/hubstorage.rst.txt Demonstrates how to authenticate with the HubstorageClient using an API key and retrieve project-level information such as settings and job summaries. ```python from scrapinghub import HubstorageClient hc = HubstorageClient(auth='apikey') hc.server_timestamp() project = hc.get_project('1111111') print(project.settings['botgroups']) print(project.jobsummary()) ``` -------------------------------- ### Initialize Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to create an instance of the ScrapinghubClient. Authentication can be provided via an API key or environment variables. Optional parameters for the API endpoint and additional HubstorageClient arguments can also be passed. ```python >>> from scrapinghub import ScrapinghubClient >>> client = ScrapinghubClient('APIKEY') >>> client ``` -------------------------------- ### GET /jobs/{job_key} Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a specific job instance using its unique job key. ```APIDOC ## GET /jobs/{job_key} ### Description Retrieves a Job object associated with the provided job key. ### Method GET ### Endpoint client.get_job(job_key) ### Parameters #### Path Parameters - **job_key** (string) - Required - Format: 'project_id/spider_id/job_id' ### Response #### Success Response (200) - **Job** (object) - A job instance object. ``` -------------------------------- ### Update Multiple Project Settings at Once (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to update multiple project settings simultaneously using a dictionary. The `update` method allows for partial updates to settings. ```python >>> project = client.get_project(123) >>> project.settings.update({'default_job_units': 1, ... 'job_runtime_limit': 20}) ``` -------------------------------- ### GET /collections Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Methods for interacting with data collections, including CRUD operations on stored items. ```APIDOC ## GET /collections/{collection_name} ### Description Retrieves items or metadata from a specific data collection. ### Method GET ### Endpoint /collections/{collection_name} ### Parameters #### Path Parameters - **collection_name** (string) - Required - The name of the collection to access. ### Response #### Success Response (200) - **items** (array) - List of items stored in the collection. ### Response Example { "items": [{"key": "value"}] } ``` -------------------------------- ### GET /logs Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/hubstorage.html Retrieve logs associated with a specific job. ```APIDOC ## GET /jobs/{job_id}/logs ### Description Iterate through logs for a specific job. ### Method GET ### Parameters #### Query Parameters - **count** (integer) - Optional - Number of log entries to retrieve. ### Response #### Success Response (200) - **logs** (list) - A list of dictionaries containing log level, message, and timestamp. ### Response Example { "logs": [{"level": "INFO", "message": "Started", "time": 1447221694537}] } ``` -------------------------------- ### Iterate Through Project Settings (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to iterate through the key/value pairs of a project's settings. The `iter` method returns an iterator over these pairs. ```python >>> project = client.get_project(123) >>> project.settings.iter() ``` -------------------------------- ### Initialize Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/client/overview.rst.txt Demonstrates how to instantiate the ScrapinghubClient using a valid API key. This client serves as the entry point for all platform interactions. ```python from scrapinghub import ScrapinghubClient apikey = '84c87545607a4bc0****************' client = ScrapinghubClient(apikey) ``` -------------------------------- ### Write Sample Item with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to write a new sample item to the collection using the `write()` method. The `item` parameter should be a dictionary containing the sample data. ```python >>> job.samples.write({'data': 'sample_data'}) ``` -------------------------------- ### GET /jobs/count Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/connection.html Returns the total count of jobs for a project based on applied filters. ```APIDOC ## GET /jobs/count ### Description Returns the total number of jobs matching the specified criteria. ### Method GET ### Endpoint /jobs/count ### Parameters #### Query Parameters - **project** (string) - Required - The project ID. ### Request Example GET /jobs/count?project=12345 ### Response #### Success Response (200) - **count** (integer) - Total number of jobs. #### Response Example { "count": 42 } ``` -------------------------------- ### Python: Get Item from Collection by Key using Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to retrieve a single item from a Scrapinghub collection using its unique key. The `get()` method takes the item's key as a string and can optionally accept additional query parameters. If the item exists, it returns a dictionary representing the item; otherwise, it might raise an error or return None depending on the client's implementation. ```python item = foo_store.get('item_key', param1='value1') ``` -------------------------------- ### Delete a Project Setting by Key (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Illustrates how to delete a specific project setting using its key. The `delete` method takes the setting key as a string parameter. ```python >>> project = client.get_project(123) >>> project.settings.delete('job_runtime_limit') ``` -------------------------------- ### Accessing and Managing Spiders Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to retrieve a spider instance from a project and access its attributes like key and name. ```python spider = project.spiders.get('spider1') print(spider.key) print(spider.name) ``` -------------------------------- ### GET /jobs/list Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/connection.html Retrieves a list of jobs associated with a specific project based on provided filters. ```APIDOC ## GET /jobs/list ### Description Retrieves a list of jobs for a given project. Supports filtering via parameters. ### Method GET ### Endpoint /jobs/list ### Parameters #### Query Parameters - **project** (string) - Required - The project ID. - **state** (string) - Optional - Filter jobs by state (e.g., running, finished). ### Request Example GET /jobs/list?project=12345 ### Response #### Success Response (200) - **jobs** (array) - List of job objects. #### Response Example { "jobs": [{"id": "12345/1/1", "state": "finished"}] } ``` -------------------------------- ### Retrieve Job Summaries Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/client/overview.rst.txt Provides methods to get a summary of job states or the most recent jobs for each spider. ```python >>> spider.jobs.summary() >>> list(sp.jobs.iter_last()) ``` -------------------------------- ### Initialize HubstorageClient Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/hubstorage.html Demonstrates how to authenticate and initialize the HubstorageClient using an API key. ```python from scrapinghub import HubstorageClient hc = HubstorageClient(auth='apikey') hc.server_timestamp() ``` -------------------------------- ### List Project Frontiers Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to list all available frontiers within a project. This provides a simple list of frontier names. ```python >>> project.frontiers.list() ['test', 'test1', 'test2'] ``` -------------------------------- ### Manage Projects and Spiders Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/connection.html Demonstrates how to list available projects, select a specific project, list spiders within that project, and schedule a new spider run. ```python # List projects conn.project_ids() # Select project project = conn[123] # Schedule spider project.schedule('myspider', arg1='val1') # List spiders project.spiders() ``` -------------------------------- ### GET /jobs/{job_id}/logs Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves log entries for a specific job. Supports filtering by log level and content, and provides both list and iterator interfaces. ```APIDOC ## GET /jobs/{job_id}/logs ### Description Retrieves log entries associated with a specific job. Use `iter()` for large datasets to optimize memory usage or `list()` for smaller, immediate results. ### Method GET ### Endpoint /jobs/{job_id}/logs ### Parameters #### Query Parameters - **count** (integer) - Optional - Limit the number of log entries returned. - **level** (string) - Optional - Filter logs by severity level (e.g., WARNING, INFO). - **filter** (list) - Optional - List of tuples for advanced filtering (e.g., [("message", "contains", ["text"])]). ### Request Example ``` job.logs.list(level='WARNING', count=10) ``` ### Response #### Success Response (200) - **level** (integer) - The log severity level. - **message** (string) - The log content. - **time** (integer) - UNIX timestamp in milliseconds. #### Response Example [ { "level": 30, "message": "Some warning: mymessage", "time": 1486375511188 } ] ``` -------------------------------- ### PUT /projects/{project_id}/settings Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Update project-level configuration settings. ```APIDOC ## PUT /projects/{project_id}/settings ### Description Updates one or multiple configuration settings for a specific project. ### Method PUT ### Endpoint /projects/{project_id}/settings ### Request Body - **values** (dict) - Required - Key-value pairs of settings to update. ### Request Example { "default_job_units": 1, "job_runtime_limit": 20 } ### Response #### Success Response (200) - **status** (string) - Confirmation of update. ``` -------------------------------- ### Getting a Cached Store Collection (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a collection that caches items for a month. This is useful for frequently accessed data where slightly older versions are acceptable. ```python cached_store = collections.get_cached_store('my_cached_data') ``` -------------------------------- ### Manage Projects Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Shows how to list available projects, retrieve a summary of project activity, and access a specific project instance by its ID. ```python client.projects.list() client.projects.summary() project = client.get_project(123) ``` -------------------------------- ### Get Frontier New Request Count Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to access the `newcount` property of a frontier to see the number of new requests that have been added. This is useful for monitoring frontier activity. ```python >>> frontier.newcount 3 ``` -------------------------------- ### Accessing and Listing Collections (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to access the collections object from a project instance and list available collections. This is the primary way to interact with collections in the Scrapinghub client. ```python >>> collections = project.collections >>> collections.list() [{'name': 'Pages', 'type': 's'}] >>> foo_store = collections.get_store('foo_store') ``` -------------------------------- ### Manage Project Settings in Python Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/client/overview.rst.txt Shows how to retrieve, update, and list project settings using the Settings class. This allows for dynamic configuration of project parameters like job units and runtime limits. ```python project.settings.list() project.settings.get('job_runtime_limit') project.settings.set('job_runtime_limit', 20) project.settings.update({'default_job_units': 1, 'job_runtime_limit': 20}) ``` -------------------------------- ### Initialize HubStorage Client Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/hubstorage.html Shows how to initialize the HubStorage client with optional parameters for authentication, endpoint, timeouts, retries, and user agent. The default values are used if not specified. ```python _class _scrapinghub.hubstorage.HubstorageClient(_auth =None_, _endpoint =None_, _connection_timeout =None_, _max_retries =None_, _max_retry_time =None_, _user_agent =None_, _use_msgpack =True_) ``` -------------------------------- ### Get all finished jobs for a project in Python Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/legacy/connection.rst.txt Retrieves a set of all jobs that have completed execution within a project. The result is a JobSet object, which is iterable. ```python jobs = project.jobs(state='finished') for job in jobs: # process job print([x.id for x in jobs]) ``` -------------------------------- ### POST /activity Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Adds new activity events to a project. ```APIDOC ## POST /activity ### Description Adds one or more activity events to the project's activity log. ### Method POST ### Endpoint project.activity.add(values) ### Request Body - **values** (dict or list) - Required - A dictionary or list of dictionaries containing 'event', 'job', and 'user' keys. ### Request Example { "event": "job:completed", "job": "123/2/4", "user": "jobrunner" } ``` -------------------------------- ### Update a Single Project Setting Value (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to update the value of a single project setting. The `set` method takes the setting key and the new value as parameters. Note that some settings are read-only. ```python >>> project = client.get_project(123) >>> project.settings.set('default_job_units', 2) ``` -------------------------------- ### ScrapinghubClient Initialization Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Initializes the main client to interact with the Scrapy Cloud API using authentication credentials. ```APIDOC ## ScrapinghubClient Initialization ### Description Initializes the connection to the Scrapy Cloud API. If credentials are not provided, it attempts to read from environment variables. ### Parameters - **auth** (string) - Optional - Scrapy Cloud API key or credentials. - **dash_endpoint** (string) - Optional - The API URL (defaults to https://app.zyte.com/api/). ### Request Example ```python from scrapinghub import ScrapinghubClient client = ScrapinghubClient('YOUR_API_KEY') ``` ``` -------------------------------- ### Listing Project Spiders Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to retrieve a list of all spiders within a project, returning their metadata as dictionaries. ```python spiders_list = project.spiders.list() for spider in spiders_list: print(spider) ``` -------------------------------- ### Get Slot New Request Count Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates accessing the `newcount` property of a slot to retrieve the number of new requests added to that specific slot. This helps in monitoring slot-specific activity. ```python >>> slot.newcount 2 ``` -------------------------------- ### List Slot Requests Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to list all request batches currently in a slot's queue. The output is a list of dictionaries, each representing a batch with its ID and requests. ```python >>> slot.q.list() [{'id': '0115a8579633600006', 'requests': [['page1.html', {'depth': 1}]]}] ``` -------------------------------- ### Manage Projects and Spiders Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/hubstorage.html Shows how to retrieve project settings, job summaries, and spider IDs. ```python project = hc.get_project('1111111') project.settings['botgroups'] project.jobsummary() project.ids.spider('foo') summaries = project.spiders.lastjobsummary(count=3) ``` -------------------------------- ### Query and Filter Jobs Source: https://python-scrapinghub.readthedocs.io/en/latest/_sources/legacy/hubstorage.rst.txt Demonstrates how to list job metadata using filters like tags, state, and pagination to manage large sets of job data. ```python jobs_metadata = project.jobq.list(has_tag=['new', 'verified'], lacks_tag='obsolete') jobs_metadata_filtered = project.jobq.list(spider='foo', state='finished', count=3) jobs_paginated = project.jobq.list(start=1000) ``` -------------------------------- ### Update Job State (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to update the state of a job to a new value, optionally with additional meta parameters. It returns the previous string state of the job. ```python >>> job.update('finished') 'running' ``` -------------------------------- ### List Job Summaries with Filters (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Provides a convenient shortcut to list job results based on various filter parameters. It returns a list of dictionaries, each representing a job summary. For large datasets, using `iter()` is recommended to avoid high memory consumption. ```python list(_count =None_, _start =None_, _spider =None_, _state =None_, _has_tag =None_, _lacks_tag =None_, _startts =None_, _endts =None_, _meta =None_, _** params_) ``` -------------------------------- ### Retrieve and iterate job items using Python Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Demonstrates how to retrieve items from a job using iterators, list methods, and chunked processing. These methods support filtering by count, timestamp, and custom criteria to manage memory usage efficiently. ```python # Retrieve all items as a generator job.items.iter() # Iterate through first 100 items for item in job.items.iter(count=100): print(item) # Retrieve items with timestamp filter job.items.list(startts=1447221694537) # Retrieve items in chunks gen = job.items.list_iter(chunksize=2) next(gen) # Retrieve items with complex filters filters = [("size", ">", [30000]), ("size", "<", [40000])] job.items.list(count=1, filter=filters) ``` -------------------------------- ### Update Job Tags Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Illustrates how to add tags to a job. The `update_tags()` method is used, and tags can be added to an existing list of tags. For example, to add the tag 'consumed': ```python >>> job.update_tags(add=['consumed']) ``` -------------------------------- ### Get a Specific Job by Key using Scrapinghub API Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a single `Job` object using its unique job key. The job key must match the project and spider context. Returns a `Job` object. ```python >>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' ``` -------------------------------- ### Manage Project Settings in Scrapinghub Source: https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html Demonstrates how to interact with the project settings object to list all settings, retrieve a specific value, update a single setting, or perform bulk updates. These methods allow for dynamic configuration of project parameters via the Scrapinghub API. ```python # List all project settings project.settings.list() # Get a specific setting value project.settings.get('job_runtime_limit') # Update a single setting project.settings.set('job_runtime_limit', 20) # Update multiple settings at once project.settings.update({'default_job_units': 1, 'job_runtime_limit': 20}) ``` -------------------------------- ### Get a Specific Job by Key Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a `Job` object using its unique job key. The job key must be in the format 'project_id/spider_id/job_id', where all components are integers. This method is essential for accessing and managing individual job data. ```python >>> job = client.get_job('123/1/1') >>> job ``` -------------------------------- ### Iterate Through Project Activity Events Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Provides a way to iterate over all activity events for a given project. It's recommended to use the `iter()` method for large amounts of activity to avoid excessive memory usage. The `list()` method is a convenient shortcut but may consume more memory. ```python >>> project.activity.iter() ``` -------------------------------- ### Python: Create a Collection Writer with Scrapinghub Client Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Shows how to create a writer for a Scrapinghub collection using the Python client. This method allows for efficient batch uploading of items. It accepts several optional parameters to configure the writer's behavior, such as initial offset, authentication, queue size, and content encoding. The function returns a batch writer object. ```python writer = foo_store.create_writer(start=0, size=1000, interval=15, qsize=None, content_encoding='identity', maxitemsize=1048576, callback=None) ``` -------------------------------- ### Iterate Last Job Summaries (Python) Source: https://python-scrapinghub.readthedocs.io/en/latest/client/apidocs.html Retrieves a generator object yielding dictionaries of job summaries for a given filter. This is useful for fetching recent job data efficiently. It can be used to get all last job summaries for a project or for a specific spider. ```python >>> project.jobs.iter_last() >>> list(spider.jobs.iter_last()) [{'close_reason': 'success', 'elapsed': 3062444, 'errors': 1, 'finished_time': 1482911633089, 'key': '123/1/3', 'logs': 8, 'pending_time': 1482911596566, 'running_time': 1482911598909, 'spider': 'spider1', 'state': 'finished', 'ts': 1482911615830, 'version': 'some-version'}] ``` -------------------------------- ### Connect to Scrapinghub API Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/connection.html Initializes a connection to the Scrapinghub service using an API key. This is the entry point for all subsequent project and job operations. ```python from scrapinghub import Connection conn = Connection('APIKEY') conn ``` -------------------------------- ### HubStorage Client Methods Source: https://python-scrapinghub.readthedocs.io/en/latest/legacy/hubstorage.html Lists common methods available on the HubStorage client, including closing the client, getting job or project objects, pushing new jobs, and executing HTTP requests with retry policies. ```python close(_timeout =None_) get_job(_* args_, _** kwargs_) get_project(_* args_, _** kwargs_) push_job(_projectid_ , _spidername_ , _auth =None_, _** jobparams_) request(_is_idempotent =False_, _** kwargs_) server_timestamp() ```