### Install scrapinghub Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/README.rst Install the scrapinghub library using pip. ```bash pip install scrapinghub ``` -------------------------------- ### Install scrapinghub with MessagePack support Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/README.rst Install the scrapinghub library with MessagePack support for improved performance. ```bash pip install scrapinghub[msgpack] ``` -------------------------------- ### List job items in chunks with start and count Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a subset of job items using `chunksize`, `start`, and `count` parameters. This allows for fetching specific ranges of items efficiently. ```default >>> gen = job.items.list_iter(chunksize=2, start=5, count=3) >>> next(gen) [{'name': 'Item #5'}, {'name': 'Item #6'}] >>> next(gen) [{'name': 'Item #7'}] >>> next(gen) Traceback (most recent call last): File "", line 1, in StopIteration ``` -------------------------------- ### Getting a Project Setting Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve the value of a specific project setting by its name. ```python >>> project.settings.get('job_runtime_limit') 24 ``` -------------------------------- ### Get Job Key and Metadata Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Demonstrates how to retrieve a job object and access its key and metadata, such as its state. ```python >>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' >>> job.metadata.get('state') 'finished' ``` -------------------------------- ### Get Specific Project Setting Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve the value of a single project setting by its key. The key must be a string. ```python >>> project.settings.get('default_job_units') 2 ``` -------------------------------- ### Settings.list Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a list of all project setting key/value pairs. This provides a simple way to get all settings, but be cautious with large numbers of settings. ```APIDOC ## Settings.list(*args, **kwargs) ### Description Convenient shortcut to list iter results. Please note that [`list()`](#scrapinghub.client.projects.Settings.list) method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via [`iter()`](#scrapinghub.client.projects.Settings.iter) method (all params and available filters are same for both methods). ### Returns - A list of key/value pairs for project settings. ``` -------------------------------- ### Start a Job Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Moves a job to a running state. This operation returns the previous state of the job. ```python >>> job.start() 'pending' ``` -------------------------------- ### Get Project Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get a specific Project instance using its numeric ID. This instance can then be used to access spiders and jobs within that project. ```python >>> project = client.get_project(123) >>> project >>> project.key '123' ``` -------------------------------- ### Get Specific Job with Client Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a specific job using its unique key directly from the client instance. This is a shortcut for getting job information. ```python >>> job = client.get_job('123/1/2') ``` -------------------------------- ### Get a Project by ID Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve a project instance using its ID. This is a shortcut for accessing the project through the client's projects collection. ```python project = client.get_project(123) project ``` -------------------------------- ### Get Item from Collection by Key Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the `get` method to retrieve a specific item from a collection using its key. Optional query parameters can be passed. ```python >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {'value': '1447221694537'} ``` -------------------------------- ### Iterate through first 100 job items Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Example demonstrating how to iterate through a specific number of items from a job and print each one. The `count` parameter limits the iteration. ```default >>> for item in job.items.iter(count=100): ... print(item) ``` -------------------------------- ### Get Project Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md This snippet demonstrates how to obtain a Project instance using the ScrapinghubClient. It shows how to retrieve a project by its ID. ```python >>> project = client.get_project(123) >>> project >>> project.key '123' ``` -------------------------------- ### Paginate Job Iteration Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md To retrieve more than the default 1000 jobs, use the `start` parameter with `.jobs.iter()` to paginate through results in batches. ```python >>> jobs_summary = spider.jobs.iter(start=1000) ``` -------------------------------- ### Access Project Collections Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get a Collections instance from a Project instance to manage collections. This is the entry point for collection-related operations. ```python >>> collections = project.collections >>> collections.list() [{'name': 'Pages', 'type': 's'}] >>> foo_store = collections.get_store('foo_store') ``` -------------------------------- ### List a single request from a job Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md The `list()` method can retrieve a specific number of requests. This example retrieves a single request and displays its details. ```python >>> job.requests.list(count=1) [{ 'duration': 354, 'fp': '6d748741a927b10454c83ac285b002cd239964ea', 'method': 'GET', 'rs': 1270, 'status': 200, 'time': 1482233733870, 'url': 'https://example.com' }] ``` -------------------------------- ### Get Generator Over Item Keys Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use `iter` with `nodata=True` and `meta=['_key']` to efficiently get a generator of item keys without fetching the full item data. ```python >>> keys = foo_store.iter(nodata=True, meta=['_key'])) >>> next(keys) {'_key': '002d050ee3ff6192dcbecc4e4b4457d7'} ``` -------------------------------- ### Iterating Through Frontiers Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get an iterator to loop through all frontiers within a project. ```python >>> frontiers.iter() ``` -------------------------------- ### Access Project Settings Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get an instance of the Settings class for a specific project. This is accessed via the 'settings' attribute of a Project object. ```python >>> project.settings ``` -------------------------------- ### Get a Job by Key Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve a specific job using its unique key, which is formatted as 'project_id/spider_id/job_id'. ```python job = client.get_job('123/1/1') job ``` -------------------------------- ### Collections.get_store(name) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Gets a store collection by its name. Returns a Collection object. ```APIDOC ## Collections.get_store(name) ### Description Method to get a store collection by name. ### Parameters #### Path Parameters - **name** (string) - Required - a collection name string. ### Returns a collection object. ### Return type [`Collection`](#scrapinghub.client.collections.Collection) ``` -------------------------------- ### Access Job Metadata Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get a JobMeta instance from a Job object. This is the entry point for interacting with job metadata. ```default >>> job.metadata ``` -------------------------------- ### Projects.list Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a list of all project IDs available to the current user. This provides a simple way to get all project identifiers. ```APIDOC ## Projects.list() ### Description Get list of projects available to current user. ### Returns - A list of project IDs. - **Return type:** `list[int]` ### Usage ```python client.projects.list() ``` ``` -------------------------------- ### Get Project Summary Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a summary of all your projects, including the count of finished, pending, and running jobs for each project. This provides an overview of project activity. ```python >>> client.projects.summary() [{'finished': 674, 'has_capacity': True, 'pending': 0, 'project': 123, 'running': 1}, {'finished': 33079, 'has_capacity': True, 'pending': 0, 'project': 456, 'running': 2}] ``` -------------------------------- ### Get Specific Job by Key Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a specific job using its unique key from a project instance. This returns a Job instance. ```python >>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' ``` -------------------------------- ### List samples with a timestamp filter Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md The `list()` method for samples can accept a `startts` parameter to filter samples with a timestamp greater than or equal to the provided value. This example shows two such samples. ```python >>> job.samples.list(startts=1484570043851) [[1484570043851, 554, 576, 1777, 821, 0], [1484570046673, 561, 583, 1782, 821, 0]] ``` -------------------------------- ### Iterating Through Slot Fingerprints Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get an iterator to retrieve all fingerprints for a given slot. ```python >>> fps = slot.f.iter() ``` -------------------------------- ### Get All Job Summaries Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a summary of all jobs, grouped by their state. This is useful for understanding the current status distribution of jobs. ```python >>> spider.jobs.summary() [{'count': 0, 'name': 'pending', 'summary': []}, {'count': 0, 'name': 'running', 'summary': []}, {'count': 5, 'name': 'finished', 'summary': [...]} ``` -------------------------------- ### Get Job Summaries Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Use `.jobs.summary()` to retrieve a summary of job counts for different states (pending, running, finished). ```python >>> spider.jobs.summary() [{'count': 0, 'name': 'pending', 'summary': []}, {'count': 0, 'name': 'running', 'summary': []}, {'count': 5, 'name': 'finished', 'summary': [...]}] ``` -------------------------------- ### List job items with start timestamp Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves job items that have a timestamp greater than or equal to the specified `startts`. The returned list contains dictionaries representing the items. ```default >>> job.items.list(startts=1447221694537) [{ 'name': ['Some custom item'], 'url': 'http://some-url/item.html', 'size': 100000, }] ``` -------------------------------- ### Access Projects Collection Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get an instance of the Projects collection from the Scrapinghub client. This is the entry point for managing user projects. ```python >>> client.projects ``` -------------------------------- ### summary Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get a summary of jobs, optionally filtered by state or spider. Returns a list of dictionaries, grouped by job state. ```APIDOC ## summary(state=None, spider=None, **params) ### Description Get jobs summary (optionally by state). ### Parameters #### Query Parameters - **state** (optional) - a string state to filter jobs. - **spider** (optional) - a spider name (not needed if instantiated with [`Spider`](#scrapinghub.client.spiders.Spider)). - **params** (optional) - additional keyword args. ### Returns a list of dictionaries of jobs summary for a given filter params grouped by job state. ### Return type `list[dict]` ``` -------------------------------- ### List all request batches in a frontier slot queue Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the list() method of the 'q' property on a FrontierSlot object to get a list of all request batches. Each batch is a dictionary containing 'id' and 'requests'. ```python >>> slot.q.list() [{'id': '0115a8579633600006', 'requests': [['page1.html', {'depth': 1}]]}] ``` -------------------------------- ### Listing Project Settings Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a list of all project settings, typically as key-value pairs. ```python >>> project.settings.list() [(u'default_job_units', 2), (u'job_runtime_limit', 24)]] ``` -------------------------------- ### Get the count of new requests in a frontier Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Access the newcount property of a Frontier object to get the integer amount of new entries added to the frontier. ```python >>> frontier.newcount 3 ``` -------------------------------- ### Job.start Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Move a job to the running state. Optionally updates meta parameters. ```APIDOC ## Job.start ### Description Move a job to the running state. Optionally updates meta parameters. ### Method POST (assumed) ### Endpoint /jobs/{job_key}/start (assumed) ### Parameters #### Query Parameters - **params** (dict) - Optional - keyword meta parameters to update. ### Response #### Success Response (200) - **previous_state** (str) - A previous string job state. ``` -------------------------------- ### Job Dictionary Structure Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Dictionaries returned by `.jobs.iter()` contain job summary information. This example shows a typical structure. ```python { u'close_reason': u'finished', u'elapsed': 201815620, u'finished_time': 1492843577852, u'items': 2, u'key': u'123320/3/155', u'logs': 21, u'pages': 2, u'pending_time': 1492843520319, u'running_time': 1492843526622, u'spider': u'spider001', u'state': u'finished', u'ts': 1492843563720, u'version': u'792458b-master' } ``` -------------------------------- ### Access Jobs Collection from Project Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get the Jobs collection associated with a Project instance. This is used to manage jobs within that project. ```default >>> project.jobs ``` -------------------------------- ### Get Job Summary by State Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a summary of jobs for a specific state. If no state is provided, it returns a summary grouped by all job states. ```python >>> project.jobs.summary('pending') {'count': 0, 'name': 'pending', 'summary': []} ``` -------------------------------- ### List Project Settings Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve a list of all key-value pairs for a project's settings. For large numbers of settings, consider using iter() to conserve memory. ```python >>> project.settings.list() [(u'default_job_units', 2), (u'job_runtime_limit', 20)] ``` -------------------------------- ### Get the count of new requests in a frontier slot Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Access the newcount property of a FrontierSlot object to get the integer amount of new entries added to that slot. ```python >>> slot.newcount 2 ``` -------------------------------- ### Run a New Job for a Spider Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Run a new job for a specific spider. This is the basic way to start a scraping job on the Scrapinghub platform. ```python >>> job = spider.jobs.run() ``` -------------------------------- ### Get Project by ID Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve a specific project object using its unique ID. Ensure the project ID is a valid integer or numeric string. ```python >>> project = client.projects.get(123) >>> project ``` -------------------------------- ### Settings.get Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves the value of a specific project setting by its key. This allows you to access individual configuration values. ```APIDOC ## Settings.get(key) ### Description Get element value by key. ### Parameters #### Path Parameters - **key** (string) - Required - The key of the setting to retrieve. ``` -------------------------------- ### Filter Logs by Level and Message Content Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md This example shows how to filter job logs by a specific log level (e.g., WARNING) and by messages containing a particular word. It uses a list of tuples for filters. ```python >>> filters = [("message", "contains", ["mymessage"])] >>> job.logs.list(level='WARNING', filter=filters) [{ 'level': 30, 'message': 'Some warning: mymessage', 'time': 1486375511188, }] ``` -------------------------------- ### Get Specific Spider Instance Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get a specific Spider instance by its name. This instance provides access to spider-specific job collections and attributes like 'key', 'name', etc. ```python >>> spider = project.spiders.get('spider2') >>> spider >>> spider.key '123/2' >>> spider.name spider2 ``` -------------------------------- ### Projects.summary Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Fetches short summaries for all user projects, including job status and capacity information. This is useful for a quick overview of project activity. ```APIDOC ## Projects.summary(**params) ### Description Get short summaries for all available user projects. ### Parameters #### Query Parameters - **state** (string or list) - Optional - A string state or a list of states to filter summaries by. ### Returns - A list of dictionaries, where each dictionary represents a project summary including job counts (pending, running, finished) and capacity status. - **Return type:** `list[dict]` ### Usage ```python client.projects.summary() ``` ``` -------------------------------- ### Items.stats Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get resource stats for items. ```APIDOC ## Items.stats ### Description Get resource stats for items. ### Method GET (assumed) ### Endpoint /items/stats (assumed) ### Response #### Success Response - **stats_data** (dict) - A dictionary with stats data. ``` -------------------------------- ### Iterate through all samples from a job Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the `iter()` method on the job's samples attribute to get a generator for all samples. This is memory-efficient for large numbers of samples. ```python >>> job.samples.iter() ``` -------------------------------- ### List all fingerprints in a frontier slot Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the list() method of the 'f' property on a FrontierSlot object to get a list of all fingerprints. Optional query parameters can be passed. ```python >>> slot.f.list() ['page1.html'] ``` -------------------------------- ### Frontier Class Methods Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Provides methods to manage and interact with a frontier, including iterating through slots, listing all slots, getting a specific slot, and flushing frontier data. ```APIDOC ## Frontier Methods ### flush() #### Description Flush data for a whole frontier. ### get(slot) #### Description Get a slot by name. * **Returns:** a frontier slot instance. * **Return type:** [`FrontierSlot`](#scrapinghub.client.frontiers.FrontierSlot) ### iter() #### Description Iterate through slots. * **Returns:** an iterator over frontier slots names. * **Return type:** `collections.abc.Iterable[str]` ### list() #### Description List all slots. * **Returns:** a list of frontier slots names. * **Return type:** `list[str]` ### *property* newcount #### Description Integer amount of new entries added to frontier. ``` -------------------------------- ### Getting a Specific Frontier Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Obtain a Frontier instance by its name. ```python >>> frontier = frontiers.get('test') >>> frontier ``` -------------------------------- ### Getting a Specific Frontier Slot Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Obtain a FrontierSlot instance by its name. ```python >>> slot = frontier.get('example.com') >>> slot ``` -------------------------------- ### Iterate through requests in a frontier slot queue Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the iter() method of the 'q' property on a FrontierSlot object to get a generator yielding request batches. Each batch is a dictionary containing 'id' and 'requests'. ```python >>> slot.q.iter() ``` -------------------------------- ### Getting a Spider Object Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Illustrates how to obtain a specific Spider object using its name. This method is used to interact with individual spiders, such as retrieving their metadata or updating their tags. ```default >>> project.spiders.get('spider2') ``` -------------------------------- ### Settings.iter Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Iterates through all key/value pairs of project settings. This is useful for processing all settings sequentially. ```APIDOC ## Settings.iter() ### Description Iterate through key/value pairs. ### Returns - An iterator over key/value pairs. - **Return type:** `collections.abc.Iterable` ``` -------------------------------- ### Add a Single Project Activity Event Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Post a new activity event to a project using `.activity.add()`. The event should be a dictionary. ```default >>> event = {'event': 'job:completed', 'job': '123/2/4', 'user': 'john'} >>> project.activity.add(event) ``` -------------------------------- ### Updating Multiple Project Settings Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Update several project settings simultaneously by providing a dictionary of key-value pairs. ```python >>> project.settings.update({'default_job_units': 1, ... 'job_runtime_limit': 20}) ``` -------------------------------- ### Iterating Through Slot Requests Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get an iterator to retrieve all requests for a given slot. ```python >>> reqs = slot.q.iter() ``` -------------------------------- ### iter(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Iterates over a collection of jobs, allowing for filtering by various parameters such as count, start offset, spider name, job state, tags, timestamps, and meta fields. Useful for batch processing or searching for specific jobs. ```APIDOC ## iter(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params) ### Description Iterate over jobs collection for a given set of params. ### Parameters #### Query Parameters * **count** (integer) - Optional - limit amount of returned jobs. * **start** (integer) - Optional - number of jobs to skip in the beginning. * **spider** (string) - Optional - filter by spider name. * **state** (string or list of strings) - Optional - a job state. * **has_tag** (string or list of strings) - Optional - filter results by existing tag(s). * **lacks_tag** (string or list of strings) - Optional - filter results by missing tag(s). * **startts** (integer) - Optional - UNIX timestamp at which to begin results, in milliseconds. * **endts** (integer) - Optional - UNIX timestamp at which to end results, in milliseconds. * **meta** (string or list of strings) - Optional - request for additional fields, a single field name or a list of field names to return. * **params** (any) - Optional - other filter params. ### Returns a generator object over a list of dictionaries of jobs summary for a given filter params. ### Return type types.GeneratorType[dict] ### Notes The endpoint used by the method returns only finished jobs by default, use `state` parameter to return jobs in other states. ### Usage - retrieve all jobs for a spider: ```default >>> spider.jobs.iter() ``` - get all job keys for a spider: ```default >>> jobs_summary = spider.jobs.iter() >>> [job['key'] for job in jobs_summary] ['123/1/3', '123/1/2', '123/1/1'] ``` - job summary fieldset is less detailed than [`JobMeta`](#scrapinghub.client.jobs.JobMeta) but contains a few new fields as well. Additional fields can be requested using `meta` parameter. If it’s used, then it’s up to the user to list all the required fields, so only few default fields would be added except requested ones: ```default >>> jobs_summary = project.jobs.iter(meta=['scheduled_by', ]) ``` - by default [`Jobs.iter()`](#scrapinghub.client.jobs.Jobs.iter) returns maximum last 1000 results. Pagination is available using start parameter: ```default >>> jobs_summary = spider.jobs.iter(start=1000) ``` - get jobs filtered by tags (list of tags has `OR` power): ```default >>> jobs_summary = project.jobs.iter( ... has_tag=['new', 'verified'], lacks_tag='obsolete') ``` - get certain number of last finished jobs per some spider: ```default >>> jobs_summary = project.jobs.iter( ... spider='spider2', state='finished', count=3) ``` ``` -------------------------------- ### Iterating Through Frontier Slots Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Get an iterator to loop through all slots within a specific frontier. ```python >>> frontier.iter() ``` -------------------------------- ### Add Multiple Project Activity Events Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Post multiple activity events to a project at once using `.activity.add()`. Pass a list of event dictionaries. ```default >>> events = [ ... {'event': 'job:completed', 'job': '123/2/5', 'user': 'john'}, ... {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'}, ... ] >>> project.activity.add(events) ``` -------------------------------- ### Activity.add Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Adds one or more new activity events to the project. Events are represented as dictionaries with 'event', 'job', and 'user' keys. ```APIDOC #### add(values, **kwargs) ### Description Add new event to the project activity. ### Parameters * **values** – a single event or a list of events, where event is represented with a dictionary of (‘event’, ‘job’, ‘user’) keys. ``` -------------------------------- ### Get Specific Job Meta Field Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieve the value of a specific job metadata field by its key. ```default >>> job.metadata.get('version') 'test' ``` -------------------------------- ### Instantiate Scrapinghub Client Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/quickstart.md Instantiate a new Scrapinghub client using your API key. Ensure your API key is kept secret. ```python from scrapinghub import ScrapinghubClient apikey = '84c87545607a4bc0****************' # your API key as a string client = ScrapinghubClient(apikey) ``` -------------------------------- ### Accessing Frontiers Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Initialize the frontiers object to interact with project frontiers. ```python >>> frontiers = project.frontiers ``` -------------------------------- ### Settings.update Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Updates multiple project settings at once using a dictionary of key/value pairs. This provides a convenient way to perform partial updates. ```APIDOC ## Settings.update(values) ### Description Update multiple elements at once. The method provides convenient interface for partial updates. ### Parameters #### Request Body - **values** (dict) - Required - A dictionary with key/values to update. ``` -------------------------------- ### Collections.get_versioned_store(name) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Gets a versioned-store collection by name. Retains up to 3 copies of each item. Returns a Collection object. ```APIDOC ## Collections.get_versioned_store(name) ### Description Method to get a versioned-store collection by name. The collection type retains up to 3 copies of each item. ### Parameters #### Path Parameters - **name** (string) - Required - a collection name string. ### Returns a collection object. ### Return type [`Collection`](#scrapinghub.client.collections.Collection) ``` -------------------------------- ### Collections.get_cached_store(name) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Gets a cached-store collection by name. Items in this collection expire after a month. Returns a Collection object. ```APIDOC ## Collections.get_cached_store(name) ### Description Method to get a cashed-store collection by name. The collection type means that items expire after a month. ### Parameters #### Path Parameters - **name** (string) - Required - a collection name string. ### Returns a collection object. ### Return type [`Collection`](#scrapinghub.client.collections.Collection) ``` -------------------------------- ### run Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Schedule a new job and return its job key. This method allows specifying spider name, units, priority, metadata, tags, arguments, settings, command arguments, and environment variables. ```APIDOC ## run(spider=None, units=None, priority=None, meta=None, add_tag=None, job_args=None, job_settings=None, cmd_args=None, environment=None, **params) ### Description Schedule a new job and returns its job key. ### Parameters #### Query Parameters - **spider** - a spider name string (not needed if job is scheduled via `Spider.jobs`). - **units** (optional) - amount of units for the job. - **priority** (optional) - integer priority value. - **meta** (optional) - a dictionary with metadata. - **add_tag** (optional) - a string tag or a list of tags to add. - **job_args** (optional) - a dictionary with job arguments. - **job_settings** (optional) - a dictionary with job settings. - **cmd_args** (optional) - a string with script command args. - **environment** (optional) - a dictionary with custom environment. - **params** (optional) - additional keyword args. ### Returns a job instance, representing the scheduled job. ### Return type [`Job`](#scrapinghub.client.jobs.Job) ``` -------------------------------- ### Run Tests Updating Cassettes Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/quickstart.md Run integration tests using py.test and update or recreate all VCR.py cassettes from scratch. This erases existing cassettes. ```bash py.test --update-cassettes ``` -------------------------------- ### Items.list Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Convenient shortcut to list iter results. It's recommended to use `iter()` for large amounts of elements due to memory constraints. ```APIDOC ## Items.list ### Description Convenient shortcut to list iter results. It's recommended to use `iter()` for large amounts of elements due to memory constraints. ### Method GET (assumed) ### Endpoint /items (assumed) ### Parameters All parameters and available filters are the same as for the `iter()` method. ### Response #### Success Response - **items** (list) - A list of elements. ``` -------------------------------- ### Get new request count for all frontiers Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves the total count of new requests that have been added across all frontiers for the project. ```default >>> project.frontiers.newcount 3 ``` -------------------------------- ### Collections.get_versioned_cached_store(name) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Gets a versioned-cached-store collection by name. Multiple copies are retained, and each expires after a month. Returns a Collection object. ```APIDOC ## Collections.get_versioned_cached_store(name) ### Description Method to get a versioned-cached-store collection by name. Multiple copies are retained, and each one expires after a month. ### Parameters #### Path Parameters - **name** (string) - Required - a collection name string. ### Returns a collection object. ### Return type [`Collection`](#scrapinghub.client.collections.Collection) ``` -------------------------------- ### Collection.list(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Convenient shortcut to list iter results for a collection. It can consume a lot of memory for large datasets; consider using iter() for large amounts of data. ```APIDOC ## Collection.list(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params) ### Description Convenient shortcut to list iter results. Please note that [`list()`](#scrapinghub.client.collections.Collection.list) method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via [`iter()`](#scrapinghub.client.collections.Collection.iter) method (all params and available filters are same for both methods). ### Parameters #### Query Parameters - **key** (string) - Optional - a string key or a list of keys to filter with. - **prefix** (string) - Optional - a string prefix to filter items. - **prefixcount** (integer) - Optional - maximum number of values to return per prefix. - **startts** (integer) - Optional - UNIX timestamp at which to begin results. - **endts** (integer) - Optional - UNIX timestamp at which to end results. - **requests_params** (dict) - Optional - a dict with optional requests params. - **params** (dict) - Optional - additional query params for the request. ### Returns a list of items where each item is represented with a dict. ### Return type `list[dict]` ``` -------------------------------- ### Settings.set Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Updates the value of a specific project setting by its key. Note that some settings are read-only. ```APIDOC ## Settings.set(key, value) ### Description Update project setting value by key. ### Parameters #### Path Parameters - **key** (string) - Required - The key of the setting to update. - **value** (any) - Required - The new value for the setting. ``` -------------------------------- ### scrapinghub.client.samples.Samples.list Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md A convenient shortcut that lists all elements from the iterator. Be aware that this method can consume significant memory for large collections. ```APIDOC ## scrapinghub.client.samples.Samples.list ### Description Convenient shortcut to list iter results. This method can use a lot of memory for a large amount of elements and it’s recommended to iterate through it via `iter()` method. ### Parameters * **args** - Positional arguments. * **kwargs** - Keyword arguments. ``` -------------------------------- ### Adding Requests with Parameters Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Add multiple requests to a slot's queue, including requests with additional parameters like page number and query data. Flush changes afterwards. ```python >>> slot.q.add([{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}]) >>> slot.flush() ``` -------------------------------- ### Add a Single Activity Event Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Post a new activity event to a project. The event must be a dictionary containing 'event', 'job', and 'user' keys. ```python event = {'event': 'job:completed', 'job': '123/2/4', 'user': 'jobrunner'} project.activity.add(event) ``` -------------------------------- ### Setting a Project Setting Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Update the value of a specific project setting by its name. ```python >>> project.settings.set('job_runtime_limit', 20) ``` -------------------------------- ### Retrieve All Project Activity Events Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Fetch all available project activity events using `.activity.list()`. ```pycon >>> project.activity.list() [{'event': 'job:completed', 'job': '123/2/3', 'user': 'jobrunner'}, {'event': 'job:cancelled', 'job': '123/2/3', 'user': 'john'}] ``` -------------------------------- ### Count Items in Collection Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the `count` method to get the total number of items in a collection. This method accepts optional filters. ```python >>> foo_store.count() 1 ``` -------------------------------- ### Projects.get Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a specific project by its ID. This method allows you to fetch detailed information about a single project. ```APIDOC ## Projects.get(project_id) ### Description Get project for a given project id. ### Parameters #### Path Parameters - **project_id** (integer or string) - Required - The ID of the project to retrieve. ### Returns - A project object. - **Return type:** [`Project`] ### Usage ```python project = client.projects.get(123) ``` ``` -------------------------------- ### Get Job Metadata Value Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a specific metadata value for a job using its key. The job instance must be available. ```python >>> job.metadata.get('version') ``` -------------------------------- ### Iterate and Print First 100 Log Entries Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md This snippet demonstrates how to iterate through a limited number of log entries and print each one. It uses the count parameter to limit the results. ```python >>> for log in job.logs.iter(count=100): ... print(log) ``` -------------------------------- ### Add Multiple Activity Events Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Post multiple activity events to a project simultaneously by providing a list of event dictionaries to the add() method. ```python events = [ {'event': 'job:completed', 'job': '123/2/5', 'user': 'jobrunner'}, {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'}, ] project.activity.add(events) ``` -------------------------------- ### Iterate Jobs with Meta Fields Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Request additional fields in the job summary using the 'meta' parameter. You must specify all desired fields, as only default fields will be added alongside requested ones. ```python >>> jobs_summary = project.jobs.iter(meta=['scheduled_by', ]) ``` -------------------------------- ### Get All Job Keys for a Spider Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Iterate through job summaries and extract only the 'key' field for each job. This is useful for obtaining a list of job identifiers. ```python >>> jobs_summary = spider.jobs.iter() >>> [job['key'] for job in jobs_summary] ['123/1/3', '123/1/2', '123/1/1'] ``` -------------------------------- ### Frontiers.newcount Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Gets the integer amount of new entries that have been added to all frontiers. This property provides a count of newly added requests across all frontiers. ```APIDOC ## Frontiers.newcount ### Description Integer amount of new entries added to all frontiers. ### Property newcount ### Parameters None ### Returns Integer amount of new entries added to all frontiers. ``` -------------------------------- ### Get a specific job item Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a single job item by its key. The method returns a dictionary containing the item's data. ```default >>> job.items.get(key) ``` -------------------------------- ### Accessing Spiders Collection Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Demonstrates how to access the `Spiders` collection associated with a project. This collection provides methods for managing and retrieving spider information. ```default >>> project.spiders ``` -------------------------------- ### Store Data in a Collection Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Store a key-value pair in a Scrapinghub collection. This involves getting a store, setting data, and optionally counting or retrieving. ```default >>> collections = project.collections >>> foo_store = collections.get_store('foo_store') >>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}) >>> foo_store.count() 1 >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {u'value': u'1447221694537'} >>> # iterate over _key & value pair ... list(foo_store.iter()) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> # filter by multiple keys - only values for keys that exist will be returned ... list(foo_store.iter(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah'])) [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}] >>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7') >>> foo_store.count() 0 ``` -------------------------------- ### Count Spider Jobs Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Use `.jobs.count()` to get the number of jobs for a specific spider. This method supports various filters for precise counting. ```python >>> spider.jobs.count() 5 ``` -------------------------------- ### Listing All Slot Requests Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Retrieve a list of all requests for a given slot. The `list()` method works similarly for fingerprints. ```python >>> fps = slot.q.list() ``` -------------------------------- ### Items.list Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a list of scraped items from a job, with support for various filters and parameters like start timestamp, count, and custom filters. ```APIDOC ## Items.list ### Description Retrieve items with timestamp greater or equal to given timestamp or with filters. ### Method list(startts=None, count=None, filter=None) ### Parameters #### Query Parameters - **startts** (integer) - Optional - retrieve items with timestamp greater or equal to this value. - **count** (integer) - Optional - the number of items to retrieve. - **filter** (list of tuples) - Optional - a list of filters to apply. Each filter is a tuple of (field, operator, value). ### Returns a list of dictionaries, where each dictionary represents an item. ### Return type `list[dict]` ``` -------------------------------- ### Run a New Job Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/quickstart.md Run a new job for a specific project, specifying the spider name and optional job arguments. ```python project = client.get_project(123) project.jobs.run('spider1', job_args={'arg1': 'val1'}) ``` -------------------------------- ### Collections.list() Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Lists all collections of a project. Each collection is represented by a dictionary with 'name' and 'type' fields. ```APIDOC ## Collections.list() ### Description List collections of a project. ### Returns a list of collections where each collection is represented by a dictionary with (‘name’,’type’) fields. ### Return type `list[dict]` ``` -------------------------------- ### scrapinghub.client.samples.Samples.iter Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Provides an iterator to go through all sample elements in the collection. This method is memory-efficient for large datasets, recommended over `list()`. ```APIDOC ## scrapinghub.client.samples.Samples.iter ### Description Iterate over elements in collection. ### Parameters * **_key** (string) - Internal key parameter. * **count** (int) - Limit amount of elements to retrieve. * **params** (dict) - Additional parameters for the iteration. ### Returns A generator object over a list of element dictionaries. ### Return type `types.GeneratorType[dict]` ``` -------------------------------- ### Run a New Job for a Project Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Run a new job for a project, specifying the spider name. This is an alternative to running a job directly from a spider instance. ```python >>> job = project.jobs.run('spider1') ``` -------------------------------- ### Iterate through all requests from a job Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the `iter()` method on the job's requests attribute to get a generator for all requests. This is memory-efficient for large numbers of requests. ```python >>> job.requests.iter() ``` -------------------------------- ### List Spiders in a Project Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md List all spiders within a specific project. This returns a list of dictionaries, where each dictionary contains details like 'id', 'tags', 'type', and 'version'. ```python >>> project.spiders.list() [ {'id': 'spider1', 'tags': [], 'type': 'manual', 'version': '123'}, {'id': 'spider2', 'tags': [], 'type': 'manual', 'version': '123'} ] ``` -------------------------------- ### get(job_key) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Retrieves a specific Job object using its unique job key. This method is useful for accessing detailed information about a single job. ```APIDOC ## get(job_key) ### Description Get a [`Job`](#scrapinghub.client.jobs.Job) with a given job_key. ### Parameters #### Path Parameters * **job_key** (string) - Required - A string job key. The job_key's project component should match the project used to get [`Jobs`](#scrapinghub.client.jobs.Jobs) instance, and job_key's spider component should match the spider (if [`Spider`](#scrapinghub.client.spiders.Spider) was used to get [`Jobs`](#scrapinghub.client.jobs.Jobs) instance). ### Returns a job object. ### Return type [`Job`](#scrapinghub.client.jobs.Job) ### Usage ```default >>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' ``` ``` -------------------------------- ### Create Batch Writer for Collection Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md The `create_writer` method initializes a batch writer for efficiently uploading multiple items to a collection. It allows configuration of parameters like queue size and upload interval. ```python >>> writer = foo_store.create_writer(size=1000, interval=15) >>> # Use writer to add items... ``` -------------------------------- ### Projects.iter Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Iterates through all projects available to the current user. This is useful for processing projects sequentially. ```APIDOC ## Projects.iter() ### Description Iterate through list of projects available to current user. Provided for the sake of API consistency. ### Returns - An iterator over project IDs. - **Return type:** `collections.abc.Iterable[int]` ``` -------------------------------- ### Access Jobs Collection from Spider Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Get the Jobs collection associated with a Spider instance. This allows managing jobs for a specific spider within a project. ```default >>> spider = project.spiders.get('spider1') >>> spider.jobs ``` -------------------------------- ### list(mincount=None, **params) Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md List request batches in the queue. This method allows you to retrieve information about pending request batches. ```APIDOC ## list(mincount=None, **params) ### Description List request batches in the queue. ### Method list(mincount=None, **params) ### Parameters #### Query Parameters - **mincount** (integer) - Optional - limit results with min amount of requests. - **params** (dict) - Optional - additional query params for the request. ### Returns a list of request batches in the queue where each batch is represented with a dict with (‘id’, ‘requests’) field. ### Return type `list[dict]` ``` -------------------------------- ### Set Project Setting Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Update the value of a specific project setting using its key and new value. Note that some settings may be read-only. ```python >>> project.settings.set('default_job_units', 2) ``` -------------------------------- ### Add requests to a frontier slot queue Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the add() method of the 'q' property on a FrontierSlot object to add new requests to the slot's queue. Each request should be in a dictionary format. ```python >>> data = [{'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}] >>> slot.q.add('example.com', data) ``` -------------------------------- ### Iterate Over Spider Jobs Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Use `.jobs.iter()` to get an iterator for spider jobs, ordered by most recently finished. The iterator yields dictionaries containing job summaries. ```python >>> jobs_summary = spider.jobs.iter() >>> [j['key'] for j in jobs_summary] ['123/1/3', '123/1/2', '123/1/1'] ``` ```python >>> for job in jobs_summary: ... # do something with job data ``` ```python >>> [x['key'] for x in jobs_summary] ['123/1/3', '123/1/2', '123/1/1'] ``` -------------------------------- ### Iterate Through All Job Metadata Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/overview.md Iterate through all available metadata keys and values for a job. The job instance must be available. The output is typically a dictionary. ```python >>> dict(job.metadata.iter()) ``` -------------------------------- ### Accessing Spider Properties Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Demonstrates how to retrieve a Spider object and access its key and name properties. This is useful for identifying and referencing specific spiders within a project. ```default >>> spider = project.spiders.get('spider1') >>> spider.key '123/1' >>> spider.name 'spider1' ``` -------------------------------- ### Iterate Over Project Activity Events Source: https://github.com/scrapinghub/python-scrapinghub/blob/master/docs/client/apidocs.md Use the iter() method to efficiently loop through all activity events for a project, which is recommended for large numbers of events. ```python project.activity.iter() ```