### Example Prompt for Data Extraction Source: https://github.com/ai92-github/parseextract/blob/main/output_examples.md This prompt is used to instruct the model to extract specific product details from a given URL. ```plaintext extract the product name, product link, image link and price for all the products ``` -------------------------------- ### Asynchronous PDF/DOCX Parsing Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This snippet demonstrates asynchronous PDF/DOCX parsing using 'httpx' and 'aiofiles'. It's suitable for non-blocking operations. Ensure 'httpx' and 'aiofiles' are installed. ```python import httpx, aiofiles import os import asyncio # API URL api_url = "https://api.parseextract.com/v1/pdf-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx file path or url pdf_file_path = 'your-pdf-or-docx-file-path.pdf or url' # Other Form Data Parameters (refer table below for all available parameters) pdf_option = 'option_b' inline_images = False get_base64_images = True # Payload payload = {"pdf_option":pdf_option, "inline_images":inline_images, "get_base64_images":get_base64_images} # add all other parameters # POST Request (Async) # We send the file and all parameters as multi-form data async def parse_pdf_async(api_url, pdf_file_path, pdf_option, inline_images, get_base64_images): async with aiofiles.open(pdf_file_path, 'rb') as file: file_content = await file.read() files = {'file': (pdf_file_path, file_content)} timeout = httpx.Timeout(10, read=300) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) return response async def get_response_async(): response = await parse_pdf_async(api_url, pdf_file_path, pdf_option, inline_images, get_base64_images) print(response.json().get('text','')) print(response.json().get('images','')) print(response.json().get('job_id','')) # Run the async function asyncio.run(get_response_async()) ``` -------------------------------- ### Async Image API Call Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Use this snippet to perform asynchronous image parsing. Ensure you have httpx and aiofiles installed. The API key should be set as an environment variable. ```python import httpx, aiofiles # API URL api_url = "https://api.parseextract.com/v1/image-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx file path or url image_file_path = 'your-image-file-path or url' # Other Form Data Parameters (refer table below for all available parameters) image_option = 'option_b' # Payload payload = {"image_option":image_option} # add all other parameters # POST Request (Async) # We send the file and all parameters as multi-form data async def parse_image_async(api_url, image_file_path, image_option): async with aiofiles.open(image_file_path, 'rb') as file: file_content = await file.read() files = {'file': (image_file_path, file_content)} timeout = httpx.Timeout(10, read=60) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) return response # response = await parse_image_async(api_url, image_file_path, image_option) # or use asyncio import asyncio async def get_response_async(): response = await parse_image_async(api_url, image_file_path, image_option) print(response.json().get('text','')) # Run the async function asyncio.run(get_response_async()) ``` -------------------------------- ### Synchronous PDF/DOCX Parsing Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Use this snippet for synchronous PDF or DOCX file parsing. Ensure you have the 'requests' library installed. The timeout parameter configures connection and read timeouts. ```python import requests import os # API URL api_url = "https://api.parseextract.com/v1/pdf-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx file path or url pdf_file_path = 'your-pdf-or-docx-file-path.pdf or url' # Other Form Data Parameters (refer table below for all available parameters) pdf_option = 'option_b' inline_images = False get_base64_images = True # Timeouts timeout = (10, 300) # 10 seconds connect, 60 seconds read # Payload payload = {"pdf_option":pdf_option, "inline_images":inline_images, "get_base64_images":get_base64_images} # add all other parameters # POST Request (Sync) # We send the file and all parameters as multi-form data with open(pdf_file_path, 'rb') as f: files = {'file': (pdf_file_path, f)} response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) parsed_text = response.json().get('text','') extracted_images = response.json().get('images','') job_id = response.json().get('job_id','') print(parsed_text) print(extracted_images) print(job_id) ``` -------------------------------- ### GET /v1/fetchcrawloutput Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Retrieves the results of a previously initiated crawl or parse job using the job_id. ```APIDOC ## GET /v1/fetchcrawloutput ### Description Fetches the output of a crawl or parse job using the job_id returned from the initial request. ### Method GET ### Endpoint https://api.parseextract.com/v1/fetchcrawloutput ### Parameters #### Query Parameters - **job_id** (string) - Required - The unique identifier for the job. ### Response #### Success Response (200) - **output** (string) - The parsed content or crawl results. ``` -------------------------------- ### Save Tables as Excel/CSV Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Parses the API response to get base64 encoded file data and saves them as Excel or CSV files. Allows configuration to download only specific file types. ```python import json, base64 api_response = response.json() # get excel and csv files to download from the response try: file_to_download = json.loads(api_response.get('file_to_download',[])) except: file_to_download = [] # You can set download_excel or download_csv = False if you do not need any one of them download_excel = True download_csv = True # saving excel / csv files if file_to_download!=[]: for table_data in file_to_download: output_filename = table_data['id'] if not download_excel and table_data['id'].endswith('.xlsx'): continue if not download_csv and table_data['id'].endswith('.csv'): continue decoded_bytes = base64.b64decode(table_data['base64_string']) with open(output_filename, "wb") as f: f.write(decoded_bytes) print(f"{output_filename} created successfully from Base64 data.") else: print('No files to download') ``` -------------------------------- ### Fetch Results using Job ID Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Retrieve parsing results for large documents (over 5 pages) using the job ID obtained from the initial PDF/DOCX parse request. This uses a simple GET request. ```python import requests import os # Job ID job_id = 'the-job-id-from-the-pdf-parse-endpoint' # API URL # Add the job id as the query string parameter api_url = f"https://api.parseextract.com/v1/fetchoutput?job_id={job_id}" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # GET Request response = requests.get(api_url, headers=headers) print(response.json().get('text','')) print(response.json().get('images','')) ``` -------------------------------- ### Parse PDF and DOCX Documents with Python Source: https://context7.com/ai92-github/parseextract/llms.txt Uploads a document for parsing into structured text. Small documents return results immediately, while larger ones require asynchronous job handling. ```python import requests import os api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization": f"Bearer {api_key}"} api_url = "https://api.parseextract.com/v1/pdf-parse" # Parse PDF with inline images pdf_path = "document.pdf" payload = { "pdf_option": "option_b", # Use option_b for better accuracy "inline_images": True, # Insert [Image_X_Y] placeholders "get_base64_images": True # Return images as base64 } with open(pdf_path, 'rb') as f: files = {'file': (pdf_path, f)} response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=(10, 300)) result = response.json() # For small documents (<=5 pages), get immediate results if 'text' in result: parsed_text = result.get('text', '') images = result.get('images', []) print(parsed_text) # Output includes [Image_1_1], [Image_1_2] placeholders inline with text ``` -------------------------------- ### Asynchronous PDF Parsing with httpx Source: https://context7.com/ai92-github/parseextract/llms.txt Uses httpx and aiofiles to perform non-blocking document uploads and concurrent processing of multiple files. ```python import httpx import aiofiles import asyncio import os api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization": f"Bearer {api_key}"} async def parse_pdf_async(pdf_path: str) -> dict: """Asynchronously parse a PDF document.""" api_url = "https://api.parseextract.com/v1/pdf-parse" payload = {"pdf_option": "option_b", "inline_images": True, "get_base64_images": True} async with aiofiles.open(pdf_path, 'rb') as file: file_content = await file.read() files = {'file': (pdf_path, file_content)} timeout = httpx.Timeout(10, read=300) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) return response.json() async def batch_parse_documents(pdf_paths: list) -> list: """Parse multiple PDFs concurrently.""" tasks = [parse_pdf_async(path) for path in pdf_paths] results = await asyncio.gather(*tasks, return_exceptions=True) return results # Process multiple documents documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"] results = asyncio.run(batch_parse_documents(documents)) for path, result in zip(documents, results): if isinstance(result, Exception): print(f"Error processing {path}: {result}") else: print(f"Parsed {path}: {len(result.get('text', ''))} characters") ``` -------------------------------- ### Synchronous PDF/Docx Parsing Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Initiate a document parsing request using the requests library. ```python import requests ``` -------------------------------- ### Perform Synchronous Webpage Crawling in Python Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Initiate a crawl job synchronously. The response will contain a job_id for fetching results later. ```python import requests # API URL api_url = "https://api.parseextract.com/v1/url-crawl" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} ``` -------------------------------- ### Async Table Extraction API Call Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This snippet shows how to perform asynchronous table extraction. It utilizes httpx and aiofiles. The API key must be configured as an environment variable. This method is suitable for large documents where immediate results are not required. ```python import httpx # API URL api_url = "https://api.parseextract.com/v1/table-extract" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx file path file_path = 'your-file-path' extraction_option = "option_b" page_no = None # use None if you want all pages # POST Request (Async) # We send the file and all parameters as multi-form data async def extract_table_async(api_url, file_path, extraction_option=extraction_option, page_no=page_no): async with aiofiles.open(file_path, 'rb') as file: file_content = await file.read() files = {'file': (file_path, file_content)} data={"extraction_option":extraction_option, "page_no":page_no} timeout = httpx.Timeout(10, read=60) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=data, headers=headers, timeout=timeout) return response response = await extract_table_async(api_url, file_path) print(response.json()) if response.json().get('job_id','')!='': print(f"Extraction in process. {response.json().get('tables','')}") # # or use asyncio # import asyncio # async def get_response_async(): # response = await extract_table_async(api_url, file_path) # print(response.json()) # if response.json().get('job_id','')!='': # print(f"Extraction in process. {response.json().get('tables','')}") # # Run the async function # asyncio.run(get_response_async()) ``` -------------------------------- ### Synchronous Image Parsing Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This snippet shows how to perform synchronous parsing of image files. It requires the 'requests' library and configures connection and read timeouts. ```python import requests import os # API URL api_url = "https://api.parseextract.com/v1/image-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx file path or url image_file_path = 'your-image-file-path.pdf or url' # Other Form Data Parameters (refer table below for all available parameters) image_option = 'option_b' # Timeouts timeout = (10, 60) # 10 seconds connect, 60 seconds read # Payload payload = {"image_option":image_option} # add all other parameters # POST Request (Sync) # We send the file and all parameters as multi-form data with open(image_file_path, 'rb') as f: files = {'file': (image_file_path, f)} response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) parsed_text = response.json().get('text','') print(parsed_text) ``` -------------------------------- ### Perform Synchronous Webpage Scraping in Python Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Use this snippet to scrape a single URL synchronously. Ensure the PARSEEXTRACT_API_KEY environment variable is set before execution. ```python import requests # API URL api_url = "https://api.parseextract.com/v1/url-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The URL to scrape url = "" # Other Parameters (refer table below for all available parameters) wait = 1.5 # Timeouts timeout = (10, 60) # 10 seconds connect, 60 seconds read # Payload payload = {"url":url, "wait":wait} # add all other parameters # POST Request (Sync) response = requests.post(api_url, json=payload, headers=headers, timeout=timeout) scraped_text = response.json().get('text','') print(scraped_text) ``` -------------------------------- ### Structured Data Extraction (Sync) Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This endpoint performs structured data extraction from documents (PDF, DOCX, image). It supports sync API calls and requires an API key for authorization. You can specify extraction prompts and optionally provide a URL or file. ```APIDOC ## POST /v1/data-extract (Sync) ### Description Extracts structured data from uploaded documents (PDF, DOCX, image) or from a given URL. Supports synchronous API calls. ### Method POST ### Endpoint https://api.parseextract.com/v1/data-extract ### Headers - **Authorization** (string) - Required - Bearer token for authentication. Example: Bearer YOUR_API_KEY ### Form Data Parameters - **prompt** (string) - Required - Your extraction prompt, optionally including a schema for the output format. - **url** (string) - Optional - The URL from which to extract data. If provided, the 'file' parameter should be omitted. - **file** (file) - Optional - The document file (PDF, DOCX, image) to extract data from. If provided, the 'url' parameter should be omitted. - **extraction_option** (string) - Optional - Either 'option_a' or 'option_b'. Defaults to 'option_b'. - **page_no** (integer) - Optional - Extracts data from a specific page number (starts from 1). Defaults to None (extracts all pages). ### Request Example (File Upload) ```python import requests api_url = "https://api.parseextract.com/v1/data-extract" api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} file_path = 'your-document.pdf' prompt = 'Extract all names and email addresses in a JSON format.' payload = {"prompt": prompt} timeout = (10, 120) with open(file_path, 'rb') as f: files = {'file': (file_path, f)} response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) print(response.json()) ``` ### Request Example (URL Input) ```python import requests api_url = "https://api.parseextract.com/v1/data-extract" api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} url = 'http://example.com/document.pdf' prompt = 'Extract the main title and author.' payload = {"prompt": prompt, "url": url} timeout = (10, 120) response = requests.post(api_url, data=payload, headers=headers, timeout=timeout) print(response.json()) ``` ### Response #### Success Response (200) - **extracted_data** (object/array) - The structured data extracted based on the prompt. ### Note Currently, multi-page PDF/DOCX files are only processed for the first page. Contact support for multi-page support. ``` -------------------------------- ### Async API Call for Structured Data Extraction Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Performs an asynchronous POST request to extract structured data using httpx and aiofiles. Handles file uploads or URL inputs with configurable prompts and timeouts. Requires PARSEEXTRACT_API_KEY. ```python import httpx, aiofiles # API URL api_url = "https://api.parseextract.com/v1/data-extract" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx/image file path file_path = 'your-file-path' # if you have a url then pass the url as a separate form data and ignore the files input # Other Form Data Parameters (refer table below for all available parameters) prompt = 'your-extraction-prompt-with-a-optional-schema' url = 'the-url-from-which-data-is-to-be-extracted' # Payload payload = {"prompt":prompt} # add url to the payload if your input is an url # POST Request (Async) # We send the file and all parameters as multi-form data async def extract_data_async(api_url, file_path, payload): async with aiofiles.open(file_path, 'rb') as file: file_content = await file.read() files = {'file': (file_path, file_content)} timeout = httpx.Timeout(10, read=120) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) return response # response = await extract_data_async(api_url, file_path, payload) # or use asyncio import asyncio async def get_response_async(): response = await extract_data_async(api_url, file_path, payload) print(response.json()) # Run the async function asyncio.run(get_response_async()) ``` -------------------------------- ### Crawl Website Pages Asynchronously with Python Source: https://context7.com/ai92-github/parseextract/llms.txt Initiates a crawl job for multiple pages and polls the status endpoint to retrieve results once completed. Supports URL filtering via regex patterns. ```python import requests import os import time api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization": f"Bearer {api_key}"} # Start crawl job crawl_url = "https://api.parseextract.com/v1/url-crawl" payload = { "url": "https://docs.example.com", "wait": 1.5, "include_parent_url": True, # Crawl subdirectories "include_path": [".*\\/docs\\/.*"], # Only include /docs/ paths "exclude_path": [".*\\/archive\\/.*"], # Skip archive pages "max_depth": 2, # Crawl depth "crawl_limit": 100, # Max pages to scrape "keep_images": False, # Exclude images for token savings "keep_webpage_links": True } response = requests.post(crawl_url, json=payload, headers=headers, timeout=(10, 10)) job_id = response.json().get('job_id') print(f"Crawl job started: {job_id}") # Poll for results fetch_url = f"https://api.parseextract.com/v1/fetchcrawloutput?job_id={job_id}" while True: time.sleep(10) result = requests.get(fetch_url, headers=headers).json() if result.get('status') == 'completed': crawled_content = result.get('output', []) print(f"Crawled {len(crawled_content)} pages") break ``` -------------------------------- ### Extracted Product Data (JSON) Source: https://github.com/ai92-github/parseextract/blob/main/output_examples.md This JSON output represents the structured data extracted from the IKEA product page based on the provided prompt. Each object contains details for a single product. ```json { "product_name": "EKTORP 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-hakebo-dark-grey-s19508999/", "image_link": "https://www.ikea.com/ext/ingkadam/m/18cac20c564a2bb3/original/PE902096-crop001.JPG?f=s", "price": "Rs.33,990" } ``` ```json { "product_name": "LANDSKRONA 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/landskrona-3-seat-sofa-gunnared-light-green-wood-s19270327/", "image_link": "https://www.ikea.com/ext/ingkadam/m/720cb948eae89d88/original/PE923432-crop001.jpg?f=s", "price": "Rs.69,990" } ``` ```json { "product_name": "GAMMALBYN 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/gammalbyn-3-seat-sofa-kilanda-light-beige-10615526/", "image_link": "https://www.ikea.com/in/en/images/products/gammalbyn-3-seat-sofa-kilanda-light-beige__1449308_pe989846_s5.jpg?f=xxs", "price": "Rs.22,990" } ``` ```json { "product_name": "EKTORP 3-seat sofa with chaise longue", "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-with-chaise-longue-kilanda-light-beige-s19509041/", "image_link": "https://www.ikea.com/in/en/images/products/ektorp-3-seat-sofa-with-chaise-longue-kilanda-light-beige__1194849_pe902099_s5.jpg?f=xxs", "price": "Rs.37,990" } ``` ```json { "product_name": "ORMARYD 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/ormaryd-3-seat-sofa-dark-blue-60477594/", "image_link": "https://www.ikea.com/in/en/images/products/ormaryd-3-seat-sofa-dark-blue__0919663_pe786703_s5.jpg?f=xxs", "price": "Rs.20,990" } ``` ```json { "product_name": "GLOSTAD 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/glostad-3-seat-sofa-knisa-dark-grey-40595937/", "image_link": "https://www.ikea.com/in/en/images/products/glostad-3-seat-sofa-knisa-dark-grey__1234948_pe917261_s5.jpg?f=xxs", "price": "Rs.12,990" } ``` ```json { "product_name": "GLOSTAD 2-seat sofa", "product_link": "https://www.ikea.com/in/en/p/glostad-2-seat-sofa-knisa-dark-grey-10489009/", "image_link": "https://www.ikea.com/in/en/images/products/glostad-2-seat-sofa-knisa-dark-grey__0950864_pe800736_s5.jpg?f=xxs", "price": "Rs.9,990" } ``` ```json { "product_name": "LINANÄS 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/linanaes-3-seat-sofa-vissle-beige-90512237/", "image_link": "https://www.ikea.com/in/en/images/products/linanaes-3-seat-sofa-vissle-beige__1013894_pe829446_s5.jpg?f=xxs", "price": "Rs.24,990" } ``` ```json { "product_name": "JÄTTEBO U-shaped sofa", "product_link": "https://www.ikea.com/in/en/p/jaettebo-u-shaped-sofa-7-seat-with-chaise-longue-right-with-headrests-tonerud-grey-s39510618/", "image_link": "https://www.ikea.com/in/en/images/products/jaettebo-u-shaped-sofa-7-seat-with-chaise-longue-right-with-headrests-tonerud-grey__1179836_pe896109_s5.jpg?f=xxs", "price": "Rs.2,93,500" } ``` ```json { "product_name": "EKTORP 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-kilanda-light-beige-s49509011/", "image_link": "https://www.ikea.com/in/en/images/products/ektorp-3-seat-sofa-kilanda-light-beige__1194853_pe902103_s5.jpg?f=xxs", "price": "Rs.29,990" } ``` ```json { "product_name": "LANDSKRONA 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/landskrona-3-seat-sofa-gunnared-light-green-wood-s19270327/", "image_link": "https://www.ikea.com/in/en/images/products/landskrona-3-seat-sofa-gunnared-light-green-wood__0602122_pe680191_s5.jpg?f=xxs", "price": "Rs.69,990" } ``` ```json { "product_name": "SÖDERHAMN Corner sofa", "product_link": "https://www.ikea.com/in/en/p/soederhamn-corner-sofa-6-seat-tonerud-grey-s89452079/", "image_link": "https://www.ikea.com/in/en/images/products/soederhamn-corner-sofa-6-seat-tonerud-grey__1057827_pe849007_s5.jpg?f=xxs", "price": "Rs.1,19,080" } ``` ```json { "product_name": "LINANÄS 3-seat sofa with chaise longue", "product_link": "https://www.ikea.com/in/en/p/linanaes-3-seat-sofa-with-chaise-longue-vissle-dark-grey-60512248/", "image_link": "https://www.ikea.com/in/en/images/products/linanaes-3-seat-sofa-with-chaise-longue-vissle-dark-grey__1013908_pe829460_s5.jpg?f=xxs", "price": "Rs.35,990" } ``` ```json { "product_name": "SÖDERHAMN 3-seat sofa", "product_link": "https://www.ikea.com/in/en/p/soederhamn-3-seat-sofa-gransel-natural-colour-s39442158/", "image_link": "https://www.ikea.com/images/a-soederhamn-sofa-in-a-living-room-with-different-cushions-l-9ea3ab1b5308e61f33cab2df355f59f9.jpg?f=s", "price": "Rs.52,990" } ``` ```json { "product_name": "SÖDERHAMN 3-seat sofa", "product_link": null, "image_link": null, "price": null } ``` -------------------------------- ### POST /v1/url-parse Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Scrapes a single URL and returns the extracted text content. ```APIDOC ## POST /v1/url-parse ### Description Scrapes a single URL and returns the extracted text content. Supports both synchronous and asynchronous requests. ### Method POST ### Endpoint https://api.parseextract.com/v1/url-parse ### Parameters #### Request Body - **url** (string) - Required - The URL to scrape. - **wait** (float) - Optional - Time to wait for the page to load in seconds. Default: 1.5. - **keep_images** (boolean) - Optional - Keep image links. Default: True. - **remove_svg_image** (boolean) - Optional - Remove .svg images. Default: True. - **remove_gif_image** (boolean) - Optional - Remove .gif images. Default: True. - **remove_image_types** (list) - Optional - List of image extensions to remove. Default: []. - **keep_webpage_links** (boolean) - Optional - Keep webpage links. Default: True. - **remove_script_tag** (boolean) - Optional - Remove script tags. Default: True. - **remove_style_tag** (boolean) - Optional - Remove style tags. Default: True. - **remove_tags** (list) - Optional - List of tags to remove. Default: []. ### Request Example { "url": "https://example.com", "wait": 1.5 } ### Response #### Success Response (200) - **text** (string) - The extracted text content from the webpage. #### Response Example { "text": "Extracted content..." } ``` -------------------------------- ### Sync Table Extraction API Call Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This snippet demonstrates a synchronous API call for table extraction. It requires the 'requests' library. Ensure the API key is set as an environment variable. The 'page_no' parameter can be set to None to process all pages. ```python import requests # API URL api_url = "https://api.parseextract.com/v1/table-extract" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx/image file path file_path = 'your-file-path' extraction_option = "option_b" page_no = None # use None if you want all pages # Timeouts timeout = (10, 60) # 10 seconds connect, 60 seconds read # POST Request (Sync) # We send the file as multi-form data with open(file_path, 'rb') as f: files = {'file': (file_path, f)} data={"extraction_option":extraction_option, "page_no":page_no} response = requests.post(api_url, files=files, data=data, headers=headers, timeout=timeout) print(response.json()) if response.json().get('job_id','')!='': print(f"Extraction in process. {response.json().get('tables','')}") ``` -------------------------------- ### Structured Data Extraction (Async) Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md This endpoint performs structured data extraction from documents (PDF, DOCX, image) using asynchronous API calls. It requires an API key for authorization and allows for non-blocking requests. ```APIDOC ## POST /v1/data-extract (Async) ### Description Extracts structured data from uploaded documents (PDF, DOCX, image) or from a given URL using asynchronous API calls. ### Method POST ### Endpoint https://api.parseextract.com/v1/data-extract ### Headers - **Authorization** (string) - Required - Bearer token for authentication. Example: Bearer YOUR_API_KEY ### Form Data Parameters - **prompt** (string) - Required - Your extraction prompt, optionally including a schema for the output format. - **url** (string) - Optional - The URL from which to extract data. If provided, the 'file' parameter should be omitted. - **file** (file) - Optional - The document file (PDF, DOCX, image) to extract data from. If provided, the 'url' parameter should be omitted. - **extraction_option** (string) - Optional - Either 'option_a' or 'option_b'. Defaults to 'option_b'. - **page_no** (integer) - Optional - Extracts data from a specific page number (starts from 1). Defaults to None (extracts all pages). ### Request Example (Async File Upload) ```python import httpx, aiofiles, asyncio async def extract_data_async(api_url, file_path, payload, headers): async with aiofiles.open(file_path, 'rb') as file: file_content = await file.read() files = {'file': (file_path, file_content)} timeout = httpx.Timeout(10, read=120) async with httpx.AsyncClient() as client: response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) return response async def main(): api_url = "https://api.parseextract.com/v1/data-extract" api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} file_path = 'your-document.pdf' prompt = 'Extract all names and email addresses in a JSON format.' payload = {"prompt": prompt} response = await extract_data_async(api_url, file_path, payload, headers) print(response.json()) asyncio.run(main()) ``` ### Response #### Success Response (200) - **extracted_data** (object/array) - The structured data extracted based on the prompt. ### Note Currently, multi-page PDF/DOCX files are only processed for the first page. Contact support for multi-page support. ``` -------------------------------- ### Sync API Call for Structured Data Extraction Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Performs a synchronous POST request to extract structured data from a file. Supports file uploads or URL inputs, with configurable prompts and timeouts. Requires PARSEEXTRACT_API_KEY. ```python import requests # API URL api_url = "https://api.parseextract.com/v1/data-extract" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The pdf/docx/image file path file_path = 'your-file-path' # if you have a url then pass the url as a separate form data and ignore the files input # Other Form Data Parameters (refer table below for all available parameters) prompt = 'your-extraction-prompt-with-a-optional-schema' url = 'the-url-from-which-data-is-to-be-extracted' # Timeouts timeout = (10, 120) # 10 seconds connect, 60 seconds read # Payload payload = {"prompt":prompt} # add url to the payload if your input is an url # POST Request (Sync) # We send the file and all parameters as multi-form data with open(file_path, 'rb') as f: files = {'file': (file_path, f)} response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout) print(response.json()) ``` -------------------------------- ### POST /v1/url-crawl Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Initiates a crawling job for a URL and returns a job_id for result retrieval. ```APIDOC ## POST /v1/url-crawl ### Description Initiates a crawling job for a specified URL. Returns a job_id which is used to fetch the crawling results later. ### Method POST ### Endpoint https://api.parseextract.com/v1/url-crawl ### Response #### Success Response (200) - **job_id** (string) - The unique identifier for the crawling job. ``` -------------------------------- ### Fetch asynchronous job results Source: https://context7.com/ai92-github/parseextract/llms.txt Use this pattern to retrieve results for large documents processed via job IDs. ```python if 'job_id' in result: job_id = result['job_id'] fetch_url = f"https://api.parseextract.com/v1/fetchoutput?job_id={job_id}" final_result = requests.get(fetch_url, headers=headers).json() print(final_result.get('text', '')) ``` -------------------------------- ### POST /v1/url-crawl Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Initiates a URL crawling job. For large crawls, this returns a job_id to be used with the fetch endpoint. ```APIDOC ## POST /v1/url-crawl ### Description Initiates a crawl job for a specified URL with various configuration options for depth, filtering, and content removal. ### Method POST ### Endpoint https://api.parseextract.com/v1/url-crawl ### Parameters #### Request Body - **url** (string) - Required - The url to crawl. - **wait** (float) - Optional - Time to wait for the page to load in seconds. Default: 1.5. - **include_parent_url** (boolean) - Optional - Crawl all sub directories. Default: True. - **include_keyword** (list) - Optional - Keywords to include in crawling. Default: []. - **exclude_keyword** (list) - Optional - Keywords to exclude. Default: []. - **include_path** (list) - Optional - Regex patterns to include. Default: []. - **exclude_path** (list) - Optional - Regex patterns to exclude. Default: []. - **max_depth** (integer) - Optional - Depth of crawling. Default: 2. - **crawl_limit** (integer) - Optional - Maximum number of urls to scrape. Default: 1000. - **keep_images** (boolean) - Optional - Keep image links. Default: True. - **remove_svg_image** (boolean) - Optional - Remove .svg images. Default: True. - **remove_gif_image** (boolean) - Optional - Remove .gif images. Default: True. - **remove_image_types** (list) - Optional - List of image extensions to remove. Default: []. - **keep_webpage_links** (boolean) - Optional - Keep webpage links. Default: True. - **remove_script_tag** (boolean) - Optional - Remove script tags. Default: True. - **remove_style_tag** (boolean) - Optional - Remove style tags. Default: True. - **remove_tags** (list) - Optional - List of tags to remove. Default: []. ### Request Example { "url": "https://example.com", "wait": 1.5, "crawl_limit": 1000 } ``` -------------------------------- ### Scrape Webpage Content with Python Source: https://context7.com/ai92-github/parseextract/llms.txt Uses the URL Scraping API to convert a webpage into clean, LLM-optimized text. Configure wait times and filtering options to manage token usage. ```python import requests import os # API configuration api_url = "https://api.parseextract.com/v1/url-parse" api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization": f"Bearer {api_key}"} # Scrape a webpage payload = { "url": "https://example.com/sample-page", "wait": 2.0, # Wait 2 seconds for JS to load "keep_images": True, # Keep image links in output "remove_svg_image": True, # Remove SVG images "remove_gif_image": True, # Remove GIF images "keep_webpage_links": True, # Keep hyperlinks "remove_script_tag": True, # Remove script tags "remove_style_tag": True # Remove style tags } response = requests.post(api_url, json=payload, headers=headers, timeout=(10, 60)) result = response.json() scraped_text = result.get('text', '') print(scraped_text) # Output: Clean markdown-formatted text with preserved links and structure ``` -------------------------------- ### Asynchronous URL Crawl Request Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Perform an asynchronous POST request using httpx and asyncio for non-blocking operations. ```python import httpx # API URL api_url = "https://api.parseextract.com/v1/url-crawl" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The URL to crawl url = "" # Other Parameters (refer table below for all available parameters) wait = 1.5 crawl_limit = 1000 # Payload payload = {"url":url, "wait":wait, "crawl_limit":crawl_limit} # add all other parameters # POST Request (Async) async def crawl_url_async(api_url, url=url, wait=wait, crawl_limit=crawl_limit): timeout = httpx.Timeout(10.0, read=60) async with httpx.AsyncClient() as client: response = await client.post(api_url, json=payload, headers=headers, timeout=timeout) return response # response = await crawl_url_async(api_url, url, wait, crawl_limit) # or use asyncio import asyncio async def get_response_async(): response = await crawl_url_async(api_url, url, wait, crawl_limit) print(response.json()) # Run the async function asyncio.run(get_response_async()) ``` -------------------------------- ### POST /v1/table-extract Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Extracts tables and tabular data from PDF, DOCX, or image files. ```APIDOC ## POST /v1/table-extract ### Description Extracts tables from documents. Returns base64 encoded strings for Excel and CSV files. For multi-page documents, a job_id is returned for asynchronous processing. ### Method POST ### Endpoint https://api.parseextract.com/v1/table-extract ### Parameters #### Request Body - **file** (binary) - Required - The document file to extract tables from. - **extraction_option** (string) - Optional - Extraction configuration option. - **page_no** (integer) - Optional - Specific page number to extract; defaults to all pages. ``` -------------------------------- ### Perform Asynchronous Webpage Scraping in Python Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Use this snippet to scrape a single URL asynchronously using httpx. This is suitable for non-blocking operations. ```python import httpx # API URL api_url = "https://api.parseextract.com/v1/url-parse" # Authorization api_key = os.environ["PARSEEXTRACT_API_KEY"] headers = {"Authorization":f"Bearer {api_key}"} # The URL to scrape url = "" # Other Parameters (refer table below for all available parameters) wait = 1.5 # Payload payload = {"url":url, "wait":wait} # add all other parameters # POST Request (Async) async def parse_url_async(api_url, url=url, wait=wait): timeout = httpx.Timeout(10.0, read=60) async with httpx.AsyncClient() as client: response = await client.post(api_url, json=payload, headers=headers, timeout=timeout) return response # response = await parse_url_async(api_url, url, wait) # or use asyncio import asyncio async def get_response_async(): response = await parse_url_async(api_url, url, wait) print(response.json().get('text','')) # Run the async function asyncio.run(get_response_async()) ``` -------------------------------- ### POST /extract Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md Endpoint to process a document or URL and extract data based on a provided prompt. ```APIDOC ## POST /extract ### Description Extracts information from a document or URL based on a user-defined prompt. ### Method POST ### Parameters #### Request Body (Form Data) - **files** (file) - Optional - The document file to be processed. - **url** (string) - Optional - The URL of the document (use either file upload or URL). - **prompt** (string) - Required - The extraction prompt describing what to extract, optionally including a JSON schema example. - **json_out** (integer) - Optional - Set to 1 if you do not want JSON output format. ```