### Example Prompt for Data Extraction

Source: https://github.com/ai92-github/parseextract/blob/main/output_examples.md

This prompt is used to instruct the model to extract specific product details from a given URL.

```plaintext
extract the product name, product link, image link and price for all the products
```

--------------------------------

### Asynchronous PDF/DOCX Parsing

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This snippet demonstrates asynchronous PDF/DOCX parsing using 'httpx' and 'aiofiles'. It's suitable for non-blocking operations. Ensure 'httpx' and 'aiofiles' are installed.

```python
import httpx, aiofiles
import os
import asyncio

# API URL
api_url = "https://api.parseextract.com/v1/pdf-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx file path or url
pdf_file_path = 'your-pdf-or-docx-file-path.pdf or url'

# Other Form Data Parameters (refer table below for all available parameters)
pdf_option = 'option_b'
inline_images = False
get_base64_images = True

# Payload
payload = {"pdf_option":pdf_option, "inline_images":inline_images, "get_base64_images":get_base64_images}  # add all other parameters

# POST Request (Async)
# We send the file and all parameters as multi-form data
async def parse_pdf_async(api_url, pdf_file_path, pdf_option, inline_images, get_base64_images):
    async with aiofiles.open(pdf_file_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (pdf_file_path, file_content)}
        timeout = httpx.Timeout(10, read=300)
        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
            return response

async def get_response_async():
    response = await parse_pdf_async(api_url, pdf_file_path, pdf_option, inline_images, get_base64_images)
    print(response.json().get('text',''))
    print(response.json().get('images',''))
    print(response.json().get('job_id',''))

# Run the async function
asyncio.run(get_response_async())
```

--------------------------------

### Async Image API Call

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Use this snippet to perform asynchronous image parsing. Ensure you have httpx and aiofiles installed. The API key should be set as an environment variable.

```python
import httpx, aiofiles

# API URL
api_url = "https://api.parseextract.com/v1/image-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx file path or url
image_file_path = 'your-image-file-path or url'

# Other Form Data Parameters (refer table below for all available parameters)
image_option = 'option_b'

# Payload
payload = {"image_option":image_option}  # add all other parameters

# POST Request (Async)
# We send the file and all parameters as multi-form data
async def parse_image_async(api_url, image_file_path, image_option):
    async with aiofiles.open(image_file_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (image_file_path, file_content)}
        timeout = httpx.Timeout(10, read=60)
        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
            return response

# response = await parse_image_async(api_url, image_file_path, image_option)
# or use asyncio
import asyncio
async def get_response_async():
    response = await parse_image_async(api_url, image_file_path, image_option)
    print(response.json().get('text',''))

# Run the async function
asyncio.run(get_response_async())
```

--------------------------------

### Synchronous PDF/DOCX Parsing

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Use this snippet for synchronous PDF or DOCX file parsing. Ensure you have the 'requests' library installed. The timeout parameter configures connection and read timeouts.

```python
import requests
import os

# API URL
api_url = "https://api.parseextract.com/v1/pdf-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx file path or url
pdf_file_path = 'your-pdf-or-docx-file-path.pdf or url'

# Other Form Data Parameters (refer table below for all available parameters)
pdf_option = 'option_b'
inline_images = False
get_base64_images = True

# Timeouts
timeout = (10, 300) # 10 seconds connect, 60 seconds read

# Payload
payload = {"pdf_option":pdf_option, "inline_images":inline_images, "get_base64_images":get_base64_images}  # add all other parameters

# POST Request (Sync)
# We send the file and all parameters as multi-form data
with open(pdf_file_path, 'rb') as f:
    files = {'file': (pdf_file_path, f)}
    response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)

parsed_text = response.json().get('text','')
extracted_images = response.json().get('images','')
job_id = response.json().get('job_id','')
print(parsed_text)
print(extracted_images)
print(job_id)
```

--------------------------------

### GET /v1/fetchcrawloutput

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Retrieves the results of a previously initiated crawl or parse job using the job_id.

```APIDOC
## GET /v1/fetchcrawloutput

### Description
Fetches the output of a crawl or parse job using the job_id returned from the initial request.

### Method
GET

### Endpoint
https://api.parseextract.com/v1/fetchcrawloutput

### Parameters
#### Query Parameters
- **job_id** (string) - Required - The unique identifier for the job.

### Response
#### Success Response (200)
- **output** (string) - The parsed content or crawl results.
```

--------------------------------

### Save Tables as Excel/CSV

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Parses the API response to get base64 encoded file data and saves them as Excel or CSV files. Allows configuration to download only specific file types.

```python
import json, base64

api_response = response.json()

# get excel and csv files to download from the response
try:
  file_to_download = json.loads(api_response.get('file_to_download',[]))
except:
  file_to_download = []

# You can set download_excel or download_csv = False if you do not need any one of them
download_excel = True
download_csv = True

# saving excel / csv files
if file_to_download!=[]:
  for table_data in file_to_download:
    output_filename =  table_data['id']
    if not download_excel and table_data['id'].endswith('.xlsx'):
      continue
    if not download_csv and table_data['id'].endswith('.csv'):
      continue
    decoded_bytes = base64.b64decode(table_data['base64_string'])
    with open(output_filename, "wb") as f:
        f.write(decoded_bytes)
    print(f"{output_filename} created successfully from Base64 data.")
else:
  print('No files to download')
```

--------------------------------

### Fetch Results using Job ID

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Retrieve parsing results for large documents (over 5 pages) using the job ID obtained from the initial PDF/DOCX parse request. This uses a simple GET request.

```python
import requests
import os

# Job ID
job_id = 'the-job-id-from-the-pdf-parse-endpoint'

# API URL
# Add the job id as the query string parameter
api_url = f"https://api.parseextract.com/v1/fetchoutput?job_id={job_id}"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# GET Request
response = requests.get(api_url, headers=headers)
print(response.json().get('text',''))
print(response.json().get('images',''))
```

--------------------------------

### Parse PDF and DOCX Documents with Python

Source: https://context7.com/ai92-github/parseextract/llms.txt

Uploads a document for parsing into structured text. Small documents return results immediately, while larger ones require asynchronous job handling.

```python
import requests
import os

api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}
api_url = "https://api.parseextract.com/v1/pdf-parse"

# Parse PDF with inline images
pdf_path = "document.pdf"
payload = {
    "pdf_option": "option_b",       # Use option_b for better accuracy
    "inline_images": True,          # Insert [Image_X_Y] placeholders
    "get_base64_images": True       # Return images as base64
}

with open(pdf_path, 'rb') as f:
    files = {'file': (pdf_path, f)}
    response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=(10, 300))

result = response.json()

# For small documents (<=5 pages), get immediate results
if 'text' in result:
    parsed_text = result.get('text', '')
    images = result.get('images', [])
    print(parsed_text)
    # Output includes [Image_1_1], [Image_1_2] placeholders inline with text
```

--------------------------------

### Asynchronous PDF Parsing with httpx

Source: https://context7.com/ai92-github/parseextract/llms.txt

Uses httpx and aiofiles to perform non-blocking document uploads and concurrent processing of multiple files.

```python
import httpx
import aiofiles
import asyncio
import os

api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

async def parse_pdf_async(pdf_path: str) -> dict:
    """Asynchronously parse a PDF document."""
    api_url = "https://api.parseextract.com/v1/pdf-parse"
    payload = {"pdf_option": "option_b", "inline_images": True, "get_base64_images": True}

    async with aiofiles.open(pdf_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (pdf_path, file_content)}
        timeout = httpx.Timeout(10, read=300)

        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
            return response.json()

async def batch_parse_documents(pdf_paths: list) -> list:
    """Parse multiple PDFs concurrently."""
    tasks = [parse_pdf_async(path) for path in pdf_paths]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# Process multiple documents
documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = asyncio.run(batch_parse_documents(documents))
for path, result in zip(documents, results):
    if isinstance(result, Exception):
        print(f"Error processing {path}: {result}")
    else:
        print(f"Parsed {path}: {len(result.get('text', ''))} characters")
```

--------------------------------

### Synchronous PDF/Docx Parsing

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Initiate a document parsing request using the requests library.

```python
import requests
```

--------------------------------

### Perform Synchronous Webpage Crawling in Python

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Initiate a crawl job synchronously. The response will contain a job_id for fetching results later.

```python
import requests

# API URL
api_url = "https://api.parseextract.com/v1/url-crawl"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}
```

--------------------------------

### Async Table Extraction API Call

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This snippet shows how to perform asynchronous table extraction. It utilizes httpx and aiofiles. The API key must be configured as an environment variable. This method is suitable for large documents where immediate results are not required.

```python
import httpx

# API URL
api_url = "https://api.parseextract.com/v1/table-extract"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx file path
file_path = 'your-file-path'
extraction_option = "option_b"
page_no = None  # use None if you want all pages

# POST Request (Async)
# We send the file and all parameters as multi-form data
async def extract_table_async(api_url, file_path, extraction_option=extraction_option, page_no=page_no):
    async with aiofiles.open(file_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (file_path, file_content)}
        data={"extraction_option":extraction_option, "page_no":page_no}
        timeout = httpx.Timeout(10, read=60)
        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=data, headers=headers, timeout=timeout)
            return response

response = await extract_table_async(api_url, file_path)
print(response.json())
if response.json().get('job_id','')!='':
  print(f"Extraction in process. {response.json().get('tables','')}")

# # or use asyncio
# import asyncio
# async def get_response_async():
    # response = await extract_table_async(api_url, file_path)
    # print(response.json())
    # if response.json().get('job_id','')!='':
      # print(f"Extraction in process. {response.json().get('tables','')}")

# # Run the async function
# asyncio.run(get_response_async())
```

--------------------------------

### Synchronous Image Parsing

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This snippet shows how to perform synchronous parsing of image files. It requires the 'requests' library and configures connection and read timeouts.

```python
import requests
import os

# API URL
api_url = "https://api.parseextract.com/v1/image-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx file path or url
image_file_path = 'your-image-file-path.pdf or url'

# Other Form Data Parameters (refer table below for all available parameters)
image_option = 'option_b'

# Timeouts
timeout = (10, 60) # 10 seconds connect, 60 seconds read

# Payload
payload = {"image_option":image_option}  # add all other parameters

# POST Request (Sync)
# We send the file and all parameters as multi-form data
with open(image_file_path, 'rb') as f:
    files = {'file': (image_file_path, f)}
    response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)

parsed_text = response.json().get('text','')
print(parsed_text)
```

--------------------------------

### Perform Synchronous Webpage Scraping in Python

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Use this snippet to scrape a single URL synchronously. Ensure the PARSEEXTRACT_API_KEY environment variable is set before execution.

```python
import requests

# API URL
api_url = "https://api.parseextract.com/v1/url-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The URL to scrape
url = ""

# Other Parameters (refer table below for all available parameters)
wait = 1.5

# Timeouts
timeout = (10, 60) # 10 seconds connect, 60 seconds read

# Payload
payload = {"url":url, "wait":wait}  # add all other parameters

# POST Request (Sync)
response = requests.post(api_url, json=payload, headers=headers, timeout=timeout)

scraped_text = response.json().get('text','')
print(scraped_text)
```

--------------------------------

### Structured Data Extraction (Sync)

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This endpoint performs structured data extraction from documents (PDF, DOCX, image). It supports sync API calls and requires an API key for authorization. You can specify extraction prompts and optionally provide a URL or file.

```APIDOC
## POST /v1/data-extract (Sync)

### Description
Extracts structured data from uploaded documents (PDF, DOCX, image) or from a given URL. Supports synchronous API calls.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/data-extract

### Headers
- **Authorization** (string) - Required - Bearer token for authentication. Example: Bearer YOUR_API_KEY

### Form Data Parameters
- **prompt** (string) - Required - Your extraction prompt, optionally including a schema for the output format.
- **url** (string) - Optional - The URL from which to extract data. If provided, the 'file' parameter should be omitted.
- **file** (file) - Optional - The document file (PDF, DOCX, image) to extract data from. If provided, the 'url' parameter should be omitted.
- **extraction_option** (string) - Optional - Either 'option_a' or 'option_b'. Defaults to 'option_b'.
- **page_no** (integer) - Optional - Extracts data from a specific page number (starts from 1). Defaults to None (extracts all pages).

### Request Example (File Upload)
```python
import requests

api_url = "https://api.parseextract.com/v1/data-extract"
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}
file_path = 'your-document.pdf'
prompt = 'Extract all names and email addresses in a JSON format.'
payload = {"prompt": prompt}
timeout = (10, 120)

with open(file_path, 'rb') as f:
    files = {'file': (file_path, f)}
    response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
    print(response.json())
```

### Request Example (URL Input)
```python
import requests

api_url = "https://api.parseextract.com/v1/data-extract"
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}
url = 'http://example.com/document.pdf'
prompt = 'Extract the main title and author.'
payload = {"prompt": prompt, "url": url}
timeout = (10, 120)

response = requests.post(api_url, data=payload, headers=headers, timeout=timeout)
print(response.json())
```

### Response
#### Success Response (200)
- **extracted_data** (object/array) - The structured data extracted based on the prompt.

### Note
Currently, multi-page PDF/DOCX files are only processed for the first page. Contact support for multi-page support.
```

--------------------------------

### Async API Call for Structured Data Extraction

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Performs an asynchronous POST request to extract structured data using httpx and aiofiles. Handles file uploads or URL inputs with configurable prompts and timeouts. Requires PARSEEXTRACT_API_KEY.

```python
import httpx, aiofiles

# API URL
api_url = "https://api.parseextract.com/v1/data-extract"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx/image file path
file_path = 'your-file-path'  # if you have a url then pass the url as a separate form data and ignore the files input

# Other Form Data Parameters (refer table below for all available parameters)
prompt = 'your-extraction-prompt-with-a-optional-schema'
url = 'the-url-from-which-data-is-to-be-extracted'

# Payload
payload = {"prompt":prompt}  # add url to the payload if your input is an url

# POST Request (Async)
# We send the file and all parameters as multi-form data
async def extract_data_async(api_url, file_path, payload):
    async with aiofiles.open(file_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (file_path, file_content)}
        timeout = httpx.Timeout(10, read=120)
        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
            return response

# response = await extract_data_async(api_url, file_path, payload)
# or use asyncio
import asyncio
async def get_response_async():
    response = await extract_data_async(api_url, file_path, payload)
    print(response.json())

# Run the async function
asyncio.run(get_response_async())
```

--------------------------------

### Crawl Website Pages Asynchronously with Python

Source: https://context7.com/ai92-github/parseextract/llms.txt

Initiates a crawl job for multiple pages and polls the status endpoint to retrieve results once completed. Supports URL filtering via regex patterns.

```python
import requests
import os
import time

api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

# Start crawl job
crawl_url = "https://api.parseextract.com/v1/url-crawl"
payload = {
    "url": "https://docs.example.com",
    "wait": 1.5,
    "include_parent_url": True,           # Crawl subdirectories
    "include_path": [".*\\/docs\\/.*"],     # Only include /docs/ paths
    "exclude_path": [".*\\/archive\\/.*"],  # Skip archive pages
    "max_depth": 2,                       # Crawl depth
    "crawl_limit": 100,                   # Max pages to scrape
    "keep_images": False,                 # Exclude images for token savings
    "keep_webpage_links": True
}

response = requests.post(crawl_url, json=payload, headers=headers, timeout=(10, 10))
job_id = response.json().get('job_id')
print(f"Crawl job started: {job_id}")

# Poll for results
fetch_url = f"https://api.parseextract.com/v1/fetchcrawloutput?job_id={job_id}"
while True:
    time.sleep(10)
    result = requests.get(fetch_url, headers=headers).json()
    if result.get('status') == 'completed':
        crawled_content = result.get('output', [])
        print(f"Crawled {len(crawled_content)} pages")
        break
```

--------------------------------

### Extracted Product Data (JSON)

Source: https://github.com/ai92-github/parseextract/blob/main/output_examples.md

This JSON output represents the structured data extracted from the IKEA product page based on the provided prompt. Each object contains details for a single product.

```json
{
    "product_name": "EKTORP 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-hakebo-dark-grey-s19508999/",
    "image_link": "https://www.ikea.com/ext/ingkadam/m/18cac20c564a2bb3/original/PE902096-crop001.JPG?f=s",
    "price": "Rs.33,990"
  }
```

```json
{
    "product_name": "LANDSKRONA 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/landskrona-3-seat-sofa-gunnared-light-green-wood-s19270327/",
    "image_link": "https://www.ikea.com/ext/ingkadam/m/720cb948eae89d88/original/PE923432-crop001.jpg?f=s",
    "price": "Rs.69,990"
  }
```

```json
{
    "product_name": "GAMMALBYN 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/gammalbyn-3-seat-sofa-kilanda-light-beige-10615526/",
    "image_link": "https://www.ikea.com/in/en/images/products/gammalbyn-3-seat-sofa-kilanda-light-beige__1449308_pe989846_s5.jpg?f=xxs",
    "price": "Rs.22,990"
  }
```

```json
{
    "product_name": "EKTORP 3-seat sofa with chaise longue",
    "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-with-chaise-longue-kilanda-light-beige-s19509041/",
    "image_link": "https://www.ikea.com/in/en/images/products/ektorp-3-seat-sofa-with-chaise-longue-kilanda-light-beige__1194849_pe902099_s5.jpg?f=xxs",
    "price": "Rs.37,990"
  }
```

```json
{
    "product_name": "ORMARYD 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/ormaryd-3-seat-sofa-dark-blue-60477594/",
    "image_link": "https://www.ikea.com/in/en/images/products/ormaryd-3-seat-sofa-dark-blue__0919663_pe786703_s5.jpg?f=xxs",
    "price": "Rs.20,990"
  }
```

```json
{
    "product_name": "GLOSTAD 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/glostad-3-seat-sofa-knisa-dark-grey-40595937/",
    "image_link": "https://www.ikea.com/in/en/images/products/glostad-3-seat-sofa-knisa-dark-grey__1234948_pe917261_s5.jpg?f=xxs",
    "price": "Rs.12,990"
  }
```

```json
{
    "product_name": "GLOSTAD 2-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/glostad-2-seat-sofa-knisa-dark-grey-10489009/",
    "image_link": "https://www.ikea.com/in/en/images/products/glostad-2-seat-sofa-knisa-dark-grey__0950864_pe800736_s5.jpg?f=xxs",
    "price": "Rs.9,990"
  }
```

```json
{
    "product_name": "LINANÄS 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/linanaes-3-seat-sofa-vissle-beige-90512237/",
    "image_link": "https://www.ikea.com/in/en/images/products/linanaes-3-seat-sofa-vissle-beige__1013894_pe829446_s5.jpg?f=xxs",
    "price": "Rs.24,990"
  }
```

```json
{
    "product_name": "JÄTTEBO U-shaped sofa",
    "product_link": "https://www.ikea.com/in/en/p/jaettebo-u-shaped-sofa-7-seat-with-chaise-longue-right-with-headrests-tonerud-grey-s39510618/",
    "image_link": "https://www.ikea.com/in/en/images/products/jaettebo-u-shaped-sofa-7-seat-with-chaise-longue-right-with-headrests-tonerud-grey__1179836_pe896109_s5.jpg?f=xxs",
    "price": "Rs.2,93,500"
  }
```

```json
{
    "product_name": "EKTORP 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/ektorp-3-seat-sofa-kilanda-light-beige-s49509011/",
    "image_link": "https://www.ikea.com/in/en/images/products/ektorp-3-seat-sofa-kilanda-light-beige__1194853_pe902103_s5.jpg?f=xxs",
    "price": "Rs.29,990"
  }
```

```json
{
    "product_name": "LANDSKRONA 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/landskrona-3-seat-sofa-gunnared-light-green-wood-s19270327/",
    "image_link": "https://www.ikea.com/in/en/images/products/landskrona-3-seat-sofa-gunnared-light-green-wood__0602122_pe680191_s5.jpg?f=xxs",
    "price": "Rs.69,990"
  }
```

```json
{
    "product_name": "SÖDERHAMN Corner sofa",
    "product_link": "https://www.ikea.com/in/en/p/soederhamn-corner-sofa-6-seat-tonerud-grey-s89452079/",
    "image_link": "https://www.ikea.com/in/en/images/products/soederhamn-corner-sofa-6-seat-tonerud-grey__1057827_pe849007_s5.jpg?f=xxs",
    "price": "Rs.1,19,080"
  }
```

```json
{
    "product_name": "LINANÄS 3-seat sofa with chaise longue",
    "product_link": "https://www.ikea.com/in/en/p/linanaes-3-seat-sofa-with-chaise-longue-vissle-dark-grey-60512248/",
    "image_link": "https://www.ikea.com/in/en/images/products/linanaes-3-seat-sofa-with-chaise-longue-vissle-dark-grey__1013908_pe829460_s5.jpg?f=xxs",
    "price": "Rs.35,990"
  }
```

```json
{
    "product_name": "SÖDERHAMN 3-seat sofa",
    "product_link": "https://www.ikea.com/in/en/p/soederhamn-3-seat-sofa-gransel-natural-colour-s39442158/",
    "image_link": "https://www.ikea.com/images/a-soederhamn-sofa-in-a-living-room-with-different-cushions-l-9ea3ab1b5308e61f33cab2df355f59f9.jpg?f=s",
    "price": "Rs.52,990"
  }
```

```json
{
    "product_name": "SÖDERHAMN 3-seat sofa",
    "product_link": null,
    "image_link": null,
    "price": null
  }
```

--------------------------------

### POST /v1/url-parse

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Scrapes a single URL and returns the extracted text content.

```APIDOC
## POST /v1/url-parse

### Description
Scrapes a single URL and returns the extracted text content. Supports both synchronous and asynchronous requests.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/url-parse

### Parameters
#### Request Body
- **url** (string) - Required - The URL to scrape.
- **wait** (float) - Optional - Time to wait for the page to load in seconds. Default: 1.5.
- **keep_images** (boolean) - Optional - Keep image links. Default: True.
- **remove_svg_image** (boolean) - Optional - Remove .svg images. Default: True.
- **remove_gif_image** (boolean) - Optional - Remove .gif images. Default: True.
- **remove_image_types** (list) - Optional - List of image extensions to remove. Default: [].
- **keep_webpage_links** (boolean) - Optional - Keep webpage links. Default: True.
- **remove_script_tag** (boolean) - Optional - Remove script tags. Default: True.
- **remove_style_tag** (boolean) - Optional - Remove style tags. Default: True.
- **remove_tags** (list) - Optional - List of tags to remove. Default: [].

### Request Example
{
  "url": "https://example.com",
  "wait": 1.5
}

### Response
#### Success Response (200)
- **text** (string) - The extracted text content from the webpage.

#### Response Example
{
  "text": "Extracted content..."
}
```

--------------------------------

### Sync Table Extraction API Call

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This snippet demonstrates a synchronous API call for table extraction. It requires the 'requests' library. Ensure the API key is set as an environment variable. The 'page_no' parameter can be set to None to process all pages.

```python
import requests

# API URL
api_url = "https://api.parseextract.com/v1/table-extract"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx/image file path
file_path = 'your-file-path'
extraction_option = "option_b"
page_no = None  # use None if you want all pages

# Timeouts
timeout = (10, 60) # 10 seconds connect, 60 seconds read

# POST Request (Sync)
# We send the file as multi-form data
with open(file_path, 'rb') as f:
    files = {'file': (file_path, f)}
    data={"extraction_option":extraction_option, "page_no":page_no}
    response = requests.post(api_url, files=files, data=data, headers=headers, timeout=timeout)

print(response.json())
if response.json().get('job_id','')!='':
  print(f"Extraction in process. {response.json().get('tables','')}")
```

--------------------------------

### Structured Data Extraction (Async)

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

This endpoint performs structured data extraction from documents (PDF, DOCX, image) using asynchronous API calls. It requires an API key for authorization and allows for non-blocking requests.

```APIDOC
## POST /v1/data-extract (Async)

### Description
Extracts structured data from uploaded documents (PDF, DOCX, image) or from a given URL using asynchronous API calls.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/data-extract

### Headers
- **Authorization** (string) - Required - Bearer token for authentication. Example: Bearer YOUR_API_KEY

### Form Data Parameters
- **prompt** (string) - Required - Your extraction prompt, optionally including a schema for the output format.
- **url** (string) - Optional - The URL from which to extract data. If provided, the 'file' parameter should be omitted.
- **file** (file) - Optional - The document file (PDF, DOCX, image) to extract data from. If provided, the 'url' parameter should be omitted.
- **extraction_option** (string) - Optional - Either 'option_a' or 'option_b'. Defaults to 'option_b'.
- **page_no** (integer) - Optional - Extracts data from a specific page number (starts from 1). Defaults to None (extracts all pages).

### Request Example (Async File Upload)
```python
import httpx, aiofiles, asyncio

async def extract_data_async(api_url, file_path, payload, headers):
    async with aiofiles.open(file_path, 'rb') as file:
        file_content = await file.read()
        files = {'file': (file_path, file_content)}
        timeout = httpx.Timeout(10, read=120)
        async with httpx.AsyncClient() as client:
            response = await client.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)
            return response

async def main():
    api_url = "https://api.parseextract.com/v1/data-extract"
    api_key = os.environ["PARSEEXTRACT_API_KEY"]
    headers = {"Authorization":f"Bearer {api_key}"}
    file_path = 'your-document.pdf'
    prompt = 'Extract all names and email addresses in a JSON format.'
    payload = {"prompt": prompt}

    response = await extract_data_async(api_url, file_path, payload, headers)
    print(response.json())

asyncio.run(main())
```

### Response
#### Success Response (200)
- **extracted_data** (object/array) - The structured data extracted based on the prompt.

### Note
Currently, multi-page PDF/DOCX files are only processed for the first page. Contact support for multi-page support.
```

--------------------------------

### Sync API Call for Structured Data Extraction

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Performs a synchronous POST request to extract structured data from a file. Supports file uploads or URL inputs, with configurable prompts and timeouts. Requires PARSEEXTRACT_API_KEY.

```python
import requests

# API URL
api_url = "https://api.parseextract.com/v1/data-extract"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The pdf/docx/image file path
file_path = 'your-file-path'  # if you have a url then pass the url as a separate form data and ignore the files input

# Other Form Data Parameters (refer table below for all available parameters)
prompt = 'your-extraction-prompt-with-a-optional-schema'
url = 'the-url-from-which-data-is-to-be-extracted'

# Timeouts
timeout = (10, 120) # 10 seconds connect, 60 seconds read

# Payload
payload = {"prompt":prompt}  # add url to the payload if your input is an url

# POST Request (Sync)
# We send the file and all parameters as multi-form data
with open(file_path, 'rb') as f:
    files = {'file': (file_path, f)}
    response = requests.post(api_url, files=files, data=payload, headers=headers, timeout=timeout)

print(response.json())
```

--------------------------------

### POST /v1/url-crawl

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Initiates a crawling job for a URL and returns a job_id for result retrieval.

```APIDOC
## POST /v1/url-crawl

### Description
Initiates a crawling job for a specified URL. Returns a job_id which is used to fetch the crawling results later.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/url-crawl

### Response
#### Success Response (200)
- **job_id** (string) - The unique identifier for the crawling job.
```

--------------------------------

### Fetch asynchronous job results

Source: https://context7.com/ai92-github/parseextract/llms.txt

Use this pattern to retrieve results for large documents processed via job IDs.

```python
if 'job_id' in result:
    job_id = result['job_id']
    fetch_url = f"https://api.parseextract.com/v1/fetchoutput?job_id={job_id}"
    final_result = requests.get(fetch_url, headers=headers).json()
    print(final_result.get('text', ''))
```

--------------------------------

### POST /v1/url-crawl

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Initiates a URL crawling job. For large crawls, this returns a job_id to be used with the fetch endpoint.

```APIDOC
## POST /v1/url-crawl

### Description
Initiates a crawl job for a specified URL with various configuration options for depth, filtering, and content removal.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/url-crawl

### Parameters
#### Request Body
- **url** (string) - Required - The url to crawl.
- **wait** (float) - Optional - Time to wait for the page to load in seconds. Default: 1.5.
- **include_parent_url** (boolean) - Optional - Crawl all sub directories. Default: True.
- **include_keyword** (list) - Optional - Keywords to include in crawling. Default: [].
- **exclude_keyword** (list) - Optional - Keywords to exclude. Default: [].
- **include_path** (list) - Optional - Regex patterns to include. Default: [].
- **exclude_path** (list) - Optional - Regex patterns to exclude. Default: [].
- **max_depth** (integer) - Optional - Depth of crawling. Default: 2.
- **crawl_limit** (integer) - Optional - Maximum number of urls to scrape. Default: 1000.
- **keep_images** (boolean) - Optional - Keep image links. Default: True.
- **remove_svg_image** (boolean) - Optional - Remove .svg images. Default: True.
- **remove_gif_image** (boolean) - Optional - Remove .gif images. Default: True.
- **remove_image_types** (list) - Optional - List of image extensions to remove. Default: [].
- **keep_webpage_links** (boolean) - Optional - Keep webpage links. Default: True.
- **remove_script_tag** (boolean) - Optional - Remove script tags. Default: True.
- **remove_style_tag** (boolean) - Optional - Remove style tags. Default: True.
- **remove_tags** (list) - Optional - List of tags to remove. Default: [].

### Request Example
{
  "url": "https://example.com",
  "wait": 1.5,
  "crawl_limit": 1000
}
```

--------------------------------

### Scrape Webpage Content with Python

Source: https://context7.com/ai92-github/parseextract/llms.txt

Uses the URL Scraping API to convert a webpage into clean, LLM-optimized text. Configure wait times and filtering options to manage token usage.

```python
import requests
import os

# API configuration
api_url = "https://api.parseextract.com/v1/url-parse"
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

# Scrape a webpage
payload = {
    "url": "https://example.com/sample-page",
    "wait": 2.0,                    # Wait 2 seconds for JS to load
    "keep_images": True,            # Keep image links in output
    "remove_svg_image": True,       # Remove SVG images
    "remove_gif_image": True,       # Remove GIF images
    "keep_webpage_links": True,     # Keep hyperlinks
    "remove_script_tag": True,      # Remove script tags
    "remove_style_tag": True        # Remove style tags
}

response = requests.post(api_url, json=payload, headers=headers, timeout=(10, 60))
result = response.json()

scraped_text = result.get('text', '')
print(scraped_text)
# Output: Clean markdown-formatted text with preserved links and structure
```

--------------------------------

### Asynchronous URL Crawl Request

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Perform an asynchronous POST request using httpx and asyncio for non-blocking operations.

```python
import httpx

# API URL
api_url = "https://api.parseextract.com/v1/url-crawl"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The URL to crawl
url = ""

# Other Parameters (refer table below for all available parameters)
wait = 1.5
crawl_limit = 1000

# Payload
payload = {"url":url, "wait":wait, "crawl_limit":crawl_limit}  # add all other parameters

# POST Request (Async)
async def crawl_url_async(api_url, url=url, wait=wait, crawl_limit=crawl_limit):
    timeout = httpx.Timeout(10.0, read=60)
    async with httpx.AsyncClient() as client:
        response = await client.post(api_url, json=payload, headers=headers, timeout=timeout)
        return response

# response = await crawl_url_async(api_url, url, wait, crawl_limit)
# or use asyncio
import asyncio
async def get_response_async():
    response = await crawl_url_async(api_url, url, wait, crawl_limit)
    print(response.json())

# Run the async function
asyncio.run(get_response_async())
```

--------------------------------

### POST /v1/table-extract

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Extracts tables and tabular data from PDF, DOCX, or image files.

```APIDOC
## POST /v1/table-extract

### Description
Extracts tables from documents. Returns base64 encoded strings for Excel and CSV files. For multi-page documents, a job_id is returned for asynchronous processing.

### Method
POST

### Endpoint
https://api.parseextract.com/v1/table-extract

### Parameters
#### Request Body
- **file** (binary) - Required - The document file to extract tables from.
- **extraction_option** (string) - Optional - Extraction configuration option.
- **page_no** (integer) - Optional - Specific page number to extract; defaults to all pages.
```

--------------------------------

### Perform Asynchronous Webpage Scraping in Python

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Use this snippet to scrape a single URL asynchronously using httpx. This is suitable for non-blocking operations.

```python
import httpx

# API URL
api_url = "https://api.parseextract.com/v1/url-parse"

# Authorization
api_key = os.environ["PARSEEXTRACT_API_KEY"]
headers = {"Authorization":f"Bearer {api_key}"}

# The URL to scrape
url = ""

# Other Parameters (refer table below for all available parameters)
wait = 1.5

# Payload
payload = {"url":url, "wait":wait}  # add all other parameters

# POST Request (Async)
async def parse_url_async(api_url, url=url, wait=wait):
    timeout = httpx.Timeout(10.0, read=60)
    async with httpx.AsyncClient() as client:
        response = await client.post(api_url, json=payload, headers=headers, timeout=timeout)
        return response

# response = await parse_url_async(api_url, url, wait)
# or use asyncio
import asyncio
async def get_response_async():
    response = await parse_url_async(api_url, url, wait)
    print(response.json().get('text',''))

# Run the async function
asyncio.run(get_response_async())
```

--------------------------------

### POST /extract

Source: https://github.com/ai92-github/parseextract/blob/main/api_docs.md

Endpoint to process a document or URL and extract data based on a provided prompt.

```APIDOC
## POST /extract

### Description
Extracts information from a document or URL based on a user-defined prompt.

### Method
POST

### Parameters
#### Request Body (Form Data)
- **files** (file) - Optional - The document file to be processed.
- **url** (string) - Optional - The URL of the document (use either file upload or URL).
- **prompt** (string) - Required - The extraction prompt describing what to extract, optionally including a JSON schema example.
- **json_out** (integer) - Optional - Set to 1 if you do not want JSON output format.
```