### Local Development Setup Source: https://github.com/jina-ai/reader/blob/main/README.md Commands to install dependencies, start services, and initialize the database for local development. ```bash npm install docker compose up -d npm run init-db ``` -------------------------------- ### Local Development Setup Commands Source: https://context7.com/jina-ai/reader/llms.txt Commands for setting up the local development environment, including cloning the repository, installing dependencies, starting services via Docker, initializing the database, and running the application in development or production mode. ```bash # Prerequisites: Node.js >=22, Docker git clone git@github.com:jina-ai/reader.git cd reader npm install # Start MongoDB + MinIO (S3-compatible) via Docker Compose docker compose up -d # Initialise the database (creates collections/indexes) npm run init-db # Start in development mode with hot-reload npm run dev # → Server listening on http://localhost:3000 (HTTP/1.1) # → Alternative HTTP/2 cleartext on port 3001 # Build TypeScript and start in production mode npm run serve # Run unit tests npm run test:unit # Run e2e tests npm run test:e2e # With coverage report npm run test:coverage ``` -------------------------------- ### Local Development with Environment Variables Source: https://github.com/jina-ai/reader/blob/main/README.md Commands to start services and run development mode after setting up environment variables. ```bash docker compose up -d npm run dev ``` -------------------------------- ### Start Crawl Server (Node.js) Source: https://context7.com/jina-ai/reader/llms.txt Import and start the crawl server singleton for embedding in Node.js applications. The service can be configured to listen on a specific port. ```typescript // package.json exports: ".", "./crawl", "./search", "./serp" import crawlServer from 'reader/crawl'; // CrawlStandAloneServer singleton import searchServer from 'reader/search'; // SearchStandAloneServer singleton // Start the crawl server on port 3000 crawlServer.serviceReady().then((s) => { s.h2c().listen(3000); }); // Or dry-run mode (initialise then shut down — useful for testing) if (process.env.NODE_ENV === 'dry-run') { crawlServer.serviceReady().then(() => finalizer.terminate()); } ``` -------------------------------- ### Clone Repository and Install Dependencies Source: https://github.com/jina-ai/reader/blob/main/README.md Instructions for cloning the Jina Reader repository and installing the necessary Node.js dependencies using npm. Ensure you are using Node.js v18. ```bash git clone git@github.com:jina-ai/reader.git npm install ``` -------------------------------- ### Start Web Crawler Source: https://github.com/jina-ai/reader/blob/main/fixtures/sample.html Initiates a web crawler to start fetching pages from a given URL. Ensure the Crawler class is properly defined and imported. ```javascript const crawler = new Crawler(); crawler.start('https://example.com'); ``` -------------------------------- ### GET / - URL to Markdown Source: https://context7.com/jina-ai/reader/llms.txt Prepend `https://r.jina.ai/` to any URL to convert it to markdown. The service auto-selects rendering engines and applies cleanup. Behavior can be controlled via request headers. ```APIDOC ## GET / ### Description Converts a given URL to markdown text. The service automatically selects the appropriate rendering engine (Curl or Browser) and applies cleanup using Readability. Output format and behavior can be customized using request headers. ### Method GET ### Endpoint `https://r.jina.ai/` ### Headers - `Accept`: `application/json` to receive JSON output with title, url, and content fields. - `X-No-Cache`: `true` to force bypass cache. - `X-Timeout`: Specifies timeout in seconds. - `X-Target-Selector`: Focuses rendering on a specific CSS selector. - `X-Remove-Selector`: Removes specified CSS elements (e.g., cookie banners). - `X-Respond-With`: `screenshot` to receive a raw screenshot (PNG redirect). - `X-Set-Cookie`: Forwards session cookies. - `X-Proxy-Url`: Specifies a custom proxy URL (e.g., `socks5://user:pass@proxy.example.com:1080`). ### Request Example ```bash # Basic usage curl https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence # JSON output curl -H "Accept: application/json" https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence # Screenshot output curl -L -H "X-Respond-With: screenshot" https://r.jina.ai/https://jina.ai -o screenshot.png ``` ### Response #### Success Response (200) - `content` (string): Markdown representation of the URL content. - `title` (string): The title of the page. - `url` (string): The original URL. #### Response Example ```json { "title": "Artificial intelligence", "url": "https://en.wikipedia.org/wiki/Artificial_intelligence", "content": "# Artificial intelligence\n..." } ``` ``` -------------------------------- ### Markdown output: ATX headings Source: https://context7.com/jina-ai/reader/llms.txt Control the markdown heading style. This example forces ATX-style headings (e.g., # Heading) instead of setext underline style. ```bash curl -H "X-Md-Heading-Style: atx" \ https://r.jina.ai/https://docs.example.com ``` -------------------------------- ### Get Screenshot with curl Source: https://context7.com/jina-ai/reader/llms.txt Retrieve a raw screenshot (PNG redirect) by setting the 'X-Respond-With' header to 'screenshot'. Use '-L' to follow redirects. ```bash # Return raw screenshot (PNG redirect) curl -L -H "X-Respond-With: screenshot" \ https://r.jina.ai/https://jina.ai -o screenshot.png ``` -------------------------------- ### Submit URL via POST with curl Source: https://context7.com/jina-ai/reader/llms.txt Submit a URL for processing via a POST request, essential for hash-routed SPAs where fragments cannot be sent in GET paths. ```bash # Hash-routed SPA (# fragment cannot be sent in a GET path) curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/#/route/to/page"}' ``` -------------------------------- ### Inject custom JavaScript Source: https://context7.com/jina-ai/reader/llms.txt Inject custom JavaScript into the page after it has loaded. This example removes a modal element using JavaScript. ```bash curl -H "Accept: application/json" \ -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "injectPageScript": ["document.querySelector(\".modal\").remove();"]}' ``` -------------------------------- ### Convert URL to Markdown with curl Source: https://context7.com/jina-ai/reader/llms.txt Use this to convert a Wikipedia page to markdown. Specify 'Accept: application/json' to get JSON output with title, url, and content fields. ```bash # Basic: convert a Wikipedia page to markdown curl https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence # Return JSON with title, url, content fields curl -H "Accept: application/json" \ https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence ``` -------------------------------- ### Token budget: Summary of links Source: https://context7.com/jina-ai/reader/llms.txt Include a summary section listing all found links in the response. Set 'X-With-Links-Summary: true' to enable this feature. ```bash curl -H "X-With-Links-Summary: true" \ https://r.jina.ai/https://news.ycombinator.com ``` -------------------------------- ### Enable JSON Mode with Accept Header Source: https://github.com/jina-ai/reader/blob/main/README.md Use the `Accept: application/json` header to retrieve content in JSON format. Currently, this mode returns a JSON object with `url`, `title`, and `content` fields. ```bash curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page ``` -------------------------------- ### Submit Binary File as Base64 with POST Source: https://context7.com/jina-ai/reader/llms.txt Submit any binary file (e.g., Word, Excel, PowerPoint) encoded in Base64 in the JSON body of a POST request. ```bash # Submit any binary file (Word, Excel, PowerPoint) as base64 DOCX_B64=$(base64 -i slides.pptx) curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d "{"file": "${DOCX_B64}"}" ``` -------------------------------- ### In-Site Search with Jina Reader Source: https://github.com/jina-ai/reader/blob/main/README.md Perform an in-site search by specifying the `site` parameter in the query. Multiple `site` parameters can be used to search across different domains. ```bash curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com' ``` -------------------------------- ### SPA Fetching with Timeout Header Source: https://github.com/jina-ai/reader/blob/main/README.md When dealing with SPAs that dynamically load content, use the `x-timeout` header to instruct Reader to wait until the network is idle or the timeout is reached, ensuring content is fully loaded. ```bash curl 'https://example.com/' -H 'x-timeout: 30' ``` -------------------------------- ### Enable Image Captioning Source: https://github.com/jina-ai/reader/blob/main/README.md To enable image captioning for better latency, set the `x-with-generated-alt: true` header in your request. ```http x-with-generated-alt: true ``` -------------------------------- ### Control Output Format with X-Respond-With Header Source: https://context7.com/jina-ai/reader/llms.txt Demonstrates how to control the output format of the Reader API using the X-Respond-With header. Supports various formats like markdown, html, text, pageshot, and readerlm-v2. Requires Accept: text/event-stream or application/json for compound formats. ```bash # Return only the cleaned markdown (no Readability processing) curl -H "X-Respond-With: markdown" \ https://r.jina.ai/https://docs.python.org/3/library/os.html ``` ```bash # Return raw outer HTML curl -H "X-Respond-With: html" \ https://r.jina.ai/https://example.com ``` ```bash # Return innerText only curl -H "X-Respond-With: text" \ https://r.jina.ai/https://example.com ``` ```bash # Full-page screenshot (PNG) curl -H "X-Respond-With: pageshot" \ https://r.jina.ai/https://jina.ai -o pageshot.png ``` ```bash # Use ReaderLM-v2 (small LM converts HTML → markdown) curl -H "X-Respond-With: readerlm-v2" \ https://r.jina.ai/https://arxiv.org/abs/2309.10305 ``` ```bash # Compound: content + screenshot (requires SSE or JSON Accept) curl -H "Accept: application/json" \ -H "X-Respond-With: content,screenshot" \ https://r.jina.ai/https://example.com ``` -------------------------------- ### Compare Standard vs. Streaming Mode Source: https://github.com/jina-ai/reader/blob/main/README.md Demonstrates the difference between standard and streaming mode for content extraction. Streaming mode is beneficial when content is loaded dynamically after the initial page load. ```bash curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853 ``` ```bash curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853 ``` -------------------------------- ### Read URL with Jina Reader Source: https://github.com/jina-ai/reader/blob/main/README.md Prepend `https://r.jina.ai/` to any URL to convert it into an LLM-friendly input. This is useful for agents and RAG systems. ```url https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence ``` -------------------------------- ### Cache control: Robots.txt compliance Source: https://context7.com/jina-ai/reader/llms.txt Ensure compliance with robots.txt rules for specific bots. 'X-Robots-Txt: Googlebot' checks against Googlebot's rules. ```bash curl -H "X-Robots-Txt: Googlebot" \ https://r.jina.ai/https://example.com/page ``` -------------------------------- ### SPA Fetching with Timeout Source: https://github.com/jina-ai/reader/blob/main/README.md For SPAs or websites with dynamic content loading, use the `x-timeout` header to specify a waiting period. ```APIDOC ## SPA Fetching (Dynamic Content) ### Description Fetches content from SPAs or websites with dynamic content loading, waiting until a specified timeout for network idle. ### Method GET ### Endpoint `{URL}` ### Parameters #### Headers - **x-timeout** (integer) - Optional - The timeout in seconds to wait for network idle. ### Request Example ```bash curl 'https://example.com/' -H 'x-timeout: 30' ``` ``` -------------------------------- ### Token budget: Inspect usage (curl) Source: https://context7.com/jina-ai/reader/llms.txt Inspect token usage from response headers using curl. The 'X-Usage-Tokens' header provides an estimate of the GPT-compatible token count. ```bash curl -sI "https://r.jina.ai/https://en.wikipedia.org/wiki/Python_(programming_language)" \ | grep -i x-usage-tokens ``` -------------------------------- ### SPA Fetching with POST Request Source: https://github.com/jina-ai/reader/blob/main/README.md For SPAs with hash-based routing, use the POST method with the `url` parameter in the request body to correctly handle URLs containing `#`. ```bash curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route' ``` -------------------------------- ### Enable Streaming Mode with Accept Header Source: https://github.com/jina-ai/reader/blob/main/README.md Toggle streaming mode using the `Accept: text/event-stream` header. This mode waits longer for the page to stabilize, providing more complete results, especially for pages with dynamic content loaded by JavaScript. ```bash curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page ``` -------------------------------- ### Token budget: Reject if exceeded Source: https://context7.com/jina-ai/reader/llms.txt Set a token budget for requests. If the estimated token count exceeds the budget (e.g., 5000), the request will be rejected with an HTTP 400 error. ```bash curl -v -H "X-Token-Budget: 5000" \ https://r.jina.ai/https://very-long-document.example.com ``` -------------------------------- ### Handle SPA and Dynamic Content with Browser Engine Source: https://context7.com/jina-ai/reader/llms.txt Manages Single Page Applications (SPAs) with hash routing, lazy-loaded content, and dynamic elements. Uses Puppeteer headless Chrome. Options include waiting for specific selectors, setting explicit timeouts, and forcing the browser engine. ```bash # Wait for a specific element to appear before returning curl -H "X-Wait-For-Selector: #main-content" \ https://r.jina.ai/https://react-app.example.com/products ``` ```bash # Explicit timeout to capture fully-rendered heavy pages (max 180s) curl -H "X-Timeout: 45" \ https://r.jina.ai/https://dashboard.example.com/analytics ``` ```bash # Force browser engine (always use headless Chrome, never curl) curl -H "X-Engine: browser" \ https://r.jina.ai/https://js-heavy-app.example.com ``` -------------------------------- ### POST / - URL or file via POST body Source: https://context7.com/jina-ai/reader/llms.txt Submit content via a POST request body to `https://r.jina.ai/`. Supports `url`, `html`, `pdf` (base64), or `file` (base64) for processing. ```APIDOC ## POST / ### Description Allows submitting content via a POST request body for processing. This is useful for handling hash-routed SPAs, raw HTML conversion, or direct file uploads. ### Method POST ### Endpoint `https://r.jina.ai/` ### Headers - `Content-Type`: `application/json` ### Request Body - `url` (string): The URL to process. - `html` (string): Raw HTML content to convert to markdown. - `pdf` (string): Base64 encoded PDF content. - `file` (string): Base64 encoded content of any binary file (e.g., Word, Excel, PowerPoint). ### Request Example ```bash # Process a URL with a hash fragment curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/#/route/to/page"}' # Convert raw HTML string to markdown curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"html": "

Hello

World example

"}' # Submit a base64 encoded PDF PDF_B64=$(base64 -i report.pdf) curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d "{"pdf": "${PDF_B64}"}" # Submit a base64 encoded DOCX file DOCX_B64=$(base64 -i slides.pptx) curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d "{"file": "${DOCX_B64}"}" ``` ### Response #### Success Response (200) - `content` (string): Markdown representation of the processed content. #### Response Example ``` # Hello World **example** ``` ``` -------------------------------- ### SPA Fetching with POST Source: https://github.com/jina-ai/reader/blob/main/README.md For Single Page Applications (SPAs) with hash-based routing, use the POST method with the URL in the request body. ```APIDOC ## SPA Fetching (Hash-based Routing) ### Description Fetches content from SPAs that use hash-based routing by sending a POST request. ### Method POST ### Endpoint `https://r.jina.ai/` ### Parameters #### Request Body - **url** (string) - Required - The URL of the SPA. ### Request Example ```bash curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route' ``` ``` -------------------------------- ### Web Search with s.jina.ai Source: https://context7.com/jina-ai/reader/llms.txt Performs web searches using the s.jina.ai endpoint. Supports basic search, domain restriction, JSON output, streaming results, image search, news search, geo-targeting, and controlling result count. Requires an Authorization header for some features. ```bash # Basic web search, returns top 5 results in markdown curl "https://s.jina.ai/What%20is%20the%20best%20vector%20database%20for%20RAG%3F" ``` ```bash # Restrict search to specific domains (in-site search) curl "https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com" ``` ```bash # Return structured JSON (array of {title, url, content}) curl -H "Accept: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/latest%20AI%20models%202024" ``` ```bash # Stream search results progressively via SSE curl -H "Accept: text/event-stream" \ -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/quantum%20computing%20breakthroughs" ``` ```bash # Search images curl -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/golden%20retriever?type=images" ``` ```bash # Search news curl -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/AI%20regulation%202024?type=news" ``` ```bash # Geo-targeted search (country + language) curl -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/local%20weather?gl=de&hl=de" ``` ```bash # Request only URLs/titles without fetching full content (faster) curl -H "X-Respond-With: no-content" \ -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/machine%20learning%20tutorials" ``` ```bash # Control result count (up to 20) curl -H "Authorization: Bearer YOUR_API_KEY" \ "https://s.jina.ai/climate%20change&num=3" ``` -------------------------------- ### Web Search Source: https://github.com/jina-ai/reader/blob/main/README.md Perform a web search by prepending `https://s.jina.ai/` to your search query. The API fetches and processes the top 5 search results. ```APIDOC ## Web Search ### Description Performs a web search and fetches content from the top 5 results. ### Method GET ### Endpoint `https://s.jina.ai/{search_query}` ### Parameters #### Path Parameters - **search_query** (string) - Required - The URL-encoded search query. ### Request Example ```bash https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F ``` ``` -------------------------------- ### In-site Search with Jina Reader Source: https://github.com/jina-ai/reader/blob/main/README.md To restrict search results to a specific domain, append `site=yourdomain.com` to the query parameters. ```url https://s.jina.ai/your+query?site=jina.ai ``` -------------------------------- ### Use Custom Proxy with curl Source: https://context7.com/jina-ai/reader/llms.txt Utilize a custom proxy for requests by specifying the proxy URL in the 'X-Proxy-Url' header. ```bash # Use a custom proxy curl -H "X-Proxy-Url: socks5://user:pass@proxy.example.com:1080" \ https://r.jina.ai/https://geo-restricted-site.com ``` -------------------------------- ### Generate Alt Text for Images with X-With-Generated-Alt Source: https://github.com/jina-ai/reader/blob/main/README.md Enable automatic captioning of images lacking alt tags by using the `X-With-Generated-Alt: true` header. The captions are formatted to assist downstream LLMs in understanding image content. ```bash curl -H "X-With-Generated-Alt: true" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page ``` -------------------------------- ### Submit PDF as Base64 with POST Source: https://context7.com/jina-ai/reader/llms.txt Submit a PDF file encoded in Base64 within the JSON body of a POST request to convert it to markdown. ```bash # Submit a PDF as base64 and get markdown back PDF_B64=$(base64 -i report.pdf) curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d "{"pdf": "${PDF_B64}"}" ``` -------------------------------- ### Streaming Mode - Accept: text/event-stream Source: https://context7.com/jina-ai/reader/llms.txt Enables Server-Sent Events (SSE) streaming for content that loads dynamically. Each SSE chunk provides increasingly complete page content, allowing immediate processing. ```APIDOC ## Streaming Mode ### Description Enables Server-Sent Events (SSE) streaming by setting the `Accept` header to `text/event-stream`. This is useful for websites that load content dynamically via JavaScript or when immediate processing of partial content is desired. Each subsequent SSE chunk contains more complete page content. ### Method GET ### Endpoint `https://r.jina.ai/` ### Headers - `Accept`: `text/event-stream` - `X-No-Cache`: `true` (optional, to ensure fresh content) ### Request Example ```bash # Stream Wikipedia main page curl -H "Accept: text/event-stream" \ https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page # Stream a site with dynamic content, bypassing cache curl -H "Accept: text/event-stream" \ -H "X-No-Cache: true" \ https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853 ``` ### Response #### Success Response (200) - Server-Sent Events stream where each event chunk contains progressively more complete page content. ``` -------------------------------- ### Programmatic Crawling with CrawlerHost (Node.js) Source: https://context7.com/jina-ai/reader/llms.txt Directly use the CrawlerHost for programmatic crawling. This involves resolving the host from the container, configuring crawl options, and iterating through snapshots. ```typescript // Direct use of CrawlerHost for programmatic crawling import { container } from 'tsyringe'; import { CrawlerHost } from './src/api/crawler'; import { CrawlerOptions } from './src/dto/crawler-options'; const host = container.resolve(CrawlerHost); await host.serviceReady(); const url = new URL('https://en.wikipedia.org/wiki/Artificial_intelligence'); const opts = CrawlerOptions.from({ respondWith: 'markdown', withGeneratedAlt: false }); const crawlOpts = await host.configure(opts); for await (const snapshot of host.iterSnapshots(url, crawlOpts, opts)) { if (!snapshot) continue; const formatted = await host.simpleCrawl('content', url, crawlOpts); console.log(formatted.content); break; } ``` -------------------------------- ### Convert Raw HTML to Markdown with POST Source: https://context7.com/jina-ai/reader/llms.txt Convert a raw HTML string to markdown by including the HTML content in the JSON body of a POST request. ```bash # Convert raw HTML string to markdown curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"html": "

Hello

World example

"}' # → "# Hello\n\nWorld **example**\n" ``` -------------------------------- ### Cache control: Cache tolerance Source: https://context7.com/jina-ai/reader/llms.txt Specify the maximum age of cached content to accept. 'X-Cache-Tolerance: 86400' allows cached content up to 24 hours old. ```bash curl -H "X-Cache-Tolerance: 86400" \ https://r.jina.ai/https://example.com/static-page ``` -------------------------------- ### Force curl engine Source: https://context7.com/jina-ai/reader/llms.txt Use the 'curl' engine for lightweight, no-JavaScript execution. This is useful for static content or when JavaScript execution is not desired. ```bash curl -H "X-Engine: curl" \ https://r.jina.ai/https://static-site.example.com ``` -------------------------------- ### Cache control: Bypass cache Source: https://context7.com/jina-ai/reader/llms.txt Bypass the cache entirely for a request. Use 'X-No-Cache: true' to ensure the content is always re-fetched from the source. ```bash curl -H "X-No-Cache: true" \ https://r.jina.ai/https://example.com/live-data ``` -------------------------------- ### Search Web with Jina Reader Source: https://github.com/jina-ai/reader/blob/main/README.md Use `https://s.jina.ai/` followed by a URL-encoded query to search the web. This allows LLMs to access current world knowledge. ```url https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F ``` -------------------------------- ### Set viewport for responsive rendering Source: https://context7.com/jina-ai/reader/llms.txt Define a specific viewport for rendering responsive web pages. This is useful for testing how a page appears on different devices. ```bash curl -X POST https://r.jina.ai/ \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com","viewport":{"width":375,"height":812,"isMobile":true}}' ``` -------------------------------- ### Token budget: Max tokens Source: https://context7.com/jina-ai/reader/llms.txt Limit the response size by setting a maximum token count. 'X-Max-Tokens: 2000' will trim the response to approximately 2000 tokens. ```bash curl -H "X-Max-Tokens: 2000" \ https://r.jina.ai/https://wikipedia.org/wiki/Large_language_model ``` -------------------------------- ### Control Caching and Timeout with curl Source: https://context7.com/jina-ai/reader/llms.txt Force bypass of cache and wait for full network idle for heavy JS apps by setting 'X-No-Cache: true' and 'X-Timeout: 30'. ```bash # Force bypass cache and wait for full network idle (heavy JS apps) curl -H "X-No-Cache: true" \ -H "X-Timeout: 30" \ https://r.jina.ai/https://app.example.com/dashboard ``` -------------------------------- ### Markdown output: Strip images Source: https://context7.com/jina-ai/reader/llms.txt Configure the markdown output to strip all images. Use 'X-Retain-Images: none' to remove all image elements. ```bash curl -H "X-Retain-Images: none" \ https://r.jina.ai/https://example.com/article ``` -------------------------------- ### Generate Image Alt-Text with X-With-Generated-Alt Source: https://context7.com/jina-ai/reader/llms.txt Automatically generates captions for images lacking alt attributes using the jina-vlm model. Captions are embedded in markdown. Can be combined with X-With-Images-Summary for a full image summary section. Requires an Authorization header for JSON output. ```bash # Auto-caption all images on a Wikipedia page curl -H "X-With-Generated-Alt: true" \ https://r.jina.ai/https://en.wikipedia.org/wiki/Hubble_Space_Telescope ``` ```bash # With JSON output to inspect image captions programmatically curl -H "Accept: application/json" \ -H "X-With-Generated-Alt: true" \ https://r.jina.ai/https://en.wikipedia.org/wiki/Hubble_Space_Telescope ``` ```bash # Combined: with alt text + full image summary section curl -H "X-With-Generated-Alt: true" \ -H "X-With-Images-Summary: true" \ https://r.jina.ai/https://www.nasa.gov/missions/ ``` -------------------------------- ### In-Site Search Source: https://github.com/jina-ai/reader/blob/main/README.md Perform an in-site search by specifying the `site` query parameter. You can target multiple sites. ```APIDOC ## In-Site Search ### Description Searches within specified websites. ### Method GET ### Endpoint `https://s.jina.ai/{search_query}` ### Parameters #### Query Parameters - **search_query** (string) - Required - The URL-encoded search query. - **site** (string) - Optional - The domain to search within. Can be specified multiple times. ### Request Example ```bash curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com' ``` ``` -------------------------------- ### Wait for CSS Selector with x-wait-for-selector Source: https://github.com/jina-ai/reader/blob/main/README.md Use the `x-wait-for-selector` header to make the Reader wait for a specific CSS selector to appear on the page before extracting content. This is useful when you know the exact element to target. ```bash curl 'https://example.com/' -H 'x-wait-for-selector: #content' ``` -------------------------------- ### Read PDF from URL with Jina Reader Source: https://github.com/jina-ai/reader/blob/main/README.md Jina Reader can now process PDF files directly from a URL. The output is an LLM-friendly format. ```url https://r.jina.ai/https://www.nasa.gov/wp-content/uploads/2023/01/55583main_vision_space_exploration2.pdf ``` -------------------------------- ### Markdown output: Contextual chunking Source: https://context7.com/jina-ai/reader/llms.txt Apply contextual chunking to the markdown output. 'X-Markdown-Chunking: s2' enables structured chunking with a depth of 2. ```bash curl -H "X-Markdown-Chunking: s2" \ https://r.jina.ai/https://long-documentation-page.example.com ``` -------------------------------- ### Markdown output: Strip links Source: https://context7.com/jina-ai/reader/llms.txt Configure the markdown output to strip all links. Use 'X-Retain-Links: none' to remove all hyperlink elements. ```bash curl -H "X-Retain-Links: none" \ https://r.jina.ai/https://example.com/article ``` -------------------------------- ### Markdown output: Chunking by heading level Source: https://context7.com/jina-ai/reader/llms.txt Enable markdown chunking based on heading levels. 'X-Markdown-Chunking: h2' injects a separator before each H2 heading. ```bash curl -H "X-Markdown-Chunking: h2" \ https://r.jina.ai/https://long-documentation-page.example.com ``` -------------------------------- ### Remove cookie/GDPR overlays Source: https://context7.com/jina-ai/reader/llms.txt Automatically remove common overlay elements like cookie or GDPR banners. Set 'X-Remove-Overlay' to 'true' to enable this feature. ```bash curl -H "X-Remove-Overlay: true" \ https://r.jina.ai/https://news.site.example.com ``` -------------------------------- ### Control response timing Source: https://context7.com/jina-ai/reader/llms.txt Explicitly control when the response is considered ready using the 'X-Respond-Timing' header. 'network-idle' waits until network activity has ceased. ```bash curl -H "X-Respond-Timing: network-idle" \ https://r.jina.ai/https://example.com/heavy-page ``` -------------------------------- ### Forward Session Cookie with curl Source: https://context7.com/jina-ai/reader/llms.txt Forward a session cookie to maintain user sessions and prevent caching by using the 'X-Set-Cookie' header. ```bash # Forward session cookie (result not cached) curl -H "X-Set-Cookie: session=abc123; Domain=example.com; Path=/" \ https://r.jina.ai/https://example.com/profile ``` -------------------------------- ### Markdown output: GPT-OSS citation links Source: https://context7.com/jina-ai/reader/llms.txt Set the link retention policy to 'gpt-oss' for a specific citation link format: 【{id}†.*】. This is useful for academic or technical documentation. ```bash curl -H "X-Retain-Links: gpt-oss" \ https://r.jina.ai/https://docs.openai.com/api-reference ``` -------------------------------- ### Token budget: Parse usage (Python) Source: https://context7.com/jina-ai/reader/llms.txt Parse token usage from a JSON response using Python's httpx library. The 'usage.tokens' field in the JSON response contains the token count. ```python import httpx r = httpx.get( "https://r.jina.ai/https://en.wikipedia.org/wiki/Python_(programming_language)", headers={"Accept": "application/json", "Authorization": "Bearer YOUR_API_KEY"} ) data = r.json() print(f"Tokens used: {data.get('usage', {}).get('tokens')}") ``` -------------------------------- ### Enable Server-Sent Events Streaming with curl Source: https://context7.com/jina-ai/reader/llms.txt Enable Server-Sent Events (SSE) streaming by setting the 'Accept: text/event-stream' header. This is useful for sites that load content dynamically via JavaScript or when immediate processing is needed. ```bash # Stream Wikipedia main page — last event contains the most complete result curl -H "Accept: text/event-stream" \ https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page # Real example: site that lazy-loads after full load # Standard mode returns incomplete page; streaming waits longer: curl -H "Accept: text/event-stream" \ -H "X-No-Cache: true" \ https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853 ``` -------------------------------- ### Parse in Python - Use Final Data Event Source: https://context7.com/jina-ai/reader/llms.txt Extracts the last data event from a stream, useful for processing the final content of a web page. Requires httpx and json libraries. ```python import httpx, json with httpx.stream("GET", "https://r.jina.ai/https://example.com", headers={"Accept": "text/event-stream", "Authorization": "Bearer YOUR_API_KEY"}) as r: last_data = None for line in r.iter_lines(): if line.startswith("data: "): last_data = json.loads(line[6:]) print(last_data["content"]) ``` -------------------------------- ### Target Specific Sections and Remove Elements with curl Source: https://context7.com/jina-ai/reader/llms.txt Focus on a specific CSS section using 'X-Target-Selector' and remove elements like cookie banners with 'X-Remove-Selector'. ```bash # Focus on one CSS section, remove cookie banners curl -H "X-Target-Selector: article.main-content" \ -H "X-Remove-Selector: .cookie-banner, #newsletter-popup" \ https://r.jina.ai/https://www.nytimes.com/2024/01/01/technology/ai.html ``` -------------------------------- ### Request Headers for Reader API Source: https://github.com/jina-ai/reader/blob/main/README.md Control the behavior of the Reader API using various request headers for features like image captioning, cookie forwarding, response format, proxying, caching, and element selection. ```APIDOC ## Reader API Request Headers ### Description Customize the Reader API's behavior using the following request headers. ### Headers - **x-with-generated-alt**: `true` - Enable image caption feature. - **x-set-cookie**: `true` - Forward cookie settings. Requests with cookies are not cached. - **x-respond-with**: `markdown` | `html` | `text` | `screenshot` - Specify response format. `markdown` bypasses readability, `html` returns outerHTML, `text` returns innerText, `screenshot` returns screenshot URL. - **x-proxy-url**: (string) - Specify a proxy server URL. - **x-cache-tolerance**: (integer) - Customize cache tolerance in seconds. - **x-no-cache**: `true` - Bypass the cached page (equivalent to `x-cache-tolerance: 0`). - **x-target-selector**: (string) - CSS selector to target a specific element for content extraction. - **x-wait-for-selector**: (string) - CSS selector to wait for until the element is rendered. ``` -------------------------------- ### Cache control: Prevent caching Source: https://context7.com/jina-ai/reader/llms.txt Prevent the result of a specific request from being cached. This is useful for sensitive or frequently changing data. ```bash curl -H "DNT: 1" \ https://r.jina.ai/https://example.com/sensitive-page ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.