### Start Local Web Server Source: https://github.com/scribeocr/scribe.js/blob/master/scribe-ui/README.md Use this command to serve the basic viewer locally. Ensure you are in the parent directory where `scribe-ui/` and `scribe.js/` are siblings. ```bash npx http-server ``` -------------------------------- ### Run the Textract Proxy Server Source: https://github.com/scribeocr/scribe.js/blob/master/examples/server-textract-proxy/README.md Execute this command in the terminal to start the proxy server. Ensure Node.js 20+ and AWS credentials are set up. ```sh cd scribe.js/examples/server-textract-proxy AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... \ TEXTRACT_REGIONS=us-east-1 \ node server.js ``` -------------------------------- ### Install Scribe.js via npm Source: https://github.com/scribeocr/scribe.js/blob/master/README.md Install the Scribe.js library using npm. This command is used for setting up the project. ```sh npm i scribe.js-ocr ``` -------------------------------- ### Clone and Set Up Scribe.js Locally Source: https://github.com/scribeocr/scribe.js/blob/master/README.md Clone the Scribe.js repository, including submodules, and install dependencies. Run automated tests before making a Pull Request. ```sh ## Clone the repo, including recursively cloning submodules git clone --recurse-submodules git@github.com:scribeocr/scribe.js.git cd scribe.js ## Install dependencies npm i ## Make changes ## [...] ## Run automated tests before making PR npm run test ``` -------------------------------- ### Programmatic Integration with scribe.js Source: https://github.com/scribeocr/scribe.js/blob/master/examples/server-textract-proxy/README.md Example of how to integrate the Textract proxy into your application using the scribe.js library. ```APIDOC ## Wire it into your own app ```js import scribe from 'scribe.js'; import { RecognitionModelServerProxy } from './RecognitionModelServerProxy.js'; await scribe.init(); await scribe.importFiles([pdfFileFromInput]); const ac = new AbortController(); cancelButton.addEventListener('click', () => ac.abort()); try { await scribe.recognize({ model: RecognitionModelServerProxy, modelOptions: { serverUrl: 'https://your-server.example/ocr' }, signal: ac.signal, }); // scribe.data.ocr.active is populated; scribe.exportData('pdf') returns a searchable PDF } catch (err) { if (err.name !== 'AbortError') throw err; // Partial results in scribe.data.ocr.active for pages that arrived before abort. } ``` ``` -------------------------------- ### Scribe.js CLI Examples Source: https://context7.com/scribeocr/scribe.js/llms.txt Command-line interface for batch OCR, PDF processing, and text extraction. Supports various formats and operations like recognition, checking confidence, and overlaying text. ```bash # Extract text from an image (outputs .txt file) scribe extract scan.png ``` ```bash # Extract text from a PDF, output as searchable PDF scribe extract document.pdf --format pdf ``` ```bash # Process all supported files in a directory scribe extract ./scans/ --dir --format txt ``` ```bash # Recognize and create a searchable PDF with invisible text layer scribe recognize scan.pdf --output ./output/ ``` ```bash # Check OCR confidence of a file scribe check document.pdf ``` ```bash # Overlay OCR text on a PDF with a visual proof overlay scribe overlay scan.pdf --output ./output/ --vis ``` ```bash # Detect whether a PDF is text-native or image-based scribe detect-pdf-type document.pdf ``` -------------------------------- ### Accessing OCR Data in Scribe.js Source: https://context7.com/scribeocr/scribe.js/llms.txt Iterate through OCR words on a specific page to access their text and confidence scores. Also shows how to get the total page count and calculate an overall accuracy estimate. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); await scribe.recognize(); // Iterate OCR words on page 0 const page0 = scribe.data.ocr.active[0]; for (const line of page0.lines) { for (const word of line.words) { console.log(word.text, word.conf); // text and confidence score } } // Total page count console.log(scribe.inputData.pageCount); // Confidence summary across all pages const { highConf, total } = scribe.utils.calcConf(scribe.data.ocr.active); console.log(`Accuracy estimate: ${((highConf / total) * 100).toFixed(1)}%`); await scribe.terminate(); ``` -------------------------------- ### init Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Initialize the program and optionally pre-load resources like the PDF renderer and OCR engine. ```APIDOC ## init Initialize the program and optionally pre-load resources. ### Parameters * `params` **[Object]?** * `params.pdf` **[boolean]** Load PDF renderer. (optional, default `false`) * `params.ocr` **[boolean]** Load OCR engine. (optional, default `false`) * `params.font` **[boolean]** Load built-in fonts. The PDF renderer and OCR engine are automatically loaded when needed. Therefore, the only reason to set `pdf` or `ocr` to `true` is to pre-load them. (optional, default `false`) ``` -------------------------------- ### Serve Demo HTML Source: https://github.com/scribeocr/scribe.js/blob/master/examples/server-textract-proxy/README.md Run this command from the scribe.js checkout root to serve the demo HTML file. This is used to test the proxy server from a browser. ```sh # In a second terminal, from the scribe.js checkout root (NOT this folder): cd scribe.js npx http-server -p 8081 --cors ``` -------------------------------- ### Run Standalone Tauri Viewer Source: https://github.com/scribeocr/scribe.js/blob/master/scribe-ui/README.md Launch the built Tauri application. This command requires the path to a PDF file. Additional flags can be used to specify the initial page, action, or highlights. ```bash ./basic-viewer/tauri/target/release/scribe-viewer-tauri -f /path/to/file.pdf ``` -------------------------------- ### Build Standalone Tauri Viewer Source: https://github.com/scribeocr/scribe.js/blob/master/scribe-ui/README.md Execute this script to build the standalone desktop viewer. It auto-detects the environment and uses `cargo` if available, otherwise falling back to Docker. Ensure Rust and Tauri dependencies are met if not using the dev container. ```bash bash basic-viewer/tauri/build.sh ``` -------------------------------- ### scribe.init(params?) Source: https://context7.com/scribeocr/scribe.js/llms.txt Initializes the Scribe.js library and pre-loads necessary resources like the PDF renderer, OCR engine, and fonts. This is useful for avoiding startup delays during the first user interaction. ```APIDOC ## scribe.init(params?) ### Description Pre-loads the PDF renderer, OCR engine, and/or built-in fonts. All three are loaded on-demand automatically; calling `init` is only necessary to avoid a startup delay when the user first interacts with the page. ### Method `init` ### Parameters - **params** (object) - Optional - An object to specify which resources to pre-load (e.g., `{ pdf: true, ocr: true, font: true }`). Can also include `ocrParams` for custom Tesseract configurations. ### Request Example ```javascript import scribe from 'scribe.js-ocr'; // Pre-load everything upfront await scribe.init({ pdf: true, ocr: true, font: true }); // Pre-load OCR only with custom Tesseract parameters await scribe.init({ ocr: true, ocrParams: { corePath: '/custom/tesseract-core.wasm.js' }, }); ``` ``` -------------------------------- ### Import PDF as ArrayBuffer with SortedInputFiles Source: https://context7.com/scribeocr/scribe.js/llms.txt Import a PDF file provided as an ArrayBuffer using the `SortedInputFiles` object. This is necessary when passing raw buffer data. ```javascript import scribe from 'scribe.js-ocr'; import fs from 'node:fs'; const pdfBuffer = fs.readFileSync('scan.pdf').buffer; await scribe.importFiles({ pdfFiles: [pdfBuffer], // ArrayBuffer }); ``` -------------------------------- ### recognize Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Recognize all pages in the active document. Files must be imported first using `importFiles`. ```APIDOC ## recognize Recognize all pages in active document. Files for recognition should already be imported using `importFiles` before calling this function. The results of recognition can be exported by calling `exportFiles` after this function. ### Parameters * `langs` **[Array]<[string]>** (optional, default `['eng']`) ``` -------------------------------- ### Initialize Scribe.js with Custom Tesseract Parameters Source: https://context7.com/scribeocr/scribe.js/llms.txt Initialize Scribe.js with OCR enabled and custom Tesseract parameters, such as specifying the core path for the WebAssembly file. ```javascript import scribe from 'scribe.js-ocr'; // Pre-load OCR only with custom Tesseract parameters await scribe.init({ ocr: true, ocrParams: { corePath: '/custom/tesseract-core.wasm.js' }, }); ``` -------------------------------- ### Initialize Scribe.js with Default Resources Source: https://context7.com/scribeocr/scribe.js/llms.txt Pre-load the PDF renderer, OCR engine, and built-in fonts for Scribe.js. This is optional and useful for avoiding startup delays on user interaction. ```javascript import scribe from 'scribe.js-ocr'; // Pre-load everything upfront await scribe.init({ pdf: true, ocr: true, font: true }); ``` -------------------------------- ### Import Supplemental OCR Data for Comparison Source: https://context7.com/scribeocr/scribe.js/llms.txt Loads additional OCR data, such as ground-truth, to enable comparison with the primary OCR results. Requires importing primary files and recognizing them first. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); await scribe.recognize(); // Load ground-truth hOCR for accuracy evaluation await scribe.importFilesSupp(['ground-truth.hocr'], 'Ground Truth'); // Compare primary OCR vs ground truth const comparison = await scribe.compareOCR( scribe.data.ocr.active, scribe.data.ocr['Ground Truth'], ); console.log(comparison.metrics); // per-page word-level metrics await scribe.terminate(); ``` -------------------------------- ### Cloud Adapter - AWS Textract Source: https://context7.com/scribeocr/scribe.js/llms.txt Drop-in recognition model that sends pages to AWS Textract instead of the built-in Tesseract engine. Requires AWS credentials in the environment. ```APIDOC ## Cloud Adapter — AWS Textract Drop-in recognition model that sends pages to AWS Textract instead of the built-in Tesseract engine. Requires AWS credentials in the environment. ```js import scribe from 'scribe.js-ocr'; import { RecognitionModelTextract } from '@scribe.js/aws-textract'; // AWS credentials via env: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION scribe.opt.progressHandler = ({ n, type, info }) => { if (type === 'convert') console.log(`Page ${n}: ${info.engineName}`); }; await scribe.importFiles(['invoice.pdf']); await scribe.recognize({ model: RecognitionModelTextract, modelOptions: { analyzeLayout: true, // Multi-region for higher throughput (optional): region: ['us-east-1', 'us-west-2', 'eu-west-1'], }, }); const text = await scribe.exportData('text'); console.log(text); await scribe.terminate(); ``` ``` -------------------------------- ### Integrate with Your Application Source: https://github.com/scribeocr/scribe.js/blob/master/examples/server-textract-proxy/README.md Use this JavaScript code to integrate the Textract proxy into your own application. It initializes scribe.js, imports PDF files, and calls the recognition service. ```js import scribe from 'scribe.js'; import { RecognitionModelServerProxy } from './RecognitionModelServerProxy.js'; await scribe.init(); await scribe.importFiles([pdfFileFromInput]); const ac = new AbortController(); cancelButton.addEventListener('click', () => ac.abort()); try { await scribe.recognize({ model: RecognitionModelServerProxy, modelOptions: { serverUrl: 'https://your-server.example/ocr' }, signal: ac.signal, }); // scribe.data.ocr.active is populated; scribe.exportData('pdf') returns a searchable PDF } catch (err) { if (err.name !== 'AbortError') throw err; // Partial results in scribe.data.ocr.active for pages that arrived before abort. } ``` -------------------------------- ### download Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Runs `exportData` and saves the result as a download (browser) or local file (Node.js). ```APIDOC ## download Runs `exportData` and saves the result as a download (browser) or local file (Node.js). ### Parameters * `format` **(`"pdf"` | `"hocr"` | `"docx"` | `"xlsx"` | `"txt"` | `"text"`)** * `fileName` **[string]** * `minPage` **[number]** First page to export. (optional, default `0`) * `maxPage` **[number]** Last page to export (inclusive). -1 exports through the last page. (optional, default `-1`) ``` -------------------------------- ### Recognize Text with Default Settings Source: https://context7.com/scribeocr/scribe.js/llms.txt Run OCR on imported files using the default recognition mode, which combines LSTM and legacy Tesseract models for the English language. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); // Default: combined LSTM + legacy, English await scribe.recognize({ langs: ['eng'] }); ``` -------------------------------- ### Using AWS Textract Cloud Adapter with Scribe.js Source: https://context7.com/scribeocr/scribe.js/llms.txt Integrates AWS Textract as a recognition model. Requires AWS credentials to be set in the environment. The progress handler can be customized to log engine information during processing. ```javascript import scribe from 'scribe.js-ocr'; import { RecognitionModelTextract } from '@scribe.js/aws-textract'; // AWS credentials via env: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION scribe.opt.progressHandler = ({ n, type, info }) => { if (type === 'convert') console.log(`Page ${n}: ${info.engineName}`); }; await scribe.importFiles(['invoice.pdf']); await scribe.recognize({ model: RecognitionModelTextract, modelOptions: { analyzeLayout: true, // Multi-region for higher throughput (optional): region: ['us-east-1', 'us-west-2', 'eu-west-1'], }, }); const text = await scribe.exportData('text'); console.log(text); await scribe.terminate(); ``` -------------------------------- ### Extract Text from User-Uploaded Files (Browser) Source: https://context7.com/scribeocr/scribe.js/llms.txt Extract text from files uploaded by a user in a browser environment. Pre-loading resources with `scribe.init` can improve the initial response time. ```javascript import scribe from 'node_modules/scribe.js-ocr/scribe.js'; await scribe.init({ ocr: true, font: true }); // pre-load for faster response document.getElementById('uploader').addEventListener('change', async (e) => { const text = await scribe.extractText(e.target.files); console.log(text); }); ``` -------------------------------- ### scribe.opt Source: https://context7.com/scribeocr/scribe.js/llms.txt A static class providing global configuration options that affect recognition, export, and performance. These options must be set before the relevant operation is invoked. ```APIDOC ## scribe.opt ### Description Global options object that affects recognition, export, and performance. Must be set before the relevant operation is invoked. ### Properties - `confThreshHigh` (number): High confidence threshold for OCR results. Defaults to 85. - `confThreshMed` (number): Medium confidence threshold for OCR results. Defaults to 75. - `workerN` (number): Number of worker threads to use. Set before `init`. - `langPath` (string): Path to Tesseract language data for offline/sandboxed environments. - `displayMode` (string): Controls PDF output mode. Options: 'invis' (invisible text), 'ebook' (text only), 'proof' (semi-transparent overlay), 'eval' (debug color-coded overlay). - `reflow` (boolean): Enables reflow for reconstructing reading order from layout during export. - `progressHandler` (function): A custom handler for progress updates. Receives an object with `n` (page number) and `type`. - `warningHandler` (function): A custom handler for warning messages. Receives the warning message string. - `errorHandler` (function): A custom handler for error messages. Receives the error message string. ``` -------------------------------- ### Scribe CLI Commands Source: https://context7.com/scribeocr/scribe.js/llms.txt The `scribe` CLI provides batch OCR, PDF overlay, and text extraction from the terminal. ```APIDOC ## CLI — `scribe` command-line interface The `scribe` CLI (installed via `npm i -g scribe.js-ocr` or run with `npx`) provides batch OCR, PDF overlay, and text extraction from the terminal. ```bash # Extract text from an image (outputs .txt file) scribe extract scan.png # Extract text from a PDF, output as searchable PDF scribe extract document.pdf --format pdf # Process all supported files in a directory scribe extract ./scans/ --dir --format txt # Recognize and create a searchable PDF with invisible text layer scribe recognize scan.pdf --output ./output/ # Check OCR confidence of a file scribe check document.pdf # Overlay OCR text on a PDF with a visual proof overlay scribe overlay scan.pdf --output ./output/ --vis # Detect whether a PDF is text-native or image-based scribe detect-pdf-type document.pdf ``` ``` -------------------------------- ### importFiles Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Import files for processing. Supports various file types and input formats, including structured objects for explicit file type definition. ```APIDOC ## importFiles Import files for processing. An object with `pdfFiles`, `imageFiles`, and `ocrFiles` arrays can be provided to import multiple types of files. Alternatively, for `File` objects (browser) and file paths (Node.js), a single array can be provided, which is sorted based on extension. ### Parameters * `files` **([Array]<[File]> | FileList | [Array]<[string]> | [SortedInputFiles])** ``` -------------------------------- ### Import Multiple Image Files Source: https://context7.com/scribeocr/scribe.js/llms.txt Import multiple image files into the Scribe.js processing pipeline. The library sorts them alphabetically internally. ```javascript import scribe from 'scribe.js-ocr'; // Multiple images (sorted alphabetically internally) await scribe.importFiles(['page_01.png', 'page_02.png', 'page_03.png']); ``` -------------------------------- ### Import and Use Scribe.js Source: https://github.com/scribeocr/scribe.js/blob/master/README.md Import Scribe.js in your JavaScript code for browser or Node.js environments. Use the extractText method to process image URLs and log the results. ```javascript // Import statement in browser: import scribe from 'node_modules/scribe.js-ocr/scribe.js'; // Import statement for Node.js: import scribe from 'scribe.js-ocr'; // Basic usage scribe.extractText(['https://tesseract.projectnaptha.com/img/eng_bw.png']) .then((res) => console.log(res)) ``` -------------------------------- ### Configure Global Scribe.js Options Source: https://context7.com/scribeocr/scribe.js/llms.txt Sets global configuration options that affect recognition, export, and performance. These must be set before the relevant operation is invoked. ```javascript import scribe from 'scribe.js-ocr'; // Tune confidence thresholds scribe.opt.confThreshHigh = 85; // default scribe.opt.confThreshMed = 75; // Control number of workers (set before init) scribe.opt.workerN = 4; // Use a local mirror for Tesseract language data (offline/sandboxed environments) scribe.opt.langPath = '/static/tessdata'; // Control PDF output mode: 'invis' (invisible text), 'ebook' (text only), // 'proof' (semi-transparent overlay), 'eval' (debug color-coded overlay) scribe.opt.displayMode = 'invis'; // Export reflow: reconstruct reading order from layout scribe.opt.reflow = true; // Custom progress handler scribe.opt.progressHandler = ({ n, type }) => { console.log(`Progress: page=${n} type=${type}`); }; // Custom warning / error handlers scribe.opt.warningHandler = (msg) => console.warn('[scribe warn]', msg); scribe.opt.errorHandler = (msg) => console.error('[scribe error]', msg); await scribe.init({ ocr: true }); await scribe.importFiles(['scan.png']); await scribe.recognize({ langs: ['eng'] }); const pdf = await scribe.exportData('pdf'); await scribe.terminate(); ``` -------------------------------- ### scribe.recognize(options?) Source: https://context7.com/scribeocr/scribe.js/llms.txt Executes the Optical Character Recognition (OCR) process on files previously imported using `importFiles`. It supports various recognition modes and language options, including custom cloud-based models. ```APIDOC ## scribe.recognize(options?) ### Description Runs optical character recognition on all pages currently loaded via `importFiles`. Supports built-in Tesseract (combined LSTM + legacy), speed-only mode, and pluggable cloud/custom recognition models. ### Method `recognize` ### Parameters - **options** (object) - Optional - Configuration options for the recognition process, including `langs`, `mode`, `modeAdv`, `config`, and `model`. - **langs** (Array) - Optional - Languages to use for OCR. - **mode** (string) - Optional - Recognition mode ('speed' for faster processing). - **modeAdv** (string) - Optional - Advanced recognition mode (e.g., 'lstm'). - **config** (object) - Optional - Custom Tesseract configuration parameters. - **model** (object) - Optional - A custom recognition model (e.g., `RecognitionModelTextract`). - **modelOptions** (object) - Optional - Options specific to the chosen custom model. ### Request Example ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); // Default: combined LSTM + legacy, English await scribe.recognize({ langs: ['eng'] }); // Speed mode (faster, similar to raw Tesseract.js) await scribe.recognize({ mode: 'speed', langs: ['eng'] }); // Advanced: use only LSTM model, custom Tesseract config await scribe.recognize({ modeAdv: 'lstm', langs: ['deu'], config: { tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' }, }); // Custom cloud model: AWS Textract import { RecognitionModelTextract } from '@scribe.js/aws-textract'; await scribe.importFiles(['document.pdf']); await scribe.recognize({ model: RecognitionModelTextract, modelOptions: { analyzeLayout: true }, }); const text = await scribe.exportData('text'); console.log(text); await scribe.terminate(); ``` ``` -------------------------------- ### writeDebugImages Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Writes debug images for visual inspection of processing steps. ```APIDOC ## writeDebugImages ### Parameters * `ctx` * `compDebugArrArr` **[Array]<[Array]>** * `filePath` **[string]** ``` -------------------------------- ### Parameters for Recognition Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md This section details the options available for configuring the recognition process in Scribe.js. ```APIDOC ## Parameters for Recognition ### Description Configuration options for the recognition process. ### Parameters #### options (Object) - Optional, default `{}` - **options.mode** (`"speed"` | `"quality"`) - Optional, default `'quality'`: Recognition mode. - **options.langs** (Array) - Optional, default `['eng']`: Language(s) in the document. - **options.modeAdv** (`"lstm"` | `"legacy"` | `"combined"`) - Optional, default `'combined'`: Alternative method of setting recognition mode. - **options.combineMode** (`"conf"` | `"data"` | `"none"`) - Optional, default `'data'`: Method of combining OCR results. Used if OCR data already exists. - **options.vanillaMode** (boolean) - Optional, default `false`: Whether to use the vanilla Tesseract.js model. - **options.config** (Object) - Optional, default `{}`: Config parameters to pass to Tesseract.js. ``` -------------------------------- ### scribe.importFiles(files) Source: https://context7.com/scribeocr/scribe.js/llms.txt Loads various file types (PDF, images, hOCR/XML, .scribe sessions) into the Scribe.js processing pipeline. It accepts different input formats depending on the environment (Node.js vs. Browser). ```APIDOC ## scribe.importFiles(files) ### Description Loads PDF, image (PNG/JPEG), hOCR/XML, or `.scribe` session files into the internal pipeline. Accepts an array of paths (Node.js), URLs or `File` objects (browser), a `FileList`, or a pre-sorted `SortedInputFiles` object. Must be called before `recognize` or `exportData`. ### Method `importFiles` ### Parameters - **files** (Array) - Required - An array of file paths, URLs, File objects, FileList, or a `SortedInputFiles` object containing file data. ### Request Example ```javascript import scribe from 'scribe.js-ocr'; // Single PDF await scribe.importFiles(['invoice.pdf']); // Multiple images (sorted alphabetically internally) await scribe.importFiles(['page_01.png', 'page_02.png', 'page_03.png']); // Pre-sorted SortedInputFiles object — required when passing ArrayBuffers import fs from 'node:fs'; const pdfBuffer = fs.readFileSync('scan.pdf').buffer; await scribe.importFiles({ pdfFiles: [pdfBuffer], // ArrayBuffer }); // Combined: image pages + existing hOCR data await scribe.importFiles({ imageFiles: ['scan.png'], ocrFiles: ['scan.hocr'], }); console.log(scribe.inputData.pageCount); // => number of pages imported ``` ``` -------------------------------- ### terminate Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Terminates the program and releases all allocated resources. ```APIDOC ## terminate Terminates the program and releases resources. ``` -------------------------------- ### Import Single PDF File Source: https://context7.com/scribeocr/scribe.js/llms.txt Import a single PDF file into the Scribe.js processing pipeline using its file path. ```javascript import scribe from 'scribe.js-ocr'; // Single PDF await scribe.importFiles(['invoice.pdf']); ``` -------------------------------- ### Download OCR Data to File Source: https://context7.com/scribeocr/scribe.js/llms.txt A convenience function that exports OCR data and saves it as a file. Supports various formats and page range options. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['report.pdf']); await scribe.recognize(); // Save searchable PDF to disk (Node.js) or trigger download (browser) await scribe.download('pdf', 'report.pdf'); // Save plain text output await scribe.download('txt', 'report.txt'); // Save only first 3 pages as DOCX await scribe.download('docx', 'report.docx', { minPage: 0, maxPage: 2 }); await scribe.terminate(); ``` -------------------------------- ### Recognize Text with Advanced LSTM and Custom Config Source: https://context7.com/scribeocr/scribe.js/llms.txt Utilize advanced recognition options, such as using only the LSTM model and providing a custom Tesseract configuration for character whitelisting. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); // Advanced: use only LSTM model, custom Tesseract config await scribe.recognize({ modeAdv: 'lstm', langs: ['deu'], config: { tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' }, }); ``` -------------------------------- ### Import Mixed Image and OCR Files Source: https://context7.com/scribeocr/scribe.js/llms.txt Import both image files and pre-existing OCR data (e.g., hOCR) into the Scribe.js pipeline. The `pageCount` can be checked after import. ```javascript import scribe from 'scribe.js-ocr'; // Combined: image pages + existing hOCR data await scribe.importFiles({ imageFiles: ['scan.png'], ocrFiles: ['scan.hocr'], }); console.log(scribe.inputData.pageCount); // => number of pages imported ``` -------------------------------- ### Recognize Text with AWS Textract Cloud Model Source: https://context7.com/scribeocr/scribe.js/llms.txt Integrate with AWS Textract for OCR by specifying the `RecognitionModelTextract` and its options. The extracted text can then be exported. ```javascript import scribe from 'scribe.js-ocr'; import { RecognitionModelTextract } from '@scribe.js/aws-textract'; await scribe.importFiles(['document.pdf']); await scribe.recognize({ model: RecognitionModelTextract, modelOptions: { analyzeLayout: true }, }); const text = await scribe.exportData('text'); console.log(text); await scribe.terminate(); ``` -------------------------------- ### Export OCR Data in Various Formats Source: https://context7.com/scribeocr/scribe.js/llms.txt Exports OCR data to formats like 'txt', 'pdf', 'hocr', 'alto', 'html', 'md', 'docx', 'xlsx', or 'scribe'. Options allow specifying page ranges or arrays of pages. ```javascript import scribe from 'scribe.js-ocr'; import fs from 'node:fs'; await scribe.importFiles(['multi-page.pdf']); await scribe.recognize({ langs: ['eng'] }); // Plain text — all pages const txt = await scribe.exportData('txt'); console.log(txt); // Searchable PDF (invisible text layer overlaid on original) const pdfBytes = await scribe.exportData('pdf'); fs.writeFileSync('output.pdf', Buffer.from(pdfBytes)); // hOCR — pages 2–4 only (0-based indices) const hocr = await scribe.exportData('hocr', { minPage: 1, maxPage: 3 }); // Specific pages via array const partial = await scribe.exportData('txt', { pageArr: [0, 2, 4] }); // Markdown with table preservation const md = await scribe.exportData('md'); // Scribe session format (compressed, for later restore) const sessionBuf = await scribe.exportData('scribe'); fs.writeFileSync('session.scribe', Buffer.from(sessionBuf)); await scribe.terminate(); ``` -------------------------------- ### scribe.importFilesSupp(files, ocrName) Source: https://context7.com/scribeocr/scribe.js/llms.txt Loads supplemental OCR data, such as ground-truth or alternate engine outputs, to be used alongside the primary OCR data for comparison and evaluation. ```APIDOC ## scribe.importFilesSupp(files, ocrName) ### Description Loads an additional OCR version (e.g., ground-truth or an alternate engine's output) alongside the primary OCR data, enabling comparison and evaluation workflows. ### Parameters #### `files` (array of strings, required) - An array of file paths or URLs to the supplemental OCR data files. #### `ocrName` (string, required) - A name to identify this supplemental OCR data, used for referencing it later (e.g., 'Ground Truth'). ### Returns - `Promise`: This function performs an import operation and does not return a value. ``` -------------------------------- ### Recognize Text in Speed Mode Source: https://context7.com/scribeocr/scribe.js/llms.txt Perform OCR in speed mode for faster processing, similar to raw Tesseract.js. Specify the desired languages for recognition. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); // Speed mode (faster, similar to raw Tesseract.js) await scribe.recognize({ mode: 'speed', langs: ['eng'] }); ``` -------------------------------- ### clear Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Clears all document-specific data from the current session. ```APIDOC ## clear Clears all document-specific data. ``` -------------------------------- ### Extract Text from Image (Node.js) Source: https://context7.com/scribeocr/scribe.js/llms.txt Use this to extract text from an image file in a Node.js environment. Remember to terminate the process to release workers. ```javascript import scribe from 'scribe.js-ocr'; const text = await scribe.extractText(['scan.png']); console.log(text); // => "Hello World\n..." await scribe.terminate(); // release workers ``` -------------------------------- ### API Endpoint: POST /ocr Source: https://github.com/scribeocr/scribe.js/blob/master/examples/server-textract-proxy/README.md This endpoint accepts raw PDF bytes and returns NDJSON results streamed per page, processed by AWS Textract. ```APIDOC ## POST /ocr ### Description Accepts raw PDF bytes and returns NDJSON results streamed per page, processed by AWS Textract. Credentials remain on the server. ### Method POST ### Endpoint /ocr ### Parameters #### Request Body - **body** (binary) - Required - Raw PDF bytes, `Content-Type: application/pdf` ### Response #### Success Response (200) - **application/x-ndjson** - Streamed NDJSON lines, one per page. - `{"pageNum": 0, "rawData": ""}` - Lines are flushed as each page completes. #### Failure Response - **application/x-ndjson** - Streamed NDJSON lines, one per page. - `{"pageNum": 0, "error": {"message": "..."}}` ``` -------------------------------- ### scribe.download(format, fileName, options?) Source: https://context7.com/scribeocr/scribe.js/llms.txt A convenience function that exports OCR data and saves it as a file. It acts as a wrapper around `exportData` and handles file saving in both browser and Node.js environments. ```APIDOC ## scribe.download(format, fileName, options?) ### Description Exports the active OCR data and saves it as a file download (browser) or writes to the local filesystem (Node.js). This is a convenience wrapper around `exportData`. ### Parameters #### `format` (string, required) - The desired output format (e.g., 'pdf', 'txt', 'docx'). #### `fileName` (string, required) - The name of the file to save the output as. #### `options` (object, optional) - `minPage` (number): The starting page index (0-based) for export. - `maxPage` (number): The ending page index (0-based) for export. - `pageArr` (array of numbers): An array of specific page indices to export. ### Returns - `Promise`: This function does not return a value but performs a file save operation. ``` -------------------------------- ### `scribe.utils.calcConf(ocrPages)` - Compute confidence statistics Source: https://context7.com/scribeocr/scribe.js/llms.txt Returns aggregate high-confidence and total word counts across an array of OCR pages, useful for estimating recognition quality. ```APIDOC ## `scribe.utils.calcConf(ocrPages)` — Compute confidence statistics Returns aggregate high-confidence and total word counts across an array of OCR pages, useful for estimating recognition quality. ```js import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.pdf']); await scribe.recognize(); const { highConf, total } = scribe.utils.calcConf(scribe.data.ocr.active); console.log(`High-confidence words: ${highConf} / ${total}`); console.log(`Estimated accuracy: ${((highConf / total) * 100).toFixed(1)}%`); await scribe.terminate(); ``` ``` -------------------------------- ### scribe.extractText(files, langs?, outputFormat?, options?) Source: https://context7.com/scribeocr/scribe.js/llms.txt Performs text extraction from provided files in a single asynchronous call. It handles both text-native PDFs and image-based files, automatically performing OCR when necessary. Supports various input types like URLs, file paths, and File/FileList objects. ```APIDOC ## scribe.extractText(files, langs?, outputFormat?, options?) ### Description Imports files, runs OCR when needed, and returns extracted text in a single async call. For text-native PDFs the existing text is returned directly; for image-based files OCR is performed automatically. Accepts URLs (browser), file paths (Node.js), or `File`/`FileList` objects (browser). ### Method `extractText` ### Parameters - **files** (Array) - Required - An array of file paths, URLs, or File objects to process. - **langs** (Array) - Optional - An array of language codes for OCR (e.g., ['eng', 'fra']). - **outputFormat** (string) - Optional - The desired output format (e.g., 'txt'). - **options** (object) - Optional - Configuration options, such as `{ skipRecPDFTextNative: true }`. ### Request Example ```javascript // Node.js — extract text from an image import scribe from 'scribe.js-ocr'; const text = await scribe.extractText(['scan.png']); console.log(text); // => "Hello World\n..." await scribe.terminate(); // release workers // Node.js — extract text from a PDF, specify language const text = await scribe.extractText( ['document.pdf'], ['eng', 'fra'], // languages 'txt', // output format { skipRecPDFTextNative: true } // skip OCR for text-native PDFs (default) ); await scribe.terminate(); // Browser — extract from user-uploaded files import scribe from 'node_modules/scribe.js-ocr/scribe.js'; await scribe.init({ ocr: true, font: true }); // pre-load for faster response document.getElementById('uploader').addEventListener('change', async (e) => { const text = await scribe.extractText(e.target.files); console.log(text); }); ``` ``` -------------------------------- ### scribe.exportData(format?, options?) Source: https://context7.com/scribeocr/scribe.js/llms.txt Exports the active OCR data to the requested format. Supported formats include 'txt', 'pdf', 'hocr', 'alto', 'html', 'md', 'docx', 'xlsx', and 'scribe'. Options can specify page ranges or specific pages. ```APIDOC ## scribe.exportData(format?, options?) ### Description Exports the active OCR data to the requested format and returns the content as a `string` or `ArrayBuffer`. Supported formats: `'txt'`, `'pdf'`, `'hocr'`, `'alto'`, `'html'`, `'md'`, `'docx'`, `'xlsx'`, `'scribe'`. ### Parameters #### `format` (string, optional) - The desired output format. Defaults to 'txt' if not specified. #### `options` (object, optional) - `minPage` (number): The starting page index (0-based) for export. - `maxPage` (number): The ending page index (0-based) for export. - `pageArr` (array of numbers): An array of specific page indices to export. ### Returns - `string` or `ArrayBuffer`: The exported OCR data in the specified format. ``` -------------------------------- ### Computing Confidence Statistics with Scribe.js Source: https://context7.com/scribeocr/scribe.js/llms.txt Calculates aggregate high-confidence and total word counts from OCR pages to estimate recognition quality. This utility function is useful for assessing the accuracy of the OCR process. ```javascript import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.pdf']); await scribe.recognize(); const { highConf, total } = scribe.utils.calcConf(scribe.data.ocr.active); console.log(`High-confidence words: ${highConf} / ${total}`); console.log(`Estimated accuracy: ${((highConf / total) * 100).toFixed(1)}%`); await scribe.terminate(); ``` -------------------------------- ### Accessing Internal OCR and Image Data Source: https://context7.com/scribeocr/scribe.js/llms.txt Provides access to raw OCR page objects, font state, image cache, layout regions, and per-page metrics after processing. ```APIDOC ## `scribe.data` — Accessing internal OCR and image data Provides access to raw OCR page objects, font state, image cache, layout regions, and per-page metrics after processing. ```js import scribe from 'scribe.js-ocr'; await scribe.importFiles(['scan.png']); await scribe.recognize(); // Iterate OCR words on page 0 const page0 = scribe.data.ocr.active[0]; for (const line of page0.lines) { for (const word of line.words) { console.log(word.text, word.conf); // text and confidence score } } // Total page count console.log(scribe.inputData.pageCount); // Confidence summary across all pages const { highConf, total } = scribe.utils.calcConf(scribe.data.ocr.active); console.log(`Accuracy estimate: ${((highConf / total) * 100).toFixed(1)}%`); await scribe.terminate(); ``` ``` -------------------------------- ### exportData Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Export active OCR data to a specified format, with options for page range. ```APIDOC ## exportData Export active OCR data to specified format. ### Parameters * `format` **(`"pdf"` | `"hocr"` | `"docx"` | `"xlsx"` | `"txt"` | `"text"`)** (optional, default `'txt'`) * `minPage` **[number]** First page to export. (optional, default `0`) * `maxPage` **[number]** Last page to export (inclusive). -1 exports through the last page. (optional, default `-1`) Returns **[Promise]<([string] | [ArrayBuffer])>** ``` -------------------------------- ### Extract Text from PDF with Language and Format (Node.js) Source: https://context7.com/scribeocr/scribe.js/llms.txt Extract text from a PDF file, specifying the desired languages and output format. The `skipRecPDFTextNative` option can be used to skip OCR for text-native PDFs. ```javascript import scribe from 'scribe.js-ocr'; const text = await scribe.extractText( ['document.pdf'], ['eng', 'fra'], // languages 'txt', // output format { skipRecPDFTextNative: true } // skip OCR for text-native PDFs (default) ); await scribe.terminate(); ``` -------------------------------- ### scribe.terminate() Source: https://context7.com/scribeocr/scribe.js/llms.txt Releases all resources used by Scribe.js, including worker threads for OCR, PDF rendering, and font engines. This should always be called when processing is complete to prevent resource leaks. ```APIDOC ## scribe.terminate() ### Description Terminates the underlying worker threads (Tesseract, PDF renderer, font engine) and frees memory. Always call this when you are done processing to avoid resource leaks, especially in Node.js. ### Returns - `Promise`: This function performs a termination operation and does not return a value. ``` -------------------------------- ### Terminate Scribe.js Resources Source: https://context7.com/scribeocr/scribe.js/llms.txt Frees all resources by terminating worker threads. Essential for preventing memory leaks, especially in Node.js environments. ```javascript import scribe from 'scribe.js-ocr'; try { const text = await scribe.extractText(['scan.pdf']); console.log(text); } finally { await scribe.terminate(); } ``` -------------------------------- ### extractText Source: https://github.com/scribeocr/scribe.js/blob/master/docs/API.md Function for extracting text from image and PDF files with a single function call. Handles PDF text extraction or OCR based on file type and options. ```APIDOC ## extractText Function for extracting text from image and PDF files with a single function call. By default, existing text content is extracted for text-native PDF files; otherwise text is extracted using OCR. To control how text from PDF files is handled, set the options in the `opt.usePDFText` object. For more control, use `init`, `importFiles`, `recognize`, and `exportData` separately. ### Parameters * `files` * `langs` **[Array]<[string]>** (optional, default `['eng']`) * `outputFormat` (optional, default `'txt'`) * `options` (optional, default `{}`) ``` -------------------------------- ### Clear Document Data for New Processing Source: https://context7.com/scribeocr/scribe.js/llms.txt Clears all OCR results and related data for the current document, allowing a new document to be processed within the same session without re-initializing workers. ```javascript import scribe from 'scribe.js-ocr'; await scribe.init({ ocr: true }); // Process first document await scribe.importFiles(['doc1.pdf']); await scribe.recognize(); const text1 = await scribe.exportData('txt'); // Clear and process second document await scribe.clear(); await scribe.importFiles(['doc2.pdf']); await scribe.recognize(); const text2 = await scribe.exportData('txt'); await scribe.terminate(); ``` -------------------------------- ### scribe.clear() Source: https://context7.com/scribeocr/scribe.js/llms.txt Clears all document-specific data, including OCR results and images, without releasing the underlying worker resources. This is useful for processing multiple documents within the same session. ```APIDOC ## scribe.clear() ### Description Clears all document-specific data (OCR results, images, page metrics) without releasing the underlying workers. Use this to process a new document in the same session without the overhead of re-initializing. ### Returns - `Promise`: This function performs a clear operation and does not return a value. ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.