### UAT Script Execution Against Unpacked Distribution Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/run-uat-script.adoc This example demonstrates how to set up and run the UAT script against a Tika server unpacked from a standard distribution zip. It includes starting the server in the background and then executing the script. ```bash unzip tika-server-standard-.zip -d /tmp/tika-server-dist cd /tmp/tika-server-dist java -jar tika-server-standard-.jar -p 9998 -h localhost & sleep 12 ~/path/to/tika/release-tools/uat/run-uat.sh ``` -------------------------------- ### Tika gRPC Server Development Mode Startup Logs Source: https://github.com/apache/tika/blob/main/tika-grpc/README.md Example output observed when the Tika gRPC server starts in development mode, indicating successful plugin resolution and server startup. ```text INFO TikaPluginManager running in DEVELOPMENT mode INFO PF4J version 3.14.0 in 'development' mode INFO Plugin 'tika-pipes-file-system-plugin@4.0.0-SNAPSHOT' resolved INFO Plugin 'tika-pipes-s3-plugin@4.0.0-SNAPSHOT' resolved ... INFO Server started, listening on 50052 ``` -------------------------------- ### Basic Parsing Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc A basic example demonstrating how to perform document parsing using the Tika server. This snippet is intended as a starting point for simple parsing tasks. ```bash curl -T document.pdf http://localhost:9998/tika/text ``` -------------------------------- ### Complete S3 Pipeline Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/s3.adoc This example demonstrates a complete pipeline configuration using S3 fetcher, emitter, and iterator. It lists files from an input bucket and writes results to an output bucket. ```json { "fetcher": { "type": "s3", "bucket": "my-tika-input", "path": "incoming/", "fetcherId": "s3-fetcher" }, "emitter": { "type": "s3", "bucket": "my-tika-output", "path": "results/", "emitterId": "s3-emitter" }, "iterator": { "type": "s3", "fetcherId": "s3-fetcher", "emitterId": "s3-emitter" } } ``` -------------------------------- ### Install Tika Server as a Linux Service Source: https://github.com/apache/tika/blob/main/tika-server/README.md Installs the Tika Server as a service on Linux using the provided installation script and a binary distribution zip file. Requires root privileges. ```bash unzip -j tika-server-standard-.zip bin/install_tika_service.sh ``` ```bash ./install_tika_service.sh ./tika-server-standard-.zip ``` -------------------------------- ### Setup Tika-App Integration Test Environment Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-app.adoc Prepare the environment for Tika-App integration tests by creating a dedicated directory, copying the distribution ZIP, and extracting it. This ensures tests run in an isolated and reproducible setup. ```bash # Create test directory mkdir -p /tmp/tika-app-test cd /tmp/tika-app-test # Copy and extract distribution cp /path/to/tika-app-4.0.0-SNAPSHOT.zip . unzip tika-app-4.0.0-SNAPSHOT.zip cd tika-app-4.0.0-SNAPSHOT ``` -------------------------------- ### Start Tika Server with Config File Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Starts the Tika server using a custom configuration file named 'tika-config.json'. Ensure this file exists in the same directory or provide a full path. ```bash java -jar tika-server-standard-4.0.0-SNAPSHOT.jar -c tika-config.json & ``` -------------------------------- ### Start Jina VLM Server Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/local-vlm-server.adoc Activate the virtual environment and start the Jina VLM server. You can specify a different model or port. The first run downloads model weights. ```bash source .venv/bin/activate python server.py # default: jina-vlm on port 8000 python server.py --model Qwen/Qwen2-VL-2B-Instruct # use Qwen instead python server.py --port 9000 # custom port ``` -------------------------------- ### Install einops Dependency Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/local-vlm-server.adoc Install the 'einops' library if you encounter related import errors. ```bash pip install einops ``` -------------------------------- ### Preview Site with Node.js HTTP Server Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/maintainers/site.adoc Starts a Node.js http-server to preview the documentation locally. Access the site at http://localhost:8000. ```bash npx http-server docs/target/site -p 8000 ``` -------------------------------- ### Development Configuration Example (JSON) Source: https://github.com/apache/tika/blob/main/tika-grpc/README.md Example JSON configuration file for Tika GRPC server in development mode. It specifies plugin roots using relative paths to 'target/classes' and configures a file system fetcher. ```json { "plugin-roots": [ "../tika-pipes/tika-pipes-plugins/tika-pipes-file-system/target/classes", "../tika-pipes/tika-pipes-plugins/tika-pipes-http/target/classes", "../tika-pipes/tika-pipes-plugins/tika-pipes-s3/target/classes" ], "fetchers": [ { "fs": { "myFetcher": { "basePath": "/tmp/input" } } } ] } ``` -------------------------------- ### Start Tika Server on Custom Host Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Starts the Tika server and binds it to all available network interfaces (0.0.0.0) on port 9998. This allows connections from any host. ```bash java -jar tika-server-standard-4.0.0-SNAPSHOT.jar --host 0.0.0.0 --port 9998 & ``` -------------------------------- ### Install Helm and Artifactory Plugin Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/maintainers/release-guides/helm.adoc Install Helm and the Artifactory push plugin required for publishing charts. Ensure you are using the correct version for the plugin. ```bash # Install Helm (macOS) brew install helm # Install the Artifactory push plugin helm plugin install https://github.com/belitre/helm-push-artifactory-plugin --version 1.0.2 ``` -------------------------------- ### Full Tika 4.x JSON Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc This example demonstrates a complete Tika 4.x JSON configuration file, showcasing commonly used parser settings. Refer to individual parser documentation for all available options. ```json { "tika-config": { "parsers": { "application/pdf": [ { "class": "org.apache.tika.parser.pdf.PDFParser", " திக": { "extractInlineImages": true } } ], "image/jpeg": [ { "class": "org.apache.tika.parser.jpeg.JpegParser" } ], "image/png": [ { "class": "org.apache.tika.parser.png.PngParser" } ], "message/rfc822": [ { "class": "org.apache.tika.parser.mail.RFC822Parser" } ], "message/x-msg": [ { "class": "org.apache.tika.parser.microsoft.msg.OutlookMessageParser" } ], "application/vnd.ms-outlook": [ { "class": "org.apache.tika.parser.microsoft.msg.OutlookMessageParser" } ], "application/vnd.openxmlformats-officedocument.wordprocessingml.document": [ { "class": "org.apache.tika.parser.microsoft.docx.DocxParser" } ], "application/vnd.openxmlformats-officedocument.presentationml.presentation": [ { "class": "org.apache.tika.parser.microsoft.xslf.XSLFParser" } ], "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": [ { "class": "org.apache.tika.parser.microsoft.xlsx.XLSXParser" } ], "application/msword": [ { "class": "org.apache.tika.parser.microsoft.OldWordParser" } ], "application/vnd.ms-excel": [ { "class": "org.apache.tika.parser.microsoft.OldExcelParser" } ], "application/vnd.ms-powerpoint": [ { "class": "org.apache.tika.parser.microsoft.OldPowerPointParser" } ], "text/html": [ { "class": "org.apache.tika.parser.html.HtmlParser" } ], "text/plain": [ { "class": "org.apache.tika.parser.PlaintextParser" } ], "application/xml": [ { "class": "org.apache.tika.parser.XMLParser" } ], "application/json": [ { "class": "org.apache.tika.parser.JSONParser" } ], "application/octet-stream": [ { "class": "org.apache.tika.parser.EmptyParser" } ] }, "detectors": { "application/pdf": [ { "class": "org.apache.tika.parser.pdf.PDFParser" } ], "image/jpeg": [ { "class": "org.apache.tika.parser.jpeg.JpegParser" } ], "image/png": [ { "class": "org.apache.tika.parser.png.PngParser" } ], "message/rfc822": [ { "class": "org.apache.tika.parser.mail.RFC822Parser" } ], "message/x-msg": [ { "class": "org.apache.tika.parser.microsoft.msg.OutlookMessageParser" } ], "application/vnd.openxmlformats-officedocument.wordprocessingml.document": [ { "class": "org.apache.tika.parser.microsoft.docx.DocxParser" } ], "application/vnd.openxmlformats-officedocument.presentationml.presentation": [ { "class": "org.apache.tika.parser.microsoft.xslf.XSLFParser" } ], "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": [ { "class": "org.apache.tika.parser.microsoft.xlsx.XLSXParser" } ], "application/msword": [ { "class": "org.apache.tika.parser.microsoft.OldWordParser" } ], "application/vnd.ms-excel": [ { "class": "org.apache.tika.parser.microsoft.OldExcelParser" } ], "application/vnd.ms-powerpoint": [ { "class": "org.apache.tika.parser.microsoft.OldPowerPointParser" } ], "text/html": [ { "class": "org.apache.tika.parser.html.HtmlParser" } ], "text/plain": [ { "class": "org.apache.tika.parser.PlaintextParser" } ], "application/xml": [ { "class": "org.apache.tika.parser.XMLParser" } ], "application/json": [ { "class": "org.apache.tika.parser.JSONParser" } ], "application/octet-stream": [ { "class": "org.apache.tika.parser.EmptyParser" } ] } } } ``` -------------------------------- ### Tika Server 4.x PUT Endpoint Examples Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/migration-to-4x/migrating-tika-server-4x.adoc Demonstrates how to use the PUT endpoints of the Tika Server 4.x for simple content extraction. These examples show how to retrieve plain text, JSON with metadata and text, and JSON with HTML content. ```bash # Get plain text curl -T document.pdf http://localhost:9998/tika/text # Get JSON with metadata and text curl -T document.pdf http://localhost:9998/tika/json # Get JSON with HTML content curl -T document.pdf http://localhost:9998/tika/json/html ``` -------------------------------- ### Start Tika Server Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/using-tika/server/index.adoc Run the Tika Server using the standard JAR file. The server defaults to running on localhost:9998. ```bash java -jar tika-server-standard-X.Y.Z.jar ``` -------------------------------- ### Self-Configuring PDF Parser Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/developers/serialization.adoc Demonstrates a PDFParser implementing SelfConfiguring, which handles its own configuration via ParseContext. ```java package com.example.tika; import org.apache.tika.config.TikaComponent; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.filter.MetadataFilter; @TikaComponent public class UpperCaseFilter implements MetadataFilter { private String fieldName = "title"; public void setFieldName(String fieldName) { this.fieldName = fieldName; } public String getFieldName() { return fieldName; } @Override public void filter(Metadata metadata) throws TikaException { String value = metadata.get(fieldName); if (value != null) { metadata.set(fieldName, value.toUpperCase()); } } } ``` -------------------------------- ### Complete File System Pipeline Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/filesystem.adoc This is a canonical filesystem-to-filesystem integration test configuration. Replace path tokens with real paths and EMIT_INTERMEDIATE_RESULTS with a boolean value. ```json { "tika-config": "tika-config-basic.json", "fetcher": { "basePath": "FETCHER_BASE_PATH", "extractFileSystemMetadata": true }, "emitter": { "basePath": "EMITTER_BASE_PATH", "emitIntermediateResults": EMIT_INTERMEDIATE_RESULTS }, "pipesIterator": { "fetcherId": "fsf", "emitterId": "fse" }, "plugins": { "paths": [ "PLUGINS_PATHS" ] } } ``` -------------------------------- ### HTTP Fetcher Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/http.adoc This JSON snippet shows a sample configuration for the HTTP fetcher, demonstrating how to set up authentication, proxy, timeouts, and other parameters. ```json { "tika-pipes-http": { "userName": "user", "password": "password", "proxyHost": "proxy.example.com", "proxyPort": 8080, "userAgent": "MyTikaClient/1.0", "maxConnections": 100, "maxConnectionsPerRoute": 10, "connectTimeoutMillis": 30000, "socketTimeoutMillis": 60000, "requestTimeoutMillis": 60000, "overallTimeoutMillis": 120000, "maxRedirects": 5, "maxSpoolSize": 10485760, "maxErrMsgSize": 100000, "httpHeaders": [ "Accept: application/json", "X-Custom-Header: my-value" ], "httpRequestHeaders": { "Content-Type": ["application/xml", "text/xml"] }, "jwtIssuer": "https://example.com/auth", "jwtSubject": "tika-client", "jwtExpiresInSeconds": 3600, "jwtSecret": "my-super-secret-key" } } ``` -------------------------------- ### Java Usage Example for Ignite ConfigStore Source: https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-config-store-ignite/README.md This Java code snippet indicates that configuration is automatically loaded from tika-config.json and the Ignite cluster forms automatically across nodes with the same igniteInstanceName. No explicit setup is required in the code itself. ```java // Configuration is automatically loaded from tika-config.json // The Ignite cluster will form automatically across all nodes // with the same igniteInstanceName ``` -------------------------------- ### Preview Site with Python HTTP Server Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/maintainers/site.adoc Starts a Python HTTP server in the generated site directory to preview the documentation locally. Access the site at http://localhost:8000. ```bash cd docs/target/site python3 -m http.server 8000 ``` -------------------------------- ### Filesystem-to-Filesystem Pipeline Configuration Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/configuration.adoc An example configuration for a filesystem-to-filesystem pipeline. Tokens like FETCHER_BASE_PATH and EMITTER_BASE_PATH need to be replaced with actual paths in production. ```json { "project": { "pipes": { "emitStrategy": { "type": "DYNAMIC", "thresholdBytes": 100000 } } } } ``` -------------------------------- ### Create Project Directory and Virtual Environment Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/local-vlm-server.adoc Set up a new project directory for the VLM server, install Python 3.12.10 locally for the project, create a virtual environment, and activate it. ```bash mkdir ~/vlm-server && cd ~/vlm-server pyenv local 3.12.10 python3 -m venv .venv source .venv/bin/activate ``` -------------------------------- ### Start Tika Server with Docker Compose Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/using-tika/server/tls.adoc Command to start the Tika Server in detached mode using Docker Compose. ```bash docker-compose up -d ``` -------------------------------- ### Install Python 3.12 with pyenv Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/local-vlm-server.adoc Install Python version 3.12.10 using pyenv. This version is recommended due to compatibility with the ML ecosystem. ```bash pyenv install 3.12.10 ``` -------------------------------- ### Setup Tika-Server Integration Test Directory Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Create a dedicated directory for integration tests and navigate into it. This is the first step before setting up the tika-server for testing. ```bash # Create test directory mkdir -p /tmp/tika-server-test cd /tmp/tika-server-test ``` -------------------------------- ### GET /version Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Retrieves the Tika Server version. This is a simple GET request to check the server's operational status and version information. ```APIDOC ## GET /version ### Description Retrieves the Tika Server version. ### Method GET ### Endpoint /version ### Response #### Success Response (200) - **version** (string) - The Apache Tika version string (e.g., `Apache Tika X.X.X`) ``` -------------------------------- ### Main Server Startup Function Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/local-vlm-server.adoc Initializes the VLM model, processor, and device, then starts the FastAPI server using Uvicorn. It supports specifying the model and port via command-line arguments. ```python def main(): global model, processor, device parser = argparse.ArgumentParser( description="VLM OpenAI-compatible server (localhost only)" ) parser.add_argument( "--model", type=str, default="jinaai/jina-vlm", help="Hugging Face model name or local path" ) parser.add_argument( "--port", type=int, default=8000, help="Port (default: 8000)" ) args = parser.parse_args() if torch.backends.mps.is_available(): device = "mps" print("Using Apple Silicon GPU (MPS)") else: device = "cpu" print("MPS not available, using CPU (this will be slow)") print(f"Loading model {args.model}...") processor = AutoProcessor.from_pretrained( args.model, use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( args.model, device_map=device, torch_dtype=torch.float16, trust_remote_code=True, ) print(f"Model loaded. Starting server on http://127.0.0.1:{args.port}") uvicorn.run(app, host="127.0.0.1", port=args.port) if __name__ == "__main__": main() ``` -------------------------------- ### Tika Batch Processing Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/robustness.adoc Use this command-line option for desktop/VM-scale batch processing with Tika 3.x and earlier. Specify input and output directories. ```bash java -jar tika-app.jar -i -o ``` -------------------------------- ### JSON Iterator Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/json.adoc This example demonstrates the structure of a JSONL file that can be processed by the JSON iterator. Each line represents a distinct work item. ```json { "fetchKey": "/path/to/document.pdf", "emitKey": "/path/to/document.pdf.json" } { "fetchKey": "/path/to/another_document.docx", "emitKey": "/path/to/another_document.docx.json" } ``` -------------------------------- ### Set up Tika Runtime Environment Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc Unzips the built tika-app and tika-eval distribution packages into a dedicated runtime directory. Ensure these are run from their respective unzipped directories. ```bash mkdir -p ~/tika-runtime && cd ~/tika-runtime unzip -q tika-app/target/tika-app-{tika-version}.zip -d tika-app unzip -q tika-eval/tika-eval-app/target/tika-eval-app-{tika-version}.zip -d tika-eval ``` -------------------------------- ### Complete GCS Pipeline Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/gcs.adoc Example of a complete Tika Pipes pipeline using GCS fetcher, emitter, and iterator for bucket-to-bucket processing. ```json { "tika-pipes": { "fetchers": { "gcs-fetcher": { "class": "org.apache.tika.pipes.fetchers.gcs.GCSFetcher", "projectId": "your-gcp-project-id", "bucket": "your-gcs-bucket-name" } }, "emitters": { "gcs-emitter": { "class": "org.apache.tika.pipes.emitters.gcs.GCSEmitter", "projectId": "your-gcp-project-id", "bucket": "your-gcs-bucket-name", "prefix": "output/" } }, "iterators": { "gcs-iterator": { "class": "org.apache.tika.pipes.iterators.gcs.GCSPipesIterator", "bucket": "your-gcs-bucket-name", "fetcherId": "gcs-fetcher", "emitterId": "gcs-emitter" } } } } ``` -------------------------------- ### Get All MIME Types Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Retrieves a JSON object containing all known MIME types supported by the Tika server via the GET /mime-types endpoint. ```bash curl -s -H "Accept: application/json" http://localhost:9998/mime-types ``` -------------------------------- ### Prepare and Publish Release Documentation Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/maintainers/site.adoc Steps to publish documentation for a new release version, including tagging, creating a docs branch, updating the version in antora.yml, building, and publishing to SVN. ```bash # 1. Tag the release as usual git tag v4.0.0 # 2. Create docs branch from tag git checkout -b docs/4.0.0 v4.0.0 # 3. Update version in antora.yml sed -i "s/4.0.0-SNAPSHOT/4.0.0/" docs/antora.yml git commit -am "Set docs version to 4.0.0" git push origin docs/4.0.0 # 4. Build and publish ./mvnw package -Papache-release -pl :tika-docs -DskipTests cd docs ./publish-docs.sh /path/to/tika-site/publish # 5. Commit to SVN cd /path/to/tika-site svn add publish/docs publish/_ --force svn commit -m "Publish 4.0.0 docs" ``` -------------------------------- ### Start Tika Server in Default Mode Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Starts the Tika server in its default configuration, where config endpoints are disabled. Waits for the server to become available before proceeding. ```bash java -jar tika-server-standard-4.0.0-SNAPSHOT.jar --port 9998 & sleep 8 curl -s http://localhost:9998/version ``` -------------------------------- ### Tika 3.x XML Parser Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc This is an example of how parsers were configured in Tika 3.x using XML. It shows how to specify a parser class and its parameters. ```xml true 1000000

``` -------------------------------- ### Apply Formatting and Build Tika Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/migration-to-4x/design-notes-4x.adoc Execute this command to apply formatting and then build the project. ```bash mvn clean spotless:apply install ``` -------------------------------- ### Start Tika Server on Custom Port Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Starts the Tika server on a specified custom port (9999) and then checks its version to confirm it's running on the new port. ```bash java -jar tika-server-standard-4.0.0-SNAPSHOT.jar --port 9999 & sleep 8 curl -s http://localhost:9999/version ``` -------------------------------- ### Quick Start Development Mode Source: https://github.com/apache/tika/blob/main/tika-grpc/README.md Build Tika and plugins, then run the Tika GRPC server in development mode for hot-reloading. This automatically enables development mode and loads plugins from local target directories. ```bash # 1. Build Tika and all plugins (from tika project root) ./mvnw clean install -DskipTests # 2. Run in development mode (from tika-grpc directory) cd tika-grpc ./run-dev.sh ``` -------------------------------- ### Run Tika App Help Source: https://github.com/apache/tika/blob/main/README.md Display the help information for the Tika standalone application jar. This shows available commands and options. ```bash java -jar tika-app/target/tika-app-*.jar --help ``` -------------------------------- ### Fast Build with mvnd Source: https://github.com/apache/tika/blob/main/README.md Use mvnd for faster builds by leveraging daemon processes. Combine clean, install, and test commands for maximum development speed. ```bash mvnd clean install -Pfast ``` ```bash mvnd test -pl :tika-core ``` ```bash mvnd clean install -Pfast -T1C ``` -------------------------------- ### Tika 4.x JSON Parser Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc This is an example of how parsers are configured in Tika 4.x using JSON. It demonstrates the new kebab-case naming for parsers and direct parameter mapping. ```json { "parsers": [ { "pdf-parser": { "sortByPosition": true, "maxMainMemoryBytes": 1000000 } }, { "default-parser": {} } ] } ``` -------------------------------- ### Pipes CPU Sizing Diagnostic Log Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc Example log output from PipesParser startup showing the decided CPU sizing parameters for diagnostics. ```log INFO pipes-cpu-sizing: hostCores=16, numClients=4, parentReserved=2, autoCap=slice=3 ``` -------------------------------- ### Prepare Tika Server with Unsecure Features Enabled Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc Stops the default Tika server, creates a JSON configuration file to enable unsecure features, and then starts the server with this configuration. ```bash pkill -f "tika-server-standard-4.0.0-SNAPSHOT.jar" cat > tika-config-unsecure.json << 'EOF' { "server": { "port": 9998, "host": "localhost", "enableUnsecureFeatures": true }, "parsers": [ {"default-parser": {}} ], "plugin-roots": "/tmp/tika-server-test/plugins" } EOF java -jar tika-server-standard-4.0.0-SNAPSHOT.jar -c tika-config-unsecure.json & sleep 10 curl -s http://localhost:9998/version ``` -------------------------------- ### JDBC Reporter Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/jdbc.adoc Example JSON configuration for the JDBC reporter, used to write per-document processing status to a SQL table. It defines connection parameters and the SQL statement for recording status updates. ```json { "tika-pipes": { "reporter": { "type": "jdbc-reporter", "connection": "jdbc:postgresql://db.example.com:5432/tika", "insert": "INSERT INTO status (id, status) VALUES (?, ?)" } } } ``` -------------------------------- ### Complete Kafka Pipeline Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/kafka.adoc This JSON configuration demonstrates a complete pipeline using the Kafka iterator to consume fetch requests and the Kafka emitter to publish parsed results. It integrates a filesystem fetcher for document retrieval. ```json { "fetcher": { "type": "fs", "path": "/data/docs" }, "iterator": { "type": "kafka-pipes-iterator", "topic": "fetch-requests", "bootstrapServers": "kafka1:9092,kafka2:9092", "groupId": "tika-fetcher-group", "fetcherId": "my-fetcher", "emitterId": "my-emitter" }, "emitter": { "type": "kafka-emitter", "topic": "parsed-docs", "bootstrapServers": "kafka1:9092,kafka2:9092", "acks": "all" } } ``` -------------------------------- ### JDBC Iterator Configuration Example Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/pipes/plugins/jdbc.adoc Example JSON configuration for the JDBC iterator, which walks rows from a SELECT statement to emit FetchEmitTuples. It includes connection details, the SELECT statement, and optional column mappings for identifiers and keys. ```json { "tika-pipes": { "iterator": { "type": "jdbc-pipes-iterator", "connection": "jdbc:postgresql://db.example.com:5432/tika", "select": "SELECT id, title, body FROM documents", "idColumn": "id", "fetchKeyColumn": "id", "emitKeyColumn": "id", "fetcherId": "fetcher-id", "emitterId": "emitter-id" } } } ``` -------------------------------- ### Maven Install Plugin for Local Repository Zips Source: https://github.com/apache/tika/blob/main/docs/modules/ROOT/pages/maintainers/release-guides/release-artifacts.adoc Use the maven-install-plugin to install the assembly zip into the local Maven repository. This makes the zip artifact available as a dependency for sibling modules or integration tests without publishing it to a remote repository. ```xml install-zip install install-file ${project.build.directory}/${project.build.finalName}.zip zip ```