### Install Dependencies for SentenceVectors Conversion Source: https://github.com/apache/opennlp/blob/main/opennlp-core/opennlp-ml/opennlp-dl/README.md Install the necessary Python packages for converting sentence vector models to ONNX. ```bash python3 -m pip install optimum onnx onnxruntime ``` -------------------------------- ### SentenceDetectorME Training Source: https://context7.com/apache/opennlp/llms.txt Provides a guide on how to train a custom `SentenceModel` using a stream of sentence samples and save the trained model to a file. ```APIDOC ## SentenceDetectorME Training ### Description This section details the process of training a custom `SentenceModel` using the `SentenceDetectorME.train()` method. It involves preparing a stream of `SentenceSample` objects from a training file, configuring training parameters, and then serializing the resulting model. ### Method Java API calls ### Endpoint N/A (Java Library) ### Parameters - **language** (String) - Required - The language code for the model being trained. - **sampleStream** (ObjectStream) - Required - A stream of sentence training samples. - **factory** (SentenceDetectorFactory) - Required - A factory to create the `SentenceDetectorME`. - **params** (TrainingParameters) - Required - Parameters for the training process. ### Request Example ```java // Train a custom sentence model try (ObjectStream lineStream = new PlainTextByLineStream( new MarkableFileInputStreamFactory(new File("sentences.train")), java.nio.charset.StandardCharsets.UTF_8)) { ObjectStream sampleStream = new SentenceSampleStream(lineStream); SentenceDetectorFactory factory = new SentenceDetectorFactory("en", true, null, null); TrainingParameters params = TrainingParameters.defaultParams(); SentenceModel trainedModel = SentenceDetectorME.train("en", sampleStream, factory, params); try (OutputStream modelOut = new FileOutputStream("custom-sent.bin")) { trainedModel.serialize(modelOut); } } ``` ### Response #### Success Response - `SentenceModel`: The trained sentence model object. #### Response Example (The trained model is serialized to `custom-sent.bin`) ``` -------------------------------- ### Configure Training Parameters for ML Algorithms Source: https://context7.com/apache/opennlp/llms.txt Illustrates how to set up TrainingParameters for different ML algorithms like MaxEnt (GIS, Perceptron, QN) and Naive Bayes. It also shows how to use default parameters and serialize/deserialize them to a properties file. ```java import opennlp.tools.util.TrainingParameters; // Default parameters (MaxEnt, 100 iterations, cutoff 5) TrainingParameters params = TrainingParameters.defaultParams(); // MaxEnt / GIS TrainingParameters maxent = new TrainingParameters(); maxent.put("Algorithm", "MAXENT"); maxent.put("Iterations", 100); maxent.put("Cutoff", 5); maxent.put("Threads", 4); // Perceptron (good default for NER) TrainingParameters perceptron = new TrainingParameters(); perceptron.put("Algorithm", "PERCEPTRON"); perceptron.put("Iterations", 300); perceptron.put("Cutoff", 0); // Quasi-Newton (fast convergence for large feature sets) TrainingParameters qn = new TrainingParameters(); qn.put("Algorithm", "MAXENT_QN"); qn.put("Iterations", 100); qn.put("Cutoff", 2); qn.put("l1Cost", 0.1); qn.put("l2Cost", 0.1); // Naive Bayes TrainingParameters naiveBayes = new TrainingParameters(); naiveBayes.put("Algorithm", "NAIVEBAYES"); // Read/write parameters from properties file try (java.io.OutputStream out = new java.io.FileOutputStream("train.properties")) { params.serialize(out); } try (java.io.InputStream in = new java.io.FileInputStream("train.properties")) { TrainingParameters loaded = new TrainingParameters(in); } ``` -------------------------------- ### Detect Language with LanguageDetectorME Source: https://context7.com/apache/opennlp/llms.txt Use LanguageDetectorME to identify the language of a given text. This example shows how to get the single best prediction and all ranked language predictions. ```java import opennlp.tools.langdetect.* import java.io.* try (InputStream modelIn = new FileInputStream("langdetect-183.bin")) { LanguageDetectorModel model = new LanguageDetectorModel(modelIn) LanguageDetectorME langDetector = new LanguageDetectorME(model) // Single best prediction Language best = langDetector.predictLanguage("Das ist ein schöner Tag.") System.out.printf("Language: %s Confidence: %.4f%n", best.getLang(), best.getConfidence()) // Language: deu Confidence: 0.9921 // All languages ranked by confidence Language[] all = langDetector.predictLanguages("Ceci est une belle journée.") for (Language lang : all) { if (lang.getConfidence() > 0.01) System.out.printf("%s %.4f%n", lang.getLang(), lang.getConfidence()) } ``` -------------------------------- ### NGramLanguageModel Training and Probability Calculation Source: https://context7.com/apache/opennlp/llms.txt Shows how to build an NGramLanguageModel from scratch, add training data, apply cutoff, and calculate probabilities using Stupid Backoff. Requires importing necessary classes. ```java import opennlp.tools.languagemodel.NGramLanguageModel; import opennlp.tools.util.StringList; import java.io.*; // --- Build from scratch --- NGramLanguageModel lm = new NGramLanguageModel(3); // trigram model // Add training sentences lm.add(new StringList("the", "cat", "sat"), 1, 3); lm.add(new StringList("the", "dog", "ran"), 1, 3); lm.add(new StringList("a", "cat", "sat"), 1, 3); lm.cutoff(1, Integer.MAX_VALUE); // remove ngrams appearing < 1 time // Calculate probability (Stupid Backoff) double prob = lm.calculateProbability(new StringList("the", "cat", "sat")); System.out.printf("P(sat | the cat) = %.6f%n", prob); ``` -------------------------------- ### Initialize and Use DocumentCategorizerDL with ONNX Model Source: https://context7.com/apache/opennlp/llms.txt Shows how to initialize DocumentCategorizerDL with an ONNX model, vocabulary, category mapping, scoring strategy, and inference options. It then categorizes a sample token array and displays the best category and all scores. ```java import opennlp.dl.doccat.*; import opennlp.dl.doccat.scoring.AverageClassificationScoringStrategy; import opennlp.dl.InferenceOptions; import java.io.*; import java.util.*; Map categories = new HashMap<>(); categories.put(0, "negative"); categories.put(1, "positive"); InferenceOptions opts = new InferenceOptions(); opts.setDocumentSplitSize(200); DocumentCategorizerDL categorizer = new DocumentCategorizerDL( new File("sentiment-model.onnx"), new File("vocab.txt"), categories, new AverageClassificationScoringStrategy(), opts ); String[] tokens = {"This", "movie", "was", "absolutely", "fantastic"}; double[] probs = categorizer.categorize(tokens); String best = categorizer.getBestCategory(probs); System.out.println("Sentiment: " + best); // positive Map scores = categorizer.scoreMap(tokens); scores.forEach((cat, score) -> System.out.printf("%s: %.4f%n", cat, score)); ``` -------------------------------- ### Initialize and Use NameFinderDL with ONNX Model Source: https://context7.com/apache/opennlp/llms.txt Demonstrates initializing NameFinderDL with an ONNX model, vocabulary, label mapping, inference options, and a sentence detector. It then finds named entities in a sample token array. ```java import opennlp.dl.namefinder.NameFinderDL; import opennlp.dl.InferenceOptions; import opennlp.tools.sentdetect.SentenceDetectorME; import opennlp.tools.sentdetect.SentenceModel; import opennlp.tools.util.Span; import java.io.*; import java.util.*; // Label mapping from model output index to NER label Map ids2Labels = new HashMap<>(); ids2Labels.put(0, "O"); ids2Labels.put(1, "B-PER"); ids2Labels.put(2, "I-PER"); ids2Labels.put(3, "B-LOC"); ids2Labels.put(4, "I-LOC"); // InferenceOptions for controlling ONNX session InferenceOptions inferenceOptions = new InferenceOptions(); inferenceOptions.setIncludeAttentionMask(true); inferenceOptions.setIncludeTokenTypeIds(true); inferenceOptions.setDocumentSplitSize(250); // max tokens per chunk inferenceOptions.setSplitOverlapSize(50); // overlap between chunks // inferenceOptions.setGpu(true); // enable GPU (requires opennlp-dl-gpu) // inferenceOptions.setGpuDeviceId(0); try (InputStream sentModelIn = new FileInputStream("en-sent.bin")) { SentenceModel sentModel = new SentenceModel(sentModelIn); SentenceDetectorME sentDetector = new SentenceDetectorME(sentModel); NameFinderDL nameFinder = new NameFinderDL( new File("ner-model.onnx"), new File("vocab.txt"), ids2Labels, inferenceOptions, sentDetector ); String[] tokens = {"John", "Smith", "lives", "in", "New", "York"}; Span[] names = nameFinder.find(tokens); for (Span span : names) { System.out.printf("[%d-%d] %s (%.3f)%n", span.getStart(), span.getEnd(), span.getType(), span.getProb()); } } ``` -------------------------------- ### Train a Tokenizer Model Source: https://context7.com/apache/opennlp/llms.txt Demonstrates the process of training a new TokenizerME model using a stream of token samples and specified training parameters. ```java // --- Train a tokenizer model --- ObjectStream sampleStream = /* ... */; TokenizerFactory factory = new TokenizerFactory("en", null, true, null); TrainingParameters params = TrainingParameters.defaultParams(); TokenizerModel trainedModel = TokenizerME.train(sampleStream, factory, params); ``` -------------------------------- ### Parser - Create and use a parser Source: https://context7.com/apache/opennlp/llms.txt Shows how to create a constituency parser using `ParserFactory.create` with a `ParserModel` and then use it to parse a pre-tokenized sentence. ```APIDOC ## Parser - Create and use a parser ### Description Creates a constituency parser (chunking or tree-insert style) and uses it to parse a pre-tokenized sentence, returning top-k `Parse` objects. ### Method `ParserFactory.create(ParserModel)` `ParserTool.parseLine(String, opennlp.tools.parser.Parser, int)` ### Parameters - `model` (ParserModel) - Required - The model to create the parser from. - `sentence` (String) - Required - The pre-tokenized sentence to parse. - `parser` (opennlp.tools.parser.Parser) - Required - The parser instance to use. - `k` (int) - Required - The number of top parses to return. ### Response - `topParses` (Parse[]) - An array of `Parse` objects representing the top-k parses of the sentence. ### Request Example ```java import opennlp.tools.parser.*; import opennlp.tools.cmdline.parser.ParserTool; import java.io.*; try (InputStream modelIn = new FileInputStream("en-parser-chunking.bin")) { ParserModel model = new ParserModel(modelIn); opennlp.tools.parser.Parser parser = ParserFactory.create(model); opennlp.tools.parser.Parser customParser = ParserFactory.create(model, 20, 0.95); String sentence = "Pierre Vinken , 61 years old , will join the board ."; Parse[] topParses = ParserTool.parseLine(sentence, parser, 1); for (Parse parse : topParses) { parse.show(); } Parse root = topParses[0]; System.out.println("Type: " + root.getType()); System.out.println("Prob: " + root.getProb()); for (Parse child : root.getChildren()) { System.out.println(" Child: " + child.getType() + " -> " + child.getCoveredText()); } } ``` ``` -------------------------------- ### Sentence Detection with OpenNLP Source: https://context7.com/apache/opennlp/llms.txt Demonstrates loading a pre-trained sentence detection model from disk, detecting sentences in text, and retrieving span positions with confidence scores. Also shows auto-downloading a model and training a custom sentence model. ```java import opennlp.tools.sentdetect.*; import opennlp.tools.util.*; import java.io.*; // --- Load from disk and detect sentences --- try (InputStream modelIn = new FileInputStream("en-sent.bin")) { SentenceModel model = new SentenceModel(modelIn); SentenceDetectorME sentDetector = new SentenceDetectorME(model); String text = "Hello Mr. Smith. How are you today? The weather is great!"; String[] sentences = sentDetector.sentDetect(text); // Output: ["Hello Mr. Smith.", "How are you today?", "The weather is great!"] for (String s : sentences) System.out.println(s); // With span positions and confidence scores Span[] spans = sentDetector.sentPosDetect(text); double[] probs = sentDetector.probs(); for (int i = 0; i < spans.length; i++) { System.out.printf("[%d-%d] %.4f: %s%n", spans[i].getStart(), spans[i].getEnd(), probs[i], spans[i].getCoveredText(text)); } } // --- Auto-download model (OpenNLP 2.x / 3.x) --- SentenceDetectorME sdAuto = new SentenceDetectorME("en"); // downloads en model // --- Train a custom sentence model --- try (ObjectStream lineStream = new PlainTextByLineStream( new MarkableFileInputStreamFactory(new File("sentences.train")), java.nio.charset.StandardCharsets.UTF_8)) { ObjectStream sampleStream = new SentenceSampleStream(lineStream); SentenceDetectorFactory factory = new SentenceDetectorFactory("en", true, null, null); TrainingParameters params = TrainingParameters.defaultParams(); SentenceModel trainedModel = SentenceDetectorME.train("en", sampleStream, factory, params); try (OutputStream modelOut = new FileOutputStream("custom-sent.bin")) { trainedModel.serialize(modelOut); } } ``` -------------------------------- ### Process Training Data with ObjectStream Source: https://context7.com/apache/opennlp/llms.txt Shows how to read training data from a plain-text file using PlainTextByLineStream and convert it into NameSample objects with NameSampleDataStream. Also demonstrates using DirectorySampleStream to read from multiple files. ```java import opennlp.tools.util.*; import opennlp.tools.namefind.*; import java.io.*; import java.nio.charset.StandardCharsets; // Sentence samples from a plain-text file (one sentence per line, OpenNLP format) try (ObjectStream lineStream = new PlainTextByLineStream( new MarkableFileInputStreamFactory(new File("ner-train.bio")), StandardCharsets.UTF_8)) { ObjectStream sampleStream = new NameSampleDataStream(lineStream); NameSample sample; while ((sample = sampleStream.read()) != null) { System.out.println(java.util.Arrays.toString(sample.getSentence())); System.out.println(java.util.Arrays.toString(sample.getNames())); } sampleStream.reset(); // rewind for training } // Reading from multiple files with DirectorySampleStream // (opennlp-formats module) import opennlp.tools.formats.DirectorySampleStream; ObjectStream fileStream = new DirectorySampleStream( new File("data/"), f -> f.getName().endsWith(".txt"), false); ``` -------------------------------- ### Load OpenNLP Models from Various Sources Source: https://context7.com/apache/opennlp/llms.txt Demonstrates loading SentenceModel from classpath InputStream, auto-downloading a Tokenizer model, loading from a File, and using ClasspathModelLoader with ClassgraphModelFinder. Also shows how to serialize a trained model. ```java import opennlp.tools.sentdetect.*; import opennlp.tools.tokenize.*; import opennlp.tools.util.DownloadUtil; import opennlp.tools.models.*; import java.io.*; // 1. Load from classpath InputStream try (InputStream in = SentenceDetectorME.class .getResourceAsStream("/opennlp/en-sent.bin")) { SentenceModel model = new SentenceModel(in); SentenceDetectorME detector = new SentenceDetectorME(model); } // 2. Auto-download model (cached locally in OPENNLP_DOWNLOAD_HOME or ~/.opennlp) TokenizerME tokenizerAuto = new TokenizerME("en"); // 3. Load from File SentenceModel fromFile = new SentenceModel(new File("en-sent.bin")); // 4. ClassPath model resolver (opennlp-model-resolver + classgraph) import opennlp.tools.models.classgraph.ClassgraphModelFinder; ClassgraphModelFinder finder = new ClassgraphModelFinder(); Set entries = finder.findModels(false); ClassPathModelLoader loader = new ClassPathModelLoader(); SentenceModel cpModel = loader.load(entries, "en", opennlp.tools.models.ModelType.SENTENCE_DETECTOR, SentenceModel.class); // 5. Serialize a trained model back to disk SentenceModel trained = /* ... */; try (OutputStream out = new FileOutputStream("my-sent.bin")) { trained.serialize(out); } ``` -------------------------------- ### Train a POS Model Source: https://context7.com/apache/opennlp/llms.txt Demonstrates training a new POS tagger model with specified training parameters and a factory. ```java // --- Train a POS model --- ObjectStream sampleStream = /* ... */; TrainingParameters params = TrainingParameters.defaultParams(); params.put("Iterations", 100); params.put("Cutoff", 5); POSModel trainedModel = POSTaggerME.train("en", sampleStream, params, new POSTaggerFactory()); ``` -------------------------------- ### Build OpenNLP with Maven Source: https://github.com/apache/opennlp/blob/main/README.md Use this command to build the OpenNLP library from the main branch. Requires JDK 21 and Maven 3.9.x. ```bash mvn install ``` -------------------------------- ### Clone and Build Snowball Compiler Source: https://github.com/apache/opennlp/blob/main/dev/Snowball-Stemmer.md Clone the Snowball repository and build the compiler using make. This command generates the snowball compiler in the root directory of the repository. ```bash git clone https://github.com/snowball/snowball.git cd snowball make ``` -------------------------------- ### Load and Tokenize Text with TokenizerME Source: https://context7.com/apache/opennlp/llms.txt Loads a pre-trained tokenizer model and tokenizes a given sentence. It also demonstrates how to retrieve token positions and probabilities, and provides an alternative for simple whitespace tokenization. ```java import opennlp.tools.tokenize.*; import opennlp.tools.util.*; import java.io.*; // --- Load and tokenize --- try (InputStream modelIn = new FileInputStream("en-token.bin")) { TokenizerModel model = new TokenizerModel(modelIn); TokenizerME tokenizer = new TokenizerME(model); String[] tokens = tokenizer.tokenize("Pierre Vinken, 61 years old, will join the board."); // ["Pierre", "Vinken", ",", "61", "years", "old", ",", "will", "join", "the", "board", "."] System.out.println(java.util.Arrays.toString(tokens)); // With positions and probabilities Span[] spans = tokenizer.tokenizePos("Don't stop."); double[] probs = tokenizer.probs(); for (int i = 0; i < spans.length; i++) { System.out.printf("%-10s %.4f%n", tokens[i], probs[i]); } } // --- Whitespace tokenizer (no model needed) --- String[] wsTokens = WhitespaceTokenizer.INSTANCE.tokenize("Hello World !"); ``` -------------------------------- ### DocumentCategorizerME - Train a model Source: https://context7.com/apache/opennlp/llms.txt Illustrates the process of training a new DocumentCategorizerME model using a stream of document samples and specified training parameters. ```APIDOC ## DocumentCategorizerME - Train a model ### Description Trains a new DocumentCategorizerME model using a stream of document samples, training parameters, and a feature generator. ### Method `DocumentCategorizerME.train(String, ObjectStream, TrainingParameters, DoccatFactory)` ### Parameters - `language` (String) - Required - The language of the documents. - `sampleStream` (ObjectStream) - Required - A stream of document samples. - `params` (TrainingParameters) - Required - Training parameters. - `factory` (DoccatFactory) - Required - A factory for creating feature generators. ### Request Example ```java import opennlp.tools.doccat.*; import opennlp.tools.util.*; import java.io.*; import java.util.*; ObjectStream sampleStream = /* ... */; DoccatFactory factory = new DoccatFactory( new FeatureGenerator[]{ new BagOfWordsFeatureGenerator() }); TrainingParameters params = TrainingParameters.defaultParams(); DoccatModel trainedModel = DocumentCategorizerME.train("en", sampleStream, params, factory); try (OutputStream out = new FileOutputStream("en-doccat.bin")) { trainedModel.serialize(out); } ``` ``` -------------------------------- ### Run POSTaggerME Benchmark with JMH Source: https://github.com/apache/opennlp/blob/main/opennlp-core/opennlp-runtime/BENCHMARKS.md Executes only the POSTaggerME JMH benchmarks, including tests for different cache sizes. Requires the JMH profile. ```bash mvn exec:java -pl opennlp-core/opennlp-runtime \ -Pjmh -Dexec.classpathScope=test \ -Dexec.mainClass=org.openjdk.jmh.Main \ -Dexec.args="POSTaggerMEBenchmark" ``` -------------------------------- ### Load and Tag Tokens with POSTaggerME Source: https://context7.com/apache/opennlp/llms.txt Loads a pre-trained POS tagger model and assigns part-of-speech tags to a sequence of tokens. It shows how to configure the tag format and retrieve tag probabilities and top sequences. ```java import opennlp.tools.postag.*; import opennlp.tools.util.*; import java.io.*; // --- Load and tag --- try (InputStream modelIn = new FileInputStream("en-pos-maxent.bin")) { POSModel model = new POSModel(modelIn); POSTaggerME tagger = new POSTaggerME(model); // defaults to UD format // POSTaggerME tagger = new POSTaggerME(model, POSTagFormat.PENN); // Penn format String[] tokens = {"Pierre", "Vinken", ",", "61", "years", "old", "."}; String[] tags = tagger.tag(tokens); // e.g. ["PROPN", "PROPN", "PUNCT", "NUM", "NOUN", "ADJ", "PUNCT"] (UD) // Probability for each token's best tag double[] probs = tagger.probs(); for (int i = 0; i < tokens.length; i++) { System.out.printf("%-12s %-6s %.4f%n", tokens[i], tags[i], probs[i]); } // Top-3 tagging sequences opennlp.tools.util.Sequence[] topSeqs = tagger.topKSequences(tokens); for (opennlp.tools.util.Sequence seq : topSeqs) { System.out.println(seq.getOutcomes() + " score=" + seq.getScore()); } } ``` -------------------------------- ### MorfologikLemmatizer Usage Source: https://context7.com/apache/opennlp/llms.txt Demonstrates how to use MorfologikLemmatizer for lemmatizing tokens with corresponding POS tags. Requires Morfologik binary dictionaries. ```java import opennlp.morfologik.lemmatizer.MorfologikLemmatizer; import opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory; import opennlp.tools.postag.*; import java.nio.file.Paths; import java.io.*; // --- Morfologik lemmatizer --- MorfologikLemmatizer lemmatizer = new MorfologikLemmatizer( Paths.get("polish.dict") // morfologik binary dictionary ); String[] tokens = {"kotów", "biegających"}; String[] tags = {"subst:gen:pl:m2", "pact:gen:pl:m1:imperf"}; String[] lemmas = lemmatizer.lemmatize(tokens, tags); System.out.println(java.util.Arrays.toString(lemmas)); // [kot, biegać] // Multiple lemma candidates per token java.util.List> allLemmas = lemmatizer.lemmatize(java.util.Arrays.asList(tokens), java.util.Arrays.asList(tags)); ``` ```java // --- Morfologik tag dictionary for POSTagger --- // Build with CLI: opennlp-morfologik MorfologikDictionaryBuilder // Then use the factory: try (InputStream modelIn = new FileInputStream("pl-pos.bin"); InputStream dictIn = new FileInputStream("polish.dict")) { POSModel posModel = new POSModel(modelIn); POSTaggerME tagger = new POSTaggerME(posModel); // The MorfologikPOSTaggerFactory is configured at model-training time } ``` -------------------------------- ### OpenNLP CLI: Train Sentence Detector Model Source: https://context7.com/apache/opennlp/llms.txt Trains a new sentence detector model using the OpenNLP command-line interface. Requires training data in OpenNLP format. ```bash # Train a sentence detector model opennlp SentenceDetectorTrainer -model en-sent.bin -lang en \ -data sentences.train -encoding UTF-8 ``` -------------------------------- ### SentenceDetectorME Auto-Download Model Source: https://context7.com/apache/opennlp/llms.txt Shows how to initialize `SentenceDetectorME` with a language code to automatically download the corresponding pre-trained model. ```APIDOC ## SentenceDetectorME Auto-Download Model ### Description This snippet demonstrates the convenience of initializing `SentenceDetectorME` with a language code (e.g., "en"), which triggers an automatic download of the appropriate pre-trained model if it's not already available locally. ### Method Java API call ### Endpoint N/A (Java Library) ### Parameters - **language** (String) - Required - The language code for which to download the model (e.g., "en"). ### Request Example ```java // Auto-download model (OpenNLP 2.x / 3.x) SentenceDetectorME sdAuto = new SentenceDetectorME("en"); // downloads en model ``` ### Response N/A (Initialization) #### Response Example N/A ``` -------------------------------- ### SentenceDetectorME Usage Source: https://context7.com/apache/opennlp/llms.txt Demonstrates how to load a pre-trained sentence detection model, detect sentences in text, and retrieve sentence spans with confidence scores. ```APIDOC ## SentenceDetectorME Usage ### Description This section shows how to use the `SentenceDetectorME` class to detect sentences in a given text. It covers loading a model from disk, performing sentence detection, and obtaining sentence spans with associated probabilities. ### Method Java API calls ### Endpoint N/A (Java Library) ### Parameters N/A (Java API) ### Request Example ```java import opennlp.tools.sentdetect.*; import opennlp.tools.util.*; import java.io.*; // Load from disk and detect sentences try (InputStream modelIn = new FileInputStream("en-sent.bin")) { SentenceModel model = new SentenceModel(modelIn); SentenceDetectorME sentDetector = new SentenceDetectorME(model); String text = "Hello Mr. Smith. How are you today? The weather is great!"; String[] sentences = sentDetector.sentDetect(text); // Output: ["Hello Mr. Smith.", "How are you today?", "The weather is great!"] for (String s : sentences) System.out.println(s); // With span positions and confidence scores Span[] spans = sentDetector.sentPosDetect(text); double[] probs = sentDetector.probs(); for (int i = 0; i < spans.length; i++) { System.out.printf("[%d-%d] %.4f: %s\n", spans[i].getStart(), spans[i].getEnd(), probs[i], spans[i].getCoveredText(text)); } } ``` ### Response #### Success Response - `String[]`: An array of strings, where each string is a detected sentence. - `Span[]`: An array of `Span` objects representing the start and end character offsets of each sentence. - `double[]`: An array of probabilities corresponding to each detected sentence. #### Response Example ``` Hello Mr. Smith. How are you today? The weather is great! [0-18] 0.9987: Hello Mr. Smith. [19-37] 0.9976: How are you today? [38-64] 0.9991: The weather is great! ``` ``` -------------------------------- ### OpenNLP CLI: Tokenization Source: https://context7.com/apache/opennlp/llms.txt Tokenizes a given text using the OpenNLP command-line interface. Requires a pre-trained tokenizer model. ```bash # Tokenize echo "Pierre Vinken, 61 years old." | opennlp TokenizerME en-token.bin ``` -------------------------------- ### Build OpenNLP with JMH Profile Source: https://github.com/apache/opennlp/blob/main/opennlp-core/opennlp-runtime/BENCHMARKS.md Builds the OpenNLP core and runtime modules with the JMH profile enabled. Skips forbidden API checks and checkstyle. ```bash mvn test-compile -Pjmh \ -pl opennlp-core/opennlp-runtime -am \ -Dforbiddenapis.skip=true -Dcheckstyle.skip=true ``` -------------------------------- ### Create and Use Constituency Parser Source: https://context7.com/apache/opennlp/llms.txt Creates a constituency parser using a pre-trained model and parses sentences. Supports chunking or tree-insert styles and allows custom beam size. ```java import opennlp.tools.parser.*; import opennlp.tools.cmdline.parser.ParserTool; import java.io.*; try (InputStream modelIn = new FileInputStream("en-parser-chunking.bin")) { ParserModel model = new ParserModel(modelIn); // Create parser with defaults (beamSize=20, advancePercentage=0.95) opennlp.tools.parser.Parser parser = ParserFactory.create(model); // Create parser with custom beam size opennlp.tools.parser.Parser customParser = ParserFactory.create(model, 20, 0.95); // Parse a sentence (text must be pre-tokenized with spaces) String sentence = "Pierre Vinken , 61 years old , will join the board ."; Parse[] topParses = ParserTool.parseLine(sentence, parser, 1); for (Parse parse : topParses) { parse.show(); // prints Penn Treebank-style tree // (S (NP (NNP Pierre) (NNP Vinken) (, ,) ...) (VP (MD will) ...)) } // Access parse tree nodes Parse root = topParses[0]; System.out.println("Type: " + root.getType()); System.out.println("Prob: " + root.getProb()); for (Parse child : root.getChildren()) { System.out.println(" Child: " + child.getType() + " -> " + child.getCoveredText()); } } ``` -------------------------------- ### NGramLanguageModel Serialization and Deserialization Source: https://context7.com/apache/opennlp/llms.txt Demonstrates how to serialize an NGramLanguageModel to a file and deserialize it back into an object. Ensure proper file handling with try-with-resources. ```java // --- Serialize / deserialize --- try (OutputStream out = new FileOutputStream("trigram.lm")) { lm.serialize(out); } try (InputStream in = new FileInputStream("trigram.lm")) { NGramLanguageModel loaded = new NGramLanguageModel(in); } ``` -------------------------------- ### Run JUnit Correctness Test for Thread Safety Source: https://github.com/apache/opennlp/blob/main/opennlp-core/opennlp-runtime/BENCHMARKS.md Executes the Failsafe integration test `ThreadSafetyBenchmarkIT` to verify thread-safe behavior of shared ME instances. Run using `mvn verify`. ```bash mvn verify -pl opennlp-core/opennlp-runtime -am \ -Dforbiddenapis.skip=true \ -Dit.test=ThreadSafetyBenchmarkIT ``` -------------------------------- ### Train DocumentCategorizerME Model Source: https://context7.com/apache/opennlp/llms.txt Trains a new DocumentCategorizerME model. Requires a stream of DocumentSample objects and specifies training parameters and feature generators. ```java // Training file format: \t ... ObjectStream sampleStream = /* ... */; DoccatFactory factory = new DoccatFactory( new FeatureGenerator[]{ new BagOfWordsFeatureGenerator() }); TrainingParameters params = TrainingParameters.defaultParams(); DoccatModel trainedModel = DocumentCategorizerME.train("en", sampleStream, params, factory); try (OutputStream out = new FileOutputStream("en-doccat.bin")) { trainedModel.serialize(out); } ``` -------------------------------- ### DocumentCategorizerME - Classify a document Source: https://context7.com/apache/opennlp/llms.txt Demonstrates how to load a pre-trained DocumentCategorizerME model and use it to classify a given set of tokens. It also shows how to retrieve the best category and score maps. ```APIDOC ## DocumentCategorizerME - Classify a document ### Description Classifies pre-tokenized documents into user-defined categories using a Maximum Entropy model. Returns a probability array and the best category. ### Method `categorize(String[])` ### Parameters - `tokens` (String[]) - Required - An array of tokens representing the pre-tokenized document. ### Response - `outcomes` (double[]) - An array of probabilities for each category. - `bestCategory` (String) - The category with the highest probability. ### Request Example ```java import opennlp.tools.doccat.*; import opennlp.tools.util.*; import java.io.*; import java.util.*; try (InputStream modelIn = new FileInputStream("en-doccat.bin")) { DoccatModel model = new DoccatModel(modelIn); DocumentCategorizerME categorizer = new DocumentCategorizerME(model); String[] tokens = {"I", "love", "this", "product", "it", "is", "amazing"}; double[] outcomes = categorizer.categorize(tokens); String best = categorizer.getBestCategory(outcomes); System.out.println("Category: " + best); Map scores = categorizer.scoreMap(tokens); scores.forEach((cat, prob) -> System.out.printf("%\-20s %.4f%n", cat, prob)); SortedMap> sorted = categorizer.sortedScoreMap(tokens); sorted.forEach((prob, cats) -> System.out.printf("%.4f %s%n", prob, cats)); } ``` ``` -------------------------------- ### OpenNLP CLI: Train NER Model (Perceptron) Source: https://context7.com/apache/opennlp/llms.txt Trains a Named Entity Recognition model using the Perceptron algorithm with the OpenNLP command-line interface. Requires training data in BIO format. ```bash # Train a NER model (Perceptron) opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en \ -type person -data ner-train.bio -encoding UTF-8 \ -algorithm PERCEPTRON -iterations 300 ``` -------------------------------- ### Train a Chunker Model with ChunkerME Source: https://context7.com/apache/opennlp/llms.txt Trains a new ChunkerModel for shallow phrase chunking. Requires a stream of ChunkSample objects and training parameters. The trained model is then ready for use. ```java // --- Train --- ObjectStream sampleStream = /* ... */; TrainingParameters params = TrainingParameters.defaultParams(); ChunkerModel trainedModel = ChunkerME.train("en", sampleStream, params, new ChunkerFactory()); ``` -------------------------------- ### OpenNLP CLI: Sentence Detection Source: https://context7.com/apache/opennlp/llms.txt Detects sentences in a given text using the OpenNLP command-line interface. Requires a pre-trained sentence detector model. ```bash # Detect sentences echo "Hello world. How are you?" | opennlp SentenceDetector en-sent.bin ``` -------------------------------- ### OpenNLP CLI: Language Detection Source: https://context7.com/apache/opennlp/llms.txt Detects the language of a given text using the OpenNLP command-line interface. Requires a pre-trained language detector model. ```bash # Language detection echo "Das ist ein schöner Tag." | opennlp LanguageDetector langdetect-183.bin ```