### Example sense2vec.eval Usage Source: https://github.com/explosion/sense2vec/blob/master/README.md An example of how to run the sense2vec.eval recipe with specific senses and a similarity threshold. ```bash prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md --senses NOUN,ORG,PRODUCT --threshold 0.5 ``` -------------------------------- ### Install and Run Streamlit Sense2Vec Demo Source: https://github.com/explosion/sense2vec/blob/master/README.md Install the Streamlit library and run the provided demo script to explore pretrained sense2vec vectors. The script requires paths to pretrained vectors as command-line arguments. ```bash pip install streamlit streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors ``` -------------------------------- ### Install sense2vec Source: https://github.com/explosion/sense2vec/blob/master/README.md Install the sense2vec library using pip. This command fetches and installs the package from the Python Package Index. ```bash pip install sense2vec ``` -------------------------------- ### Run sense2vec.teach Recipe Source: https://github.com/explosion/sense2vec/blob/master/README.md Use this recipe to bootstrap a terminology list. Prodigy suggests similar terms based on sense2vec vectors, adjusting suggestions as you annotate. Ensure sense2vec is installed in the same environment as Prodigy. ```bash prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md --seeds "natural language processing, machine learning, artificial intelligence" ``` -------------------------------- ### Initialize Sense2Vec with senses Source: https://github.com/explosion/sense2vec/blob/master/README.md When initializing Sense2Vec, you can specify the available senses. This example shows how to check if a sense is present after initialization. ```python s2v = Sense2Vec(senses=["VERB", "NOUN"]) assert "VERB" in s2v.senses ``` -------------------------------- ### Initialize and Use Sense2Vec Vectors Source: https://context7.com/explosion/sense2vec/llms.txt Demonstrates initializing Sense2Vec with pretrained vectors, checking for keys, retrieving vectors and frequencies, finding most similar terms, calculating similarity between terms, and getting other senses of a word. ```python from sense2vec import Sense2Vec import numpy as np # Initialize and load pretrained vectors s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md") # Check if key exists and get vector query = "natural_language_processing|NOUN" if query in s2v: vector = s2v[query] # Returns numpy.ndarray freq = s2v.get_freq(query) # Returns frequency count print(f"Vector shape: {vector.shape}, Frequency: {freq}") # Find most similar terms most_similar = s2v.most_similar(query, n=5) for key, score in most_similar: print(f"{key}: {score:.4f}") # Output: # machine_learning|NOUN: 0.8987 # computer_vision|NOUN: 0.8636 # deep_learning|NOUN: 0.8573 # artificial_intelligence|NOUN: 0.8321 # data_mining|NOUN: 0.8156 # Calculate similarity between terms sim = s2v.similarity("python|NOUN", "javascript|NOUN") print(f"Similarity: {sim:.4f}") # Find other senses of the same word other_senses = s2v.get_other_senses("duck|NOUN") print(other_senses) # ['duck|VERB', 'Duck|ORG', 'Duck|PERSON'] # Get best sense for ambiguous word best = s2v.get_best_sense("apple") # Returns highest frequency sense print(best) # "apple|NOUN" or "Apple|ORG" depending on corpus ``` -------------------------------- ### Programmatic Usage of sense2vec.teach Logic Source: https://context7.com/explosion/sense2vec/llms.txt Demonstrates the internal logic of the sense2vec.teach recipe for finding seed keys, getting suggestions, and filtering them by a similarity threshold. ```python # Programmatic usage (internal structure) from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/vectors") # Find seed keys seeds = ["machine learning", "deep learning"] seed_keys = [] for seed in seeds: key = s2v.get_best_sense(seed) if key: seed_keys.append(key) print(f"Seed: {seed} -> {key}") # Get suggestions above threshold threshold = 0.85 suggestions = s2v.most_similar(seed_keys, n=100) filtered = [(key, score) for key, score in suggestions if score > threshold] for key, score in filtered[:10]: word, sense = s2v.split_key(key) print(f"{word} ({sense}): {score:.4f}") ``` -------------------------------- ### Get all keys from Sense2Vec Source: https://github.com/explosion/sense2vec/blob/master/README.md Convert the keys iterator returned by `Sense2Vec.keys()` into a list to get all string keys present in the table. ```python all_keys = list(s2v.keys()) ``` -------------------------------- ### Sense2Vec Patterns Output Format Source: https://context7.com/explosion/sense2vec/llms.txt Example of the JSONL output format generated by the sense2vec.to-patterns recipe, suitable for use with spaCy's EntityRuler. ```json # Output format (patterns.jsonl): # {"label": "TECHNOLOGY", "pattern": [{"lower": "machine"}, {"lower": "learning"}]} # {"label": "TECHNOLOGY", "pattern": [{"lower": "neural"}, {"lower": "network"}]} # {"label": "TECHNOLOGY", "pattern": [{"lower": "deep"}, {"lower": "learning"}]} ``` -------------------------------- ### Sense2Vec Get Other Senses Source: https://github.com/explosion/sense2vec/blob/master/README.md Find other entries for the same word but with a different sense. ```APIDOC ## Sense2Vec.get_other_senses ### Description Find other entries for the same word with a different sense, e.g. "duck|VERB" for "duck|NOUN". ### Method `get_other_senses` ### Endpoint N/A (Instance method) ### Parameters - **key** (unicode / int) - The key to check. - **ignore_case** (bool) - Check for uppercase, lowercase and titlecase. Defaults to `True`. ### RETURNS - **list** - The string keys of other entries with different senses. ### Request Example ```python other_senses = s2v.get_other_senses("duck|NOUN") # ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ'] ``` ``` -------------------------------- ### Sense2Vec Get Item Source: https://github.com/explosion/sense2vec/blob/master/README.md Retrieves a vector for a given key. Returns None if the key is not found. ```APIDOC ## Sense2Vec.__getitem__ ### Description Retrieve a vector for a given key. Returns None if the key is not in the table. ### Method `__getitem__` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **key** (unicode / int) - Required - The key to look up. ### Request Example ```python vec = s2v["avocado|NOUN"] ``` ### Response #### Success Response (200) - **vector** (`numpy.ndarray`) - The vector or `None`. #### Response Example ```json { "vector": [4.0, 2.0, 2.0, 2.0] } ``` ``` -------------------------------- ### sense2vec.eval Source: https://github.com/explosion/sense2vec/blob/master/README.md Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? The recipe will only ask about vectors with the same sense and supports different example selection strategies. ```APIDOC ## sense2vec.eval ### Description Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? If the human mostly agrees with the model, the vectors model is good. The recipe will only ask about vectors with the same sense and supports different example selection strategies. ### Method PRODIGY COMMAND ### Endpoint `sense2vec.eval` ### Parameters #### Positional Arguments - **dataset** (string) - Required - Dataset to save annotations to. - **vectors_path** (string) - Required - Path to pretrained sense2vec vectors. #### Options - **--strategy** (`-st`) (string) - Optional - Example selection strategy. `most similar` (default) or `random`. - **--senses** (`-s`) (string) - Optional - Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. - **--exclude-senses** (`-es`) (string) - Optional - Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` for the defaults. - **--n-freq** (`-f`) (integer) - Optional - Number of most frequent entries to limit to. - **--threshold** (`-t`) (float) - Optional - Minimum similarity threshold to consider examples. - **--batch-size** (`-b`) (integer) - Optional - Batch size to use. - **--eval-whole** (`-E`) (flag) - Optional - Evaluate the whole dataset instead of the current session. - **--eval-only** (`-O`) (flag) - Optional - Don't annotate, only evaluate the current dataset. - **--show-scores** (`-S`) (flag) - Optional - Show all scores for debugging. ### Strategies #### `most_similar` Pick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection. #### `most_least_similar` Pick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that. #### `random` Pick a random sample of 3 words from the same random sense. ### Example ```bash prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md --senses NOUN,ORG,PRODUCT --threshold 0.5 ``` ``` -------------------------------- ### Get Best Sense for a Word Source: https://github.com/explosion/sense2vec/blob/master/README.md Find the best-matching sense for a given word. Optionally limit the search to specific senses. Case-insensitive matching is enabled by default. ```python assert s2v.get_best_sense("duck") == "duck|NOUN" ``` ```python assert s2v.get_best_sense("duck", ["VERB", "ADJ"]) == "duck|VERB" ``` -------------------------------- ### Integrate Sense2Vec as a spaCy Pipeline Component Source: https://github.com/explosion/sense2vec/blob/master/README.md Add sense2vec as a pipeline component to a spaCy model and access word vector information via extension attributes. Requires spaCy v3 and the sense2vec library installed. ```python import spacy nlp = spacy.load("en_core_web_sm") s2v = nlp.add_pipe("sense2vec") s2v.from_disk("/path/to/s2v_reddit_2015_md") doc = nlp("A sentence about natural language processing.") assert doc[3:6].text == "natural language processing" freq = doc[3:6]._.s2v_freq vector = doc[3:6]._.s2v_vec most_similar = doc[3:6]._.s2v_most_similar(3) # [(('machine learning', 'NOUN'), 0.8986967), # (('computer vision', 'NOUN'), 0.8636297), # (('deep learning', 'NOUN'), 0.8573361)] ``` -------------------------------- ### Run Script with Help Option Source: https://github.com/explosion/sense2vec/blob/master/README.md Use the --help flag to view command-line arguments for any script. This is useful for understanding script options and parameters. ```bash python scripts/01_parse.py --help ``` -------------------------------- ### Sense2Vec Get Frequency Source: https://github.com/explosion/sense2vec/blob/master/README.md Retrieves the frequency count for a given key, with an optional default value. ```APIDOC ## Sense2Vec.get_freq ### Description Get the frequency count for a given key. ### Method `get_freq` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **key** (unicode / int) - Required - The key to look up. - **default** - Optional - Default value to return if no frequency is found. ### Request Example ```python vec = s2v["avocado|NOUN"] s2v.add("🥑|NOUN", vec, 1234) assert s2v.get_freq("🥑|NOUN") == 1234 ``` ### Response #### Success Response (200) - **frequency** (int) - The frequency count. #### Response Example ```json { "frequency": 1234 } ``` ``` -------------------------------- ### Prodigy Recipe: sense2vec.teach for Terminology Bootstrapping Source: https://context7.com/explosion/sense2vec/llms.txt Bootstrap terminology lists by suggesting similar terms based on seed phrases. Use the command line for basic and advanced usage, including resuming interrupted processes. ```bash # Basic usage with seed terms prodigy sense2vec.teach tech_terms /path/to/s2v_reddit_2015_md \ --seeds "machine learning, deep learning, neural network" \ --threshold 0.85 # With additional options prodigy sense2vec.teach medical_terms /path/to/vectors \ --seeds "diabetes, hypertension, cardiovascular disease" \ --threshold 0.80 \ --n-similar 200 \ --batch-size 10 \ --resume # Continue from existing dataset ``` -------------------------------- ### Sense2VecComponent.from_disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Loads a Sense2VecComponent from a directory. ```APIDOC ## Sense2VecComponent.from_disk ### Description Load a `Sense2Vec` object from a directory. Also called when you run `nlp.from_disk`. ### Method `from_disk` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters #### Path Parameters - **path** (unicode / `Path`) - Required - The path to load from. ### Request Example ```python loaded_component = Sense2VecComponent.from_disk("/path/to/model") ``` ### Response #### Success Response (200) - **Sense2VecComponent** (Sense2VecComponent) - The loaded object. #### Response Example ```json { "example": "Sense2VecComponent object" } ``` ``` -------------------------------- ### Get Frequency of Sense2Vec Key Source: https://github.com/explosion/sense2vec/blob/master/README.md Retrieve the frequency count for a given key. A default value can be provided if the key is not found. ```python vec = s2v["avocado|NOUN"] s2v.add("🥑|NOUN", vec, 1234) assert s2v.get_freq("🥑|NOUN") == 1234 ``` -------------------------------- ### Get Sense2Vec Vector Count Source: https://github.com/explosion/sense2vec/blob/master/README.md Retrieve the number of rows in the vectors table. Asserts the length matches the specified shape. ```python s2v = Sense2Vec(shape=(300, 128)) assert len(s2v) == 300 ``` -------------------------------- ### Sense2Vec Serialization - Save and Load Models Source: https://context7.com/explosion/sense2vec/llms.txt Illustrates how to save and load Sense2Vec models using `to_disk`, `from_disk`, `to_bytes`, and `from_bytes`. Also shows how to exclude specific fields during disk serialization. ```python from sense2vec import Sense2Vec import numpy as np # Create and populate model s2v = Sense2Vec(shape=(100, 64), senses=["NOUN", "VERB", "ADJ"]) for i in range(50): vec = np.random.rand(64).astype(np.float32) s2v.add(f"word_{i}|NOUN", vec, freq=i * 10) # Save to directory s2v.to_disk("/path/to/my_vectors") # Load from directory loaded_s2v = Sense2Vec().from_disk("/path/to/my_vectors") assert len(loaded_s2v) == len(s2v) # Serialize to bytes (useful for network transfer) bytes_data = s2v.to_bytes() restored_s2v = Sense2Vec().from_bytes(bytes_data) # Exclude specific fields during serialization s2v.to_disk("/path/to/vectors_no_cache", exclude=["cache", "strings"]) ``` -------------------------------- ### Get other senses for a word with Sense2Vec Source: https://github.com/explosion/sense2vec/blob/master/README.md The `Sense2Vec.get_other_senses` method finds entries for the same word but with different senses. By default, it ignores case when searching. ```python other_senses = s2v.get_other_senses("duck|NOUN") # ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ'] ``` -------------------------------- ### Get all vectors from Sense2Vec Source: https://github.com/explosion/sense2vec/blob/master/README.md Convert the vectors iterator returned by `Sense2Vec.values()` into a list to retrieve all numpy ndarray vectors stored in the table. ```python all_vecs = list(s2v.values()) ``` -------------------------------- ### Load pretrained vectors with Sense2Vec Source: https://github.com/explosion/sense2vec/blob/master/README.md Initialize Sense2Vec and load pretrained vectors from a specified directory. Ensure the directory contains the unpacked vector data. ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md") ``` -------------------------------- ### Sense2VecComponent.__init__ Source: https://github.com/explosion/sense2vec/blob/master/README.md Initializes the Sense2VecComponent with a vocabulary, vector shape, and configuration for phrase merging and lemmatization. ```APIDOC ## Sense2VecComponent.__init__ ### Description Initialize the pipeline component. ### Method `__init__` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **vocab** (Vocab) - Required - The shared `Vocab`. Mostly used for the shared `StringStore`. - **shape** (tuple) - Required - The vector shape. - **merge_phrases** (bool) - Optional - Whether to merge sense2vec phrases into one token. Defaults to `False`. - **lemmatize** (bool) - Optional - Always look up lemmas if available in the vectors, otherwise default to original word. Defaults to `False`. - **overrides** (Optional) - Optional custom functions to use, mapped to names registered via the registry, e.g. `{"make_key": "custom_make_key"}`. ### Request Example ```python s2v = Sense2VecComponent(nlp.vocab) ``` ### Response #### Success Response (200) - **Sense2VecComponent** (Sense2VecComponent) - The newly constructed object. #### Response Example ```json { "example": "Sense2VecComponent object" } ``` ``` -------------------------------- ### Initialize Sense2VecComponent from NLP object Source: https://github.com/explosion/sense2vec/blob/master/README.md Initialize the component using an existing nlp object. This method is often used as a factory for the component entry point. ```python s2v = Sense2VecComponent.from_nlp(nlp) ``` -------------------------------- ### Serialize Sense2VecComponent to Disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Serialize the component to a directory. This is also called when the component is added to the pipeline and nlp.to_disk is run. ```python s2v.to_disk(path) ``` -------------------------------- ### sense2vec.teach Source: https://github.com/explosion/sense2vec/blob/master/README.md Bootstrap a terminology list using sense2vec. Prodigy suggests similar terms based on sense2vec vectors, adjusting suggestions as you annotate. ```APIDOC ## sense2vec.teach ### Description Bootstrap a terminology list using sense2vec. Prodigy will suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used. ### Method PRODIGY COMMAND ### Endpoint sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold] [--n-similar] [--batch-size] [--resume] ### Parameters #### Positional Arguments - **dataset** (positional) - Dataset to save annotations to. - **vectors_path** (positional) - Path to pretrained sense2vec vectors. #### Options - **--seeds, -s** (option) - One or more comma-separated seed phrases. - **--threshold, -t** (option) - Similarity threshold. Defaults to `0.85`. - **--n-similar, -n** (option) - Number of similar items to get at once. - **--batch-size, -b** (option) - Batch size for submitting annotations. - **--resume, -R** (flag) - Resume from an existing phrases dataset. ### Request Example ```bash prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md \ --seeds "natural language processing, machine learning, artificial intelligence" ``` ### Response #### Success Response (200) - **Annotations** (list) - Saved annotations to the specified dataset. #### Response Example (No specific response example provided, output is saved to dataset) ``` -------------------------------- ### Sense2VecComponent.to_disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Serializes the Sense2VecComponent to a directory. ```APIDOC ## Sense2VecComponent.to_disk ### Description Serialize the component to a directory. Also called when the component is added to the pipeline and you run `nlp.to_disk`. ### Method `to_disk` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters #### Path Parameters - **path** (unicode / `Path`) - Required - The path. ### Request Example ```python nlp.to_disk("/path/to/model") ``` ### Response #### Success Response (200) None #### Response Example None ``` -------------------------------- ### Sense2Vec Initialization Source: https://github.com/explosion/sense2vec/blob/master/README.md Initializes a Sense2Vec object with specified parameters. ```APIDOC ## Sense2Vec.__init__ ### Description Initialize the `Sense2Vec` object. ### Method `__init__` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **shape** (tuple) - Optional - The vector shape. Defaults to `(1000, 128)`. - **strings** (`spacy.strings.StringStore`) - Optional - Optional string store. Will be created if it doesn't exist. - **senses** (list) - Optional - Optional list of all available senses. Used in methods that generate the best sense or other senses. - **vectors_name** (unicode) - Optional - Optional name to assign to the `Vectors` table, to prevent clashes. Defaults to `"sense2vec"`. - **overrides** (dict) - Optional - Optional custom functions to use, mapped to names registered via the registry, e.g. `{"make_key": "custom_make_key"}`. ### Request Example ```python s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"]) ``` ### Response #### Success Response (200) - **object** (`Sense2Vec`) - The newly constructed object. #### Response Example ```json { "message": "Sense2Vec object created successfully" } ``` ``` -------------------------------- ### Load and Query Sense2Vec Model Standalone Source: https://github.com/explosion/sense2vec/blob/master/README.md Load a pretrained sense2vec model from disk and query for vectors, frequencies, and most similar terms. Ensure the model path is correct and the query term exists in the model. ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md") query = "natural_language_processing|NOUN" assert query in s2v vector = s2v[query] freq = s2v.get_freq(query) most_similar = s2v.most_similar(query, n=3) # [('machine_learning|NOUN', 0.8986967), # ('computer_vision|NOUN', 0.8636297), # ('deep_learning|NOUN', 0.8573361)] ``` -------------------------------- ### Run sense2vec.eval Command Source: https://github.com/explosion/sense2vec/blob/master/README.md Use this command to evaluate a sense2vec model. Specify the dataset to save annotations, the path to the pretrained vectors, and optional strategies or filters. ```bash prodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses] [--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole] [--eval-only] [--show-scores] ``` -------------------------------- ### Sense2VecComponent.from_nlp Source: https://github.com/explosion/sense2vec/blob/master/README.md Initializes the Sense2VecComponent from an nlp object, commonly used as a component factory. ```APIDOC ## Sense2VecComponent.from_nlp ### Description Initialize the component from an nlp object. Mostly used as the component factory for the entry point (see setup.cfg) and to auto-register via the `@spacy.component` decorator. ### Method `from_nlp` (classmethod) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **nlp** (Language) - Required - The `nlp` object. - **&&cfg** (-) - Optional - Optional config parameters. ### Request Example ```python s2v = Sense2VecComponent.from_nlp(nlp) ``` ### Response #### Success Response (200) - **Sense2VecComponent** (Sense2VecComponent) - The newly constructed object. #### Response Example ```json { "example": "Sense2VecComponent object" } ``` ``` -------------------------------- ### Initialize Sense2Vec with Custom Overrides Source: https://github.com/explosion/sense2vec/blob/master/README.md When initializing Sense2Vec, pass a dictionary to the 'overrides' argument to use your custom registered functions for 'make_key' and 'split_key'. ```python overrides = {"make_key": "custom", "split_key": "custom"} s2v = Sense2Vec(overrides=overrides) ``` -------------------------------- ### Train GloVe Vectors Source: https://github.com/explosion/sense2vec/blob/master/README.md This script uses the GloVe library to train word vectors. Ensure GloVe is cloned and built before running. ```bash python scripts/04_glove_train_vectors.py ``` -------------------------------- ### Compare two sense2vec models side-by-side Source: https://context7.com/explosion/sense2vec/llms.txt Utilize the sense2vec.eval-ab Prodigy recipe for A/B comparison of two vector models. Options include specifying senses, frequency, batch size, and debugging flags. ```bash prodigy sense2vec.eval-ab comparison_results \ /path/to/s2v_reddit_2015_md \ /path/to/s2v_reddit_2019_lg \ --senses NOUN,ORG,PRODUCT \ --n-freq 100000 \ --batch-size 5 ``` ```bash prodigy sense2vec.eval-ab comparison /path/to/model_a /path/to/model_b \ --show-mapping ``` ```bash prodigy sense2vec.eval-ab comparison /path/to/model_a /path/to/model_b \ --eval-only --eval-whole ``` -------------------------------- ### Prodigy Recipe: sense2vec.to-patterns for EntityRuler Source: https://context7.com/explosion/sense2vec/llms.txt Convert accepted phrases from sense2vec.teach into spaCy EntityRuler patterns. Supports basic usage, case-sensitive matching, and dry runs to preview patterns. ```bash # Generate patterns for entity matching prodigy sense2vec.to-patterns tech_terms en_core_web_sm TECHNOLOGY \ --output-file ./patterns.jsonl # Case-sensitive patterns prodigy sense2vec.to-patterns brand_names en_core_web_sm BRAND \ --output-file ./brand_patterns.jsonl \ --case-sensitive # Dry run to preview patterns prodigy sense2vec.to-patterns medical_terms en_core_web_sm MEDICAL --dry ``` -------------------------------- ### Train custom sense2vec vectors: Build vocabulary for GloVe Source: https://context7.com/explosion/sense2vec/llms.txt Step 3 of training custom vectors: Build word count statistics required for GloVe training. Specify input and output directories, and the path to the GloVe build scripts. ```bash python scripts/03_glove_build_counts.py ./preprocessed/ ./glove/ \ /path/to/GloVe/build/ ``` -------------------------------- ### Initialize Sense2VecComponent Source: https://github.com/explosion/sense2vec/blob/master/README.md Initialize the pipeline component with a shared Vocab object. This is the primary constructor for the Sense2VecComponent. ```python s2v = Sense2VecComponent(nlp.vocab) ``` -------------------------------- ### Find Similar Terms with Sense2Vec.most_similar Source: https://context7.com/explosion/sense2vec/llms.txt Demonstrates using the `most_similar` method to find terms with high cosine similarity to a single key or an average of multiple keys. Also shows how to access frequency rankings. ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/vectors") # Single key query similar = s2v.most_similar("machine_learning|NOUN", n=10) for term, score in similar: word, sense = s2v.split_key(term) print(f"{word} ({sense}): {score:.4f}") # Multiple keys - uses average vector combined_similar = s2v.most_similar( ["artificial_intelligence|NOUN", "deep_learning|NOUN"], n=5, batch_size=32 ) print("\nSimilar to AI + Deep Learning combined:") for term, score in combined_similar: print(f" {term}: {score:.4f}") # Using frequency rankings top_terms = s2v.frequencies[:100] # Most frequent (key, freq) tuples for key, freq in top_terms[:5]: print(f"{key}: {freq} occurrences") ``` -------------------------------- ### Sense2Vec Registry Customization Source: https://github.com/explosion/sense2vec/blob/master/README.md Demonstrates how to register custom functions for key generation and splitting within the Sense2Vec registry. ```APIDOC ## Sense2Vec Registry Customization ### Description This section explains how to customize the functions used by Sense2Vec for generating and splitting keys, and how to apply these customizations when initializing the Sense2Vec model. ### Registry Functions - `registry.make_key`: Given a `word` and `sense`, return a string of the key, e.g. `"word\|sense".` - `registry.split_key`: Given a string key, return a `(word, sense)` tuple. - `registry.make_spacy_key`: Given a spaCy object (`Token` or `Span`) and a boolean `prefer_ents` keyword argument, return a `(word, sense)` tuple. - `registry.get_phrases`: Given a spaCy `Doc`, return a list of `Span` objects used for sense2vec phrases. - `registry.merge_phrases`: Given a spaCy `Doc`, get all sense2vec phrases and merge them into single tokens. ### Registering Custom Functions Use the `register` method as a decorator to add custom functions to the registry. ```python from sense2vec import registry @registry.make_key.register("custom") def custom_make_key(word, sense): return f"{word}###{sense}" @registry.split_key.register("custom") def custom_split_key(key): word, sense = key.split("###") return word, sense ``` ### Applying Customizations Pass a dictionary of overrides to the `Sense2Vec` constructor to use your registered custom functions. ```python overrides = {"make_key": "custom", "split_key": "custom"} s2v = Sense2Vec(overrides=overrides) ``` ``` -------------------------------- ### Deserialize Sense2VecComponent from Disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Load a Sense2Vec object from a directory. This is also called when nlp.from_disk is run. ```python loaded_s2v = Sense2VecComponent.from_disk(path) ``` -------------------------------- ### Train FastText Vectors Source: https://github.com/explosion/sense2vec/blob/master/README.md This script uses the FastText library to train word vectors. Ensure FastText is cloned and built before running. ```bash python scripts/04_fasttext_train_vectors.py ``` -------------------------------- ### Train custom sense2vec vectors: Preprocess to sense2vec format Source: https://context7.com/explosion/sense2vec/llms.txt Step 2 of training custom vectors: Convert parsed text into the sense2vec format. Specify the input directory containing parsed files and the output directory for preprocessed files. ```bash python scripts/02_preprocess.py ./parsed/ ./preprocessed/ ``` -------------------------------- ### Initialize Sense2Vec Object Source: https://github.com/explosion/sense2vec/blob/master/README.md Initialize the Sense2Vec object with a specified shape and optional senses. Defaults are used if not provided. ```python s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"]) ``` -------------------------------- ### Standalone Sense2Vec Usage Source: https://github.com/explosion/sense2vec/blob/master/README.md Instantiate the Sense2Vec class directly and load vectors using from_disk. Keys for lookup must follow the 'phrase_text|SENSE' format, and the table is case-sensitive. ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/reddit_vectors-1.1.0") most_similar = s2v.most_similar("natural_language_processing|NOUN", n=10) ``` -------------------------------- ### Evaluate sense2vec model quality Source: https://context7.com/explosion/sense2vec/llms.txt Use the sense2vec.eval Prodigy recipe to assess model quality by comparing phrase triples. Specify senses and a similarity threshold. ```bash prodigy sense2vec.eval eval_results /path/to/vectors \ --senses NOUN,VERB,ORG \ --threshold 0.5 ``` ```bash prodigy sense2vec.eval eval_data /path/to/vectors \ --strategy most_similar \ --n-freq 50000 \ --batch-size 10 ``` ```bash prodigy sense2vec.eval eval_data /path/to/vectors --eval-only ``` -------------------------------- ### Customize Sense2Vec Key Functions with Registry Source: https://context7.com/explosion/sense2vec/llms.txt Register custom functions for creating and splitting keys using a different delimiter. This allows for flexible encoding schemes. ```python from sense2vec import Sense2Vec, registry # Register custom key format using ":::" instead of "|" @registry.make_key.register("custom_format") def custom_make_key(word, sense): return f"{word.replace(' ', '_')}:::{sense}" @registry.split_key.register("custom_format") def custom_split_key(key): if ":::" not in key: raise ValueError(f"Invalid key format: {key}") word, sense = key.rsplit(":::", 1) return word.replace("_", " "), sense # Use custom functions with Sense2Vec s2v = Sense2Vec( shape=(100, 64), overrides={"make_key": "custom_format", "split_key": "custom_format"} ) # Keys now use custom format import numpy as np vec = np.random.rand(64).astype(np.float32) s2v.add("hello_world:::NOUN", vec, freq=100) # Verify custom format works assert "hello_world:::NOUN" in s2v word, sense = s2v.split_key("hello_world:::NOUN") print(f"Word: {word}, Sense: {sense}") # Word: hello world, Sense: NOUN ``` -------------------------------- ### Register Custom Key Generation Function Source: https://github.com/explosion/sense2vec/blob/master/README.md Use the registry.make_key.register decorator to define a custom function for generating keys. This function takes a word and sense, returning a string key. ```python from sense2vec import registry @registry.make_key.register("custom") def custom_make_key(word, sense): return f"{word}###{sense}" ``` -------------------------------- ### Train custom sense2vec vectors: Export to sense2vec format Source: https://context7.com/explosion/sense2vec/llms.txt Step 5 of training custom vectors: Export the trained vectors (e.g., from GloVe) into the sense2vec format. Specify the input vector location and the output directory. ```bash python scripts/05_export.py ./glove/ ./s2v_output/ --vectors-loc ./glove/vectors.txt ``` -------------------------------- ### Load and use trained sense2vec vectors Source: https://context7.com/explosion/sense2vec/llms.txt Load custom trained sense2vec vectors from disk and find the most similar terms to a given input. The input term should include its sense (e.g., 'term|NOUN'). ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("./s2v_output") print(f"Loaded {len(s2v)} vectors") print(f"Available senses: {s2v.senses}") # Test with domain-specific terms results = s2v.most_similar("your_domain_term|NOUN", n=10) for term, score in results: print(f"{term}: {score:.4f}") ``` -------------------------------- ### Convert Phrases to Patterns with sense2vec.to-patterns Source: https://github.com/explosion/sense2vec/blob/master/README.md Convert a dataset of phrases into token-based match patterns for spaCy's EntityRuler or other NER recipes. Patterns are written to stdout by default if no output file is specified. Tokenization ensures multi-token terms are handled correctly. ```bash prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY --output-file /path/to/patterns.jsonl ``` -------------------------------- ### Deserialize Sense2Vec from Disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Load a Sense2Vec object from a directory path. This is the counterpart to `to_disk` for restoring models. Fields can be excluded during loading. ```python s2v.to_disk("/path/to/sense2vec") new_s2v = Sense2Vec().from_disk("/path/to/sense2vec") ``` -------------------------------- ### Serialize Sense2VecComponent to Bytes Source: https://github.com/explosion/sense2vec/blob/master/README.md Serialize the component to a bytestring. This is also called when the component is added to the pipeline and nlp.to_bytes is run. ```python component_bytes = s2v.to_bytes() ``` -------------------------------- ### Train custom sense2vec vectors: Train vectors with GloVe Source: https://context7.com/explosion/sense2vec/llms.txt Step 4a of training custom vectors: Train word vectors using the GloVe algorithm. Configure the number of threads and training iterations. ```bash python scripts/04_glove_train_vectors.py ./glove/ /path/to/GloVe/build/ \ --n-threads 8 --n-iter 15 ``` -------------------------------- ### Sense2VecComponent.from_bytes Source: https://github.com/explosion/sense2vec/blob/master/README.md Loads a Sense2VecComponent from a bytestring. ```APIDOC ## Sense2VecComponent.from_bytes ### Description Load a component from a bytestring. Also called when you run `nlp.from_bytes`. ### Method `from_bytes` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **bytes_data** (bytes) - Required - The data to load. ### Request Example ```python loaded_component = Sense2VecComponent.from_bytes(bytes_data) ``` ### Response #### Success Response (200) - **Sense2VecComponent** (Sense2VecComponent) - The loaded object. #### Response Example ```json { "example": "Sense2VecComponent object" } ``` ``` -------------------------------- ### Add EntityRuler with patterns from disk Source: https://context7.com/explosion/sense2vec/llms.txt Load custom entity patterns from a JSONL file into a spaCy EntityRuler. Ensure the 'entity_ruler' pipe is added before 'ner'. ```python import spacy from spacy.pipeline import EntityRuler lp = spacy.load("en_core_web_sm") ruler = nlp.add_pipe("entity_ruler", before="ner") ruler.from_disk("./patterns.jsonl") doc = nlp("We use machine learning and neural networks.") for ent in doc.ents: print(f"{ent.text}: {ent.label_}") ``` -------------------------------- ### Deserialize Sense2VecComponent from Bytes Source: https://github.com/explosion/sense2vec/blob/master/README.md Load a component from a bytestring. This is also called when nlp.from_bytes is run. ```python loaded_s2v = Sense2VecComponent.from_bytes(bytes_data) ``` -------------------------------- ### Merge multi-part archives Source: https://github.com/explosion/sense2vec/blob/master/README.md Use the 'cat' command to merge split tar.gz files into a single archive. Ensure all parts are in the same directory. ```bash cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz ``` -------------------------------- ### Evaluate Sense2Vec Most Similar Entries Source: https://github.com/explosion/sense2vec/blob/master/README.md Use this command to evaluate a sense2vec model by checking its most similar entries for a given phrase. Specify the dataset, path to vectors, and optionally filter by senses. ```bash prodigy sense2vec.eval-most-similar vectors_eval_sim /path/to/s2v_reddit_2015_md --senses NOUN,ORG,PRODUCT ``` -------------------------------- ### Train custom sense2vec vectors: Train vectors with FastText Source: https://context7.com/explosion/sense2vec/llms.txt Step 4b of training custom vectors: Alternatively, train word vectors using the FastText algorithm. Specify input and output directories, and the path to the FastText executable. ```bash python scripts/04_fasttext_train_vectors.py ./preprocessed/ ./fasttext/ \ /path/to/fasttext --n-threads 8 ``` -------------------------------- ### Training Custom Sense2Vec Vectors Source: https://github.com/explosion/sense2vec/blob/master/README.md Outlines the requirements for training your own Sense2Vec vectors. ```APIDOC ## Training Custom Sense2Vec Vectors ### Description This section details the necessary components and tools required to train your own sense2vec vectors from scratch. ### Requirements - **Large Text Corpus**: A very large source of raw text (ideally more than 1 billion words) is recommended due to the sparsity introduced by senses. - **Pretrained spaCy Model**: A spaCy model that provides part-of-speech tags, dependencies, named entities, and populates `doc.noun_chunks`. If noun phrase extraction is not built-in for your language, you will need to implement a custom syntax iterator. - **Vector Training Library**: GloVe or fastText installed and built. You should be able to clone their respective repositories and run `make`. ``` -------------------------------- ### Train custom sense2vec vectors: Parsing raw text Source: https://context7.com/explosion/sense2vec/llms.txt Step 1 of training custom vectors: Parse raw text using a spaCy model. Specify input and output directories, and the spaCy model to use. Multiprocessing is supported. ```bash python scripts/01_parse.py ./raw_text/ ./parsed/ en_core_web_lg \ --n-process 4 ``` -------------------------------- ### Train custom sense2vec vectors: Precompute nearest neighbors cache Source: https://context7.com/explosion/sense2vec/llms.txt Step 6 (optional) of training custom vectors: Precompute a cache of nearest neighbors for faster lookups. Specify the output directory and the number of neighbors to cache. ```bash python scripts/06_precompute_cache.py ./s2v_output/ --n-neighbors 100 ``` -------------------------------- ### sense2vec.to-patterns Source: https://github.com/explosion/sense2vec/blob/master/README.md Convert a dataset of phrases to token-based match patterns for spaCy's EntityRuler or other NER recipes. ```APIDOC ## sense2vec.to-patterns ### Description Convert a dataset of phrases collected with `sense2vec.teach` to token-based match patterns that can be used with [spaCy's `EntityRuler`](https://spacy.io/usage/rule-based-matching#entityruler) or recipes like `ner.match`. If no output file is specified, the patterns are written to stdout. The examples are tokenized so that multi-token terms are represented correctly, e.g.: `{"label": "SHOE_BRAND", "pattern": [{ "LOWER": "new" }, { "LOWER": "balance" }]}`. ### Method PRODIGY COMMAND ### Endpoint sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file] [--case-sensitive] [--dry] ### Parameters #### Positional Arguments - **dataset** (positional) - Phrase dataset to convert. - **spacy_model** (positional) - spaCy model for tokenization. - **label** (positional) - Label to apply to all patterns. #### Options - **--output-file, -o** (option) - Optional output file. Defaults to stdout. - **--case-sensitive, -CS** (flag) - Make patterns case-sensitive. - **--dry, -D** (flag) - Perform a dry run and don't output anything. ### Request Example ```bash prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY --output-file /path/to/patterns.jsonl ``` ### Response #### Success Response (200) - **Patterns** (JSONL) - Token-based match patterns written to stdout or specified file. #### Response Example ```json { "label": "TECHNOLOGY", "pattern": [ { "LOWER": "natural" }, { "LOWER": "language" }, { "LOWER": "processing" } ] } ``` ``` -------------------------------- ### Serialize Sense2Vec to Disk Source: https://github.com/explosion/sense2vec/blob/master/README.md Save a Sense2Vec object to a specified directory path. This allows for persistent storage of the model. Fields can be excluded from saving. ```python s2v.to_disk("/path/to/sense2vec") ``` -------------------------------- ### Sense2VecComponent.__call__ Source: https://github.com/explosion/sense2vec/blob/master/README.md Processes a Doc object with the Sense2VecComponent, typically as part of the spaCy pipeline. ```APIDOC ## Sense2VecComponent.__call__ ### Description Process a `Doc` object with the component. Typically only called as part of the spaCy pipeline and not directly. ### Method `__call__` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **doc** (Doc) - Required - The document to process. ### Request Example ```python processed_doc = s2v(doc) ``` ### Response #### Success Response (200) - **Doc** (Doc) - the processed document. #### Response Example ```json { "example": "Processed Doc object" } ``` ``` -------------------------------- ### Register Custom Key Splitting Function Source: https://github.com/explosion/sense2vec/blob/master/README.md Use the registry.split_key.register decorator to define a custom function for splitting keys back into word and sense. This function takes a key string and returns a (word, sense) tuple. ```python @registry.split_key.register("custom") def custom_split_key(key): word, sense = key.split("###") return word, sense ``` -------------------------------- ### Add Sense2Vec to spaCy Pipeline Config Source: https://github.com/explosion/sense2vec/blob/master/README.md Configure a spaCy pipeline to include a Sense2Vec component by specifying the data path in the [initialize.components.sense2vec] section of the training config. ```ini [initialize.components] [initialize.components.sense2vec] data_path = "/path/to/s2v_reddit_2015_md" ``` -------------------------------- ### Process Document with Sense2VecComponent Source: https://github.com/explosion/sense2vec/blob/master/README.md Process a Doc object with the component. This method is typically invoked as part of the spaCy pipeline. ```python doc = s2v(doc) ``` -------------------------------- ### sense2vec.eval Source: https://github.com/explosion/sense2vec/blob/master/README.md Evaluate a sense2vec model by asking about phrase triples. ```APIDOC ## sense2vec.eval ### Description Evaluate a sense2vec model by asking about phrase triples. ### Method PRODIGY COMMAND ### Endpoint sense2vec.eval [dataset] [vectors_path] ### Parameters #### Positional Arguments - **dataset** (positional) - Dataset to save annotations to. - **vectors_path** (positional) - Path to pretrained sense2vec vectors. ### Request Example ```bash prodigy sense2vec.eval eval_phrases /path/to/s2v_reddit_2015_md ``` ### Response #### Success Response (200) - **Evaluation Results** (dict) - Results of the model evaluation. #### Response Example (No specific response example provided, output is evaluation results) ```