### Install Turftopic Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/index.md Installs the turftopic library using pip. This is the first step to using Turftopic in your Python projects. ```bash pip install turftopic ``` -------------------------------- ### Basic KeyNMF Model Usage with Turftopic Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/index.md Demonstrates how to use the KeyNMF model from Turftopic. It fetches the 20Newsgroups dataset, trains a KeyNMF model, and prints the topics. This example assumes familiarity with scikit-learn. ```python from turftopic import KeyNMF from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups( subset="all", remove=("headers", "footers", "quotes"), ) corspus = newsgroups.data model = KeyNMF(20).fit(corpus) model.print_topics() ``` -------------------------------- ### Install Development Dependencies with Pip Source: https://github.com/x-tabdeveloping/turftopic/blob/main/CONTRIBUTING.md Installs all necessary development dependencies for the Turftopic project. This command assumes you have pip installed and are in the project's root directory. ```console pip install turftopic[dev] ``` -------------------------------- ### Install Turftopic and Dependencies Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/ideologies.md Installs the Turftopic library along with Plotly for visualization and the 'datasets' library for fetching data from Hugging Face Hub. This is a prerequisite for running the tutorial. ```bash pip install datasets plotly pandas turftopic ``` -------------------------------- ### Install Turftopic and Dependencies Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/reviews.md Installs the turftopic library along with necessary packages like SpaCy, Plotly, and Pandas. It also downloads a small English language model for SpaCy. ```shell pip install turftopic[spacy] plotly pandas python -m spacy download en_core_web_sm ``` -------------------------------- ### Install Turftopic and Dependencies Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/arxiv_ml.md Installs the necessary Python libraries for the project, including `datasets`, `plotly`, and `turftopic` with optional dependencies for UMAP-based clustering and datamapplot. ```bash pip install datasets plotly turftopic[umap-learn, datamapplot] ``` -------------------------------- ### Run Full Test Suite with Pytest Source: https://github.com/x-tabdeveloping/turftopic/blob/main/CONTRIBUTING.md Executes the entire test suite for the Turftopic project using pytest. Ensure you have pytest installed and are in the project's root directory. ```console pytest tests/ ``` -------------------------------- ### Install Turftopic Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Installs the turftopic library from PyPI. Includes optional dependencies for specific functionalities like CTMs (using Pyro) or clustering models (using UMAP). ```bash pip install turftopic ``` ```bash pip install "turftopic[pyro-ppl]" ``` ```bash pip install "turftopic[umap-learn]" ``` -------------------------------- ### Install MTEB and Initialize KeyNMF with MTEB Encoder Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/multimodal.md Installs the MTEB library and initializes the KeyNMF model using an encoder compatible with the MTEB multimodal encoder interface. This enables topic modeling on multimodal data. Requires installation of MTEB. ```bash pip install "mteb<2.0.0" ``` ```python from turftopic import KeyNMF import mteb encoder = mteb.get_model("kakaobrain/align-base") multimodal_keynmf = KeyNMF(10, encoder="clip-ViT-B-32") ``` -------------------------------- ### Install Turftopic with Datamapplot (Bash) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Installs the turftopic library with the datamapplot extra, enabling interactive cluster visualizations. This command is run in the terminal. ```bash pip install turftopic[datamapplot] ``` -------------------------------- ### Asymmetric Example with e5-large-v2 Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/encoders.md An example demonstrating the setup for an asymmetric encoding scenario using the 'intfloat/e5-large-v2' model with KeyNMF. It configures prompts for 'query' and 'passage' and sets 'query' as the default prompt name. ```python encoder = SentenceTransformer( "intfloat/e5-large-v2", prompts={ "query": "query: " "passage": "passage: " }, # Make sure to set default prompt to query! default_prompt_name="query", ) model = KeyNMF(10, encoder=encoder) ``` -------------------------------- ### Install Turftopic with Jieba for Chinese Tokenization Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/chinese.md Installs the Turftopic library with the necessary dependencies for Chinese text processing, specifically the jieba tokenizer. ```bash pip install turftopic[jieba] ``` -------------------------------- ### Install Turftopic with datamapplot and OpenAI support Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Installs the Turftopic library with necessary dependencies for datamapplot visualizations and OpenAI integration for topic naming. This allows for advanced topic analysis and visualization. ```bash pip install "turftopic[datamapplot, openai]" ``` -------------------------------- ### Install turftopic with Topic Wizard Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/religious.md Installs the turftopic library along with the topic-wizard extra, which might include additional utilities or dependencies for enhanced functionality. This command is executed in a bash environment. ```bash pip install turftopic[topic-wizard] ``` -------------------------------- ### Install Turftopic with UMAP-learn Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/clustering.md Installs the Turftopic library along with the umap-learn dependency, which is often required for topic modeling functionalities. ```bash pip install turftopic[umap-learn] ``` -------------------------------- ### Install Turftopic with topic-wizard support Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Installs the Turftopic library with the 'topic-wizard' extra, enabling integration with the topicwizard library for interactive topic model visualization. This is a simple way to explore topic models. ```bash pip install "turftopic[topic-wizard]" ``` -------------------------------- ### Install topic-wizard for Interactive Visualization - Bash Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Installs the topic-wizard library using pip. This library is used in conjunction with Turftopic for interactive exploration and visualization of topic models. ```bash pip install topic-wizard ``` -------------------------------- ### Install Turftopic with Pyro Support Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/ctm.md Installs the Turftopic library with the necessary dependencies for Pyro-PSL, which is required for using Autoencoding Topic Models. This is a prerequisite for utilizing these advanced topic modeling features. ```bash pip install turftopic[pyro-ppl] ``` -------------------------------- ### Install turftopic with pyro-ppl support Source: https://github.com/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb Installs or upgrades the turftopic Python package to the latest version, including the pyro-ppl backend for probabilistic modeling. This command requires pip and ensures all necessary dependencies are met. ```python %pip install --upgrade turftopic[pyro-ppl] ``` -------------------------------- ### Install Plotly for Dynamic Topic Modeling Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/dynamic.md This command installs the Plotly library, which is a required dependency for visualizations when working with dynamic topic models in Turftopic. Ensure you have pip installed and configured correctly. ```bash pip install plotly ``` -------------------------------- ### Load and Prepare ML Paper Dataset Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/arxiv_ml.md Loads a dataset of machine learning papers from HuggingFace Hub using the `datasets` library. It then subsamples the dataset to 10,000 examples for faster processing and extracts the abstracts. ```python from datasets import load_dataset ds = load_dataset("CShorten/ML-ArXiv-Papers", split="train") # Subsampling dataset ds = ds.train_test_split(seed=42, test_size=10_000)["test"] abstracts = ds["abstract"] ``` -------------------------------- ### Install Keyphrase-Vectorizers Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/keyphrase.md Installs the Keyphrase-Vectorizers library, which is required for extracting keyphrases from text. This library relies on SpaCy for POS-tagging to identify noun phrases. ```bash pip install keyphrase-vectorizers ``` -------------------------------- ### Visualization with datamapplot Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/clustering.md Explains how to install datamapplot and use it for interactive cluster visualization within Turftopic. ```APIDOC ## Visualization You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic! You will first have to install `datamapplot` for this to work: ```bash pip install turftopic[datamapplot] ``` ```python from turftopic import ClusteringTopicModel from turftopic.analyzers import OpenAIAnalyzer model = ClusteringTopicModel(feature_importance="centroid").fit(corpus) analyzer = OpenAIAnalyzer("gpt-5-nano") analysis_res = model.analyze_topics(analyzer) fig = model.plot_clusters_datamapplot() fig.save("clusters_visualization.html") fig ``` _See Figure 1_ !!! info If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure. ``` -------------------------------- ### Speed Up Models with ONNX Backend Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/encoders.md Illustrates how to leverage the ONNX backend for significantly faster model inference with sentence transformers, starting from version 3.2.0. This requires installing the 'onnx' or 'onnx-gpu' package and specifying 'backend="onnx"' when initializing the SentenceTransformer. ```bash pip install sentence-transformers[onnx, onnx-gpu] ``` ```python from turftopic import SemanticSignalSeparation from sentence_transformers import SentenceTransformer encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx") model = SemanticSignalSeparation(10, encoder=encoder) ``` -------------------------------- ### Initialize OpenAI Analyzer Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/analyzers.md Initializes an OpenAIAnalyzer for topic analysis using the OpenAI API. Requires the 'turftopic[openai]' package to be installed and the OPENAI_API_KEY environment variable to be set. The default model used is 'gpt-5-nano'. ```bash pip install turftopic[openai] export OPENAI_API_KEY="sk-" ``` ```python from turftopic.analyzers import OpenAIAnalyzer analyzer = OpenAIAnalyzer('gpt-5-nano') ``` -------------------------------- ### Python Example for File Copying Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html This Python snippet illustrates how to copy a file from one location to another. It utilizes the `shutil` module, a standard library for high-level file operations. This is useful for tasks like creating backups or staging files for further processing. ```python import shutil import os def copy_file(source_path, destination_path): """Copies a file from source to destination.""" try: if not os.path.exists(source_path): print(f"Error: Source file not found at {source_path}") return shutil.copy2(source_path, destination_path) # copy2 preserves metadata print(f"File copied from {source_path} to {destination_path}") except Exception as e: print(f"An error occurred during file copy: {e}") # Example usage: # source = 'path/to/original/file.txt' # destination = 'path/to/destination/file.txt' # copy_file(source, destination) ``` -------------------------------- ### Run Code Formatting Checks Source: https://github.com/x-tabdeveloping/turftopic/blob/main/CONTRIBUTING.md Performs code formatting checks using either Black or Ruff. These commands should be run from the project's root directory to ensure consistency. ```console black . ## or ruff . ``` -------------------------------- ### Python: Dynamic Online Topic Modeling Setup Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/KeyNMF.md Sets up the time bins required for dynamic online topic modeling. This involves defining a list of datetime objects that represent the boundaries for the time periods over which the corpus is to be analyzed. The model cannot infer these bins automatically. ```python from datetime import datetime # We will bin by years in a period of 2020-2030 bins = [datetime(year=y, month=1, day=1) for y in range(2020, 2030 + 2, 1)] ``` -------------------------------- ### Install and Use LemmaCountVectorizer with SpaCy for Lemmatization Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/vectorizers.md Demonstrates the installation of Turftopic with SpaCy support and a SpaCy model, followed by the usage of 'LemmaCountVectorizer' for lemmatizing words before topic modeling. This method relies on a SpaCy pipeline for accurate lemmatization. ```bash pip install turftopic[spacy] python -m spacy download en_core_web_sm ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import LemmaCountVectorizer model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm")) model.fit(corpus) model.print_topics() ``` -------------------------------- ### Initialize KeyNMF with SentenceTransformers Encoder Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/multimodal.md Initializes the KeyNMF model with a specified multimodal encoder from SentenceTransformers. This allows for topic modeling on corpora containing both text and images. Ensure SentenceTransformers is installed. ```python from turftopic import KeyNMF multimodal_keynmf = KeyNMF(10, encoder="clip-ViT-B-32") ``` -------------------------------- ### KeyNMF with Lemma Extraction Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Trains a KeyNMF model using LemmaCountVectorizer for extracting lemmas as features. This requires spaCy installation and a downloaded model. The output includes topics with their highest-ranking lemmas. ```bash pip install turftopic[spacy] python -m spacy download "en_core_web_sm" ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import LemmaCountVectorizer model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm")) model.fit(corpus) model.print_topics() ``` -------------------------------- ### Initialize T5 Analyzer Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/analyzers.md Initializes a T5Analyzer for topic analysis. T5 models are generally less resource-intensive than causal language models but may produce lower-quality results, requiring potential tuning. This example uses the 'google/flan-t5-large' model. ```python from turftopic import T5Analyzer model = T5Analyzer("google/flan-t5-large") ``` -------------------------------- ### Interactive Cluster Visualization with datamapplot Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/clustering.md Visualizes topic clusters interactively using datamapplot. Requires the `datamapplot` package to be installed. This snippet first fits a model, analyzes topics using an OpenAI analyzer, and then generates an HTML visualization of the clusters. ```bash pip install turftopic[datamapplot] ``` ```python from turftopic import ClusteringTopicModel from turftopic.analyzers import OpenAIAnalyzer model = ClusteringTopicModel(feature_importance="centroid").fit(corpus) analyzer = OpenAIAnalyzer("gpt-5-nano") analysis_res = model.analyze_topics(analyzer) fig = model.plot_clusters_datamapplot() fig.save("clusters_visualization.html") fig ``` -------------------------------- ### Print Representative Documents for a Topic Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Shows documents that are most representative of a specific topic. This requires the fitted model, the corpus, and the document-topic matrix. It helps to see real-world examples of content related to a topic. ```python # Print highest ranking documents for topic 0 model.print_representative_documents(0, corpus, document_topic_matrix) ``` -------------------------------- ### Automated Topic Naming with OpenAI Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Assigns human-readable names to topics automatically using an OpenAI language model. Requires `turftopic[openai]` installation and an OpenAI API key. It fits a model and then renames topics. ```python from turftopic import KeyNMF from turftopic.analyzers import OpenAIAnalyzer model = KeyNMF(10).fit(corpus) namer = OpenAIAnalyzer("gpt-4o-mini") model.rename_topics(namer) model.print_topics() ``` -------------------------------- ### Python Script for Data Processing Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html This Python script demonstrates a common pattern for processing data, likely involving file I/O and data manipulation. It serves as a foundational example for data-related tasks within the turftopic project. Specific input/output formats and error handling might vary. ```python import pandas as pd def process_data(input_file, output_file): """Reads data from a CSV, performs some transformations, and saves to a new CSV.""" try: df = pd.read_csv(input_file) # Example transformation: Add a new column based on existing ones if 'col1' in df.columns and 'col2' in df.columns: df['new_col'] = df['col1'] * df['col2'] else: print("Warning: 'col1' or 'col2' not found for transformation.") df.to_csv(output_file, index=False) print(f"Data processed successfully and saved to {output_file}") except FileNotFoundError: print(f"Error: Input file not found at {input_file}") except Exception as e: print(f"An error occurred during data processing: {e}") # Example usage: # process_data('input_data.csv', 'processed_data.csv') ``` -------------------------------- ### Initialize KeyNMF Model and Prepare Topic Data - Python Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Initializes a KeyNMF model with a specified number of topics and prepares the topic data from a corpus. This is a prerequisite for most interpretation and visualization functionalities. ```python from turftopic import KeyNMF model = KeyNMF(10) topic_data = model.prepare_topic_data(corpus) ``` -------------------------------- ### Set up and Use OpenAI Analyzer for Topic Analysis Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Configures and utilizes the OpenAIAnalyzer for topic analysis, requiring the `openai` package and an API key. This method allows for topic analysis using specified OpenAI models, with an option to enable document summaries. It also includes a sample of how topic results might be presented. ```bash pip install openai export OPENAI_API_KEY="sk-" ``` ```python from turftopic.analyzers import OpenAIAnalyzer # We enable document summaries for topic analysis analyzer = OpenAIAnalyzer("gpt-5-nano", use_summaries=True) analysis_res = model.analyze_topics(analyzer) model.print_topics() ``` -------------------------------- ### Launch topicwizard Web App with Model and Documents (Python) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Launches the topicwizard web app for interactive topic model exploration using a model object and a representative sample of documents. Requires the topicwizard library and the documents and model objects. ```python import topicwizard topicwizard.visualize(corpus=documents, model=model) ``` -------------------------------- ### Install and Use KeyphraseCountVectorizer for Keyphrase Extraction Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/vectorizers.md This snippet shows how to install the 'keyphrase-vectorizers' package and use 'KeyphraseCountVectorizer' with 'KeyNMF' for extracting keyphrases from a corpus. It bypasses the need for SpaCy's dependency parser for potentially faster processing. ```bash pip install keyphrase-vectorizers ``` ```python from keyphrase_vectorizers import KeyphraseCountVectorizer vectorizer = KeyphraseCountVectorizer() model = KeyNMF(10, vectorizer=vectorizer).fit(corpus) ``` -------------------------------- ### Initialize KeyNMF Topic Model Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Initializes a KeyNMF topic model with a specified number of components and top N words per topic. This is a foundational step for topic modeling in Turftopic. ```python from turftopic import KeyNMF model = KeyNMF(n_components=10, top_n=15) ``` -------------------------------- ### Install and Use StemmingCountVectorizer with Snowball for Stemming Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/vectorizers.md This snippet covers the installation of Turftopic with Snowball Stemmer support and its subsequent use with 'StemmingCountVectorizer' for aggressive word stemming. It's an alternative to lemmatization, useful for speed or specific stemming needs, using the Snowball Stemmer library. ```bash pip install turftopic[snowball] ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.snowball import StemmingCountVectorizer model = KeyNMF(10, vectorizer=StemmingCountVectorizer(language="english")) model.fit(corpus) model.print_topics() ``` -------------------------------- ### Dimensionality Reduction with UMAP in Python Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/clustering.md Sets up Turftopic's ClusteringTopicModel to utilize UMAP for dimensionality reduction. UMAP is a versatile non-linear technique, often preferred for topic discovery due to its speed and better preservation of global data structures compared to TSNE. Installation via `pip install umap-learn` is required. ```python from umap import UMAP from turftopic import ClusteringTopicModel model = ClusteringTopicModel(dimensionality_reduction=UMAP(n_components=2, metric="cosine")) ``` -------------------------------- ### Get Selected Indices Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html Retrieves the set of currently selected data indices from the DataSelectionManager. ```javascript getSelectedIndices(){return this.dataSelectionManager.getSelectedIndices();} ``` -------------------------------- ### Prepare Topic Data for Interpretation with prepare_topic_data() Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Shows how to use the `prepare_topic_data()` method to fit a model (if not already fitted) and prepare data structures essential for model interpretation and visualization. It returns a `TopicData` object containing various attributes like corpus, vocabulary, and topic distributions. ```python corpus: list[str] = ["this is a a document", "this is yet another document", ...] topic_data = model.prepare_topic_data(corpus) # print to see what attributes you can access. print(topic_data) ``` -------------------------------- ### Fit and Print Topics with turftopic KeyNMF Model (Python) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb This snippet shows how to initialize and train a KeyNMF model using provided corpus and embeddings. After fitting, it calls the print_topics() method to display the extracted topics. Dependencies include the turftopic library and potentially an encoder like 'trf'. ```python from turftopic import KeyNMF model = KeyNMF(20, encoder=trf).fit(corpus, embeddings=embeddings) model.print_topics() ``` -------------------------------- ### Fitting Multimodal Models Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/multimodal.md Provides examples of how to use the `fit_multimodal` method with various Turftopic models, including SemanticSignalSeparation, KeyNMF, Clustering Models, GMM, and AutoEncodingTopicModel. ```APIDOC ## Basic Usage: Fitting Multimodal Models ### Description All multimodal models in Turftopic provide a `fit_multimodal()` or `fit_transform_multimodal()` method to discover topics within multimodal corpora (text and images). After fitting, `plot_multimodal_topics()` can be used for visualization. ### Method `fit_multimodal(texts: list[str], images: list[PIL.Image.Image])` or `fit_transform_multimodal(texts: list[str], images: list[PIL.Image.Image])` ### Endpoint N/A (Client-side Python code) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```python from turftopic import ( SemanticSignalSeparation, KeyNMF, ClusteringTopicModel, GMM, AutoEncodingTopicModel ) from PIL import Image # Sample data (replace with your actual data) texts = ["text 1", "text 2"] images = [Image.new('RGB', (60, 30), color = 'red'), Image.new('RGB', (60, 30), color = 'blue')] # SemanticSignalSeparation model_sss = SemanticSignalSeparation(n_topics=12, encoder="clip-ViT-B-32") model_sss.fit_multimodal(texts, images=images) # model_sss.plot_multimodal_topics() # KeyNMF model_knmf = KeyNMF(n_topics=12, encoder="clip-ViT-B-32") model_knmf.fit_multimodal(texts, images=images) # model_knmf.plot_multimodal_topics() # Clustering Models (BERTopic-style) model_cluster_bt = ClusteringTopicModel(encoder="clip-ViT-B-32", feature_importance="c-tf-idf") model_cluster_bt.fit_multimodal(texts, images=images) # model_cluster_bt.plot_multimodal_topics() # Clustering Models (Top2Vec-style) model_cluster_t2v = ClusteringTopicModel(encoder="clip-ViT-B-32", feature_importance="centroid") model_cluster_t2v.fit_multimodal(texts, images=images) # model_cluster_t2v.plot_multimodal_topics() # GMM model_gmm = GMM(n_topics=12, encoder="clip-ViT-B-32") model_gmm.fit_multimodal(texts, images=images) # model_gmm.plot_multimodal_topics() # AutoEncodingTopicModel (CombinedTM) model_aetm_combined = AutoEncodingTopicModel(n_topics=12, combined=True, encoder="clip-ViT-B-32") model_aetm_combined.fit_multimodal(texts, images=images) # model_aetm_combined.plot_multimodal_topics() # AutoEncodingTopicModel (ZeroShotTM) model_aetm_zs = AutoEncodingTopicModel(n_topics=12, combined=False, encoder="clip-ViT-B-32") model_aetm_zs.fit_multimodal(texts, images=images) # model_aetm_zs.plot_multimodal_topics() ``` ### Response #### Success Response (200) N/A (This is a client-side method call, not an API endpoint. Success is indicated by the method completing without errors.) #### Response Example N/A ``` -------------------------------- ### Plotting Topics Over Time with KeyNMF Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/KeyNMF.md Generates an interactive HTML figure to visualize topic trends over time. Requires the 'plotly' library to be installed. Hovering over terms reveals their importance. ```bash pip install plotly ``` ```python model.plot_topics_over_time() ``` -------------------------------- ### Load 20 Newsgroups Dataset Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/keyphrase.md Loads a subset of the 20 Newsgroups dataset using scikit-learn. It removes headers, footers, and quotes, and filters for specific categories relevant to the demonstration. ```python from sklearn.datasets import fetch_20newsgroups corpus = fetch_20newsgroups( subset="all", remove=("headers", "footers", "quotes"), categories=[ "comp.os.ms-windows.misc", "comp.sys.ibm.pc.hardware", "talk.religion.misc", "alt.atheism", ], ).data ``` -------------------------------- ### Fit Dynamic Topic Model with KeyNMF Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/dynamic.md Demonstrates how to fit a dynamic topic model using the KeyNMF algorithm. This involves initializing the model and then calling the `fit_transform_dynamic` method with a corpus and a list of timestamps. The `bins` parameter controls the number of time slices for analysis. ```python from datetime import datetime from turftopic import KeyNMF corpus: list[str] = [] timestamps: list[datetime] = [] model = KeyNMF(5, top_n=5, random_state=42) document_topic_matrix = model.fit_transform_dynamic( corpus, timestamps=timestamps, bins=10 ) ``` -------------------------------- ### Get Initial Viewport Size in JavaScript Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/images/cluster_datamapplot.html Retrieves the initial viewport size from the document's client width and height. This function is used to determine the initial dimensions for the DeckGL map. ```javascript function getInitialViewportSize() { const width = document.documentElement.clientWidth; const height = document.documentElement.clientHeight; return { viewportWidth: width, viewportHeight: height }; } ``` -------------------------------- ### Finetune KeyNMF Models on New Corpus Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/finetuning.md Enables finetuning of pre-trained KeyNMF models on new, unseen text data using the `partial_fit()` method. The finetuned model can then be saved to disk. ```python from turftopic import load_model model = load_model("pretrained_keynmf_model") print(type(model)) # turftopic.models.keynmf.KeyNMF new_corpus: list[str] = [...] # Finetune the model to the new corpus model.partial_fit(new_corpus) model.to_disk("finetuned_model/") ``` -------------------------------- ### Finetuning KeyNMF Model with New Data Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/online.md Shows how to finetune a pre-trained KeyNMF model on a novel corpus. This allows the model's topics to adapt to new data without retraining from scratch. The process involves loading a saved model and then calling `partial_fit` with the new data. ```python from turftopic import load_model model = load_model("pretrained_keynmf_model") new_corpus: list[str] = [...] # New data # Finetune the model to the new corpus model.partial_fit(new_corpus) model.to_disk("finetuned_model/") ``` -------------------------------- ### Configure LLMAnalyzer with Context Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/analyzers.md Initializes an `LLMAnalyzer` with custom context to guide the analysis process. This allows the LLM to focus on specific domains or tasks, such as analyzing financial documents. Dependencies include `turftopic.analyzers.LLMAnalyzer`. ```python from turftopic.analyzers import LLMAnalyzer analyzer = LLMAnalyzer(context="Analyze topical content in financial documents published by the central bank.") ``` -------------------------------- ### Map Layer Order Management Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html Defines the order of map layers and provides a function to get a layer's index based on its ID. This helps in managing the z-index of different map components. ```javascript LAYER_ORDER=['dataPointLayer','boundaryLayer','LabelLayer']; function getLayerIndex(object){ return LAYER_ORDER.indexOf(object.id); } ``` -------------------------------- ### Fit and Interpret KeyNMF Model in Python Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/KeyNMF.md Demonstrates the basic usage of KeyNMF for topic modeling. It initializes the model with a specified number of topics and an encoder, fits it to a corpus, and then prints the discovered topics. Requires the turftopic library. ```python from turftopic import KeyNMF model = KeyNMF(10, encoder="paraphrase-MiniLM-L3-v2") model.fit(corpus) model.print_topics() ``` -------------------------------- ### KeyNMF with Multilingual Tokenization (Arabic) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Configures and trains a KeyNMF model for Arabic text using TokenCountVectorizer and a multilingual sentence transformer encoder. This setup allows for topic modeling on non-English corpora. ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import TokenCountVectorizer # CountVectorizer for Arabic vectorizer = TokenCountVectorizer("ar", min_df=10) model = KeyNMF( n_components=10, vectorizer=vectorizer, encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet" ) model.fit(corpus) ``` -------------------------------- ### Launch topicwizard Web App with TopicData (Python) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Launches the topicwizard web app for interactive topic model exploration using a TopicData object. This method requires a pre-existing TopicData object. ```python topic_data.visualize_topicwizard() ``` -------------------------------- ### Accessing TopicData Attributes Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/topic_data.md Demonstrates that TopicData objects are dict-like and allow attribute access using dot notation. This example shows the equivalence of accessing the shape of the document_term_matrix using both dictionary key access and attribute access. ```python # They are the same assert topic_data["document_term_matrix"].shape == topic_data.document_term_matrix.shape ``` -------------------------------- ### Use Instruct Models with KeyNMF for Keyword Retrieval Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/encoders.md Shows how to use instruction-tuned embedding models like Microsoft's E5 with KeyNMF for keyword retrieval. It highlights the importance of using prompts for these models and setting the default prompt name to 'query'. Documents act as queries and words as passages. ```python from turftopic import KeyNMF from sentence_transformers import SentenceTransformer encoder = SentenceTransformer( "intfloat/multilingual-e5-large-instruct", prompts={ "query": "Instruct: Retrieve relevant keywords from the given document. Query: " "passage": "Passage: " }, # Make sure to set default prompt to query! default_prompt_name="query", ) model = KeyNMF(10, encoder=encoder) ``` -------------------------------- ### Load and Prepare Chinese Text Data Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/chinese.md Loads the ThuNews corpus from the Chinese MTEB dataset and subsamples it for faster processing. Requires the `datasets` library. ```python import itertools import random from datasets import load_dataset # Loads the dataset ds = load_dataset("C-MTEB/ThuNewsClusteringP2P", split="test") # Wrangles the dataset from a list of lists to a single list corpus = list(itertools.chain.from_iterable(ds["sentences"])) # Subsampling the corpus so that the script runs faster random.seed(42) corpus = random.sample(corpus, 10000) ``` -------------------------------- ### Viewport and Map Calculation Utilities Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html Contains functions to get the initial viewport size and calculate the appropriate zoom level and center coordinates for the map based on provided bounds. These are essential for initializing the map view. ```javascript function getInitialViewportSize(){ const width=document.documentElement.clientWidth; const height=document.documentElement.clientHeight; return{viewportWidth:width,viewportHeight:height}; } function calculateZoomLevel(bounds,viewportWidth,viewportHeight,padding=0){ const lngRange=bounds[1]-bounds[0]; const latRange=bounds[3]-bounds[2]; const centerLng=(bounds[0]+bounds[1])/2; const centerLat=(bounds[2]+bounds[3])/2; const zoomX=Math.log2(360/(lngRange/(viewportWidth/256))); const zoomY=Math.log2(180/(latRange/(viewportHeight/256))); const zoom=Math.min(zoomX,zoomY)-padding; return{zoomLevel:zoom,dataCenter:[centerLng,centerLat]}; } ``` -------------------------------- ### Visualize Topics using Turftopic and OpenAI Analyzer Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Demonstrates fitting a ClusteringTopicModel, renaming topics using OpenAIAnalyzer, and visualizing the model with datamapplot. Requires the 'turftopic' library and optionally 'openai'. Outputs an interactive figure. ```python from turftopic import ClusteringTopicModel from turftopic.analyzers import OpenAIAnalyzer model = ClusteringTopicModel(feature_importance="centroid").fit(corpus) namer = OpenAIAnalyzer("gpt-5-nano") model.rename_topics(namer) fig = model.plot_clusters_datamapplot() fig.show() ``` -------------------------------- ### Automated Topic Naming with OpenAIAnalyzer Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_interpretation.md Analyzes topics using an OpenAI language model to generate meaningful names and descriptions. It requires a fitted topic model and an initialized OpenAIAnalyzer. The results are then printed. ```python from turftopic import KeyNMF from turftopic.namers import OpenAIAnalyzer analyzer = OpenAIAnalyzer("gpt-5-nano") analysis_res = model.analyze_topics(analyzer) model.print_topics() ``` -------------------------------- ### Use Local LLM Analyzer for Topic Analysis Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Integrates a local LLM analyzer to generate topic names and descriptions. It requires the `turftopic.analyzers.LLMAnalyzer` and enables document summaries for richer analysis. The output provides topic names derived from the model's analysis. ```python from turftopic.analyzers import LLMAnalyzer # We enable document summaries for topic analysis analyzer = LLMAnalyzer(use_summaries=True) analysis_res = model.analyze_topics(analyzer) print(analysis_res.topic_names) ``` -------------------------------- ### Load 20 Newsgroups Dataset using Scikit-learn Source: https://github.com/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb This snippet demonstrates how to load the 20 Newsgroups dataset, a collection of approximately 20,000 newsgroup documents, suitable for topic modeling and text classification tasks. It utilizes the `fetch_20newsgroups` function from Scikit-learn to retrieve the data. ```python from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset="all") corpus = newsgroups.data ``` -------------------------------- ### Precomputing Embeddings for Large Corpora Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/online.md An example script to precompute sentence embeddings for a large corpus using `SentenceTransformers`. These embeddings can then be used with `partial_fit` to speed up the KeyNMF model training process, especially when dealing with very large text datasets. ```python import numpy as np from sentence_transformers import SentenceTransformers # Assuming utils.py contains load_corpus function from utils import load_corpus corpus = load_corpus() trf = SentenceTransformers("all-MiniLM-L6-v2") embeddings = trf.encode(corpus) np.save("embeddings.npy", embeddings) ``` -------------------------------- ### Add Metadata and Tooltip Functionality Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html Adds metadata to the visualization, enabling tooltips and click event handling. It configures the deck.gl instance with a tooltip function and an optional click handler. It also preprocesses metadata for searching. ```javascript addMetaData(metaData,{tooltipFunction=({index})=>this.metaData.hover_text[index],onClickFunction=null,searchField=null,}){this.metaData=metaData;this.tooltipFunction=tooltipFunction;this.onClickFunction=onClickFunction;this.searchField=searchField;if(this.metaData.hasOwnProperty('hover_text')){this.deckgl.setProps({getTooltip:this.tooltipFunction,});} if(this.onClickFunction){this.deckgl.setProps({onClick:this.onClickFunction,});} if(this.searchField){this.searchArray=this.metaData[this.searchField].map(d=>d.toLowerCase());}} ``` -------------------------------- ### KeyNMF with Noun Phrase Extraction Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/model_definition_and_training.md Trains a KeyNMF model using NounPhraseCountVectorizer for extracting noun phrases as features. Requires installing Turftopic with spaCy support and downloading a spaCy model. Outputs a table of topics with their highest-ranking terms. ```bash pip install turftopic[spacy] python -m spacy download "en_core_web_sm" ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import NounPhraseCountVectorizer model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer("en_core_web_sm")) model.fit(corpus) model.print_topics() ``` -------------------------------- ### Estimate word importance with Clustering model in Python Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/finetuning.md Shows how to fit a ClusteringTopicModel with a specified feature importance method and then print the topics. This illustrates the process of obtaining topic representations with different importance estimations. ```python from turftopic import ClusteringTopicModel model = ClusteringTopicModel(n_reduce_to=5, feature_importance="soft-c-tf-idf").fit(corpus) model.print_topics() ``` -------------------------------- ### Use TokenCountVectorizer with SpaCy for Non-English Languages Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/vectorizers.md Shows how to install Turftopic with SpaCy and use 'TokenCountVectorizer' for vectorizing text in non-English languages, leveraging SpaCy's language-specific tokenization without requiring a full SpaCy pipeline. This is beneficial for languages where default tokenization is insufficient. ```bash pip install turftopic[spacy] ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import TokenCountVectorizer # CountVectorizer for Arabic vectorizer = TokenCountVectorizer("ar", min_df=10) model = KeyNMF( n_components=10, vectorizer=vectorizer, encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet" ) model.fit(corpus) ``` -------------------------------- ### Fit AutoEncodingTopicModel and Print Topics Source: https://github.com/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb This Python snippet demonstrates how to initialize an AutoEncodingTopicModel with a specified number of topics and an encoder. It then fits the model to a given corpus and embeddings, and finally prints the extracted topics. Dependencies include the AutoEncodingTopicModel class from turftopic and a suitable encoder (e.g., 'trf'). Inputs are the corpus and embeddings, and the output is the printed representation of topics. ```python from turftopic import AutoEncodingTopicModel model = AutoEncodingTopicModel(20, encoder=trf).fit(corpus, embeddings=embeddings) model.print_topics() ``` -------------------------------- ### Define Custom Prompts for LLMAnalyzer in Python Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/analyzers.md This snippet shows how to define custom system, namer, description, and summary prompts for the LLMAnalyzer. Prompts are formatted using Python's `str.format()`, expecting templated content within curly brackets. The analyzer is then instantiated with these custom prompts. ```python from turftopic.analyzers import LLMAnalyzer system_prompt = """ You are a topic analyzer. Follow instructions closely and exactly. """ namer_prompt = """ Please provide a human-readable name for a topic. The topic is described by the following set of keywords: {keywords}. """ description_prompt = """ Describe the following topic in a couple of sentences. The topic is described by the following set of keywords: {keywords}. """ summary_prompt = """ Summarize the following document: {document} """ namer = LLMAnalyzer( system_prompt=system_prompt, namer_prompt=namer_prompt, description_prompt=description_prompt, summary_prompt=summary_prompt ) ``` -------------------------------- ### DataMap Initialization and Configuration (JavaScript) Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/images/cluster_datamapplot.html Initializes a DataMap instance with a container, geographical bounds, and item IDs for search and selection. This sets up the primary visualization component for the project. ```javascript const container = document.getElementById('deck-container'); const datamap = new DataMap({ container: container, bounds: [-8.976655139923096, 8.372218265533448, -8.90875467300415, 9.608116474151611], searchItemId: searchItemId, lassoSelectionItemId: selectionItemId, }); ``` -------------------------------- ### Plot Concept Compass with Plotly Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/s3.md Generates a concept compass plot to visualize the relationship between terms and semantic axes using Plotly. Requires the 'plotly' library to be installed. This function plots terms based on their scores along two specified topics (axes). ```bash pip install plotly ``` ```python fig = model.plot_concept_compass(topic_x=1, topic_y=4) fig.show() ``` -------------------------------- ### Topic Modeling with Noun Phrase Vectorization using spaCy Source: https://github.com/x-tabdeveloping/turftopic/blob/main/README.md Performs topic modeling using BERTopic with a custom `NounPhraseCountVectorizer` from spaCy. Requires `turftopic[spacy]` installation and a spaCy language model. This vectorizer focuses on noun phrases for topic representation. ```python from turftopic import BERTopic from turftopic.vectorizers.spacy import NounPhraseCountVectorizer model = BERTopic( n_components=10, vectorizer=NounPhraseCountVectorizer("en_core_web_sm"), ) model.fit(corpus) model.print_topics() ``` -------------------------------- ### JavaScript: Initialize Web Workers for Data Processing Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html This snippet demonstrates the initialization of two Web Workers using a pre-defined `parsingWorkerBlob`. These workers are intended for handling 'label data' and 'point data' asynchronously, suggesting a pipeline for processing large datasets for visualization or analysis. ```javascript const searchItemId = "text-search"; const histogramItemId = "d3histogram-container"; const selectionItemId = "lasso-select"; const searchItem = document.getElementById(searchItemId); let histogramItem = null; const container = document.getElementById('deck-container'); const labelDataWorker = new Worker(workerUrl); const pointDataWorker = new Worker(workerUrl); ``` -------------------------------- ### Shell Command for Directory Operations Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html This entry showcases a common shell command used for managing directories, specifically for creating a new directory. This is a fundamental operation often required during project setup or in build scripts. It does not involve complex logic but is essential for file system management. ```bash mkdir my_new_directory ``` -------------------------------- ### Initiate Data Layer Loading - JavaScript Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/tutorials/images/arxiv_ml_datamapplot.html Calls the functions to load the point, label, and meta-data layers sequentially. These functions are responsible for initiating the data fetching and rendering process within the Turftopic project. ```javascript loadPointDataLayer(); loadLabelDataLayer(); loadMetaData(); ``` -------------------------------- ### Noun Phrase Vectorization with SpaCy Source: https://github.com/x-tabdeveloping/turftopic/blob/main/docs/vectorizers.md This snippet shows how to use Turftopic's NounPhraseCountVectorizer, which leverages SpaCy for extracting noun phrases as features. It requires `turftopic[spacy]` and a SpaCy language model (e.g., 'en_core_web_sm'). Installation instructions are provided. Model fitting can be slower but may yield higher quality results. ```bash pip install turftopic[spacy] ``` ```bash python -m spacy download en_core_web_sm ``` ```python from turftopic import KeyNMF from turftopic.vectorizers.spacy import NounPhraseCountVectorizer model = KeyNMF( n_components=10, vectorizer=NounPhraseCountVectorizer("en_core_web_sm"), ) model.fit(corpus) model.print_topics() ```