### Run Dash Application Server Manually Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.deployment.rst This Python code demonstrates how to run a Dash application server manually. It is typically used within a main script to start the development server. ```python # main.py if __name__ == "__main__": app.run_server(debug=False, port=8050) ``` -------------------------------- ### Install topic-wizard Package Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Installs the topic-wizard library from PyPI using pip. This is the first step to using the library for topic model visualization. ```python %pip install topic-wizard ``` -------------------------------- ### Create Dash App with Topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.deployment.rst This snippet shows how to create a Dash application using the Topicwizard library. It requires a TopicData object as input and initializes the Dash app. ```python # main.py import topicwizard app = topicwizard.get_dash_app(topic_data) ``` -------------------------------- ### Install topicwizard with pip Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/index.rst Installs the topicwizard Python package using pip. This is the primary method for acquiring the library. ```shell pip install topic-wizard ``` -------------------------------- ### Precompute UMAP Projections for Faster Cold Starts (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.deployment.html This snippet shows how to precompute UMAP projections using `topicwizard.precompute_positions`. This optimization can significantly reduce cold start times for deployed topicwizard applications. ```python topic_data_w_positions = topicwizard.precompute_positions(topic_data) ``` -------------------------------- ### Clone HuggingFace Spaces Repository Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.deployment.rst This bash command shows how to clone a HuggingFace Spaces repository using Git. This is the first step in deploying a Topicwizard application to HuggingFace Spaces. ```bash git clone ``` -------------------------------- ### Deploy Topicwizard to HuggingFace Spaces Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.deployment.rst These bash commands illustrate the process of deploying a Topicwizard application to HuggingFace Spaces after creating a Docker Space. It involves moving the deployment folder contents, staging, committing, and pushing the changes. ```bash mv deployment/* /path/to/space_repo cd path/to/space_repo git add -A git commit -m "Added deployment" git push ``` -------------------------------- ### Implement Turftopic Model Similar to Top2Vec for TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.compatibility.rst Provides an example of creating a Turftopic ClusteringTopicModel configured to replicate the behavior of Top2Vec models. This model can be directly used with TopicWizard. Requires installation of turftopic, umap-learn, and scikit-learn. ```bash pip install turftopic pip install umap-learn pip install scikit-learn>=1.3.0 ``` ```python from turftopic import ClusteringTopicModel from sklearn.cluster import HDBSCAN import umap import topicwizard # This has the exact same behaviour as Top2Vec models. top2vec = ClusteringTopicModel( dimensionality_reduction=umap.UMAP( n_neighbors=15, n_components=5, metric="cosine" ), clustering=HDBSCAN( min_cluster_size=15, metric="euclidean", cluster_selection_method="eom", ), feature_importance="centroid", ) topicwizard.visualize(corpus, model=top2vec) ``` -------------------------------- ### Create Rule-Based Classification Pipeline with Human-Learn Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst This example shows how to build a rule-based classification pipeline using the human-learn library. It first trains a topic pipeline, then defines a custom rule function to classify documents about 'corona', and finally combines the topic pipeline and classifier into a single scikit-learn pipeline. This is useful for labeling data when labeled examples are scarce. ```python # Install human-learn from PyPI # pip install human-learn from hulearn.classification import FunctionClassifier from sklearn.pipeline import make_pipeline topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts) # Investigate topics topicwizard.visualize(topic_pipeline) # Creating rule for classifying something as a corona document def corona_rule(df, threshold=0.5): is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold return is_about_corona.astype(int) # Freezing topic pipeline topic_pipeline.freeze = True classifier = FunctionClassifier(corona_rule) cls_pipeline = make_pipeline(topic_pipeline, classifier) ``` -------------------------------- ### Create Turftopic Model Similar to Top2Vec for topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html This snippet illustrates how to construct a Turftopic `ClusteringTopicModel` that mimics the behavior of Top2Vec models. It requires installing `turftopic`, `umap-learn`, and `scikit-learn`. This model can then be used directly with topicwizard. ```python from turftopic import ClusteringTopicModel from sklearn.cluster import HDBSCAN import umap import topicwizard # This has the exact same behaviour as Top2Vec models. top2vec = ClusteringTopicModel( dimensionality_reduction=umap.UMAP( n_neighbors=15, n_components=5, metric="cosine" ), clustering=HDBSCAN( min_cluster_size=15, metric="euclidean", cluster_selection_method="eom", ), feature_importance="centroid", ) corpus = ["Some example text", "More text here"] topicwizard.visualize(corpus, model=top2vec) ``` -------------------------------- ### Load 20newsgroups Corpus Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Loads the 20newsgroups dataset from scikit-learn, commonly used for topic modeling examples. It fetches the data and extracts the corpus content. ```python from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset="all") corpus = newsgroups.data ``` -------------------------------- ### Integrate BERTopic Model with topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html This example shows how to wrap a BERTopic model using `BERTopicWrapper` from topicwizard for direct use in the web app or for producing `TopicData` objects. The BERTopic model can be fitted automatically if not pre-trained. ```python from bertopic import BERTopic from topicwizard.compatibility import BERTopicWrapper import topicwizard model = BERTopic(language="english") wrapped_model = BERTopicWrapper(model) corpus = ["Some example text", "More text here"] # Start the web app immediately topicwizard.visualize(corpus, model=wrapped_model) # Or produce a TopicData object for persistence or figures. topic_data = wrapped_model.prepare_topic_data(corpus) ``` -------------------------------- ### Get Feature Names from TopicPipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Illustrates how to retrieve feature names (topic names) after fitting a TopicPipeline. This feature is useful for understanding and utilizing the inferred topic names in subsequent steps of a pipeline. ```python topic_pipeline.fit(texts) print(topic_pipeline.get_feature_names_out()) ``` -------------------------------- ### Exclude Pages from Topicwizard Visualization (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/application.html This example shows how to customize the topicwizard visualization by excluding specific pages, such as 'documents' and 'words'. This is useful for performance optimization or when only specific views are needed, like a PyLDAvis replacement focusing on word importances. ```python import topicwizard # Assuming 'texts' and 'pipeline' are already defined # topicwizard.visualize(texts, model=pipeline, exclude_pages=["documents", "words"]) ``` -------------------------------- ### Create and Run a Dash App with Topicwizard (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.deployment.html This snippet shows how to create a Dash application using `topicwizard.get_dash_app` and then run the server manually. It's a foundational step for deploying topicwizard applications. ```python import topicwizard app = topicwizard.get_dash_app(topic_data) if __name__ == "__main__": app.run_server(debug=False, port=8050) ``` -------------------------------- ### Easy Deployment with Topicwizard (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.deployment.html This Python code utilizes `topicwizard.easy_deploy` to create a Docker deployment folder. This function simplifies the process of packaging a topicwizard app with a fitted topic model, including a Dockerfile, main.py, and the topic_data.joblib file. ```python import joblib import topicwizard # Load previously produced topic_data object topic_data = joblib.load("topic_data.joblib") topicwizard.easy_deploy(topic_data, dest_dir="deployment", port=7860) ``` -------------------------------- ### Deploy Dash App with Gunicorn (Bash) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.deployment.html This command demonstrates how to run a Dash application using Gunicorn, a production-ready WSGI server. This is recommended for deploying topicwizard in a production environment. ```bash gunicorn main:app.server -b 8050 ``` -------------------------------- ### Visualize topic models with topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/index.rst Launches the topicwizard web application to visualize a topic model. It requires the text data and the trained model as input. ```python import topicwizard topicwizard.visualize(texts, model=model) ``` -------------------------------- ### Prepare TopicData with Contextual Models Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Illustrates preparing a TopicData object directly from contextually sensitive models like those from turftopic. The prepare_topic_data method on the model handles the necessary data transformations for visualization. ```python from turftopic import SemanticSignalSeparation model = SemanticSignalSeparation(10) topic_data = model.prepare_topic_data(corpus) ``` -------------------------------- ### Deploy to HuggingFace Spaces (Bash) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.deployment.html This series of bash commands outlines the process for deploying a topicwizard application to HuggingFace Spaces using a Docker Space. It involves cloning the space repository, moving the deployment files into it, committing the changes, and pushing them to the remote repository. ```bash git clone mv deployment/* /path/to/space_repo cd /path/to/space_repo git add -A git commit -m "Added deployment" git push ``` -------------------------------- ### Prepare TopicData with a TopicPipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Demonstrates how to prepare a TopicData object using a TopicPipeline, which encapsulates the vectorization and topic modeling steps. This object contains all information needed for TopicWizard visualization. ```python from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import NMF # Assuming make_topic_pipeline is available or defined elsewhere # from topicwizard.pipeline import make_topic_pipeline # Placeholder for make_topic_pipeline if not directly importable def make_topic_pipeline(vectorizer, model): class MockPipeline: def prepare_topic_data(self, corpus): print("Mock prepare_topic_data called") # Simulate returning a TopicData-like structure return {"topics": [], "words": [], "documents": []} return MockPipeline() pipeline = make_topic_pipeline(CountVectorizer(), NMF(10)) topic_data = pipeline.prepare_topic_data(corpus) ``` -------------------------------- ### Create Scikit-learn NMF Model Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Initializes a scikit-learn Non-negative Matrix Factorization (NMF) model with 10 components. NMF is a fast algorithm often used for topic modeling. ```python from sklearn.decomposition import NMF model = NMF(n_components=10) ``` -------------------------------- ### Build contextually sensitive topic model with Turftopic Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/index.rst Initializes a contextually sensitive topic model using Turftopic's KeyNMF. This approach is suitable for capturing nuanced relationships in text data. ```python from turftopic import KeyNMF model = KeyNMF(n_components=10) ``` -------------------------------- ### Initialize Topic Models (LDA, NMF, DMM) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Shows how to initialize various topic models compatible with Topicwizard. This includes Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) from scikit-learn, and Discrete Markov Model (DMM) from tweetopic for short texts. These models require a '.components_' attribute for topic-term importance. ```python # LDA for long texts from sklearn.decomposition import LatentDirichletAllocation model = LatentDirichletAllocation(n_components=10) # You can use NMF too from sklearn.decomposition import NMF model = NMF(n_components=10) # Or tweetopic's DMM for short texts # pip install tweetopic from tweetopic import DMM model = DMM(n_components=10) ``` -------------------------------- ### Visualize Topic Model with topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/application.rst Launches the topicwizard web application to visualize a trained topic model. It takes the corpus ('texts') and the fitted model ('pipeline') as input. The web app provides an interactive overview of the topic model. ```python import topicwizard topicwizard.visualize(texts, model=pipeline) ``` -------------------------------- ### Visualize Topic Model with TopicWizard Web App Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Launches the topicwizard web application for interactive exploration of a fitted topic model. This function can optionally exclude the 'documents' page for faster loading and can incorporate predefined group labels for richer analysis. ```python import topicwizard topicwizard.visualize(corpus, pipeline=pipeline) ``` ```python topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"]) ``` ```python topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"], group_labels=group_labels) ``` -------------------------------- ### Initialize AutoEncodingTopicModel Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html Initializes AutoEncodingTopicModel for zero-shot or combined topic modeling. This model replicates CTM behavior and is part of the Turftopic library. ```python from turftopic import AutoEncodingTopicModel zeroshot_tm = AutoEncodingTopicModel(10, combined=False) combined_tm = AutoEncodingTopicModel(10, combined=True) ``` -------------------------------- ### Build scikit-learn compatible topic pipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/index.rst Constructs a topic modeling pipeline compatible with scikit-learn conventions. It utilizes CountVectorizer for text processing and NMF for topic decomposition. ```python from sklearn.decomposition import NMF from sklearn.feature_extraction.text import CountVectorizer from topicwizard.pipeline import make_topic_pipeline bow_vectorizer = CountVectorizer() nmf = NMF(n_components=10) model = make_topic_pipeline(bow_vectorizer, nmf) ``` -------------------------------- ### Fit Topic Pipeline and Visualize Model with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Demonstrates fitting a topic pipeline to a corpus and then visualizing the results using topicwizard.visualize. This is a core function for interpreting topic models. ```python import topicwizard topicwizard.visualize(corpus, model=topic_pipeline) ``` -------------------------------- ### Create a Topicwizard TopicPipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Demonstrates the creation of a TopicPipeline using Topicwizard's `make_topic_pipeline` function. TopicPipelines offer enhanced convenience for downstream tasks and model interpretation compared to standard Scikit-learn pipelines. ```python from topicwizard.pipeline import make_topic_pipeline topic_pipeline = make_topic_pipeline(vectorizer, model) ``` -------------------------------- ### Fit NMF Topic Model Pipeline with Scikit-learn Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Sets up and fits a scikit-learn pipeline for Nonnegative Matrix Factorization (NMF) topic modeling. The pipeline includes a CountVectorizer and the NMF model, configured with specified parameters for document processing and topic discovery. ```python from sklearn.decomposition import NMF from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import make_pipeline # Setting up topic modelling pipeline vectorizer = CountVectorizer(max_df=0.8, min_df=10, stop_words="english") # NMF topic model with 20 topics nmf = NMF(n_components=20) # Build a pipeline from the two components pipeline = make_pipeline(vectorizer, nmf) # Fit the pipeline to the data pipeline.fit(corpus) ``` -------------------------------- ### Initialize CountVectorizer for Text Vectorization Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Demonstrates how to initialize a CountVectorizer from scikit-learn, a common component for converting texts into bag-of-words vectors. This is a foundational step in many natural language processing pipelines. ```python from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() ``` -------------------------------- ### Integrate BERTopic Model with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Demonstrates how to use BERTopic models with TopicWizard by employing the BERTopicWrapper compatibility layer. This enables the visualization of BERTopic's contextual topic models. ```python from bertopic import BERTopic from topicwizard.compatibility import BERTopicWrapper model = BERTopicWrapper(BERTopic(language="english")) topicwizard.visualize(corpus, model=model) ``` -------------------------------- ### Visualize SemanticSignalSeparation Model with topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/application.rst Demonstrates visualizing a topic model trained with turftopic's SemanticSignalSeparation. It shows two methods: directly passing the model, or preparing TopicData first and then visualizing. ```python from turftopic import SemanticSignalSeparation model = SemanticSignalSeparation(n_components=10) topicwizard.visualize(texts, model=model) ## OR topic_data = model.prepare_topic_data(texts) topicwizard.visualize(topic_data=topic_data) ``` -------------------------------- ### Visualize Turftopic Semantic Signal Separation Model - Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.contextual_models.html Demonstrates how to prepare topic data from a corpus using Turftopic's SemanticSignalSeparation model and then visualize it with topicwizard. Alternatively, the web application can be run directly with the corpus and model. ```python import topicwizard from turftopic import SemanticSignalSeparation model = SemanticSignalSeparation(n_components=10) # You can produce the topic data from a corpus before running the app # This option should be prefered as the data can be saved and the app can be restarted # Or you can use it for producing individual figures later. topic_data = model.prepare_topic_data(corpus) topicwizard.visualize(topic_data=topic_data) # Or you can run the app directly with the model and a corpus topicwizard.visualize(corpus, model=model) ``` -------------------------------- ### Create a Scikit-learn Pipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Illustrates how to combine a vectorizer and a topic model into a standard Scikit-learn pipeline using `make_pipeline`. This pipeline can include additional transformations. ```python from sklearn.pipeline import make_pipeline topic_pipeline = make_pipeline(vectorizer, model) ``` -------------------------------- ### Interpret turftopic Semantic Signal Separation Model with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Illustrates how to visualize a contextually sensitive topic model from turftopic using TopicWizard. The SemanticSignalSeparation model is prepared and then passed to topicwizard.visualize. ```python import topicwizard from turftopic import SemanticSignalSeparation model = SemanticSignalSeparation(n_components=10) topicwizard.visualize(corpus, model=model) ``` -------------------------------- ### Train NMF Topic Model with topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/application.rst Trains a Non-negative Matrix Factorization (NMF) topic model using scikit-learn's CountVectorizer and NMF, then prepares it for topicwizard visualization. Assumes 'texts' is a pre-defined corpus. ```python # Training a compatible topic model from sklearn.decomposition import NMF from sklearn.feature_extraction.text import CountVectorizer from topicwizard.pipeline import make_topic_pipeline bow_vectorizer = CountVectorizer() nmf = NMF(n_components=10) pipeline = make_topic_pipeline(bow_vectorizer, nmf) pipeline.fit(texts) ``` -------------------------------- ### Visualize Topic Model with Group Labels using topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/application.rst Trains an NMF topic model on the 20Newsgroups dataset and visualizes it using topicwizard, including custom group labels derived from the dataset's target names. Requires 'numpy' for label mapping. ```python import topicwizard from topicwizard.pipeline import make_topic_pipeline from sklearn.datasets import fetch_20newsgroups import numpy as np newsgroups = fetch_20newsgroups(subset="all") corpus = newsgroups.data # Sklearn gives the labels back as integers, we have to map them back to # the actual textual label. group_labels = np.array(newsgroups.target_names)[newsgroups.target] # Here we fit a topic model to the corpus pipeline = make_topic_pipeline( CountVectorizer(stop_words="english"), NMF(n_components=30), ).fit(corpus) # Notice that I'm passing the labels as the group_labels argument topicwizard.visualize(corpus, model=pipeline, group_labels=group_labels) ``` -------------------------------- ### Visualize Topic Model with Group Labels (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/application.html This snippet demonstrates how to use the topicwizard library to visualize a topic model, incorporating optional group labels from a dataset like 20Newsgroups. It involves fetching data, creating a topic pipeline, and then calling the visualize function with corpus, model, and group labels. ```python import topicwizard from topicwizard.pipeline import make_topic_pipeline from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import NMF import numpy as np newsgroups = fetch_20newsgroups(subset="all") corpus = newsgroups.data # Sklearn gives the labels back as integers, we have to map them back to # the actual textual label. group_labels = np.array(newsgroups.target_names)[newsgroups.target] # Here we fit a topic model to the corpus pipeline = make_topic_pipeline( CountVectorizer(stop_words="english"), NMF(n_components=30), ).fit(corpus) # Notice that I'm passing the labels as the group_labels argument topicwizard.visualize(corpus, model=pipeline, group_labels=group_labels) ``` -------------------------------- ### Create Gensim Pipeline for topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html This snippet demonstrates how to create a scikit-learn compatible pipeline for Gensim models (LSI, LDA, NMF) to be used with topicwizard. It requires a pre-trained Gensim dictionary and topic model object. ```python from gensim.corpora.dictionary import Dictionary from gensim.models import LdaModel import topicwizard from topicwizard.compatibility import gensim_pipeline texts: list[list[str]] = [ ['computer', 'time', 'graph'], ['survey', 'response', 'eps'], ['human', 'system', 'computer'], ... ] dictionary = Dictionary(texts) bow_corpus = [dictionary.doc2bow(text) for text in texts] lda = LdaModel(bow_corpus, num_topics=10) pipeline = gensim_pipeline(dictionary, model=lda) # Then you can use the pipeline as usual corpus = [" ".join(text) for text in texts] topicwizard.visualize(pipeline=pipeline, corpus=corpus) ``` -------------------------------- ### Load 20newsgroups Dataset with Scikit-learn Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Loads the 20newsgroups dataset from scikit-learn, removing headers, footers, and quotes. It also maps integer labels back to their textual names. This data serves as the corpus for topic modeling. ```python from sklearn.datasets import fetch_20newsgroups import numpy as np newsgroups = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes")) corpus = newsgroups.data # Sklearn gives the labels back as integers, we have to map them back to # the actual textual label. group_labels = np.array(newsgroups.target_names)[newsgroups.target] ``` -------------------------------- ### Integrate Gensim LDA Model with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Shows how to use Gensim's LDA models with TopicWizard by wrapping them in a TopicPipeline. This allows visualization of topic distributions derived from Gensim's corpus and dictionary. ```python from gensim.corpora.dictionary import Dictionary from gensim.models import LdaModel from topicwizard.compatibility import gensim_pipeline texts: list[list[str]] = [ ['computer', 'time', 'graph'], ['survey', 'response', 'eps'], ['human', 'system', 'computer'], ... ] dictionary = Dictionary(texts) bow_corpus = [dictionary.doc2bow(text) for text in texts] lda = LdaModel(bow_corpus, num_topics=10) pipeline = gensim_pipeline(dictionary, model=lda) # Then you can use the pipeline as usual corpus = [" ".join(text) for text in texts] topicwizard.visualize(corpus, model=pipeline) ``` -------------------------------- ### Load TopicData with joblib - Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.persistence.html This snippet shows how to deserialize a TopicData object from a joblib file. It uses the topicwizard library for visualization and joblib for loading the data. The input is a 'topic_data.joblib' file, and the output is a TopicData object ready for visualization. ```python import topicwizard # We import this only for type checking from topicwizard.data import TopicData import joblib topic_data: TopicData = joblib.load("topic_data.joblib") topicwizard.visualize(topic_data=topic_data) ``` -------------------------------- ### Implement Turftopic AutoEncodingTopicModel for TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.compatibility.rst Demonstrates how to use Turftopic's AutoEncodingTopicModel, which can replicate the behavior of CTM models, for use with TopicWizard. Supports both zero-shot and combined modes. ```python from turftopic import AutoEncodingTopicModel import topicwizard zeroshot_tm = AutoEncodingTopicModel(10, combined=False) combined_tm = AutoEncodingTopicModel(10, combined=True) topicwizard.visualize(corpus, model=zeroshot_tm) ``` -------------------------------- ### Convert Existing Pipeline to TopicPipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Shows how to convert an existing Scikit-learn Pipeline into a TopicPipeline using the `TopicPipeline.from_pipeline()` class method. This allows leveraging TopicPipeline's features with previously created pipelines. ```python from topicwizard.pipeline import TopicPipeline topic_pipeline = TopicPipeline.from_pipeline(pipeline) ``` -------------------------------- ### Theme Toggling JavaScript for topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/search.html This JavaScript code snippet dynamically sets the theme for the topicwizard documentation based on user preference or system settings. It reads the 'theme' from local storage or defaults to 'auto', and applies it to the document's body dataset. This allows for light, dark, or automatic theme switching. ```javascript document.body.dataset.theme = localStorage.getItem("theme") || "auto"; ``` -------------------------------- ### Save TopicData with joblib - Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.persistence.html This snippet demonstrates how to serialize a TopicData object into a joblib file. It requires the turftopic library for model preparation and joblib for dumping the data. The input is a corpus, and the output is a file named 'topic_data.joblib'. ```python from turftopic import KeyNMF import joblib model = KeyNMF(10) topic_data = model.prepare_topic_data(corpus) joblib.dump(topic_data, "topic_data.joblib") ``` -------------------------------- ### Visualize BERTopic Model with topicwizard Compatibility Layer - Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.contextual_models.html Shows how to wrap a BERTopic model using topicwizard's BERTopicWrapper to make it compatible for visualization with the topicwizard web application. The BERTopic model can be fitted or unfitted before wrapping. ```python from bertopic import BERTopic from topicwizard.compatibility import BERTopicWrapper # The model can be fitted or not. model = BERTopic() wrapped_model = BERTopicWrapper(model) topicwizard.visualize(corpus, model=wrapped_model) ``` -------------------------------- ### Serialize and Deserialize Topic Data with Joblib Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md This section illustrates how to save (serialize) and load (deserialize) topic data using the joblib library. This is crucial for sharing or persisting topic modeling results across sessions or machines. It highlights the importance of version compatibility between the saving and loading environments. ```python import joblib from topicwizard.data import TopicData # Assuming 'topic_data' is already prepared # Save the data joblib.dump(topic_data, "topic_data.joblib") # Load the data # (The type annotation is just for type checking, it doesn't do anything) topic_data: TopicData = joblib.load("topic_data.joblib") ``` -------------------------------- ### Visualize Topic Data using topicwizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/topic_data.rst Demonstrates how to use the TopicData object with topicwizard's visualization utilities, including topic maps and the web application. The TopicData object allows for the reproduction of interpretive visualizations. ```python import topicwizard from topicwizard.figures import topic_map # Usage with figures topic_map(topic_data) # Usage with web app # Beware that topic_data is a keyword argument topicwizard.visualize(topic_data=topic_data) ``` -------------------------------- ### Display Document Topic Timeline (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/figures.rst Generates a line chart illustrating the topic distribution over time within a single document. It accepts topic data and the document text. Window and step sizes can be adjusted for resolution control. This is useful for tracking topic evolution. ```python from topicwizard.figures import document_topic_timeline document_topic_timeline( topic_data, "New cure against type 2 diabetes in development.", ) ``` -------------------------------- ### Generate Group-Topic Barcharts with TopicWizard Figures API Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Generates interactive barchart plots illustrating the relevance of topics to predefined group labels using the topicwizard figures API. This visualization helps in understanding how topics align with external categorical data. ```python from topicwizard.figures import group_topic_barcharts group_topic_barcharts(corpus, group_labels, pipeline=pipeline, top_n=5) ``` -------------------------------- ### Create Scikit-learn CountVectorizer Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Initializes a scikit-learn CountVectorizer for text processing. It is configured to ignore terms that appear in less than 5 documents or more than 80% of the documents, and to remove English stop words. ```python from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english") ``` -------------------------------- ### Visualize Topic Data with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md This snippet demonstrates how to prepare topic data using BERTopicWrapper and then visualize it using the topicwizard library. It assumes 'corpus' is a pre-defined variable containing the text data. The 'topic_data' object is central to both visualization and serialization. ```python from bertopic import BERTopic from topicwizard.wrappers import BERTopicWrapper # Assuming 'corpus' is your list of documents # corpus = ["document 1", "document 2", ...] model = BERTopicWrapper(BERTopic()) topic_data = model.prepare_topic_data(corpus) import topicwizard topicwizard.visualize(topic_data=topic_data) ``` -------------------------------- ### Generate Group Word Clouds with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Creates word clouds for each group label, considering only word counts and not relevance. It requires the corpus, group labels, and a pipeline. ```python from topicwizard.figures import group_wordclouds group_wordclouds(corpus, group_labels, pipeline=pipeline) ``` -------------------------------- ### Visualize Embeddings with Topicwizard (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/application.html This snippet illustrates how to use topicwizard to visualize embeddings, for instance, generated by LSI. It demonstrates disabling other pages like 'documents' and 'topics' to focus solely on the embedding visualization. ```python import topicwizard # Assuming 'texts' and 'pipeline' are already defined # topicwizard.visualize(texts, model=pipeline, exclude_pages=["documents", "topics"]) ``` -------------------------------- ### Generate Word Map with TopicWizard Figures API Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Creates an interactive word map visualization using the topicwizard figures API. This plot illustrates the relationships and proximity between different words based on their co-occurrence within topics. ```python from topicwizard.figures import word_map word_map(corpus, pipeline=pipeline) ``` -------------------------------- ### Implement Custom Contextual Model Interface Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html Implements the interface for custom contextual topic models in TopicWizard. Models must be able to produce TopicData objects and follow the TopicModel protocol. ```python from topicwizard.model_interface import TopicModel from topicwizard.data import TopicData # TopicModel is only a Protocol, the model inferits no behaviour, # it just provides static checks class CustomTopicModel(TopicModel): def prepare_topic_data( self, corpus: list[str], ) -> TopicData: pass ``` -------------------------------- ### Wrap BERTopic Model for TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.compatibility.rst Shows how to wrap a BERTopic model using BERTopicWrapper to make it compatible with TopicWizard. This allows direct usage in the web app or preparation of a TopicData object for persistence and figures. ```python from bertopic import BERTopic from topicwizard.compatibility import BERTopicWrapper import topicwizard model = BERTopic(language="english") wrapped_model = BERTopicWrapper(model) # Start the web app immediately topicwizard.visualize(corpus, model=wrapped_model) # Or produce a TopicData object for persistance or figures. topic_data = wrapped_model.prepare_topic_data(corpus) ``` -------------------------------- ### Visualize Document Topic Timeline with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Displays topic distribution over time within a single document using a line chart. It can also analyze an entire corpus if texts are joined. Users can specify window and step sizes for token analysis. ```python from topicwizard.figures import document_topic_timeline document_topic_timeline( topic_data, "New cure against type 2 diabetes in development." ) ``` -------------------------------- ### Generate Word Clouds - topicwizard Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Produces a joint word cloud plot for all topics, visualizing word relevance. The 'alpha' parameter can be used to specify the relevance metric for word sizing. ```python from topicwizard.figures import topic_wordclouds topic_wordclouds(topic_data) ``` -------------------------------- ### Display Word Map - topicwizard Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Visualizes word relationships in a 2D space, offering an alternative to the interactive app. Words can be labeled based on a Z-value cutoff, and coloring indicates the most relevant topic. UMAP can be used for automatic axis discovery, or specific topics can be defined as axes. ```python from topicwizard.figures import word_map word_map(topic_data) word_map( topic_data, topic_axes=( "9_api_apis_register_automatedsarcasmgenerator", "4_study_studying_assessments_exams" ) ) ``` -------------------------------- ### Display Word Barplots - topicwizard Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Creates a joint plot displaying word importances across all topics as a bar chart. Allows customization of relevance metric via the 'alpha' parameter and controls the number of words displayed using 'top_n'. ```python from topicwizard.figures import topic_barcharts topic_barcharts(topic_data) topic_barcharts(topic_data, top_n=5) ``` -------------------------------- ### Python Base Topic Model Structure with BaseEstimator Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.compatibility.rst This Python code defines the basic structure for a custom topic model inheriting from `BaseEstimator`. It mandates the implementation of a `transform` method to process vectorized documents and return topic distributions, and a `components_` property to expose topic-word distributions. ```python from sklearn.base import BaseEstimator import numpy as np # Same thing, BaseEstimator is a good thing to have class CustomTopicModel(BaseEstimator): # All topic models should have a transform method, that takes # the vectorized documents and returns a sparse or dense array of # topic distributions with shape (n_docs, n_topics) def transform(self, X): pass # All topic models should have a property or attribute named # components_, that should be a dense or sparse array of topic-word # distributions of shape (n_topics, n_features) @property def components_(self) -> np.ndarray: pass ``` -------------------------------- ### Generate Topic Barcharts with TopicWizard Figures API Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/examples/basic_usage.ipynb Generates interactive barchart plots showing the most important words for each discovered topic using the topicwizard figures API. This plot helps in understanding the thematic content of each topic. ```python from topicwizard.figures import topic_barcharts topic_barcharts(corpus, pipeline=pipeline, top_n=5) ``` -------------------------------- ### Configure TopicPipeline for Pandas DataFrame Output Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Demonstrates two methods for configuring a TopicPipeline to output pandas DataFrames. This is beneficial for analyzing topic content at the document level, especially when dealing with sparse outputs from vectorizers that pandas cannot directly handle. ```python # Set a parameter pipeline = make_topic_pipeline(vectorizer, model, pandas_out=True) # Or use set_output API pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas") ``` -------------------------------- ### Create Group Topic Barcharts with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Generates a joint plot displaying the topic content of all groups as bar charts. This function requires the corpus, group labels, and optionally a pipeline and the number of top topics to display. ```python from topicwizard.figures import group_topic_barcharts group_topic_barcharts(corpus, group_labels, pipeline=pipeline, top_n=5) ``` -------------------------------- ### Visualize Word Association Barchart with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Generates a bar chart visualizing the most relevant topics for a given set of words. It takes topic data and a list of words as input. Associations are not selected by default. ```python from topicwizard.figures import word_association_barchart word_association_barchart(topic_data, ["supreme", "court"]) ``` -------------------------------- ### Visualize Document Topics with Plotly Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst This snippet demonstrates how to use Plotly Express to visualize document-topic relationships as a heatmap. It assumes a pre-existing `pipeline` object that has been transformed with a list of texts. The output is a heatmap displayed interactively. ```python import plotly.express as px texts = [ "Coronavirus killed 50000 people today.", "Donald Trump's presidential campaing is going very well", "Protests against police brutality have been going on all around the US.", ] topic_df = pipeline.transform(texts) topic_df.index = texts px.imshow(topic_df).show() ``` -------------------------------- ### Define Custom Topic Model for TopicPipeline Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/usage.compatibility.html Defines a custom topic model component for TopicWizard's TopicPipeline. It must inherit from BaseEstimator and implement transform and components_ properties. ```python # Same thing, BaseEstimator is a good thing to have class CustomTopicModel(BaseEstimator): # All topic models should have a transform method, that takes # the vectorized documents and returns a sparse or dense array of # topic distributions with shape (n_docs, n_topics) def transform(self, X): pass # All topic models should have a property or attribute named # components_, that should be a dense or sparse array of topic-word # distributions of shape (n_topics, n_features) @property def components_(self) -> np.ndarray: pass ``` -------------------------------- ### Visualize word relationships with topicwizard word_map Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/index.rst Generates an interactive visualization of word relationships within a topic model using the `word_map` function from `topicwizard.figures`. This requires pre-processed topic data. ```python from topicwizard.figures import word_map topic_data = topic_pipeline.prepare_topic_data(corpus) word_map(topic_data) ``` -------------------------------- ### Generate Word Map (UMAP Discovery) - Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/figures.rst Displays a word map where UMAP discovers the axes and projects words into 2D space. This is useful for exploring word distances, relations, and potential clusters. It takes a TopicData object as input. ```python from topicwizard.figures import word_map word_map(topic_data) ``` -------------------------------- ### Display Topic Map - topicwizard Python Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Generates a semantic map of topics within your topic model. Requires a TopicData object as input. This function visualizes the relationships between topics in a semantic space. ```python from topicwizard.figures import topic_map topic_map(topic_data) ``` -------------------------------- ### Show Document Topic Distribution with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/figures.html Displays topic distributions for a given document or list of documents as a bar chart. It requires topic data and the document content as input. ```python from topicwizard.figures import document_topic_distribution document_topic_distribution( topic_data, "New cure against type 2 diabetes in development." ) ``` -------------------------------- ### Freeze Topic Pipeline Components for Downstream Training Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/usage.pipelines.rst Explains how to freeze the vectorizer and topic model components within a TopicPipeline. Freezing prevents these components from being retrained when `fit()` or `partial_fit()` is called on an outer pipeline, which is useful for multi-stage training processes. ```python from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts) # Investigate topics topicwizard.visualize(topic_pipeline) # Freezing topic pipeline topic_pipeline.freeze = True # Constructing classification pipeline cls_pipeline = make_pipeline(topic_pipeline, LogisticRegression()) cls_pipeline.fit(X, y) ``` -------------------------------- ### Generate Interactive Topic Figures with TopicWizard Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md This code imports various plotting functions from the topicwizard.figures module, allowing for the creation of customizable and interactive plots such as word maps, document-topic timelines, topic wordclouds, and word association bar charts. These figures are generated from a TopicData object. ```python from topicwizard.figures import ( word_map, document_topic_timeline, topic_wordclouds, word_association_barchart ) # Assuming 'topic_data' is loaded or prepared # Example usage: word_map(topic_data) document_topic_timeline(topic_data, "Joe Biden takes over presidential office from Donald Trump.") topic_wordclouds(topic_data) word_association_barchart(topic_data, ["supreme", "court"]) ``` -------------------------------- ### TopicData Class Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/_build/html/topic_data.html The TopicData type is the main abstraction in topicwizard. It's a dictionary-like object at runtime, providing static type checking for interoperability. It holds all necessary information to reproduce visualizations and inference results. ```APIDOC ## TopicData Class ### Description Inference data used to produce visualizations in the application and figures. This type is a Python TypedDict, behaving like a dictionary at runtime while offering static type checking. ### Method N/A (Class definition) ### Endpoint N/A (Class definition) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ```json { "topic_data": { "corpus": [ "This is the first document.", "This is the second document." ], "vocab": ["this", "is", "the", "first", "document", "second"], "document_term_matrix": [[1, 1, 1, 1, 1, 0], [1, 1, 1, 0, 1, 1]], "document_topic_matrix": [[0.8, 0.2], [0.3, 0.7]], "topic_term_matrix": [[0.6, 0.4, 0.0, 0.0, 0.0, 0.0], [0.1, 0.1, 0.2, 0.3, 0.3, 0.0]], "document_representation": [[0.1, 0.2], [0.3, 0.4]], "topic_names": ["Topic A", "Topic B"] } } ``` ### Response #### Success Response (200) N/A (Class definition) #### Response Example N/A (Class definition) ### Attributes - **corpus** (`list` of `str`) - The corpus on which inference was run. - **vocab** (`ndarray` of `shape (n_vocab,)`) - Array of all words in the vocabulary of the topic model. - **document_term_matrix** (`ndarray` of `shape (n_documents`, `n_vocab)`) - Bag-of-words document representations. Elements are word importances/frequencies for given documents. - **document_topic_matrix** (`ndarray` of `shape (n_documents`, `n_topics)`) - Topic importances for each document. - **topic_term_matrix** (`ndarray` of `shape (n_topics`, `n_vocab)`) - Importances of each term for each topic in a matrix. - **document_representation** (`ndarray` of `shape (n_documents`, `n_dimensions)`) - Embedded representations for documents. Can also be a sparse BoW matrix for classical models. - **transform** (`(list[str]) -> ndarray`, optional) - Function that transforms documents to document-topic matrices. Can be `None` for transductive models. - **topic_names** (`list` of `str`) - Names or topic descriptions inferred for topics by the model. ``` -------------------------------- ### Visualize Topics with Excluded Pages (Python) Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/docs/application.rst Demonstrates how to use the topicwizard.visualize function to display topic visualizations while excluding specific pages such as 'documents' and 'words'. This is useful for customizing the output based on the data or analysis method. ```python topicwizard.visualize(texts, model=pipeline, exclude_pages=["documents", "words"]) ``` ```python topicwizard.visualize(texts, model=pipeline, exclude_pages=["documents", "topics"]) ``` -------------------------------- ### Customize TopicWizard Visualization by Excluding Pages Source: https://github.com/x-tabdeveloping/topicwizard/blob/main/README.md Shows how to exclude specific pages (e.g., 'documents') from the TopicWizard visualization to speed up preprocessing, especially for large corpora. This allows for focused visualization of desired components. ```python # A large corpus takes a looong time to compute 2D projections for so # so you can speed up preprocessing by disabling it alltogether. topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"]) ```