### Install Lilac and Start Project Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/quickstart.md Install Lilac with all dependencies and initialize a new project in the specified directory. ```bash pip install lilac[all] lilac start ~/my_project ``` -------------------------------- ### Start Svelte development server Source: https://github.com/databricks/lilac/blob/main/web/blueprint/README.md Run this command after installing dependencies to start the development server. The `--open` flag will automatically open the application in your default browser. ```bash npm run dev ``` ```bash npm run dev -- --open ``` -------------------------------- ### Install and Start Lilac CLI Source: https://github.com/databricks/lilac/blob/main/docs/deployment/self_hosted.md Install Lilac with all features and start the service on port 5432. Ensure you have Python and pip installed. ```sh pip install lilac[all] lilac start /data ``` -------------------------------- ### Install Project Dependencies Source: https://github.com/databricks/lilac/blob/main/development.md Run this script to install all necessary project dependencies. ```sh ./scripts/setup.sh ``` -------------------------------- ### Install Firebase CLI Source: https://github.com/databricks/lilac/blob/main/docs/README.md Install the Firebase command-line interface globally. This is a one-time setup step for deployment. ```bash npm install -g firebase-tools ``` -------------------------------- ### Install Lilac with all features Source: https://github.com/databricks/lilac/blob/main/README.md Install Lilac including all optional dependencies for full functionality. This is the recommended installation for most users. ```sh pip install lilac[all] ``` -------------------------------- ### Initialize Concept and Example Objects Source: https://github.com/databricks/lilac/blob/main/notebooks/Toxicity.ipynb Initializes an empty dictionary for data and imports necessary classes for defining concepts and examples. ```python from lilac.concepts.concept import Concept, Example data = {} ``` -------------------------------- ### Start Lilac webserver from CLI Source: https://github.com/databricks/lilac/blob/main/README.md Start a Lilac webserver using the command-line interface. This command initializes a project directory and launches the server. ```sh lilac start ~/my_project ``` -------------------------------- ### Run Development Server Source: https://github.com/databricks/lilac/blob/main/development.md Execute this command to start the web server in development mode, enabling fast edit-refresh. ```sh ./run_server_dev.sh ``` -------------------------------- ### Install Miniforge for M1/M2 TensorFlow Source: https://github.com/databricks/lilac/wiki/Troubleshooting Steps to install Miniforge and activate the environment for TensorFlow on M1/M2 chips. ```sh chmod +x ~/Downloads/Miniforge3-MacOSX-arm64.sh sh ~/Downloads/Miniforge3-MacOSX-arm64.sh source ~/miniforge3/bin/activate ``` -------------------------------- ### Start Lilac Server and Create Project Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_load.md Starts the Lilac web server and automatically creates a project directory with an empty `lilac.yml` file. This is useful for managing project configurations. ```python import lilac as ll ll.start_server(project_dir='~/my_lilac') ``` -------------------------------- ### Watch Documentation Locally Source: https://github.com/databricks/lilac/blob/main/docs/README.md Run this script to start a local server that automatically refreshes documentation as you make changes. Execute from the project root. ```bash ./scripts/watch_docs.sh ``` -------------------------------- ### Add Examples to a Concept in Python Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md Insert new labeled examples into an existing concept. Ensure examples are formatted as `ll.ExampleIn` objects with a boolean label and text. ```python train_data = [ ll.ExampleIn(label=False, text='The weather is beautiful today'), ] db.edit( 'local', 'positive-product-reviews', ll.ConceptUpdate(insert=train_data)) ``` -------------------------------- ### Start Lilac Server Source: https://github.com/databricks/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb Use this command to start a Lilac server. Specify the project directory where your data is located. ```python ll.start_server(project_dir='./data') ``` -------------------------------- ### Create Docker Buildx Builder Source: https://github.com/databricks/lilac/blob/main/development.md One-time setup command to create and bootstrap a Docker buildx builder instance named 'mybuilder'. ```sh docker buildx create --name mybuilder --node mybuilder0 --bootstrap --use ``` -------------------------------- ### Add Examples to a Concept Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_create.md Add positive and negative examples to an existing concept. Examples are defined using ExampleIn objects with a label and text. ```python examples = [ ll.concepts.ExampleIn(label=False, text='The quick brown fox jumps over the lazy dog.'), ll.concepts.ExampleIn(label=True, text='This product is amazing!'), ll.concepts.ExampleIn(label=True, text='Thank you for your awesome work on this UI.') ] db.edit('local', 'positive-product-reviews', ll.concepts.ConceptUpdate(insert=examples)) ``` -------------------------------- ### List Concept Examples in Python Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md Retrieve and display existing examples associated with a concept. This is useful for inspecting the current training data of a concept. ```python concept = db.get('local', 'positive-product-reviews') print(concept.data) ``` -------------------------------- ### Retrieve and Print Concept Examples Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb Fetch a concept using `db.get` and print its data. This is useful for inspecting the current training examples within a concept. ```python concept = db.get('local', 'positive-product-reviews') if concept: print(concept.data) ``` -------------------------------- ### Get a Concept by Name Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb Retrieves a specific concept using its namespace and name. This action loads the concept, which is a collection of positive and negative examples. ```python # Get the `language-model-reference` concept. This is just a collection of positive and negative ``` -------------------------------- ### Start Lilac Server Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb Starts the Lilac server to enable data visualization and interaction. Ensure this is run before other server-dependent operations. ```python ll.start_server() ``` -------------------------------- ### Start Lilac Web Server Source: https://github.com/databricks/lilac/blob/main/docs/blog/curate-coding-dataset.md Starts the Lilac web server to allow for visual inspection and interaction with the loaded dataset through a web interface. ```python ll.start_server() ``` -------------------------------- ### Add Training Examples to a Concept Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb Use `ll.ExampleIn` to define training data and `db.edit` to insert it into a specified concept. Ensure the concept and namespace are correctly identified. ```python train_data = [ ll.ExampleIn(label=False, text='The quick brown fox jumps over the lazy dog.'), ll.ExampleIn(label=False, text='This is a random sentence.'), ll.ExampleIn(label=True, text='This product is amazing!'), ll.ExampleIn(label=True, text='Thank you for your awesome work on this UI.'), ] db.edit('local', 'positive-product-reviews', ll.ConceptUpdate(insert=train_data)) ``` -------------------------------- ### Get Dataset Handle Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_query.md Set the project path and retrieve a dataset instance. Ensure the project path is set before accessing datasets. ```python import lilac as ll # Set the project path globally. For more information, see the Projects guide. ll.set_project_path('~/my_project') dataset = ll.get_dataset('local', 'imdb') ``` -------------------------------- ### Install Dependencies for LlamaIndex and pypdf Source: https://github.com/databricks/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb Install the necessary libraries, pypdf and llama_index, to use LlamaIndex loaders. This is a prerequisite for loading data. ```python !pip install pypdf llama_index ``` -------------------------------- ### Save Concept with Sentiment Examples Source: https://github.com/databricks/lilac/blob/main/notebooks/Sentiment.ipynb Initializes a data structure to save concept examples, iterating through the training data to create `Example` objects with sentiment labels and text. This snippet is incomplete and likely part of a larger concept definition process. ```python from lilac.concepts.concept import Concept, Example def save_concept(positive_sentiment): data = {} for index, (label, text) in enumerate(zip(labels, list(train_df['text']))): id = str(index) ex = Example(label=bool(label), text=text, id=str(index)) if not positive_sentiment: ``` -------------------------------- ### Start Lilac webserver from Python Source: https://github.com/databricks/lilac/blob/main/README.md Start a Lilac webserver programmatically using the Python API. This is useful for integrating Lilac into existing Python workflows. ```python import lilac as ll ll.start_server(project_dir='~/my_project') ``` -------------------------------- ### Start a New Lilac Project Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/installation.md Initiate a new Lilac project in a specified directory. The command will prompt for confirmation before proceeding. ```bash ❯ lilac start ~/my_project Lilac will create a project in `/Users/me/my-project`. Do you want to continue? (y/n): y INFO: Started server process [33100] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit) ``` -------------------------------- ### Reinstall Xcode Command Line Tools Source: https://github.com/databricks/lilac/wiki/Troubleshooting Use this command if pyenv installation fails on M1 machines after installing Xcode. ```sh $ sudo rm -rf /Library/Developer/CommandLineTools $ xcode-select --install ``` -------------------------------- ### Initialize Lilac Project from CLI Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md Use this command to initialize a Lilac project directory without starting the webserver. This is useful for setting up a project structure. ```sh lilac init ~/my_project ``` -------------------------------- ### Check Lilac Version Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/installation.md Verify the installed version of Lilac. ```bash lilac version ``` -------------------------------- ### Example Output of Queried Row Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_labels.md Illustrates the structure of a row returned after querying for existing labels, showing the row ID, text, label, and other metadata. ```python { '__rowid__': '0003076800f1471f8f4c8a1b2deda742', 'text': 'If you want to truly experience the magic (?) of Don Dohler, then check out Alien Factor or maybe Fiend...', 'label': 'neg', '__hfsplit__': 'test', 'good': { 'label': 'true', 'created': datetime.datetime(2023, 9, 20, 10, 16, 15, 545277) } } ``` -------------------------------- ### Load Dataset from Hugging Face Source: https://github.com/databricks/lilac/blob/main/notebooks/Clustering.ipynb Loads a dataset from the Hugging Face Hub. Ensure the 'lilac' library is installed. ```python import lilac as ll ds = ll.from_huggingface('LDJnr/Capybara') ``` -------------------------------- ### UI Settings for Media Paths in Lilac Project Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md Configures which fields are displayed as media paths in the Lilac UI. This example sets 'premise' as the sole media path. ```yaml settings: ui: media_paths: - premise ``` -------------------------------- ### Create and Search by Concept Source: https://github.com/databricks/lilac/blob/main/README.md Define custom concepts by providing positive and negative examples. This allows for more controllable and powerful search than basic semantic search. Concepts can be stored in a `DiskConceptDB` and then used for searching. ```python concept_db = ll.DiskConceptDB() db.create(namespace='local', name='spam') # Add examples of spam and not-spam. db.edit('local', 'spam', ll.concepts.ConceptUpdate( insert=[ ll.concepts.ExampleIn(label=False, text='This is normal text.'), ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'), ll.concepts.ExampleIn(label=True, text='11757578jfdjja') ] )) # Search by the spam concept. rows = dataset.select_rows( columns=['text', 'label'], searches=[ ll.ConceptSearch( path='text', concept_namespace='lilac', concept_name='spam', embedding='gte-small') ], limit=1) print(list(rows)) ``` -------------------------------- ### Balance Dataset for Training Source: https://github.com/databricks/lilac/blob/main/notebooks/Toxicity.ipynb Creates a balanced dataset by sampling an equal number of positive and negative examples for a specified label type. Returns embeddings, labels, and text. ```python def make_balanced_data(data, embeddings, sample_size_per_group, label_type): df = data.to_pandas() groups = df[label_type].groupby(df[label_type]).groups positive_examples = np.random.choice(groups[1], sample_size_per_group, replace=False) negative_examples = np.random.choice(groups[0], sample_size_per_group, replace=False) positive_embeddings = embeddings[positive_examples] negative_embeddings = embeddings[negative_examples] positive_labels = np.ones(len(positive_embeddings)) negative_labels = np.zeros(len(negative_embeddings)) positive_text = df.loc[positive_examples]['comment_text'] negative_text = df.loc[negative_examples]['comment_text'] embeddings = np.concatenate([positive_embeddings, negative_embeddings]) labels = np.concatenate([positive_labels, negative_labels]) text = np.concatenate([positive_text, negative_text]) return embeddings, labels, text ``` -------------------------------- ### Example Lilac Project Configuration Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md This YAML configuration defines a Lilac project, specifying datasets, embeddings, signals, and UI settings. It includes a dataset from HuggingFace and configures PII signal detection. ```yaml # Lilac project config. # See https://docs.lilacml.com/api_reference/index.html#lilac.Config for details. datasets: - namespace: local name: glue source: dataset_name: glue config_name: ax source_name: huggingface embeddings: - path: premise embedding: gte-small signals: - path: premise signal: signal_name: pii - path: hypothesis signal: signal_name: pii settings: ui: media_paths: - premise ``` -------------------------------- ### Build Documentation Source: https://github.com/databricks/lilac/blob/main/docs/README.md Use this script to generate the static documentation files. Execute from the project root. ```bash ./scripts/build_docs.sh ``` -------------------------------- ### Remove Examples from a Concept in Python Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md Remove specific examples from a concept using their unique IDs. This is useful for cleaning up or correcting erroneous examples. ```python db.edit( 'local', 'positive-product-reviews', ll.ConceptUpdate(remove=['d86e4cb53c70443b8d8782a6847f4752'])) ``` -------------------------------- ### Remove Examples from a Concept Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb Use `db.edit` with `ll.ConceptUpdate` and the `remove` argument to delete specific examples from a concept by their IDs. Ensure you have the correct IDs for the examples to be removed. ```python db.edit( 'local', 'positive-product-reviews', ll.ConceptUpdate(remove=['d86e4cb53c70443b8d8782a6847f4752']) ) ``` -------------------------------- ### Initialize Lilac Project and Load Dataset Source: https://github.com/databricks/lilac/blob/main/notebooks/DatasetMap.ipynb Sets up the Lilac project directory and retrieves a dataset. Handles dataset creation if it doesn't exist. ```python %load_ext autoreload %autoreload 2 import lilac as ll ll.set_project_dir('./data') try: glue = ll.get_dataset('local', 'glue_ax_map') except Exception as e: glue = ll.create_dataset( ll.DatasetConfig( namespace='local', name='glue_ax_map', source=ll.HuggingFaceSource(dataset_name='glue', config_name='ax', sample_size=100), ) ) # ll.start_server() ``` -------------------------------- ### Deploy Website Source: https://github.com/databricks/lilac/blob/main/docs/README.md Deploy the website using the provided script. Append the `--staging` flag to deploy to the staging site instead of production. ```bash poetry run python -m scripts.deploy_website ``` -------------------------------- ### Install TensorFlow Dependencies on M1/M2 Source: https://github.com/databricks/lilac/wiki/Troubleshooting Install specific TensorFlow dependencies required for M1/M2 chips using conda. ```sh conda install -c apple tensorflow-deps=2.9.0 ``` -------------------------------- ### Initialize DiskConceptDB Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_create.md Instantiate the DiskConceptDB to manage concepts on disk. This is the first step for programmatic concept creation. ```python import lilac as ll db = ll.DiskConceptDB() ``` -------------------------------- ### Create a Local Concept Database Entry Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb Initializes a disk-based concept database and creates a 'positive-product-reviews' concept under the 'local' namespace if it doesn't already exist. ```python db = ll.DiskConceptDB() concepts = db.list() # Don't create the concept twice. if not list( filter(lambda c: c.namespace == 'local' and c.name == 'positive-product-reviews', concepts) ): db.create('local', 'positive-product-reviews') ``` -------------------------------- ### Publish HuggingFace Public Demo Source: https://github.com/databricks/lilac/blob/main/development.md Publish the demo to your HuggingFace Space. This command syncs data, loads data, uploads data, and deploys to HuggingFace. Use flags like `--skip_sync`, `--skip_load`, `--skip_data_upload`, and `--skip_deploy` to customize the process. ```sh poetry run python -m scripts.deploy_demo \ --project_dir=./demo_data \ --config=./lilac_hf_space.yml \ --hf_space=lilacai/lilac Add: --skip_sync to skip syncing data from the HuggingFace space data. --skip_load to skip loading the data. --load_overwrite to run all data from scratch, overwriting existing data. --skip_data_upload to skip uploading data. This will use the datasets already on the space. --skip_deploy to skip deploying to HuggingFace. Useful to test locally. ``` ```sh poetry run python -m scripts.deploy_demo \ --project_dir=./demo_data \ --config=./lilac_hf_space.yml \ --hf_space=lilacai/lilac \ --skip_sync \ --skip_load \ --skip_data_upload ``` -------------------------------- ### Create a new Svelte project Source: https://github.com/databricks/lilac/blob/main/web/blueprint/README.md Use this command to initialize a new Svelte project. Specify a directory name to create the project in a new folder, or run without arguments to create it in the current directory. ```bash npm create svelte@latest ``` ```bash npm create svelte@latest my-app ``` -------------------------------- ### Deploy to HuggingFace Space Source: https://github.com/databricks/lilac/blob/main/development.md Deploy the dataset to your HuggingFace Space. Use the `--create_space` flag if this is the first time deploying. ```sh poetry run python -m scripts.deploy_staging \ --dataset=$DATASET_NAMESPACE/$DATASET_NAME # --create_space if this is the first time running the command so it will create the space for you. ``` -------------------------------- ### Configure HuggingFace Demo Environment Source: https://github.com/databricks/lilac/blob/main/development.md Set these environment variables in a `.env.local` file to configure the HuggingFace demo repository and authentication token. ```sh # The repo to use for the huggingface demo. This does not have to exist when you set the flag, the deploy script will create it if it doesn't exist. HF_STAGING_DEMO_REPO='lilacai/your-space' # To authenticate with HuggingFace for uploading to the space. HF_ACCESS_TOKEN='hf_abcdefghijklmnop' ``` -------------------------------- ### Set up dataset and embedding paths Source: https://github.com/databricks/lilac/blob/main/notebooks/MigrateEmbedding.ipynb Defines the namespace, dataset name, path, and embedding type to locate the signal directory. ```python import os import lilac as ll namespace = 'local' dataset_name = 'twitter-support' path = 'text' embedding = 'cohere' signal_dir = os.path.join('data', 'datasets', namespace, dataset_name, path, embedding) ``` -------------------------------- ### Score Text with ConceptSignal Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb Use ConceptSignal to score text against a specified concept and embedding. Provides high and low score examples. ```python concept_scorer = ll.signals.ConceptSignal( namespace=concept_namespace, concept_name=concept_name, embedding=embedding_name ) # Should be a high-score. results = concept_scorer.compute(['As a language model I cannot talk about politics.']) print(list(results)) # Should be a low-score. results = concept_scorer.compute(['How are you doing today?']) print(list(results)) ``` -------------------------------- ### Get Linear Model Coefficients Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb Retrieve the coefficients of a linear model associated with a concept and embedding. Note that weights are tied to the embedding. ```python # Get the `language-model-reference` concept model which predicts whether text is an LLM ``` -------------------------------- ### Set Project Directory and Get/Create Dataset Source: https://github.com/databricks/lilac/blob/main/notebooks/CurateCodingDataset.ipynb Configures the Lilac project directory and retrieves an existing dataset or creates a new one from Hugging Face if it doesn't exist. Requires lilac library. ```python import lilac as ll ll.set_project_dir('./demo_data') try: ds = ll.get_dataset('lilac', 'glaive') except Exception: # Create the dataset. config = ll.DatasetConfig( namespace='lilac', name='glaive', source=ll.HuggingFaceSource(dataset_name='glaiveai/glaive-code-assistant'), ) ds = ll.create_dataset(config) ``` -------------------------------- ### Prepare Dummy Vector Store for Loading Embeddings Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_embeddings.md Create a dictionary to act as a dummy vector store by encoding sample text items using a pre-defined embedding function. This simulates an external vector store. ```python items = [ {'id': '0_', 'text': 'This is some fake data'}, {'id': '1_', 'text': 'This is some more fake data'}, {'id': '2_', 'text': 'This is even more fake data'}, {'id': '3_', 'text': 'I love plants'}, ] vector_store = {} for item in items: vector_store[item['id']] = _embed(item['text']) ``` -------------------------------- ### Register Custom Embedding Function Source: https://github.com/databricks/lilac/blob/main/notebooks/CustomEmbeddings.ipynb Registers a custom embedding function using the SentenceTransformer library. Ensures the 'sentence_transformers' package is installed before proceeding. ```python import numpy as np try: from sentence_transformers import SentenceTransformer except ImportError: raise ImportError( 'Could not import the "sentence_transformers" python package. ' 'Please install it with `pip install "sentence_transformers".' ) embedding_model = SentenceTransformer('thenlper/gte-small') def _embed(text): # Call the gte-small embedding model. return np.array(embedding_model.encode(text)) # Make an embedding class. class MyEmbedding(ll.TextEmbeddingSignal): name = 'my_embedding' def compute(self, data): for text in data: embedding = _embed(text) # Yield a full chunk embedding. If you want to chunk your text, yield an array here. yield [ll.chunk_embedding(0, len(text), embedding)] print('Testing the embedding on a single item...') print(next(MyEmbedding().compute(['This is some text']))) ``` -------------------------------- ### Hugging Face Deployment Output Source: https://github.com/databricks/lilac/blob/main/notebooks/DeployToHuggingFace.ipynb This output shows the process of creating a Hugging Face space and deploying the project files and datasets. ```text Creating huggingface space https://huggingface.co/spaces/nsthorat-lilac/nikhil-project-demo The space will be created as private. You can change this from the UI. Created: https://huggingface.co/spaces/nsthorat-lilac/nikhil-project-demo Deploying project: ./data Copying root files... Uploading datasets: ['local/glue_ax'] Uploading "local/glue_ax" to HuggingFace dataset repo https://huggingface.co/datasets/nsthorat-lilac/nikhil-project-demo-local-glue_ax ``` ```text data-00000-of-00001.parquet: 0%| | 0.00/116k [00:00