### Install Lilac and Start Project

Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/quickstart.md

Install Lilac with all dependencies and initialize a new project in the specified directory.

```bash
pip install lilac[all]

lilac start ~/my_project
```

--------------------------------

### Start Svelte development server

Source: https://github.com/databricks/lilac/blob/main/web/blueprint/README.md

Run this command after installing dependencies to start the development server. The `--open` flag will automatically open the application in your default browser.

```bash
npm run dev
```

```bash
npm run dev -- --open
```

--------------------------------

### Install and Start Lilac CLI

Source: https://github.com/databricks/lilac/blob/main/docs/deployment/self_hosted.md

Install Lilac with all features and start the service on port 5432. Ensure you have Python and pip installed.

```sh
pip install lilac[all]
lilac start /data
```

--------------------------------

### Install Project Dependencies

Source: https://github.com/databricks/lilac/blob/main/development.md

Run this script to install all necessary project dependencies.

```sh
./scripts/setup.sh
```

--------------------------------

### Install Firebase CLI

Source: https://github.com/databricks/lilac/blob/main/docs/README.md

Install the Firebase command-line interface globally. This is a one-time setup step for deployment.

```bash
npm install -g firebase-tools
```

--------------------------------

### Install Lilac with all features

Source: https://github.com/databricks/lilac/blob/main/README.md

Install Lilac including all optional dependencies for full functionality. This is the recommended installation for most users.

```sh
pip install lilac[all]
```

--------------------------------

### Initialize Concept and Example Objects

Source: https://github.com/databricks/lilac/blob/main/notebooks/Toxicity.ipynb

Initializes an empty dictionary for data and imports necessary classes for defining concepts and examples.

```python
from lilac.concepts.concept import Concept, Example

data = {}

```

--------------------------------

### Start Lilac webserver from CLI

Source: https://github.com/databricks/lilac/blob/main/README.md

Start a Lilac webserver using the command-line interface. This command initializes a project directory and launches the server.

```sh
lilac start ~/my_project
```

--------------------------------

### Run Development Server

Source: https://github.com/databricks/lilac/blob/main/development.md

Execute this command to start the web server in development mode, enabling fast edit-refresh.

```sh
./run_server_dev.sh
```

--------------------------------

### Install Miniforge for M1/M2 TensorFlow

Source: https://github.com/databricks/lilac/wiki/Troubleshooting

Steps to install Miniforge and activate the environment for TensorFlow on M1/M2 chips.

```sh
chmod +x ~/Downloads/Miniforge3-MacOSX-arm64.sh
sh ~/Downloads/Miniforge3-MacOSX-arm64.sh
source ~/miniforge3/bin/activate
```

--------------------------------

### Start Lilac Server and Create Project

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_load.md

Starts the Lilac web server and automatically creates a project directory with an empty `lilac.yml` file. This is useful for managing project configurations.

```python
import lilac as ll

ll.start_server(project_dir='~/my_lilac')
```

--------------------------------

### Watch Documentation Locally

Source: https://github.com/databricks/lilac/blob/main/docs/README.md

Run this script to start a local server that automatically refreshes documentation as you make changes. Execute from the project root.

```bash
./scripts/watch_docs.sh
```

--------------------------------

### Add Examples to a Concept in Python

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md

Insert new labeled examples into an existing concept. Ensure examples are formatted as `ll.ExampleIn` objects with a boolean label and text.

```python
train_data = [
  ll.ExampleIn(label=False, text='The weather is beautiful today'),
]
db.edit(
  'local', 'positive-product-reviews',
  ll.ConceptUpdate(insert=train_data))
```

--------------------------------

### Start Lilac Server

Source: https://github.com/databricks/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb

Use this command to start a Lilac server. Specify the project directory where your data is located.

```python
ll.start_server(project_dir='./data')
```

--------------------------------

### Create Docker Buildx Builder

Source: https://github.com/databricks/lilac/blob/main/development.md

One-time setup command to create and bootstrap a Docker buildx builder instance named 'mybuilder'.

```sh
docker buildx create --name mybuilder --node mybuilder0 --bootstrap --use
```

--------------------------------

### Add Examples to a Concept

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_create.md

Add positive and negative examples to an existing concept. Examples are defined using ExampleIn objects with a label and text.

```python
examples = [
  ll.concepts.ExampleIn(label=False, text='The quick brown fox jumps over the lazy dog.'),
  ll.concepts.ExampleIn(label=True, text='This product is amazing!'),
  ll.concepts.ExampleIn(label=True, text='Thank you for your awesome work on this UI.')
]
db.edit('local', 'positive-product-reviews', ll.concepts.ConceptUpdate(insert=examples))
```

--------------------------------

### List Concept Examples in Python

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md

Retrieve and display existing examples associated with a concept. This is useful for inspecting the current training data of a concept.

```python
concept = db.get('local', 'positive-product-reviews')

print(concept.data)
```

--------------------------------

### Retrieve and Print Concept Examples

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Fetch a concept using `db.get` and print its data. This is useful for inspecting the current training examples within a concept.

```python
concept = db.get('local', 'positive-product-reviews')

if concept:
  print(concept.data)
```

--------------------------------

### Get a Concept by Name

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Retrieves a specific concept using its namespace and name. This action loads the concept, which is a collection of positive and negative examples.

```python
# Get the `language-model-reference` concept. This is just a collection of positive and negative
```

--------------------------------

### Start Lilac Server

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Starts the Lilac server to enable data visualization and interaction. Ensure this is run before other server-dependent operations.

```python
ll.start_server()

```

--------------------------------

### Start Lilac Web Server

Source: https://github.com/databricks/lilac/blob/main/docs/blog/curate-coding-dataset.md

Starts the Lilac web server to allow for visual inspection and interaction with the loaded dataset through a web interface.

```python
ll.start_server()
```

--------------------------------

### Add Training Examples to a Concept

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Use `ll.ExampleIn` to define training data and `db.edit` to insert it into a specified concept. Ensure the concept and namespace are correctly identified.

```python
train_data = [
  ll.ExampleIn(label=False, text='The quick brown fox jumps over the lazy dog.'),
  ll.ExampleIn(label=False, text='This is a random sentence.'),
  ll.ExampleIn(label=True, text='This product is amazing!'),
  ll.ExampleIn(label=True, text='Thank you for your awesome work on this UI.'),
]
db.edit('local', 'positive-product-reviews', ll.ConceptUpdate(insert=train_data))
```

--------------------------------

### Get Dataset Handle

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_query.md

Set the project path and retrieve a dataset instance. Ensure the project path is set before accessing datasets.

```python
import lilac as ll

# Set the project path globally. For more information, see the Projects guide.
ll.set_project_path('~/my_project')

dataset = ll.get_dataset('local', 'imdb')
```

--------------------------------

### Install Dependencies for LlamaIndex and pypdf

Source: https://github.com/databricks/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb

Install the necessary libraries, pypdf and llama_index, to use LlamaIndex loaders. This is a prerequisite for loading data.

```python
!pip install pypdf llama_index
```

--------------------------------

### Save Concept with Sentiment Examples

Source: https://github.com/databricks/lilac/blob/main/notebooks/Sentiment.ipynb

Initializes a data structure to save concept examples, iterating through the training data to create `Example` objects with sentiment labels and text. This snippet is incomplete and likely part of a larger concept definition process.

```python
from lilac.concepts.concept import Concept, Example


def save_concept(positive_sentiment):
  data = {}

  for index, (label, text) in enumerate(zip(labels, list(train_df['text']))):
    id = str(index)
    ex = Example(label=bool(label), text=text, id=str(index))
    if not positive_sentiment:

```

--------------------------------

### Start Lilac webserver from Python

Source: https://github.com/databricks/lilac/blob/main/README.md

Start a Lilac webserver programmatically using the Python API. This is useful for integrating Lilac into existing Python workflows.

```python
import lilac as ll

ll.start_server(project_dir='~/my_project')
```

--------------------------------

### Start a New Lilac Project

Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/installation.md

Initiate a new Lilac project in a specified directory. The command will prompt for confirmation before proceeding.

```bash
❯ lilac start ~/my_project
Lilac will create a project in `/Users/me/my-project`. Do you want to continue? (y/n): y

INFO:     Started server process [33100]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)
```

--------------------------------

### Reinstall Xcode Command Line Tools

Source: https://github.com/databricks/lilac/wiki/Troubleshooting

Use this command if pyenv installation fails on M1 machines after installing Xcode.

```sh
$ sudo rm -rf /Library/Developer/CommandLineTools
$ xcode-select --install
```

--------------------------------

### Initialize Lilac Project from CLI

Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md

Use this command to initialize a Lilac project directory without starting the webserver. This is useful for setting up a project structure.

```sh
lilac init ~/my_project
```

--------------------------------

### Check Lilac Version

Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/installation.md

Verify the installed version of Lilac.

```bash
lilac version
```

--------------------------------

### Example Output of Queried Row

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_labels.md

Illustrates the structure of a row returned after querying for existing labels, showing the row ID, text, label, and other metadata.

```python
{
  '__rowid__': '0003076800f1471f8f4c8a1b2deda742',
  'text': 'If you want to truly experience the magic (?) of Don Dohler, then check out Alien Factor or maybe Fiend...', 
  'label': 'neg',
  '__hfsplit__': 'test',
  'good': {
    'label': 'true',
    'created': datetime.datetime(2023, 9, 20, 10, 16, 15, 545277)
  }
}
```

--------------------------------

### Load Dataset from Hugging Face

Source: https://github.com/databricks/lilac/blob/main/notebooks/Clustering.ipynb

Loads a dataset from the Hugging Face Hub. Ensure the 'lilac' library is installed.

```python
import lilac as ll

ds = ll.from_huggingface('LDJnr/Capybara')
```

--------------------------------

### UI Settings for Media Paths in Lilac Project

Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md

Configures which fields are displayed as media paths in the Lilac UI. This example sets 'premise' as the sole media path.

```yaml
settings:
  ui:
    media_paths:
      - premise
```

--------------------------------

### Create and Search by Concept

Source: https://github.com/databricks/lilac/blob/main/README.md

Define custom concepts by providing positive and negative examples. This allows for more controllable and powerful search than basic semantic search. Concepts can be stored in a `DiskConceptDB` and then used for searching.

```python
concept_db = ll.DiskConceptDB()
db.create(namespace='local', name='spam')
# Add examples of spam and not-spam.
db.edit('local', 'spam', ll.concepts.ConceptUpdate(
  insert=[
    ll.concepts.ExampleIn(label=False, text='This is normal text.'),
    ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),
    ll.concepts.ExampleIn(label=True, text='11757578jfdjja')
  ]
))

# Search by the spam concept.
rows = dataset.select_rows(
  columns=['text', 'label'],
  searches=[
    ll.ConceptSearch(
      path='text',
      concept_namespace='lilac',
      concept_name='spam',
      embedding='gte-small')
  ],
  limit=1)

print(list(rows))
```

--------------------------------

### Balance Dataset for Training

Source: https://github.com/databricks/lilac/blob/main/notebooks/Toxicity.ipynb

Creates a balanced dataset by sampling an equal number of positive and negative examples for a specified label type. Returns embeddings, labels, and text.

```python
def make_balanced_data(data, embeddings, sample_size_per_group, label_type):
  df = data.to_pandas()
  groups = df[label_type].groupby(df[label_type]).groups
  positive_examples = np.random.choice(groups[1], sample_size_per_group, replace=False)
  negative_examples = np.random.choice(groups[0], sample_size_per_group, replace=False)
  positive_embeddings = embeddings[positive_examples]
  negative_embeddings = embeddings[negative_examples]
  positive_labels = np.ones(len(positive_embeddings))
  negative_labels = np.zeros(len(negative_embeddings))
  positive_text = df.loc[positive_examples]['comment_text']
  negative_text = df.loc[negative_examples]['comment_text']
  embeddings = np.concatenate([positive_embeddings, negative_embeddings])
  labels = np.concatenate([positive_labels, negative_labels])
  text = np.concatenate([positive_text, negative_text])
  return embeddings, labels, text

```

--------------------------------

### Example Lilac Project Configuration

Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md

This YAML configuration defines a Lilac project, specifying datasets, embeddings, signals, and UI settings. It includes a dataset from HuggingFace and configures PII signal detection.

```yaml
# Lilac project config.
# See https://docs.lilacml.com/api_reference/index.html#lilac.Config for details.

datasets:
  - namespace: local
    name: glue
    source:
      dataset_name: glue
      config_name: ax
      source_name: huggingface
    embeddings:
      - path: premise
        embedding: gte-small
    signals:
      - path: premise
        signal:
          signal_name: pii
      - path: hypothesis
        signal:
          signal_name: pii
    settings:
      ui:
        media_paths:
          - premise
```

--------------------------------

### Build Documentation

Source: https://github.com/databricks/lilac/blob/main/docs/README.md

Use this script to generate the static documentation files. Execute from the project root.

```bash
./scripts/build_docs.sh
```

--------------------------------

### Remove Examples from a Concept in Python

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_tuning.md

Remove specific examples from a concept using their unique IDs. This is useful for cleaning up or correcting erroneous examples.

```python
db.edit(
  'local', 'positive-product-reviews',
  ll.ConceptUpdate(remove=['d86e4cb53c70443b8d8782a6847f4752']))
```

--------------------------------

### Remove Examples from a Concept

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Use `db.edit` with `ll.ConceptUpdate` and the `remove` argument to delete specific examples from a concept by their IDs. Ensure you have the correct IDs for the examples to be removed.

```python
db.edit(
  'local', 'positive-product-reviews', ll.ConceptUpdate(remove=['d86e4cb53c70443b8d8782a6847f4752'])
)
```

--------------------------------

### Initialize Lilac Project and Load Dataset

Source: https://github.com/databricks/lilac/blob/main/notebooks/DatasetMap.ipynb

Sets up the Lilac project directory and retrieves a dataset. Handles dataset creation if it doesn't exist.

```python
%load_ext autoreload
%autoreload 2
import lilac as ll

ll.set_project_dir('./data')

try:
  glue = ll.get_dataset('local', 'glue_ax_map')
except Exception as e:
  glue = ll.create_dataset(
    ll.DatasetConfig(
      namespace='local',
      name='glue_ax_map',
      source=ll.HuggingFaceSource(dataset_name='glue', config_name='ax', sample_size=100),
    )
  )

# ll.start_server()
```

--------------------------------

### Deploy Website

Source: https://github.com/databricks/lilac/blob/main/docs/README.md

Deploy the website using the provided script. Append the `--staging` flag to deploy to the staging site instead of production.

```bash
poetry run python -m scripts.deploy_website
```

--------------------------------

### Install TensorFlow Dependencies on M1/M2

Source: https://github.com/databricks/lilac/wiki/Troubleshooting

Install specific TensorFlow dependencies required for M1/M2 chips using conda.

```sh
conda install -c apple tensorflow-deps=2.9.0
```

--------------------------------

### Initialize DiskConceptDB

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_create.md

Instantiate the DiskConceptDB to manage concepts on disk. This is the first step for programmatic concept creation.

```python
import lilac as ll

db = ll.DiskConceptDB()
```

--------------------------------

### Create a Local Concept Database Entry

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Initializes a disk-based concept database and creates a 'positive-product-reviews' concept under the 'local' namespace if it doesn't already exist.

```python
db = ll.DiskConceptDB()

concepts = db.list()
# Don't create the concept twice.
if not list(
  filter(lambda c: c.namespace == 'local' and c.name == 'positive-product-reviews', concepts)
):
  db.create('local', 'positive-product-reviews')
```

--------------------------------

### Publish HuggingFace Public Demo

Source: https://github.com/databricks/lilac/blob/main/development.md

Publish the demo to your HuggingFace Space. This command syncs data, loads data, uploads data, and deploys to HuggingFace. Use flags like `--skip_sync`, `--skip_load`, `--skip_data_upload`, and `--skip_deploy` to customize the process.

```sh
poetry run python -m scripts.deploy_demo \
  --project_dir=./demo_data \
  --config=./lilac_hf_space.yml \
  --hf_space=lilacai/lilac

Add:
  --skip_sync to skip syncing data from the HuggingFace space data.
  --skip_load to skip loading the data.
  --load_overwrite to run all data from scratch, overwriting existing data.
  --skip_data_upload to skip uploading data. This will use the datasets already on the space.
  --skip_deploy to skip deploying to HuggingFace. Useful to test locally.
```

```sh
poetry run python -m scripts.deploy_demo \
  --project_dir=./demo_data \
  --config=./lilac_hf_space.yml \
  --hf_space=lilacai/lilac \
  --skip_sync \
  --skip_load \
  --skip_data_upload
```

--------------------------------

### Create a new Svelte project

Source: https://github.com/databricks/lilac/blob/main/web/blueprint/README.md

Use this command to initialize a new Svelte project. Specify a directory name to create the project in a new folder, or run without arguments to create it in the current directory.

```bash
npm create svelte@latest
```

```bash
npm create svelte@latest my-app
```

--------------------------------

### Deploy to HuggingFace Space

Source: https://github.com/databricks/lilac/blob/main/development.md

Deploy the dataset to your HuggingFace Space. Use the `--create_space` flag if this is the first time deploying.

```sh
poetry run python -m scripts.deploy_staging \
  --dataset=$DATASET_NAMESPACE/$DATASET_NAME

# --create_space if this is the first time running the command so it will create the space for you.
```

--------------------------------

### Configure HuggingFace Demo Environment

Source: https://github.com/databricks/lilac/blob/main/development.md

Set these environment variables in a `.env.local` file to configure the HuggingFace demo repository and authentication token.

```sh
# The repo to use for the huggingface demo. This does not have to exist when you set the flag, the deploy script will create it if it doesn't exist.
HF_STAGING_DEMO_REPO='lilacai/your-space'
# To authenticate with HuggingFace for uploading to the space.
HF_ACCESS_TOKEN='hf_abcdefghijklmnop'
```

--------------------------------

### Set up dataset and embedding paths

Source: https://github.com/databricks/lilac/blob/main/notebooks/MigrateEmbedding.ipynb

Defines the namespace, dataset name, path, and embedding type to locate the signal directory.

```python
import os
import lilac as ll

namespace = 'local'
dataset_name = 'twitter-support'
path = 'text'
embedding = 'cohere'
signal_dir = os.path.join('data', 'datasets', namespace, dataset_name, path, embedding)
```

--------------------------------

### Score Text with ConceptSignal

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Use ConceptSignal to score text against a specified concept and embedding. Provides high and low score examples.

```python
concept_scorer = ll.signals.ConceptSignal(
  namespace=concept_namespace, concept_name=concept_name, embedding=embedding_name
)

# Should be a high-score.
results = concept_scorer.compute(['As a language model I cannot talk about politics.'])
print(list(results))

# Should be a low-score.
results = concept_scorer.compute(['How are you doing today?'])
print(list(results))
```

--------------------------------

### Get Linear Model Coefficients

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Retrieve the coefficients of a linear model associated with a concept and embedding. Note that weights are tied to the embedding.

```python
# Get the `language-model-reference` concept model which predicts whether text is an LLM

```

--------------------------------

### Set Project Directory and Get/Create Dataset

Source: https://github.com/databricks/lilac/blob/main/notebooks/CurateCodingDataset.ipynb

Configures the Lilac project directory and retrieves an existing dataset or creates a new one from Hugging Face if it doesn't exist. Requires lilac library.

```python
import lilac as ll

ll.set_project_dir('./demo_data')

try:
  ds = ll.get_dataset('lilac', 'glaive')
except Exception:
  # Create the dataset.
  config = ll.DatasetConfig(
    namespace='lilac',
    name='glaive',
    source=ll.HuggingFaceSource(dataset_name='glaiveai/glaive-code-assistant'),
  )
  ds = ll.create_dataset(config)
```

--------------------------------

### Prepare Dummy Vector Store for Loading Embeddings

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_embeddings.md

Create a dictionary to act as a dummy vector store by encoding sample text items using a pre-defined embedding function. This simulates an external vector store.

```python
items = [
  {'id': '0_', 'text': 'This is some fake data'},
  {'id': '1_', 'text': 'This is some more fake data'},
  {'id': '2_', 'text': 'This is even more fake data'},
  {'id': '3_', 'text': 'I love plants'},
]

vector_store = {}
for item in items:
  vector_store[item['id']] = _embed(item['text'])
```

--------------------------------

### Register Custom Embedding Function

Source: https://github.com/databricks/lilac/blob/main/notebooks/CustomEmbeddings.ipynb

Registers a custom embedding function using the SentenceTransformer library. Ensures the 'sentence_transformers' package is installed before proceeding.

```python
import numpy as np

try:
  from sentence_transformers import SentenceTransformer
except ImportError:
  raise ImportError(
    'Could not import the "sentence_transformers" python package. '
    'Please install it with `pip install "sentence_transformers".'
  )

embedding_model = SentenceTransformer('thenlper/gte-small')


def _embed(text):
  # Call the gte-small embedding model.
  return np.array(embedding_model.encode(text))


# Make an embedding class.
class MyEmbedding(ll.TextEmbeddingSignal):
  name = 'my_embedding'

  def compute(self, data):
    for text in data:
      embedding = _embed(text)
      # Yield a full chunk embedding. If you want to chunk your text, yield an array here.
      yield [ll.chunk_embedding(0, len(text), embedding)]


print('Testing the embedding on a single item...')
print(next(MyEmbedding().compute(['This is some text'])))

```

--------------------------------

### Hugging Face Deployment Output

Source: https://github.com/databricks/lilac/blob/main/notebooks/DeployToHuggingFace.ipynb

This output shows the process of creating a Hugging Face space and deploying the project files and datasets.

```text
Creating huggingface space https://huggingface.co/spaces/nsthorat-lilac/nikhil-project-demo
The space will be created as private. You can change this from the UI.
Created: https://huggingface.co/spaces/nsthorat-lilac/nikhil-project-demo
Deploying project: ./data

Copying root files...

Uploading datasets:  ['local/glue_ax']
Uploading "local/glue_ax" to HuggingFace dataset repo https://huggingface.co/datasets/nsthorat-lilac/nikhil-project-demo-local-glue_ax


```

```text
data-00000-of-00001.parquet:   0%|          | 0.00/116k [00:00<?, ?B/s]
A


AAAA

AA


AAA
data-00000-of-00001.parquet:   7%|▋         | 8.19k/116k [00:00<00:01, 67.6kB/s]


AAAA


AA


spans.pkl: 100%|██████████| 53.0k/53.0k [00:00<00:00, 165kB/s] 
hnsw.lookup.pkl: 100%|██████████| 53.4k/53.4k [00:00<00:00, 149kB/s] 
data-00000-of-00001.parquet: 100%|██████████| 46.8k/46.8k [00:00<00:00, 123kB/s] 
data-00000-of-00001.parquet: 100%|██████████| 116k/116k [00:00<00:00, 283kB/s]  
data-00000-of-00001.parquet:   0%|          | 0.00/939 [00:00<?, ?B/s]
A


data-00000-of-00001.parquet: 100%|██████████| 939/939 [00:00<00:00, 16.8kB/s]


data-00000-of-00001.parquet: 100%|██████████| 47.4k/47.4k [00:00<00:00, 435kB/s]
data-00000-of-00001.parquet: 100%|██████████| 43.5k/43.5k [00:00<00:00, 227kB/s]
data-00000-of-00001.parquet: 100%|██████████| 41.2k/41.2k [00:00<00:00, 196kB/s]
hnsw.hnswlib.bin: 100%|██████████| 1.86M/1.86M [00:01<00:00, 1.25MB/s]


Upload 9 LFS files: 100%|██████████| 9/9 [00:01<00:00,  5.55it/s]


```

```text
Uploading concepts:  ['local/aliens', '100712716653593140239/aliens', '100712716653593140239/private_aliens']

Uploading cache files: ['concept/local/aliens/gte-small.pkl', 'concept/100712716653593140239/aliens/gte-small.pkl', 'concept/100712716653593140239/private_aliens/gte-small.pkl']


```

```text
Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]
A

gte-small.pkl: 100%|██████████| 21.8k/21.8k [00:00<00:00, 266kB/s]
gte-small.pkl: 100%|██████████| 10.8k/10.8k [00:00<00:00, 111kB/s]
gte-small.pkl: 100%|██████████| 28.4k/28.4k [00:00<00:00, 231kB/s]
Upload 3 LFS files: 100%|██████████| 3/3 [00:00<00:00,  8.47it/s]


```

```text
Done! View your space at https://huggingface.co/spaces/nsthorat-lilac/nikhil-project-demo


```

--------------------------------

### Limit and Offset Dataset Rows

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_query.md

Use `limit` and `offset` to control the number of rows returned and the starting point of the results, similar to SQL.

```python
rows = dataset.select_rows(columns=['text'], limit=1, offset=100)
print(list(rows))
```

--------------------------------

### Load Concepts from Directory

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Load all concepts defined in a specified directory. Ensure concepts are defined using the `Concept` class.

```python
from lilac.concepts import load_concepts

concepts = load_concepts("path/to/concepts")
```

--------------------------------

### Create Dataset from Pandas DataFrame

Source: https://github.com/databricks/lilac/blob/main/notebooks/API.ipynb

Creates a Lilac dataset from a Pandas DataFrame. Requires `lilac` and `pandas`. The DataFrame is read from a CSV URL in this example.

```python
import pandas as pd

url = 'https://storage.googleapis.com/lilac-data-us-east1/datasets/csv_datasets/the_movies_dataset/the_movies_dataset.csv'
df = pd.read_csv(url, low_memory=False)

config = ll.DatasetConfig(namespace='local', name='the_movies_dataset2', source=ll.PandasSource(df))

dataset = ll.create_dataset(config)
```

--------------------------------

### Signals Configuration in Lilac Project

Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md

Defines signals to be computed on dataset fields. This example configures the 'pii' signal to run on both 'premise' and 'hypothesis' fields.

```yaml
signals:
  - path: premise
    signal:
      signal_name: pii
  - path: hypothesis
    signal:
      signal_name: pii
```

--------------------------------

### Build and Push Docker Images

Source: https://github.com/databricks/lilac/blob/main/development.md

Build and push Docker images for both 'linux/amd64' and 'linux/arm64' platforms, tagging them with 'lilacai/lilac' and a version-specific tag. Ensure Docker Desktop is running and you are logged in as the 'lilacai' account.

```sh
docker buildx build --platform linux/amd64,linux/arm64 \
  -t lilacai/lilac \
  -t lilacai/lilac:$(poetry version -s) \
  --push .
```

--------------------------------

### Create a Dataset with Custom Settings

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_configure.md

Use `DatasetConfig` to define dataset properties and `DatasetSettings` to specify UI and embedding preferences when creating a new dataset.

```python
import lilac as ll

# 'text' is our media path
settings = ll.DatasetSettings(
  ui=ll.DatasetUISettings(
    media_paths=[('text',)]), 
    view_type='single-item'
  ),
  preferred_embedding='gte-small'
)
config = ll.DatasetConfig(
  namespace='test_namespace',
  name='test_dataset',
  source=ll.JSONSource(),
  settings=settings,
)
dataset = ll.create_dataset(config)
```

--------------------------------

### Create a New Concept

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_create.md

Create a new concept with a specified namespace and name using the DiskConceptDB instance.

```python
db.create(namespace='test', name='test_concept')
```

--------------------------------

### Load and Print Dataset Rows with LlamaIndex

Source: https://github.com/databricks/lilac/blob/main/notebooks/LlamaIndexLoader.ipynb

Use the LlamaIndex loader to get a dataset and iterate through its first few rows. This is useful for initial data inspection.

```python
dataset = ll.get_dataset('local', 'arxiv-karpathy')
for row in dataset.select_rows(['*'], limit=5):
  print(row)
```

--------------------------------

### Get IMDB Dataset in Python

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_delete_rows.md

Initializes the Lilac project directory and retrieves the IMDB dataset. Ensure the project directory is set before accessing datasets.

```python
import lilac as ll

ll.set_project_dir('~/my_project')

dataset = ll.get_dataset('local', 'imdb')
```

--------------------------------

### Embeddings Configuration in Lilac Project

Source: https://github.com/databricks/lilac/blob/main/docs/projects/projects.md

Configures embeddings to be computed on specific fields within a dataset. This example shows computing the 'gte-small' embedding on the 'premise' field.

```yaml
embeddings:
  - path: premise
    embedding: gte-small
```

--------------------------------

### Create a Concept with a Prompt and Parameters

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Define a concept with a prompt and additional parameters for the LLM call. This allows for fine-tuning the generation process.

```python
from lilac.concepts import Concept

concept = Concept(
    name="my-prompt-params-concept",
    description="My prompt-based concept with parameters",
    prompt="Extract entities from the following text: {text}",
    model_params={"temperature": 0.5, "max_tokens": 100},
)
```

--------------------------------

### Load Lilac Project from YAML

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_load.md

Explicitly loads the `lilac.yml` configuration without starting the web server. This allows you to apply changes made to the configuration file.

```python
ll.load(project_dir='~/my_lilac')
```

--------------------------------

### Load Lilac Project from CLI

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_load.md

Loads the `lilac.yml` configuration using the command-line interface. This is an alternative to loading via Python code.

```sh
lilac load --project_dir=~/my_lilac
```

--------------------------------

### Build Svelte production application

Source: https://github.com/databricks/lilac/blob/main/web/blueprint/README.md

Execute this command to generate a production-ready build of your Svelte application. This optimizes your code for deployment.

```bash
npm run build
```

--------------------------------

### Delete Rows by Filter in Python

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_delete_rows.md

Deletes multiple rows that match a specified filter. This example deletes rows where the number of characters in the 'text' field is less than 1000.

```python
dataset.delete_rows(
  filters=[
    (('text', 'text_statistics', 'num_characters'), 'less', 1000)
  ]
)
```

--------------------------------

### Retrieve and Print Concept Information

Source: https://github.com/databricks/lilac/blob/main/notebooks/UsingConcepts.ipynb

Fetches a concept from the database by namespace and name, then prints its version and data. Asserts that the concept was found.

```python
llm_self_ref_concept = ll.concepts.db_concept.DISK_CONCEPT_DB.get(concept_namespace, concept_name)
assert llm_self_ref_concept is not None

print('concept version:', llm_self_ref_concept.version)
print('concept data:', llm_self_ref_concept.data.values())
```

--------------------------------

### Filter dataset rows before mapping

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_edit.md

Apply filters to a dataset before running a map function to process only a subset of rows. This example limits the map operation to rows where the 'source' column equals 'bar'.

```python
items = [
    {'question': 'A', 'source': 'foo'},
    {'question': 'B', 'source': 'bar'},
    {'question': 'C', 'source': 'bar'}
]
dataset = ll.from_dicts('local', 'questions', items, overwrite=True)

result = dataset.map(
  lambda x: x['question'].lower(),
  filters=[ll.Filter(path=('source',), op='equals', value='bar')],
  limit=1)

print(list(result))
```

--------------------------------

### Deploy Single Dataset with Configuration

Source: https://github.com/databricks/lilac/blob/main/notebooks/DeployToHuggingFace.ipynb

Deploy a single Lilac dataset to a HuggingFace space using a configuration object. This script automatically creates the space if it doesn't exist and can compute embeddings and signals on boot. By default, it creates a private space.

```python
ll.deploy_config(
  hf_space='nsthorat-lilac/nikhil-demo',
  # Create the space if it doesn't exist.
  create_space=True,
  config=ll.Config(datasets=[
    ll.DatasetConfig(
      namespace='local',
      name='glue_ax',
      source=ll.HuggingFaceSource(dataset_name='glue', config_name='ax'),
      # NOTE: Remove embeddings and signals if you just want to visualize the dataset without any
      # enrichments.
      embeddings=[
        # Compute gte-small over 'hypothesis'.
        ll.EmbeddingConfig(path='hypothesis', embedding='gte-small'),
      ],
      signals=[ll.SignalConfig(path='hypothesis', signal=ll.TextStatisticsSignal())])
  ]),
  # No persistent storage for HuggingFace. If you want to use persistent storage,
  # set this to 'small', 'medium', or 'large'.
  # NOTE: Persistent storage is not free. See https://huggingface.co/docs/hub/spaces-storage
  hf_space_storage=None)
```

--------------------------------

### Login to HuggingFace CLI

Source: https://github.com/databricks/lilac/blob/main/development.md

Authenticate with the HuggingFace CLI using Poetry. Follow the provided instructions to set up SSH keys for Git interaction.

```sh
poetry run huggingface-cli login
```

--------------------------------

### Custom Named Bins for Continuous Features

Source: https://github.com/databricks/lilac/blob/main/docs/datasets/dataset_query.md

Assign custom names to your bins for better readability. Provide a list of tuples, where each tuple contains a name and the bin's range (start, end).

```python
groups = dataset.select_groups(
  leaf_path=('text', 'text_statistics', 'readability'),
  bins=[
    ('LOW', None, 100),
    ('MEDIUM', 100, 200),
    ('HIGH', 200, None)
  ]
)
print(groups)
```

--------------------------------

### Initialize Firebase App and Analytics

Source: https://github.com/databricks/lilac/blob/main/docs/_templates/base.html

Import Firebase SDKs, configure your app with your Firebase project credentials, and initialize the Firebase app and analytics services. Ensure you have the correct SDK versions and your Firebase configuration details.

```javascript
import {initializeApp} from 'https://www.gstatic.com/firebasejs/10.1.0/firebase-app.js';
import {getAnalytics} from 'https://www.gstatic.com/firebasejs/10.1.0/firebase-analytics.js';

// TODO: Add SDKs for Firebase products that you want to use
// https://firebase.google.com/docs/web/setup#available-libraries

// Your web app's Firebase configuration
// For Firebase JS SDK v7.20.0 and later, measurementId is optional
const firebaseConfig = {
  apiKey: 'AIzaSyC\_1E688jeyJ2wXdCIEPEulG3a4jrzKej8',
  authDomain: 'lilac-386213.firebaseapp.com',
  projectId: 'lilac-386213',
  storageBucket: 'lilac-386213.appspot.com',
  messagingSenderId: '279475920249',
  appId: '1:279475920249:web:4680f6f21f8baf900c63a8',
  measurementId: 'G-LX8JBKFTT3'
};

// Initialize Firebase
const app = initializeApp(firebaseConfig);
const analytics = getAnalytics(app);
```

--------------------------------

### Register Custom Embedding in Python

Source: https://github.com/databricks/lilac/blob/main/docs/embeddings/embeddings.md

Define and register a custom text embedding signal in Python. Implement `setup` for one-time initialization and `compute` for embedding generation using a provided `embed_fn` and `split_fn`.

```python
class MyEmbedding(ll.TextEmbeddingSignal):
  name: 'my_embedding'
  def setup(self):
    # Do your one-time setup here.
    pass

  def compute(self, docs):
    def embed_fn(texts: list[str]):
      # Compute your embedding matrix for the batch of text here. This return a matrix with
      # dimensions [batch_size, embedding_dims].
      return your_embedding(texts)

    for doc in docs:
      # Split the text, and compute embeddings for each split,
      yield from ll.compute_split_embeddings(
        docs=docs,
        batch_size=64,
        embed_fn,
        # Use the lilac chunk splitter.
        split_fn=ll.split_text,
        # How many batches to request as a single unit.
        num_parallel_requests=1)

ll.register_signal(MyEmbedding)
```

--------------------------------

### Debug Lilac Server with PDB

Source: https://github.com/databricks/lilac/blob/main/development.md

Execute this command to start the Lilac webserver in single-threaded mode, enabling PDB breakpoints for debugging. You can trigger breakpoints by interacting with the Lilac UI or replaying cURL commands.

```sh
./run_server_pdb.sh
```

--------------------------------

### Add Label Based on Filter

Source: https://github.com/databricks/lilac/blob/main/README.md

Add labels to data points based on specified filters. This example adds the label 'short' to all text entries with fewer than 1000 characters, utilizing a field generated by the `text_statistics` signal.

```python
dataset.add_labels(
  'short',
  filters=[
    (('text', 'text_statistics', 'num_characters'), 'less', 1000)
  ]
)
```

--------------------------------

### Build Custom Docker Image

Source: https://github.com/databricks/lilac/blob/main/docs/getting_started/installation.md

Build a custom Docker image for Lilac from the root of the repository.

```bash
docker build -t lilac .
```

--------------------------------

### Compute Concept Score for Text with Python

Source: https://github.com/databricks/lilac/blob/main/docs/concepts/concept_use.md

Use ConceptSignal to compute a concept score for a given text input. The 'gte-small' embedding runs on-device. The output includes start and end positions of matching spans and their scores.

```python
import lilac as ll

signal = ll.signals.ConceptSignal(
  namespace='lilac',
  concept_name='positive-sentiment',
  embedding='gte-small')

# Signals take an iterable of inputs, and return a list of items that match the shape of the input.
result = list(signal.compute(['This product is amazing, thank you!']))

print(result)
```