### Install ExllamaV2 Environment

Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md

Commands to clone the ExllamaV2 repository and install the necessary Python dependencies for the custom kernel.

```bash
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -e .
```

--------------------------------

### Install xFasterTransformer

Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md

Installs the xFasterTransformer library using pip. Ensure your environment is set up according to the xFasterTransformer documentation.

```bash
pip install xfastertransformer
```

--------------------------------

### Install AWQ and FastChat Environment

Source: https://github.com/lm-sys/fastchat/blob/main/docs/awq.md

Sets up a Conda environment, installs FastChat, clones the AWQ repository, and compiles the necessary CUDA kernels for 4bit inference.

```bash
conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
pip install --upgrade pip
pip install -e .
git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .
cd awq/kernels
python setup.py install
```

--------------------------------

### Download Text File for LangChain Example

Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md

Downloads a sample text file from a URL using wget. This file will be used as input for the LangChain question-answering example.

```bash
wget https://raw.githubusercontent.com/hwchase17/langchain/v0.0.200/docs/modules/state_of_the_union.txt
```

--------------------------------

### Install Training Dependencies

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Installs the necessary dependencies for training FastChat models. This command is typically run in a Python environment.

```bash
pip3 install -e ".[train]"
```

--------------------------------

### vLLM Worker for High-Throughput Serving

Source: https://context7.com/lm-sys/fastchat/llms.txt

Shows how to start a vLLM worker for high-throughput inference, leveraging continuous batching and PagedAttention. Examples include basic setup, tensor parallelism, and custom GPU memory utilization.

```bash
# Start vLLM worker
python3 -m fastchat.serve.vllm_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --controller-address http://localhost:21001 \
    --worker-address http://localhost:21002

# vLLM worker with tensor parallelism
python3 -m fastchat.serve.vllm_worker \
    --model-path lmsys/vicuna-33b-v1.3 \
    --num-gpus 4 \
    --tensor-parallel-size 4

# vLLM worker with custom GPU memory utilization
python3 -m fastchat.serve.vllm_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --gpu-memory-utilization 0.85
```

--------------------------------

### Starting the OpenAI-Compatible API Server

Source: https://context7.com/lm-sys/fastchat/llms.txt

Starts the API server that provides OpenAI-compatible endpoints for chat completions, text completions, and embeddings.

```APIDOC
## Starting the OpenAI-Compatible API Server

### Description
Starts the API server that provides OpenAI-compatible endpoints for chat completions, text completions, and embeddings. This allows integration with tools and SDKs designed for OpenAI's API.

### Method
N/A (Command Line)

### Endpoint
N/A

### Parameters
#### Command Line Arguments
- `--host` (string) - Required - The host address to bind the API server to. Use `0.0.0.0` to make it accessible externally.
- `--port` (integer) - Required - The port number for the API server. Defaults to 8000.

### Request Example
```bash
# Start the API server on localhost, port 8000
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

# Full deployment example (run in separate terminals):
# Terminal 1: Controller
# python3 -m fastchat.serve.controller

# Terminal 2: Model Worker
# python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5

# Terminal 3: API Server
# python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```

### Response
N/A (Server Output)
```

--------------------------------

### Install FastChat and Dependencies

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md

Installs necessary system packages, Anaconda for environment management, and clones the FastChat repository. It then creates and activates a dedicated Conda environment and installs FastChat using pip.

```bash
sudo apt update
sudo apt install tmux htop

wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh

conda create -n fastchat python=3.9
conda activate fastchat

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install -e .
```

--------------------------------

### Launch FastChat Serving Components

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Commands to start the controller, model worker, and Gradio web server. These components are required to host and interact with LLMs locally.

```bash
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
python3 -m fastchat.serve.test_message --model-name vicuna-7b-v1.5
python3 -m fastchat.serve.gradio_web_server
```

--------------------------------

### Configure and Start Model Worker

Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md

Commands to download a quantized model and start the FastChat model worker, including advanced configurations like sequence length and multi-GPU memory allocation.

```bash
# Download quantized model
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

# Start worker with default config
python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --enable-exllama

# Start worker with custom sequence length and GPU split
python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --enable-exllama \
    --exllama-max-seq-len 2048 \
    --exllama-gpu-split 18,24
```

--------------------------------

### Launch FastChat Web Server with API Registration

Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md

Command-line instruction to start the Gradio web server while registering custom API endpoints defined in a JSON configuration file.

```bash
python3 -m fastchat.serve.gradio_web_server --controller "" --share --register api_endpoints.json
```

--------------------------------

### Install DashInfer Package

Source: https://github.com/lm-sys/fastchat/blob/main/docs/dashinfer_integration.md

Installs the DashInfer Python package using pip. This is the first step to enable DashInfer's capabilities within your environment.

```bash
pip install dashinfer
```

--------------------------------

### cURL Examples

Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md

Examples of how to use cURL to interact with the FastChat API server for various endpoints.

```APIDOC
## cURL Examples

List Models:

```bash
curl http://localhost:8000/v1/models
```

Chat Completions:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'
```

Text Completions:

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }'
```

Embeddings:

```bash
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "input": "Hello world!"
  }'
```
```

--------------------------------

### Install FastChat and Dependencies

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md

Clones the FastChat repository and installs the necessary packages for model workers and LLM evaluation.

```bash
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,llm_judge]"
```

--------------------------------

### Install Nginx on Debian/Ubuntu

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/gateway/README.md

Installs the Nginx web server on Debian-based systems like Ubuntu using the apt package manager. This is a prerequisite for deploying the Nginx gateway.

```bash
sudo apt update
sudo apt install nginx
```

--------------------------------

### Launch vLLM Worker

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md

Starts a vLLM server instance to serve model weights. This backend provides faster inference for supported models.

```bash
vllm serve [MODEL-PATH] --dtype auto
```

--------------------------------

### Install vLLM Dependency

Source: https://github.com/lm-sys/fastchat/blob/main/docs/vllm_integration.md

Installs the vLLM library required for running the optimized model worker in FastChat.

```bash
pip install vllm
```

--------------------------------

### Fine-Tuning Vicuna Models

Source: https://context7.com/lm-sys/fastchat/llms.txt

Provides a command to fine-tune models using the supervised fine-tuning script, specifically for Vicuna. This example uses FSDP and details various training parameters such as model path, data path, epochs, batch size, learning rate, and FSDP configurations.

```bash
# Fine-tune Vicuna with FSDP
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path ./data/conversations.json \
    --bf16 True \
    --output_dir ./output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --learning_rate 2e-5 \
    --weight_decay 0.0 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True
```

--------------------------------

### Multi-Model Worker Configuration

Source: https://context7.com/lm-sys/fastchat/llms.txt

Demonstrates how to serve multiple models from a single worker process, which is particularly useful for PEFT/LoRA models that share base weights. The example shows how to specify multiple model paths and names.

```bash
# Serve multiple models in one process
python3 -m fastchat.serve.multi_model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-names vicuna-7b-v1.5 \
    --model-path lmsys/longchat-7b-16k \
    --model-names longchat-7b-16k
```

--------------------------------

### Launch Quantized LightLLM Worker

Source: https://github.com/lm-sys/fastchat/blob/main/docs/lightllm_integration.md

Starts the LightLLM worker with support for quantized weights and KV cache. This is useful for reducing memory footprint during inference.

```bash
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv
```

--------------------------------

### Run Vicuna-7B on Ascend NPU with FastChat CLI

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Command to enable acceleration on Ascend NPUs. Requires installation of the Ascend PyTorch Adapter and setting CANN environment variables.

```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --device npu
```

--------------------------------

### Vicuna Prompt Template (v1.1, v1.3, v1.5)

Source: https://github.com/lm-sys/fastchat/blob/main/docs/vicuna_weights_version.md

This is an example of the prompt template used for Vicuna weights versions 1.1, 1.3, and 1.5. It demonstrates the conversational structure between a user and an AI assistant, using 'USER:' and 'ASSISTANT:' tags, and the EOS token '</s>' for separation.

```text
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>
```

--------------------------------

### Starting a Model Worker

Source: https://context7.com/lm-sys/fastchat/llms.txt

Starts a model worker that loads and serves LLM models. Each worker registers with the controller and handles inference requests for its loaded model.

```APIDOC
## Starting a Model Worker

### Description
Starts a model worker that loads and serves LLM models. Each worker registers with the controller and handles inference requests for the loaded model.

### Method
N/A (Command Line)

### Endpoint
N/A

### Parameters
#### Command Line Arguments
- `--model-path` (string) - Required - The path to the pre-trained model weights (e.g., Hugging Face model ID).
- `--controller-address` (string) - Optional - The address of the controller. Defaults to http://localhost:21001.
- `--worker-address` (string) - Optional - The address for the worker. Defaults to a dynamically assigned port.
- `--load-8bit` (boolean) - Optional - Whether to load the model in 8-bit precision to reduce memory usage.
- `--num-gpus` (integer) - Optional - The number of GPUs to use for tensor parallelism. If not specified, defaults to 1.
- `--model-names` (string) - Optional - A comma-separated list of model names to expose via the API. Useful for API compatibility.

### Request Example
```bash
# Start a model worker with Vicuna model
python3 -m fastchat.serve.model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --controller-address http://localhost:21001 \
    --worker-address http://localhost:21002

# Start with 8-bit quantization
python3 -m fastchat.serve.model_worker \
    --model-path lmsys/vicuna-13b-v1.5 \
    --load-8bit

# Start with multiple GPUs using tensor parallelism
python3 -m fastchat.serve.model_worker \
    --model-path lmsys/vicuna-33b-v1.3 \
    --num-gpus 4

# Start with custom model names for API compatibility
python3 -m fastchat.serve.model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-names "gpt-3.5-turbo,vicuna-7b"
```

### Response
N/A (Server Output)
```

--------------------------------

### Run Vicuna-7B on Intel XPU with FastChat CLI

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Command to enable GPU acceleration on Intel Data Center and Arc A-Series GPUs using the XPU device. Requires installation of Intel Extension for PyTorch and setting OneAPI environment variables.

```bash
source /opt/intel/oneapi/setvars.sh
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --device xpu
```

--------------------------------

### Start Model Worker with xFasterTransformer

Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md

Starts the FastChat model worker process with xFasterTransformer enabled. Allows configuration of data types and optimized CPU utilization using numactl and MPI.

```bash
# Load model with default configuration (max sequence length 4096, no GPU split setting).
python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 
```

```bash
#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 
```

```bash
#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16
```

--------------------------------

### Install MLX for FastChat

Source: https://github.com/lm-sys/fastchat/blob/main/docs/mlx_integration.md

Installs the MLX-LM library, which is required for using MLX as an optimized worker in FastChat. This command ensures you have the necessary package for MLX integration.

```bash
pip install "mlx-lm>=0.0.6"
```

--------------------------------

### Install FastChat Data Cleaning Dependencies

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/data_cleaning.md

Installs the necessary Python packages for HTML parsing, markdown conversion, and language detection required for data preprocessing.

```bash
pip3 install bs4 markdownify
pip3 install polyglot pyicu pycld2
```

--------------------------------

### Install Nginx on Red Hat/CentOS/Fedora

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/gateway/README.md

Installs the Nginx web server on Red Hat-based systems like CentOS and Fedora using the yum package manager. This is a prerequisite for deploying the Nginx gateway.

```bash
sudo yum install epel-release
sudo yum install nginx
```

--------------------------------

### Check Server Launch Times

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md

This script iterates through server logs to find and display the 'Running on local URL' message, indicating when each Gradio web server instance started.

```bash
for i in $(seq 0 11); do cat fastchat_logs/server$i/gradio_web_server.log | grep "Running on local URL" | tail -n 1; done
```

--------------------------------

### Run CLI with RWKV-4-Raven model

Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md

This command shows how to run the FastChat CLI with the BlinkDL/RWKV-4-Raven model. Ensure Python 3 and FastChat are installed. The model path should point to the downloaded weights.

```bash
python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth
```

--------------------------------

### Launch FastChat Controller

Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md

Starts the FastChat controller, which manages workers and routes requests. This is the first step in setting up the FastChat API server.

```bash
python3 -m fastchat.serve.controller
```

--------------------------------

### Vicuna Prompt Template (v0)

Source: https://github.com/lm-sys/fastchat/blob/main/docs/vicuna_weights_version.md

This is an example of the prompt template used for Vicuna weights version 0. It uses a different format with '### Human:' and '### Assistant:' tags for structuring the conversation.

```text
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: Hello!
### Assistant: Hello!
### Human: How are you?
### Assistant: I am good.
```

--------------------------------

### CLI Chat Interface for Models

Source: https://context7.com/lm-sys/fastchat/llms.txt

Provides command-line interface examples to interact with models directly in the terminal. Supports streaming output, custom conversation templates, 8-bit quantization, and multi-GPU configurations. Includes special commands for chat management.

```bash
# Start CLI chat with a model
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5

# CLI with custom conversation template
python3 -m fastchat.serve.cli \
    --model-path lmsys/vicuna-7b-v1.5 \
    --conv-template vicuna_v1.1

# CLI with 8-bit quantization
python3 -m fastchat.serve.cli \
    --model-path lmsys/vicuna-13b-v1.5 \
    --load-8bit

# CLI with multiple GPUs
python3 -m fastchat.serve.cli \
    --model-path lmsys/vicuna-33b-v1.3 \
    --num-gpus 2

# CLI special commands:
# !!exit - Exit the chat
# !!reset - Reset conversation
# !!remove - Remove last message
# !!regen - Regenerate last response
# !!save <filename> - Save conversation
# !!load <filename> - Load conversation
```

--------------------------------

### OpenAI Python SDK Integration

Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md

Example of how to interact with the FastChat API server using the OpenAI Python SDK (version >= 1.0).

```APIDOC
## OpenAI Python SDK Integration

Install the OpenAI Python package:

```bash
pip install --upgrade openai
```

Interact with the API server:

```python
import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

model = "vicuna-7b-v1.5"
prompt = "Once upon a time"

# Create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
print(prompt + completion.choices[0].text)

# Create a chat completion
completion = openai.chat.completions.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
print(completion.choices[0].message.content)
```
```

--------------------------------

### Launch Gradio Web Server for Vision Arena

Source: https://github.com/lm-sys/fastchat/blob/main/docs/arena.md

Starts the Gradio web server with specific flags for the vision arena. It can register API endpoints, enable remote storage, and load random questions for VQA.

```python
python3 -m fastchat.serve.gradio_web_server_multi --share --register-api-endpoint-file api_endpoints.json --vision-arena --use-remote-storage --random-questions metadata_sampled.json
```

--------------------------------

### Chat with FastChat-T5-3B using FastChat CLI

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Command to start a chat with the FastChat-T5-3B model. The model weights are automatically fetched from Hugging Face.

```python
python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0
```

--------------------------------

### Generate VQA Examples Directory

Source: https://github.com/lm-sys/fastchat/blob/main/docs/arena.md

A Python script to generate a random questions file by downloading images from various VQA datasets. This is used for the vision arena to provide sample questions.

```python
python fastchat/serve/vision/create_vqa_examples_dir.py
```

--------------------------------

### Launch FastChat OpenAI API Server

Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md

Starts the RESTful API server that is compatible with OpenAI's API. This server allows applications like LangChain to interact with the locally hosted models through a familiar OpenAI interface.

```bash
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```

--------------------------------

### Train Vicuna-7B with xFormers

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

This command is used to train Vicuna-7B when xFormers is installed, offering memory-efficient attention. It replaces the standard training script with one optimized for xFormers.

```bash
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_xformers.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path data/dummy_conversation.json \
    --bf16 True \
    --output_dir output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True
```

--------------------------------

### LangChain Question Answering Example

Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md

Demonstrates how to use LangChain with a local LLM via the FastChat API. It loads a text document, creates an index with embeddings, and then queries the index using a language model to answer questions.

```python
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator

embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
loader = TextLoader("state_of_the_union.txt")
index = VectorstoreIndexCreator(embedding=embedding).from_loaders([loader])
llm = ChatOpenAI(model="gpt-3.5-turbo")

questions = [
    "Who is the speaker",
    "What did the president say about Ketanji Brown Jackson",
    "What are the threats to America",
    "Who are mentioned in the speech",
    "Who is the vice president",
    "How many projects were announced",
]

for query in questions:
    print("Query:", query)
    print("Answer:", index.query(query, llm=llm))
```

--------------------------------

### Launch FastChat Servers and Workers

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md

Initiates the FastChat controller, registers a Hugging Face API worker, and starts the Gradio web server with support for multiple models (ChatGPT, Claude, PaLM). It also includes environment variable exports for API keys and project IDs.

```bash
cd fastchat_logs/controller
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001
python3 -m fastchat.serve.register_worker --controller http://localhost:21001 --worker-name https://
python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://localhost:21001

cd fastchat_logs/server0

python3 -m fastchat.serve.huggingface_api_worker --model-info-file ~/elo_results/register_hf_api_models.json

export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
export GCP_PROJECT_ID=

python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 50 --add-chatgpt --add-claude --add-palm --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms

python3 backup_logs.py
```

--------------------------------

### Launch Multi-Model Worker

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md

Starts a multi-model worker capable of serving multiple models simultaneously. This is useful for efficient resource utilization, especially when dealing with models of varying sizes or types. It specifies the models, controller address, host, port, and worker address.

```bash
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.multi_model_worker --model-path ~/model_weights/RWKV-4-Raven-14B-v12-Eng98%25-Other2%25-20230523-ctx8192.pth --model-name RWKV-4-Raven-14B --model-path lmsys/fastchat-t5-3b-v1.0 --model-name fastchat-t5-3b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker http://$(hostname):31000 --limit 4
```

```bash
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.multi_model_worker --model-path OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --model-name oasst-pythia-12b --model-path mosaicml/mpt-7b-chat --model-name mpt-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker http://$(hostname):31001 --limit 4
```

```bash
CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.5 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4
```

--------------------------------

### Starting the Controller

Source: https://context7.com/lm-sys/fastchat/llms.txt

Starts the FastChat controller, which manages distributed model workers and routes requests. It's the central coordinator for the serving infrastructure.

```APIDOC
## Starting the Controller

### Description
Starts the FastChat controller, which manages distributed model workers and routes requests. It is the central coordinator for the FastChat serving infrastructure.

### Method
N/A (Command Line)

### Endpoint
N/A

### Parameters
#### Command Line Arguments
- `--host` (string) - Optional - The host address to bind the controller to. Defaults to localhost.
- `--port` (integer) - Optional - The port number for the controller. Defaults to 21001.

### Request Example
```bash
python3 -m fastchat.serve.controller

# With custom host and port
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001
```

### Response
N/A (Server Output)
```

--------------------------------

### Launch Multi-Model and Advanced Serving

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Commands for launching the multi-model Gradio interface and configuring multiple workers on specific GPUs for scalability.

```bash
python3 -m fastchat.serve.gradio_web_server_multi --register-api-endpoint-file api_endpoint.json

# worker 0
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 --controller http://localhost:21001 --port 31000 --worker http://localhost:31000
# worker 1
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker --model-path lmsys/fastchat-t5-3b-v1.0 --controller http://localhost:21001 --port 31001 --worker http://localhost:31001

python3 -m fastchat.serve.gradio_web_server_multi
```

--------------------------------

### Download Benchmark Dataset

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/monitor/classify/README.md

Uses git-lfs to clone the category benchmark dataset from Hugging Face and prepares the local directory structure for evaluation.

```console
git clone https://huggingface.co/datasets/lmarena-ai/categories-benchmark-eval
cp -r categories-benchmark-eval/label_bench .
```

--------------------------------

### Launch Gradio Web Servers for FastChat

Source: https://context7.com/lm-sys/fastchat/llms.txt

These commands launch Gradio web servers for interacting with FastChat models. Options include a single-model server, a multi-model server for side-by-side comparisons, and a server with vision capabilities for image support, which can also register API endpoints.

```bash
python3 -m fastchat.serve.gradio_web_server --share
```

```bash
python3 -m fastchat.serve.gradio_web_server_multi --share
```

```bash
python3 -m fastchat.serve.gradio_web_server_multi \
    --share \
    --vision-arena \
    --register-api-endpoint-file api_endpoints.json
```

--------------------------------

### Start FastChat API Server Thread

Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb

This Python code snippet starts the FastChat API server in a separate thread. It assumes the necessary threading module is imported. The server will be accessible at http://127.0.0.1:8000/v1/.

```python
import threading

def run_api_server():
    # Placeholder for the actual API server run function
    print("API server is running...")

api_server_thread = threading.Thread(target=run_api_server)
api_server_thread.start()
```

--------------------------------

### Fine-Tune Models with QLoRA using DeepSpeed

Source: https://context7.com/lm-sys/fastchat/llms.txt

This script demonstrates how to fine-tune a model using QLoRA for memory-efficient training with 4-bit quantization. It utilizes DeepSpeed for distributed training and requires specifying model paths, LoRA parameters, data paths, and training configurations.

```bash
deepspeed fastchat/train/train_lora.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ./data/conversations.json \
    --bf16 True \
    --output_dir ./checkpoints_lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json
```

--------------------------------

### Clone FastChat and Install Dependencies (Python)

Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb

This snippet clones the FastChat repository from GitHub and installs the necessary dependencies for running the model worker and web UI. It's designed to be executed in a Google Colab environment.

```python
%cd /content/

# clone FastChat
!git clone https://github.com/lm-sys/FastChat.git

# install dependencies
%cd FastChat
!python3 -m pip install -e ".[model_worker,webui]" --quiet
```

--------------------------------

### GET /v1/models

Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md

Retrieves a list of available models that the API server supports.

```APIDOC
## GET /v1/models

### Description
Lists all the models available through the API server.

### Method
GET

### Endpoint
/v1/models

### Parameters
None

### Response
#### Success Response (200)
- **object** (string) - Type of object returned, e.g., 'list'.
- **data** (array) - A list of model objects.
  - **id** (string) - The unique identifier for the model.
  - **object** (string) - Type of object, e.g., 'model'.
  - **created** (integer) - Unix timestamp of when the model was created.
  - **owned_by** (string) - The owner of the model.

#### Response Example
```json
{
  "object": "list",
  "data": [
    {
      "id": "vicuna-7b-v1.5",
      "object": "model",
      "created": 1677652288,
      "owned_by": "lmsys"
    }
  ]
}
```
```

--------------------------------

### Prepare Models for xFasterTransformer

Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md

Converts models to a format compatible with xFasterTransformer. This script requires input dataset and output directories to be specified.

```bash
python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}
```

--------------------------------

### Launch LightLLM Worker

Source: https://github.com/lm-sys/fastchat/blob/main/docs/lightllm_integration.md

Starts the LightLLM worker process for a specified model. This replaces the standard fastchat.serve.model_worker and requires configuration for total token capacity.

```bash
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
```

--------------------------------

### Run AWQ Quantized Model via CLI

Source: https://github.com/lm-sys/fastchat/blob/main/docs/awq.md

Downloads a quantized model using git-lfs and executes the FastChat CLI with specific AWQ parameters for 4bit inference.

```bash
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
python3 -m fastchat.serve.cli --model-path models/vicuna-7b-v1.3-4bit-g128-awq --awq-wbits 4 --awq-groupsize 128
```

--------------------------------

### Run CLI with Llama-2-7b-chat-hf

Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md

This command launches the FastChat command-line interface (CLI) to interact with the meta-llama/Llama-2-7b-chat-hf model. It requires Python 3 and the FastChat package to be installed.

```bash
python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf
```

--------------------------------

### Train Vicuna-7B with Local GPUs

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Command to initiate the training of Vicuna-7B using PyTorch distributed training on multiple GPUs. It requires specifying model and data paths, and configures various training parameters like batch size, learning rate, and saving strategies.

```bash
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path data/dummy_conversation.json \
    --bf16 True \
    --output_dir output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True
```

--------------------------------

### Run FastChat CLI with Exllama

Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md

Execute the FastChat CLI interface using a GPTQ model path with the Exllama backend enabled.

```bash
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --enable-exllama
```

--------------------------------

### Initialize FastChat Controller

Source: https://context7.com/lm-sys/fastchat/llms.txt

Starts the central controller service responsible for managing model workers and routing requests. It can be configured with custom host and port settings.

```bash
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001
```

--------------------------------

### Download and View MT-Bench Data

Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md

Downloads pre-generated model answers and judgments, and launches a local QA browser to inspect the results.

```bash
python3 download_mt_bench_pregenerated.py
python3 qa_browser.py --share
```

--------------------------------

### Launch FastChat vLLM Worker

Source: https://github.com/lm-sys/fastchat/blob/main/docs/vllm_integration.md

Commands to start the FastChat vLLM worker. Includes variations for standard models, custom tokenizers, and AWQ quantized models.

```bash
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5
```

```bash
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer hf-internal-testing/llama-tokenizer
```

```bash
python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
```

--------------------------------

### Programmatic Model Loading and Inference

Source: https://context7.com/lm-sys/fastchat/llms.txt

Demonstrates how to load models and conversation templates programmatically for custom inference pipelines. It shows model loading, setting up a conversation, generating a prompt, and producing a response.

```python
from fastchat.model import load_model, get_conversation_template
import torch

# Load model and tokenizer
model, tokenizer = load_model(
    model_path="lmsys/vicuna-7b-v1.5",
    device="cuda",
    num_gpus=1,
    load_8bit=False,
    dtype=torch.float16
)

# Get conversation template for the model
conv = get_conversation_template("vicuna-7b-v1.5")
conv.append_message(conv.roles[0], "What is artificial intelligence?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate response
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)
```

--------------------------------

### Launch FastChat Controller

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md

Starts the FastChat controller, which manages distributed workers. It listens on a specified host and port. Ensure this is run on a central node (e.g., node-01).

```bash
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 10002
```

--------------------------------

### Run Vicuna-7B with 8-bit Compression for Reduced Memory

Source: https://github.com/lm-sys/fastchat/blob/main/README.md

Command to enable 8-bit compression for the Vicuna-7B model, significantly reducing memory usage with a slight impact on quality. Compatible with CPU, GPU, and Metal backends.

```python
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --load-8bit
```

--------------------------------

### Run CLI Inference with xFasterTransformer

Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md

Launches the FastChat CLI for inference using xFasterTransformer. Supports various configurations for CPU usage and data types like fp16 and bf16_fp16.

```bash
#run inference on all CPUs and using float16
python3 -m fastchat.serve.cli \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype fp16
```

```bash
#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc \
python3 -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16
```

```bash
#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc \
python -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc \
python -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16
```

--------------------------------

### Test FastChat Message Interface

Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md

Runs a test command to interact with the FastChat system, specifically targeting the 'vicuna-13b' model through the controller at localhost:10002. This is useful for verifying the setup and basic functionality.

```bash
python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://localhost:10002
```

--------------------------------

### Generate Embeddings using curl

Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb

This example demonstrates how to generate embeddings for a given text input using curl and the FastChat API. It targets the /v1/embeddings endpoint and specifies the model and the input text.

```bash
!curl http://127.0.0.1:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{ \
    "model": "vicuna-7b-v1.5", \
    "input": "Hello, can you tell me a joke for me?" \
  }'
```

--------------------------------

### Interact with FastChat API using cURL

Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md

Examples of using cURL commands to interact with the FastChat OpenAI-compatible API server for various endpoints like listing models, chat completions, text completions, and embeddings.

```bash
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }'
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "input": "Hello world!"
  }'
```

--------------------------------

### Run CLI with Vicuna, Alpaca, LLaMA, Koala models

Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md

This command demonstrates how to use the FastChat CLI with models like Vicuna, Alpaca, LLaMA, and Koala. It requires Python 3 and the FastChat package. The `--model-path` argument specifies the model to load.

```bash
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5
```