### Install ExllamaV2 Environment Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md Commands to clone the ExllamaV2 repository and install the necessary Python dependencies for the custom kernel. ```bash git clone https://github.com/turboderp/exllamav2 cd exllamav2 pip install -e . ``` -------------------------------- ### Install xFasterTransformer Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md Installs the xFasterTransformer library using pip. Ensure your environment is set up according to the xFasterTransformer documentation. ```bash pip install xfastertransformer ``` -------------------------------- ### Install AWQ and FastChat Environment Source: https://github.com/lm-sys/fastchat/blob/main/docs/awq.md Sets up a Conda environment, installs FastChat, clones the AWQ repository, and compiles the necessary CUDA kernels for 4bit inference. ```bash conda create -n fastchat-awq python=3.10 -y conda activate fastchat-awq pip install --upgrade pip pip install -e . git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq cd repositories/llm-awq pip install -e . cd awq/kernels python setup.py install ``` -------------------------------- ### Download Text File for LangChain Example Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md Downloads a sample text file from a URL using wget. This file will be used as input for the LangChain question-answering example. ```bash wget https://raw.githubusercontent.com/hwchase17/langchain/v0.0.200/docs/modules/state_of_the_union.txt ``` -------------------------------- ### Install Training Dependencies Source: https://github.com/lm-sys/fastchat/blob/main/README.md Installs the necessary dependencies for training FastChat models. This command is typically run in a Python environment. ```bash pip3 install -e ".[train]" ``` -------------------------------- ### vLLM Worker for High-Throughput Serving Source: https://context7.com/lm-sys/fastchat/llms.txt Shows how to start a vLLM worker for high-throughput inference, leveraging continuous batching and PagedAttention. Examples include basic setup, tensor parallelism, and custom GPU memory utilization. ```bash # Start vLLM worker python3 -m fastchat.serve.vllm_worker \ --model-path lmsys/vicuna-7b-v1.5 \ --controller-address http://localhost:21001 \ --worker-address http://localhost:21002 # vLLM worker with tensor parallelism python3 -m fastchat.serve.vllm_worker \ --model-path lmsys/vicuna-33b-v1.3 \ --num-gpus 4 \ --tensor-parallel-size 4 # vLLM worker with custom GPU memory utilization python3 -m fastchat.serve.vllm_worker \ --model-path lmsys/vicuna-7b-v1.5 \ --gpu-memory-utilization 0.85 ``` -------------------------------- ### Starting the OpenAI-Compatible API Server Source: https://context7.com/lm-sys/fastchat/llms.txt Starts the API server that provides OpenAI-compatible endpoints for chat completions, text completions, and embeddings. ```APIDOC ## Starting the OpenAI-Compatible API Server ### Description Starts the API server that provides OpenAI-compatible endpoints for chat completions, text completions, and embeddings. This allows integration with tools and SDKs designed for OpenAI's API. ### Method N/A (Command Line) ### Endpoint N/A ### Parameters #### Command Line Arguments - `--host` (string) - Required - The host address to bind the API server to. Use `0.0.0.0` to make it accessible externally. - `--port` (integer) - Required - The port number for the API server. Defaults to 8000. ### Request Example ```bash # Start the API server on localhost, port 8000 python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 # Full deployment example (run in separate terminals): # Terminal 1: Controller # python3 -m fastchat.serve.controller # Terminal 2: Model Worker # python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 # Terminal 3: API Server # python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` ### Response N/A (Server Output) ``` -------------------------------- ### Install FastChat and Dependencies Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md Installs necessary system packages, Anaconda for environment management, and clones the FastChat repository. It then creates and activates a dedicated Conda environment and installs FastChat using pip. ```bash sudo apt update sudo apt install tmux htop wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh bash Anaconda3-2022.10-Linux-x86_64.sh conda create -n fastchat python=3.9 conda activate fastchat git clone https://github.com/lm-sys/FastChat.git cd FastChat pip3 install -e . ``` -------------------------------- ### Launch FastChat Serving Components Source: https://github.com/lm-sys/fastchat/blob/main/README.md Commands to start the controller, model worker, and Gradio web server. These components are required to host and interact with LLMs locally. ```bash python3 -m fastchat.serve.controller python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 python3 -m fastchat.serve.test_message --model-name vicuna-7b-v1.5 python3 -m fastchat.serve.gradio_web_server ``` -------------------------------- ### Configure and Start Model Worker Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md Commands to download a quantized model and start the FastChat model worker, including advanced configurations like sequence length and multi-GPU memory allocation. ```bash # Download quantized model git lfs install git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g # Start worker with default config python3 -m fastchat.serve.model_worker \ --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ --enable-exllama # Start worker with custom sequence length and GPU split python3 -m fastchat.serve.model_worker \ --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ --enable-exllama \ --exllama-max-seq-len 2048 \ --exllama-gpu-split 18,24 ``` -------------------------------- ### Launch FastChat Web Server with API Registration Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md Command-line instruction to start the Gradio web server while registering custom API endpoints defined in a JSON configuration file. ```bash python3 -m fastchat.serve.gradio_web_server --controller "" --share --register api_endpoints.json ``` -------------------------------- ### Install DashInfer Package Source: https://github.com/lm-sys/fastchat/blob/main/docs/dashinfer_integration.md Installs the DashInfer Python package using pip. This is the first step to enable DashInfer's capabilities within your environment. ```bash pip install dashinfer ``` -------------------------------- ### cURL Examples Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md Examples of how to use cURL to interact with the FastChat API server for various endpoints. ```APIDOC ## cURL Examples List Models: ```bash curl http://localhost:8000/v1/models ``` Chat Completions: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }' ``` Text Completions: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "prompt": "Once upon a time", "max_tokens": 41, "temperature": 0.5 }' ``` Embeddings: ```bash curl http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "input": "Hello world!" }' ``` ``` -------------------------------- ### Install FastChat and Dependencies Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md Clones the FastChat repository and installs the necessary packages for model workers and LLM evaluation. ```bash git clone https://github.com/lm-sys/FastChat.git cd FastChat pip install -e ".[model_worker,llm_judge]" ``` -------------------------------- ### Install Nginx on Debian/Ubuntu Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/gateway/README.md Installs the Nginx web server on Debian-based systems like Ubuntu using the apt package manager. This is a prerequisite for deploying the Nginx gateway. ```bash sudo apt update sudo apt install nginx ``` -------------------------------- ### Launch vLLM Worker Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md Starts a vLLM server instance to serve model weights. This backend provides faster inference for supported models. ```bash vllm serve [MODEL-PATH] --dtype auto ``` -------------------------------- ### Install vLLM Dependency Source: https://github.com/lm-sys/fastchat/blob/main/docs/vllm_integration.md Installs the vLLM library required for running the optimized model worker in FastChat. ```bash pip install vllm ``` -------------------------------- ### Fine-Tuning Vicuna Models Source: https://context7.com/lm-sys/fastchat/llms.txt Provides a command to fine-tune models using the supervised fine-tuning script, specifically for Vicuna. This example uses FSDP and details various training parameters such as model path, data path, epochs, batch size, learning rate, and FSDP configurations. ```bash # Fine-tune Vicuna with FSDP torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --data_path ./data/conversations.json \ --bf16 True \ --output_dir ./output_vicuna \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --learning_rate 2e-5 \ --weight_decay 0.0 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True ``` -------------------------------- ### Multi-Model Worker Configuration Source: https://context7.com/lm-sys/fastchat/llms.txt Demonstrates how to serve multiple models from a single worker process, which is particularly useful for PEFT/LoRA models that share base weights. The example shows how to specify multiple model paths and names. ```bash # Serve multiple models in one process python3 -m fastchat.serve.multi_model_worker \ --model-path lmsys/vicuna-7b-v1.5 \ --model-names vicuna-7b-v1.5 \ --model-path lmsys/longchat-7b-16k \ --model-names longchat-7b-16k ``` -------------------------------- ### Launch Quantized LightLLM Worker Source: https://github.com/lm-sys/fastchat/blob/main/docs/lightllm_integration.md Starts the LightLLM worker with support for quantized weights and KV cache. This is useful for reducing memory footprint during inference. ```bash python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv ``` -------------------------------- ### Run Vicuna-7B on Ascend NPU with FastChat CLI Source: https://github.com/lm-sys/fastchat/blob/main/README.md Command to enable acceleration on Ascend NPUs. Requires installation of the Ascend PyTorch Adapter and setting CANN environment variables. ```bash source /usr/local/Ascend/ascend-toolkit/set_env.sh python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --device npu ``` -------------------------------- ### Vicuna Prompt Template (v1.1, v1.3, v1.5) Source: https://github.com/lm-sys/fastchat/blob/main/docs/vicuna_weights_version.md This is an example of the prompt template used for Vicuna weights versions 1.1, 1.3, and 1.5. It demonstrates the conversational structure between a user and an AI assistant, using 'USER:' and 'ASSISTANT:' tags, and the EOS token '' for separation. ```text A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hello! ASSISTANT: Hello! USER: How are you? ASSISTANT: I am good. ``` -------------------------------- ### Starting a Model Worker Source: https://context7.com/lm-sys/fastchat/llms.txt Starts a model worker that loads and serves LLM models. Each worker registers with the controller and handles inference requests for its loaded model. ```APIDOC ## Starting a Model Worker ### Description Starts a model worker that loads and serves LLM models. Each worker registers with the controller and handles inference requests for the loaded model. ### Method N/A (Command Line) ### Endpoint N/A ### Parameters #### Command Line Arguments - `--model-path` (string) - Required - The path to the pre-trained model weights (e.g., Hugging Face model ID). - `--controller-address` (string) - Optional - The address of the controller. Defaults to http://localhost:21001. - `--worker-address` (string) - Optional - The address for the worker. Defaults to a dynamically assigned port. - `--load-8bit` (boolean) - Optional - Whether to load the model in 8-bit precision to reduce memory usage. - `--num-gpus` (integer) - Optional - The number of GPUs to use for tensor parallelism. If not specified, defaults to 1. - `--model-names` (string) - Optional - A comma-separated list of model names to expose via the API. Useful for API compatibility. ### Request Example ```bash # Start a model worker with Vicuna model python3 -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-7b-v1.5 \ --controller-address http://localhost:21001 \ --worker-address http://localhost:21002 # Start with 8-bit quantization python3 -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-13b-v1.5 \ --load-8bit # Start with multiple GPUs using tensor parallelism python3 -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-33b-v1.3 \ --num-gpus 4 # Start with custom model names for API compatibility python3 -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-7b-v1.5 \ --model-names "gpt-3.5-turbo,vicuna-7b" ``` ### Response N/A (Server Output) ``` -------------------------------- ### Run Vicuna-7B on Intel XPU with FastChat CLI Source: https://github.com/lm-sys/fastchat/blob/main/README.md Command to enable GPU acceleration on Intel Data Center and Arc A-Series GPUs using the XPU device. Requires installation of Intel Extension for PyTorch and setting OneAPI environment variables. ```bash source /opt/intel/oneapi/setvars.sh python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --device xpu ``` -------------------------------- ### Start Model Worker with xFasterTransformer Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md Starts the FastChat model worker process with xFasterTransformer enabled. Allows configuration of data types and optimized CPU utilization using numactl and MPI. ```bash # Load model with default configuration (max sequence length 4096, no GPU split setting). python3 -m fastchat.serve.model_worker \ --model-path /path/to/models \ --enable-xft \ --xft-dtype bf16_fp16 ``` ```bash #run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16) numactl -N 0 --localalloc python3 -m fastchat.serve.model_worker \ --model-path /path/to/models \ --enable-xft \ --xft-dtype bf16_fp16 ``` ```bash #run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16) OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \ -n 1 numactl -N 0 --localalloc python -m fastchat.serve.model_worker \ --model-path /path/to/models \ --enable-xft \ --xft-dtype bf16_fp16 : \ -n 1 numactl -N 1 --localalloc python -m fastchat.serve.model_worker \ --model-path /path/to/models \ --enable-xft \ --xft-dtype bf16_fp16 ``` -------------------------------- ### Install MLX for FastChat Source: https://github.com/lm-sys/fastchat/blob/main/docs/mlx_integration.md Installs the MLX-LM library, which is required for using MLX as an optimized worker in FastChat. This command ensures you have the necessary package for MLX integration. ```bash pip install "mlx-lm>=0.0.6" ``` -------------------------------- ### Install FastChat Data Cleaning Dependencies Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/data_cleaning.md Installs the necessary Python packages for HTML parsing, markdown conversion, and language detection required for data preprocessing. ```bash pip3 install bs4 markdownify pip3 install polyglot pyicu pycld2 ``` -------------------------------- ### Install Nginx on Red Hat/CentOS/Fedora Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/gateway/README.md Installs the Nginx web server on Red Hat-based systems like CentOS and Fedora using the yum package manager. This is a prerequisite for deploying the Nginx gateway. ```bash sudo yum install epel-release sudo yum install nginx ``` -------------------------------- ### Check Server Launch Times Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md This script iterates through server logs to find and display the 'Running on local URL' message, indicating when each Gradio web server instance started. ```bash for i in $(seq 0 11); do cat fastchat_logs/server$i/gradio_web_server.log | grep "Running on local URL" | tail -n 1; done ``` -------------------------------- ### Run CLI with RWKV-4-Raven model Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md This command shows how to run the FastChat CLI with the BlinkDL/RWKV-4-Raven model. Ensure Python 3 and FastChat are installed. The model path should point to the downloaded weights. ```bash python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth ``` -------------------------------- ### Launch FastChat Controller Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md Starts the FastChat controller, which manages workers and routes requests. This is the first step in setting up the FastChat API server. ```bash python3 -m fastchat.serve.controller ``` -------------------------------- ### Vicuna Prompt Template (v0) Source: https://github.com/lm-sys/fastchat/blob/main/docs/vicuna_weights_version.md This is an example of the prompt template used for Vicuna weights version 0. It uses a different format with '### Human:' and '### Assistant:' tags for structuring the conversation. ```text A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. ### Human: Hello! ### Assistant: Hello! ### Human: How are you? ### Assistant: I am good. ``` -------------------------------- ### CLI Chat Interface for Models Source: https://context7.com/lm-sys/fastchat/llms.txt Provides command-line interface examples to interact with models directly in the terminal. Supports streaming output, custom conversation templates, 8-bit quantization, and multi-GPU configurations. Includes special commands for chat management. ```bash # Start CLI chat with a model python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 # CLI with custom conversation template python3 -m fastchat.serve.cli \ --model-path lmsys/vicuna-7b-v1.5 \ --conv-template vicuna_v1.1 # CLI with 8-bit quantization python3 -m fastchat.serve.cli \ --model-path lmsys/vicuna-13b-v1.5 \ --load-8bit # CLI with multiple GPUs python3 -m fastchat.serve.cli \ --model-path lmsys/vicuna-33b-v1.3 \ --num-gpus 2 # CLI special commands: # !!exit - Exit the chat # !!reset - Reset conversation # !!remove - Remove last message # !!regen - Regenerate last response # !!save - Save conversation # !!load - Load conversation ``` -------------------------------- ### OpenAI Python SDK Integration Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md Example of how to interact with the FastChat API server using the OpenAI Python SDK (version >= 1.0). ```APIDOC ## OpenAI Python SDK Integration Install the OpenAI Python package: ```bash pip install --upgrade openai ``` Interact with the API server: ```python import openai openai.api_key = "EMPTY" openai.base_url = "http://localhost:8000/v1/" model = "vicuna-7b-v1.5" prompt = "Once upon a time" # Create a completion completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64) print(prompt + completion.choices[0].text) # Create a chat completion completion = openai.chat.completions.create( model=model, messages=[{"role": "user", "content": "Hello! What is your name?"}] ) print(completion.choices[0].message.content) ``` ``` -------------------------------- ### Launch Gradio Web Server for Vision Arena Source: https://github.com/lm-sys/fastchat/blob/main/docs/arena.md Starts the Gradio web server with specific flags for the vision arena. It can register API endpoints, enable remote storage, and load random questions for VQA. ```python python3 -m fastchat.serve.gradio_web_server_multi --share --register-api-endpoint-file api_endpoints.json --vision-arena --use-remote-storage --random-questions metadata_sampled.json ``` -------------------------------- ### Chat with FastChat-T5-3B using FastChat CLI Source: https://github.com/lm-sys/fastchat/blob/main/README.md Command to start a chat with the FastChat-T5-3B model. The model weights are automatically fetched from Hugging Face. ```python python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0 ``` -------------------------------- ### Generate VQA Examples Directory Source: https://github.com/lm-sys/fastchat/blob/main/docs/arena.md A Python script to generate a random questions file by downloading images from various VQA datasets. This is used for the vision arena to provide sample questions. ```python python fastchat/serve/vision/create_vqa_examples_dir.py ``` -------------------------------- ### Launch FastChat OpenAI API Server Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md Starts the RESTful API server that is compatible with OpenAI's API. This server allows applications like LangChain to interact with the locally hosted models through a familiar OpenAI interface. ```bash python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` -------------------------------- ### Train Vicuna-7B with xFormers Source: https://github.com/lm-sys/fastchat/blob/main/README.md This command is used to train Vicuna-7B when xFormers is installed, offering memory-efficient attention. It replaces the standard training script with one optimized for xFormers. ```bash torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_xformers.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --data_path data/dummy_conversation.json \ --bf16 True \ --output_dir output_vicuna \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True ``` -------------------------------- ### LangChain Question Answering Example Source: https://github.com/lm-sys/fastchat/blob/main/docs/langchain_integration.md Demonstrates how to use LangChain with a local LLM via the FastChat API. It loads a text document, creates an index with embeddings, and then queries the index using a language model to answer questions. ```python from langchain.chat_models import ChatOpenAI from langchain.document_loaders import TextLoader from langchain.embeddings import OpenAIEmbeddings from langchain.indexes import VectorstoreIndexCreator embedding = OpenAIEmbeddings(model="text-embedding-ada-002") loader = TextLoader("state_of_the_union.txt") index = VectorstoreIndexCreator(embedding=embedding).from_loaders([loader]) llm = ChatOpenAI(model="gpt-3.5-turbo") questions = [ "Who is the speaker", "What did the president say about Ketanji Brown Jackson", "What are the threats to America", "Who are mentioned in the speech", "Who is the vice president", "How many projects were announced", ] for query in questions: print("Query:", query) print("Answer:", index.query(query, llm=llm)) ``` -------------------------------- ### Launch FastChat Servers and Workers Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/webserver.md Initiates the FastChat controller, registers a Hugging Face API worker, and starts the Gradio web server with support for multiple models (ChatGPT, Claude, PaLM). It also includes environment variable exports for API keys and project IDs. ```bash cd fastchat_logs/controller python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 python3 -m fastchat.serve.register_worker --controller http://localhost:21001 --worker-name https:// python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://localhost:21001 cd fastchat_logs/server0 python3 -m fastchat.serve.huggingface_api_worker --model-info-file ~/elo_results/register_hf_api_models.json export OPENAI_API_KEY= export ANTHROPIC_API_KEY= export GCP_PROJECT_ID= python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 50 --add-chatgpt --add-claude --add-palm --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms python3 backup_logs.py ``` -------------------------------- ### Launch Multi-Model Worker Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md Starts a multi-model worker capable of serving multiple models simultaneously. This is useful for efficient resource utilization, especially when dealing with models of varying sizes or types. It specifies the models, controller address, host, port, and worker address. ```bash CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.multi_model_worker --model-path ~/model_weights/RWKV-4-Raven-14B-v12-Eng98%25-Other2%25-20230523-ctx8192.pth --model-name RWKV-4-Raven-14B --model-path lmsys/fastchat-t5-3b-v1.0 --model-name fastchat-t5-3b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker http://$(hostname):31000 --limit 4 ``` ```bash CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.multi_model_worker --model-path OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --model-name oasst-pythia-12b --model-path mosaicml/mpt-7b-chat --model-name mpt-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker http://$(hostname):31001 --limit 4 ``` ```bash CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.5 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4 ``` -------------------------------- ### Starting the Controller Source: https://context7.com/lm-sys/fastchat/llms.txt Starts the FastChat controller, which manages distributed model workers and routes requests. It's the central coordinator for the serving infrastructure. ```APIDOC ## Starting the Controller ### Description Starts the FastChat controller, which manages distributed model workers and routes requests. It is the central coordinator for the FastChat serving infrastructure. ### Method N/A (Command Line) ### Endpoint N/A ### Parameters #### Command Line Arguments - `--host` (string) - Optional - The host address to bind the controller to. Defaults to localhost. - `--port` (integer) - Optional - The port number for the controller. Defaults to 21001. ### Request Example ```bash python3 -m fastchat.serve.controller # With custom host and port python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 ``` ### Response N/A (Server Output) ``` -------------------------------- ### Launch Multi-Model and Advanced Serving Source: https://github.com/lm-sys/fastchat/blob/main/README.md Commands for launching the multi-model Gradio interface and configuring multiple workers on specific GPUs for scalability. ```bash python3 -m fastchat.serve.gradio_web_server_multi --register-api-endpoint-file api_endpoint.json # worker 0 CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 --controller http://localhost:21001 --port 31000 --worker http://localhost:31000 # worker 1 CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker --model-path lmsys/fastchat-t5-3b-v1.0 --controller http://localhost:21001 --port 31001 --worker http://localhost:31001 python3 -m fastchat.serve.gradio_web_server_multi ``` -------------------------------- ### Download Benchmark Dataset Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/serve/monitor/classify/README.md Uses git-lfs to clone the category benchmark dataset from Hugging Face and prepares the local directory structure for evaluation. ```console git clone https://huggingface.co/datasets/lmarena-ai/categories-benchmark-eval cp -r categories-benchmark-eval/label_bench . ``` -------------------------------- ### Launch Gradio Web Servers for FastChat Source: https://context7.com/lm-sys/fastchat/llms.txt These commands launch Gradio web servers for interacting with FastChat models. Options include a single-model server, a multi-model server for side-by-side comparisons, and a server with vision capabilities for image support, which can also register API endpoints. ```bash python3 -m fastchat.serve.gradio_web_server --share ``` ```bash python3 -m fastchat.serve.gradio_web_server_multi --share ``` ```bash python3 -m fastchat.serve.gradio_web_server_multi \ --share \ --vision-arena \ --register-api-endpoint-file api_endpoints.json ``` -------------------------------- ### Start FastChat API Server Thread Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb This Python code snippet starts the FastChat API server in a separate thread. It assumes the necessary threading module is imported. The server will be accessible at http://127.0.0.1:8000/v1/. ```python import threading def run_api_server(): # Placeholder for the actual API server run function print("API server is running...") api_server_thread = threading.Thread(target=run_api_server) api_server_thread.start() ``` -------------------------------- ### Fine-Tune Models with QLoRA using DeepSpeed Source: https://context7.com/lm-sys/fastchat/llms.txt This script demonstrates how to fine-tune a model using QLoRA for memory-efficient training with 4-bit quantization. It utilizes DeepSpeed for distributed training and requires specifying model paths, LoRA parameters, data paths, and training configurations. ```bash deepspeed fastchat/train/train_lora.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --data_path ./data/conversations.json \ --bf16 True \ --output_dir ./checkpoints_lora \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --learning_rate 2e-5 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --q_lora True \ --deepspeed playground/deepspeed_config_s2.json ``` -------------------------------- ### Clone FastChat and Install Dependencies (Python) Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb This snippet clones the FastChat repository from GitHub and installs the necessary dependencies for running the model worker and web UI. It's designed to be executed in a Google Colab environment. ```python %cd /content/ # clone FastChat !git clone https://github.com/lm-sys/FastChat.git # install dependencies %cd FastChat !python3 -m pip install -e ".[model_worker,webui]" --quiet ``` -------------------------------- ### GET /v1/models Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md Retrieves a list of available models that the API server supports. ```APIDOC ## GET /v1/models ### Description Lists all the models available through the API server. ### Method GET ### Endpoint /v1/models ### Parameters None ### Response #### Success Response (200) - **object** (string) - Type of object returned, e.g., 'list'. - **data** (array) - A list of model objects. - **id** (string) - The unique identifier for the model. - **object** (string) - Type of object, e.g., 'model'. - **created** (integer) - Unix timestamp of when the model was created. - **owned_by** (string) - The owner of the model. #### Response Example ```json { "object": "list", "data": [ { "id": "vicuna-7b-v1.5", "object": "model", "created": 1677652288, "owned_by": "lmsys" } ] } ``` ``` -------------------------------- ### Prepare Models for xFasterTransformer Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md Converts models to a format compatible with xFasterTransformer. This script requires input dataset and output directories to be specified. ```bash python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o ${OUTPUT_DIR} ``` -------------------------------- ### Launch LightLLM Worker Source: https://github.com/lm-sys/fastchat/blob/main/docs/lightllm_integration.md Starts the LightLLM worker process for a specified model. This replaces the standard fastchat.serve.model_worker and requires configuration for total token capacity. ```bash python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 ``` -------------------------------- ### Run AWQ Quantized Model via CLI Source: https://github.com/lm-sys/fastchat/blob/main/docs/awq.md Downloads a quantized model using git-lfs and executes the FastChat CLI with specific AWQ parameters for 4bit inference. ```bash git lfs install git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq python3 -m fastchat.serve.cli --model-path models/vicuna-7b-v1.3-4bit-g128-awq --awq-wbits 4 --awq-groupsize 128 ``` -------------------------------- ### Run CLI with Llama-2-7b-chat-hf Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md This command launches the FastChat command-line interface (CLI) to interact with the meta-llama/Llama-2-7b-chat-hf model. It requires Python 3 and the FastChat package to be installed. ```bash python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf ``` -------------------------------- ### Train Vicuna-7B with Local GPUs Source: https://github.com/lm-sys/fastchat/blob/main/README.md Command to initiate the training of Vicuna-7B using PyTorch distributed training on multiple GPUs. It requires specifying model and data paths, and configures various training parameters like batch size, learning rate, and saving strategies. ```bash torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --data_path data/dummy_conversation.json \ --bf16 True \ --output_dir output_vicuna \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True ``` -------------------------------- ### Run FastChat CLI with Exllama Source: https://github.com/lm-sys/fastchat/blob/main/docs/exllama_v2.md Execute the FastChat CLI interface using a GPTQ model path with the Exllama backend enabled. ```bash python3 -m fastchat.serve.cli \ --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ --enable-exllama ``` -------------------------------- ### Initialize FastChat Controller Source: https://context7.com/lm-sys/fastchat/llms.txt Starts the central controller service responsible for managing model workers and routing requests. It can be configured with custom host and port settings. ```bash python3 -m fastchat.serve.controller python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 ``` -------------------------------- ### Download and View MT-Bench Data Source: https://github.com/lm-sys/fastchat/blob/main/fastchat/llm_judge/README.md Downloads pre-generated model answers and judgments, and launches a local QA browser to inspect the results. ```bash python3 download_mt_bench_pregenerated.py python3 qa_browser.py --share ``` -------------------------------- ### Launch FastChat vLLM Worker Source: https://github.com/lm-sys/fastchat/blob/main/docs/vllm_integration.md Commands to start the FastChat vLLM worker. Includes variations for standard models, custom tokenizers, and AWQ quantized models. ```bash python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 ``` ```bash python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer hf-internal-testing/llama-tokenizer ``` ```bash python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq ``` -------------------------------- ### Programmatic Model Loading and Inference Source: https://context7.com/lm-sys/fastchat/llms.txt Demonstrates how to load models and conversation templates programmatically for custom inference pipelines. It shows model loading, setting up a conversation, generating a prompt, and producing a response. ```python from fastchat.model import load_model, get_conversation_template import torch # Load model and tokenizer model, tokenizer = load_model( model_path="lmsys/vicuna-7b-v1.5", device="cuda", num_gpus=1, load_8bit=False, dtype=torch.float16 ) # Get conversation template for the model conv = get_conversation_template("vicuna-7b-v1.5") conv.append_message(conv.roles[0], "What is artificial intelligence?") conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() # Generate response inputs = tokenizer([prompt], return_tensors="pt").to("cuda") output = model.generate( **inputs, max_new_tokens=256, temperature=0.7, do_sample=True ) response = tokenizer.decode(output[0], skip_special_tokens=True) print(response) ``` -------------------------------- ### Launch FastChat Controller Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md Starts the FastChat controller, which manages distributed workers. It listens on a specified host and port. Ensure this is run on a central node (e.g., node-01). ```bash python3 -m fastchat.serve.controller --host 0.0.0.0 --port 10002 ``` -------------------------------- ### Run Vicuna-7B with 8-bit Compression for Reduced Memory Source: https://github.com/lm-sys/fastchat/blob/main/README.md Command to enable 8-bit compression for the Vicuna-7B model, significantly reducing memory usage with a slight impact on quality. Compatible with CPU, GPU, and Metal backends. ```python python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --load-8bit ``` -------------------------------- ### Run CLI Inference with xFasterTransformer Source: https://github.com/lm-sys/fastchat/blob/main/docs/xFasterTransformer.md Launches the FastChat CLI for inference using xFasterTransformer. Supports various configurations for CPU usage and data types like fp16 and bf16_fp16. ```bash #run inference on all CPUs and using float16 python3 -m fastchat.serve.cli \ --model-path /path/to/models \ --enable-xft \ --xft-dtype fp16 ``` ```bash #run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16) numactl -N 0 --localalloc \ python3 -m fastchat.serve.cli \ --model-path /path/to/models/chatglm2_6b_cpu/ \ --enable-xft \ --xft-dtype bf16_fp16 ``` ```bash #run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16) OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \ -n 1 numactl -N 0 --localalloc \ python -m fastchat.serve.cli \ --model-path /path/to/models/chatglm2_6b_cpu/ \ --enable-xft \ --xft-dtype bf16_fp16 : \ -n 1 numactl -N 1 --localalloc \ python -m fastchat.serve.cli \ --model-path /path/to/models/chatglm2_6b_cpu/ \ --enable-xft \ --xft-dtype bf16_fp16 ``` -------------------------------- ### Test FastChat Message Interface Source: https://github.com/lm-sys/fastchat/blob/main/docs/commands/local_cluster.md Runs a test command to interact with the FastChat system, specifically targeting the 'vicuna-13b' model through the controller at localhost:10002. This is useful for verifying the setup and basic functionality. ```bash python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://localhost:10002 ``` -------------------------------- ### Generate Embeddings using curl Source: https://github.com/lm-sys/fastchat/blob/main/playground/FastChat_API_GoogleColab.ipynb This example demonstrates how to generate embeddings for a given text input using curl and the FastChat API. It targets the /v1/embeddings endpoint and specifies the model and the input text. ```bash !curl http://127.0.0.1:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ \ "model": "vicuna-7b-v1.5", \ "input": "Hello, can you tell me a joke for me?" \ }' ``` -------------------------------- ### Interact with FastChat API using cURL Source: https://github.com/lm-sys/fastchat/blob/main/docs/openai_api.md Examples of using cURL commands to interact with the FastChat OpenAI-compatible API server for various endpoints like listing models, chat completions, text completions, and embeddings. ```bash curl http://localhost:8000/v1/models curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }' curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "prompt": "Once upon a time", "max_tokens": 41, "temperature": 0.5 }' curl http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "vicuna-7b-v1.5", "input": "Hello world!" }' ``` -------------------------------- ### Run CLI with Vicuna, Alpaca, LLaMA, Koala models Source: https://github.com/lm-sys/fastchat/blob/main/docs/model_support.md This command demonstrates how to use the FastChat CLI with models like Vicuna, Alpaca, LLaMA, and Koala. It requires Python 3 and the FastChat package. The `--model-path` argument specifies the model to load. ```bash python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 ```