### Start Local JetStream Mock Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Use this command to start a local mock server for JetStream. ```bash python -m jetstream.core.implementations.mock.server ``` -------------------------------- ### Start Simple Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Command to start the simple inference server in the background. ```bash python inference/entrypoint/run_simple_server.py & ``` -------------------------------- ### Setup MaxText and JetStream Environment Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure a Python virtual environment and install dependencies for MaxText and JetStream. This involves creating a virtual environment, activating it, and running setup scripts for both repositories. ```bash # Create a python virtual environment for the demo. sudo apt install python3.10-venv python -m venv .env source .env/bin/activate # Setup MaxText. cd maxtext/ bash setup.sh # Setup JetStream cd JetStream pip install -e . cd benchmarks pip install -r requirements.in ``` -------------------------------- ### Install JAX with TPU Support Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Installs the JAX library with TPU support. Ensure you are in an activated virtual environment. ```bash pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html ``` -------------------------------- ### Install Benchmark Dependencies Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Installs necessary Python packages for running benchmarks. Navigate to the benchmarks directory first. ```bash cd ~/JetStream/benchmarks pip install -r requirements.in ``` -------------------------------- ### Install JetStream Dependencies Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to install all necessary dependencies for JetStream. ```bash make install-deps ``` -------------------------------- ### Install Python and Create Virtual Environment Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Installs Python development headers and venv, then creates and activates a virtual environment named 'jetstream'. ```bash sudo apt-get install python3-dev python3-venv -y sudo apt-get install build-essential -y python -m venv ~/venv/jetstream source ~/venv/jetstream/bin/activate ``` -------------------------------- ### Start TensorBoard Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md Initiates a TensorBoard server to visualize profiling data. Ensure the log directory exists. TensorBoard can be accessed at http://localhost:6006/. ```bash tensorboard --logdir /tmp/tensorboard/ ``` -------------------------------- ### Start Jetstream Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Starts the Jetstream server with specified configurations for model loading and performance tuning. This command is typically run from the ~/maxtext directory. ```bash cd ~/maxtext python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=assets/tokenizer.llama2 \ load_parameters_path="gs://msingh-bkt/checkpoints/quant_llama2-70b-chat/mlperf_070924/int8_" \ max_prefill_predict_length=1024 \ max_target_length=2048 \ model_name=llama2-70b \ ici_fsdp_parallelism=1 \ ici_autoregressive_parallelism=1 \ ici_tensor_parallelism=-1 \ scan_layers=false \ weight_dtype=bfloat16 \ checkpoint_is_quantized=True \ quantization=int8 \ quantize_kvcache=True \ compute_axis_order=0,2,1,3 \ ar_cache_axis_order=0,2,1,3 \ enable_jax_profiler=True \ per_device_batch_size=50 \ optimize_mesh_for_tpu_v6e=True ``` -------------------------------- ### Clone Repository and Install Dependencies Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Commands to clone the JetStream repository and install required Python packages from the requirements.txt file. ```bash git clone https://github.com/AI-Hypercomputer/JetStream.git cd JetStream/experimental/jax pip install -r requirements.txt ``` -------------------------------- ### Start JetStream MaxText Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Command to start the JetStream MaxText server using a base configuration file and environment variables. Ensure you are in the maxtext directory. ```bash cd ~/maxtext python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=${TOKENIZER_PATH} \ load_parameters_path=${LOAD_PARAMETERS_PATH} \ max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \ max_target_length=${MAX_TARGET_LENGTH} \ model_name=${MODEL_NAME} \ ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \ ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \ ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ scan_layers=${SCAN_LAYERS} \ weight_dtype=${WEIGHT_DTYPE} \ per_device_batch_size=${PER_DEVICE_BATCH_SIZE} ``` -------------------------------- ### Install Evaluation Dependencies Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Installs a list of Python packages required for evaluation, specifying exact versions for reproducibility. ```bash pip install \ transformers==4.31.0 \ nltk==3.8.1 \ evaluate==0.4.0 \ absl-py==1.4.0 \ rouge-score==0.1.2 \ sentencepiece==0.1.99 \ accelerate==0.21.0 ``` -------------------------------- ### Download MLPerf Inference Benchmark Suite and Install Loadgen Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Clones the MLPerf inference repository and installs the loadgen package from its source directory. ```bash cd ~ git clone https://github.com/mlcommons/inference.git pushd inference/loadgen pip install . ``` -------------------------------- ### Example Prometheus Metrics Output Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md This is an example of the metrics output you can expect when accessing the Prometheus endpoint. Metrics like prefill backlog size and decode slot usage are exposed. ```text # HELP jetstream_prefill_backlog_size Size of prefill queue # TYPE jetstream_prefill_backlog_size gauge jetstream_prefill_backlog_size{id="SOME-HOSTNAME-HERE>"} 0.0 # HELP jetstream_slots_used_percentage The percentage of decode slots currently being used # TYPE jetstream_slots_used_percentage gauge jetstream_slots_used_percentage{id="",idx="0"} 0.04166666666666663 ``` -------------------------------- ### Start JetStream MaxText Server with JAX Profiler Enabled Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md Launches the JetStream MaxText server with JAX profiler enabled. Set `ENABLE_JAX_PROFILER` to `true` and optionally configure `JAX_PROFILER_PORT`. This server will expose profiling data on the specified port (default 9999). ```bash # Refer to JetStream MaxText User Guide for the following server config. export TOKENIZER_PATH=assets/tokenizer.gemma export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH} export MAX_PREFILL_PREDICT_LENGTH=1024 export MAX_TARGET_LENGTH=2048 export MODEL_NAME=gemma-7b export ICI_FSDP_PARALLELISM=1 export ICI_AUTOREGRESSIVE_PARALLELISM=-1 export ICI_TENSOR_PARALLELISM=1 export SCAN_LAYERS=false export WEIGHT_DTYPE=bfloat16 export PER_DEVICE_BATCH_SIZE=11 # Set ENABLE_JAX_PROFILER to enable JAX profiler server at port 9999. export ENABLE_JAX_PROFILER=true # Set JAX_PROFILER_PORT to customize JAX profiler server port. export JAX_PROFILER_PORT=9999 cd ~/maxtext python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=${TOKENIZER_PATH} \ load_parameters_path=${LOAD_PARAMETERS_PATH} \ max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \ max_target_length=${MAX_TARGET_LENGTH} \ model_name=${MODEL_NAME} \ ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \ ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \ ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ scan_layers=${SCAN_LAYERS} \ weight_dtype=${WEIGHT_DTYPE} \ per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \ enable_jax_profiler=${ENABLE_JAX_PROFILER} \ jax_profiler_port=${JAX_PROFILER_PORT} ``` -------------------------------- ### Run JetStream MaxText Server with Prometheus Metrics Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md Set the PROMETHEUS_PORT environment variable to enable Prometheus metrics. This example shows how to launch the JetStream MaxText server with metrics observability enabled on port 9090. ```bash # Refer to JetStream MaxText User Guide for the following server config. export TOKENIZER_PATH=assets/tokenizer.gemma export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH} export MAX_PREFILL_PREDICT_LENGTH=1024 export MAX_TARGET_LENGTH=2048 export MODEL_NAME=gemma-7b export ICI_FSDP_PARALLELISM=1 export ICI_AUTOREGRESSIVE_PARALLELISM=-1 export ICI_TENSOR_PARALLELISM=1 export SCAN_LAYERS=false export WEIGHT_DTYPE=bfloat16 export PER_DEVICE_BATCH_SIZE=11 # Set PROMETHEUS_PORT to enable Prometheus metrics. export PROMETHEUS_PORT=9090 cd ~/maxtext python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=${TOKENIZER_PATH} \ load_parameters_path=${LOAD_PARAMETERS_PATH} \ max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \ max_target_length=${MAX_TARGET_LENGTH} \ model_name=${MODEL_NAME} \ ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \ ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \ ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ scan_layers=${SCAN_LAYERS} \ weight_dtype=${WEIGHT_DTYPE} \ per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \ prometheus_port=${PROMETHEUS_PORT} ``` -------------------------------- ### Send Generation Request Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Example curl command to send a JSON request to the running inference server for text generation. ```bash curl --no-buffer -H 'Content-Type: application/json' \ -d '{ "prompt": "Today is a good day" }' \ -X POST \ localhost:8000/generate ``` -------------------------------- ### Set up Python Environment Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Commands to create and activate a new Python virtual environment for JAX inference. ```bash virtualenv jax-inference source jax-inference/bin/activate ``` -------------------------------- ### Benchmark with Full Warmup Mode Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Executes the benchmark using the 'full' warmup mode, which warms up the server with all input requests. Requires specifying tokenizer, dataset, and output paths. ```bash python JetStream/benchmarks/benchmark_serving.py \ --tokenizer ~/maxtext/assets/tokenizer.llama2 \ --warmup-mode full \ --save-result \ --save-request-outputs \ --request-outputs-file-path outputs.json \ --num-prompts 1000 \ --max-output-length 1024 \ --dataset openorca ``` -------------------------------- ### Download Llama2-70B Data Files Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Creates a directory for data and downloads two specific processed data files for the Llama2-70B model using gsutil. ```bash export DATA_DISK_DIR=~/loadgen_run_data mkdir -p ${DATA_DISK_DIR} gsutil cp gs://cloud-tpu-inference-public/mlcommons/inference/language/llama2-70b/data/processed-openorca/open_orca_gpt4_tokenized_llama.calibration_1000.pkl ${DATA_DISK_DIR}/processed-calibration-data.pkl gsutil cp gs://cloud-tpu-inference-public/mlcommons/inference/language/llama2-70b/data/processed-openorca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl ${DATA_DISK_DIR}/processed-data.pkl ``` -------------------------------- ### Run Offline Benchmarking Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Commands to set up environment variables and run the offline benchmarking script. This uses 8-way TP for experimental comparison. ```bash export PYTHONPATH=$(pwd) export JAX_COMPILATION_CACHE_DIR="/tmp/jax_cache" python inference/entrypoint/mini_offline_benchmarking.py ``` -------------------------------- ### Test JetStream LoRA Adapter TensorStore Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for the JetStream LoRA adapter tensorstore. ```bash python -m unittest -v jetstream.tests.core.lora.test_adapter_tensorstore ``` -------------------------------- ### Run MLPerf Server Accuracy Benchmark Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Navigate to the specified directory for Llama2 70b TPU v5e 8 JetStream maxtext and execute the server accuracy benchmark script. ```bash cd Google/code/llama2-70b/tpu_v5e_8_jetstream_maxtext/scripts/ bash ./generate_server_accuracy_run.sh ``` -------------------------------- ### Test JetStream Core Server Library Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for the JetStream core server library. ```bash python -m unittest -v jetstream.tests.core.test_server ``` -------------------------------- ### Run MLPerf Server Performance Benchmark Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Navigate to the MLPerf benchmarks scripts directory and execute the server performance benchmark script. ```bash cd ~/JetStream/benchmarks/mlperf/scripts bash ./generate_server_performance_run.sh ``` -------------------------------- ### Benchmark Gemma-7b with ShareGPT Dataset Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Run the JetStream benchmark for Gemma-7b using the ShareGPT dataset and the Gemma tokenizer. Control QPS with --request-rate and use --warmup-first for the initial run. ```bash # Activate the python virtual environment we created in Step 2. cd ~ source .env/bin/activate # download dataset wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # run benchmark with the downloaded dataset and the tokenizer in maxtext # You can control the qps by setting "--request-rate", the default value is inf. python JetStream/benchmarks/benchmark_serving.py \ --tokenizer maxtext/assets/tokenizer.gemma \ --num-prompts 1000 \ --dataset sharegpt \ --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 \ --request-rate 5 \ --warmup-mode sampled ``` -------------------------------- ### Build and Upload Docker Image Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md Builds, tests, and uploads the Docker image using the pipeline script. The UPLOAD_IMAGE_TAG is set to the nightly build with the current date. ```bash ./pipeline.sh UPLOAD_IMAGE_TAG=gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly-$(date +"%Y%m%d") ``` -------------------------------- ### Test Mock JetStream Token Utils Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for the mock JetStream token utilities. ```bash python -m unittest -v jetstream.tests.engine.test_token_utils ``` -------------------------------- ### Define Paths for MaxText Checkpoint Conversion Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Set environment variables for the source MaxText checkpoint and the destination for the quantized checkpoint. Ensure these paths are accessible and correctly formatted for Google Cloud Storage. ```bash export LOAD_PARAMS_PATH=gs://${USER}-bkt/llama2-70b-chat/param-only-decode-ckpt-maxtext/checkpoints/0/items export SAVE_QUANT_PARAMS_PATH=gs://${USER}-bkt/quantized/llama2-70b-chat ``` -------------------------------- ### Run Benchmark with MaxText Tokenizer Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Executes the benchmark script using the MaxText tokenizer and the ShareGPT dataset. Adjust the number of prompts and output length as needed. ```bash python benchmark_serving.py \ --tokenizer /home/{username}/maxtext/assets/tokenizer \ --num-prompts 10 \ --dataset sharegpt \ --dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 ``` -------------------------------- ### Test Local JetStream Mock Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Execute this command to test the local mock server of JetStream. ```bash python -m jetstream.tools.requester ``` -------------------------------- ### Load Test Local JetStream Mock Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Use this command to perform load testing on the local mock server of JetStream. ```bash python -m jetstream.tools.load_tester ``` -------------------------------- ### Benchmark Llama2 with ShareGPT Dataset Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Execute the JetStream benchmark for Llama2 using the ShareGPT dataset and the Llama2 tokenizer. Similar to Gemma-7b, control QPS and use warmup mode. ```bash # The command is the same as that for the Gemma-7b, except for the tokenizer. Since we need to use a tokenizer that matches the model, it should now be tokenizer.llama2. python JetStream/benchmarks/benchmark_serving.py \ --tokenizer maxtext/assets/tokenizer.llama2 \ --num-prompts 1000 \ --dataset sharegpt \ --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 \ --request-rate 5 \ --warmup-mode sampled ``` -------------------------------- ### Format Code with Make Source: https://github.com/ai-hypercomputer/jetstream/blob/main/CONTRIBUTING.md Run this command to ensure your code adheres to the project's formatting standards before submitting a pull request. ```bash make format ``` -------------------------------- ### Test Mock JetStream Engine Implementation Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for the mock JetStream engine implementation. ```bash python -m unittest -v jetstream.tests.engine.test_mock_engine ``` -------------------------------- ### Enable KV-Cache Quantization Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Set the QUANTIZE_KVCACHE environment variable to True to enable quantization of the KV-cache. ```bash export QUANTIZE_KVCACHE=True ``` -------------------------------- ### Log in to Hugging Face CLI Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Command to log in to the Hugging Face CLI. Ensure your account has permission to access the specified model. ```bash huggingface-cli login ``` -------------------------------- ### Benchmark with OpenOrca Dataset Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Runs the benchmark using the OpenOrca dataset, commonly used for MLPerf inference benchmarks with LLaMA2 models. Includes options for saving results and outputs. ```bash python JetStream/benchmarks/benchmark_serving.py \ --tokenizer ~/maxtext/assets/tokenizer.llama2 \ --warmup-mode sampled \ --save-result \ --save-request-outputs \ --request-outputs-file-path outputs.json \ --num-prompts 1000 \ --max-output-length 1024 \ --dataset openorca ``` -------------------------------- ### Test JetStream Utils Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for JetStream utilities. ```bash python -m unittest -v jetstream.tests.engine.test_utils ``` -------------------------------- ### Run MLPerf Server Audit Benchmark Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Navigate to the specified directory for Llama2 70b TPU v5e 8 JetStream maxtext and execute the server audit benchmark script. ```bash cd Google/code/llama2-70b/tpu_v5e_8_jetstream_maxtext/scripts/ bash ./generate_server_audit_run.sh ``` -------------------------------- ### Download Maxtext and Jetstream Repositories Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Clones the Maxtext and JetStream repositories from their respective Git repositories. ```bash cd ~ git clone git@github.com:AI-Hypercomputer/maxtext.git git clone git@github.com:AI-Hypercomputer/JetStream.git ``` -------------------------------- ### Run Benchmark for Llama 3 Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Runs the benchmark script specifically for Llama 3 models, requiring the Llama 3 tokenizer path. Uses the ShareGPT dataset. ```bash python benchmark_serving.py \ --tokenizer \ --num-prompts 10 \ --dataset sharegpt \ --dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 \ --model llama-3 ``` -------------------------------- ### Download ShareGPT Dataset Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Downloads the ShareGPT V3 unfiltered cleaned dataset, required for certain benchmarks. Ensure you are in the data directory. ```bash cd ~/data wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json ``` -------------------------------- ### Restart MaxText Engine Server with Mixed Precision Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Restart the MaxText engine server for a mixed precision quantized model, including the quant_cfg_path parameter. ```bash python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=${TOKENIZER_PATH} \ load_parameters_path=${LOAD_PARAMETERS_PATH} \ max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \ max_target_length=${MAX_TARGET_LENGTH} \ model_name=${MODEL_NAME} \ ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \ ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \ ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ scan_layers=${SCAN_LAYERS} \ weight_dtype=${WEIGHT_DTYPE} \ per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \ quantization=${QUANTIZATION} \ quantize_kvcache=${QUANTIZE_KVCACHE} \ checkpoint_is_quantized=${CHECKPOINT_IS_QUANTIZED} \ quant_cfg_path=${QUANT_CFG_PATH} ``` -------------------------------- ### Set Quantization Flags for Mixed-Precision Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables to load a mixed-precision quantized checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True and specify the quantization config path. ```bash export QUANTIZATION=intmp export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH} export CHECKPOINT_IS_QUANTIZED=True export QUANT_CFG_PATH=configs/quantization/mp_scale.json ``` -------------------------------- ### Test JetStream Core Orchestrator Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md Run this command to execute verbose tests for the JetStream core orchestrator module. ```bash python -m unittest -v jetstream.tests.core.test_orchestrator ``` -------------------------------- ### Jetstream Server Ready Logs Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md These log messages indicate that the Jetstream server has successfully initialized and is ready to process requests. Pay attention to memory usage and initialization warnings. ```log Memstats: After load_params: Using (GB) 8.1 / 31.25 (25.920000%) on TPU_0(process=0,(0,0,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_1(process=0,(1,0,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_2(process=0,(0,1,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_3(process=0,(1,1,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_4(process=0,(0,2,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_5(process=0,(1,2,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_6(process=0,(0,3,0,0)) Using (GB) 8.1 / 31.25 (25.920000%) on TPU_7(process=0,(1,3,0,0)) WARNING:root:Initialising driver with 1 prefill engines and 1 generate engines. 2025-02-10 22:10:34,122 - root - WARNING - Initialising driver with 1 prefill engines and 1 generate engines. WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 2025-02-10 22:10:34,152 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 2025-02-10 22:10:34,260 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 2025-02-10 22:10:34,326 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1 GC tweaked (allocs, gen1, gen2): 60000 20 30 2025-02-10 22:10:36.360296: I external/xla/xla/tsl/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:9999 selected port:9999 ``` -------------------------------- ### Restart MaxText Engine Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Restart the MaxText engine server with specified configuration parameters, including quantization settings. Adjust PER_DEVICE_BATCH_SIZE for Gemma 7b model. ```bash # For Gemma 7b model, change per_device_batch_size to 12 to optimize performance. export PER_DEVICE_BATCH_SIZE=12 cd ~/maxtext python3 -m MaxText.maxengine_server \ MaxText/configs/base.yml \ tokenizer_path=${TOKENIZER_PATH} \ load_parameters_path=${LOAD_PARAMETERS_PATH} \ max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \ max_target_length=${MAX_TARGET_LENGTH} \ model_name=${MODEL_NAME} \ ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \ ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \ ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ scan_layers=${SCAN_LAYERS} \ weight_dtype=${WEIGHT_DTYPE} \ per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \ quantization=${QUANTIZATION} \ quantize_kvcache=${QUANTIZE_KVCACHE} \ checkpoint_is_quantized=${CHECKPOINT_IS_QUANTIZED} ``` -------------------------------- ### Set Quantization Flags for Weight-Only Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables to load an int8 weight-only checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True. ```bash export QUANTIZATION=int8w export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH} export CHECKPOINT_IS_QUANTIZED=True ``` -------------------------------- ### Send Test Request to JetStream MaxText Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Use this script to send a test prompt to the running JetStream MaxText server. Specify the tokenizer path based on the model being used (Gemma or Llama2). ```bash cd ~ # For Gemma model python JetStream/jetstream/tools/requester.py --tokenizer maxtext/assets/tokenizer.gemma # For Llama2 model python JetStream/jetstream/tools/requester.py --tokenizer maxtext/assets/tokenizer.llama2 ``` -------------------------------- ### Benchmark Prefix Cache Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Tests JetStream's prefix caching mechanism by running benchmarks with mock input requests that share a common prefix. Configurable with common prefix length and number of prompts. ```bash python JetStream/benchmarks/benchmark_serving.py \ --tokenizer prefix_cache_test \ --dataset prefix_cache_test --warmup-mode full \ --num-prompts 100 \ --max-input-length 16000 \ --prefix-cache-test-common-len 9000\ --max-output-length 50 \ ``` -------------------------------- ### Gemma-7b Environment Variables for MaxText Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables for running the JetStream MaxText server with the Gemma-7b model. Ensure UNSCANNED_CKPT_PATH is set. ```bash export TOKENIZER_PATH=assets/tokenizer.gemma export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH} export MAX_PREFILL_PREDICT_LENGTH=1024 export MAX_TARGET_LENGTH=2048 export MODEL_NAME=gemma-7b export ICI_FSDP_PARALLELISM=1 export ICI_AUTOREGRESSIVE_PARALLELISM=1 export ICI_TENSOR_PARALLELISM=-1 export SCAN_LAYERS=false export WEIGHT_DTYPE=bfloat16 export PER_DEVICE_BATCH_SIZE=11 ``` -------------------------------- ### Create Cloud TPU v5e-8 Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md Command to create a Cloud TPU v5e-8 resource using gcloud alpha. Ensure you have the necessary project and zone information. ```bash gcloud alpha compute tpus queued-resources create ${QR_NAME} \ --node-id ${NODE_NAME} \ --project ${PROJECT_ID} \ --zone ${ZONE} \ --accelerator-type v5litepod-8 \ --runtime-version v2-alpha-tpuv5-lite ``` -------------------------------- ### Set Quantization Flags for DRQ Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables to load an int8 DRQ checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True. ```bash export QUANTIZATION=int8 export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH} export CHECKPOINT_IS_QUANTIZED=True ``` -------------------------------- ### Llama2-7b Environment Variables for MaxText Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables for running the JetStream MaxText server with the Llama2-7b model. Ensure UNSCANNED_CKPT_PATH is set. ```bash export TOKENIZER_PATH=assets/tokenizer.llama2 export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH} export MAX_PREFILL_PREDICT_LENGTH=1024 export MAX_TARGET_LENGTH=2048 export MODEL_NAME=llama2-7b export ICI_FSDP_PARALLELISM=1 export ICI_AUTOREGRESSIVE_PARALLELISM=1 export ICI_TENSOR_PARALLELISM=-1 export SCAN_LAYERS=false export WEIGHT_DTYPE=bfloat16 export PER_DEVICE_BATCH_SIZE=11 ``` -------------------------------- ### Generate Int8 Quantized Checkpoint with MaxText Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Execute the MaxText script to convert a Llama2-70B checkpoint into an int8 quantized format. This command specifies tokenizer path, model configuration, parallelism settings, and quantization type. ```bash export TOKENIZER_PATH=maxtext/assets/tokenizer.llama2 cd maxtext && \ python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=${TOKENIZER_PATH} load_parameters_path=${LOAD_PARAMS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-70b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8 save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH} ``` -------------------------------- ### Run Benchmark and Evaluation Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Executes the benchmark and automatically runs ROUGE evaluation afterward using `--run-eval true`. Scores are saved if `--save-result` is also used. ```bash python benchmark_serving.py \ --tokenizer /home/{username}/maxtext/assets/tokenizer \ --num-prompts 10 \ --dataset sharegpt \ --dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 \ --save-request-outputs \ --run-eval true ``` -------------------------------- ### Export Quantized Checkpoint Save Path Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Set an environment variable to define the GCS bucket path where quantized checkpoint parameters will be saved. This is a prerequisite for generating quantized checkpoints. ```bash export SAVE_QUANT_PARAMS_PATH=gs://${USER}-bkt/quantized/llama2-7b-chat ``` -------------------------------- ### Configure PodMonitoring for GKE Clusters Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md Apply this PodMonitoring resource to your GKE cluster to enable Google Cloud Managed Service for Prometheus to scrape JetStream metrics. Ensure you replace `` with the actual port configured. ```json { "apiVersion": "monitoring.googleapis.com/v1", "kind": "PodMonitoring", "metadata": { "name": "jetstream-podmonitoring" }, "spec": { "endpoints": [ { "interval": "1s", "path": "/", "port": } ], "targetLabels": { "metadata": [ "pod", "container", "node" ] } } } ``` -------------------------------- ### Run JetStream MaxText Container Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md Runs the JetStream-MaxText stable stack Docker container on a TPU VM. Ensure necessary volume mounts, TPU device access, and network ports are configured. ```bash docker run --net=host --privileged --rm -it \ # Add necessary volume mounts, TPU device access, network ports, etc. \ gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly \ bash ``` -------------------------------- ### Generate Mixed Precision Weight-Only Quantized Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md This command generates a quantized checkpoint using mixed precision for weights. It requires updating a specific configuration file (`mp_scale.json`) with desired quantization settings before execution. ```json { ".*/query": {"bits": 4, "scale": 0.8}, ".*/key": {"bits": 4, "scale": 0.9}, ".*/value": {"bits": 8}, ".*/out": {"bits": 4}, ".*/wi_0": {"bits": 4}, ".*/wo": {"bits": 8} } ``` ```bash python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=intmp quant_cfg_path=configs/quantization/mp_scale.json save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH} ``` -------------------------------- ### Llama2-13b Environment Variables for MaxText Server Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Configure environment variables for running the JetStream MaxText server with the Llama2-13b model. Ensure UNSCANNED_CKPT_PATH is set. ```bash export TOKENIZER_PATH=assets/tokenizer.llama2 export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH} export MAX_PREFILL_PREDICT_LENGTH=1024 export MAX_TARGET_LENGTH=2048 export MODEL_NAME=llama2-13b export ICI_FSDP_PARALLELISM=1 export ICI_AUTOREGRESSIVE_PARALLELISM=1 export ICI_TENSOR_PARALLELISM=-1 export SCAN_LAYERS=false export WEIGHT_DTYPE=bfloat16 export PER_DEVICE_BATCH_SIZE=4 ``` -------------------------------- ### Generate Weights-Only int8 Quantized Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md This command generates a quantized checkpoint focusing only on weights using int8 quantization. This can be useful for reducing model size and memory footprint. ```bash python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8w save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH} ``` -------------------------------- ### Convert Gemma Checkpoint for MaxText Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Convert a downloaded Gemma checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints. ```bash # For gemma-7b bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh gemma 7b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED} ``` -------------------------------- ### Clean Up GCS Buckets and Local Files Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Use these bash commands to clean up Google Cloud Storage buckets and local directories after running MaxText experiments. Ensure environment variables like MODEL_BUCKET and BASE_OUTPUT_DIRECTORY are set. ```bash # Clean up gcs buckets. gcloud storage buckets delete ${MODEL_BUCKET} gcloud storage buckets delete ${BASE_OUTPUT_DIRECTORY} # Clean up repositories. rm -rf maxtext rm -rf JetStream # Clean up python virtual environment rm -rf .env ``` -------------------------------- ### Set up SSH Tunnel for Remote TensorBoard Access Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md Establishes an SSH tunnel to forward the TensorBoard port (6006) from a remote machine to your local machine. This is necessary when running TensorBoard remotely and unable to access it directly. ```bash gcloud compute ssh -- -L 6006:127.0.0.1:6006 ``` -------------------------------- ### Save Request Outputs in Benchmark Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Enables saving the benchmark's prediction outputs to a file using the `--save-request-outputs` flag. This is useful for subsequent evaluation. ```bash python benchmark_serving.py \ --tokenizer /home/{username}/maxtext/assets/tokenizer \ --num-prompts 10 \ --dataset sharegpt \ --dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-output-length 1024 \ --save-request-outputs ``` -------------------------------- ### Verify JAX TPU Access Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md Verifies that JAX can access the TPU and reports the number of available TPU devices. ```python import jax jax.device_count() ``` -------------------------------- ### Convert Llama2 7b Checkpoint for MaxText Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Convert a Llama2 7b checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints. ```bash # For llama2-7b bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh llama2 7b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED} ``` -------------------------------- ### Convert Llama2 13b Checkpoint for MaxText Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Convert a Llama2 13b checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints. ```bash # For llama2-13b bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh llama2 13b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED} ``` -------------------------------- ### Generate int8 DRQ Quantized Checkpoint Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md This command generates a quantized checkpoint using int8 DRQ (Dynamic Range Quantization). Ensure the necessary configuration files and model paths are correctly set. ```bash python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8 save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH} ``` -------------------------------- ### Pull Nightly Docker Image Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md Pulls the latest nightly Docker image for the JetStream-MaxText stable stack. Replace YYYYMMDD with the desired date or use the 'nightly' tag for the absolute latest build. ```bash # Replace YYYYMMDD with the specific date, e.g., 20231027 export NIGHTLY_DATE=$(date +"%Y%m%d") # Or set manually, e.g., export NIGHTLY_DATE=20231027 docker pull gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly-${NIGHTLY_DATE} # Or the last nightly build docker pull gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly ``` -------------------------------- ### Disable KV-Cache Quantization Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md Set the QUANTIZE_KVCACHE environment variable to False to disable quantization of the KV-cache. ```bash export QUANTIZE_KVCACHE=False ``` -------------------------------- ### Standalone Evaluation Run Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Performs a standalone evaluation of saved request outputs using the `eval_accuracy.py` script. Requires the path to the output file. ```bash python eval_accuracy.py outputs.json ``` -------------------------------- ### Reference Accuracy Numbers for Llama2 Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md Provides reference accuracy metrics (ROUGE scores) for llama2-7b-chat and llama2-70b-chat models when evaluated on the OpenOrca dataset. ```text llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998} llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162} ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.