### Start Local JetStream Mock Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Use this command to start a local mock server for JetStream.

```bash
python -m jetstream.core.implementations.mock.server
```

--------------------------------

### Start Simple Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Command to start the simple inference server in the background.

```bash
python inference/entrypoint/run_simple_server.py &
```

--------------------------------

### Setup MaxText and JetStream Environment

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure a Python virtual environment and install dependencies for MaxText and JetStream. This involves creating a virtual environment, activating it, and running setup scripts for both repositories.

```bash
# Create a python virtual environment for the demo.
sudo apt install python3.10-venv
python -m venv .env
source .env/bin/activate

# Setup MaxText.
cd maxtext/ 
bash setup.sh

# Setup JetStream
cd JetStream
pip install -e .
cd benchmarks
pip install -r requirements.in
```

--------------------------------

### Install JAX with TPU Support

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Installs the JAX library with TPU support. Ensure you are in an activated virtual environment.

```bash
pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
```

--------------------------------

### Install Benchmark Dependencies

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Installs necessary Python packages for running benchmarks. Navigate to the benchmarks directory first.

```bash
cd ~/JetStream/benchmarks
pip install -r requirements.in
```

--------------------------------

### Install JetStream Dependencies

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to install all necessary dependencies for JetStream.

```bash
make install-deps
```

--------------------------------

### Install Python and Create Virtual Environment

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Installs Python development headers and venv, then creates and activates a virtual environment named 'jetstream'.

```bash
sudo apt-get install python3-dev python3-venv -y
sudo apt-get install build-essential -y
python -m venv ~/venv/jetstream
source ~/venv/jetstream/bin/activate
```

--------------------------------

### Start TensorBoard Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md

Initiates a TensorBoard server to visualize profiling data. Ensure the log directory exists. TensorBoard can be accessed at http://localhost:6006/.

```bash
tensorboard --logdir /tmp/tensorboard/
```

--------------------------------

### Start Jetstream Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Starts the Jetstream server with specified configurations for model loading and performance tuning. This command is typically run from the ~/maxtext directory.

```bash
cd ~/maxtext
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=assets/tokenizer.llama2 \
  load_parameters_path="gs://msingh-bkt/checkpoints/quant_llama2-70b-chat/mlperf_070924/int8_" \
  max_prefill_predict_length=1024 \
  max_target_length=2048 \
  model_name=llama2-70b \
  ici_fsdp_parallelism=1 \
  ici_autoregressive_parallelism=1 \
  ici_tensor_parallelism=-1 \
  scan_layers=false \
  weight_dtype=bfloat16 \
  checkpoint_is_quantized=True \
  quantization=int8 \
  quantize_kvcache=True \
  compute_axis_order=0,2,1,3 \
  ar_cache_axis_order=0,2,1,3 \
  enable_jax_profiler=True \
  per_device_batch_size=50 \
  optimize_mesh_for_tpu_v6e=True
```

--------------------------------

### Clone Repository and Install Dependencies

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Commands to clone the JetStream repository and install required Python packages from the requirements.txt file.

```bash
git clone https://github.com/AI-Hypercomputer/JetStream.git

cd JetStream/experimental/jax

pip install -r requirements.txt
```

--------------------------------

### Start JetStream MaxText Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Command to start the JetStream MaxText server using a base configuration file and environment variables. Ensure you are in the maxtext directory.

```bash
cd ~/maxtext
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=${TOKENIZER_PATH} \
  load_parameters_path=${LOAD_PARAMETERS_PATH} \
  max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \
  max_target_length=${MAX_TARGET_LENGTH} \
  model_name=${MODEL_NAME} \
  ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \
  ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \
  ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
  scan_layers=${SCAN_LAYERS} \
  weight_dtype=${WEIGHT_DTYPE} \
  per_device_batch_size=${PER_DEVICE_BATCH_SIZE}
```

--------------------------------

### Install Evaluation Dependencies

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Installs a list of Python packages required for evaluation, specifying exact versions for reproducibility.

```bash
pip install \
transformers==4.31.0 \
nltk==3.8.1 \
evaluate==0.4.0 \
absl-py==1.4.0 \
rouge-score==0.1.2 \
sentencepiece==0.1.99 \
accelerate==0.21.0
```

--------------------------------

### Download MLPerf Inference Benchmark Suite and Install Loadgen

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Clones the MLPerf inference repository and installs the loadgen package from its source directory.

```bash
cd ~
git clone https://github.com/mlcommons/inference.git
pushd inference/loadgen
pip install .

```

--------------------------------

### Example Prometheus Metrics Output

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md

This is an example of the metrics output you can expect when accessing the Prometheus endpoint. Metrics like prefill backlog size and decode slot usage are exposed.

```text
# HELP jetstream_prefill_backlog_size Size of prefill queue
# TYPE jetstream_prefill_backlog_size gauge
jetstream_prefill_backlog_size{id="SOME-HOSTNAME-HERE>"} 0.0
# HELP jetstream_slots_used_percentage The percentage of decode slots currently being used
# TYPE jetstream_slots_used_percentage gauge
jetstream_slots_used_percentage{id="<SOME-HOSTNAME-HERE>",idx="0"} 0.04166666666666663
```

--------------------------------

### Start JetStream MaxText Server with JAX Profiler Enabled

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md

Launches the JetStream MaxText server with JAX profiler enabled. Set `ENABLE_JAX_PROFILER` to `true` and optionally configure `JAX_PROFILER_PORT`. This server will expose profiling data on the specified port (default 9999).

```bash
# Refer to JetStream MaxText User Guide for the following server config.
export TOKENIZER_PATH=assets/tokenizer.gemma
export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH}
export MAX_PREFILL_PREDICT_LENGTH=1024
export MAX_TARGET_LENGTH=2048
export MODEL_NAME=gemma-7b
export ICI_FSDP_PARALLELISM=1
export ICI_AUTOREGRESSIVE_PARALLELISM=-1
export ICI_TENSOR_PARALLELISM=1
export SCAN_LAYERS=false
export WEIGHT_DTYPE=bfloat16
export PER_DEVICE_BATCH_SIZE=11
# Set ENABLE_JAX_PROFILER to enable JAX profiler server at port 9999.
export ENABLE_JAX_PROFILER=true
# Set JAX_PROFILER_PORT to customize JAX profiler server port.
export JAX_PROFILER_PORT=9999

cd ~/maxtext
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=${TOKENIZER_PATH} \
  load_parameters_path=${LOAD_PARAMETERS_PATH} \
  max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \
  max_target_length=${MAX_TARGET_LENGTH} \
  model_name=${MODEL_NAME} \
  ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \
  ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \
  ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
  scan_layers=${SCAN_LAYERS} \
  weight_dtype=${WEIGHT_DTYPE} \
  per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
  enable_jax_profiler=${ENABLE_JAX_PROFILER} \
  jax_profiler_port=${JAX_PROFILER_PORT}
```

--------------------------------

### Run JetStream MaxText Server with Prometheus Metrics

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md

Set the PROMETHEUS_PORT environment variable to enable Prometheus metrics. This example shows how to launch the JetStream MaxText server with metrics observability enabled on port 9090.

```bash
# Refer to JetStream MaxText User Guide for the following server config.
export TOKENIZER_PATH=assets/tokenizer.gemma
export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH}
export MAX_PREFILL_PREDICT_LENGTH=1024
export MAX_TARGET_LENGTH=2048
export MODEL_NAME=gemma-7b
export ICI_FSDP_PARALLELISM=1
export ICI_AUTOREGRESSIVE_PARALLELISM=-1
export ICI_TENSOR_PARALLELISM=1
export SCAN_LAYERS=false
export WEIGHT_DTYPE=bfloat16
export PER_DEVICE_BATCH_SIZE=11
# Set PROMETHEUS_PORT to enable Prometheus metrics.
export PROMETHEUS_PORT=9090

cd ~/maxtext
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=${TOKENIZER_PATH} \
  load_parameters_path=${LOAD_PARAMETERS_PATH} \
  max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \
  max_target_length=${MAX_TARGET_LENGTH} \
  model_name=${MODEL_NAME} \
  ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \
  ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \
  ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
  scan_layers=${SCAN_LAYERS} \
  weight_dtype=${WEIGHT_DTYPE} \
  per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
  prometheus_port=${PROMETHEUS_PORT}
```

--------------------------------

### Send Generation Request

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Example curl command to send a JSON request to the running inference server for text generation.

```bash
curl --no-buffer -H 'Content-Type: application/json' \
  -d '{ "prompt": "Today is a good day" }' \
  -X POST \
  localhost:8000/generate
```

--------------------------------

### Set up Python Environment

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Commands to create and activate a new Python virtual environment for JAX inference.

```bash
virtualenv jax-inference
source jax-inference/bin/activate
```

--------------------------------

### Benchmark with Full Warmup Mode

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Executes the benchmark using the 'full' warmup mode, which warms up the server with all input requests. Requires specifying tokenizer, dataset, and output paths.

```bash
python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.llama2  \
--warmup-mode full   \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs.json   \
--num-prompts 1000   \
--max-output-length 1024   \
--dataset openorca
```

--------------------------------

### Download Llama2-70B Data Files

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Creates a directory for data and downloads two specific processed data files for the Llama2-70B model using gsutil.

```bash
export DATA_DISK_DIR=~/loadgen_run_data
mkdir -p ${DATA_DISK_DIR}
gsutil cp gs://cloud-tpu-inference-public/mlcommons/inference/language/llama2-70b/data/processed-openorca/open_orca_gpt4_tokenized_llama.calibration_1000.pkl ${DATA_DISK_DIR}/processed-calibration-data.pkl
gsutil cp gs://cloud-tpu-inference-public/mlcommons/inference/language/llama2-70b/data/processed-openorca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl ${DATA_DISK_DIR}/processed-data.pkl
```

--------------------------------

### Run Offline Benchmarking

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Commands to set up environment variables and run the offline benchmarking script. This uses 8-way TP for experimental comparison.

```bash
export PYTHONPATH=$(pwd)
export JAX_COMPILATION_CACHE_DIR="/tmp/jax_cache"
python inference/entrypoint/mini_offline_benchmarking.py
```

--------------------------------

### Test JetStream LoRA Adapter TensorStore

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for the JetStream LoRA adapter tensorstore.

```bash
python -m unittest -v jetstream.tests.core.lora.test_adapter_tensorstore
```

--------------------------------

### Run MLPerf Server Accuracy Benchmark

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Navigate to the specified directory for Llama2 70b TPU v5e 8 JetStream maxtext and execute the server accuracy benchmark script.

```bash
cd Google/code/llama2-70b/tpu_v5e_8_jetstream_maxtext/scripts/
bash ./generate_server_accuracy_run.sh
```

--------------------------------

### Test JetStream Core Server Library

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for the JetStream core server library.

```bash
python -m unittest -v jetstream.tests.core.test_server
```

--------------------------------

### Run MLPerf Server Performance Benchmark

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Navigate to the MLPerf benchmarks scripts directory and execute the server performance benchmark script.

```bash
cd ~/JetStream/benchmarks/mlperf/scripts
bash ./generate_server_performance_run.sh
```

--------------------------------

### Benchmark Gemma-7b with ShareGPT Dataset

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Run the JetStream benchmark for Gemma-7b using the ShareGPT dataset and the Gemma tokenizer. Control QPS with --request-rate and use --warmup-first for the initial run.

```bash
# Activate the python virtual environment we created in Step 2.
cd ~
source .env/bin/activate

# download dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# run benchmark with the downloaded dataset and the tokenizer in maxtext
# You can control the qps by setting "--request-rate", the default value is inf.
python JetStream/benchmarks/benchmark_serving.py \
--tokenizer maxtext/assets/tokenizer.gemma \
--num-prompts 1000 \
--dataset sharegpt \
--dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--request-rate 5 \
--warmup-mode sampled
```

--------------------------------

### Build and Upload Docker Image

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md

Builds, tests, and uploads the Docker image using the pipeline script. The UPLOAD_IMAGE_TAG is set to the nightly build with the current date.

```bash
./pipeline.sh UPLOAD_IMAGE_TAG=gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly-$(date +"%Y%m%d")
```

--------------------------------

### Test Mock JetStream Token Utils

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for the mock JetStream token utilities.

```bash
python -m unittest -v jetstream.tests.engine.test_token_utils
```

--------------------------------

### Define Paths for MaxText Checkpoint Conversion

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Set environment variables for the source MaxText checkpoint and the destination for the quantized checkpoint. Ensure these paths are accessible and correctly formatted for Google Cloud Storage.

```bash
export LOAD_PARAMS_PATH=gs://${USER}-bkt/llama2-70b-chat/param-only-decode-ckpt-maxtext/checkpoints/0/items

export SAVE_QUANT_PARAMS_PATH=gs://${USER}-bkt/quantized/llama2-70b-chat
```

--------------------------------

### Run Benchmark with MaxText Tokenizer

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Executes the benchmark script using the MaxText tokenizer and the ShareGPT dataset. Adjust the number of prompts and output length as needed.

```bash
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024
```

--------------------------------

### Test Local JetStream Mock Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Execute this command to test the local mock server of JetStream.

```bash
python -m jetstream.tools.requester
```

--------------------------------

### Load Test Local JetStream Mock Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Use this command to perform load testing on the local mock server of JetStream.

```bash
python -m jetstream.tools.load_tester
```

--------------------------------

### Benchmark Llama2 with ShareGPT Dataset

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Execute the JetStream benchmark for Llama2 using the ShareGPT dataset and the Llama2 tokenizer. Similar to Gemma-7b, control QPS and use warmup mode.

```bash
# The command is the same as that for the Gemma-7b, except for the tokenizer. Since we need to use a tokenizer that matches the model, it should now be tokenizer.llama2. 

python JetStream/benchmarks/benchmark_serving.py \
--tokenizer maxtext/assets/tokenizer.llama2 \
--num-prompts 1000  \
--dataset sharegpt \
--dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--request-rate 5 \
--warmup-mode sampled
```

--------------------------------

### Format Code with Make

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/CONTRIBUTING.md

Run this command to ensure your code adheres to the project's formatting standards before submitting a pull request.

```bash
make format
```

--------------------------------

### Test Mock JetStream Engine Implementation

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for the mock JetStream engine implementation.

```bash
python -m unittest -v jetstream.tests.engine.test_mock_engine
```

--------------------------------

### Enable KV-Cache Quantization

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Set the QUANTIZE_KVCACHE environment variable to True to enable quantization of the KV-cache.

```bash
export QUANTIZE_KVCACHE=True
```

--------------------------------

### Log in to Hugging Face CLI

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Command to log in to the Hugging Face CLI. Ensure your account has permission to access the specified model.

```bash
huggingface-cli login
```

--------------------------------

### Benchmark with OpenOrca Dataset

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Runs the benchmark using the OpenOrca dataset, commonly used for MLPerf inference benchmarks with LLaMA2 models. Includes options for saving results and outputs.

```bash
python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.llama2  \
--warmup-mode sampled   \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs.json   \
--num-prompts 1000   \
--max-output-length 1024   \
--dataset openorca
```

--------------------------------

### Test JetStream Utils

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for JetStream utilities.

```bash
python -m unittest -v jetstream.tests.engine.test_utils
```

--------------------------------

### Run MLPerf Server Audit Benchmark

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Navigate to the specified directory for Llama2 70b TPU v5e 8 JetStream maxtext and execute the server audit benchmark script.

```bash
cd Google/code/llama2-70b/tpu_v5e_8_jetstream_maxtext/scripts/
bash ./generate_server_audit_run.sh
```

--------------------------------

### Download Maxtext and Jetstream Repositories

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Clones the Maxtext and JetStream repositories from their respective Git repositories.

```bash
cd ~
git clone git@github.com:AI-Hypercomputer/maxtext.git
git clone git@github.com:AI-Hypercomputer/JetStream.git
```

--------------------------------

### Run Benchmark for Llama 3

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Runs the benchmark script specifically for Llama 3 models, requiring the Llama 3 tokenizer path. Uses the ShareGPT dataset.

```bash
python benchmark_serving.py \
--tokenizer <llama3 tokenizer path> \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--model llama-3
```

--------------------------------

### Download ShareGPT Dataset

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Downloads the ShareGPT V3 unfiltered cleaned dataset, required for certain benchmarks. Ensure you are in the data directory.

```bash
cd ~/data
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

--------------------------------

### Restart MaxText Engine Server with Mixed Precision

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Restart the MaxText engine server for a mixed precision quantized model, including the quant_cfg_path parameter.

```bash
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=${TOKENIZER_PATH} \
  load_parameters_path=${LOAD_PARAMETERS_PATH} \
  max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \
  max_target_length=${MAX_TARGET_LENGTH} \
  model_name=${MODEL_NAME} \
  ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \
  ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \
  ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
  scan_layers=${SCAN_LAYERS} \
  weight_dtype=${WEIGHT_DTYPE} \
  per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
  quantization=${QUANTIZATION} \
  quantize_kvcache=${QUANTIZE_KVCACHE} \
  checkpoint_is_quantized=${CHECKPOINT_IS_QUANTIZED} \
  quant_cfg_path=${QUANT_CFG_PATH}
```

--------------------------------

### Set Quantization Flags for Mixed-Precision Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables to load a mixed-precision quantized checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True and specify the quantization config path.

```bash
export QUANTIZATION=intmp
export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH}
export CHECKPOINT_IS_QUANTIZED=True
export QUANT_CFG_PATH=configs/quantization/mp_scale.json
```

--------------------------------

### Test JetStream Core Orchestrator

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/README.md

Run this command to execute verbose tests for the JetStream core orchestrator module.

```bash
python -m unittest -v jetstream.tests.core.test_orchestrator
```

--------------------------------

### Jetstream Server Ready Logs

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

These log messages indicate that the Jetstream server has successfully initialized and is ready to process requests. Pay attention to memory usage and initialization warnings.

```log
Memstats: After load_params:
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_0(process=0,(0,0,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_1(process=0,(1,0,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_2(process=0,(0,1,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_3(process=0,(1,1,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_4(process=0,(0,2,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_5(process=0,(1,2,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_6(process=0,(0,3,0,0))
        Using (GB) 8.1 / 31.25 (25.920000%) on TPU_7(process=0,(1,3,0,0))
WARNING:root:Initialising driver with 1 prefill engines and 1 generate engines.
2025-02-10 22:10:34,122 - root - WARNING - Initialising driver with 1 prefill engines and 1 generate engines.
WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
2025-02-10 22:10:34,152 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
2025-02-10 22:10:34,260 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
WARNING:absl:T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
2025-02-10 22:10:34,326 - absl - WARNING - T5 library uses PAD_ID=0, which is different from the sentencepiece vocabulary, which defines pad_id=-1
GC tweaked (allocs, gen1, gen2):  60000 20 30
2025-02-10 22:10:36.360296: I external/xla/xla/tsl/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:9999 selected port:9999
```

--------------------------------

### Restart MaxText Engine Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Restart the MaxText engine server with specified configuration parameters, including quantization settings. Adjust PER_DEVICE_BATCH_SIZE for Gemma 7b model.

```bash
# For Gemma 7b model, change per_device_batch_size to 12 to optimize performance. 
export PER_DEVICE_BATCH_SIZE=12

cd ~/maxtext
python3 -m MaxText.maxengine_server \
  MaxText/configs/base.yml \
  tokenizer_path=${TOKENIZER_PATH} \
  load_parameters_path=${LOAD_PARAMETERS_PATH} \
  max_prefill_predict_length=${MAX_PREFILL_PREDICT_LENGTH} \
  max_target_length=${MAX_TARGET_LENGTH} \
  model_name=${MODEL_NAME} \
  ici_fsdp_parallelism=${ICI_FSDP_PARALLELISM} \
  ici_autoregressive_parallelism=${ICI_AUTOREGRESSIVE_PARALLELISM} \
  ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
  scan_layers=${SCAN_LAYERS} \
  weight_dtype=${WEIGHT_DTYPE} \
  per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
  quantization=${QUANTIZATION} \
  quantize_kvcache=${QUANTIZE_KVCACHE} \
  checkpoint_is_quantized=${CHECKPOINT_IS_QUANTIZED}
```

--------------------------------

### Set Quantization Flags for Weight-Only Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables to load an int8 weight-only checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True.

```bash
export QUANTIZATION=int8w
export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH}
export CHECKPOINT_IS_QUANTIZED=True
```

--------------------------------

### Send Test Request to JetStream MaxText Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Use this script to send a test prompt to the running JetStream MaxText server. Specify the tokenizer path based on the model being used (Gemma or Llama2).

```bash
cd ~
# For Gemma model
python JetStream/jetstream/tools/requester.py --tokenizer maxtext/assets/tokenizer.gemma
# For Llama2 model
python JetStream/jetstream/tools/requester.py --tokenizer maxtext/assets/tokenizer.llama2
```

--------------------------------

### Benchmark Prefix Cache

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Tests JetStream's prefix caching mechanism by running benchmarks with mock input requests that share a common prefix. Configurable with common prefix length and number of prompts.

```bash
python JetStream/benchmarks/benchmark_serving.py \
--tokenizer prefix_cache_test \
--dataset prefix_cache_test
--warmup-mode full \
--num-prompts 100 \
--max-input-length 16000 \
--prefix-cache-test-common-len 9000\
--max-output-length 50 \

```

--------------------------------

### Gemma-7b Environment Variables for MaxText Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables for running the JetStream MaxText server with the Gemma-7b model. Ensure UNSCANNED_CKPT_PATH is set.

```bash
export TOKENIZER_PATH=assets/tokenizer.gemma
export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH}
export MAX_PREFILL_PREDICT_LENGTH=1024
export MAX_TARGET_LENGTH=2048
export MODEL_NAME=gemma-7b
export ICI_FSDP_PARALLELISM=1
export ICI_AUTOREGRESSIVE_PARALLELISM=1
export ICI_TENSOR_PARALLELISM=-1
export SCAN_LAYERS=false
export WEIGHT_DTYPE=bfloat16
export PER_DEVICE_BATCH_SIZE=11
```

--------------------------------

### Create Cloud TPU v5e-8

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jax/README.md

Command to create a Cloud TPU v5e-8 resource using gcloud alpha. Ensure you have the necessary project and zone information.

```bash
gcloud alpha compute tpus queued-resources create ${QR_NAME} \
    --node-id ${NODE_NAME} \
    --project ${PROJECT_ID} \
    --zone ${ZONE} \
    --accelerator-type v5litepod-8 \
    --runtime-version v2-alpha-tpuv5-lite 
```

--------------------------------

### Set Quantization Flags for DRQ Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables to load an int8 DRQ checkpoint. Ensure CHECKPOINT_IS_QUANTIZED is set to True.

```bash
export QUANTIZATION=int8
export LOAD_PARAMETERS_PATH${SAVE_QUANT_PARAMS_PATH}
export CHECKPOINT_IS_QUANTIZED=True
```

--------------------------------

### Llama2-7b Environment Variables for MaxText Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables for running the JetStream MaxText server with the Llama2-7b model. Ensure UNSCANNED_CKPT_PATH is set.

```bash
export TOKENIZER_PATH=assets/tokenizer.llama2
export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH}
export MAX_PREFILL_PREDICT_LENGTH=1024
export MAX_TARGET_LENGTH=2048
export MODEL_NAME=llama2-7b
export ICI_FSDP_PARALLELISM=1
export ICI_AUTOREGRESSIVE_PARALLELISM=1
export ICI_TENSOR_PARALLELISM=-1
export SCAN_LAYERS=false
export WEIGHT_DTYPE=bfloat16
export PER_DEVICE_BATCH_SIZE=11
```

--------------------------------

### Generate Int8 Quantized Checkpoint with MaxText

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Execute the MaxText script to convert a Llama2-70B checkpoint into an int8 quantized format. This command specifies tokenizer path, model configuration, parallelism settings, and quantization type.

```bash
export TOKENIZER_PATH=maxtext/assets/tokenizer.llama2
cd maxtext && \
python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=${TOKENIZER_PATH} load_parameters_path=${LOAD_PARAMS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-70b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8 save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH}
```

--------------------------------

### Run Benchmark and Evaluation

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Executes the benchmark and automatically runs ROUGE evaluation afterward using `--run-eval true`. Scores are saved if `--save-result` is also used.

```bash
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs \
--run-eval true
```

--------------------------------

### Export Quantized Checkpoint Save Path

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Set an environment variable to define the GCS bucket path where quantized checkpoint parameters will be saved. This is a prerequisite for generating quantized checkpoints.

```bash
export SAVE_QUANT_PARAMS_PATH=gs://${USER}-bkt/quantized/llama2-7b-chat
```

--------------------------------

### Configure PodMonitoring for GKE Clusters

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md

Apply this PodMonitoring resource to your GKE cluster to enable Google Cloud Managed Service for Prometheus to scrape JetStream metrics. Ensure you replace `<your-prometheus-port>` with the actual port configured.

```json
{
    "apiVersion": "monitoring.googleapis.com/v1",
    "kind": "PodMonitoring",
    "metadata": {
      "name": "jetstream-podmonitoring"
    },
    "spec": {
      "endpoints": [
        {
          "interval": "1s",
          "path": "/",
          "port": <your-prometheus-port>
        }
      ],
      "targetLabels": {
        "metadata": [
          "pod",
          "container",
          "node"
        ]
      }
    }
  }
```

--------------------------------

### Run JetStream MaxText Container

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md

Runs the JetStream-MaxText stable stack Docker container on a TPU VM. Ensure necessary volume mounts, TPU device access, and network ports are configured.

```bash
docker run --net=host --privileged --rm -it \
  # Add necessary volume mounts, TPU device access, network ports, etc. \
  gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly \
  bash
```

--------------------------------

### Generate Mixed Precision Weight-Only Quantized Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

This command generates a quantized checkpoint using mixed precision for weights. It requires updating a specific configuration file (`mp_scale.json`) with desired quantization settings before execution.

```json
{
  ".*/query": {"bits": 4, "scale": 0.8},
  ".*/key": {"bits": 4, "scale": 0.9},
  ".*/value": {"bits": 8},
  ".*/out": {"bits": 4},
  ".*/wi_0": {"bits": 4},
  ".*/wo": {"bits": 8}
}
```

```bash
python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=intmp
quant_cfg_path=configs/quantization/mp_scale.json save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH}
```

--------------------------------

### Llama2-13b Environment Variables for MaxText Server

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Configure environment variables for running the JetStream MaxText server with the Llama2-13b model. Ensure UNSCANNED_CKPT_PATH is set.

```bash
export TOKENIZER_PATH=assets/tokenizer.llama2
export LOAD_PARAMETERS_PATH=${UNSCANNED_CKPT_PATH}
export MAX_PREFILL_PREDICT_LENGTH=1024
export MAX_TARGET_LENGTH=2048
export MODEL_NAME=llama2-13b
export ICI_FSDP_PARALLELISM=1
export ICI_AUTOREGRESSIVE_PARALLELISM=1
export ICI_TENSOR_PARALLELISM=-1
export SCAN_LAYERS=false
export WEIGHT_DTYPE=bfloat16
export PER_DEVICE_BATCH_SIZE=4
```

--------------------------------

### Generate Weights-Only int8 Quantized Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

This command generates a quantized checkpoint focusing only on weights using int8 quantization. This can be useful for reducing model size and memory footprint.

```bash
python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8w save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH}
```

--------------------------------

### Convert Gemma Checkpoint for MaxText

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Convert a downloaded Gemma checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints.

```bash
# For gemma-7b
bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh gemma 7b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED}
```

--------------------------------

### Clean Up GCS Buckets and Local Files

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Use these bash commands to clean up Google Cloud Storage buckets and local directories after running MaxText experiments. Ensure environment variables like MODEL_BUCKET and BASE_OUTPUT_DIRECTORY are set.

```bash
# Clean up gcs buckets.
gcloud storage buckets delete ${MODEL_BUCKET}
gcloud storage buckets delete ${BASE_OUTPUT_DIRECTORY}

# Clean up repositories.
rm -rf maxtext
rm -rf JetStream

# Clean up python virtual environment
rm -rf .env
```

--------------------------------

### Set up SSH Tunnel for Remote TensorBoard Access

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md

Establishes an SSH tunnel to forward the TensorBoard port (6006) from a remote machine to your local machine. This is necessary when running TensorBoard remotely and unable to access it directly.

```bash
gcloud compute ssh <machine-name> -- -L 6006:127.0.0.1:6006
```

--------------------------------

### Save Request Outputs in Benchmark

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Enables saving the benchmark's prediction outputs to a file using the `--save-request-outputs` flag. This is useful for subsequent evaluation.

```bash
python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10 \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs
```

--------------------------------

### Verify JAX TPU Access

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/mlperf/README.md

Verifies that JAX can access the TPU and reports the number of available TPU devices.

```python
import jax
jax.device_count()
```

--------------------------------

### Convert Llama2 7b Checkpoint for MaxText

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Convert a Llama2 7b checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints.

```bash
# For llama2-7b
bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh llama2 7b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED}
```

--------------------------------

### Convert Llama2 13b Checkpoint for MaxText

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Convert a Llama2 13b checkpoint into a MaxText compatible unscanned checkpoint. This script requires the model name, variation, checkpoint bucket, and paths for scanned and unscanned MaxText checkpoints.

```bash
# For llama2-13b
bash ../JetStream/jetstream/tools/maxtext/model_ckpt_conversion.sh llama2 13b ${CHKPT_BUCKET} ${MAXTEXT_BUCKET_SCANNED} ${MAXTEXT_BUCKET_UNSCANNED}
```

--------------------------------

### Generate int8 DRQ Quantized Checkpoint

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

This command generates a quantized checkpoint using int8 DRQ (Dynamic Range Quantization). Ensure the necessary configuration files and model paths are correctly set.

```bash
python3 -m MaxText.decode MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 load_parameters_path=${LOAD_PARAMETERS_PATH} max_prefill_predict_length=1024 max_target_length=2048 model_name=llama2-7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 scan_layers=false weight_dtype=bfloat16 per_device_batch_size=11 attention=dot_product quantization=int8 save_quantized_params_path=${SAVE_QUANT_PARAMS_PATH}
```

--------------------------------

### Pull Nightly Docker Image

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/experimental/jetstream-maxtext-stable-stack/README.md

Pulls the latest nightly Docker image for the JetStream-MaxText stable stack. Replace YYYYMMDD with the desired date or use the 'nightly' tag for the absolute latest build.

```bash
# Replace YYYYMMDD with the specific date, e.g., 20231027
export NIGHTLY_DATE=$(date +"%Y%m%d") # Or set manually, e.g., export NIGHTLY_DATE=20231027

docker pull gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly-${NIGHTLY_DATE}

# Or the last nightly build
docker pull gcr.io/cloud-tpu-inference-test/jetstream-maxtext-stable-stack/tpu:nightly
```

--------------------------------

### Disable KV-Cache Quantization

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/docs/online-inference-with-maxtext-engine.md

Set the QUANTIZE_KVCACHE environment variable to False to disable quantization of the KV-cache.

```bash
export QUANTIZE_KVCACHE=False
```

--------------------------------

### Standalone Evaluation Run

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Performs a standalone evaluation of saved request outputs using the `eval_accuracy.py` script. Requires the path to the output file.

```bash
python eval_accuracy.py outputs.json
```

--------------------------------

### Reference Accuracy Numbers for Llama2

Source: https://github.com/ai-hypercomputer/jetstream/blob/main/benchmarks/README.md

Provides reference accuracy metrics (ROUGE scores) for llama2-7b-chat and llama2-70b-chat models when evaluated on the OpenOrca dataset.

```text
llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998}
llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162}
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.