### Install Dependencies and torch-npu

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/install.rst

Install necessary Python dependencies and then torch and torch-npu version 2.1.0. Use the provided mirror for installation.

```shell
# install the dependencies
pip3 install attrs numpy==1.26.4 decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions -i https://pypi.tuna.tsinghua.edu.cn/simple
# install torch and torch-npu
pip install torch==2.1.0 torch-npu==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
```

--------------------------------

### Install Project Dependencies

Source: https://github.com/ascend/docs/blob/main/README.md

Install the required Python dependencies for building the documentation. This command includes specific index URLs for PyPI and PyTorch.

```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple --extra-index-url https://download.pytorch.org/whl/cpu
```

--------------------------------

### Install Necessary Libraries

Source: https://github.com/ascend/docs/blob/main/sources/transformers/fine-tune.rst

Install the required Python libraries for fine-tuning models, including transformers, datasets, evaluate, accelerate, and scikit-learn.

```shell
pip install transformers datasets evaluate accelerate scikit-learn
```

--------------------------------

### Install OpenCompass

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/install.rst

Install OpenCompass using pip. The command includes a mirror for faster downloads. Uncomment other lines for full installation or specific framework support.

```shell
pip install -U opencompass -i https://pypi.tuna.tsinghua.edu.cn/simple

## Full installation (with support for more datasets)
# pip install "opencompass[full]"

## Environment with model acceleration frameworks
## Manage different acceleration frameworks using virtual environments
## since they usually have dependency conflicts with each other.
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"

## API evaluation (i.e. Openai, Qwen)
# pip install "opencompass[api]"
```

--------------------------------

### Configure Agentic Pipeline with DeepSpeed

Source: https://github.com/ascend/docs/blob/main/sources/roll/quick_start.rst

Update the main configuration file to use DeepSpeed for training. This example shows a typical setup for the qwen2.5-0.5B-agentic model, including environment settings, checkpointing, and training parameters.

```yaml
# vim examples/qwen2.5-0.5B-agentic/agentic_val_sokoban.yaml
defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: ./data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban


checkpoint_config:
  type: file_system
  output_dir: ./data/cpfs_0/rl_examples/models/${exp_name}

num_gpus_per_node: 4

max_steps: 128
save_steps: 10000
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

rollout_batch_size: 16
val_batch_size: 16
sequence_length: 1024

advantage_clip: 0.2
ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0

pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~ 
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 2
    gradient_accumulation_steps: 64
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero3}
    # strategy_name: megatron_train
    # strategy_config:
    #   tensor_model_parallel_size: 1
    #   pipeline_model_parallel_size: 1
    #   expert_model_parallel_size: 1
    #   use_distributed_optimizer: true
    #   recompute_granularity: full
  device_mapping: list(range(0,2))
  infer_batch_size: 2

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.6
      block_size: 16
      load_format: auto
  device_mapping: list(range(2,3))

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~ 
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(3,4))
  infer_batch_size: 2

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
  max_env_num_per_worker: 4
  num_env_groups: 8
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 1
  tags: [SimpleSokoban]
  num_groups_partition: [8] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 64
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output

```

--------------------------------

### Start Training with Debug Model

Source: https://github.com/ascend/docs/blob/main/sources/torchtitan/quick_start.rst

Execute the run_train.sh script to start pre-training using the configuration specified in debug_model.toml.

```shell
./run_train.sh
```

--------------------------------

### Clone Accelerate Repository

Source: https://github.com/ascend/docs/blob/main/sources/accelerate/quick_start.rst

Download the Accelerate official example code from GitHub. This is required to access the example scripts.

```bash
git clone https://github.com/huggingface/accelerate.git
```

--------------------------------

### Install timm Package

Source: https://github.com/ascend/docs/blob/main/sources/timm/install.rst

Install the timm library using pip, specifying a mirror for faster downloads. Ensure your virtual environment is activated.

```shell
pip install timm -i https://pypi.tuna.tsinghua.edu.cn/simple
```

--------------------------------

### Start Ray with Dashboard Access

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray head node and configure the dashboard to be accessible from external browsers by binding to `0.0.0.0` and specifying a custom dashboard port.

```bash
ray start --head --dashboard-host=0.0.0.0 --dashboard-port=8265
```

--------------------------------

### Install Ray

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Install Ray version 2.46.0 or higher using pip. This is the recommended installation method for Ascend NPU.

```bash
pip install "ray>=2.46.0"
```

--------------------------------

### Navigate to Example Directory

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Change the current directory to the location of the NPU experiment script.

```shell
cd example/aishell/s0
```

--------------------------------

### Install Hugging Face Hub CLI

Source: https://github.com/ascend/docs/blob/main/sources/transformers/modeldownload.rst

Install the Hugging Face Hub command-line interface tool. This is required for downloading models from Hugging Face.

```shell
pip install huggingface-hub
```

--------------------------------

### Serve Local Documentation

Source: https://github.com/ascend/docs/blob/main/README.md

Start a local HTTP server to view the built documentation. The server will be accessible at localhost:4000.

```bash
python -m http.server -d _build/html 4000
```

--------------------------------

### Verify timm and NPU Installation

Source: https://github.com/ascend/docs/blob/main/sources/timm/install.rst

Run this Python script to check the installed timm version and confirm NPU device availability. Successful execution indicates a correct setup.

```python
import torch
import torch_npu
import timm

print("timm version:", timm.version.__version__)
print("NPU devices:", torch.npu.current_device())
```

--------------------------------

### Install torch and torch-npu

Source: https://github.com/ascend/docs/blob/main/sources/timm/install.rst

Install specific versions of PyTorch (2.2.0) and torch-npu (2.2.0) for Ascend compatibility. This command also installs necessary dependencies.

```shell
pip3 install attrs numpy==1.26.4 decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install torch==2.2.0 torch-npu==2.2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
```

--------------------------------

### Install Required Libraries

Source: https://github.com/ascend/docs/blob/main/sources/accelerate/quick_start.rst

Install HuggingFace, scikit-learn, and other necessary libraries using pip. Ensure you use the specified index URL for Tsinghua.

```bash
pip install datasets evaluate transformers scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple
```

--------------------------------

### Expected Output for Installation Verification

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/install.rst

This is the expected output when the OpenCompass and NPU installation is successful on a single NPU card environment.

```shell
opencompass version:  0.3.3
NPU devices:  0
```

--------------------------------

### Start Ray Head Node

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray node on a single machine, which will act as the head node. This is the default behavior when no address is specified.

```bash
ray start --head
```

--------------------------------

### Start Ray with Custom Temporary Directory

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray head node and specify a custom directory for temporary files and logs. This is recommended for long-running, heavy-load tasks to avoid filling up the default `/tmp` directory.

```bash
ray start --head --temp-dir=/data/ray_tmp
```

--------------------------------

### Launch Training Script with Accelerate

Source: https://github.com/ascend/docs/blob/main/sources/pytorch/quick_start.rst

Use the `accelerate launch` command to start a training script with specified configurations like mixed precision and model paths. Ensure environment variables like MODEL_NAME and DATASET_NAME are set.

```bash
accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_NAME \
    --use_ema \
    --resolution=512 --center_crop --random_flip \
    --train_batch_size=1 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --max_train_steps=15000 \
    --learning_rate=1e-05 \
    --max_grad_norm=1 \
    --lr_scheduler="constant" --lr_warmup_steps=0 \
    --output_dir="sd-pokemon-model"
```

--------------------------------

### Verify Kernels Installation

Source: https://github.com/ascend/docs/blob/main/sources/kernels/quick_start.rst

This snippet demonstrates how to download an optimized kernel from the Hugging Face hub and use it for a fast GELU computation on a CUDA device. Ensure you have the 'kernels' library installed.

```python
import torch
from kernels import get_kernel

# Download optimized kernels from the Hugging Face hub
activation = get_kernel("kernels-community/activation")

# Create a random tensor
x = torch.randn((10, 10), dtype=torch.float16, device="cuda")

# Run the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
```

--------------------------------

### Start Training with Llama3-8B Model

Source: https://github.com/ascend/docs/blob/main/sources/torchtitan/quick_start.rst

Execute the run_train.sh script, specifying the configuration file for the Llama3-8B model.

```shell
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
```

--------------------------------

### Start Ray Worker Node

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray worker node and connect it to the head node. Replace `<head_node_ip>` with the actual IP address of the head node.

```bash
ray start --address='<head_node_ip>:6379'
```

--------------------------------

### Verify OpenCompass and NPU Installation

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/install.rst

Run this Python script to verify the installation of OpenCompass and check NPU device availability. Successful execution prints the OpenCompass version and NPU device information.

```python
import torch
import opencompass

print("opencompass version: ", opencompass.__version__)
print("NPU devices: ", torch.npu.current_device())
```

--------------------------------

### Start Ray with Explicit Resource Declaration

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray head node and explicitly declare the number of CPUs and GPUs available to the Ray cluster. This is useful when you want to allocate only a portion of the machine's resources to Ray.

```bash
ray start --head --num-cpus=4 --num-gpus=1
```

--------------------------------

### Start Ray with Custom Port

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Start a Ray head node and specify a custom communication port (default is 6379). This is useful when running multiple Ray clusters on the same host.

```bash
ray start --head --port=6380
```

--------------------------------

### Verify Ascend Operator Installation

Source: https://github.com/ascend/docs/blob/main/sources/opencv/install.rst

Execute the Ascend operator unit tests to confirm that OpenCV has been successfully compiled and installed with Ascend support. Successful completion indicates proper integration.

```shell
cd path/to/opencv/build/bin
./opencv_test_cannops
```

--------------------------------

### Example JSON Data Format

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Illustrates the JSON format for each line in the data.list file, containing key, wav path, and text transcription.

```json
{"key": "BAC009S0002W0122", "wav": "/export/data/asr-data/OpenSLR/33//data_aishell/wav/train/S0002/BAC009S0002W0122.wav", "txt": "而对楼市成交抑制作用最大的限购"}
```

--------------------------------

### Expected Output for Verification

Source: https://github.com/ascend/docs/blob/main/sources/timm/install.rst

This is the expected output when the timm and NPU installation is successful on a single NPU card environment.

```shell
timm version: 1.0.8.dev0
NPU devices: 0
```

--------------------------------

### Transformers with Local Kernels

Source: https://github.com/ascend/docs/blob/main/sources/kernels/quick_start.rst

This example demonstrates loading a causal language model using Transformers with local kernels. Performance can be compared by commenting out the `kernel_config` argument. Note: This requires Transformers to be compiled from source.

```python
import time
import logging
from transformers import AutoModelForCausalLM, AutoTokenizer, KernelConfig


# Set the level to `DEBUG` to see which kernels are being called.
logging.basicConfig(level=logging.DEBUG)

model_name = "/root/Qwen3"

kernel_mapping = {
    "RMSNorm":
        "/kernels-ext-npu/rmsnorm:rmsnorm",
}

kernel_config = KernelConfig(kernel_mapping, use_local_kernel=True)

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    kernel_config=kernel_config
)

# Prepare the model input
prompt = "What is the result of 100 + 100?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Warm_up
for _ in range(2):
    generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()

# Print Runtime
for _ in range(5):
    start_time = time.time()
    generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
    print("runtime: ", time.time() - start_time)
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()
    content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
    print("content:", content)
```

--------------------------------

### Create Small Datasets for Training and Evaluation

Source: https://github.com/ascend/docs/blob/main/sources/transformers/fine-tune.rst

Create smaller subsets of the tokenized dataset for training and evaluation by shuffling and selecting a specified number of examples. This is useful for faster iteration during development. The commented-out lines show how to select the full datasets.

```python
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# 下面是加载全训练集和验证集
# full_train_dataset = tokenized_datasets["train"]
# full_eval_dataset = tokenized_datasets["test"]
```

--------------------------------

### Configure Ray Environment Variable Before Start

Source: https://github.com/ascend/docs/blob/main/sources/Ray/quick_start.rst

Set the `RAY_DEDUP_LOGS` environment variable to 0 before starting the Ray head node to disable log deduplication. This is recommended for debugging NPU issues.

```bash
export RAY_DEDUP_LOGS=0
ray start --head
```

--------------------------------

### llama.cpp Inference Log Output

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/quick_start.rst

Example output from a successful llama.cpp inference run, detailing model loading, metadata, and vocabulary information.

```shell
Log start
main: build = 3520 (8e707118)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1728907816
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/jiahao/models/llama3-8b-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "..."
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0

```

--------------------------------

### Transformers with Remote Kernels

Source: https://github.com/ascend/docs/blob/main/sources/kernels/quick_start.rst

This example shows how to load a causal language model using Transformers with remote kernels enabled. Set logging to DEBUG to see which kernels are called. Performance can be compared by commenting out `use_kernels=True`.

```python
import time
import logging
from transformers import AutoModelForCausalLM, AutoTokenizer


# Set the level to `DEBUG` to see which kernels are being called.
logging.basicConfig(level=logging.DEBUG)

model_name = "Qwen/Qwen3-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    use_kernels=True,
)

# Prepare the model input
prompt = "What is the result of 100 + 100?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Warm_up
for _ in range(2):
    generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()

# Print Runtime
for _ in range(5):
    start_time = time.time()
    generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
    print("runtime: ", time.time() - start_time)
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()
    content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
    print("content:", content)
```

--------------------------------

### Interactive Chat with Model

Source: https://github.com/ascend/docs/blob/main/sources/torchchat/quick_start.rst

Engage in an interactive chat session with a downloaded model. Specify the model name, e.g., `llama3.1`, to start the chat.

```shell
# 交互式聊天
python torchchat.py chat llama3.1
```

--------------------------------

### Train NLP Model on Ascend NPU

Source: https://github.com/ascend/docs/blob/main/sources/accelerate/quick_start.rst

Set the HF_ENDPOINT to mirror HuggingFace domains for faster downloads in China. Navigate to the Accelerate examples directory and run the NLP training script. Successful training is indicated by specific log messages.

```bash
# 替换HF域名，方便国内用户进行数据及模型的下载
export HF_ENDPOINT=https://hf-mirror.com
# 进入项目目录     
cd accelerate/examples
# 模型训练
python nlp_example.py
```

--------------------------------

### Model Inference Output and Performance Analysis

Source: https://github.com/ascend/docs/blob/main/sources/torchchat/quick_start.rst

This output shows an example of text generation, including device information, model loading time, generated text, and detailed performance metrics like token throughput and memory usage.

```text
Using device=npu Ascend910B3
Loading model...
Time to load model: 4.42 seconds
-----------------------------------------------------------
write me a story about a boy and his bear friend
Once upon a time, in a dense forest, there lived a young boy named Timmy. Timmy was a curious and adventurous boy who loved exploring the woods behind his village. One day, while wandering deeper into the forest than he had ever gone before, Timmy stumbled upon a magnificent brown bear. The bear was enormous, with a thick coat of fur and piercing yellow eyes. At first, Timmy was frightened, but to his surprise, the bear didn't seem to be threatening him. Instead, the bear gently approached Timmy and began to sniff him.

As the days passed, Timmy and the bear, whom he named Boris, became inseparable friends. Boris was unlike any bear Timmy had ever seen before. He was incredibly intelligent and could understand human language. Boris would often sit by Timmy's side as he read books or helped with his chores. The villagers were initially wary of Boris, but as they saw how kind and gentle he was, they grew
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 199 tokens                 
Time for inference 1: 13.3118 sec total                 
Time to first token: 0.6189 sec with parallel prefill.                

        Total throughput: 15.0242 tokens/sec, 0.0666 s/token                 
First token throughput: 1.6157 tokens/sec, 0.6189 s/token                 
Next token throughput: 15.6781 tokens/sec, 0.0638 s/token                     

Bandwidth achieved: 241.30 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


Warning: Excluding compile in calculations                 
        Average tokens/sec (total): 15.02                 
    Average tokens/sec (first token): 1.62                 
    Average tokens/sec (next tokens): 15.68 
                    
    Memory used: 17.23 GB
```

--------------------------------

### Train Model on Ascend NPU

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Start model training on Ascend NPU. The script automatically acquires NPU card numbers and sets environment variables. Specify NPU card numbers by modifying ASCEND_RT_VISIBLE_DEVICES.

```shell
bash run_npu.sh --stage 4 --stop_stage 4
```

--------------------------------

### SGLang Inference Verification Script

Source: https://github.com/ascend/docs/blob/main/sources/sglang/quick_start.rst

This Python script verifies SGLang installation by performing inference with multiple prompts on an Ascend NPU. Ensure the model path is correct.

```python
# example.py
import torch

import sglang as sgl

def main():
    
    prompts = [
        "Hello, my name is",
        "The Independence Day of the United States is",
        "The capital of Germany is",
        "The full form of AI is",
    ] * 1

    llm = sgl.Engine(model_path="/Qwen2.5/Qwen2.5-0.5B-Instruct", device="npu", attention_backend="ascend")

    sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}
    
    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == '__main__':
    main()
```

--------------------------------

### Build llama.cpp with CANN Support

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/install.rst

Compile llama.cpp using CMake, enabling CANN support and setting the build type to release. This step requires the CANN-toolkit to be installed and configured.

```shell
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
    cmake --build build --config release
```

--------------------------------

### Import torch-npu

Source: https://github.com/ascend/docs/blob/main/sources/timm/quick_start.rst

Import torch and then torch-npu in your entry script. Ensure your Ascend environment and timm are set up.

```python
import torch
import torch-npu
```

--------------------------------

### Prepare Training Data

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Organize training data into wav.scp and text formats using the local/aishell_data_prep.sh script. wav.scp maps wav_id to wav_path, and text maps wav_id to text_label.

```shell
bash run_npu.sh --stage 0 --stop_stage 0
```

--------------------------------

### Build HTML Documentation

Source: https://github.com/ascend/docs/blob/main/README.md

Execute this command to build the HTML version of the documentation locally. This command is part of the Sphinx build process.

```bash
make html
```

--------------------------------

### Launch SGLang Server on NPU

Source: https://github.com/ascend/docs/blob/main/sources/sglang/quick_start.rst

Use this command to launch the SGLang server, specifying the model, device, port, and attention backend for Ascend NPU.

```shell
# Launch the SGLang server on NPU
python -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct \
    --device npu --port 8000 --attention-backend ascend \
    --host 0.0.0.0 --trust-remote-code
```

--------------------------------

### Image Processing with Ascend NPU in Python

Source: https://github.com/ascend/docs/blob/main/sources/opencv/quick_start.rst

This Python snippet demonstrates image processing on Ascend NPUs using OpenCV. It covers CANN initialization, device selection, applying add, rotate, and flip operations, and finalization. Input and output image paths are specified via command-line arguments.

```python
# This file is part of OpenCV project.
# It is subject to the license terms in the LICENSE file found in the top-level directory
# of this distribution and at http://opencv.org/license.html.

import numpy as np
import cv2
import argparse

parser = argparse.ArgumentParser(description='This is a sample for image processing with Ascend NPU.')
parser.add_argument('image', help='path to input image')
parser.add_argument('output', help='path to output image')
args = parser.parse_args()

# 读取输入图像
img = cv2.imread(args.image)
# 生成高斯噪声
gaussNoise = np.random.normal(0, 25,(img.shape[0], img.shape[1], img.shape[2])).astype(img.dtype)

# cann 初始化及指定设备
cv2.cann.initAcl()
cv2.cann.setDevice(0)

# 添加高斯噪声到输入图像
output = cv2.cann.add(img, gaussNoise)
# 旋转图像 (0, 1, 2, 分别代表旋转 90°, 180°, 270°)
output = cv2.cann.rotate(output, 0)
# 翻转图像 (0, 正数, 负数, 分别代表沿 x, y, x 和 y 轴进行翻转)
output = cv2.cann.flip(output, 0)
# 写入输出图像
cv2.imwrite(args.output, output)

# cann 去初始化
cv2.cann.finalizeAcl()
```

--------------------------------

### Configure Docker for Ascend Devices

Source: https://github.com/ascend/docs/blob/main/sources/sglang/install.rst

Use `drun` with specific device and volume mounts to prepare the Docker environment for Ascend hardware. Ensure necessary Ascend drivers and configurations are accessible within the container.

```bash
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
        --device=/dev/davinci_manager --device=/dev/hisi_hdc \
        --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
        --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
        --volume /etc/ascend_install.info:/etc/ascend_install.info \
        --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/
```

--------------------------------

### View TorchChat Help

Source: https://github.com/ascend/docs/blob/main/sources/torchchat/quick_start.rst

Use the --help flag to see available commands and their usage.

```shell
# 查看帮助
python torchchat.py --help
```

--------------------------------

### Prepare WeNet Data Format

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Generate the data.list file in the format required by WeNet. Each line is a JSON object containing key, wav file path, and text content.

```shell
bash run_npu.sh --stage 3 --stop_stage 3
```

--------------------------------

### Text Generation Pipeline

Source: https://github.com/ascend/docs/blob/main/sources/transformers/inference.rst

Generate conversational responses or text completions. This example uses a conversational model and is designed for chat-like interactions.

```shell
from transformers import pipeline

generator = pipeline(model="HuggingFaceH4/zephyr-7b-beta")
# Zephyr-beta is a conversational model, so let's pass it a chat instead of a single string
generator([{"role": "user", "content": "What is the capital of France? Answer in one word."}], do_sample=False, max_new_tokens=2)
```

--------------------------------

### Text Classification Pipeline

Source: https://github.com/ascend/docs/blob/main/sources/transformers/inference.rst

Classify text based on provided candidate labels. This example uses a large language model for classification.

```shell
from transformers import pipeline
classifier = pipeline(model="meta-llama/Meta-Llama-3-8B-Instruct")
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)
```

--------------------------------

### Evaluate Transformers Model on MMLU with Ascend

Source: https://github.com/ascend/docs/blob/main/sources/lm_evaluation/quick_start.rst

Use this command to evaluate transformers models like Qwen2-0.5B-Instruct on the MMLU task using Ascend NPU. Ensure the HF_ENDPOINT is set for efficient data and model downloads.

```shell
# Replace HF domain name for domestic users to download data and models
export HF_ENDPOINT=https://hf-mirror.com

lm_eval --model hf \
    --model_args pretrained=Qwen2-0.5B-Instruct \
    --tasks MMLU \
    --device npu:0 \
    --batch_size 8
```

--------------------------------

### Create Conda Environment for OpenCV

Source: https://github.com/ascend/docs/blob/main/sources/opencv/install.rst

Use this command to create a new conda environment named 'opencv' with Python 3.10, which is required for the OpenCV installation.

```shell
# 创建名为 opencv 的 python 3.10 的虚拟环境
conda create -y -n opencv python=3.10
# 激活虚拟环境
conda activate opencv
```

--------------------------------

### Image-to-Image Upscaling Pipeline

Source: https://github.com/ascend/docs/blob/main/sources/transformers/inference.rst

Enhance image details and resolution using an image-to-image pipeline. This example uses a super-resolution model to upscale a low-resolution image.

```python
from PIL import Image
import requests
from transformers import pipeline

upscaler = pipeline("image-to-image", model="caidas/swin2SR-classical-sr-x2-64")
img = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
img = img.resize((64, 64))
upscaled_img = upscaler(img) #超分辨率处理
print(img.size)          
print(upscaled_img.size)
```

--------------------------------

### 配置对话模型评估任务

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/quick_start.rst

使用命令行配置和启动对话模型的评估任务。确保已安装 'opencompass' 和 'torch-npu'。--debug 模式会顺序执行任务并实时打印输出。

```shell
python run.py \
    --models hf_internlm2_chat_1_8b hf_qwen2_1_5b_instruct \
    --datasets demo_gsm8k_chat_gen demo_math_chat_gen \
    --debug
```

--------------------------------

### Create Python Virtual Environment

Source: https://github.com/ascend/docs/blob/main/sources/timm/install.rst

Use conda to create a new virtual environment named 'timm' with Python 3.10. Activate the environment before proceeding with installations.

```shell
conda create -y -n timm python=3.10
conda activate <your_env_name>
```

--------------------------------

### 配置基座模型评估任务

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/quick_start.rst

使用命令行配置和启动基座模型的评估任务。确保已安装 'opencompass' 和 'torch-npu'。--debug 模式会顺序执行任务并实时打印输出。

```shell
python run.py \
    --models hf_internlm2_1_8b hf_qwen2_1_5b \
    --datasets demo_gsm8k_base_gen demo_math_base_gen \
    --debug
```

--------------------------------

### Create Python 3.10 Virtual Environment

Source: https://github.com/ascend/docs/blob/main/sources/opencompass/install.rst

Use this command to create a conda virtual environment named 'opencompass' with Python 3.10. Activate it before proceeding with installations.

```shell
# 创建 python 3.10 的虚拟环境
conda create -y -n opencompass python=3.10
# 激活虚拟环境
conda activate opencompass
```

--------------------------------

### Single-Device Inference with llama.cpp

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/quick_start.rst

Execute LLM inference on a single Ascend NPU. Specify the model path, prompt, and other parameters like number of tokens and GPU layers.

```shell
./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
```

--------------------------------

### Visual Question Answering Pipeline

Source: https://github.com/ascend/docs/blob/main/sources/transformers/inference.rst

Answer questions about an image. Provide the image URL or local path and the question. This example uses a large language model for VQA.

```shell
from transformers import pipeline
vqa = pipeline(model="meta-llama/Meta-Llama-3-8B-Instruct")
output = vqa(
    image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png",
    question="What is the invoice number?",
)
output[0]["score"] = round(output[0]["score"], 3)
```

--------------------------------

### Convert Hugging Face Model to GGUF

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/quick_start.rst

Use this script to convert Hugging Face models to the GGUF format required by llama.cpp. Ensure you have the necessary environment setup.

```shell
python convert_hf_to_gguf.py path/to/model
```

--------------------------------

### Initialize Tokenizer

Source: https://github.com/ascend/docs/blob/main/sources/transformers/fine-tune.rst

Initialize an AutoTokenizer for the specified model (e.g., Meta-Llama-3-8B-Instruct). This tokenizer will be used to convert text into tokens that the model can understand. The example shows how to encode a sample sentence.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
#使用分词器处理文本
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
```

--------------------------------

### Download Aishell-1 Dataset

Source: https://github.com/ascend/docs/blob/main/sources/wenet/quick_start.rst

Execute the script to download the aishell-1 dataset. If data is already downloaded, adjust the \"$data\" variable in the script and start from the next stage.

```shell
bash run_npu.sh --stage -1 --stop_stage -1
```

--------------------------------

### Download Entire Model Snapshot with Hugging Face Hub

Source: https://github.com/ascend/docs/blob/main/sources/transformers/modeldownload.rst

Use the `snapshot_download` function from the huggingface_hub library to download all files for a model repository. Specify the repository ID and a local cache directory.

```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2-7B-Instruct", cache_dir="./your/path/Qwen")
```

--------------------------------

### Clone Model Repository with Git LFS

Source: https://github.com/ascend/docs/blob/main/sources/transformers/modeldownload.rst

Clone a model repository using Git LFS. Ensure Git LFS is installed. This command downloads all model files, including large ones.

```shell
git lfs install
git clone https://hf-mirror.com/Qwen/Qwen2-7B-Instruct
```

--------------------------------

### Run llama.cpp Docker Container

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/install.rst

Launch a Docker container for llama.cpp, mapping Ascend devices and necessary host directories. Mount your models directory to '/app/models' and specify the model path and GPU layers.

```shell
docker run --name llamacpp \
    --device /dev/davinci0  \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /PATH_TO_YOUR_MODELS/:/app/models \
    -it llama-cpp-cann -m /app/models/MODEL_PATH -ngl 32 \
    -p "Building a website can be done in 10 simple steps:"
```

--------------------------------

### Download Model from Hugging Face CLI

Source: https://github.com/ascend/docs/blob/main/sources/transformers/modeldownload.rst

Use the Hugging Face CLI to download a specific model, including its original files. Ensure you have the necessary permissions after accepting the license.

```shell
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include “original/*” --local-dir meta-llama/Meta-Llama-3-8B-Instruct
```

--------------------------------

### Launch SGLang Server with Ascend Backend

Source: https://github.com/ascend/docs/blob/main/sources/sglang/install.rst

Execute the SGLang server within a Docker container, specifying the image, environment variables, model path, and Ascend attention backend. The server listens on all interfaces.

```bash
drun --env "HF_TOKEN=<secret>" \
        <image_name> \
        python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000
```

--------------------------------

### Tokenize Dataset Function

Source: https://github.com/ascend/docs/blob/main/sources/transformers/fine-tune.rst

Define a function to tokenize the dataset using the initialized tokenizer. This function applies padding and truncation to ensure consistent input format for the model. The `batched=True` argument processes multiple examples simultaneously for efficiency.

```python
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
```

--------------------------------

### Single/Distributed Training with timm

Source: https://github.com/ascend/docs/blob/main/sources/timm/quick_start.rst

Launch training for timm-based image classification models on single or multi-NPU devices. Specify the number of NPUs, model, dataset path, and training parameters.

```shell
num_npus=1
./distributed_train.sh $num_npus path/to/dataset/ImageNet-1000 \
    --device npu \
    --model seresnet34 \
    --sched cosine \
    --epochs 150 \
    --warmup-epochs 5 \
    --lr 0.4 \
    --reprob 0.5 \
    --remode pixel \
    --batch-size 256 \
    --amp -j 4
```

--------------------------------

### Check Ascend Device Information

Source: https://github.com/ascend/docs/blob/main/sources/llama_cpp/install.rst

Run this command within the Ascend environment or container to display information about the available NPUs.

```shell
npu-smi info
```

--------------------------------

### FSDP Training Configuration

Source: https://github.com/ascend/docs/blob/main/sources/torchtitan/quick_start.rst

Configure for Fully Sharded Data Parallel (FSDP) training. data_parallel_shard_degree = -1 indicates default FSDP.

```shell
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1 #-1为默认FSDP
tensor_parallel_degree = 1
pipeline_parallel_degree = 1
context_parallel_degree = 1
```