### Install msit Tool from Source

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md

Install the msit tool by cloning the repository and using pip. This is an example of source installation.

```bash
git clone https://gitcode.com/ascend/msit.git
cd msit/msit
pip install .
```

--------------------------------

### Eager Mode Device Limit Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/appendix/core_limit.md

Example demonstrating how to set and then retrieve device limits for a specific device in Eager mode.

```python
import torch
import torch_npu

torch.npu.set_device(0)
torch.npu.set_device_limit(0, 12, 20)
print(torch.npu.get_device_limit(0))
```

--------------------------------

### Check msit Installation

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md

Verify the installation of msit components, including the 'msit-llm' package, using the msit check all command. Successful installation will be indicated in the logs.

```bash
msit check all
```

--------------------------------

### Get NPU Compiler with Custom Backend

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/torchair/get_compiler.md

Demonstrates how to obtain an NPU compiler using `torchair.get_compiler` and integrate it with a custom backend for graph compilation. This example defines a custom backend that prints the graph module and uses the retrieved compiler.

```python
import os
import torch
from torch._functorch.aot_autograd import aot_module_simplified
import torch_npu
import torchair
from torchair.configs.compiler_config import CompilerConfig

class MM(torch.nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, x, y):
        x = x + y
        return x
def custom_backend(gm: torch.fx.GraphModule, example_inputs):
    compiler_config = CompilerConfig()
    compiler = torchair.get_compiler(compiler_config)
    print(gm)
    return aot_module_simplified(gm, example_inputs, fw_compiler=compiler)

torch.npu.set_device(0)
x = torch.ones([2, 2], dtype=torch.int32).npu()
y = torch.ones([2, 2], dtype=torch.int32).npu()
model = torch.compile(MM().npu(), backend=custom_backend, dynamic=False)
ret = model(x, y)
print(ret)
```

--------------------------------

### Install llm Component Package

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md

Install the llm component package using the msit install command. The --find-links flag points to the directory containing the package files.

```bash
msit install llm --find-links ${llm_dir}
```

--------------------------------

### Platform Configuration Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/scope/limit_core_num.md

This example shows how to view the maximum AI Core and Vector Core counts supported by the AI processor from a platform configuration file. This helps in understanding the constraints for the limit_core_num API.

```txt
[SoCInfo]
ai_core_cnt=24
cube_core_cnt=24
vector_core_cnt=48
```

--------------------------------

### Example: Using RefData Type Conversion in Online Inference

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/ref_data.md

Demonstrates how to enable RefData type conversion for a PyTorch model using torch.compile with TorchAir's NPU backend. This setup is for online inference scenarios.

```python
import torch
import torch_npu
import torchair
from torch import nn
from torchair.configs.compiler_config import CompilerConfig

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
    def forward(self, x):
        return x.add_(1)

device = torch.device("npu:0")
config = CompilerConfig()
config.experimental_config.enable_ref_data = True
input0 = torch.ones((3,3), dtype=torch.float32)
input0 = input0.to(device)
model = Network()
npu_backend = torchair.get_npu_backend(compiler_config=config)
model = torch.compile(model, fullgraph=True, backend=npu_backend, dynamic=True)
```

--------------------------------

### Accuracy Comparison Output Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md

Example output from the msit llm compare command, showing the comparison process and the location where the results are saved.

```bash
msit_llm_logger - INFO - Comparing GE with FX
msit_llm_logger - INFO - All token ids in my_dump_data: dict_keys([0])
......
msit_llm_logger - INFO - Saved comparing results: ./msit_cmp_report_${timestamp}.csv
```

--------------------------------

### Example Fusion Switch Configuration (.cfg)

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/fusion_switch_file.md

This is an example of a .cfg file used to configure operator fusion rules. 'on' enables a rule, and 'off' disables it. Specific rules can be enabled or disabled individually.

```txt
{
    "Switch":{
        "GraphFusion":{
            "ConvToFullyConnectionFusionPass":"on",
            "SoftmaxFusionPass":"on",
            "ConvConcatFusionPass":"on",
            "MatMulBiasAddFusionPass":"on",
            "PoolingFusionPass":"on",
            "ZConcatv2dFusionPass":"on",
            "ZConcatExt2FusionPass":"on",
            "TfMergeSubFusionPass":"on"
        },
        "UBFusion":{
            "FusionVirtualOpSetSwitch":"on"
        }
    }
}
```

```txt
{
    "Switch":{
        "GraphFusion":{
            "ALL":"off"
        },
        "UBFusion":{
            "ALL":"off"
         }
    }
}
```

```txt
{
    "Switch":{
        "GraphFusion":{
            "ALL":"off",
            "SoftmaxFusionPass":"on"
        },
        "UBFusion":{
            "ALL":"off",
            "TbePool2dQuantFusionPass":"on"
         }
    }
}
```

--------------------------------

### Ascend IR record API Usage Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/ops/record.md

Demonstrates how to use the torchair.ops.record() function within a PyTorch NPU context. This example shows recording a task and then waiting for it using torchair.ops.wait().

```python
import torch
import torch_npu
import torchair
from torchair import CompilerConfig

def demo(x, y):
    with torchair.scope.npu_stream_switch('1'):
        mm = torch.mm(x, x)
        abs = torch.abs(mm)
        record = torchair.ops.record()
    add = torch.add(abs, 1)
    torchair.ops.wait([record])
    sub = torch.sub(x, mm)
    return add, sub

config = CompilerConfig()
npu_backend = torchair.get_npu_backend(compiler_config=config)
func = torch.compile(demo, backend=npu_backend, dynamic=False, fullgraph=True)
input1 = torch.ones(2, 2).npu()
input2 = torch.ones(2, 2).npu()
func(input1, input2)
```

--------------------------------

### Example Graph Description Output

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/data_dump.md

This is an example of the ge_proto*.txt file generated after setting the DUMP_GE_GRAPH environment variable. The 'name' field in the 'op' section represents the operator name.

```txt
graph {
  name: "online_0"
  input: "Add1_in_0:0"
  input: "Add1_in_1:0"
  op {
    name: "Add1_in_0"
    type: "Data"
    input: ""
    attr {
      key: "OUTPUT_IS_VAR"
      value {
        list {
          b: false
          val_type: VT_LIST_BOOL
        }
      }
    }
    ......
  }
  op{
    name: "Add2"
    type: "Data"
    ......
  }
}
```

--------------------------------

### Eager Mode Stream Limit Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/appendix/core_limit.md

Example showing how to set stream-specific core limits and then execute an operation within that stream's context.

```python
import torch
import torch_npu

batch_size = 2
hidden_size = 16
x = torch.randn(batch_size, hidden_size).npu()
stream = torch.npu.current_stream()

torch.npu.set_stream_limit(stream, 3, 8)
with torch.npu.stream(stream):
    output = torch_npu.npu_swiglu(x, dim=-1)
```

--------------------------------

### Product Mode Gear Configuration Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamic_gears_merge_policy.md

In product mode, gears are configured using a Cartesian product, allowing for simpler configuration when many combinations are needed. This example configures gears for a tensor with two dimensions.

```python
torchair.inference.set_dim_gears(input3, dim_gears={0:[2, 3, 4], 1:[10, 20]})
```

--------------------------------

### Text Graph Dump Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/graph_dump.md

Example of a graph dump in .txt format, illustrating the structure of graph operations, attributes, and input/output descriptions.

```text
name: "graph_1"
op {
  name: "arg0_1"
  type: "Data"
  attr {
    key: "_input_name_key"
    value {
      list {
        s: "x"
      }
    }
  }
  input_desc {
    name: "x"
    dtype: DT_FLOAT
    shape {
      dim: 2
      dim: 2
      dim: 2
    }
    layout: "ND"
    device_type: "NPU"
  }
  output_desc {
    name: "y"
    dtype: DT_FLOAT
    shape {
      dim: 2
      dim: 2
      dim: 2
    }
    layout: "ND"
    attr {
      key: "_meta"
      value {
        s: "Tensor(dtype=torch.float32, shape=torch.Size([2, 2, 2]))"
      }
    }
    attr {
      key: "format_for_int"
      value {
        i: 2
      }
    }
    device_type: "NPU"
  }
}
.........
op {
  name: "arg1_1"
  type: "Data"
  attr {
    key: "_input_name_key"
    value {
      list {
        s: "x"
      }
    }
  }
  ......
}
```

--------------------------------

### Zip Mode Gear Configuration Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamic_gears_merge_policy.md

In zip mode, gears are configured by matching dimensions positionally. This example shows a complex configuration for a tensor with two dimensions.

```python
torchair.inference.set_dim_gears(input3, dim_gears={0:[2, 3, 4, 2, 3, 4], 1:[10, 10, 10, 20, 20, 20]})
```

--------------------------------

### Ascend IR Cache Index File Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md

This is an example of the content found in an Ascend IR cache index file (`.idx`). It lists the available cache files and their corresponding graph keys and variable descriptor files.

```json
{
    "cache_file_list":[
        {
            "cache_file_name":"./cache_dir/graph_$key1_20230117202307.om",
            "graph_key":"graph_$key1",
            "var_desc_file_name":"./cache_dir/graph_$key1_20230117202307.rdcpkt"
        },
        {
            "cache_file_name":"./cache_dir/graph_$key1_20230117203007.om",
            "graph_key":"graph_$key1",
            "var_desc_file_name":"./cache_dir/graph_$key1_20230117203007.rdcpkt"
        }
    ]
}

```

--------------------------------

### Full Example: Enabling Debug Logging and Graph Dumps

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/debug_save.md

This example demonstrates how to enable debug information saving, configure graph dump types, and perform forward and backward passes with different input shapes. It includes enabling Dynamo logs and setting the compiler backend.

```python
import os
# 配置环境变量
os.environ["TORCH_COMPILE_DEBUG"] = "1"
import torch
import torch_npu
import torchair
import logging
# 开启Dynamo日志
torch._logging.set_logs(dynamo=logging.DEBUG,aot=logging.DEBUG,output_code=True,graph_code=True)

config = torchair.CompilerConfig()
config.debug.graph_dump.type = "pbtxt"
npu_backend = torchair.get_npu_backend(compiler_config=config)
device = "npu:0"

class Model(torch.nn.Module):
    def forward(self, x):
        return 2 * x

model = Model().to(device)
model = torch.compile(model, backend=npu_backend, dynamic=False)

x = torch.randn(10, 10, requires_grad=True, device=device)
out = model(x)
loss_fn = torch.nn.MSELoss()
target = torch.randn(10, 10, device=device)
loss = loss_fn(out, target)
loss.backward()

x = torch.randn(20, 20, requires_grad=False, device=device)
out = model(x)
```

--------------------------------

### Compiling and Installing Custom Operator Package

Source: https://github.com/ascend/torchair/blob/master/feature/原地操作算子入图指导.md

Commands to compile the custom operator package and install it. Note that custom operator packages with the same vendor name will overwrite each other.

```bash
cd ./MyInplace
bash build.sh
bash build_out/custom*.run
cd ..
```

--------------------------------

### Example of Python Debug Logs during Ascend IR Graph Compilation

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/python_log_print.md

This example displays the detailed debug logs generated by TorchAir during the Ascend IR graph compilation process. It shows the state of the graph before and after optimization, along with input and output details.

```text
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.250.813 [npu_fx_compiler.py:242]2250956 before sym input optimization, graph is graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%scatter_update : [num_users=1] = call_function[target=torch.ops.npu.scatter_update.default](args = (%arg0_1, %arg1_1, %arg2_1, -2), kwargs = {})
return (scatter_update,)
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.251.624 [npu_fx_compiler.py:238]2250956 after sym input optimization, graph is graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%scatter_update : [num_users=1] = call_function[target=torch.ops.npu.scatter_update.default](args = (%arg0_1, %arg1_1, %arg2_1, -2), kwargs = {})
return (scatter_update,)
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.252.076 [npu_fx_compiler.py:112]2250956 -------------------
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.252.187 [npu_fx_compiler.py:113]2250956 target: arg0_1
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.160 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(arg0_1:0, dtype=DT_FLOAT, size=[1, 1, 2, 8])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.478 [npu_fx_compiler.py:112]2250956 -------------------
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.601 [npu_fx_compiler.py:113]2250956 target: arg1_1
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.196 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.int64, size=[1] npu:Tensor(arg1_1:0, dtype=DT_INT64, size=[1])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.503 [npu_fx_compiler.py:112]2250956 -------------------
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.609 [npu_fx_compiler.py:113]2250956 target: arg2_1
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.063 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 1, 8] npu:Tensor(arg2_1:0, dtype=DT_FLOAT, size=[1, 1, 1, 8])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.330 [npu_fx_compiler.py:112]2250956 -------------------
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.433 [npu_fx_compiler.py:113]2250956 target: npu.scatter_update.default
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.676 [npu_fx_compiler.py:115]2250956 input 0: Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(arg0_1:0, dtype=DT_FLOAT, size=[1, 1, 2, 8])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.830 [npu_fx_compiler.py:115]2250956 input 1: Pack(meta:FakeTensor(dtype=torch.int64, size=[1] npu:Tensor(arg1_1:0, dtype=DT_INT64, size=[1])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.955 [npu_fx_compiler.py:115]2250956 input 2: Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 1, 8] npu:Tensor(arg2_1:0, dtype=DT_FLOAT, size=[1, 1, 1, 8])))
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.256.042 [npu_fx_compiler.py:115]2250956 input 3: -2
[DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.297.158 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(Scatter:0, dtype=DT_FLOAT, size=[1, 1, 2, 8])))
```

--------------------------------

### Multi-Stream Expression Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/npugraph_ex/advanced/multi_stream.md

This example demonstrates how to use multiple streams and events to achieve parallel computation and control execution order for NPU operations. It includes creating custom streams, recording events for synchronization, and using `record_stream` to manage tensor memory lifecycle.

```python
import torch
import torch_npu


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, in1, in2, in3, in4):
        stream1 = torch.npu.Stream()
        stream2 = torch.npu.Stream()
        event1 = torch.npu.Event()
        event2 = torch.npu.Event()

        add_result = torch.add(in1, in2)
        # B在默认流上创建
        B = in3 + in4
        # 插入一个record用于同步，对于event1.wait(stream1)后的任务需要等record执行完毕才能执行
        event1.record()
        with torch.npu.stream(stream1):
            # torch.mm算子(mm_result)等待torch.add算子(add_result)以及B计算执行完再执行
            event1.wait(stream1)
            # B在stream1上使用
            mm_result = torch.mm(B, in4)
            # 插入一个record用于同步，对于event2.wait(stream2)后的任务需要等record执行完毕才能执行
            event2.record()
            # record_stream B在stream'1'上使用，延长Tensor B对应内存的生命周期
            B.record_stream(stream1)
        mm1 = torch.mm(in3, in4)
        with torch.npu.stream(stream2):
            # torch.add算子(add2)等待torch.mm算子(mm_result)执行完再执行
            event2.wait(stream2)
            add2 = torch.add(in3, in4)
        return add_result, mm_result, mm1, add2

model = Model().to("npu")
model = torch.compile(model, backend="npugraph_ex", fullgraph=False, dynamic=False)

in1 = torch.randn(1000, 1000, dtype = torch.float16).npu()
in2 = torch.randn(1000, 1000, dtype = torch.float16).npu()
in3 = torch.randn(1000, 1000, dtype = torch.float16).npu()
in4 = torch.randn(1000, 1000, dtype = torch.float16).npu()
result = model(in1, in2, in3, in4)

```

--------------------------------

### Example Model Definition

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/post_grad_custom_pass.md

Defines a sample PyTorch model with multiple operations, serving as the target for custom FX graph pass modifications.

```python
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        mm = torch.mm(x, x)
        abs = torch.abs(mm)
        add = torch.add(abs, 1)
        sub = torch.sub(x, mm)
        return add, sub
```

--------------------------------

### Dynamo Native Log Output

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/python_log_print.md

Example of Dynamo's native log output during graph compilation, showing compilation reasons, traced graphs, and module details.

```txt
[2025-02-06 16:46:56,297] [0/0] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='return_value', user_stack=[<FrameSummary file /home/torchair_example/tests/examples/test_scatter_update.py, line 17 in forward>], graph_break=False)
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  ===== __compiled_fn_0 =====
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_var_ : torch.Tensor, L_indices_ : torch.Tensor, L_updates_ : torch.Tensor):
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_var_ = L_var_
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_indices_ = L_indices_
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_updates_ = L_updates_
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] 
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /home/torchair_example/tests/examples/test_scatter_update.py:16, code: output = torch_npu.scatter_update(var, indices, updates, -2)
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         scatter_update = torch.ops.npu.scatter_update(l_var_, l_indices_, l_updates_, -2);  l_var_ = l_indices_ = l_updates_ = None
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (scatter_update,)
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] 
[2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] 
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] Tabulate module missing, please install tabulate to log the graph in tabular format, logging code instead:
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]  ===== __compiled_fn_0 =====
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]     def forward(self, L_var_ : torch.Tensor, L_indices_ : torch.Tensor, L_updates_ : torch.Tensor):
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         l_var_ = L_var_
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         l_indices_ = L_indices_
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         l_updates_ = L_updates_
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] 
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         # File: /home/torchair_example/tests/examples/test_scatter_update.py:16, code: output = torch_npu.scatter_update(var, indices, updates, -2)
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         scatter_update = torch.ops.npu.scatter_update(l_var_, l_indices_, l_updates_, -2);  l_var_ = l_indices_ = l_updates_ = None
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         return (scatter_update,)
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] 
[2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] 
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] ===== __compiled_fn_0 =====
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_var_: (1, 1, 2, 8)
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_indices_: (1,)
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_updates_: (1, 1, 1, 8)
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] scatter_update: (1, 1, 2, 8)
[2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] 
```

--------------------------------

### Example Fusion Log Output

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/fusion_switch_file.md

This log snippet shows the fusion switch file path being loaded during graph initialization. It confirms that the specified fusion configuration is being used.

```txt
concrete_graph/session.cpp:28   ge.fusionSwitchFile: /home/test/fusion_switch.cfg
```

--------------------------------

### Compile and Install Custom Operator Package

Source: https://github.com/ascend/torchair/blob/master/docs/zh/custom_op_graph/in_place_op_cases.md

Build and install the custom operator package for torch_npu. Ensure the Python version matches your environment. This command recompiles and installs the package, forcing a reinstall.

```bash
bash ci/build.sh --python=3.9
pip3 install dist/torch*.whl --force-reinstall --no-deps
```

--------------------------------

### Example C++ Debug Log Output

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/cplus_log_print.md

This is an example of the detailed debug logs generated by the C++ layer of TorchAir during graph execution.

```text
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.205 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041200000) to ge::Tensor(storage shape=[1, 1, 2, 8], origin shape=[1, 1, 2, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041200000)
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.323 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 1 at::Tensor(shape=[1], dtype='long int', device=npu:0, addr=0x12c041200200) to ge::Tensor(storage shape=[1], origin shape=[1], storage format=ND, origin format=ND, dtype=DT_INT64, device=NPU, addr=0x12c041200200)
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.379 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 2 at::Tensor(shape=[1, 1, 1, 8], dtype='float', device=npu:0, addr=0x12c041200400) to ge::Tensor(storage shape=[1, 1, 1, 8], origin shape=[1, 1, 1, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041200400)
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.487 [static_npu_graph_executor.cpp:130]2250956 Create empty output 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041201000)
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.527 [static_npu_graph_executor.cpp:138]2250956 Assemble torch output 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041201000) to ge::Tensor(storage shape=[1, 1, 2, 8], origin shape=[1, 1, 2, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041201000)
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.591 [concrete_graph/session.cpp:238]2250956 Start to session load graph 0
[DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.090.305 [concrete_graph/session.cpp:250]2250956 Start to session execute graph 0
[INFO] TORCHAIR(2250956,python):2025-02-06-15:44:53.090.459 [static_npu_graph_executor.cpp:256]2250956 Static npu graph executor run graph 0 on stream 0x3345cca0 successfully.
```

--------------------------------

### Python Graph Dump Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/graph_dump.md

Example of a graph dump in Python format, showing data, shape, broadcast, expand, scatter, multiply, and cast operations.

```python
from torch import tensor
from torchair._ge_concrete_graph import ge_apis as ge
from torchair.ge._ge_graph import get_default_ge_graph

arg0_1_0 = ge.Data(index=0, dtype=0, shape=[2, 2, 2], placement="NPU", node_name="arg0_1")
arg1_1_0 = ge.Data(index=1, dtype=0, shape=[2], placement="NPU", node_name="arg1_1")

# File "/home/torchair_example/tests/examples/example_select_scatter.py", line 23, in forward output = torch.ops.aten.select_scatter(x, y, 0,1)
## FX Code: select_scatter = torch.ops.aten.select_scatter.default(arg0_1, arg1_1, 0, 1)
Shape_0 = ge.Shape(arg0_1_0, node_name="Shape")
BroadcastTo_0 = ge.BroadcastTo(1, Shape_0, node_name="BroadcastTo")
ExpandDims_0 = ge.ExpandDims(arg1_1_0, 0, node_name="ExpandDims")
BroadcastTo_1_0 = ge.BroadcastTo(ExpandDims_0, Shape_0, node_name="BroadcastTo_1")
ScatterElements_0 = ge.ScatterElements(arg0_1_0, BroadcastTo_0, BroadcastTo_1_0, axis=0, node_name="ScatterElements")

# File "/home/torchair_example/tests/examples/example_select_scatter.py", line 24, in forward return output*10
## FX Code: mul = torch.ops.aten.mul.Tensor(select_scatter, 10)
Mul_0 = ge.Mul(ScatterElements_0, ge.Const(10, dtype=0), node_name="Mul")
Cast_0 = ge.Cast(Mul_0, dst_type=0, node_name="Cast")

_ = ge.NetOutput([Cast_0], dependencies=[])
```

--------------------------------

### Installing Custom Operator Package

Source: https://github.com/ascend/torchair/blob/master/docs/zh/custom_op_graph/in_place_op_cases.md

Command to install the compiled custom operator package. The package is typically named custom_opp_<target os>_<target architecture>.run.

```bash
bash build_out/custom_opp_<target os>_<target architecture>.run
cd ..
```

--------------------------------

### Example Fused Operator in FX Graph Dump

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/pattern_fusion_pass.md

This text output shows an example of how a fused operator, npu_add_rms_norm_dynamic_quant, might appear in an FX graph dump after the pattern fusion pass has been applied.

```text
# No stacktrace found for following nodes
npu_add_rms_norm_dynamic_quant_default = torch.ops.npu.npu_add_rms_norm_dynamic_quant.default(arg2_1, arg1_1, arg0_1, output_mask = [True, True]);  arg2_1 = arg1_1 = arg0_1 = None
getitem_5: "i8[2, 3, 4]" = npu_add_rms_norm_dynamic_quant_default[0]
getitem_6: "f16[2, 3, 4]" = npu_add_rms_norm_dynamic_quant_default[2]
getitem_7: "f32[2, 3]" = npu_add_rms_norm_dynamic_quant_default[3];  npu_add_rms_norm_dynamic_quant_default = None
view_default: "i8[6, 4]" = torch.ops.aten.reshape.default(getitem_5, [6, 4]);  getitem_5 = None
view_default_1: "f32[6, 1]" = torch.ops.aten.reshape.default(getitem_7, [-1, 1]);  getitem_7 = None
return (view_default, view_default_1, getitem_6)
```

--------------------------------

### 准备PyTorch模型脚本

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md

定义一个包含自定义输入类型（InputMeta）和列表类型输入的PyTorch模型，并使用torch.compile进行编译和执行。

```python
import torch
import dataclasses
from typing import List
import torch_npu
import torchair
from torchair.configs.compiler_config import CompilerConfig

config = CompilerConfig()
npu_backend = torchair.get_npu_backend(compiler_config=config)

# InputMeta为仿照VLLM(Versatile Large Language Model)框架的入参结构
@dataclasses.dataclass
class InputMeta:
    data: torch.Tensor
    is_prompt: bool

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(2, 1)
        self.linear2 = torch.nn.Linear(2, 1)
        for param in self.parameters():
            torch.nn.init.ones_(param)

    def forward(self, x: InputMeta, kv: List[torch.Tensor]):
        return self.linear2(x.data) + self.linear2(kv[0])

x = InputMeta(data=torch.randn(2, 2).npu(), is_prompt=True)
kv = [torch.randn(2, 2).npu()]
model = Model().npu()
# 调用torch.compile编译
compiled_model = torch.compile(model, backend=npu_backend)
# 执行prompt
res_prompt = compiled_model(x, kv)
x.is_prompt = False
# 执行decode
res_decode = compiled_model(x, kv)
```

--------------------------------

### TorchAir CompilerConfig Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamo_export.md

Example of configuring CompilerConfig for graph export. This snippet shows how to enable automatic generation of ATC JSON configuration templates and include nn_module_stack information in the exported graph.

```python
import torch_npu, torchair
config = torchair.CompilerConfig()
# 开启自动生成ATC的json配置文件模板
config.export.experimental.auto_atc_config_generated = True
# 携带nn_module_stack信息
config.export.experimental.enable_record_nn_module_stack = True
```

--------------------------------

### 配置算子在线编译选项

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/jit_compile.md

通过CompilerConfig配置jit_compile参数来控制算子的在线编译方式。适用于GE图模式场景。

```python
import torch_npu, torchair
config = torchair.CompilerConfig()
# 算子在线编译选项配置
config.experimental_config.jit_compile = "auto"
npu_backend = torchair.get_npu_backend(compiler_config=config)
opt_model = torch.compile(model, backend=npu_backend)
```

--------------------------------

### Enable SuperKernel Fusion Optimization

Source: https://github.com/ascend/torchair/blob/master/docs/zh/npugraph_ex/advanced/superkernel.md

Enable SuperKernel fusion optimization by setting `super_kernel_optimize` to `True` in `torch.compile` options. This example demonstrates how to configure the optimization, but it is for reference only and not directly runnable. Refer to the documentation for detailed parameter explanations.

```python
import torch
import torch_npu

opt_model = torch.compile(model, backend="npugraph_ex", options={"super_kernel_optimize": True, "super_kernel_optimize_options": dict, "super_kernel_debug_options": dict}, dynamic=False, fullgraph=True)
```

--------------------------------

### Add Operator Custom Decompositions Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/inference/cache_compile.md

This example demonstrates how to use custom_decompositions with cache_compile to handle operator decompositions, specifically for the torch.ops.aten.add.default operator. It involves registering a decomposition function and passing it during the cache_compile call.

```python
# 注册算子分解函数
import torch, torch_npu, torchair
from torch._decomp import get_decompositions, register_decomposition
@register_decomposition(torch.ops.aten.add.default)
def test_add_decomp(t1, t2):
    return t1 + t2

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # 将被分解算子的列表通过custom_decompositions传入
        self.cached = torchair.inference.cache_compile(self.inner_forward,
            custom_decompositions=get_decompositions([torch.ops.aten.add.default]))

    def inner_forward(self, t1, t2):
        return torch.ops.aten.add(t1, t2)

    def forward(self, t1, t2):
        return self.cached(t1, t2)

# ...
```

--------------------------------

### 改造PyTorch模型脚本以支持编译缓存

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md

将模型原有的forward函数拆分为_forward和forward，并使用torchair.inference.cache_compile封装_forward函数以实现编译缓存。此方法适用于需要多次trace的函数。

```python
import dataclasses
import logging
from typing import List

import torch
import torch_npu
import torchair
from torchair import logger
from torchair.configs.compiler_config import CompilerConfig

config = CompilerConfig()

logger.setLevel(logging.INFO)


# InputMeta为仿照VLLM(Versatile Large Language Model)框架的入参结构
@dataclasses.dataclass
class InputMeta:
    data: torch.Tensor
    is_prompt: bool


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(2, 1)
        self.linear2 = torch.nn.Linear(2, 1)
        for param in self.parameters():
            torch.nn.init.ones_(param)

        # 通过torchair.inference.cache_compile实现编译缓存
        self.cached_prompt = torchair.inference.cache_compile(self.prompt, config=config)
        self.cached_decode = torchair.inference.cache_compile(self.decode, config=config)

    @torch.inference_mode()
    def forward(self, x: InputMeta, kv: List[torch.Tensor]):
        return self._forward(x, kv)

    def _forward(self, x, kv):
        return self.linear2(x.data) + self.linear2(kv[0])

    def prompt(self, x: InputMeta, kv: List[torch.Tensor]):
        return self._forward(x, kv)

    def decode(self, x: InputMeta, kv: List[torch.Tensor]):
        return self._forward(x, kv)
```

--------------------------------

### RefData Conversion Log Message Example

Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/ref_data.md

An example log message indicating that a RefData type has been replaced with a Data type during graph optimization. This message appears when the RefData type conversion feature is enabled and active.

```text
[DEBUG] TORCHAIR 20240607 02:06:15 Replace RefData_5_3_20_20_1200_400_20_1_0_140251860631280:RefData with arg0_1:Data in graph graph_1
```