### Install msit Tool from Source Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md Install the msit tool by cloning the repository and using pip. This is an example of source installation. ```bash git clone https://gitcode.com/ascend/msit.git cd msit/msit pip install . ``` -------------------------------- ### Eager Mode Device Limit Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/appendix/core_limit.md Example demonstrating how to set and then retrieve device limits for a specific device in Eager mode. ```python import torch import torch_npu torch.npu.set_device(0) torch.npu.set_device_limit(0, 12, 20) print(torch.npu.get_device_limit(0)) ``` -------------------------------- ### Check msit Installation Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md Verify the installation of msit components, including the 'msit-llm' package, using the msit check all command. Successful installation will be indicated in the logs. ```bash msit check all ``` -------------------------------- ### Get NPU Compiler with Custom Backend Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/torchair/get_compiler.md Demonstrates how to obtain an NPU compiler using `torchair.get_compiler` and integrate it with a custom backend for graph compilation. This example defines a custom backend that prints the graph module and uses the retrieved compiler. ```python import os import torch from torch._functorch.aot_autograd import aot_module_simplified import torch_npu import torchair from torchair.configs.compiler_config import CompilerConfig class MM(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, y): x = x + y return x def custom_backend(gm: torch.fx.GraphModule, example_inputs): compiler_config = CompilerConfig() compiler = torchair.get_compiler(compiler_config) print(gm) return aot_module_simplified(gm, example_inputs, fw_compiler=compiler) torch.npu.set_device(0) x = torch.ones([2, 2], dtype=torch.int32).npu() y = torch.ones([2, 2], dtype=torch.int32).npu() model = torch.compile(MM().npu(), backend=custom_backend, dynamic=False) ret = model(x, y) print(ret) ``` -------------------------------- ### Install llm Component Package Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md Install the llm component package using the msit install command. The --find-links flag points to the directory containing the package files. ```bash msit install llm --find-links ${llm_dir} ``` -------------------------------- ### Platform Configuration Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/scope/limit_core_num.md This example shows how to view the maximum AI Core and Vector Core counts supported by the AI processor from a platform configuration file. This helps in understanding the constraints for the limit_core_num API. ```txt [SoCInfo] ai_core_cnt=24 cube_core_cnt=24 vector_core_cnt=48 ``` -------------------------------- ### Example: Using RefData Type Conversion in Online Inference Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/ref_data.md Demonstrates how to enable RefData type conversion for a PyTorch model using torch.compile with TorchAir's NPU backend. This setup is for online inference scenarios. ```python import torch import torch_npu import torchair from torch import nn from torchair.configs.compiler_config import CompilerConfig class Network(nn.Module): def __init__(self): super(Network, self).__init__() def forward(self, x): return x.add_(1) device = torch.device("npu:0") config = CompilerConfig() config.experimental_config.enable_ref_data = True input0 = torch.ones((3,3), dtype=torch.float32) input0 = input0.to(device) model = Network() npu_backend = torchair.get_npu_backend(compiler_config=config) model = torch.compile(model, fullgraph=True, backend=npu_backend, dynamic=True) ``` -------------------------------- ### Accuracy Comparison Output Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/cases/accuracy_cases.md Example output from the msit llm compare command, showing the comparison process and the location where the results are saved. ```bash msit_llm_logger - INFO - Comparing GE with FX msit_llm_logger - INFO - All token ids in my_dump_data: dict_keys([0]) ...... msit_llm_logger - INFO - Saved comparing results: ./msit_cmp_report_${timestamp}.csv ``` -------------------------------- ### Example Fusion Switch Configuration (.cfg) Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/fusion_switch_file.md This is an example of a .cfg file used to configure operator fusion rules. 'on' enables a rule, and 'off' disables it. Specific rules can be enabled or disabled individually. ```txt { "Switch":{ "GraphFusion":{ "ConvToFullyConnectionFusionPass":"on", "SoftmaxFusionPass":"on", "ConvConcatFusionPass":"on", "MatMulBiasAddFusionPass":"on", "PoolingFusionPass":"on", "ZConcatv2dFusionPass":"on", "ZConcatExt2FusionPass":"on", "TfMergeSubFusionPass":"on" }, "UBFusion":{ "FusionVirtualOpSetSwitch":"on" } } } ``` ```txt { "Switch":{ "GraphFusion":{ "ALL":"off" }, "UBFusion":{ "ALL":"off" } } } ``` ```txt { "Switch":{ "GraphFusion":{ "ALL":"off", "SoftmaxFusionPass":"on" }, "UBFusion":{ "ALL":"off", "TbePool2dQuantFusionPass":"on" } } } ``` -------------------------------- ### Ascend IR record API Usage Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/ops/record.md Demonstrates how to use the torchair.ops.record() function within a PyTorch NPU context. This example shows recording a task and then waiting for it using torchair.ops.wait(). ```python import torch import torch_npu import torchair from torchair import CompilerConfig def demo(x, y): with torchair.scope.npu_stream_switch('1'): mm = torch.mm(x, x) abs = torch.abs(mm) record = torchair.ops.record() add = torch.add(abs, 1) torchair.ops.wait([record]) sub = torch.sub(x, mm) return add, sub config = CompilerConfig() npu_backend = torchair.get_npu_backend(compiler_config=config) func = torch.compile(demo, backend=npu_backend, dynamic=False, fullgraph=True) input1 = torch.ones(2, 2).npu() input2 = torch.ones(2, 2).npu() func(input1, input2) ``` -------------------------------- ### Example Graph Description Output Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/data_dump.md This is an example of the ge_proto*.txt file generated after setting the DUMP_GE_GRAPH environment variable. The 'name' field in the 'op' section represents the operator name. ```txt graph { name: "online_0" input: "Add1_in_0:0" input: "Add1_in_1:0" op { name: "Add1_in_0" type: "Data" input: "" attr { key: "OUTPUT_IS_VAR" value { list { b: false val_type: VT_LIST_BOOL } } } ...... } op{ name: "Add2" type: "Data" ...... } } ``` -------------------------------- ### Eager Mode Stream Limit Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/appendix/appendix/core_limit.md Example showing how to set stream-specific core limits and then execute an operation within that stream's context. ```python import torch import torch_npu batch_size = 2 hidden_size = 16 x = torch.randn(batch_size, hidden_size).npu() stream = torch.npu.current_stream() torch.npu.set_stream_limit(stream, 3, 8) with torch.npu.stream(stream): output = torch_npu.npu_swiglu(x, dim=-1) ``` -------------------------------- ### Product Mode Gear Configuration Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamic_gears_merge_policy.md In product mode, gears are configured using a Cartesian product, allowing for simpler configuration when many combinations are needed. This example configures gears for a tensor with two dimensions. ```python torchair.inference.set_dim_gears(input3, dim_gears={0:[2, 3, 4], 1:[10, 20]}) ``` -------------------------------- ### Text Graph Dump Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/graph_dump.md Example of a graph dump in .txt format, illustrating the structure of graph operations, attributes, and input/output descriptions. ```text name: "graph_1" op { name: "arg0_1" type: "Data" attr { key: "_input_name_key" value { list { s: "x" } } } input_desc { name: "x" dtype: DT_FLOAT shape { dim: 2 dim: 2 dim: 2 } layout: "ND" device_type: "NPU" } output_desc { name: "y" dtype: DT_FLOAT shape { dim: 2 dim: 2 dim: 2 } layout: "ND" attr { key: "_meta" value { s: "Tensor(dtype=torch.float32, shape=torch.Size([2, 2, 2]))" } } attr { key: "format_for_int" value { i: 2 } } device_type: "NPU" } } ......... op { name: "arg1_1" type: "Data" attr { key: "_input_name_key" value { list { s: "x" } } } ...... } ``` -------------------------------- ### Zip Mode Gear Configuration Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamic_gears_merge_policy.md In zip mode, gears are configured by matching dimensions positionally. This example shows a complex configuration for a tensor with two dimensions. ```python torchair.inference.set_dim_gears(input3, dim_gears={0:[2, 3, 4, 2, 3, 4], 1:[10, 10, 10, 20, 20, 20]}) ``` -------------------------------- ### Ascend IR Cache Index File Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md This is an example of the content found in an Ascend IR cache index file (`.idx`). It lists the available cache files and their corresponding graph keys and variable descriptor files. ```json { "cache_file_list":[ { "cache_file_name":"./cache_dir/graph_$key1_20230117202307.om", "graph_key":"graph_$key1", "var_desc_file_name":"./cache_dir/graph_$key1_20230117202307.rdcpkt" }, { "cache_file_name":"./cache_dir/graph_$key1_20230117203007.om", "graph_key":"graph_$key1", "var_desc_file_name":"./cache_dir/graph_$key1_20230117203007.rdcpkt" } ] } ``` -------------------------------- ### Full Example: Enabling Debug Logging and Graph Dumps Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/debug_save.md This example demonstrates how to enable debug information saving, configure graph dump types, and perform forward and backward passes with different input shapes. It includes enabling Dynamo logs and setting the compiler backend. ```python import os # 配置环境变量 os.environ["TORCH_COMPILE_DEBUG"] = "1" import torch import torch_npu import torchair import logging # 开启Dynamo日志 torch._logging.set_logs(dynamo=logging.DEBUG,aot=logging.DEBUG,output_code=True,graph_code=True) config = torchair.CompilerConfig() config.debug.graph_dump.type = "pbtxt" npu_backend = torchair.get_npu_backend(compiler_config=config) device = "npu:0" class Model(torch.nn.Module): def forward(self, x): return 2 * x model = Model().to(device) model = torch.compile(model, backend=npu_backend, dynamic=False) x = torch.randn(10, 10, requires_grad=True, device=device) out = model(x) loss_fn = torch.nn.MSELoss() target = torch.randn(10, 10, device=device) loss = loss_fn(out, target) loss.backward() x = torch.randn(20, 20, requires_grad=False, device=device) out = model(x) ``` -------------------------------- ### Compiling and Installing Custom Operator Package Source: https://github.com/ascend/torchair/blob/master/feature/原地操作算子入图指导.md Commands to compile the custom operator package and install it. Note that custom operator packages with the same vendor name will overwrite each other. ```bash cd ./MyInplace bash build.sh bash build_out/custom*.run cd .. ``` -------------------------------- ### Example of Python Debug Logs during Ascend IR Graph Compilation Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/python_log_print.md This example displays the detailed debug logs generated by TorchAir during the Ascend IR graph compilation process. It shows the state of the graph before and after optimization, along with input and output details. ```text [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.250.813 [npu_fx_compiler.py:242]2250956 before sym input optimization, graph is graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %scatter_update : [num_users=1] = call_function[target=torch.ops.npu.scatter_update.default](args = (%arg0_1, %arg1_1, %arg2_1, -2), kwargs = {}) return (scatter_update,) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.251.624 [npu_fx_compiler.py:238]2250956 after sym input optimization, graph is graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %scatter_update : [num_users=1] = call_function[target=torch.ops.npu.scatter_update.default](args = (%arg0_1, %arg1_1, %arg2_1, -2), kwargs = {}) return (scatter_update,) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.252.076 [npu_fx_compiler.py:112]2250956 ------------------- [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.252.187 [npu_fx_compiler.py:113]2250956 target: arg0_1 [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.160 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(arg0_1:0, dtype=DT_FLOAT, size=[1, 1, 2, 8]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.478 [npu_fx_compiler.py:112]2250956 ------------------- [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.253.601 [npu_fx_compiler.py:113]2250956 target: arg1_1 [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.196 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.int64, size=[1] npu:Tensor(arg1_1:0, dtype=DT_INT64, size=[1]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.503 [npu_fx_compiler.py:112]2250956 ------------------- [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.254.609 [npu_fx_compiler.py:113]2250956 target: arg2_1 [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.063 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 1, 8] npu:Tensor(arg2_1:0, dtype=DT_FLOAT, size=[1, 1, 1, 8]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.330 [npu_fx_compiler.py:112]2250956 ------------------- [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.433 [npu_fx_compiler.py:113]2250956 target: npu.scatter_update.default [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.676 [npu_fx_compiler.py:115]2250956 input 0: Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(arg0_1:0, dtype=DT_FLOAT, size=[1, 1, 2, 8]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.830 [npu_fx_compiler.py:115]2250956 input 1: Pack(meta:FakeTensor(dtype=torch.int64, size=[1] npu:Tensor(arg1_1:0, dtype=DT_INT64, size=[1]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.255.955 [npu_fx_compiler.py:115]2250956 input 2: Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 1, 8] npu:Tensor(arg2_1:0, dtype=DT_FLOAT, size=[1, 1, 1, 8]))) [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.256.042 [npu_fx_compiler.py:115]2250956 input 3: -2 [DEBUG] TORCHAIR(2250956,python):2025-02-06 15:44:44.297.158 [npu_fx_compiler.py:119]2250956 output Pack(meta:FakeTensor(dtype=torch.float32, size=[1, 1, 2, 8] npu:Tensor(Scatter:0, dtype=DT_FLOAT, size=[1, 1, 2, 8]))) ``` -------------------------------- ### Multi-Stream Expression Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/npugraph_ex/advanced/multi_stream.md This example demonstrates how to use multiple streams and events to achieve parallel computation and control execution order for NPU operations. It includes creating custom streams, recording events for synchronization, and using `record_stream` to manage tensor memory lifecycle. ```python import torch import torch_npu class Model(torch.nn.Module): def __init__(self): super().__init__() def forward(self, in1, in2, in3, in4): stream1 = torch.npu.Stream() stream2 = torch.npu.Stream() event1 = torch.npu.Event() event2 = torch.npu.Event() add_result = torch.add(in1, in2) # B在默认流上创建 B = in3 + in4 # 插入一个record用于同步,对于event1.wait(stream1)后的任务需要等record执行完毕才能执行 event1.record() with torch.npu.stream(stream1): # torch.mm算子(mm_result)等待torch.add算子(add_result)以及B计算执行完再执行 event1.wait(stream1) # B在stream1上使用 mm_result = torch.mm(B, in4) # 插入一个record用于同步,对于event2.wait(stream2)后的任务需要等record执行完毕才能执行 event2.record() # record_stream B在stream'1'上使用,延长Tensor B对应内存的生命周期 B.record_stream(stream1) mm1 = torch.mm(in3, in4) with torch.npu.stream(stream2): # torch.add算子(add2)等待torch.mm算子(mm_result)执行完再执行 event2.wait(stream2) add2 = torch.add(in3, in4) return add_result, mm_result, mm1, add2 model = Model().to("npu") model = torch.compile(model, backend="npugraph_ex", fullgraph=False, dynamic=False) in1 = torch.randn(1000, 1000, dtype = torch.float16).npu() in2 = torch.randn(1000, 1000, dtype = torch.float16).npu() in3 = torch.randn(1000, 1000, dtype = torch.float16).npu() in4 = torch.randn(1000, 1000, dtype = torch.float16).npu() result = model(in1, in2, in3, in4) ``` -------------------------------- ### Example Model Definition Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/post_grad_custom_pass.md Defines a sample PyTorch model with multiple operations, serving as the target for custom FX graph pass modifications. ```python class Model(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x): mm = torch.mm(x, x) abs = torch.abs(mm) add = torch.add(abs, 1) sub = torch.sub(x, mm) return add, sub ``` -------------------------------- ### Dynamo Native Log Output Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/python_log_print.md Example of Dynamo's native log output during graph compilation, showing compilation reasons, traced graphs, and module details. ```txt [2025-02-06 16:46:56,297] [0/0] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='return_value', user_stack=[], graph_break=False) [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ===== __compiled_fn_0 ===== [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] .0 class GraphModule(torch.nn.Module): [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_var_ : torch.Tensor, L_indices_ : torch.Tensor, L_updates_ : torch.Tensor): [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_var_ = L_var_ [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_indices_ = L_indices_ [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_updates_ = L_updates_ [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/torchair_example/tests/examples/test_scatter_update.py:16, code: output = torch_npu.scatter_update(var, indices, updates, -2) [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] scatter_update = torch.ops.npu.scatter_update(l_var_, l_indices_, l_updates_, -2); l_var_ = l_indices_ = l_updates_ = None [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (scatter_update,) [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2025-02-06 16:46:56,300] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] Tabulate module missing, please install tabulate to log the graph in tabular format, logging code instead: [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] ===== __compiled_fn_0 ===== [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] .0 class GraphModule(torch.nn.Module): [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] def forward(self, L_var_ : torch.Tensor, L_indices_ : torch.Tensor, L_updates_ : torch.Tensor): [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_var_ = L_var_ [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_indices_ = L_indices_ [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_updates_ = L_updates_ [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/torchair_example/tests/examples/test_scatter_update.py:16, code: output = torch_npu.scatter_update(var, indices, updates, -2) [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] scatter_update = torch.ops.npu.scatter_update(l_var_, l_indices_, l_updates_, -2); l_var_ = l_indices_ = l_updates_ = None [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] return (scatter_update,) [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2025-02-06 16:46:56,301] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] ===== __compiled_fn_0 ===== [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_var_: (1, 1, 2, 8) [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_indices_: (1,) [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] l_updates_: (1, 1, 1, 8) [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] scatter_update: (1, 1, 2, 8) [2025-02-06 16:46:56,302] [0/0] torch._dynamo.output_graph.__graph_sizes: [DEBUG] ``` -------------------------------- ### Example Fusion Log Output Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/fusion_switch_file.md This log snippet shows the fusion switch file path being loaded during graph initialization. It confirms that the specified fusion configuration is being used. ```txt concrete_graph/session.cpp:28 ge.fusionSwitchFile: /home/test/fusion_switch.cfg ``` -------------------------------- ### Compile and Install Custom Operator Package Source: https://github.com/ascend/torchair/blob/master/docs/zh/custom_op_graph/in_place_op_cases.md Build and install the custom operator package for torch_npu. Ensure the Python version matches your environment. This command recompiles and installs the package, forcing a reinstall. ```bash bash ci/build.sh --python=3.9 pip3 install dist/torch*.whl --force-reinstall --no-deps ``` -------------------------------- ### Example C++ Debug Log Output Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/cplus_log_print.md This is an example of the detailed debug logs generated by the C++ layer of TorchAir during graph execution. ```text [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.205 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041200000) to ge::Tensor(storage shape=[1, 1, 2, 8], origin shape=[1, 1, 2, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041200000) [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.323 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 1 at::Tensor(shape=[1], dtype='long int', device=npu:0, addr=0x12c041200200) to ge::Tensor(storage shape=[1], origin shape=[1], storage format=ND, origin format=ND, dtype=DT_INT64, device=NPU, addr=0x12c041200200) [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.379 [static_npu_graph_executor.cpp:46]2250956 Assemble aten device input 2 at::Tensor(shape=[1, 1, 1, 8], dtype='float', device=npu:0, addr=0x12c041200400) to ge::Tensor(storage shape=[1, 1, 1, 8], origin shape=[1, 1, 1, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041200400) [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.487 [static_npu_graph_executor.cpp:130]2250956 Create empty output 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041201000) [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.527 [static_npu_graph_executor.cpp:138]2250956 Assemble torch output 0 at::Tensor(shape=[1, 1, 2, 8], dtype='float', device=npu:0, addr=0x12c041201000) to ge::Tensor(storage shape=[1, 1, 2, 8], origin shape=[1, 1, 2, 8], storage format=ND, origin format=ND, dtype=DT_FLOAT, device=NPU, addr=0x12c041201000) [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.084.591 [concrete_graph/session.cpp:238]2250956 Start to session load graph 0 [DEBUG] TORCHAIR(2250956,python):2025-02-06-15:44:53.090.305 [concrete_graph/session.cpp:250]2250956 Start to session execute graph 0 [INFO] TORCHAIR(2250956,python):2025-02-06-15:44:53.090.459 [static_npu_graph_executor.cpp:256]2250956 Static npu graph executor run graph 0 on stream 0x3345cca0 successfully. ``` -------------------------------- ### Python Graph Dump Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/graph_dump.md Example of a graph dump in Python format, showing data, shape, broadcast, expand, scatter, multiply, and cast operations. ```python from torch import tensor from torchair._ge_concrete_graph import ge_apis as ge from torchair.ge._ge_graph import get_default_ge_graph arg0_1_0 = ge.Data(index=0, dtype=0, shape=[2, 2, 2], placement="NPU", node_name="arg0_1") arg1_1_0 = ge.Data(index=1, dtype=0, shape=[2], placement="NPU", node_name="arg1_1") # File "/home/torchair_example/tests/examples/example_select_scatter.py", line 23, in forward output = torch.ops.aten.select_scatter(x, y, 0,1) ## FX Code: select_scatter = torch.ops.aten.select_scatter.default(arg0_1, arg1_1, 0, 1) Shape_0 = ge.Shape(arg0_1_0, node_name="Shape") BroadcastTo_0 = ge.BroadcastTo(1, Shape_0, node_name="BroadcastTo") ExpandDims_0 = ge.ExpandDims(arg1_1_0, 0, node_name="ExpandDims") BroadcastTo_1_0 = ge.BroadcastTo(ExpandDims_0, Shape_0, node_name="BroadcastTo_1") ScatterElements_0 = ge.ScatterElements(arg0_1_0, BroadcastTo_0, BroadcastTo_1_0, axis=0, node_name="ScatterElements") # File "/home/torchair_example/tests/examples/example_select_scatter.py", line 24, in forward return output*10 ## FX Code: mul = torch.ops.aten.mul.Tensor(select_scatter, 10) Mul_0 = ge.Mul(ScatterElements_0, ge.Const(10, dtype=0), node_name="Mul") Cast_0 = ge.Cast(Mul_0, dst_type=0, node_name="Cast") _ = ge.NetOutput([Cast_0], dependencies=[]) ``` -------------------------------- ### Installing Custom Operator Package Source: https://github.com/ascend/torchair/blob/master/docs/zh/custom_op_graph/in_place_op_cases.md Command to install the compiled custom operator package. The package is typically named custom_opp__.run. ```bash bash build_out/custom_opp__.run cd .. ``` -------------------------------- ### Example Fused Operator in FX Graph Dump Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/basic/pattern_fusion_pass.md This text output shows an example of how a fused operator, npu_add_rms_norm_dynamic_quant, might appear in an FX graph dump after the pattern fusion pass has been applied. ```text # No stacktrace found for following nodes npu_add_rms_norm_dynamic_quant_default = torch.ops.npu.npu_add_rms_norm_dynamic_quant.default(arg2_1, arg1_1, arg0_1, output_mask = [True, True]); arg2_1 = arg1_1 = arg0_1 = None getitem_5: "i8[2, 3, 4]" = npu_add_rms_norm_dynamic_quant_default[0] getitem_6: "f16[2, 3, 4]" = npu_add_rms_norm_dynamic_quant_default[2] getitem_7: "f32[2, 3]" = npu_add_rms_norm_dynamic_quant_default[3]; npu_add_rms_norm_dynamic_quant_default = None view_default: "i8[6, 4]" = torch.ops.aten.reshape.default(getitem_5, [6, 4]); getitem_5 = None view_default_1: "f32[6, 1]" = torch.ops.aten.reshape.default(getitem_7, [-1, 1]); getitem_7 = None return (view_default, view_default_1, getitem_6) ``` -------------------------------- ### 准备PyTorch模型脚本 Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md 定义一个包含自定义输入类型(InputMeta)和列表类型输入的PyTorch模型,并使用torch.compile进行编译和执行。 ```python import torch import dataclasses from typing import List import torch_npu import torchair from torchair.configs.compiler_config import CompilerConfig config = CompilerConfig() npu_backend = torchair.get_npu_backend(compiler_config=config) # InputMeta为仿照VLLM(Versatile Large Language Model)框架的入参结构 @dataclasses.dataclass class InputMeta: data: torch.Tensor is_prompt: bool class Model(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(2, 1) self.linear2 = torch.nn.Linear(2, 1) for param in self.parameters(): torch.nn.init.ones_(param) def forward(self, x: InputMeta, kv: List[torch.Tensor]): return self.linear2(x.data) + self.linear2(kv[0]) x = InputMeta(data=torch.randn(2, 2).npu(), is_prompt=True) kv = [torch.randn(2, 2).npu()] model = Model().npu() # 调用torch.compile编译 compiled_model = torch.compile(model, backend=npu_backend) # 执行prompt res_prompt = compiled_model(x, kv) x.is_prompt = False # 执行decode res_decode = compiled_model(x, kv) ``` -------------------------------- ### TorchAir CompilerConfig Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/dynamo_export.md Example of configuring CompilerConfig for graph export. This snippet shows how to enable automatic generation of ATC JSON configuration templates and include nn_module_stack information in the exported graph. ```python import torch_npu, torchair config = torchair.CompilerConfig() # 开启自动生成ATC的json配置文件模板 config.export.experimental.auto_atc_config_generated = True # 携带nn_module_stack信息 config.export.experimental.enable_record_nn_module_stack = True ``` -------------------------------- ### 配置算子在线编译选项 Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/jit_compile.md 通过CompilerConfig配置jit_compile参数来控制算子的在线编译方式。适用于GE图模式场景。 ```python import torch_npu, torchair config = torchair.CompilerConfig() # 算子在线编译选项配置 config.experimental_config.jit_compile = "auto" npu_backend = torchair.get_npu_backend(compiler_config=config) opt_model = torch.compile(model, backend=npu_backend) ``` -------------------------------- ### Enable SuperKernel Fusion Optimization Source: https://github.com/ascend/torchair/blob/master/docs/zh/npugraph_ex/advanced/superkernel.md Enable SuperKernel fusion optimization by setting `super_kernel_optimize` to `True` in `torch.compile` options. This example demonstrates how to configure the optimization, but it is for reference only and not directly runnable. Refer to the documentation for detailed parameter explanations. ```python import torch import torch_npu opt_model = torch.compile(model, backend="npugraph_ex", options={"super_kernel_optimize": True, "super_kernel_optimize_options": dict, "super_kernel_debug_options": dict}, dynamic=False, fullgraph=True) ``` -------------------------------- ### Add Operator Custom Decompositions Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/api/inference/cache_compile.md This example demonstrates how to use custom_decompositions with cache_compile to handle operator decompositions, specifically for the torch.ops.aten.add.default operator. It involves registering a decomposition function and passing it during the cache_compile call. ```python # 注册算子分解函数 import torch, torch_npu, torchair from torch._decomp import get_decompositions, register_decomposition @register_decomposition(torch.ops.aten.add.default) def test_add_decomp(t1, t2): return t1 + t2 class Model(torch.nn.Module): def __init__(self): super().__init__() # 将被分解算子的列表通过custom_decompositions传入 self.cached = torchair.inference.cache_compile(self.inner_forward, custom_decompositions=get_decompositions([torch.ops.aten.add.default])) def inner_forward(self, t1, t2): return torch.ops.aten.add(t1, t2) def forward(self, t1, t2): return self.cached(t1, t2) # ... ``` -------------------------------- ### 改造PyTorch模型脚本以支持编译缓存 Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/compile_cache.md 将模型原有的forward函数拆分为_forward和forward,并使用torchair.inference.cache_compile封装_forward函数以实现编译缓存。此方法适用于需要多次trace的函数。 ```python import dataclasses import logging from typing import List import torch import torch_npu import torchair from torchair import logger from torchair.configs.compiler_config import CompilerConfig config = CompilerConfig() logger.setLevel(logging.INFO) # InputMeta为仿照VLLM(Versatile Large Language Model)框架的入参结构 @dataclasses.dataclass class InputMeta: data: torch.Tensor is_prompt: bool class Model(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(2, 1) self.linear2 = torch.nn.Linear(2, 1) for param in self.parameters(): torch.nn.init.ones_(param) # 通过torchair.inference.cache_compile实现编译缓存 self.cached_prompt = torchair.inference.cache_compile(self.prompt, config=config) self.cached_decode = torchair.inference.cache_compile(self.decode, config=config) @torch.inference_mode() def forward(self, x: InputMeta, kv: List[torch.Tensor]): return self._forward(x, kv) def _forward(self, x, kv): return self.linear2(x.data) + self.linear2(kv[0]) def prompt(self, x: InputMeta, kv: List[torch.Tensor]): return self._forward(x, kv) def decode(self, x: InputMeta, kv: List[torch.Tensor]): return self._forward(x, kv) ``` -------------------------------- ### RefData Conversion Log Message Example Source: https://github.com/ascend/torchair/blob/master/docs/zh/ascend_ir/features/advanced/ref_data.md An example log message indicating that a RefData type has been replaced with a Data type during graph optimization. This message appears when the RefData type conversion feature is enabled and active. ```text [DEBUG] TORCHAIR 20240607 02:06:15 Replace RefData_5_3_20_20_1200_400_20_1_0_140251860631280:RefData with arg0_1:Data in graph graph_1 ```