### Install CANN Package (Example) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Install the CANN toolkit and ops packages. Ensure you replace {version} with the correct CANN version and specify the installation path if needed. This is a prerequisite for end-to-end runs. ```bash # Example: x86 A3, {version} is CANN version, e.g. 9.0.0 chmod +x Ascend-cann_{version}_linux-x86_64.run chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run ./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}] ./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}] ``` -------------------------------- ### Install Ascend Toolkit Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md Make the Ascend Toolkit installation script executable and run it to install the toolkit. ```bash chmod +x Ascend-cann-toolkit_{ascend-cann-toolkit version}_linux-aarch64.run ./Ascend-cann-toolkit_{ascend-cann-toolkit version}_linux-aarch64.run --install ``` -------------------------------- ### Fast Build (Skip Installation) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md Perform a quick build that skips the installation step. Useful for development cycles where only compilation is needed. ```bash # 快速构建(不执行安装) ./build-tools/build.sh -o ./build --fast-build ``` -------------------------------- ### Install CANN Package Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md Install the CANN toolkit and ops packages. Ensure the execute permissions are set and use the --full or --install flags as appropriate. Specify the installation path if needed. ```bash #以x86系统A3环境,{version}为CANN版本,如9.0.0 chmod +x Ascend-cann_{version}_linux-x86_64.run chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run ./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}] ./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}] ``` -------------------------------- ### Install Ascend CANN and Ops Packages Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Installs the Ascend CANN toolkit and ops packages in an x86 A3 environment. Ensure to replace {version} with the correct CANN version and specify the installation path. ```bash #In the x86 A3 environment, {version} indicates the CANN version, for example, 9.0.0. chmod +x Ascend-cann_{version}_linux-x86_64.run chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run ./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}] ./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}] ``` -------------------------------- ### Install PyTorch NPU Support Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md Install the torch_npu package, which provides NPU backend support for PyTorch. ```bash pip install pybind11 torch_npu ``` -------------------------------- ### Fast Build AscendNPU IR (Skip Install) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Perform a fast build of AscendNPU IR, skipping the installation step. This is useful when only the build artifacts are needed. ```bash # Fast build (skip install) ./build-tools/build.sh -o ./build --fast-build ``` -------------------------------- ### Install ascendnpu-ir Package Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/python/wheel/README.md Install the ascendnpu-ir Python package using pip. Ensure you have Python 3.9 or higher. ```bash pip install ascendnpu-ir ``` -------------------------------- ### msprof Command-Line Usage Examples Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md These examples show how to use the msprof command-line tool for full-network and single-operator profiling, including options for specifying output directories, kernel names, and performance metrics. ```bash # Full-network on-device profiling # --output: directory for profiling data (default: current dir) # --application: command to run msprof --output=xxx --application="" ``` ```bash # Single-operator on-device profiling # --kernel-name: kernel name (supports prefix match) # --aic-metrics: enable metrics (Roofline, Occupancy, MemoryDetail, etc.) msprof op --output=xxx --application="" --kernel-name=xxx --aic-metrics=xxx ``` ```bash # Single Operator Simulation Tuning # --core-id - Specify the IDs of partial logical cores to parse simulation data of designated cores # --kernel-name - Specify the name of the operator to be collected; fuzzy matching by operator name prefix is supported # --soc-version - Specify the simulator type # --output - Storage path for collected performance data; data is saved in the current directory by default msprof op simulator --core-id=xxx --kernel-name=xxx --soc-version=Ascendxxx --output=xxx ``` -------------------------------- ### Build AscendNPU-IR Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md Build the AscendNPU-IR component using the provided installation script. An alternative allows specifying a local path for AscendNPU-IR. ```bash cd tilelang-mlir-ascend bash install_npuir.sh # Alternative way of building with local AscendNPU-IR bash install_npuir.sh --bishengir-path=/path/to/AscendNPU-IR/build/install ``` -------------------------------- ### Install torch_npu and triton-ascend Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Installs the specified versions of torch_npu and triton-ascend using pip. Ensure compatibility with your Python environment. ```bash pip install torch_npu==2.7.1 pip install triton-ascend ``` -------------------------------- ### Install BishengIR Headers Source: https://github.com/ascend/ascendnpu-ir/blob/master/CMakeLists.txt Installs BishengIR header files to the system include directory. Excludes specific patterns like 'CMakeFiles' and 'config.h'. ```cmake install( DIRECTORY bishengir/include/bishengir DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}" COMPONENT bishengir-headers FILES_MATCHING PATTERN "*.def" PATTERN "*.h" PATTERN "*.inc" PATTERN "*.td" PATTERN "LICENSE.TXT") install( DIRECTORY ${BISHENGIR_BINARY_DIR}/bishengir/include/bishengir DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}" COMPONENT bishengir-headers FILES_MATCHING PATTERN "*.def" PATTERN "*.h" PATTERN "*.gen" PATTERN "*.inc" PATTERN "*.td" PATTERN "CMakeFiles" EXCLUDE PATTERN "config.h" EXCLUDE) ``` -------------------------------- ### TargetDeviceSpecAttr Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HACCDialect.md An example of TargetDeviceSpecAttr, which represents NPU target device specifications. Each specification describes a device and its hardware properties, such as UB_SIZE. ```mlir #hacc.target_device_spec<\ #dlti.dl_entry<"UB_SIZE", 196608 : i32>> ``` -------------------------------- ### Setup Host Tool Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/tools/bishengir-hfusion-ods-gen/CMakeLists.txt Configures the 'bishengir-hfusion-ods-yaml-gen' as a host tool, defining its executable name and target. This is crucial for build system integration. ```cmake setup_host_tool(bishengir-hfusion-ods-yaml-gen BISHENGIR_HFUSION_ODS_YAML_GEN BISHENGIR_HFUSION_ODS_YAML_GEN_EXE BISHENGIR_HFUSION_ODS_YAML_GEN_TARGET) ``` -------------------------------- ### Example Test Command for a Specific Pass Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/CVOptimization.md Illustrates how to run a specific MLIR pass using bishengir-opt and verify its output with FileCheck. ```bash // RUN: bishengir-opt -hivm-normalize-matmul %s -split-input-file -verify-diagnostics -allow-unregistered-dialect | FileCheck %s ``` -------------------------------- ### Cube Loop Tiling Example (Before) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/TileCubeAndVectorLoop.md Illustrates the structure of a Cube loop before the tiling pass is applied. This serves as a baseline for understanding the transformation. ```mlir scf.for { hivm.load A hivm.load B hivm.hir.mmadL1 hivm.hir.fixpipe } {cube_loop} ``` -------------------------------- ### MLIR Kernel Entry Function Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/interface_api.md Example of an MLIR function marked as a kernel entry point for the device, utilizing specific Ascend attributes. ```mlir func.func @kernel(...) attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { ... } ``` -------------------------------- ### Vector Cumulative Product Example (Memref) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Example of using `hivm.hir.vcumprod` with memref types for calculating cumulative product along dimension 0 in reverse. ```mlir hivm.hir.vcumprod ins(%src : memref) outs(%dst : memref) cum_dims : [0] reverse = true ``` -------------------------------- ### HIVM VPad Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md An example demonstrating the usage of the hivm.hir.vpad operation to pad a tensor. It specifies the source and destination tensors, padding dimensions, and the pad value. ```mlir hivm.hir.vpad ins(%src : tensor<2x16xf32>) outs(%dst: tensor) low[%first_dim_low, 0] high[%first_dim_high, 0] pad_value %pad_value : f32 -> tensor ``` -------------------------------- ### Configure Ascend Toolkit Environment Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md Source the environment setup script provided by the Ascend Toolkit to configure necessary environment variables. ```bash source /path/to/install/Ascend/ascend-toolkit/set_env.sh ``` -------------------------------- ### Link Libraries for bishengir-capi-ir-test Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/CAPI/CMakeLists.txt Example of using `_add_capi_test_executable` to create the 'bishengir-capi-ir-test' executable, linking BiShengIRCAPIRegisterEverything and MLIRCAPIIR libraries. ```cmake _add_capi_test_executable(bishengir-capi-ir-test ir.c LINK_LIBS PRIVATE BiShengIRCAPIRegisterEverything MLIRCAPIIR ) ``` -------------------------------- ### Example of Ninja Build Error Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md This error indicates that the `build.ninja` file is missing. Rerunning CMake with the `-r` option can regenerate it. ```bash ninja: error: loading 'build.ninja': No such file or directory ``` -------------------------------- ### HIVM VArangeOp Examples Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Illustrates the usage of the hivm.hir.varange operation with different output types (memref and tensor) and demonstrates how to specify offsets and strides. ```mlir hivm.hir.varange offset[%o] strides[%s0, %s1] outs(%dst : memref<32xf32>) ``` ```mlir %result = hivm.hir.varange offset[%o] strides[%s0, %s1] outs(%dst : tensor<32xf32>) -> tensor<32xf32> ``` -------------------------------- ### Original Kernel Invocation (GDN Example) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/best_practice.md Demonstrates the standard way of invoking the `chunk_gated_delta_rule_fwd_kernel_h_blockdim64` operator in the GDN network. ```python chunk_gated_delta_rule_fwd_kernel_h_blockdim64[ grid ]( k=k, v=u, w=w, v_new=v_new, g=g, gk=gk, h=h, h0=initial_state, ht=final_state, cu_seqlens=cu_seqlens, chunk_offsets=chunk_offsets, T=T, H=H, K=K, V=V, BT=BT, ) ``` -------------------------------- ### TargetDeviceSpecAttr Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/dialects/HACCDialect.md Represents NPU target device specifications, including hardware properties like UB_SIZE. Each specification describes a single device. ```mlir #hacc.target_device_spec<\n #dlti.dl_entry<"UB_SIZE", 196608 : i32>> ``` -------------------------------- ### Link Libraries for bishengir-capi-pass-test Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/CAPI/CMakeLists.txt Example of using `_add_capi_test_executable` to create the 'bishengir-capi-pass-test' executable, linking multiple BiShengIR and MLIR C API libraries. ```cmake _add_capi_test_executable(bishengir-capi-pass-test pass.c LINK_LIBS PRIVATE BiShengIRCAPIRegisterEverything BiShengIRCAPIAnnotation MLIRCAPIFunc BiShengIRCAPIHFusion BiShengIRCAPIHIVM MLIRCAPIIR MLIRCAPIRegisterEverything MLIRCAPITransforms ) ``` -------------------------------- ### Install CANN Python Dependencies Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Installs the required Python dependencies for the Ascend CANN environment. Ensure these versions are compatible with your setup. ```bash pip install attrs==24.2.0 numpy==1.26.4 scipy==1.13.1 decorator==5.1.1 psutil==6.0.0 pyyaml ``` -------------------------------- ### Set Ascend Home Path Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/README.md Set the ASCEND_HOME_PATH environment variable to the installed path of the CANN software package before running the example on Ascend NPU. ```bash export ASCEND_HOME_PATH=/usr/local/ascend-toolkit/latest ``` -------------------------------- ### Build Documentation with Make Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md Use the make command from the repository root to build documentation in HTML format. Specify 'html' for English, 'html-zh' for Chinese, or 'html-all' for both. ```bash make -C docs html make -C docs html-zh make -C docs html-all ``` -------------------------------- ### Build Documentation from Docs Directory Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md Alternatively, navigate to the 'docs/' directory and use the make command to build documentation. 'html' builds English, 'html-zh' builds Chinese, and 'html-all' builds both. ```bash make html make html-zh make html-all ``` -------------------------------- ### Build Documentation Source: https://github.com/ascend/ascendnpu-ir/blob/master/README.md Build the documentation for AscendNPU IR. Specify the target language for the output. ```bash make -C docs html # English only → docs/_build/en make -C docs html-zh # Chinese only → docs/_build/zh_cn make -C docs html-all # Both languages ``` -------------------------------- ### Build AscendNPU IR from Source (Recommended) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md Use the provided build script for a streamlined build process. This command configures, builds, and installs the project. The --apply-patches flag is required for the first build. ```bash #在项目根目录下 ./build-tools/build.sh -o ./build --build-type Release --apply-patches ``` -------------------------------- ### Preview Documentation Locally Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md Open the generated HTML files to preview the documentation. Use 'open' for direct access or 'python3 -m http.server' to serve the files via HTTP. ```bash # English open docs/_build/en/index.html # Chinese open docs/_build/zh_cn/index.html # Or serve with HTTP (e.g. port 8080 for English, 8081 for Chinese) cd docs/_build/en && python3 -m http.server 8080 cd docs/_build/zh_cn && python3 -m http.server 8081 ``` -------------------------------- ### Add LLVM Install Target for Headers Source: https://github.com/ascend/ascendnpu-ir/blob/master/CMakeLists.txt Adds a custom install target for BishengIR headers if IDE support is disabled. ```cmake if(NOT LLVM_ENABLE_IDE) add_llvm_install_targets(install-bishengir-headers DEPENDS bishengir-headers COMPONENT bishengir-headers) endif() ``` -------------------------------- ### Install Python Dependencies for Ascend Toolkit Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md Install required Python packages for the Ascend Toolkit, ensuring compatibility with specified NumPy versions. ```bash pip3 install attrs cython 'numpy>=1.19.2,<=1.24.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy requests absl-py ``` -------------------------------- ### Build Wheel Package from Source Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/python/wheel/README.md Build the bishengir-compile binary and then the wheel package from source. The wheel will be located in the dist directory. ```bash # Build the compiler cd build-tools ./build.sh # Build the wheel package ./build_wheel.sh ``` -------------------------------- ### Vector Sort Example (Memref) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Example of using the `hivm.hir.vsort` operation with memref types for sorting in descending order along axis 0. ```mlir hivm.hir.vsort ins(%src : memref) outs(%dst : memref) descending = true sort_axis = 0 ``` -------------------------------- ### Torch IR to Linalg/HFusion Conversion Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/conversion/framework_interface.md Example of a Torch IR function for element-wise multiplication, demonstrating the input and output tensor types. ```mlir func.func @torch_mul(%arg0: !torch.vtensor<[4096],f16>, %arg1: !torch.vtensor<[1,56,4096],f16>) -> !torch.vtensor<[1,56,4096],f16> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %0 = torch.aten.mul.Tensor %arg0, %arg1 : !torch.vtensor<[4096],f16>, !torch.vtensor<[1,56,4096],f16> -> !torch.vtensor<[1,56,4096],f16> return %0 : !torch.vtensor<[1,56,4096],f16> } ``` -------------------------------- ### Build AscendNPU IR (First Build) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Perform the initial build of AscendNPU IR using the build script. This command configures CMake, builds the project with Ninja, and applies necessary patches to submodules. ```bash # From the repo root ./build-tools/build.sh -o ./build --build-type Release --apply-patches ``` -------------------------------- ### HIVM Load Operation Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Demonstrates loading data from global memory to a local unified buffer. Supports padding and initialization options. Ensure src and dst have the same element type and shape if padding is not used. ```mlir hivm.load ins(%src : memref<16x16xf16, #hivm.address_space>) outs(%dst : memref<16x16xf16, #hivm.address_space>) ``` -------------------------------- ### Compile MLIR to Device Binary Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/README.md Use `bishengir-compile` to compile the `.mlir` file into an executable binary for the Ascend NPU. The `-enable-hivm-compile` flag is crucial for this process. ```bash bishengir-compile add.mlir -enable-hivm-compile -o kernel.o ``` -------------------------------- ### HFusion PrintOp Usage Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md Example of inserting hfusion.print into HFusion IR to print the result of a load operation. Ensure the 'hex' attribute is correctly set. ```mlir func.func @vector_kernel(%arg0: memref {hacc.arg_type = #hacc.arg_type}, %arg1: memref {hacc.arg_type = #hacc.arg_type}, %arg2: memref {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32) attributes {SyncBlockLockArgIdx = 0 : i64, WorkspaceArgIdx = 1 : i64, hacc.entry, hacc.function_kind = #hacc.function_kind, mix_mode = "aiv", parallel_mode = "simd"} { %reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [0], sizes: [8], strides: [1] : memref to memref<8xi64, strided<[1]>> %alloc = memref.alloc() : memref<8xi64> memref.copy %reinterpret_cast, %alloc : memref<8xi64, strided<[1]>> to memref<8xi64> %0 = bufferization.to_tensor %alloc restrict writable : memref<8xi64> hfusion.print " x: " {hex = false} %0 : tensor<8xi64> return } ``` -------------------------------- ### Cube Loop Tiling Example (After) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/TileCubeAndVectorLoop.md Shows the Cube loop structure after the tiling pass. The original large iteration is split into smaller iterations, with operations like `hivm.load` now operating on slices. ```mlir scf.for { for { hivm.load slice_A hivm.load slice_B hivm.hir.mmadL1 hivm.hir.fixpipe } {sub_tile} } {cube_loop} ``` -------------------------------- ### Build AscendNPU IR with Specific Compiler and Threads Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Configure the build process to use specific C and C++ compilers and set the number of parallel build threads. This allows for fine-tuning the build environment. ```bash # Specify compiler and thread count ./build-tools/build.sh -o ./build --c-compiler /usr/bin/clang-15 --cxx-compiler /usr/bin/clang++-15 -j 256 ``` -------------------------------- ### Converted Linalg/HFusion IR Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/framework_interface.md Example of MLIR code after conversion from Torch IR to Linalg/HFusion. It shows the `mul` operation represented using `linalg.broadcast` and `linalg.elemwise_binary`. ```mlir func.func @torch.aten.mul_tensor(%arg0: tensor<4096xf16>, %arg1: tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> attributes { hacc.entry, hacc.function_kind = #hacc.function_kind } { %0 = tensor.empty() : tensor<1x56x4096xf16> %broadcasted = linalg.broadcast ins(%arg0 : tensor<4096xf16>) outs(%0 : tensor<1x56x4096xf16>) dimensions = [0, 1] %1 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%broadcasted, %arg1 : tensor<1x56x4096xf16>, tensor<1x56x4096xf16>) outs(%0 : tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> return %1 : tensor<1x56x4096xf16> } ``` -------------------------------- ### Vector Sort Example (Tensor) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Example of using the `hivm.hir.vsort` operation with tensor types for sorting in descending order along axis 0, specifying the result type. ```mlir %result = hivm.hir.vsort ins(%src : tensor) outs(%dst : tensor) descending = true sort_axis = 0 -> tensor ``` -------------------------------- ### Run LIT test suite Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/faq/faq.md Execute the LIT test suite by running './bin/llvm-lit' with the path to your test directory. Adjust paths as necessary. ```bash ./bin/llvm-lit ../bishengir/test ``` -------------------------------- ### Vector Cumulative Product Example (Tensor) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Example of using `hivm.hir.vcumprod` with tensor types for calculating cumulative product along dimension 0 in reverse, specifying the result type. ```mlir %result = hivm.hir.vcumprod ins(%src : tensor) outs(%dst : tensor) cum_dims : [0] reverse = true -> tensor ``` -------------------------------- ### Triton index_put Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Demonstrates how to use the index_put function to place a value tensor into a target tensor at specified indices. This example shows a 2D index placement scenario. ```python import triton import triton.language as tl @triton.jit def put_index(): # Placeholder for actual tensor definitions and kernel logic # Example tensor pointers and tiles would be defined here # For demonstration, assume dst_ptr, index_tile, and value_tile are pre-defined dst_ptr = tl.make_block_ptr(0, (1,), (1,), (0,), (1,), (1,)) index_tile = tl.arange(0, 4).to(tl.int32).reshape(4, 1) value_tile = tl.arange(0, 8).to(tl.float16).reshape(4, 2) tmp_buf = tl.index_put( ptr=dst_ptr, index=index_tile, value=value_tile, dim=0, index_boundary=4, end_offset=(2, 2), start_offset=(0, 0), dst_stride=(2, 1) ) # Placeholder for further kernel operations or return return tmp_buf ``` -------------------------------- ### Configure and Build AscendNPU IR with LLVM Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Use this CMake command to configure the build for AscendNPU IR, specifying compilers, build type, and external projects. Optional parameters can be uncommented and modified as needed. ```bash export LLVM_SOURCE_DIR="$(realpath ../third-party/llvm-project)" cmake ${LLVM_SOURCE_DIR}/llvm -G Ninja \ -DCMAKE_C_COMPILER=clang \ -DCMAKE_CXX_COMPILER=clang++ \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_PROJECTS="mlir" \ -DLLVM_EXTERNAL_PROJECTS="bishengir" \ -DLLVM_EXTERNAL_BISHENGIR_SOURCE_DIR="$(realpath ..)" \ -DBSPUB_DAVINCI_BISHENGIR=ON \ # [-DCMAKE_INSTALL_PREFIX="${PWD}/install"] \ # [-DLLVM_MAJOR_VERSION_21_COMPATIBLE=ON] \ # [-DLLVM_ENABLE_ASSERTIONS=ON] \ # [-DMLIR_ENABLE_BINDINGS_PYTHON=ON] \ # [-DLLVM_TARGETS_TO_BUILD="host;Native"] \ # [-DBISHENGIR_PUBLISH=OFF] \ # [-DBISHENGIR_BUILD_TEMPLATE=ON -DBISHENG_COMPILER_PATH=/path/to/bisheng-compiler] \ # [-Dother-options=value] ``` ```bash ninja -j32 ``` -------------------------------- ### Manual Build Prerequisites Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Prepare for a manual build by initializing submodules and applying patches. These are essential prerequisites before proceeding with manual CMake configuration. ```bash # From the repo root # Prerequisites: submodule init (git submodule update --init --recursive) and patch application (source build-tools/apply_patches.sh). mkdir -p build cd build ``` -------------------------------- ### Triton Kernel Example with Bitwise Mask Hint Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/best_practice.md An example Triton kernel demonstrating the use of `tl.compile_hint` for a bitwise mask on a condition tensor loaded as i8. This snippet is intended for testing the bitmask feature. ```python # test_bitmask.py import triton import triton.language as tl import torch import torch_npu import test_common @triton.jit def triton_where_lt_case1(in_ptr0, in_ptr1, cond_ptr, out_ptr0, xnumel, XBLOCK: tl.constexpr, XBLOCK_SUB: tl.constexpr): xoffset = tl.program_id(0) * XBLOCK for xoffset_sub in range(0, XBLOCK, XBLOCK_SUB): xindex = xoffset + xoffset_sub + tl.arange(0, XBLOCK_SUB)[:] xmask = xindex < xnumel in0 = tl.load(in_ptr0 + xindex, xmask) in1 = tl.load(in_ptr1 + xindex, xmask) cond = tl.load(cond_ptr + xindex, xmask) res = tl.where(cond, in1, in0) # versions after triton-ascend 3.4.0 # tl.extra.cann.extension.compile_hint(cond, "bitwise_mask") # versions before triton-ascend 3.2.0 tl.compile_hint(cond, "bitwise_mask") tl.store(out_ptr0 + (xindex), res, xmask) def test_where_lt_case1(): dtype = "float32" shape = (1, 1024, 8) ncore = 1 xblock = 8192 xblock_sub = 1024 if shape[-1] %8 != 0: raise ValueError("The last dimension should be a multiple of 8") x0 = test_common.generate_tensor(shape, dtype).npu() x1 = test_common.generate_tensor(shape, dtype).npu() # Run triton with i8 bitwise mask cond_i8 = test_common.generate_tensor(shape, 'uint8').npu() y_cal = test_common.generate_tensor(shape, dtype).npu() triton_where_lt_case1[ncore, 1, 1](x0, x1, cond_i8, y_cal, x0.numel(), xblock, xblock_sub) test_where_lt_case1() ``` -------------------------------- ### Triton gather_out_to_ub Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Example usage of the gather_out_to_ub function within a Triton kernel. This snippet demonstrates how to call the function with source tensor pointer, index tensor, boundary, dimension, strides, and offsets. ```python tmp_buf = al.gather_out_to_ub( src=src_ptr, index=index, index_boundary=4, dim=0, src_stride=(2, 1), end_offset=(2, 2), start_offset=(0, 0) ) ``` -------------------------------- ### HIVM DebugOp Usage Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md Example of adding hivm.hir.debug to HIVM IR for runtime debugging. This snippet demonstrates printing a tensor with specified debug type, hex format, prefix, and core type. ```mlir func.func @vector_kernel(%arg0: i64 {hacc.arg_type = #hacc.arg_type}, %arg1: memref {hacc.arg_type = #hacc.arg_type}, %arg2: memref {hacc.arg_type = #hacc.arg_type}, %arg3: memref {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32) attributes {SyncBlockLockArgIdx = 0 : i64, WorkspaceArgIdx = 1 : i64, func_dyn_memref_args = dense<[false, true, true, true, false, false, false, false]> : vector<8xi1>, hacc.entry, hacc.function_kind = #hacc.function_kind, mix_mode = "aiv", parallel_mode = "simd"} { %0 = arith.muli %arg5, %arg6 : i32 %1 = arith.muli %0, %arg7 : i32 annotation.mark %1 {logical_block_num} : i32 %reinterpret_cast = memref.reinterpret_cast %arg3 to offset: [0], sizes: [8], strides: [1] : memref to memref<8xi64, strided<[1]>> %alloc = memref.alloc() : memref<8xi64> hivm.hir.load ins(%reinterpret_cast : memref<8xi64, strided<[1]>>) outs(%alloc : memref<8xi64>) init_out_buffer = false may_implicit_transpose_with_last_axis = false %2 = bufferization.to_tensor %alloc restrict writable : memref<8xi64> hivm.hir.debug {debugtype = "print", hex = false, prefix = " x: ", tcoretype = #hivm.tcore_type} %2 : tensor<8xi64> return } ``` -------------------------------- ### Schedule API Usage Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/features/AutoSchedule/HFusion_AutoSchedule.md Demonstrates the use of various Schedule APIs within createScheduleImpl for operations like IO caching, tiling, loop fusion, and multicore binding. These primitives help in constructing the Transform Dialect program. ```cpp // IO 缓存与 buffer 管理 cacheRead cacheWrite setBufferSize // Tiling 与循环结构控制 tileUsingFor tileUsingForAll tileReductionUsingFor // 循环融合与合并 fuseLoops fuseIntoContaining coalesceLoops // 多核绑定 bindLoopToMulticore ``` -------------------------------- ### Setting up Double Buffering with 'multibuffer' Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Demonstrates the use of `al.multibuffer` to set up double buffering for tensors. This optimization technique improves data flow and computational overlap by creating buffered copies of a tensor. ```python @triton.jit def triton_compile_hint(): #... tmp0 = tl.load(in_ptr0 + xindex, xmask) al.multibuffer(tmp0, 2) tl.store(out_ptr0 + (xindex), tmp0, xmask) #... ``` -------------------------------- ### HFusion MLIR Example for Reduction Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/framework_interface.md An example MLIR function demonstrating the use of Linalg and HFusion dialects for a reduction operation involving elementwise multiplication and summation. This code is intended to be compiled for the Ascend NPU. ```mlir func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %1 = tensor.empty() : tensor<40960x1024xf32> %3 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32> %4 = tensor.empty() : tensor<40960xf32> %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>) outs(%4 : tensor<40960xf32>) dimensions = [1] %5 = tensor.empty() : tensor<40960xf32> %6 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>) outs(%5: tensor<40960xf32>) -> tensor<40960xf32> return %6 : tensor<40960xf32> } ``` -------------------------------- ### Unfold Symbolic Int: Invalid Bind Symbolic Shape Examples Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/passes/SymbolPasses.md These examples illustrate invalid uses of symbol.bind_symbolic_shape that would violate the constraints of the unfold-symbolic-int pass. They show incorrect affine maps or binding to tensor.empty. ```mlir symbol.bind_symbolic_shape %arg0, [%S0, %S1], affine_map<()[s0, s1] -> (s0/2, s0/s1, s1+1)> : tensor %empty = tensor.empty(%S2, %S3) : tensor symbol.bind_symbolic_shape %empty, [%S2, %S3], affine_map<()[s0, s1] -> (s0, s1)> : tensor ``` -------------------------------- ### Build AscendNPU IR (Subsequent Builds) Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md Execute subsequent builds of AscendNPU IR when the build directory already exists. This command leverages the existing build configuration. ```bash ./build-tools/build.sh -o ./build --build-type Release ``` -------------------------------- ### Scope Overflow Error Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/PlanMemory/PlanMemory.md This example demonstrates a typical scope overflow error message encountered when memory constraints are violated. It indicates that the required memory bits exceed the available bits, often due to large tiling basic blocks or excessive local buffers enabled by multi-buffering. ```bash loc("/tmp/tmp0h121237/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 3219456 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.) ``` -------------------------------- ### Reassociation Generation from Mask Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/AutoFlatten/AutoFlatten.md Example of generating reassociation groups based on a given mask. ```APIDOC ## Reassociation Generation from Mask Input Mask: [U, C, U, N, U, C, U] Processing: Segment 1: [U, C, U] → Group units with collapsible → [[0, 1, 2]] Segment 2: [N] → Isolated non-collapsible → [[3]] Segment 3: [U, C, U] → Group units with collapsible → [[4, 5, 6]] Result: [[0, 1, 2], [3], [4, 5, 6]] ``` -------------------------------- ### Level 2 Memory Reuse Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/PlanMemory/PlanMemory.md Demonstrates Level 2 memory reuse, preferring reuse within the same pipeline type (e.g., Vector with Vector) to avoid cross-pipeline dependencies and maintain throughput. This example shows how shared memory C (Vector OP) can reuse shared memory B (Vector OP) without performance degradation. ```text Shared A [A0, A1] Shared B [B] Shared C [C] Shared D [D0, D1] Loop i: // sync op1(A0, A1) // DMA OP, Double Buffer op2(B) // Vector OP op3(C) // Vector OP op4(D0, D1) // DMA OP, Double Buffer ``` -------------------------------- ### Set Ascend CANN Environment Variables Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md Sources the environment setup script for Ascend CANN. The path to the script may vary depending on the CANN version. ```bash #If the version is earlier than 8.5.0, the path is ${PATH-TO-CANN}/ascend-toolkit/set_env.sh. source ${PATH-TO-CANN}/cann/set_env.sh ``` -------------------------------- ### MLIR Broadcast Operation Example Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/AutoFlatten/AutoFlatten.md This MLIR snippet demonstrates a basic broadcast operation before any transformations are applied. ```mlir %0 = hivm.vbrc %input broadcast_dims = [3] : memref<1x64x1x128x256xf32> -> memref<1x64x16x128x256xf32> ``` -------------------------------- ### hivm.hir.get_sys_cnt Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md Retrieves the system count of the current device. This operation is used to get a system-level counter value. ```APIDOC ## `hivm.hir.get_sys_cnt` (hivm::GetSysCntOp) ### Description Get sys cnt of the current device. ### Syntax ```mlir operation ::= `hivm.hir.get_sys_cnt` attr-dict `->` type($result) ``` ### Results | Result | Description | | :----: | ----------- | | `result` | 64-bit signless integer ``` -------------------------------- ### Configure Target Include Directories Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/CMakeLists.txt Adds public include directories for the 'bishengir-npu-hivm-vec-add' target, including the Ascend NPU's general include path and the experiment/msprof directory. ```cmake target_include_directories(bishengir-npu-hivm-vec-add PUBLIC ${ASCEND_HOME_PATH}/include ${ASCEND_HOME_PATH}/include/experiment/msprof ) ```