### Install CANN Package (Example)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Install the CANN toolkit and ops packages. Ensure you replace {version} with the correct CANN version and specify the installation path if needed. This is a prerequisite for end-to-end runs.

```bash
# Example: x86 A3, {version} is CANN version, e.g. 9.0.0
chmod +x Ascend-cann_{version}_linux-x86_64.run
chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run
./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}]
./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}]
```

--------------------------------

### Install Ascend Toolkit

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md

Make the Ascend Toolkit installation script executable and run it to install the toolkit.

```bash
chmod +x Ascend-cann-toolkit_{ascend-cann-toolkit version}_linux-aarch64.run
./Ascend-cann-toolkit_{ascend-cann-toolkit version}_linux-aarch64.run --install
```

--------------------------------

### Fast Build (Skip Installation)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md

Perform a quick build that skips the installation step. Useful for development cycles where only compilation is needed.

```bash
# 快速构建（不执行安装）
./build-tools/build.sh -o ./build --fast-build
```

--------------------------------

### Install CANN Package

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md

Install the CANN toolkit and ops packages. Ensure the execute permissions are set and use the --full or --install flags as appropriate. Specify the installation path if needed.

```bash
#以x86系统A3环境，{version}为CANN版本，如9.0.0
chmod +x Ascend-cann_{version}_linux-x86_64.run
chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run
./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}]
./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}]
```

--------------------------------

### Install Ascend CANN and Ops Packages

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Installs the Ascend CANN toolkit and ops packages in an x86 A3 environment. Ensure to replace {version} with the correct CANN version and specify the installation path.

```bash
#In the x86 A3 environment, {version} indicates the CANN version, for example, 9.0.0.
chmod +x Ascend-cann_{version}_linux-x86_64.run
chmod +x Ascend-cann-A3-ops_{version}_linux-x86_64.run
./Ascend-cann_{version}_linux-x86_64.run --full [--install-path=${PATH-TO-CANN}]
./Ascend-cann-A3-ops_{version}_linux-x86_64.run --install [--install-path=${PATH-TO-CANN}]
```

--------------------------------

### Install PyTorch NPU Support

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md

Install the torch_npu package, which provides NPU backend support for PyTorch.

```bash
pip install pybind11 torch_npu
```

--------------------------------

### Fast Build AscendNPU IR (Skip Install)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Perform a fast build of AscendNPU IR, skipping the installation step. This is useful when only the build artifacts are needed.

```bash
# Fast build (skip install)
./build-tools/build.sh -o ./build --fast-build
```

--------------------------------

### Install ascendnpu-ir Package

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/python/wheel/README.md

Install the ascendnpu-ir Python package using pip. Ensure you have Python 3.9 or higher.

```bash
pip install ascendnpu-ir
```

--------------------------------

### msprof Command-Line Usage Examples

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md

These examples show how to use the msprof command-line tool for full-network and single-operator profiling, including options for specifying output directories, kernel names, and performance metrics.

```bash
# Full-network on-device profiling
# --output: directory for profiling data (default: current dir)
# --application: command to run
msprof --output=xxx --application=""
```

```bash
# Single-operator on-device profiling
# --kernel-name: kernel name (supports prefix match)
# --aic-metrics: enable metrics (Roofline, Occupancy, MemoryDetail, etc.)
msprof op --output=xxx --application="" --kernel-name=xxx --aic-metrics=xxx
```

```bash
# Single Operator Simulation Tuning
# --core-id - Specify the IDs of partial logical cores to parse simulation data of designated cores
# --kernel-name - Specify the name of the operator to be collected; fuzzy matching by operator name prefix is supported
# --soc-version - Specify the simulator type
# --output - Storage path for collected performance data; data is saved in the current directory by default
msprof op simulator --core-id=xxx --kernel-name=xxx --soc-version=Ascendxxx --output=xxx
```

--------------------------------

### Build AscendNPU-IR

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md

Build the AscendNPU-IR component using the provided installation script. An alternative allows specifying a local path for AscendNPU-IR.

```bash
cd tilelang-mlir-ascend
bash install_npuir.sh
# Alternative way of building with local AscendNPU-IR
bash install_npuir.sh --bishengir-path=/path/to/AscendNPU-IR/build/install
```

--------------------------------

### Install torch_npu and triton-ascend

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Installs the specified versions of torch_npu and triton-ascend using pip. Ensure compatibility with your Python environment.

```bash
pip install torch_npu==2.7.1
pip install triton-ascend
```

--------------------------------

### Install BishengIR Headers

Source: https://github.com/ascend/ascendnpu-ir/blob/master/CMakeLists.txt

Installs BishengIR header files to the system include directory. Excludes specific patterns like 'CMakeFiles' and 'config.h'.

```cmake
install(
    DIRECTORY bishengir/include/bishengir
    DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}"
    COMPONENT bishengir-headers
    FILES_MATCHING
    PATTERN "*.def"
    PATTERN "*.h"
    PATTERN "*.inc"
    PATTERN "*.td"
    PATTERN "LICENSE.TXT")

install(
    DIRECTORY ${BISHENGIR_BINARY_DIR}/bishengir/include/bishengir
    DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}"
    COMPONENT bishengir-headers
    FILES_MATCHING
    PATTERN "*.def"
    PATTERN "*.h"
    PATTERN "*.gen"
    PATTERN "*.inc"
    PATTERN "*.td"
    PATTERN "CMakeFiles" EXCLUDE
    PATTERN "config.h" EXCLUDE)
```

--------------------------------

### TargetDeviceSpecAttr Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HACCDialect.md

An example of TargetDeviceSpecAttr, which represents NPU target device specifications. Each specification describes a device and its hardware properties, such as UB_SIZE.

```mlir
#hacc.target_device_spec<\
  #dlti.dl_entry<"UB_SIZE", 196608 : i32>>
```

--------------------------------

### Setup Host Tool

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/tools/bishengir-hfusion-ods-gen/CMakeLists.txt

Configures the 'bishengir-hfusion-ods-yaml-gen' as a host tool, defining its executable name and target. This is crucial for build system integration.

```cmake
setup_host_tool(bishengir-hfusion-ods-yaml-gen BISHENGIR_HFUSION_ODS_YAML_GEN BISHENGIR_HFUSION_ODS_YAML_GEN_EXE BISHENGIR_HFUSION_ODS_YAML_GEN_TARGET)
```

--------------------------------

### Example Test Command for a Specific Pass

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/CVOptimization.md

Illustrates how to run a specific MLIR pass using bishengir-opt and verify its output with FileCheck.

```bash
// RUN: bishengir-opt -hivm-normalize-matmul %s -split-input-file -verify-diagnostics -allow-unregistered-dialect | FileCheck %s
```

--------------------------------

### Cube Loop Tiling Example (Before)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/TileCubeAndVectorLoop.md

Illustrates the structure of a Cube loop before the tiling pass is applied. This serves as a baseline for understanding the transformation.

```mlir
scf.for {
  hivm.load A
  hivm.load B
  hivm.hir.mmadL1
  hivm.hir.fixpipe
} {cube_loop}
```

--------------------------------

### MLIR Kernel Entry Function Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/interface_api.md

Example of an MLIR function marked as a kernel entry point for the device, utilizing specific Ascend attributes.

```mlir
func.func @kernel(...) attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  ...
}
```

--------------------------------

### Vector Cumulative Product Example (Memref)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Example of using `hivm.hir.vcumprod` with memref types for calculating cumulative product along dimension 0 in reverse.

```mlir
hivm.hir.vcumprod ins(%src : memref<?xf32>) outs(%dst : memref<?xf32>) cum_dims : [0] reverse = true
```

--------------------------------

### HIVM VPad Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

An example demonstrating the usage of the hivm.hir.vpad operation to pad a tensor. It specifies the source and destination tensors, padding dimensions, and the pad value.

```mlir
hivm.hir.vpad ins(%src : tensor<2x16xf32>) outs(%dst: tensor<?x16xf32>)
              low[%first_dim_low, 0] high[%first_dim_high, 0]
              pad_value %pad_value : f32
                -> tensor<?x16xf32>
```

--------------------------------

### Configure Ascend Toolkit Environment

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md

Source the environment setup script provided by the Ascend Toolkit to configure necessary environment variables.

```bash
source /path/to/install/Ascend/ascend-toolkit/set_env.sh
```

--------------------------------

### Link Libraries for bishengir-capi-ir-test

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/CAPI/CMakeLists.txt

Example of using `_add_capi_test_executable` to create the 'bishengir-capi-ir-test' executable, linking BiShengIRCAPIRegisterEverything and MLIRCAPIIR libraries.

```cmake
_add_capi_test_executable(bishengir-capi-ir-test
  ir.c
  LINK_LIBS PRIVATE
    BiShengIRCAPIRegisterEverything
    MLIRCAPIIR
)
```

--------------------------------

### Example of Ninja Build Error

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

This error indicates that the `build.ninja` file is missing. Rerunning CMake with the `-r` option can regenerate it.

```bash
ninja: error: loading 'build.ninja': No such file or directory
```

--------------------------------

### HIVM VArangeOp Examples

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Illustrates the usage of the hivm.hir.varange operation with different output types (memref and tensor) and demonstrates how to specify offsets and strides.

```mlir
hivm.hir.varange offset[%o] strides[%s0, %s1] outs(%dst : memref<32xf32>)
```

```mlir
%result = hivm.hir.varange offset[%o] strides[%s0, %s1] outs(%dst : tensor<32xf32>)
                            -> tensor<32xf32>
```

--------------------------------

### Original Kernel Invocation (GDN Example)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/best_practice.md

Demonstrates the standard way of invoking the `chunk_gated_delta_rule_fwd_kernel_h_blockdim64` operator in the GDN network.

```python
chunk_gated_delta_rule_fwd_kernel_h_blockdim64[
    grid
](
    k=k,
    v=u,
    w=w,
    v_new=v_new,
    g=g,
    gk=gk,
    h=h,
    h0=initial_state,
    ht=final_state,
    cu_seqlens=cu_seqlens,
    chunk_offsets=chunk_offsets,
    T=T,
    H=H,
    K=K,
    V=V,
    BT=BT,
)
```

--------------------------------

### TargetDeviceSpecAttr Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/dialects/HACCDialect.md

Represents NPU target device specifications, including hardware properties like UB_SIZE. Each specification describes a single device.

```mlir
#hacc.target_device_spec<\n  #dlti.dl_entry<"UB_SIZE", 196608 : i32>>
```

--------------------------------

### Link Libraries for bishengir-capi-pass-test

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/CAPI/CMakeLists.txt

Example of using `_add_capi_test_executable` to create the 'bishengir-capi-pass-test' executable, linking multiple BiShengIR and MLIR C API libraries.

```cmake
_add_capi_test_executable(bishengir-capi-pass-test
  pass.c
  LINK_LIBS PRIVATE
    BiShengIRCAPIRegisterEverything
    BiShengIRCAPIAnnotation
    MLIRCAPIFunc
    BiShengIRCAPIHFusion
    BiShengIRCAPIHIVM
    MLIRCAPIIR
    MLIRCAPIRegisterEverything
    MLIRCAPITransforms
)
```

--------------------------------

### Install CANN Python Dependencies

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Installs the required Python dependencies for the Ascend CANN environment. Ensure these versions are compatible with your setup.

```bash
pip install attrs==24.2.0 numpy==1.26.4 scipy==1.13.1 decorator==5.1.1 psutil==6.0.0 pyyaml
```

--------------------------------

### Set Ascend Home Path

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/README.md

Set the ASCEND_HOME_PATH environment variable to the installed path of the CANN software package before running the example on Ascend NPU.

```bash
export ASCEND_HOME_PATH=/usr/local/ascend-toolkit/latest
```

--------------------------------

### Build Documentation with Make

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md

Use the make command from the repository root to build documentation in HTML format. Specify 'html' for English, 'html-zh' for Chinese, or 'html-all' for both.

```bash
make -C docs html
make -C docs html-zh
make -C docs html-all
```

--------------------------------

### Build Documentation from Docs Directory

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md

Alternatively, navigate to the 'docs/' directory and use the make command to build documentation. 'html' builds English, 'html-zh' builds Chinese, and 'html-all' builds both.

```bash
make html
make html-zh
make html-all
```

--------------------------------

### Build Documentation

Source: https://github.com/ascend/ascendnpu-ir/blob/master/README.md

Build the documentation for AscendNPU IR. Specify the target language for the output.

```bash
make -C docs html      # English only → docs/_build/en
make -C docs html-zh   # Chinese only → docs/_build/zh_cn
make -C docs html-all  # Both languages
```

--------------------------------

### Build AscendNPU IR from Source (Recommended)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/introduction/quick_start/installing_guide.md

Use the provided build script for a streamlined build process. This command configures, builds, and installs the project. The --apply-patches flag is required for the first build.

```bash
#在项目根目录下
./build-tools/build.sh -o ./build --build-type Release --apply-patches
```

--------------------------------

### Preview Documentation Locally

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/README.md

Open the generated HTML files to preview the documentation. Use 'open' for direct access or 'python3 -m http.server' to serve the files via HTTP.

```bash
# English
open docs/_build/en/index.html

# Chinese
open docs/_build/zh_cn/index.html

# Or serve with HTTP (e.g. port 8080 for English, 8081 for Chinese)
cd docs/_build/en && python3 -m http.server 8080
cd docs/_build/zh_cn && python3 -m http.server 8081
```

--------------------------------

### Add LLVM Install Target for Headers

Source: https://github.com/ascend/ascendnpu-ir/blob/master/CMakeLists.txt

Adds a custom install target for BishengIR headers if IDE support is disabled.

```cmake
if(NOT LLVM_ENABLE_IDE)
  add_llvm_install_targets(install-bishengir-headers DEPENDS
                           bishengir-headers COMPONENT bishengir-headers)
endif()
```

--------------------------------

### Install Python Dependencies for Ascend Toolkit

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/tile_lang_interface.md

Install required Python packages for the Ascend Toolkit, ensuring compatibility with specified NumPy versions.

```bash
pip3 install attrs cython 'numpy>=1.19.2,<=1.24.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy requests absl-py
```

--------------------------------

### Build Wheel Package from Source

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/python/wheel/README.md

Build the bishengir-compile binary and then the wheel package from source. The wheel will be located in the dist directory.

```bash
# Build the compiler
cd build-tools
./build.sh

# Build the wheel package
./build_wheel.sh
```

--------------------------------

### Vector Sort Example (Memref)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Example of using the `hivm.hir.vsort` operation with memref types for sorting in descending order along axis 0.

```mlir
hivm.hir.vsort ins(%src : memref<?xf32>) outs(%dst : memref<?xf32>) descending = true sort_axis = 0
```

--------------------------------

### Torch IR to Linalg/HFusion Conversion Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/conversion/framework_interface.md

Example of a Torch IR function for element-wise multiplication, demonstrating the input and output tensor types.

```mlir
func.func @torch_mul(%arg0: !torch.vtensor<[4096],f16>, %arg1: !torch.vtensor<[1,56,4096],f16>) -> !torch.vtensor<[1,56,4096],f16>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %0 = torch.aten.mul.Tensor %arg0, %arg1 : !torch.vtensor<[4096],f16>, !torch.vtensor<[1,56,4096],f16> -> !torch.vtensor<[1,56,4096],f16>
  return %0 : !torch.vtensor<[1,56,4096],f16>
}
```

--------------------------------

### Build AscendNPU IR (First Build)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Perform the initial build of AscendNPU IR using the build script. This command configures CMake, builds the project with Ninja, and applies necessary patches to submodules.

```bash
# From the repo root
./build-tools/build.sh -o ./build --build-type Release --apply-patches
```

--------------------------------

### HIVM Load Operation Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Demonstrates loading data from global memory to a local unified buffer. Supports padding and initialization options. Ensure src and dst have the same element type and shape if padding is not used.

```mlir
hivm.load ins(%src : memref<16x16xf16, #hivm.address_space<gm>>) outs(%dst : memref<16x16xf16, #hivm.address_space<ub>>)
```

--------------------------------

### Compile MLIR to Device Binary

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/README.md

Use `bishengir-compile` to compile the `.mlir` file into an executable binary for the Ascend NPU. The `-enable-hivm-compile` flag is crucial for this process.

```bash
bishengir-compile add.mlir -enable-hivm-compile -o kernel.o
```

--------------------------------

### HFusion PrintOp Usage Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md

Example of inserting hfusion.print into HFusion IR to print the result of a load operation. Ensure the 'hex' attribute is correctly set.

```mlir
func.func @vector_kernel(%arg0: memref<?xi8> {hacc.arg_type = #hacc.arg_type<sync_block_lock>}, %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}, %arg2: memref<?xi64> {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32) attributes {SyncBlockLockArgIdx = 0 : i64, WorkspaceArgIdx = 1 : i64, hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>, mix_mode = "aiv", parallel_mode = "simd"} {
  %reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [0], sizes: [8], strides: [1] : memref<?xi64> to memref<8xi64, strided<[1]>>
  %alloc = memref.alloc() : memref<8xi64>
  memref.copy %reinterpret_cast, %alloc : memref<8xi64, strided<[1]>> to memref<8xi64>
  %0 = bufferization.to_tensor %alloc restrict writable : memref<8xi64>
  hfusion.print " x: " {hex = false} %0 : tensor<8xi64>
  return
}
```

--------------------------------

### Cube Loop Tiling Example (After)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/CV/TileCubeAndVectorLoop.md

Shows the Cube loop structure after the tiling pass. The original large iteration is split into smaller iterations, with operations like `hivm.load` now operating on slices.

```mlir
scf.for {
  for {
    hivm.load slice_A
    hivm.load slice_B
    hivm.hir.mmadL1
    hivm.hir.fixpipe
  } {sub_tile}
} {cube_loop}
```

--------------------------------

### Build AscendNPU IR with Specific Compiler and Threads

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Configure the build process to use specific C and C++ compilers and set the number of parallel build threads. This allows for fine-tuning the build environment.

```bash
# Specify compiler and thread count
./build-tools/build.sh -o ./build --c-compiler /usr/bin/clang-15 --cxx-compiler /usr/bin/clang++-15 -j 256
```

--------------------------------

### Converted Linalg/HFusion IR Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/framework_interface.md

Example of MLIR code after conversion from Torch IR to Linalg/HFusion. It shows the `mul` operation represented using `linalg.broadcast` and `linalg.elemwise_binary`.

```mlir
func.func @torch.aten.mul_tensor(%arg0: tensor<4096xf16>, %arg1: tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> attributes {
hacc.entry,
 hacc.function_kind = #hacc.function_kind<DEVICE>
} {
  %0 = tensor.empty() : tensor<1x56x4096xf16>
  %broadcasted = linalg.broadcast ins(%arg0 : tensor<4096xf16>) outs(%0 : tensor<1x56x4096xf16>) dimensions = [0, 1] 
  %1 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%broadcasted, %arg1 : tensor<1x56x4096xf16>, tensor<1x56x4096xf16>) outs(%0 : tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16>
  return %1 : tensor<1x56x4096xf16>
}
```

--------------------------------

### Vector Sort Example (Tensor)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Example of using the `hivm.hir.vsort` operation with tensor types for sorting in descending order along axis 0, specifying the result type.

```mlir
%result = hivm.hir.vsort ins(%src : tensor<?xf32>) outs(%dst : tensor<?xf32>) descending = true sort_axis = 0 -> tensor<?xf32>
```

--------------------------------

### Run LIT test suite

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/faq/faq.md

Execute the LIT test suite by running './bin/llvm-lit' with the path to your test directory. Adjust paths as necessary.

```bash
./bin/llvm-lit ../bishengir/test
```

--------------------------------

### Vector Cumulative Product Example (Tensor)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Example of using `hivm.hir.vcumprod` with tensor types for calculating cumulative product along dimension 0 in reverse, specifying the result type.

```mlir
%result = hivm.hir.vcumprod ins(%src : tensor<?xf32>) outs(%dst : tensor<?xf32>) cum_dims : [0] reverse = true -> tensor<?xf32>
```

--------------------------------

### Triton index_put Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Demonstrates how to use the index_put function to place a value tensor into a target tensor at specified indices. This example shows a 2D index placement scenario.

```python
import triton
import triton.language as tl

@triton.jit
def put_index():
    # Placeholder for actual tensor definitions and kernel logic
    # Example tensor pointers and tiles would be defined here
    # For demonstration, assume dst_ptr, index_tile, and value_tile are pre-defined
    dst_ptr = tl.make_block_ptr(0, (1,), (1,), (0,), (1,), (1,))
    index_tile = tl.arange(0, 4).to(tl.int32).reshape(4, 1)
    value_tile = tl.arange(0, 8).to(tl.float16).reshape(4, 2)

    tmp_buf = tl.index_put(
        ptr=dst_ptr,
        index=index_tile,
        value=value_tile,
        dim=0,
        index_boundary=4,
        end_offset=(2, 2),
        start_offset=(0, 0),
        dst_stride=(2, 1)
    )
    # Placeholder for further kernel operations or return
    return tmp_buf
```

--------------------------------

### Configure and Build AscendNPU IR with LLVM

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Use this CMake command to configure the build for AscendNPU IR, specifying compilers, build type, and external projects. Optional parameters can be uncommented and modified as needed.

```bash
export LLVM_SOURCE_DIR="$(realpath ../third-party/llvm-project)"
cmake ${LLVM_SOURCE_DIR}/llvm -G Ninja \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_BUILD_TYPE=Release \
    -DLLVM_ENABLE_PROJECTS="mlir" \
    -DLLVM_EXTERNAL_PROJECTS="bishengir" \
    -DLLVM_EXTERNAL_BISHENGIR_SOURCE_DIR="$(realpath ..)" \
    -DBSPUB_DAVINCI_BISHENGIR=ON \
    # [-DCMAKE_INSTALL_PREFIX="${PWD}/install"] \
    # [-DLLVM_MAJOR_VERSION_21_COMPATIBLE=ON] \
    # [-DLLVM_ENABLE_ASSERTIONS=ON] \
    # [-DMLIR_ENABLE_BINDINGS_PYTHON=ON] \
    # [-DLLVM_TARGETS_TO_BUILD="host;Native"] \
    # [-DBISHENGIR_PUBLISH=OFF] \
    # [-DBISHENGIR_BUILD_TEMPLATE=ON -DBISHENG_COMPILER_PATH=/path/to/bisheng-compiler] \
    # [-Dother-options=value]
```

```bash
ninja -j32
```

--------------------------------

### Manual Build Prerequisites

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Prepare for a manual build by initializing submodules and applying patches. These are essential prerequisites before proceeding with manual CMake configuration.

```bash
# From the repo root
# Prerequisites: submodule init (git submodule update --init --recursive) and patch application (source build-tools/apply_patches.sh).

mkdir -p build
cd build
```

--------------------------------

### Triton Kernel Example with Bitwise Mask Hint

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/best_practice.md

An example Triton kernel demonstrating the use of `tl.compile_hint` for a bitwise mask on a condition tensor loaded as i8. This snippet is intended for testing the bitmask feature.

```python
# test_bitmask.py
import triton
import triton.language as tl
import torch 
import torch_npu
import test_common

@triton.jit
def triton_where_lt_case1(in_ptr0, in_ptr1, cond_ptr, out_ptr0, xnumel, XBLOCK: tl.constexpr, XBLOCK_SUB: tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    for xoffset_sub in range(0, XBLOCK, XBLOCK_SUB):
        xindex = xoffset + xoffset_sub + tl.arange(0, XBLOCK_SUB)[:]
        xmask = xindex < xnumel
        in0 = tl.load(in_ptr0 + xindex, xmask)
        in1 = tl.load(in_ptr1 + xindex, xmask)
        cond = tl.load(cond_ptr + xindex, xmask)
        res = tl.where(cond, in1, in0)
        # versions after triton-ascend 3.4.0
        # tl.extra.cann.extension.compile_hint(cond, "bitwise_mask")
        # versions before triton-ascend 3.2.0
        tl.compile_hint(cond, "bitwise_mask")
        tl.store(out_ptr0 + (xindex), res, xmask)

def test_where_lt_case1():
       dtype = "float32"
       shape = (1, 1024, 8) 
       ncore = 1 
       xblock = 8192
       xblock_sub = 1024
       if shape[-1] %8 != 0:
           raise ValueError("The last dimension should be a multiple of 8")
       x0 = test_common.generate_tensor(shape, dtype).npu()
       x1 = test_common.generate_tensor(shape, dtype).npu()
       # Run triton with i8 bitwise mask
       cond_i8 = test_common.generate_tensor(shape, 'uint8').npu()
       y_cal = test_common.generate_tensor(shape, dtype).npu()
       triton_where_lt_case1[ncore, 1, 1](x0, x1, cond_i8, y_cal, x0.numel(), xblock, xblock_sub)
       
test_where_lt_case1()
```

--------------------------------

### Triton gather_out_to_ub Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Example usage of the gather_out_to_ub function within a Triton kernel. This snippet demonstrates how to call the function with source tensor pointer, index tensor, boundary, dimension, strides, and offsets.

```python
tmp_buf = al.gather_out_to_ub(
    src=src_ptr,
    index=index,
    index_boundary=4,
    dim=0,
    src_stride=(2, 1),
    end_offset=(2, 2),
    start_offset=(0, 0)
)
```

--------------------------------

### HIVM DebugOp Usage Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/user_guide/debug_option.md

Example of adding hivm.hir.debug to HIVM IR for runtime debugging. This snippet demonstrates printing a tensor with specified debug type, hex format, prefix, and core type.

```mlir
func.func @vector_kernel(%arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>}, %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<sync_block_lock>}, %arg2: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}, %arg3: memref<?xi64> {tt.divisibility = 16 : i32, tt.tensor_kind = 0 : i32}, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32) attributes {SyncBlockLockArgIdx = 0 : i64, WorkspaceArgIdx = 1 : i64, func_dyn_memref_args = dense<[false, true, true, true, false, false, false, false]> : vector<8xi1>, hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>, mix_mode = "aiv", parallel_mode = "simd"} {
  %0 = arith.muli %arg5, %arg6 : i32
  %1 = arith.muli %0, %arg7 : i32
  annotation.mark %1 {logical_block_num} : i32
  %reinterpret_cast = memref.reinterpret_cast %arg3 to offset: [0], sizes: [8], strides: [1] : memref<?xi64> to memref<8xi64, strided<[1]>>
  %alloc = memref.alloc() : memref<8xi64>
  hivm.hir.load ins(%reinterpret_cast : memref<8xi64, strided<[1]>>) outs(%alloc : memref<8xi64>) init_out_buffer = false may_implicit_transpose_with_last_axis = false
  %2 = bufferization.to_tensor %alloc restrict writable : memref<8xi64>
  hivm.hir.debug {debugtype = "print", hex = false, prefix = " x: ", tcoretype = #hivm.tcore_type<CUBE_OR_VECTOR>} %2 : tensor<8xi64>
  return
}
```

--------------------------------

### Schedule API Usage Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/zh_cn/developer_guide/features/AutoSchedule/HFusion_AutoSchedule.md

Demonstrates the use of various Schedule APIs within createScheduleImpl for operations like IO caching, tiling, loop fusion, and multicore binding. These primitives help in constructing the Transform Dialect program.

```cpp
// IO 缓存与 buffer 管理
  cacheRead
  cacheWrite
  setBufferSize
// Tiling 与循环结构控制
  tileUsingFor
  tileUsingForAll
  tileReductionUsingFor
// 循环融合与合并
  fuseLoops
  fuseIntoContaining
  coalesceLoops
// 多核绑定
  bindLoopToMulticore
```

--------------------------------

### Setting up Double Buffering with 'multibuffer'

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Demonstrates the use of `al.multibuffer` to set up double buffering for tensors. This optimization technique improves data flow and computational overlap by creating buffered copies of a tensor.

```python
@triton.jit
def triton_compile_hint():
    #...
    tmp0 = tl.load(in_ptr0 + xindex, xmask)
    al.multibuffer(tmp0, 2)
    tl.store(out_ptr0 + (xindex), tmp0, xmask)
    #...
```

--------------------------------

### HFusion MLIR Example for Reduction

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/framework_interface.md

An example MLIR function demonstrating the use of Linalg and HFusion dialects for a reduction operation involving elementwise multiplication and summation. This code is intended to be compiled for the Ascend NPU.

```mlir
func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32>
attributes {hacc.entry, hacc.function_kind = #hacc.function_kind<DEVICE>} {
  %1 = tensor.empty() : tensor<40960x1024xf32>
  %3 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32>
  %4 = tensor.empty() : tensor<40960xf32>
  %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>)
                                    outs(%4 : tensor<40960xf32>) dimensions = [1]
  %5 = tensor.empty() : tensor<40960xf32>
  %6 = linalg.elemwise_binary {fun = #linalg.binary_fn<mul>} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>)
                                                                  outs(%5: tensor<40960xf32>) -> tensor<40960xf32>
  return %6 : tensor<40960xf32>
}
```

--------------------------------

### Unfold Symbolic Int: Invalid Bind Symbolic Shape Examples

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/passes/SymbolPasses.md

These examples illustrate invalid uses of symbol.bind_symbolic_shape that would violate the constraints of the unfold-symbolic-int pass. They show incorrect affine maps or binding to tensor.empty.

```mlir
symbol.bind_symbolic_shape %arg0, [%S0, %S1], affine_map<()[s0, s1] -> (s0/2, s0/s1, s1+1)> : tensor<?x640x?xf16>
%empty = tensor.empty(%S2, %S3) : tensor<?x?xf16>
symbol.bind_symbolic_shape %empty, [%S2, %S3], affine_map<()[s0, s1] -> (s0, s1)> : tensor<?x?xf16>
```

--------------------------------

### Build AscendNPU IR (Subsequent Builds)

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/introduction/quick_start/installing_guide.md

Execute subsequent builds of AscendNPU IR when the build directory already exists. This command leverages the existing build configuration.

```bash
./build-tools/build.sh -o ./build --build-type Release
```

--------------------------------

### Scope Overflow Error Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/PlanMemory/PlanMemory.md

This example demonstrates a typical scope overflow error message encountered when memory constraints are violated. It indicates that the required memory bits exceed the available bits, often due to large tiling basic blocks or excessive local buffers enabled by multi-buffering.

```bash
loc("/tmp/tmp0h121237/kernel.ttadapter.mlir":2:3): error: ub overflow,
requires 3219456 bits while 1572864 bits available! (possible reason:
tiling basic block is too large or block number is more than what user
expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
```

--------------------------------

### Reassociation Generation from Mask

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/AutoFlatten/AutoFlatten.md

Example of generating reassociation groups based on a given mask.

```APIDOC
## Reassociation Generation from Mask

Input Mask: [U, C, U, N, U, C, U]

Processing:
  Segment 1: [U, C, U] → Group units with collapsible → [[0, 1, 2]]
  Segment 2: [N]       → Isolated non-collapsible    → [[3]]
  Segment 3: [U, C, U] → Group units with collapsible → [[4, 5, 6]]

Result: [[0, 1, 2], [3], [4, 5, 6]]
```

--------------------------------

### Level 2 Memory Reuse Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/PlanMemory/PlanMemory.md

Demonstrates Level 2 memory reuse, preferring reuse within the same pipeline type (e.g., Vector with Vector) to avoid cross-pipeline dependencies and maintain throughput. This example shows how shared memory C (Vector OP) can reuse shared memory B (Vector OP) without performance degradation.

```text
Shared A [A0, A1]
Shared B [B]
Shared C [C]
Shared D [D0, D1]
Loop i:
  // sync
  op1(A0, A1) // DMA OP, Double Buffer
  op2(B)      // Vector OP
  op3(C)      // Vector OP
  op4(D0, D1) // DMA OP, Double Buffer
```

--------------------------------

### Set Ascend CANN Environment Variables

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/conversion/triton_interface.md

Sources the environment setup script for Ascend CANN. The path to the script may vary depending on the CANN version.

```bash
#If the version is earlier than 8.5.0, the path is ${PATH-TO-CANN}/ascend-toolkit/set_env.sh.
source ${PATH-TO-CANN}/cann/set_env.sh
```

--------------------------------

### MLIR Broadcast Operation Example

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/features/AutoFlatten/AutoFlatten.md

This MLIR snippet demonstrates a basic broadcast operation before any transformations are applied.

```mlir
%0 = hivm.vbrc %input broadcast_dims = [3] 
     : memref<1x64x1x128x256xf32> -> memref<1x64x16x128x256xf32>
```

--------------------------------

### hivm.hir.get_sys_cnt

Source: https://github.com/ascend/ascendnpu-ir/blob/master/docs/source/en/developer_guide/dialects/HIVMDialect.md

Retrieves the system count of the current device. This operation is used to get a system-level counter value.

```APIDOC
## `hivm.hir.get_sys_cnt` (hivm::GetSysCntOp)

### Description
Get sys cnt of the current device.

### Syntax
```mlir
operation ::= `hivm.hir.get_sys_cnt` attr-dict `->` type($result)
```

### Results
| Result | Description |
| :----: | ----------- |
| `result` | 64-bit signless integer
```

--------------------------------

### Configure Target Include Directories

Source: https://github.com/ascend/ascendnpu-ir/blob/master/bishengir/test/Integration/HIVM/VecAdd/CMakeLists.txt

Adds public include directories for the 'bishengir-npu-hivm-vec-add' target, including the Ascend NPU's general include path and the experiment/msprof directory.

```cmake
target_include_directories(bishengir-npu-hivm-vec-add PUBLIC
  ${ASCEND_HOME_PATH}/include
  ${ASCEND_HOME_PATH}/include/experiment/msprof
)
```