Enabling a Fused Operator

openMind Library supports fused operators provided by the Ascend Extension for PyTorch plugin torch_npu, enabling developers to fully unleash the computing power of Ascend AI Processors under the PyTorch framework.

You can modify the from_pretrained input parameter or the config.json file of the model to enable a fused operator without modifying the model code. Fused operators can greatly improve model performance. However, the effect varies depending on the model structure and hyperparameters.

openMind Library supports the following fused operators:

Model	npu_rms_norm	npu_fusion_attention
Llama2	☑	☑
Qwen2	☑	☑
Internlm2	☐	☑
Mistral	☐	☑

More models and fused operators are being adapted and developed.

The following describes how to use these operators.

npu_fusion_attention

The fused operator implements the fusion compute of Transformer Attention Score. The formula is as follows:

You can query the detailed document about the fused operator. In static shape scenarios, the performance is greatly improved.

Example

You can use the fused operator in either of the following ways:

Method 1: Use openMind Auto Classes to load the model, use AutoModelForCausalLM to instantiate the model, and pass the _attn_implementation="npu_fusion_attention" parameter.

python

from openmind import AutoModelForCausalLM

model_id = "/your/model/path"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    _attn_implementation="npu_fusion_attention"
    )

Method 2: Pass "_attn_implementation": "npu_fusion_attention" through the config.json file of the model.

Take the AI-Research/internlm2-base-7b model as an example. Download the model to the local host and add the "_attn_implementation": "npu_fusion_attention" field to the config.json file in the model folder.

json

{
"architectures": [
    "InternLM2ForCausalLM"
],
"auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM"
},
"bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "internlm2",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pad_token_id": 2,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 1000000,
"tie_word_embeddings": false,
"_attn_implementation": "npu_fusion_attention",
"torch_dtype": "bfloat16",
"transformers_version": "4.41.0",
"use_cache": true,
"vocab_size": 92544,
"pretraining_tp": 1
}

Save the modification and use Auto Classes to load the model. The _attn_implementation parameter does not need to be passed when using the from_pretrained method. The following is an example:

python

from openmind import AutoModelForCausalLM

model_id = "/your/local/model/path"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    )

npu_rms_norm

The RmsNorm operator is a common normalization operation for large models. Compared with the LayerNorm operator, the RmsNorm operator does not subtract the average value. The calculation formular is as follows:

You can query the detailed document of the fused operator. This fused operator is applicable to Llama and Qwen2 models.

Example

You can use the fused operator in either of the following ways:

Pass "use_npu_rms_norm": true through the config.json file of the model.
Take the AI-Research/Llama-3.2-3B-Instruct model as an example. Download the model to the local host and add the "use_npu_rms_norm": true field to the config.json file in the model folder.

json


{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.45.0.dev0",
  "use_cache": true,
  "use_npu_rms_norm": true,
  "vocab_size": 128256
}

Enabling a Fused Operator ​

npu_fusion_attention ​

Example ​

npu_rms_norm ​

Example ​

Enabling a Fused Operator

npu_fusion_attention

Example

npu_rms_norm

Example