Enabling a Fused Operator
openMind Library supports fused operators provided by the Ascend Extension for PyTorch plugin torch_npu, enabling developers to fully unleash the computing power of Ascend AI Processors under the PyTorch framework.
You can modify the from_pretrained input parameter or the config.json file of the model to enable a fused operator without modifying the model code. Fused operators can greatly improve model performance. However, the effect varies depending on the model structure and hyperparameters.
openMind Library supports the following fused operators:
| Model | npu_rms_norm | npu_fusion_attention |
|---|---|---|
| Llama2 | ☑ | ☑ |
| Qwen2 | ☑ | ☑ |
| Internlm2 | ☐ | ☑ |
| Mistral | ☐ | ☑ |
More models and fused operators are being adapted and developed.
The following describes how to use these operators.
npu_fusion_attention
The fused operator implements the fusion compute of Transformer Attention Score. The formula is as follows:
You can query the detailed document about the fused operator. In static shape scenarios, the performance is greatly improved.
Example
You can use the fused operator in either of the following ways:
Method 1: Use openMind Auto Classes to load the model, use AutoModelForCausalLM to instantiate the model, and pass the
_attn_implementation="npu_fusion_attention"parameter.pythonfrom openmind import AutoModelForCausalLM model_id = "/your/model/path" model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, _attn_implementation="npu_fusion_attention" )Method 2: Pass
"_attn_implementation": "npu_fusion_attention"through the config.json file of the model.Take the AI-Research/internlm2-base-7b model as an example. Download the model to the local host and add the
"_attn_implementation": "npu_fusion_attention"field to the config.json file in the model folder.json{ "architectures": [ "InternLM2ForCausalLM" ], "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM" }, "bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 32768, "model_type": "internlm2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pad_token_id": 2, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 1000000, "tie_word_embeddings": false, "_attn_implementation": "npu_fusion_attention", "torch_dtype": "bfloat16", "transformers_version": "4.41.0", "use_cache": true, "vocab_size": 92544, "pretraining_tp": 1 }Save the modification and use Auto Classes to load the model. The
_attn_implementationparameter does not need to be passed when using thefrom_pretrainedmethod. The following is an example:pythonfrom openmind import AutoModelForCausalLM model_id = "/your/local/model/path" model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, )
npu_rms_norm
The RmsNorm operator is a common normalization operation for large models. Compared with the LayerNorm operator, the RmsNorm operator does not subtract the average value. The calculation formular is as follows:
You can query the detailed document of the fused operator. This fused operator is applicable to Llama and Qwen2 models.
Example
You can use the fused operator in either of the following ways:
Pass
"use_npu_rms_norm": truethrough the config.json file of the model.Take the AI-Research/Llama-3.2-3B-Instruct model as an example. Download the model to the local host and add the
"use_npu_rms_norm": truefield to the config.json file in the model folder.
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 3072,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 24,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.45.0.dev0",
"use_cache": true,
"use_npu_rms_norm": true,
"vocab_size": 128256
}