English
Content on This Page

DeepSpeed

DeepSpeed is a specialized acceleration library for PyTorch. It significantly improves GPU memory usage and accelerates the training speed during distributed training. At the heart of DeepSpeed is Zero Redundancy Optimizer (ZeRO), which operates in several stages:

  • ZeRO-1: optimizer partitioning
  • ZeRO-2: optimizer and weight partitioning
  • ZeRO-3: optimizer, gradient, and parameter partitioning

This guide provides specific training scripts and ds_config.json with different configurations to facilitate you to use DeepSpeed for distributed training.

When used with openMind Library, this ecosystem library supports only the PyTorch framework.

Environment and Configuration

shell
pip install deepspeed==0.13.1

Weight, Dataset, and Fine-tuning Script

ItemURL
WeightPyTorch-NPU/qwen1.5_7b_chat
Datasetalpaca_data
Fine-tuning scriptfinetune.py

Refer to the following commands to run the script:

shell
if [ -d ./test/output ];then
    rm -rf ./test/output
    mkdir -p ./test/output
else
    mkdir -p ./test/output
fi

# master_port needs to be configured according to the actual situation.
torchrun --nproc_per_node=8 --master_port=xxx finetune.py \
    --model_name_or_path "PyTorch-NPU/qwen1.5_7b_chat"\
    --data_path alpaca_data.json \
    --deepspeed ds_config.json \
    --bf16 True \
    --output_dir ./test/output \
    --max_steps 2000 \
    --per_device_train_batch_size 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 3000 \
    --save_total_limit 1 \
    --learning_rate 1e-6 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --seed 1234 \
    --logging_steps 1  > ./test/output/train.log 2>&1 &

Refer to the following "Basic DeepSpeed Settings" and "Advanced DeepSpeed Settings" sections to configure ds_config.json.

Basic DeepSpeed Settings

ZeRO-1

json
{
    "zero_optimization": {
        "stage": 1
    }
}

ZeRO-2

  • allgather_partitions: chooses between allgather or a series of broadcast collective operations to gather updated parameters from all compute cards at the end of each step.
  • allgather_bucket_size: number of elements allgathered at a time. It limits the memory required for allgather for a large model size.
  • overlap_comm: attempts to overlap between communication and computation.
  • reduce_scatter: uses reduce or reduce scatter instead of allreduce to average gradients.
  • reduce_bucket_size: number of elements reduced/allreduced at a time. It limits the memory required for allgather for a large model size.
  • contiguous_gradients: copies gradients to a contiguous buffer when they are generated. This avoids memory fragmentation during backward propagation.
  • round_robin_gradients: optimizes CPU offloading performance. Performance improves with increasing numbers of gradient accumulation steps and NPUs.
json
{
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    }
}

ZeRO-3

  • overlap_comm: attempts to overlap between communication and computation.
  • contiguous_gradients: copies gradients to a contiguous buffer when they are generated. This avoids memory fragmentation during backward propagation.
  • sub_group_size: controls which parameters are updated during the optimizer step.
  • reduce_bucket_size: controls the bucket size for allreduce.
  • stage3_prefetch_bucket_size: fixed buffer size for prefetching parameters. Smaller values use less memory, but may increase stagnation due to communication.
  • stage3_param_persistence_threshold: Do not partition parameters whose values are less than this threshold. Smaller values use less memory, but can greatly increase communication traffic.
  • stage3_max_live_parameters: maximum number of parameters on each compute card before being released. Smaller values use less memory, but require more communications.
  • stage3_max_reuse_distance: Do not release a parameter if it will be used again within this parameter threshold. Smaller values use less memory, but require more communications.
json
{
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    }
}

Advanced DeepSpeed Settings

Optimizer and scheduler

DeepSpeed supports multiple optimizers, including Adam, AdamW, OneBitAdam, and LAMB. Additionally, it supports other optimizers imported from PyTorch to meet a broader set of requirements. If you do not specify an optimizer in the configuration file, DeepSpeed uses AdamW by default.

shell
{
   "optimizer": {
       "type": "AdamW",
       "params": {
         "lr": "auto",
         "betas": "auto",
         "eps": "auto",
         "weight_decay": "auto"
       }
   }
}

Moreover, DeepSpeed is compatible with various learning rate schedulers such as LRRangeTest, OneCycle, WarmupLR, and WarmupDecayLR. If you do not specify a scheduler in the configuration file, DeepSpeed automatically uses WarmupDecayLR by default.

shell
{
   "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
             "total_num_steps": "auto",
             "warmup_min_lr": "auto",
             "warmup_max_lr": "auto",
             "warmup_num_steps": "auto"
         }
     }
}

Precision

DeepSpeed supports FP32, FP16, and BF16.

shell
{
    "bf16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

Batch size

The data batch size can be automatically configured or explicitly set. If auto is enabled, Trainer sets train_micro_batch_size_per_gpu to the value of args.per_device_train_batch_size and train_batch_size to the value of args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps.

shell
{
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
}

Gradient accumulation

Gradient accumulation can be automatically configured or explicitly set. If auto is enabled, Trainer sets this parameter to the value of args.gradient_accumulation_steps.

shell
{
    "gradient_accumulation_steps": "auto"
}

Gradient clipping

Gradient clipping can be automatically configured or explicitly set. If auto is enabled, Trainer sets this parameter to the value of args.max_grad_norm. This is a common technique to prevent gradient explosion (i.e. numerical calculation problem caused by large gradient values) in a training process. By setting a maximum gradient norm, you can ensure that gradients are kept within a proper range during backward propagation, which facilitates stable training of your model.

shell
{
    "gradient_clipping": "auto"
}

Communication data type

A separate data type is used when communication operations such as reduction, gathering, and scattering are performed. You can select a communication data type by setting communication_data_type in the configuration file.

shell
{
    "communication_data_type": "fp32"
}

Example

The following provides ZeRO-2 and ZeRO-3 configuration examples for your reference.

ZeRO-2
json
{   
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "communication_data_type": "fp32",
    
    "bf16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
  
    "optimizer": {
       "type": "AdamW",
       "params": {
         "lr": "auto",
         "betas": "auto",
         "eps": "auto",
         "weight_decay": "auto"
       }
   }, 
  
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    }
}
ZeRO-3
json
{   
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "communication_data_type": "fp32",
    
    "bf16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
  
    "optimizer": {
       "type": "AdamW",
       "params": {
         "lr": "auto",
         "betas": "auto",
         "eps": "auto",
         "weight_decay": "auto"
       }
   }, 
  
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}