DeepSpeed
DeepSpeed is a specialized acceleration library for PyTorch. It significantly improves GPU memory usage and accelerates the training speed during distributed training. At the heart of DeepSpeed is Zero Redundancy Optimizer (ZeRO), which operates in several stages:
- ZeRO-1: optimizer partitioning
- ZeRO-2: optimizer and weight partitioning
- ZeRO-3: optimizer, gradient, and parameter partitioning
This guide provides specific training scripts and ds_config.json with different configurations to facilitate you to use DeepSpeed for distributed training.
When used with openMind Library, this ecosystem library supports only the PyTorch framework.
Environment and Configuration
pip install deepspeed==0.13.1
Weight, Dataset, and Fine-tuning Script
| Item | URL |
|---|---|
| Weight | PyTorch-NPU/qwen1.5_7b_chat |
| Dataset | alpaca_data |
| Fine-tuning script | finetune.py |
Refer to the following commands to run the script:
if [ -d ./test/output ];then
rm -rf ./test/output
mkdir -p ./test/output
else
mkdir -p ./test/output
fi
# master_port needs to be configured according to the actual situation.
torchrun --nproc_per_node=8 --master_port=xxx finetune.py \
--model_name_or_path "PyTorch-NPU/qwen1.5_7b_chat"\
--data_path alpaca_data.json \
--deepspeed ds_config.json \
--bf16 True \
--output_dir ./test/output \
--max_steps 2000 \
--per_device_train_batch_size 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 3000 \
--save_total_limit 1 \
--learning_rate 1e-6 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--seed 1234 \
--logging_steps 1 > ./test/output/train.log 2>&1 &
Refer to the following "Basic DeepSpeed Settings" and "Advanced DeepSpeed Settings" sections to configure ds_config.json.
Basic DeepSpeed Settings
ZeRO-1
{
"zero_optimization": {
"stage": 1
}
}
ZeRO-2
allgather_partitions: chooses between allgather or a series of broadcast collective operations to gather updated parameters from all compute cards at the end of each step.allgather_bucket_size: number of elements allgathered at a time. It limits the memory required for allgather for a large model size.overlap_comm: attempts to overlap between communication and computation.reduce_scatter: uses reduce or reduce scatter instead of allreduce to average gradients.reduce_bucket_size: number of elements reduced/allreduced at a time. It limits the memory required for allgather for a large model size.contiguous_gradients: copies gradients to a contiguous buffer when they are generated. This avoids memory fragmentation during backward propagation.round_robin_gradients: optimizes CPU offloading performance. Performance improves with increasing numbers of gradient accumulation steps and NPUs.
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}
ZeRO-3
overlap_comm: attempts to overlap between communication and computation.contiguous_gradients: copies gradients to a contiguous buffer when they are generated. This avoids memory fragmentation during backward propagation.sub_group_size: controls which parameters are updated during the optimizer step.reduce_bucket_size: controls the bucket size for allreduce.stage3_prefetch_bucket_size: fixed buffer size for prefetching parameters. Smaller values use less memory, but may increase stagnation due to communication.stage3_param_persistence_threshold: Do not partition parameters whose values are less than this threshold. Smaller values use less memory, but can greatly increase communication traffic.stage3_max_live_parameters: maximum number of parameters on each compute card before being released. Smaller values use less memory, but require more communications.stage3_max_reuse_distance: Do not release a parameter if it will be used again within this parameter threshold. Smaller values use less memory, but require more communications.
{
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9
}
}
Advanced DeepSpeed Settings
Optimizer and scheduler
DeepSpeed supports multiple optimizers, including Adam, AdamW, OneBitAdam, and LAMB. Additionally, it supports other optimizers imported from PyTorch to meet a broader set of requirements. If you do not specify an optimizer in the configuration file, DeepSpeed uses AdamW by default.
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
}
}
Moreover, DeepSpeed is compatible with various learning rate schedulers such as LRRangeTest, OneCycle, WarmupLR, and WarmupDecayLR. If you do not specify a scheduler in the configuration file, DeepSpeed automatically uses WarmupDecayLR by default.
{
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
Precision
DeepSpeed supports FP32, FP16, and BF16.
{
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
Batch size
The data batch size can be automatically configured or explicitly set. If auto is enabled, Trainer sets train_micro_batch_size_per_gpu to the value of args.per_device_train_batch_size and train_batch_size to the value of args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps.
{
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto"
}
Gradient accumulation
Gradient accumulation can be automatically configured or explicitly set. If auto is enabled, Trainer sets this parameter to the value of args.gradient_accumulation_steps.
{
"gradient_accumulation_steps": "auto"
}
Gradient clipping
Gradient clipping can be automatically configured or explicitly set. If auto is enabled, Trainer sets this parameter to the value of args.max_grad_norm. This is a common technique to prevent gradient explosion (i.e. numerical calculation problem caused by large gradient values) in a training process. By setting a maximum gradient norm, you can ensure that gradients are kept within a proper range during backward propagation, which facilitates stable training of your model.
{
"gradient_clipping": "auto"
}
Communication data type
A separate data type is used when communication operations such as reduction, gathering, and scattering are performed. You can select a communication data type by setting communication_data_type in the configuration file.
{
"communication_data_type": "fp32"
}
Example
The following provides ZeRO-2 and ZeRO-3 configuration examples for your reference.
ZeRO-2
{
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"communication_data_type": "fp32",
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
}
}
ZeRO-3
{
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"communication_data_type": "fp32",
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}