Bagua-Net
Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network .
Technically, Bagua-Net is a plugin for NVIDIA NCCL communication library, the fastest generally avaiable GPU communication implementation now (2021). It replaces the TCP communication related logic in NCCL to greatly improve the communication performance, by improving the fairness between different streams and reducing the contentions between sockets.
By enabling Bagua-Net, the communication efficiency can be increased by 83% (code), and the end2end training throughput can be increased by 35%:
# VGG16 on 4x8xV100 NCCL default implementation
Running benchmark...
Iter #0: 2620.2 img/sec GPU
Iter #1: 2771.9 img/sec GPU
Iter #2: 2772.6 img/sec GPU
Iter #3: 2794.5 img/sec GPU
Iter #4: 2627.9 img/sec GPU
Iter #5: 2787.8 img/sec GPU
Iter #6: 2775.9 img/sec GPU
Iter #7: 2741.6 img/sec GPU
Iter #8: 2760.0 img/sec GPU
Iter #9: 2796.6 img/sec GPU
Img/sec per GPU: 85.8 +-3.8
Total img/sec on 32 GPU(s): 2744.9 +-122.3
# VGG16 on 4x8xV100 Bagua-Net enabled
Running benchmark...
Iter #0: 4081.0 img/sec GPU
Iter #1: 4072.0 img/sec GPU
Iter #2: 4106.4 img/sec GPU
Iter #3: 4081.7 img/sec GPU
Iter #4: 4064.8 img/sec GPU
Iter #5: 4122.1 img/sec GPU
Iter #6: 3857.7 img/sec GPU
Iter #7: 4128.3 img/sec GPU
Iter #8: 4125.5 img/sec GPU
Iter #9: 3826.6 img/sec GPU
Img/sec per GPU: 126.5 +-6.4
Total img/sec on 32 GPU(s): 4046.6 +-205.2
To enable Bagua-Net, you only need to pass the --enable_bagua_net argument in bagua.distributed.launch or bagua.distributed.run. No code change in your training script.
For example, with this distributed training example, you can launch the job with
python3 -m bagua.distributed.launch --enable_bagua_net \
--nproc_per_node=8 synthetic_benchmark.py --algorithm gradient_allreduce
Note that if you do not need to modify the source code of
bagua-coreand recompile, it is strongly recommended that you installbaguawithbagua-coreincluded in the pre-released version according to installation guide.
It worth noting that you can even use
bagua.distributed.launchorbagua.distributed.runwith--enable_bagua_netargument to launch PyTorch-DDP jobs to improve the training throughput without migrating your code to Bagua.