A research team is deploying a deep learning model on an NVIDIA DGX A100 system. The model has high computational demands and requires efficient use of all available GPUs. During the deployment, they notice that the GPUs are underutilized, and the inter-GPU communication seems to be a bottleneck. The software stack includes TensorFlow, CUDA, NCCL, and cuDNN. Which of the following actions would most likely optimize the inter-GPU communication and improve overall GPU utilization?
Submit