In pipeline parallelism, a large batch of training data is divided into smaller units called micro-batches to improve hardware utilization. If a single large batch were processed as one unit, each stage of the pipeline—represented by a specific GPU—would have to wait for the previous stage to finish its entire computation before starting work. This results i....
Log in to view the answer