Question

In a pipeline parallel training setup, what is the specific purpose of &#x27;micro-batching&#x27; in minimizing the time GPUs spend in a stalled, waiting state?

Accepted Answer

In pipeline parallelism, a large batch of training data is divided into smaller units called micro-batches to improve hardware utilization. If a single large batch were processed as one unit, each stage of the pipeline—represented by a specific GPU—would have to wait for the previous stage to finish its entire computation before starting work. This results in a pipeline bubble, which is a period where most GPUs sit idle because they are waiting for data to pass through the preceding stages. Micro-batching breaks this large batch into several smaller segments, allowing multiple GPUs to work on different parts of the same training batch simultaneously. For example, if a model is split across four GPUs, the first GPU can finish processing the first micro-batch and immediately pass it to the second GPU while starting work on the second micro-batch. This pipelining effect ensures that after an initial warmup period, all GPUs are active at once rather than waiting for one massive operation to complete. By increasing the frequency at which GPUs receive data, micro-batching significantly reduces the cumulative idle time, ensuring that the processing hardware is kept busy rather than stalled.

Home → All Courses → Programming Courses → Large Language Model (LLM) Engineering → Flashcard

In a pipeline parallel training setup, what is the specific purpose of 'micro-batching' in minimizing the time GPUs spend in a stalled, waiting state?