If a deep learning expert needs a recurrent network that runs a bit faster and has fewer parts but still handles long sequences well, what might they choose instead of an LSTM?
The deep learning expert would likely choose a Gated Recurrent Unit, or GRU. A recurrent network is a type of artificial neural network specifically designed to process sequences of data by maintaining an internal state, or memory, from previous steps. However, standard recurrent networks can struggle with long sequences due to issues like vanishing or exploding gradients, where the signals for learning either become too small or too large as they propagate through many time steps. The Long Short-Term Memory, or LSTM, network was developed to overcome these challenges by employing a complex internal structure with a dedicated "cell state" to store information and three distinct "gates"—an input gate, a forget gate, and an output gate. These gates are small neural network components that regulate the flow of information into, out of, and within the cell state, allowing LSTMs to effectively learn and remember long-term dependencies. A GRU simplifies this architecture. Instead of three gates and a separate cell state, a GRU uses only two gates: an update gate and a reset gate, and it merges the cell state into the hidden state. The update gate controls how much of the previous hidden state is retained and how much of the new candidate information is incorporated into the current hidden state. The reset gate determines how much of the previous hidden state is "forgotten" before calculating the new candidate hidden state. This reduction in the number of gates and the merging of the cell state with the hidden state means that a GRU has fewer parameters and a simpler computational graph compared to an LSTM. Fewer parameters translate directly to fewer calculations per time step, resulting in faster training and inference times. Despite its reduced complexity and fewer parts, the GRU effectively handles long sequences by leveraging its gates to manage information flow and mitigate the vanishing gradient problem, similar to an LSTM, thereby maintaining its ability to capture long-term dependencies efficiently.