What distinguishes L1 regularization from L2 regularization in terms of the sparsity of the resulting model?
L1 regularization, also known as Lasso regularization, adds a penalty term to the loss function that is proportional to the absolute value of the weights. L2 regularization, also known as Ridge regularization, adds a penalty term to the loss function that is proportional to the square of the weights. The key difference in terms of sparsity is that L1 regularization tends to produce sparse models, meaning that it drives many of the weights to exactly zero, effectively removing those features from the model. L2 regularization, on the other hand, tends to shrink the weights towards zero but rarely makes them exactly zero. This is because the L1 penalty has a constant gradient, so it can shrink weights all the way to zero. The L2 penalty has a gradient proportional to the weight value, so as the weight approaches zero, the penalty becomes smaller, preventing it from reaching zero in most cases. For example, if a model has 100 features, L1 regularization might set the weights of 70 features to zero, resulting in a model that only uses 30 features. In contrast, L2 regularization would shrink the weights of all 100 features, but none of them would be exactly zero. This makes L1 regularization useful for feature selection, as it can identify the most important features and discard the rest, leading to a simpler and more interpretable model. L2 regularization is often preferred when all features are believed to be relevant to some extent.