$\color{block}\rule{365px}{3px}$

The names L1 and L2 regularization come from the concept of Lp norms, which are used to measure distance from a point to the origin in a space.

I like to think of it this way (! Spoiler !)

Regularization will set up some kind of limit on the size of sum of all the weights (coefficients), where each weight represents the significance of the feature in the data. And then the optimization process (Backprop in deep learning) will adjust the weights within the given limit of sum of all the weights. Since model will learn the importance of each weight(i.e. importance of each feature) by itself, the model will end up minimizing the weights(coefficients of features) that are not important, i.e weights of features related to the noises of the training data, which in turn reducing the overfitting.

L1 Regularization (Lasso Regression)

$\color{block}\rule{365px}{3px}$

<aside> <img src="/icons/chart-line_orange.svg" alt="/icons/chart-line_orange.svg" width="40px" /> L1 Norm (Manhattan Distance)

<aside> <img src="/icons/bookmark_orange.svg" alt="/icons/bookmark_orange.svg" width="40px" /> Definition:

The L1 norm, also known as the Manhattan distance or Taxicab norm, is the sum of the absolute values of the vector components.

</aside>

Untitled

Manhattan Distance between two points $\mathbf{x}, \mathbf{ y}$:

$$ d(\mathbf{x},\mathbf{y}) = \sum_{i=1}^{m}{|x_i - y_i|} $$

(Summing the absolute differences along each dimension)

Which from origin to any point is:

$$ d(\mathbf{w},\mathbf{0}) = \sum_{i=1}^{m}{|w_i - 0|}=\sum_{i=1}^{m}{|w_i|}= |w_1|+|w_2|+ ... |w_m| $$

</aside>

<aside> <img src="/icons/chart-line_orange.svg" alt="/icons/chart-line_orange.svg" width="40px" /> L1 Regularization (Lasso Regression)

<aside> <img src="/icons/bookmark_orange.svg" alt="/icons/bookmark_orange.svg" width="40px" /> Definition: L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function.

</aside>

Mathematical Formulation:

The L1 regularization term is added to the loss function as follows:

$$ \text{Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{n} |w_i| $$

where:

$\lambda$ is the regularization parameter (a hyperparameter that needs to be tuned).
$w_i$ are the model coefficients.

Notice the similarity between the penalty term and the L1 Norm above:

$$ \text{L1 Norm}=\sum_{i=1}^{m}{|w_i|}= |w_1|+|w_2|+ ... |w_m| $$

</aside>

L2 Regularization (Ridge Regression)

$\color{block}\rule{365px}{3px}$

<aside> <img src="/icons/chart-line_orange.svg" alt="/icons/chart-line_orange.svg" width="40px" /> L2 Norm (Euclidean Distance)

<aside> <img src="/icons/bookmark_orange.svg" alt="/icons/bookmark_orange.svg" width="40px" /> Definition:

The L2 norm, also known as the Euclidean distance, is the square root of the sum of the squared values of the vector components.

</aside>

Untitled

Euclidean Distance between two points $\mathbf{x}, \mathbf{ y}$:

$$ d(\mathbf{x},\mathbf{y}) =\sqrt{\sum_{i=1}^{m}{(x_i - y_i)^2}} $$

(Summing the absolute differences along each dimension)

Which from origin to any point is:

$$ d(\mathbf{w},\mathbf{0}) = \sqrt{\sum_{i=1}^{m}{(w_i - 0)^2}}=\sqrt{\sum_{i=1}^{m}{w_i^2}}= \sqrt{w_1^2+w_2^2+ ... w_m^2} $$

</aside>