Introduction

$\color{black}\rule{365px}{3px}$

In this notes, I will discuss more about the architectures not discussed in previous series(Positional Encoding, Attention):
- Regularization
- Layer Normalization
- Feed-Forward Network

Regularization

$\color{black}\rule{365px}{3px}$

The Transformer model employs three regularization techniques:

Residual Dropout: Dropout is applied to the output of each sub-layer before it is added to the input and normalized, with $P_{drop}=0.1$ for the base model.
Embedding Dropout: Dropout is applied specifically to the sums of the embeddings and positional encodings to regularize the input to the encoder and decoder.
Label Smoothing: A smoothing value of $ϵ_{ls}=0.1$ is employed during training. This reduces model confidence (hurting perplexity) but improves accuracy and BLEU scores.

Recall Label Smoothing

How Label Smoothing Modifies the Targets

Instead of using a one-hot vector, label smoothing adjusts the target probabilities slightly. For a smoothing parameter ϵls, the adjusted target probabilities are:

$$ \text{Modified Target}=(1−ϵ_{ls})⋅\text{one-hot vector}+\frac{ϵ_{ls}}{K} $$

where $K$ is the total number of classes.

For example, with $K=3$ classes and $ϵ_{ls}=0.1$, the modified target probabilities for class 2 would be:

$$ Target=[0.033,0.933,0.033] $$
This means that the model doesn't assign full confidence to the true label but spreads some probability mass to the incorrect labels. </aside>