$\color{black}\rule{365px}{3px}$
“Attention Is All You Need” - 2017
Link to Paper:
arxiv.org
Introduction
$\color{black}\rule{365px}{3px}$
- In this notes, I will discuss more about the architectures not discussed in previous series(Positional Encoding, Attention):
- Regularization
- Layer Normalization
- Feed-Forward Network
Regularization
$\color{black}\rule{365px}{3px}$
The Transformer model employs three regularization techniques:
- Residual Dropout: Dropout is applied to the output of each sub-layer before it is added to the input and normalized, with $P_{drop}=0.1$ for the base model.
- Embedding Dropout: Dropout is applied specifically to the sums of the embeddings and positional encodings to regularize the input to the encoder and decoder.
- Label Smoothing: A smoothing value of $ϵ_{ls}=0.1$ is employed during training. This reduces model confidence (hurting perplexity) but improves accuracy and BLEU scores.
<aside>
<img src="/icons/report_green.svg" alt="/icons/report_green.svg" width="40px" />
Recall Label Smoothing
How Label Smoothing Modifies the Targets
- Instead of using a one-hot vector, label smoothing adjusts the target probabilities slightly. For a smoothing parameter ϵls, the adjusted target probabilities are:
$$
\text{Modified Target}=(1−ϵ_{ls})⋅\text{one-hot vector}+\frac{ϵ_{ls}}{K}
$$
where $K$ is the total number of classes.
-
For example, with $K=3$ classes and $ϵ_{ls}=0.1$, the modified target probabilities for class 2 would be:
$$
Target=[0.033,0.933,0.033]
$$
-
This means that the model doesn't assign full confidence to the true label but spreads some probability mass to the incorrect labels.
</aside>