TF-IDF | Notion

$\color{black}\rule{365px}{3px}$

Term Frequency - Inverse Document Frequency

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).

<aside> <img src="/icons/light-bulb_red.svg" alt="/icons/light-bulb_red.svg" width="40px" /> The main idea behind TF-IDF is to quantify the importance of a word based on how frequently it appears in a document (term frequency) and how unique or rare it is across all documents (inverse document frequency).

</aside>

Components of TF-IDF

<aside> <img src="/icons/triangle-one-third_red.svg" alt="/icons/triangle-one-third_red.svg" width="40px" /> Term Frequency (TF):

Measures how frequently a term t appears in a document d .
The simplest form is the raw count of a term in a document, but it can also be normalized.
Formula:

$$ ⁍ $$

</aside>

<aside> <img src="/icons/triangle-two-thirds_red.svg" alt="/icons/triangle-two-thirds_red.svg" width="40px" /> Inverse Document Frequency (IDF):

Measures the importance of a term by considering how many documents contain the term.
Terms that are common across many documents (e.g., “the”, “is”) are less informative, so their IDF is low.
Formula:

$$ IDF(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$

Here, $D$ is the total number of documents in the corpus.

</aside>

<aside> <img src="/icons/triangle-alternate_red.svg" alt="/icons/triangle-alternate_red.svg" width="40px" /> TF-IDF Calculation:

Combines TF and IDF to assign a weight to each term in a document.
Formula:

$$ \text{TF-IDF}(t, d, D) = TF(t, d) \times IDF(t, D) $$

</aside>

Example Calculation

Let’s walk through an example using a small corpus.

Example Corpus:

•	Document 1: “the cat sat on the mat”
•	Document 2: “the dog sat on the log”
•	Document 3: “the cat chased the dog”

Step-by-Step Calculation: