Last Updated : 13 Aug, 2024
Comments
Improve
Language models predicts the probability of a sequence of words and generate coherent text. These models are used in various applications, including chatbots, translators, and more. However, one of the challenges in building language models is handling the issue of zero probabilities for unseen events in the training data. Smoothing techniques are employed to address this problem, ensuring that predictions remain accurate even when encountering previously unseen words.
In this article, we will focus on two advanced smoothing techniques: Witten-Bell Smoothing and Jelinek-Mercer Smoothing.
Understanding Language Models
Large Language Models (LLMs) are designed to understand and generate human-like text. They work by calculating the probability of a sequence of words and predicting the next word in a sentence. These models are typically trained on vast corpora of text, enabling them to grasp the patterns and structures of language.
LLMs, like those utilized in AI applications, often use deep learning architectures such as Recurrent Neural Networks (RNNs) and Transformers. These models can capture long-range dependencies and complex linguistic patterns, making them highly effective for a wide range of NLP tasks.
Pre-requisite: Additive Smoothing Techniques in Language Models
Advanced Smoothing Techniques in Language Models
Smoothing techniques are crucial in large language models to tackle the problem of zero probabilities for unseen words. These techniques adjust the probability distribution, ensuring that even unseen events have a non-zero probability. By doing so, they significantly enhance the performance of LLMs.
1. Witten-Bell Smoothing
Witten-Bell smoothing is a technique used in statistical language modeling to estimate the probability of unseen events, particularly in the context of n-grams in natural language processing (NLP). This method was introduced by Ian H. Witten and Timothy C. Bell in 1991 as part of their work on data compression.
How Witten-Bell Smoothing Works?
- Witten-Bell smoothing is based on the idea of interpolating between the maximum likelihood estimate (MLE) and a fallback estimate. It assumes that the likelihood of encountering an unseen event is proportional to the number of unique events that have been observed so far.
- The probability of an unseen event is estimated using the count of unique n-grams observed in the training data. The idea is that if a particular context has produced many different n-grams, it is likely that it will also produce new, unseen ones.
- The smoothed probability for an n-gram is calculated as: [Tex]P(w_n | w_{n-1}, \dots, w_1) = \frac{C(w_1, \dots, w_n)}{C(w_1, \dots, w_{n-1}) + N(w_1, \dots, w_{n-1})}[/Tex]
- Here:
- [Tex]C(w_1, \dots, w_n)[/Tex] is the count of the n-gram [Tex](w_1, \dots, w_n)[/Tex].
- [Tex]C(w_1, \dots, w_{n-1})[/Tex] is the count of the (n-1)-gram prefix.
- [Tex]N(w_1, \dots, w_{n-1})[/Tex] is the number of unique words that follow the (n-1)-gram prefix.
- Here:
Witten-Bell smoothing is particularly effective for handling sparse data, which is common in NLP tasks. It smooths the probability distribution by considering both the frequency of occurrence and the diversity of possible continuations in the data.
Now, let’s implement this technique using the following steps:
- Import Required Function: Import the
defaultdict
function from thecollections
module. - Define the Class: Create a class called
WittenBellSmoothing
that:- Initializes
unigrams
andbigrams
as default dictionaries. - Sets up a counter variable for the total number of unigrams.
- Initializes
- Model Training: Train the model using a sample text corpus:
- Split each sentence into tokens.
- Update the unigram and bigram counts based on the tokens.
- Define Probability Calculation Function: Implement a function named
bigram_prob
within the class to calculate the probability of a bigram using the Witten-Bell Smoothing technique. - Create and Use Class Object:
- Define a sample text corpus.
- Create an object of the
WittenBellSmoothing
class. - Call the
bigram_prob
function to calculate the probability for the bigram ‘the cat’.
from collections import defaultdictclass WittenBellSmoothing: def __init__(self, corpus): self.unigrams = defaultdict(int) self.bigrams = defaultdict(int) self.total_unigrams = 0 self.train(corpus) def train(self, corpus): for sentence in corpus: tokens = sentence.split() for i in range(len(tokens)): self.unigrams[tokens[i]] += 1 self.total_unigrams += 1 if i > 0: self.bigrams[(tokens[i-1], tokens[i])] += 1 def bigram_prob(self, word1, word2): V = len(set(self.unigrams.keys())) bigram_count = self.bigrams[(word1, word2)] unigram_count = self.unigrams[word1] return (bigram_count + 1) / (unigram_count + V) if unigram_count > 0 else 1 / Vcorpus = ["the cat is black", "the dog is white", "the cat is white"]model = WittenBellSmoothing(corpus)print(f"P(cat | the): {model.bigram_prob('the', 'cat'):.3f}")
Output:
P(cat | the): 0.333
The output "P(cat | the): 0.333"
represents the probability of the word “cat” occurring after the word “the” in the given corpus, calculated using the Witten-Bell Smoothing technique.
2. Jelinek-Mercer Smoothing
Jelinek-Mercer smoothing is a widely used technique in statistical language modeling, particularly in the context of n-gram models. It is a form of linear interpolation that aims to address the problem of estimating probabilities for n-grams that may not appear in the training data.
How Jelinek-Mercer Smoothing Works?
- The core idea of Jelinek-Mercer smoothing is to combine the probability estimates from higher-order n-grams with those from lower-order n-grams (like unigram, bigram, etc.). This combination is done using a fixed weight (λ) that determines the influence of each level of n-grams.
- The probability of an n-gram [Tex](w_n | w_{n-1}, \dots, w_1)[/Tex] is calculated as: [Tex]P(w_n | w_{n-1}, \dots, w_1) = \lambda \cdot P_{\text{MLE}}(w_n | w_{n-1}, \dots, w_1) + (1 – \lambda) \cdot P(w_n | w_{n-1}, \dots, w_2)[/Tex]
- Here:
- [Tex]P_{\text{MLE}}(w_n | w_{n-1}, \dots, w_1)[/Tex] is the maximum likelihood estimate of the n-gram.
- [Tex]\lambda[/Tex] is the smoothing parameter (0 ≤ λ ≤ 1) that controls the balance between the n-gram and the fallback model (lower-order n-gram).
- [Tex]P(w_n | w_{n-1}, \dots, w_2)[/Tex] is the probability estimate from the lower-order model.
- The process can be recursively applied to lower-order models until it reaches the unigram model.
- Here:
- Interpretation:
- If λ = 1, the model fully relies on the maximum likelihood estimate of the higher-order n-gram.
- If λ = 0, it relies entirely on the lower-order model.
- By choosing an intermediate value for λ, the model effectively interpolates between these two extremes, allowing it to handle sparse data more robustly.
- Parameter Tuning: The value of λ can be set using cross-validation or other optimization techniques to find the balance that best fits the training data. In practice, different values of λ may be used for different levels of n-grams.
Jelinek-Mercer smoothing is straightforward to implement and computationally efficient. It avoids the zero-probability problem for unseen n-grams by always falling back on lower-order estimates. The fixed λ makes it relatively simple to control the interpolation between higher and lower-order n-grams.
To implement this we will use the same process as before except we will use a lambda value in the bigram_prob function and calculate the probability of a given word with the overall probability using a weighted average.
from collections import defaultdictclass JelinekMercerSmoothing: def __init__(self, corpus, lambda_=0.7): self.lambda_ = lambda_ self.unigrams = defaultdict(int) self.bigrams = defaultdict(int) self.total_unigrams = 0 self.train(corpus) def train(self, corpus): for sentence in corpus: tokens = sentence.split() for i in range(len(tokens)): self.unigrams[tokens[i]] += 1 self.total_unigrams += 1 if i > 0: self.bigrams[(tokens[i-1], tokens[i])] += 1 def bigram_prob(self, word1, word2): unigram_prob = self.unigrams[word2] / self.total_unigrams bigram_prob = self.bigrams[(word1, word2)] / self.unigrams[word1] if self.unigrams[word1] > 0 else 0 return self.lambda_ * bigram_prob + (1 - self.lambda_) * unigram_probcorpus = ["the cat is black", "the dog is white", "the cat is white"]model = JelinekMercerSmoothing(corpus, lambda_=0.7)print(f"P(cat | the): {model.bigram_prob('the', 'cat'):.3f}")
Output:
P(cat | the): 0.517
The output "P(cat | the): 0.517"
represents the probability of the word “cat” occurring after the word “the” in the given corpus, calculated using the Jelinek-Mercer Smoothing technique.
Previous Article
Best Tools for Natural Language Processing in 2024
Next Article
Comparing Rows in DataFrames: Techniques and Performance Considerations