Word Vectors

Class Notes

Word Representation#

How do we represent words in NLP models?

N-gram#

Definition: An n-gram is a sequence of n elements taken consecutively from a given text or speech sequence. In the field of natural language processing, these elements are usually words. The n-gram model is a probabilistic language model used to estimate the probability of a given text sequence occurring. It is based on the assumption that the probability of a word occurring depends only on the n-1 words that precede it. For example, in a bigram model (bigram, i.e., n=2), the sentence "I love natural language processing" would be broken down into bigrams like "I love", "love natural", "natural language", "language processing", and the model would learn the probability of each bigram occurring, such as the frequency of the combination "love natural" in the corpus, to predict the likelihood of text or perform tasks like language generation.

P(w_1,w_2,...,w_n) = \prod_{i=1}^{n}P(w_i|w_{i-1})\\ P(w_i|w_{i-1})=\frac{C(w_{i-1},w_i)+\alpha}{C(w_{i-1})+\alpha|V|}

The first formula represents the joint probability of a sequence of words $w_1,w_2,...,w_n$ . It calculates the probability of the entire sequence by multiplying the probabilities of each word $w_i$ occurring (conditioned on the previous word $w_{i-1}$ ). This decomposition is known as the chain rule.
The second formula calculates the conditional probability of the current word $w_i$ occurring given the previous word $w_{i-1}$ . The various parts of the formula are explained as follows:

$C(w _{i−1} ,w_{i} )$ : Represents the count of the word pair $(w_{i-1},w_i)$ occurring in the training data.
$C(w_{i−1} )$ : Represents the count of the word $w_{i-1}$ occurring in the training data.
∣V∣: Represents the size of the vocabulary, i.e., the number of distinct words in the vocabulary.
α: A smoothing parameter used to avoid the zero probability problem. It is a value between 0 and 1, typically used to adjust the probabilities of unseen word pairs.

Specific Details#

Definition: An n-gram is a chunk of n consecutive words.
• unigrams: “the”, “students”, “opened”, ”their”
• bigrams: “the students”, “students opened”, “opened their”
• trigrams: “the students opened”, “students opened their”
• four-grams: “the students opened their”
Concept: Collect the frequencies of different n-grams and use this data to predict the next word.

First we make a Markov assumption: $x^n$ depends only on the preceding n-1 words
How do we obtain the probabilities of these n-grams and (n-1)-grams? Answer: By counting them in some large text corpus!

Example#

Suppose we are learning a 4-gram language model:

Sparsity Problems#

The n-gram model is a method used in natural language processing to predict text sequences, based on the assumption that the occurrence of a word depends only on the n-1 preceding words. However, as the value of n increases, the sparsity problem becomes more severe, as there may be many unseen n-gram combinations.

Sparsity Problem 1:
Problem:
If a specific word or phrase (like “students opened their w”) has never appeared in the training data, then according to the n-gram model, the probability of this word or phrase will be 0. This leads to the model being unable to assign any probability to these unseen words or phrases.
Partial Solution:
Smoothing: To address this issue, a small value δ can be added to the count of each word. This way, even if a word or phrase has never appeared in the data, its probability will not be 0. Smoothing techniques can ensure that all words have a non-zero probability, thus avoiding the zero probability situation.
Sparsity Problem 2:
Problem:
If a longer n-gram (like “students opened their”) has never appeared in the data, then the model will be unable to calculate the probability of any words following this phrase (like w).
Partial Solution:
Backoff: In this case, we can back off to a shorter n-gram (like “opened their”) to calculate the probability. This method allows the model to use shorter n-grams to estimate probabilities when encountering unseen long n-grams.
Considerations:
Impact of Increasing n: Increasing n (the length of the n-gram) exacerbates the sparsity problem. Typically, we cannot let n exceed 5, as the number of unseen n-gram combinations increases dramatically with larger n, leading to more sparsity issues.
Through these methods, n-gram language models can alleviate the sparsity problem to some extent, improving the model's generalization ability and prediction accuracy. However, these methods also have their limitations, such as smoothing potentially introducing some noise, and backoff possibly losing some contextual information. Therefore, it is crucial to choose an appropriate n value and smoothing technique in practical applications.

Storage Problems#

Storage Requirements: The n-gram model requires storing the counts of all n-grams observed in the training corpus. This means the size of the model is proportional to the number of distinct n-grams in the training data.
Factors Affecting Model Size:

Increasing n: As the length n of the n-gram increases, the model needs to store more n-gram counts, as the number of combinations of longer n-grams increases dramatically.
Increasing Corpus: An increase in the training corpus also increases the size of the model, as more text means more n-gram combinations.
Solutions and Challenges:
Storage Optimization: Since the storage requirements of the n-gram model increase with n and the expansion of the corpus, effective storage optimization techniques, such as compression and hash tables, are needed to reduce storage space.
Model Simplification: The model can be simplified by limiting the length of n-grams, using more efficient data structures or algorithms to reduce storage requirements.
Sparsity Issues: As n increases, the sparsity issue (i.e., many n-grams never appearing in the training data) becomes more severe, necessitating the use of smoothing techniques to address it.
Alternative Models: Consider using more advanced models, such as neural network models (like Transformers), which are typically more compact and can learn more complex language patterns with fewer parameters.

Naive Bayes#

Naive Bayes is a simple probabilistic classifier based on Bayes' theorem, which assumes independence among features. In text classification tasks, the Naive Bayes model can be used to calculate the probability of a word $w_i$ occurring given a category $c_j$ .
The formula is as follows:
$P\^(w_i∣c_j)=\frac{Count(wi,cj)+α}{\Sigma_{w∈V}Count(w,c_j)+α∣V∣}$
Formula Explanation:

$P\^(wi∣cj)$ : Represents the probability of the word $w_i$ occurring given the category $c_j$ . This is the estimated probability predicted by the model.
Count(wi,cj): Represents the count of the word $w_i$ occurring in category $c_j$ .
$∑_{w∈V}Count(w,cj)$ : Represents the total count of all words occurring in category $c_j$ . Here, V is the vocabulary, representing all possible words.
α: A smoothing parameter used to address data sparsity issues and avoid zero probabilities. It is a value between 0 and 1.
∣V∣: Represents the size of the vocabulary, i.e., the number of distinct words in the vocabulary.
Smoothing Techniques:
In the Naive Bayes model, smoothing techniques are also used to handle data sparsity issues. Specifically, the smoothing technique in the formula is achieved by adding α to both the numerator and denominator:
Adding α to the numerator: Count(wi,cj)+α, which means that even if the word $w_i$ has never appeared in category $c_j$ , its probability will not be zero, but rather α.
Adding α∣V∣ to the denominator: $∑_{w∈V}Count(w,cj)+α∣V∣$ , which means that even if some words have never appeared in category $c_j$ , their probabilities will not be zero but will be evenly distributed.

Why Focus on Semantics in NLP Models?#

In words: a feature is a word identity (= string)
For example, if the previous word is ‘terrible’, it needs to be exactly the same ‘terrible’ in both the test and training sets.
But if we can convert semantics into vectors:

previous word was vector [35, 22, 17, …]
Now in the test set we might see a similar vector [34, 21, 14, …]
We can generalize to similar but unseen words!!!
In traditional NLP, we treat words as discrete vectors, represented by one-hot vectors, where the vector dimension equals the number of words in the vocabulary. But this method does not provide a natural way to represent similarity.
Distributional Semantics: The meaning of a word is given by the words that frequently appear nearby.
When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window). We use the many contexts of w to build up a representation of w.
We can represent a word's context using vectors!

What Do Words Mean?#

Synonyms: couch/sofa, car/automobile, filbert/hazelnut
Antonyms: dark/light, rise/fall, up/down
Some words are not synonyms, but they share some meaning elements, such as: cat/dog, car/bicycle, cow/horse
Some words are not similar, but they are related: coffee/cup, house/door, chef/menu
The big idea: model of meaning focusing on similarity
Similar words are "nearby in vector space"

Word Embedding Process#

Goal: represent words as short (50-300 dimensional) & dense (real-valued) vectors!

Count-based Approaches:
Using history: This method has been in use since the 1990s.
Co-occurrence Matrix: Build a sparse word-word co-occurrence (PPMI, Positive Pointwise Mutual Information) matrix that records the frequency of different words appearing together in the text.
SVD Decomposition: Use Singular Value Decomposition (SVD) to decompose the co-occurrence matrix to obtain low-dimensional vector representations of words.
Prediction-based Approaches:
Machine Learning Problem: Frame the word embedding problem as a machine learning problem by predicting words in context to learn word representations.
Word2vec: Proposed by Mikolov et al. in 2013, Word2vec learns word vectors by predicting context words given a target word or predicting the target word given context words.
GloVe: Proposed by Pennington et al. in 2014, GloVe (Global Vectors for Word Representation) learns word vectors using global word-word co-occurrence information.

Word Embeddings: The Learning Problem#

Learn vectors from text to represent words.
Input:
A large text corpus and vocabulary V.
Vector dimension d (e.g., 300 dimensions).
Output:
A function f→Rd that maps each word in the vocabulary to a d-dimensional real-valued vector space.
Learning Process:
The learning process of word embeddings typically involves optimizing an objective function that measures the model's performance on prediction tasks (such as predicting words in context).
Through training, the learned word vectors can capture relationships between words, such as synonyms, antonyms, and categories of words.
Basic Properties:

Similar words have similar vectors argmaxcos(e(w),w(w*))
The relationship between “man” (male) and “woman” (female), as well as the relationship between “king” (king) and “queen” (queen). In the word embedding space, these two relationships are similar, i.e., vman−vwoman≈vking−vqueen. This means the vector from “man” to “woman” is similar to the vector from “king” to “queen”.
Verb Tense: Such as “walk” (walk), “walked” (walked), “swim” (swim) and “swam” (swam). These relationships are also similar in the word embedding space, i.e., vwalking−vwalked≈vswimming−vswam.
Country-Capital: Such as “France” (France) and “Paris” (Paris), “Italy” (Italy) and “Rome” (Rome). These relationships are also similar in the word embedding space, i.e., vParis−vFrance≈vRome−vItaly.
Solving analogy problems: Find analogy words by calculating vector differences and cosine similarities. The specific steps are as follows:
Define the analogy relationship: Given an analogy relationship $a:a^∗:: b : b^∗$ , where a and b are known words, and a∗ and b∗ are the analogy words to be found.
Calculate the vector difference: Calculate e(a∗)−e(a)+e(b), where e(w) represents the vector representation of word w.
Find the most similar word: Find the word b∗ in the vocabulary V that has the highest cosine similarity with e(a∗)−e(a)+e(b), i.e., b∗=argmaxw∈Vcos(e(w),e(a∗)−e(a)+e(b)).

This image illustrates the process of learning language models (LMs) through neural networks, particularly how the concept of word embeddings is introduced. The model described in the image is the Neural Probabilistic Language Model, proposed by Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin in 2003.
Elements in the image explained:

Input Layer (Index for $w_{t−n+1}$ , $w_{t−2}$ , $w_{t−1}$ ):
These are the indices of the previous n words, representing the context. Each word is mapped to a vector representation through a lookup table (Table look-up), i.e., word embedding.
Word Embedding Layer (C( $w_{t−n+1}$ ), C( $w_{t−2}$ ), C( $w_{t−1}$ )):
Each word's index is mapped to a vector through a lookup table, and these vectors are shared parameters (shared parameters across words), representing word embeddings.
Hidden Layer (tanh):
The word embedding vectors are concatenated and processed through a nonlinear activation function (like tanh). This step is the most computationally intensive part of the model.
Output Layer (softmax):
The output of the hidden layer is transformed into a probability distribution through the softmax function, representing the probability of each possible next word given the context.
Output (t-th output = P( $w_t=i∣context$ )):
The final output is the probability of the t-th word being a specific word i given the context.

Word2vec#

Skip-gram#

The goal of the Skip-gram model is to use each word to predict other words in its context.
Assumption:
We have a large text corpus $w_1, w_2, ..., w_T \in V$
Key Idea:
Use each word to predict other words in its context. This is a classification problem, as the model needs to select the correct context word from the vocabulary.
Context:
The context is defined as a fixed-size window of size 2m (in the example in the image, m=2). This means that for each center word, the model considers m words before and after it as context.
Probability Calculation:
Given the center word a, the model needs to calculate the probability P(b|a) for other words b to be context words.
This probability distribution P(⋅|a) is defined as ∑w∈VP(w|a)=1, meaning the sum of the probabilities of all possible context words equals 1.
The image shows a center word "into", with a context window size of 2, including one word before and one word after, i.e., "problems", "turning", "banking", "crises", and "as".
The model needs to learn how to predict these context words based on the center word.
Principle of the Skip-gram Model:
Goal: For each center word, the model's objective is to maximize the probability of its context words.
Loss Function: Typically, the cross-entropy loss function is used to train the model, minimizing the difference between the predicted probability distribution and the actual context word distribution.
Optimization: Adjust the model parameters through gradient descent or other optimization algorithms to minimize the loss function.

This image further explains the training process of the Skip-gram model, showing how to convert text data into a format that the model can process and illustrating the training objective of the model.
Context Window:
The image shows a fixed window size of 2, meaning that for each center word (marked in red in the image), the model considers two words before and after as context.
Probability Calculation:
For each center word, the model needs to calculate the probabilities of its context words. For example, given the center word "into", the model needs to calculate the probabilities of "problems", "turning", "banking", "crises", and "as" being context words.
Training Data Transformation:
The right side of the image shows how to convert the original text data into the format required for model training. For example, for the center word "into", the model generates training samples like (into, problems), (into, turning), (into, banking), (into, crises), etc.
Training Objective:
The model's training objective is to find a set of parameters that can maximize the probabilities of context words. In other words, the word vectors that the model attempts to learn should best predict the context words given the center word.
Objective Function:

How is $P(w_{t+j}|w_t;\theta)$ defined?
This is achieved using word vectors and the Softmax function.
Two Sets of Vectors:
For each word in the vocabulary V, two sets of vectors are used:
$u_a ∈ R^d$ : The vector for the center word a, for all $a ∈ V$ .
$v_b ∈ R^d$ : The vector for the context word b, for all $b ∈ V$ .
Dot Product:
The dot product $u_a⋅v_b$ is used to measure the likelihood of the center word a appearing with the context word b.
Softmax Function:
The Softmax function is used to convert the dot product into a probability distribution. This is achieved by normalizing the exponentials of the dot products with the sum of the exponentials of all possible context words.
The probability distribution:
P(⋅| $w_t$ ) is a probability distribution defined over the vocabulary V, representing the probability of each possible context word occurring given the center word $w_t$ .

vs Multinomial Logistic Regression#

Multinomial Logistic Regression:
Formula:
Multinomial logistic regression is used for multi-class problems, and its formula is:
$P(y=c|x)=\frac{\sum_{j=1}^{m}exp(w_j⋅x+b_j)}{exp(w_c⋅x+b_c)}$
where y is the class label, c is one of the classes, x is the input feature vector, $w_c$ and $b_c$ are the weight vector and bias term for class c, and m is the total number of classes.
Explanation:
The numerator in the formula is the dot product of the input feature vector x and the weight vector $w_c$ for class c, plus the exponential of the bias term $b_c$ .
The denominator is the sum of the exponentials for all classes, used for normalization, ensuring that the sum of probabilities for all classes equals 1.
Skip-gram Model:
Formula:
The probability calculation formula in the Skip-gram model is:
$P(w_{t+j}|w_t)=\frac{\sum_{k∈V}exp(u_{w_t}⋅v_k)}{exp(u_{w_t}⋅v_{w_{t+j}})}$
where $w_t$ is the center word, $w_{t+j}$ is the context word, $u_{w_t}$ and $v_{w_{t+j}}$ are the vector representations of the center and context words, and V is the vocabulary.
Explanation:
The numerator in the formula is the exponential of the dot product of the center word $w_t$ and the context word $w_{t+j}$ .
The denominator is the sum of the exponentials of the dot products of the center word $w_t$ with all words in the vocabulary, used for normalization.
Comparison:
Essentially a ∣V∣-way classification problem: The Skip-gram model can be viewed as a multi-class problem, where ∣V∣ is the size of the vocabulary.
Fixing $u_{w_t}$ : If the vector of the center word $u_{w_t}$ is fixed, then the problem simplifies to a multinomial logistic regression problem.
Non-convex Optimization Problem: Since it requires learning the vectors for both the center word and context words simultaneously, the training objective is non-convex, meaning the optimization process may have multiple local optima.

vs Multinomial Logistic Regression#

Practice#

The answer is (b).
Each word has two d-dimensional vectors, so it is 2 × | V | × d

Question: Why does each word need two vectors instead of one?
Answer: Because a word is unlikely to appear in its own context window. For example, given the word "dog", P(dog|dog) should be low. If we only use one set of vectors, the model essentially needs to minimize $u_{dog}⋅u_{dog}$ , which would lead to self-referential vectors being too similar, thus affecting model performance.
Question: Which set of vectors is used as word embeddings?
Answer: This is an empirical question. Typically, only $u_w$ is used as the word embedding, but you can also concatenate both sets of vectors for use.

Skip-gram with Negative Sampling (SGNS) and Other Variants#

Problem Description:
In the traditional Skip-gram model, each time a pair of center word and context word (t,c) is obtained, the context word vector $v_k$ needs to be updated using all words in the vocabulary. This is computationally expensive.
Negative Sampling Method:
Negative sampling does not consider all words in the vocabulary but randomly samples K negative samples (usually K is between 5 and 20). This means we only randomly select K words from the vocabulary as negative samples instead of using all words.
Softmax and Negative Sampling Formulas:
Softmax: The original Skip-gram model uses the Softmax function to calculate probabilities, with the formula:
$y=−log\left(\sum_{k∈V}exp(u_t⋅v_k)\right)−log\left(exp(u_t⋅v_c)\right)$
Negative Sampling: The negative sampling method replaces Softmax with a simpler formula, defined as:
$y=−log(σ(u_t⋅v_c))−\sum_{i=1}^{K}E_{j∼P(w)}log(σ(−u_t⋅v_j))$
where $σ(x)=\frac{1}{1+exp(−x)}$ is the sigmoid function used to convert the dot product into probabilities.

Key Idea:
Transform the original ∣V∣-way classification problem (where ∣V∣ is the size of the vocabulary) into a set of binary classification tasks.
Each time a pair of words (t,c) is obtained, the model predicts whether (t,c) is a positive sample pair, while (t,c′) is a negative sample pair, where c′ is randomly selected from a small sampling set.
Positive and Negative Samples:
Positive Sample: For example, for the center word "apricot" and the context word "tablespoon", this is a positive sample pair.
Negative Sample: For example, for the center word "apricot" and a randomly chosen word "aardvark", this is a negative sample pair.
Loss Function:
The loss function y is defined as:
$y=−log(σ(u_t⋅v_c))−\sum_{i=1}^{K}E_{j∼P(w)}log(σ(−u_t⋅v_j))$
where $σ(x)=\frac{1}{1+exp(−x)}$ is the sigmoid function, K is the number of negative samples, and P(w) is the probability distribution sampled based on word frequency.
Probability Calculation:
The probability P(y=1|t,c) for a given center word t and context word c is calculated through $σ(u_t⋅v_c)$ .
The probability P(y=0|t,c′) for a given center word t and negative sample c′ is calculated through $1−σ(u_t⋅v_{c′})=σ(−u_t⋅v_{c′})$ .
Optimization:
Similar to binary logistic regression, but it requires simultaneously optimizing the center word vector $u_t$ and context word vector $v_c$ .

Practice#

The vector for the center word t, $u_t$ (dimension d).
The vector for the positive sample context word c, $v_c$ (dimension d).
The vectors for K negative sample words (each dimension d).

Continuous Bag of Words (CBOW)#

GloVe: Global Vectors#

This image introduces the GloVe (Global Vectors for Word Representation) model, which is an algorithm for generating word embeddings. GloVe learns word vectors by leveraging the global co-occurrence statistics of words, unlike window-based methods like Skip-gram and CBOW, which directly utilize the co-occurrence matrix of the entire corpus to learn word vectors.
Key Idea:
Directly use the co-occurrence counts of words to approximate the dot product between word vectors ( $u_i⋅v_j$ ).
Global Co-occurrence Statistics:
The model uses global co-occurrence statistics $X_{ij}$ , which is the frequency of words i and j appearing together in the corpus.
Loss Function J(θ):
The loss function for GloVe is defined as:
$J(θ)=\sum_{i,j∈V}f(X_{ij})(u_i⋅v_j+b_i+b_j−logX_{ij})^2$
where $f(X_{ij})$ is a weighting function used to adjust the influence of low-frequency word pairs; $u_i$ and $v_j$ are the vector representations of words i and j, respectively; $b_i$ and $b_j$ are bias terms; and $X_{ij}$ is the co-occurrence frequency of words i and j.
Training Speed and Scalability:
The GloVe model trains faster and can scale to very large corpora.
Weighting Function f:
The shape of the weighting function f is typically a smooth increasing function used to reduce the influence of low-frequency co-occurring word pairs.
Advantages of GloVe:
Global Information: GloVe utilizes co-occurrence information from the entire corpus, allowing it to capture broader semantic relationships.
Training Efficiency: Due to its matrix factorization form, GloVe is more efficient in training compared to window-based methods.
Scalability: GloVe can handle very large corpora, making it perform well on large-scale datasets.

FastText#

This image introduces subword embeddings in the FastText model, which is an improved word embedding method that captures finer-grained semantic information by breaking words down into subwords (n-grams).
Subword Embeddings:
The FastText model is similar to the Skip-gram model, but it breaks words down into n-grams (subwords), where n ranges from 3 to 6.
This method can capture semantic information within words, for example, the word “where” can be broken down into subwords “wh”, “her”, “ere”, etc.
Example:
The image provides an example of the breakdown of the word “where”:
3-grams: <wh, whe, her, ere, re>
4-grams: <whe, wher, here, ere>
5-grams: <wher, where, here>
6-grams: <where, where>
Replacement Operation:
When calculating the dot product of the center word and context word vectors, the FastText model replaces the original word vector dot product with the sum of the subword vectors.
Specifically, if $u_i⋅v_j$ is the dot product of the original word vectors, then in FastText, this dot product is replaced by the sum of all subword vector dot products:
$\sum_{g∈n-grams(w_i)}u_g⋅v_j$
where g is a subword of word $w_i$ , and n-grams( $w_i$ ) represents the set of all possible subwords of word $w_i$ .
Advantages of the FastText Model:
Capturing Internal Structure:
By breaking words down into subwords, FastText can capture internal structural information of words, which is very helpful for understanding the semantics of words.
Handling Rare and Unknown Words:
Subword embeddings can better handle rare and unknown words, as even if a word has not appeared in the training data, its subwords may have.
Improving Generalization Ability:
Subword embeddings give the model better generalization ability when facing new words, as it can use known subword information to infer the semantics of new words.

Pre-trained Usable Word Embeddings#

word2vec: https://code.google.com/archive/p/word2vec/
GloVe: https://nlp.stanford.edu/projects/glove
FastText: https://fasttext.cc/

To Contextualized Word Vectors Using LMs#

This image illustrates the structure of the ELMo (Embeddings from Language Models) model, which is a deep learning model used to generate word embeddings. ELMo was proposed by Matthew E. Peters et al. in 2018, and its paper "Deep Contextualized Word Representations" details the principles and implementation of the model.
Elements in the image explained:

Input Layer (E1,E2,...,EN):
These are the input embeddings of words, usually one-hot encoded or frequency encoded.
Bidirectional LSTM Layer:
The image shows two layers of bidirectional LSTM (Long Short-Term Memory), each consisting of a forward and backward LSTM. Each LSTM unit processes sequential data and can capture long-distance dependencies between words.
Bidirectional LSTM can simultaneously consider the contextual information of words from both directions, thus better understanding the contextual meaning of words.
- Additional Notes on the Bidirectional LSTM Layer:
  - Bidirectional Long Short-Term Memory (Bi-LSTM) is a special type of recurrent neural network (RNN) that processes sequential data using two LSTM layers, one layer processing data in the forward direction (from the start to the end of the sequence) and the other layer processing data in the backward direction (from the end to the start of the sequence). This structure allows the network to consider the contextual information of each element in the sequence from both directions.
  - Structure: In a bidirectional LSTM, for each time step t in the sequence, there are two LSTM units working:
    Forward LSTM: This LSTM unit starts from the first element of the sequence and processes the sequence in the forward direction until the last element. For each time step t, it only considers information from the start of the sequence to the current time step.
    Backward LSTM: This LSTM unit starts from the last element of the sequence and processes the sequence in the backward direction until the first element. For each time step t, it only considers information from the end of the sequence to the current time step.
  - Information Flow: At each time step t, both the forward and backward LSTMs produce a hidden state. These two hidden states contain contextual information about the element at that position in the sequence, one coming from the front of the sequence and the other from the back.
  - Output: The output of the bidirectional LSTM can be combined in several different ways:
    - Concatenation: Concatenate the outputs of the forward and backward LSTM at each time step to form a longer vector. This method retains the bidirectional contextual information for each position in the sequence.
    - Summation: Sum the output vectors of the forward and backward LSTM at each time step. This method merges the bidirectional information but may lose some details.
    - Averaging: Average the output vectors of the forward and backward LSTM at each time step. This method also merges the bidirectional information but may reduce the model's sensitivity to specific directional information.
    - Separate Use: In some cases, the outputs of the forward and backward LSTM may be used separately, especially when different parts of the model require information from different directions.
Output Layer (T1,T2,...,TN):
These are the representations of words after processing by the LSTM. Each word's representation is a weighted sum of its outputs from different LSTM layers.
Principle:
Contextualizing Word Embeddings: Traditional word embeddings (like Word2Vec or GloVe) are static and do not consider the context of words. ELMo, by using bidirectional LSTMs, can generate contextualized word embeddings, meaning the same word can have different representations in different contexts.
Capturing Long-Distance Dependencies: LSTMs are particularly suited for handling sequential data, capturing long-distance dependencies between words. This is crucial for understanding complex structures in language (like syntax and semantics).
Bidirectional Information Flow: By considering the contextual information of words from both directions, ELMo can understand the meaning of words more comprehensively. This is important for handling ambiguous words and understanding context.

Evaluating Word Vectors#

Extrinsic Evaluation
- Let’s embed these word embeddings into real NLP systems and see if they can improve performance, which may take a long time but remains the most important evaluation metric.

Intrinsic Evaluation
- Evaluate specific/intermediate sub-tasks
- Quick computation
- Unclear if this actually helps downstream tasks

Vocabulary Assumption: Assume there exists a fixed vocabulary built from the training set containing tens of thousands of words. All new words encountered during testing will be mapped to a single "UNK" (unknown word).
Example of Vocabulary Mapping:
Common Words: For example, “hat” maps to “pizza” (index), “learn” maps to “tasty” (index).
Variants, Misspellings, New Terms: “taaaaaasty” (variant), “laern” (misspelling), “Transformerify” (new term) are all mapped to “UNK” (index).
Limitations of the Finite Vocabulary Assumption: The significance of the finite vocabulary assumption is less in many languages. This is because many languages have complex morphology or word structures, leading to more word types but fewer occurrences of each word.

Language Models#

Narrow Sense#

A probabilistic model that assigns a probability to every finite sequence (grammatical or not).
GPT-3 still acts in this way but the model is implemented as a very large neural network of 175 billion parameters!

Broad Sense#

The image details three main architectures of pre-trained language models: decoder-only models, encoder-only models, and encoder-decoder models, along with their typical applications.

Decoder-only Models:
Representative Models: GPT-x models (like GPT-2, GPT-3). These models are primarily used for generation tasks, such as text generation and question answering. They typically use an autoregressive method to generate text from left to right.
Encoder-only Models:
Representative Models: BERT, RoBERTa, ELECTRA. These models process input text through an encoder to generate representations of the text but do not perform text generation. They are mainly used for understanding tasks, such as text classification and named entity recognition. The BERT model uses masked language modeling (Mask LM) and next sentence prediction (NSP) as pre-training objectives to learn contextual representations of words.
Encoder-Decoder Models:
Representative Models: T5, BART. These models combine encoders and decoders, capable of handling both generation and understanding tasks. The encoder generates text representations, and the decoder generates output text based on these representations. This structure allows the model to handle tasks like translation and summarization.
Explanation of Examples in the Image:
BERT:
BERT uses masked language modeling (Mask LM) and next sentence prediction (NSP) as pre-training objectives. The image shows how BERT processes two masked sentences (Masked Sentence A and Masked Sentence B) and an unlabeled sentence pair (Unlabeled Sentence A and B Pair).
T5:
T5 is an encoder-decoder model that uses a different pre-training objective. The image illustrates T5's applications in various tasks, including translation (translating English to German), summarization (summarizing text), and text evaluation (judging the acceptability of text).
Principle:
Masked Language Modeling (Mask LM): In BERT, some words in the input text are randomly replaced with a special [MASK] token, and the model needs to predict these masked words. This method allows the model to learn contextual representations of words.
Next Sentence Prediction (NSP): BERT also uses the NSP task to learn relationships between sentences. The model needs to predict whether two input sentences are consecutive text.
Encoder-Decoder Structure: In T5 and BART, the encoder first processes the input text to generate representations. Then, the decoder generates output text based on these representations. This structure allows the model to handle both generation and understanding tasks.

Building Neural Language Models#

Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
- Improvements to Fixed Window Neural Language Models:
  - No Sparsity Issues:
    Neural language models do not produce sparsity issues because they do not need to calculate the probabilities of each specific n-gram but predict the next word by learning word vectors and context.
  - No Need to Store All Observed n-grams:
    Neural models do not need to store all n-grams observed in the corpus along with their counts, thus reducing storage requirements.
- Existing Problems:
  - Fixed Window Too Small:
    The size of the fixed window limits the range of context the model can consider.
  - Enlarging the Window Increases Parameter Count:
    If we try to enlarge the window to include more contextual information, the number of parameters (size of the weight matrix W) will also increase, which may lead to overfitting and increased computational costs.
  - The Window Is Never Big Enough:
    No matter how large the window is, there will always be some long-distance dependencies that cannot be captured.
  - Input Processing Lacks Symmetry:
    In fixed window models, words at different positions in the sequence are treated with different weights, lacking symmetry.
  - Solution:
    Recurrent Neural Networks (RNNs):
    The image points out the need for a neural network architecture capable of handling inputs of arbitrary length. RNNs are one solution, as they can process sequential data through recurrent connections, regardless of the sequence length.

More On Word Vectors#

This image illustrates the workflow of MorphTE (a method for injecting morphology into tensor embeddings) from the paper "MorphTE: Injecting Morphology in Tensorized Embeddings" (NeurIPS 2022), authored by Guobing Gan, Peng Zhang, and others. Below is a detailed explanation of the content in the image:

Left Side - Vocabulary:
Displays a vocabulary containing multiple words, such as “kindness”, “unkindly”, “unfeelingly”. These words will serve as inputs for subsequent processing steps.
Middle Left - Morpheme Segmentation:
Performs morpheme segmentation on each word in the vocabulary. For example, “kindness” is segmented into “kind” and “ness”, “unkindly” is segmented into “un”, “kind”, and “ly”, and “unfeelingly” is segmented into “un”, “feel”, “ing”, and “ly”. The segmented morphemes are arranged in a matrix, with the number of rows equal to the vocabulary size |V| and the number of columns equal to the number of morphemes in the words n.
Middle Right - Indexing:
Indexes the segmented morphemes, mapping each morpheme to a unique identifier. The indexed results are used for subsequent embedding operations.
Right Side - Morpheme Embedding Matrices:
Contains two morpheme embedding matrices $f_l$ and $f_r$ , used to process the left and right parts of the morphemes, respectively. These matrices convert morpheme indices into low-dimensional vector representations.
Far Right - Word Embedding Matrix:
Combines the results of the morpheme embedding matrices (shown in the image as an addition operation) to generate the final word embedding vectors. These vectors represent the semantic and morphological information of the words.

Symbols and parameters in the image explained:

$n$ : The number of morphemes in a word (morpheme order).
$q$ : The dimensionality of the morpheme vectors.
$|V|$ : The size of the vocabulary.
$|M|$ : The size of the morpheme vocabulary.

Overall, this image illustrates how MorphTE converts words into vector representations containing morphological information through morpheme segmentation, indexing, and embedding operations.

Training Word Vectors#

How to train?

Practice#

Compute Gradients for Word2vec#

Overall Algorithm#

This image illustrates an overall algorithm primarily used for tasks related to word embeddings, with the following detailed explanation:

Input Section#

Text Corpus: The text data source that the algorithm processes.
Embedding Size d: The size of the embedding dimension, determining the dimensionality of the final representation vector for each word.
Vocabulary V: The vocabulary containing all possible words.
Context Size m: The size of the context window, defining the range of context considered in the text.

Initialization Section#

For each word i in the vocabulary V, randomly initialize two vectors $\mathbf{u}_i$ and $\mathbf{v}_i$ .

Training Section#

Iterate through the training corpus, for each training instance $(t, c)$ (where $t$ is the target word and $c$ is the context word):

Update Target Word Vector $\mathbf{u}_t$ :
- The formula is $\mathbf{u}_t \leftarrow \mathbf{u}_t - \eta \frac{\partial y}{\partial \mathbf{u}_t}$ , where $\frac{\partial y}{\partial \mathbf{u}_t} = -\mathbf{v}_c + \sum_{k \in V} P(k|t)\mathbf{v}_k$ . Here, $\eta$ is the learning rate, controlling the step size of each update.
Update Context Word Vector $\mathbf{v}_k$ :
- For each word $k$ in the vocabulary V, the formula is $\mathbf{v}_k \leftarrow \mathbf{v}_k - \eta \frac{\partial y}{\partial \mathbf{v}_k}$ .
- When $k = c$ (i.e., the current context word), $\frac{\partial y}{\partial \mathbf{v}_k} = (P(k|t) - 1)\mathbf{u}_t$ ; when $k \neq c$ , $\frac{\partial y}{\partial \mathbf{v}_k} = P(k|t)\mathbf{u}_t$ . $P(k|t)$ represents the probability of word $k$ occurring given the target word $t$ .

The right side also shows an example of converting training data into a specific format, such as (into, problems), reflecting the combination of target and context words.