Embeddings in NLP: Turning Words into Numbers

The intersection of linguistics and machine learning has led to significant advancements in Natural Language Processing (NLP) over the past decade. NLP involves the application of machine learning techniques to understand, interpret, and generate human language. One of the key breakthroughs that enabled the revolution in NLP is embeddings — the transformation of words, phrases, or even sentences into numerical representations, typically vectors.

In this detailed blog, we will explore embeddings from the ground up: understanding what they are, how they work, the algorithms used to generate them, and their critical role in modern NLP tasks. By the end, you will have a deeper understanding of how embeddings allow machines to understand human language and their applications in real-world scenarios.

What Are Embeddings?

At a high level, embeddings are vector representations of words, phrases, or sentences in a continuous vector space. These vectors, or embeddings, are constructed in such a way that words with similar meanings or contexts are closer to each other in the vector space. The magic behind embeddings lies in their ability to capture semantic and syntactic relationships between words.

For example, consider the words “king” and “queen.” Though they are distinct words, they share many similarities. Both are titles of royalty and are often used in similar contexts. A well-trained embedding model will map these two words to vectors that are close to one another in the embedding space. Here’s a simplified view of what the vectors might look like:

“king”: [0.45, -0.23, 0.87, …]
“queen”: [0.44, -0.22, 0.86, …]

This vector space representation enables machines to understand the relationship between words at a deeper level, allowing NLP systems to perform tasks like semantic similarity, translation, sentiment analysis, and much more.

Why Are Embeddings Important?

Language is inherently complex, with numerous layers of meaning, context, and nuance. Traditional NLP models used simple methods like one-hot encoding to represent words. In one-hot encoding, each word in the vocabulary is mapped to a high-dimensional vector with all zeros except for a single one at the position corresponding to that word. This method, however, has two major shortcomings:

Sparsity: The vector is very sparse, and it doesn’t effectively capture relationships between words. For example, there is no way to tell that “dog” and “cat” are semantically similar just by looking at their one-hot representations.
High Dimensionality: As the vocabulary size grows, the one-hot vectors become increasingly large, making computations expensive and inefficient.

Embeddings, on the other hand, represent words in a dense vector space, where each word is represented by a low-dimensional vector. These vectors have meaningful values and are learned in such a way that words with similar meanings are placed closer together in the space.

Embeddings allow models to capture semantic relationships (such as similarity between words) and syntactic relationships (such as grammatical structure), both of which are essential for tasks like machine translation, question answering, and text summarization.

The Mechanics of Embeddings: How Do They Work?

Initialization and Randomization

When embeddings are first initialized, they start as random vectors. These vectors have no meaningful relationships at this point. For instance, the initial vectors for “dog” and “cat” might look like this:

“dog”: [0.23, -0.57, 0.41, …]
“cat”: [-0.19, 0.42, -0.72, …]

Since they are initialized randomly, there is no real semantic meaning behind these vectors. However, during training, the vectors will adjust based on the context in which words appear. This is where the power of embeddings lies — they learn from the context and semantic patterns in large datasets, gradually adjusting to reflect the relationships between words.

Training Process: Context and Optimization

The core idea behind word embeddings is that the meaning of a word can be inferred from the words around it (its context). Words that appear in similar contexts tend to have similar meanings. For example, “dog” and “cat” frequently appear in contexts related to pets, animals, and so on. Hence, their embeddings should become more similar as they are trained on a large corpus of text.

The training of embeddings typically happens using unsupervised learning. In this setup, no explicit labels are provided to the model; instead, the model learns from the vast amounts of text it processes. There are two main approaches to training embeddings: Word2Vec and GloVe. Let’s take a closer look at these techniques.

Word2Vec

Word2Vec is one of the most popular algorithms for generating word embeddings. It works by predicting words in context or predicting context given a word. There are two architectures within Word2Vec:

Continuous Bag of Words (CBOW): In CBOW, the model predicts a target word based on a given context. For instance, given the context words “the” and “dog,” the model tries to predict the target word “barks.”
Skip-Gram: Skip-Gram does the reverse — it uses a target word to predict the surrounding context. For example, given the word “dog,” it will predict words like “barks,” “tail,” and “animal.”

Both CBOW and Skip-Gram use a sliding window approach to process a given text corpus. The window shifts across the text, and the embeddings are updated as the model learns from the surrounding words.

GloVe (Global Vectors for Word Representation)

Unlike Word2Vec, which focuses on local context (contextual relationships), GloVe is based on global word co-occurrence statistics. In other words, it attempts to capture how frequently words co-occur across the entire corpus. It builds a co-occurrence matrix and then factorizes it to create the word embeddings.

The idea behind GloVe is that the meaning of a word is influenced by its relationship to all other words in the corpus, not just the immediate neighbors. Thus, GloVe’s embeddings tend to capture more global relationships, while Word2Vec captures more local context.

Vector Space: What Do Embeddings Represent?

Once embeddings are trained, each word in the vocabulary is mapped to a vector in a multi-dimensional space, typically with 100 to 300 dimensions. In this vector space, words with similar meanings are located closer together, while words with dissimilar meanings are placed farther apart. This is akin to a semantic space where relationships between words can be represented by distances or directions in the vector space.

For example:

“king” and “queen”: As mentioned earlier, these words would have vectors that are close to each other because they share similar meanings and contexts.
“man” and “woman”: These words would also have vectors that are close but might differ in certain dimensions that capture gender.

A fascinating property of word embeddings is that they capture analogies. For instance, the difference between “king” and “queen” is similar to the difference between “man” and “woman.” The vectors for “king” and “queen” differ in much the same way as the vectors for “man” and “woman.” This property allows embeddings to be used for tasks like solving analogies (e.g., “man” is to “woman” as “king” is to “queen”).

Challenges and Limitations of Word Embeddings

While embeddings have been revolutionary in NLP, they are not without their challenges. Here are some of the key limitations:

Bias in Embeddings: Word embeddings can inherit biases present in the training data. For example, if a corpus has biased associations between certain words (e.g., “doctor” with “male” and “nurse” with “female”), the embeddings will reflect these biases. This has been a topic of active research, and methods are being developed to reduce bias in embeddings.
Out-of-Vocabulary (OOV) Words: Embeddings are limited to the vocabulary seen during training. If the model encounters a word it has never seen before (an out-of-vocabulary word), it won’t have a direct vector representation for it. Subword embeddings (like Byte Pair Encoding) and character-level embeddings help alleviate this issue by breaking down words into smaller components.
Context Sensitivity: Traditional word embeddings are static, meaning the vector representation of a word is the same regardless of the context in which it appears. For instance, “bank” would have the same embedding whether it refers to a financial institution or the side of a river. To address this, models like ELMo, BERT, and GPT were developed to create contextual embeddings that vary depending on the surrounding text.
Dimensionality and Interpretability: While embeddings are typically low-dimensional compared to one-hot encodings, they still represent very high-dimensional spaces. These spaces can be difficult to visualize and interpret. Various techniques like t-SNE or PCA are used to reduce the dimensionality and visualize embeddings, but they can only provide approximations of the true relationships between words.

Types of Embeddings

Embeddings are not limited to word-level representations. Depending on the application and the complexity of the task, different types of embeddings have been developed:

1. Word Embeddings

Word embeddings map each word in the vocabulary to a dense vector. This is the most common and basic form of embedding, and algorithms like Word2Vec and GloVe are designed to generate them. Word embeddings are widely used in tasks like text classification, semantic similarity, and sentiment analysis.

2. Subword Embeddings

Subword embeddings break words down into smaller units like subwords or character n-grams. These embeddings are useful for handling morphologically rich languages (such as Finnish or Turkish) or languages with complex character structures (like Chinese or Japanese). Subword embeddings can also help with handling rare or unseen words by representing them as combinations of known subwords.

Byte Pair Encoding (BPE): This method iteratively merges the most frequent pair of characters or subwords into a new token.
WordPiece: Used in models like BERT, WordPiece splits words into smaller units and assigns embeddings to these subword units.

3. Character-Level Embeddings

Character-level embeddings represent individual characters rather than whole words. This is particularly useful for languages with rich morphology or when dealing with noisy text (such as social media). Character-level embeddings allow the model to generalize better, particularly for out-of-vocabulary words.

For example, the word “unhappiness” could be broken down into its constituent characters: [‘u’, ‘n’, ‘h’, ‘a’, ‘p’, ‘p’, ‘i’, ‘n’, ‘e’, ‘s’, ‘s’]. Each character has its own embedding, and the final word representation is a combination of these individual character embeddings.

Applications of Embeddings

Word embeddings are at the core of numerous NLP applications. Their ability to capture the meaning and relationships between words enables a variety of tasks in modern NLP systems.

1. Semantic Search

By converting both queries and documents into vector representations, embeddings allow for semantic search. Unlike traditional keyword-based search, semantic search compares the vectors of words in the query with the vectors of words in the document. This enables a more accurate and context-aware search experience.

For example, a semantic search system can find relevant results for the query “How to train a dog?” by matching it with documents that contain similar meanings, even if the exact words don’t appear.

2. Sentiment Analysis

Embeddings are widely used in sentiment analysis, where the model classifies text as positive, negative, or neutral based on its semantic meaning. By representing the text as vectors, the model can identify sentiment-related patterns more effectively.

3. Machine Translation

Embeddings are also fundamental for machine translation. By converting words or phrases into vector representations, embeddings allow models to map words in one language to their counterparts in another language. Neural Machine Translation (NMT) systems like Google Translate rely on embeddings to handle multiple languages.

4. Named Entity Recognition (NER)

Embeddings are useful in tasks like Named Entity Recognition, where the goal is to identify entities like names, locations, and dates in text. The model can leverage the relationships captured in the embeddings to identify entities even when they appear in unfamiliar contexts.

5. Text Classification

Text classification tasks, such as spam detection, topic categorization, or news classification, can be powered by embeddings. By converting the entire document into an embedding (or aggregating word embeddings), machine learning models can classify text into predefined categories.

Conclusion

Embeddings have become a foundational technique in NLP, transforming how machines understand and process human language. They provide a dense, meaningful, and computationally efficient representation of words, allowing machines to capture semantic and syntactic relationships.

From Word2Vec and GloVe to modern contextualized embeddings like BERT and GPT, the field continues to evolve, opening up new possibilities in NLP. Embeddings have led to breakthroughs in tasks such as machine translation, sentiment analysis, and text classification, among others. However, challenges such as bias, out-of-vocabulary words, and interpretability remain, and continued research is necessary to overcome them.

Embeddings are a crucial building block for many advanced NLP models, and they will remain an essential part of the evolving landscape of AI and language understanding. As the field progresses, it’s likely we’ll see even more sophisticated and nuanced embeddings that will push the boundaries of what’s possible with NLP.

Reference:

https://medium.com/@harsh.vardhan7695/a-comprehensive-guide-to-word-embeddings-in-nlp-ee3f9e4663ed

https://www.turing.com/kb/guide-on-word-embeddings-in-nlp