Artificial Intelligence 🤖
Recurrent Neural Networks (RNNs)
Word Embeddings

Natural Language Processing & Word Embeddings

Natural language processing (NLP) combined with deep learning has led to significant advancements in machine understanding of language. Word embeddings are a critical component of these developments, providing nuanced representations of words that capture their meanings, relationships, and context within a language.

Introduction to Word Embeddings

Word embeddings provide a way to convert words into a numerical form that deep learning models can understand. Unlike sparse one-hot encoded vectors, word embeddings map words into a continuous vector space where semantically similar words are mapped to nearby points.

Word Representations

Word embeddings help address the shortcomings of one-hot encoding by capturing the semantic relationships between words. This means that words like "king" and "queen," which have similar semantic meanings, will have similar representations in the embedding space.

To understand word embeddings, consider the contrast between one-hot representations and an embedding:

One-hot representation

In one-hot encoding, every word is represented as a vector with a '1' in the position corresponding to the word in the vocabulary, and '0's everywhere else.We will use the annotation OidxO_{\text{idx}} for any word that is represented as a one-hot vector.

This representation does not capture any semantic information about the words. It treats a word as a thing in itself and it doesn't allow an algorithm to generalize across words. For example, consider the following sentence:

"I want a glass of orange [...]"

a model should predict the next word as juice. But for a similar sentence:

"I want a glass of apple [...]"

A model won't as easily predict juice here if it wasn't trained on examples of it. However, if the model were able to tell that the two examples are related because orange and apple are similar, this may be achievable. This is what a word embeddings do.

Notice that the inner product between any one-hot encoding vector is zero. one-hot vectors don't do a good job of capturing the level of similarity between words. This is because every one-hot vector has the same Euclidean distance from any other one-hot vector. So, instead of a one-hot presentation, won't it be nice if we can learn a featurized representation with each of these words: man, woman, king, queen, apple, and orange? Embedding vectors, such as GloVe vectors, provide much more dense representation where words with similar meanings have similar representation:

Word embedding representation

Each word is represented by a feature vector, which helps models to generalize from one word to another based on their semantic similarity.

Each word will have, for example, 300 features with a type of float point number. Each word column will be a 300-dimensional vector which will be the representation. We will use the notation eidxe_{\text{idx}} to describe a particular word features vector (it is the 300 dimensional vector in this example). Now, if we return to the examples we described again:

"I want a glass of orange [...]"

"I want a glass of apple [...]"

Orange and apple now share a lot of similar features which makes it easier for an algorithm to generalize between them. We call this representation Word Embeddings. To visualize word embeddings we can use t-SNE to reduce the features to 2 dimensions which makes it easy to visualize:

t-SNE visualizations of word embeddings. Left: Number Region; Right: Jobs Region.

This is the t-SNE visualizations of word embeddings taken from Turian et al. (2010) (opens in a new tab), showing on the left the Number Region, and on the right, the jobs Region. You can see the complete image here (opens in a new tab).

You will get a sense that more related words are closer to each other.

Using word embeddings

Let's see how we can take the feature representation we have extracted from each word and apply it in the Named Entity Recognition (NER) problem: Chart, box and whisker chart Description automatically generated

Sally Johnson is a person's name. After training on this sentence the model should find out that the sentence "Robert Lin is an apple farmer" contains Robert Lin as a name, as apple and orange have near representations. Now if you have tested your model with this sentence "Shav Vimalendiran is a durian cultivator" the network should learn the name even if it hasn't seen the word durian before (during training). That's the power of word representations.

The algorithms that are used to learn word embeddings can examine billions of words of unlabeled text - for example, 100 billion words and learn the representation from them. The best way to approach this would be transfer learning for the word embeddings:

  1. Learn word embeddings from a large text corpus (1-100 billion words) or download a pre-trained embedding from online.
    • If you're trying to transfer from some task A to some task B, the process of transfer learning is just most useful when you happen to have a ton of data for A and a relatively smaller data set for B. This is not always true e.g. Gene Translation
  2. Transfer embedding to new task with the smaller training set (say, 100k words).
    • You can now use relatively lower dimensional feature vectors.
  3. Optional: continue to finetune the word embeddings with new data.
    • You bother doing this if your smaller training set (from step 2) is big enough.

Word embeddings tend to make the biggest difference when the task you're trying to carry out has a relatively smaller training set. It has been shown to be useful for named entity recognition, for text summarization, for co-reference resolution, for parsing etc. Also, one of the advantages of using word embeddings is that it reduces the size of the input! Here we only may need a 300 feature vector as opposed to a 10,000 one hot encoded vector.

Word embeddings also have an interesting relationship to the face recognition task from earlier:

Face Recognition

In this problem, we encode each face into a vector and then checked how similar those vectors were. Words encoding and embeddings have a similar meaning here. In the word embeddings task, we are learning a representation for each word in our vocabulary (unlike in image encoding where we have to map each new image to some nn-dimensional vector).

We'll have a fixed vocabulary of, say, 10,000 words. And we'll learn a vector e1e_{1} through, say, e10000e_{10000} that learns a fixed encoding/embedding for each of the words in our vocabulary instead of learning for new unencountered words.

Properties of word embeddings

Word embeddings can be used to perform analogical reasoning, which is the task of completing the sentence "A is to B as C is to ___". This is possible because embeddings capture the relationships between words in their geometry.

While analogy reasoning may not be by itself the most important NLP application, it helps convey a sense of what these word embeddings can do. For example, here is a word embeddings table, which includes featurized representations of a set of words that you might hope a word embedding could capture.

Table Description automatically generated

Can we conclude this analogy reasoning:

  • Man ➡️ Woman
  • King ➡️ ❓

one way to carry out this analogy reasoning of "man is to woman as king is to what?" is to subtract eMane_{\text{Man}} from eWomane_{\text{Woman}}, and try to find a vector so that eManeWomane_{\text{Man}} - e_{\text{Woman}} is close to eKinge?e_{\text{King}} - e_{\text{?}}:

eManeWomaneKinge?e_{\text{Man}} - e_{\text{Woman}} \approx e_{\text{King}} - e_{\text{?}}

Let's subtract eMane_{\text{Man}} from eWomane_{\text{Woman}}. This will equal the vector [-2 0 0 0]. Similarly, eKinge_{\text{King}} - eQueene_{\text{Queen}} = [-2 0 0 0]. So the difference is about the gender in both.

eManeWoman=[2000]eKingeQueen=[2000]e_{\text{Man}} - e_{\text{Woman}} = \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} \approx e_{\text{King}} - e_{\text{Queen}} = \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix}

Difference

This vector represents gender. It captures that the main difference is in the gender for man and woman, and tries to find a similar difference from eKinge_{\text{King}}. This drawing is a 2D visualization of the 4D vector that has been extracted by a t-SNE algorithm. It's a drawing just for visualization. Keep in mind not to rely on t-SNE plots for finding parallels. All t-SNE does is takes the 300-Dimensional data and it maps it, in a very non-linear way, to a 2D space. Only in the original dimensionality can you can expect the distance to look the same for both.

We can reformulate the problem to find:

eManeWomaneKinge?e_{\text{Man}} - e_{\text{Woman}} \approx e_{\text{King}} - e_{\text{?}} argmax[sim(eW,eKingeMan+eWoman)]\text{argmax}\left[ \text{sim}(e_{\text{W}}, e_{\text{King}} - e_{\text{Man}} + e_{\text{Woman}})\right]

where sim(u,v)sim(u,v) is the Cosine Similarity:

CosineSimilarity(u,v)=uvu2v2=cos(θ)\text{CosineSimilarity}(u, v) = \frac{u \cdot v}{\left| |u| \right|_{2}\left| |v| \right|_{2}} = cos(\theta)

Where:

  • uvu \cdot v is the dot product (or inner product) of two vectors
  • u2\left| |u| \right|_{2} is the norm (or length) of the vector uu
    • Reminder u2=sumi=1nui2\left| |u| \right|_{2} = \sqrt{sum_{i = 1}^{n}u_{i}^{2}}
  • θ\theta is the angle between uu and vv.
  • The cosine similarity depends on the angle between uu and vv.
  • If uu and vv are very similar, say θ=0°\theta = 0\degree, their cosine similarity will be close to 1.
  • If they are dissimilar, say θ=180°\theta = 180\degree, the cosine similarity will take a smaller value, 1-1 in this case if it is completely opposite.

Cosine Similarity

It turns out that eQueene_{\text{Queen}} is the best solution here that gets the most similar vector. The Cosine Similarity is the most commonly used similarity function. Some other distance measures that may be used were explored earlier.

You can also use the Euclidean distance as a similarity function (but it actually measures dissimilarity, so you should take the negative).

Embedding matrix

When you implement an algorithm to learn a word embedding, what you end up learning is actually an embedding matrix EE. It is a learned parameter matrix that contains the embeddings for all words in the vocabulary.

Suppose we are using 10,000 words as our vocabulary (plus the tokens). The algorithm should create a matrix EE of the shape (300,10000)(300, 10000) in the case where we are extracting 300 features.

Lets look again the one hot encoded vector:

Oorange=[0010]O_{\text{orange}} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix}

If O6257O_{6257} is the one hot encoding of the word orange of shape (10000, 1), then np.dot(E,O_6257) = e_6257 which has the shape (300, 1). Notice this is because the one hot encoded vector is as tall as the embedding matrix is wide. Generally:

e_j = np.dot(E,O_j)

The embedding for a word can be obtained by multiplying its one-hot encoded vector with the embedding matrix EE, or more efficiently, by directly indexing the matrix:

e_j = E[:, index_of_orange]

The dimension of word vectors is usually smaller than the size of the vocabulary. Most common sizes for word feature vectors range between 50 and 400.

Learning Word Embeddings: Word2vec & GloVe

As the vocabulary of any language is large and cannot be labeled by human and hence we require unsupervised learning techniques that can learn the context of any word on its own. Skip-gram is one of the unsupervised learning techniques used to find the most related words for a given word.

Learning embeddings can be approached through various algorithms. Initially complex, these algorithms have been simplified over time. The general idea is to predict the probability of a target word given its context, and through this process, learn the embeddings.

We will start by learning the complex examples to develop more intuition.

When learning word embeddings, we create an artificial task of estimating P(targetcontext)P(\text{target}|\text{context}). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.

Neural Language Model

Building a neural language model is a classic approach in NLP to generate word embeddings. Here, the model is trained to predict the probability of a target word given a sequence of context words. For example, below, the indexes are the index in the vocab. This is a reasonable way of getting to the word embeddings.

Iwantaglassoforange?
434396551385261636257

Model Architecture

So, we use this neural network to learn the language model:

The neural network used for this task typically has an embedding layer, one or more hidden layers, and a softmax output layer:

  • Embedding Layer: The first layer of the model where each word in the input sequence is transformed into a dense vector representation, denoted as eje_j. The embedding for each word is retrieved by performing a dot product between the embedding matrix EE and the one-hot encoded vector OjO_j for the word i.e. np.dot(E,O_j).

    • Each of the eje_j's is a 300-dimensional embedding vector.
    • These 6 300-dimensional-vectors are fed into a neural network hidden layer.
  • Hidden Layers: The dense vectors are then fed into one or more hidden layers of the network, which learns to capture the relationships and structure within the input sequence.

  • Softmax Layer: The final layer is a softmax classification layer that outputs a probability distribution over the entire vocabulary for the next word.

Parameters and Optimization

The neural language model has several key parameters:

  • Embedding Matrix (EE): A matrix where each column corresponds to a word in the vocabulary and contains its embedding vector.

  • Weights (W[1],W[2]W^{[1]}, W^{[2]}): The weights of the hidden layer and the softmax layer, respectively.

  • Biases (b[1],b[2]b^{[1]}, b^{[2]}): The biases associated with the hidden layer and the softmax layer.

The model uses backpropagation to optimize these parameters, aiming to maximize the likelihood of the correct target word given the context. This training process inherently leads to the learning of word embeddings within the embedding matrix EE.

Note that we also, inherently, had:

  • 1800-dimensional input
    • Using 300-dimensional word embeddings with 6 words would be an input dimension of (3006,1)(300 \cdot 6, 1) after stacking these vectors.
    • if the window size is 6 (six previous words)

Although the primary goal of the neural language model is to predict the next word, the embeddings learned through this process are valuable. These embeddings capture semantic and syntactic properties of words and can be utilized for various other NLP tasks.

Context Types for Embeddings

The type of context used can influence the embeddings learned. We're able to do this as our goal isn't just to create the language model:L

  • Fixed Window: Using the last nn words as context (e.g., the last 4 words).
    • Having a fixed history enables us to deal with arbitrarily long sentences.
    • "a glass of orange" and try to predict the next word from it.
  • Variable Window: Taking words from both sides of the target word within a specified window.
    • e.g. 4 words on the left and 4 words on the right.
    • "a glass of orange" and "to go along with"
  • Single Word: Using just the immediate previous word or a nearby word as context.
    • "glass" word is near juice.
    • This is the idea of skip grams model.
    • The idea is much simpler and works remarkably well.

Researchers found that if you really want to build a language model, it's natural to use the last few words as a context. But if your main goal is really to learn a word embedding, then you can use all of these other contexts and they will result in very meaningful work embeddings.

To summarize, the language modeling problem poses a machine learning problem where you input the context (like the last four words) and predict some target words. And posing that problem allows you to learn good word embeddings.

Skip-Grams

  • Concept: In the Skip-Grams model, the aim is to predict target words given a context word within a specific window size in a sentence. This is a type of supervised learning task.
    • We will choose context and target pairs to be our supervised learning problem.
    • Rather than context being the last nn words, randomly pick a word to be the context word, and randomly pick another word within some window, and choose this to be the target word.
  • Example: Using the sentence "I want a glass of orange juice to go along with my cereal", you might select "orange" as the context word and then select "juice", "glass", and "my" as target words within a specified window (e.g., -10 to +10 words around the context word).
ContextTargetHow far
orangejuice1
orangeglass-2
orangemy6
  • Goal: The main objective is not just to perform well on this task, but to use this framework to learn word embeddings- dense vector representations of words that capture semantic meanings and relationships.

Word2Vec

The Word2Vec is a simpler and computationally efficient model for learning word embeddings. It includes two main architectures: the Skip-Gram model (as described above) and Continuous Bag of Words (CBOW)

Say, we continue to use Vocabulary size of 10,000 words.

  • Input: The context word cc.
  • Output: The target word tt.
  • Supervised Problem: We want to learn a mapping from some context cc to some target tt
  • Embedding Retrieval: For a context word cc, its embedding ece_c is obtained from an embedding matrix EE i.e. ec=EOce_c = E \cdot O_c.
  • Probability Prediction: The embedding ece_c is then fed into a softmax unit to compute the probability P(tc)P(t|c) of the target word given the context. This is y^\hat{y}
  • Loss Function: The model uses a cross-entropy loss function to learn the embeddings.

Overall:

p(tc)=eθtTecj=110,000eθjTecp(t|c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}

Where θt\theta_{t} is the parameter associated with output tt. This essentially says "What is probability of it being the label?"

Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. θt\theta_{t} and ece_c are both 500 dimentional vectors, and are both trained with an an optimisation algorithm, like Adam or GD.

Here we are summing 10,000 numbers which corresponds to the number of words in our vocabulary. If this number is larger say 1 million, the computation will become very slow.

Finally, the loss function for SoftMax will be the usual negative log likelyhood:

L(y,y^)=i=110,000yilogy^i\mathcal{L}(y,\hat{y}) = -\sum_{i=1}^{10,000} y_i \log \hat{y}_i

We have parameters for the softmax unit (θt\theta_{t}) and the Matrix EE. Optimize for all these parameters and we will get a pretty good model.

Issues and Solutions

  • Computational Challenge: The softmax calculation in Word2Vec can be computationally expensive, especially with large vocabularies (e.g., 10,000 words or more).
    • Computational cost also scales by log(vocab size)\log(\text{vocab size}).
  • Hierarchical Softmax Classifier: To address the computational challenge, a hierarchical softmax classifier is used. This classifier works like a tree, making it faster and more efficient, particularly with large vocabularies.
    • In practice, the hierarchical SoftMax classifier doesn't use a balanced tree like the drawn one.
    • A tree can be developed so common words are at the top and less common are at the bottom/ deeper in the tree so you might need only a few traversals to get to common words like the and of.
    • There are heuristics available.

Hierarchical Softmax

Word2Vec Variants

The word2vec paper includes 2 ideas of learning word embeddings

  • Skip-Gram Model: Focuses on predicting the surrounding words given a specific word. It's effective with small datasets and captures well the representation of rare words.
  • CBOW (Continuous Bag of Words): Predicts a word based on its context. It's faster and more suitable for larger datasets but less effective with rare words.

Sampling Context

How to sample the context cc?

  • Uniform Sampling Issue: Uniform sampling of context words can lead to overrepresentation of frequent words (like "the", "of", "and").
  • Balanced Approach: In practice, a balanced sampling approach is used to mitigate this issue, ensuring a good mix of common and rare words in the training process.

The Word2Vec model, especially its skip-gram variant, has been a fundamental development in NLP, enabling machines to understand and process human language in a more nuanced and effective way. It paved the way for more advanced models and applications in the field.

Negative Sampling

Negative sampling is a technique that makes learning word embeddings more computationally efficient, particularly beneficial when dealing with large datasets, allowing us to learn much better word embeddings. It achieves something similar to the skip-gram model, but with a much more efficient learning algorithm. It modifies the training objective to predict whether a given pair of words (context-target) are likely to appear together (positive example) or not (negative example). i.e. Given a pair of words like orange and juice, predict, is this a context-target pair?

"I want a glass of orange juice to go along with my cereal"

The sampling will look like this:

Context (c)Word (t)Target (y)
orangejuice1
orangeking0
orangebook0
orangethe0
orangeof0
  • Positive Examples: Generated using the Skip-Gram technique within a fixed window in a sentence.
  • Negative Examples: Randomly chosen words from the vocabulary are assumed not to be related to the context word (labeled as 0).
    • Notice, that we got word "of" as a negative example although it appeared in the same sentence.

Overall:

  1. Pick a positive context
  2. Pick kk negative context words from the dictionary.

kk is recommended to be from 5 to 20 in small datasets and 2 to 5 for larger ones. We will have a ratio of kk negative examples to 1 positive ones in the data we are collecting.

A simple logistic regression model is used to learn the probability p(y=1c,t)p(y = 1 | c, t) that a context cc and a target tt word are a valid context-target pair.

p(y=1c,t)=σ(θtTec)p(y = 1 | c, t) = \sigma(\theta_{t}^{T}e_{c})

The logistic regression model can be drawn like this:

Logistic Regression

This is essentially 10,000 binary classification problems, and we only train k+1k+1 classifier of them in each iteration.

How do we select negative samples? We can sample according to empirical frequencies in the words corpus i.e. according to how often different words appear. The problem with that is that you end up with a very high representation of words like the, of, and, and so on. One other extreme would be to use 1 over the vocab size, i.e. sample the negative examples uniformly at random, but that's also very non-representative of the distribution of English words.

The best way, according to authors, is to sample with this equation:

P(wi)=f(wi)3/4j=110,000f(wj)3/4P(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=1}^{10,000} f(w_j)^{3/4}}

Note that this is an empirical formula, and is not theoretically justifier per se.

GloVe word vectors

GloVe (Global vectors for word representation) is another method for learning word embeddings, admired for its simplicity. This is not used as often as Word2Vec or skip-gram models, but it has some enthusiasts because of its simplicity.

Let's use our previous example:

"I want a glass of orange juice to go along with my cereal".

The GloVe model uses a count matrix XX where XctX_{ct} is the number of times a target word tt appears in the context of word cc, within a specified window. i.e. Xorange,juiceX_{orange, juice} is the number of times the word juice appears in the context of the word orange.

GloVe seeks to relate the word vectors for context and target words in a way that their dot product equals the logarithm of their co-occurrence probability, adjusted by a weighting function f(x)f(x) to handle various word frequencies appropriately. The model is defined as:

Minimizei=110,000j=110,000f(Xij)(θiTej+bi+bjlogXij)2\text{Minimize} \sum_{i=1}^{10,000} \sum_{j=1}^{10,000} f(X_{ij}) (\theta_i^T e_j + b_i + b_j - \log X_{ij})^2

Where:

  • XijX_{ij} is the number of times word jj occurs in the context of word ii.
  • θi\theta_i is the word vector for the word ii (the context word).
  • eje_j is the word vector for the word jj (the target word).
  • bib_i and bjb_j are the bias terms for the context word and the target word, respectively.
  • f(Xij)f(X_{ij}) is a weighting function that assigns lower weights to very frequent words like "the", "of", and "and".

The weighting factor f(Xij)f(X_{ij}) can be a function that gives a meaningful amount of computation, even to the less frequent words like durian, and gives more weight but not an unduly large amount of weight to words like this, is, of, a etc. It also aids in the log(0)\log(0) problem, which might occur if there are no pairs for the given target and context values.

θ\theta and ee are symmetric which helps in getting the final word embedding. Because θ\theta and ee in this particular formulation play symmetric roles unlike the earlier models, you can calculate the average for the final value for each given word.

Conclusions on Word Embeddings

  • Interpretability: Word embeddings do not guarantee interpretable features; the axes of the embedding space may not align with human-understandable concepts.
  • Pre-trained Models: It is often practical to use pre-trained word embeddings due to the high computational cost of training them from scratch.
  • Implementation: If the data is sufficient, one can attempt to implement and train embeddings using one of the described algorithms.

Applications using Word Embeddings

Sentiment Classification

Word embeddings are utilized in sentiment analysis to determine whether a piece of text has a positive or negative sentiment. They are particularly useful when there is a lack of large labeled datasets. For example:

TextRating
The dessert is excellent.⭐⭐⭐⭐⭐
Service was quite slow.⭐⭐
Good for a quick meal, but nothing special.⭐⭐⭐
Completely lacking in good taste, good service, and good ambience.

A simple sentiment classification model could look like this:

It involves looking up embeddings for each word, summing or averaging them, and then passing the result to a softmax classifier. That makes this classifier works for short or long sentences.

The useful thing here though is that if this embedding was trained on a very large data set, like a hundred billion words, then this allows you to take a lot of knowledge even from infrequent words and apply them to your problem, even words that weren't in your labeled training set.

One of the problems with this simple model is that it ignores words order. For example, "Completely lacking in good taste, good service, and good ambience" has the word good 3 times but it's a negative review.

A more advanced model uses Recurrent Neural Networks (RNNs), which consider the sequence of words, allowing for more accurate sentiment predictions, especially in complex sentences.

And so if you train this algorithm, you end up with a pretty decent sentiment classification algorithm. Also, it will generalize better even if words weren't in your dataset. For example you have the sentence "Completely absent of good taste, good service, and good ambience", then even if the word "absent" is not in your label training set, if it was in your 1 billion or 100 billion word corpus used to train the word embeddings, it might still get this right and generalize much better even to words that were in the training set used to train the word embeddings but not necessarily in the label training set that you had for specifically the sentiment classification problem.

Debiasing Word Embeddings

Word embeddings can inadvertently capture and perpetuate societal biases present in training data.

Machine learning and AI algorithms are increasingly trusted to help with, or to make, extremely important decisions. At the same time, Word embeddings can inadvertently capture and perpetuate societal biases, such as gender or racial bias, present in training data.

Results on the trained word embeddings in the context of Analogies:

  • Man ➡️ Computer_programmer as Woman : Homemaker
  • Father : Doctor as Mother : Nurse

Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of text used to train the model. Learning algorithms by general are making important decisions and it mustn't be biased. Andrew thinks we actually have better ideas for quickly reducing the bias in AI than for quickly reducing the bias in the human race, although it still needs a lot of work to be done.

Addressing bias in word embeddings steps are taken from this paper (opens in a new tab). The proposed solution involves identifying bias directions and then neutralizing and equalizing embeddings to remove biases.

  1. Identify Bias Direction: By computing and averaging the differences between pairs of words that should be equivalent except for the bias aspect (e.g., "he" - "she", "male" - "female").
    • more commonly, PCA can be used
    • This can help you identify the bias direction which is a 1D vector and the non-bias vector which is 299D vector.

  1. Neutralize: For non-definitional terms, adjust the embeddings to be equidistant from the bias direction.

    • Babysitter and doctor need to be neutral so we project them on non-bias axis with the direction of the bias:
    • After that they will be equal in the term of gender.
    • To do this the authors of the paper trained a classifier to tell the words that need to be neutralized or not.

  1. Equalize Pairs: For definitional pairs, ensure they are only different with respect to the bias aspect (e.g., "grandfather" - "grandmother") and are equidistant from the non-biased words.
    • We want to do this because the distance between grandfather and babysitter is bigger than babysitter and grandmother
    • There are some words you need to do this for in your steps. Number of these words is relatively small.
    • Which ones to neutralize? What the authors did is train a classifier to try to figure out what words are definitional, what words should be gender-specific and what words should not be. And it turns out that most words in the English language are not definitional, meaning that gender is not part of the definition.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.


Resources: