Learning Word Embeddings with AWS Blazingtext

September 07, 2021

What is Word Embedding?

As we all know, Machine Learning algorithms deal with numbers and do not consume text directly. Previously, one way to consume text data was to use tfidf, count vectorizer, to convert text into numbers that ML Algorithms can consume. With the recent developments in NLP techniques, there is a new way to represent text-based on language semantics.

For example The ____ sat on the mat. If we were to fill in the blanks in the above example, most of us would fill it with Cat with options as (Cat, Air, Cloud, Donkey)

"You shall know a word by the company it keeps" (J. R. Firth)

The meaning of the word can be learned from the context in which the words appear. The topic was first introduced in the paper https://arxiv.org/pdf/1301.3781.pdf ( Mikolov et al.)

Word2Vec itself supports SkipGram and CBOW, and the details of which are explained clearly in the paper mentioned above. I also found out another blog which explains clearly about <Link href="https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html" color="blue.500" fontWeight="bold"> how to learn word embedding </Link>

tl;dr

CBOW: Predicts the word given a context
SkipGram: Predicts the context given a word The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called word embedding. In the embedding space, semantically similar words correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words.

There has been significant development in Word Embeddings, and there are many other word embedding algorithms like

doc2vec (Le and Mikolov (2014))
GloVe (Pennington, Socher, and Manning (2014))
fastText (Bojanowski, Grave, Joulin, and Mikolov (2016))
Latent semantic analysis (Deerwester, Dumais, Furnas, Harshman, Landauer, Lochbaum, and Streeter (1988))
BERT (Devlin, Chang, Lee, and Toutanova (2018))

If you want to understand a bit more other text transformation techniques take a look at <Link href="https://shrikar.com/blog/word-embeddings-word2vec-and-latent-semantic-analysis" color="blue.500" fontWeight="bold">Word2Vec and LSA embeddings</Link>

AWS Sagemaker BlazingText Algorithm

In this blog post, we will be talking about AWS Sagemaker Blazing text, an improved version of the Word2Vec algorithm that also supports generating embeddings for subwords.

Generating word representations by assigning a distinct vector to each word has certain limitations. It cannot deal with out-of-vocabulary (OOV) words, that is, words that have not been seen during training. Typically, such words are set to the unknown (UNK) token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large.

Blazing text algorithm has also adding featured for generating vectors for n-grams based on <Link href="https://arxiv.org/pdf/1607.04606.pdf" color="blue.500" fontWeight="bold"> Enriching Word Vectors with Subword Information</Link>

Applications of Word Embeddings

Similar Entities. Entities, in this case, could be Search Queries, SEO Titles, Products, and more.
Similar Search Queries: Let us say we are building a search engine for an e-commerce store. Having an understanding of similar searches can help guide users through their sessions.
Retrieval Tasks. Let us say we are building an e-commerce search engine and want to retrieve candidates to re-rank. The embedding-based retrieval is a pretty good baseline. Feature to the downstream models like NER, Text Classification, and Ranking

BlazingText Example in python

Here is how to cleaned_dataset.csv looks like

keyword # header
seo analyzer
seo keyword research
...
...

Upload the file with search queries to "s3://bucketname/cleaned_dataset.csv" replace bucketname with your s3 bucket name.

Now let's look at some code that is boilerplate to setup sagemaker training jobs.

import sagemaker
from sagemaker import get_execution_role
import boto3
import json
import pandas as pd
import numpy as np
sess = sagemaker.Session()
role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = 'bucketname' # Replace with your own bucket name if needed

s3_train_data = "s3://bucketname/cleaned_dataset.csv" # Cleaned CSV in case of query similarity would be all the search queries
s3_output_location = "s3://bucketname/output/"

region_name = boto3.Session().region_name

container = sagemaker.amazon.amazon_estimator.image_uris.retrieve("blazingtext", region_name, "latest")
print("Using SageMaker BlazingText container: {} ({})".format(container, region_name))

AWS provide custom containers for algorithms that they provide making it super easy to run and scale a job.

bt_model = sagemaker.estimator.Estimator(container,
                                         role,
                                         instance_count=1,
                                         instance_type='ml.c4.2xlarge', # Use of ml.p3.2xlarge is highly recommended for highest speed and cost efficiency
                                         volume_size = 100,
                                         max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)


bt_model.set_hyperparameters(mode="skipgram",
                             epochs=20,
                             min_count=5,
                             sampling_threshold=0.0001,
                             learning_rate=0.05,
                             window_size=3,
                             vector_dim=100,
                             negative_samples=5,
                             subwords=True, # Enables learning of subword embeddings for OOV word vector generation
                             min_char=2, # min length of char ngrams
                             max_char=6, # max length of char ngrams
                             batch_size=11, #  = (2*window_size + 1) (Preferred. Used only if mode is batch_skipgram)
                             evaluation=True)# Perform similarity evaluation on WS-353 dataset at the end of training

train_data = sagemaker.session.TrainingInput(s3_train_data, distribution='FullyReplicated',
                        content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}
bt_model.fit(inputs=data_channels, logs=True)
s3 = boto3.resource('s3')
key = bt_model.model_data[bt_model.model_data.find("/", 5)+1:]
s3.Bucket(bucket).download_file(key, 'model_dim_100.tar.gz')

!tar -xvf model_dim_100.tar.gz
vectors.bin
eval.json
vectors.txt

Let's look at what the above code is doing

We are setting the instance_count to 1 and can be increased to higher number for distributed training. Along with the node count we provide the instance type to be used for training.
Note the s3_output_location is where the model artifacts will be stored.
For this example we are using the skipgram mode.
In order to ensure the embeddings are meaningful we remove token that appear less than 5 timers
Learning rate and sampling rate are hyper parameters that can be tuned
100 is the vector embeddings size.
subwords = True ensure that we generate embedding for ngram as provided by the min_char and max_char

At this point in time you have the word/ngram embeddings that can be use by downstream tasks mentioned above.

Before we start consuming the vectors lets normalize the vectors as shown below.

import numpy as np
from sklearn.preprocessing import normalize

first_line = True
word_to_vec = {}
num_points = 106055
with open("vectors.txt","r") as f:
    for line_num, line in enumerate(f):
        if first_line:
            dim = int(line.strip().split()[1])
            word_vecs = np.zeros((num_points, dim), dtype=float)
            first_line = False
            continue
        line = line.strip()
        word = line.split()[0]
        vec = word_vecs[line_num-1]
        word_to_vec[word] = vec
        try:
            for index, vec_val in enumerate(line.split()[1:]):
                vec[index] = float(vec_val)
        except:
            pass
#         index_to_word.append(word)
        if line_num >= num_points:
            break
word_vecs = normalize(word_vecs, copy=False, return_norm=False)

Let's take the task of similar search queries. You might be wondering how we go about getting the vector representation for multi token queries. We can try different approaches like (average, min, max etc) but one common approach followed by the industry is to average the word vectors to get the representation for any entity vector(search query, product name etc)

Converting word to entity vector representation

df = pd.read_csv('cleaned_dataset.csv')
ids = range(0, len(search_terms))
search_terms = df.keyword.unique()
search_vec = [np.mean([word_to_vec[word] if word in word_to_vec else np.zeros(100) \
            for word in str(x).split()] , axis=0)  for x in search_terms]

data_dict = {'search_terms': search_terms, 'search_vectors': search_vec, 'ids': ids}
joblib.dump(data_dict, 'search_terms_vectors_dim_100.npy')

At this point we have a vector representation for each search query.

Approximate Near Neighbor Search using Embeddings

For ANN's (Approximate nearest Neighbor) search we have many libraries like faiss, annoy, scann etc. In this blog we will use annoy as it's the simplest to use and something that I have played around with before. In subsequent blog posts we will look into faiss/scann.

Let's start with building an annoy index.

from annoy import AnnoyIndex
f = 100
t = AnnoyIndex(f, 'euclidean')  # Length of item vector that will be indexed
for i in range(len(search_vec)):
    v = search_vec[i]
    t.add_item(i, v)

t.build(10) # 10 trees

Now that the tree is built we will look into finding the nearest search queries given an input query.

import random
def similar_search(term=None, n = 20):
    terms = []
    if not term:
        idx = random.randint(0, len(search_terms))
        term = search_terms[idx]
        for id in t.get_nns_by_item(idx, n):
            terms.append(search_terms[id])
    else:
        idx = list(search_terms).index(term)
        for id in t.get_nns_by_item(idx, n):
            terms.append(search_terms[id])
    return terms

Let's look at some example to see how the algorithm worked. SEO Analyzer

Here are a few other useful posts that might be of interest to you.

MultiClass Classification using Keras : Python Code for Multiclass classification
PySpark Real time predictions
AWS Serverless Lambda: Serverless Framework and AWS Lambda Tutorial
Data mining Reddit for Travel Recommendations: Reddit Data Mining