Word Embeddings : Word2Vec and Latent Semantic Analysis
In this post, we will see two different approaches to generating corpus-based semantic embeddings. Corpus-based semantic embeddings exploit statistical properties of the text to embed words in vectorial space. We will be using Gensim which provided algorithms for both LSA and Word2vec.
Basics difference Word2vec is a prediction based model i.e given the vector of a word predict the context word vectors(skipgram). LSA/LSI is a count based model where similar terms have same counts for different documents. Then dimensions of this count matrix is reduced using SVD. For both, the models similarity can be calculated using cosine similarity.
Is Word2vec really better Word2vec algorithm has shown to capture similarity in a better manner. It is believed that prediction based model capture similarity in a better manner.Should we always use Word2Vec? The answer is it depends. LSA/LSI tends to perform better when your training data is small. On the other hand Word2Vec which is a prediction based method performs really well when you have a lot of training data. Since word2vec has a lot of parameters to train they provide poor embeddings when the dataset is small.
Latent Semantic Analysis Latent semantic analysis or Latent semantic indexing literally means analyzing documents to find the underlying meaning or concepts of those documents. In this approach we pass a set of training documents and define a possible numbers of concepts which might exist in these documents. And the output of this LSA is essentially a matrix of terms to concepts.
We basically start with a word by document co-occurance matrix and apply normalization to weights of uninformative words(Think tfidf). Finally we apply SVD(Singular value decomposition) to this matrix to reduce the number of features from ~10000 features to around 100 to 300 features which will condense all the important features into small vector space.
<Link href='https://google.com' color="blue.300" fontWeight="bold" isExternal="true">Information Retrieval Book</Link>Word2vec
Word2vec consists of two neural network language models, Continuous Bag of Words (CBOW) and Skip-gram. In both models, a window of predefined length is moved along the corpus, and in each step the network is trained with the words inside the window. Whereas the CBOW model is trained to predict the word in the center of the window based on the surrounding words, the Skip-gram model is trained to predict the contexts based on the central word. Once the neural network has been trained, the learned linear transformation in the hidden layer is taken as the word representation. Word2vec
Let's take an example of identifying similar recipes. You can find the dataset here
import gensim
from gensim import corpora
import pandas as pd
from nltk.corpus import stopwords
from nltk import FreqDist
from gensim import corpora, models, similarities
import logging
import os
import numpy as np
from gensim.models import Word2Vec
from annoy import AnnoyIndex
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
stopwords = stopwords.words('english')
class Similarity(object):
def __init__(self, data, num_topics = 10):
self.WORD2VEC_LEN = 300
self.num_topics = num_topics
self.data = data
self.tokenized_data = self._tokens()
self.freqdist = FreqDist([x for y in self.tokenized_data for x in y])
def _tokens(self):
tokens = [[word for word in str(document).lower().split() if word not in stopwords]
for document in self.data]
return tokens
def filter_tokens(self):
"""
Filter tokens which have occured only once
"""
return [[tk for tk in entry if self.freqdist[tk] > 1] for entry in self.tokenized_data]
def build_dictionary(self):
logging.info("In building dictionary")
self.dictionary = corpora.Dictionary(self.tokenized_data)
self.dictionary.save('similarity_dictionary.dict')
def build_corpus(self):
self.corpus = [self.dictionary.doc2bow(text) for text in self.tokenized_data]
corpora.MmCorpus.serialize('similarity.mm', self.corpus)
def build_lsi(self):
logging.info("Building lsi model")
self.lsi = models.LsiModel(self.corpus, id2word=self.dictionary, num_topics=self.num_topics)
# self.lsi.print_topics(self.num_topics)
self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
self.index.save('similarity.index')
random_samples = np.random.choice(self.data, 10)
for t in random_samples:
logging.info("Which of the recipes are more similar to : {}".format(t))
doc = t
vec_bow = self.dictionary.doc2bow(doc.lower().split())
vec_lsi = self.lsi[vec_bow]
sims = self.index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
cnt = 0
tmp = set()
tmp.add(t)
for (x,y) in sims:
if self.data[x] not in tmp:
logging.info(self.data[x])
tmp.add(self.data[x])
if len(tmp) > 5:
break
logging.info("*" * 10)
def get_vector(self, data):
data = str(data).lower()
return np.max([self.w2v_model[x] for x in data.split() if self.w2v_model.vocab.has_key(x)], axis=0)
def build_word2vec(self):
self.w2v_model = Word2Vec(self.tokenized_data, size=self.WORD2VEC_LEN, window=5, negative=10)
self.annoy_index = AnnoyIndex(self.WORD2VEC_LEN)
for i, rname in enumerate(self.data):
try:
v = self.get_vector(rname.lower())
self.annoy_index.add_item(i, v)
except:
pass
self.annoy_index.build(50)
names = np.random.choice(self.data, 10)
for name in names:
try:
logging.info("*" * 50)
logging.info("Source : {}".format(name))
v = self.get_vector(name)
res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True)
logging.info(res)
for i,rec in enumerate([self.data[x] for x in res[0]]):
logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i]))
except:
pass
def validate(self):
names = np.random.choice(self.data, 20)
for name in names:
try:
logging.info("-" * 50)
logging.info("Which of the recipes are more similar to : {}".format(name))
v = self.get_vector(name)
res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True)
logging.info("************ Word2vec Engine **************")
for i,rec in enumerate([self.data[x] for x in res[0]]):
# logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i]))
logging.info("Recipe {} : {}".format(i+1,rec))
logging.info("************ LSA Engine**************")
doc = name
vec_bow = self.dictionary.doc2bow(doc.lower().split())
vec_lsi = self.lsi[vec_bow]
sims = self.index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
tmp = set()
tmp.add(name)
for (x,y) in sims:
if self.data[x] not in tmp and len(tmp) <= 5:
logging.info("Recipe {} : {}".format(i+1,self.data[x]))
tmp.add(self.data[x])
except:
pass
def build_lda(self):
logging.info("Building lda model")
logging.info("*" * 50)
lda = models.LdaModel(self.corpus_tfidf, id2word=self.dictionary, num_topics=self.num_topics)
lda.print_topics(self.num_topics)
logging.info("*" * 50)
def build(self):
logging.info("In build")
self.build_dictionary()
self.build_corpus()
self.tfidf = models.TfidfModel(self.corpus)
self.corpus_tfidf = self.tfidf[self.corpus]
self.build_lsi()
self.build_word2vec()
# self.build_lda()
self.validate()
if __name__ == "__main__":
df = pd.read_csv('epi_r.csv')
sim = Similarity(df.title.values, num_topics=100)
sim.build()
And here are the results for the methods. Let me know which one do you think is doing a better job.
Which of the recipes are more similar to
-
Red Wine-Braised Short Ribs with Vegetables
- Word2vec
- Recipe 1 : Wine-Braised Red Cabbage
- Recipe 2 : Calamari with Roasted Tomato Sauce
- Recipe 3 : Black Bean, Jícama, and Grilled Corn Salad
- Recipe 4 : Grilled Corn on the Cob with Garlic Butter, Fresh Lime, and Queso Fresco
- Recipe 5 : Green Goddess Spinach Dip
- LSA Engine
- Recipe 1 : Red Wine Brasato with Glazed Root Vegetables
- Recipe 2 : Braised Short Ribs with Red Wine and Pureed Vegetables
- Recipe 3 : Oxtail Soup with Red Wine and Root Vegetables
- Recipe 4 : Red Wine–Braised Short Ribs
- Recipe 5 : Red Snapper a la Nicoise
- Word2vec
-
Pan-Seared Salmon Over Red Cabbage and Onions with Merlot Gastrique
- ** Word2vec Engine **
- Recipe 1 : Oxtail Soup with Red Wine and Root Vegetables
- Recipe 2 : Celery Root and Potato Puree with Roasted Jerusalem Artichoke "Croutons"
- Recipe 3 : Green Goddess Spinach Dip
- Recipe 4 : Grilled Tuna with Provençal Vegetables and Easy Aioli
- Recipe 5 : Slow-Braised Lamb Shanks with Guajillo-Pineapple Sauce, Roasted Vegetables, and Coconut Tamales
- LSA Engine
- Recipe 1 : Red Cabbage and Onions
- Recipe 2 : Red Cabbage with Raspberries, Onions and Apples
- Recipe 3 : Pickled Red Onions
- Recipe 4 : Pickled Red Onions with Cilantro
- Recipe 5 : Lime-Pickled Red Onions
- ** Word2vec Engine **
-
Julia's Roast Chicken with Lemon and Herbs
- ** Word2vec Engine **
- Recipe 1 : Crispy Roast Duck with Blackberry Sauce
- Recipe 2 : Roast Cod with Potatoes, Onions, and Olives
- Recipe 3 : Lemon Garlic Mayonnaise
- Recipe 4 : Grilled Spiced Chicken Breasts
- Recipe 5 : Grilled Lobster with Ginger, Garlic, and Soy Sauce
- LSA Engine
- Recipe 1 : Roast Chicken with Lemon and Thyme
- Recipe 2 : Roast Chicken Legs with Lemon and Thyme
- Recipe 3 : Tarragon and Lemon Roast Chicken
- Recipe 4 : Roast Chicken with Lemon and Fresh Herbs
- Recipe 5 : Roast Chicken With Lemon and Butter
- ** Word2vec Engine **
Let me know if you have any feedback or want me to write about any other topics. Also if you want to learn more about how to learn word embeddings using blazing text take a look at <Link href="https://shrikar.com/blog/aws-blazingtext-word-embeddings" color="blue.500" fontWeight="bold">Learn word embeddings using BlazingText</Link>