Datascience Machine Learning Word Embeddings

Word Embeddings : Word2Vec and Latent Semantic Analysis

In this post, we will see two different approaches to generating corpus-based semantic embeddings. Corpus-based semantic embeddings exploit statistical properties of the text to embed words in vectorial space. We will be using Gensim which provided algorithms for both LSA and Word2vec.

Basics difference
Word2vec is a prediction based model i.e given the vector of a word predict the context word vectors(skipgram).
LSA/LSI is a count based model where similar terms have same counts for different documents. Then dimensions of this count matrix is reduced using SVD.
For both, the models similarity can be calculated using cosine similarity.

Is Word2vec really better
Word2vec algorithm has shown to capture similarity in a better manner. It is believed that prediction based model capture similarity in a better manner.Should we always use Word2Vec?
The answer is it depends. LSA/LSI tends to perform better when your training data is small. On the other hand Word2Vec which is a prediction based method performs really well when you have a lot of training data. Since word2vec has a lot of parameters to train they provide poor embeddings when the dataset is small.

Latent Semantic Analysis
Latent semantic analysis or Latent semantic indexing literally means analyzing documents to find the underlying meaning or concepts of those documents. In this approach we pass a set of training documents and define a possible numbers of concepts which might exist in these documents. And the output of this LSA is essentially a matrix of terms to concepts.

We basically start with a word by document co-occurance matrix and apply normalization to weights of uninformative words(Think tfidf). Finally we apply SVD(Singular value decomposition) to this matrix to reduce the number of features from ~10000 features to around 100 to 300 features which will condense all the important features into small vector space.

Information Retrieval Book

Word2vec

Word2vec consists of two neural network language models, Continuous Bag of Words
(CBOW) and Skip-gram. In both models, a window of predefined length is moved along the corpus, and in each step the network is trained with the words inside the window.
Whereas the CBOW model is trained to predict the word in the center of the window based on the surrounding words, the Skip-gram model is trained to predict the contexts based on the central word. Once the neural network has been trained, the learned linear transformation in the hidden layer is taken as the word representation.
Word2vec

Let’s take an example of identifying similar recipes. You can find the dataset here https://www.kaggle.com/hugodarwood/epirecipes

import gensim
from gensim import corpora
import pandas as pd
from nltk.corpus import stopwords
from nltk import FreqDist
from gensim import corpora, models, similarities
import logging
import os
import numpy as np
from gensim.models import Word2Vec
from annoy import AnnoyIndex


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

stopwords = stopwords.words('english')

class Similarity(object):
    def __init__(self, data, num_topics = 10):
        self.WORD2VEC_LEN = 300
        self.num_topics = num_topics
        self.data = data
        self.tokenized_data = self._tokens()
        self.freqdist = FreqDist([x for y in self.tokenized_data for x in y])

    def _tokens(self):
      tokens = [[word for word in str(document).lower().split() if word not in stopwords]
                for document in self.data]
      return tokens

    def filter_tokens(self):
        """
            Filter tokens which have occured only once
        """
        return [[tk for tk in entry if self.freqdist[tk] > 1] for entry in self.tokenized_data]

    def build_dictionary(self):
        logging.info("In building dictionary")
        self.dictionary = corpora.Dictionary(self.tokenized_data)
        self.dictionary.save('similarity_dictionary.dict')

    def build_corpus(self):
        self.corpus = [self.dictionary.doc2bow(text) for text in self.tokenized_data]
        corpora.MmCorpus.serialize('similarity.mm', self.corpus)

    def build_lsi(self):
        logging.info("Building lsi model")
        self.lsi = models.LsiModel(self.corpus, id2word=self.dictionary, num_topics=self.num_topics)
        # self.lsi.print_topics(self.num_topics)
        self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
        self.index.save('similarity.index')
        random_samples = np.random.choice(self.data, 10)
        for t in random_samples:
            logging.info("Which of the recipes are more similar to  : {}".format(t))
            doc = t
            vec_bow = self.dictionary.doc2bow(doc.lower().split())
            vec_lsi = self.lsi[vec_bow]
            sims = self.index[vec_lsi]
            sims = sorted(enumerate(sims), key=lambda item: -item[1])
            cnt = 0
            tmp = set()
            tmp.add(t)
            for (x,y) in sims:
                if self.data[x] not in tmp:
                    logging.info(self.data[x])
                    tmp.add(self.data[x])
                if len(tmp) > 5:
                    break
            logging.info("*" * 10)

    def get_vector(self, data):
        data = str(data).lower()
        return np.max([self.w2v_model[x] for x in data.split() if self.w2v_model.vocab.has_key(x)], axis=0)

    def build_word2vec(self):
        self.w2v_model = Word2Vec(self.tokenized_data, size=self.WORD2VEC_LEN, window=5, negative=10)
        self.annoy_index = AnnoyIndex(self.WORD2VEC_LEN)
        for i, rname in enumerate(self.data):
            try:
                v = self.get_vector(rname.lower())
                self.annoy_index.add_item(i, v)
            except:
                pass
        self.annoy_index.build(50)

        names = np.random.choice(self.data, 10)
        for name in names:
            try:
                logging.info("*" * 50)
                logging.info("Source : {}".format(name))
                v = self.get_vector(name)
                res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True)
                logging.info(res)
                for i,rec in enumerate([self.data[x] for x in res[0]]):
                    logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i]))
            except:
                pass

    def validate(self):
        names = np.random.choice(self.data, 20)
        for name in names:
            try:
                logging.info("-" * 50)
                logging.info("Which of the recipes are more similar to  : {}".format(name))
                v = self.get_vector(name)
                res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True)
                logging.info("************ Word2vec Engine **************")
                for i,rec in enumerate([self.data[x] for x in res[0]]):
                    # logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i]))
                    logging.info("Recipe {} : {}".format(i+1,rec))
                logging.info("************ LSA Engine**************")

                doc = name
                vec_bow = self.dictionary.doc2bow(doc.lower().split())
                vec_lsi = self.lsi[vec_bow]
                sims = self.index[vec_lsi]
                sims = sorted(enumerate(sims), key=lambda item: -item[1])
                tmp = set()
                tmp.add(name)
                for (x,y) in sims:
                    if self.data[x] not in tmp and len(tmp) <= 5:
                        logging.info("Recipe {} : {}".format(i+1,self.data[x]))
                        tmp.add(self.data[x])

            except:
                pass


    def build_lda(self):
        logging.info("Building lda model")
        logging.info("*" * 50)
        lda = models.LdaModel(self.corpus_tfidf, id2word=self.dictionary, num_topics=self.num_topics)
        lda.print_topics(self.num_topics)
        logging.info("*" * 50)


    def build(self):
        logging.info("In build")
        self.build_dictionary()
        self.build_corpus()
        self.tfidf = models.TfidfModel(self.corpus)
        self.corpus_tfidf = self.tfidf[self.corpus]
        self.build_lsi()
        self.build_word2vec()
        # self.build_lda()
        self.validate()


if __name__ == "__main__":
    df = pd.read_csv('epi_r.csv')
    sim = Similarity(df.title.values, num_topics=100)
    sim.build()

And here are the results for the methods. Let me know which one do you think is doing a better job.

1) Which of the recipes are more similar to : Red Wine-Braised Short Ribs with Vegetables
************ Word2vec Engine **************
Recipe 1 : Wine-Braised Red Cabbage
Recipe 2 : Calamari with Roasted Tomato Sauce
Recipe 3 : Black Bean, Jícama, and Grilled Corn Salad
Recipe 4 : Grilled Corn on the Cob with Garlic Butter, Fresh Lime, and Queso Fresco
Recipe 5 : Green Goddess Spinach Dip
************ LSA Engine**************
Recipe 5 : Red Wine Brasato with Glazed Root Vegetables
Recipe 5 : Braised Short Ribs with Red Wine and Pureed Vegetables
Recipe 5 : Oxtail Soup with Red Wine and Root Vegetables
Recipe 5 : Red Wine–Braised Short Ribs
Recipe 5 : Red Snapper à la Niçoise

2) Which of the recipes are more similar to : Pan-Seared Salmon Over Red Cabbage and Onions with Merlot Gastrique
************ Word2vec Engine **************
Recipe 1 : Oxtail Soup with Red Wine and Root Vegetables
Recipe 2 : Celery Root and Potato Puree with Roasted Jerusalem Artichoke “Croutons”
Recipe 3 : Green Goddess Spinach Dip
Recipe 4 : Grilled Tuna with Provençal Vegetables and Easy Aioli
Recipe 5 : Slow-Braised Lamb Shanks with Guajillo-Pineapple Sauce, Roasted Vegetables, and Coconut Tamales
************ LSA Engine**************
Recipe 5 : Red Cabbage and Onions
Recipe 5 : Red Cabbage with Raspberries, Onions and Apples
Recipe 5 : Pickled Red Onions
Recipe 5 : Pickled Red Onions with Cilantro
Recipe 5 : Lime-Pickled Red Onions

3) Which of the recipes are more similar to : Julia’s Roast Chicken with Lemon and Herbs
************ Word2vec Engine **************
Recipe 1 : Crispy Roast Duck with Blackberry Sauce
Recipe 2 : Roast Cod with Potatoes, Onions, and Olives
Recipe 3 : Lemon Garlic Mayonnaise
Recipe 4 : Grilled Spiced Chicken Breasts
Recipe 5 : Grilled Lobster with Ginger, Garlic, and Soy Sauce

Recipe 5 : Roast Chicken with Lemon and Thyme
Recipe 5 : Roast Chicken Legs with Lemon and Thyme
Recipe 5 : Tarragon and Lemon Roast Chicken
Recipe 5 : Roast Chicken with Lemon and Fresh Herbs
Recipe 5 : Roast Chicken With Lemon and Butter

Let me know if you have any feedback or want me to write about any other topics.

About the author

Shrikar

Backend/Infrastructure Engineer by Day. iOS Developer for the rest of the time.

/* ]]> */