As we know Reddit is a goldmine for user generated content and some subreddits have proven to be really useful to me. I like to travel a lot and some subreddits have helped me to discover new places which suit my interests. Here are a few subreddits I follow : /r/programming, /r/python, /r/machinelearning, /r/Entrepreneur, /r/travel, /r/hiking etc.
The recommendations I have seen in /r/travel are really awesome and can be a platform to discover new places. However, there are some problems with this data
- Data is scattered.
- Not all comments are useful and many are quite spammy.
I wanted to see if there is a reasonable way to build a Reddit travel recommendation app. I haven’t decided whether it will be a bot or a simple search interface like google. I will be blogging about how I go about building the same.
Exploring /r/travel data using Word2Vec
I won’t go into details about Word2vec but here are the resources I referred to understand about word2vec.
I have used PRAW to fetch data from reddit and store the results in mongodb.
Once we fetch the data we can build the word2vec model on all the submission and their corresponding comments. The model building part is pretty straight forward
model = gensim.models.Word2Vec(texts, size=100, window=5, min_count=10, negative=10)
In : model.most_similar('yosemite') Out: [('yellowstone', 0.9365677833557129), ('zion', 0.8894634246826172), ('arches', 0.8825746178627014), ('glacier', 0.8800368309020996), ('sequoia', 0.8710728287696838), ('bryce', 0.8620302677154541), ('tetons', 0.8497499823570251), ('teton', 0.849648118019104), ('np', 0.834464967250824), ('moab', 0.8296624422073364)] In : model.most_similar('tahoe') Out: [('crater', 0.9356250762939453), ('gorge', 0.899262547492981), ('louise', 0.8927288055419922), ('shore', 0.8808829188346863), ('titicaca', 0.8807282447814941), ('michigan', 0.8767774701118469), ('redwoods', 0.875566840171814), ('glen', 0.8734359741210938), ('boulder', 0.8698586225509644), ('peninsula', 0.8691876530647278)] In : model.most_similar('honduras') Out: [('macedonia', 0.9377617835998535), ('sask', 0.9181386232376099), ('territories', 0.9177947640419006), ('scotia', 0.9174924492835999), ('manitoba', 0.9173691272735596), ('tu', 0.9149880409240723), ('pueblo', 0.9129607677459717), ('serengeti', 0.91243976354599), ('lviv', 0.9115496873855591), ('toured', 0.9108266830444336)] In : model.most_similar('recommend') Out: [('suggest', 0.8981289267539978), ('recommended', 0.8258309364318848), ('consider', 0.6939650774002075), ('advise', 0.6786701083183289), ('worth', 0.4964148998260498), ('interested', 0.4776420593261719), ('liked', 0.4723684787750244), ('love', 0.4663149118423462), ('mostar', 0.46572262048721313), ('skip', 0.4583231210708618)] In : model.most_similar('rainier') Out: [('glen', 0.9346678853034973), ('whitney', 0.9302600026130676), ('escalante', 0.9264642000198364), ('tuolumne', 0.9260645508766174), ('dam', 0.9213648438453674), ('needles', 0.9160547852516174), ('rainbow', 0.9156197905540466), ('olympic', 0.9152690768241882), ('meadows', 0.9149000644683838), ('verde', 0.9130691885948181)]
- In this initial exploration phase, I didn’t do anything special except cleaning and normalizing the data.
- It will be interesting to see if we can run these comments/posts through NER and build a word2vec only on those location data
- Problem : Even the best NER is not able to recognize the locations correctly.
- Can regex help here?
- Use snorkel or deepdive to build the Knowledge graph. ( This will only be possible if the NER can work with high accuracy)
- Build phrases colocation.
- Any other methods/algorithms which I should try? Please comment.
- Given a location suggest places to me which are similar.
- If there are multiple places within a given area break it down into small itineraries.
- For a given location summarize a set of things to do or must watch.
Please comment if you have any suggestion or feedback. Also, let me know if you there are any other features which you would like to be implemented.