var jekyll = Object.create({ mpq : { track : function(eventName, properties, callback) { properties = properties || {}; callback = callback || function(){}; if (!/www.shrikar.com/.test(document.location.hostname)) { mpq.track(eventName, properties, callback); }; } } }); /** * Track Event "Post Some-Post" */ jekyll.mpq.track('EventName', { some : 'data' });

Shrikar Archak

It does not matter how slow you go so long as you do not stop. ~Confucius

Designing API’s for Aggregation of User Events

Requirements: Design a REST API service from the ground up (assume it can publicly be accessed via HTTP). The REST API has two functions:

  1. saveEvent (user_id, key, value, [{subkey,subvalue},{subkey,subvalue},…]) The saveEvent call will accept a user_id, a key, a value and an optional array of subkeys and subvalues.

  2. getEvent (user_id, key || subkey, date_min, date_max) The getEvent call will return the sum of all the values for that user / key cominbation for that date range (and will support both key and subkey).

Solution:

Possible Candidates for DataStore. Given the type of data the most obvious candidates are any generic Key/Value store or a NoSQL database.

1) For our case I would go with MongoDB a NoSQL database which can Scale horizontally. Since we are doing some kind of aggregation with respect to user_id and key/subkeys having a db with rich query interface would make our job easier as compared to Key Value stores.

Given that the system need to scale we can use the Sharding feature of MongoDB so as to distribute the data across different machines and hence load balance the system.

2) Choice of MongoClient : Mongoose. Provides a simple apis for querying db.

3) Express framework for providing the REST type API’s.

4) Depending on the number of Cores we could create those many Node.js process fronted by Nginx server. All the node.js process talk to a single Mongo backend.

5 ) Demo of the app available on Heroku. Using the shared instance of MongoLab for the demo.

Example:

  1. FOR saveEvent: curl http://hollow-meadow-3903.herokuapp.com/saveEvent -XPOST -d “user_id=shrikar&key=shrek&value=100&optional=[{shrek,50},{subkey1,33 } ]” Event Saved.

  2. For getEvent: curl “http://hollow-meadow-3903.herokuapp.com/getEvent?user_id=shrikar&key=shrek&start=2012-04-29T02:50:42.749Z&end=2012-04-29T010:11:16.262Z” {“sum”:150}

Running Variance

Variance : In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value)

Challenge: Calculating the variance of a large set of number is not trivial. It becomes more challenging when we are talking in term of millions of numbers. Using the naive method for calculating variance would not be efficient. There is a better way of computing variance goes back to a 1962 paper by B. P. Welford and is presented in Donald Knuth’s Art of Computer Programming.

John Cook in his blog mentions about this method. http://www.johndcook.com/standard_deviation.html. A java implementation is give below

One of the application of this is in RTB(Real time bidding) world for calculating ECPM .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<http://http://www.johndcook.com/standard_deviation.htmluckage com.shrikar.library;


public class ECPMStats {
  private int events;
  double old_mean, new_mean, old_variance, new_variance, sum;

  public ECPMStats(){
      events=0;
      sum = old_mean = new_mean=old_variance=new_variance = 0;
  }

  void add(double ecpm){
      events++;
      sum += ecpm;
      if(events == 1){
          old_mean = new_mean = ecpm;
          old_variance = 0;
      } else {
          new_mean = old_mean + (ecpm - old_mean)/events;
          new_variance = old_variance + (ecpm - old_mean) * (ecpm - new_mean);
          old_mean = new_mean;
          old_variance = new_variance;
      }
  }

  int numEvents(){
      return events;
  }

  double ecpmMean(){
      if(events > 0)
          return new_mean;
      else
          return 0;
  }

  double ecpmVariance(){
      if(events>1){
          return new_variance/(events-1);
      } else {
          return 0;
      }
  }

  double sum(){
      return sum;
  }

}

Personalized Web Index

Problem: While browsing the internet we find many things which are useful. As of today we mainly bookmark and what happens is if we have a lot of bookmarks its hard to find the content we are looking for from the set of bookmarks we have. Also it so happens that not all the content in the link is useful to us.Current tools are lacking this features which can allow a user to look for content/keywords from his browsing history. Existing tool provide search by title or by tags.

Solution: Create a unique personalized web index.

Approach: We should allow users to selectively upload data from the page or upload the whole page. At the server we can index the content per user. With this infrastructure we can provide a search capability which can help them get the exact information from what they have stored. This is currently missing. Also with this approach we are getting curated content from the user which can be used in many ways. Since we are creating the index per user the result wont be affected by what other users store. This is like a pin and search mechanism for content instead of photos like(Pinterest).

Demo : http://hashedout.info

Identifying the Best Point to Meet in a N X N Grid

Problem :

We have a set of m friends who stay at position (x1,y1)(x2,y2)….(xm,ym). Now we need to find a best location (xi,yi) where all friends decide to meet. Assume we have only one person staying at any location (x,y) . We need to minimize the sum of the distance to the location we identified.

This problem can be solved in O(n2) where we calculate the distance between every point xi,yi and accumulate over all the points. And the best solution is where the sum is least. Assume the person can move in all the direction in 1 unit. We can solve the same problem in O(n) by applying Centroid of set of points idea.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
int64_t dist(pair p1, pair p2){
  return (max(abs(p1.first-p2.first), abs(p1.second-p2.second)));
}
void mindist(vector< pair > &pts){
  int64_t min = pow(2,63);
  int64_t sum=0;
  for(int64_t i = 0; i < pts.size(); i++) {
      sum = 0;
      for(int64_t j = 0; j < pts.size(); j++){
          pair x = pts[i];
          pair y = pts[j];
          sum += dist(x,y);
      }
      if(sum < min)
          min = sum;
  }
  cout << min << endl;
}
int main()
{
  int64_t n,x,y ;
  vector< pair > points;
  cin >> n;
  int64_t xcent=0,ycent=0;
  for(int64_t i = 0; i < n ;i++){

      cin >> x;
      cin >> y;
      points.push_back(pair(x,y));
  }

  mindist(points);

}

Configuring Apache for PUT

I tried this on Ubuntu 10.04.

  1. sudo apt-get install apache2 php5-cli php
  2. Add the directive : Script PUT /put.php in /etc/apache2/sites-enabled/000-default
  3. sudo ln -s /etc/apache2/mods-available/actions.load /etc/apache2/mods-enabled/actions.load
  4. sudo ln -s /etc/apache2/mods-available/actions.conf /etc/apache2/mods-enabled/actions.conf
  5. sudo /etc/init.d/apache2 restart

Create the file put.php in the document root which in my case is /var/www

put.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<?php
echo "Creating file : ";
echo basename($_SERVER['REQUEST_URI']);
/* PUT data comes in on the stdin stream */
echo $ARGV[0];
$putdata = fopen("php://input", "r");

/* Open a file for writing */
$fp = fopen(basename($_SERVER['REQUEST_URI']), w);

/* Read the data 1 KB at a time
and write to the file */
while ($data = fread($putdata, 1024))
fwrite($fp, $data);

/* Close the streams */
fclose($fp);
fclose($putdata);

?>

This post doesn’t take into account any of the security issues. Also if you are planning to upload big files then you might need to modify these variable(max_execution_time and max_input_time) in /etc/php5/apache/php.ini.

Evaluating a Recommender

In the previous post we created a recommender. Its time to see how well our recommender performs given a training and a test dataset. I am using the modifed version of grouplen data. Given a dataset we can simulate the training and test dataset by taking some portion of the actual dataset and treating it as the training dataset and the remaining portion as the test dataset. In which case we can train the recommender and evaluate how it performs against the test dataset. Since we know the actual value of the test dataset its easier to see how well we recommended the item by taking the difference. There are two types of evaluator in Mahout

Average Absolute Difference evaluator Root Mean Square Evaluator. With a score of this type, lower is better, because that would mean the estimates differed from the actual value by less.

Applying the above two evaluator to our recommender gave a value of

1
2
3
4
shrikar@shrikar-laptop:~/proj/shri-mahout$ java -cp .:target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar:home/shrikar/proj/mahout-distribution-0.5/utils/target/dependency/slf4j*.jar shri.mahout.Evaluator

Absolute Average Score : 0.9810512490344762
Root Mean Square Score : 1.2754938650252956

In our case the Root Mean Square Score was penalizing the recommendation that were way off.

Experiments With Mahout

I assume you have installed Mahout and compiled it. In this post I will show how we can create our own recommender. Here are the steps.

  1. mvn archetype:create -DgroupId=shri.mahout -DartifactId=shri-mahout
  2. mvn compile
  3. mvn eclipse:eclipse
  4. Now import the project in eclipse . File > Import > Path to [shri-mahout dir created by the first command ]

To resolve the dependencies you will need to add themahout-core-0.5.jar,mahout-utils-0.5.jar,mahout-math-0.5.jar,mahout-core-0.5-job.jar ( This has all the dependencies required by mahout). We can add the dependencies by Right click on the project folder > configure build path > Libraries > Add External jar’s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package shri.mahout;

import java.io.File;
import java.util.List;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class RecommenderIntro {
  public static void main(String[] args) throws Exception{
      DataModel model = new FileDataModel(new File("intro"));
      UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
      UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
      Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
      for(int i = 0; i < 5; i++) {
          System.out.println("For user : " + i);
          List recommendations = recommender.recommend(i,3);
          for(RecommendedItem recommendation :recommendations) {

              System.out.println(recommendation);
          }
      }
  }
}

How To Run the program from command line.

java -cp .:target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar:home/shrikar/proj/mahout-distribution-0.5/utils/target/dependency/slf4j*.jar shri.mahout.RecommenderIntro

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
For user : 0
For user : 1
RecommendedItem[item:748, value:4.5]
RecommendedItem[item:313, value:4.5]
RecommendedItem[item:300, value:4.0]
For user : 2
RecommendedItem[item:513, value:4.5]
RecommendedItem[item:83, value:4.5]
RecommendedItem[item:603, value:4.5]
For user : 3
RecommendedItem[item:64, value:5.0]
RecommendedItem[item:50, value:5.0]
RecommendedItem[item:1, value:5.0]
For user : 4
RecommendedItem[item:181, value:5.0]
RecommendedItem[item:275, value:4.5]
RecommendedItem[item:25, value:4.5]