Shrikar Archak

It does not matter how slow you go so long as you do not stop. ~Confucius

Kiji on CDH4.2.1

In this post I will be talking about how to make kiji work with CDH4.2.1.

My assumption is that you have installed CDH-4.2.1 and the services like hadoop,hbase,zookeeper are running. I used cloudera manager for installing hadoop and all necessary components. Cloudera Manager is the best tool for managing a hadoop cluster

First you need to set the HADOOP_HOME,HBASE_HOME,HADOOP_CONF_DIR and HBASE_CONF_DIR. For installing done using cloudera manager the path would be something like /opt/cloudera/parcels …

  • Set these variable in the bashrc

    export HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hadoop
    export HBASE_HOME=/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hbase
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export HBASE_CONF_DIR=/etc/hbase/conf
    

  • Download Kiji from Install Kiji and tar xzf kiji-bento-*.tar.gz. ( I downloaded it on the hbase server)

  • cd kiji-bento-albacore
  • pwd
  • Add kiji home folder to the PATH. Add the path you got from pwd to ~/.bashrc
     export PATH=$PATH:/home/shrikar/kiji-bento-albacore/bin
     
  • source ~/.bashrc
  • bin/kiji install
  • Comment out the line which tries to configure the cluster
     #source "${KIJI_HOME}/cluster/bin/bento-env.sh"
     
  • source bin/kiji-env.sh
  • At this point you can continue with the remaining steps from Quick_Start_Guide

FontAwesome With Meteorjs

Meteorjs

Meteorjs is a new Javascript framework for building realtime applications. More about meteor can be found here Meteorjs.One of the cool feature of Meteorjs is its package manager. Many open source libraries like twitter’s bootstrap are provided as a package. In our application we will be using bootstrap. There is a basic set of icons which are provided by twitter bootstrap but in this example I thought we will use font awesome. Font Awesome is a iconic font library designed for twitter bootstrap(Font Awesome).

Existing third party meteor packages didn’t work

There are two meteor packages which can be installed to integrate for font awesome into meteor app, but for some reason none of them worked for me.

  • bootstrap-fontawesome

/usr/local/lib/node_modules/meteorite/lib/sources/git.js:108
        throw "There was a problem cloning repo: " + self.url;
                                                   ^
There was a problem cloning repo: https://github.com/alexnotov/meteor-bootstrap-and-font-awesome
  • font-awesome

Errors prevented startup:
Exception while bundling application:
Error: The package named font-awesome does not exist.
    at _.extend.init_from_library (/usr/local/meteor/app/lib/packages.js:91:13)
    at Object.module.exports.get (/usr/local/meteor/app/lib/packages.js:225:11)
    at self.api.use (/usr/local/meteor/app/lib/bundler.js:94:28)
    at Array.forEach (native)
    at Function._.each._.forEach (/usr/local/meteor/lib/node_modules/underscore/underscore.js:79:11)
    at Object.self.api.use (/usr/local/meteor/app/lib/bundler.js:93:9)
    at _.extend.init_from_app_dir [as on_use_handler] (/usr/local/meteor/app/lib/packages.js:136:11)
    at _.extend.use (/usr/local/meteor/app/lib/bundler.js:382:11)
    at Object.exports.bundle (/usr/local/meteor/app/lib/bundler.js:707:12)
    at /usr/local/meteor/app/meteor/run.js:613:26
    at exports.inFiber (/usr/local/meteor/app/lib/fiber-helpers.js:22:12)
Your application is crashing. Waiting for file change.

“Necessity is the mother of all inventions.”

Structure of our Meteor Application

The default structure of a meteor app created is different from what we will using.

Things to be done:

  • meteor create awesomeapp
  • cd awesomeapp
  • meteor add bootstrap
  • mkdir -p public/img
  • mkdir -p css
  • mkdir -p client
  • mkdir -p server
  • mv awesomeapp.css css/
  • Download font awesome from Download here.
  • unzip the folder
  • move font folder to public/
  • move all css in the unzip folder/css to css/
  • discard all other downloaded content( Remove them from the root folder)

/RootFolder
     |
     |____ public
     |         |____ font
     |         |____ robots.txt
     |         |____ other static assets
     |____ css
     |      |____ awesomeapp.css
     |      |____ font-awesome.css ( all font awesome css files)
     |____ server
     |        |____ appserver.js ( Loaded only on the server side)
     |____ client
     |        |____ appclient.js ( Loaded only on the client side)
     |_ models.js (Loaded on both client and server)
          

Note : appserver.js, appclient.js and models.js are not created by default. If we have some custom logic which needs to be executed only in server or in client can go into those files.

Modifying the fontawesome.css

Since we have put font in the public directory of the meteor app we need to change the path in font-awesome*.css as below.


@font-face {
  font-family: 'FontAwesome';
  src: url('/font/fontawesome-webfont.eot?v=3.0.1');
  src: url('/font/fontawesome-webfont.eot?#iefix&v=3.0.1') format('embedded-opentype'),
    url('/font/fontawesome-webfont.woff?v=3.0.1') format('woff'),
    url('/font/fontawesome-webfont.ttf?v=3.0.1') format('truetype');
  font-weight: normal;
  font-style: normal;
}

You should be able to use any of the font awesome icons in your app now. Check this for integrating with your app code(Integration)

Easier Deployment/automation With Fabric

Easier deployment/automation with Fabric.

Fabric is a tool which has the flexibility to run commands on the remote machine including sudo commands. Most of the distribution doesn’t allow to execute sudo commands if they don’t have a tty associated. I was initially using Rye for ruby it was able to perform most of the work but had problems when executing sudo commands thats where Fabric shines.

Installation

sudo easy_install fabric

Usecase

Installing dependencies for Riack.


fabfile.py :

from fabric.api import run
from fabric.api import env
from fabric.api import sudo,cd,lcd
env.password = 'password'
env.user='username'
def riak_dep():
      print("Executing on %s as %s" % (env.host, env.user))
      sudo("sudo apt-get install git --yes")
      sudo("sudo apt-get install cmake --yes")
      with cd('downloads'):
        run('/usr/bin/wget http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.bz2')
        run('bzip2 -d protobuf-2.4.1.tar.bz2');
        run('tar -xvf protobuf-2.4.1.tar')
        with cd('protobuf-2.4.1'):
           print "Now configuring....";
           run('./configure')
           print "Now making ....";
           run('make')
           print "Running sudo install....";
           sudo('make install')

      print "Completed installing protobuf";

      with cd('~/downloads'):
        run('/usr/bin/wget http://protobuf-c.googlecode.com/files/protobuf-c-0.15.tar.gz');
        run('tar -zxvf protobuf-c-0.15.tar.gz')
        with cd('protobuf-c-0.15'):
          print "Now configuring....";
          run('./configure')
          print "Now making ....";
          run('export LD_LIBRARY_PATH=/usr/local/lib && make')
          print "Running sudo install....";
          sudo('make install');
          print "Completed installing protobuf-c";

      with cd('~/downloads'):
        print "Cloning the repository..";
        run('git clone https://github.com/trifork/riack.git')
        with cd('riack'):
          print "Running cmake ...";
          run('cmake src')
          print "Running make ...";
          run('make')
          print "Running make install...";
          sudo('make install');

      print("Installing dependencies complete on %s as %s" % (env.host, env.user))

Fabric in action

shrikar-dev$ fab -H "192.168.1.100" riak_dep

Riak Installation and Configuration

Installation

  • Download Riak 1.2.1( Newest at the time of writing this page) from Riak Downloads
  • sudo dpkg -i <downloaded file>
  • Note the above steps are for ubuntu based systems.

System Configuration

  • Modify /etc/security/limits.conf to have these entries at the bottom. The value you set here depends on the number of partitions in the Riak Cluster and also the resources available.

*               soft     nofile          32768  
*               hard     nofile          32768  

  • Create a file /etc/default/riak

ulimit -n 8192

  • Modify /etc/sysctl.conf to have these entries at the end

vm.swappiness = 0
net.ipv4.tcp_max_syn_backlog = 40000
net.core.somaxconn=4000
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_tw_reuse = 1

  • Modify /etc/fstab to add noatime for mount options. Make sure you restart the machine for changes to affect.

UUID=55aa50cd-6281-4558-8282-10ae3b6c90cd / ext4 noatime,errors=remount-ro 0 1


  • Make sure the machines in your Riak cluster have a different hostname . Modify /etc/hostname for give different names to all the machines.

Riak Configuration

Configuration files for Riak are stored in /etc/riak folder. Two configuration files exists app.config and vm.args.

app.config

  • In riak-api section change the pb_backlog and pb_ip keys.

 {riak_api, [
            %% pb_backlog is the maximum length to which the queue of pending
            %% connections may grow. If set, it must be an integer >= 0.
            %% By default the value is 5. If you anticipate a huge number of
            %% connections being initialised *simultaneously*, set this number
            %% higher.
            {pb_backlog, 32},

            %% pb_ip is the IP address that the Riak Protocol Buffers interface
            %% will bind to.  If this is undefined, the interface will not run.
            {pb_ip,   "ipaddress of the host being configured" },

            %% pb_port is the TCP port that the Riak Protocol Buffers interface
            %% will bind to
            {pb_port, 8087 }
            ]},

              %% Default ring creation size.  Make sure it is a power of 2,
              %% e.g. 16, 32, 64, 128, 256, 512 etc
              {ring_creation_size, 512},

              %% http is a list of IP addresses and TCP ports that the Riak
              %% HTTP interface will bind.
              {http, [ {"ipaddress of the host being configured", 8098 } ]},
  • In riak_kv section set the storage backend to leveldb.

 {riak_kv, [
            %% Storage_backend specifies the Erlang module defining the storage
            %% mechanism that will be used on this node.
            {storage_backend, riak_kv_eleveldb_backend} ] }
  • In riak_search section enable the search feature.

 {riak_search, [
                %% To enable Search functionality set this 'true'.
                {enabled, true}
               ]},
  • In eleveldb section have entries like below. For more information on how to configure leveldb LevelDB Backend

{eleveldb, [
             {data_root, "/var/lib/riak/leveldb"},
             {cache_size, 134217728},
             {expire_secs, 5184000}
            ]} 

vm.args

Modify the vm.args and set the -name key


## Name of the riak node
-name riak@ipaddress of the host being configured

Starting and joining to a cluster.

  • Through command line

shell$ riak start
shell$ riak-admin cluster join riak@ip address of the first host configured

Why those configurations?

Here are the reasons for choosing those configs.

  • pb_backlog by default is 32 . We expect creating a lot of connections hence I set it to 32.
  • Ring size of 512 I chose because I wanted to grow the cluster at later point and wanted a sufficient set of partition to load balance. There is one more reason for choosing 512 because leveldb for 2i works best for less than or equal to 512 partitions.
  • LevelDB backend : because it supports 2i ( secondary index) and has compression.
  • More the cache size better the results but it depends on the total number of partitions(ring size) and available memory . I set a reasonable value depending on the memory present per node.

Designing API’s for Aggregation of User Events

Requirements: Design a REST API service from the ground up (assume it can publicly be accessed via HTTP). The REST API has two functions:

  1. saveEvent (user_id, key, value, [{subkey,subvalue},{subkey,subvalue},…]) The saveEvent call will accept a user_id, a key, a value and an optional array of subkeys and subvalues.

  2. getEvent (user_id, key || subkey, date_min, date_max) The getEvent call will return the sum of all the values for that user / key cominbation for that date range (and will support both key and subkey).

Solution:

Possible Candidates for DataStore. Given the type of data the most obvious candidates are any generic Key/Value store or a NoSQL database.

1) For our case I would go with MongoDB a NoSQL database which can Scale horizontally. Since we are doing some kind of aggregation with respect to user_id and key/subkeys having a db with rich query interface would make our job easier as compared to Key Value stores.

Given that the system need to scale we can use the Sharding feature of MongoDB so as to distribute the data across different machines and hence load balance the system.

2) Choice of MongoClient : Mongoose. Provides a simple apis for querying db.

3) Express framework for providing the REST type API’s.

4) Depending on the number of Cores we could create those many Node.js process fronted by Nginx server. All the node.js process talk to a single Mongo backend.

5 ) Demo of the app available on Heroku. Using the shared instance of MongoLab for the demo.

Example:

  1. FOR saveEvent: curl http://hollow-meadow-3903.herokuapp.com/saveEvent -XPOST -d “user_id=shrikar&key=shrek&value=100&optional=[{shrek,50},{subkey1,33 } ]” Event Saved.

  2. For getEvent: curl “http://hollow-meadow-3903.herokuapp.com/getEvent?user_id=shrikar&key=shrek&start=2012-04-29T02:50:42.749Z&end=2012-04-29T010:11:16.262Z” {“sum”:150}

Running Variance

Variance : In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value)

Challenge: Calculating the variance of a large set of number is not trivial. It becomes more challenging when we are talking in term of millions of numbers. Using the naive method for calculating variance would not be efficient. There is a better way of computing variance goes back to a 1962 paper by B. P. Welford and is presented in Donald Knuth’s Art of Computer Programming.

John Cook in his blog mentions about this method. http://www.johndcook.com/standard_deviation.html. A java implementation is give below

One of the application of this is in RTB(Real time bidding) world for calculating ECPM .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<http://http://www.johndcook.com/standard_deviation.htmluckage com.shrikar.library;


public class ECPMStats {
  private int events;
  double old_mean, new_mean, old_variance, new_variance, sum;

  public ECPMStats(){
      events=0;
      sum = old_mean = new_mean=old_variance=new_variance = 0;
  }

  void add(double ecpm){
      events++;
      sum += ecpm;
      if(events == 1){
          old_mean = new_mean = ecpm;
          old_variance = 0;
      } else {
          new_mean = old_mean + (ecpm - old_mean)/events;
          new_variance = old_variance + (ecpm - old_mean) * (ecpm - new_mean);
          old_mean = new_mean;
          old_variance = new_variance;
      }
  }

  int numEvents(){
      return events;
  }

  double ecpmMean(){
      if(events > 0)
          return new_mean;
      else
          return 0;
  }

  double ecpmVariance(){
      if(events>1){
          return new_variance/(events-1);
      } else {
          return 0;
      }
  }

  double sum(){
      return sum;
  }

}

Personalized Web Index

Problem: While browsing the internet we find many things which are useful. As of today we mainly bookmark and what happens is if we have a lot of bookmarks its hard to find the content we are looking for from the set of bookmarks we have. Also it so happens that not all the content in the link is useful to us.Current tools are lacking this features which can allow a user to look for content/keywords from his browsing history. Existing tool provide search by title or by tags.

Solution: Create a unique personalized web index.

Approach: We should allow users to selectively upload data from the page or upload the whole page. At the server we can index the content per user. With this infrastructure we can provide a search capability which can help them get the exact information from what they have stored. This is currently missing. Also with this approach we are getting curated content from the user which can be used in many ways. Since we are creating the index per user the result wont be affected by what other users store. This is like a pin and search mechanism for content instead of photos like(Pinterest).

Demo : http://hashedout.info

Identifying the Best Point to Meet in a N X N Grid

Problem :

We have a set of m friends who stay at position (x1,y1)(x2,y2)….(xm,ym). Now we need to find a best location (xi,yi) where all friends decide to meet. Assume we have only one person staying at any location (x,y) . We need to minimize the sum of the distance to the location we identified.

This problem can be solved in O(n2) where we calculate the distance between every point xi,yi and accumulate over all the points. And the best solution is where the sum is least. Assume the person can move in all the direction in 1 unit. We can solve the same problem in O(n) by applying Centroid of set of points idea.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
int64_t dist(pair p1, pair p2){
  return (max(abs(p1.first-p2.first), abs(p1.second-p2.second)));
}
void mindist(vector< pair > &pts){
  int64_t min = pow(2,63);
  int64_t sum=0;
  for(int64_t i = 0; i < pts.size(); i++) {
      sum = 0;
      for(int64_t j = 0; j < pts.size(); j++){
          pair x = pts[i];
          pair y = pts[j];
          sum += dist(x,y);
      }
      if(sum < min)
          min = sum;
  }
  cout << min << endl;
}
int main()
{
  int64_t n,x,y ;
  vector< pair > points;
  cin >> n;
  int64_t xcent=0,ycent=0;
  for(int64_t i = 0; i < n ;i++){

      cin >> x;
      cin >> y;
      points.push_back(pair(x,y));
  }

  mindist(points);

}

Configuring Apache for PUT

I tried this on Ubuntu 10.04.

  1. sudo apt-get install apache2 php5-cli php
  2. Add the directive : Script PUT /put.php in /etc/apache2/sites-enabled/000-default
  3. sudo ln -s /etc/apache2/mods-available/actions.load /etc/apache2/mods-enabled/actions.load
  4. sudo ln -s /etc/apache2/mods-available/actions.conf /etc/apache2/mods-enabled/actions.conf
  5. sudo /etc/init.d/apache2 restart

Create the file put.php in the document root which in my case is /var/www

put.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<?php
echo "Creating file : ";
echo basename($_SERVER['REQUEST_URI']);
/* PUT data comes in on the stdin stream */
echo $ARGV[0];
$putdata = fopen("php://input", "r");

/* Open a file for writing */
$fp = fopen(basename($_SERVER['REQUEST_URI']), w);

/* Read the data 1 KB at a time
and write to the file */
while ($data = fread($putdata, 1024))
fwrite($fp, $data);

/* Close the streams */
fclose($fp);
fclose($putdata);

?>

This post doesn’t take into account any of the security issues. Also if you are planning to upload big files then you might need to modify these variable(max_execution_time and max_input_time) in /etc/php5/apache/php.ini.

Evaluating a Recommender

In the previous post we created a recommender. Its time to see how well our recommender performs given a training and a test dataset. I am using the modifed version of grouplen data. Given a dataset we can simulate the training and test dataset by taking some portion of the actual dataset and treating it as the training dataset and the remaining portion as the test dataset. In which case we can train the recommender and evaluate how it performs against the test dataset. Since we know the actual value of the test dataset its easier to see how well we recommended the item by taking the difference. There are two types of evaluator in Mahout

Average Absolute Difference evaluator Root Mean Square Evaluator. With a score of this type, lower is better, because that would mean the estimates differed from the actual value by less.

Applying the above two evaluator to our recommender gave a value of

1
2
3
4
shrikar@shrikar-laptop:~/proj/shri-mahout$ java -cp .:target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar:home/shrikar/proj/mahout-distribution-0.5/utils/target/dependency/slf4j*.jar shri.mahout.Evaluator

Absolute Average Score : 0.9810512490344762
Root Mean Square Score : 1.2754938650252956

In our case the Root Mean Square Score was penalizing the recommendation that were way off.