Shrikar Archak

Shrikar's Blog

Image Server in Node.js

  • January 5, 2012 7:43 am

I got curious about the much hyped node.js and what it can do. I wanted to try out this new way of using javascript on the Serverside. I followed this particular link which explains the process in detail. Nodebeginner
I will be deploying this app on heroku using cedar stack. Node application requires package.json which describe all the package dependencies.

To run your web process, you need to declare what command to use. In this case, we simply need to execute our Node script. We’ll use Procfile to declare how our web process type is run.

Here’s a Procfile for the sample app we’ve been working on:
web: node index.js

The demo of the app can be found here : Nodejs-ImageServer

I found Node.js to be interesting and certainly worth spending more time on.

Identifying the best point to meet in a N X N grid

  • December 31, 2011 10:20 pm

Problem :

We have a set of m friends who stay at position (x1,y1)(x2,y2)….(xm,ym). Now we need to find a best location (xi,yi) where all friends decide to meet. Assume we have only one person staying at any location (x,y) . We need to minimize the sum of the distance to the location we identified.

This problem can be solved in O(n2) where we calculate the distance between every point xi,yi and accumulate over all the points. And the best solution is where the sum is least. Assume the person can move in all the direction in 1 unit. We can solve the same problem in O(n) by applying Centroid of set of points idea.

 

int64_t dist(pair p1, pair p2){
	return (max(abs(p1.first-p2.first), abs(p1.second-p2.second)));
}
void mindist(vector< pair > &pts){
	int64_t min = pow(2,63);
	int64_t sum=0;
	for(int64_t i = 0; i < pts.size(); i++) {
		sum = 0;
		for(int64_t j = 0; j < pts.size(); j++){
			pair x = pts[i];
			pair y = pts[j];
			sum += dist(x,y);
		}
		if(sum < min)
			min = sum;
	}
	cout << min << endl;
}
int main()
{
	int64_t n,x,y ;
	vector< pair > points;
	cin >> n;
	int64_t xcent=0,ycent=0;
	for(int64_t i = 0; i < n ;i++){

		cin >> x;
		cin >> y;
		points.push_back(pair(x,y));
	}

	mindist(points);

}

Configuring Apache2 for PUT operation.

  • December 7, 2011 7:45 pm

I tried this on Ubuntu 10.04.

  • sudo apt-get install apache2 php5-cli php
  • Add the directive : Script PUT /put.php in /etc/apache2/sites-enabled/000-default
  • sudo ln -s /etc/apache2/mods-available/actions.load /etc/apache2/mods-enabled/actions.load
  • sudo ln -s /etc/apache2/mods-available/actions.conf /etc/apache2/mods-enabled/actions.conf
  • sudo /etc/init.d/apache2 restart

Create the file put.php in the document root which in my case is /var/www

put.php

<?php
echo "Creating file : ";
echo basename($_SERVER['REQUEST_URI']);
/* PUT data comes in on the stdin stream */
echo $ARGV[0];
$putdata = fopen("php://input", "r");

/* Open a file for writing */
$fp = fopen(basename($_SERVER['REQUEST_URI']), “w”);

/* Read the data 1 KB at a time
and write to the file */
while ($data = fread($putdata, 1024))
fwrite($fp, $data);

/* Close the streams */
fclose($fp);
fclose($putdata);

?>

This post doesn’t take into account any of the security issues. Also if you are planning to upload big files then you might need to modify these variable(max_execution_time and max_input_time) in /etc/php5/apache/php.ini.

Evaluating a Recommender

  • December 5, 2011 7:50 am

In the previous post we created a recommender.  Its time to see how well our recommender performs given a training and a test dataset. I am using the modifed version of grouplen data. Given a dataset we can simulate the training and test dataset by taking some portion of the actual dataset and treating it as the training dataset and the remaining portion as the test dataset. In which case we can train the recommender and evaluate how it performs against the test dataset. Since we know the actual value of the test dataset its easier to see how well we recommended the item by taking the difference. There are two types of evaluator in Mahout

  • Average Absolute Difference evaluator
  • Root Mean Square Evaluator.

With a score of this type, lower is better, because that would mean the estimates differed from the actual value by less.
Rating Dataset download here

Applying the above two evaluator to our recommender gave a value of

shrikar@shrikar-laptop:~/proj/shri-mahout$ java -cp .:target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar:home/shrikar/proj/mahout-distribution-0.5/utils/target/dependency/slf4j*.jar shri.mahout.Evaluator

Absolute Average Score : 0.9810512490344762
Root Mean Square Score : 1.2754938650252956

In our case the Root Mean Square Score was penalizing the recommendation that were way off.

 

Experiments with Mahout

  • December 5, 2011 4:27 am

I assume you have installed Mahout and compiled it. In this post I will show how we can create our own recommender. Here are the steps.

  • mvn archetype:create -DgroupId=shri.mahout -DartifactId=shri-mahout
  • mvn compile
  • mvn eclipse:eclipse
  • Now import the project in eclipse . File > Import > Path to [shri-mahout dir created by the first command ]
  • To resolve the dependencies you will need to add themahout-core-0.5.jar,mahout-utils-0.5.jar,mahout-math-0.5.jar,mahout-core-0.5-job.jar ( This has all the dependencies required by mahout). We can add the dependencies by Right click on the project folder > configure build path > Libraries > Add External jar’s
package shri.mahout;

import java.io.File;
import java.util.List;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class RecommenderIntro {
	public static void main(String[] args) throws Exception{
		DataModel model = new FileDataModel(new File("intro"));
		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
		Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
		for(int i = 0; i < 5; i++) {
			System.out.println("For user : " + i);
			List recommendations = recommender.recommend(i,3);
			for(RecommendedItem recommendation :recommendations) {

				System.out.println(recommendation);
			}
		}
	}
}

How To Run the program from command line.

java -cp .:target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/classes/:/home/shrikar/proj/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar:home/shrikar/proj/mahout-distribution-0.5/utils/target/dependency/slf4j*.jar shri.mahout.RecommenderIntro

For user : 0
For user : 1
RecommendedItem[item:748, value:4.5]
RecommendedItem[item:313, value:4.5]
RecommendedItem[item:300, value:4.0]
For user : 2
RecommendedItem[item:513, value:4.5]
RecommendedItem[item:83, value:4.5]
RecommendedItem[item:603, value:4.5]
For user : 3
RecommendedItem[item:64, value:5.0]
RecommendedItem[item:50, value:5.0]
RecommendedItem[item:1, value:5.0]
For user : 4
RecommendedItem[item:181, value:5.0]
RecommendedItem[item:275, value:4.5]
RecommendedItem[item:25, value:4.5]

Application of Log Parsing

  • December 5, 2011 3:56 am

Assume you have a log in this particular format

EVENT:USERID:TIMESTAMP:DATA

Having data dump of this format can help a lot in what a particular user is doing, how often he is performing a particular event ( We have the timestamp to calculate the same).This data can be used to provide some personalized information for that particular user. ( Example displaying ads which the user might be interested in etc etc).How you use the data is completely application specific.
Here is one of my example in hadoop to perform the above described task.

Hadoop Install and Configure

  • December 5, 2011 3:55 am

Steps for installing Hadoop
1) Download the latest version of hadoop from here  http://newverhost.com/pub//hadoop/core/stable/hadoop-0.20.203.0rc1.tar.gz
2) Unzip the tar
3) Make sure : You can login to localhost without giving passphrase. If you cannot login without giving passphrase follow the below steps
shrikar@localhost$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
shrikar@localhost$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
4) Create HADOOP_HOME=
5) Edit the  fi le $HADOOP HOME/conf/hadoop-env.sh to de ne at least JAVA HOME to be the root of your Java installation.

Verify Hadoop is running fine
1. By default, Hadoop is configured to run in a non-distributed mode (standalone mode), as a single
Java process. This is useful for debugging.
2. The following example copies the unpacked conf directory to use as input and then displays every match of the given regular expression. Output is written to the given output directory.

shrikar@localhost$ mkdir input
shrikar@localhost$ cp conf/*.xml input
shrikar@localhost$ bin/hadoop jar hadoop-*-examples.jar grep input output ‘Put’
shrikar@localhost$ cat output/*

3. Clean up:

shrikar@localhost$ rm -rf input output

4. Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop
daemon runs in a separate Java process.

5. Edit the conf/core-site.xml file:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

6. Edit the conf/hdfs-site.xml file:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

7. Edit the conf/mapred-site.xml file:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

8. Format a new distributed filesystem:

shrikar@localhost$ bin/hadoop namenode -format

9. Start the Hadoop daemons:

shrikar@localhost$ bin/start-all.sh

10. Browse the web interface for the NameNode and the JobTracker; by default they are available at:
NameNode – http://localhost:50070
JobTracker – http://localhost:50030
11. Copy the input files into the distributed filesystem:

shrikar@localhost$ bin/hadoop fs -put conf/*.xml input

12. Run some of the examples provided:
( you can create a directory as : bin/hadoop fs -mkdir /user/shrikar/input )
shrikar@localhost$ bin/hadoop jar hadoop-*-examples.jar grep /user/shrikar/input /user/shrikar/output ‘Put’

13. Copy the output files from the distributed filesystem to the local filesystem and examine them:
shrikar@localhost$ bin/hadoop fs -get output output
shrikar@localhost$ cat output/*

14. Clean up:

shrikar@localhost$ rm -r output
shrikar@localhost$ bin/hadoop fs -rmr /user/shrikar/input /user/shrikar/output

15. When you’re done, stop the daemons with:

shrikar@localhost$ bin/stop-all.sh

Cassandra Data Model

  • December 5, 2011 3:52 am

Column
——–

The column is the lowest/smallest increment of data. It’s a tuple (triplet) that contains a name, a value and a timestamp.
Here’s a column represented in JSON-ish notation:

{
name: “emailAddress”,
value: “xxx@example.com”,
timestamp: 123456789
}

That’s all it is. For simplicity sake let’s ignore the timestamp. Just think of it as a name/value pair.

Also, it’s worth noting is that the name and value are both binary (technically byte[]) and can be of any length.

SuperColumn
————

A SuperColumn is a tuple with a binary name & a value which is a map containing an unbounded number of Columns – keyed by the Column‘s name. Keeping with the JSON-ish notation we get:

{ // this is a SuperColumn
name: “homeAddress”,
// with an infinite list of Columns
value: {
// note the keys is the name of the Column
street: {name: “street”, value: “1234 x street”, timestamp: 123456789},
city: {name: “city”, value: “san francisco”, timestamp: 123456789},
zip: {name: “zip”, value: “94107″, timestamp: 123456789},
}
}

Column vs SuperColumn
———————-

Columns and SuperColumns are both a tuples with a name & value. The key difference is that a standard Column‘s value is a “string” and in a SuperColumn the value is a Map of Columns. That’s the main difference… their values contain different types of data. Another minor difference is that SuperColumn‘s don’t have a timestamp component to them.

Thanks to Alex Popescu and Arin Sarkissian for their blogs which helped me get to know more about Cassandra data model.