Cluster-based Recommendation with Mahout

Mahout includes a few new experimental recommenders that are weakly documented at the moment. One of them is TreeClusteringRecommender which clusters your model into a set of groups and makes recommendations based on distances between your users and items in these clusters.

A clustered-based recommendation may be a good choice if your data is sparse and rarely correlated to find obvious patterns. Another advantage is clustering may help you to provide recommendations to users even with very tiny data available. Yet, it decreases the level of personalization and output is not unique to users but to clusters.

Here is a quick start to create and run one:

UserSimilarity similarity = new LogLikelihoodSimilarity(model);
ClusterSimilarity clusterSimilarity = new FarthestNeighborClusterSimilarity(similarity);
Recommender rec = TreeClusteringRecommender(model, clusterSimilarity, 20);

rec.getCluster(1); // gets the cluster of userId=1
rec.recommend(1, 10); // recommends 10 items to userId=1

What about user similarity and cluster similarity?

  • You basically have to provide a similarity function to be able to measure distances between different users. You may like to represent each user with a vector and calculate euclidean, Pearson, cosine, etc.  distance by looking at these features. Or LogLlikelihoodSimilarity may work as well. You may want to look at Surprise and Coincidence to understand what’s under the hood of this similarity.
  • Cluster similarity is newly represented here. It’s a place to customize the measurement of the similarity between two clusters. There are already two implementations: NearestNeighborClusterSimilarity and FarthestNeighborClusterSimilarity. Beware that clusters are dynamic. As new data comes in, old clusters may be merged and new ones may be introduced.

And the rest is about experimental work to find a method fit for your data by analyzing the nature of your input,  plotting the results and evaluating. The initial clustering takes a lot of time compared to item or user similarity based recommenders of Mahout, yet it works to be OK online once you start working with pre-compute data. Even it’s not very mature, for a primitive start, you may still like to consider this recommender if what you want to achieve fits in clustering.

The Rise of the Open Science

Open science is opening the way we make science. It stands for transparency and public accessibility of scientific data, collaboration, methods and results. On the other hand, it supports the existence of public contribution to the current state of science, and giving it back to the public domain.

Motivation

While we are making science, we rely on the older publications and methods those are often published with no open access to data. Years ago, academic community skeptically started to question the credibility of the research work on the existing literature. The way that science is funded was one of chief reasons behind this question. Science made with non-open data had possibility to be easily led by politics and other funding authority such as private companies to mislead the facts such as global warming or medical side-effects of a new medicine. Firing up an openness discussion led another ideas such as opening the methods and scientific source code.

Why to open data, open tools and open results?

One of the core values of science was being open and accessible. But ironically science is today receive heavy financial support from private institutions and governments where much of the budgets are shaped by economical, industrial or military needs. Scientific institutions are mostly closed to people without PhDs for scientist roles because there is already a huge competition among PhDs. Our credibility is measured by the number of papers published and number of citations we receive. I wouldn’t want to slander scientists but professional science, as in its own closed ecosystem, has a few conflicts against the key foundations of science. Science’s route, subject, people and results are controlled or may have possibility of being controlled by authority. In next few decades, we have to reissue the way we sustain  science.

We also do have a verification problem with science that relies on data. Computational and statistical science is lacking in reproducing the final results advertised on publications. JASA (Journal of the American Statistical Association) reports that only 21% of the papers are being published with source open in 2011, still a positive number compared to 2006′s 9%. Without code or data, even the work is published on an academic journal, there is no way to validate or iterate over the existing founding.

One of the key problems as we can address is that scientific research is not maintainable without economical sustainability due to the need of scientific tools. I’ve watched Eri Gentry, the founder of BioCurious, at OSCON last year. Her key points about opening the scientific tools, in the self-makers’ vision was motivating. According to her, at some point at BioCurious, they needed to have a PCR machine that was costing several thousand dollars to keep their garage based research on. Since they can’t afford the machine, they decided to analyze how they are actually working. Fortunately, they’ve figured out what it’s about and created OpenPCR. And now you are able to copy some strawberry DNA sequence or make cancer research at home. An open repository of knowledge on making scientific tools will increase the level of collaboration from regular makers and DIY people who may never have chance to investigate or be able to reverse engineer these tools.

Collaborative Science

By the radical changes in means of communication, discovery and discussion will have to change radically as well. A few months ago, I’ve seen a book by Michael Nielsen called Reinventing Discovery: The New Era of Networked Science on the new arrivals section. Nielsen opens the first chapter by a 2009 story about Tim Gowers‘ Polymath Project. Tim Gowers is a very notable mathematician, a Fields medalist from Cambridge University. In 2009, instead of working alone or with his existing pairs, he decided to discuss a mathematical problem on his blog and asked for readers to share their ideas online. In 6 weeks, he received 800 comments from 27 people. Although start has a its pitfalls, 37 days later Gowers announced they have not just solved his problem but the generalization of the polymaths problem including a special case.

And what about citizen science? Citizen science is used to be perceived as a more pro way of scientific crowd sourcing. But this perception seems to be changing. Very recently, I had a few discussions with friends who are totally aliens for citizen science and its current initiatives. They preliminary questioned the need of citizen scientists. Our main talk was about classification of galaxies on GalaxyZoo. GalaxyZoo is an online tool that shows you images of galaxies taken by Hubble telescope and wants you to manually choose if galaxy is elliptical or spiral or it has some set of features or not. Any programmer would initially ask why we are doing this classification manually in 2010s. Honestly, we have technology to pick up the features directly from signal without any observation from a human eye. So? But, discovery is not classification. We actually don’t know what we are looking at. Any anomalies or any strange looking objects would be a new scientific discovery. By reviewing the existing images, GalaxyZoo members discovered a new type of galaxies, now we call them “pea galaxies” and Hanny van Arkel, a Dutch school teacher, discovered a green strange nebula-looking object in the size of the Milky Way Galaxy called Hanny’s Voorwerp again in 2007.

So, why aren’t we taking it any further? There is an ongoing afford to make a cultural shift to increase the awareness and participation into science. Not only Zooniverse projects but NASA has opened code.nasa.gov very recently. Ariel Waldman is keeping a dictionary of all citizen space exploration projects on spacehack.org for a while. LHC’s ongoing CMS project donated data to Science Hack Day participants to let data hackers come up with data visualization tools for CMS. DIYgenomics are crowd sourcing genomic data. The list goes on…

Conclusion

With the ongoing momentum in scientific communities, in the next few decades, we’ll experience a tremendous change in they way we make and participate in science. For now, not intercepting conventional means but creating possibilities, new science is approaching with the strong sympathy for making scientific results freely and universally accessible.

Android’s RTP implementation

Although still being not really mature, Android is supporting RTSP streaming for a long time. In theory, it’s very trivial to play an RTSP link with MediaPlayer controller.

MediaPlayer player = new MediaPlayer("rtsp://...");
player.prepare();
player.start();

But in practice, MediaPlayer implementation is not fair enough to give you responses and you basically dont know what’s going on since your media is not playing. I will be generally talking about network layer, so you will have a basic idea how to configure your media servers.

RTSP and RTP

Generally we call it RTSP. But RTSP streaming has two phases: RTSP and mostly RTP to transform actual media data. RTSP is a stateful protocol. While making the first connection, it agrees on a bunch of details and exchanges data about the media being served between client and server. These are done with a family of directives.  These directives are sent on TCP 554. The RTSP flow includes OPTIONS, DESCRIBE, SETUP, PLAY/PAUSE/etc. On the request made for SETUP directive, client specifies what transform protocol it’ll support (in this case, it’s RTP) and on which protocol and which port. Android clients choose UDP and a range starting form 15000 to 65k. This range may change from phone to phone, manufacturer to manufacturer. Summary: There is absolutely no standard at  all. If you look at native MediaPlayer implementation in Android codebase, you will see no specific range as well. So, it’s very likely for you to have trouble. Another bad point is, RTP is usually supported on a port range between 9k-15k on TCP (e.g. Blackberry devices). And if you read tips and tricks about configuring a server, you won’t be able to catch the Android fact.

Note: This post was a draft for about a year, I reviewed it and posted. There’s nothing over-dated according to my practical knowledge. If you are against me, contact me for fixes.

Implementing Interfaces/Protocols in JavaScript?

Several years later, I decided to read Pro JavaScript Design Patterns by Harmes and Diaz again. This book has almost become controversial for its introductory section which presents a suggestion to implement interfaces in JavaScript.

Pro JS Design Patterns implements an Interface object, simply by attaching the several method attributes to an object during creation. The interface class is more likely to be a Factory class since it creates a new class definition based on the input parameters.

// Constructor.
var Interface = function(name, methods) {
    this.name = name;
    this.methods = [];
    for(var i = 0, len = methods.length; i < len; i++) {
        this.methods.push(methods[i]);
    }
};
// create Widget class which implements an interface with two methods: addSubView and removeFromSuper.
var Widget = new Interface('Widget', ['addSubView', 'removeFromSuper']);

Once you initialize Widget, you ensure that Widget is implementing addSubView and removeFromSuper methods. This is a methodology/concept book is presenting. In this blog post, I’ll try to explain you why conceptually it makes no sense to implement an Interface object in JS.

There is no “before run-time”

Dynamic languages are forming the classes at the run-time. Interfaces are invented to find failures during compiling time — long before code is being run. If you develop control methodologies or test against a structure, it might be very pointless due to the fact they are going to be executed in run-time. If your situation is not critical, let it fail on actual code during the test. Nevertheless, you can always check if required methods are implemented by the instance and raise flags if something has gone bad for critical operations or not-trusted objects comes from an AJAX request for example.
If you truly want to make “before run-time” checks, your problem turns out to be a unit-testing one. You can simply write external test code to consume your JS library and check if your object is implementing your set of methods. This is absolutely is not a solution since you are able to modify object structure during the execution (see next section). You, again, can’t be sure your code is going to break it all or not.

Hierarchy or functionality changes at run-time

Almost any object is changeable during run-time — although there are some exceptions with [global] object in browser implementations since window is the [global] object and modification is a security problem . Hierarchy may change, properties may change. So, there is again no guarantee that a Factory created object will agree a certain protocol during the run-time.

The cost

Nothing comes for free. Factory like initialization is an overhead, huge or small, it’s still overhead. It requires extra iterations over objects during creation. And you’re required to iterate over same properties while implementing the methods.

Consequently, there is almost no practical point to create classes from an Interface factory at all. If your code is getting large, write automated tests and find ways to share schema around the developers more efficiently.

Twifighting: Compare tweets per hour

Yesterday, I put some code and graphics together and launched Twifighting: a simple tool which compares how trendy your phrases are in Twitter’s ecosystem. It’s an easy tool I always needed, to make some market research, to gather the latest interest metrics and etc. I simply thought that I’m not the only one on the comparison need side, that’s how I decided to launch it. It’s funny to watch the metrics from Gordon Brown vs David Cameron, Obama vs BP and etc. :) I became a huge addict! I’m planning a minor update with some useful features and fix some of the usability problems. If you have any feedback, please drop it here.

Setting bounds of a map to cover collection of POIs on Android

Lately, as I browse web for maps related questions on Android, what’s frequently requested is an example of setting bounds of a map (zooming to a proper level and panning) to be able show all of the pins given on the screen.

Most of the maps APIs provide this functionality such as Google Maps API, so developers seem to have problems with implementing theirs. Google Maps API for Android does not provide functionality for setting bounds to a box. Instead, what’s provided is to zoom to a span.

com.google.android.maps.MapController.zoomToSpan(int latSpanE6, int lonSpanE6)

latSpanE6 is the difference in latitudes * 10^6 and similarly lonSpanE6 is the difference longitude * 10^6. You may question how map controllers know where to zoom in just by the differences. For examples, kms between longitudes differ from equator to poles. Fortunately, Google maps projection has them in the same length. This may remind you the infamous South America versus Greenland syndrome. Although Greenland is much much smaller than South America, it doesnt look so with this map projection.

On the below, I implemented a boundary arranger method for MapView. Method takes three arguments: items, hpadding and vpadding. items as you may guess is a list of POIs. Other arguments are a little bit more interesting. hpadding and vpadding is the percentage of padding you would like to leave horizontally and vertically so that pins don’t appear just on the corners. For instance, if you assign 0.1 for hpadding, 10% padding will be given from top and bottom.

BTW, You’ll have to extend the existing MapView and add this method to your own MapView to use this method properly.

public void setMapBoundsToPois(List<GeoPoint> items, double hpadding, double vpadding) {

    MapController mapController = this.getController();
    // If there is only on one result
    // directly animate to that location

    if (items.size() == 1) { // animate to the location
        mapController.animateTo(items.get(0));
    } else {
        // find the lat, lon span
        int minLatitude = Integer.MAX_VALUE;
        int maxLatitude = Integer.MIN_VALUE;
        int minLongitude = Integer.MAX_VALUE;
        int maxLongitude = Integer.MIN_VALUE;

        // Find the boundaries of the item set
        for (GeoPoint item : items) {
            int lat = item.getLatitudeE6(); int lon = item.getLongitudeE6();

            maxLatitude = Math.max(lat, maxLatitude);
            minLatitude = Math.min(lat, minLatitude);
            maxLongitude = Math.max(lon, maxLongitude);
            minLongitude = Math.min(lon, minLongitude);
        }

        // leave some padding from corners
        // such as 0.1 for hpadding and 0.2 for vpadding
        maxLatitude = maxLatitude + (int)((maxLatitude-minLatitude)*hpadding);
        minLatitude = minLatitude - (int)((maxLatitude-minLatitude)*hpadding);

        maxLongitude = maxLongitude + (int)((maxLongitude-minLongitude)*vpadding);
        minLongitude = minLongitude - (int)((maxLongitude-minLongitude)*vpadding);

        // Calculate the lat, lon spans from the given pois and zoom
        mapController.zoomToSpan(Math.abs(maxLatitude - minLatitude), Math
.abs(maxLongitude - minLongitude));

        // Animate to the center of the cluster of points
        mapController.animateTo(new GeoPoint(
              (maxLatitude + minLatitude) / 2, (maxLongitude + minLongitude) / 2));
    }
} // end of the method

W3C Widgets: The good, the bad and the ugly

It hasn’t been a while since ppk wrote about totally a new W3C movement called “Widgets“. A Widget is a downloadable archive of HTML, JavaScript, CSS and a configuration file. It’s a downloadable web front-end. Basically it’s designed to build mobile apps to avoid extra network usage consumed to download heavy weight pages, CSS and JS. With Widgets, you only consume network traffic for data transmission. Before getting into details I have to share a fact that according to my knowledge, Opera Mobile is the only browser around with Widgets support.

You can read Vodafone’s tutorial to make a Widget first to have an initial look.

The Good

For many years, I’ve been in a huge debate with people who uses work force inefficiently by their 35k different platforms and SDKs. Half of the developer have written HTML once in their life and JavaScript has a very large developers base. Every new mobile platform is usually re-inventing the wheel once again and this default action is usually driven by business fears.

Widgets make software accessible anywhere you can run a browser. It’s definitely “Write once, run everywhere”. And the complaints about slow page transmission is being fixed by running them from local resources.

Widgets will push mobile web browsers to act more similarly as applications base grow. Many of the extensions such as geo-location APIs dont really fit each other and some mobile browsers provide totally non-standard features. If web applications dominates the mobile, community will push browsers to act better.

It’s easy to get in. You dont have to download SDKs, learn another language and read documentation/tutorials to learn something new.

The Bad

Performance. Native apps run fast. Even Dalvik empowered Android is horrible and not really responsive compared to other platforms’ applications because of Java. Heavy JS on web browsers are not scalable and just like most of the other browsers, Safari on iPhone has rendering issues even on local websites.

Forget the advantages of Web when it comes to releasing software. No on the fly updates at all. Software should be downloaded again and again as new versions release.  Accessibility to internal platform is questionable. Open platforms like Android provide access to internals such as contact lists, file system and invoking other applications. If mobile  operating system manufacturers cant meet at providing the similar APIs, this wont work.

The Ugly

I find the old-generation of mobile development community is very ill-minded. They use the know-how to make money and this community is interested in their complex and closed environments.

On the other hand, the only contributor is Opera for now. I’m not really sure if they go for larger market share or not. If an open standard acts like a diverse platform for Opera browser phones, it’s the same story.