May 04, 2008

Google now crawling forms

A few days ago, I started receiving Google alerts about me (yes, I'm following what's said about me on the web) that were linking to search results pages on my own blog, with strange query terms such as "steve" or "near", "idea" or "known".

Why would such searches show up in Google? Who has linked from his website to the search results for these weird words? I tried to find these pages, but failed.

Today I finally found the answer: Google is now crawling through forms, by filling inputs with words they find on the page containing the form. The intent is to crawl the "deep web" that is not normally accessible through regular links.

I'm not sure of the efficiency of this new feature, since every web site that has some sort of product catalog that seems to be only available to humans through forms has semi-hidden link pages to feed the search engines. But it will certainly bring to the surface lots of information that people thought were more or less hidden (not to say protected) behind a form.

Give a box full of keys to a monkey, and it will probably find one that opens your door if you have a weak lock! Inspecting my server logs shows that the current form crawler is not really different from a monkey, trying lots of very common words, 100 per day.

Now the nice thing is that, according to Google's explanation, my blog is a "high-quality site" :-)

April 29, 2008

Speeding up mobile web applications

It's been nearly one month since I started full speed at Goojet, and I already learned a lot on the mobile web and J2ME.

One of the key issues in the mobile web is latency: connections are long to establish, and data transfer rates are low. Not everybody has a 3G phone nor an unlimited data plan.

Connection establishment latency can be mitigated in two ways:
  • Use persistent HTTP/1.1 connections. By reusing the same TCP socket for several HTTP requests, you save the non-negligible connection setup overhead. To achieve this, every request and response must carry a Content-Length header.
  • Open several connections in parallel when you have several items to load (e.g. images in a page). This might seem contradictory to the above, but a single connection doesn't use all the available bandwidth. There are lots of idle periods during connection establishement and data transfer, waiting for handshake packets on the slow link. So by opening several connections and handling them in parallel, more of the available bandwidth is actually used leading to shorter overall load time.
Having responsive connections is good to speed-up transfer, but even better is not to have to transfer data at all, both for response time and user phone bill. So HTTP cache headers must be handled with great care, which also means that the J2ME application has to implement an HTTP cache.

Cache headers for static resources like images or CSS files are most often handled by the web server itself, by means of Last-Modified and/or ETag headers. You have to be careful with ETags though in server farms, since some web servers include the file inode in it, which essentially makes its value different on every server for the same replicated resource, thus breaking the intented purpose.

Static resources can also benefit from using the Expires header, or better the HTTP/1.1 Cache-Control header that avoids the date parsing and clock synchronization issues associated to Expires. The use of versioned URLs even allows to set an infinite expiration date.

But cache headers for applications are way more tricky. There is often no concept of "last modified", and computing an etag that reflects the state of all the elements that contribute to the response is not always possible or would lead this concern to creep into all application layers.

So I've taken a different approach in the Goojet backend: when receiving the first byte of the response body, a servlet filter checks if the application has set cache headers. If not, the response is buffered so that we can compute a hash code once the application has finished producing it. We could have used MD5 for the hash, but it's a bit costly to compute and we don't need something cryptographically secure. So we use a 64 bit FNV-1 hash that is very fast to compute and has a low collision rate, even for small changes in the data.

The result is that even for highly dynamic responses, we are able to provide cache headers that allow the mobile application to issue conditional requests and download data only when actually needed.

All these techniques combined really make a difference to have a more responsive application and a lower phone bill!

April 05, 2008

Goodbye Joost, hello Goojet

I have been working at Joost for the last two years as architect of backend systems and tech leader of the backend development team. I started there at a time when the architecture of the system was a blank paperboard. Exciting times, that allowed me to use my creative thinking and build amazing stuff with an amazing team. But also exhausting times, that got me quite burned.

For the last 6 months or so, Joost has been undergoing many changes: organisational changes, stragegy changes (not yet publically visible), and geographical changes by concentrating the previously distributed teams and pushing west towards the US.

So all things considered, it was time for me to move on, and I was helped in that by a startup in Toulouse that I've been knowing since its inception because Anyware was participating to the development team. And since their inception, they wanted to have me on board.

So here it is: yesterday was my last day as a backend systems architect at Joost, and monday will me my first day as the CTO of Goojet.

Goojet is a widget platform and social network targetting mobile phones. Contrarily to other social networks, Goojet focuses on the collaboration between you and your contacts rather than you exposing or broadcasting information to your contacts. The world is made of interactions, not only of yelling at the masses.

My role there will be what I do best and like most: being a think tank, architect stuff, use my synthetic mind and teaching abilities to help the business and dev team understand each other, develop some of the tricky parts and generally give technical guidance and advice to the whole team. I also plan to participate in the nascent or ongoing standardization efforts in the widgets and mobile web domains, but I will first be heads down pushing our first public release out.

The really new technology for me at Goojet are mobile phones: I've been playing with J2ME on my free time for a few weeks, and it is kind of refreshing to work in a very constrained environment. Every line of code counts, no big framework, no high-level abstractions. And the device fragmentation, which requires careful engineering and testing. But it's fun!

So I'm pretty excited by this new job. I know it will be hard and demanding, but well, it seems quiet and easy jobs are not for me.

April 02, 2008

Cluster computing commoditization

I came across an interesting report on the first Hadoop Summit that happened last week. Hadoop (an open-source implementation of Google's distributed mapreduce infrastructure) is getting a lot of steam, and there are now higher-level open source projects emerging on top of it that I wouldn't even haven't dreamed of a few years ago.

This is a perfect example of commoditization at work, and actually multiple commoditization trends nurturing each other: the cluster physical infrastructure is being commoditized by people like Amazon with EC2 (now with static IP!). Setting up your own cluster which previously required careful planning, good sys admins and a lot of money is now a few mouse clicks away, and Hadoop provides the software infrastructure to easily use this computing power for complex processing of large datasets.

And now that the basic blocks are in place, people are looking at the higher levels: Mahout builds the machine learning tools (classification, clustering, etc) that are so useful when you have to process large sets of user profiles and activity logs to give value to any social network website, HBase to store huge amounts of data in a cluster, and some high-level query languages like Pig or Jaql. What will be the next higher-level that will be addressed?

When I introduced Hadoop at Joost more than 18 months ago to process the user activity logs, it was still very young and a bit shaky. It's nice to see it maturing quickly and getting so much interest.

For sure not every project needs that, but when you're building a social website you have to consider this kind of tool to face the explosive growth that you expect to see happen. It not always happens of course, but if it does you have to be prepared for it!

March 03, 2008

Novillero, our new (big) pet

My wife Claire has been horse-riding for years, and wanted her own horse for her 40th birthday. That was 3 weeks ago, and the birthday gift arrived almost on time a few days ago. It takes some time to find a good match between the rider and the horse!

So we have a new pet (sort of): Novillero, a 7 years old lusitanian, a species from Portugal (click on the picture for a larger photoset). He is a very responsive horse, which will allow Claire to make a lot of progress, but is also very stable and forgiving.

Of course the horse won't live in our garden! Even if quite large according to french standards, our 2500 m2 are way too small for him! He will live in a horse club 10 minutes away from our home, where he will also have a social life with other animals of his kind.

I'm not sure I will ever sit on his back though, since I started learning how to ride only a couple of months ago and this kind of horse requires an experimented driver :-)

February 26, 2008

A new Ant/Ivy committer!

My dear colleague and friend Nicolas Lalevée has been elected committer on Ant/Ivy, the dependency management tool that allows the use of Maven repositories without the pain of Maven, for his work on the Eclipse plugin for Ivy.

I've been working with Nico for two years, and he's one of the main guys behind the Joost search engine, powered by Lucene, another popular Apache product.

Congrats and welcome in the big Apache tribe!

February 08, 2008

The hidden gem in the Wii remote

Did you know that the Wii remote contains a 1024x768 pixel infrared camera? With this knowledge an a bunch of IR-emitting LEDs and tape, Johnny Chung Lee, a student at Carnegie Mellon, has done some truly amazing things. Multi-touch screen, Minority Report-like interaction, 3D scenes controlled with you heads movements, etc.

Oh, and of course Mac addicts should try Darwiin-remote to use the Wiimote as the Mac's mouse (or whatever you want). What, you didn't not know that the Wiimote is also a Bluetooth device?

If you don't have a Wii, the remotes are sold separately and are quite cheap. Kudos to Nintendo for having invented such an amazing device!