Tuesday, September 29, 2009

The Web Knows

We all get this weird feeling sometimes that the Web knows a lot about us but there's no way to know otherwise because the Web doesn't tells how much it knows. For me, it knows what magazines I read, what friends I share my photos with, what places I fly, what food and clothes I buy, where do I work, what movies I watch, what songs I like, what blogs I write...the list goes on and suddenly I shout.."Jesus, the Web knows more about me than my mother." But then I calm down knowing that its not going to tell anybody, at-least in near future. However Sir Tim Berners-Lee thinks otherwise, he believes that the web will reveal everything once we start to ask in a structured manner. 20 years ago as a frustrated Software Engineer Tim Berners-Lee invented the World Wide Web. Now he is frustrated again with how the web has evolved so far. As the head of W3C, he is now evangelizing the idea of linked data and semantic web. So what we would have will be a "Web of data" rather than "Web of documents". And I couldn't agree more with Tim, the current web however useful is still a mesh of incoherent, in-congruent and highly unstructured data which is bound to be replaced by Linked Data.  If you are fumbling with idea of how linked data is going to reframe the next web, Sir Tim's talk (on TED.com) on Linked Data is a must watch for everyone who wants to know where the Web is headed. He is pivotal in creating W3C design specs for Linked Data, some of the key points of which are:


  • Use URIs to identify things that you expose to the Web as resources.
  • Use HTTP URIs so that people can locate and look up (dereference) these things.
  • Provide useful information about the resource when its URI is dereferenced.
  • Include links to other, related URIs in the exposed data as a means of improving information discovery on the Web.

Technically nothing new is going on here but logically the world is changing. However the transformation is easier said than done. There are uncountable websites with unstructured data which no doubt amass valuable information that can't be ignored. Currently there are two possible ways of integrating this data with semantic web, one is that websites themselves expose their data with webservices and other is to scrape those websites to collect and organize the data. The first option being more plausible is also more easier to implement. Later in this article we will see how web scraping really works and what problems confront it. The below image depicts how many websites have opened their webservice and are currently participating in Semantic Web as datasets and these datasets are increasing exponentially.





So the "Web of data", as some call it Web 3.0, will eventually encourage web sites to expose themselves as Web Services. And we are now witnessing such services already surfacing on the horizon with giants like Google, Yahoo, Amazon and Thomson Reuters joining the bandwagon. Lets us take a brief look at some of these exciting webservices.


My personal favorite is OpenCalais which is probably the best current example of Linked Data which is a type of structured data recommended by Sir Lee. OpenCalais API was launched on Feb '08  by the international business and financial news giant Thomson Reuters. The reason why I favor OpenCalais is the ease with which Linked Data can be generated. The users passed unstructured HTML in API and it turns  into semantically marked up data. The linking is more profound in categories such as 'people,' 'places,' 'companies' and few more. This way, third party applications and sites can build interesting new things from that data - one of the defining principles of Linked Data.

Thomson Reuters is not alone, Wolfram Research launched a "computational knowledge engine" called Wolfram|Alpha in May '09 which is not Google killer as some predicted. With a search engine-like interface Wolfram|Alpha serves natural language query like Google but it also does some interesting computation on the retrieved data.  Wolfram|Alpha is more inclined towards consuming structured data rather than generating it. Wolfram|Alpha is one of the few existing products that marks the beginning of era when machines will consume human generated content.

Not quiet coincidently, also in May '09, Google added a new feature in its core search called 'Rich snippets' which is a form of structured data. This features shows little more useful information about the pages in result by using structured data format such as microformats and RDFa. Although this markup is not widespread yet but given the wide reach of Google this is surely a good news for the development of Semantic Web.


Above three examples are certain indication that structured data is rapidly becoming a feature of today's and future's Web. Players like Thomson Reuters and Google are encouraging generation of structured data and products (like Wolfram|Alpha) will make use of structured data in ways we perhaps can't imagine right now. Linked data can also helpful in making businesses grow by expanding their userbase or making their data more accessible. This is evident from Amazon's visionary WebOS strategy. Amazon has released number of developer friendly API to expose their infrastructure. One of the interesting web services opened up by Amazon was the E-Commerce service which allows access to Amazon's product catalog. Third party developers can use this feature rich API to manipulate users, wish lists and shopping carts.  Making this API completely free makes perfect business sense for Amazon as the application developed on top of this API will drive user traffic back to Amazon as the webservice returns items with Amazon URL.

Despite the evident benefits of webservices some site will choose not expose their data through webservices, this will force third party developers to deploy scrapers in order to collect the data from  such websites. Web Scraping is more or less reverse engineering of HTML pages and has its disadvantages as with any other reverse engineering technique. It is essentially parsing out chunks of information from a page. The problem with scraping web pages coded in HTML is that actual data is mingled with layout and rendering information and is not readily available to a computer. For Scrapers programs to get the data back from a given HTML page, first they have to learn the details of the particular markup and figure out where the actual data is. By applying such a scraper, it is possible to discover what URLs are tagged with any given tag but the result may not be accurate as achieved through webservices..

When compared to scrapers, webservices offers numerous advantages. To name a few, websites will have the control over the data and can track usage of data alongwith granular details like how the data is used and by whom. Following Amazon's track other sites can do this in a way to encourage third party developers to build applications which will eventually drive the traffic back to their sites.


In the past websites were very conservative about the data they own as they believed closed data gives them a competitive advantage. However people have started to realize that opening up their data can open new business possibilities. Amazon being pioneer in this change has already proved that charging a very small amount for their data can indeed increase the revenue as more traffic is directed to its sites through non-Amazon applications.


In the future websites will have to act as a database for other applications, how they do it still unclear. More or less websites will transform into webservices. However webservices APIs may not be available for all and this will fuel the expansion of scraper program  penetration. Some sites will fail to notice this change and will pay the price for it. Only those who understand and appreciate the importance of Semantic Web will survive to see the dawn of "Web 3.0".

1 comment:

Unknown said...

Gyaan Wardhan ke liye shukriyaa pundit ji...