Hadoop on Azure and importing data from the Windows Azure Marketplace

In a previous post I briefly introduced Hadoop and described a brief overview of its components.  In this post I’ll put my money where my mouth is and actually do stuff with Hadoop on Azure.  Again, one step at a time.

So I went to the market and found some free demographics data available from the UN to mess around with and imported it to my Hadoop on Azure cluster via the portal.   Before you blindly go about importing data, you can first sample the goods.  In the Windows Azure Marketplace you can subscribe to published data, and build a query of your interest.  The most important thing here is that you need to take note of your primary account key and the query url.

Take note of the query and passkey.  Click on Show.

Sample the goods, build your query.

Once we have the query we want, we need to go to our Hadoop on Azure portal, select Manage Cluster and then select DataMarket.  Here we will have to input our user name (from your email), the passkey obtained earlier, the query url you obtained as well and the name of the Hive table so we can access the data after the import is done. Note: I’ve replaced the encoded space and quotation marks to avoid Bad Request errors.  This happened to me because I copied and pasted the query right out from the marketplace.  It took a couple of tries until I figured it out, oh well.  Run the query by selecting Import Data.

Now we can go to the Hive interactive console and take a look at the results.  We can type the show tables command for a list of the tables, and make sure ours is there.

Take into consideration that, although this looks like a table, it has columns like a table, and we can query the data as if it were stored in the table, it’s not.  When we create tables in HiveQL, partition them and load data, HDFS files are actually created and stored (and replicated and distributed through the nodes).

Now we can go on and type our HiveQL query.  Remember what’s happening under the hood.  This query is creating and executing a MapReduce job.  I’m also a newbie in the MapReduce world, so I trust Gert Drapers when he says a simple join in HiveQL is equivalent to writing a much more complex bunch of code, and that Facebook’s jobs are mostly written in HiveQL.  That means something, doesn’t it?

So we’ve executed our simple HiveQL query and seen the results.  We can always go back to the job history for that query, too (you can’t miss that big orange button in the main page).

Job History.

So, we’ve imported data from the marketplace and ran a simple HiveQL query.  In a future post we can go through the samples that are included when you setup your cluster and mess around with the Javascript console.  Refer to the previous post for additional links and resources.

Happy Learning.

Leave a Reply