All posts by Alejandro Jezierski


Azure Event Hubs, the Thing and the Internet

I have things, and I have Internet. Now I want to Internet my things…

I recently started playing with Azure Event Hubs and wanted to do a basic IoT sample with a real device and learn lots of new stuff. If you haven’t heard of Azure Event Hubs before,  Event Hubs is a “highly scalable publish-subscribe ingestor that can intake millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications.”

In this post, the objective is to analyze Kinect sensor readings in real time using Azure Event Hubs and monitor meaningful events in a website.

The scenario

I want to know in real time(ish) when a security breach is produced. In my fictitious scenario, a security breach is produced when someone gets too close to a person (have you seen The Island?) or a guarded object (if movies teach us anything, is that a laser grid alone won’t cut it. I don’t have access to a laser grid, either). Ok, so basically, I need to walk around with a Kinect strapped to my chest or strap a Kinect to a Fabergé egg).

I set out to do the following:

  • Dump the Kinect’s depth sensor readings in an event hub.
  • Analyze the stream of events using a worker role in real time (using the EventProcessorHost approach) and produce a proximity alert whenever I get too close to the sensor (say, 50cm).
  • Display the proximity alerts in a monitoring website as they occur.

You can checkout the progress here. It will require that you have a Kinect, the Kinect SDK (1.8) and a Microsoft Azure account.

The big picture



The Sample Solution

I’ll make a brief overview of the projects in the sample.

The DepthRecorder project: It is actually the Depth-Basics sample of the Kinect SDK with a little twist. Every time a frame is received, I obtain the known distance of the closest pixel of the frame, prepare a new KinectEvent object and send it to the event hub. At roughly 30fps, it´s fun to watch how events quickly start to accumulate. (Checkout the ServiceBus explorer tool, too!)

The Cloud project: This project contains two worker roles. The EventsConsumer worker is responsible for analyzing the stream of events and produce a proximity alert whenever an event reports a depth reading of 50cms or less. As alerts appear, the worker role will dump them in a Service Bus queue. The AlertsListener worker is responsible for reading the alerts queue, and pushing alert messages down to the monitoring website (using SignalR)

The Web project: It’s an MVC project that will display the alerts on screen as they occur. It receives SignalR messages and will display a big red alert. For the time being, it won’t distinguish alert sources, but we’ll work on that :)


Work in progress

This is probably not the simplest approach, but I wanted to separate and decouple components as much as possible, because i wanted to use queues and because it’s fun :). So probably, for simplicity’s sake, we could get rid of the alerts queue and the alerts listener worker, and push alerts directly to the website.


Further reading

check out the following links for further reading:

Cloud Cover Episode 160

ScottGu’s blog post with links to more information

Clemens Vaster’s session at Teched


So there you go. Happy Learning!


Patterns & Practices: Cloud Design Patterns

The Patterns & Practices team released “Cloud Design Patterns”, a guide that provides solutions to problems that are relevant to the cloud.  It discusses a set of patterns and guidance topics in their proper context, providing code samples or snippets to get a sense of a concrete implementation.

We’ve had the privilege of being part of this project, working with great, passionate people.  The guide is available on msdn.

Cloud Design Patterns Guide

Design Patterns and Guidance Index

Codeplex Project Site

Happy Reading!

Cloud Design Patterns, new drop on Codeplex

We made a new drop of the Cloud Design Patterns book on codeplex.

The drop includes the following patterns and related guidance:


This is still work in progress, and in future releases we will include new patterns, guidance, as well as some code samples to get you started.

Happy reading!

Cloud Applicaction Patterns on Windows Azure – get heard!

We are working together with Microsoft’s Patterns and Practices team creating a new guide in the Windows Azure Guidance Series called “Cloud Application Patterns on Windows Azure”.  The guide will discuss guidance and a wide range of patterns that are relevant to the cloud  dipped in Windows Azure sauce, and, contrary to popular belief, the secret is NOT in the sauce, the secret lies within YOU.

We are currently conducting an online survey, gauging interest and gathering feedback  on an initial list of patterns that we plan to tackle in the guide.  Feel free to go over the list and tell us what you think, get heard!

We’ll have:

  • Patterns for Improving Performance and Scalability
  • Patterns Data Management
  • Patterns for Configuration and Deployment
  • Patterns for Managing Resilience
  • Patterns for Monitoring and Metering
  • Patterns for Security and Isolation
  • and most important, patterns suggested by you!

Thank you and we appreciate your feedback.

Happy voting!

The elephant, the cloud and the cmdlet

The Hadoop SDK for .NET is evolving rapidly.  I first checked it out a few months ago and the progress since then has been amazing.  It now contains several client libraries (WebHDFS, WebHCat (aka “templeton”), submission and management of Oozie workflows and Ambari management and more).

As part of our current project, while looking for alternatives when provisioning an HDInsight cluster,  we found the SDK team recently released a set of cmdlets to help accomplish that task. Perfect timing.

I must confess, I like to describe the steps I followed, but then I found this wiki page and decided I’ll keep things simple.  These are really simple instructions on how to use the cmdlets.  The only difference is that I found this page AFTER I played with them,  so I used the compiled source, not the direct download), no big deal.

the command is simple

and you can see the corresponding result in the portal

There’s also an alternative to specify a Windows Azure SQL Database as your metastore.  This gives you the flexibility of persisting all your hive and oozie metadata across cluster lifetimes.  So, if you destroy your cluster, and then decide you need a new one, you don’t have to create all your hive table definitions, and simply specify your SQL Database as the metastore.

In addition to the cmdlets, there’s also a nuget package (Microsoft.WindowsAzure.Management.HDInsight) with what you need to provision, list and tear down your clusters.  You can find more information here.

So go and check out the awesome progress on the Hadoop SDK for .NET and get coding with the latest features.

Happy hdinsightcmdletting

Introduction to HCatalog, Pig scripts and heavy burdens

As an eager developer craving for Big Data knowledge, you’ve probably come across many tutorials that explain how to create a simple hive query or pig script, I have.  Sounds simple enough, right?  Now imagine your solution has evolved from one ad-hoc query, to a full library of hive, pig and Map/reduce scripts, scheduled jobs, etc.  Cool!  The team has been growing too, so now there’s an admin in charge of organizing and maintaining the cluster and stored data.

The problem

One fine day, you find none of your jobs are working anymore.

Here’s a simple Pig script.  It’s only a few lines long, but there’s something wrong in almost every line of this script, considering the scenario I just described:

raw = load ‘/mydata/sourcedata/’ using CustomLoader()
as (userId:int, firstName:chararray, lastName:chararray, age:int);

filtered = filter raw by age > 21;

store results into = ‘/mydata/output/';

Problem 1: you’ve hardcoded the location of your data, why would you assume the data will be in the same place?  The admin found a better way of organizing data and he restructured the whole file system, didn’t you get the memo?

Problem 2: you’ve been blissfully unaware of the fact that you’ve given your script the responsibility of knowing the format of the raw data, too, by using a custom loader.  What if you are not in control of the data you consume? what if the provider found a new awesome format and wants to share with their consumers?

Problem 3: your script is also responsible of handling the schema for the data it consumes.  Now the data provider has decided to include a new field, remove another few and you wish you’d never got into this whole Big Data thing.

Problem 4: last but not least, several other scripts use the same data, so why would you have metadata related logic duplicated everywhere?

The solution

So, the solution is to go back, change all your scripts and make them work with the new location, new format and new schema and look for a new job.  Thanks for watching.

OR, you can use HCatalog.

HCatalog is a table and storage management layer for Hadoop.  So, when before you had something like this (note that you would need to write custom loaders for every new format you use):

now you have this:

See how Pig, Hive, MR no longer need to know about data storage or format, they only care about querying, transforming, etc.  If we used HCatalog, the script would look something like this:

raw = load `rawdatatable` using HCatLoader();
filtered = filter raw by age > 21;

store results into `outputtable`

1. your script no longer knows (or cares) where the data is.  It just knows it wants to get data from the “sample” table.

2. your script no longer cares in what format the raw data is in, that is handled by HCatalog, and we use the HCatLoader for that.

3. no more defining schema for your raw data in your script.  no more schema duplication.

4. other scripts, for example hive, can now share one common general structure for the same data, sharing is good.

Rmember, when you setup your HdInsight cluster, remember HCatalog, and let that script do what it was meant to do, perform awesome transformations on your data, and nothing more.

For more information, you can visit HCatalog’s page.

Another cool post Alan Gates

Happy learning!

Developing Big Data Solutions on Windows Azure, the blind and the elephant

What is a Big Data solution to you?  Whatever you are thinking of, I cannot think of a better example than the story of the blind and the elephant.  “I’m a dev, It’s about writing some Map/Reduce awesomeness”, or “I’m a business analyst, I just want to query data in Excel!”,  “I’m a passer-by, but whatever this is, it’s yellow”… and so on.

I’m currently working with Microsoft’s Patterns and Practices team, developing a new guide in the Windows Azure Guidance Series called “Developing Big Data Solutions on Windows Azure”.

The guide will focus on discussing a wide range of scenarios, all of which have HDInsight on Windows Azure as a key player, the related technologies from the Hadoop ecosystem and Microsoft’s current BI solution stack.

We just made our first early preview drop on codeplex, so you can take a peek and see how the guide is shaping up so far.

So go get it, have a read and tell us what you think, we appreciate your feedback!

Hadoop on Azure and importing data from the Windows Azure Marketplace

In a previous post I briefly introduced Hadoop and described a brief overview of its components.  In this post I’ll put my money where my mouth is and actually do stuff with Hadoop on Azure.  Again, one step at a time.

So I went to the market and found some free demographics data available from the UN to mess around with and imported it to my Hadoop on Azure cluster via the portal.   Before you blindly go about importing data, you can first sample the goods.  In the Windows Azure Marketplace you can subscribe to published data, and build a query of your interest.  The most important thing here is that you need to take note of your primary account key and the query url.

Take note of the query and passkey.  Click on Show.

Sample the goods, build your query.

Once we have the query we want, we need to go to our Hadoop on Azure portal, select Manage Cluster and then select DataMarket.  Here we will have to input our user name (from your email), the passkey obtained earlier, the query url you obtained as well and the name of the Hive table so we can access the data after the import is done. Note: I’ve replaced the encoded space and quotation marks to avoid Bad Request errors.  This happened to me because I copied and pasted the query right out from the marketplace.  It took a couple of tries until I figured it out, oh well.  Run the query by selecting Import Data.

Now we can go to the Hive interactive console and take a look at the results.  We can type the show tables command for a list of the tables, and make sure ours is there.

Take into consideration that, although this looks like a table, it has columns like a table, and we can query the data as if it were stored in the table, it’s not.  When we create tables in HiveQL, partition them and load data, HDFS files are actually created and stored (and replicated and distributed through the nodes).

Now we can go on and type our HiveQL query.  Remember what’s happening under the hood.  This query is creating and executing a MapReduce job.  I’m also a newbie in the MapReduce world, so I trust Gert Drapers when he says a simple join in HiveQL is equivalent to writing a much more complex bunch of code, and that Facebook’s jobs are mostly written in HiveQL.  That means something, doesn’t it?

So we’ve executed our simple HiveQL query and seen the results.  We can always go back to the job history for that query, too (you can’t miss that big orange button in the main page).

Job History.

So, we’ve imported data from the marketplace and ran a simple HiveQL query.  In a future post we can go through the samples that are included when you setup your cluster and mess around with the Javascript console.  Refer to the previous post for additional links and resources.

Happy Learning.

Big Data, Hadoop on Azure and the elephant in the room

Seriously.  There’s an elephant in the room, so I’ve no choice but to talk about it.  I’m new to Big Data and newer to Hadoop on Azure, so this post (and future ones as well) will serve as an introduction to the underlying concepts of big data and my experience on using Hadoop on Azure, one step at a time.

Big Data

So we generate massive amounts of data.  Massive.  Structured, not structured, from devices, from sensors, feeds, tweets, blogs, everything we do in our daily lives generate data at some point.  What do we do with it besides store it?  Ignore it?  Throw it away?  We could. But data is there for a reason.  We can extract valuable information from it.  We can discover new business insights, interesting patterns emerge, and most important we could save lives… so yes, it’s a big deal.  Ok, so we’ll leave the processing to multi-million dollar companies, they can afford it, right?  One misconception is that we need all this massive state of the art infrastructure to be able to handle big data.  We can setup nodes on affordable, commodity hardware, and achieve the same results.  Nice.  But I still need to maintain all these boxes, and they WILL fail eventually…

Hadoop is a scalable, hi fault tolerant open source MapReduce solution.  It runs on commodity hardware, so there is an economic advantage to it.

The main components of Hadoop are illustrated in the following diagram.

HDFS: Hadoop Distributed File System.  It’s the storage mechanism used by Hadoop applications.  Amongst other things, it stores replicas of data blocks on the nodes of your cluster, aid availability, reliability, performance, etc.

Map Reduce.  A programming framework that allows to create mappers and reducers.  The framework will construct a set of jobs, hand them over to the nodes for processing, and keep track of them.  The map operation let’s you take the processing to where the data is stored in the distributed file system.  The reduce operation summarizes the results from the mappers.

Hive: it provides a few things, such as the possibility to create a structure for data through the use of tables.  It also defines a SQL oriented language (QL, or Hive QL), in other words, MapReduce for mortals.  The magic behind this is that a Hive query can be translated to a MapReduce job, and present the results back to the user.  The need for Hive appeared because creating a relatively simple query in plain MapReduce jobs resulted in a cumbersome coding experience.

Sqoop: Bridge between the Hadoop and the relational world.  Because we also live in a relational world, right?  We can import data from SQL Server, let Sqoop store the data in HDFS, and make the data available for our MapReduce jobs as well.

Hadoop on Azure is Microsoft’s Hadoop distribution that runs on Windows, plus a hosting environment on Windows Azure.  I recently got invited to use the CTP version of Hadoop on Azure (I did ask for an invitation a few weeks ago) and started to get familiar with its features.  The huge benefit to this is that I don’t need to maintain all those nodes, I have my own cluster now, and I’m ready to handle massive amounts of data.  Tada!

In future posts I’ll be showing how to execute a simple MapReduce job, or how to get data from the Windows Azure Marketplace and query the data using Hive.  The following links lead you to useful resources if you are getting started with Hadoop on Azure.

Big Data, Big deal, video by Gert Drapers

Introduction to Hadoop on Azure, video by Wenming Ye

Hadoop on Azure portal, get invited!

Apache Hadoop

Happy learning

Service Bus Topics and Synchronizing Data

I am probably going to compare apples and oranges in this post, with a slight bias towards oranges.  Yes, I like fruit.  I also like SQL Azure Data Sync and how it makes data synchronization tasks a breeze.  You  install your agent, register your database, go to the portal, setup your sync groups, conflict resolution and you are ready to go.  Eventually (if you set things up right) data you need will be synchronized from cloud to cloud, cloud to on premise…  Awesome.

As it turns out, I also like Service Bus, and the powerful features it provides to build distributed applications, aid transactional behavior, etc.

In our scenario we have a company that runs an online store in all six Windows Azure data centers (how romantic), and, through Traffic Manager, the users can get routed anywhere.  How do we deal with the challenge of data consistency? Well, we can use Data Sync, for sure, but what if we are in a more restrictive scenario?  What if  we are not allowed to make changes to the database?  Well, the Data Sync option quickly fades away, doesn’t it?  Part of Data Sync’s magic consists of installing a Windows Service and a number of tables in your database to keep track of changes.  Busted.

So, how do we go about solving this problem?  I will go down the path of  Service Bus topics and subscriptions to comply with our restriction.  I choose Service Bus because we are already working with Windows Azure,  I’ve seen it in action, been working with it for the last couple of months and know that I can find a relatively simple way to manage this.  It’ll be wholesome to discuss other options as well.

This is a high level diagram of the idea

Our beloved customer places an order through any of the data centers.  Because the user can see his own orders, and because he might get routed by Traffic Manager to another data center, we need to have that information available everywhere, eventually.

We create a topic (let’s call it SyncTopic) in our favorite data center.  Just one (let’s add a budget restriction while we are at it, and a little bit of KISS). We create one subscription per data center, and we’ll have our SyncListener component running in each cloud subscribe to that topic.  The SyncListener will be responsible for writing the orders to the database.  Insert only, it makes the story easier  (that would depend on how we designed our data model, of course)

Every time a customer places an order, the SyncSender component (a separate worker role or a new task running in a worker role – you choose) posts a message to the topic, a copy of the order.  Each SyncListener will receive a copy of the message and perform the corresponding write operation to the db.  That is the big picture.

Implementation considerations.

This is all very nice if we draw it on paper, but we need to be sure that we are tackling other concerns, for example, and in no particular order of importance…

Sync loops: because the data center where the order originated also contains a SyncListener, it could potentially receive a copy of the order it already has.  We can avoid this by implementing a filter on the subscription, and filling the corresponding property in the BrokeredMessage.  The filter (a SQLFilter why not) can be something like  SenderDataCenter != ListeningDataCenter.

Conflict Resolution: we need to be covered here as well.  The path we choose will depend on our given context and might even take us back to square one.

Transactional Behavior: We would have to use the PeekLock mode to receive messages for starters, and we would have to have a supervising mechanism to deal with transient failures, we can also use Topaz.  Consider using the dead letter queue as well if something goes wrong when receiving the message.

Message size: What if our BrokeredMessage exceeds its size?  We need to consider another mechanism to get the required data through.  Maybe storing it in blob storage, and then passing a reference in the message for the listener to pick up.  The story gets complicated, but it could happen.

Security: Although the topic and subscriptions are used internally, it is always a good idea to secure them.  With ACS we can easily secure the edges.

Hybridness: now we need all the orders data on premise (wait, don’t we have that already? oops).  By setting up  a new subscription, and a SyncListener on premise we can now get a copy of the data.

and probably others…


We can go about it many ways.  Data Sync is simple and quick to setup.  We don’t have 100% control over it, but it gets the job done.   As our context changes, so does our need of considering other possibilities where we need greater control over things.  I am aware that this idea can trigger LOTS of other considerations I haven’t mentioned. Consider this post not as a definitive solution, but as a starting point to a fun discussion.

Happy Learning!