Category Archives: Cloud Computing

Netflix OSS – Security Tools

Netflix has released different security tools and solutions to the open source community. The security-related open source efforts generally fall into one of two categories:

  • Operational tools and systems to make security teams more efficient and effective when securing large and dynamic environments
  • Security infrastructure components that provide critical security services for modern distributed systems.

Below you can find further information about some of the security tools released by Netflix.

Security Monkey

Security Monkey monitors policy changes and alerts on insecure configurations in an AWS account. While Security Monkey’s main purpose is security, it also proves a useful tool for tracking down potential problems as it is essentially a change tracking system.

It has a Docker image, but the functionality works with AWS.

Scumblr

Scumblr is a web application that allows performing periodic searches and storing/taking actions on the identified results. Scumblr uses the Workflowable gem to allow setting up flexible workflows for different types of results.

Workflowable is a gem that allows easy creation of workflows in Ruby on Rails applications. Workflows can contain any number of stages and transitions, and can trigger customizable automated actions as states are triggered.

Scumblr searches utilize plugins called Search Providers. Each Search Provider knows how to perform a search via a certain site or API (Google, Bing, eBay, Pastebin, Twitter, etc.). Searches can be configured from within Scumblr based on the options available by the Search Provider. Examples of things you might want to look for are:

  • Compromised credentials
  • Vulnerability / hacking discussion
  • Attack discussion
  • Security relevant social media discussion

Message Security Layer

Message Security Layer (MSL) is an extensible and flexible secure messaging framework that can be used to transport data between two or more communicating entities. Data may also be associated with specific users, and treated as confidential or non-replayable if so desired.

MSL does not attempt to solve any specific use case or communication scenario. Rather, it is capable of supporting a wide variety of applications and leveraging external cryptographic resources. There is no one-size-fits-all implementation or configuration; proper use of MSL requires the application designer to understand their specific security requirements.

Netflix OSS – Insight, Reliability and Performance Tools

As part of the Netflix OSS platform, Netflix has released tools to get operation insight about an application, take different kind of metrics and validate reliability by ensuring that the application can support different kinds of failures.

In this blog post we list and briefly describe some of these tools.

Atlas

Atlas was developed by Netflix to manage dimensional time series data for near real-time operational insight. Atlas features in-memory data storage, allowing it to gather and report very large numbers of metrics very quickly.

Atlas captures operational intelligence. Whereas business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a system.

Edda

Edda is a service that polls your AWS resources via AWS APIs and records the results. It allows you to quickly search through your resources and shows you how they have changed over time.

Previously this project was known within Netflix as Entrypoints (and mentioned in some blog posts), but the name was changed as the scope of the project grew. Edda (meaning “a tale of Norse mythology”), seemed appropriate for the new name, as our application records the tales of Asgard.

Spectator

Spectator is a simple library for instrumenting code to record dimensional time series. When running at Netflix with the standard platform, use the spectator-nflx-plugin library to get bindings for internal tools like Atlas and Chronos.

Vector

Vector is an open source on-host performance monitoring framework which exposes hand-picked, high-resolution system and application metrics to every engineer’s browser. Having the right metrics available on-demand and at a high resolution is key to understanding how a system behaves to correctly troubleshoot performance issues.

Vector provides a simple way for users to visualize and analyze system and application-level metrics in near real-time. It leverages the battle tested open source system monitoring framework, Performance Co-Pilot (PCP), layering on top a flexible and user-friendly UI.

Ice

Ice provides a bird’s-eye view of large and complex cloud landscape from a usage and cost perspective. Cloud resources are dynamically provisioned by dozens of service teams within the organization and any static snapshot of resource allocation has limited value.

Ice is a Grails project. It consists of three parts: processor, reader and UI. Processor processes the Amazon detailed billing file into data readable by reader. Reader reads data generated by processor and renders them to UI. UI queries reader and renders interactive graphs and tables in the browser.

Ice communicates with AWS Programmatic Billing Access and maintains knowledge of the following key AWS entity categories:

  • Accounts
  • Regions
  • Services (e.g. EC2, S3, EBS)
  • Usage types (e.g. EC2 – m1.xlarge)
  • Cost and Usage Categories (On-Demand, Reserved, etc.) The UI allows you to filter directly on the above categories to custom tailor your view.

Simian Army

The Simian Army is a suite of tools for keeping your cloud operating in top form. Chaos Monkey, the first member, is a resiliency tool that helps ensure that your applications can tolerate random instance failures.

Simian Army consists of services (Monkeys) in the cloud for generating various kinds of failures, detecting abnormal conditions, and testing our ability to survive them. The goal is to keep our cloud safe, secure, and highly available.

Currently the simians include Chaos Monkey, Janitor Monkey, and Conformity Monkey.

Netflix OSS – Data Persistence Tools Overview

Handling a huge amount of data operations per day required Netflix to improve existent open source software with their own tools. The cloud usage in Netflix and the scale at which it consumes/manages data has required them to build tools and services that enhance the used datastores.

In this blog post we list and briefly describe some of the tools released by Netflix to store and serve data in the cloud.

EVCache

EVCache is a memcached & spymemcached based caching solution that is mainly used for AWS EC2 infrastructure for caching frequently used data.

EVCache is an abbreviation for:

  • Ephemeral – The data stored is for a short duration as specified by its TTL (Time To Live)
  • Volatile – The data can disappear at any time (Evicted)
  • Cache – An in-memory key-value store

It offers the following features:

  • Distributed Key-Value store, i.e., the cache is spread across multiple instances
  • AWS Zone-Aware – Data can be replicated across zones
  • Registers and works with Eureka for automatic discovery of new nodes/services

Dynomite

Dynomite is a thin, distributed dynamo layer for different storages and protocols.

It is a generic dynamo implementation that can be used with many different key-value pair storage engines. Currently these include Redis and Memcached. Dynomite supports multi-datacenter replication and is designed for high availability.

The ultimate goal with Dynomite is to be able to implement high availability and cross-datacenter replication on storage engines that do not inherently provide that functionality. The implementation is efficient, not complex (few moving parts), and highly performant.

Astyanax

Astyanax is a high level Java client for Apache Cassandra. Apache Cassandra is a highly available column oriented database.

It borrows many concepts from Hector but diverges in the connection pool implementation as well as the client API. One of the main design considerations was to provide a clean abstraction between the connection pool and Cassandra API so that each may be customized and improved separately. Astyanax provides a fluent style API which guides the caller to narrow the query from key to column as well as providing queries for more complex use cases that we have encountered. The operational benefits of Astyanax over Hector include lower latency, reduced latency variance, and better error handling.

Some of the features provided by this client are:

  • High level, simple object oriented interface to Cassandra
  • Fail-over behavior on the client side
  • Connection pool abstraction. Implementation of a round robin connection pool
  • Monitoring abstraction to get event notification from the connection pool
  • Complete encapsulation of the underlying Thrift API and structs
  • Automatic retry of downed hosts
  • Automatic discovery of additional hosts in the cluster
  • Suspension of hosts for a short period of time after several timeouts
  • Annotations to simplify use of composite columns

Dyno

Dyno is the Netflix home grown java client for Dynomite. Dynomite adds sharding and replication on top of Redis and Memcached as underlying datastores and the dynomite server implements the underlying datastore protocol and presents that as its public interface. Hence, one can just use popular java clients like Jedis, Redisson and SpyMemcached to directly speak to Dynomite. Dyno encapsulates client-side complexity and best practices in one place instead of having every application repeat the same engineering effort, e.g., topology aware routing, effective failover, load shedding with exponential backoff, etc.

Dyno implements patterns inspired by Astyanax on top of popular clients like Jedis, Redisson and SpyMemcached.

Some of Dyno’s features are:

  • Connection pooling of persistent connections – this helps reduce connection churn on the Dynomite server with client connection reuse.
  • Topology aware load balancing (Token Aware) for avoiding any intermediate hops to a Dynomite coordinator node that is not the owner of the specified data
  • Application specific local rack affinity based request routing to Dynomite nodes
  • Application resilience by intelligently failing over to remote racks when local Dynomite rack nodes fail
  • Application resilience against network glitches by constantly monitoring connection health and recycling unhealthy connections
  • Capability of surgically routing traffic away from any nodes that need to be taken offline for maintenance
  • Flexible retry policies such as exponential backoff, etc.
  • Insight into connection pool metrics
  • Highly configurable and pluggable connection pool components for implementing your advanced features

Netflix OSS – Common Runtime Services & Libraries

Netflix has released as open source software several of the tools, libraries and services they use to power microservices. The cloud platform is the foundation and technology stack for the majority of the services within Netflix. This platform consists of cloud services, application libraries and application containers.

Below you can find information about the services and libraries used by Netflix that were released as open source software.

Eureka

Eureka is a REST (Representational State Transfer) based service that is primarily used in the AWS cloud for locating services for the purpose of load balancing and failover of middle-tier servers. We call this service the Eureka Server. Eureka also comes with a Java-based client component, the Eureka Client, which simplifies the interactions with the service. The client also has a built-in load balancer that does basic round-robin load balancing. At Netflix, a much more sophisticated load balancer wraps Eureka to provide weighted load balancing based on several factors (like traffic, resource usage, error conditions, etc.) to provide superior resiliency.

Apart from playing a critical part in mid-tier load balancing, at Netflix, Eureka is used for the following purposes.

  • For aiding Netflix Asgard in:
    • Fast rollback of versions in case of problems, avoiding the relaunch of hundreds of instances which could take a long time
    • Rolling pushes, to avoid propagation of a new version to all instances in case of problems
  • For Cassandra deployments to take instances out of traffic for maintenance
  • For Memcached caching services to identify the list of nodes in the ring
  • For carrying other additional application specific metadata about services for various other reasons

Archaius

Archaius is a configuration management library with a focus on Dynamic Properties sourced from multiple configuration stores. It includes a set of configuration management APIs used by Netflix. It is primarily implemented as an extension of Apache’s Commons Configuration Library.

It provides the following functionalities:

  • Dynamic, Typed Properties
  • High throughput and Thread Safe Configuration operations
  • A polling framework that allows obtaining property changes of a Configuration Source
  • A Callback mechanism that gets invoked on effective/”winning” property mutations (in the ordered hierarchy of Configurations)
  • A JMX MBean that can be accessed via JConsole to inspect and invoke operations on properties
  • Out of the box, Composite Configurations (With ordered hierarchy) for applications (and most web applications willing to use convention based property file locations)
  • Implementations of dynamic configuration sources for URLs, JDBC and Amazon DynamoDB
  • Scala dynamic property wrappers

At the heart of Archaius is the concept of a Composite Configuration which can hold one or more Configurations. Each Configuration can be sourced from a Configuration Source such as: JDBC, REST, .properties file, etc.

Ribbon

Ribbon is an Inter Process Communication (remote procedure calls) library with built-in software load balancers. The primary usage model involves REST calls with various serialization scheme support.

Ribbon is a client-side IPC library that is battle-tested in cloud. It provides the following features:

  • Load balancing
  • Fault tolerance
  • Multiple protocol (HTTP, TCP, UDP) support in an asynchronous and reactive model
  • Caching and batching

There are three sub projects:

  • ribbon-core: includes load balancer and client interface definitions, common load balancer implementations, integration of client with load balancers and client factory
  • ribbon-eureka: includes load balancer implementations based on Eureka client, which is the library for service registration and discovery
  • ribbon-httpclient: includes the JSR-311 based implementation of REST client integrated with load balancers

Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

Hystrix is designed to do the following:

  • Give protection from and control over latency and failure from dependencies accessed (typically over the network) via third-party client libraries
  • Stop cascading failures in a complex distributed system
  • Fail fast and rapidly recover
  • Fallback and gracefully degrade when possible
  • Enable near real-time monitoring, alerting, and operational control

Karyon

Karyon is a framework and library that essentially contains the blueprint of what it means to implement a cloud-ready web service. All the other fine grained web services and applications that form our SOA graph can essentially be thought of as being cloned from this basic blueprint.

Karyon can be thought of as a nucleus that contains the following main ingredients:

  • Bootstrapping , dependency and Lifecycle Management (via Governator)
  • Runtime Insights and Diagnostics (via karyon-admin-web module)
  • Configuration Management (via Archaius)
  • Service discovery (via Eureka)
  • Powerful transport module (via RxNetty)

Governator

Governator is a library of extensions and utilities that enhance Google Guice to provide classpath scanning and automatic binding, lifecycle management, configuration to field mapping, field validation and parallelized object warmup.

Prana

Pana is a sidecar for your NetflixOSS based services. It simplifies the integration with NetflixOSS services since it exposes Java based client libraries of various services like Eureka, Ribbon, and Archaius over HTTP. It makes it easy for applications especially written in Non-JVM languages to exist in the NetflixOSS eco-system.

Prana is a Karyon & RxNetty based application that exposes features of java-based client libraries of various NetflixOSS services over an HTTP API. It is conceptually “attached” to the main application and complements it by providing features that are otherwise available as libraries within a JVM-based application.

Prana is used extensively within Netflix alongside applications built in non-jvm programming language like Python and NodeJS or services like Memcached, Spark, and Hadoop.

Between Prana’s features we can mention:

  • Advertising applications via the Eureka Service Discovery Service
  • Discovery of hosts of an application via Eureka
  • Health Check of services
  • Load Balancing http requests via Ribbon
  • Fetching Dynamic Properties via Archaius

Zuul

Zuul is an edge service that provides dynamic routing, monitoring, resiliency and security among other things. It is the front door for all requests from devices and web sites to the backend of the Netflix streaming application. As an edge service application, Zuul is built to enable dynamic routing, monitoring, resiliency and security. It also has the ability to route requests to multiple Amazon Auto Scaling Groups.

Zuul uses a range of different types of filters that enables us to quickly and nimbly apply functionality to our edge service. These filters help the people from Netflix to perform the following functions:

  • Authentication and Security – identifying authentication requirements for each resource and rejecting requests that do not satisfy them
  • Insights and Monitoring – tracking meaningful data and statistics at the edge in order to give us an accurate view of production
  • Dynamic Routing – dynamically routing requests to different backend clusters as needed
  • Stress Testing – gradually increasing the traffic to a cluster in order to gauge performance
  • Load Shedding – allocating capacity for each type of request and dropping requests that exceed the limit
  • Static Response handling – building some responses directly at the edge instead of forwarding them to an internal cluster
  • Multiregion Resiliency – routing requests across AWS regions in order to diversify our ELB usage and move our edge closer to our members

At the center of Zuul is a series of Filters that are capable of performing a range of actions during the routing of HTTP requests and responses.
The following are the key characteristics of a Zuul Filter:

  • Type: most often defines the stage during the routing flow when the Filter will be applied (although it can be any custom string)
  • Execution Order: applied within the Type, defines the order of execution across multiple Filters
  • Criteria: the conditions required in order for the Filter to be executed
  • Action: the action to be executed if the Criteria is met

Zuul provides a framework to dynamically read, compile, and run these Filters. Filters do not communicate with each other directly – instead they share state through a RequestContext which is unique to each request.

Zuul contains multiple components:

  • zuul-core– library which contains the core functionality of compiling and executing Filters
  • zuul-simple-webapp– webapp which shows a simple example of how to build an application with zuul-core
  • zuul-netflix– library which adds other NetflixOSS components to Zuul – using Ribbon for routing requests, for example
  • zuul-netflix-webapp – webapp which packages zuul-core and zuul-netflix together into an easy-to-use package

Netflix OSS – Build and Delivery Tools Overview

Between the Build and Delivery tools released by Netflix as part of the Netflix OSS platform you can find build resources such as Nebula (which makes Gradle plugins easy to build, test and deploy) and tools to manage resources in AWS and to support deployments to this platform.

Below you can find a brief description of some of the build and delivery tools released by Netflix.

Nebula

The nebula-plugins organization was set up to facilitate the generation, governance, and releasing of Gradle plugins. It is done by providing a space to host plugins, in SCM, CI, and a Repository. A single GitHub organization is used, for which anyone or any plugin can be added. Cloudbees jobs are created for every plugin to provide a standard set of jobs. Releases are posted to bintray, proxied to JCenter and synced to Maven Central.

Aminator

Aminator is a tool for creating EBS AMIs. This tool currently works for CentOS/RedHat Linux images and is intended to run on an EC2 instance.
It creates a custom AMI from just:

  • A base ami ID
  • A link to a deb or rpm package that installs your application.

This is useful for many AWS workflows, particularly ones that take advantage of auto-scaling groups.

Asgard

Asgard is a web-based tool for managing cloud-based applications and infrastructure. It offers a web interface for application deployments and cloud management in Amazon Web Services (AWS).

Netflix has been using Asgard for cloud deployments since early 2010. It was initially named the Netflix Application Console.

Netflix OSS – Big Data Tools Overview

Behind the scenes, Netflix has a rich ecosystem of big data technologies facilitating their algorithms and analytics. They use and contribute to broadly-adopted open source technologies including Hadoop, Hive, Pig, Parquet, Presto, and Spark. Additionally, they have developed and contributed some additional tools and services which have further elevated their data platform.

Below you can find information about some tools of the Netflix OSS platform that offer functionalities associated to big data.

Genie

Genie is a federated job execution engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Presto, Sqoop and more. It also provides APIs for managing many distributed processing cluster configurations and the commands and applications which run on them.

From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) computational resources (like YARN, Spark, Mesos clusters etc.). It then provides APIs to submit and monitor jobs on these clusters without users having to install any clients themselves or know details of the clusters and commands.

A big advantage of this model is the scalability that it provides for client resources. This solves a very common problem where a single machine is configured as an entry point to submit jobs to large clusters and the machine gets overloaded. Genie allows the use of a group of machines which can increase and decrease in number to handle the increasing load, providing a very scalable solution.

Within Netflix it is primarily used to handle scaling out jobs for their big data platform in the cloud, but it doesn’t require AWS or the cloud to benefit users.

Inviso

Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

This tool is based on the following components:

  • REST API for Job History: REST endpoint to load an entire job history file as a json object
  • ElasticSearch: Search jobs and correlate Hadoop jobs for Pig and Hive scripts
  • Python Scripts: Scripts to index job configurations into ElasticSearch for querying. These scripts can accommodate a pub/sub model for use with SQS or another queuing service to better distribute the load or allow other systems to know about job events.
  • Web UI: Provides an interface to search and visualize jobs and cluster data

Lipstick

Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.

Aegisthus

Aegisthus is a Bulk Data Pipeline out of Cassandra. It implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.

Netflix OSS Overview

Netflix is considered one of the biggest cloud applications out there. As such, the people at Netflix have faced many different kinds of challenges trying to avoid failures in their service. Over time, they implemented different tools to support and improve their cloud environment, making the Netflix application more reliable, fault-tolerant and highly available.

The really good news is that Netflix has made some of these tools open source. There are now tools available to make a cloud environment more reliable coming from a company that is using them in a huge infrastructure. One thing to consider is that Netflix utilizes AWS for services and content delivery and, as a consequence, some of the implemented tools provide functionalities for this particular cloud environment. However, other tools offer more generic features that can be used in other environments.

It is important to consider that, although Netflix has embraced the open source concept, the shared code provides solutions for computing. The company is not sharing any of its innovations and technology around streaming video.

Open Source tools

The different tools released as part of the Netflix OSS platform can be categorized according to the functionality they provide. In this section you can find a brief description of these categories.

Additionally, you can find further information about the different categories and the associated tools in the blog posts that are referenced at the end of each category description. It is important to keep in mind that, at this point, more than 50 projects can be found in the Netflix OSS GitHub repository. As a consequence, we list and describe only the main tools corresponding to each category.

Big Data Tools

Behind the scenes, Netflix has a rich ecosystem of big data technologies facilitating their algorithms and analytics. They use and contribute to broadly-adopted open source technologies including Hadoop, Hive, Pig, Parquet, Presto, and Spark. Additionally, they have developed and contributed some additional tools and services which have further elevated their data platform.

To know more about the main tools of the Netflix OSS platform that offer functionalities associated to big data, please check out the Netflix OSS – Big Data tools Overview blog post.

Build and Delivery Tools

In this category you can find build resources such as Nebula, which makes Gradle plugins easy to build, test and deploy. Additionally, this category includes tools to manage resources in AWS and to support deployments to this platform.

In the Netflix OSS – Build and Delivery Tools Overview blog post you can find more information about the available tools.

Common Runtime Services & Libraries

In this category you can find tools, libraries and services to power microservices. The cloud platform is the foundation and technology stack for the majority of the services within Netflix. This platform consists of cloud services, application libraries and application containers.

In this category you can find tools, libraries and services to power microservices. The cloud platform is the foundation and technology stack for the majority of the services within Netflix. This platform consists of cloud services, application libraries and application containers.

Take a look at the Netflix OSS – Common Runtime Services & Libraries Overview blog post to know more about the services and libraries used by Netflix that were released as open source software.

Data Persistence Tools

Handling a huge amount of data operations per day required Netflix to improve existent open source software with their own tools. The cloud usage in Netflix and the scale at which it consumes/manages data has required them to build tools and services that enhance the used datastores.

In this category you will find tools to store and serve data in the cloud. Take a look at the Netflix OSS – Data Persistence Tools Overview blog post to read more about these tools.

Insight, Reliability and Performance Tools

In this category you can find tools to get Operation Insight about an application, take different kind of metrics and validate reliability by ensuring that the application can support different kinds of failures.

In the Netflix OSS – Insight, Reliability and Performance Tools Overview blog post you can find more information about these tools.

Security Tools

Netflix has released different security tools and solutions to the open source community. The security-related open source efforts generally fall into one of two categories:

  • Operational tools and systems to make security teams more efficient and effective when securing large and dynamic environments
  • Security infrastructure components that provide critical security services for modern distributed systems.

Check out the Netflix OSS – Security Tools Overview blog post to find further information about some of the security tools released by Netflix.

Getting Started

There are different ways to start working with the Netflix OSS tools.

The Zero to Cloud workshop offers a tutorial focused on bringing up the Netflix OSS stack on a fresh AWS account, in a similar style to how Netflix does it internally. To try it, you would need to have an AWS account and the required resources to set up the infrastructure.

Another way to start playing with the Netflix OSS tools is to analyze sample applications such as IBM ACME Air and Flux Capacitator. These applications use several of the Netflix OSS tools, so they could be useful to understand how the tools can be used outside Netflix. In this case, you may also need to have the proper cloud infrastructure to set up the tools and execute them.

Finally, the fastest way to test some of the Netflix OSS tools is to use ZeroToDocker. If you are familiar with Docker, you can use the Docker images provided by Netflix to get some of the tools up and running in just a few minutes. Additionally, since some of the tools do not require AWS to work, you can run and test them in other cloud environments or locally.

Introduction to Azure Data Factory

We live in a world where data is coming at us from everywhere. IoT is evolving so quickly that right now it seems almost every device is capable of producing valuable information (from water quality sensors to smartwatches). At the same time, the amount of data collected is growing exponentially in volume, variety, and complexity, making the process of learning useful information about Terabytes of data stored in different places (data sources from a variety of geographic locations) a complex scenario that requires the creation of custom logic that has to be maintained and updated over time.

There are several tools and services nowadays that are used to simplify the process of extracting, transforming and loading (ETL) data from different (and most likely) heterogeneous data sources into a single source: an Enterprise Data Warehouse (EDW). Their goal is to obtain meaningful business information (insights) that could help improve products and make decisions.

In this post, we are going to explore Azure Data Factory, the Microsoft cloud service for performing ETL operations to compose streamlined data pipelines that can be later consumed by BI tools or monitored to pinpoint issues and take corrective actions.

Azure Data Factory

Azure Data Factory is a fully managed service that merges the traditional EDWs with other modern Big Data scenarios like Social feeds (Twitter, Facebook), device information and other IoT scenarios. This service lets you:

  • Easily work with diverse data storage and processing systems, meaning you can process both on-premises data (like a SQL Server) and cloud data sources such as Azure SQL Database, Blob, Tables, HDInsight, etc.
  • Transform data into trusted information, via Hive, Pig and custom C# code activities that can be fully managed by Data Factory on our behalf (meaning that, for instance, no manual Hadoop cluster setup or management is required)
  • Monitor data pipelines in one place. For this, you can use an up-to-the-moment monitoring dashboard to quickly assess end-to-end data pipeline health, pinpoint issues, and take corrective action if needed.
  • Get rich insights from transformed data. You can create data pipelines that produce trusted data, which can be later consumed by BI and analytic tools.

1

Now that we know the basics let’s see each of these features in a real scenario. For this, we are going to use the Gaming customer profiling sample pipeline provided in the Azure Preview Portal. You can easily deploy this Data Factory in your own Azure subscription following this tutorial and explore it using the Azure Preview Portal. For instance, this is the Data Factory diagram of this sample (you can visualize it by clicking the Diagram tile inside the Data Factory blade):

2

The following is a brief description of the sample:

“Contoso is a gaming company that creates games for multiple platforms: game consoles, hand held devices, and personal computers (PCs). Each of these games produces tons of logs. Contoso’s goal is to collect and analyze the logs produced by these games to get usage information, identify up-sell and cross-sell opportunities, develop new compelling features, etc. to improve business and provide better experience to customers. This sample collects sample logs, processes and enriches them with reference data, and transforms the data to evaluate the effectiveness of a marketing campaign that Contoso has recently launched. “

Easily work with diverse data storage and processing systems

Azure Data Factory currently supports the following data sources: Azure Storage (Blob and Tables), Azure SQL, Azure DocumentDB, On-premises SQL Server, On-premises Oracle, On-premises File System, On-premises MySQL, On-premises DB2, On-premises Teradata, On-premises Sybase and On-premises PostgreSQL.

For instance, the Data Factory sample combines information from Azure Blob Storage:

3

Transform data into trusted information

Azure Data Factory currently supports the following activities: Copy Activity (on-premises to cloud, and cloud to on-premises), HDInsight Activity (Pig, Hive, MapReduce, Hadoop Streaming transformations), Azure Machine Learning Batch Scoring Activity, Azure SQL Stored Procedure activity, Custom .NET activities.

In the Data Factory sample, one of the pipelines executes 2 activities: an HDInsight Hive Activity to bring data from 2 different blob storage tables into a single blob storage table and a Copy Activity to copy the results of the previous activity (in an Azure Blob) to an Azure SQL Database.

4

Monitor data pipelines in one place

You can use the Azure Preview Portal to view details about the Data Factory resource, like linked services, datasets and their details, the latest runs of the activities and their status, etc. You can also configure the resource to send notifications when an operation is complete or has failed (more details here)

5

Get rich insights from transformed data

You can use data pipelines to deliver transformed data from the cloud to on-premises sources like SQL Server, or keep it in your cloud storage sources for consumption by BI tools and other applications.

In this sample we collect log information and reference data that is then transformed to evaluate the effectiveness of marketing campaigns, as seen in the image below:

6

Next steps

Docker Compose: Creating Multi-Container Applications

Introduction

Simply deploying apps to Docker is not an architectural shift that will bring the agility, isolation and DevOps automated capabilities that a microservices approach can offer. You can always deploy a monolithic application into a Docker container.

The microservice architecture is an approach to decompose a single app as a suite of small independently deployable services communicating with each other.

There should be a bare minimum of centralized management of these services.

What is Docker Compose?

Docker Compose is an orchestration tool that makes spinning up multi-container applications effortless.

With Compose, you define a multi-container application in a single file, then spin your application up in a single command that takes care of everything to get it running.

 

While Compose is not yet considered production-ready, it is great for dev environments, staging, learning and experimenting.

Read More

Introduction to Azure Machine Learning

Note: If you are already familiar with machine learning you can skip this post and jump directly to the Creating a Machine Learning Web Service post by Diego Poza, which explains how you can use Azure Machine Learning with a specific example.

Machine learning is a science that allows computer systems to independently learn and improve based on past experiences or human input. It might sound like a new technique, but the reality is that some of our most common interactions with our apps and the Internet are driven by automatic suggestions or recommendations, and some companies even make decisions using predictions based on past data and machine learning algorithms.

This technology comes in handy specially when handling Big Data. Today, companies collect and accumulate data at massive, unmanageable rates (websites clicks, credit card transactions, GPS trails, social media interactions, etc.), and it’s becoming a challenge to process all the valuable information and use it in a meaningful way. This is where rule-based algorithms fall short: machine learning algorithms use all the collected, “past” data to learn patterns and predict results (insights) that helps make better business decisions.

Let’s take a look at these examples of machine learning. You may be familiar with some of them:

  • Online movie recommendation on Netflix, based on several indicators like recently watched, ratings, search results, movies similarities, etc. (see here)
  • Spam filtering, which uses text classification techniques to move potentially harmful emails to your Junk folder.
  • Credit scoring, which helps banks decide whether or not to grant loans to customers based on credit history, historical loan applications, customers’ data, etc.
  • Google’s self-driving cars, which use Computer vision, image processing and machine learning algorithms to learn from actual drivers’ behavior.

As seen in the examples above, machine learning is a useful technique to build models from historical (or current) data, in order to forecast future events with an acceptable level of reliability. This general concept is known as Predictive analytics, and to get more accuracy in the analysis you can also combine machine learning with other techniques such as data mining or statistical modeling.

In the next section, we will see how we can use machine learning in the real world, without the need to build a large infrastructure and to avoid reinventing the wheel.

What is Azure Machine Learning?

Azure Machine Learning is a cloud-based predictive analytics service for solving machine learning problems. It provides visual and collaborative tools to create predictive models that can be published as ready-to-consume web services, without worrying about the hardware or the VMs that perform the calculations.

AzureMLService

Azure Machine Learning Studio

You can create predictive analysis models in the Azure ML Studio, a collaborative, drag-and-drop tool to manage Experiments, which basically consists of datasets and algorithms to analyze the data, “train” the model and evaluate how well the model is able to predict the values. All of this can be done with no programming because it provides a large library of state of the art Machine Learning algorithms and modules, as well as a gallery of experiments authored by the community and ready-to-consume web services from Microsoft Azure Marketplace that can be purchased.

Azure Machine Learning Studio

Next steps

  • What is Azure Machine Learning Studio?
    Understand more about the Azure Machine Learning Studio workspace and what you can do with it.
  • Machine learning algorithm cheat sheet
    Investigate some of the state of the art machine learning algorithms and to help you choose the right algorithm for your predictive analytics solution. There are three main categories of machine learning algorithm: supervised learning, unsupervised learning, and reinforcement learning. The Azure Machine Learning library contains algorithms of the first two, so it might worth a look.
  • Azure Machine Learning Studio site
    Get started, read additional documentation and watch webinars about how to create your first experiment in the Azure Machine Learning Studio tool.