Azure service

Microsoft has announced a preview of Azure HDInsight 3.6. The step has been taken to get feedback on Apache Spark 2.1. You can try out all the features available in the open source release of Apache Spark 2.1, along with the rich experience of using notebooks on Azure HDInsight.

Apache Spark 2.1 is now open source and brings in a ton of improvements for developers. These improvements range from Structured Streaming to allowing developers to use Apache Kafka (version 0.10) with Spark Streaming.

Apache Spark is an open source processing framework that runs large-scale data analytics applications. Built on an in-memory compute engine, Spark enables high performance querying on big data. It leverages a parallel data processing framework that persists data in-memory and disk if needed. This allows Spark to deliver 100x faster speed and a common execution model to various tasks like extract, transform, load (ETL), batch, interactive queries, and others on data in a Hadoop Distributed File System (HDFS). Azure makes Apache Spark easy and cost effective to deploy with no hardware to buy, no software to configure, a full notebook experience to author compelling narratives, and integration with partner business intelligence tools.

How to get started with Apache Spark 2.1 on Azure HDInsight

Go to Microsoft Azure portal and create a new Azure HDInsight service.

Create HDInsight service

Once you select HDInsight, you can pick the Spark cluster type with version Spark 2.1 (HDI 3.6 Preview).

Select Spark 2.1 version

After creating the cluster you will have access to all the tools, services and notebooks, including Jupyter. You can access the Jupyter notebook by clicking “Cluster dashboard”.

Open Jupyter notebook

Need more help? Here are some useful links for Apache Spark 2.1 and Azure HDInsight: