Hello to all my subscribers. Hope you all are having a great weekend.
In this week’s post, I’ll write about one of the AWS Analytics services - EMR
EMR is like a Swiss knife for all big data frameworks within AWS Analytics category of services. Read on to understand this more…
What is Amazon EMR
Amazon EMR (Elastic MapReduce) is one of the core Analytics services offered by AWS.
EMR provides managed Hadoop capabilities within AWS. You get different big data frameworks like Apache Spark, Hive, Hbase, Presto & TensorFlow as part of EMR.
Since it provides managed Hadoop capabilities, you can easily use EMR for creating big data solutions. You don’t need to worry about installing and managing the Hadoop & Spark clusters. When you provision EMR clusters, you get the required frameworks like Spark or Hive pre-installed & pre-configured in the cluster.
Benefits
Amazon EMR is a managed Hadoop offering. It has below benefits of managed services.
No need to deploy EC2 machines or install Hadoop packages
One-click installation of required frameworks during cluster provisioning
Auto Scaling of the cluster for varying workloads
EMR notebooks for developing Spark code
Option to store data on S3 so that it can be accessed even after the cluster is terminated
Use Cases
EMR can be used for multiple use cases. Since it offers various big data frameworks, it can be used for implementing different types of workloads. Some of these are listed below.
Creating Spark ETL applications for batch workloads
Implementing NoSQL based solution using Hbase
For interactive querying using Presto
For supporting any streaming uses cases using Spark streaming
Implementing ML workloads using TensorFlow
Deployment Options
While EMR on EC2 is the most popular & widely used approach for using EMR, there are other deployment options also
EMR on EC2 - most popular & widely used
EMR on EKS - For Kubernetes based applications
EMR on Outposts - For using EMR on-prem within customer data center
Serverless EMR - its the new serverless option
Reference Links
This post just gives you a very high-level overview of Amazon EMR. In case you would like to explore further, here are some useful links.
One of the best videos for understanding EMR
Glue Vs EMR - I had made a quick comparison between Amazon EMR & AWS Glue which is another AWS service for implementing Spark jobs.
Thanks for reading, hope this post will help you to get started with Amazon EMR.
There is not right or wrong answer to this.
Based on your use cases, your tech stack and overall strategy it should be decided.
If you need a multi-cloud strategy, you can go for Databricks. If you want to use AWS native services you can use EMR or Glue. This is just one angel to look at it. There are many more aspects...