AWS Elastic MapReduce (EMR) — 6 Caveats You Shouldn’t Ignore - JOYK Joy of Geek, Geek News, Link all geek

AWS Elastic MapReduce (EMR) — 6 Caveats You Shouldn’t Ignore

Oct 28 ·7min read

UBRFFvv.jpg!web

If you are in data and analytics industry, you must have heard of the burgeoning trend “data-lake” which, on simpler notes, represents a storage strategy that allows organizations to store data from different sources and of different characteristics (size, format and velocity) in one place. Data-lake then becomes an enabler of a number of use-cases like advanced analytics or data warehousing to name a few and generally data is moved to a more specialized store e.g. MPP relational engines or NoSQL to better serve the requirements of a specific use-case. If the platform is being provisioned in a cloud environment like AWS, Azure or GCP then object stores (e.g. S3 for AWS, Azure Data Lake Store Gen2 for Azure) are usually the strongest candidates to provide the foundational physical layer of data-lakes. Once the data is in data-lake, it passes through series of zones/phases/layers to establish semantic consistency in the data and to conform it for optimal consumption.

Usually, as per the reference architecture of data-lake platforms, which is agnostic of which cloud provider you choose, Hadoop (specifically Spark) is employed as a processing engine/component to process the data in data-layer as it progresses through different layers. These processing frameworks are well-integrated with data-lake services, provide capabilities like horizontally scalability, in-memory computation and unstructured data processing which position them as viable options in this context. One generally has a number of choices to use Hadoop distributions in the cloud for instance one can proceed with provisioning IaaS based infrastructure (i.e. AWS EC2 or Azure Virtual machines and installing a Hadoop distribution e.g. vanilla Hadoop, Cloudera/Hortonworks). Alternatively, almost all the cloud providers are providing Hadoop as managed services natively (e.g. ElasticMapReduce (EMR) in AWS, HDInsight/Databricks in Azure, Cloud Dataproc in GCP). Each of the options have their pros and cons. For example with IaaS based implementation, the overhead of provisioning, configuring and maintaining the cluster by yourself becomes a strong concern for many. Also, the intrinsic propositions of Cloud like elasticity and scalability pose a challenge in IaaS based implementation.

On the other hand, managed offerings like EMR do provide strong propositions in terms of reduced support and administration overhead. But from functional capabilities point of view, there are still a number of caveats or gotchas that one has to be cognizant of while using managed Hadoop offerings from the leading cloud providers. Though in my experience I’ve worked with all three major managed Hadoop offerings from all three Cloud providers but in this post I am going to highlight a few caveats specifically about EMR from AWS. The motive behind this is to enable you, the reader, be better equipped to leverage the potential of EMR while avoiding potential issues that you can experience if the implications are not factored in your development.

AWS Glue Data Catalog:

AWS Glue is a managed data catalog and ETL service. Specifically when used for data catalog purposes, it provides a replacement for Hive metastore that traditional Hadoop cluster used to rely for Hive table metadata management.

b6bQBnI.png!web

Conceptual view of how Glue integrated with AWS services eco-system. (Source: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.htm)l

1. Glue Database Location:

When working with Glue Catalog, one generally creates Databases which provide a logical grouping of tables in the catalog. Now when you create a Glue catalog normally with commands like:

CREATE DATABASE <database_name>

and then if do the following:

Computing Statistics on the table of this database

ALTER TABLE <database_name.table_name> COMPUTE STATISTICS

Using Spark SQL dataframe’s saveAsTable function:

df.write.saveAsTable(“database_name.table_name”)

you will most definitely encounter exceptions (where one will state that “it can’t create path from empty string ”and the other about “no route to host”)

The reason for this caveat is that by default, it sets database location to an HDFS location that can be valid for the cluster from which you created the database. However if you are using multiple clusters or clusters in transient fashion (i.e. you launch a cluster, use and then terminate), then that HDFS location won’t remain valid and thus poses issues.

The resolution is to either explicitly specify an S3 location while creating database:

CREATE DATABASE <database_name> LOCATION 's3://<bucket_name>/<folder_name>'

Or you can edit the Database location in Glue Catalog as well after it has been created.

2. Renaming Glue Table Columns:

If you have created a table and want to rename a column, you can do that via AWS Glue. However what I’ve seen is that even though you can do that via Glue, it results into inconsistent metadata. For example if you rename a column and then query the table via Athena and/or EMR, both will show different views i.e. one will show the renamed column and the other will show the original one. Thus it’s suggested to avoid doing that and create a new table wi3th the right column names.

3. Data Manipulation with External Tables:

This gotcha is not specific to AWS EMR exclusively but it’s something to be vigilant of. The Glue tables, projected to S3 buckets are external tables. As a result, if you drop a table, the underlying data doesn’t get deleted. But if you drop a table, create it again and overwrite it (either via spark.sql() or via dataframe APIs), it will overwrite the contents as expected. However, if you drop table, create it and then do INSERT, as the original data will still be there thus you will practically be getting an append result instead of an overwrite one. If you want to drop contents of a table as well, one way is to delete the files at S3 (via AWS CLI/SDK or console) however just a note that as you do that, be sure to run emrfs sync after that (as highlighted below).

EMR File System (EMRFS):

In line with the recent architectural patterns where compute and storage layer is segregated, one of the ways EMR cluster is used is that S3 serves as storage layer. EMR cluster reads data to/from S3. As S3 is intrinsically an object store and such object stores usually have consistency constraints (i.e. eventually consistent in certain aspects) these can pose challenges when used with compute engines like Hadoop. The approach taken by AWS in this context is in in the form EMRFS which is an implementation of HDFS that EMR clusters use for reading and writing files from Amazon EMR directly to S3. This provides the capability of storing persistent data in Amazon S3 for use with Hadoop while also providing features like consistent view.

M7bQRvV.png!web

Source: https://www.slideshare.net/AmazonWebServices/deep-dive-amazon-elastic-map-reduce

The caveat here is that EMRFS employs another store i.e. DynamoDB (a NoSQL offering from AWS) to persist metadata about S3. EMR, then leverages DynamoDB to ensure that metadata of S3 (e.g. objects keys) is consistent or not. This has a few implications:

1. Pricing:

As you’ll use DynamoDB, there will be a cost associated with it. When a table in DynamoDB is provisioned to store EMRFS metadata, it is provisioned with provisioned read/write capacity. As DynamoDB is quite economical so the cost won’t be that huge but still it’s good to be mindful of that and make sure to factor that cost in your platform OPEX estimations.

2. Lots of Small Files Implications:

If you are operating on large volumes of data with a lot of files of small-sizes (which you should avoid by all means if you want your Hadoop processing pipelines to perform) and if multiple clusters are reading/writing from/to S3, the metadata to be managed by DynamoDB increases and this, at times, then impacts the provisioned read and write throughput. In such scenarios, you can either increase the provisioned read/write capacity of DynamoDB or set it to auto-scale. You can monitor the consumption in DynamoDB and can set alarms as well if you run into such issues and can automate the whole process as well (in a number of ways e.g. using AWS Lambda)

3. Synchronizing Metadata:

If read/write of data from/to S3 happens via EMR, then EMRFS metadata remains consistent and all remains good. But if however you manipulate with the S3 data by any external mechanism e.g. using AWS CLI/SDK and deleting/adding files for any reason, the metadata then tends to become inconsistent. Resultantly, EMR jobs can appear to be stuck. In such instance, the resolution is to synchronize the metadata i.e. the changes happened in S3 e.g. addition/deletion of new objects need to be registered into DynamoDB where EMRFS metadata is persisted. One of the ways to do that is via Linux bash shell. Generally emrfs utilities come installed with EMR nodes and thus you can SSH into EMR master nodes and run:

emrfs sync s3://<bucket_name>/<folder_where_you_manipulated_data>

Upon running that, EMRFS metadata in DynamoDB will get updated, inconsistencies will be eliminated and your EMR jobs should run smoothly.

Every product has a number of nuances and so is the case with EMR. The purpose of this blog-post is to not emphasize on its negative aspects but rather educating you, the potential user of EMR, of the implications so that you can make the best use of this wonderful service from AWS. This is coming from someone who has it successfully on processing tera-bytes of data with minimum to no issues in enterprise-grade production environments.

If you are interested in up-skilling yourself in these set of technologies, do check out my best-selling course about Spark and my book about Scala programming for Big Data analytics

AWS Elastic MapReduce (EMR) — 6 Caveats You Shouldn’t Ignore