Kerberizing Hadoop Clusters at Twitter

Authors

Ashwin Poojary, Head of Platform Services SRE, @ash_win_p
Sampath Kumar, Senior Site Reliability Engineer, @realsampat
Santosh Marella, Senior Staff Software Engineer, @santoshmarella

Background

Twitter runs one of the largest Hadoop installations on the planet. The use cases Hadoop serves range from general purpose HDFS storage to search, AI/ML, metrics & reporting, spam/abuse prevention, ads, internal tools etc. The diversity in the use cases means Hadoop hosts several large datasets (including some sensitive ones) that are accessed by a variety of internal teams such as engineering, product, business, finance and legal.

As Twitter evolved over the years, so did the Hadoop footprint at Twitter. Tens of Hadoop clusters, with thousands of nodes, hundreds of PetaBytes in logical storage (without replication), running tens of thousands of MapReduce/Spark jobs/day (and many such metrics) has been the norm for a few years now.

Such a battle tested and stable Data Infrastructure Platform has a few challenges:

a) Making any change has a high amplification risk
b) Asking customers to make any code changes to their apps doesn’t scale
c) Cross cluster dependencies develop organically
d) Clusters can’t afford downtime as it affects Twitter’s bottomline.

With the number of Hadoop datasets and jobs growing year-over-year, it became imperative to enforce strong Authentication, Authorization and Accountability for Hadoop data access.

Much like most Hadoop installations, Twitter adopted HDFS’ native Unix-like permission model for authorization from day one. Files/Directories in HDFS are marked with LDAP user/group ownership and access permission bits. The ownership on nested directories is revisited from time-to-time to enforce least privilege based access.

For accountability of Hadoop data access, Twitter uses HDFS audit logging, which is also a natively built-in HDFS feature. As users perform various HDFS operations (list directory, file open, read, delete etc), the user’s LDAP name, IP address are logged along with the resource accessed and the operation performed.

Strong authentication, however, had been a tough puzzle to solve at Twitter, due to the challenges outlined in the above section. While Hadoop natively supports Kerberos based authentication, turning it ON on several Hadoop clusters at Twitter without taking downtime, or avoiding large coordination efforts required some creative thinking, testing and gradual execution.

Kerberization Building Blocks

Several systems at Twitter use Kerberos for authentication. So, luckily we had a Kerberos Key Distribution Center (KDC) pre-existing when we set out to Kerberize Hadoop. However, the scale of Hadoop meant KDC had to be ready to notch up in scale as well.

Tens of thousands of additional service principals (for Hadoop Services)

Thousands of client principals for service accounts (for Hadoop clients)

Hundreds of additional QPS(Queries per second) to KDC due to auth/service-ticket requests

Thanks to the KDC team at Twitter, these numbers (along with other similar requests from other teams) were easily accommodated with some tuning to the master KDC, and adding some replicas.

Keytab Generation & Distribution

In addition to KDC, Hadoop kerberization required tooling to place the requests and secure distribution of keytabs to appropriate host destinations. Twitter’s platform security team received multiple similar requests from other teams that were trying to enforce strong authentication between services. A keytab generation service and a secure material distribution service were developed to help teams request for keytabs and tag the appropriate host destinations to distribute them to. The APIs were flexible to develop additional tools on top of these, such as generation of per-host keytabs and a web UI to make this self-service for our internal teams.

NameNode Configuration

One of the design principles we’ve chosen while kerberizing Hadoop at Twitter was to use per-host service principals (of the form fooService/[email protected]). The reason for this choice is simple - reduce the blast radius in the event a host gets compromised.

However, the Hadoop NameNodes are configured for High Availability via a Zookeeper based failover mechanism between the active and standby NameNode instances. So, when the active NN fails over and the standby is promoted to serve the requests, the Hadoop clients start connecting to the now active NameNode. If 1000s of clients are connected to HDFS and the NameNode fails over, it can easily flood the KDC with service ticket requests for the newly active NN. To prevent this, we’ve chosen to relax usage of per-host principals for NNs, and instead chose to use the exact same service principal for both the NNs (e.g. namenode/[email protected]).

Constraints for Kerberizing Hadoop @ Twitter

All or None

A kerberized service instance expects all its clients to present valid kerberos service tickets for the service instance (this in-turn requires the clients to perform a successful kinit and obtain a valid TGT from the KDC). Client connections that fail to prove their kerberos credentials are simply dropped.

With 100s of teams accessing Hadoop and 1000s of jobs running in the clusters every day, this simply meant ALL of them have to be kerberos ready before Hadoop services are kerberized.

Another option was to modify Hadoop services to “fail-open” or have a “whitelist” of users. Both these options initially sounded promising, but had a few of cons associated with them:

Defeats the purpose of strong auth and just delays solving the problem for good
Can promote bad practices such as “switch to a whitelisted user” to circumvent auth
Not an in-built Hadoop feature

Scale of services embedding Hadoop clients

MicroServices with 1000s of instances are common at Twitter. Several of them run with an embedded Hadoop client library in order to interact with Hadoop. Kerberization of Hadoop requires that all these microservices be kerberized. i.e:

All instances of a service foo running as unix-user bar, needs to have a kerberos keytab with principal for bar on all the 1000s of hosts the service instances are running on.

All the instances are restarted and kinit’ed from bar’s keytab, in-order to make successful calls to a kerberized Hadoop service (or be ok to take a hit on service’s QoS)

Restarting a service with thousands of instances is a large, coordinated effort that could sometimes take a few days (due to other dependencies on the service). “Hadoop kerberization” event, where we say “Hadoop will be kerberized on this day and you need to restart all your service instances following it”, simply doesn’t fly when you have multiple services that have thousands of instances.

Cross Cluster Dependencies

Another challenge at Twitter was that there are tens of Hadoop clusters and by no means we could kerberize them all in a single major deployment, even by taking a down time. Therefore, we had to plan for the situation where some clusters are kerberized and some are inevitably not. So, what happens when a job that runs in a non-kerberized cluster wants to access data from a kerberized cluster? Obviously, those calls will fail (as kerberization is an all-or-none deal).

We tried to lay down a graph of these cross-cluster dependencies and see if there are any access patterns that can help us chalk out a kerberization order. However, it looked pretty bad, as there were several cyclic dependencies that were harder to disentangle. Moreover:

Our cross-cluster data replication services heavily rely on jobs that copy data from a source cluster to a destination cluster. The jobs are spun up in the destination cluster and they read the data from the source cluster. If the source cluster is kerberized and the destination is not, these jobs fail and it can ultimately impact Twitter’s bottomline.
ViewFS at Twitter makes it really easy for client applications to perform reads/writes that span multiple clusters. It was tough to figure out how many cross-cluster dependencies existed to even develop a reasonable kerberization strategy for Hadoop.

Strategies Considered

A new set of Kerberized Clusters

One possibility we considered (before we fully understood the cross-cluster dependencies) was to stand up a set of kerberized clusters and migrate the use-cases from non-kerberized to kerberized clusters. The biggest pro in this approach is that we don’t need to worry about “what would break”, as we are starting on a clean slate. The downside is that we face a long-tail of migration and additional CapEx and OpEx. Also, once we discovered the cross-cluster dependencies were too many, this option quickly fizzled out as there was no way any cross-cluster use cases would work when one of the clusters is kerberized and the other is not.

Tunnel all Hadoop requests through a proxy

Hadoop actually has a component called HttpFS that basically exposes a REST proxy interface via which http clients can interact with HDFS. Taking a cue from it, we thought about developing a proxy that enforces strong authentication on its clients (need not be kerberos auth), but talks to Hadoop using the end-user’s kerberos credentials.

This approach solves the problem of not needing the Hadoop clients (1000s of them) to be kerberized, but shifts the problem to still make them strongly authenticate with the proxy (via, say, s2s). The “cross-cluster dependencies” problem still exists, though. Another drawback is that a proxy host compromise has a very large blast radius, as it compromises the keytabs for every single HDFS user.

Take downtime

Taking downtime on one cluster seemed reasonable to start with. But the cyclic dependencies meant we needed to take downtime on multiple clusters simultaneously, which was a huge risk to undertake.

Solution

After much dwelling, we took a multi-pronged approach to deal with the constraints and develop code changes, strategies as needed.

The following were our guiding principles:

Kerberize user applications without requiring changes to user code. 1000s of applications cannot be asked to recompile and redeploy.
Be able to identify which users are kerberized and which are not. No TPM wants to chase hundreds of teams. Much easier to specifically request a handful.
Kerberized client applications can talk to non-kerberized Hadoop services when ipc.client.fallback-to-simple-auth-allowed is set to true. Exploit this in-built Hadoop RPC feature, and kerberize all the user applications (hundreds of thousands), all the DataNode and Node Manager processes (Tens of thousands), before kerberizing Hadoop Name Nodes and Resource Manager. Doing it the other way around incurs huge downtime.
Don’t break cross-cluster use cases. No one wants to debug arbitrary cross cluster failures on a stressful production deployment day.
HAVE A ROLLBACK PLAN. Just always have it!

Auto-login from keytab

Typically Hadoop requires any application that uses kerberos keytab to execute UserGroupInformation.loginUserFromKeytab(keytabPrincipal, keytabPath) in order to authenticate with KDC (i.e. perform kinit from keytab). When any of the Hadoop operations are later performed (e.g. list files under a directory or submit a MapReduce job), the Hadoop RPC layer, picks up the TGT and obtains an appropriate service ticket in order to talk to one of the Hadoop services (NameNode/Resource Manager).

At Twitter, services that talk to Hadoop take compile time dependency as well as runtime dependency on the Hadoop client library. The services also pick up the Hadoop configurations (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) at runtime. Therefore, as long as the client library changes are API compatible, it is safe to deploy a new client library version to all the hosts at Twitter, without breaking the Hadoop client applications.

With this in mind, we decided to enhance the UserGroupInformation.getLoginUser() and introduce the notion of auto login the user from the user’s keytab, should one exist at the standard location on the host where the user application is running. The standard location is typically a location on the local filesystem at which the keytab distribution service drops the keytab. The path to the keytab can easily be constructed from the application’s unix user name. The functionality simply performs the following:

For an application running as unix user foo, check if foo’s keytab exists at the standard path, such as /bar/standard/path/foo.keytab.
If the keytab doesn’t exist, the client can talk to a non-kerberized cluster, but can’t talk to a kerberized cluster.
If the keytab exists, then the client library automatically executes UserGroupInformation.loginUserFromKeytab(keytabPrincipal, keytabPath) for the application’s unix user and obtains a TGT from the KDC. With ipc.client.fallback-to-simple-auth-allowed set to true, the client can now talk to both kerberized and non-kerberized Hadoop NameNodes & Resource Managers.

HDFS Audit Logging

Additionally, the HDFS audit logs typically show the UGI of the calling user. E.g. if foo unix user performs a directory listing, and foo has kerberos credentials, the audit log could read something like (these timestamps are not real):

2023-02-18 19:00:05,251 INFO audit: allowed=true ugi=[email protected] (auth:KERBEROS) ip=/192.168.1.123 cmd=listStatus src=/user/foo/my/path dst=null perm=null proto=rpc

Without kerberos credentials, the same call on a kerberized cluster’s HDFS audit log would read:

2023-02-18 19:00:05,251 INFO audit: allowed=false ugi=foo (auth:SIMPLE) ip=/192.168.1.123 cmd=listStatus src=/user/foo/my/path dst=null perm=null proto=rpc

Without kerberos credentials, the same call on a non-kerberized cluster’s HDFS audit log would read:

2023-02-18 19:00:05,251 INFO audit: allowed=true ugi=foo (auth:SIMPLE) ip=/192.168.1.123 cmd=listStatus src=/user/foo/my/path dst=null perm=null proto=rpc

The capability to auto-login from keytab, clubbed with the ipc.client.fallback-to-simple-auth-allowed=true setting and HDFS audit logging, helped us to get a high leverage in kerberizing 100s of 1000s of Hadoop client applications. Where applications were still not kerberized, it was easy to query them from HDFS audit logs and reach out to the owners of those applications (usually being less than 10 teams).

Delegation Tokens and cross cluster jobs

With the ipc.client.fallback-to-simple-auth-allowed=true setting, kerberized client applications can submit jobs to a kerberized cluster and access data from a non-kerberized cluster. For e.g. a DistCp job can be launched in a kerberized cluster and it can copy data from a non-kerberized cluster.

However, kerberized client applications cannot launch jobs in a non-kerberized cluster and read data from a kerberized cluster. This is because, beyond the initial kerberos based authentication, HDFS uses Delegation Tokens for applications like MapReduce to call into the HDFS Name Node when the MapReduce tasks execute, without requiring the tasks to authenticate with the KDC again (doing so would flood KDC with auth requests).

Authors

Background

Kerberization Building Blocks

Keytab Generation & Distribution

NameNode Configuration

Constraints for Kerberizing Hadoop @ Twitter

All or None

Scale of services embedding Hadoop clients

Cross Cluster Dependencies

Strategies Considered

A new set of Kerberized Clusters

Tunnel all Hadoop requests through a proxy

Take downtime

Solution

Auto-login from keytab

HDFS Audit Logging

Delegation Tokens and cross cluster jobs

Recommend

SWIFT ISO 20022 支付系统升级或影响银行、数字货币

巨亏228亿美元却依然乐观，巴菲特公布2022年报和致股东信

规范编码之利用pre-commit给项目添加提交前检查

一加Buds Pro 2新配色“云峰白”亮相：打磨难度拉满

联合国总部游记（上）：《三体》里的“默思室”

EPFL

PUMA：Super PUMA Reveal 将于 2 月 28 日揭晓，持有者可把全身 NFT 裁剪为 PFP

仅仅6.1%应届本科生起薪过万！平均已达10年来最高

OnePlus Ace 2 Dimensity Edition's design revealed through leaked image

国内AI公司，靠ChatGPT逆袭？

About Joyk