Lessons learned from deploying Cloudera Data Platform for IBM Cloud Pak for Data
source link: https://developer.ibm.com/blogs/lessons-learned-from-cloudera-data-platform-on-ibm-cloud/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Blog Post
Lessons learned from deploying Cloudera Data Platform for IBM Cloud Pak for Data
Useful tips, snags we hit, and how we resolved them
Here in this last blog post in our series, we focus on lessons learned from installing, maintaining, and verifying the connectivity of Cloudera Data Platform and IBM Cloud Pak for Data. If you haven’t read the first two posts — A technical deep-dive on integrating Cloudera Data Platform and IBM Cloud Pak for Data and Installing Cloudera’s CDP Private Cloud Base on IBM Cloud with Ansible, then I’d invite you to go back and read them for additional context.
In this installment, we’d like to share some useful tips and tricks and teach you how to avoid common mistakes by first-time installers
Lesson 1: Use a bastion host
Our Cloudera cluster had a total of 8 VMs (3 master nodes, 3 worker nodes, and 2 edge nodes). We wanted easy access to each node and wanted to limit public network traffic to the Cloudera cluster as much as possible. Luckily, there’s already a well-known solution to this problem: using a bastion host.
We spun up a small VM on the same subnet as our Cloudera cluster and could then easily communicate over private network interfaces (10.x.y.z IP addresses). For the installation process, this choice offered the benefit of not dropping connections for long-running Ansible playbooks.
Figure 1. The architecture of our Cloudera for Cloud Pak for Data environment
Lesson 2: Use VS Code’s Remote Extension Plug-in
When installing Cloudera Data Platform with Ansible playbooks you’re likely going to need to change a few config options and values in the playbooks. We’re not against using Vim, but we opted to use the Visual Studio Code Remote Development Extension Pack. This made searching through the files, modifying values, and uploading and downloading files much easier.
Figure 2. VSCode’s Remote Development Extension useful for editing files and running commands against our remote machines
Lesson 3: Stick to private networks
This point may seem obvious, but it’s more about being consistent. Anywhere an IP address was to be input, we always made sure to use the private network IP address. This ensured that any traffic would stay on the IBM Cloud network and not the public Internet.
Lesson 4: Eliminate all inbound traffic except RDP on the Windows Active Directory server
Here is a subtle lesson that might otherwise be little tricky to pin down. After a few days of uptime, the health checks on our Cloudera Data Platform were indicating that the hosts could not reach our Active Directory (AD) server. Indeed we discovered that our AD had hung. When we would reboot the AD server things would go back to normal for a day or so and then it would repeat.
We looked over capacity and performance of the server. When we looked at networking utilization, we noticed a high level of traffic going to and from the system from the Internet facing interface. After looking at the server configuration and the traffic, we were able to determine that a vast majority was over the LDAP port.
Since our only use of LDAP is internal, the solution to this problem was to limit the inbound traffic to the AD by creating a rule that only allowed traffic on the RDP protocol, which is used for remote desktop management. On IBM Cloud, we created a custom security group permitting inbound TCP on port 3389 for RDP.
Lesson 5: Mount secondary drives to /data/dfs automatically
The storage requirements for installing Cloudera required us to purchase additional drives to go along with our virtual machines. These drives had to be mounted before running any playbooks. We used a little bit of bash and SSH to do it in an automated way. In our case, we chose to mount the drives to /data/dfs:
for i in {1..8}
do
ssh cid-vm-0$i mkfs.ext4 -m0 -O sparse_super,dir_index,extent,has_journal /dev/xvdc
ssh cid-vm-0$i mkdir -p /data/dfs
ssh cid-vm-0$i mount /dev/xvdc /data/dfs
ssh cid-vm-0$i 'echo "/dev/xvdc /data/dfs ext4 defaults,noatime 1 2" | tee -a /etc/fstab'
done
Lesson 6: Update OpenShift DNS operator so it knows the Cloudera node hostnames
We wanted our IBM Cloud Pak for Data instance which runs on OpenShift be able to communicate with our newly deployed Cloudera Data Platform cluster. We stuck to our “always use private network interfaces” rule, but that resulted in 404s since OpenShift didn’t know how to resolve those hostnames. To get around this, we needed to edit the DNS operator on our OpenShift instance. It’s documented in the OpenShift DNS Documentation, but for brevity, we’ve added what worked for us.
Edit the dns operator default CR: oc edit dns.operator/default
update by adding to the spec section:
spec:
servers:
- forwardPlugin:
upstreams:
- <your private ip>
- <your public ip>
name: cdplab-server
zones:
- cdplab.local
Then verify the configmap for CoreDNS is updated: oc get configmap/dns-default -n openshift-dns -o yaml
apiVersion: v1
data:
Corefile: |
# cdplab-server
cdplab.local:5353 {
forward . <your private ip> <your public ip>
}
Then create a pod and try to access CDP from the pod, and HTML should be returned, not a 404 error message.
bash-4.4$ curl -k https://cid-vm-01.cdplab.local:7183/cmf/home
Lesson 7: Ensure the AD self-signed certificate can be used as a certificate authority
This lesson can be broadly applied to other LDAP and AD scenarios. In our case, we could successfully connect to the Impala service running on Cloudera through Kerberos, but not through LDAP. After double-checking that our LDAP-specific Impala configuration was correct, we were still getting a not-so-helpful “Can’t contact LDAP server” error.
We slowly started to peel back the layers of the problem. We managed to isolate the problem to our LDAP configuration, and we realized this was the case because when we ran ldapsearch
in an attempt to bind the user, it gave us the same error message. Ah-ha! Impala was using an OpenLDAP library under the covers.
$ ldapsearch -H ldaps://cid-adc.cdplab.local:636 -D "[email protected]" -b "dc=cdplab,dc=local" '(uid=stevemar)' -W
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
After double-checking that the Windows firewall wasn’t the culprit, we narrowed down the problem to a missing bit of information in the self-signed certificate we had created for the AD. We needed to add the -TextExtension "2.5.29.19={text}CA=true"
flag for the Windows New-SelfSignedCertificate
command. Our new command looked like (before it was missing the last parameter):
New-SelfSignedCertificate -Subject *.$dnsName `
-NotAfter $lifetime.AddDays(365) -KeyUsage DigitalSignature, KeyEncipherment `
-Type SSLServerAuthentication -DnsName *.$dnsName, $dnsName `
-TextExtension "2.5.29.19={text}CA=true"
Lesson 8: Get familiar with Kerberos concepts and tools
There’s no real single piece of advice here, other than if you’re going to use Kerberos to secure your Cloudera cluster, get familiar with Kerberos concepts, like keytabs, and tools like ktutil and ktpass.
Summary and next steps
We hope you enjoyed reading about some of the pitfalls we encountered and remember some of the tips we shared the next time you’re deploying a data and AI platform. You can learn more about the Cloudera Data Platform for IBM Cloud Pak for Data joint offering.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK