2

Databricks Connect. Ease of personal computer with… | by Sanjay Singh | Sanrusha...

 2 years ago
source link: https://medium.com/sanrusha-consultancy/databricks-connect-4bc269dfc94d
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Databricks Connect

Ease of personal computer with greatness of Databricks Spark Cluster

Photo by Bruce Dixon on Unsplash

Shilpa, our data scientist, is in love with Spark after going through our below articles on Spark and Pyspark.

While Spark, with its in-memory computation capability and real-time data streaming, is making her life better, using IDE tools on databricks cluster is not same as using IDE tools on her own computer. She wants to connect her own computer to databricks cluster and develop the Spark application on her own computer’s Jupyter notebook. And this is where Databricks connect can help her.

What is Databricks Connect?

Below is explanation from Databricks website

Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Databricks clusters.

Prerequisites

Below are per-requisites for installing databricks connect.

  1. Python: Make sure you have appropriate version of Python installed on your computer. Your Python version should be compatible with the Databricks cluster runtime version. Below is matrix of Cluster run time and Python version from Databricks .
1*tDE2jZcm9FXzr_rdirptRw.png?q=20
databricks-connect-4bc269dfc94d
Information from https://docs.databricks.com/dev-tools/databricks-connect.html

2. Java: Spark runs in java virtual machine (JVM). Make sure you have appropriate version of Java (1.8 and above is recommended) on your computer.

3. Winutil: If your computer has windows operating system, download winutil and define HADOOP_HOME environment variable. I downloaded hadoop-common-2.2.0-bin-master.zip directory from here. Unzip the zip file and it contains below files.

1*3XePtHkTvtkWx831LEyp8w.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

Setup environment variable HADOOP_HOME up to the folder, just before the bin folder containing winutil file.

1*tUysGleIlp4lcHFcoREenw.png?q=20
databricks-connect-4bc269dfc94d
Windows environment variable setup (Image by Author)

4. Databricks Cluster : You will need access to Databricks Standard (or better) workspace. Community edition will not work, because it does not has option to create access token, which is required for accessing the cluster from your computer.At the time of writing this article, databricks was offering free access to there Standard cluster for 14 days.

Create a Spark cluster in the workspace. Make sure to pick the runtime version in line with the Python version installed on your computer.

1*djmG8ES9Wq2t97LuaFFV6Q.png?q=20
databricks-connect-4bc269dfc94d

Installation & Configuration

Once you took care of the per-requisites are the steps for installing, configuring, and testing Databricks connect on your computer.

a. Uninstall PySpark: If you has already installed Pyspark, uninstall it. It will get installed again while installing the databricks connect.

1*TCAKmH4GMa6hDJp_xKUGig.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

b. Install databricks connect

pip install -U "databricks-connect==9.1"
1*A6gwVF29YuY7kEBHILqgUA.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

The version 9.1 is databricks cluster runtime version. Put appropriate version based on your runtime version.

c. Now that databricks connect is installed, it is time to configure it. Following information are required for configuring the databricks connect.

i) Cluster ID: The values after …clusters/ in the cluster URL is cluster id. In below example the cluster id is 1015–041759-ui3itm88

ii) Host: The value up to o= is cluster host. In below example the cluster host is https://dbc-08951d2d-3041.cloud.databricks.com/?o=3676650223046134

iii)Org id. The value after = in the url is org id. In this example 3676650223046134 is org id.

1*3Y6nFKNI-emmvKmQ01OPZw.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

iv) Token: Go to user settings , Access Tokens and Generate New Token. Copy the token value.

1*pTQDZiZVjW1vHlaWkQwfjw.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

Once you have all the information listed above (i to iv), run below command on your computer

databricks-connect configure

Provide the requested information

1*3PwwWmRdVwb-BjLrvoq2cQ.png?q=20
databricks-connect-4bc269dfc94d
Image by Author
1*AKGsu3g9Q9l8-ihk09Hw_A.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

Congratulations, databricks connect is installed and configured on your computer. Now you can test it.

Now, you are ready to test databricks connect. Run below command to test databricks connect.

databricks-connect test
1*nxPblnxFzEVcfR-KMGM8QQ.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

If you see the message “All tests passed”, you are all set. The databricks connect is working. Now you can open Jupyter notebook and start developing Spark application.

Implementation

It’s time to implement databricks connect.

Download Pima Indian Diabetes Database diabetes file from below location on Kaggle.

Upload the diabetes.csv file on the cluster.

1*C6DUIWZwnwPtd91OY6hYXg.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

Now open Jupyter notebook and start developing Spark application on Databricks cluster.

1*F6q6avgJQFxYxtTclhOL-w.png?q=20
databricks-connect-4bc269dfc94d
Image by Author

You can run review the Pyspark SQL, Machine Learning etc. scripts from my previous article and run it on your own Jupyter notebook.

Conclusion

Spark is revolutionary. Databricks cluster is in great demand for running the Spark cluster. Databricks offered the much awaited flexibility of developing Spark application from your own computer.

Happy developing Spark application through databricks connect!

References:

https://www.udemy.com/course/apache-spark-for-data-engineers/?referralCode=CA92888DA98AEA3315AC


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK