Databricks Connect

Ease of personal computer with greatness of Databricks Spark Cluster

Shilpa, our data scientist, is in love with Spark after going through our below articles on Spark and Pyspark.

While Spark, with its in-memory computation capability and real-time data streaming, is making her life better, using IDE tools on databricks cluster is not same as using IDE tools on her own computer. She wants to connect her own computer to databricks cluster and develop the Spark application on her own computer’s Jupyter notebook. And this is where Databricks connect can help her.

What is Databricks Connect?

Below is explanation from Databricks website

Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Databricks clusters.

Prerequisites

Below are per-requisites for installing databricks connect.

Python: Make sure you have appropriate version of Python installed on your computer. Your Python version should be compatible with the Databricks cluster runtime version. Below is matrix of Cluster run time and Python version from Databricks .

Information from https://docs.databricks.com/dev-tools/databricks-connect.html

2. Java: Spark runs in java virtual machine (JVM). Make sure you have appropriate version of Java (1.8 and above is recommended) on your computer.

3. Winutil: If your computer has windows operating system, download winutil and define HADOOP_HOME environment variable. I downloaded hadoop-common-2.2.0-bin-master.zip directory from here. Unzip the zip file and it contains below files.

Image by Author

Setup environment variable HADOOP_HOME up to the folder, just before the bin folder containing winutil file.

Windows environment variable setup (Image by Author)

4. Databricks Cluster : You will need access to Databricks Standard (or better) workspace. Community edition will not work, because it does not has option to create access token, which is required for accessing the cluster from your computer.At the time of writing this article, databricks was offering free access to there Standard cluster for 14 days.

Try Databricks

Discover why businesses are turning to Databricks to accelerate innovation. Try Databricks' Full Platform Trial free…

databricks.com

Create a Spark cluster in the workspace. Make sure to pick the runtime version in line with the Python version installed on your computer.

Installation & Configuration

Once you took care of the per-requisites are the steps for installing, configuring, and testing Databricks connect on your computer.

a. Uninstall PySpark: If you has already installed Pyspark, uninstall it. It will get installed again while installing the databricks connect.

Image by Author

b. Install databricks connect

pip install -U "databricks-connect==9.1"

Image by Author

The version 9.1 is databricks cluster runtime version. Put appropriate version based on your runtime version.

c. Now that databricks connect is installed, it is time to configure it. Following information are required for configuring the databricks connect.

i) Cluster ID: The values after …clusters/ in the cluster URL is cluster id. In below example the cluster id is 1015–041759-ui3itm88

ii) Host: The value up to o= is cluster host. In below example the cluster host is https://dbc-08951d2d-3041.cloud.databricks.com/?o=3676650223046134

iii)Org id. The value after = in the url is org id. In this example 3676650223046134 is org id.

Image by Author

iv) Token: Go to user settings , Access Tokens and Generate New Token. Copy the token value.

Image by Author

Once you have all the information listed above (i to iv), run below command on your computer

databricks-connect configure

Provide the requested information

Image by Author

Image by Author

Congratulations, databricks connect is installed and configured on your computer. Now you can test it.

Now, you are ready to test databricks connect. Run below command to test databricks connect.

databricks-connect test

Image by Author

If you see the message “All tests passed”, you are all set. The databricks connect is working. Now you can open Jupyter notebook and start developing Spark application.

Implementation

It’s time to implement databricks connect.

Download Pima Indian Diabetes Database diabetes file from below location on Kaggle.

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

www.kaggle.com

Upload the diabetes.csv file on the cluster.

Image by Author

Now open Jupyter notebook and start developing Spark application on Databricks cluster.

Image by Author

You can run review the Pyspark SQL, Machine Learning etc. scripts from my previous article and run it on your own Jupyter notebook.

Conclusion

Spark is revolutionary. Databricks cluster is in great demand for running the Spark cluster. Databricks offered the much awaited flexibility of developing Spark application from your own computer.

Happy developing Spark application through databricks connect!

References:

https://www.udemy.com/course/apache-spark-for-data-engineers/?referralCode=CA92888DA98AEA3315AC

Databricks Connect. Ease of personal computer with… | by Sanjay Singh | Sanrusha...

Databricks Connect

Spark

From understanding core concepts to developing a well-functioning Spark application on AWS Instance!

PySpark

Rendezvous of Python, SQL, Spark, and Distributed Computing making Machine Learning on Big Data possible

Prerequisites

Try Databricks

Discover why businesses are turning to Databricks to accelerate innovation. Try Databricks' Full Platform Trial free…

Installation & Configuration

Implementation

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

Conclusion

Recommend

“虫卵”事件背后，网红好欢螺“翻车”了？

抖音电商标准化选品指南（建议收藏）

天津移动联合中兴通讯完成700M多频协同组网性能测试

LiveData beyond the ViewModel

互动平台稳定发展,智能电视使用普及 ——2021年第三季度大屏收视回溯

Migrating Our Web Codebase from Flow to TypeScript

支付巨头Square更名为 Block

以“社区、乡村、边防”智慧建设抓手,智慧广电赋能平安广西基层网格管理

文娱行业大裁员？优爱腾芒“寒冬将至”？

Wall Street watchdogs probe Trump media firm deal

About Joyk