6

Spark and Oracle Database

 3 years ago
source link: https://towardsdatascience.com/spark-and-oracle-database-6624abd2b632
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Spark and Oracle Database

Ease of structured data and efficiency of Spark

0*7wnvzn4Mcgby2C4K?q=20
spark-and-oracle-database-6624abd2b632
Photo by Sorin Sîrbu on Unsplash

Shilpa has become an expert in Spark and enjoys Big data analysis. Everything was going well until her employer wanted to know the kind of insight they can get by combining their enterprise data from the Oracle database with Big Data.

Oracle database is the most sold enterprise database. Most of the enterprise applications, like ERP, SCM applications, are running on the Oracle database. Like Shilpa, most of the data scientists come across situations where they have to relate the data coming from enterprise databases like Oracle with the data coming from a Big Data source like Hadoop.

There are two approaches to address such requirements:

  1. Bring the enterprise data into the Big Data storage system like Hadoop HDFS and then access it through Spark SQL.
1*LG-V-gbEV7z8eRdrvqGeMQ.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

This approach has the following drawbacks:

  • Data duplication
  • Enterprise data has to be brought into Hadoop HDFS. This requires a data integration solution and will mostly be a batch operation, bringing in data latency issues.

2. Keep the operational enterprise data in the Oracle database and Big Data in Hadoop HDFS and access both through Spark SQL.

1*SwiYNtBRdsURzpCrD3zJlw.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author
  • Only the required enterprise data is accessed through Spark SQL.
  • If required the enterprise data can be stored in Hadoop HDFS through Spark RDD.

I am elaborating on the second approach in this article. Let’s go through the basics first.

Spark

If you want to know about Spark and seek step-by-step instructions on how to download and install it along with Python, I highly recommend my below article.

Oracle Database

If you want to know about the Oracle database and seek step-by-step instructions on how to install a fully functional server-class Oracle database, I highly recommend my below article.

Before we taking a deeper dive into Spark and Oracle database integration, one shall know about Java Database Connection (JDBC).

A Java application can connect to the Oracle database through JDBC, which is a Java-based API. As Spark runs in a Java Virtual Machine (JVM), it can be connected to the Oracle database through JDBC.

You can download the latest JDBC jar file from the below link

You should get the ojdbc7.jar file. Save this file into the …/spark/jars folder, where all other spark system class files are stored.

1*Bl1bc2V9e2BDfS-DZIfsPQ.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

Connecting Spark with Oracle Database

Now that you already have installed the JDBC jar file where Spark is installed, and you know access details (host, port, sid, login, password) to the Oracle database, let’s begin the action.

I have installed Oracle Database as well as Spark (in local mode) on AWS EC2 instance as explained in the above article.

  1. I can access my oracle database sanrusha. The database is up and running.
1*zVk_7vS3f6L94USZTL_Hnw.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

2. Database listener is also up and running

1*beCej1pcrgGGH7zyeLpdhA.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

3. Database user is sparkuser1. This user has access to one table test, that has only on column A, but no data.

1*-duNsajJOe7539S8Bw8DZQ.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

In the next step, going to connect to this database and table through Spark.

4a. Log in to the Spark machine and start Spark through Spark-shell or pyspark.

1*iYoiY6a3OulNGenuVRI-Wg.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

4b. Below command creates a spark dataframe df with details of the Oracle database table test. Write this command on Scala prompt.

val df= spark.read.format(“jdbc”).option(“url”,”jdbc:oracle:thin:sparkuser1/oracle@<oracledbhost>:<oracle db access port default is 1521>/<oracledbsid>”).option(“dbtable”,”test”).option(“user”,”sparkuser1").option(“password”,”oracle”).option(“driver”,”oracle.jdbc.driver.OracleDriver”).load()

4c. df.schema will show the details of the table. In this case, it is a simple test table with just one column A.

1*5FKBOv4ooGwPx3y7PR4dxQ.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

4d. Open a browser, enter the below address

http://<public IP address of machine where Spark is running>:4040

Click on the SQL tab. You should see the details like what time the connection request was submitted, how long connection and data retrieval activities took, and also the JDBC details.

1*C9TrX7Xrjck82CbWDDcBfw.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

Spark can also be initiated through a Spark session.builder API available in Python. Open Jypyter notebook and enter the below details to start the Spark application session and connect it with the Oracle database

Here is a snapshot of my Jupyter notebook.

1*Bs8199T2nHkKddbNmXwkSw.png?q=20
spark-and-oracle-database-6624abd2b632
Image by Author

Conclusion

This was a small article explaining options when it comes to using Spark with Oracle database. You can extend this knowledge for connecting Spark with MySQL and databases.

Looking forward to your feedback.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK