Spark and Oracle Database

Ease of structured data and efficiency of Spark

Shilpa has become an expert in Spark and enjoys Big data analysis. Everything was going well until her employer wanted to know the kind of insight they can get by combining their enterprise data from the Oracle database with Big Data.

Oracle database is the most sold enterprise database. Most of the enterprise applications, like ERP, SCM applications, are running on the Oracle database. Like Shilpa, most of the data scientists come across situations where they have to relate the data coming from enterprise databases like Oracle with the data coming from a Big Data source like Hadoop.

There are two approaches to address such requirements:

Bring the enterprise data into the Big Data storage system like Hadoop HDFS and then access it through Spark SQL.

Image by Author

This approach has the following drawbacks:

Data duplication
Enterprise data has to be brought into Hadoop HDFS. This requires a data integration solution and will mostly be a batch operation, bringing in data latency issues.

2. Keep the operational enterprise data in the Oracle database and Big Data in Hadoop HDFS and access both through Spark SQL.

Image by Author

Only the required enterprise data is accessed through Spark SQL.
If required the enterprise data can be stored in Hadoop HDFS through Spark RDD.

I am elaborating on the second approach in this article. Let’s go through the basics first.

Spark

If you want to know about Spark and seek step-by-step instructions on how to download and install it along with Python, I highly recommend my below article.

Spark

From understanding core concepts to developing a well-functioning Spark application on AWS Instance!

towardsdatascience.com

Oracle Database

If you want to know about the Oracle database and seek step-by-step instructions on how to install a fully functional server-class Oracle database, I highly recommend my below article.

Oracle Database on AWS

Step-by-step instructions on downloading, installing, configuring, and running fully functional Oracle Database on AWS…

medium.com

Before we taking a deeper dive into Spark and Oracle database integration, one shall know about Java Database Connection (JDBC).

A Java application can connect to the Oracle database through JDBC, which is a Java-based API. As Spark runs in a Java Virtual Machine (JVM), it can be connected to the Oracle database through JDBC.

You can download the latest JDBC jar file from the below link

Oracle Database 12c Release 1 JDBC Driver Downloads

No results found Your search did not match any results. We suggest you try the following to help find what you're…

www.oracle.com

You should get the ojdbc7.jar file. Save this file into the …/spark/jars folder, where all other spark system class files are stored.

Image by Author

Connecting Spark with Oracle Database

Now that you already have installed the JDBC jar file where Spark is installed, and you know access details (host, port, sid, login, password) to the Oracle database, let’s begin the action.

I have installed Oracle Database as well as Spark (in local mode) on AWS EC2 instance as explained in the above article.

I can access my oracle database sanrusha. The database is up and running.

Image by Author

2. Database listener is also up and running

Image by Author

3. Database user is sparkuser1. This user has access to one table test, that has only on column A, but no data.

Image by Author

In the next step, going to connect to this database and table through Spark.

4a. Log in to the Spark machine and start Spark through Spark-shell or pyspark.

Image by Author

4b. Below command creates a spark dataframe df with details of the Oracle database table test. Write this command on Scala prompt.

val df= spark.read.format(“jdbc”).option(“url”,”jdbc:oracle:thin:sparkuser1/oracle@<oracledbhost>:<oracle db access port default is 1521>/<oracledbsid>”).option(“dbtable”,”test”).option(“user”,”sparkuser1").option(“password”,”oracle”).option(“driver”,”oracle.jdbc.driver.OracleDriver”).load()

4c. df.schema will show the details of the table. In this case, it is a simple test table with just one column A.

Image by Author

4d. Open a browser, enter the below address

http://<public IP address of machine where Spark is running>:4040

Click on the SQL tab. You should see the details like what time the connection request was submitted, how long connection and data retrieval activities took, and also the JDBC details.

Image by Author

Spark can also be initiated through a Spark session.builder API available in Python. Open Jypyter notebook and enter the below details to start the Spark application session and connect it with the Oracle database

Here is a snapshot of my Jupyter notebook.

Image by Author

Conclusion

This was a small article explaining options when it comes to using Spark with Oracle database. You can extend this knowledge for connecting Spark with MySQL and databases.

Looking forward to your feedback.

Spark and Oracle Database

Spark and Oracle Database

Ease of structured data and efficiency of Spark

Spark

Spark

From understanding core concepts to developing a well-functioning Spark application on AWS Instance!

Oracle Database

Oracle Database on AWS

Step-by-step instructions on downloading, installing, configuring, and running fully functional Oracle Database on AWS…

Oracle Database 12c Release 1 JDBC Driver Downloads

No results found Your search did not match any results. We suggest you try the following to help find what you're…

Connecting Spark with Oracle Database

Conclusion

Recommend

Problems with Oracle's Way of MySQL Bugs Database Maintenance

Oracle Database 18新增Linux风格

Determining the sizes of Oracle database tables and indexes

Uploading and Downloading Files with Node.js and Oracle Database

【OPatch】从 Oracle Database 19.3 升级到 Oracle Database-晟数的博客

Generating Rows in Oracle Database

oracle-database-enterprise-edition docker image 镜像

Python and Oracle Database

Distributed database access with Spark and JDBC

From Database to S3 with Apache Spark

About Joyk