Spark and Oracle Database
source link: https://towardsdatascience.com/spark-and-oracle-database-6624abd2b632
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Spark and Oracle Database
Ease of structured data and efficiency of Spark
Shilpa has become an expert in Spark and enjoys Big data analysis. Everything was going well until her employer wanted to know the kind of insight they can get by combining their enterprise data from the Oracle database with Big Data.
Oracle database is the most sold enterprise database. Most of the enterprise applications, like ERP, SCM applications, are running on the Oracle database. Like Shilpa, most of the data scientists come across situations where they have to relate the data coming from enterprise databases like Oracle with the data coming from a Big Data source like Hadoop.
There are two approaches to address such requirements:
- Bring the enterprise data into the Big Data storage system like Hadoop HDFS and then access it through Spark SQL.
This approach has the following drawbacks:
- Data duplication
- Enterprise data has to be brought into Hadoop HDFS. This requires a data integration solution and will mostly be a batch operation, bringing in data latency issues.
2. Keep the operational enterprise data in the Oracle database and Big Data in Hadoop HDFS and access both through Spark SQL.
- Only the required enterprise data is accessed through Spark SQL.
- If required the enterprise data can be stored in Hadoop HDFS through Spark RDD.
I am elaborating on the second approach in this article. Let’s go through the basics first.
Spark
If you want to know about Spark and seek step-by-step instructions on how to download and install it along with Python, I highly recommend my below article.
Oracle Database
If you want to know about the Oracle database and seek step-by-step instructions on how to install a fully functional server-class Oracle database, I highly recommend my below article.
Before we taking a deeper dive into Spark and Oracle database integration, one shall know about Java Database Connection (JDBC).
A Java application can connect to the Oracle database through JDBC, which is a Java-based API. As Spark runs in a Java Virtual Machine (JVM), it can be connected to the Oracle database through JDBC.
You can download the latest JDBC jar file from the below link
You should get the ojdbc7.jar file. Save this file into the …/spark/jars folder, where all other spark system class files are stored.
Connecting Spark with Oracle Database
Now that you already have installed the JDBC jar file where Spark is installed, and you know access details (host, port, sid, login, password) to the Oracle database, let’s begin the action.
I have installed Oracle Database as well as Spark (in local mode) on AWS EC2 instance as explained in the above article.
- I can access my oracle database sanrusha. The database is up and running.
2. Database listener is also up and running
3. Database user is sparkuser1. This user has access to one table test, that has only on column A, but no data.
In the next step, going to connect to this database and table through Spark.
4a. Log in to the Spark machine and start Spark through Spark-shell or pyspark.
4b. Below command creates a spark dataframe df with details of the Oracle database table test. Write this command on Scala prompt.
val df= spark.read.format(“jdbc”).option(“url”,”jdbc:oracle:thin:sparkuser1/oracle@<oracledbhost>:<oracle db access port default is 1521>/<oracledbsid>”).option(“dbtable”,”test”).option(“user”,”sparkuser1").option(“password”,”oracle”).option(“driver”,”oracle.jdbc.driver.OracleDriver”).load()
4c. df.schema will show the details of the table. In this case, it is a simple test table with just one column A.
4d. Open a browser, enter the below address
http://<public IP address of machine where Spark is running>:4040
Click on the SQL tab. You should see the details like what time the connection request was submitted, how long connection and data retrieval activities took, and also the JDBC details.
Spark can also be initiated through a Spark session.builder API available in Python. Open Jypyter notebook and enter the below details to start the Spark application session and connect it with the Oracle database
Here is a snapshot of my Jupyter notebook.
Conclusion
This was a small article explaining options when it comes to using Spark with Oracle database. You can extend this knowledge for connecting Spark with MySQL and databases.
Looking forward to your feedback.
Recommend
-
63
In one of my previous posts I stated that Oracle does not care enough to maintain public MySQL bugs database properly. I t...
-
36
告诉你,现在可以下载Linux风格的Oracle Database 18,当然也有Windows风味的,但有个性的人不使用它。 根据一位健谈的高级Oracle小组的说法,Oracle Database 18现在可以在Linux上进行本地下载。
-
58
For one of our projects we store large amounts of timeseries data in an Oracle database. Sometimes we want to get an overview of how big the tables and related indexes are. Some database client tools like
-
31
Ifirst wrote about uploading and downloading files in 2015. Since then,
-
26
由于Oracle Database 19c会作为Oracle长期支持的版本,官方也建议大家选择19c这个版本。而最新推出的20c会作为一个全新的发布序列,当前版本是20.1,此版本的生命周期也会比较短,因此可以再观望一下20c的后续版本。我安装了在Oracle官网可以下载到的19.3的版本,...
-
20
Sometimes you want to automatically populate a database table with a number rows. Maybe you need a big table with lots of entries for a performance experiment or some dummy data for development. Unfortunately, there’s no...
-
8
``` docker pull oracle-database-enterprise-edition ``` Oracle Database Server 12c R2 is an industry leading relational database server. The Oracle Database Server Docker Image contains the Oracle Database Server 12.2.0.1 Enterprise Edition...
-
8
Python and Oracle DatabaseA combination no data scientist can ignore
-
4
Distributed database access with Spark and JDBC 10 Feb 2022 by dzlab
-
2
From Database to S3 with Apache Spark – Adam Gamboa G – Developer Skip to content ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK