Oracle loves Hadoop – integrate hadoop seamlessly into Oracle

Ever wanted to execute hadoop scripts from Oracle environment and join the results to any oracle table? It’s easier than you think, there is no special tool required, there is no transient (temporary) file created, there is no ETL !! This can be all executed on a fly by any user anytime. Why would you want to do it? Well there can be many reasons, for example you have your customer data stored in oracle and want to join them to very large call files or web longs generated every day.


Before I explain the trick, let me show you sample print-screen what I mean by “seamlessly”. Sample below is a query (in this case executed in DataStudio, but it can be pretty much any client you use to connect to oracle – SQLPLUS, TOAD, SAS, MS ACCESS, … you name it), as you can see I run very simple query which in the background goes to remote server, executes hadoop script and returns result back to oracle. User does not need to know that the data reside on non-oracle server (in this case hadoop). Please note the hadoop script is executed on a fly, this is not pre-executed script (it executes every time you run the query).


What you need

I used sample VMWARE Hadoop image, I also got Oracle 11.2 Ent on RedHat (this is really nothing special, just standard Oracle install). That’s all you need. Yup, it’s that simple.


Unix setting

No don’t worry, there is really no special setting required. The only thing you should do is to exchange public keys between Hadoop and RedHat servers so your ssh won’t require password. Here is very simple how-to

Then create 4 unix directories, you can create just one if you wish, I try to keep things organized.

/home/jiri/Desktop/sample/data    - this will hold dummy empty file
/home/jiri/Desktop/sample/log     - this will hold oracle logs
/home/jiri/Desktop/sample/bad     - this will hold oracle reject records
/home/jiri/Desktop/sample/execute - this will hold hadoop scripts (wrapped in sh)

I use sample hadoop script for calculating PI, this script is part of VMWARE image. To execute this script I create two unix scripts on my RedHat Server. As you can see one script is basically simple wrapper, the other calls hadoop script over ssh. Take these as simple examples, there are many ways how to improve them and make them production ready.


/home/oracle/Desktop/sample/execute/ 2>/dev/null

/home/jiri/Desktop/sample/execute/ – please change the IP address to your hadoop server IP or host.

/usr/bin/ssh guest@ hadoop jar hadoop-examples.jar pi 4 10000


the last step is to create dummy empty file  (it’s really just an empty file, this is dummy file you can reuse for all hadoop scripts)



Oracle setting

There is really nothing much to setup. I will use external tables with preprocessor (new 11.2 feature). I wrote about this feature in two older posts, you can find them here and here

First let me setup oracle directories (login to oracle as DBA or user with grants to create directories)

 CREATE DIRECTORY dir_sample_data as '/home/jiri/Desktop/sample/data';
 CREATE DIRECTORY dir_sample_log  as '/home/jiri/Desktop/sample/log';
 CREATE DIRECTORY dir_sample_bad  as '/home/jiri/Desktop/sample/bad';
 CREATE DIRECTORY dir_sample_exec as '/home/jiri/Desktop/sample/execute';

then I need to give grants on these directories to a user I will use to run the hadoop (in my case oracle user JIRI)

 GRANT READ        on DIRECTORY dir_sample_data to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_log  to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_bad  to JIRI;
 GRANT EXECUTE     on DIRECTORY dir_sample_exec to JIRI;

now login as normal oracle user (in my case JIRI) and create external table

   DEFAULT DIRECTORY dir_sample_data
     PREPROCESSOR dir_sample_exec:''
     SKIP 0
     LOGFILE dir_sample_log:'dummy.log'
     BADFILE dir_sample_bad:'dummy.bad'
   LOCATION (dir_sample_data:'dummy.dat')

That’s it. Now every time user JIRI runs a query which uses MYHADOOP_EXT table, Oracle will execute unix scripts which executes hadoop script on remote server and returns stdout result back to the oracle user.

As you can see oracle external table preprocessor is very powerful tool, you can use it not only during ETL (as oracle documentation suggests unzip example), but for pretty much any unix level manipulation.

Tagged , , ,

4 thoughts on “Oracle loves Hadoop – integrate hadoop seamlessly into Oracle

  1. Jag says:

    Have you tried connecting using SQL*NET connect string and execute the query from the external table ?

    ssh in the preprocessor script is giving error
    “Host key verification failed”

  2. jiri says:

    preprocessor script is executed using the same unix ID as oracle services use (typically userID is oracle), please make sure known_hosts unix file for that user is correctly setup

  3. Jag says:

    I have no issue when connecting as user/password.
    But when I connect as user/password@, I have this SSH error

  4. jiri says:

    if you use SSH in the scripts, then I would recommend not to use password at all. Simply setup password-less access using keys. There are great examples online, this is one of them

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: