Ever wanted to execute hadoop scripts from Oracle environment and join the results to any oracle table? It’s easier than you think, there is no special tool required, there is no transient (temporary) file created, there is no ETL !! This can be all executed on a fly by any user anytime. Why would you want to do it? Well there can be many reasons, for example you have your customer data stored in oracle and want to join them to very large call files or web longs generated every day.
Before I explain the trick, let me show you sample print-screen what I mean by “seamlessly”. Sample below is a query (in this case executed in DataStudio, but it can be pretty much any client you use to connect to oracle – SQLPLUS, TOAD, SAS, MS ACCESS, … you name it), as you can see I run very simple query which in the background goes to remote server, executes hadoop script and returns result back to oracle. User does not need to know that the data reside on non-oracle server (in this case hadoop). Please note the hadoop script is executed on a fly, this is not pre-executed script (it executes every time you run the query).
What you need
I used sample VMWARE Hadoop image, I also got Oracle 11.2 Ent on RedHat (this is really nothing special, just standard Oracle install). That’s all you need. Yup, it’s that simple.
No don’t worry, there is really no special setting required. The only thing you should do is to exchange public keys between Hadoop and RedHat servers so your ssh won’t require password. Here is very simple how-to
Then create 4 unix directories, you can create just one if you wish, I try to keep things organized.
/home/jiri/Desktop/sample/data - this will hold dummy empty file /home/jiri/Desktop/sample/log - this will hold oracle logs /home/jiri/Desktop/sample/bad - this will hold oracle reject records /home/jiri/Desktop/sample/execute - this will hold hadoop scripts (wrapped in sh)
I use sample hadoop script for calculating PI, this script is part of VMWARE image. To execute this script I create two unix scripts on my RedHat Server. As you can see one script is basically simple wrapper, the other calls hadoop script over ssh. Take these as simple examples, there are many ways how to improve them and make them production ready.
#!/bin/bash /home/oracle/Desktop/sample/execute/hadoop_script.sh 2>/dev/null
/home/jiri/Desktop/sample/execute/hadoop_script.sh – please change the IP address to your hadoop server IP or host.
#!/bin/bash /usr/bin/ssh firstname.lastname@example.org hadoop jar hadoop-examples.jar pi 4 10000
the last step is to create dummy empty file (it’s really just an empty file, this is dummy file you can reuse for all hadoop scripts)
First let me setup oracle directories (login to oracle as DBA or user with grants to create directories)
CREATE DIRECTORY dir_sample_data as '/home/jiri/Desktop/sample/data'; CREATE DIRECTORY dir_sample_log as '/home/jiri/Desktop/sample/log'; CREATE DIRECTORY dir_sample_bad as '/home/jiri/Desktop/sample/bad'; CREATE DIRECTORY dir_sample_exec as '/home/jiri/Desktop/sample/execute';
then I need to give grants on these directories to a user I will use to run the hadoop (in my case oracle user JIRI)
GRANT READ on DIRECTORY dir_sample_data to JIRI; GRANT READ, WRITE on DIRECTORY dir_sample_log to JIRI; GRANT READ, WRITE on DIRECTORY dir_sample_bad to JIRI; GRANT EXECUTE on DIRECTORY dir_sample_exec to JIRI;
now login as normal oracle user (in my case JIRI) and create external table
CREATE TABLE JIRI.MYHADOOP_EXT ( SCRIPT_OUTPUT VARCHAR2(2000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY dir_sample_data ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE PREPROCESSOR dir_sample_exec:'run_hadoop.sh' SKIP 0 LOGFILE dir_sample_log:'dummy.log' BADFILE dir_sample_bad:'dummy.bad' NODISCARDFILE FIELDS TERMINATED BY '~' OPTIONALLY ENCLOSED BY '~' MISSING FIELD VALUES ARE NULL ) LOCATION (dir_sample_data:'dummy.dat') ) REJECT LIMIT 0 NOPARALLEL NOMONITORING;
That’s it. Now every time user JIRI runs a query which uses MYHADOOP_EXT table, Oracle will execute unix scripts which executes hadoop script on remote server and returns stdout result back to the oracle user.
As you can see oracle external table preprocessor is very powerful tool, you can use it not only during ETL (as oracle documentation suggests unzip example), but for pretty much any unix level manipulation.