Category Archives: Oracle External Tables

Oracle loves Hadoop – integrate hadoop seamlessly into Oracle

Ever wanted to execute hadoop scripts from Oracle environment and join the results to any oracle table? It’s easier than you think, there is no special tool required, there is no transient (temporary) file created, there is no ETL !! This can be all executed on a fly by any user anytime. Why would you want to do it? Well there can be many reasons, for example you have your customer data stored in oracle and want to join them to very large call files or web longs generated every day.

.

Before I explain the trick, let me show you sample print-screen what I mean by “seamlessly”. Sample below is a query (in this case executed in DataStudio, but it can be pretty much any client you use to connect to oracle – SQLPLUS, TOAD, SAS, MS ACCESS, … you name it), as you can see I run very simple query which in the background goes to remote server, executes hadoop script and returns result back to oracle. User does not need to know that the data reside on non-oracle server (in this case hadoop). Please note the hadoop script is executed on a fly, this is not pre-executed script (it executes every time you run the query).

.

What you need

I used sample VMWARE Hadoop image, I also got Oracle 11.2 Ent on RedHat (this is really nothing special, just standard Oracle install). That’s all you need. Yup, it’s that simple.

.

Unix setting

No don’t worry, there is really no special setting required. The only thing you should do is to exchange public keys between Hadoop and RedHat servers so your ssh won’t require password. Here is very simple how-to

Then create 4 unix directories, you can create just one if you wish, I try to keep things organized.

/home/jiri/Desktop/sample/data    - this will hold dummy empty file
/home/jiri/Desktop/sample/log     - this will hold oracle logs
/home/jiri/Desktop/sample/bad     - this will hold oracle reject records
/home/jiri/Desktop/sample/execute - this will hold hadoop scripts (wrapped in sh)

I use sample hadoop script for calculating PI, this script is part of VMWARE image. To execute this script I create two unix scripts on my RedHat Server. As you can see one script is basically simple wrapper, the other calls hadoop script over ssh. Take these as simple examples, there are many ways how to improve them and make them production ready.

/home/jiri/Desktop/sample/execute/run_hadoop.sh

#!/bin/bash
/home/oracle/Desktop/sample/execute/hadoop_script.sh 2>/dev/null

/home/jiri/Desktop/sample/execute/hadoop_script.sh – please change the IP address to your hadoop server IP or host.

#!/bin/bash
/usr/bin/ssh guest@192.168.149.132 hadoop jar hadoop-examples.jar pi 4 10000

.

the last step is to create dummy empty file  (it’s really just an empty file, this is dummy file you can reuse for all hadoop scripts)

/home/jiri/Desktop/sample/data/dummy.dat

.

Oracle setting

There is really nothing much to setup. I will use external tables with preprocessor (new 11.2 feature). I wrote about this feature in two older posts, you can find them here and here

First let me setup oracle directories (login to oracle as DBA or user with grants to create directories)

 CREATE DIRECTORY dir_sample_data as '/home/jiri/Desktop/sample/data';
 CREATE DIRECTORY dir_sample_log  as '/home/jiri/Desktop/sample/log';
 CREATE DIRECTORY dir_sample_bad  as '/home/jiri/Desktop/sample/bad';
 CREATE DIRECTORY dir_sample_exec as '/home/jiri/Desktop/sample/execute';

then I need to give grants on these directories to a user I will use to run the hadoop (in my case oracle user JIRI)

 GRANT READ        on DIRECTORY dir_sample_data to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_log  to JIRI;
 GRANT READ, WRITE on DIRECTORY dir_sample_bad  to JIRI;
 GRANT EXECUTE     on DIRECTORY dir_sample_exec to JIRI;

now login as normal oracle user (in my case JIRI) and create external table

CREATE TABLE JIRI.MYHADOOP_EXT
 (
 SCRIPT_OUTPUT   VARCHAR2(2000)
 )
 ORGANIZATION EXTERNAL
 ( TYPE ORACLE_LOADER
   DEFAULT DIRECTORY dir_sample_data
   ACCESS PARAMETERS
   ( RECORDS DELIMITED BY NEWLINE
     PREPROCESSOR dir_sample_exec:'run_hadoop.sh'
     SKIP 0
     LOGFILE dir_sample_log:'dummy.log'
     BADFILE dir_sample_bad:'dummy.bad'
     NODISCARDFILE
     FIELDS TERMINATED BY '~' OPTIONALLY ENCLOSED BY '~'
     MISSING FIELD VALUES ARE NULL
   )
   LOCATION (dir_sample_data:'dummy.dat')
 )
 REJECT LIMIT 0
 NOPARALLEL
 NOMONITORING;

That’s it. Now every time user JIRI runs a query which uses MYHADOOP_EXT table, Oracle will execute unix scripts which executes hadoop script on remote server and returns stdout result back to the oracle user.

As you can see oracle external table preprocessor is very powerful tool, you can use it not only during ETL (as oracle documentation suggests unzip example), but for pretty much any unix level manipulation.

Advertisements
Tagged , , ,

Oracle External Tables by Examples part 5 – missing features

This is maybe a bit unusual post, I almost never post about missing feature and rather focus what does exist. Although I love Oracle External tables, I must admit there are few missing features which can be easily implemented and would save a lot of time in some instances.

Please note most of the missing features below can be partially or fully patched by workarounds, I’ve done it million times before. The point I am trying to make is to have these implemented directly without any workarounds.

.

1. LOAD FIRST x

This is very simple, I would love to be able to specify to load first x for example first 2 records. This would be great in cases where you want to load for example header only.

Is there a workaround? Sure there is, but it’s a workaround and not a true solution

a. if you are lucky and the header has specific characteristic you can use filter LOAD WHEN

b. you can use REJECT LIMIT UNLIMITED – this is possible but I really do not like it, simply because you by definition expect bad records. What if the header is 5 rows and row 3 has an issue and is rejected?

c. SELECT * FROM EXTERNAL_TABLE WHERE ROWNUM < X

This is certainly possible, but SQL engine runs after the external table load (this is defined by readsize cache) and although this is somewhat working, you might still get rejects because Oracle might read 7 records instead of just 1.

d. use unix and separate header and body – this is clean solution but is extra step

.

2. LOAD LAST X

This is very similar to LOAD FIRST, this has however bigger implication. Just imagine you have 40GB file, if you use for example LOAD WHEN you will be waiting for a long time to get the trailing record. Unix file split is probably the most effective workaround here.

.

3. SKIP LAST X

External table supports SKIP X meaning skip first x records. This is great for skipping header during the load. I would love to see similar feature for skipping last X records (skipping loading trailing record). Again, there are possible workarounds, but why do we need them if it could to implement this directly to external tables?

Please share your ideas what could be added to external tables load.

Tagged , , ,

Oracle External Tables by Examples part 4 – column_transforms Clause (load CLOB BLOB or any constant using external tables)

Starting Oracle 10g you can load any CLOB/BLOB (any file including text, xml, jpeg, doc, pdf, …. whatever format you like) which resides in separate file(s) to Oracle using external tables and no longer need SQL*Plus (unless you plan to load data over network). It is done using very interesting column_transforms clause.

let’s assume you have customer.dat file which looks like this

CUSTOMER_ID,CUSTOMER_ACQUIRE_DT, CUSTOMER_FILE_NM
123456,12/23/2001,jiri_smith.txt
234567,01/17/1999,scott_walker.txt
345678,05/16/2009,jim_brown.txt

the file includes next to customer_id also a date when was the customer acquired and also a file name which includes more detail information. In my example above it means that I have next to customer.dat file also jiri_smith.txt file, scott_walker.txt and jim_brown.txt file.

Here is the goal:

1. use external tables to load customer.dat file as well as all txt files
2. load txt files into CLOB field (you can load it to BLOB using the same technique)
3. present date as true date (correctly parse mask
4. add additional static column which defines what vendor sent the file
5. add windows drive name http://myserver/ to customer_file_nm so users know where the text file reside (this is kinda silly requirement since we load it to CLOB, but still …).

First let’s define directories, this is very straightforward. Notice I define separate directory for customer.dat and individual txt files (this can of course be one directory).

CREATE OR REPLACE DIRECTORY "DIR_USERS_JIRI_DATA" AS '/work/users/jiri/data';
CREATE OR REPLACE DIRECTORY "DIR_USERS_JIRI_FILES" AS '/work/users/jiri/files';
CREATE OR REPLACE DIRECTORY "DIR_USERS_JIRI_LOG" AS '/work/users/jiri/log';
CREATE OR REPLACE DIRECTORY "DIR_USERS_JIRI_BAD" AS '/work/users/jiri/bad';

GRANT READ, WRITE ON DIRECTORY DIR_USERS_JIRI_DATA TO JIRI;
GRANT READ, WRITE ON DIRECTORY DIR_USERS_JIRI_FILES TO JIRI;
GRANT READ ON DIRECTORY DIR_USERS_JIRI_LOG TO JIRI;
GRANT READ ON DIRECTORY DIR_USERS_JIRI_BAD TO JIRI;

The next step is to define external table (yes, all steps 1-5 can be done using one single DDL)

CREATE TABLE customer_ext
(
 CUSTOMER_ID         number,
 CUSTOMER_ACQUIRE_DT date,
 SERVICER_NM         varchar2(60),
 FILE_NM             varchar2(120),
 CUSTOMER_PROFILE    clob
)
organization external
(
 type oracle_loader
 default directory dir_sample_data
 access parameters
 (
 records delimited by NEWLINE
 skip 1
 logfile dir_sample_log:'customer.log'
 badfile dir_sample_bad:'customer.bad'
 nodiscardfile
 fields terminated by ','
 missing field values are null
 (
 CUSTOMER_ID          CHAR(10),
 CUSTOMER_ACQUIRE_DT  CHAR(11) DATE_FORMAT DATE MASK "MM/DD/YYYY",
 CUSTOMER_FILE_NM     CHAR(60)
 )
 column transforms
  (
   SERVICER_NM      FROM CONSTANT 'Web Data Delivery Inc.',
   FILE_NM          FROM CONCAT  (CONSTANT 'http://myserver/', CUSTOMER_FILE_NM),
   CUSTOMER_PROFILE FROM LOBFILE (CUSTOMER_FILE_NM) FROM (dir_sample_files)
  )
 )
 location (dir_sample_data:'customer.dat')
)
reject limit 0
noparallel
nomonitoring;

it cannot be easier than that

SELECT CUSTOMER_ID,
       to_char(CUSTOMER_ACQUIRE_DT, 'YYYYMMDD') ACQUIRE_DT,
       SERVICER_NM,
       FILE_NM,
       SUBSTR(CUSTOMER_PROFILE, 1, 10) CUSTOMER_PROFILE,
       DBMS_LOB.getlength(CUSTOMER_PROFILE) CLOB_LENGTH
  FROM CUSTOMER_EXT;

CUSTOMER_ID ACQUIRE_DT SERVICER_NM            FILE_NM                           CUSTOMER_PROFILE CLOB_LENGTH
----------------------------------------------------------------------------------------------------
123456     20011223    Web Data Delivery Inc. http://myserver/jiri_smith.txt    dfdfdfdfdf       55
234567     19990117    Web Data Delivery Inc. http://myserver/scott_walker.txt  rrrererere       564
345678     20090516    Web Data Delivery Inc. http://myserver/jim_brown.txt     gretgertgr       444
----------------------------------------------------------------------------------------------------

as you can see, we can do all requirements in one simple external table DDL.

1. we used external table
2. we load LOB using COLUMN_TRANSFORMS LOBFILE clause
3. we set the correct mask for date using DATE MASK
4. we created new SERVICER_NM static column (virtual column) wih static description using COLUMN TRANSFORMS CONSTANT
5.we appended constant to existing field (merged field with static value) using COLUMN TRANSFORMS CONCAT

please note COLUMN TRANSFORMS cannot be used for any other transformation –  CONSTANT, CONCAT and LOBFILE are the only options. I wish there would be more options and chance to add custom functions, but there is currently no such option.

You can find complete documentation here (there are however no real examples on how to use column transforms clause)

Tagged , , , , , , , , ,