query JSON in Hive

Lets prep simple JSON first

 

this is somewhat simple JSON, it includes one array

{
  "myData": {
    "my_address_info": "UNV-ADDR",
    "transaction_number": "8007038190",
    "customer_id": "jiri2",
    "inventory": [
      {
        "product_category": "electronic",
        "amount": "5000.20"
      }
    ],
    "transaction_datetime": "06\/04\/2013 21:38:14"
  }
}

will look like this flattened

{ "myData":{ "my_address_info":"UNV-ADDR", "transaction_number":"8007038190", "customer_id":"jiri2", "inventory":[ { "product_category":"electronic", "amount":"5000.20" } ], "transaction_datetime":"06/04/2013 21:38:14" } }

lets add more dummy records

{ "myData":{ "my_address_info":"UNV-ADDR", "transaction_number":"8007038190", "customer_id":"jiri2", "inventory":[ { "product_category":"electronic", "amount":"5000.20" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-AAAA", "transaction_number":"8007038191", "customer_id":"jiri1", "inventory":[ { "product_category":"electronic", "amount":"5000.30" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-BBBB", "transaction_number":"8007038192", "customer_id":"jiri3", "inventory":[ { "product_category":"electronic", "amount":"5000.40" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-CCCC", "transaction_number":"8007038193", "customer_id":"jiri1", "inventory":[ { "product_category":"electronic", "amount":"5000.50" } ], "transaction_datetime":"06/04/2013 21:38:14" } }

and save it to the Hadoop cluster (in my care kerberos enabled) as a file sample.json

kinit yourname@COMPANY.COM;
hadoop fs -put sample.json /projects/jiri/json_demo

 

method 1: custom serde Hive-JSON-Serde

prep the serde (one time setting)

  1. dowload the serde from https://github.com/rcongiu/Hive-JSON-Serde
  2. compile it
    mvn package
    
  3. save it on Hadoop client e.g. to /home/jiri/hive/json-serde-1.1.7.jar
  4. run hive CLI and add the serde
    kinit yourname@COMPANY.COM
    hive
    add JAR /home/jiri/hive/json-serde-1.1.7.jar
    

note: you can add it to .hiverc file in your home directory if you dont want to add it manually every time you start hive

create table

I used external table, but regular table works the same

 

CREATE EXTERNAL TABLE json_serde (
     myData struct <
                      my_address_info :string,
                      transaction_number :string,
                      customer_id :string, 
                      inventory :array<struct<product_category:string, amount:string>> ,
                      transaction_datetime :string
                   >
                                  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION "/projects/jiri/json_demo";

optionally you can use this tool to generate the table DDL for you (this tool parses JSON and creates DDL based on the content) https://github.com/midpeter444/hive-json-schema

let’s run some queries

simple select statement

select * from json_serde;

{"my_address_info":"UNV-ADDR","transaction_number":"8007038190","customer_id":"jiri2","inventory":[{"product_category":"electronic","amount":"5000.20"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-AAAA","transaction_number":"8007038191","customer_id":"jiri1","inventory":[{"product_category":"electronic","amount":"5000.30"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-BBBB","transaction_number":"8007038192","customer_id":"jiri3","inventory":[{"product_category":"electronic","amount":"5000.40"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-CCCC","transaction_number":"8007038193","customer_id":"jiri1","inventory":[{"product_category":"electronic","amount":"5000.50"}],"transaction_datetime":"06/04/2013 21:38:14"}

simple group by query

  select myData.customer_id, count(*)
    from json_serde
group by myData.customer_id;

jiri1 2
jiri2 1
jiri3 1

let’s select data from the array

  select myData.customer_id, count(*), sum(myData.inventory[0].amount)
    from json_serde
group by myData.customer_id;

jiri1 2 10000.8
jiri2 1 5000.2
jiri3 1 5000.4

let’s do a join (in this case dummy to itself) and group by

  select a.myData.customer_id, count(*), sum(a.myData.inventory[0].amount)
    from json_serde a inner join json_serde b on (a.myData.customer_id = b.myData.customer_id)
group by a.myData.customer_id;

jiri1 4 20001.6
jiri2 1 5000.2
jiri3 1 5000.4

interesting twist using explode

field “inventory” in example above is an array and I always referenced first tuple directly instead of “give me all tuples from the array”. This can be done using “explode” function which basically creates multiple rows (one row per one tuple). This is same example as above, in this case first records has two values in the array.

{ "myData":{ "my_address_info":"UNV-ADDR", "transaction_number":"8007038190", "customer_id":"jiri2", "inventory":[ { "product_category":"electronic", "amount":"5000.20" } , { "product_category":"electronic", "amount":"50000.20" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-AAAA", "transaction_number":"8007038191", "customer_id":"jiri1", "inventory":[ { "product_category":"electronic", "amount":"5000.30" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-BBBB", "transaction_number":"8007038192", "customer_id":"jiri3", "inventory":[ { "product_category":"electronic", "amount":"5000.40" } ], "transaction_datetime":"06/04/2013 21:38:14" } }
{ "myData":{ "my_address_info":"UNV-CCCC", "transaction_number":"8007038193", "customer_id":"jiri1", "inventory":[ { "product_category":"electronic", "amount":"5000.50" } ], "transaction_datetime":"06/04/2013 21:38:14" } }

I will use explode function on the inventory array to … well explode the record and in this case create 5 rows (first row will change into two rows),
“lateral view” is almost like a virtual view created from the source (in this case JSON_SERDE) applied back, in this case it takes array, explodes it and injects the result to the result

select myData.customer_id, inventory.product_category, inventory.amount
 from json_serde a
lateral view explode( myData.inventory) b as inventory;

jiri2 electronic 5000.20
jiri2 electronic 50000.20
jiri1 electronic 5000.30
jiri3 electronic 5000.40
jiri1 electronic 5000.50
..
..

 

method 2: using built in get_json_object function

as the name says, the main purposes is manipulation of JSON object, Probably the best use case is file with multiple fields where JSON is one of them, using it several times on pure JSON is a bit clunky and inefficient, lets look at it why.

create table

USE MY_SCHEMA;
CREATE EXTERNAL TABLE json_table ( json string )
ROW FORMAT DELIMITED
STORED AS TextFile
LOCATION "/projects/jiri/json_demo";

run sample queries

select get_json_object(json_table.json, '$')
 from json_table;
set hive.cli.print.header=true;

select get_json_object(json_table.json, '$.myData.my_address_info') as my_address_info,
       get_json_object(json_table.json, '$.myData.customer_id') as customer_id
  from json_table;
..

 

method 3: using built in json_tuple function with LateralView

json_tuple is perhaps a bit more advanced and faster approach, but still very clunky and hard to work with on deeper JSONs, each tupple can go only one level deep and requires lateralview on a lateralview

create table

USE MY_SCHEMA;
CREATE EXTERNAL TABLE json_table ( json string )
ROW FORMAT DELIMITED
STORED AS TextFile
LOCATION "/projects/jiri/json_demo";

run sample queries

first sample peals first level in JSON level

set hive.cli.print.header=true;
select b.*
from json_table a
lateral view json_tuple(a.json, 'myData') b as myData;

myData
{"my_address_info":"UNV-ADDR","transaction_number":"8007038190","customer_id":"jiri2","inventory":[{"product_category":"electronic","amount":"5000.20"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-AAAA","transaction_number":"8007038191","customer_id":"jiri1","inventory":[{"product_category":"electronic","amount":"5000.30"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-BBBB","transaction_number":"8007038192","customer_id":"jiri3","inventory":[{"product_category":"electronic","amount":"5000.40"}],"transaction_datetime":"06/04/2013 21:38:14"}
{"my_address_info":"UNV-CCCC","transaction_number":"8007038193","customer_id":"jiri1","inventory":[{"product_category":"electronic","amount":"5000.50"}],"transaction_datetime":"06/04/2013 21:38:14"}

second example goes two levels deep

set hive.cli.print.header=true;

select c.*
from json_table a
lateral view json_tuple(a.json, 'myData') b as myData
lateral view json_tuple(b.myData, 'my_address_info', 'transaction_number', 'customer_id', 'inventory') c as address, trasref, mid, inventory;

address trasref mid inventory
UNV-ADDR 8007038190 jiri2 [{"product_category":"electronic","amount":"5000.20"}]
UNV-AAAA 8007038191 jiri1 [{"product_category":"electronic","amount":"5000.30"}]
UNV-BBBB 8007038192 jiri3 [{"product_category":"electronic","amount":"5000.40"}]
UNV-CCCC 8007038193 jiri1 [{"product_category":"electronic","amount":"5000.50"}]
...

 

lessons learned

  • hive-json-schema (tool I used to create DDL from JSON – this is one time step only) is buggy (it was able to get job reasonable well done and saved a lot of time, few tweaks had to be done to fix DDL)
  • Hadoop had no issue to create sample JSON (less than 10 minutes task for 13mm records JSON)
  • Hadoop has no issue query JSON using JSON-SERDE with reasobable time for simply sum as well as various joins
  • JSON serde deserializes escape characters. For example if your JSON included \n character, JSON serde will translate it (deserialize it) to NEWLINE automatically (there is currently no way to suppress this deserilization.
  • by default Hive using JSON-SERDE throws an error if JSON is mail formatted (this is a good thing), you can suppress the error by using setting_ ignore.malformed.json” = “true”_
    ALTER TABLE SAMPLE_TABLE SET SERDEPROPERTIES ( “ignore.malformed.json” = “true”);
  • Hive does not like “dot” and thus all JSON field which include “dot” as part of name have to be either renamed e.g. “score.reason_code” was renamed to “score_reason_code” or mapping hive property has to be set in Table DDL e.g. WITH SERDEPROPERTIES ( “mapping.my_reason_code” = “my.reason_code” )
  • If your JSON record spans several rows, you have to create your own custom InputSplit, you cannot use default TextInputSplit which uses NEWLINE. Tom White’s awesome “Hadoop: The Definitive Guide” book from O’Reilly goes into this in great depth.
  • fast way to uncompress file inside hadoop HDFS

    If I ignore GET | UNZIP | PUT method which basically reads the file from HDFS into local machine, then the easiest way is to use PIG script

    a = load '/user/jiri/file.gz' using PigStorage();
    store a into '/user/jiri/file' using PigStorage();
    

    it cannot be easier than that

    Cardinality Oracle Hint

    I just discovered very interesting Oracle undocumented (yet according to Tom “safe” ) hint CARDINALITY. Cardinality hint compensates for the optimizers inability to estimate the inter-table join result set. This is NOT a problem with the optimizer, as no amount of metadata will help when joining tables with complex, multi-column where clauses.

    The cardinality hint is used in two general cases, complex joins and dynamically created tables like global temporary tables (and possibly using materializations using the WITH clause):
    This hint compensates for the CBO inability to estimate the tables join result set. It is really not an issue of CBO, it simply cannot do good estimate with complex multi-complex joins (this hint is not about two tables join, but complex multi-tables joins).

    It can be used in complex joins as well as dynamically created tables like global temporary tables.
    One interesting side effect Alain Lavallee posted on my blog today is that CARDINALITY hint can enforce joins of remote tables on remote server when doing inserts into local server when using db-link, DRIVING_SITE hint unfortunatelly does not enforce that, see more details here https://jiri.wordpress.com/2009/05/20/queries-across-db-link-and-how-driving_site-hint-does-not-work-for-inserts/