PosgreSQL to Hadoop/Hive

Ever tried to get data from PostgreSQL to Hive? Came across CSV SerDe which is bundled in latest version of Apache Hive. But for all practical purposes it is useless. It treats every column as string. So wrote my own SerDe. You can find the source on GitHub. Dump your PostgreSQL table data using pg_dump or psql with COPY in plain text format.

Download pgdump-serde jar to your local machine. Open hive shell, add jar. Create external table and load the dump data. If you are using pg_dump file, this SerDe cannot handle schema, comments, column headers etc. So remove unnecessary header/footer that is not row data.

hive> add jar <path to>/pgdump-serde-1.0.4-1.2.0-all.jar;
hive> USE my_database;
hive> CREATE EXTERNAL TABLE `my_table_ext` (
  `id` string,
  `time` timestamp,
  `what` boolean,
  `size` int,
ROW FORMAT SERDE 'com.pasam.hive.serde.pg.PgDumpSerDe'
LOCATION '/tmp/my_table_ext.txt';
hive> LOAD DATA LOCAL INPATH '<path to dump directory>/my_table.dump' OVERWRITE INTO TABLE my_table_ext;
PosgreSQL to Hadoop/Hive