PosgreSQL to Hadoop/Hive

Ever tried to get data from PostgreSQL to Hive? Came across CSV SerDe which is bundled in latest version of Apache Hive. But for all practical purposes it is useless. It treats every column as string. So wrote my own SerDe. You can find the source on GitHub. Dump your PostgreSQL table data using pg_dump or psql with COPY in plain text format.

Download pgdump-serde jar to your local machine. Open hive shell, add jar. Create external table and load the dump data. If you are using pg_dump file, this SerDe cannot handle schema, comments, column headers etc. So remove unnecessary header/footer that is not row data.

hive> add jar <path to>/pgdump-serde-1.0.4-1.2.0-all.jar;
hive> USE my_database;
hive> CREATE EXTERNAL TABLE `my_table_ext` (
  `id` string,
  `time` timestamp,
  `what` boolean,
  `size` int,
  ...
)
ROW FORMAT SERDE 'com.pasam.hive.serde.pg.PgDumpSerDe'
LOCATION '/tmp/my_table_ext.txt';
hive> LOAD DATA LOCAL INPATH '<path to dump directory>/my_table.dump' OVERWRITE INTO TABLE my_table_ext;
Advertisements
PosgreSQL to Hadoop/Hive

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s