Data Pipeline Output Deduplication? Hive Or Pig

Become a Member!

Why Register?

Login

Featured Research

Announcements

Technology Events

Home Profile Peers Wiki Activity Groups Feedback

Data Pipeline output deduplication? Hive or Pig

Currently 5/5 Stars.
1
2
3
4
5

rate this

Last Update: Apr 04, 2013 | 01:45

Viewed 6310 times | Community Rating: 5

Originating Author: Alexander Berezovsky

What is the best approach to deduplicate Data Pipeline output?

my data looks like:

extracted_at_utc (timestamp), id (int), attribute (int)

2013-03-29 22:02:44.0,40,0

2013-03-29 22:02:44.0,41,1

2013-03-30 22:03:19.0,40,1

2013-03-30 22:03:19.0,41,0

Now I'm using Hive query:

SELECT

 t.extracted_at_utc,
 t.id,
 t.attribute

FROM ${input1} t JOIN (SELECT

       id,
       MAX(extracted_at_utc) latest_extract_utc
     FROM ${input1}
     GROUP BY id) mx

ON t.id = mx.id AND t.extracted_at_utc = mx.latest_extract_utc;

I think that Pig script might be better tool for this job-- Here's the equivalent Pig code:

Equivalent Pig code:

input1 = load '$input1' as (id, extracted_at_utc, attribute); maxes = foreach (group input1 by id) generate id, attribute, MAX(extracted_at_utc) as latest_extract_utc; unique_maxes = DISTINCT maxes;

Comments on 'Data Pipeline output deduplication? Hive or Pig'

Equivalent Pig code:

input1 = load '$input1' as (id, extracted_at_utc, attribute); maxes = foreach (group input1 by id) generate id, attribute, MAX(extracted_at_utc) as latest_extract_utc; unique_maxes = DISTINCT maxes;

Posted By:Russell Jurney| Wed Apr 03, 2013 01:03
Thanks Russell

Posted By:David Vellante| Thu Apr 04, 2013 01:43

Post A Comment

You must be logged in to post a comment, please Sign in

Revision ID	Author	Timestamp	Comment
46375	Dvellante	13 Apr 04 13:45:28
46374	Dvellante	13 Apr 04 13:45:06
46373	Dvellante	13 Apr 04 13:44:38
46345	A-b	13 Apr 02 18:46:45
46344	A-b	13 Apr 02 18:44:51
46343	A-b	13 Apr 02 18:44:13
46342	A-b	13 Apr 02 18:41:06
46341	A-b	13 Apr 02 18:40:37	Created page with 'What is the best approach to deduplicate Data Pipeline output? my data looks like: extracted_at_timestamp, id, attribute 2013-03-29 22:02:44.0,40,0 2013-03-29 22:02:44.0,41,1 ...'

Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge.

Become a Member!

Login

Featured Research

Announcements

Technology Events

Comments on 'Data Pipeline output deduplication? Hive or Pig'

Post A Comment

most recent wikibon articles

latest wikibon blog posts

company profiles

wikibon community information