What is the best approach to deduplicate Data Pipeline output?
my data looks like:
extracted_at_utc (timestamp), id (int), attribute (int)
2013-03-29 22:02:44.0,40,0
2013-03-29 22:02:44.0,41,1
2013-03-30 22:03:19.0,40,1
2013-03-30 22:03:19.0,41,0
Now I'm using Hive query:
SELECT
t.extracted_at_utc, t.id, t.attribute
FROM ${input1} t JOIN (SELECT
id, MAX(extracted_at_utc) latest_extract_utc FROM ${input1} GROUP BY id) mx
ON t.id = mx.id AND t.extracted_at_utc = mx.latest_extract_utc;
I think that Pig script might be better tool for this job-- Here's the equivalent Pig code:
Equivalent Pig code:
input1 = load '$input1' as (id, extracted_at_utc, attribute); maxes = foreach (group input1 by id) generate id, attribute, MAX(extracted_at_utc) as latest_extract_utc; unique_maxes = DISTINCT maxes;