Wednesday 15 May 2013

apache pig - nested FOREACH statements -


It seems that the group is not supported in the nested FOREACH statement. I have the following schema:

  data2: {group: chararray, data1: {(lieutenant: chararray, ln: chararray)}}   

on which I want to do data 1, all pairs of groups (LT, LN), calculation, order DESC, and finally to limit 1.

This idea is to remove the most likely pair (Lieutenant, LN) for each group. How do you recommend me to do this?

An UDF will be the best for fastest execution. Before establishing relationships with schema (group, lieutenant, ln) can be trusted. It may be something like this (this is just a pseudo script, some debugging may be required)

  Assume load schema (id, lieutenant, ln) ap = load .... ... as (id, lieutenant: charre, ln: charare); Grp1 = GROUP IPIN BY (ID, Lieutenant, LN); Data 1 = FOREACH GRP FLATTEN (IPT), COUNT as CNT (Data 1); Data2 = GROUP data_wtih_count based on ID; --Data2: {Group: Chararay, Data 1: {(ID, Lieutenant: Charra, LN: Chararay, CNT: IT)}} Most_Profile_Per = FOREACH data2 {ord = ORDER data1CNT ASC; Top = LIMIT and 1; Create a group, top. (Ln, lt); }   

Or you can level data 2 and data 1 and start with grp1.

No comments:

Post a Comment