Thursday 15 May 2014

hadoop - PIG - Best way to optimize various grouping structures from one large input -


I am using pig to take a large txt file of the form

Colonel A | Colonel b. Colonel c. Colonel d | Colonel e. Colonel F. Col. G.

My goal is to take this input and group from different combinations of the columns, to get something like

(Col. A / Colonel B). Calculation (Colonel F). Yoga (Colonel G)

(Colonel A / Colonel C). Calculation (Colonel F). Yoga (Colonel G)

(Colonel A / Colonel D). Calculation (Colonel F). Yoga (Colonel G)

(Colonel B / Colonel C). Calculation (Colonel F). Saman (Colonel G)

I'm wondering if there is a way to make my pig code structured so that only the data needs to process the minimum time number because the input remains the same and All this

Pig automatically optimizes it. If you originally implement grouping, then grouping can be done in parallel and it will be done in a single map - reduce the job.

Given that you want to do the same thing for each grouping, you should define the macro so that you can save yourself some typing, for example:

 < Code> define DOFUFF (input, grp1, grp2) return result {group = = GROUP $ input BY ($ grp1, $ grp2); $ Result = FOREACH group is divided into Females (Group), COUNT (Grouped F), SM (Grouped G); }; Data = load '/ path / to / text' AS (A, B, C, D, E, F, G: IT); W = DADFF (data, A, B); X = DOIFFF (data, A, C); Y = DO_STUFF (data, A, D); Z = DO_STUFF (data, B, C);    

No comments:

Post a Comment