Monday 15 April 2013

java - Pig: Group by ranges/ binning data -


I have a set of integer values ​​that I want to group in a bundle group.

Example: Say I have 1000 points between 1 and 1000, and I want to do 20 bins.

Is there any group in the bin / array?

In addition, I would not know ahead of time how wide the range will be, so I can not hardcode a specific value.

If you have a minimum and maximum, you can divide the number of cans by bins. For example,

  - foo.pig id = load as '$ INPUT' (id: int); Ids_with_key = foreach id generated (id - $ MIN) * $ BIN_COUNT / ($ MAX-$ MIN + 1) as bin_id, id; Group_by_id = Group_id_key by bin_id; Bin_id = foreach group_by_id generates group, flatten (ids_with_key.id); Dump bin_id;   

Then you can use the following command to run it:

  Poor-F foo.pig -p MIN = 1 -p MAX = 1000 -p BIN_COUNT = 20 -p INPUT = your_input_path   

The idea behind the script is that we get the size [MIN, MAX] to get the shape Can divide every bin: (MAX - MIN + 1) / BIN_COUNT , which is called BIN_SIZE, then we will call ID in bin number: (id - MIN) / BIN_SIZE Map, and group them.

No comments:

Post a Comment