Friday 15 February 2013

numpy - Pandas crosstab double counting when using two aggregate functions? -


I'm not sure whether this is something I'm doing wrong or do not understand, or maybe a Bug

I'm using a simple dataframe from panda examples

  & gt; & Gt; & Gt; In pandas crosstab or oval shaped function Df = DataFrame ({'A': ['one', 'one', 'two', 'three'] * 6, 'b': ['a', 'b', 'c'] * 8, c ':' 'Foo', 'Foo', 'Foo', 'Bar', 'Bar', 'Bar'] * 4, 'D': N. P. Raymond Sundance (24), 'E': NP Random. Randn (24)})   

Performing a simple crossing with margin = Total works as expected, this is true:

  gt ; & Gt; Crosstab (rows = [df ['a'], df ['b']], coles = [df ['c']], margin = tru) fu all AB is a 2 2 b 4 b2 2c 4 2 2 4 3 A2 0 2 B2 2 C2 0 2 Two A2 2 B2 0 2 C 2 2 All 12 12 24   

The np.size function Using direct results gives:

  & gt; & Gt; & Gt; Crosstab (rows = [df ['a'], df ['b']], column =, margin = true, aggfunc = [np.size] [['c'] df]) foo all ab once 2 2 4 B. 2 2 4 C 2 2 4 Three A2 0 2 B 0 2 2 C 2 0 2 Two A2 2 B2 0 2 C 2 2 All 12 12 24   < P> Pandus allows you to get a count in one consolidation work and get the meaning of a crosstab. However, when I do this, the size of both Fu and Bar are doubled in the last call, yet the total total is correct.  
  & gt; & Gt; & Gt; Importance of crosstab (rows = [df ['a'], df ['b']], column =, margin = true, aggfunc = [np.size, np.mean] [['c'] df] = Df ['d']) Meaning foo all times foo all AB one a 2 2 4 0,245998 0,076366 0,161182 B2 2 4 -0739757 0,137780 -0300988 C2 2 4 -1,555759 -1, 446554-1,501157 Three one 2 NaN 2 1.2,16,109 NaN 1.216109 B. NaN 2 2 NaN 0.255482 0.255482 C 2 NaN 2 0.732448 NaN 0.732448 Two A NaN 2 2 NaN -0.273747 -0.273747 B2NN 2 -0.00164 9NN-0,01,649CNN 2 2NN 0.685422.6685422 All 24 24 24 -0, 017102 -094208 - 0.055655   

Am I missing something here? Why does it behave differently in two cases?

OK I got to know what it's doing pandas / pandas / After digging through the source code in tools / pivot.py , this statement comes in

  row_margin = data [Coles + value]. Gusby (calls). Agg (Agfunch)   

Here the column is DF ['C'] and value is DF ['D']. We group those two things by calls and then apply the aggregation task, which is np.size in this case. Each row appears

 in  [158]: data [column + value]. Gusby (column). Nth (0) Out [158]: __Dream_C Bar -1.823026 FU 0.465117   

When we call np.size () , we must 2 meet. Give them all the amount of 2 to get the margin and we end up with 24, can you expect twice if you want to count only D

Perhaps someone else can tell us whether it is expected or not. I am still a bit confused with one part of the source code. If I understand something else, then I will edit.

No comments:

Post a Comment