11 Plydata (dplyr for Python)

11.1 Sample Data

n = 200
comp = ['C' + i for i in np.random.randint( 1,4, size  = n).astype(str)] # 3x Company
dept = ['D' + i for i in np.random.randint( 1,6, size  = n).astype(str)] # 5x Department
grp =  ['G' + i for i in np.random.randint( 1,3, size  = n).astype(str)] # 2x Groups
value1 = np.random.normal( loc=50 , scale=5 , size = n)
value2 = np.random.normal( loc=20 , scale=3 , size = n)
#value3 = np.random.normal( loc=5 , scale=30 , size = n)

mydf = pd.DataFrame({
    'comp':comp, 
    'dept':dept, 
    'grp': grp,
    'value1':value1, 
    'value2':value2
    #'value3':value3 
})
mydf.head()
#:>   comp dept grp     value1     value2
#:> 0   C1   D5  G2  44.991357  18.376559
#:> 1   C1   D3  G1  42.629061  15.044993
#:> 2   C3   D1  G1  53.803512  18.496985
#:> 3   C2   D3  G1  51.387646  16.152419
#:> 4   C2   D3  G1  52.897552  19.107399

11.2 Column Manipulation

11.2.1 Copy Column

mydf >> define(newcol = 'value1')                 # simple method for one column
#:>     comp dept grp     value1     value2     newcol
#:> 0     C1   D5  G2  44.991357  18.376559  44.991357
#:> 1     C1   D3  G1  42.629061  15.044993  42.629061
#:> 2     C3   D1  G1  53.803512  18.496985  53.803512
#:> 3     C2   D3  G1  51.387646  16.152419  51.387646
#:> 4     C2   D3  G1  52.897552  19.107399  52.897552
#:> ..   ...  ...  ..        ...        ...        ...
#:> 195   C3   D3  G1  44.837077  23.383648  44.837077
#:> 196   C2   D2  G2  44.899655  17.546987  44.899655
#:> 197   C1   D1  G2  47.806362  21.936824  47.806362
#:> 198   C1   D5  G1  50.837563  17.498985  50.837563
#:> 199   C1   D4  G1  51.282241  20.841696  51.282241
#:> 
#:> [200 rows x 6 columns]
mydf >> define (('newcol1', 'value1'), newcol2='value2')  # method for muiltiple new columns
#:>     comp dept grp     value1     value2    newcol1    newcol2
#:> 0     C1   D5  G2  44.991357  18.376559  44.991357  18.376559
#:> 1     C1   D3  G1  42.629061  15.044993  42.629061  15.044993
#:> 2     C3   D1  G1  53.803512  18.496985  53.803512  18.496985
#:> 3     C2   D3  G1  51.387646  16.152419  51.387646  16.152419
#:> 4     C2   D3  G1  52.897552  19.107399  52.897552  19.107399
#:> ..   ...  ...  ..        ...        ...        ...        ...
#:> 195   C3   D3  G1  44.837077  23.383648  44.837077  23.383648
#:> 196   C2   D2  G2  44.899655  17.546987  44.899655  17.546987
#:> 197   C1   D1  G2  47.806362  21.936824  47.806362  21.936824
#:> 198   C1   D5  G1  50.837563  17.498985  50.837563  17.498985
#:> 199   C1   D4  G1  51.282241  20.841696  51.282241  20.841696
#:> 
#:> [200 rows x 7 columns]

11.2.2 New Column from existing Column

Without specify the new column name, it will be derived from expression

mydf >> define ('value1*2')
#:>     comp dept grp     value1     value2    value1*2
#:> 0     C1   D5  G2  44.991357  18.376559   89.982715
#:> 1     C1   D3  G1  42.629061  15.044993   85.258122
#:> 2     C3   D1  G1  53.803512  18.496985  107.607023
#:> 3     C2   D3  G1  51.387646  16.152419  102.775291
#:> 4     C2   D3  G1  52.897552  19.107399  105.795104
#:> ..   ...  ...  ..        ...        ...         ...
#:> 195   C3   D3  G1  44.837077  23.383648   89.674153
#:> 196   C2   D2  G2  44.899655  17.546987   89.799311
#:> 197   C1   D1  G2  47.806362  21.936824   95.612724
#:> 198   C1   D5  G1  50.837563  17.498985  101.675127
#:> 199   C1   D4  G1  51.282241  20.841696  102.564482
#:> 
#:> [200 rows x 6 columns]

Specify the new column name

mydf >> define(value3 = 'value1*2')
#:>     comp dept grp     value1     value2      value3
#:> 0     C1   D5  G2  44.991357  18.376559   89.982715
#:> 1     C1   D3  G1  42.629061  15.044993   85.258122
#:> 2     C3   D1  G1  53.803512  18.496985  107.607023
#:> 3     C2   D3  G1  51.387646  16.152419  102.775291
#:> 4     C2   D3  G1  52.897552  19.107399  105.795104
#:> ..   ...  ...  ..        ...        ...         ...
#:> 195   C3   D3  G1  44.837077  23.383648   89.674153
#:> 196   C2   D2  G2  44.899655  17.546987   89.799311
#:> 197   C1   D1  G2  47.806362  21.936824   95.612724
#:> 198   C1   D5  G1  50.837563  17.498985  101.675127
#:> 199   C1   D4  G1  51.282241  20.841696  102.564482
#:> 
#:> [200 rows x 6 columns]

Define multiple new columns in one go. Observe there are three ways to specify the new columns

mydf >> define('value1*2',('newcol2','value2*2'),newcol3='value2*3')
#:>     comp dept grp     value1     value2    value1*2    newcol2    newcol3
#:> 0     C1   D5  G2  44.991357  18.376559   89.982715  36.753118  55.129677
#:> 1     C1   D3  G1  42.629061  15.044993   85.258122  30.089985  45.134978
#:> 2     C3   D1  G1  53.803512  18.496985  107.607023  36.993970  55.490955
#:> 3     C2   D3  G1  51.387646  16.152419  102.775291  32.304838  48.457257
#:> 4     C2   D3  G1  52.897552  19.107399  105.795104  38.214799  57.322198
#:> ..   ...  ...  ..        ...        ...         ...        ...        ...
#:> 195   C3   D3  G1  44.837077  23.383648   89.674153  46.767296  70.150944
#:> 196   C2   D2  G2  44.899655  17.546987   89.799311  35.093974  52.640962
#:> 197   C1   D1  G2  47.806362  21.936824   95.612724  43.873649  65.810473
#:> 198   C1   D5  G1  50.837563  17.498985  101.675127  34.997969  52.496954
#:> 199   C1   D4  G1  51.282241  20.841696  102.564482  41.683391  62.525087
#:> 
#:> [200 rows x 8 columns]

11.2.3 Select Column(s)

mydf2 = mydf >> define(newcol1='value1',newcol2='value2')
mydf2.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 200 entries, 0 to 199
#:> Data columns (total 7 columns):
#:>  #   Column   Non-Null Count  Dtype  
#:> ---  ------   --------------  -----  
#:>  0   comp     200 non-null    object 
#:>  1   dept     200 non-null    object 
#:>  2   grp      200 non-null    object 
#:>  3   value1   200 non-null    float64
#:>  4   value2   200 non-null    float64
#:>  5   newcol1  200 non-null    float64
#:>  6   newcol2  200 non-null    float64
#:> dtypes: float64(4), object(3)
#:> memory usage: 11.1+ KB

11.2.3.1 By Column Names

Exact Coumn Name

mydf2 >> select ('comp','dept','value1')
#:>     comp dept     value1
#:> 0     C1   D5  44.991357
#:> 1     C1   D3  42.629061
#:> 2     C3   D1  53.803512
#:> 3     C2   D3  51.387646
#:> 4     C2   D3  52.897552
#:> ..   ...  ...        ...
#:> 195   C3   D3  44.837077
#:> 196   C2   D2  44.899655
#:> 197   C1   D1  47.806362
#:> 198   C1   D5  50.837563
#:> 199   C1   D4  51.282241
#:> 
#:> [200 rows x 3 columns]

Column Name Starts With

mydf2 >> select ('comp', startswith='val')
#:>     comp     value1     value2
#:> 0     C1  44.991357  18.376559
#:> 1     C1  42.629061  15.044993
#:> 2     C3  53.803512  18.496985
#:> 3     C2  51.387646  16.152419
#:> 4     C2  52.897552  19.107399
#:> ..   ...        ...        ...
#:> 195   C3  44.837077  23.383648
#:> 196   C2  44.899655  17.546987
#:> 197   C1  47.806362  21.936824
#:> 198   C1  50.837563  17.498985
#:> 199   C1  51.282241  20.841696
#:> 
#:> [200 rows x 3 columns]

Column Name Ends With …

mydf2 >> select ('comp',endswith=('1','2','3'))
#:>     comp     value1     value2    newcol1    newcol2
#:> 0     C1  44.991357  18.376559  44.991357  18.376559
#:> 1     C1  42.629061  15.044993  42.629061  15.044993
#:> 2     C3  53.803512  18.496985  53.803512  18.496985
#:> 3     C2  51.387646  16.152419  51.387646  16.152419
#:> 4     C2  52.897552  19.107399  52.897552  19.107399
#:> ..   ...        ...        ...        ...        ...
#:> 195   C3  44.837077  23.383648  44.837077  23.383648
#:> 196   C2  44.899655  17.546987  44.899655  17.546987
#:> 197   C1  47.806362  21.936824  47.806362  21.936824
#:> 198   C1  50.837563  17.498985  50.837563  17.498985
#:> 199   C1  51.282241  20.841696  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]

Column Name Contains …

mydf2 >> select('comp', contains=('col','val'))
#:>     comp     value1     value2    newcol1    newcol2
#:> 0     C1  44.991357  18.376559  44.991357  18.376559
#:> 1     C1  42.629061  15.044993  42.629061  15.044993
#:> 2     C3  53.803512  18.496985  53.803512  18.496985
#:> 3     C2  51.387646  16.152419  51.387646  16.152419
#:> 4     C2  52.897552  19.107399  52.897552  19.107399
#:> ..   ...        ...        ...        ...        ...
#:> 195   C3  44.837077  23.383648  44.837077  23.383648
#:> 196   C2  44.899655  17.546987  44.899655  17.546987
#:> 197   C1  47.806362  21.936824  47.806362  21.936824
#:> 198   C1  50.837563  17.498985  50.837563  17.498985
#:> 199   C1  51.282241  20.841696  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]

11.2.3.2 Specify Column Range

mydf2 >> select ('comp', slice('value1','newcol2'))
#:>     comp     value1     value2    newcol1    newcol2
#:> 0     C1  44.991357  18.376559  44.991357  18.376559
#:> 1     C1  42.629061  15.044993  42.629061  15.044993
#:> 2     C3  53.803512  18.496985  53.803512  18.496985
#:> 3     C2  51.387646  16.152419  51.387646  16.152419
#:> 4     C2  52.897552  19.107399  52.897552  19.107399
#:> ..   ...        ...        ...        ...        ...
#:> 195   C3  44.837077  23.383648  44.837077  23.383648
#:> 196   C2  44.899655  17.546987  44.899655  17.546987
#:> 197   C1  47.806362  21.936824  47.806362  21.936824
#:> 198   C1  50.837563  17.498985  50.837563  17.498985
#:> 199   C1  51.282241  20.841696  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]

11.2.4 Drop Column(s)

mydf2 >> select('newcol1','newcol2',drop=True)
#:>     comp dept grp     value1     value2
#:> 0     C1   D5  G2  44.991357  18.376559
#:> 1     C1   D3  G1  42.629061  15.044993
#:> 2     C3   D1  G1  53.803512  18.496985
#:> 3     C2   D3  G1  51.387646  16.152419
#:> 4     C2   D3  G1  52.897552  19.107399
#:> ..   ...  ...  ..        ...        ...
#:> 195   C3   D3  G1  44.837077  23.383648
#:> 196   C2   D2  G2  44.899655  17.546987
#:> 197   C1   D1  G2  47.806362  21.936824
#:> 198   C1   D5  G1  50.837563  17.498985
#:> 199   C1   D4  G1  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]
mydf >> rename( {'val.1' : 'value1',
                 'val.2' : 'value2' })
#:>     comp dept grp      val.1      val.2
#:> 0     C1   D5  G2  44.991357  18.376559
#:> 1     C1   D3  G1  42.629061  15.044993
#:> 2     C3   D1  G1  53.803512  18.496985
#:> 3     C2   D3  G1  51.387646  16.152419
#:> 4     C2   D3  G1  52.897552  19.107399
#:> ..   ...  ...  ..        ...        ...
#:> 195   C3   D3  G1  44.837077  23.383648
#:> 196   C2   D2  G2  44.899655  17.546987
#:> 197   C1   D1  G2  47.806362  21.936824
#:> 198   C1   D5  G1  50.837563  17.498985
#:> 199   C1   D4  G1  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]

Combined Method
Combine both assignment and dictionary method

mydf >> rename( {'val.1' : 'value1',
                 'val.2' : 'value2'
              }, group = 'grp' )
#:>     comp dept group      val.1      val.2
#:> 0     C1   D5    G2  44.991357  18.376559
#:> 1     C1   D3    G1  42.629061  15.044993
#:> 2     C3   D1    G1  53.803512  18.496985
#:> 3     C2   D3    G1  51.387646  16.152419
#:> 4     C2   D3    G1  52.897552  19.107399
#:> ..   ...  ...   ...        ...        ...
#:> 195   C3   D3    G1  44.837077  23.383648
#:> 196   C2   D2    G2  44.899655  17.546987
#:> 197   C1   D1    G2  47.806362  21.936824
#:> 198   C1   D5    G1  50.837563  17.498985
#:> 199   C1   D4    G1  51.282241  20.841696
#:> 
#:> [200 rows x 5 columns]

11.3 Sorting (arrange)

Use ‘-colName’ for decending

mydf >> arrange('comp', '-value1')
#:>     comp dept grp     value1     value2
#:> 37    C1   D2  G2  61.082781  18.495389
#:> 75    C1   D5  G1  60.071749  17.671009
#:> 148   C1   D5  G2  59.864576  22.915047
#:> 48    C1   D4  G1  59.220959  16.136915
#:> 64    C1   D4  G2  58.752440  17.698505
#:> ..   ...  ...  ..        ...        ...
#:> 83    C3   D4  G1  42.457564  20.433678
#:> 161   C3   D2  G2  42.343989  18.665869
#:> 109   C3   D1  G1  41.992172  19.105493
#:> 138   C3   D1  G2  41.128355  19.948868
#:> 11    C3   D5  G1  39.306004  23.168880
#:> 
#:> [200 rows x 5 columns]

11.4 Grouping

mydf.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 200 entries, 0 to 199
#:> Data columns (total 5 columns):
#:>  #   Column  Non-Null Count  Dtype  
#:> ---  ------  --------------  -----  
#:>  0   comp    200 non-null    object 
#:>  1   dept    200 non-null    object 
#:>  2   grp     200 non-null    object 
#:>  3   value1  200 non-null    float64
#:>  4   value2  200 non-null    float64
#:> dtypes: float64(2), object(3)
#:> memory usage: 7.9+ KB
gdf = mydf >> group_by('comp','dept')
type(gdf)
#:> <class 'plydata.types.GroupedDataFrame'>

11.5 Summarization

11.5.1 Simple Method

Passing Multiple Expressions

gdf >> summarize('n()','sum(value1)','mean(value2)')

11.5.2 Specify Summarized Column Name

Assignment Method
- Passing colName=‘expression’**
- Column name cannot contain special character

gdf >> summarize(count='n()',v1sum='sum(value1)',v2_mean='mean(value2)')

Tuple Method (‘colName,’‘expression’)
Use when the column name contain special character

gdf >> summarize(('count','n()'),('v1.sum','sum(value1)'),('s2.sum','sum(value2)'),v2mean=np.mean(value2))

11.5.3 Number of Rows in Group

  • n() : total rows in group
  • n_unique() : total of rows with unique value
gdf >> summarize(count='n()', va11_unique='n_unique(value1)')