11 Plydata
(dplyr
for Python)
11.1 Sample Data
= 200
n = ['C' + i for i in np.random.randint( 1,4, size = n).astype(str)] # 3x Company
comp = ['D' + i for i in np.random.randint( 1,6, size = n).astype(str)] # 5x Department
dept = ['G' + i for i in np.random.randint( 1,3, size = n).astype(str)] # 2x Groups
grp = np.random.normal( loc=50 , scale=5 , size = n)
value1 = np.random.normal( loc=20 , scale=3 , size = n)
value2 #value3 = np.random.normal( loc=5 , scale=30 , size = n)
= pd.DataFrame({
mydf 'comp':comp,
'dept':dept,
'grp': grp,
'value1':value1,
'value2':value2
#'value3':value3
}) mydf.head()
#:> comp dept grp value1 value2
#:> 0 C1 D5 G2 44.991357 18.376559
#:> 1 C1 D3 G1 42.629061 15.044993
#:> 2 C3 D1 G1 53.803512 18.496985
#:> 3 C2 D3 G1 51.387646 16.152419
#:> 4 C2 D3 G1 52.897552 19.107399
11.2 Column Manipulation
11.2.1 Copy Column
>> define(newcol = 'value1') # simple method for one column mydf
#:> comp dept grp value1 value2 newcol
#:> 0 C1 D5 G2 44.991357 18.376559 44.991357
#:> 1 C1 D3 G1 42.629061 15.044993 42.629061
#:> 2 C3 D1 G1 53.803512 18.496985 53.803512
#:> 3 C2 D3 G1 51.387646 16.152419 51.387646
#:> 4 C2 D3 G1 52.897552 19.107399 52.897552
#:> .. ... ... .. ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648 44.837077
#:> 196 C2 D2 G2 44.899655 17.546987 44.899655
#:> 197 C1 D1 G2 47.806362 21.936824 47.806362
#:> 198 C1 D5 G1 50.837563 17.498985 50.837563
#:> 199 C1 D4 G1 51.282241 20.841696 51.282241
#:>
#:> [200 rows x 6 columns]
>> define (('newcol1', 'value1'), newcol2='value2') # method for muiltiple new columns mydf
#:> comp dept grp value1 value2 newcol1 newcol2
#:> 0 C1 D5 G2 44.991357 18.376559 44.991357 18.376559
#:> 1 C1 D3 G1 42.629061 15.044993 42.629061 15.044993
#:> 2 C3 D1 G1 53.803512 18.496985 53.803512 18.496985
#:> 3 C2 D3 G1 51.387646 16.152419 51.387646 16.152419
#:> 4 C2 D3 G1 52.897552 19.107399 52.897552 19.107399
#:> .. ... ... .. ... ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648 44.837077 23.383648
#:> 196 C2 D2 G2 44.899655 17.546987 44.899655 17.546987
#:> 197 C1 D1 G2 47.806362 21.936824 47.806362 21.936824
#:> 198 C1 D5 G1 50.837563 17.498985 50.837563 17.498985
#:> 199 C1 D4 G1 51.282241 20.841696 51.282241 20.841696
#:>
#:> [200 rows x 7 columns]
11.2.2 New Column from existing Column
Without specify the new column name, it will be derived from expression
>> define ('value1*2') mydf
#:> comp dept grp value1 value2 value1*2
#:> 0 C1 D5 G2 44.991357 18.376559 89.982715
#:> 1 C1 D3 G1 42.629061 15.044993 85.258122
#:> 2 C3 D1 G1 53.803512 18.496985 107.607023
#:> 3 C2 D3 G1 51.387646 16.152419 102.775291
#:> 4 C2 D3 G1 52.897552 19.107399 105.795104
#:> .. ... ... .. ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648 89.674153
#:> 196 C2 D2 G2 44.899655 17.546987 89.799311
#:> 197 C1 D1 G2 47.806362 21.936824 95.612724
#:> 198 C1 D5 G1 50.837563 17.498985 101.675127
#:> 199 C1 D4 G1 51.282241 20.841696 102.564482
#:>
#:> [200 rows x 6 columns]
Specify the new column name
>> define(value3 = 'value1*2') mydf
#:> comp dept grp value1 value2 value3
#:> 0 C1 D5 G2 44.991357 18.376559 89.982715
#:> 1 C1 D3 G1 42.629061 15.044993 85.258122
#:> 2 C3 D1 G1 53.803512 18.496985 107.607023
#:> 3 C2 D3 G1 51.387646 16.152419 102.775291
#:> 4 C2 D3 G1 52.897552 19.107399 105.795104
#:> .. ... ... .. ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648 89.674153
#:> 196 C2 D2 G2 44.899655 17.546987 89.799311
#:> 197 C1 D1 G2 47.806362 21.936824 95.612724
#:> 198 C1 D5 G1 50.837563 17.498985 101.675127
#:> 199 C1 D4 G1 51.282241 20.841696 102.564482
#:>
#:> [200 rows x 6 columns]
Define multiple new columns in one go. Observe there are three ways to specify the new columns
>> define('value1*2',('newcol2','value2*2'),newcol3='value2*3') mydf
#:> comp dept grp value1 value2 value1*2 newcol2 newcol3
#:> 0 C1 D5 G2 44.991357 18.376559 89.982715 36.753118 55.129677
#:> 1 C1 D3 G1 42.629061 15.044993 85.258122 30.089985 45.134978
#:> 2 C3 D1 G1 53.803512 18.496985 107.607023 36.993970 55.490955
#:> 3 C2 D3 G1 51.387646 16.152419 102.775291 32.304838 48.457257
#:> 4 C2 D3 G1 52.897552 19.107399 105.795104 38.214799 57.322198
#:> .. ... ... .. ... ... ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648 89.674153 46.767296 70.150944
#:> 196 C2 D2 G2 44.899655 17.546987 89.799311 35.093974 52.640962
#:> 197 C1 D1 G2 47.806362 21.936824 95.612724 43.873649 65.810473
#:> 198 C1 D5 G1 50.837563 17.498985 101.675127 34.997969 52.496954
#:> 199 C1 D4 G1 51.282241 20.841696 102.564482 41.683391 62.525087
#:>
#:> [200 rows x 8 columns]
11.2.3 Select Column(s)
= mydf >> define(newcol1='value1',newcol2='value2')
mydf2 mydf2.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 200 entries, 0 to 199
#:> Data columns (total 7 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 comp 200 non-null object
#:> 1 dept 200 non-null object
#:> 2 grp 200 non-null object
#:> 3 value1 200 non-null float64
#:> 4 value2 200 non-null float64
#:> 5 newcol1 200 non-null float64
#:> 6 newcol2 200 non-null float64
#:> dtypes: float64(4), object(3)
#:> memory usage: 11.1+ KB
11.2.3.1 By Column Names
Exact Coumn Name
>> select ('comp','dept','value1') mydf2
#:> comp dept value1
#:> 0 C1 D5 44.991357
#:> 1 C1 D3 42.629061
#:> 2 C3 D1 53.803512
#:> 3 C2 D3 51.387646
#:> 4 C2 D3 52.897552
#:> .. ... ... ...
#:> 195 C3 D3 44.837077
#:> 196 C2 D2 44.899655
#:> 197 C1 D1 47.806362
#:> 198 C1 D5 50.837563
#:> 199 C1 D4 51.282241
#:>
#:> [200 rows x 3 columns]
Column Name Starts With …
>> select ('comp', startswith='val') mydf2
#:> comp value1 value2
#:> 0 C1 44.991357 18.376559
#:> 1 C1 42.629061 15.044993
#:> 2 C3 53.803512 18.496985
#:> 3 C2 51.387646 16.152419
#:> 4 C2 52.897552 19.107399
#:> .. ... ... ...
#:> 195 C3 44.837077 23.383648
#:> 196 C2 44.899655 17.546987
#:> 197 C1 47.806362 21.936824
#:> 198 C1 50.837563 17.498985
#:> 199 C1 51.282241 20.841696
#:>
#:> [200 rows x 3 columns]
Column Name Ends With …
>> select ('comp',endswith=('1','2','3')) mydf2
#:> comp value1 value2 newcol1 newcol2
#:> 0 C1 44.991357 18.376559 44.991357 18.376559
#:> 1 C1 42.629061 15.044993 42.629061 15.044993
#:> 2 C3 53.803512 18.496985 53.803512 18.496985
#:> 3 C2 51.387646 16.152419 51.387646 16.152419
#:> 4 C2 52.897552 19.107399 52.897552 19.107399
#:> .. ... ... ... ... ...
#:> 195 C3 44.837077 23.383648 44.837077 23.383648
#:> 196 C2 44.899655 17.546987 44.899655 17.546987
#:> 197 C1 47.806362 21.936824 47.806362 21.936824
#:> 198 C1 50.837563 17.498985 50.837563 17.498985
#:> 199 C1 51.282241 20.841696 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
Column Name Contains …
>> select('comp', contains=('col','val')) mydf2
#:> comp value1 value2 newcol1 newcol2
#:> 0 C1 44.991357 18.376559 44.991357 18.376559
#:> 1 C1 42.629061 15.044993 42.629061 15.044993
#:> 2 C3 53.803512 18.496985 53.803512 18.496985
#:> 3 C2 51.387646 16.152419 51.387646 16.152419
#:> 4 C2 52.897552 19.107399 52.897552 19.107399
#:> .. ... ... ... ... ...
#:> 195 C3 44.837077 23.383648 44.837077 23.383648
#:> 196 C2 44.899655 17.546987 44.899655 17.546987
#:> 197 C1 47.806362 21.936824 47.806362 21.936824
#:> 198 C1 50.837563 17.498985 50.837563 17.498985
#:> 199 C1 51.282241 20.841696 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
11.2.3.2 Specify Column Range
>> select ('comp', slice('value1','newcol2')) mydf2
#:> comp value1 value2 newcol1 newcol2
#:> 0 C1 44.991357 18.376559 44.991357 18.376559
#:> 1 C1 42.629061 15.044993 42.629061 15.044993
#:> 2 C3 53.803512 18.496985 53.803512 18.496985
#:> 3 C2 51.387646 16.152419 51.387646 16.152419
#:> 4 C2 52.897552 19.107399 52.897552 19.107399
#:> .. ... ... ... ... ...
#:> 195 C3 44.837077 23.383648 44.837077 23.383648
#:> 196 C2 44.899655 17.546987 44.899655 17.546987
#:> 197 C1 47.806362 21.936824 47.806362 21.936824
#:> 198 C1 50.837563 17.498985 50.837563 17.498985
#:> 199 C1 51.282241 20.841696 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
11.2.4 Drop Column(s)
>> select('newcol1','newcol2',drop=True) mydf2
#:> comp dept grp value1 value2
#:> 0 C1 D5 G2 44.991357 18.376559
#:> 1 C1 D3 G1 42.629061 15.044993
#:> 2 C3 D1 G1 53.803512 18.496985
#:> 3 C2 D3 G1 51.387646 16.152419
#:> 4 C2 D3 G1 52.897552 19.107399
#:> .. ... ... .. ... ...
#:> 195 C3 D3 G1 44.837077 23.383648
#:> 196 C2 D2 G2 44.899655 17.546987
#:> 197 C1 D1 G2 47.806362 21.936824
#:> 198 C1 D5 G1 50.837563 17.498985
#:> 199 C1 D4 G1 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
>> rename( {'val.1' : 'value1',
mydf 'val.2' : 'value2' })
#:> comp dept grp val.1 val.2
#:> 0 C1 D5 G2 44.991357 18.376559
#:> 1 C1 D3 G1 42.629061 15.044993
#:> 2 C3 D1 G1 53.803512 18.496985
#:> 3 C2 D3 G1 51.387646 16.152419
#:> 4 C2 D3 G1 52.897552 19.107399
#:> .. ... ... .. ... ...
#:> 195 C3 D3 G1 44.837077 23.383648
#:> 196 C2 D2 G2 44.899655 17.546987
#:> 197 C1 D1 G2 47.806362 21.936824
#:> 198 C1 D5 G1 50.837563 17.498985
#:> 199 C1 D4 G1 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
Combined Method
Combine both assignment and dictionary method
>> rename( {'val.1' : 'value1',
mydf 'val.2' : 'value2'
= 'grp' ) }, group
#:> comp dept group val.1 val.2
#:> 0 C1 D5 G2 44.991357 18.376559
#:> 1 C1 D3 G1 42.629061 15.044993
#:> 2 C3 D1 G1 53.803512 18.496985
#:> 3 C2 D3 G1 51.387646 16.152419
#:> 4 C2 D3 G1 52.897552 19.107399
#:> .. ... ... ... ... ...
#:> 195 C3 D3 G1 44.837077 23.383648
#:> 196 C2 D2 G2 44.899655 17.546987
#:> 197 C1 D1 G2 47.806362 21.936824
#:> 198 C1 D5 G1 50.837563 17.498985
#:> 199 C1 D4 G1 51.282241 20.841696
#:>
#:> [200 rows x 5 columns]
11.3 Sorting (arrange)
Use ‘-colName’ for decending
>> arrange('comp', '-value1') mydf
#:> comp dept grp value1 value2
#:> 37 C1 D2 G2 61.082781 18.495389
#:> 75 C1 D5 G1 60.071749 17.671009
#:> 148 C1 D5 G2 59.864576 22.915047
#:> 48 C1 D4 G1 59.220959 16.136915
#:> 64 C1 D4 G2 58.752440 17.698505
#:> .. ... ... .. ... ...
#:> 83 C3 D4 G1 42.457564 20.433678
#:> 161 C3 D2 G2 42.343989 18.665869
#:> 109 C3 D1 G1 41.992172 19.105493
#:> 138 C3 D1 G2 41.128355 19.948868
#:> 11 C3 D5 G1 39.306004 23.168880
#:>
#:> [200 rows x 5 columns]
11.4 Grouping
mydf.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 200 entries, 0 to 199
#:> Data columns (total 5 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 comp 200 non-null object
#:> 1 dept 200 non-null object
#:> 2 grp 200 non-null object
#:> 3 value1 200 non-null float64
#:> 4 value2 200 non-null float64
#:> dtypes: float64(2), object(3)
#:> memory usage: 7.9+ KB
= mydf >> group_by('comp','dept')
gdf type(gdf)
#:> <class 'plydata.types.GroupedDataFrame'>
11.5 Summarization
11.5.1 Simple Method
Passing Multiple Expressions
>> summarize('n()','sum(value1)','mean(value2)') gdf
11.5.2 Specify Summarized Column Name
Assignment Method
- Passing colName=‘expression’**
- Column name cannot contain special character
>> summarize(count='n()',v1sum='sum(value1)',v2_mean='mean(value2)') gdf
Tuple Method (‘colName,’‘expression’)
Use when the column name contain special character
>> summarize(('count','n()'),('v1.sum','sum(value1)'),('s2.sum','sum(value2)'),v2mean=np.mean(value2)) gdf