13 pandas
13.1 Modules Import
import pandas as pd
## Other Libraries
import numpy as np
import datetime as dt
from datetime import datetime
from datetime import date
13.2 Pandas Objects
13.2.1 Pandas Data Types
- pandas.Timestamp
- pandas.Timedelta
- pandas.Period
- pandas.Interval
- pandas.DateTimeIndex
13.2.2 Pandas Data Structure
Type | Dimension | Size | Value | Constructor |
---|---|---|---|---|
Series | 1 | Immutable | Mutable | pandas.DataFrame( data, index, dtype, copy) |
DataFrame | 2 | Mutable | Mutable | pandas.DataFrame( data, index, columns, dtype, copy) |
Panel | 3 | Mutable | Mutable |
data can be ndarray, list, constants
index must be unique and same length as data. Can be integer or string
dtype if none, it will be inferred
copy copy data. Default false
13.3 Class Method
13.3.1 Creating Timestamp Objects
Pandas to_datetime()
can:
- Convert list of dates to DateTimeIndex
- Convert list of dates to Series of Timestamps
- Convert single date into Timestamp Object
. Source can be string, date, datetime object
13.3.1.1 From List to DateTimeIndex
= pd.to_datetime(['2011-01-03', # from string
dti 2018,4,13), # from date
date(2018,3,1,7,30)] # from datetime
datetime(
)print( dti,
'\nObject Type: ', type(dti),
'\nObject dtype: ', dti.dtype,
'\nElement Type: ', type(dti[1]))
#:> DatetimeIndex(['2011-01-03 00:00:00', '2018-04-13 00:00:00', '2018-03-01 07:30:00'], dtype='datetime64[ns]', freq=None)
#:> Object Type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
#:> Object dtype: datetime64[ns]
#:> Element Type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
13.3.1.2 From List to Series of Timestamps
= pd.to_datetime(pd.Series(['2011-01-03', # from string
sdt 2018,4,13), # from date
date(2018,3,1,7,30)]# from datetime
datetime(
))print(sdt,
'\nObject Type: ',type(sdt),
'\nObject dtype: ', sdt.dtype,
'\nElement Type: ',type(sdt[1]))
#:> 0 2011-01-03 00:00:00
#:> 1 2018-04-13 00:00:00
#:> 2 2018-03-01 07:30:00
#:> dtype: datetime64[ns]
#:> Object Type: <class 'pandas.core.series.Series'>
#:> Object dtype: datetime64[ns]
#:> Element Type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
13.3.1.3 From Scalar to Timestamp
print( pd.to_datetime('2011-01-03'), '\n',
2011,1,3)), '\n',
pd.to_datetime(date(2011,1,3,5,30)), '\n',
pd.to_datetime(datetime('\nElement Type: ', type(pd.to_datetime(datetime(2011,1,3,5,30))))
#:> 2011-01-03 00:00:00
#:> 2011-01-03 00:00:00
#:> 2011-01-03 05:30:00
#:>
#:> Element Type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
13.3.2 Generate Timestamp Sequence
The function date_range()
return DateTimeIndex
object. Use Series()
to convert into Series if desired.
13.3.2.1 Hourly
If start time not specified, default to 00:00:00.
If start time specified, it will be honored on all subsequent Timestamp elements.
Specify start and end, sequence will automatically distribute Timestamp according to frequency.
print(
'2018-01-01', periods=3, freq='H'),
pd.date_range(2018,1,1,12,30), periods=3, freq='H'),
pd.date_range(datetime(='2018-01-03-1230', end='2018-01-03-18:30', freq='H')) pd.date_range(start
#:> DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-01 12:30:00', '2018-01-01 13:30:00', '2018-01-01 14:30:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-03 12:30:00', '2018-01-03 13:30:00', '2018-01-03 14:30:00',
#:> '2018-01-03 15:30:00', '2018-01-03 16:30:00', '2018-01-03 17:30:00',
#:> '2018-01-03 18:30:00'],
#:> dtype='datetime64[ns]', freq='H')
13.3.2.2 Daily
When the frequency is Day and time is not specified, output is date distributed.
When time is specified, output will honor the time.
print(
2018,1,2), periods=3, freq='D'),
pd.date_range(date('2018-01-01-1230', periods=4, freq='D')) pd.date_range(
#:> DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D') DatetimeIndex(['2018-01-01 12:30:00', '2018-01-02 12:30:00', '2018-01-03 12:30:00',
#:> '2018-01-04 12:30:00'],
#:> dtype='datetime64[ns]', freq='D')
13.3.2.3 First Day Of Month
Use freq=MS
, M stands for montly, S stand for Start. If the day specified, the sequence start from first day of following month.
print(
'2018-01', periods=4, freq='MS'),
pd.date_range('2018-01-09', periods=4, freq='MS'),
pd.date_range('2018-01-09 12:30:00', periods=4, freq='MS') ) pd.date_range(
#:> DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01'], dtype='datetime64[ns]', freq='MS') DatetimeIndex(['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01'], dtype='datetime64[ns]', freq='MS') DatetimeIndex(['2018-02-01 12:30:00', '2018-03-01 12:30:00', '2018-04-01 12:30:00',
#:> '2018-05-01 12:30:00'],
#:> dtype='datetime64[ns]', freq='MS')
13.3.2.4 Last Day of Month
Sequence always starts from the end of the specified month.
print(
'2018-01', periods=4, freq='M'),
pd.date_range('2018-01-09', periods=4, freq='M'),
pd.date_range('2018-01-09 12:30:00', periods=4, freq='M')) pd.date_range(
#:> DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M') DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M') DatetimeIndex(['2018-01-31 12:30:00', '2018-02-28 12:30:00', '2018-03-31 12:30:00',
#:> '2018-04-30 12:30:00'],
#:> dtype='datetime64[ns]', freq='M')
13.3.3 Frequency Table (crosstab)
crosstab returns Dataframe Object
crosstab( index = <SeriesObj>, columns = <new_colName> ) # one dimension table
crosstab( index = <SeriesObj>, columns = <SeriesObj> ) # two dimension table
crosstab( index = <SeriesObj>, columns = [<SeriesObj1>, <SeriesObj2>] ) # multi dimension table
crosstab( index = <SeriesObj>, columns = <SeriesObj>, margines=True ) # add column and row margins
13.3.3.1 Sample Data
= 200
n = ['C' + i for i in np.random.randint( 1,4, size = n).astype(str)] # 3x Company
comp = ['D' + i for i in np.random.randint( 1,6, size = n).astype(str)] # 5x Department
dept = ['G' + i for i in np.random.randint( 1,3, size = n).astype(str)] # 2x Groups
grp = np.random.normal( loc=50 , scale=5 , size = n)
value1 = np.random.normal( loc=20 , scale=3 , size = n)
value2 = np.random.normal( loc=5 , scale=30 , size = n)
value3
= pd.DataFrame({
mydf 'comp':comp,
'dept':dept,
'grp': grp,
'value1':value1,
'value2':value2,
'value3':value3 })
mydf.head()
#:> comp dept grp value1 value2 value3
#:> 0 C2 D4 G2 51.276501 23.944060 -19.453141
#:> 1 C3 D5 G1 48.241634 19.823981 7.926569
#:> 2 C1 D2 G2 52.025269 20.216081 -40.035628
#:> 3 C2 D1 G1 55.582803 19.893167 40.971304
#:> 4 C3 D5 G1 54.443123 23.286554 12.842095
13.3.3.2 One DimensionTable
## Frequency Countn For Company, Department
print(
=mydf.comp, columns='counter'),'\n\n',
pd.crosstab(index=mydf.dept, columns='counter')) pd.crosstab(index
#:> col_0 counter
#:> comp
#:> C1 72
#:> C2 58
#:> C3 70
#:>
#:> col_0 counter
#:> dept
#:> D1 41
#:> D2 41
#:> D3 49
#:> D4 44
#:> D5 25
13.3.3.3 Two Dimension Table
=mydf.comp, columns=mydf.dept) pd.crosstab(index
#:> dept D1 D2 D3 D4 D5
#:> comp
#:> C1 11 15 17 19 10
#:> C2 16 14 15 10 3
#:> C3 14 12 17 15 12
13.3.3.4 Higher Dimension Table
Crosstab header is multi-levels index when more than one column specified.
= pd.crosstab(index=mydf.comp, columns=[mydf.dept, mydf.grp])
tb print( tb, '\n\n',
tb.columns )
#:> dept D1 D2 D3 D4 D5
#:> grp G1 G2 G1 G2 G1 G2 G1 G2 G1 G2
#:> comp
#:> C1 8 3 9 6 8 9 8 11 6 4
#:> C2 11 5 5 9 5 10 4 6 0 3
#:> C3 8 6 6 6 9 8 10 5 7 5
#:>
#:> MultiIndex([('D1', 'G1'),
#:> ('D1', 'G2'),
#:> ('D2', 'G1'),
#:> ('D2', 'G2'),
#:> ('D3', 'G1'),
#:> ('D3', 'G2'),
#:> ('D4', 'G1'),
#:> ('D4', 'G2'),
#:> ('D5', 'G1'),
#:> ('D5', 'G2')],
#:> names=['dept', 'grp'])
Select sub-dataframe using multi-level referencing.
print( 'Under D2:\n', tb['D2'], '\n\n',
'Under D2-G2:\n',tb['D2','G1'])
#:> Under D2:
#:> grp G1 G2
#:> comp
#:> C1 9 6
#:> C2 5 9
#:> C3 6 6
#:>
#:> Under D2-G2:
#:> comp
#:> C1 9
#:> C2 5
#:> C3 6
#:> Name: (D2, G1), dtype: int64
13.3.3.5 Getting Margin
Extend the crosstab with ‘margin=True’ to have sum of rows/columns, presented in new column/row named ‘All’.
= pd.crosstab(index=mydf.dept, columns=mydf.grp, margins=True)
tb tb
#:> grp G1 G2 All
#:> dept
#:> D1 27 14 41
#:> D2 20 21 41
#:> D3 22 27 49
#:> D4 22 22 44
#:> D5 13 12 25
#:> All 104 96 200
print(
'Row Sums: \n', tb.loc[:,'All'],
'\n\nColumn Sums:\n', tb.loc['All'])
#:> Row Sums:
#:> dept
#:> D1 41
#:> D2 41
#:> D3 49
#:> D4 44
#:> D5 25
#:> All 200
#:> Name: All, dtype: int64
#:>
#:> Column Sums:
#:> grp
#:> G1 104
#:> G2 96
#:> All 200
#:> Name: All, dtype: int64
13.3.3.6 Getting Proportion
Use matrix operation divide each row with its respective column sum.
/tb.loc['All'] tb
#:> grp G1 G2 All
#:> dept
#:> D1 0.259615 0.145833 0.205
#:> D2 0.192308 0.218750 0.205
#:> D3 0.211538 0.281250 0.245
#:> D4 0.211538 0.229167 0.220
#:> D5 0.125000 0.125000 0.125
#:> All 1.000000 1.000000 1.000
13.3.4 Concatination
13.3.4.1 Sample Data
= pd.Series(['A1','A2','A3','A4'])
s1 = pd.Series(['B1','B2','B3','B4'], name='B')
s2 = pd.Series(['C1','C2','C3','C4'], name='C') s3
13.3.4.2 Column-Wise
Combinining Multiple Series Into A New DataFrame
- Added series will have 0,1,2,… column names (if Series are not named originally)
- None series will be ignored
- axis=1
means column-wise
None], axis=1) pd.concat([s1,s2,s3,
#:> 0 B C
#:> 0 A1 B1 C1
#:> 1 A2 B2 C2
#:> 2 A3 B3 C3
#:> 3 A4 B4 C4
Add Multiple Series Into An Existing DataFrame
- No change to original data frame column name
- Added columns from series will have 0,1,2,3,.. column name
= pd.DataFrame({ 'A': s1, 'B': s2})
df None],axis=1) pd.concat([df,s3,s1,
#:> A B C 0
#:> 0 A1 B1 C1 A1
#:> 1 A2 B2 C2 A2
#:> 2 A3 B3 C3 A3
#:> 3 A4 B4 C4 A4
13.3.5 External Data
13.3.5.1 html_table
Parser
This method require html5lib library.
- Read the web page, create a list: which contain one or more dataframes that maps to each html table found
- Scrap all detectable html tables
- Auto detect column header
- Auto create index using number starting from 0
read_html(url) # return list of dataframe(s) that maps to web table(s) structure
= pd.read_html('https://www.malaysiastock.biz/Listed-Companies.aspx?type=S&s1=18') ## read all tables
df_list = df_list[6] ## get the specific table
df
print ('Total Table(s) Found : ', len(df_list), '\n',
'First Table Found: ',df)
#:> Total Table(s) Found : 11
#:> First Table Found: 0 1
#:> 0 Sector: --- Filter by Sector --- BOND ISLAMIC CLOSED...
13.3.5.2 CSV Writing
Syntax
DataFrame.to_csv(
path_or_buf=None, ## if not provided, result is returned as string
sep=', ',
na_rep='',
float_format=None,
columns=None, ## list of columns name to write, if not provided, all columns are written
header=True, ## write out column names
index=True, ## write row label
index_label=None,
mode='w',
encoding=None, ## if not provided, default to 'utf-8'
quoting=None, quotechar='"',
line_terminator=None,
chunksize=None,
date_format=None,
doublequote=True,
escapechar=None,
decimal='.')
Example below shows column value containing different special character. Note that pandas handles these very well by default.
= pd.DataFrame({'Id':[10,20,30,40],
mydf 'Name': ['Aaa','Bbb','Ccc','Ddd'],
'Funny': ["world's most \clever",
"Bloody, damn, good",
"many\nmany\nline",
'Quoting "is" tough']})
'Id', inplace=True)
mydf.set_index('data/csv_test.csv', index=True)
mydf.to_csv( mydf
#:> Name Funny
#:> Id
#:> 10 Aaa world's most \clever
#:> 20 Bbb Bloody, damn, good
#:> 30 Ccc many\nmany\nline
#:> 40 Ddd Quoting "is" tough
This is the file saved
# system('more data\\csv_test.csv')
All content retained when reading back by Pandas
'data/csv_test.csv', index_col='Id') pd.read_csv(
#:> Name Funny
#:> Id
#:> 10 Aaa world's most \clever
#:> 20 Bbb Bloody, damn, good
#:> 30 Ccc many\nmany\nline
#:> 40 Ddd Quoting "is" tough
13.3.5.3 CSV Reading
Syntax
pandas.read_csv(
'url or filePath', # path to file or url
encoding = 'utf_8', # optional: default is 'utf_8'
index_col = ['colName1', ...], # optional: specify one or more index column
parse_dates = ['dateCol1', ...], # optional: specify multiple string column to convert to date
na_values = ['.','na','NA','N/A'], # optional: values that is considered NA
names = ['newColName1', ... ], # optional: overwrite column names
thousands = '.', # optional: thousand seperator symbol
nrows = n, # optional: load only first n rows
skiprows = 0, # optional: don't load first n rows
parse_dates = False, # List of date column names
infer_datetime_format = False # automatically parse dates
)
Refer to full codec Python Codec.
Default Import
- index is sequence of integer 0,1,2…
- only two data types detection; number (float64/int64) and string (object)
- date is not parsed, hence stayed as string
= pd.read_csv('data/goog.csv', encoding='utf_8')
goo print(goo.head(), '\n\n',
goo.info())
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 61 entries, 0 to 60
#:> Data columns (total 6 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Date 61 non-null object
#:> 1 Open 61 non-null float64
#:> 2 High 61 non-null float64
#:> 3 Low 61 non-null float64
#:> 4 Close 61 non-null float64
#:> 5 Volume 61 non-null int64
#:> dtypes: float64(4), int64(1), object(1)
#:> memory usage: 3.0+ KB
#:> Date Open High Low Close Volume
#:> 0 12/19/2016 790.219971 797.659973 786.270020 794.200012 1225900
#:> 1 12/20/2016 796.760010 798.650024 793.270020 796.419983 925100
#:> 2 12/21/2016 795.840027 796.676025 787.099976 794.559998 1208700
#:> 3 12/22/2016 792.359985 793.320007 788.580017 791.260010 969100
#:> 4 12/23/2016 790.900024 792.739990 787.280029 789.909973 623400
#:>
#:> None
Specify Data Types
- To customize the data type, use
dtype
parameter with a dict of definition.
= {'Volume': str}
d_types 'data/goog.csv', dtype=d_types).info() pd.read_csv(
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 61 entries, 0 to 60
#:> Data columns (total 6 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Date 61 non-null object
#:> 1 Open 61 non-null float64
#:> 2 High 61 non-null float64
#:> 3 Low 61 non-null float64
#:> 4 Close 61 non-null float64
#:> 5 Volume 61 non-null object
#:> dtypes: float64(4), object(2)
#:> memory usage: 3.0+ KB
Parse Datetime
You can specify multiple date-alike column for parsing
'data/goog.csv', parse_dates=['Date']).info() pd.read_csv(
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 61 entries, 0 to 60
#:> Data columns (total 6 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Date 61 non-null datetime64[ns]
#:> 1 Open 61 non-null float64
#:> 2 High 61 non-null float64
#:> 3 Low 61 non-null float64
#:> 4 Close 61 non-null float64
#:> 5 Volume 61 non-null int64
#:> dtypes: datetime64[ns](1), float64(4), int64(1)
#:> memory usage: 3.0 KB
Parse Datetime, Then Set as Index
- Specify names of date column in parse_dates=
- When date is set as index, the type is DateTimeIndex
= pd.read_csv('data/goog.csv',index_col='Date', parse_dates=['Date'])
goo3 goo3.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> DatetimeIndex: 61 entries, 2016-12-19 to 2017-03-17
#:> Data columns (total 5 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Open 61 non-null float64
#:> 1 High 61 non-null float64
#:> 2 Low 61 non-null float64
#:> 3 Close 61 non-null float64
#:> 4 Volume 61 non-null int64
#:> dtypes: float64(4), int64(1)
#:> memory usage: 2.9 KB
13.3.6 Inspection
13.3.6.1 Structure info
info() is a function that print information to screen. It doesn’t return any object
dataframe.info() # display columns and number of rows (that has no missing data)
goo.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> RangeIndex: 61 entries, 0 to 60
#:> Data columns (total 6 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Date 61 non-null object
#:> 1 Open 61 non-null float64
#:> 2 High 61 non-null float64
#:> 3 Low 61 non-null float64
#:> 4 Close 61 non-null float64
#:> 5 Volume 61 non-null int64
#:> dtypes: float64(4), int64(1), object(1)
#:> memory usage: 3.0+ KB
13.3.6.2 head
goo.head()
#:> Date Open High Low Close Volume
#:> 0 12/19/2016 790.219971 797.659973 786.270020 794.200012 1225900
#:> 1 12/20/2016 796.760010 798.650024 793.270020 796.419983 925100
#:> 2 12/21/2016 795.840027 796.676025 787.099976 794.559998 1208700
#:> 3 12/22/2016 792.359985 793.320007 788.580017 791.260010 969100
#:> 4 12/23/2016 790.900024 792.739990 787.280029 789.909973 623400
13.4 class: Timestamp
This is an enhanced version to datetime standard library.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp
13.4.1 Constructor
13.4.1.1 From Number
print( pd.Timestamp(year=2017, month=1, day=1),'\n', #date-like numbers
2017,1,1), '\n', # date-like numbers
pd.Timestamp(2017,12,11,5,45),'\n', # datetime-like numbers
pd.Timestamp(2017,12,11,5,45,55,999),'\n', # + microseconds
pd.Timestamp(2017,12,11,5,45,55,999,8),'\n', # + nanoseconds
pd.Timestamp(type(pd.Timestamp(2017,12,11,5,45,55,999,8)),'\n')
#:> 2017-01-01 00:00:00
#:> 2017-01-01 00:00:00
#:> 2017-12-11 05:45:00
#:> 2017-12-11 05:45:55.000999
#:> 2017-12-11 05:45:55.000999008
#:> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
13.4.1.2 From String
Observe that pandas support many string input format
Year Month Day, default has no timezone
print( pd.Timestamp('2017-12-11'),'\n', # date-like string: year-month-day
'2017 12 11'),'\n', # date-like string: year-month-day
pd.Timestamp('2017 Dec 11'),'\n', # date-like string: year-month-day
pd.Timestamp('Dec 11, 2017')) # date-like string: year-month-day pd.Timestamp(
#:> 2017-12-11 00:00:00
#:> 2017-12-11 00:00:00
#:> 2017-12-11 00:00:00
#:> 2017-12-11 00:00:00
YMD Hour Minute Second Ms
print( pd.Timestamp('2017-12-11 0545'),'\n', ## hour minute
'2017-12-11-05:45'),'\n',
pd.Timestamp('2017-12-11T0545'),'\n',
pd.Timestamp('2017-12-11 054533'),'\n', ## hour minute seconds
pd.Timestamp('2017-12-11 05:45:33')) pd.Timestamp(
#:> 2017-12-11 05:45:00
#:> 2017-12-11 05:45:00
#:> 2017-12-11 05:45:00
#:> 2017-12-11 05:45:33
#:> 2017-12-11 05:45:33
With Timezone can be included in various ways.
print( pd.Timestamp('2017-01-01T0545Z'),'\n', # GMT
'2017-01-01T0545+9'),'\n', # GMT+9
pd.Timestamp('2017-01-01T0545+0800'),'\n', # GMT+0800
pd.Timestamp('2017-01-01 0545', tz='Asia/Singapore'),'\n') pd.Timestamp(
#:> 2017-01-01 05:45:00+00:00
#:> 2017-01-01 05:45:00+09:00
#:> 2017-01-01 05:45:00+08:00
#:> 2017-01-01 05:45:00+08:00
13.4.1.3 From Standard Library datetime
and date
Object
print( pd.Timestamp(date(2017,3,5)),'\n', # from date
2017,3,5,4,30)),'\n', # from datetime
pd.Timestamp(datetime(2017,3,5,4,30), tz='Asia/Kuala_Lumpur')) # from datetime, + tz pd.Timestamp(datetime(
#:> 2017-03-05 00:00:00
#:> 2017-03-05 04:30:00
#:> 2017-03-05 04:30:00+08:00
13.4.2 Attributes
We can tell many things about a Timestamp object.
= pd.Timestamp('2017-01-01T054533+0800') # GMT+0800
ts print( ts.month, '\n',
'\n',
ts.day, '\n',
ts.year, '\n',
ts.hour, '\n',
ts.minute, '\n',
ts.second, '\n',
ts.microsecond, '\n',
ts.nanosecond, '\n',
ts.tz, '\n',
ts.daysinmonth,'\n',
ts.dayofyear, '\n',
ts.is_leap_year, '\n',
ts.is_month_end, '\n',
ts.is_month_start, ts.dayofweek)
#:> 1
#:> 1
#:> 2017
#:> 5
#:> 45
#:> 33
#:> 0
#:> 0
#:> pytz.FixedOffset(480)
#:> 31
#:> 1
#:> False
#:> False
#:> True
#:> 6
Note that timezone (tz) is a pytz object.
= pd.Timestamp(datetime(2017,3,5,4,30), tz='Asia/Kuala_Lumpur') # from datetime, + tz
ts1 = pd.Timestamp('2017-01-01T054533+0800') # GMT+0800
ts2 = pd.Timestamp('2017-01-01T0545')
ts3
print( ts1.tz, 'Type:', type(ts1.tz), '\n',
'Type:', type(ts2.tz), '\n',
ts2.tz, 'Type:', type(ts3.tz) ) ts3.tz,
#:> Asia/Kuala_Lumpur Type: <class 'pytz.tzfile.Asia/Kuala_Lumpur'>
#:> pytz.FixedOffset(480) Type: <class 'pytz._FixedOffset'>
#:> None Type: <class 'NoneType'>
13.4.3 Instance Methods
13.4.3.1 Atribute-like Methods
= pd.Timestamp(2017,1,1)
ts print( ' Weekday: ', ts.weekday(), '\n',
'ISO Weekday:', ts.isoweekday(), '\n',
'Day Name: ', ts.day_name(), '\n',
'ISO Calendar:', ts.isocalendar()
)
#:> Weekday: 6
#:> ISO Weekday: 7
#:> Day Name: Sunday
#:> ISO Calendar: (2016, 52, 7)
13.4.3.2 Timezones
Adding Timezones and Clock Shifting
-
tz_localize
will add the timezone, however will not shift the clock.
- Once a timestamp had gotten a timezone, you can easily shift the clock to another timezone using
tz_convert()
= pd.Timestamp(2017,1,10,10,34) ## No timezone
ts = ts.tz_localize('Asia/Kuala_Lumpur') ## Add timezone
ts1 = ts1.tz_convert('UTC') ## Convert timezone
ts2 print(' Origininal Timestamp :', ts, '\n',
'Loacalized Timestamp (added TZ):', ts1, '\n',
'Converted Timestamp (shifted) :',ts2)
#:> Origininal Timestamp : 2017-01-10 10:34:00
#:> Loacalized Timestamp (added TZ): 2017-01-10 10:34:00+08:00
#:> Converted Timestamp (shifted) : 2017-01-10 02:34:00+00:00
Removing Timezone
Just apply None with tz_localize
to remove TZ infomration.
= pd.Timestamp(2017,1,10,10,34) ## No timezone
ts = ts.tz_localize('Asia/Kuala_Lumpur') ## Add timezone
ts = ts.tz_localize(None) ## Convert timezone
ts ts
#:> Timestamp('2017-01-10 10:34:00')
13.4.3.3 Formatting
strftime
Use strftime()
to customize string format. For complete directive, see below: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
= pd.Timestamp(2017,1,10,10,34) ## No timezone
ts = ts.tz_localize('Asia/Kuala_Lumpur') ## Add timezone
ts "%m/%d") ts.strftime(
#:> '01/10'
isoformat
Use isoformat()
to format ISO string (without timezone)
= pd.Timestamp(2017,1,10,10,34)
ts = ts.tz_localize('Asia/Kuala_Lumpur')
ts1 print( ' ISO Format without TZ:', ts.isoformat(), '\n',
'ISO Format with TZ :', ts1.isoformat())
#:> ISO Format without TZ: 2017-01-10T10:34:00
#:> ISO Format with TZ : 2017-01-10T10:34:00+08:00
13.4.3.4 Type Conversion
Convert To datetime.datetime/date
Use to_pydatetime()
to convert into standard library datetime.datetime
. From the ‘datetime’ object, apply date()
to get datetime.date
= pd.Timestamp(2017,1,10,7,30,52)
ts print(
'Datetime:', ts.to_pydatetime(), '\n',
'Date Only:', ts.to_pydatetime().date())
#:> Datetime: 2017-01-10 07:30:52
#:> Date Only: 2017-01-10
Convert To numpy.datetime64
Use to_datetime64()
to convert into numpy.datetime64
= pd.Timestamp(2017,1,10,7,30,52)
ts ts.to_datetime64()
#:> numpy.datetime64('2017-01-10T07:30:52.000000000')
13.5 class: DateTimeIndex
13.5.2 Instance Method
13.5.2.1 Data Type Conversion
Convert To datetime.datetime
Use to_pydatetime
to convert into python standard datetime.datetime object
print('Converted to List:', dti.to_pydatetime(), '\n\n',
'Converted Type:', type(dti.to_pydatetime()))
#:> Converted to List: [datetime.datetime(2011, 1, 3, 0, 0) datetime.datetime(2018, 4, 13, 0, 0)
#:> datetime.datetime(2018, 3, 1, 7, 30)]
#:>
#:> Converted Type: <class 'numpy.ndarray'>
13.5.2.2 Structure Conversion
Convert To Series: to_series
This creates a Series where index and data with the same value
#dti = pd.date_range('2018-02', periods=4, freq='M')
dti.to_series()
#:> 2011-01-03 00:00:00 2011-01-03 00:00:00
#:> 2018-04-13 00:00:00 2018-04-13 00:00:00
#:> 2018-03-01 07:30:00 2018-03-01 07:30:00
#:> dtype: datetime64[ns]
Convert To DataFrame: to_frame()
This convert to single column DataFrame with index as the same value
dti.to_frame()
#:> 0
#:> 2011-01-03 00:00:00 2011-01-03 00:00:00
#:> 2018-04-13 00:00:00 2018-04-13 00:00:00
#:> 2018-03-01 07:30:00 2018-03-01 07:30:00
13.6 class: Series
Series allows different data types (object class) as its element
pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
- data array-like, iterable, dict or scalar
- If dtype not specified, it will infer from data.
13.6.1 Constructor
13.6.1.1 Empty Series
Passing no data to constructor will result in empty series. By default, empty series dtype is float.
= pd.Series(dtype='object')
s print (s, '\n',
type(s))
#:> Series([], dtype: object)
#:> <class 'pandas.core.series.Series'>
13.6.1.2 From Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index
99, index = ['a','b','c','d']) pd.Series(
#:> a 99
#:> b 99
#:> c 99
#:> d 99
#:> dtype: int64
13.6.1.3 From array-like
From list
'a','b','c','d','e']) # from Python list pd.Series([
#:> 0 a
#:> 1 b
#:> 2 c
#:> 3 d
#:> 4 e
#:> dtype: object
From numpy.array
If index is not specified, default to 0 and continue incrementally
'a','b','c','d','e'])) pd.Series(np.array([
#:> 0 a
#:> 1 b
#:> 2 c
#:> 3 d
#:> 4 e
#:> dtype: object
From DateTimeIndex
'2011-1-1','2011-1-3')) pd.Series(pd.date_range(
#:> 0 2011-01-01
#:> 1 2011-01-02
#:> 2 2011-01-03
#:> dtype: datetime64[ns]
13.6.1.4 From Dictionary
The dictionary key will be the index. Order is not sorted.
'a' : 0., 'c' : 5., 'b' : 2.}) pd.Series({
#:> a 0.0
#:> c 5.0
#:> b 2.0
#:> dtype: float64
If index sequence is specifeid, then Series will forllow the index order
Objerve that missing data (index without value) will be marked as NaN
'a' : 0., 'c' : 1., 'b' : 2.},index = ['a','b','c','d']) pd.Series({
#:> a 0.0
#:> b 2.0
#:> c 1.0
#:> d NaN
#:> dtype: float64
13.6.1.5 Specify Index
'a','b','c','d','e'], index=[10,20,30,40,50]) pd.Series([
#:> 10 a
#:> 20 b
#:> 30 c
#:> 40 d
#:> 50 e
#:> dtype: object
13.6.1.6 Mix Element Types
dType will be ‘object’ when there were mixture of classes
= pd.Series(['a',1,2,3])
ser print('Object Type : ', type(ser),'\n',
'Object dType: ', ser.dtype,'\n',
'Element 1 Type: ',type(ser[0]),'\n',
'Elmeent 2 Type: ',type(ser[1]))
#:> Object Type : <class 'pandas.core.series.Series'>
#:> Object dType: object
#:> Element 1 Type: <class 'str'>
#:> Elmeent 2 Type: <class 'int'>
13.6.1.7 Specify Data Types
By default, dtype is inferred from data.
= pd.Series([1,2,3])
ser1 = pd.Series([1,2,3], dtype="int8")
ser2 = pd.Series([1,2,3], dtype="object")
ser3
print(' Inferred: ',ser1.dtype, '\n',
'Specified int8: ',ser2.dtype, '\n',
'Specified object:',ser3.dtype)
#:> Inferred: int64
#:> Specified int8: int8
#:> Specified object: object
13.6.2 Accessing Series
series ( single/list/range_of_row_label/number ) # can cause confusion
series.loc ( single/list/range_of_row_label )
series.iloc( single/list/range_of_row_number )
13.6.2.1 Sample Data
= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s s
#:> a 1
#:> b 2
#:> c 3
#:> d 4
#:> e 5
#:> dtype: int64
13.6.2.2 by Row Number(s)
Single Item. Notice that inputing a number and list of number give different result.
print( 'Referencing by number:',s.iloc[1],'\n\n',
'\nReferencing by list of number:\n',s.iloc[[1]])
#:> Referencing by number: 2
#:>
#:>
#:> Referencing by list of number:
#:> b 2
#:> dtype: int64
Multiple Items
1,3]] s.iloc[[
#:> b 2
#:> d 4
#:> dtype: int64
Range (First 3)
3] s.iloc[:
#:> a 1
#:> b 2
#:> c 3
#:> dtype: int64
Range (Last 3)
-3:] s.iloc[
#:> c 3
#:> d 4
#:> e 5
#:> dtype: int64
Range (in between)
2:3] s.iloc[
#:> c 3
#:> dtype: int64
13.6.2.3 by Index(es)
Single Label. Notice the difference referencing input: single index and list of index.
Warning: if index is invalid, this will result in error.
print( s.loc['c'], '\n',
'c']]) s[[
#:> 3
#:> c 3
#:> dtype: int64
Multiple Labels
If index is not found, it will return NaN
# error: missing labels no longer supported
'k','c']] s.loc[[
** Range of Labels **
'b':'d'] s.loc[
#:> b 2
#:> c 3
#:> d 4
#:> dtype: int64
13.6.2.4 Filtering
Use logical array to filter
= pd.Series(range(1,8))
s <5] s[s
#:> 0 1
#:> 1 2
#:> 2 3
#:> 3 4
#:> dtype: int64
Use where
The where method is an application of the if-then idiom. For each element in the calling Series, if cond
is True the element is used; otherwise other
is used.
.where(cond, other=nan, inplace=False)
print(s.where(s<4),'\n\n',
<4,other=None) ) s.where(s
#:> 0 1.0
#:> 1 2.0
#:> 2 3.0
#:> 3 NaN
#:> 4 NaN
#:> 5 NaN
#:> 6 NaN
#:> dtype: float64
#:>
#:> 0 1
#:> 1 2
#:> 2 3
#:> 3 None
#:> 4 None
#:> 5 None
#:> 6 None
#:> dtype: object
13.6.3 Updating Series
13.6.4 Series Attributes
13.6.4.1 The Data
= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'],name='SuperHero')
s s
#:> a 1
#:> b 2
#:> c 3
#:> d 4
#:> e 5
#:> Name: SuperHero, dtype: int64
13.6.4.2 The Attributes
print( ' Series Index: ',s.index, '\n',
'Series dType: ', s.dtype, '\n',
'Series Size: ', s.size, '\n',
'Series Shape: ', s.shape, '\n',
'Series Dimension:', s.ndim)
#:> Series Index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
#:> Series dType: int64
#:> Series Size: 5
#:> Series Shape: (5,)
#:> Series Dimension: 1
13.6.5 Instance Methods
13.6.5.1 Index Manipulation
.rename_axis()
'haribulan') s.rename_axis(
#:> haribulan
#:> a 1
#:> b 2
#:> c 3
#:> d 4
#:> e 5
#:> Name: SuperHero, dtype: int64
.reset_index()
Resetting index will:
- Convert index to a normal column, with column named as ‘index’
- Index renumbered to 1,2,3
- Return DataFrame (became two columns)
s.reset_index()
#:> index SuperHero
#:> 0 a 1
#:> 1 b 2
#:> 2 c 3
#:> 3 d 4
#:> 4 e 5
13.6.5.2 Structure Conversion
- A series structure contain
value
(in numpy array), itsdtype
(data type of the numpy array).
- Use
values
to retrieve into `numpy.ndarray
. Usedtype
to understand the data type.
= pd.Series([1,2,3,4,5])
s print(' Series value: ', s.values, '\n',
'Series value type: ', type(s.values), '\n',
'Series dtype: ', s.dtype)
#:> Series value: [1 2 3 4 5]
#:> Series value type: <class 'numpy.ndarray'>
#:> Series dtype: int64
Convert To List using .tolist()
pd.Series.tolist(s)
#:> [1, 2, 3, 4, 5]
13.6.6 Series Operators
The result of applying operator (arithmetic or logic) to Series object returns a new Series object
13.6.6.1 Arithmetic Operator
= pd.Series( [100,200,300,400,500] )
s1 = pd.Series( [10, 20, 30, 40, 50] ) s2
Apply To One Series Object
- 100 s1
#:> 0 0
#:> 1 100
#:> 2 200
#:> 3 300
#:> 4 400
#:> dtype: int64
Apply To Two Series Objects
- s2 s1
#:> 0 90
#:> 1 180
#:> 2 270
#:> 3 360
#:> 4 450
#:> dtype: int64
13.6.6.2 Logic Operator
- Apply logic operator to a Series return a new Series of boolean result
- This can be used for Series or DataFrame filtering
= pd.Series(range(0,10))
bs >3 bs
#:> 0 False
#:> 1 False
#:> 2 False
#:> 3 False
#:> 4 True
#:> 5 True
#:> 6 True
#:> 7 True
#:> 8 True
#:> 9 True
#:> dtype: bool
~((bs>3) & (bs<8) | (bs>7))
#:> 0 True
#:> 1 True
#:> 2 True
#:> 3 True
#:> 4 False
#:> 5 False
#:> 6 False
#:> 7 False
#:> 8 False
#:> 9 False
#:> dtype: bool
13.6.7 Series .str
Accesor
If the underlying data is str type, then pandas exposed various properties and methos through str
accessor.
SeriesObj.str.operatorFunction()
Available Functions
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:
len() lower() translate() islower() ljust() upper() startswith() isupper() rjust() find() endswith() isnumeric() center() rfind() isalnum() isdecimal() zfill() index() isalpha() split() strip() rindex() isdigit() rsplit() rstrip() capitalize() isspace() partition() lstrip() swapcase() istitle() rpartition()
13.6.7.1 Regex Extractor
Extract capture groups in the regex pattern, by default in DataFrame (expand=True
).
Series.str.extract(self, pat, flags=0, expand=True)
- expand=True: if result is single column, make it a Series instead of Dataframe.
= pd.Series(['a1', 'b2', 'c3'])
s print(
' Extracted Dataframe:\n', s.str.extract(r'([ab])(\d)'),'\n\n',
'Extracted Dataframe witn Names:\n', s.str.extract(r'(?P<Letter>[ab])(\d)'))
#:> Extracted Dataframe:
#:> 0 1
#:> 0 a 1
#:> 1 b 2
#:> 2 NaN NaN
#:>
#:> Extracted Dataframe witn Names:
#:> Letter 1
#:> 0 a 1
#:> 1 b 2
#:> 2 NaN NaN
Below ouptut single columne, use expand=False
to make the result a Series, instead of DataFrame.
= s.str.extract(r'[ab](\d)', expand=False)
r print( r, '\n\n', type(r) )
#:> 0 1
#:> 1 2
#:> 2 NaN
#:> dtype: object
#:>
#:> <class 'pandas.core.series.Series'>
13.6.7.2 Character Extractor
= pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
monte 'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte
#:> 0 Graham Chapman
#:> 1 John Cleese
#:> 2 Terry Gilliam
#:> 3 Eric Idle
#:> 4 Terry Jones
#:> 5 Michael Palin
#:> dtype: object
startwith
str.startswith('T') monte.
#:> 0 False
#:> 1 False
#:> 2 True
#:> 3 False
#:> 4 True
#:> 5 False
#:> dtype: bool
Slicing
str[0:3] monte.
#:> 0 Gra
#:> 1 Joh
#:> 2 Ter
#:> 3 Eri
#:> 4 Ter
#:> 5 Mic
#:> dtype: object
13.6.7.3 Splitting
Split strings around given separator/delimiter in either string or regex.
Series.str.split(self, pat=None, n=-1, expand=False)
- pat: can be string or regex
= pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h_i_j'])
s s
#:> 0 a_b_c
#:> 1 c_d_e
#:> 2 NaN
#:> 3 f_g_h_i_j
#:> dtype: object
str.split()
by default, split will split each item into array
str.split('_') s.
#:> 0 [a, b, c]
#:> 1 [c, d, e]
#:> 2 NaN
#:> 3 [f, g, h, i, j]
#:> dtype: object
expand=True
will return a dataframe instead of series. By default, expand split into all possible columns.
print( s.str.split('_', expand=True) )
#:> 0 1 2 3 4
#:> 0 a b c None None
#:> 1 c d e None None
#:> 2 NaN NaN NaN NaN NaN
#:> 3 f g h i j
It is possible to limit the number of columns splitted
print( s.str.split('_', expand=True, n=1) )
#:> 0 1
#:> 0 a b_c
#:> 1 c d_e
#:> 2 NaN NaN
#:> 3 f g_h_i_j
str.rsplit()
rsplit
stands for reverse split, it works the same way, except it is reversed
print( s.str.rsplit('_', expand=True, n=1) )
#:> 0 1
#:> 0 a_b c
#:> 1 c_d e
#:> 2 NaN NaN
#:> 3 f_g_h_i j
13.6.7.4 Case Conversion
SeriesObj.str.upper()
SeriesObj.str.lower()
SeriesObj.str.capitalize()
= pd.Series(['A', 'B', 'C', 'aAba', 'bBaca', np.nan, 'cCABA', 'dog', 'cat'])
s print( s.str.upper(), '\n',
str.capitalize()) s.
#:> 0 A
#:> 1 B
#:> 2 C
#:> 3 AABA
#:> 4 BBACA
#:> 5 NaN
#:> 6 CCABA
#:> 7 DOG
#:> 8 CAT
#:> dtype: object
#:> 0 A
#:> 1 B
#:> 2 C
#:> 3 Aaba
#:> 4 Bbaca
#:> 5 NaN
#:> 6 Ccaba
#:> 7 Dog
#:> 8 Cat
#:> dtype: object
13.6.7.5 Number of Characters
str.len() s.
#:> 0 1.0
#:> 1 1.0
#:> 2 1.0
#:> 3 4.0
#:> 4 5.0
#:> 5 NaN
#:> 6 5.0
#:> 7 3.0
#:> 8 3.0
#:> dtype: float64
13.6.7.6 String Indexing
This return specified character from each item.
= pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s str[0].values # first char s.
#:> array(['A', 'B', 'C', 'A', 'B', nan, 'C', 'd', 'c'], dtype=object)
str[0:2].values # first and second char s.
#:> array(['A', 'B', 'C', 'Aa', 'Ba', nan, 'CA', 'do', 'ca'], dtype=object)
13.6.8 Series .dt
Accessor
If the underlying data is datetime64 type, then pandas exposed various properties and methos through dt
accessor.
13.6.8.1 Sample Data
= pd.Series([
s 2000,1,1,0,0,0),
datetime(1999,12,15,12,34,55),
datetime(2020,3,8,5,7,12),
datetime(2018,1,1,0,0,0),
datetime(2003,3,4,5,6,7)
datetime(
]) s
#:> 0 2000-01-01 00:00:00
#:> 1 1999-12-15 12:34:55
#:> 2 2020-03-08 05:07:12
#:> 3 2018-01-01 00:00:00
#:> 4 2003-03-04 05:06:07
#:> dtype: datetime64[ns]
13.6.8.2 Convert To
datetime.datetime
Use to_pydatetime()
to convert into numpy.array
of standard library datetime.datetime
= s.dt.to_pydatetime()
pdt print( type(pdt) )
#:> <class 'numpy.ndarray'>
pdt
#:> array([datetime.datetime(2000, 1, 1, 0, 0),
#:> datetime.datetime(1999, 12, 15, 12, 34, 55),
#:> datetime.datetime(2020, 3, 8, 5, 7, 12),
#:> datetime.datetime(2018, 1, 1, 0, 0),
#:> datetime.datetime(2003, 3, 4, 5, 6, 7)], dtype=object)
datetime.date
Use dt.date
to convert into pandas.Series
of standard library datetime.date
Is it possible to have a pandas.Series of datetime.datetime ? No, because Pandas want it as its own Timestamp.
= s.dt.date
sdt print( type(sdt[1] ))
#:> <class 'datetime.date'>
print( type(sdt))
#:> <class 'pandas.core.series.Series'>
sdt
#:> 0 2000-01-01
#:> 1 1999-12-15
#:> 2 2020-03-08
#:> 3 2018-01-01
#:> 4 2003-03-04
#:> dtype: object
13.6.8.3 Timestamp Attributes
A Series::DateTime object support below properties:
- date
- month
- day
- year
- dayofweek
- dayofyear
- weekday
- weekday_name
- quarter
- daysinmonth
- hour
- minute
Full list below:
https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties
s.dt.date
#:> 0 2000-01-01
#:> 1 1999-12-15
#:> 2 2020-03-08
#:> 3 2018-01-01
#:> 4 2003-03-04
#:> dtype: object
s.dt.month
#:> 0 1
#:> 1 12
#:> 2 3
#:> 3 1
#:> 4 3
#:> dtype: int64
s.dt.dayofweek
#:> 0 5
#:> 1 2
#:> 2 6
#:> 3 0
#:> 4 1
#:> dtype: int64
s.dt.weekday
#:> 0 5
#:> 1 2
#:> 2 6
#:> 3 0
#:> 4 1
#:> dtype: int64
# error no attribute weekday_name
s.dt.weekday_name
s.dt.quarter
#:> 0 1
#:> 1 4
#:> 2 1
#:> 3 1
#:> 4 1
#:> dtype: int64
s.dt.daysinmonth
#:> 0 31
#:> 1 31
#:> 2 31
#:> 3 31
#:> 4 31
#:> dtype: int64
# extract time as time Object s.dt.time
#:> 0 00:00:00
#:> 1 12:34:55
#:> 2 05:07:12
#:> 3 00:00:00
#:> 4 05:06:07
#:> dtype: object
# extract hour as integer s.dt.hour
#:> 0 0
#:> 1 12
#:> 2 5
#:> 3 0
#:> 4 5
#:> dtype: int64
# extract minute as integer s.dt.minute
#:> 0 0
#:> 1 34
#:> 2 7
#:> 3 0
#:> 4 6
#:> dtype: int64
13.7 class: DataFrame
13.7.1 Constructor
13.7.1.1 Empty DataFrame
By default, An empty dataframe contain no coumns and index.
= pd.DataFrame()
empty_df1 = pd.DataFrame()
empty_df2 print(id(empty_df1), id(empty_df2), empty_df1)
#:> 140189694040336 140189694018000 Empty DataFrame
#:> Columns: []
#:> Index: []
However, you can also initialize an empty DataFrame with Index and/or Columns.
= pd.DataFrame(columns=['A','B','C'], index=[1,2,3])
empty_df print( empty_df )
#:> A B C
#:> 1 NaN NaN NaN
#:> 2 NaN NaN NaN
#:> 3 NaN NaN NaN
Take note that below empty_df1 and empty_df2 refers to same memory location. Meaning they cantain similar data.
= empty_df2 = pd.DataFrame()
empty_df1 print(id(empty_df1), id(empty_df2))
#:> 140189694328016 140189694328016
13.7.1.2 From Row Oriented Data (List of Lists)
Create from List of Lists
DataFrame( [row_list1, row_list2, row_list3] )
DataFrame( [row_list1, row_list2, row_list3], column = columnName_list )
DataFrame( [row_list1, row_list2, row_list3], index = row_label_list )
Basic DataFrame with default Row Label and Column Header
101,'Alice',40000,2017],
pd.DataFrame ([[102,'Bob', 24000, 2017],
[103,'Charles',31000,2017]] ) [
#:> 0 1 2 3
#:> 0 101 Alice 40000 2017
#:> 1 102 Bob 24000 2017
#:> 2 103 Charles 31000 2017
Specify Column Header during Creation
101,'Alice',40000,2017],
pd.DataFrame ([[102,'Bob', 24000, 2017],
[103,'Charles',31000,2017]], columns = ['empID','name','salary','year']) [
#:> empID name salary year
#:> 0 101 Alice 40000 2017
#:> 1 102 Bob 24000 2017
#:> 2 103 Charles 31000 2017
Specify Row Label during Creation
101,'Alice',40000,2017],
pd.DataFrame ([[102,'Bob', 24000, 2017],
[103,'Charles',31000,2017]], index = ['r1','r2','r3'] ) [
#:> 0 1 2 3
#:> r1 101 Alice 40000 2017
#:> r2 102 Bob 24000 2017
#:> r3 103 Charles 31000 2017
13.7.1.3 From Row Oriented Data (List of Dictionary)
DataFrame( [dict1, dict2, dict3] )
DataFrame( [row_list1, row_list2, row_list3], column=np.arrange )
DataFrame( [row_list1, row_list2, row_list3], index=row_label_list )
by default,keys will become collumn names, and autosorted
Default Column Name Follow Dictionary Key
Note missing info as NaN
"name":"Yong", "id":1,"zkey":101},{"name":"Gan","id":2}]) pd.DataFrame ([{
#:> name id zkey
#:> 0 Yong 1 101.0
#:> 1 Gan 2 NaN
Specify Index
"name":"Yong", "id":'wd1'},{"name":"Gan","id":'wd2'}],
pd.DataFrame ([{= (1,2)) index
#:> name id
#:> 1 Yong wd1
#:> 2 Gan wd2
Specify Column Header during Creation, can acts as column filter and manual arrangement
Note missing info as NaN
"name":"Yong", "id":1, "zkey":101},{"name":"Gan","id":2}],
pd.DataFrame ([{=("name","id","zkey")) columns
#:> name id zkey
#:> 0 Yong 1 101.0
#:> 1 Gan 2 NaN
13.7.1.4 From Column Oriented Data
Create from Dictrionary of List
DataFrame( { 'column1': list1,
'column2': list2,
'column3': list3 } ,
index = row_label_list,
columns = column_list)
By default, DataFrame will arrange the columns alphabetically, unless columns is specified
Default Row Label
= {'empID': [100, 101, 102, 103, 104],
data 'year': [2017, 2017, 2017, 2018, 2018],
'salary': [40000, 24000, 31000, 20000, 30000],
'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric']}
pd.DataFrame(data)
#:> empID year salary name
#:> 0 100 2017 40000 Alice
#:> 1 101 2017 24000 Bob
#:> 2 102 2017 31000 Charles
#:> 3 103 2018 20000 David
#:> 4 104 2018 30000 Eric
Specify Row Label during Creation
= {'empID': [100, 101, 102, 103, 104],
data 'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
'year': [2017, 2017, 2017, 2018, 2018],
'salary': [40000, 24000, 31000, 20000, 30000] }
=['r1','r2','r3','r4','r5']) pd.DataFrame (data, index
#:> empID name year salary
#:> r1 100 Alice 2017 40000
#:> r2 101 Bob 2017 24000
#:> r3 102 Charles 2017 31000
#:> r4 103 David 2018 20000
#:> r5 104 Eric 2018 30000
Manualy Choose Columns and Arrangement
= {'empID': [100, 101, 102, 103, 104],
data 'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'],
'year': [2017, 2017, 2017, 2018, 2018],
'salary': [40000, 24000, 31000, 20000, 30000] }
=('empID','name','salary'), index=['r1','r2','r3','r4','r5']) pd.DataFrame (data, columns
#:> empID name salary
#:> r1 100 Alice 40000
#:> r2 101 Bob 24000
#:> r3 102 Charles 31000
#:> r4 103 David 20000
#:> r5 104 Eric 30000
13.7.2 Operator
13.7.2.1 The Data
Two dataframe is created, each with 3 columns and 3 rows. However, only two matching column and row names We shall notice that the operator will perform cell-wise, honoring the row/column name.
= pd.DataFrame(data=
df1 'idx': ['row1','row2','row3'],
{'x': [10, 20, 30],
'y': [1,2,3],
'z': [0.1, 0.2, 0.3]}).set_index('idx')
= pd.DataFrame(data=
df2 'idx': ['row1','row2','row4'],
{'x': [13, 23, 33],
'z': [0.1, 0.2, 0.3],
'k': [11,21,31]
'idx')
}).set_index(
print( df1, '\n\n', df2)
#:> x y z
#:> idx
#:> row1 10 1 0.1
#:> row2 20 2 0.2
#:> row3 30 3 0.3
#:>
#:> x z k
#:> idx
#:> row1 13 0.1 11
#:> row2 23 0.2 21
#:> row4 33 0.3 31
13.7.2.2 Addition
Adding Two DataFrame
Using +
operator, non-matching row/column names will result in NA. However, when using function add
, none matching cells can be assumed as with a value.
= df1 + df2
r1 = df1.add(df2,fill_value=1000)
r2
print( r1, '\n\n', r2)
#:> k x y z
#:> idx
#:> row1 NaN 23.0 NaN 0.2
#:> row2 NaN 43.0 NaN 0.4
#:> row3 NaN NaN NaN NaN
#:> row4 NaN NaN NaN NaN
#:>
#:> k x y z
#:> idx
#:> row1 1011.0 23.0 1001.0 0.2
#:> row2 1021.0 43.0 1002.0 0.4
#:> row3 NaN 1030.0 1003.0 1000.3
#:> row4 1031.0 1033.0 NaN 1000.3
Adding Series and DataFrame
Specify the appropriate axis
depending on the orientation of the series data. Column and Row names are respected in this operation. However, fill_value
is not applicable when apply on Series.
Note that columns in Series that are not found in dataframe, will still be created in the result. This is similar behaviour as operating Dataframe with Dataframe.
= pd.Series([1,1,1], index=['row1','row2','row4'])
s3 = pd.Series([3,3,3], index=['x','y','s'])
s4
print('Original Data:\n',df1,'\n\n',
'Add By Rows: \n', df1.add(s3, axis=0), '\n\n',
'Add By Columns: \n', df1.add(s4, axis=1))
#:> Original Data:
#:> x y z
#:> idx
#:> row1 10 1 0.1
#:> row2 20 2 0.2
#:> row3 30 3 0.3
#:>
#:> Add By Rows:
#:> x y z
#:> row1 11.0 2.0 1.1
#:> row2 21.0 3.0 1.2
#:> row3 NaN NaN NaN
#:> row4 NaN NaN NaN
#:>
#:> Add By Columns:
#:> s x y z
#:> idx
#:> row1 NaN 13.0 4.0 NaN
#:> row2 NaN 23.0 5.0 NaN
#:> row3 NaN 33.0 6.0 NaN
13.7.2.3 Substraction
= df2 - df1
r1 = df2.sub(df1,fill_value=1000)
r2
print( r1, '\n\n', r2)
#:> k x y z
#:> idx
#:> row1 NaN 3.0 NaN 0.0
#:> row2 NaN 3.0 NaN 0.0
#:> row3 NaN NaN NaN NaN
#:> row4 NaN NaN NaN NaN
#:>
#:> k x y z
#:> idx
#:> row1 -989.0 3.0 999.0 0.0
#:> row2 -979.0 3.0 998.0 0.0
#:> row3 NaN 970.0 997.0 999.7
#:> row4 -969.0 -967.0 NaN -999.7
= (r2>0) & (r2<=3)
r3 print( 'Original Data: \n', r2, '\n\n',
'Logical Operator:\n', r3)
#:> Original Data:
#:> k x y z
#:> idx
#:> row1 -989.0 3.0 999.0 0.0
#:> row2 -979.0 3.0 998.0 0.0
#:> row3 NaN 970.0 997.0 999.7
#:> row4 -969.0 -967.0 NaN -999.7
#:>
#:> Logical Operator:
#:> k x y z
#:> idx
#:> row1 False True False False
#:> row2 False True False False
#:> row3 False False False False
#:> row4 False False False False
13.7.3 Attributes
= pd.DataFrame(
df 'empID': [100, 101, 102, 103, 104],
{ 'year1': [2017, 2017, 2017, 2018, 2018],
'name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'year2': [2001, 1907, 2003, 1998, 2011],
'salary': [40000, 24000, 31000, 20000, 30000]},
= ['year1','salary','year2','empID','name']) columns
13.7.3.2 Index
df.index
#:> RangeIndex(start=0, stop=5, step=1)
Underlying Index values are numpy object
df.index.values
#:> array([0, 1, 2, 3, 4])
13.7.4 Index Manipulation
index and row label are used interchangeably in this book
13.7.4.1 Sample Data
Columns are intentionaly ordered in a messy way
= pd.DataFrame(
df 'empID': [100, 101, 102, 103, 104],
{ 'year1': [2017, 2017, 2017, 2018, 2018],
'name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'year2': [2001, 1907, 2003, 1998, 2011],
'salary': [40000, 24000, 31000, 20000, 30000]},
= ['year1','salary','year2','empID','name'])
columns
print (df, '\n')
#:> year1 salary year2 empID name
#:> 0 2017 40000 2001 100 Alice
#:> 1 2017 24000 1907 101 Bob
#:> 2 2017 31000 2003 102 Charles
#:> 3 2018 20000 1998 103 David
#:> 4 2018 30000 2011 104 Eric
print (df.index)
#:> RangeIndex(start=0, stop=5, step=1)
13.7.4.2 Convert Column To Index
set_index('column_name', inplace=False)
inplace=True means don’t create a new dataframe. Modify existing dataframe
inplace=False means return a new dataframe
print(df)
#:> year1 salary year2 empID name
#:> 0 2017 40000 2001 100 Alice
#:> 1 2017 24000 1907 101 Bob
#:> 2 2017 31000 2003 102 Charles
#:> 3 2018 20000 1998 103 David
#:> 4 2018 30000 2011 104 Eric
print(df.index,'\n')
#:> RangeIndex(start=0, stop=5, step=1)
'empID',inplace=True)
df.set_index(print(df)
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
print(df.index) # return new DataFrameObj
#:> Int64Index([100, 101, 102, 103, 104], dtype='int64', name='empID')
13.7.4.3 Convert Index Back To Column
- Reseting index will resequence the index as 0,1,2 etc
- Old index column will be converted back as normal column
- Operation support inplace** option
=True)
df.reset_index(inplaceprint(df)
#:> empID year1 salary year2 name
#:> 0 100 2017 40000 2001 Alice
#:> 1 101 2017 24000 1907 Bob
#:> 2 102 2017 31000 2003 Charles
#:> 3 103 2018 20000 1998 David
#:> 4 104 2018 30000 2011 Eric
13.7.4.4 Updating Index ( .index= )
Warning:
- Updating index doesn’t reorder the data sequence
- Number of elements before and after reorder must match, otherwise error
- Same label are allowed to repeat
- Not reversable
= [101, 101, 101, 102, 103]
df.index df
#:> empID year1 salary year2 name
#:> 101 100 2017 40000 2001 Alice
#:> 101 101 2017 24000 1907 Bob
#:> 101 102 2017 31000 2003 Charles
#:> 102 103 2018 20000 1998 David
#:> 103 104 2018 30000 2011 Eric
13.7.4.5 Reordering Index (. reindex )
- Reindex will reorder the rows according to new index
- The operation is not reversable
Start from this original dataframe
= [101,102,103,104,105]
df.index df
#:> empID year1 salary year2 name
#:> 101 100 2017 40000 2001 Alice
#:> 102 101 2017 24000 1907 Bob
#:> 103 102 2017 31000 2003 Charles
#:> 104 103 2018 20000 1998 David
#:> 105 104 2018 30000 2011 Eric
Change the order of Index, always return a new dataframe
103,102,101,104,105]) df.reindex([
#:> empID year1 salary year2 name
#:> 103 102 2017 31000 2003 Charles
#:> 102 101 2017 24000 1907 Bob
#:> 101 100 2017 40000 2001 Alice
#:> 104 103 2018 20000 1998 David
#:> 105 104 2018 30000 2011 Eric
13.7.4.6 Rename Index
- Example below renamed the axis of both columns and rows
- Use
axis=0
for row index, useaxis=1
for column index
'super_id').rename_axis('my_cols', axis=1) df.rename_axis(
#:> my_cols empID year1 salary year2 name
#:> super_id
#:> 101 100 2017 40000 2001 Alice
#:> 102 101 2017 24000 1907 Bob
#:> 103 102 2017 31000 2003 Charles
#:> 104 103 2018 20000 1998 David
#:> 105 104 2018 30000 2011 Eric
13.7.5 Subsetting Columns
Select Single Column Return Series
dataframe.columnName # single column, name based, return Series object
dataframe[ single_col_name ] # single column, name based, return Series object
dataframe[ [single_col_name] ] # single column, name based, return DataFrame object
Select Single/Multiple Columns Return DataFrame
dataframe[ single/list_of_col_names ] # name based, return Dataframe object
dataframe.loc[ : , single_col_name ] # single column, series
dataframe.loc[ : , col_name_list ] # multiple columns, dataframe
dataframe.loc[ : , col_name_ranage ] # multiple columns, dataframe
dataframe.iloc[ : , col_number ] # single column, series
dataframe.iloc[ : , col_number_list ] # multiple columns, dataframe
dataframe.iloc[ : , number_range ] # multiple columns, dataframe
13.7.5.1 Select Single Column
Selecting single column always return as panda::Series
print(
'\n\n',
df.name, 'name'], '\n\n',
df['name'], '\n\n',
df.loc[:, 3]) df.iloc[:,
#:> 101 Alice
#:> 102 Bob
#:> 103 Charles
#:> 104 David
#:> 105 Eric
#:> Name: name, dtype: object
#:>
#:> 101 Alice
#:> 102 Bob
#:> 103 Charles
#:> 104 David
#:> 105 Eric
#:> Name: name, dtype: object
#:>
#:> 101 Alice
#:> 102 Bob
#:> 103 Charles
#:> 104 David
#:> 105 Eric
#:> Name: name, dtype: object
#:>
#:> 101 2001
#:> 102 1907
#:> 103 2003
#:> 104 1998
#:> 105 2011
#:> Name: year2, dtype: int64
13.7.5.2 Select Multiple Columns
Multiple columns return as panda::Dataframe object`
Example below returns DataFrame with Single Column
'name']] # return one column dataframe df[[
#:> name
#:> 101 Alice
#:> 102 Bob
#:> 103 Charles
#:> 104 David
#:> 105 Eric
print(
'name','year1']] ,'\n\n',
df[['name','year1']]) df.loc[:,[
#:> name year1
#:> 101 Alice 2017
#:> 102 Bob 2017
#:> 103 Charles 2017
#:> 104 David 2018
#:> 105 Eric 2018
#:>
#:> name year1
#:> 101 Alice 2017
#:> 102 Bob 2017
#:> 103 Charles 2017
#:> 104 David 2018
#:> 105 Eric 2018
Select Range of Columns
print(
'year1':'year2'], '\n\n',
df.loc [ : , 0,3]] ,'\n\n',
df.iloc[ : , [0:3]
df.iloc[ : , )
#:> year1 salary year2
#:> 101 2017 40000 2001
#:> 102 2017 24000 1907
#:> 103 2017 31000 2003
#:> 104 2018 20000 1998
#:> 105 2018 30000 2011
#:>
#:> empID year2
#:> 101 100 2001
#:> 102 101 1907
#:> 103 102 2003
#:> 104 103 1998
#:> 105 104 2011
#:>
#:> empID year1 salary
#:> 101 100 2017 40000
#:> 102 101 2017 24000
#:> 103 102 2017 31000
#:> 104 103 2018 20000
#:> 105 104 2018 30000
13.7.5.3 By Column Name (.filter)
.filter(items=None, like=None, regex=None, axis=1)
like = Substring Matches
filter( like='year', axis='columns') ## or axis = 1 df.
#:> year1 year2
#:> 101 2017 2001
#:> 102 2017 1907
#:> 103 2017 2003
#:> 104 2018 1998
#:> 105 2018 2011
items = list of column names
filter( items=('year1','year2'), axis=1) ## or axis = 1 df.
#:> year1 year2
#:> 101 2017 2001
#:> 102 2017 1907
#:> 103 2017 2003
#:> 104 2018 1998
#:> 105 2018 2011
regex = Regular Expression
Select column names that contain integer
filter(regex='\d') ## default axis=1 if DataFrame df.
#:> year1 year2
#:> 101 2017 2001
#:> 102 2017 1907
#:> 103 2017 2003
#:> 104 2018 1998
#:> 105 2018 2011
13.7.5.4 Data Type (.select_dtypes)
df.select_dtypes(include=None, exclude=None)
Always return panda::DataFrame, even though only single column matches.
Allowed types are:
- number (integer and float)
- integer / float
- datetime
- timedelta
- category
# error: no attribute get_dtype_counts
df.get_dtype_counts()
='number') df.select_dtypes(exclude
#:> name
#:> 101 Alice
#:> 102 Bob
#:> 103 Charles
#:> 104 David
#:> 105 Eric
=('number','object')) df.select_dtypes(exclude
#:> Empty DataFrame
#:> Columns: []
#:> Index: [101, 102, 103, 104, 105]
13.7.6 Column Manipulation
13.7.6.1 Sample Data
df
#:> empID year1 salary year2 name
#:> 101 100 2017 40000 2001 Alice
#:> 102 101 2017 24000 1907 Bob
#:> 103 102 2017 31000 2003 Charles
#:> 104 103 2018 20000 1998 David
#:> 105 104 2018 30000 2011 Eric
13.7.6.2 Renaming Columns
Method 1 : Rename All Columns (.columns =)
- Construct the new column names, check if there is no missing column names
- Missing columns will return error
- Direct Assignment to column property result in change to dataframe
= ['year.1','salary','year.2','empID','name']
new_columns = new_columns
df.columns 2) df.head(
#:> year.1 salary year.2 empID name
#:> 101 100 2017 40000 2001 Alice
#:> 102 101 2017 24000 1907 Bob
Method 2 : Renaming Specific Column (.rename (columns=) )
- Change column name through rename function
- Support inpalce option for original dataframe change
- Missing column is OK
={'year.1':'year1', 'year.2':'year2'}, inplace=True)
df.rename( columns2) df.head(
#:> year1 salary year2 empID name
#:> 101 100 2017 40000 2001 Alice
#:> 102 101 2017 24000 1907 Bob
13.7.6.3 Reordering Columns
Always return a new dataframe. There is no inplace option for reordering columns
Method 1 - reindex(columns = )
- reindex may sounds like operation on row labels, but it works
- Missmatch column names will result in NA for the unfound column
= [ 'empID', 'name', 'salary', 'year1', 'year2']
new_colorder = new_colorder).head(2) df.reindex(columns
#:> empID name salary year1 year2
#:> 101 2001 Alice 2017 100 40000
#:> 102 1907 Bob 2017 101 24000
Method 2 - [ ] notation
- Missmatch column will result in ERROR
= [ 'empID', 'name', 'salary', 'year1', 'year2']
new_colorder df[new_colorder]
#:> empID name salary year1 year2
#:> 101 2001 Alice 2017 100 40000
#:> 102 1907 Bob 2017 101 24000
#:> 103 2003 Charles 2017 102 31000
#:> 104 1998 David 2018 103 20000
#:> 105 2011 Eric 2018 104 30000
13.7.6.4 Duplicating or Replacing Column
-
New Column will be created instantly using [] notation
- DO NOT USE dot Notation because it is view only attribute
'year3'] = df.year1
df[ df
#:> year1 salary year2 empID name year3
#:> 101 100 2017 40000 2001 Alice 100
#:> 102 101 2017 24000 1907 Bob 101
#:> 103 102 2017 31000 2003 Charles 102
#:> 104 103 2018 20000 1998 David 103
#:> 105 104 2018 30000 2011 Eric 104
13.7.6.5 Dropping Columns (.drop)
dataframe.drop( columns='column_name', inplace=True/False) # delete single column
dataframe.drop( columns=list_of_colnames, inplace=True/False) # delete multiple column
dataframe.drop( index='row_label', inplace=True/False) # delete single row
dataframe.drop( index= list_of_row_labels, inplace=True/False) # delete multiple rows
inplace=True means column will be deleted from original dataframe. Default is False, which return a copy of dataframe
By Column Name(s)
='year1') # drop single column df.drop( columns
#:> salary year2 empID name year3
#:> 101 2017 40000 2001 Alice 100
#:> 102 2017 24000 1907 Bob 101
#:> 103 2017 31000 2003 Charles 102
#:> 104 2018 20000 1998 David 103
#:> 105 2018 30000 2011 Eric 104
=['year2','year3']) # drop multiple columns df.drop(columns
#:> year1 salary empID name
#:> 101 100 2017 2001 Alice
#:> 102 101 2017 1907 Bob
#:> 103 102 2017 2003 Charles
#:> 104 103 2018 1998 David
#:> 105 104 2018 2011 Eric
By Column Number(s)
Use dataframe.columns to produce interim list of column names
=df.columns[[3,4,5]] ) # delete columns by list of column number df.drop( columns
#:> year1 salary year2
#:> 101 100 2017 40000
#:> 102 101 2017 24000
#:> 103 102 2017 31000
#:> 104 103 2018 20000
#:> 105 104 2018 30000
=df.columns[3:6] ) # delete columns by range of column number df.drop( columns
#:> year1 salary year2
#:> 101 100 2017 40000
#:> 102 101 2017 24000
#:> 103 102 2017 31000
#:> 104 103 2018 20000
#:> 105 104 2018 30000
13.7.7 Subsetting Rows
dataframe.loc[ row_label ] # return series, single row
dataframe.loc[ row_label_list ] # multiple rows
dataframe.loc[ boolean_list ] # multiple rows
dataframe.iloc[ row_number ] # return series, single row
dataframe.iloc[ row_number_list ] # multiple rows
dataframe.iloc[ number_range ] # multiple rows
dataframe.sample(frac=) # frac = 0.6 means sampling 60% of rows randomly
13.7.7.1 Sample Data
= pd.DataFrame(
df 'empID': [100, 101, 102, 103, 104],
{ 'year1': [2017, 2017, 2017, 2018, 2018],
'name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'year2': [2001, 1907, 2003, 1998, 2011],
'salary': [40000, 24000, 31000, 20000, 30000]},
= ['year1','salary','year2','empID','name']).set_index(['empID'])
columns df
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
13.7.7.2 By Index or Boolean
Single Index return Series
101] # by single row label, return series df.loc[
#:> year1 2017
#:> salary 24000
#:> year2 1907
#:> name Bob
#:> Name: 101, dtype: object
List or Range of Indexes returns DataFrame
100,103] ] # by multiple row labels df.loc[ [
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 103 2018 20000 1998 David
100:103 ] # by range of row labels df.loc[
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
List of Boolean returns DataFrame
= (df.salary > 30000) & (df.year1==2017)
criteria print (criteria)
#:> empID
#:> 100 True
#:> 101 False
#:> 102 True
#:> 103 False
#:> 104 False
#:> dtype: bool
print (df.loc[criteria])
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 102 2017 31000 2003 Charles
13.7.7.3 By Row Number
Single Row return Series
1] # by single row number df.iloc[
#:> year1 2017
#:> salary 24000
#:> year2 1907
#:> name Bob
#:> Name: 101, dtype: object
Multiple rows returned as dataframe object
0,3] ] # by row numbers df.iloc[ [
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 103 2018 20000 1998 David
0:3 ] # by row number range df.iloc[
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
13.7.8 Row Manipulation
13.7.8.2 Appending Rows
Appending rows is more computaional intensive then concatenate. Item can be added as single item or multi-items (list form)
Append From Another DataFrame
- When
ignore_index=True
, pandas will drop the original Index and recreate with 0,1,2,3…
- It is recommended to ignore index IF the data source index is not unique.
- New columns will be added in the result, with NaN on original dataframe.
= pd.DataFrame(
my_df = {'Id': [10,20,30],
data'Name': ['Aaa','Bbb','Ccc']})
# .set_index('Id')
= pd.DataFrame(
my_df_new = {'Id': [40,50],
data'Name': ['Ddd','Eee'],
'Age': [12,13]})
#.set_index('Id')
= my_df.append(my_df_new, ignore_index=False)
my_df_append = my_df.append(my_df_new, ignore_index=True)
my_df_noindex
print("Original DataFrame:\n", my_df,
"\n\nTo Be Appended DataFrame:\n", my_df_new,
"\n\nAppended DataFrame (index maintained):\n", my_df_append,
"\n\nAppended DataFrame (index ignored):\n", my_df_noindex)
#:> Original DataFrame:
#:> Id Name
#:> 0 10 Aaa
#:> 1 20 Bbb
#:> 2 30 Ccc
#:>
#:> To Be Appended DataFrame:
#:> Id Name Age
#:> 0 40 Ddd 12
#:> 1 50 Eee 13
#:>
#:> Appended DataFrame (index maintained):
#:> Id Name Age
#:> 0 10 Aaa NaN
#:> 1 20 Bbb NaN
#:> 2 30 Ccc NaN
#:> 0 40 Ddd 12.0
#:> 1 50 Eee 13.0
#:>
#:> Appended DataFrame (index ignored):
#:> Id Name Age
#:> 0 10 Aaa NaN
#:> 1 20 Bbb NaN
#:> 2 30 Ccc NaN
#:> 3 40 Ddd 12.0
#:> 4 50 Eee 13.0
Append From Dictionary
= pd.DataFrame(
my_df = {'Id': [10,20,30],
data'Name': ['Aaa','Bbb','Ccc']}) \
'Id')
.set_index(
= {'Id':40, 'Name': 'Ddd'}
new_item1 = {'Id':50, 'Name': 'Eee'}
new_item2 = {'Id':60, 'Name': 'Fff'}
new_item3
= my_df.append( new_item1, ignore_index=True )
my_df_one = my_df.append( [new_item2, new_item3], ignore_index=True )
my_df_multi
print("Original DataFrame:\n", my_df,
"\n\nAdd One Item (index ignored):\n", my_df_one,
"\n\nAdd Multi Item (index ignored):\n", my_df_multi)
#:> Original DataFrame:
#:> Name
#:> Id
#:> 10 Aaa
#:> 20 Bbb
#:> 30 Ccc
#:>
#:> Add One Item (index ignored):
#:> Name Id
#:> 0 Aaa NaN
#:> 1 Bbb NaN
#:> 2 Ccc NaN
#:> 3 Ddd 40.0
#:>
#:> Add Multi Item (index ignored):
#:> Name Id
#:> 0 Aaa NaN
#:> 1 Bbb NaN
#:> 2 Ccc NaN
#:> 3 Eee 50.0
#:> 4 Fff 60.0
Appending None
items(s)
Adding single None item has no effect (nothing added).
Adding None in list form (multiple items) creates rows with None.ignore_index
is not important here.
= my_df.append( None )
single_none = my_df.append( [None])
multi_none
print("Original DataFrame:\n", my_df,
"\n\nAdd One None (index ignored):\n", single_none,
"\n\nAdd List of None (index ignored):\n", multi_none)
#:> Original DataFrame:
#:> Name
#:> Id
#:> 10 Aaa
#:> 20 Bbb
#:> 30 Ccc
#:>
#:> Add One None (index ignored):
#:> Name
#:> Id
#:> 10 Aaa
#:> 20 Bbb
#:> 30 Ccc
#:>
#:> Add List of None (index ignored):
#:> Name 0
#:> 10 Aaa NaN
#:> 20 Bbb NaN
#:> 30 Ccc NaN
#:> 0 NaN None
Appending Items Containing None results in ERROR
# error
None] ) my_df.append( [new_item1,
13.7.8.4 Dropping Rows (.drop)
.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
By Row Label(s)
=100) # single row df.drop(index
#:> year1 salary year2 name
#:> empID
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
=[100,103]) # multiple rows df.drop(index
#:> year1 salary year2 name
#:> empID
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 104 2018 30000 2011 Eric
13.7.9 Slicing
13.7.9.1 Sample Data
df
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
13.7.9.2 Getting One Cell
By Row Label and Column Name (loc)
dataframe.loc [ row_label , col_name ] # by row label and column names
dataframe.loc [ bool_list , col_name ] # by row label and column names
dataframe.iloc[ row_number, col_number ] # by row and column number
print (df.loc[100,'year1'])
#:> 2017
By Row Number and Column Number (iloc)
print (df.iloc[1,2])
#:> 1907
13.7.9.3 Getting Multiple Cells
Specify rows and columns (by individual or range)
dataframe.loc [ list/range_of_row_labels , list/range_col_names ] # by row label and column names
dataframe.iloc[ list/range_row_numbers, list/range_col_numbers ] # by row number
By Index and Column Name (loc)
print (df.loc[ [101,103], ['name','year1'] ], '\n') # by list of row label and column names
#:> name year1
#:> empID
#:> 101 Bob 2017
#:> 103 David 2018
print (df.loc[ 101:104 , 'year1':'year2' ], '\n') # by range of row label and column names
#:> year1 salary year2
#:> empID
#:> 101 2017 24000 1907
#:> 102 2017 31000 2003
#:> 103 2018 20000 1998
#:> 104 2018 30000 2011
By Boolean Row and Column Names (loc)
==2017, 'year1':'year2'] df.loc[df.year1
#:> year1 salary year2
#:> empID
#:> 100 2017 40000 2001
#:> 101 2017 24000 1907
#:> 102 2017 31000 2003
By Row and Column Number (iloc)
print (df.iloc[ [1,4], [0,3]],'\n' ) # by individual rows/columns
#:> year1 name
#:> empID
#:> 101 2017 Bob
#:> 104 2018 Eric
print (df.iloc[ 1:4 , 0:3], '\n') # by range
#:> year1 salary year2
#:> empID
#:> 101 2017 24000 1907
#:> 102 2017 31000 2003
#:> 103 2018 20000 1998
13.7.10 Chained Indexing
Chained Index Method creates a copy of dataframe, any modification of data on original dataframe does not affect the copy
dataframe.loc [...] [...]
dataframe.iloc [...] [...]
Suggesting, never use chain indexing
= pd.DataFrame(
df 'empID': [100, 101, 102, 103, 104],
{ 'year1': [2017, 2017, 2017, 2018, 2018],
'name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'year2': [2001, 1907, 2003, 1998, 2011],
'salary': [40000, 24000, 31000, 20000, 30000]},
= ['year1','salary','year2','empID','name']).set_index(['empID'])
columns df
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
100]['year'] =2000 df.loc[
#:> /home/msfz751/miniconda3/envs/python_book/bin/python:1: SettingWithCopyWarning:
#:> A value is trying to be set on a copy of a slice from a DataFrame
#:>
#:> See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#:> /home/msfz751/miniconda3/envs/python_book/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
#:> A value is trying to be set on a copy of a slice from a DataFrame
#:>
#:> See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#:> iloc._setitem_with_indexer(indexer, value)
## notice row label 100 had not been updated, because data was updated on a copy due to chain indexing df
#:> year1 salary year2 name
#:> empID
#:> 100 2017 40000 2001 Alice
#:> 101 2017 24000 1907 Bob
#:> 102 2017 31000 2003 Charles
#:> 103 2018 20000 1998 David
#:> 104 2018 30000 2011 Eric
13.7.11 Cell Value Replacement
Slicing deals with square cells selection. Use mask
or where
to select specific cell(s). These function respect column and row names.
13.7.11.1 mask()
mask()
replace value with other=
when condition is met. Column and row name is respected
= pd.DataFrame(data={
ori 'x': [1,4,7],
'y': [2,5,8],
'z': [3,6,9]}, index=[
'row1','row2','row3'])
= (ori >4)[['y','x','z']]
df_big = ori.mask(df_big, other=999)
resul1
print('Original DF: \n', ori, '\n\n',
'Big DF : \n', df_big, '\n\n',
'Result : \n', resul1)
#:> Original DF:
#:> x y z
#:> row1 1 2 3
#:> row2 4 5 6
#:> row3 7 8 9
#:>
#:> Big DF :
#:> y x z
#:> row1 False False False
#:> row2 True False True
#:> row3 True True True
#:>
#:> Result :
#:> x y z
#:> row1 1 2 3
#:> row2 4 999 999
#:> row3 999 999 999
13.7.12 Iteration
13.7.12.1 Loop Through Rows (.iterrows)
= pd.DataFrame(data=
df 'empID': [100, 101, 102, 103, 104],
{ 'Name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'Year': [1999, 1988, 2001, 2010, 2020]}).set_index(['empID'])
for idx, row in df.iterrows():
print(idx, row.Name)
#:> 100 Alice
#:> 101 Bob
#:> 102 Charles
#:> 103 David
#:> 104 Eric
13.7.12.2 Loop Through Columns (.itemes)
for label, content in df.items():
print('Label:', label, '\n\n',
'Content (Series):\n', content, '\n\n')
#:> Label: Name
#:>
#:> Content (Series):
#:> empID
#:> 100 Alice
#:> 101 Bob
#:> 102 Charles
#:> 103 David
#:> 104 Eric
#:> Name: Name, dtype: object
#:>
#:>
#:> Label: Year
#:>
#:> Content (Series):
#:> empID
#:> 100 1999
#:> 101 1988
#:> 102 2001
#:> 103 2010
#:> 104 2020
#:> Name: Year, dtype: int64
13.7.13 Data Structure
13.7.13.1 Instance Methods - Structure
Find out the column names, data type in a summary. Output is for display only, not a data object
# return text output df.info()
#:> <class 'pandas.core.frame.DataFrame'>
#:> Int64Index: 5 entries, 100 to 104
#:> Data columns (total 2 columns):
#:> # Column Non-Null Count Dtype
#:> --- ------ -------------- -----
#:> 0 Name 5 non-null object
#:> 1 Year 5 non-null int64
#:> dtypes: int64(1), object(1)
#:> memory usage: 120.0+ bytes
# return Series df.get_dtype_counts()
13.7.13.2 Conversion To Other Format
df.to_json()
#:> '{"Name":{"100":"Alice","101":"Bob","102":"Charles","103":"David","104":"Eric"},"Year":{"100":1999,"101":1988,"102":2001,"103":2010,"104":2020}}'
df.to_records()
#:> rec.array([(100, 'Alice', 1999), (101, 'Bob', 1988),
#:> (102, 'Charles', 2001), (103, 'David', 2010),
#:> (104, 'Eric', 2020)],
#:> dtype=[('empID', '<i8'), ('Name', 'O'), ('Year', '<i8')])
df.to_csv()
#:> 'empID,Name,Year\n100,Alice,1999\n101,Bob,1988\n102,Charles,2001\n103,David,2010\n104,Eric,2020\n'
13.8 class: MultiIndex
MultiIndexing are columns with few levels of headers.
13.8.1 The Data
= pd.DataFrame({
df 'myindex': [0, 1, 2],
'One_X': [1.1, 1.1, 1.1],
'One_Y': [1.2, 1.2, 1.2],
'Two_X': [1.11, 1.11, 1.11],
'Two_Y': [1.22, 1.22, 1.22]})
'myindex',inplace=True)
df.set_index( df
#:> One_X One_Y Two_X Two_Y
#:> myindex
#:> 0 1.1 1.2 1.11 1.22
#:> 1 1.1 1.2 1.11 1.22
#:> 2 1.1 1.2 1.11 1.22
13.8.2 Creating MultiIndex Object
13.8.2.1 Create From Tuples
MultiIndex can easily created from typles:
- Step 1: Create a MultiIndex object by splitting column name into tuples
- Step 2: Assign the MultiIndex Object to dataframe columns
property.
= [tuple(c.split('_')) for c in df.columns]
my_tuples = pd.MultiIndex.from_tuples(my_tuples)
df.columns
print(' Column Headers :\n\n', my_tuples,
'\n\nNew Columns: \n\n', df.columns,
'\n\nTwo Layers Header DF:\n\n', df)
#:> Column Headers :
#:>
#:> [('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]
#:>
#:> New Columns:
#:>
#:> MultiIndex([('One', 'X'),
#:> ('One', 'Y'),
#:> ('Two', 'X'),
#:> ('Two', 'Y')],
#:> )
#:>
#:> Two Layers Header DF:
#:>
#:> One Two
#:> X Y X Y
#:> myindex
#:> 0 1.1 1.2 1.11 1.22
#:> 1 1.1 1.2 1.11 1.22
#:> 2 1.1 1.2 1.11 1.22
13.8.3 MultiIndex Object
13.8.3.1 Levels
- MultiIndex object contain multiple leveels, each level (header) is an Index object.
- Use
MultiIndex.get_level_values()
to the entire header for the desired level. Note that each level is an Index object
print(df.columns.get_level_values(0), '\n',
1)) df.columns.get_level_values(
#:> Index(['One', 'One', 'Two', 'Two'], dtype='object')
#:> Index(['X', 'Y', 'X', 'Y'], dtype='object')
MultiIndex.levels
return the unique values of each level.
print(df.columns.levels[0], '\n',
1]) df.columns.levels[
#:> Index(['One', 'Two'], dtype='object')
#:> Index(['X', 'Y'], dtype='object')
13.8.4 Selecting Column(s)
13.8.4.1 Sample Data
import itertools
= pd.DataFrame
test_df = 100
max_age
### Create The Columns Tuple
= ['Male','Female','Pondan']
level0_sex = ['Medium','High','Low']
level1_age = list(itertools.product(level0_sex, level1_age))
my_columns
= pd.DataFrame([
test_df 1,2,3,4,5,6,7,8,9],
[11,12,13,14,15,16,17,18,19],
[21,22,23,24,25,26,27,28,29]], index=['row1','row2','row3'])
[
### Create Multiindex From Tuple
= pd.MultiIndex.from_tuples(my_columns)
test_df.columns print( test_df )
#:> Male Female Pondan
#:> Medium High Low Medium High Low Medium High Low
#:> row1 1 2 3 4 5 6 7 8 9
#:> row2 11 12 13 14 15 16 17 18 19
#:> row3 21 22 23 24 25 26 27 28 29
13.8.4.2 Select Level0 Header(s)
Use [L0]
notation, where L0
is list of header names
print( test_df[['Male','Pondan']] ,'\n\n', ## Include multiple Level0 Header
'Male'] , '\n\n', ## Include single Level0 Header
test_df[## Same as above test_df.Male )
#:> Male Pondan
#:> Medium High Low Medium High Low
#:> row1 1 2 3 7 8 9
#:> row2 11 12 13 17 18 19
#:> row3 21 22 23 27 28 29
#:>
#:> Medium High Low
#:> row1 1 2 3
#:> row2 11 12 13
#:> row3 21 22 23
#:>
#:> Medium High Low
#:> row1 1 2 3
#:> row2 11 12 13
#:> row3 21 22 23
Using .loc[]
Use .loc[ :, L0 ]
, where L0
is list of headers names
print( test_df.loc[:, ['Male','Pondan']] , '\n\n', ## Multiple Level0 Header
'Male'] ) ## Single Level0 Header test_df.loc[:,
#:> Male Pondan
#:> Medium High Low Medium High Low
#:> row1 1 2 3 7 8 9
#:> row2 11 12 13 17 18 19
#:> row3 21 22 23 27 28 29
#:>
#:> Medium High Low
#:> row1 1 2 3
#:> row2 11 12 13
#:> row3 21 22 23
13.8.4.3 Selecting Level 1 Header(s)
Use .loc[ :, (All, L1)]
, where L1
are list of headers names
= slice(None)
All print( test_df.loc[ : , (All, 'High')], '\n\n', ## Signle L1 header
'High','Low'])] ) ## Multiple L1 headers test_df.loc[ : , (All, [
#:> Male Female Pondan
#:> High High High
#:> row1 2 5 8
#:> row2 12 15 18
#:> row3 22 25 28
#:>
#:> Male Female Pondan
#:> High Low High Low High Low
#:> row1 2 3 5 6 8 9
#:> row2 12 13 15 16 18 19
#:> row3 22 23 25 26 28 29
13.8.4.4 Select Level 0 and Level1 Headers
Use .loc[ :, (L0, L1)]
, where L0
and L1
are list of headers names
'Male','Pondan'], ['Medium','High'])] test_df.loc[ : , ([
#:> Male Pondan
#:> Medium High Medium High
#:> row1 1 2 7 8
#:> row2 11 12 17 18
#:> row3 21 22 27 28
13.8.4.5 Select single L0,L1 Header
Use .loc[:, (L0, L1) ]
, result is a Series
Use .loc[:, (L0 ,[L1])]
, result is a DataFrame
print( test_df.loc[ : , ('Female', 'High')], '\n\n',
'Female', ['High'])]) test_df.loc[ : , (
#:> row1 5
#:> row2 15
#:> row3 25
#:> Name: (Female, High), dtype: int64
#:>
#:> Female
#:> High
#:> row1 5
#:> row2 15
#:> row3 25
13.8.5 Headers Ordering
Note that columns order specifeid by [ ]
selection were not respected. This can be remediated either by Sorting and rearranging.
13.8.5.1 Sort Headers
Use .sort_index()
on DataFrame to sort the headers. Note that when level1 is sorted, it jumble up level0 headers.
= test_df.sort_index(axis=1, level=0)
test_df_sorted_l0 = test_df.sort_index(axis=1, level=1, ascending=False)
test_df_sorted_l1 print(test_df, '\n\n',test_df_sorted_l0, '\n\n', test_df_sorted_l1)
#:> Male Female Pondan
#:> Medium High Low Medium High Low Medium High Low
#:> row1 1 2 3 4 5 6 7 8 9
#:> row2 11 12 13 14 15 16 17 18 19
#:> row3 21 22 23 24 25 26 27 28 29
#:>
#:> Female Male Pondan
#:> High Low Medium High Low Medium High Low Medium
#:> row1 5 6 4 2 3 1 8 9 7
#:> row2 15 16 14 12 13 11 18 19 17
#:> row3 25 26 24 22 23 21 28 29 27
#:>
#:> Pondan Male Female Pondan Male Female Pondan Male Female
#:> Medium Medium Medium Low Low Low High High High
#:> row1 7 1 4 9 3 6 8 2 5
#:> row2 17 11 14 19 13 16 18 12 15
#:> row3 27 21 24 29 23 26 28 22 25
13.8.5.2 Rearranging Headers
Use **.reindex()**
on arrange columns in specific order. Example below shows how to control the specific order for level1 headers.
= ['Low','Medium','High']
cats =1, axis=1) test_df.reindex(cats, level
#:> Male Female Pondan
#:> Low Medium High Low Medium High Low Medium High
#:> row1 3 1 2 6 4 5 9 7 8
#:> row2 13 11 12 16 14 15 19 17 18
#:> row3 23 21 22 26 24 25 29 27 28
13.8.6 Stacking and Unstacking
df.stack()
#:> One Two
#:> myindex
#:> 0 X 1.1 1.11
#:> Y 1.2 1.22
#:> 1 X 1.1 1.11
#:> Y 1.2 1.22
#:> 2 X 1.1 1.11
#:> Y 1.2 1.22
13.8.6.1 Stacking Columns to Rows
Stacking with DataFrame.stack(level_no)
is moving wide columns into row.
print('Stacking Header Level 0: \n\n', df.stack(0),
'\n\nStacking Header Level 1: \n\n', df.stack(1))
#:> Stacking Header Level 0:
#:>
#:> X Y
#:> myindex
#:> 0 One 1.10 1.20
#:> Two 1.11 1.22
#:> 1 One 1.10 1.20
#:> Two 1.11 1.22
#:> 2 One 1.10 1.20
#:> Two 1.11 1.22
#:>
#:> Stacking Header Level 1:
#:>
#:> One Two
#:> myindex
#:> 0 X 1.1 1.11
#:> Y 1.2 1.22
#:> 1 X 1.1 1.11
#:> Y 1.2 1.22
#:> 2 X 1.1 1.11
#:> Y 1.2 1.22
13.8.7 Exploratory Analysis
13.8.7.1 Sample Data
df
#:> One Two
#:> X Y X Y
#:> myindex
#:> 0 1.1 1.2 1.11 1.22
#:> 1 1.1 1.2 1.11 1.22
#:> 2 1.1 1.2 1.11 1.22
13.8.7.2 All Stats in One - .describe()
df.describe(include='number') # default
df.describe(include='object') # display for non-numeric columns
df.describe(include='all') # display both numeric and non-numeric
When applied to DataFrame object, describe shows all basic statistic for all numeric columns:
- Count (non-NA)
- Unique (for string)
- Top (for string)
- Frequency (for string)
- Percentile
- Mean
- Min / Max
- Standard Deviation
For Numeric Columns only
You can customize the percentiles requred. Notice 0.5 percentile is always there although not specified
df.describe()
#:> One Two
#:> X Y X Y
#:> count 3.0 3.0 3.00 3.00
#:> mean 1.1 1.2 1.11 1.22
#:> std 0.0 0.0 0.00 0.00
#:> min 1.1 1.2 1.11 1.22
#:> 25% 1.1 1.2 1.11 1.22
#:> 50% 1.1 1.2 1.11 1.22
#:> 75% 1.1 1.2 1.11 1.22
#:> max 1.1 1.2 1.11 1.22
=[0.9,0.3,0.2,0.1]) df.describe(percentiles
#:> One Two
#:> X Y X Y
#:> count 3.0 3.0 3.00 3.00
#:> mean 1.1 1.2 1.11 1.22
#:> std 0.0 0.0 0.00 0.00
#:> min 1.1 1.2 1.11 1.22
#:> 10% 1.1 1.2 1.11 1.22
#:> 20% 1.1 1.2 1.11 1.22
#:> 30% 1.1 1.2 1.11 1.22
#:> 50% 1.1 1.2 1.11 1.22
#:> 90% 1.1 1.2 1.11 1.22
#:> max 1.1 1.2 1.11 1.22
For both Numeric and Object
='all') df.describe(include
#:> One Two
#:> X Y X Y
#:> count 3.0 3.0 3.00 3.00
#:> mean 1.1 1.2 1.11 1.22
#:> std 0.0 0.0 0.00 0.00
#:> min 1.1 1.2 1.11 1.22
#:> 25% 1.1 1.2 1.11 1.22
#:> 50% 1.1 1.2 1.11 1.22
#:> 75% 1.1 1.2 1.11 1.22
#:> max 1.1 1.2 1.11 1.22
13.8.7.3 min/max/mean/median
min() # default axis=0, column-wise df.
#:> One X 1.10
#:> Y 1.20
#:> Two X 1.11
#:> Y 1.22
#:> dtype: float64
min(axis=1) # axis=1, row-wise df.
#:> myindex
#:> 0 1.1
#:> 1 1.1
#:> 2 1.1
#:> dtype: float64
Observe, sum on string will concatenate column-wise, whereas row-wise only sum up numeric fields
sum(0) df.
#:> One X 3.30
#:> Y 3.60
#:> Two X 3.33
#:> Y 3.66
#:> dtype: float64
sum(1) df.
#:> myindex
#:> 0 4.63
#:> 1 4.63
#:> 2 4.63
#:> dtype: float64
13.9 class: Categories
13.9.1 Creating
13.9.1.1 From List
Basic (Auto Category Mapping)
Basic syntax return categorical index with sequence with code 0,1,2,3… mapping to first found category
In this case, low(0), high(1), medium(2)
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp)
temp_cat temp_cat
#:> ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
#:> Categories (3, object): ['high', 'low', 'medium']
type( temp_cat )
#:> <class 'pandas.core.arrays.categorical.Categorical'>
Manual Category Mapping
During creation, we can specify mapping of codes to category: low(0), medium(1), high(2)
= pd.Categorical(temp, categories=['low','medium','high'])
temp_cat temp_cat
#:> ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
#:> Categories (3, object): ['low', 'medium', 'high']
13.9.1.2 From Series
- We can ‘add’ categorical structure into a Series. With these methods, additional property (.cat) is added as a categorical accessor
- Through this accessor, you gain access to various properties of the category such as .codes, .categories. But not .get_values() as the information is in the Series itself
- Can we manual map category ?????
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Series(temp, dtype='category')
temp_cat print (type(temp_cat)) # Series object
#:> <class 'pandas.core.series.Series'>
print (type(temp_cat.cat)) # Categorical Accessor
#:> <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
- Method below has the same result as above by using .astype(‘category’)
- It is useful adding category structure into existing series.
= pd.Series(temp)
temp_ser = pd.Series(temp).astype('category')
temp_cat print (type(temp_cat)) # Series object
#:> <class 'pandas.core.series.Series'>
print (type(temp_cat.cat)) # Categorical Accessor
#:> <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
temp_cat.cat.categories
#:> Index(['high', 'low', 'medium'], dtype='object')
13.9.1.3 Ordering Category
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp, categories=['low','medium','high'], ordered=True)
temp_cat temp_cat
#:> ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
#:> Categories (3, object): ['low' < 'medium' < 'high']
# error
temp_cat.get_values()
temp_cat.codes
#:> array([0, 2, 1, 2, 2, 0, 1, 1, 2], dtype=int8)
0] < temp_cat[3] temp_cat[
#:> False
13.9.2 Properties
13.9.3 Rename Category
13.9.3.1 Renamce To New Category Object
.rename_categories() method return a new category object with new changed categories
= ['low','high','medium','high','high','low','medium','medium','high']
temp = temp_cat.rename_categories(['sejuk','sederhana','panas'])
new_temp_cat new_temp_cat
#:> ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
#:> Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
# original category object categories not changed temp_cat
#:> ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
#:> Categories (3, object): ['low' < 'medium' < 'high']
13.9.3.2 Rename Inplace
Observe the original categories had been changed using .rename()
= ['sejuk','sederhana','panas']
temp_cat.categories # original category object categories is changed temp_cat
#:> ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
#:> Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
13.9.4 Adding New Category
This return a new category object with added categories
= temp_cat.add_categories(['susah','senang'])
temp_cat_more temp_cat_more
#:> ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
#:> Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']
13.9.5 Removing Category
This is not in place, hence return a new categorical object
13.9.5.1 Remove Specific Categor(ies)
Elements with its category removed will become NaN
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp)
temp_cat = temp_cat.remove_categories('low')
temp_cat_removed temp_cat_removed
#:> [NaN, 'high', 'medium', 'high', 'high', NaN, 'medium', 'medium', 'high']
#:> Categories (2, object): ['high', 'medium']
13.9.5.2 Remove Unused Category
Since categories removed are not used, there is no impact to the element
print (temp_cat_more)
#:> ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
#:> Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']
temp_cat_more.remove_unused_categories()
#:> ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
#:> Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
13.9.6 Add and Remove Categories In One Step - Set()
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp, ordered=True)
temp_cat temp_cat
#:> ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
#:> Categories (3, object): ['high' < 'low' < 'medium']
'low','medium','sederhana','susah','senang']) temp_cat.set_categories([
#:> ['low', NaN, 'medium', NaN, NaN, 'low', 'medium', 'medium', NaN]
#:> Categories (5, object): ['low' < 'medium' < 'sederhana' < 'susah' < 'senang']
13.10 Dummies
- get_dummies creates columns for each categories
- The underlying data can be string or pd.Categorical
- It produces a new pd.DataFrame
13.10.1 Sample Data
= pd.DataFrame (
df 'A': ['A1', 'A2', 'A3','A1','A3','A1'],
{'B': ['B1','B2','B3','B1','B1','B3'],
'C': ['C1','C2','C3','C1',np.nan,np.nan]})
df
#:> A B C
#:> 0 A1 B1 C1
#:> 1 A2 B2 C2
#:> 2 A3 B3 C3
#:> 3 A1 B1 C1
#:> 4 A3 B1 NaN
#:> 5 A1 B3 NaN
13.10.2 Dummies on Array-Like Data
pd.get_dummies(df.A)
#:> A1 A2 A3
#:> 0 1 0 0
#:> 1 0 1 0
#:> 2 0 0 1
#:> 3 1 0 0
#:> 4 0 0 1
#:> 5 1 0 0
13.10.3 Dummies on DataFrame (multiple columns)
13.10.4 Dummies with na
By default, nan values are ignored
pd.get_dummies(df.C)
#:> C1 C2 C3
#:> 0 1 0 0
#:> 1 0 1 0
#:> 2 0 0 1
#:> 3 1 0 0
#:> 4 0 0 0
#:> 5 0 0 0
Make NaN as a dummy variable
=True) pd.get_dummies(df.C,dummy_na
#:> C1 C2 C3 NaN
#:> 0 1 0 0 0
#:> 1 0 1 0 0
#:> 2 0 0 1 0
#:> 3 1 0 0 0
#:> 4 0 0 0 1
#:> 5 0 0 0 1
13.10.5 Specify Prefixes
='col') pd.get_dummies(df.A, prefix
#:> col_A1 col_A2 col_A3
#:> 0 1 0 0
#:> 1 0 1 0
#:> 2 0 0 1
#:> 3 1 0 0
#:> 4 0 0 1
#:> 5 1 0 0
=['colA','colB']) pd.get_dummies(df[cols], prefix
#:> colA_A1 colA_A2 colA_A3 colB_B1 colB_B2 colB_B3
#:> 0 1 0 0 1 0 0
#:> 1 0 1 0 0 1 0
#:> 2 0 0 1 0 0 1
#:> 3 1 0 0 1 0 0
#:> 4 0 0 1 1 0 0
#:> 5 1 0 0 0 0 1
13.10.6 Dropping First Column
- Dummies cause colinearity issue for regression as it has redundant column.
- Dropping a column does not loose any information technically
=True) pd.get_dummies(df[cols],drop_first
#:> A_A2 A_A3 B_B2 B_B3
#:> 0 0 0 0 0
#:> 1 1 0 1 0
#:> 2 0 1 0 1
#:> 3 0 0 0 0
#:> 4 0 1 0 0
#:> 5 0 0 0 1
13.11 DataFrameGroupBy
-
groupby()
is a DataFrame method, it returnsDataFrameGroupBy
object
-
DataFrameGroupBy
object open doors for dataframe aggregation and summarization
-
DataFrameGroupBy
object is a very flexible abstraction. In many ways, you can simply treatDataFrameGroup
as if it’s a collection of DataFrames, and it does the difficult things under the hood
13.11.1 Sample Data
= pd.read_csv('data/company.csv')
company company
#:> Company Department Name Age Salary Birthdate
#:> 0 C1 D1 Yong 45 15000 1/1/1970
#:> 1 C1 D1 Chew 35 12000 2/1/1980
#:> 2 C1 D2 Lim 34 8000 2/19/1977
#:> 3 C1 D3 Jessy 23 2500 3/15/1990
#:> 4 C1 D3 Hoi Ming 55 25000 4/15/1987
#:> .. ... ... ... ... ... ...
#:> 13 C3 D3 Chang 32 7900 7/26/1973
#:> 14 C3 D1 Ong 44 17500 8/21/1980
#:> 15 C3 D2 Lily 41 15300 7/17/1990
#:> 16 C3 D3 Sally 54 21000 7/19/1968
#:> 17 C3 D3 Esther 37 13500 3/16/1969
#:>
#:> [18 rows x 6 columns]
13.11.2 Creating Groups
Group can be created for single or multiple columns
= company.groupby('Company') ## Single Column
com_grp = company.groupby(['Company','Department']) ## Multiple Column
com_dep_grp type(com_dep_grp)
#:> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
13.11.3 Properties
13.11.3.2 Row Numbers Associated For Each Group
.groups
property is a dictionary containing group key (identifying the group) and its values (underlying row indexes for the group)
= com_dep_grp.groups # return Dictionary
gdict print( gdict.keys() , '\n\n', # group identifier
# group row indexes gdict.values() )
#:> dict_keys([('C1', 'D1'), ('C1', 'D2'), ('C1', 'D3'), ('C2', 'D1'), ('C2', 'D2'), ('C2', 'D3'), ('C3', 'D1'), ('C3', 'D2'), ('C3', 'D3')])
#:>
#:> dict_values([Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64'), Int64Index([3, 4, 5], dtype='int64'), Int64Index([6], dtype='int64'), Int64Index([7, 8, 9], dtype='int64'), Int64Index([10, 11, 12], dtype='int64'), Int64Index([14], dtype='int64'), Int64Index([15], dtype='int64'), Int64Index([13, 16, 17], dtype='int64')])
13.11.5 Retrieve Rows
13.11.5.1 Retrieve n-th Row Of Each Grou
- Row number is 0-based
- For First row, use
.first()
ornth(0)
print( com_dep_grp.nth(0) , '\n',
com_dep_grp.first())
#:> Name Age Salary Birthdate
#:> Company Department
#:> C1 D1 Yong 45 15000 1/1/1970
#:> D2 Lim 34 8000 2/19/1977
#:> D3 Jessy 23 2500 3/15/1990
#:> C2 D1 Anne 18 400 7/15/1997
#:> D2 Deborah 30 8600 8/15/1984
#:> D3 Michael 38 17000 11/30/1997
#:> C3 D1 Ong 44 17500 8/21/1980
#:> D2 Lily 41 15300 7/17/1990
#:> D3 Chang 32 7900 7/26/1973
#:> Name Age Salary Birthdate
#:> Company Department
#:> C1 D1 Yong 45 15000 1/1/1970
#:> D2 Lim 34 8000 2/19/1977
#:> D3 Jessy 23 2500 3/15/1990
#:> C2 D1 Anne 18 400 7/15/1997
#:> D2 Deborah 30 8600 8/15/1984
#:> D3 Michael 38 17000 11/30/1997
#:> C3 D1 Ong 44 17500 8/21/1980
#:> D2 Lily 41 15300 7/17/1990
#:> D3 Chang 32 7900 7/26/1973
- For Last row, use
.last()
ornth(
-1)`
print( com_dep_grp.nth(-1) , '\n',
com_dep_grp.last())
#:> Name Age Salary Birthdate
#:> Company Department
#:> C1 D1 Chew 35 12000 2/1/1980
#:> D2 Lim 34 8000 2/19/1977
#:> D3 Sui Wei 56 3000 6/15/1990
#:> C2 D1 Anne 18 400 7/15/1997
#:> D2 Jimmy 46 14000 10/31/1988
#:> D3 Bernard 29 9800 12/1/1963
#:> C3 D1 Ong 44 17500 8/21/1980
#:> D2 Lily 41 15300 7/17/1990
#:> D3 Esther 37 13500 3/16/1969
#:> Name Age Salary Birthdate
#:> Company Department
#:> C1 D1 Chew 35 12000 2/1/1980
#:> D2 Lim 34 8000 2/19/1977
#:> D3 Sui Wei 56 3000 6/15/1990
#:> C2 D1 Anne 18 400 7/15/1997
#:> D2 Jimmy 46 14000 10/31/1988
#:> D3 Bernard 29 9800 12/1/1963
#:> C3 D1 Ong 44 17500 8/21/1980
#:> D2 Lily 41 15300 7/17/1990
#:> D3 Esther 37 13500 3/16/1969
13.11.5.2 Retrieve N Rows Of Each Groups
Example below retrieve 2 rows from each group
2) com_dep_grp.head(
#:> Company Department Name Age Salary Birthdate
#:> 0 C1 D1 Yong 45 15000 1/1/1970
#:> 1 C1 D1 Chew 35 12000 2/1/1980
#:> 2 C1 D2 Lim 34 8000 2/19/1977
#:> 3 C1 D3 Jessy 23 2500 3/15/1990
#:> 4 C1 D3 Hoi Ming 55 25000 4/15/1987
#:> .. ... ... ... ... ... ...
#:> 11 C2 D3 Jeannie 30 12500 12/31/1980
#:> 13 C3 D3 Chang 32 7900 7/26/1973
#:> 14 C3 D1 Ong 44 17500 8/21/1980
#:> 15 C3 D2 Lily 41 15300 7/17/1990
#:> 16 C3 D3 Sally 54 21000 7/19/1968
#:>
#:> [14 rows x 6 columns]
13.11.5.3 Retrieve All Rows Of Specific Group
get_group()
retrieves all rows within the specified group.
'C1','D3')) com_dep_grp.get_group((
#:> Company Department Name Age Salary Birthdate
#:> 3 C1 D3 Jessy 23 2500 3/15/1990
#:> 4 C1 D3 Hoi Ming 55 25000 4/15/1987
#:> 5 C1 D3 Sui Wei 56 3000 6/15/1990
13.11.6 Single Statistic Per Group
13.11.6.1 count()
count()
for valid data (not null) for each fields within the group
# return panda DataFrame object com_dep_grp.count()
#:> Name Age Salary Birthdate
#:> Company Department
#:> C1 D1 2 2 2 2
#:> D2 1 1 1 1
#:> D3 3 3 3 3
#:> C2 D1 1 1 1 1
#:> D2 3 3 3 3
#:> D3 3 3 3 3
#:> C3 D1 1 1 1 1
#:> D2 1 1 1 1
#:> D3 3 3 3 3
13.11.6.2 sum()
This sums up all numeric columns for each group
sum() com_dep_grp.
#:> Age Salary
#:> Company Department
#:> C1 D1 80 27000
#:> D2 34 8000
#:> D3 134 30500
#:> C2 D1 18 400
#:> D2 127 34600
#:> D3 97 39300
#:> C3 D1 44 17500
#:> D2 41 15300
#:> D3 123 42400
To sum specific columns of each group, use ['columnName']
to select the column.
When single column is selected, output is a Series
'Age'].sum() com_dep_grp[
#:> Company Department
#:> C1 D1 80
#:> D2 34
#:> D3 134
#:> C2 D1 18
#:> D2 127
#:> D3 97
#:> C3 D1 44
#:> D2 41
#:> D3 123
#:> Name: Age, dtype: int64
13.11.6.3 mean()
This average up all numeric columns for each group
com_dep_grp.mean()
#:> Age Salary
#:> Company Department
#:> C1 D1 40.000000 13500.000000
#:> D2 34.000000 8000.000000
#:> D3 44.666667 10166.666667
#:> C2 D1 18.000000 400.000000
#:> D2 42.333333 11533.333333
#:> D3 32.333333 13100.000000
#:> C3 D1 44.000000 17500.000000
#:> D2 41.000000 15300.000000
#:> D3 41.000000 14133.333333
To average specific columns of each group, use ['columnName']
to select the column.
When single column is selected, output is a Series
'Age'].mean() com_dep_grp[
#:> Company Department
#:> C1 D1 40.000000
#:> D2 34.000000
#:> D3 44.666667
#:> C2 D1 18.000000
#:> D2 42.333333
#:> D3 32.333333
#:> C3 D1 44.000000
#:> D2 41.000000
#:> D3 41.000000
#:> Name: Age, dtype: float64
13.11.7 Multi Statistic Per Group
13.11.7.1 Single Function To Column(s)
- Instructions for aggregation are provided in the form of a dictionary. Dictionary keys specifies the column name, and value as the function to run
- Can use
lambda x:
to customize the calclulation on entire column (x)
- Python built-in function names does can be supplied without wrapping in string
'function'
com_dep_grp.agg({'Age': sum , ## Total age of the group
'Salary': lambda x: max(x), ## Highest salary of the group
'Birthdate': 'first' ## First birthday of the group
})
#:> Age Salary Birthdate
#:> Company Department
#:> C1 D1 80 15000 1/1/1970
#:> D2 34 8000 2/19/1977
#:> D3 134 25000 3/15/1990
#:> C2 D1 18 400 7/15/1997
#:> D2 127 14000 8/15/1984
#:> D3 97 17000 11/30/1997
#:> C3 D1 44 17500 8/21/1980
#:> D2 41 15300 7/17/1990
#:> D3 123 21000 7/26/1973
13.11.7.2 Multiple Function to Column(s)
- Use list of function names to specify functions to be applied on a particular column
- Notice that output columns are MultiIndex , indicating the name of funcitons appled on level 1
= com_dep_grp.agg({
ag 'Age': ['mean', sum ], ## Average age of the group
'Salary': lambda x: max(x), ## Highest salary of the group
'Birthdate': 'first' ## First birthday of the group
})
print (ag, '\n\n', ag.columns)
#:> Age Salary Birthdate
#:> mean sum <lambda> first
#:> Company Department
#:> C1 D1 40.000000 80 15000 1/1/1970
#:> D2 34.000000 34 8000 2/19/1977
#:> D3 44.666667 134 25000 3/15/1990
#:> C2 D1 18.000000 18 400 7/15/1997
#:> D2 42.333333 127 14000 8/15/1984
#:> D3 32.333333 97 17000 11/30/1997
#:> C3 D1 44.000000 44 17500 8/21/1980
#:> D2 41.000000 41 15300 7/17/1990
#:> D3 41.000000 123 21000 7/26/1973
#:>
#:> MultiIndex([( 'Age', 'mean'),
#:> ( 'Age', 'sum'),
#:> ( 'Salary', '<lambda>'),
#:> ('Birthdate', 'first')],
#:> )
13.11.7.3 Column Relabling
Introduced in Pandas 0.25.0, groupby aggregation with relabelling is supported using “named aggregation” with simple tuples
com_dep_grp.agg(= ('Age', max),
max_age = ('Salary', lambda x: max(x)+100),
salary_m100 = ('Birthdate', 'first')
first_bd )
#:> max_age salary_m100 first_bd
#:> Company Department
#:> C1 D1 45 15100 1/1/1970
#:> D2 34 8100 2/19/1977
#:> D3 56 25100 3/15/1990
#:> C2 D1 18 500 7/15/1997
#:> D2 51 14100 8/15/1984
#:> D3 38 17100 11/30/1997
#:> C3 D1 44 17600 8/21/1980
#:> D2 41 15400 7/17/1990
#:> D3 54 21100 7/26/1973
13.11.8 Iteration
DataFrameGroupBy object can be thought as a collection of named groups
def print_groups (g):
for name,group in g:
print (name)
print (group[:2])
print_groups (com_grp)
#:> C1
#:> Company Department Name Age Salary Birthdate
#:> 0 C1 D1 Yong 45 15000 1/1/1970
#:> 1 C1 D1 Chew 35 12000 2/1/1980
#:> C2
#:> Company Department Name Age Salary Birthdate
#:> 6 C2 D1 Anne 18 400 7/15/1997
#:> 7 C2 D2 Deborah 30 8600 8/15/1984
#:> C3
#:> Company Department Name Age Salary Birthdate
#:> 13 C3 D3 Chang 32 7900 7/26/1973
#:> 14 C3 D1 Ong 44 17500 8/21/1980
com_grp
#:> <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8074bdd190>
13.11.9 Transform
- Transform is an operation used combined with DataFrameGroupBy object
- transform() return a new DataFrame object
= company.groupby('Company')
grp grp.size()
#:> Company
#:> C1 6
#:> C2 7
#:> C3 5
#:> dtype: int64
transform() perform a function to a group, and expands and replicate it to multiple rows according to original DataFrame
'Age','Salary']].transform('sum') grp[[
#:> Age Salary
#:> 0 248 65500
#:> 1 248 65500
#:> 2 248 65500
#:> 3 248 65500
#:> 4 248 65500
#:> .. ... ...
#:> 13 208 75200
#:> 14 208 75200
#:> 15 208 75200
#:> 16 208 75200
#:> 17 208 75200
#:>
#:> [18 rows x 2 columns]
lambda x:x+10 ) grp.transform(
#:> Age Salary
#:> 0 55 15010
#:> 1 45 12010
#:> 2 44 8010
#:> 3 33 2510
#:> 4 65 25010
#:> .. ... ...
#:> 13 42 7910
#:> 14 54 17510
#:> 15 51 15310
#:> 16 64 21010
#:> 17 47 13510
#:>
#:> [18 rows x 2 columns]
13.13 Missing Data
13.13.2 Sample Data
= pd.DataFrame( np.random.randn(5, 3),
df =['a', 'c', 'e', 'f', 'h'],
index =['one', 'two', 'three'])
columns 'four'] = 'bar'
df['five'] = df['one'] > 0
df[#df
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) df.reindex([
#:> one two three four five
#:> a -0.155909 -0.501790 0.235569 bar False
#:> b NaN NaN NaN NaN NaN
#:> c -1.763605 -1.095862 -1.087766 bar False
#:> d NaN NaN NaN NaN NaN
#:> e -0.305170 -0.473748 -0.200595 bar False
#:> f 0.355197 0.689518 0.410590 bar True
#:> g NaN NaN NaN NaN NaN
#:> h -0.564978 0.599391 -0.162936 bar False
How Missing Data For Each Column ?
df.count()
#:> one 5
#:> two 5
#:> three 5
#:> four 5
#:> five 5
#:> dtype: int64
len(df.index) - df.count()
#:> one 0
#:> two 0
#:> three 0
#:> four 0
#:> five 0
#:> dtype: int64
df.isnull()
#:> one two three four five
#:> a False False False False False
#:> c False False False False False
#:> e False False False False False
#:> f False False False False False
#:> h False False False False False
df.describe()
#:> one two three
#:> count 5.000000 5.000000 5.000000
#:> mean -0.486893 -0.156498 -0.161028
#:> std 0.788635 0.772882 0.579752
#:> min -1.763605 -1.095862 -1.087766
#:> 25% -0.564978 -0.501790 -0.200595
#:> 50% -0.305170 -0.473748 -0.162936
#:> 75% -0.155909 0.599391 0.235569
#:> max 0.355197 0.689518 0.410590