數據操作(Pandas) 完整
1.1. pandas 對象
引入 pandas 等包,DataFrame、Series 屬於常用的,所以直接引入
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
DataFrame 對象:Pandas DataFrame 是一個表格型的數據結構,有行索引也有列索引
from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)
Series 對象:類似於一維數組的對象,由一組同樣 type 的數組和索引組成
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'
index 對象:即 Series 和 DataFrame 的索引
# 獲取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index
# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe
# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引
# union of two indexes 合併兩個索引
# idx = idx.union(other)
idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label
創建 Series 和 DataFrame
http://pandas.pydata.org/pandas-docs/stable/dsintro.html
1.2. DataFrame 入門
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
DataFrame 的一些實用查看方法
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A 10 non-null float64
B 10 non-null float64
C 10 non-null float64
D 10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes
n=4
dfh = df.head(n) # 看前 n 行
dft = df.tail(n) # 看後 n 行
dfs = df.describe() # 各類統計信息
top_left_corner_df = df.iloc[:5, :5]
dfT = df.T # transpose rows and cols
DataFrame index 的一些特性
l = df.axes # list row and col indexes
l
[RangeIndex(start=0, stop=10, step=1),
Index([u'A', u'B', u'C', u'D'], dtype='object')]
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
A float64
B float64
C float64
D float64
dtype: object
b = df.empty # True for empty DataFrame
b
False
i = df.ndim # number of axes (2)
i
2
t = df.shape # (row-count, column-count)
t
(10, 4)
(r, c) = df.shape # from above
(r, c)
(10, 4)
i = df.size # row-count * column-count
i
40
a = df.values # get a numpy array for df
實用方法
df = DataFrame([1, 23, 3, 5, 2])
dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每個值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion
/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
app.launch_new_instance()
# 下面的兩個方法沒怎麼搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
str(series.iat[0]) + '\n')
Col name: 0
First value: 1
通用函數
method | ## |
---|---|
df = df.abs() | absolute values |
df = df.add(o) | add df, Series or value |
s = df.count() | non NA/null values |
df = df.cummax() | (cols default axis) |
df = df.cummin() | (cols default axis) |
df = df.cumsum() | (cols default axis) |
df = df.cumprod() | (cols default axis) |
df = df.diff() | 1st diff (col def axis) |
df = df.div(o) | div by df, Series, value |
df = df.dot(o) | matrix dot product |
s = df.max() | max of axis (col def) |
s = df.mean() | mean (col default axis) |
s = df.median() | median (col default) |
s = df.min() | min of axis (col def) |
df = df.mul(o) | mul by df Series val |
s = df.sum() | sum axis (cols default) |
1.3. DataFrame Columns 列處理
column 其實也是一個 Series
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list
lst
['A', 'B', 'C', 'D']
In [32]:
label
Out[32]:
'A'
In [33]:
idx
Out[33]:
Index([u'A', u'B', u'C', u'D'], dtype='object')
column 改名
# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})
選擇 columns, 也就是提取列
s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改變排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']
用 python 特性提取列
s = df.A # same as s = df['A'],
# 但不能用 python 特性創建新的 columns
# df['new_col'] = df.a / df.b
添加新的 columns,添加一個 column 是極為方便的,只要能添加一組數據就行
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
df.head(2)
A | B | C | D | new_col | random | index_as_col | |
---|---|---|---|---|---|---|---|
0 | 0.458326 | -1.402187 | 0.446208 | -0.459079 | NaN | 0.920599 | 0 |
1 | 0.366833 | 0.618661 | -0.727332 | 1.152775 | NaN | 0.503750 | 1 |
詳情參考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)
判定函數 pd.Series.where
# 符合 >0 條件的保持原值,其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)
數據格式 轉換一列的格式時非常有用。
s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list
columns 的一些特性和方法
value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()
找出最小值和最大值的位置
df['B'].idxmax()
df['B'].idxmin()
7
元素級方法
s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)
df = df.mul(s, axis=0) # on matched rows,相當於 * other Series 每行都與之相乘
df.columns.get_loc('B')
0
df = df.iloc[:, 0:2] # exclusive
獲取 columns 的具體位置
df
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
下面那個好像沒什麼軟用
In [167]:
for i in ['pop', 'state']:
print df.columns.get_loc(i)
2
1
In [152]:
Series(df.columns)
Out[152]:
0 year
1 state
2 pop
3 debt
dtype: object
1.4. DataFrame rows 行處理
獲取索引和標籤
idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list
改變索引或行名
df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)
Drop row 刪除行
df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row
查找一些行
In [23]:
# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)
In [4]:
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 這裡看不太懂
對行做排序
df_obj.sort(columns = ‘’)#按列名進行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用時報該函數已過時,請用sort_values
df_obj.sort_values(by=['',''])同上
索引前奏
df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])
df0
x | y | |
---|---|---|
3 | 1 | 3 |
2 | 2 | 4 |
1 | 3 | 5 |
df0[1:2]
x | y | |
---|---|---|
2 | 2 | 4 |
df1
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
1 | 6 | 7 | 8 |
df1[0]
3 1
2 3
1 6
Name: 0, dtype: int64
df1[0:2]
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
df1.ix[:, 0:2]
0 | 1 | 2 | |
---|---|---|---|
3 | 1 | 2 | 3 |
2 | 3 | 4 | 5 |
1 | 6 | 7 | 8 |
df0[['x','y']]
x | y | |
---|---|---|
3 | 1 | 3 |
2 | 2 | 4 |
1 | 3 | 5 |
1.5. 索引和切片
整數時一般是不包含的,非整數則會包含尾巴(基於 label)
foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))
print(foo)
print '------'
print(bar)
0
a 4.5
b 7.2
c -5.3
d 3.6
------
0
0 4.5
1 7.2
2 -5.3
3 3.6
print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']
0
a 4.5
b 7.2
------
0
0 4.5
1 7.2
------
0
a 4.5
b 7.2
c -5.3
ix[::, ::] 可以接受兩套切片(axis=0)橫向,(axis=1)列向
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
df = DataFrame(data,index=['one','two','three','four','five'],
columns=['year','state','pop','debt'])
df
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
df.ix[:, 'state':'pop']
state | pop | |
---|---|---|
one | Ohino | 1.5 |
two | Ohino | 1.7 |
three | Ohino | 3.6 |
four | Nevada | 2.4 |
five | Nevada | 2.9 |
df.ix[1] # 切的是行,所以說 ix 默認切的行, 也就是 axis=0
year 2001
state Ohino
pop 1.7
debt NaN
Name: two, dtype: object
非 ix
ix 可以說是 pandas 的標準切法,而沒有 ix 時,情況就略複雜些了,作者說:
索引時,選取的是列
切片時,選取的是行
記住一點,如果你想看單列或少數列的索引,那麼直接用 df[‘column’], 其他就
print(type(df['year']))
print(type(df[['year']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
# df['one'] # 會報錯,沒辦法這樣索引,這是行
df[['year', 'state']] # 可運行
year | state | |
---|---|---|
one | 2000 | Ohino |
two | 2001 | Ohino |
three | 2002 | Ohino |
four | 2001 | Nevada |
five | 2002 | Nevada |
df[0:1] # 切第一行,直接 df[0] 是會報錯的。而 ix 不會。
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
df['one':'two'] # 所以他也是可以整數切,也能標籤切
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohino | 1.5 | NaN |
two | 2001 | Ohino | 1.7 | NaN |
print(df.columns.tolist())
print(df.index.tolist())
['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']
df.loc[:, 'year':'state']
year | state | |
---|---|---|
one | 2000 | Ohino |
two | 2001 | Ohino |
three | 2002 | Ohino |
four | 2001 | Nevada |
five | 2002 | Nevada |
df.iloc[:, 1:2]
state | |
---|---|
one | Ohino |
two | Ohino |
three | Ohino |
four | Nevada |
five | Nevada |
.loc[label] 這是嚴格基於標籤的索引
.iloc[inte] 這是嚴格基於整數位置的索引
.ix[] 更像是這兩種嚴格方式的智能整合版。
# df.loc[1:2] 用 label 的去切整數,自然會出錯
# df.iloc['two':'three'] 也會出錯
df.loc['two':'three']
year | state | pop | debt | |
---|---|---|---|---|
two | 2001 | Ohino | 1.7 | NaN |
three | 2002 | Ohino | 3.6 | NaN |
df.iloc[1:2]
year | state | pop | debt | |
---|---|---|---|---|
two | 2001 | Ohino | 1.7 | NaN |
小結:
- 儘量寫兩套切片,除非是索引單列則用 df[column]
- 多用 iloc 和 loc, 除非你很清晰的基於標籤
df2.info() # 說明,ix在這種非整數的整數標籤上,他的切片跟loc一樣,是基於標籤的,而另外兩個剛好相反。