import pandas as pd
import numpy as np
概述
pandas 索引和选择的操作主要是以下五种:
操作 |
语法 |
Result |
选择列 |
df[col] |
Series |
通过标签选择行 |
df.loc[label] |
Series |
通过整数下标选择行 |
df.iloc[loc] |
Series |
行切片 |
df[5:10] |
DataFrame |
通过布尔向量切片 |
df[bool_vec] |
DataFrame |
df = pd.DataFrame(np.random.randn(4, 5), index=['a', 'b', 'c', 'd'], columns=['A', 'B', 'C', 'D', 'E'])
df
|
A |
B |
C |
D |
E |
a |
2.251822 |
0.508078 |
0.296828 |
-1.223630 |
0.523225 |
b |
1.169537 |
1.378697 |
0.956192 |
-0.761983 |
1.206156 |
c |
-1.071012 |
0.117104 |
-1.068820 |
0.443776 |
-1.185699 |
d |
0.276478 |
0.013506 |
-0.207837 |
-0.295314 |
0.402572 |
a 0.508078
b 1.378697
c 0.117104
d 0.013506
Name: B, dtype: float64
|
B |
C |
a |
0.508078 |
0.296828 |
b |
1.378697 |
0.956192 |
c |
0.117104 |
-1.068820 |
d |
0.013506 |
-0.207837 |
A 0.276478
B 0.013506
C -0.207837
D -0.295314
E 0.402572
Name: d, dtype: float64
A 2.251822
B 0.508078
C 0.296828
D -1.223630
E 0.523225
Name: a, dtype: float64
|
A |
B |
C |
D |
E |
a |
2.251822 |
0.508078 |
0.296828 |
-1.223630 |
0.523225 |
b |
1.169537 |
1.378697 |
0.956192 |
-0.761983 |
1.206156 |
c |
-1.071012 |
0.117104 |
-1.068820 |
0.443776 |
-1.185699 |
|
A |
B |
C |
D |
E |
a |
True |
True |
True |
False |
True |
b |
True |
True |
True |
False |
True |
|
A |
B |
C |
D |
E |
a |
2.251822 |
0.508078 |
0.296828 |
0.000000 |
0.523225 |
b |
1.169537 |
1.378697 |
0.956192 |
0.000000 |
1.206156 |
c |
0.000000 |
0.117104 |
0.000000 |
0.443776 |
0.000000 |
d |
0.276478 |
0.013506 |
0.000000 |
0.000000 |
0.402572 |
使用 [] 进行索引
使用 []
进行索引,实际上就是选择低维切片。对于 Series 对象和 DataFrame 对象,[]
返回的类型不同。
对象类型 |
调用方法 |
返回值类型 |
Series |
series[label] |
标量值 |
DataFrame |
frame[colname] |
对应 colname 的 Series |
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
index=dates, columns=['A', 'B', 'C', 'D'])
df
|
A |
B |
C |
D |
2000-01-01 |
-0.245083 |
-1.422116 |
0.607832 |
1.303857 |
2000-01-02 |
0.596950 |
-0.232929 |
-1.116721 |
-1.387045 |
2000-01-03 |
-0.096937 |
-1.016238 |
0.547761 |
-0.629623 |
2000-01-04 |
-0.538159 |
1.512869 |
0.694778 |
0.561185 |
2000-01-05 |
-1.338459 |
1.352126 |
1.203392 |
-0.431008 |
2000-01-06 |
-0.525117 |
0.735395 |
0.562011 |
-0.133187 |
2000-01-07 |
-0.076509 |
0.542937 |
0.047262 |
0.512984 |
2000-01-08 |
-1.047306 |
0.812662 |
1.716307 |
-0.060350 |
2000-01-01 -0.245083
2000-01-02 0.596950
2000-01-03 -0.096937
2000-01-04 -0.538159
2000-01-05 -1.338459
2000-01-06 -0.525117
2000-01-07 -0.076509
2000-01-08 -1.047306
Freq: D, Name: A, dtype: float64
-0.5251169095699432
交换 A
和 B
两列:
df[['B', 'A']] = df[['A', 'B']]
df
|
A |
B |
C |
D |
2000-01-01 |
-1.422116 |
-0.245083 |
0.607832 |
1.303857 |
2000-01-02 |
-0.232929 |
0.596950 |
-1.116721 |
-1.387045 |
2000-01-03 |
-1.016238 |
-0.096937 |
0.547761 |
-0.629623 |
2000-01-04 |
1.512869 |
-0.538159 |
0.694778 |
0.561185 |
2000-01-05 |
1.352126 |
-1.338459 |
1.203392 |
-0.431008 |
2000-01-06 |
0.735395 |
-0.525117 |
0.562011 |
-0.133187 |
2000-01-07 |
0.542937 |
-0.076509 |
0.047262 |
0.512984 |
2000-01-08 |
0.812662 |
-1.047306 |
1.716307 |
-0.060350 |
注意,如果使用了 loc
或者 iloc
,pandas 会先对齐所有 axes
。这不会修改 df
,因为列对齐在赋值之前。
df.loc[:, ['B', 'A']] = df[['A', 'B']]
df
|
A |
B |
C |
D |
2000-01-01 |
-1.422116 |
-0.245083 |
0.607832 |
1.303857 |
2000-01-02 |
-0.232929 |
0.596950 |
-1.116721 |
-1.387045 |
2000-01-03 |
-1.016238 |
-0.096937 |
0.547761 |
-0.629623 |
2000-01-04 |
1.512869 |
-0.538159 |
0.694778 |
0.561185 |
2000-01-05 |
1.352126 |
-1.338459 |
1.203392 |
-0.431008 |
2000-01-06 |
0.735395 |
-0.525117 |
0.562011 |
-0.133187 |
2000-01-07 |
0.542937 |
-0.076509 |
0.047262 |
0.512984 |
2000-01-08 |
0.812662 |
-1.047306 |
1.716307 |
-0.060350 |
可以强转成 ndarray
数组来实现这一操作:
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
df
|
A |
B |
C |
D |
2000-01-01 |
-0.245083 |
-1.422116 |
0.607832 |
1.303857 |
2000-01-02 |
0.596950 |
-0.232929 |
-1.116721 |
-1.387045 |
2000-01-03 |
-0.096937 |
-1.016238 |
0.547761 |
-0.629623 |
2000-01-04 |
-0.538159 |
1.512869 |
0.694778 |
0.561185 |
2000-01-05 |
-1.338459 |
1.352126 |
1.203392 |
-0.431008 |
2000-01-06 |
-0.525117 |
0.735395 |
0.562011 |
-0.133187 |
2000-01-07 |
-0.076509 |
0.542937 |
0.047262 |
0.512984 |
2000-01-08 |
-1.047306 |
0.812662 |
1.716307 |
-0.060350 |
切片
切片的规则和 Python 以及 NumPy 中的切片规则,一样。这里用 []
运算符说明切片的语义。
s = pd.Series(np.random.randn(7), index=list('abcdefg'))
s
a -0.430408
b 1.192773
c 0.110207
d 0.252352
e -0.126190
f 0.684152
g -0.330652
dtype: float64
a -0.430408
b 1.192773
c 0.110207
d 0.252352
e -0.126190
dtype: float64
a -0.430408
c 0.110207
e -0.126190
g -0.330652
dtype: float64
g -0.330652
f 0.684152
e -0.126190
d 0.252352
c 0.110207
b 1.192773
a -0.430408
dtype: float64
s2 = s.copy()
s2[:3] = 0
s2
a 0.000000
b 0.000000
c 0.000000
d 0.252352
e -0.126190
f 0.684152
g -0.330652
dtype: float64
df = pd.DataFrame(np.random.randn(3, 4), index=list('abc'), columns=list('ABCD'))
df
|
A |
B |
C |
D |
a |
0.967130 |
0.642469 |
-0.327055 |
-0.760568 |
b |
0.435906 |
-0.498920 |
0.411437 |
-1.726552 |
c |
0.793539 |
-0.507760 |
0.167797 |
-1.432030 |
|
A |
B |
C |
D |
a |
0.967130 |
0.642469 |
-0.327055 |
-0.760568 |
b |
0.435906 |
-0.498920 |
0.411437 |
-1.726552 |
|
A |
B |
C |
D |
c |
0.793539 |
-0.507760 |
0.167797 |
-1.432030 |
b |
0.435906 |
-0.498920 |
0.411437 |
-1.726552 |
a |
0.967130 |
0.642469 |
-0.327055 |
-0.760568 |
通过属性索引
sa = pd.Series([1, 2, 3], index=list('abc'))
dfa = df.copy()
a 1
b 2
c 3
dtype: int64
2
|
A |
B |
C |
D |
a |
0.967130 |
0.642469 |
-0.327055 |
-0.760568 |
b |
0.435906 |
-0.498920 |
0.411437 |
-1.726552 |
c |
0.793539 |
-0.507760 |
0.167797 |
-1.432030 |
a 0.967130
b 0.435906
c 0.793539
Name: A, dtype: float64
dfa.A = list(range(len(dfa.index)))
dfa
|
A |
B |
C |
D |
a |
0 |
0.642469 |
-0.327055 |
-0.760568 |
b |
1 |
-0.498920 |
0.411437 |
-1.726552 |
c |
2 |
-0.507760 |
0.167797 |
-1.432030 |
采用访问属性的方法必须确保该属性存在。如果要创建新的一列,仍然需要通过 []
。否则,会出现 UserWanring
的警告。
dfa['E'] = list(range(len(dfa.index)))
dfa
|
A |
B |
C |
D |
E |
a |
0 |
0.642469 |
-0.327055 |
-0.760568 |
0 |
b |
1 |
-0.498920 |
0.411437 |
-1.726552 |
1 |
c |
2 |
-0.507760 |
0.167797 |
-1.432030 |
2 |
通过标签索引
使用 loc
方法可以使用标签对行进行索引。
s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
s1
a -1.373748
b 0.505333
c 0.929577
d 0.336673
e 1.074367
f -1.375100
dtype: float64
0.5053325199102856
标签支持冒号表达式,进行切片运算。
c 0.929577
d 0.336673
e 1.074367
f -1.375100
dtype: float64
df1 = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df1
|
A |
B |
C |
D |
a |
-0.100191 |
-0.129401 |
0.976985 |
1.839616 |
b |
1.158153 |
0.662432 |
1.278219 |
-0.159460 |
c |
0.650541 |
-0.073285 |
-1.377789 |
0.122472 |
d |
0.442271 |
0.452595 |
-1.274344 |
-0.494543 |
e |
0.536770 |
-0.597499 |
0.897752 |
1.104886 |
f |
1.488848 |
1.195444 |
1.394976 |
1.760736 |
DataFrame 对象可以行列同时索引。
df1.loc[['a', 'b', 'd'], ['A', 'C']]
|
A |
C |
a |
-0.100191 |
0.976985 |
b |
1.158153 |
1.278219 |
d |
0.442271 |
-1.274344 |
|
A |
B |
C |
d |
0.442271 |
0.452595 |
-1.274344 |
e |
0.536770 |
-0.597499 |
0.897752 |
f |
1.488848 |
1.195444 |
1.394976 |
注意:下面的切片不是下标切片,而是标签切片。因此,3:5
表示标签 3
和 5
之间的所有标签。3
和 5
之间还有一个标签 2
,因此返回结果为:
s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])
s.loc[3:5]
3 b
2 c
5 d
dtype: object
通过下标位置索引
使用 iloc
方法可以通过下标位置进行索引。
s2 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
s2
0 -0.235442
2 1.504205
4 0.917704
6 -0.268698
8 0.045323
dtype: float64
0 -0.235442
2 1.504205
4 0.917704
dtype: float64
0.04532252934053265
df2 = pd.DataFrame(np.random.randn(6, 4),
index=list(range(0, 12, 2)),
columns=list(range(0, 8, 2)))
df2
|
0 |
2 |
4 |
6 |
0 |
0.342468 |
0.198276 |
-1.195844 |
0.451696 |
2 |
-0.025717 |
0.077352 |
-0.123052 |
1.313708 |
4 |
-0.137973 |
0.373148 |
-0.220029 |
-0.683300 |
6 |
-0.554065 |
-0.518741 |
-0.839679 |
-0.599059 |
8 |
-0.585208 |
1.001620 |
0.162317 |
-1.386377 |
10 |
-0.229306 |
1.202411 |
-0.456811 |
0.110556 |
DataFrame 对象可以行列同时索引。
|
0 |
2 |
4 |
6 |
0 |
0.342468 |
0.198276 |
-1.195844 |
0.451696 |
2 |
-0.025717 |
0.077352 |
-0.123052 |
1.313708 |
4 |
-0.137973 |
0.373148 |
-0.220029 |
-0.683300 |
|
4 |
6 |
2 |
-0.123052 |
1.313708 |
4 |
-0.220029 |
-0.683300 |
6 |
-0.839679 |
-0.599059 |
8 |
0.162317 |
-1.386377 |
df2.iloc[[1, 3, 5], [1, 3]]
|
2 |
6 |
2 |
0.077352 |
1.313708 |
6 |
-0.518741 |
-0.599059 |
10 |
1.202411 |
0.110556 |
|
0 |
2 |
4 |
6 |
2 |
-0.025717 |
0.077352 |
-0.123052 |
1.313708 |
4 |
-0.137973 |
0.373148 |
-0.220029 |
-0.683300 |
可以超出索引范围,但是可能会返回空 DataFrame。
选择接受可调用对象
[]
、loc
和 iloc
都接受可调用对象进行索引。
df1 = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),
columns=list('ABCD'))
df1
|
A |
B |
C |
D |
a |
1.444943 |
-0.907930 |
1.038214 |
-2.246870 |
b |
-1.602533 |
-1.085308 |
-1.039485 |
0.574455 |
c |
-0.705120 |
0.563562 |
1.281975 |
1.127445 |
d |
-1.860152 |
-1.679263 |
-1.071518 |
0.127717 |
e |
0.472768 |
-0.563109 |
-0.331930 |
-1.279404 |
f |
0.759632 |
0.306537 |
-0.342015 |
-1.121230 |
df1.loc[lambda df: df['A'] > 0, :]
|
A |
B |
C |
D |
a |
1.444943 |
-0.907930 |
1.038214 |
-2.246870 |
e |
0.472768 |
-0.563109 |
-0.331930 |
-1.279404 |
f |
0.759632 |
0.306537 |
-0.342015 |
-1.121230 |
df1.loc[:, lambda df: ['A', 'B']]
|
A |
B |
a |
1.444943 |
-0.907930 |
b |
-1.602533 |
-1.085308 |
c |
-0.705120 |
0.563562 |
d |
-1.860152 |
-1.679263 |
e |
0.472768 |
-0.563109 |
f |
0.759632 |
0.306537 |
df1.iloc[:, lambda df: [0, 1]]
|
A |
B |
a |
1.444943 |
-0.907930 |
b |
-1.602533 |
-1.085308 |
c |
-0.705120 |
0.563562 |
d |
-1.860152 |
-1.679263 |
e |
0.472768 |
-0.563109 |
f |
0.759632 |
0.306537 |
df1[lambda df: df.columns[0]]
a 1.444943
b -1.602533
c -0.705120
d -1.860152
e 0.472768
f 0.759632
Name: A, dtype: float64
df1['A'].loc[lambda s: s > 0]
a 1.444943
e 0.472768
f 0.759632
Name: A, dtype: float64
快速访问
由于使用 []
进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此它需要一些开销才能确定您的要求。如果您只想访问一个标量值,最快的方法是使用 at
和 iat
方法,它们在所有数据结构上都实现了。
s = pd.Series(np.random.randint(0, 7, size=(7,)), index=list('abcdefg'))
s
a 1
b 4
c 5
d 6
e 0
f 3
g 3
dtype: int32
3
1
df = pd.DataFrame(np.random.randint(0, 7, size=(3, 4)), index=list('abc'), columns=list('ABCD'))
df
|
A |
B |
C |
D |
a |
3 |
5 |
4 |
4 |
b |
0 |
2 |
2 |
2 |
c |
1 |
2 |
0 |
3 |
5
0
布尔索引
Series 对象的布尔索引和 Python 以及 NumPy 类似。
s = pd.Series(range(-3, 4))
s
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
4 1
5 2
6 3
dtype: int64
0 -3
1 -2
4 1
5 2
6 3
dtype: int64
3 0
4 1
5 2
6 3
dtype: int64
df = pd.DataFrame(np.random.randn(7, 4),
index=pd.date_range('2022/02/22', periods=7),
columns=list('ABCD'))
df
|
A |
B |
C |
D |
2022-02-22 |
-1.322987 |
1.022325 |
0.304516 |
-0.281856 |
2022-02-23 |
-0.767053 |
0.644011 |
2.058071 |
-0.342653 |
2022-02-24 |
0.065301 |
2.530895 |
1.091807 |
1.865210 |
2022-02-25 |
-2.094778 |
-0.714121 |
0.319254 |
-1.102402 |
2022-02-26 |
1.543428 |
-1.073784 |
-0.585543 |
-1.479911 |
2022-02-27 |
-0.045216 |
-1.291289 |
0.145565 |
0.124343 |
2022-02-28 |
-1.352934 |
-1.934588 |
0.214902 |
-0.942982 |
|
A |
B |
C |
D |
2022-02-24 |
0.065301 |
2.530895 |
1.091807 |
1.865210 |
2022-02-26 |
1.543428 |
-1.073784 |
-0.585543 |
-1.479911 |