pandas 索引和选择

import pandas as pd
import numpy as np

概述

pandas 索引和选择的操作主要是以下五种:

操作 语法 Result
选择列 df[col] Series
通过标签选择行 df.loc[label] Series
通过整数下标选择行 df.iloc[loc] Series
行切片 df[5:10] DataFrame
通过布尔向量切片 df[bool_vec] DataFrame
df = pd.DataFrame(np.random.randn(4, 5), index=['a', 'b', 'c', 'd'], columns=['A', 'B', 'C', 'D', 'E'])
df

A B C D E
a 2.251822 0.508078 0.296828 -1.223630 0.523225
b 1.169537 1.378697 0.956192 -0.761983 1.206156
c -1.071012 0.117104 -1.068820 0.443776 -1.185699
d 0.276478 0.013506 -0.207837 -0.295314 0.402572
df['B']
a    0.508078
b    1.378697
c    0.117104
d    0.013506
Name: B, dtype: float64
df[['B', 'C']]

B C
a 0.508078 0.296828
b 1.378697 0.956192
c 0.117104 -1.068820
d 0.013506 -0.207837
df.loc['d']
A    0.276478
B    0.013506
C   -0.207837
D   -0.295314
E    0.402572
Name: d, dtype: float64
df.iloc[0]
A    2.251822
B    0.508078
C    0.296828
D   -1.223630
E    0.523225
Name: a, dtype: float64
df[:-1]

A B C D E
a 2.251822 0.508078 0.296828 -1.223630 0.523225
b 1.169537 1.378697 0.956192 -0.761983 1.206156
c -1.071012 0.117104 -1.068820 0.443776 -1.185699
df[:2] > 0

A B C D E
a True True True False True
b True True True False True
df[df < 0] = 0
df

A B C D E
a 2.251822 0.508078 0.296828 0.000000 0.523225
b 1.169537 1.378697 0.956192 0.000000 1.206156
c 0.000000 0.117104 0.000000 0.443776 0.000000
d 0.276478 0.013506 0.000000 0.000000 0.402572

使用 [] 进行索引

使用 [] 进行索引,实际上就是选择低维切片。对于 Series 对象和 DataFrame 对象,[] 返回的类型不同。

对象类型 调用方法 返回值类型
Series series[label] 标量值
DataFrame frame[colname] 对应 colnameSeries
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
                  index=dates, columns=['A', 'B', 'C', 'D'])
df

A B C D
2000-01-01 -0.245083 -1.422116 0.607832 1.303857
2000-01-02 0.596950 -0.232929 -1.116721 -1.387045
2000-01-03 -0.096937 -1.016238 0.547761 -0.629623
2000-01-04 -0.538159 1.512869 0.694778 0.561185
2000-01-05 -1.338459 1.352126 1.203392 -0.431008
2000-01-06 -0.525117 0.735395 0.562011 -0.133187
2000-01-07 -0.076509 0.542937 0.047262 0.512984
2000-01-08 -1.047306 0.812662 1.716307 -0.060350
s = df['A']
s
2000-01-01   -0.245083
2000-01-02    0.596950
2000-01-03   -0.096937
2000-01-04   -0.538159
2000-01-05   -1.338459
2000-01-06   -0.525117
2000-01-07   -0.076509
2000-01-08   -1.047306
Freq: D, Name: A, dtype: float64
s[dates[5]]
-0.5251169095699432

交换 AB 两列:

df[['B', 'A']] = df[['A', 'B']]
df

A B C D
2000-01-01 -1.422116 -0.245083 0.607832 1.303857
2000-01-02 -0.232929 0.596950 -1.116721 -1.387045
2000-01-03 -1.016238 -0.096937 0.547761 -0.629623
2000-01-04 1.512869 -0.538159 0.694778 0.561185
2000-01-05 1.352126 -1.338459 1.203392 -0.431008
2000-01-06 0.735395 -0.525117 0.562011 -0.133187
2000-01-07 0.542937 -0.076509 0.047262 0.512984
2000-01-08 0.812662 -1.047306 1.716307 -0.060350

注意,如果使用了 loc 或者 ilocpandas 会先对齐所有 axes。这不会修改 df,因为列对齐在赋值之前。

df.loc[:, ['B', 'A']] = df[['A', 'B']]
df

A B C D
2000-01-01 -1.422116 -0.245083 0.607832 1.303857
2000-01-02 -0.232929 0.596950 -1.116721 -1.387045
2000-01-03 -1.016238 -0.096937 0.547761 -0.629623
2000-01-04 1.512869 -0.538159 0.694778 0.561185
2000-01-05 1.352126 -1.338459 1.203392 -0.431008
2000-01-06 0.735395 -0.525117 0.562011 -0.133187
2000-01-07 0.542937 -0.076509 0.047262 0.512984
2000-01-08 0.812662 -1.047306 1.716307 -0.060350

可以强转成 ndarray 数组来实现这一操作:

df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
df

A B C D
2000-01-01 -0.245083 -1.422116 0.607832 1.303857
2000-01-02 0.596950 -0.232929 -1.116721 -1.387045
2000-01-03 -0.096937 -1.016238 0.547761 -0.629623
2000-01-04 -0.538159 1.512869 0.694778 0.561185
2000-01-05 -1.338459 1.352126 1.203392 -0.431008
2000-01-06 -0.525117 0.735395 0.562011 -0.133187
2000-01-07 -0.076509 0.542937 0.047262 0.512984
2000-01-08 -1.047306 0.812662 1.716307 -0.060350

切片

切片的规则和 Python 以及 NumPy 中的切片规则,一样。这里用 [] 运算符说明切片的语义。

s = pd.Series(np.random.randn(7), index=list('abcdefg'))
s
a   -0.430408
b    1.192773
c    0.110207
d    0.252352
e   -0.126190
f    0.684152
g   -0.330652
dtype: float64
s[:5]
a   -0.430408
b    1.192773
c    0.110207
d    0.252352
e   -0.126190
dtype: float64
s[::2]
a   -0.430408
c    0.110207
e   -0.126190
g   -0.330652
dtype: float64
s[::-1]
g   -0.330652
f    0.684152
e   -0.126190
d    0.252352
c    0.110207
b    1.192773
a   -0.430408
dtype: float64
s2 = s.copy()
s2[:3] = 0
s2
a    0.000000
b    0.000000
c    0.000000
d    0.252352
e   -0.126190
f    0.684152
g   -0.330652
dtype: float64
df = pd.DataFrame(np.random.randn(3, 4), index=list('abc'), columns=list('ABCD'))
df

A B C D
a 0.967130 0.642469 -0.327055 -0.760568
b 0.435906 -0.498920 0.411437 -1.726552
c 0.793539 -0.507760 0.167797 -1.432030
df[:2]

A B C D
a 0.967130 0.642469 -0.327055 -0.760568
b 0.435906 -0.498920 0.411437 -1.726552
df[::-1]

A B C D
c 0.793539 -0.507760 0.167797 -1.432030
b 0.435906 -0.498920 0.411437 -1.726552
a 0.967130 0.642469 -0.327055 -0.760568

通过属性索引

sa = pd.Series([1, 2, 3], index=list('abc'))
dfa = df.copy()
sa
a    1
b    2
c    3
dtype: int64
sa.b
2
dfa

A B C D
a 0.967130 0.642469 -0.327055 -0.760568
b 0.435906 -0.498920 0.411437 -1.726552
c 0.793539 -0.507760 0.167797 -1.432030
dfa.A
a    0.967130
b    0.435906
c    0.793539
Name: A, dtype: float64
dfa.A = list(range(len(dfa.index)))
dfa

A B C D
a 0 0.642469 -0.327055 -0.760568
b 1 -0.498920 0.411437 -1.726552
c 2 -0.507760 0.167797 -1.432030

采用访问属性的方法必须确保该属性存在。如果要创建新的一列,仍然需要通过 []。否则,会出现 UserWanring 的警告。

dfa['E'] = list(range(len(dfa.index)))
dfa

A B C D E
a 0 0.642469 -0.327055 -0.760568 0
b 1 -0.498920 0.411437 -1.726552 1
c 2 -0.507760 0.167797 -1.432030 2

通过标签索引

使用 loc 方法可以使用标签对行进行索引。

s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
s1
a   -1.373748
b    0.505333
c    0.929577
d    0.336673
e    1.074367
f   -1.375100
dtype: float64
s1.loc['b']
0.5053325199102856

标签支持冒号表达式,进行切片运算。

s1.loc['c':]
c    0.929577
d    0.336673
e    1.074367
f   -1.375100
dtype: float64
df1 = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df1

A B C D
a -0.100191 -0.129401 0.976985 1.839616
b 1.158153 0.662432 1.278219 -0.159460
c 0.650541 -0.073285 -1.377789 0.122472
d 0.442271 0.452595 -1.274344 -0.494543
e 0.536770 -0.597499 0.897752 1.104886
f 1.488848 1.195444 1.394976 1.760736

DataFrame 对象可以行列同时索引。

df1.loc[['a', 'b', 'd'], ['A', 'C']]

A C
a -0.100191 0.976985
b 1.158153 1.278219
d 0.442271 -1.274344
df1.loc['d':, 'A':'C']

A B C
d 0.442271 0.452595 -1.274344
e 0.536770 -0.597499 0.897752
f 1.488848 1.195444 1.394976

注意:下面的切片不是下标切片,而是标签切片。因此,3:5 表示标签 35 之间的所有标签。35 之间还有一个标签 2,因此返回结果为:

s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])
s.loc[3:5]
3    b
2    c
5    d
dtype: object

通过下标位置索引

使用 iloc 方法可以通过下标位置进行索引。

s2 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
s2
0   -0.235442
2    1.504205
4    0.917704
6   -0.268698
8    0.045323
dtype: float64
s2.iloc[:3]
0   -0.235442
2    1.504205
4    0.917704
dtype: float64
s2.iloc[-1]
0.04532252934053265
df2 = pd.DataFrame(np.random.randn(6, 4),
                   index=list(range(0, 12, 2)),
                   columns=list(range(0, 8, 2)))
df2

0 2 4 6
0 0.342468 0.198276 -1.195844 0.451696
2 -0.025717 0.077352 -0.123052 1.313708
4 -0.137973 0.373148 -0.220029 -0.683300
6 -0.554065 -0.518741 -0.839679 -0.599059
8 -0.585208 1.001620 0.162317 -1.386377
10 -0.229306 1.202411 -0.456811 0.110556

DataFrame 对象可以行列同时索引。

df2.iloc[:3]

0 2 4 6
0 0.342468 0.198276 -1.195844 0.451696
2 -0.025717 0.077352 -0.123052 1.313708
4 -0.137973 0.373148 -0.220029 -0.683300
df2.iloc[1:5, 2:4]

4 6
2 -0.123052 1.313708
4 -0.220029 -0.683300
6 -0.839679 -0.599059
8 0.162317 -1.386377
df2.iloc[[1, 3, 5], [1, 3]]

2 6
2 0.077352 1.313708
6 -0.518741 -0.599059
10 1.202411 0.110556
df2.iloc[1:3, :]

0 2 4 6
2 -0.025717 0.077352 -0.123052 1.313708
4 -0.137973 0.373148 -0.220029 -0.683300

可以超出索引范围,但是可能会返回空 DataFrame。

df2.iloc[:, 4:]

0
2
4
6
8
10

选择接受可调用对象

[]lociloc 都接受可调用对象进行索引。

df1 = pd.DataFrame(np.random.randn(6, 4),
                   index=list('abcdef'),
                   columns=list('ABCD'))
df1

A B C D
a 1.444943 -0.907930 1.038214 -2.246870
b -1.602533 -1.085308 -1.039485 0.574455
c -0.705120 0.563562 1.281975 1.127445
d -1.860152 -1.679263 -1.071518 0.127717
e 0.472768 -0.563109 -0.331930 -1.279404
f 0.759632 0.306537 -0.342015 -1.121230
df1.loc[lambda df: df['A'] > 0, :]

A B C D
a 1.444943 -0.907930 1.038214 -2.246870
e 0.472768 -0.563109 -0.331930 -1.279404
f 0.759632 0.306537 -0.342015 -1.121230
df1.loc[:, lambda df: ['A', 'B']]

A B
a 1.444943 -0.907930
b -1.602533 -1.085308
c -0.705120 0.563562
d -1.860152 -1.679263
e 0.472768 -0.563109
f 0.759632 0.306537
df1.iloc[:, lambda df: [0, 1]]

A B
a 1.444943 -0.907930
b -1.602533 -1.085308
c -0.705120 0.563562
d -1.860152 -1.679263
e 0.472768 -0.563109
f 0.759632 0.306537
df1[lambda df: df.columns[0]]
a    1.444943
b   -1.602533
c   -0.705120
d   -1.860152
e    0.472768
f    0.759632
Name: A, dtype: float64
df1['A'].loc[lambda s: s > 0]
a    1.444943
e    0.472768
f    0.759632
Name: A, dtype: float64

快速访问

由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此它需要一些开销才能确定您的要求。如果您只想访问一个标量值,最快的方法是使用 atiat 方法,它们在所有数据结构上都实现了。

s = pd.Series(np.random.randint(0, 7, size=(7,)), index=list('abcdefg'))
s
a    1
b    4
c    5
d    6
e    0
f    3
g    3
dtype: int32
s.iat[5]
3
s.at['a']
1
df = pd.DataFrame(np.random.randint(0, 7, size=(3, 4)), index=list('abc'), columns=list('ABCD'))
df

A B C D
a 3 5 4 4
b 0 2 2 2
c 1 2 0 3
df.iat[0, 1]
5
df.at['b', 'A']
0

布尔索引

Series 对象的布尔索引和 Python 以及 NumPy 类似。

s = pd.Series(range(-3, 4))
s
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64
s[s > 0]
4    1
5    2
6    3
dtype: int64
s[(s < -1) | (s > 0.5)]
0   -3
1   -2
4    1
5    2
6    3
dtype: int64
s[~(s < 0)]
3    0
4    1
5    2
6    3
dtype: int64
df = pd.DataFrame(np.random.randn(7, 4),
                  index=pd.date_range('2022/02/22', periods=7),
                  columns=list('ABCD'))
df

A B C D
2022-02-22 -1.322987 1.022325 0.304516 -0.281856
2022-02-23 -0.767053 0.644011 2.058071 -0.342653
2022-02-24 0.065301 2.530895 1.091807 1.865210
2022-02-25 -2.094778 -0.714121 0.319254 -1.102402
2022-02-26 1.543428 -1.073784 -0.585543 -1.479911
2022-02-27 -0.045216 -1.291289 0.145565 0.124343
2022-02-28 -1.352934 -1.934588 0.214902 -0.942982
df[df['A'] > 0]

A B C D
2022-02-24 0.065301 2.530895 1.091807 1.865210
2022-02-26 1.543428 -1.073784 -0.585543 -1.479911
Previous
Next