Pandas df.filter() 函数

Pandas 常用函数

filter() 是 Pandas 中用于筛选列（而不是行）的函数。它可以根据列名、标签模式或正则表达式来选择特定的列。与 loc[] 和 iloc[] 不同，filter() 专门设计用于列的选择，这在处理具有大量列的数据集时非常有用。

在数据分析中，有时我们需要只关注特定的列，比如只选择数值列、只选择以特定前缀开头的列，或者只选择包含特定关键字的列。filter() 正是为这些场景设计的。

基本语法与参数

filter() 是 DataFrame 和 Series（对于 Series，只支持 items 参数）的方法，用于根据条件筛选列。

语法格式

DataFrame.filter(items=None, like=None, regex=None, axis=None)

参数说明

参数	类型	是否必填	说明	默认值
items	list	可选	直接指定列名列表，选择匹配的列。	None
like	str	可选	包含指定字符串的列名，支持模糊匹配。	None
regex	str	可选	正则表达式模式，匹配列名。	None
axis	int 或 str	可选	筛选的轴。0 或 'index' 表示行，1 或 'columns' 表示列。	1 或 'columns'

返回值说明

返回值类型：返回一个新的 DataFrame，包含筛选后的列。
列筛选：默认情况下，filter() 作用于列（axis=1）。
保留所有行：筛选只影响列，行数据保持不变。

实例

让我们通过丰富的例子全面掌握 filter() 的用法。

示例 1：基础用法 - 使用 items 参数

直接指定列名列表来选择特定的列。

实例

import pandas as pd

# 创建示例 DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [18, 19, 17, 18, 20],
'score': [85, 92, 78, 90, 88],
'grade': ['A', 'A', 'B', 'A', 'B'],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou', 'Shanghai']
}
df = pd.DataFrame(data)

print("原始 DataFrame:")
print(df)
print()

# 使用 items 参数选择特定列
print("选择 name 和 age 列:")
print(df.filter(items=['name', 'age']))
print()

# 选择单列（返回 DataFrame）
print("只选择 score 列:")
print(df.filter(items=['score']))

运行结果：

原始 DataFrame:
      name  age  score grade     city
0    Alice   18     85     A   Beijing
1      Bob   19     92     A  Shanghai
2  Charlie   17     78     B   Beijing
3    David   18     90     A  Guangzhou
4      Eve   20     88     B  Shanghai

选择 name 和 age 列:
      name  age
0    Alice   18
1      Bob   19
2  Charlie   17
3    David   18
4      Eve   20

只选择 score 列:
   score
0     85
1     92
2     78
3     90
4     88

代码解析：

filter(items=['name', 'age']) 选择指定的列，返回包含这些列的新 DataFrame。
所有行都被保留，只是选择了特定的列。
如果指定的列名不存在，会抛出 KeyError。

示例 2：使用 like 参数 - 模糊匹配

like 参数允许我们筛选列名中包含指定字符串的列，这非常适合处理列名具有规律性的数据集。

实例

import pandas as pd

# 创建列名包含规律的 DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age_2020': [18, 19, 17, 18, 20],
'age_2021': [19, 20, 18, 19, 21],
'age_2022': [20, 21, 19, 20, 22],
'score_2020': [85, 92, 78, 90, 88],
'score_2021': [87, 94, 80, 92, 90],
'score_2022': [89, 96, 82, 94, 92],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou', 'Shanghai']
}
df = pd.DataFrame(data)

print("原始 DataFrame:")
print(df)
print()

# 选择所有包含 "age" 的列
print("包含 'age' 的列:")
print(df.filter(like='age'))
print()

# 选择所有包含 "score" 的列
print("包含 'score' 的列:")
print(df.filter(like='score'))
print()

# 选择所有包含 "2021" 的列
print("包含 '2021' 的列:")
print(df.filter(like='2021'))

运行结果：

原始 DataFrame:
      name  age_2020  age_2021  age_2022  score_2020  score_2021  score_2022     city
0    Alice       18       19       20         85         87         89   Beijing
1      Bob       19       20       21         92         94         96  Shanghai
2  Charlie       17       18       19         78         80         82   Beijing
3    David       18       19       20         90         92         94  Guangzhou
4      Eve       20       21       22         88         90         92  Shanghai

包含 'age' 的列:
   age_2020  age_2021  age_2022
0       18       19       20
1       19       20       21
2       17       18       19
3       18       19       20
4       20       21       22

包含 'score' 的列:
   score_2020  score_2021  score_2022
0         85         87         89
1         92         94         96
2         78         80         82
3         90         92         94
4         88         90         92

包含 '2021' 的列:
   age_2021  score_2021
0       19       87
1       20       94
2       18       80
3       19       92
4       21       90

代码解析：

filter(like='age') 选择所有列名中包含 "age" 的列。
模糊匹配不需要知道完整的列名，只需提供部分匹配字符串。
这在处理具有类似命名规范的列（如按年份组织的列）时特别有用。

示例 3：使用 regex 参数 - 正则表达式匹配

对于更复杂的列名筛选需求，可以使用正则表达式。

实例

import pandas as pd

# 创建复杂的列名
data = {
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age_2020': [18, 19, 17, 18, 20],
'age_2021': [19, 20, 18, 19, 21],
'score_math': [85, 92, 78, 90, 88],
'score_english': [87, 94, 80, 92, 90],
'score_science': [89, 96, 82, 94, 92],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou', 'Shanghai']
}
df = pd.DataFrame(data)

print("原始 DataFrame 的列:")
print(df.columns.tolist())
print()

# 选择以 "score_" 开头的列
print("以 'score_' 开头的列:")
print(df.filter(regex='^score_'))
print()

# 选择包含数字的列名
print("包含数字的列名:")
print(df.filter(regex='\d')) # d 匹配数字
print()

# 选择同时包含 "age" 和 "202" 的列
print("包含 'age' 和 '202' 的列:")
print(df.filter(regex='age.*202|202.*age'))
print()

# 选择以字母开头，不包含下划线的列
print("以字母开头且不含下划线的列:")
print(df.filter(regex='^[a-zA-Z]+$'))

运行结果：

原始 DataFrame 的列:
['id', 'name', 'age_2020', 'age_2021', 'score_math', 'score_english', 'score_science', 'city']

以 'score_' 开头的列:
   score_math  score_english  score_science
0         85         87         89
1         92         94         96
2         78         80         82
3         90         92         94
4         88         90         92

包含数字的列名:
   age_2020  age_2021  score_math  score_english  score_science
0       18       19         85         87         89
1       19       20         92         94         96
2       17       18         78         80         82
3       18       19         90         92         94
4       20       21         88         90         92

包含 'age' 和 '202' 的列:
   age_2020  age_2021
0       18       19
1       19       20
2       17       18
3       18       19
4       20       21

以字母开头且不含下划线的列:
      name   city
0    Alice  Beijing
1      Bob  Shanghai
2  Charlie  Beijing
3    David  Guangzhou
4    Eve  Shanghai

代码解析：

filter(regex='^score_') 使用 ^ 匹配以 "score_" 开头的列。
filter(regex='\d') 使用 \d 匹配包含数字的列名。
正则表达式提供了最大的灵活性，可以处理各种复杂的匹配需求。

示例 4：使用 axis 参数筛选行

虽然 filter() 主要用于列，但通过 axis 参数也可以用于筛选行。

实例

import pandas as pd

# 创建 DataFrame，带有命名索引
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [18, 19, 17, 18, 20],
'score': [85, 92, 78, 90, 88]
}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print("带命名索引的 DataFrame:")
print(df)
print()

# 使用 axis='index' 筛选行（使用 like）
print("索引包含 'row' 的行:")
print(df.filter(like='row', axis='index'))
print()

# 使用 items 筛选特定索引的行
print("索引为 'row1' 和 'row3' 的行:")
print(df.filter(items=['row1', 'row3'], axis='index'))
print()

# 使用 regex 筛选行
print("索引匹配正则表达式（row[135]）的行:")
print(df.filter(regex='row[135]', axis='index'))

运行结果：

带命名索引的 DataFrame:
      name   age  score
row1  Alice   18     85
row2    Bob   19     92
row3  Charlie   17     78
row4  David   18     90
row5    Eve   20     88

索引包含 'row' 的行:
      name   age  score
row1  Alice   18     85
row2    Bob   19     92
row3  Charlie   17     78
row4  David   18     90
row5    Eve   20     88

索引为 'row1' 和 'row3' 的行:
      name   age  score
row1  Alice   18     85
row3  Charlie   17     78

索引匹配正则表达式（row[135]）的行:
      name   age  score
row1  Alice   18     85
row3  Charlie   17     78
row5    Eve   20     88

代码解析：

axis='index' 或 axis=0 告诉 filter() 筛选行索引。
like、items、regex 参数在行筛选中的用法与列筛选相同。
这在需要根据索引名称筛选特定行时非常有用。

示例 5：结合其他函数使用

filter() 可以与其他 Pandas 函数结合使用，实现强大的数据选择功能。

实例

import pandas as pd
import numpy as np

# 创建一个大型 DataFrame
np.random.seed(42)
df = pd.DataFrame(np.random.randn(5, 10), columns=[
'A', 'B', 'C', 'D', 'E',
'AA', 'BB', 'CC', 'DD', 'EE'
])

print("原始 DataFrame 的列:")
print(df.columns.tolist())
print()

# 筛选单字母列并进行计算
single_letter_cols = df.filter(regex='^[A-E]$')
print("单字母列（A-E）的均值:")
print(single_letter_cols.mean())
print()

# 筛选双字母列并进行计算
double_letter_cols = df.filter(regex='^[A-E]{2}$')
print("双字母列（AA-EE）的均值:")
print(double_letter_cols.mean())
print()

# 先筛选列，再选择行
print("双字母列的前 3 行:")
print(df.filter(regex='^[A-E]{2}$').head(3))

运行结果：

原始 DataFrame 的列:
['A', 'B', 'C', 'D', 'E', 'AA', 'BB', 'CC', 'DD', 'EE']

单字母列（A-E）的均值:
A    0.336744
B    0.128223
C   -0.234007
D   -0.347540
E   -0.197939
dtype: float64

双字母列（AA-EE）的均值:
AA    0.530256
BB   -0.671336
CC    0.506853
DD   -0.443588
EE   -0.456486
dtype: float64

双字母列的前 3 行:
         AA        BB        CC        DD        EE
0 -0.013497 -1.174139  0.214864  1.550698  0.375698
1 -1.368593  0.746580  0.669383 -0.717552 -1.159950
2  0.602619 -1.700736 -0.201647 -0.605166 -0.012750

代码解析：

filter() 可以与其他 DataFrame 方法链式调用。
先筛选特定列，再进行统计计算，提高代码可读性。
这在处理宽表格时特别有用。

注意事项

items 用于精确指定列名，like 用于模糊匹配，regex 用于正则表达式匹配。根据实际需求选择合适的方式。
在处理大型数据集时，先用 filter() 筛选需要的列，可以减少后续计算的数据量，提高性能。
filter() 默认作用于列（axis=1），可以通过 axis='index' 或 axis=0 改变筛选方向。
注意 items、like、regex 这三个参数是互斥的，不能同时使用。

提示：filter() 是处理宽表格（有很多列的数据集）时的利器。特别是在数据探索阶段，如果只想关注某一类信息（如所有数值列、所有以某前缀开头的列），使用 filter() 可以快速完成列的筛选。

小结

filter() 是 Pandas 中专门用于列（和行）筛选的函数。它提供了三种筛选方式：精确指定（items）、模糊匹配（like）和正则表达式匹配（regex）。

这个函数的主要应用场景包括：筛选特定列、处理具有规律性命名规范的列、动态选择需要分析的列。结合其他 Pandas 函数使用，可以显著提高数据处理的效率和代码的可读性。

Pandas 常用函数

返回顶部

菜鸟教程

Pandas df.filter() 函数

基本语法与参数

语法格式

参数说明

返回值说明

实例

示例 1：基础用法 - 使用 items 参数

实例

示例 2：使用 like 参数 - 模糊匹配

实例

示例 3：使用 regex 参数 - 正则表达式匹配

实例

示例 4：使用 axis 参数筛选行

实例

示例 5：结合其他函数使用

实例

注意事项

小结