Pandas pd.read_html() 函数

read_html() 是 pandas 库中用于解析 HTML 表格的函数，能够从网页或 HTML 文件中读取表格数据并转换为 DataFrame。

网页中的表格数据是非常重要的数据来源，很多公开数据（如股票信息、统计数据等）都以 HTML 表格的形式呈现。read_html() 利用 lxml 和 BeautifulSoup 库来解析 HTML，可以自动提取页面中的所有表格或指定表格。

基本语法与参数

语法格式

pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None,
                skiprows=None, attrs=None, parse_dates=False, thousands=',',
                decimal='.', converters=None, ...)

参数说明

参数	类型	说明	默认值
io	str, path object, file-like object	HTML 文件路径、URL 或字符串	必填
match	str, regex	使用正则表达式匹配表格的文本内容	'.+'
flavor	str	解析器：'lxml', 'html5lib', 'bs4'	None
header	int, list of int	用作列名的行号	None
index_col	int, str	用作行索引的列	None
skiprows	int, list, slice	跳过指定的行	None
attrs	dict	HTML 标签属性，用于筛选表格	None
parse_dates	bool, list	是否解析日期列	False

返回值

返回类型：list of DataFrames
返回一个 DataFrame 列表，页面上有多少个表格，就返回多少个 DataFrame。
如果找不到匹配的表格，返回空列表。

实例

通过以下示例，全面掌握 read_html() 的各种用法。

示例 1：读取本地 HTML 文件中的表格

首先创建一个包含表格的 HTML 文件，然后使用 read_html() 读取。

实例

import pandas as pd

# 创建一个包含表格的 HTML 文件
html_content = '''

<title>员工信息表</title>

<h1>Pandas pd-read-html() 函数</h1>
<table border="1" class="employee-table">
<tr>
<th>姓名</th>
<th>年龄</th>
<th>城市</th>
<th>薪资</th>
</tr>
<tr>
<td>Tom</td>
<td>28</td>
<td>Beijing</td>
<td>8000</td>
</tr>
<tr>
<td>Jerry</td>
<td>35</td>
<td>Shanghai</td>
<td>12000</td>
</tr>
<tr>
<td>Mike</td>
<td>42</td>
<td>Guangzhou</td>
<td>15000</td>
</tr>
<tr>
<td>Lucy</td>
<td>26</td>
<td>Shenzhen</td>
<td>7000</td>
</tr>
</table>

<h2>部门列表</h2>
<table border="1">
<tr>
<th>部门</th>
<th>人数</th>
</tr>
<tr>
<td>技术部</td>
<td>50</td>
</tr>
<tr>
<td>销售部</td>
<td>30</td>
</tr>
</table>

'''

# 将 HTML 写入文件
with open('tables.html', 'w', encoding='utf-8') as f:
f.write(html_content)

# 使用 read_html 读取所有表格
# io: HTML 文件路径（必填）
tables = pd.read_html('tables.html')

# 查看读取结果
print(f"共找到 {len(tables)} 个表格")
print()

# 遍历所有表格
for i, df in enumerate(tables):
print(f"--- 表格 {i+1} ---")
print(df)
print()

运行结果预期:

共找到 2 个表格

--- 表格 1 ---
    姓名  年龄       城市    薪资
0   Tom   28    Beijing   8000
1  Jerry   35   Shanghai  12000
2   Mike   42  Guangzhou  15000
3   Lucy   26  Shenzhen   7000

--- 表格 2 ---
    部门  人数
0  技术部   50
1  销售部   30

代码解析:

read_html() 返回一个 DataFrame 列表，每个表格对应一个 DataFrame。
默认情况下，所有表格都会被读取。
第一行自动被识别为列名（因为有 th 标签）。

示例 2：使用 attrs 和 match 筛选表格

当页面有多个表格时，可以使用属性或文本匹配来筛选需要的表格。

实例

import pandas as pd

# 创建带有属性的 HTML 文件
html_with_attrs = '''

<table id="employees" class="data-table">
<tr><th>name</th><th>age</th></tr>
<tr><td>Tom</td><td>28</td></tr>
<tr><td>Jerry</td><td>35</td></tr>
</table>

<table id="products" class="data-table">
<tr><th>product</th><th>price</th></tr>
<tr><td>A</td><td>100</td></tr>
<tr><td>B</td><td>200</td></tr>
</table>

<table class="summary">
<tr><td>Total</td><td>2</td></tr>
</table>

'''

with open('tables_attrs.html', 'w', encoding='utf-8') as f:
f.write(html_with_attrs)

# 示例 2a: 使用 attrs 根据 id 属性筛选
# 读取 id="employees" 的表格
tables_by_id = pd.read_html('tables_attrs.html', attrs={'id': 'employees'})
print("使用 id 筛选:")
print(tables_by_id[0])
print()

# 示例 2b: 使用 attrs 根据 class 属性筛选
# 读取 class="data-table" 的所有表格
tables_by_class = pd.read_html('tables_attrs.html', attrs={'class': 'data-table'})
print("使用 class 筛选 (找到 {} 个表格):".format(len(tables_by_class)))
for i, df in enumerate(tables_by_class):
print(f"表格 {i+1}:")
print(df)
print()

# 示例 2c: 使用 match 筛选包含特定文本的表格
# match 使用正则表达式匹配表格中的文本
tables_by_text = pd.read_html('tables_attrs.html', match='Tom')
print("包含 'Tom' 文本的表格:")
print(tables_by_text[0])

运行结果预期:

使用 id 筛选:
   name  age
0   Tom   28
1  Jerry   35

使用 class 筛选 (找到 2 个表格):
表格 1:
   name  age
0   Tom   28
1  Jerry   35

表格 2:
  product  price
0       A    100
1       B    200

包含 'Tom' 文本的表格:
   name  age
0   Tom   28
1  Jerry   35

代码解析:

attrs 参数可以指定 HTML 标签的属性（如 id、class、style 等）来筛选表格。
match 参数使用正则表达式匹配表格中的文本内容，返回包含匹配文本的表格。
当只返回一个表格时，可以使用 tables[0] 获取第一个 DataFrame。

示例 3：处理表头和索引

HTML 表格的格式可能比较复杂，需要灵活处理表头和索引。

h2 class="example">实例

import pandas as pd

# 创建复杂结构的 HTML
html_complex = '''


<table id="no_header">
<tr><td>Tom</td><td>28</td></tr>
<tr><td>Jerry</td><td>35</td></tr>
</table>


<table id="multi_header">
<tr><th>姓名</th><th colspan="2">联系方式</th></tr>
<tr><th></th><th>电话</th><th>邮箱</th></tr>
<tr><td>Tom</td><td>123456</td><td>tom@runoob.com</td></tr>
<tr><td>Jerry</td><td>789012</td><td>jerry@runoob.com</td></tr>
</table>


<table id="with_index" data-type="employee">
<tr><th>姓名</th><th>年龄</th><th>城市</th></tr>
<tr><th></th><th></th><th></th></tr>
<tr><td>Tom</td><td>28</td><td>Beijing</td></tr>
</table>

'''

with open('tables_complex.html', 'w', encoding='utf-8') as f:
f.write(html_complex)

# 示例 3a: 没有表头的情况
# header=None 不使用第一行作为列名
df_no_header = pd.read_html('tables_complex.html', attrs={'id': 'no_header'}, header=None)[0]
print("没有表头:")
print(df_no_header)
print()

# 示例 3b: 多行表头
# 使用第0行和第1行共同作为表头
df_multi_header = pd.read_html('tables_complex.html', attrs={'id': 'multi_header'})[0]
print("多行表头:")
print(df_multi_header)
print()

# 示例 3c: 设置索引列
# index_col 指定第0列为索引
df_with_index = pd.read_html('tables_complex.html', attrs={'id': 'with_index'}, index_col=0)[0]
print("设置索引:")
print(df_with_index)

运行结果预期:

没有表头:
      0     1
0  Tom    28
1  Jerry  35

多行表头:
       姓名    联系方式
        NaN    电话          邮箱
0     Tom  123456  tom@runoob.com
1   Jerry  789012  jerry@runoob.com

设置索引:
                年龄       城市
姓名
Tom            28    Beijing
</空行>
空字符串    NaN       NaN
Tom            28    Beijing

代码解析:

header=None 可以禁用自动表头识别，使用默认的整数索引作为列名。
多行表头会被处理为多层索引（MultiIndex）。
index_col 参数可以指定某列作为行索引。

注意事项

使用 read_html() 需要安装 lxml 库：pip install lxml。
返回的是 DataFrame 列表，需要根据索引选择具体的表格。
页面中有多个表格时，使用 attrs 或 match 参数进行筛选。
read_html() 会尝试解析所有表格，可能较慢，对于大页面可以使用 match 参数缩小范围。
读取网络页面时需要网络支持，可能存在访问限制或反爬虫机制。

小结

read_html() 是 pandas 中读取 HTML 表格数据的强大工具。它能够自动解析网页中的表格，并转换为结构化的 DataFrame 格式。

在实际工作中，如果需要从网页获取数据，read_html() 是一个非常实用的选择。掌握 attrs 和 match 参数的使用，可以高效地从复杂页面中提取所需数据。建议读者在需要抓取网页表格时优先考虑这个函数。

Python math 模块 Pandas 常用函数

返回顶部

菜鸟教程

Pandas pd.read_html() 函数

基本语法与参数

语法格式

参数说明

返回值

实例

示例 1：读取本地 HTML 文件中的表格

实例

示例 2：使用 attrs 和 match 筛选表格

实例

示例 3：处理表头和索引

注意事项

小结