使用 Python 实现一个简单的 web 爬虫

我们将使用 Python 的 requests 库来发送 HTTP 请求，并使用 BeautifulSoup 库来解析 HTML 内容。这个简单的 web 爬虫将从一个网页中提取所有的链接。

实例

import requests
from bs4 import BeautifulSoup

def simple_web_crawler(url):
# 发送 HTTP 请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
# 解析 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')

# 查找所有的链接
links = soup.find_all('a')

# 提取并打印链接
for link in links:
href = link.get('href')
if href:
print(href)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

# 使用示例
simple_web_crawler('https://www.example.com')

代码解析：

import requests：导入 requests 库，用于发送 HTTP 请求。
from bs4 import BeautifulSoup：导入 BeautifulSoup 类，用于解析 HTML 内容。
def simple_web_crawler(url):：定义一个函数 simple_web_crawler，接受一个 URL 作为参数。
response = requests.get(url)：发送 GET 请求到指定的 URL，并将响应存储在 response 变量中。
if response.status_code == 200:：检查请求是否成功（状态码 200 表示成功）。
soup = BeautifulSoup(response.text, 'html.parser')：使用 BeautifulSoup 解析 HTML 内容。
links = soup.find_all('a')：查找所有的 <a> 标签，这些标签通常包含链接。
for link in links:：遍历所有的链接。
href = link.get('href')：提取每个链接的 href 属性。
if href:：检查 href 是否存在。
print(href)：打印链接。
else:：如果请求失败，打印错误信息。

输出结果：运行代码后，程序将打印出 https://www.example.com 页面中所有的链接。具体的输出结果取决于目标网页的内容。例如：

https://www.iana.org/domains/example

这只是一个示例，实际输出可能会有所不同。

Document 对象参考手册 Python3 实例

返回顶部

菜鸟教程

使用 Python 实现一个简单的 web 爬虫

实例