Python urllib小白入门指南

🐍

Python urllib 模块完全指南

编程小白的超详细学习手册 – 用最简单的语言掌握网络请求

urllib 简介

urllib.request

urllib.parse

urllib.error

urllib.robotparser

最佳实践

什么是urllib模块？

urllib是Python标准库中用于处理URL（网址）的模块，你可以把它看作是一个网络工具箱。它不需要额外安装，是Python自带的功能包。

💡 简单理解：就像浏览器可以帮助你访问网站一样，urllib模块可以帮助Python程序访问网站。

为什么需要学习urllib？

从网页获取数据（网络爬虫的基础）
下载文件（图片、文档等）
与API接口进行数据交互
提交表单数据（如登录操作）
自动化的网页操作

urllib的四个主要子模块：

模块名称	功能	类比解释
urllib.request	打开和读取URL	就像打开浏览器输入网址
urllib.parse	解析和处理URL	就像拆分网址的各个部分
urllib.error	处理请求过程中的错误	就像处理404页面不存在错误
urllib.robotparser	解析robots.txt文件	就像查看网站访问规则说明书

简单示例：打开网页

以下是最基本的urllib使用示例：获取百度首页内容

                        import urllib.request

                        # 发送请求并获取响应

                        response = urllib.request.urlopen(‘https://www.baidu.com’)

                        # 读取返回的网页内容

                        html_content = response.read().decode(‘utf-8’)

                        # 打印前200个字符

                        print(html_content[:200])

📝 代码解释：

导入urllib.request模块
使用urlopen()函数打开百度网址
读取返回的内容（二进制格式）
使用decode(‘utf-8’)将二进制转换为字符串
打印网页内容的前200个字符

urllib.request 核心功能

这是urllib中使用最频繁的子模块，主要负责发起网络请求。

1. 基本请求：urlopen()

                        import urllib.request

                        # 最简单的GET请求

                        response = urllib.request.urlopen(‘https://www.example.com’)

                        content = response.read() # 读取返回内容

响应对象的方法：

read() – 读取全部内容
read(n) – 读取指定字节数
getcode() – 获取状态码
geturl() – 获取实际请求的URL
getheaders() – 获取响应头信息

示例：获取响应信息

                                response = urllib.request.urlopen(‘https://www.baidu.com’)

                                print(“状态码:”, response.getcode()) # 200

                                print(“URL:”, response.geturl())

                                headers = response.getheaders()

                                print(“头部信息:”)

                                for header in headers[:5]:

                                    print(header)

2. 设置请求头（模拟浏览器）

许多网站会阻止简单的爬虫，通过设置请求头可以模拟浏览器访问。

                        import urllib.request

                        url = ‘https://www.example.com’

                        # 创建请求头

                        headers = {

                            ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36’,

                            ‘Accept’: ‘text/html’

                        }

                        # 创建请求对象

                        req = urllib.request.Request(url, headers=headers)

                        # 发送请求

                        response = urllib.request.urlopen(req)

⚠️ 重要提示：大多数网站都会检查User-Agent，不设置的话可能会被拒绝访问！

3. 发送POST请求（提交表单）

当需要向服务器提交数据时（如登录表单），需要使用POST请求。

                        import urllib.request

                        import urllib.parse

                        url = ‘https://example.com/login’

                        # 准备要提交的数据

                        data = {

                            ‘username’: ‘test_user’,

                            ‘password’: ‘test_pass’

                        }

                        # 将字典转换为字节流

                        data = urllib.parse.urlencode(data).encode(‘utf-8’)

                        # 创建请求对象，指定为POST方法

                        req = urllib.request.Request(url, data=data, method=’POST’)

                        # 添加请求头

                        req.add_header(‘User-Agent’, ‘Mozilla/5.0’)

                        # 发送请求

                        response = urllib.request.urlopen(req)

🔑 关键点：

使用urllib.parse.urlencode()将字典转换为查询字符串
使用.encode(‘utf-8’)将字符串转换为字节流
在Request对象中指定method=’POST’

urllib.parse – URL处理工具

这个模块用于解析、构建和处理URL，就像拆解和组装网址一样。

1. 网址解析：urlparse()

                        from urllib.parse import urlparse

                        result = urlparse(‘https://www.example.com:8080/path/to/page?name=John&age=30#section1’)

                        print(result.scheme)   # https

                        print(result.netloc)   # www.example.com:8080

                        print(result.path)     # /path/to/page

                        print(result.query)    # name=John&age=30

                        print(result.fragment) # section1

📌 解析结果包含：

scheme – 协议（http/https）
netloc – 网络地址（域名+端口）
path – 路径
query – 查询参数
fragment – 页面锚点

2. 构建查询参数：urlencode()

将Python字典转换为URL查询字符串

                        from urllib.parse import urlencode

                        params = {

                            ‘q’: ‘Python编程’,

                            ‘page’: 2,

                            ‘sort’: ‘relevance’

                        }

                        query_string = urlencode(params)

                        print(query_string) # q=Python%E7%BC%96%E7%A8%8B&page=2&sort=relevance

💡 注意：urlencode会自动对中文等特殊字符进行编码

3. URL编码与解码

URL中只能包含特定字符，中文等特殊字符需要编码

quote() – 编码单个字符串

                                from urllib.parse import quote

                                url = “https://example.com/search?q=” + quote(“Python教程”)

                                print(url) # https://example.com/search?q=Python%E6%95%99%E7%A8%8B

unquote() – 解码URL字符串

                                from urllib.parse import unquote

                                encoded_str = “Python%E6%95%99%E7%A8%8B”

                                decoded_str = unquote(encoded_str)

                                print(decoded_str) # Python教程

urllib.error – 错误处理

网络请求中可能出现的错误处理

                        import urllib.request

                        import urllib.error

                        try:

                            response = urllib.request.urlopen(‘https://example.com/nonexistent-page’)

                        except urllib.error.HTTPError as e:

                            print(‘HTTP错误代码:’, e.code)

                            print(‘错误原因:’, e.reason)

                        except urllib.error.URLError as e:

                            print(‘URL错误：’, e.reason)

urllib.robotparser – 网站爬取规则

用于解析网站的robots.txt文件，判断是否可以爬取

                        import urllib.robotparser

                        rp = urllib.robotparser.RobotFileParser()

                        rp.set_url(“https://www.example.com/robots.txt”)

                        rp.read() # 读取并解析robots.txt

                        # 检查指定的爬虫是否可以爬取指定URL

                        can_fetch = rp.can_fetch(“MyBot”, “https://www.example.com/private/”)

                        print(“是否允许爬取:”, can_fetch) # 返回True或False

urllib最佳实践

1. 使用异常处理