手把手解决urllib.error.HTTPError: HTTP Error 403: Forbidden的终极指南

CodeHorizon

于 2025-05-19 10:07:19 发布

阅读量930

点赞数 11

文章标签： http 网络协议网络其他

本文链接：https://blog.csdn.net/CodeHorizon/article/details/148056891

版权

文章目录

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

一、遇到403错误千万别慌！（经典反爬场景）

当你在Python爬虫项目中看到这个红彤彤的报错时（特别是用urllib库的小伙伴），先放下准备砸键盘的冲动！这个HTTP 403状态码就像网站保安在对你喊：“我知道你想干啥，但就是不让进！”

举个真实案例：上周我用requests库抓取某电商网站数据时，前100次请求都很顺利，突然就开始狂喷403错误。最后发现是对方服务器把我的IP识别为爬虫了（哭）…

二、403错误的七大罪魁祸首（附自检清单）

1. User-Agent被识破（新手必踩坑）

很多网站会检查请求头中的User-Agent字段。如果你用默认的Python UA，就像在脑门上贴着"我是爬虫"的标签

# 错误示范（千万别学！）
import urllib.request
response = urllib.request.urlopen('https://example.com')

2. 请求频率过高（服务器防火墙警告）

连续快速请求会让服务器认为你在进行DDoS攻击。有次我设置0.1秒间隔请求，结果5分钟后就被封IP了（血泪教训）

3. 需要登录认证（权限不足）

就像进VIP室需要会员卡，某些页面必须携带cookie或token才能访问

4. 触发反爬机制（高级关卡）

包括但不限于：

缺少Referer头
未处理JavaScript加密参数
IP被加入黑名单
需要验证码验证

三、八大解决方案亲测有效（附代码）

方案1：伪装浏览器User-Agent（基础必会）

from urllib import request

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

req = request.Request(url='https://example.com', headers=headers)
response = request.urlopen(req)

（超级重要） 推荐常用UA清单：

Chrome Win10：Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Safari Mac：Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15

方案2：添加Referer头（针对防盗链）

headers = {
    'Referer': 'https://www.google.com/',
    # 其他头信息...
}

方案3：使用requests.Session保持会话（高级技巧）

import requests

session = requests.Session()
session.headers.update({
    'User-Agent': '...',
    'Accept-Language': 'zh-CN,zh;q=0.9'
})

# 先登录获取cookies
login_response = session.post(login_url, data=credentials)
# 后续请求自动携带cookies
data = session.get(target_url).json()

方案4：设置请求延迟（人性化操作）

import time
import random

for page in range(1, 101):
    # 随机延迟1-3秒
    time.sleep(1 + 2 * random.random())
    # 发送请求...

方案5：使用代理IP池（终极武器）

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

response = requests.get(url, proxies=proxies)

（避坑指南） 免费代理的三大陷阱：

响应速度慢如蜗牛
可用率不到30%
可能泄露请求数据

方案6：处理Cookies（需要登录时）

from http.cookiejar import CookieJar

# 创建cookie处理器
cookie_jar = CookieJar()
handler = request.HTTPCookieProcessor(cookie_jar)
opener = request.build_opener(handler)

# 模拟登录
login_data = {'username': 'xxx', 'password': 'xxx'}
req = request.Request(login_url, data=urlencode(login_data).encode())
opener.open(req)

# 后续请求自动携带cookie
response = opener.open(target_url)

方案7：使用Selenium模拟浏览器（核弹级方案）

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")  # 无界面模式
options.add_argument("user-agent=Mozilla/5.0...")

driver = webdriver.Chrome(options=options)
driver.get(url)
page_source = driver.page_source

方案8：终极组合拳（专业选手配置）

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {
    'User-Agent': ua.random,
    'Referer': 'https://www.google.com/',
    'Accept-Encoding': 'gzip, deflate, br'
}

proxies = {'https': 'http://premium_proxy:port'}
cookies = {'session_id': 'xxxxxx'}

response = requests.get(
    url,
    headers=headers,
    proxies=proxies,
    cookies=cookies,
    timeout=10
)

四、调试技巧大公开（开发者必备）

1. 使用curl命令复现请求

curl -v -H "User-Agent: Mozilla/5.0..." -H "Referer: https://google.com" https://target-site.com

2. Chrome开发者工具分析

Network面板查看完整请求头
右键请求 → Copy → Copy as cURL

3. 使用Postman测试接口

逐步添加请求头参数，定位被拦截的关键字段

五、防封禁的三大黄金法则（保命秘籍）

尊重robots.txt：就像去别人家要先敲门
限速请求：建议间隔2-5秒，高峰期适当延长
设置Retry机制：遇到403时暂停并重试

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def safe_request(url):
    response = requests.get(url)
    if response.status_code == 403:
        raise Exception("触发反爬")
    return response