Python爬虫基础总结

最新推荐文章于 2025-05-04 19:41:26 发布

code_shenbing

最新推荐文章于 2025-05-04 19:41:26 发布

阅读量719

点赞数 3

分类专栏： python项目集合文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/XiaoWang_csdn/article/details/147671876

版权

python项目集合专栏收录该内容

15 篇文章

订阅专栏

Python爬虫基础总结

一、爬虫概述

1.1 什么是爬虫

网络爬虫（Web Crawler）是一种自动浏览万维网的程序或脚本，它按照一定的规则，自动抓取互联网上的信息并存储到本地数据库中。

1.2 爬虫工作流程

URL管理器：管理待抓取和已抓取的URL
网页下载器：下载网页内容
网页解析器：提取所需数据
数据存储：将提取的数据存储到数据库或文件中

二、Python爬虫常用库

2.1 requests（HTTP请求库）

import requests

# 基本GET请求
response = requests.get('https://example.com')
print(response.text)  # 获取网页内容

# 带参数的GET请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://example.com', params=params)

# POST请求
data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://example.com', data=data)

# 设置请求头
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://example.com', headers=headers)

# 处理响应
print(response.status_code)  # 状态码
print(response.headers)      # 响应头
print(response.cookies)      # Cookies

2.2 BeautifulSoup（HTML解析库）

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <div class="content">
            <p>这是一个段落</p>
            <a href="https://example.com">链接</a>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# 查找元素
print(soup.p.text)  # 获取第一个p标签的文本
print(soup.a['href'])  # 获取a标签的href属性

# 查找所有元素
for p in soup.find_all('p'):
    print(p.text)

# CSS选择器
print(soup.select_one('.content p').text)  # 类名为content下的p标签

2.3 Scrapy（爬虫框架）

# 安装: pip install scrapy

# 创建项目
scrapy startproject myproject

# 创建爬虫
scrapy genspider example example.com

# 在spiders/example.py中编写爬虫
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']
    
    def parse(self, response):
        yield {
            'title': response.css('title::text').get(),
            'links': response.css('a::attr(href)').getall()
        }

三、爬虫实战技巧

3.1 处理动态加载内容

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# 初始化浏览器
driver = webdriver.Chrome()

# 打开网页
driver.get('https://example.com')

# 等待页面加载
time.sleep(3)

# 获取动态内容
content = driver.find_element(By.CSS_SELECTOR, '.dynamic-content').text
print(content)

# 关闭浏览器
driver.quit()

3.2 处理反爬机制

# 1. 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
    'Referer': 'https://www.google.com/',
    'Accept-Language': 'en-US,en;q=0.9'
}

# 2. 使用代理IP
proxies = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

# 3. 控制请求频率
import time
time.sleep(2)  # 每次请求间隔2秒

# 4. 使用验证码识别
from PIL import Image
import pytesseract

# 截图并识别验证码
image = Image.open('captcha.png')
captcha_text = pytesseract.image_to_string(image)

3.3 数据存储

# 1. 存储到CSV文件
import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '链接'])
    writer.writerow([title, link])

# 2. 存储到JSON文件
import json

data = {'title': title, 'link': link}
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

# 3. 存储到MySQL数据库
import pymysql

conn = pymysql.connect(host='localhost', user='root', password='123456', db='test')
cursor = conn.cursor()

sql = "INSERT INTO articles(title, link) VALUES(%s, %s)"
cursor.execute(sql, (title, link))
conn.commit()
conn.close()

四、爬虫进阶知识

4.1 正则表达式

import re

text = "Python is great, version 3.9.0"
pattern = r'\d+\.\d+\.\d+'  # 匹配版本号

match = re.search(pattern, text)
if match:
    print(match.group())  # 输出: 3.9.0

4.2 XPath解析

from lxml import etree

html = """
<html>
    <body>
        <div class="content">
            <p>这是一个段落</p>
        </div>
    </body>
</html>
"""

tree = etree.HTML(html)
result = tree.xpath('//div[@class="content"]/p/text()')
print(result)  # 输出: ['这是一个段落']

4.3 多线程/多进程爬虫

import threading

def crawl(url):
    print(f"正在爬取: {url}")

urls = ['https://example.com/page1', 'https://example.com/page2']

threads = []
for url in urls:
    t = threading.Thread(target=crawl, args=(url,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

五、爬虫项目实战

5.1 简单新闻爬虫

import requests
from bs4 import BeautifulSoup
import csv

def get_news():
    url = 'https://news.example.com'
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    news_list = []
    for item in soup.select('.news-item'):
        title = item.select_one('.title').text.strip()
        link = item.select_one('a')['href']
        news_list.append({'title': title, 'link': link})
    
    # 保存到CSV
    with open('news.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['title', 'link'])
        writer.writeheader()
        writer.writerows(news_list)

get_news()

5.2 商品价格监控爬虫

import requests
from bs4 import BeautifulSoup
import time
import smtplib
from email.mime.text import MIMEText

def check_price():
    url = 'https://product.example.com'
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    price = float(soup.select_one('.price').text.strip().replace('$', ''))
    return price

def send_email(subject, content):
    sender = 'your_email@example.com'
    password = 'your_password'
    receiver = 'receiver@example.com'
    
    msg = MIMEText(content)
    msg['Subject'] = subject
    msg['From'] = sender
    msg['To'] = receiver
    
    with smtplib.SMTP_SSL('smtp.example.com', 465) as server:
        server.login(sender, password)
        server.sendmail(sender, receiver, msg.as_string())

# 监控价格
target_price = 100.0
while True:
    current_price = check_price()
    if current_price <= target_price:
        send_email('价格提醒', f'商品价格已降至${current_price:.2f}')
        break
    time.sleep(3600)  # 每小时检查一次

六、爬虫注意事项

遵守robots.txt协议：检查目标网站的robots.txt文件，尊重网站的爬取规则
设置合理的请求频率：避免对服务器造成过大压力
处理异常情况：网络错误、页面结构变化等
数据去重：避免重复爬取相同数据
法律合规：确保爬取行为符合相关法律法规

七、学习资源推荐

官方文档：
- requests: https://docs.python-requests.org/
- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
- Scrapy: https://scrapy.org/
在线教程：
- 菜鸟教程: https://www.runoob.com/
- 廖雪峰Python教程: https://www.liaoxuefeng.com/
书籍：
- 《Python网络数据采集》
- 《Web Scraping with Python》
工具推荐：
- Postman: 测试API接口
- Fiddler: 抓包分析
- Xpath Helper: Chrome插件，辅助XPath编写