【Python爬虫4】并发并行下载_python 并行下载-CSDN博客

本文链接：https://blog.csdn.net/u014134180/article/details/55506994

本文介绍如何使用Python进行并发并行下载，包括解析Alexa网站列表，通过多线程和多进程提高爬虫速度，以及性能对比。讨论了并发与并行的区别，并给出基于MongoDB的爬虫队列实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

1一百万个网站
- 1.1用普通方法解析Alexa列表
- 1.2复用爬虫代码解析Alexa列表
2串行爬虫
3并发并行爬虫
4性能对比

这篇将介绍使用多线程和多进程这两种方式并发并行下载网页，并将它们与串行下载的性能进行比较。

1一百万个网站

亚马逊子公司Alexa提供了最受欢迎的100万个网站列表（http://www.alexa.com/topsites ），我们也可以通过http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 直接下载这一列表的压缩文件，这样就不用去提取Alexa网站的数据了。

排名	域名
1	google.com
2	youtube.com
3	facebook.com
4	baidu.com
5	yahoo.com
6	wikipedia.com
7	google.co.in
8	amazon.com
9	qq.com
10	google.co.jp
11	live.com
12	taobao.com

1.1用普通方法解析Alexa列表

提取数据的4个步骤：

下载.zip文件；
从.zip文件中提取出CSV文件；
解析CSV文件；
遍历CSV文件中的每一行，从中提取出域名数据。

# -*- coding: utf-8 -*-

import csv
from zipfile import ZipFile
from StringIO import StringIO
from downloader import Downloader

def alexa():
    D = Downloader()
    zipped_data = D('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip')
    urls = [] # top 1 million URL's will be stored in this list
    with ZipFile(StringIO(zipped_data)) as zf:
        csv_filename = zf.namelist()[0]
        for _, website in csv.reader(zf.open(csv_filename)):
            urls.append('http://' + website)
    return urls

if __name__ == '__main__':
    print len(alexa())

下载得到的压缩数据是使用StringIO封装之后，才传给ZipFile，是因为ZipFile需要一个相关的接口，而不是字符串。由于这个zip文件只包含一个文件，所以直接选择第一个文件即可。然后在域名数据前添加http://协议，附加到URL列表中。

1.2复用爬虫代码解析Alexa列表

要复用上述功能，需要修改scrape_callback接口。

# -*- coding: utf-8 -*-

import csv
from zipfile import ZipFile
from StringIO import StringIO
from mongo_cache import MongoCache

class AlexaCallback:
    def __init__(self, max_urls=1000):
        self.max_urls = max_urls
        self.seed_url = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'

    def __call__(self, url, html):
        if url == self.seed_url:
            urls = []
            #cache = MongoCache()
            with ZipFile(StringIO(html)) as zf:
                csv_filename = zf.namelist()[0]
                for _, website in csv.reader(zf.open(csv_filename)):
                    if 'http://' + website not in cache:
                        urls.append('http://' + website)
                        if len(urls) == self.max_urls:
                            break
            return urls

这里添加了一个新的输入参数max_urls，用于设定从Alexa文件中提取的URL数量。如果真要下载100万个网页，那要消耗11天的时间，所