Scrapy-Redis结合POST请求获取数据
能看到这篇文章的小伙伴肯定已经知道什么是Scrapy以及Scrapy-Redis了,基础概念这里就不再介绍。默认情况下Scrapy-Redis是发送GET请求获取数据的,对于某些使用POST请求的情况需要重写make_request_from_data
函数即可,但奇怪的是居然没在网上搜到简洁明了的答案,或许是太简单了?
这里我以httpbin.org
这个网站为例,首先在settings.py
中添加所需配置,这里需要根据实际情况进行修改:
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #启用Redis调度存储请求队列 SCHEDULER_PERSIST = True #不清除Redis队列、这样可以暂停/恢复 爬取 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #确保所有的爬虫通过Redis去重 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' REDIS_URL = "redis://127.0.0.1:6379"
1
2
3
4
5
6
|
SCHEDULER
=
"scrapy_redis.scheduler.Scheduler"
#启用Redis调度存储请求队列
SCHEDULER_PERSIST
=
True
#不清除Redis队列、这样可以暂停/恢复 爬取
DUPEFILTER_CLASS
=
"scrapy_redis.dupefilter.RFPDupeFilter"
#确保所有的爬虫通过Redis去重
SCHEDULER_QUEUE_CLASS
=
'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL
=
"redis://127.0.0.1:6379"
|
爬虫代码如下:
# -*- coding: utf-8 -*- import scrapy from scrapy_redis.spiders import RedisSpider class HpbSpider(RedisSpider): name = 'hpb' redis_key = 'test_post_data' def make_request_from_data(self, data): """Returns a Request instance from data coming from Redis. By default, ``data`` is an encoded URL. You can override this method to provide your own message decoding. Parameters ---------- data : bytes Message from redis. """ return scrapy.FormRequest("https://www.httpbin.org/post", formdata={"data":data},callback=self.parse) def parse(self, response): print(response.body)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# -*- coding: utf-8 -*-
import
scrapy
from
scrapy_redis
.
spiders
import
RedisSpider
class
HpbSpider
(
RedisSpider
)
:
name
=
'hpb'
redis_key
=
'test_post_data'
def
make_request_from_data
(
self
,
data
)
:
"""Returns a Request instance from data coming from Redis.
By default, ``data`` is an encoded URL. You can override this method to
provide your own message decoding.
Parameters
----------
data : bytes
Message from redis.
"""
return
scrapy
.
FormRequest
(
"https://www.httpbin.org/post"
,
formdata
=
{
"data"
:
data
}
,
callback
=
self
.
parse
)
def
parse
(
self
,
response
)
:
print
(
response
.
body
)
|
这里为了简单直接进行输出,真实使用时可以结合pipeline写数据库等。
然后启动爬虫程序scrapy crawl hpb
,由于我们还没向test_post_data
中写数据,所以启动后程序进入等待状态。然后模拟向队列写数据:
import redis rd = redis.Redis('127.0.0.1',port=6379,db=0) for _ in range(1000): rd.lpush('test_post_data',_)
1
2
3
4
5
|
import
redis
rd
=
redis
.
Redis
(
'127.0.0.1'
,
port
=
6379
,
db
=
0
)
for
_
in
range
(
1000
)
:
rd
.
lpush
(
'test_post_data'
,
_
)
|
此时可以看到爬虫已经开始获取程序了:
2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data' 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) 2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None) b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n' 2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min) 2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
2019
-
05
-
06
16
:
30
:
21
[
hpb
]
DEBUG
:
Read
8
requests
from
'test_post_data'
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
2019
-
05
-
06
16
:
30
:
21
[
scrapy
.
core
.
engine
]
DEBUG
:
Crawled
(
200
)
<
POST
https
:
/
/
www
.
httpbin
.
org
/
post
>
(
referer
:
None
)
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
2019
-
05
-
06
16
:
31
:
09
[
scrapy
.
extensions
.
logstats
]
INFO
:
Crawled
1001
pages
(
at
280
pages
/
min
)
,
scraped
0
items
(
at
0
items
/
min
)
2019
-
05
-
06
16
:
32
:
09
[
scrapy
.
extensions
.
logstats
]
INFO
:
Crawled
1001
pages
(
at
0
pages
/
min
)
,
scraped
0
items
(
at
0
items
/
min
)
2019
-
05
-
06
16
:
33
:
09
[
scrapy
.
extensions
.
logstats
]
INFO
:
Crawled
1001
pages
(
at
0
pages
/
min
)
,
scraped
0
items
(
at
0
items
/
min
)
|
至于数据重复的问题,如果POST的数据重复,这个请求就不会发送出去。