Splash服务搭建和使用
1.简单介绍
Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。
scrapy-splash利用Splash将javascript和Scrapy集成起来,使得Scrapy可以抓取动态网页。
地址:http://scrapy-cookbook.readthedocs.io/zh_CN/latest/scrapy-12.html#scrapy-splash
2.docker安装服务
查询相关镜像,拉取stars最多的镜像,scrapinghub/splash
docker search scrapinghub/splash
[root@VM-24-10-centos ~]# docker search Splash
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
scrapinghub/splash Lightweight, scriptable browser as a service… 84 [OK]
scrapinghub/splash-jupyter 4 [OK]
vimagick/splash A javascript rendering service with an HTTP … 2 [OK]
novadata/splash scrapinghub splash for support direct proxy 1 [OK]
splashblot/docker-postgis Docker image for PostGIS, with CartoDB^H^H t… 1 [OK]
hrbrmstr/splashttpd A slightly modified version of the scrapingh… 1
alexnoddings/splash Simple splash screen for my website. 0
splashblot/docker-dronedb Dockerized Carto^H^H^H^H^HDroneDB 0 [OK]
emcees/splashsnapshot 0
chicksphotocracy/splash-landing 0
topiaruss/splash Snapshot of Scrapinghub's splash 0 [OK]
vfcosta/splash 0
npapapietro/splash-server A basic server that serves as a splash page … 0
splashsync/toolkit Splash Sync Toolkit for Connectors Developers 0
splashblot/metabase TBD. 0 [OK]
splashblot/webodm 0
rechberger/splash splash javascript rendering service 0 [OK]
wsdookadr/splash 0
hkjallbring/splash Splash 0 [OK]
splashsync/openapi-sandbox All-In-One container to serve Open Api Faker… 0
terrytz/splash 0
splashblot/opendronemap Automated build of ODM 0 [OK]
splashblot/node-opendronemap 0
splashblot/tileo-swarm 0
faddat/splash 0
[root@VM-24-10-centos ~]#
通过docker pull scrapinghub/splash拉取镜像
docker run -it --name mysplash -p 8050:8050 scrapinghub/splash
云服务器上安装,需要将是防火墙对应端口放开,然后输入对应的地址就可以打开
3.如何在scrapy中使用
1.安装scrapy-splash
pip install scrapy-splash
2.添加配置文件
添加Splash中间件,还是在settings.py
中通过DOWNLOADER_MIDDLEWARES
指定,并且修改HttpCompressionMiddleware
的优先级
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
设置Splash自己的去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
如果你使用Splash的Http缓存,那么还要指定一个自定义的缓存后台存储介质,scrapy-splash提供了一个scrapy.contrib.httpcache.FilesystemCacheStorage
的子类
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
如果你要使用其他的缓存存储,那么需要继承这个类并且将所有的scrapy.util.request.request_fingerprint
调用替换成scrapy_splash.splash_request_fingerprint
3.使用scrapy-splash
最简单的渲染请求的方式是使用scrapy_splash.SplashRequest
,通常你应该选择使用这个
yield SplashRequest(url, self.parse_result,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
endpoint='render.json', # optional; default is render.html
splash_url='<url>', # optional; overrides SPLASH_URL
slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
Splash API说明,使用SplashRequest
是一个非常便利的工具来填充request.meta['splash']
里的数据
- meta[‘splash’] [‘args’] 包含了发往Splash的参数。
- meta[‘splash’] [‘endpoint’] 指定了Splash所使用的endpoint,默认是[render.html] (http://splash.readthedocs.org/en/latest/api.html#render-html)
- meta[‘splash’][‘splash_url’] 覆盖了
settings.py
文件中配置的Splash URL - meta[‘splash’][‘splash_headers’] 运行你增加或修改发往Splash服务器的HTTP头部信息,注意这个不是修改发往远程web站点的HTTP头部
- meta[‘splash’][‘dont_send_headers’] 如果你不想传递headers给Splash,将它设置成True
- meta[‘splash’][‘slot_policy’] 让你自定义Splash请求的同步设置
- meta[‘splash’][‘dont_process_response’] 当你设置成True后,
SplashMiddleware
不会修改默认的scrapy.Response
请求。默认是会返回SplashResponse
子类响应比如SplashTextResponse
- meta[‘splash’][‘magic_response’] 默认为True,Splash会自动设置Response的一些属性,比如
response.headers
,response.body
等
如果你想通过Splash来提交Form请求,可以使用scrapy_splash.SplashFormRequest
,它跟SplashRequest
使用是一样的。
nse.headers,
response.body`等
如果你想通过Splash来提交Form请求,可以使用scrapy_splash.SplashFormRequest
,它跟SplashRequest
使用是一样的。