一、前言
二、知识要求三、过程分析1.观察主页面和每个电脑界面的网址2.寻找每个电脑的id3.找到存放电脑的价格和评论数的信息4.爬取信息的思路四、urllib模块爬取京东笔记本电脑的数据、并对其做一个可视化实战五、可视化结果1.运行结果2.可视化结果
Python资源共享群:626017123
一、前言
作为一个程序员,笔记本电脑是必不可少的,我这里对京东上的前2页的笔记本的好评论数,价格,店铺等信息进行爬取,并做一个可视化,根据可视化的图,大家可以清晰的做出预测,方便大家购买划算的电脑。当然,我这里前2页的数据是远远不够的,如果大家想要预测的更精准一些,可以改一下数字,获取更多页面的数据,这样,预测结果会更精确。
二、知识要求
- 掌握python基础语法
- 异常处理
- 熟悉urllib模块或者其他爬虫模块
- 会抓包分析
三、过程分析
1.观察主页面和每个电脑界面的网址
(1)观察具体界面的网址,我们可以猜测,具体每个界面都有一个id,通过构造网址https://item.jd.com/【id】.html,就可以得到具体每个界面的网址。
(2)观察主界面的网址,我们发现page=的属性值就是具体的页码数,通过构造page的值,我们可以实现自动翻页爬取信息。对主界面网址一些不必要的信息剔除,最后得到主界面翻页的网址规律https://list.jd.com/list.html?cat=670,671,672&page=【页码数】
同过以上的分析,我们可以看见,获取信息的关键就是每个电脑的具体id代号,接下来,我们的任务就是要找到每个电脑的id。
2.寻找每个电脑的id
(1)首先,看看网页源代码中是否会有每个电脑的id
我们再进入到刚刚搜索的哪个电脑名称的具体界面,发现,确实是他的id
(3)根据id附件的一些属性值,唯一确定所有电脑id
根据class="gl-i-wrap j-sku-item"属性值定位,发现,唯一确定60个id,数了一下界面上的电脑,一页确实是60个电脑,所以,电脑的id获取到了。
(4)同理,根据<div class="p-name">属性值获取具体每个电脑的网址和电脑名,这样我们连具体每个电脑的网址都不用构造了,直接可以获取。
3.找到存放电脑的价格和评论数的信息
(1)通过到网页源代码中去找,发现完全找不到,所以,我猜测这些信息隐藏在js包中。
(2)打开fiddler抓包工具,进行抓包分析。
可以看见,这些信息确实是在js包里面,复制该js包的网址,然后分析。
(3)分析有如下结论:
这里,我也抓到了存放店铺的js包,但是,这个js包的地址每次有一部分是随机生成的,所以,获取不到每台的电脑的店铺名。但是,我有每台电脑的具体网址,而该界面里面有该电脑的店铺,所以,我可以访问每台电脑的具体界面去获取到店铺消息。
4.爬取信息的思路
(1)先爬每页的信息
(2)再爬每页中每台电脑的价格、电脑名和评论数,以及每台电脑的网址
(3)爬取每台电脑的页面,获取店铺信息
(4)获取完所有页面信息后,做一个可视化
四、urllib模块爬取京东笔记本电脑的数据、并对其做一个可视化实战
爬虫文件:(建议大家边看边敲一遍,更加有利于学习)
1# -*- coding: utf-8 -*- 2import random 3import urllib.request 4import re 5import time 6from lxml import etree 7from pyecharts import Bar 8from pyecharts import Pie 9 10 11headers = [ 12 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 13 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", 14 "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 15 "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", 16 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 17 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 18 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", 19 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", 20 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", 21 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", 22 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", 23 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", 24 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", 25 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", 26 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", 27 "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", 28 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11", 29 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER", 30 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)", 31 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)", 32 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER", 33 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 34 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", 35 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 36 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)", 37 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 38 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 39 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 40 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 41 "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", 42 "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre", 43 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0", 44 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 45 "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10" 46] 47 48def main(): 49 # 用来存放所有的电脑数据 50 allNames = [] 51 allCommentNums = {} 52 allPrices = {} 53 allShops = {} 54 55 # 爬取前2页的所有笔记本电脑 56 for i in range(0, 1): 57 # 每页地址规律:https://list.jd.com/list.html?cat=670,671,672&page=【页码】 58 print('正在爬取第'+str(i+1)+'页的信息...') 59 url = 'https://list.jd.com/list.html?cat=670,671,672&page='+str(i+1) 60 get_page_data(url, allNames, allCommentNums, allPrices, allShops) 61 62 # 以上为获取信息,以下为数据的可视化 63 names = allNames 64 commentNums = [] 65 for name in names: 66 if allCommentNums[name] == None: 67 commentNums.append(0) 68 else: 69 commentNums.append(eval(allCommentNums[name])) 70 prices = [] 71 for name in names: 72 if allPrices[name] == None: 73 prices.append(0) 74 else: 75 prices.append(eval(allPrices[name])) 76 shops = [] 77 for name in names: 78 if allShops[name] != None: 79 shops.append(allShops[name]) 80 for i in range(0, len(names)): 81 print(names[i]) 82 print(commentNums[i]) 83 print(prices[i]) 84 print(shops[i]) 85 # 将其评论数进行条形统计图可视化 86 tiaoxing(names, prices) 87 88 # 将其店铺进行饼图可视化 89 # 先需要统计每个店铺的个数 90 shopNames = list(set(shops)) 91 nums = [] 92 for i in range(0, len(shopNames)): 93 nums.append(0) 94 for shop in shops: 95 for i in range(0, len(shopNames)): 96 if shop == shopNames[i]: 97 nums[i] += 1 98 bingtu(shopNames, nums) 99 100 101def get_page_data(url, allNames, allCommentNums, allPrices, allShops): 102 # 爬取该页内所有电脑的id、电脑名称和该电脑的具体网址 103 response = urllib.request.Request(url) 104 response.add_header('User-Agent', random.choice(headers)) 105 data = urllib.request.urlopen(response, timeout=1).read().decode('utf-8', 'ignore') 106 data = etree.HTML(data) 107 ids = data.xpath('//a[@class="p-o-btn contrast J_contrast contrast-hide"]/@data-sku') 108 names = data.xpath('//div[@class="p-name"]/a/em/text()') 109 hrefs = data.xpath('//div[@class="p-name"]/a/@href') 110 # 去掉重复的网址 111 print(len(hrefs)) 112 hrefs = list(set(hrefs)) 113 print(len(hrefs)) 114 # 将每个电脑的网址构造完全,加上'https:' 115 for i in range(0, len(hrefs)): 116 hrefs[i] = 'https:'+hrefs[i] 117 118 # 根据id构造存放每台电脑评论数的js包的地址 119 # 其网址格式为:https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds=100000323510,100002368328&callback=jQuery5043746 120 str = '' 121 for id in ids: 122 str = str + id + ',' 123 commentJS_url = 'https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds='+str[:-1]+'&callback=jQuery5043746' 124 # 爬取该js包,获取每台电脑的评论数 125 response2 = urllib.request.Request(commentJS_url) 126 response2.add_header('User-Agent', random.choice(headers)) 127 data = urllib.request.urlopen(response2, timeout=1).read().decode('utf-8', 'ignore') 128 pat = '{(.*?)}' 129 commentStr = re.compile(pat).findall(data) # commentStr用来存放每个商品的关于评论数方面的所有信息 130 comments = {} 131 for id in ids: 132 for str in commentStr: 133 if id in str: 134 pat2 = '"CommentCount":(.*?),' 135 comments[id] = re.compile(pat2).findall(str)[0] 136 print("ids为:", len(ids),ids) 137 print("name为:", len(names), names) 138 print("评论数为:", len(comments), comments) 139 140 # 根据id构造存放每台电脑价格的js包的地址 141 # 其网址格式为:https://p.3.cn/prices/mgets?callback=jQuery1702366&type=1&skuIds=J_7512626%2CJ_44354035037%2CJ_100003302532 142 str = '' 143 for i in range(0, len(ids)): 144 if i == 0: 145 str = str + 'J_' + ids[i] + '%' 146 else: 147 str = str + '2CJ_' + ids[i] + '%' 148 priceJS_url = 'https://p.3.cn/prices/mgets?callback=jQuery1702366&type=1&skuIds=' + str[:-1] 149 # 爬取该js包,获取每台电脑的价格 150 response3 = urllib.request.Request(priceJS_url) 151 response3.add_header('User-Agent', random.choice(headers)) 152 data = urllib.request.urlopen(response3, timeout=1).read().decode('utf-8', 'ignore') 153 priceStr = re.compile(pat).findall(data) # priceStr用来存放每个商品关于价格方面的信息 154 prices = {} 155 for id in ids: 156 for str in priceStr: 157 if id in str: 158 pat3 = '"p":"(.*?)"' 159 prices[id] = re.compile(pat3).findall(str)[0] 160 print("价格为:", prices) 161 162 # 爬取每个商品的店铺,需要进入到对应的每个电脑的页面去爬取店铺信息 163 shops = {} 164 for id in ids: 165 for href in hrefs: 166 if id in href: 167 try: 168 response4 = urllib.request.Request(href) 169 response4.add_header('User-Agent', random.choice(headers)) 170 data = urllib.request.urlopen(response4, timeout=1).read().decode('gbk', 'ignore') 171 shop = etree.HTML(data).xpath('//*[@id="crumb-wrap"]/div/div[2]/div[2]/div[1]/div/a/@title') 172 print(shop) 173 if shop == []: 174 shops[id] = None 175 else: 176 shops[id] = shop[0] 177 time.sleep(2) 178 except Exception as e: 179 print(e) 180 # 先去掉电脑名两边的空格和换行符 181 [name.strip() for name in names] 182 # 将数据分别添加到item中 183 for name in names: 184 allNames.append(name) 185 # 名字对应评论数的字典形式 186 for i in range(0, len(ids)): 187 if comments[ids[i]] == '': 188 allCommentNums[names[i]] = None 189 else: 190 allCommentNums[names[i]] = comments[ids[i]] 191 # 名字与价格对应起来 192 for i in range(0, len(ids)): 193 if prices[ids[i]] == '': 194 allPrices[names[i]] = None 195 else: 196 allPrices[names[i]] = prices[ids[i]] 197 # 名字与店铺对应起来 198 for i in range(0, len(ids)): 199 allShops[names[i]] = shops[ids[i]] 200 201 202 203def tiaoxing(names, prices): 204 bar = Bar("笔记本电脑价格图", "X-电脑名,Y-价格") 205 bar.add("笔记本电脑", names, prices) 206 bar.show_config() 207 bar.render("D:\\scrapy\\jingdong\\prices.html") 208 209 210def bingtu(shopNames, nums): 211 attr = shopNames 212 v1 = nums 213 pie = Pie("笔记本店铺饼图展示") 214 pie.add("", attr, v1, is_label_show=True) 215 pie.show_config() 216 pie.render("D:\\scrapy\\jingdong\\shops.html") 217 218 219if __name__ == '__main__': 220 main()
五、可视化结果
1.运行结果
2.可视化结果
评论数条形统计图:
店铺扇形统计图:
可以看见联想的电脑买的最好。