爬虫入门(2)——BeautifulSoup库

最新推荐文章于 2024-05-01 17:12:59 发布

knock_me

最新推荐文章于 2024-05-01 17:12:59 发布

阅读量371

点赞数

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/knock_me/article/details/108480638

版权

爬虫专栏收录该内容

2 篇文章

订阅专栏

BeautifulSoup拓展包安装

pip3 install BeautifulSoup4 --default-timeout=1000

BeautifulSoup简介
BeautifulSoup是一个html/xml的解析器，主要功能是解析和提取html/xml中的数据。

BeautifulSoup支持python标准库中的html解析器，也支持一些第三方的解析器。如果我们没有进行额外的安装，使用的就是python默认是解析器。lxml解析器更加强大，速度更快，推荐使用lxml。

1、提取网页中的纯文本

r = requests.get('http://www.baidu.com')
bf = BeautifulSoup(r.text,features='html.parser')
# 按照标准缩进格式输出html
bf.prettify()
# 消去html标签项，只输出纯文字
bf.get_text()

2、提取标签中的内容

在这个地方推荐一个chrome插件，名字叫infolite，可自行下载，需翻墙。这个插件可以通过点击页面的方式很轻松的获取元素的id和class，非常好用。

bf = BeautifulSoup(r.text,features='html.parser')
# 使用select提取所有a标签的元素,返回结果是一个列表
bf.select('a')
# 找出所有id为title的元素(id前面须加#)
bf.select('#title')
# 找出所有class为link的元素(class前面须加.)
bf.select('.link')
# 找出所有class=mask的span元素(里面也可以指定id)
bf.select('span[class=mask]')
# 找出所有li元素里面的a标签
bf.select('li a')

使用BeautifulSoup尝试提取网页文字内容

1、compile方法

首先我们介绍一下后面会用到的compile方法。

compile函数用于编译正则表达式，返回一个正则表达式对象，供其他函数使用。

>>> import re
>>> s = re.compile('[a]+')
>>> string = 'aaa1123sass'
>>> list = s.split(string)
>>> list
['', '1123s', 'ss']
>>> list2 = s.findall(string)
>>> list2
['aaa', 'a']
#其他函数如findall或split等使用compile返回的正则表达式对象s的方法是s.其他函数(字符串)

2、尝试使用BeautifulSoup提取网页纯文字内容

试着用BeautifulSoup提取华理官网815考试大纲。只掌握爬取纯文字内容的话，很难将大纲的内容从众多的文本中提取出来。学习完通过标签等方式提取内容之后，就可以轻松地进行分离了。下面的代码所有的文本是混在一起的。

import requests
import re
from bs4 import BeautifulSoup

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/75.0.3770.90 Safari/537.36'
}

r = requests.get('https://gschool.ecust.edu.cn/2018/0919/c8311a79576/page.htm',headers = headers)
r.encoding = 'utf-8'
if(r.status_code == 200):
    bf = BeautifulSoup(r.text,features="html.parser")
    #按照标准缩进格式输出
    #print(bf.prettify())
    #将html的标签清除，只返回纯文字
    text = bf.get_text()
    #使用compile消除换行，使返回内容更加美观
    re = re.compile('[\n]+')
    list = re.split(text)

    with open('txt/815.txt', 'a', encoding='utf-8')as f:
        f = open('txt/815.txt', 'w')
        f.truncate()
        for x in list:
            print(x)
            f.write(x + '\n')
else:
    print('爬取网页失败')

使用BeautifulSoup爬取新浪新闻
主要需要掌握select和findall的用法。

#本脚本用于抓取新浪新闻的标题以及时间、地址链接
import requests
from bs4 import BeautifulSoup

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/75.0.3770.90 Safari/537.36'
}
r = requests.get('https://news.sina.com.cn/china/',headers=headers)
r.encoding = 'utf-8'
bf = BeautifulSoup(r.text,features='html.parser')
title = bf.select('.news-1 a')
for x in bf.select('.news-1 a')+bf.select('.news-2 a'):
    #新闻时间并没有放在当前页，所以需要我们进入子页面进行查询
    #但是这样有一个缺点就是for循环的每一次都需要对一个网页进行访问 速度超级慢
    rx = requests.get(x['href'],headers=headers)
    rx.encoding = 'utf-8'
    bfx = BeautifulSoup(rx.text,features='html.parser')
    time = bfx.select('.date')[0].text
    text = x.text
    print(time,x.text,x['href'])