最新豆瓣top250爬虫案例代码分析[注释齐全] - 孤飞 - JOYK Joy of Geek, Geek News, Link all geek

最新豆瓣top250爬虫案例代码分析[注释齐全] - 孤飞 - 博客园

最新豆瓣top250爬虫案例代码分析[注释齐全]

Published on 2022-08-09 00:02 in 分类: 爬虫 with 孤飞

分类: 爬虫

1
2
3
4
5
6
# json包
import json
#正则表达式包
import re
import requests
from requests import RequestException

定义爬取html函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#函数：获取一页html
def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
        }
        '''
        Response对象返回包含了整个服务器的资源
        Response对象的属性，有以下几种
        r.status_code： HTTP请求的返回状态，200表示连接成功，404表示失败
        2.r.text： HTTP响应内容的字符串形式，即，url对应的页面内容
        3.r.encoding：从HTTP header中猜测的响应内容编码方式
        4.r.apparent_encoding：从内容中分析出的响应内容编码方式（备选编码方式）
        5.r.content： HTTP响应内容的二进制形式
        '''
        response = requests.get(url, headers=headers, timeout=1000)
        if response.status_code == 200:
            return response.text
    except requests.exceptions.RequestException as e:
        print(e)

定义解析html函数【正则】

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#函数：解析一页html
def parse_one_page(html):
    #re.compile 是预编译正则表达式函数，是用来优化正则的，它将正则表达式转化为对象
    #re.compile 函数用于编译正则表达式，生成一个 Pattern 对象，pattern 是一个字符串形式的正则表达式
    #pattern 是一个匹配对象Regular Expression，它单独使用就没有任何意义，需要和findall(), search(), match()搭配使用。
    pattern = re.compile(
        '<em class="">(\d+)</em>.*?<a href="(.*?)">.*?' +
        '<img width="100" alt=".*?" src="(.*?)" class=""' +
        '>.*?<span class="title">(.*?)</span>.*?<span ' +
        'class="other"> / (.*?)</span>.*?<div ' +
        'class="bd">.*?<p class="">.*?导演: (.*?)&nbsp.*?<br>' +
        '.*?(\d{4}) / (.*?) / (.*?)\n' +
        '.*?</p>.*?<span class="rating_num" property="v:' +
        'average">(.*?)</span>',
        re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'page_src': item[1],
            'img_src': item[2],
            'title': item[3],
            'other_title': item[4],
            'director': item[5],
            'release_date': item[6],
            'country': item[7],
            'type': item[8],
            'rate': item[9],
        }

定义保存内容函数

1
2
3
4
#函数：将内容写入文件
def write_to_file(content):
    with open('douban_movie_rankings.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

定义主函数

1
2
3
4
5
6
7
8
9
10
11
#主空函数
def main():
    #用于翻页
    for offset in range(10):
        #获取网址
        url = f'https://movie.douban.com/top250?start={offset * 25}&filter='
        #获取html文件
        html = get_one_page(url)
        for item in parse_one_page(html):
            print(item)
            write_to_file(item)

定义魔法函数

1
2
if __name__ == '__main__':
    main()

运行结果：

原创作者：孤飞-博客园
原文链接：https://www.cnblogs.com/ranxi169/p/16564490.html

最新豆瓣top250爬虫案例代码分析[注释齐全] - 孤飞

定义爬取html函数

定义解析html函数【正则】

定义保存内容函数

定义主函数

定义魔法函数

Recommend

Appwrite Loves Open Source: Why I Choose To Support Kdenlive

What If Animals Could Code...

畅连网络 WiFi万能钥匙助力全民参与体验直播新业态-品玩

阿维塔11正式发布起售价35万元|阿维塔科技|宁德时代_新浪科技_新浪网

Best practices for inclusive textual websites

React报错之useNavigate() may be used only in context of Router - chuckQu

Incident Report: Employee and Customer Account Compromise - August 4, 2022

DeFiChain Community Brings Attractive Rewards For DFI ERC-20 Pairs on Uniswap

robpike.io/ivy

Using unwrap() in Rust is Okay

About Joyk