3

说一说微博热搜的爬取

 2 years ago
source link: https://www.bboy.app/2022/07/13/%E8%AF%B4%E4%B8%80%E8%AF%B4%E5%BE%AE%E5%8D%9A%E7%83%AD%E6%90%9C%E7%9A%84%E7%88%AC%E5%8F%96/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

说一说微博热搜的爬取

2022-07-132022-07-14python

20220713-1.webp

微博不知道什么时候改了热搜的页面,爬热搜增加了一点点的难度,也就是要拿到cookie才可以访问到热搜了

首先观察请求,页面url还是下面这个

https://s.weibo.com/top/summary?cate=realtimehot

但是在没有cookie的时候会直接给你一个302到下面这个url

https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fs.weibo.com%2Ftop%2Fsummary%3Fcate%3Drealtimehot&domain=.weibo.com&sudaref=&ua=php-sso_sdk_client-0.6.29&_rand=1657674728.5131

这个时候这个页面会加载下面这个js

https://passport.weibo.com/js/visitor/mini_original.js?v=20161116

这个js会向下面一个url发一个post请求

https://passport.weibo.com/visitor/genvisitor

带两个参数

cb: gen_callback
fp: {"os":"2","browser":"Chrome103,0,0,0","fonts":"undefined","screenInfo":"1920*1080*24","plugins":"Portable Document Format::internal-pdf-viewer::PDF Viewer|Portable Document Format::internal-pdf-viewer::Chrome PDF Viewer|Portable Document Format::internal-pdf-viewer::Chromium PDF Viewer|Portable Document Format::internal-pdf-viewer::Microsoft Edge PDF Viewer|Portable Document Format::internal-pdf-viewer::WebKit built-in PDF"}

会返回下面的信息

window.gen_callback && gen_callback({"retcode":20000000,"msg":"succ","data":{"tid":"yfUitgX0O4Qh4CWXpO7f+PkuCHWchnXFguyPWBySm1w=","new_tid":false,"confidence":95}});

返回来的东西我们只要拿到tid就好了,之后会跳转到下面这个url

https://passport.weibo.com/visitor/visitor?a=incarnate&t=9vxFhlRT0xasvUO711X1Jp5HM1Ol0pNMj8WiFn4PhuU%3D&w=2&c=095&gc=&cb=cross_domain&from=weibo&_rand=0.015638318399351148

这个url中包含下面参数

a: incarnate
t: 9vxFhlRT0xasvUO711X1Jp5HM1Ol0pNMj8WiFn4PhuU=
w: 2
c: 095
gc:
cb: cross_domain
from: weibo
_rand: 0.015638318399351148

t就是post请求拿到的tid,_rand这个你可以直接random.random()生成一个

访问完成之后你会发现response header中会多了set-cookie这个头,也就是你拿到了cookie,那么之后你再去请求

https://s.weibo.com/top/summary?cate=realtimehot

你就可以拿到页面了

欢迎关注我的博客www.bboy.app

Have Fun


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK