Python学习-网络爬虫(一)

0x00 前言

一直都听说过网络爬虫，网络爬虫到底是什么？带着这个问题，我开始学习编写网络爬虫。下面记录了我学习过程中遇到的网络爬虫相关知识：

0x01 爬虫基本原理

1、爬虫概述

1.1 网络爬虫是什么？
网络爬虫是请求网站并提取数据的自动化程序。
网络爬虫有很多类型，常用的有通用网络爬虫、聚焦网络爬虫等。
1.2 爬虫基本流程

发起请求：通过HTTP库向目标站点发起请求，即发送一个Request,请求可以包含额外的headers等信息，等待服务器响应。
获取响应内容：如果服务器能正常响应，会得到一个Response,Response的内容便是所要获取的页面内容，类型可能有HTML, Json字符串，二进制数据(如图片视频)等类型。
解析内容：得到的内容可能是HTML，可以用正则表达式、网页解析库进行解析。可能是Json, 可以直接转为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理。
保存数据：保存形式多样，可以存为文本，也可以保存至数据库，或者保存特定格式的文件。

1.3 能抓取怎样的数据？

网页文本：如HTML文档、Json格式文本等。
图片：获取到的是二进制文件，保存为图片格式。
视频：同为二进制文件，保存为视频格式即可。
其他：只要是能请求到的，都能获取。
2、爬虫基本流程的实现

2.1 Request

请求方式：主要有GET、POST两种类型，另外还有HEAD、PUT、DELETE、OPTIONS等。
请求URL：URL全称统一资源定位符，如一个网页文档、一张图片、一个视频等都可以用URL唯一来确定。
请求头：包含请求时的头部信息，如User-Agent、 Host、Cookies等信息。
请求体：请求时额外携带的数据如表单提交时的表单数据。

2.2 Response

响应状态：有多种响应状态，如200代表成功、301跳转、404找不到页面、502服务器错误等。
响应头：如内容类型、内容长度、服务器信息、设置Cookie等等。
响应体：最主要的部分，包含了请求资源的内容，如网页HTML、图片二进制数据等。

2.3.1 怎样来解析？(解析方式)

Json解析
正则表达式
解析库(BeautifulSoup、PyQuery、Xpath)

2.3.2 怎样解决JavaScript渲染的问题？

分析Ajax请求
Selenium/WebDriver
Splash
PyV8、Ghost.py

2.4 怎样保存数据？

文本：纯文本、Json、Xml等。
关系型数据库：如MySQL、Oracle、SQL Server等具有结构化表结构形式存储。
非关系型数据库：如MongoDB、Redis等Key-Value形式存储。
二进制文件：如图片、视频、音频等等直接保存成特定格式即可。

0x02 Python爬虫常用库

1.1 urllib

Python 内置的 HTTP 请求库，也就是说不需要额外安装即可使用。

urllib.request：请求模块，它是最基本的 HTTP 请求模块，可以用来模拟发送请求。
urllib.error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作以保证程序不会意外终止。
urllib.parse：url解析模块，提供了许多 URL 处理方法，比如拆分、解析、合并等。
urllib.robotparser：robots.txt解析模块，主要是用来识别网站的 robots.txt 文件，然后判断哪些网站可以爬，哪些网站不可以爬，用得比较少。

与Python2相比的变化

Python
import urllib2
response = urllib.urlopen('http://www.baidu.com')

Python3
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')

py python复制代码

1.2 urllib基础操作

1.2.1 urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

python复制代码

get类型：
import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get')
print(response.read().decode('utf-8'))

post类型：
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

超时设置：
import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('time out!')

python复制代码

1.2.2 响应
响应类型、状态码、响应头、响应内容：

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(type(response)) #响应类型
print(response.status) #状态码
print(response.getheaders()) #响应头
print(response.getheader('Server')) #获取特定响应头
print(response.read().decode('utf-8') #响应内容

python复制代码

1.2.3 Request

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

python复制代码

url 用于请求 URL，这是必传参数，其他都是可选参数。
data 如果要传，必须传 bytes（字节流）类型的。如果是字典，可以先用urllib.parse模块里的 urlencode()编码。
headers 是一个字典，它就是请求头，在构造请求时通过headers参数直接构造，也可以通过调用请求实例的 add_header()方法添加。
添加请求头最常用的是通过修改User-Agent来伪装浏览器，默认的 User-Agent是 Python-urllib，我们可以通过修改它来伪装浏览器。比如要伪装火狐浏览器，你可以把它设置为：
Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
origin_req_host 指的是请求方的 host 名称或者 IP 地址。
unverifiable 表示这个请求是否是无法验证的，默认是 False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个 HTML 文档中的图片，但是我们没有自动抓取图像的权限，这时 unverifiable 的值就是 True。
method 是一个字符串，用来指示请求使用的方法，比如 GET、POST 和 PUT 等。

get类型：
import urllib.request
request = urllib.request.Request('http://httpbin.org/get')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

post类型：
from urllib import request,parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0',
    'Host':'httpbin.org'
}
dict = {
    'name':'qwzf'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

另一种post：
from urllib import request,parse
url = 'http://httpbin.org/post'
dict = {
    'name':'qwzf'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,method='POST')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

python复制代码

1.3 urllib高级操作

1.3.1 Handler
（1）代理
代理之前

在这里插入图片描述

打开控制面板->网络和 Internet->Internet选项->连接->局域网设置->代理服务器

在这里插入图片描述

看到系统代理端口为10809，然后使用代理软件开启代理服务即可。

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:10809',
    'https': 'https://127.0.0.1:10809'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('http://httpbin.org/get')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

python复制代码

代理成功

在这里插入图片描述

（2）Cookie

#cookie变量被赋值为请求地址的Cookie
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()                 #声明CookieJar对象
handler = urllib.request.HTTPCookieProcessor(cookie)#构建处理Cookie
opener = urllib.request.build_opener(handler)       #build_opener传递Cookie
response = opener.open('http://www.baidu.com')
for item in cookie:                                 #打印Cookie信息
    print(item.name+"="+item.value)

python复制代码

在这里插入图片描述

#将请求地址的Cookie保存为文本文件
import http.cookiejar, urllib.request
filename = "cookie.txt"                             #将cookie保存成文本文件
#声明CookieJar对象的子类对象
#cookie = http.cookiejar.MozillaCookieJar(filename)  #声明MozillaCookieJar
cookie = http.cookiejar.LWPCookieJar(filename)      #声明LWPCookieJar
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True) #save()方法将Cookie保存成文本文件

python复制代码

在这里插入图片描述

#读取文本文件里存放的Cookie并附着在新的请求地址
import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

python复制代码

1.3.2 异常处理
urllib.error异常类型:

URLError 属性：reason
HTTPError 属性：code、reason、headers
ContentTooShortError(msg,content)

from urllib import request, error
try:
    response = request.urlopen('http://www.httpbin.org/qwzf')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

python复制代码

在这里插入图片描述

import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

py python复制代码

在这里插入图片描述

1.3.3 URL解析
（1）urlparse
分隔url

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

python复制代码

#分隔url
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
运行结果：
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')


#指定以https解析
from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

#scheme作为默认，如果url指定解析方式，则scheme不生效
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)


from urllib.parse import urlparse
result1 = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
result2 = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result1,'\n',result2)
运行结果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='') 
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

python复制代码

（2）urlunparse
拼接url

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
运行结果：
http://www.baidu.com/index.html;user?a=6#comment

python复制代码

（3）urljoin
拼接url

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
#后边url字段覆盖前面的url字段。
#前面有字段不存在，则补充。存在则覆盖
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

python复制代码

在这里插入图片描述

（4）urlencode
把字典对象转换成get请求参数

from urllib.parse import urlencode
params = {
    'name': 'qwzf',
    'age': 20
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

python复制代码

在这里插入图片描述

2.1 requests

基于urllib，采用Apache2 Licensed开源协议的HTTP库。
简单来说，requests是Python实现的简单易用的HTTP库。
2.1.1 安装

pip3:
pip3 install requests

pip:
pip install requests
#指定python版本安装：python3 -m pip install requests

python复制代码

2.2 requests基础操作

2.2.1 各种请求方式

import requests
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

python复制代码

2.2.2 请求

2.2.2.1 基本GET请求

（1）基本写法

import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

python复制代码

（2）带参数GET请求

import requests
data = {
    'name': 'qwzf',
    'age': 20
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

python复制代码

在这里插入图片描述

（3）解析json

import requests
import json
response = requests.get("http://httpbin.org/get")
print(response.text)
print(response.json())
print(json.loads(response.text))

python复制代码

在这里插入图片描述

（4）获取二进制数据

import requests
response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content) #获取二进制内容

python复制代码

在这里插入图片描述

将二进制数据写入二进制文件(如：图片、视频)：

import requests
response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()

python复制代码

（5）添加headers

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

python复制代码

2.2.2.2 基本POST请求

（1）带参数POST请求

import requests
data = {'name': 'qwzf', 'age': '20'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)

python复制代码

（2）带参数、添加headers和解析json的POST请求

import requests
data = {'name': 'qwzf', 'age': '20'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.post("http://httpbin.org/post", data=data, headers=headers)
print(response.json())

python复制代码

2.2.3 响应

2.2.3.1 reponse属性

import requests
response = requests.get('http://www.jianshu.com')
print(type(response.status_code), response.status_code) #状态码
print(type(response.headers), response.headers) #响应头
print(type(response.cookies), response.cookies) #Cookie
print(type(response.url), response.url) #url
print(type(response.history), response.history) #访问的历史记录

python复制代码

2.2.3.2 状态码判断

#判断Not Found
import requests
response = requests.get('http://www.aliyun.com/qwzf.html')
if response.status_code == requests.codes.not_found:
    print('404 Not Found')

#判断Successfully
import requests
response = requests.get('http://www.aliyun.com')
if response.status_code == 200:
    print('Request Successfully')

python复制代码

2.3 requests高级操作

2.3.1 文件上传

import requests
files = {'file': open('favicon.ico', 'rb')}
response = requests.post("http://httpbin.org/post", files=files)
print(response.text)

python复制代码

在这里插入图片描述

2.3.2 获取cookie

import requests
response = requests.get("http://www.baidu.com")
print(response.cookies)
for key, value in response.cookies.items():
    print(key + '=' + value)

python复制代码

在这里插入图片描述

2.3.3 会话维持
模拟登陆

import requests
requests.get('http://httpbin.org/cookies/set/number/123456789')
response = requests.get('http://httpbin.org/cookies')
print(response.text)

python复制代码

在这里插入图片描述

实现在同一个浏览器实现setcookie和getcookie：

import requests
s = requests.Session() #维持会话信息，使用Session对象发起两次请求
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')
print(response.text)

python复制代码

在这里插入图片描述

2.3.4 证书验证
用于https解析url中的证书问题

import requests
response = requests.get('https://www.12306.cn') #https://www.12306.cn证书又正常了，暂时没找到证书失效的网站进行验证
print(response.status_code)

python复制代码

如果证书失效，使用下面代码，会返回状态码：

import requests
from requests.packages import urllib3
urllib3.disable_warnings() #消除警告信息
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

python复制代码

手动指定证书：

import requests
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

python复制代码

2.3.5 代理设置
系统代理开启，参考urllib代理设置。

import requests
proxies = {
  "http": "http://127.0.0.1:10809",
  "https": "https://127.0.0.1:10809",
}
'''proxies = {#如果代理有用户名和密码
    "http": "http://user:[email protected]:10809/",
    "https": "https://user:[email protected]:10809/",
}'''
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

python复制代码

在这里插入图片描述

如果代理不是http和https代理，如socks代理：
pip3 install 'requests[socks]'
设置开启socks代理，然后运行下面代码：

import requests
proxies = {
    'http': 'socks5://127.0.0.1:10809',
    'https': 'socks5://127.0.0.1:10809'
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

python复制代码

2.3.6 超时设置

import requests
from requests.exceptions import ReadTimeout
try:
    response = requests.get("http://httpbin.org/get", timeout=0.5)
    print(response.status_code)
except ReadTimeout:#捕获异常
    print('Time out！')

python复制代码

在这里插入图片描述

2.3.7 认证设置
登录验证

import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://example.com/', auth=HTTPBasicAuth('qwzf', '123'))
print(r.status_code)

python复制代码

使用字典的方式：

import requests
r = requests.get('http://example.com/', auth=('user', '123'))
print(r.status_code)

python复制代码

2.3.8 异常处理
可以参考request.exceptions异常处理

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout: #捕获超时异常
    print('Timeout')
except ConnectionError: #捕获连接异常
    print('Connection error')
except RequestException: #一个父类异常
    print('Error')

python复制代码

0x03 后记

本篇主要记录了爬虫的基本原理、urllib库和requests库的常见操作的学习与练习。Python网络爬虫学习持续更新中。。。。。。

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 [email protected]

0x00 前言

0x01 爬虫基本原理

1、爬虫概述

2、爬虫基本流程的实现

0x02 Python爬虫常用库

1.1 urllib

1.2 urllib基础操作

1.3 urllib高级操作

2.1 requests

2.2 requests基础操作

2.2.1 各种请求方式

2.2.2 请求

2.2.2.1 基本GET请求

2.2.2.2 基本POST请求

2.2.3 响应

2.2.3.1 reponse属性

2.2.3.2 状态码判断

2.3 requests高级操作

0x03 后记

Recommend

5 Of The Heaviest Cars To Ever Hit The Road

「领慧立芯」完成亿元A轮融资

五粮液跨界新能源首投找了通威股份前董事长？

Spotify's Long-Awaited HiFi Plan May Finally Arrive Later This Year, But It Wont...

StencilJs 学习之 JSX - guojikun

pcie reset系列之内核框架 - _备忘录

ClickHouse技术研究及语法简介 - 京东云开发者

Compiling typed Python

报告称明年43%的企业招聘将引入人工智能简化招聘流程

十年之约——专题展示 | 一个人的寂寞，一群人的狂欢。

About Joyk