![](/style/images/good.png)
![](/style/images/bad.png)
写了一个图书馆学位论文下载脚本,有版权问题吗?
source link: https://bbs.pku.edu.cn/v2/post-read.php?bid=134&threadid=18025798
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
写了一个图书馆学位论文下载脚本,有版权问题吗?
[复制链接] 分享:<ASCIIArt> 1楼
如题,正在看几年前毕业师兄的论文,手动拖拽一页一页的看太麻烦了。就写了一个python脚本,一口气全下下来了,不过下载下来的是图片格式,然后用acrobat合并成pdf,没有目录结构,但是起码可以打印一下。
owl owl owl
<ASCIIArt> 2楼
python脚本,仅需要提供论文信息页即可...
Git地址:https://github.com/rushsaker/PKULTD.git
rushsaker (pkurs) 在 ta 的帖子中提到:
如题,正在看几年前毕业师兄的论文,手动拖拽一页一页的看太麻烦了。就写了一个python脚本,一口气全下下来了,不过下载下来的是图片格式,然后用acrobat合并成pdf,没有目录结构,但是起码可以打印一下。
owl owl owl
<ASCIIArt> 3楼
import os
import re
import time
import datetime
import requests
from urllib.request import urlretrieve
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
#论文链接
thsis_url="https://thesis.lib.pku.edu.cn/docinfo.action?id1=56b030af6d7a0e566009fa151cc9a83d&id2=KhhPjCdEGOQ%253D"
driver.get(thsis_url)
driver.refresh()
time.sleep(2)
lookbut = driver.find_element_by_link_text('查看全文')
lookbut.click()
handles = driver.window_handles
driver.switch_to_window(handles[1])
time.sleep(2)
#获取总页数
tpage=driver.find_element_by_css_selector('span#totalPages.toolbar-page-num')
total_pages=int(re.sub("\D","",tpage.get_attribute("innerText")))
print('本论文总页数为:%d 页'%(total_pages))
total_pages=int(total_pages)
#下载论文
os.makedirs('./thsis_image/', exist_ok=True)
find_page=False
while i<total_pages:
div_name='div#loadingBg%d.loadingbg > img'%(i)
pics = driver.find_element_by_css_selector(div_name)
find_page=True
img_url=pics.get_attribute('src')
except:
find_page=False
if find_page:
print('找到第%d页...'%(i+1))
#print(img_url)
i=i+1
urlretrieve(img_url, './thsis_image/img%d.jpg'%(i))
else:
btnext=driver.find_element_by_css_selector('a#btnnext.toobar-btn.toobar-btn-next')
btnext.click()
time.sleep(0.5)
print('文章下载完成!')
rushsaker (pkurs) 在 ta 的帖子中提到:
如题,正在看几年前毕业师兄的论文,手动拖拽一页一页的看太麻烦了。就写了一个python脚本,一口气全下下来了,不过下载下来的是图片格式,然后用acrobat合并成pdf,没有目录结构,但是起码可以打印一下。
owl owl owl
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK