25

百度贴吧帖子备份

 3 years ago
source link: https://jingyig.com/tech/baidu_tieba_backup/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
百度贴吧帖子备份 | 四方之云

百度贴吧帖子备份

2020年8月,我为了缓解压力,开始养月季。入坑后,我关注了一位介绍月季品种的博主。他的帖子图文并茂、内容详实且文采斐然。他推荐的品种大多抗病性强,与苗商大力宣传的娇弱品种对比鲜明。因此他所处的舆论环境多不平静,明嘲暗讽时而有之。2020年10月,时任某吧吧务的苗商设局陷害他,意欲删除那些介绍贴。我出于保护资料的目的,对他的绝大部分帖子进行了备份,并将其转换为本地文件。在这一过程中,我发现网上很少有详细说明如何备份贴吧帖子的文章,于是准备自己动手写一篇,以便后人之需。

在此简要概述我所做的工作。我手工整理了需要备份的帖子链接,然后使用A的代码生成html文件,使用B的代码批量下载帖子中的图片。随后,核对图片与html是否一一对应,将html中图片和贴吧表情的url改为本地路径,并清洗数据中的噪音。网友S使用印象笔记保存了该博主的部分帖子。我将他的文件与我保存的文件进行核对,整理出已备份帖子合集。

  1. 生成html文件

使用hjhee的tiebaSpider代码。由于网络原因,源代码的dependency可能需要手动从官网下载并解压至指定目录。

  1. 下载帖子图片

使用zhaohui8969的tiebaImageGet代码。原代码默认每次只下载一个链接中的图片。我对其进行了一些修改,以实现单次下载多个链接中的图片。

  1. def main():
  2. #usr_name = "relu"
  3. #txt_name = "urls.txt"
  4. txt_path = './backup//urls//202101//urls.txt'
  5. with open(txt_path, "rb") as file:
  6. lines = file.readlines()
  7. lines = [x.strip() for x in lines]
  8. # item in lines: https://tieba.baidu.com/p/6100954692
  9. pids = []
  10. for item in range(len(lines)):
  11. url = lines[item]
  12. pid = url[-10:]
  13. pids.append(int(pid))
  14. print(u"\nData has been processed")
  15. max_thread_num = 20
  16. save_directory = './backup//202101//img'
  17. try:
  18. image_get_obj = ImageGet(max_thread_num, save_directory)
  19. for id in range(len(pids)):
  20. print(u'\n开始下载')
  21. image_get_obj(pids[id])
  22. print(u'\n休眠5秒钟')
  23. time.sleep(5)
  24. print(u'\n已下载当前文档链接中的图片。请更换文档名称和IP地址')
  25. except:
  26. print(u'\n出了一些问题, 你可以自己去main()里的try块改改自己看看bug\n')
  1. 核对文件完整性

由于前两个步骤使用的代码不会输出错误日志,我需要检查url/html文件/图片三者之间是否一一对应。代码如下。

  1. import codecs
  2. from os import listdir
  3. from os.path import isfile, join
  4. def get_htmlPid(html_folders_path, html_file_name):
  5. # html_file_name = title + ".html"(with length of 5)
  6. title_len = len(html_file_name) - 5
  7. # 447: plain marks in html file before pid in urls
  8. # length of file name is not included
  9. begin = 447 + (2 * title_len)
  10. end = begin + 10
  11. html_file_path = html_folders_path + "//" + html_file_name
  12. with open(html_file_path, 'r', encoding='utf-8') as HtmlFile:
  13. html_source_code = HtmlFile.read()
  14. html_pid = int(html_source_code[begin: end])
  15. return html_pid
  16. def get_imgPid(img_folders_path):
  17. # get all folder names
  18. img_pid = listdir(img_folders_path)
  19. img_pid_int = []
  20. for id in range(len(img_pid)):
  21. img_pid_int.append(int(img_pid[id]))
  22. return img_pid_int
  23. def get_urlPid(url_path):
  24. url_pid = []
  25. with open(url_path, "r") as load_url_file:
  26. plain_urls = load_url_file.readlines()
  27. plain_urls = [x.strip() for x in plain_urls]
  28. for url_id in range(len(plain_urls)):
  29. single_url = plain_urls[url_id]
  30. url_pid.append(int(single_url[-10:]))
  31. return url_pid
  32. def check_integrity(url_pid, html_pid, img_pid):
  33. # remove duplicates
  34. final_url_pid = list(set(url_pid))
  35. final_html_pid = list(set(html_pid))
  36. final_img_pid = list(set(img_pid))
  37. missing_html = []
  38. missing_img = []
  39. # check html files
  40. for url_item in range(len(final_url_pid)):
  41. if final_url_pid[url_item] in final_html_pid:
  42. pass
  43. else:
  44. missing_html.append(final_url_pid[url_item])
  45. if final_url_pid[url_item] in final_img_pid:
  46. pass
  47. else:
  48. missing_img.append(final_url_pid[url_item])
  49. return missing_html, missing_img
  50. def main():
  51. usr_name = "relu"
  52. base_path = "./2020-10-25-tieba-data-processing//rose-tieba-backup" + "//" + usr_name
  53. store_path = "./2020-10-25-tieba-data-processing//rose-tieba-backup" + "//z-missing-files"
  54. folders = listdir(base_path)
  55. html_pid = []
  56. # store missing_html and missing_img
  57. all_missing_html_pid = []
  58. all_missing_img_pid = []
  59. for folder_id in range(len(folders)):
  60. # initialize paths
  61. html_path = base_path + "//" + folders[folder_id]
  62. img_path = base_path + "//" + folders[folder_id] + "//img"
  63. url_path = base_path + "//" + folders[folder_id] + "//urls.txt"
  64. # store html names
  65. html_file_names = []
  66. # get all html file names in a folder
  67. file_names = listdir(html_path)
  68. for name in file_names:
  69. if name.endswith(".html"):
  70. html_file_names.append(name)
  71. for html_name in range(len(html_file_names)):
  72. html_pid_single = get_htmlPid(html_path, html_file_names[html_name])
  73. html_pid.append(html_pid_single)
  74. img_pid = get_imgPid(img_path)
  75. url_pid = get_urlPid(url_path)
  76. missing_html_pid, missing_img_pid = check_integrity(url_pid, html_pid, img_pid)
  77. all_missing_html_pid.extend(missing_html_pid)
  78. all_missing_img_pid.extend(missing_img_pid)
  79. store_html_path = store_path + "//" + usr_name + "-missing-html.txt"
  80. store_img_path = store_path + "//" + usr_name + "-missing-img.txt"
  81. with open(store_html_path, "w", encoding="utf-8") as store_html:
  82. for html in range(len(all_missing_html_pid)):
  83. complete_url_1 = "https://tieba.baidu.com/p/" + str(all_missing_html_pid[html])
  84. store_html.write("%s\n" % complete_url_1)
  85. with open(store_img_path, "w", encoding="utf-8") as store_img:
  86. for img in range(len(all_missing_img_pid)):
  87. complete_url_2 = "https://tieba.baidu.com/p/" + str(all_missing_img_pid[img])
  88. store_img.write("%s\n" % complete_url_2)
  89. print("\n Data integrity of %s has been checked." % usr_name)
  90. if __name__ == "__main__":
  91. main()
  1. 修改图片路径

Html文件中的图片url指向百度图床,需要将其修改为本地路径。

  1. from bs4 import BeautifulSoup
  2. from os.path import basename, splitext
  3. from os import listdir
  4. import re
  5. def modify_src(folder_path, file_name):
  6. file_path = folder_path + '//' + file_name
  7. soup = BeautifulSoup(open(file_path, encoding = "utf-8"), "html.parser")
  8. # pid_link = soup.find_all("a", href=re.compile(r"^https://tieba.baidu.com/p/"))
  9. # t = soup.select('a[href^="https://tieba.baidu.com/p/"]')
  10. # below is correct
  11. url = [elm.get_text() for elm in soup.find_all("a", href=re.compile(r"^https://tieba.baidu.com/p/"))]
  12. # get pid
  13. pid = url[0][-10:]
  14. # modify image src
  15. # unmodified src: https://imgsa.baidu.com/forum/w%3D580/sign=4d3033fbbdde9c82a665f9875c8080d2/4417d558ccbf6c815f62fb2ab23eb13532fa4035.jpg
  16. # modified: ./img/6233150605/09d6a94bd11373f0a6c6bb5daa0f4bfbf9ed0488.jpg
  17. # pattern: ./img/pid/img_name
  18. # img_name: img["src"][-44:]
  19. # unmodified emoticon src :https://gsp0.baidu.com/5aAHeD3nKhI2p27j8IqW0jdnxx1xbK/tb/editor/images/client/image_emoticon72.png
  20. # modified: ../emoticon/image_emoticon72.png
  21. for img in soup.findAll('img',{"src":True}):
  22. if img["src"].endswith(".jpg"):
  23. modified = './img/' + pid + '/' + img['src'][-44:]
  24. img['src'] = modified
  25. if img['src'].endswith('.png'):
  26. splited = img['src'].split('/')
  27. emoticon_name = splited[-1]
  28. emoti_modified = '../tieba_emoticon/' + emoticon_name
  29. img['src'] = emoti_modified
  30. with open(file_path, "w", encoding = "utf-8") as file:
  31. file.write(str(soup))
  32. def main():
  33. base_path = './rose_tieba_data_processing//data//tiezi_downloaded'
  34. #file_name = "鹅黄美人 Buff Beauty.html"
  35. #file_path = base_path + "//" + file_name
  36. folder_names = listdir(base_path)
  37. for folder_item in range(len(folder_names)):
  38. if folder_names[folder_item] == 'tieba_emoticon':
  39. pass
  40. else:
  41. print('Processing files in %s' % folder_names[folder_item])
  42. folder_path = base_path + '//' + folder_names[folder_item]
  43. all_files = listdir(folder_path)
  44. # get all html files in a folder
  45. file_name = []
  46. for item in range(len(all_files)):
  47. if all_files[item].endswith('.html'):
  48. file_name.append(all_files[item])
  49. # processing html files
  50. for file_id in range(len(file_name)):
  51. modify_src(folder_path, file_name[file_id])
  52. print('%s has been processed' % file_name[file_id])
  53. file_name.clear()
  54. if __name__ == "__main__":
  55. main()

Html文件中的标题包含“【图片】”“XX吧”内容,需要将其清除。

  1. def modify_title(folder_path, file_name):
  2. file_path = folder_path + '//' + file_name
  3. soup = BeautifulSoup(open(file_path, encoding = "utf-8"), "html.parser")
  4. new_title = str(soup.find('title').string)
  5. print(new_title)
  6. new_title = new_title.replace('【图片】', '')
  7. new_title = new_title.replace('【月季花吧】_百度贴吧', '')
  8. new_title = new_title.replace('【天狼月季吧】_百度贴吧', '')
  9. soup.title.string = new_title
  10. new_h1 = str(soup.find('h1').string)
  11. new_h1 = new_h1.replace('【图片】', '')
  12. new_h1 = new_h1.replace('【月季花吧】_百度贴吧', '')
  13. new_h1 = new_h1.replace('【天狼月季吧】_百度贴吧', '')
  14. soup.h1.string = new_h1
  15. with open(file_path, "w", encoding = "utf-8") as file:
  16. file.write(str(soup))

另外,帖子中“希望各位吧友能支持魔吧月刊。”也需要清除:

  1. def remove_noise(folder_path, file_name):
  2. file_path = folder_path + '//' + file_name
  3. soup = BeautifulSoup(open(file_path, encoding = "utf-8"), "html.parser")
  4. for div in soup.find_all("img", {'class':'nicknameEmoji'}):
  5. div.decompose()
  6. noise = '<div>\n<div>\n<div> #3: <b></b></div>\n<div>希望各位吧友能支持魔吧月刊。</div>\n</div>\n<hr/>\n</div>'
  7. cleaned = str(soup).replace(noise, '')
  8. with open(file_path, "w", encoding = "utf-8") as file:
  9. file.write(cleaned)

我采用核对文件标题的方式寻找我和S的备份文件之间的差异。由于印象笔记生成的文件名十分混乱,我使用了正则表达式对其进行清洗。

  1. import os
  2. from os import listdir
  3. from os.path import isfile, join
  4. import re
  5. # collect spider data
  6. spider_path = "./tieba-download//html-only"
  7. spider_original_names = []
  8. spider_names = []
  9. spider_folders = listdir(spider_path)
  10. for spider_folder_id in range(len(spider_folders)):
  11. spider_sub_path = spider_path + "//" + spider_folders[spider_folder_id]
  12. spider_files = listdir(spider_sub_path)
  13. spider_original_names.extend(spider_files)
  14. # remove unnecessary suffix
  15. for spider_item in range(len(spider_original_names)):
  16. spider_names.append(spider_original_names[spider_item].replace("【月季花吧】_百度贴吧", ""))
  17. # remove duplicate names in spider_data
  18. spider_names = list(set(spider_names))
  19. # collect evernote data
  20. evernote_path = "G://ddd-data-evernote"
  21. evernote_original_names = []
  22. evernote_names = []
  23. for file in os.listdir(evernote_path):
  24. if file.endswith(".html"):
  25. evernote_original_names.append(file)
  26. # compile regex expression
  27. pattern_string = r"【月季花吧】_\w{1,4}\s\[\d{1}\]|【月季花吧】_\w{1,4}|_\w{4}_\w{1,4}\s\[\d{1}\]|_\w{4}_\w{0,4}|【月季花吧】"
  28. pattern = re.compile(pattern_string)
  29. # remove unnecessary suffix
  30. for item in range(len(evernote_original_names)):
  31. evernote_names.append(pattern.sub("", evernote_original_names[item]))
  32. # remove duplicate names in spider_data
  33. evernote_names = list(set(evernote_names))
  34. # double check files
  35. spider_minus_evernote = []
  36. evernote_minus_spider = []
  37. for evernote_id in range(len(evernote_names)):
  38. if evernote_names[evernote_id] in spider_names:
  39. pass
  40. else:
  41. evernote_minus_spider.append(evernote_names[evernote_id])
  42. for spider_id in range(len(spider_names)):
  43. if spider_names[spider_id] in evernote_names:
  44. pass
  45. else:
  46. spider_minus_evernote.append(spider_names[spider_id])
  47. # set basic paths
  48. evernote_store_path = "./evernote_minus_spider.txt"
  49. spider_store_path = "./spider_minus_evernote.txt"
  50. # store data which is in evernote but not in spider
  51. with open(evernote_store_path, "w", encoding='utf-8') as evernote_save:
  52. for evernote_save_item in evernote_minus_spider:
  53. evernote_save.write("%s\n" % evernote_save_item)
  54. # store data which is not in evernote but in spider
  55. with open(spider_store_path, "w", encoding='utf-8') as spider_save:
  56. for spider_save_item in spider_minus_evernote:
  57. spider_save.write("%s\n" % spider_save_item)
  58. print("Missing files in evernote and spider have been checked.")

我按帖子的发表日期对其排序,生成了一份目录。

  1. import pickle
  2. all_temp_data = pickle.load( open( "ordered_temp_data.p", "rb" ) )
  3. # data structure:
  4. # [year, month, day, title, category, path]
  5. # e.g. [2018, 10, 14, '巴黎七月的粉龙沙', '品种介绍-梅昂 (Meilland)', './品种介绍-梅昂 (Meilland)//巴黎七月的粉龙沙.html']
  6. hrefs = []
  7. # href :
  8. # <p> 10月14日 <a href="./品种介绍-梅昂 (Meilland)//巴黎七月的粉龙沙.html">巴黎七月的粉龙沙</a></p>
  9. for item in range(len(all_temp_data)):
  10. href = '<p> ' + str(all_temp_data[item][1]) + '月' + str(all_temp_data[item][2]) + '日 ' + '<a href=\"' + all_temp_data[item][5] + "\">" + all_temp_data[item][3] + '</a></p>'
  11. hrefs.append(href)
  12. save_path = 'G://rose_tieba_data_processing//codes//href-three.txt'
  13. with open(save_path, "w", encoding="utf-8") as store_hrefs:
  14. for href_id in range(len(hrefs)):
  15. store_hrefs.write("%s\n" % hrefs[href_id])

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK