11

ArchiveBox - 自托管网站存档服务

 3 years ago
source link: https://azhuge233.com/archivebox-%e8%87%aa%e6%89%98%e7%ae%a1%e7%bd%91%e7%ab%99%e5%ad%98%e6%a1%a3%e6%9c%8d%e5%8a%a1/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
ArchiveBox - 自托管网站存档服务 - azhuge233's

开源自托管的网站存档服务,自动对输入的 URL 进行信息爬取,将其中的 HTML、媒体文件、JS、PDF 文件等归档,方便离线查看

以下引用官网的 Background & Motivation

The aim of ArchiveBox is to enable more of the internet to be archived by empowering people to self-host their own archives. The intent is for all the web content you care about to be viewable with common software in 50 – 100 years without needing to run ArchiveBox or other specialized software to replay it.

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it’s to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010’s flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

ArchiveBox - 自托管网站存档服务
Image from WTF is Link Rot?

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don’t think everything should be preserved in an automated fashion–making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

下文将展示如何在 Debian 10 下使用包管理搭建 ArchiveBox 服务

  • Debian 10

以下未特殊说明的指令均在 root 用户下执行,使用其他用户请酌情添加 sudo

安装依赖环境

官方安装方法中未说明要单独安装 npm,但使用时需要 npm

直接安装 Debian 10 包管理默认版本即可,会顺带安装 node

apt update
apt install npm
apt update
apt install npm

安装 ArchiveBox

echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | tee /etc/apt/sources.list.d/archivebox.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
apt update
echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | tee /etc/apt/sources.list.d/archivebox.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
apt update

安装 ArchiveBox

官方提供了两种方法,推荐使用 pip 安装

apt install archivebox
python3 -m pip install --upgrade --ignore-installed archivebox
apt install archivebox
# 或者
python3 -m pip install --upgrade --ignore-installed archivebox

使用包管理安装可能会无法运行,出现此状况后直接输入 pip 安装指令即可

包管理安装不成功是因为其提供的 Django 版本过低

设置 ArchiveBox

  1. 切换到非 root 用户(此步骤下的指令均在非 root 用户下执行)
    • 执行
      su - [Username]
      su - [Username]
  2. 新建 ArchiveBox 空目录
    • 执行
      mkdir ~/archivebox
      mkdir ~/archivebox
  3. 初始化 ArchiveBox
    • 执行
      cd ~/archivebox
      archivebox init --setup
      cd ~/archivebox
      archivebox init --setup

      ArchiveBox - 自托管网站存档服务

    • 安装过程中会提示新建 Web 界面的管理员账户,输入密码和邮箱
    • 安装完毕ArchiveBox - 自托管网站存档服务
  4. 启用 ArchiveBox WebUI
    • 安装完毕后开启 WebUI,执行
      archivebox server 0.0.0.0:[port]
      archivebox server 0.0.0.0:[port]

      注意 ArchiveBox 没有权限监听 1 – 1024 端口

ArchiveBox - 自托管网站存档服务

ArchiveBox - 自托管网站存档服务

ArchiveBox - 自托管网站存档服务


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK