Recovering markdown files from generated html

Setup

I hate having to write about my blog setup (see this comic), but it's important to set the stage.

At the time of writing, this blog is statically generated using Pelican. I write everything in markdown posts with a little bit of metadata, and the generator creates concrete pages by filling in information into some templates.

I use a little VPC that has a little server that just serves the rendered pages.

The issue

I wasn't backing up the original markdown files! I recently moved to just using an old laptop full-time and that means that the original device I had been composing on was left at my cousin's so they could play Fortnite on it. I realised I could not write any more blog posts without having the original markdown because pelican will happily nuke all older posts if you give it a content directory with just one new source file.

EDIT: This is not actually true, you can pass DELETE_OUTPUT_DIRECTORY = False to suppress this.

But I want to write!

Somehow, I decided that the only way I could move forward was to reverse-engineer my blog in order to get my source files back so that I could write new posts (like this one). Here's a little step-by-step breakdown of the script I wrote to help.

Getting the original HTML

URL=$1
wget $URL

Good 'ol wget. Don't need to consult a manpage for this one. We aren't getting fancy with this script, so we just assume the first argument is the URL we need to download from.

Extracting name of the downloaded file from URL

# get last path component and strip extension
filename=$(basename $URL | cut -d . -f1)

If you are not aware, basename gives you the final component of a path separated by / (I never remember whether this is called a forward slash or a backwards slash), so "drew.idktellme.com/article.html" becomes "article.html".

The cut command in the pipeline splits the input into columns using periods as delimiters and gives the first column back. Of course, this fails for names with escaped periods (e.g. "hello\.world"), but we are just hacking away for fun and the problem space is limited, so I don't particularly care.

Converting HTML to Markdown

pandoc -t markdown_strict "${filename}.html" > "${filename}.md"

Really the actual bulk of the work is done with a simple invocation of everyone's favourite tool for converting markup, pandoc. You don't have to be a genius to get things done, just stand on the shoulders of giants.

I'm lazy and don't want to type things that I've already typed, thank you

The date information is already part of the generated markdown, but not in the right format to be read as metadata.

# get date from the markdown
posted_date=$(grep -m 1 "Posted on" ${filename}.md | awk '{print $3, $4, $5}')
parsed_date=$(date -jf "%B %d, %Y" "$posted_date" "+%Y-%m-%d")

The grep+awk pipeline is very specific to my blog, it looks for the first line containing "Posted on" (assumed to be of the form "Posted on Month Date, year") and gets the date from it. I've hardcoded it based on columns but you could easily use a regex or something.

The date command parses the date and converts it into the format pelican expects: yyyy-mm-dd. Apparently if you don't provide the -j flag, the command will try to change the system date, which is devious.

Remove unnecessary information

Since the page we're working with has not only the post contents, but also the navbar, the sidebar, and so on, we remove them with a little sed command, using again our knowledge of the blog contents.

# remove lines until the date line
sed '1,/^Posted on/d' ${filename}.md > ${filename}-temp.md

Add the parsed date back and clean up

# add date to beginning of file
echo "Date: ${parsed_date}" | cat - ${filename}-temp.md > ${filename}.md

rm ${filename}-temp.md
rm ${filename}.html

I'm not sure if there is a way to do this without echo because cat expects files or things like files. Here we are printing the date to stdout and using that as a "file" for cat to use. This is one of the rare cases where the cat command (short for concatenate) is actually used to concatenate files.

I have learnt my lesson

I will make backups from now on, but just in case I lose any more source files, I now have a little script to speed up my recovery process. I could go a bit further and parse the archive page and run a loop over the links therein, but this much was fine for now.

Setup

The issue

But I want to write!

Getting the original HTML

Extracting name of the downloaded file from URL

Converting HTML to Markdown

I'm lazy and don't want to type things that I've already typed, thank you

Remove unnecessary information

Add the parsed date back and clean up

I have learnt my lesson

Recommend

华培动力0元转让2600万份创星中科基金份额

天地数码子公司竞得13322.6平米土地使用权：总价566.22万元

唯一零自燃？极氪宣传引阿维塔质疑

微博12月关闭恶意营销、机器水军账号5.3万个

ALLDOCUBE iPlay 50 Mini NFE Pro评测：迷你仍然是时尚

俞敏洪再谈“小作文”事件：结果来看是好的，让宇辉做平台提前了

手机厂商狂卷大模型，机构称AI手机3年内将占比40%

What does 2024 hold for product managers?

Redis 之父亲自上手用大模型撸代码：通晓古今的白痴队友，将来可以取代 99% 程序员

OpenAI要向出版商付费，终究还是没有免费的午餐

About Joyk