6

Recovering markdown files from generated html

 8 months ago
source link: https://drewsh.com/getting-source-back-from-blogposts.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Setup

I hate having to write about my blog setup (see this comic), but it's important to set the stage.

At the time of writing, this blog is statically generated using Pelican. I write everything in markdown posts with a little bit of metadata, and the generator creates concrete pages by filling in information into some templates.

I use a little VPC that has a little server that just serves the rendered pages.

The issue

I wasn't backing up the original markdown files! I recently moved to just using an old laptop full-time and that means that the original device I had been composing on was left at my cousin's so they could play Fortnite on it. I realised I could not write any more blog posts without having the original markdown because pelican will happily nuke all older posts if you give it a content directory with just one new source file.

EDIT: This is not actually true, you can pass DELETE_OUTPUT_DIRECTORY = False to suppress this.

But I want to write!

Somehow, I decided that the only way I could move forward was to reverse-engineer my blog in order to get my source files back so that I could write new posts (like this one). Here's a little step-by-step breakdown of the script I wrote to help.

Getting the original HTML

URL=$1
wget $URL

Good 'ol wget. Don't need to consult a manpage for this one. We aren't getting fancy with this script, so we just assume the first argument is the URL we need to download from.

Extracting name of the downloaded file from URL

# get last path component and strip extension
filename=$(basename $URL | cut -d . -f1)

If you are not aware, basename gives you the final component of a path separated by / (I never remember whether this is called a forward slash or a backwards slash), so "drew.idktellme.com/article.html" becomes "article.html".

The cut command in the pipeline splits the input into columns using periods as delimiters and gives the first column back. Of course, this fails for names with escaped periods (e.g. "hello\.world"), but we are just hacking away for fun and the problem space is limited, so I don't particularly care.

Converting HTML to Markdown

pandoc -t markdown_strict "${filename}.html" > "${filename}.md"

Really the actual bulk of the work is done with a simple invocation of everyone's favourite tool for converting markup, pandoc. You don't have to be a genius to get things done, just stand on the shoulders of giants.

I'm lazy and don't want to type things that I've already typed, thank you

The date information is already part of the generated markdown, but not in the right format to be read as metadata.

# get date from the markdown
posted_date=$(grep -m 1 "Posted on" ${filename}.md | awk '{print $3, $4, $5}')
parsed_date=$(date -jf "%B %d, %Y" "$posted_date" "+%Y-%m-%d")

The grep+awk pipeline is very specific to my blog, it looks for the first line containing "Posted on" (assumed to be of the form "Posted on Month Date, year") and gets the date from it. I've hardcoded it based on columns but you could easily use a regex or something.

The date command parses the date and converts it into the format pelican expects: yyyy-mm-dd. Apparently if you don't provide the -j flag, the command will try to change the system date, which is devious.

Remove unnecessary information

Since the page we're working with has not only the post contents, but also the navbar, the sidebar, and so on, we remove them with a little sed command, using again our knowledge of the blog contents.

# remove lines until the date line
sed '1,/^Posted on/d' ${filename}.md > ${filename}-temp.md

Add the parsed date back and clean up

# add date to beginning of file
echo "Date: ${parsed_date}" | cat - ${filename}-temp.md > ${filename}.md

rm ${filename}-temp.md
rm ${filename}.html

I'm not sure if there is a way to do this without echo because cat expects files or things like files. Here we are printing the date to stdout and using that as a "file" for cat to use. This is one of the rare cases where the cat command (short for concatenate) is actually used to concatenate files.

I have learnt my lesson

I will make backups from now on, but just in case I lose any more source files, I now have a little script to speed up my recovery process. I could go a bit further and parse the archive page and run a loop over the links therein, but this much was fine for now.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK