Recovering markdown files from generated html
source link: https://drewsh.com/getting-source-back-from-blogposts.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Setup
I hate having to write about my blog setup (see this comic), but it's important to set the stage.
At the time of writing, this blog is statically generated using Pelican. I write everything in markdown posts with a little bit of metadata, and the generator creates concrete pages by filling in information into some templates.
I use a little VPC that has a little server that just serves the rendered pages.
The issue
I wasn't backing up the original markdown files! I recently moved to just using an old laptop full-time and that means that the original device I had been composing on was left at my cousin's so they could play Fortnite on it.
I realised I could not write any more blog posts without having the original markdown because pelican will happily nuke all older posts if you give it a content
directory with just one new source file.
EDIT: This is not actually true, you can pass DELETE_OUTPUT_DIRECTORY = False
to suppress this.
But I want to write!
Somehow, I decided that the only way I could move forward was to reverse-engineer my blog in order to get my source files back so that I could write new posts (like this one). Here's a little step-by-step breakdown of the script I wrote to help.
Getting the original HTML
URL=$1
wget $URL
Good 'ol wget
. Don't need to consult a manpage for this one. We aren't getting fancy with this script, so we just assume the first argument is the URL we need to download from.
Extracting name of the downloaded file from URL
# get last path component and strip extension
filename=$(basename $URL | cut -d . -f1)
If you are not aware, basename
gives you the final component of a path separated by /
(I never remember whether this is called a forward slash or a backwards slash), so "drew.idktellme.com/article.html" becomes "article.html".
The cut
command in the pipeline splits the input into columns using periods as delimiters and gives the first column back. Of course, this fails for names with escaped periods (e.g. "hello\.world"), but we are just hacking away for fun and the problem space is limited, so I don't particularly care.
Converting HTML to Markdown
pandoc -t markdown_strict "${filename}.html" > "${filename}.md"
Really the actual bulk of the work is done with a simple invocation of everyone's favourite tool for converting markup, pandoc. You don't have to be a genius to get things done, just stand on the shoulders of giants.
I'm lazy and don't want to type things that I've already typed, thank you
The date information is already part of the generated markdown, but not in the right format to be read as metadata.
# get date from the markdown
posted_date=$(grep -m 1 "Posted on" ${filename}.md | awk '{print $3, $4, $5}')
parsed_date=$(date -jf "%B %d, %Y" "$posted_date" "+%Y-%m-%d")
The grep+awk pipeline is very specific to my blog, it looks for the first line containing "Posted on" (assumed to be of the form "Posted on Month Date, year") and gets the date from it. I've hardcoded it based on columns but you could easily use a regex or something.
The date
command parses the date and converts it into the format pelican expects: yyyy-mm-dd. Apparently if you don't provide the -j
flag, the command will try to change the system date, which is devious.
Remove unnecessary information
Since the page we're working with has not only the post contents, but also the navbar, the sidebar, and so on, we remove them with a little sed command, using again our knowledge of the blog contents.
# remove lines until the date line
sed '1,/^Posted on/d' ${filename}.md > ${filename}-temp.md
Add the parsed date back and clean up
# add date to beginning of file
echo "Date: ${parsed_date}" | cat - ${filename}-temp.md > ${filename}.md
rm ${filename}-temp.md
rm ${filename}.html
I'm not sure if there is a way to do this without echo
because cat
expects files or things like files.
Here we are printing the date to stdout and using that as a "file" for cat to use.
This is one of the rare cases where the cat command (short for concatenate) is actually used to concatenate files.
I have learnt my lesson
I will make backups from now on, but just in case I lose any more source files, I now have a little script to speed up my recovery process. I could go a bit further and parse the archive page and run a loop over the links therein, but this much was fine for now.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK