Extracting a TOC from Markup

February 5, 2017 · 2 min · Benjamin Bengfort

In today’s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:

#!/usr/bin/env python3 # Prints out the TOC from headings in a markup file

import os import re import argparse

HDRS = { 'markdown': re.compile(r'^(#+)\s*(.*)\s*$'), 'asciidoc': re.compile(r'^(=+)\s*(.*)\s*$') }

def guess_type(path): _, ext = os.path.splitext(path) if ext.lower() in {'.md', '.mdown', '.markdown', '.mkdn', '.rmd'}: return 'markdown'

if ext.lower() in {'.asciidoc', '.adoc'}: return 'asciidoc'

raise ValueError('Unknown extenion type "{}"'.format(ext))

def extract_toc(path, indent=2, dtype=None): # Get the header regular expression from the markup format dtype = dtype or guess_type(path) if dtype not in HDRS: raise ValueError("{} is not a valid document type".format(dtype))

hdr = HDRS[dtype]

# Open up the file for reading with open(path, 'r') as f: for line in f: match = hdr.match(line) if match is not None: level, text = match.groups() tab = ((len(level) - 1) * indent) * " " print("{}- {}".format(tab, text))

if __name__ == '__main__': parser = argparse.ArgumentParser( description="Prints out the TOC from headings in a markdown file" )

parser.add_argument( 'path', metavar='src.md', type=str, help='path to markdown files to extract TOC from' ) parser.add_argument( '-i', '--indent', type=int, metavar='S', default=4, help='number of spaces to indent levels by' ) parser.add_argument( '-m', '--markup', type=str, choices=['markdown', 'asciidoc'], default=None, help='specify the markup type to use' )

args = parser.parse_args() extract_toc(args.path, indent=args.indent, dtype=args.markup)

So this is pretty simple, just use regular expressions to look for lines that start with one or more "#" or "=" (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ## heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I’ve also implemented some simple type detection using common extensions to decide which regex to use.

The result is a quick view of the structure of a markup file, especially when they can get overly large. From the Markdown of one of my longer blog posts:

- A Practical Guide to Anonymizing Datasets with Python
    - Anonymizing CSV Data
        - Generating Fake Data
        - Creating A Provider
    - Maintaining Data Quality
        - Domain Distribution
        - Realistic Profiles
        - Fuzzing Fake Names from Duplicates
    - Conclusion
        - Acknowledgments
        - Footnotes

And from the first chapter of Applied Text Analysis with Python:

- Language and Computation
        -
        -
    - What is Language?
        - Identifying the Basic Units of Language
        - Formal vs. Natural Languages
            - Formal Languages
            - Natural Languages
    - Language Models
        - Language Features
        - Contextual Features
        - Structural Features
        - The Academic State of the Art
    - Tools for Natural Language Processing
    - Language Aware Data Products
    - Conclusion

Ok, so clearly there are some bugs, those two blank - bullet points are a note callout which has the form:

[NOTE]
====
Insert note text here.
====

Therefore misidentifying the first and second ==== as a level 4 heading. I tried a couple of regular expression fixes for this, but couldn’t exactly get it. The next step is to add a simple loop to do multiple paths so that I can print out the table of contents for an entire directory (e.g. to get the TOC for the entire book where one chapter == one file).

Extracting a TOC from Markup

Extracting a TOC from Markup

Recommend

GitHub - prevwong/vue-keyframes: Create frame-based animations in Vue

etcd分布式锁及事务

学了算法有什么用？

404 Page not found · 佳雨

[Bash] HTML to PDF via wkhtmltopdf

Getting started with your own Magento 2 Theme

How to fix (80%) HiDPI issues in Gulliam OS 3.1

在 Ubuntu 上通过 APT 安装 R

The Actor Model

[Golang] Trim Prefix or Suffix from String

About Joyk