3

Extracting a TOC from Markup

 2 years ago
source link: https://bbengfort.github.io/2017/02/extract-toc/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Extracting a TOC from Markup

February 5, 2017 · 2 min · Benjamin Bengfort

In today’s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:

#!/usr/bin/env python3 # Prints out the TOC from headings in a markup file

import os import re import argparse

HDRS = { 'markdown': re.compile(r'^(#+)\s*(.*)\s*$'), 'asciidoc': re.compile(r'^(=+)\s*(.*)\s*$') }

def guess_type(path): _, ext = os.path.splitext(path) if ext.lower() in {'.md', '.mdown', '.markdown', '.mkdn', '.rmd'}: return 'markdown'

if ext.lower() in {'.asciidoc', '.adoc'}: return 'asciidoc'

raise ValueError('Unknown extenion type "{}"'.format(ext))

def extract_toc(path, indent=2, dtype=None): # Get the header regular expression from the markup format dtype = dtype or guess_type(path) if dtype not in HDRS: raise ValueError("{} is not a valid document type".format(dtype))

hdr = HDRS[dtype]

# Open up the file for reading with open(path, 'r') as f: for line in f: match = hdr.match(line) if match is not None: level, text = match.groups() tab = ((len(level) - 1) * indent) * " " print("{}- {}".format(tab, text))

if __name__ == '__main__': parser = argparse.ArgumentParser( description="Prints out the TOC from headings in a markdown file" )

parser.add_argument( 'path', metavar='src.md', type=str, help='path to markdown files to extract TOC from' ) parser.add_argument( '-i', '--indent', type=int, metavar='S', default=4, help='number of spaces to indent levels by' ) parser.add_argument( '-m', '--markup', type=str, choices=['markdown', 'asciidoc'], default=None, help='specify the markup type to use' )

args = parser.parse_args() extract_toc(args.path, indent=args.indent, dtype=args.markup)

So this is pretty simple, just use regular expressions to look for lines that start with one or more "#" or "=" (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ## heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I’ve also implemented some simple type detection using common extensions to decide which regex to use.

The result is a quick view of the structure of a markup file, especially when they can get overly large. From the Markdown of one of my longer blog posts:

- A Practical Guide to Anonymizing Datasets with Python
    - Anonymizing CSV Data
        - Generating Fake Data
        - Creating A Provider
    - Maintaining Data Quality
        - Domain Distribution
        - Realistic Profiles
        - Fuzzing Fake Names from Duplicates
    - Conclusion
        - Acknowledgments
        - Footnotes

And from the first chapter of Applied Text Analysis with Python:

- Language and Computation
        -
        -
    - What is Language?
        - Identifying the Basic Units of Language
        - Formal vs. Natural Languages
            - Formal Languages
            - Natural Languages
    - Language Models
        - Language Features
        - Contextual Features
        - Structural Features
        - The Academic State of the Art
    - Tools for Natural Language Processing
    - Language Aware Data Products
    - Conclusion

Ok, so clearly there are some bugs, those two blank - bullet points are a note callout which has the form:

[NOTE]
====
Insert note text here.
====

Therefore misidentifying the first and second ==== as a level 4 heading. I tried a couple of regular expression fixes for this, but couldn’t exactly get it. The next step is to add a simple loop to do multiple paths so that I can print out the table of contents for an entire directory (e.g. to get the TOC for the entire book where one chapter == one file).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK