pre-commit and Pelican | Gaige's Pages

Putting pre-commit to use

I mentioned in a previous post about pre-commit, a tool for maintaining code consistency through simple management of pre-commit checks.

The first place I decided to give this a whirl was on my blog sites. As you may be aware, I moved my blog sites (both Gaige's Pages and The Cartographica Blog) to static sites some time back.

Pelican markdown files have a preamble that is set apart by a blank line. Basically, a set of colon-delimited key-value pairs that are rudimentarily parsed and then passed to the interpreter. Basically it looks like this:

Title: My Blog Post
Date: 2021-03-28 07:48

# Some bloggy stuff

Content text is here... Oh, see my [previous post]({filename}previous-post.md)

In addition to the formatter, there are also some replacement items that can be used to reference generated data. For example: {filename} indicates that the path to the stored file should be substituted.

I had noticed there was a Markdown plugin for pre-commit using mdformat, and so I figured I'd give that a try. Initial results were good. It provided a lot of clean-up for free. On the downside: it also quoted all of the {filename} and similar references, such that they would no longer work as references. And, it also eliminated my footnotes.

My initial .pre-commit-config.yaml looked like this:

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v3.2.0
  hooks:
  -   id: trailing-whitespace
      exclude: ^.*\.md$
  -   id: end-of-file-fixer
  -   id: check-yaml
  -   id: check-added-large-files
  -   id: check-json
- repo: https://github.com/executablebooks/mdformat
  rev: 0.5.7
  hooks:
  - id: mdformat
    # optional
    args:
    - '--number'
    additional_dependencies:
    - mdformat-tables

Note here a couple of items:

I have excluded ^.*\.md$ from trailing-whitespace, this was specifically to deal with the fact that I had some two-space-at-end-of-previous-line implementations for handling forced line-breaks. This is one of a few ways of doing this, but was required for use with the python-markdown module that's used by default with pelican
I have added the mdformat plugin with a number of options and dependencies
--number as an argument to mdformat forces it to number ordered list items. I prefer that for readibility.
mdformat-tables adds table handling to mdformat (by default it uses a strict version of Markdown called Commonmark), so any extensions must be enabled with intention

mdformat plugins

With things mostly working, I looked at the mdformat documentation to see if I could make changes to the way it operated. Fortunately, there was a plug-in architecture that allowed for the modification of both parsing and output behavior.

Footnotes

Although there's support for footnotes in the underlying markdown parser that's used by mdformat, that parser (markdown-it-py, based on the Javascript-based markdown-it), that support wasn't built-in to the mdformat code. So, I decided that I'd take a look at mdformat-tables and see if I could do something similar for footnotes, since the code for both tables and footnotes are included in the underlying package as options.

The result is the mdformat_footnote plugin, which uses the existing parser (the hard part) and formats the footnotes appropriately.

This plugin can be installed using pip install mdformat_footnote or by adding mdformat_footnote to the list of items in the additional_dependencies list in the .pre-commit-config.yaml file.

Pelican-specific items

In this initial case, all I needed to do was change the output so that it didn't replace the {} characters inside of links. The code was straightforward, and after some playing around, I created the mdformat_pelican plugin for use with mdformat and pelican.

You can look at the code above, or install it with pip mdformat_pelican to get the latest version from pypi.org.

Implementing the initial code was straightforward. Effectively, the code hijacks the render_token function and modifies the token.attrs just before they're rendered, correcting any erroneously-quoted URLs.

This worked great, across nearly all of my files. Except for a couple that had square brackets in their metadata fields. For example, a post about Queen guitarist Brian May receiving his doctorate in Astrophysics had this front matter:

Date: 2007-08-03 07:26
Alias: /node/4836,/article.php?story=20070803092627654
Tags:
Category: general news
Title: [He's] a killer... astrophysicist?

which mdformat dutifully turned into \[He's\] a killer... astrophysicist?, which pelican didn't know how to interpret, so the backslashes ended up in my content pages...not desired.

Since I already had a Pelican plugin for mdformat, I decided to make it a bit more pelican-y, by marking the front matter as off-limits. This was a little trickier, but had good results. As you can see in the plugin source, understanding the frontmatter required adding a parser by putting in a new block rule and then putting in the parser as well as the code to render that later in render_token.

Since the format is very rigid (basically, collect everything until you reach the first blank line), it was easy to implement.

So, now my current working .pre-commit-config.yaml looked like this:

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v3.2.0
  hooks:
  -   id: trailing-whitespace
      exclude: ^.*\.md$
  -   id: end-of-file-fixer
  -   id: check-yaml
  -   id: check-added-large-files
  -   id: check-json
- repo: https://github.com/executablebooks/mdformat
  rev: 0.5.7
  hooks:
  - id: mdformat
    # optional
    args:
    - '--number'
    additional_dependencies:
    - mdformat-tables
    - mdformat-black
    - mdformat_footnote
    - mdformat_pelican
exclude: |
    (?x)(
        ^output/|
        ^themes/|
        ^venv/|
        ^content/NewZealand/
    )

This adds my new plugins (both the mdformat_footnote and the mdformat_pelican) and also adds an exclusion for some files in my pre-commit hooks. The ones that aren't actually committed (output, venv) wouldn't be included, but I have a set of badly-formatted HTML files in content/NewZealand that I don't want to fix yet.

This turned out well, but I had a couple of items that the parser in Pelican and the parser in mdformat could not agree on. In particular, things like indentation requirements for items with newlines within ordered lists that have multiple paragraphs in them.

In the end, that would lead me to write a new plugin for Pelican to replace the Markdown parser.

Markdown parser plugin for Pelican

The plugin architecture for mdformat is pretty good, but the one for Pelican is very mature and well thought-out. I've created plugins for Pelican before, notably the Nginx alias maps plugin.

Also, there already existed plugins to replace the Markdown reader in Pelican. As such the lift was pretty light:

Get a base plugin working
Parse the metadata (simple : split of each line before the first blank line)
Load the MarkdownIt package and configure with a few settings (tables, footnotes, and definition lists)
Add hooks to rewrite the \{filename\} items back to {filename}
Finally, add a new fence formatter, to use Pygments to format code

The code is available on GitHub in the markdown-it-reader repository and can be installed using pip install pelican-markdown-it-reader.

This plugin must be enabled on your site by adding it to the list of PLUGINS in your pelican.py file.