Line Oriented Hyperdocuments in Haskell

I had a fascinating conversation with @johnnulls recently about creating a lightweight markup format for hyperdocuments, extending Subtext to add various affordances for metadata like key-value pairs and tags.

But I’m getting ahead of myself.

What is Subtext?

I’ll let Gordon explain:

Subtext is a text-based, line-oriented hypertext format, designed for note-taking.

Subtext markup is made up of ordinary lines of text, which are interpreted as a list of blocks. Lines that are prefixed with magic “sigil” characters are treated as special blocks. Lines without sigils are treated as text blocks. Empty lines are ignored.

This is a really interesting approach!

I’ve played around and tried to create new hypertext tools before, but nothing I’ve written has really gripped me.

Most of them have been built around some variant of content-addressable store so that we have persistent identifiers to prevent link rot (typically with addresses provided by hashing the content), but I could never get any of them working in a way that felt ergonomic to write and read. I think I’ll return to those ideas one day.

Subtext is a lot simpler, mostly because it’s not the same thing at all.

Rather than specifying the full system (with stores, servers, and whatnot), Subtext offers a markup language for a linear hypertext format with very easy parsing based off looking for the magic “sigil” characters.

Let’s write enough basic functionality to be able to say I’ve got some sort of minimum viable product for working with Subtext:

A parser, so that we can parse Subtext documents into a structured datatype (that we can later work with in interesting ways)
An unparser, so that we can serialize that datatype back into text
A HTML renderer, since a full native client is going to take a lot of effort and we can whip up a quick and dirty HTML renderer instead, converting a corpus of Subtext documents into a collection of HTML pages we can serve with something like nginx

I’ll call this Haskell library Subtextual.

Parsing Subtext

I’ll take an example snippet of Subtext from Gordon’s guide:

# Heading

Plain text.

- List item
- List item

> Quoted text

URLs like https://example.com are automatically linked.

You can also link to local pages using short /slashlinks.

I’ll use attoparsec to build the parsing functionality, since I’ve used it before and I enjoy working with it.

Since Subtext does some sub-line parsing, I’ll start with an Inline type which will store that sub-line data. There’s 4 different kinds of text we can see in a line:

Plain text
Bare URLs, recognised by the prefix “http://” or “https://”
URLs delimited by angle-brackets
Slashlinks

We then construct Blocks (which correlate to lines of text in a Subtext file) out of Inlines for the more complicated kinds of lines, or Text for headings (since we don’t want to parse for URLs in headings).

Blanks are our blank lines, which we’ll keep in the data-type Block since we don’t want to throw away that info about the source Subtext files when we’re parsing, although we’ll probably not do anything with them anyway.

I’ll also add Document as a type synonym for lists of Blocks.

That gives us a nice core of data types:

/src/Subtextual/Core.hs

 1module Subtextual.Core
 2    ( Inline(..), Block(..), Document
 3    ) where
 4
 5import qualified Data.Text as T
 6
 7data Inline =
 8    PlainText T.Text
 9    | BareUrl T.Text
10    | AngledUrl T.Text
11    | SlashLink T.Text
12    deriving (Show, Eq)
13
14data Block = 
15    Paragraph [Inline]
16    | Heading T.Text
17    | Bullet [Inline]
18    | Quote [Inline]
19    | Blank
20    deriving (Show, Eq)
21
22type Document = [Block]

Now we’re ready to parse some text. We’ll start with some imports:

/src/Subtextual/Parser.hs

 1module Subtextual.Parser
 2    (nonBlankBlock, document) where
 3
 4import Control.Applicative
 5import Control.Monad
 6import Data.Char
 7import Data.Functor
 8import Data.Attoparsec.Text
 9import Data.Attoparsec.Combinator
10import qualified Data.Text as T
11
12import Subtextual.Core

With that out of the way, we can work on parsing.

Attoparsec is a parser combinator library, which means we build up complex parsers out of simple ones. So let’s start with those building blocks.

Since we want to start looking for URLs and slashlinks at word boundaries, we’ll need to parse in whitespace and non-whitespace characters separately. Every time we see a new word, we’ll first try to parse it with our more structured parsers for bare URLs, angle-delimited URLs, and slashlinks. If those fail, we’ll fall back to parsing them as plain text.

So we write the two text parsers:

whitespace to parse spaces and tabs
word to parse non-space characters

/src/Subtextual/Parser.hs

14------------------------------------------------------------
15--                      Text Parsing                      --
16------------------------------------------------------------
17
18whitespace :: Parser T.Text
19whitespace = takeWhile1 isHorizontalSpace <?> "whitespace"
20
21word :: Parser T.Text
22word = takeWhile1 $ not . isSpace <?> "word"

After that we’ll want to work on parsing Inlines.

PlainText should get parsed whenever we have whitespace, or a string of characters that didn’t get parsed as a URL or slash-link:

/src/Subtextual/Parser.hs

24------------------------------------------------------------
25--                     Inline Parsing                     --
26------------------------------------------------------------
27
28plainText :: Parser Inline
29plainText = PlainText <$> (word <|> whitespace) <?> "plainText"

BareUrl was an interesting challenge. According to the spec:

Subtext parsers MUST implement automatic linking for certain URLs that are not in brackets.

We could just keep parsing characters into a BareUrl until we hit whitespace, but there’s a problem with that approach—we want to be able to intelligently ignore punctuation like periods, semicolons, or commas at the end of URLs.

If we parse in e.g. Here's a link: https://google.com., we want the parsed link to be “https://google.com”, with the final period being parsed as plaintext.

To solve this issue, we do lookahead on the next bit of the input to see if we’ve reached the end of the URL, which we define as either:

A punctuation boundary (a period, semicolon, or comma followed by either whitespace characters or the end of the line)
A space, which includes spaces, tabs, and newlines
Or the end of input

/src/Subtextual/Parser.hs

31string' :: String -> Parser T.Text
32string' = string . T.pack
33
34bareUrl :: Parser Inline
35bareUrl = do
36    schema <- string' "https://" <|> string' "http://"
37    body <- manyTill anyChar $ lookAhead endOfUrl
38    let url = schema <> T.pack body
39    return $ BareUrl url
40    <?> "bareUrl"
41
42    where
43        endOfUrl :: Parser ()
44        endOfUrl = 
45            punctuationBoundary 
46            <|> space $> () 
47            <|> endOfInput
48
49        punctuationBoundary :: Parser ()
50        punctuationBoundary = do
51            c1 <- char '.' <|> char ';' <|> char ','
52            c2 <- skip isSpace <|> endOfLine
53            return ()

Parsing AngledUrls is easier, since all we have to do is look for the angle-bracket-delimited text.

URLs are wrapped in angle brackets, and can appear anywhere within a text, link, or quote block

/src/Subtextual/Parser.hs

55isAngledUrlChar :: Char -> Bool
56isAngledUrlChar c = not $ c == '<' || c == '>' || isSpace c
57
58angledUrl :: Parser Inline
59angledUrl = do
60    string' "<"
61    url <- takeWhile1 isAngledUrlChar
62    string' ">"
63    return $ AngledUrl url
64    <?> "angledUrl"

Slashlinks are also easy since we don’t need lookahead—the spec tells us that:

Generally, a slashlink is a / followed by any number of alphanumeric characters, dashes -, underscores _.

And so parsing a slashlink is as easy as looking for the initial forward-slash / character and then parsing in any alphanumeric characters, dashes, and slashes.

/src/Subtextual/Parser.hs

66isSlashLinkChar :: Char -> Bool
67isSlashLinkChar c = 
68    isAlpha c 
69    || isDigit c 
70    || c == '-' 
71    || c == '_' 
72    || c == '/' 
73
74slashLink :: Parser Inline
75slashLink = do
76    char '/'
77    link <- takeWhile1 isSlashLinkChar
78    return $ SlashLink link
79    <?> "slashLink"

Now that we’ve defined each of the individual Inline parsers, we stitch them together into one big Inline parser using the Alternative operator <|>, which tries each parser in order (backtracking on failure) until it returns the first successful parse.

There’s only one gotcha here, which is that plainText needs to be the last parser in the list, since it’s indiscriminate about what text it’ll parse and it’ll happily consume the rest of the input until the end of the line. By being careful about the order, we give our more structured (and more discriminating) parsers—bareUrl, angledUrl, and slashLink—the first crack at parsing the text.

/src/Subtextual/Parser.hs

81inline :: Parser Inline
82inline = 
83    bareUrl
84    <|> angledUrl
85    <|> slashLink
86    <|> plainText
87    <?> "inline"

Of course, our Blocks don’t consume Inlines, they consume [Inline]s, so we need a parser for those too.

Now, we could just lift the parser using many1, which takes some parser and runs it at least once (returning a list of values), but there’s something that nags at me with that approach—since we’re parsing whitespace and non-space characters into separate elements, every single space between words would have its own PlainText entry in the Block, which seems excessive.

To fix that, I decided to do some post-processing over the list to smoosh together contiguous PlainTexts into one large PlainText by concatenating their text:

/src/Subtextual/Parser.hs

 89inlines :: Parser [Inline]
 90inlines = do
 91    parsed <- many1 inline
 92    let parsed' = smoosh parsed []
 93    return parsed'
 94    <?> "inlines"
 95    where
 96        smoosh :: [Inline] -> [Inline] -> [Inline]
 97        smoosh [] finished = reverse finished
 98        smoosh (PlainText p : todo) (PlainText p' : done) = 
 99            smoosh todo $ PlainText (p' <> p) : done
100        smoosh (i : todo) done = smoosh todo (i : done)

With [Inline] parsing out of the way, it’s time to start parsing Blocks.

Since Subtext is based off checking magic sigil characters at the start of the line, I created a helper combinator prefixed which looks for a given character, skips any spaces, then runs a parser.

The Block parsers are fairly self-explanatory. Most of them look for some character at the start of the line before parsing in either a list [Inline] or plain Text in the case of Heading since we don’t want to search for slashlinks and other URLs in section headers.

The only exception to the “look for a magic sigil char” rule is paragraph, which like plainText will happily accept any input up until the end of the line, and that means that we again need to be careful about our ordering to leave paragraph as the final subparser of nonBlankBlock.

/src/Subtextual/Parser.hs

102------------------------------------------------------------
103--                      Block Parsing                     --
104------------------------------------------------------------
105
106----------                 Helpers                ----------
107
108prefixed :: Char -> Parser a -> Parser a
109prefixed c parser = char c *> skipSpace *> parser
110
111takeUntilEndOfLine :: Parser T.Text
112takeUntilEndOfLine = takeWhile1 $ not . isEndOfLine <?> "takeUntilEndOfLine"
113
114----------            Non-Blank Blocks            ----------
115
116paragraph :: Parser Block
117paragraph = Paragraph <$> inlines <?> "paragraph"
118
119heading :: Parser Block
120heading = Heading <$> prefixed '#' takeUntilEndOfLine <?> "heading"
121
122bullet :: Parser Block
123bullet = Bullet <$> prefixed '-' inlines <?> "bullet"
124
125quote :: Parser Block
126quote = Quote <$> prefixed '>' inlines <?> "quote"
127
128nonBlankBlock :: Parser Block
129nonBlankBlock = 
130    heading
131    <|> bullet
132    <|> quote
133    <|> paragraph
134    <?> "nonBlankBlock"
135
136nonBlankBlocks :: Parser Document
137nonBlankBlocks = many1 nonBlankBlock <?> "nonBlankBlocks"

Ahh, but we’re not yet parsing blank lines. Why not just add some blank parser to block (a hypothetical parser to parse any Block) that’ll look for a newline and spits out a Blank element?

Basically, it’s a counting problem. If we parse newlines as blanks but have our ultimate document :: Parser Document consist of looking for multiple pairs of parsed text followed by newlines, we’ll get the number of Blanks wrong—we’d have to see 3 newline characters before we emitted one Blank.

Imagine we have some line like Line 1\n\nLine 2 that we want to parse with document.

Line1 would be parsed in as plain text
The first \n would be parsed in as the end of the first line,
The second \n would be parsed in as a Blank

But now the parser would be expecting a third \n to be the end of the second (blank) line, and it’s never going to find it, which will cause the whole thing to stop accepting input, causing problems

The fix is to instead count newlines and then emit a list of Blanks which is 1 item shorter than the number of newlines, which gets the behaviour we want:

\n gets parsed as an empty list []
\n\n gets parsed as one blank line [Blank]
\n\n\n gets parsed as two blank lines [Blank, Blank], etc.

/src/Subtextual/Parser.hs

139----------              Blank Blocks              ----------
140
141newLines :: Parser Document
142newLines = do
143    eols <- many1 (Data.Attoparsec.Text.takeWhile isHorizontalSpace *> endOfLine)
144    let count = length eols
145    return $ replicate (count - 1) Blank
146    <?> "newLines"

This is also why we wrote the parser nonBlankBlocks—with two parsers Parser Document that parse lists of blocks, we can combine the two parsers into a single document parser.

After we figure that complication out, parsing a whole Document is pretty easy—we parse many newlines and non-blank blocks at a time, then concatenate them together from a [Document] into a Document:

/src/Subtextual/Parser.hs

148------------------------------------------------------------
149--                    Document Parsing                    --
150------------------------------------------------------------
151
152document :: Parser Document
153document = concat <$> many1 (nonBlankBlocks <|> newLines) <?> "document"

Unparsing Subtext

In comparison to parsing, unparsing is unbelievably easy.

To unparse Inline and Block, all we need to do is pattern match and provide the expected text formatting.

To unparse a Document, we map block over our list of Blocks to generate the text snippets, then concatenate them with newlines in-between each snippet.

/src/Subtextual/Unparser.hs

 1module Subtextual.Unparser
 2    (inline, block, document) where
 3
 4import Subtextual.Core
 5import qualified Data.Text as T
 6
 7inline :: Inline -> T.Text
 8inline (PlainText p) = p
 9inline (BareUrl url) = url
10inline (AngledUrl url) = T.pack "<" <> url <> T.pack ">"
11inline (SlashLink sl) = T.pack "/" <> sl
12
13inlines :: [Inline] -> T.Text
14inlines = mconcat . map inline
15
16block :: Block -> T.Text
17block (Paragraph p) = inlines p
18block (Heading h) = T.pack "# " <> h
19block (Bullet b) = T.pack "- " <> inlines b
20block (Quote q) = T.pack "> " <> inlines q
21block Blank = T.pack ""
22
23document :: Document -> T.Text
24document = T.intercalate (T.pack "\n") . map block

Rendering Subtext to HTML

Now that we can unparse Inlines, Blocks and Documents back to the original text file, let’s work on rendering Subtext to HTML so that we can serve a bundle of static HTML pages over the web.

I’ll use Lucid as the HTML templating DSL for this, since it’s a simple library that’s easy to read. I’m not looking to add unnecessary complexity just to output HTML.

Converting an Inline is as simple as pattern-matching on each data constructor, while converting the list [Inline] (as wrapped by our Blocks) only requires concatenating the HTML fragments together:

/src/Subtextual/Html.hs

 1{-# LANGUAGE ExtendedDefaultRules #-}
 2{-# LANGUAGE OverloadedStrings #-}
 3module Subtextual.Html (block, document) where
 4
 5import Subtextual.Core
 6import Lucid
 7
 8------------------------------------------------------------
 9--                     Inlines to HTML                    --
10------------------------------------------------------------
11
12inline :: Inline -> Html ()
13inline (PlainText p) = (span_ . toHtml) p
14inline (BareUrl url) = a_ [href_ url] $ toHtml url
15inline (AngledUrl url) = a_ [href_ url] $ toHtml url
16inline (SlashLink sl) = a_ [href_ sl, class_ "slashlink"] $ toHtml sl
17
18inlines :: [Inline] -> Html ()
19inlines = mconcat . map inline

Outputting HTML for Blocks is also just a pattern-match away:

/src/Subtextual/Html.hs

21------------------------------------------------------------
22--                      Block to HTML                     --
23------------------------------------------------------------
24
25block :: Block -> Html ()
26block (Paragraph p) = (p_ . inlines) p
27block (Heading h) = (h2_ . toHtml) h
28block (Bullet b) = (li_ . inlines) b
29block (Quote q) = (blockquote_ . inlines) q
30block Blank = mempty

Since Document is really just a [Block], surely we can just map block over a [Block] to get a list of HTML fragments that we concatenate together, right?

Unfortunately, it’s not so easy.

HTML requires that we wrap our lists of Bullets into a list element—either ordered ol or unordered ul, while if we were to just map block, we’d have list items adjacent to paragraphs, like:

To achieve this, I’ll first pre-process the Document to group up adjacent Bullets into a single list, then wrap those lists with a ul element when we output HTML:

/src/Subtextual/Html.hs

32------------------------------------------------------------
33--                    Document to HTML                    --
34------------------------------------------------------------
35
36data Group a =
37    Single a
38    | Bullets [a]
39
40document :: Document -> Html ()
41document = mconcat . map groupHtml . group' where
42    groupHtml :: Group Block -> Html ()
43    groupHtml (Single b) = block b
44    groupHtml (Bullets bs) = ul_ $ (mconcat . map block) bs
45
46    group' :: Document -> [Group Block]
47    group' doc = group doc []
48
49    group :: Document -> [Group Block] -> [Group Block]
50    group [] done = (reverse . map reverseGroup) done
51    group (Bullet b : todo) (Bullets bs : done) = group todo $ Bullets (Bullet b : bs) : done
52    group (Bullet b : todo) done = group todo $ Bullets [Bullet b] : done
53    group (b : todo) done = group todo $ Single b : done
54
55    reverseGroup :: Group a -> Group a
56    reverseGroup (Single s) = Single s
57    reverseGroup (Bullets bs) = Bullets $ reverse bs

Next steps

That wraps up my alpha version of Subtextual!

All the above functionality is also supported by a small battery of unit tests, and you can see the latest version of Subtextual at my GitHub.