I had a fascinating conversation with @johnnulls recently about creating a lightweight markup format for hyperdocuments, extending Subtext to add various affordances for metadata like key-value pairs and tags.
But I’m getting ahead of myself.
I’ll let Gordon explain:
Subtext is a text-based, line-oriented hypertext format, designed for note-taking.
Subtext markup is made up of ordinary lines of text, which are interpreted as a list of blocks. Lines that are prefixed with magic “sigil” characters are treated as special blocks. Lines without sigils are treated as text blocks. Empty lines are ignored.
This is a really interesting approach!
I’ve played around and tried to create new hypertext tools before, but nothing I’ve written has really gripped me.
Most of them have been built around some variant of content-addressable store so that we have persistent identifiers to prevent link rot (typically with addresses provided by hashing the content), but I could never get any of them working in a way that felt ergonomic to write and read. I think I’ll return to those ideas one day.
Subtext is a lot simpler, mostly because it’s not the same thing at all.
Rather than specifying the full system (with stores, servers, and whatnot), Subtext offers a markup language for a linear hypertext format with very easy parsing based off looking for the magic “sigil” characters.
Let’s write enough basic functionality to be able to say I’ve got some sort of minimum viable product for working with Subtext:
I’ll call this Haskell library Subtextual.
I’ll take an example snippet of Subtext from Gordon’s guide:
# Heading
Plain text.
- List item
- List item
> Quoted text
URLs like https://example.com are automatically linked.
You can also link to local pages using short /slashlinks.
I’ll use attoparsec to build the parsing functionality, since I’ve used it before and I enjoy working with it.
Since Subtext does some sub-line parsing, I’ll start with an Inline
type which will store that sub-line data. There’s 4 different kinds of text we can see in a line:
We then construct Block
s (which correlate to lines of text in a Subtext file) out of Inline
s for the more complicated kinds of lines, or Text
for headings (since we don’t want to parse for URLs in headings).
Blank
s are our blank lines, which we’ll keep in the data-type Block
since we don’t want to throw away that info about the source Subtext files when we’re parsing, although we’ll probably not do anything with them anyway.
I’ll also add Document
as a type synonym for lists of Block
s.
That gives us a nice core of data types:
/src/Subtextual/Core.hs
1module Subtextual.Core
2 ( Inline(..), Block(..), Document
3 ) where
4
5import qualified Data.Text as T
6
7data Inline =
8 PlainText T.Text
9 | BareUrl T.Text
10 | AngledUrl T.Text
11 | SlashLink T.Text
12 deriving (Show, Eq)
13
14data Block =
15 Paragraph [Inline]
16 | Heading T.Text
17 | Bullet [Inline]
18 | Quote [Inline]
19 | Blank
20 deriving (Show, Eq)
21
22type Document = [Block]
Now we’re ready to parse some text. We’ll start with some imports:
/src/Subtextual/Parser.hs
1module Subtextual.Parser
2 (nonBlankBlock, document) where
3
4import Control.Applicative
5import Control.Monad
6import Data.Char
7import Data.Functor
8import Data.Attoparsec.Text
9import Data.Attoparsec.Combinator
10import qualified Data.Text as T
11
12import Subtextual.Core
With that out of the way, we can work on parsing.
Attoparsec is a parser combinator library, which means we build up complex parsers out of simple ones. So let’s start with those building blocks.
Since we want to start looking for URLs and slashlinks at word boundaries, we’ll need to parse in whitespace and non-whitespace characters separately. Every time we see a new word, we’ll first try to parse it with our more structured parsers for bare URLs, angle-delimited URLs, and slashlinks. If those fail, we’ll fall back to parsing them as plain text.
So we write the two text parsers:
whitespace
to parse spaces and tabsword
to parse non-space characters/src/Subtextual/Parser.hs
14------------------------------------------------------------
15-- Text Parsing --
16------------------------------------------------------------
17
18whitespace :: Parser T.Text
19whitespace = takeWhile1 isHorizontalSpace <?> "whitespace"
20
21word :: Parser T.Text
22word = takeWhile1 $ not . isSpace <?> "word"
After that we’ll want to work on parsing Inline
s.
PlainText
should get parsed whenever we have whitespace, or a string of characters that didn’t get parsed as a URL or slash-link:
/src/Subtextual/Parser.hs
24------------------------------------------------------------
25-- Inline Parsing --
26------------------------------------------------------------
27
28plainText :: Parser Inline
29plainText = PlainText <$> (word <|> whitespace) <?> "plainText"
BareUrl
was an interesting challenge. According to the spec:
Subtext parsers MUST implement automatic linking for certain URLs that are not in brackets.
We could just keep parsing characters into a BareUrl
until we hit whitespace, but there’s a problem with that approach—we want to be able to intelligently ignore punctuation like periods, semicolons, or commas at the end of URLs.
If we parse in e.g. Here's a link: https://google.com.
, we want the parsed link to be “https://google.com”, with the final period being parsed as plaintext.
To solve this issue, we do lookahead on the next bit of the input to see if we’ve reached the end of the URL, which we define as either:
/src/Subtextual/Parser.hs
31string' :: String -> Parser T.Text
32string' = string . T.pack
33
34bareUrl :: Parser Inline
35bareUrl = do
36 schema <- string' "https://" <|> string' "http://"
37 body <- manyTill anyChar $ lookAhead endOfUrl
38 let url = schema <> T.pack body
39 return $ BareUrl url
40 <?> "bareUrl"
41
42 where
43 endOfUrl :: Parser ()
44 endOfUrl =
45 punctuationBoundary
46 <|> space $> ()
47 <|> endOfInput
48
49 punctuationBoundary :: Parser ()
50 punctuationBoundary = do
51 c1 <- char '.' <|> char ';' <|> char ','
52 c2 <- skip isSpace <|> endOfLine
53 return ()
Parsing AngledUrl
s is easier, since all we have to do is look for the angle-bracket-delimited text.
URLs are wrapped in angle brackets, and can appear anywhere within a text, link, or quote block
/src/Subtextual/Parser.hs
55isAngledUrlChar :: Char -> Bool
56isAngledUrlChar c = not $ c == '<' || c == '>' || isSpace c
57
58angledUrl :: Parser Inline
59angledUrl = do
60 string' "<"
61 url <- takeWhile1 isAngledUrlChar
62 string' ">"
63 return $ AngledUrl url
64 <?> "angledUrl"
Slashlinks are also easy since we don’t need lookahead—the spec tells us that:
Generally, a slashlink is a / followed by any number of alphanumeric characters, dashes -, underscores _.
And so parsing a slashlink is as easy as looking for the initial forward-slash /
character and then parsing in any alphanumeric characters, dashes, and slashes.
/src/Subtextual/Parser.hs
66isSlashLinkChar :: Char -> Bool
67isSlashLinkChar c =
68 isAlpha c
69 || isDigit c
70 || c == '-'
71 || c == '_'
72 || c == '/'
73
74slashLink :: Parser Inline
75slashLink = do
76 char '/'
77 link <- takeWhile1 isSlashLinkChar
78 return $ SlashLink link
79 <?> "slashLink"
Now that we’ve defined each of the individual Inline
parsers, we stitch them together into one big Inline
parser using the Alternative operator <|>
, which tries each parser in order (backtracking on failure) until it returns the first successful parse.
There’s only one gotcha here, which is that plainText
needs to be the last parser in the list, since it’s indiscriminate about what text it’ll parse and it’ll happily consume the rest of the input until the end of the line. By being careful about the order, we give our more structured (and more discriminating) parsers—bareUrl
, angledUrl
, and slashLink
—the first crack at parsing the text.
/src/Subtextual/Parser.hs
81inline :: Parser Inline
82inline =
83 bareUrl
84 <|> angledUrl
85 <|> slashLink
86 <|> plainText
87 <?> "inline"
Of course, our Block
s don’t consume Inline
s, they consume [Inline]
s, so we need a parser for those too.
Now, we could just lift the parser using many1
, which takes some parser and runs it at least once (returning a list of values), but there’s something that nags at me with that approach—since we’re parsing whitespace and non-space characters into separate elements, every single space between words would have its own PlainText
entry in the Block
, which seems excessive.
To fix that, I decided to do some post-processing over the list to smoosh
together contiguous PlainText
s into one large PlainText
by concatenating their text:
/src/Subtextual/Parser.hs
89inlines :: Parser [Inline]
90inlines = do
91 parsed <- many1 inline
92 let parsed' = smoosh parsed []
93 return parsed'
94 <?> "inlines"
95 where
96 smoosh :: [Inline] -> [Inline] -> [Inline]
97 smoosh [] finished = reverse finished
98 smoosh (PlainText p : todo) (PlainText p' : done) =
99 smoosh todo $ PlainText (p' <> p) : done
100 smoosh (i : todo) done = smoosh todo (i : done)
With [Inline]
parsing out of the way, it’s time to start parsing Block
s.
Since Subtext is based off checking magic sigil characters at the start of the line, I created a helper combinator prefixed
which looks for a given character, skips any spaces, then runs a parser.
The Block
parsers are fairly self-explanatory. Most of them look for some character at the start of the line before parsing in either a list [Inline]
or plain Text
in the case of Heading
since we don’t want to search for slashlinks and other URLs in section headers.
The only exception to the “look for a magic sigil char” rule is paragraph
, which like plainText
will happily accept any input up until the end of the line, and that means that we again need to be careful about our ordering to leave paragraph
as the final subparser of nonBlankBlock
.
/src/Subtextual/Parser.hs
102------------------------------------------------------------
103-- Block Parsing --
104------------------------------------------------------------
105
106---------- Helpers ----------
107
108prefixed :: Char -> Parser a -> Parser a
109prefixed c parser = char c *> skipSpace *> parser
110
111takeUntilEndOfLine :: Parser T.Text
112takeUntilEndOfLine = takeWhile1 $ not . isEndOfLine <?> "takeUntilEndOfLine"
113
114---------- Non-Blank Blocks ----------
115
116paragraph :: Parser Block
117paragraph = Paragraph <$> inlines <?> "paragraph"
118
119heading :: Parser Block
120heading = Heading <$> prefixed '#' takeUntilEndOfLine <?> "heading"
121
122bullet :: Parser Block
123bullet = Bullet <$> prefixed '-' inlines <?> "bullet"
124
125quote :: Parser Block
126quote = Quote <$> prefixed '>' inlines <?> "quote"
127
128nonBlankBlock :: Parser Block
129nonBlankBlock =
130 heading
131 <|> bullet
132 <|> quote
133 <|> paragraph
134 <?> "nonBlankBlock"
135
136nonBlankBlocks :: Parser Document
137nonBlankBlocks = many1 nonBlankBlock <?> "nonBlankBlocks"
Ahh, but we’re not yet parsing blank lines. Why not just add some blank
parser to block
(a hypothetical parser to parse any Block
) that’ll look for a newline and spits out a Blank
element?
Basically, it’s a counting problem. If we parse newlines as blank
s but have our ultimate document :: Parser Document
consist of looking for multiple pairs of parsed text followed by newlines, we’ll get the number of Blank
s wrong—we’d have to see 3 newline characters before we emitted one Blank
.
Imagine we have some line like Line 1\n\nLine 2
that we want to parse with document
.
Line1
would be parsed in as plain text\n
would be parsed in as the end of the first line,\n
would be parsed in as a Blank
But now the parser would be expecting a third \n
to be the end of the second (blank) line, and it’s never going to find it, which will cause the whole thing to stop accepting input, causing problems
The fix is to instead count newlines and then emit a list of Blank
s which is 1 item shorter than the number of newlines, which gets the behaviour we want:
\n
gets parsed as an empty list []
\n\n
gets parsed as one blank line [Blank]
\n\n\n
gets parsed as two blank lines [Blank, Blank]
, etc./src/Subtextual/Parser.hs
139---------- Blank Blocks ----------
140
141newLines :: Parser Document
142newLines = do
143 eols <- many1 (Data.Attoparsec.Text.takeWhile isHorizontalSpace *> endOfLine)
144 let count = length eols
145 return $ replicate (count - 1) Blank
146 <?> "newLines"
This is also why we wrote the parser nonBlankBlocks
—with two parsers Parser Document
that parse lists of blocks, we can combine the two parsers into a single document
parser.
After we figure that complication out, parsing a whole Document
is pretty easy—we parse many newlines and non-blank blocks at a time, then concatenate them together from a [Document]
into a Document
:
/src/Subtextual/Parser.hs
148------------------------------------------------------------
149-- Document Parsing --
150------------------------------------------------------------
151
152document :: Parser Document
153document = concat <$> many1 (nonBlankBlocks <|> newLines) <?> "document"
In comparison to parsing, unparsing is unbelievably easy.
To unparse Inline
and Block
, all we need to do is pattern match and provide the expected text formatting.
To unparse a Document
, we map block
over our list of Block
s to generate the text snippets, then concatenate them with newlines in-between each snippet.
/src/Subtextual/Unparser.hs
1module Subtextual.Unparser
2 (inline, block, document) where
3
4import Subtextual.Core
5import qualified Data.Text as T
6
7inline :: Inline -> T.Text
8inline (PlainText p) = p
9inline (BareUrl url) = url
10inline (AngledUrl url) = T.pack "<" <> url <> T.pack ">"
11inline (SlashLink sl) = T.pack "/" <> sl
12
13inlines :: [Inline] -> T.Text
14inlines = mconcat . map inline
15
16block :: Block -> T.Text
17block (Paragraph p) = inlines p
18block (Heading h) = T.pack "# " <> h
19block (Bullet b) = T.pack "- " <> inlines b
20block (Quote q) = T.pack "> " <> inlines q
21block Blank = T.pack ""
22
23document :: Document -> T.Text
24document = T.intercalate (T.pack "\n") . map block
Now that we can unparse Inline
s, Block
s and Document
s back to the original text file, let’s work on rendering Subtext to HTML so that we can serve a bundle of static HTML pages over the web.
I’ll use Lucid as the HTML templating DSL for this, since it’s a simple library that’s easy to read. I’m not looking to add unnecessary complexity just to output HTML.
Converting an Inline
is as simple as pattern-matching on each data constructor, while converting the list [Inline]
(as wrapped by our Block
s) only requires concatenating the HTML fragments together:
/src/Subtextual/Html.hs
1{-# LANGUAGE ExtendedDefaultRules #-}
2{-# LANGUAGE OverloadedStrings #-}
3module Subtextual.Html (block, document) where
4
5import Subtextual.Core
6import Lucid
7
8------------------------------------------------------------
9-- Inlines to HTML --
10------------------------------------------------------------
11
12inline :: Inline -> Html ()
13inline (PlainText p) = (span_ . toHtml) p
14inline (BareUrl url) = a_ [href_ url] $ toHtml url
15inline (AngledUrl url) = a_ [href_ url] $ toHtml url
16inline (SlashLink sl) = a_ [href_ sl, class_ "slashlink"] $ toHtml sl
17
18inlines :: [Inline] -> Html ()
19inlines = mconcat . map inline
Outputting HTML for Block
s is also just a pattern-match away:
/src/Subtextual/Html.hs
21------------------------------------------------------------
22-- Block to HTML --
23------------------------------------------------------------
24
25block :: Block -> Html ()
26block (Paragraph p) = (p_ . inlines) p
27block (Heading h) = (h2_ . toHtml) h
28block (Bullet b) = (li_ . inlines) b
29block (Quote q) = (blockquote_ . inlines) q
30block Blank = mempty
Since Document
is really just a [Block]
, surely we can just map block
over a [Block]
to get a list of HTML fragments that we concatenate together, right?
Unfortunately, it’s not so easy.
HTML requires that we wrap our lists of Bullet
s into a list element—either ordered ol
or unordered ul
, while if we were to just map block
, we’d have list items adjacent to paragraphs, like:
To achieve this, I’ll first pre-process the Document
to group up adjacent Bullet
s into a single list, then wrap those lists with a ul
element when we output HTML:
/src/Subtextual/Html.hs
32------------------------------------------------------------
33-- Document to HTML --
34------------------------------------------------------------
35
36data Group a =
37 Single a
38 | Bullets [a]
39
40document :: Document -> Html ()
41document = mconcat . map groupHtml . group' where
42 groupHtml :: Group Block -> Html ()
43 groupHtml (Single b) = block b
44 groupHtml (Bullets bs) = ul_ $ (mconcat . map block) bs
45
46 group' :: Document -> [Group Block]
47 group' doc = group doc []
48
49 group :: Document -> [Group Block] -> [Group Block]
50 group [] done = (reverse . map reverseGroup) done
51 group (Bullet b : todo) (Bullets bs : done) = group todo $ Bullets (Bullet b : bs) : done
52 group (Bullet b : todo) done = group todo $ Bullets [Bullet b] : done
53 group (b : todo) done = group todo $ Single b : done
54
55 reverseGroup :: Group a -> Group a
56 reverseGroup (Single s) = Single s
57 reverseGroup (Bullets bs) = Bullets $ reverse bs
That wraps up my alpha version of Subtextual!
All the above functionality is also supported by a small battery of unit tests, and you can see the latest version of Subtextual at my GitHub.