Many of PurpleWiki's features extend from its parser. Most Wikis use a search-and-replace strategy to convert Wiki text into HTML. This has several drawbacks, some of which we describe below. PurpleWiki parses Wiki text into an intermediate DataStructure, which can be easily converted into multiple output formats -- text, XML, etc. -- not just HTML. This also means that the parser can be replaced to support other input formats, not just Wiki text. In other words, PurpleWiki easily supports multiple input and output formats. (1X)
Algorithm (1Y)
All markup is line-oriented except for paragraphs and preformatted. Blank lines terminate structural nodes. (21)
New structural nodes terminate existing structural nodes. For example, if the parser is buffering lines for a paragraph, and it comes across a list element, it terminates the paragraph and creates a list structure. (22)
The following syntax is used to support purple numbers: (23)
{nid n} -- NID of a node, where n is the NID. This should be at the end of the line. WikiWord#n -- Granular link, where n is the NID. (24)
After parsing the document into a parse tree, the parser traverses the parse tree and add NIDs to structural nodes that do not already have them. (25)
UseModWiki Improvements (26)
Technical Notes (27)
Because PurpleWiki translates Wiki text into a DataStructure, and because the Wiki text format is so flexible, what you enter may not be exactly the same as what is saved. (28)
Although this parser is smarter than the average Wiki parser, it's still not as smart as it could be, and it has some funky behavior as a result. For example, "3BlindMice" is parsed to 3BlindMice?, where "3" is plain text and BlindMice? is a WikiWord. If we did proper tokenizing, rather than relying on some funky regexp hacks, we would not have this problem. (This does raise the question: Typically, WikiWords cannot start with numbers. Is there a good reason for this?) (H2)
Another problem with our parser: We don't handle balanced tags correctly. For example: (H4)
< tt >< nowiki >< tt >fixed font< /tt >< /nowiki >< /tt > (H5)
would be parsed into: (H6)
< nowiki >< tt > fixed font< /nowiki >< /tt > (H7)
(I had to add spaces between the angle brackets because our parser can't display this correctly!) (H8)
We should eventually replace the parser. One potential strategy would be to break WikiText? down into BNF (if it's possible), and use something like Parse::RecDescent?. If we do this, we need to remember that everything in intermap should also be considered a token. This is because "InterMap:WikiWord" can be parsed two possible ways, depending on whether or not InterMap is in intermap. (H3)