[Prior: Parsing browsers] [Robot Wisdom home page]
The widespread adoption of XML as an etext standard has reached a level where I think someone needs to shout a furious 'NO!'
XML violates the Internet etext-esthetic in many serious ways:
I'm proposing an alternate standard that relocates as much as possible of this parsing-burden back onto the computer-reader. I believe there's a minimal standard that I'll term first-cut parsing that should extract a basic set of identifications from all three basic etext groups: email, netnews, and webpages. And having parsed them, it should of course be able to convert between them, as well.
This minimal set is still extraordinarily complicated, unfortunately (which is why I offer this as a manifesto rather than an application).
The goal is to make at least one full pass over a document, and classify every character into some meaningful category, with a high level of robustness when faced with bad human editing.
It should recognize HTML tags, of course, but also etext styling like _italics_ and *bold* [1]. It should quietly correct simple HTML typos (with an error log available, ultimately).
It should recognize ascii-art signatures and simple ascii tables, columns, and lists. And smileys! ;^/
It should understand the Internet convention of 'e-quoting' with '>' characters at the start of each line, and the simple variations on this... and it should try to identify the author and message-ID of the quoted material where possible.
It should understand header- and footer- metadata as well as attachments, and recognise the basic compression formats. And public keys for PGP etc.
It should be very flexible at reading dates in human formats.
It should recognize paragraph and section formats, and conventional variants for margin widths, and alignment like centering. It should be robust at fixing newbie errors with email linewidths (esp from proportional-width fonts).
It should handle various footnoting formats, and convert HTML anchors to footnotes [2] in netnews and email.
It should parse the normal structures of webpages, like navigation bars, tables of contents, banner ads, and footer metadata. It should try to identify 'continued on next page' links.
It should monitor non-ascii characters, and recognize the main sources of these-- foreign character sets, Microsoft Hubris, etc. It should handle minor variations in punctuation styles like dashes, linebreaks, and quotation marks.
Any HTML markup (or XML, for that matter) that has no common etext equivalent can still be preserved in footnotes, keyed to line-number and char-number so the text won't be burdened with '[1]' flags.
And it will only be on this 'first cut' that real natural language parsing can begin.
[1] No-tags markup proposes etext equivalents for many HTML styles: <URL: http://www.prefab.com/ssl/notagsmarkup.html>
[2] Etext FAQ: <URL: http://www.robotwisdom.com/net/etextfaq.html>
NEW: Latte is comparable to no-tags markup.