Writing `nndoc` digest definitions

Introduction

One of the neatest features of the gnus mail and newsreading package for emacs is its ability to expand digests into individual messages that can be read with the full power of the newsreader. What's really cool about the feature is that it's extensible: you can write rules to describe new digest formats. That's especially handy in the modern world, where too many publishers think that RFC 1153 shouldn't apply to them.

The downside is that it's not at all easy to write these rules. The documentation is terse, to say the least, and when you screw up, the error messages are monumentally unhelpful. This Web page is an attempt to rectify that situation by teaching you how to write and debug digest rules. I also provide links to all the rulesets I've written.

Existing Rulesets

If you want to undigestify something, the easiest approach is to use somebody else's work. :-) The second easiest is to adapt something that already exists. If one of the following rulesets matches your needs, just slap it into your .gnus file and you're done. I like to put my rulesets into an nndoc subdirectory in my search path, and then use the following code in my .gnus file to pick it up:

(require 'nndoc-generic-functions "nndoc/nndoc-generic-functions.el")
(mapcar '(lambda (file)
	  (if (string-match "\\.el$" file)
	      (load-library
	       (concat "nndoc/" (replace-regexp-in-string "\\.el$" "" file)))))
       (directory-files "~/elisp/nndoc"))

If no ruleset fits, try adapting something that's close before you start from scratch. (Note that some of the following rulesets are early efforts that don't do as much as some later ones. See the rulesets for Yahoo! Groups and Crypto-Gram for some useful techniques.)

Some rulesets are no longer maintained. I apologize if they don't work; I've stopped receiving those lists and so I'm not able to fix them.

ACLU Online newsletters.
ACM Technews summaries.
AWADmail from A.Word.A.Day. (No longer maintained.)
The CPSR Compiler. (No longer maintained.)
Bruce Schneier's Crypto-Gram.
Edupage news summaries. (No longer maintained.)
EFFector digests.
Some generic functions usable by other rulesets.
Digests from Google Groups.
NSF New Document Summary. (No longer maintained.)
Photography-on-the.net forum digests. With minor modifications, this will work for a lot of online forums that use vbulletin. (No longer maintained.)
Two common SANS digests: SANS NewsBites and SANS PrivacyBits, which thankfully have a common format.
SANS security digests. SANS digests are a PITA to parse, by the way. Every one of them has a slightly different format, for no reason whatsoever. The parser doesn't work perfectly, but it comes fairly close.
Slashdot daily headline mailers.
Yahoo! Groups. All mailing lists from Yahoo! Groups use the same format, making them easy to parse. (No longer maintained.)

Anyone who wishes to contribute additional rulesets is welcome to e-mail them to me. Please name them nndoc-xxx.el and include only one ruleset per file.

Writing a Digest Ruleset

This section is intended as a supplement to the GNUS documentation. Before you read this, you should familiarize yourself with what the TexInfo files have to say about adding new digests. If there's something you don't understand there, I suggest you don't try to puzzle it out, because it may become clearer here. It may be useful to reread the documentation after reading this page.

The basic idea of a new ruleset is that you must describe to nndoc how to find the beginning and ending of each article in the digest. Ideally, this is done with a few regular expressions. Sometimes (all too often, it seems) you will also have to write code that converts a badly formatted article into a more mail-like layout.

How `nndoc` Parses a Digest

The most important part of writing a ruleset is understanding the exact way gnus (i.e., the nndoc package) goes about turning a digest into individual messages. This process is very complex because it has tons of options. You need to know about all of the options, though, because they are the key to getting your ruleset to work correctly.

Digest processing is divided into two parts: dissection and display. During dissection, nndoc figures out exactly where each message starts and ends in the digest. The output of this process is an association list ("alist") that describes each individual message as a set of offsets. See the comments about nndoc-dissection-alist in the nndoc.el code for more information. This step is usually the killer; it's very hard to get it exactly right.

The second processing step happens during display. Here, the message is extracted from the digest (which is easy because of the offsets generated in step 1) and then reformatted for display. This is where you can make things look nice.

Dissection

Dissection is performed by the function nndoc-dissect-buffer. Understanding this function is key to writing correct rulesets. If you have problems, this is also the function to step through in the debugger. The output of nndoc-dissect-buffer is the alist mentioned above.

The steps performed by nndoc-dissect-buffer are as follows:

Preparation is performed once per digest:

Remove blank lines from the beginning of the buffer.
If a dissection-function is defined, call it and return the result, skipping all the other steps listed below.
If the file-begin pattern is defined, search for it.

Dissection is performed in a loop, until there are no more messages (articles) in the digest. In all cases, the term "bol-search" means "Search for the given regular expression, and set point to the beginning of the line containing it. If the regular expression is not found, set point to the beginning of the current line." The dissection loop is:

Find the beginning of the article. This is a complex step:
1. If this is the first time through the loop and first-article is defined, bol-search for first-article.
2. If article-begin-function is defined, call it. Note that there is no first-article-function. However, the free variable first is available to article-begin-function and is t for the first article, so the effect of a first-article-function can be achieved by testing first.
3. Otherwise, bol-search for article-begin.
All of these functions should leave point unchanged or at the beginning of the article header. If they don't, you can fix it up in the next step.
If there is a head-begin-function, call it. Otherwise, if head-begin is defined, bol-search for it.
If we are now at the end of the buffer, or if file-end is defined and we are looking at file-end, terminate the loop. (Note that this means file-end must always match from the beginning of a line, no matter how the digest is formatted.)
Otherwise, assume that there is a new article. Save the current position as the beginning of the article header.
Bol-search for head-end (default is "^$", i.e., a blank line). Save this as the end of the article header.
If body-begin-function is defined, call it to find the beginning of the body. Otherwise, bol-search for body-begin (default "^\n"). Save the result as the beginning of the article body. Note that this step can potentially cause information to be ignored between the article header and body. Also note that because the pattern includes a newline instead of a dollar sign, the position saved is after the blank line rather than at it.
Find the end of the article body and save its position. This step is complex:
1. If body-end-function is defined, call it and use the resulting value of point. body-end-function must return a non-nil value or the following steps will be executed.
2. If body-end is defined, bol-search for it.
3. Otherwise (including search failures for body-end), search for the beginning of the following article using the procedure in Step 1 above, subparts (2) and (3).
4. If the beginning of the following article can't be found, go to the end of the file. If file-end is defined, search backwards for it and go to the beginning of that line.
Add the article number and the saved positions to the dissection alist.
If generate-head-function is defined, call it to generate fake headers for the article. Otherwise, simply grab the lines between the beginning and end of the article header and call them the headers. In either case, add a "Lines:" header with a calculated line count. (Note: the important header material depends on what you show in your summary buffer. Typically, "Subject:", "From:", and maybe "Date:" are useful things to generate.)

Whew! That's a complicated mess. Fortunately, you often don't need to understand it in detail. It's documented above in case you need to debug something. But the general summary is:

Always prefer a -function over a pattern.
Find the header, using first-article as the pattern for article #1.
Find the end of the header.
Find the start of the body.
Find the end of the body, or the end of the file.
Save the headers, either real or generated.

That makes it much simpler, right?

Display

The second layer of processing comes when it's time to display the article. This is much simpler:

In an empty buffer, insert the header of the article (as noted during dissection).
Insert a blank line.
Insert the body.
Go to the beginning of the body.
If prepare-body-function is defined, call it.
If article-transform-function is defined, call it.
Process the result like a normal mail message. In particular, this means highlighting certain header fields, "washing" the body according to your preferences, etc.

I've found that the most important detail is that article-transform-function needs to produce "proper" headers. For example, the subject should be preceded by "Subject: " (including the blank). I also find it very useful to create "From:", "Cc:", and "Reply-To:" lines designed so that I can just use the "reply" and "wide reply" features to reply to article authors or the entire mailing list. Thus, for example, when I recognize Yahoo! group digests I save the group name in nndoc-yahoo-groups-cc, and the nndoc-transform-yahoo-groups-article function inserts a CC: line to that group. The result is that I can reply to an individual or wide-reply to the entire group, as needed.

Summary of `nndoc` Variables

Here's a summary of all the options you can set for an nndoc digest type. All "find" functions can leave point anywhere in the line found; nndoc will move to the beginning of that line before proceeding. Unless otherwise specified, all options are "if defined"; the default is to simply do nothing. Also, all patterns and functions are used during dissection, with the exception of article-transform-function and prepare-body-function.

article-begin-function: Called to find the beginning of each article. Must return t if an article is found, nil otherwise. If there are no more articles, should leave point at the end of the buffer or at a line matching nndoc-file=end.
article-begin: Pattern that matches the beginning of an article. After this pattern matches, point should be somewhere on the first meaningful line of the article. NOTE: it may be necessary for this pattern to also match nndoc-file-end, so that the EOF check in step 3 above can work.
article-transform-function: During display, arbitrarily transforms the article. Often used to generate RFC-compliant header lines (nonblank characters followed by colon) at the beginning of nonconforming articles. See also prepare-body-function. Note that if necessary, you can extract information from the original unparsed article; see the Google Groups code for an example.
body-begin-function: Called to find the beginning of an article body.
body-begin: Pattern that matches the beginning of an article body. Default is "^\n".
body-end-function: Called to find the end of an article body. Must return t if another article follows this one, nil otherwise.
body-end: Pattern that matches the end of an article body. Default is the beginning of the next article, or the end of the file.
file-begin: Pattern that matches the beginning of the digest. This feature is almost the same as first-article. The difference is that first-article can stand entirely alone, while file-begin is followed by a search for either first-article (if defined) or article-begin.
file-end: Pattern that matches the first-meaningful line that marks the end of the digest. Note that file-end will only work properly if either (a) body-end-function and body-end are undefined, or (b) the body-end functions leave pointfile-end line.
first-article: Pattern that matches the beginning of the first article in the digest, in case the first article is distinguished differently. Often, this is a multi-line pattern (with embedded newlines). For many digest formats, however, it is better to leave first-article unset and use file-begin to skip past the garbage at the front of the file.
generate-head-function: Called after dissection to generate (possibly fake) headers that will be used to build the group summary buffer. Must switch to nndoc-current-buffer to extract relevant information, then return to the original buffer and insert generated headers there. This function must modify the article buffer. Use an existing one as a guide for writing your own.
head-begin-function: Called to position point at the beginning of the article header. If there are no more articles, should leave point at the end of the buffer or at a line matching nndoc-file=end.
head-begin: Pattern that matches the first line of the article header. NOTE: it may be necessary for this pattern to also match nndoc-file-end, so that the EOF check in step 3 above can work.
head-end: Pattern that matches the last line of the article header. Default is "^$".
prepare-body-function: During display, arbitrarily prepares the article body for display. Most commonly used to remove quoting in embedded articles (e.g., MIME digests), but can do whatever it wants. Called with point at the beginning of the body, but can go to point-min if it wants to muck with the article headers as well; in this sense it duplicates article-transform-function (q.v.).

Debugging a Digest Ruleset

Rulesets are hard to write correctly. No matter how hard you try, you'll make mistakes, and then you're stuck with figuring out what went wrong.

One thing to remember is that nndoc caches some information for speed. Whenever you change your rulesets, go to a different article than the one you're working on, and type "C-d" to enter it. It doesn't matter if it's a digest or not; the point is to get nndoc to clear its cache. Then return to the article in question and try it again.

Common Mistakes

Some mistakes happen over and over again. Here are some common problems and suggested solutions:

I just get a bell when I try to enter the digest. This is the most common symptom of a failed pattern set. Unfortunately, it's very hard to debug; you may have to step through the code (see below). First, though, go into the *Article* buffer and check your patterns. Every option listed above is saved in nndoc-option-name. For example, the head-begin pattern is in nndoc-head-begin. You can use ESC : to execute an Elisp expression that experiments with those patterns. For example, use ESC : (re-search-forward nndoc-first-article) RET to see if you're correctly finding the first article in the digest. Remember that point must wind up on the first line of the article header (unless head-begin-function is going to correct it).
Every second message is missing. Perhaps you have a head-begin pattern that skips past the article beginning found by article-begin. Usually, head-begin should be unset.
Every second message is missing. You have an article-begin pattern that matches multiple lines, but no body-end pattern. The result is that the end of the body extends into the beginning of the following article, so that a subsequent article-begin search won't find the beginning of that article. The solution is to define a body-end pattern that matches only the first line of the article-begin pattern, or to define a body-end-function that finds the beginning of the proper area. I often use the following body-end function:
```
	(defun nndoc-generic-body-end ()
	  (and (re-search-forward
		(concat nndoc-article-begin "\\|" nndoc-file-end)
		nil t)
	       (goto-char (match-beginning 0))
	       (skip-chars-backward " \t\n")
	       (if (eq (following-char) ?\n)
		   (forward-char 1)))
	  t)
```
head-begin pattern that skips past the article beginning found by article-begin. Usually, head-begin should be unset.
All articles are blank. Your head-end pattern takes you into the article body, so that the body-begin pattern matches the blank line at the end of the article. Then body-end matches that same place.
The group summary is incorrect. Your generate-head-function isn't creating plausible RFC-compliant headers.
Sometimes the first line of an article is missing. The article doesn't use RFC-compliant headers, and you didn't write an article-transform-function. In the absence of proper headers, gnus guesses that the first line of the article is a subject. But if the subject has a colon in it, gnus gets confused. The solution is simple: insert "Subject: " (with the blank) in front of the first line.

Serious Debugging

If the above hints don't get you going, you're kind of up a creek. It would be nice if there were some special functions to help debugging. For example, it would be really cool to be able to go into an article buffer, type M-x nndoc-show-markers RET, and see colorization that describes how nndoc parsed the buffer. Maybe someday.

Until then, you have two tools: experimenting with individual parameters, and stepping through the relevant code.

The very first thing to do is to verify that your nndoc-foo-type-p function works. Go to *Article* and type ESC : (nndoc-foo-type-p) RET where foo is the name of your added type (e.g., technews-summary). It should return t. If not, fix that function so that it correctly recognizes your digest. Be as selective as possible; you don't want your TechNews recognizer to try to parse RFC 1153-compliant digests.

If your type-recognition function seems to work, double-check it by looking at the contents of nndoc-article-type. If that's wrong, some other type may have beaten you to the punch. Use the second argument of nndoc-add-type to control this problem. Also, remember that if the type-recognition function returns a number, it's taken as a priority, so be sure it returns t if it's certain it's found the correct type.

The next step is to check all your patterns. In *Article*, search for each pattern you defined. If the type recognizer succeeded, each pattern will be saved in a variable with the same name, preceded by nndoc-. So, for example, start with ESC : (re-search-forward nndoc-first-article) RET. Make sure each pattern matches what it's supposed to, and that it leaves point somewhere in the line that's at the beginning or end of the header or body, as appropriate.

Using the Debugger

If none of this helps, you need the debugger. Before you start debugging, make sure you have non-compiled code by explicitly loading the file "nndoc.el" (use the locate command to find it). In the group summary buffer, select the digest and use C-u g to get the "raw" version that nndoc looks at. Then use M-x debug-on-entry RET nndoc-dissect-buffer RET to set a breakpoint. Type C-d to enter the digest, hit "d", then "c". At this point the buffer should have been dissected, and the results are available in the variable nndoc-dissection-alist. You can look at the values with ESC : nndoc-dissection-alist RET or (better) go into the *scratch* buffer to look at it. The alist will be too long to see all of it, but you can check some of the values to see if they look reasonable. Copy those values into another window (I like to copy and paste into "cat >/dev/null" in a shell window to record this sort of information). You can then go into the *Article* buffer and use M-x goto-char RET to go to the various places in the buffer and see if they seem reasonable.

If you have trouble generating the alist, or if it looks very wrong, you can step through your dissection functions (if any) or nndoc-dissect-buffer itself. While stepping, the command ESC : (switch-to-buffer nndoc-current-buffer) RET will put you into the buffer that is being dissected, so you can look at what the functions are seeing. Likewise, ESC : (switch-to-buffer nntp-server-buffer) RET will put you into the article buffer that is being built. Note that the latter includes special NNTP codes; those aren't a mistake.

If the alist looks OK and you can get a group summary, but can't see an individual article correctly, you probably have display-related problems. Use M-x cancel-debug-on-entry RET nndoc-dissect-buffer RET to turn off debugging, the M-x debug-on-entry RET nndoc-request-article RET to set a new breakpoint. This time, use only d to step through the function. After the second time insert-buffer-substring is called, you can use ESC : (switch-to-buffer buffer) RET to temporarily get into the scratch buffer where the article is being built. This will let you see what the transformation functions are about to work on. Use C-x b RET to return to the debugger buffer, and step through your own code with d. At any time, you can see the current state of the buffer (including point) by repeating the ESC : (switch-to-buffer buffer) RET command. (A handy shortcut is C-x ESC ESC, which repeats the last command—often, that's the switch-to-buffer command, or at least you can get there with a few M-p keystrokes.)

Finally, if you are getting inexplicable behavior (i.e., the changes you make don't seem to take effect, or you breakpoint on a function that you know is being called and the debugger isn't entered), try exiting GNUS and reentering. Sometimes, stuff gets cached in weird places.

It's fair to say that the debugging process is sometimes painful. However, the end result is well worth it: you type C-d on a big digest with tons of messages, and they're nicely broken up (and even threaded) for your reading convenience.

This text is explicitly placed in the public domain. Feel free to use it, extend it, modify it, abuse it, or destroy it as you wish.

Writing nndoc digest definitions