2

Creating an email archive with public-inbox

 2 years ago
source link: https://lwn.net/Articles/748184/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Creating an email archive with public-inbox

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

Keeping up with the free-software development community requires following a lot of mailing lists. For many years, the Gmane email archive has helped your editor to do that without going any crazier than he already is, but Gmane is becoming an increasingly unreliable resource. A recent incident increased the priority of a longstanding goal to find (or create) an alternative to Gmane. That, in turn, led to the discovery of public-inbox.

The decline of Gmane

At its peak, Gmane was by far the best way to follow many dozens of mailing lists. It holds archives of a vast number of lists — the front page currently claims over 15,000 — so most of the lists of interest can be found there. Crucially, Gmane offers an NNTP feed; newsreaders are the fastest way that your editor has found to quickly get through a day's email and pick out the interesting messages. Gmane also offered a web-based view into the archive that could be easily linked using a message ID; that made it easy to capture emails and link them back to the context in which they were sent.

Gmane was created by Lars Magne Ingebrigtsen, who operated it for many years before burning out and moving on in 2016. A company called Yomura picked up the archive and continued operating the NNTP feed, but that is where things stopped. The web interface disappeared, never to return, breaking thousands of links across the net. The front page still says "some things are very broken" and links to a blog page that was last updated in September 2016. Gmane has appeared to be on minimal life support for some time.

In mid-February, Gmane stopped receiving emails from every mailing list hosted at vger.kernel.org; those include most of the kernel-related lists, but also lists for other projects like Git. Your editor posted a query and learned that delivery problems had forced Gmane to be dropped from all lists hosted at vger. While this was happening, the main Gmane web page also ceased to work. Since then, a handful of vger lists have returned to Gmane, though the bulk of them remain unsubscribed.

Those lists could certainly be fixed too, if somebody were to find the right person to poke. But the fact that so many high-profile lists could disappear for a week or more without anybody even seeming to notice makes it clear that Gmane is not getting a lot of attention these days. The wait for the web interface to come back is in vain; it's not at all clear that even what's there now is going to last for much longer.

Gmane has served the community well for years; and we all owe the people who have worked to make that happen a huge round of thanks. But all things must end, and it may well be that Gmane's time is coming soon. So what is a frantic LWN editor to do to ensure his ability to keep up with the community?

public-inbox

In the same discussion mentioned above, Konstantin Ryabitsev mentioned that the Linux Foundation is working with a project called public-inbox to create a comprehensive archive for the linux-kernel list. That inspired your editor to go and take a look. The conclusion is that public-inbox may well be the tool for this job, but there are some rough edges to be smoothed out first. The first of those could be said to be the project's web site, which is an unadorned directory listing containing a handful of documentation files.

To summarize: public-inbox can be used to implement an archive for one or more mailing lists. There is a web interface (see the page for the project's own mailing list for an example); it is functional but not necessarily designed for aesthetic appeal. There is a search facility implemented with Xapian that can make it easy to find messages of interest, though it lacks notmuch-style tags. Public-inbox also, happily, implements an NNTP interface to the archive.

Public-inbox, created and almost exclusively developed by Eric Wong, does not appear to have the creation of a Gmane-style mailing-list archive as its primary use case. Instead, it is a tool allowing people to follow (and participate in) mailing lists without the hassle of actually subscribing to them. That shows up in various ways in the design of the system.

For example, there is an interesting design decision at the core of public-inbox: each mailing-list archive is stored in a Git repository. Every incoming message is added to the repository in its own file in a separate commit; the Git history is thus the history of incoming email. A bare Git repository is normally used, so there is no need to duplicate the emails themselves. Viewing an email requires locating its file and checking it out of the repository — though none of that activity is visible to users of the system.

This use of Git would appear to be driven by a desire to make it easy for others to duplicate a specific list archive. And, perhaps more to the point, readers can "subscribe" to the list by periodically pulling new messages from the archive repository. There is a tool (called ssoma) that can be used to feed messages from a public-inbox repository into an email client. When readers get tired of a specific mailing list, they need only stop pulling from the relevant repository; no "unsubscribe" operations are needed. Whether people really want to follow mailing lists in this manner is unclear, but the capability is there.

There are various ways of feeding email into a public-inbox repository. The source comes with an import_maildir script that took many hours to import a 500,000-message linux-kernel archive. It is a somewhat fragile tool, crashing easily on email with malformed headers, but it worked well in the end and public-inbox is quite responsive with an archive of that size — at least, until it decides to run git prune on the repository. The public-inbox-mda utility will read a message from the standard input and inject it into an archive; it is meant to be used from a .forward or .procmailrc file. There is also public-inbox-watch, which will keep an eye on a maildir directory and feed new messages to the archive as they arrive. In general, setting up a new archive is a simple and easily scripted task once one understands how the utilities work.

A young project

The initial commit to the public-inbox repository was made in January 2014, just over four years ago. Since then, some 1,300 commits have built it up to 11,000 lines of code or so. In many ways, though, public-inbox feels like a young project that is still working to get some of the basic functionality in place. It will certainly need some work before it can be used to create archives that run at any sort of scale.

The project's documentation can be accurately described as "spartan", leaving much for the user to figure out on their own. To keep that task from being too easy, many of the commands will just silently fail if something is not set up to their liking. For example, public-inbox-mda will silently drop messages on the floor if the given mailing-list name does not appear in the To or CC headers. Your editor has more than once had to resort to placing print statements in the code (which is all Perl 5, tragically) in order to figure out where things were going wrong.

Other glitches abound. The web interface offers no customization or theming support. The NNTP server does not create proper Xref headers for messages that are cross-posted to more than one list, meaning that a reader of both lists will see a lot of duplicates. There are no tools for monitoring the flow of emails into the archive or troubleshooting problems. The Git-based design could make it interesting to remove an old email from the archive, should that become necessary — from looking at the code, it appears that rebasing the repository would break the archive, though your editor has not actually run this experiment. The X-No-Archive header is not honored. There are concerns about scalability to huge archives. There is also no word about what the project has done, if anything, to ensure the security of code that is exposed to the Internet via the email stream and the HTTP and NNTP ports.

Still, it seems that public-inbox has the core features that are needed to set up a no-nonsense email archive without a huge amount of work. Its simplicity is a nice contrast to something like HyperKitty, which quickly leads a hopeful user into a morass of Django setup and dependencies — and which lacks an NNTP server. There is enough apparent potential here that the Linux Foundation is funding some work to improve the scalability of public-inbox for its linux-kernel archive project. If public-inbox can generate some more interest and grow beyond an essentially single-developer project, it may well come to fill an important niche in our community.


(Log in to post comments)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK