django-galaxy: a reusable feed aggregator in Django

I’m flooding the site with Django stuff lately, but I’ve had a recent burst of productivity that I attribute almost solely to my reading of James Bennett’s new book, Practical Django Projects.

This particular project isn’t completely done yet, but it mostly works. I just need to make this public to force me to get it to where I want it to be.

The idea is this: I maintain one or two sites that aggregate posts from other people’s blogs. Typically these are called “planet” sites after the first popular bit of software that performed this task. They’re very popular in aggregating blog posts from tight knit communities.

There is a Django project I use on ArsLounge called FeedJack, but there are a lot of things I don’t find to be optimal. One of those is that it’s hard to just plug FeedJack into your existing projects and start overriding things in a Django-ish way.

With that, I’ve started a simple project I call django-galaxy that aims to fill that gap for me. It’s essentially some models to represent Blogs and Posts, some basic templates for basic display, and it tries to make as much use of generic views (and other such things) as to be as extensible and configurable as possible.

The real magic comes in with the script you have to run on a regular basic to comb your feeds. This uses the venerable feedparser project by Mark Pilgrim and goes above and beyond to determine if an RSS feed has bad date-support, supports tagging (and uses django-tagging) when appropriate. There’s a bunch of other junk in there to handle janky feed, which are more prevalent than you might imagine!

Since many planet-esque sites are topical in nature and you can’t always come into possession of a category/tag feed for every site, I’ve also included a method of processing entries to determine if they are “on topic” for your site.

Right now I’ve got it a bit too coupled with my testing set of Django sites, but you can get an idea of how it works. Basically, it helps me filters out posts on xxxxx topic so I can exclude someone’s posts about their family vacations which are not on topic for a topical aggregation site.

This works by performing a keyword search across a post’s content and subject. You can provide a dictionary of as many terms as you find necessary.

Another neat thing is a method to clean up what I call junky feeds. In my experience, some HTML markup that ends up in feeds does not lend itself well to re-aggregation. So I’ve written a process to strip out comments, div tags, header tags, weird paragraph tags, and such.

If you’re interested, please take a look at my first check-in on Google Code (which I’ve been working on in another repository for a while now) and let me know what you think!