Starting later this week I will be starting to place a few limits on article retention within Feed Wrangler. This is something that I’ve always known I would have to address at some point but has finally grown enough of a concern that I must begin now.
What I’m changing.
The simple version of what I’m changing is to only retain the 500 most recent articles for any given feed.
On a periodic basis I will be going through the feed lists and remove any article that falls out side of that 500 limit. They will no longer be indexed for search or visible on the site. For most sites this will be removing articles that are several months if not years old.
This limit matches up with similar limits I found in most of the competitive services I checked.
Why I’m changing it.
While I would theoretically love to be able to retain all the articles ever seen within the system that is simply just not practical. The main Feed Wrangler database is currently sitting around 1.1 Terabytes in size with an additional 600 Gigabytes in search indexes. At that size I am running into some pretty serious constraints and limits in terms of things like backup and performance.
The reality is that continuing without retention limits roughly equates to trying to index a non insignificant portion of the internet’s fresh daily content. This may have been practical for Google back with Reader but isn’t for a small operation like Feed Wrangler.
On the plus side the actual impact this change will have on most users will be relatively minor. Only around 14% of feeds tracked within Feed Wrangler currently have more than 500 articles, so the vast majority of feeds will be untouched. Those 14% of feeds, however, account for a disproportional majority of storage needs so truncating them will free up a tremendous amount of resources.
Among the popular sites that will be affected are mostly news aggregation outlets. Sites like The Verge or TUAW which publish dozens of articles each day. These will now likely hold a few months of articles within the system rather than a year’s worth.
What I’m not changing.
Almost every other RSS services have policies in place where they automatically mark articles are read after a certain period of time (typically around a week or month). I’ve never really liked the way those policies feel. I think marking something as read is an action for the user to take, not for me to somewhat arbitrarily apply. So I won’t be doing anything like that at Feed Wrangler. Articles within your feed lists will remain unread for as long as you prefer them to or until they eventually fall above the 500 article limit for that feed.
I’m also not archiving articles that have been starred within Feed Wrangler. If you’ve marked something as important to you I want to do my best to keep it around. So any article for which there exists a single star will be skipped over for retention purposes.
I’m not currently expecting to impose limits on podcasts tracked by Pod Wrangler. The nature of podcasts is such that the volume is fundamentally lower.
What this allows in the future.
I’m honestly rather relieved to finally be doing this. I knew that one day I would have to (otherwise the database would just grown without bound). This change will allow me to finally tackle a few aspects of the sync system that have fallen a bit fallow over the last year. I’m planning to build better item change tracking to catch article edits after their initial publication, as well as a variety of performance improvements.
The end result is that Feed Wrangler will be a better service once these limits are in place. I can move away from being so worried about keeping up with scaling the database and instead focus on polishing the core experience.