Earlier today I watched the presentation Jonathan Oxer's gave during LCA about Package caching solutions.
Although it was certainly an interesting presentation and although I very much
agree that my current local mirror is wasting a lot of diskspace and bandwidth,
I'm still not going to switch from
debmirror to any of the available
caching solutions, because (unless I'm really missing something) none of them
scratches my itch.
My local mirror currently consists of five architectures (i386, amd64, hppa, sparc and s390) and only has unstable and testing. I use it for:
- (fast and convenient) updating of my systems
- doing Debian Installer builds and installation tests
- test builds of installation CDs (using
debian-cd)
Now, the last one is somewhat hard (debian-cd uses hardlinks to packages on
a local mirror instead of retrieving packages), so let's concentrate on the
first two.
Caching is great if you have a large number of machines – of the same architecture and that are all likely to need roughly the same packages – sitting behind the proxy: the first one triggers the download of the package and the rest gets it almost instantaneously. It is a lot less great if you have only one, maybe two systems per architecture: most of the time you'll still end up going down that (relatively) slow ADSL connection.
An important reason why I have my mirror is so that when I do my daily updates for sid or run an installation test, the packages are already available locally. I really don't want to double or even treble the time needed for installation tests just because some required packages aren't yet available locally and need to be downloaded over that slow line.
So, I have my partial mirror. Somewhat tuned (I exclude some ridiculously large debug packages for example, saving about 10GB), but still with a lot of junk^Wpackages on it I'll never ever use, especially for hppa, sparc and s390 as those systems only have fairly basic installations. Getting rid of that would significantly reduce my daily sync and allow me, for example, to also have a mirror of stable and oldstable, keep old versions of packages and probably still save a lot of diskspace.
Wishlist
What we should have is a hybrid solution: a program that will present itself as a proxy to clients, but is smart enough to pre-fetch new packages that are likely to be needed in a sync run, based on usage date from the the proxy and configuration settings.
Some ideas for features/configuration options it could have:
- should support both source and binaries (debs and udebs)
- transparently retrieve packages that are requested but not available locally (just like a regular proxy)
- options to always sync (per architecture):
- all required and standard packages
- packages listed in a certain config file
- packages in a certain section (e.g. udebs in the debian-installer section)
- for packages that have been used (requested by clients) within the last X days: pre-fetch any new version (per architecture)
- options to include security/volatile updates in that
- be smart about ABI changes: when a package name changes because the ABI version changes, also pre-fetch the new package; an alternative solution for that could be:
- for recently used packages: ensure that their dependencies are also pre-fetched; this would also provide support when packages are renamed (dependencies in the transition packages would ensure the new packages are pre-fetched)
- expire packages from the mirror that have not been used for X days
- allow faster expiration of certain types of packages (-dev or -dbg for example)
- options to keep X previous versions that are no longer referenced in the Packages files for Y days
Possibly such an implementation could even be used on some of the lower tier Debian mirrors.
Unfortunately, unlike some of my esteemed colleagues, I'm not able to just whip up something like this, so I'm condemned to wait and see if there's someone else who'd like to pick up this idea. I am of course more than willing to help develop this idea further and to test it.
Now, if I've totally failed in my research and something like this already exists, a pointer in the right direction would be much appreciated.