Well, there are some ideas about yum in my mind which I would like to implement but unfortunately I have not found enough time for it yet; and I would not start working on them at least for the next 2 months. However, I hope to be able to start working on them afterwards (even slowly). I’m writing them here, so that: 1. someone might decide to work on them! 2. I might receive some feedback/suggestions about my ideas to improve the design and fix its bugs.
For many users yum works great. But when it comes to low bandwidth internet connections, it is not so bright. The most annoying part in this regard is downloading repository metadata; which is downloaded completely in regular intervals. And until recently, sometimes yum was completely unable to download a repository’s metadata since the connection timed out during the download and yum would start from the beginning… fortunately this problem has been fixed.
Yum is really wasting bandwidth by downloading lots of metadata, most of which is not used at all. For some people, this is perfect as they prefer to waste bandwidth rather than some CPU cycles (like the ones who don’t like delta rpms); but if you don’t have a good bandwidth you won’t like that.
Considering that fact, I’ve decided to add a “low bandwidth” mode to yum so that users can select which mode they prefer. In this mode, a new kind of repository metadata will be used considering the following goals:
- The metadata should be downloaded incrementally as much as possible. Try to avoid downloading a single piece of metadata more than once
- Only the data which is needed should be downloaded
- Yum should not become slower (noticeably)
- No server side processing is acceptable: a repository is a set of files and directories. Just this. Any plain http/ftp server should be able to hold a repository.
- Even if the bandwidth savings does not happen in all use cases, it still worth if it works on most common work flows (e.g. install/remove/update).
- Security: transferred data should be verifiable.
- Yum should be able to do its job (e.g. resolving dependencies)!
Currently, a repositories metadata is stored in a number of files, AFAIK 3 bigger ones are: primary, filelist and other (if available) databases stored in sqlite files which apparently provide a fast method for yum to query data. Among those, the primary database is always downloaded, others will be downloaded if the need arise. Currently, the primary db of F12 is around 12MB and the primary db of F12 updates is around 5MB.
OK, in this post I’ll consider the primary database only: this database contains this information: list of package files in some directories (e.g. bin), package information: name, summary, description, conflicts, obsoletes, provides, requires, and some other such information. But, does yum really need all this information about all packages to function? No. So, what I’d like to do is to split the metadata as far as possible (but not too much, will be described later) in a way that yum can avoid downloading data which it doesn’t need. Some people have said that using such methods might make yum much slower compared to using sqlite databases. But, there is no need to use the same format both on the server and the client side. My ideas are generally related to how the metadata is stored on the server side. Yum could integrate any downloaded information to its local sqlite databases. The yum cache in the client side will be the same format in both low bandwidth and high bandwidth modes. The databases should contain a flag so that yum can understand which data are available in its cache and which ones should be downloaded if needed. As a result, such metadata split will not decrees yum’s performance with regard to interacting with cached data (the only extra process is to check each needed data is available or not).
Now, let me be a bit more specific about how the primary database can be split: the minimum required information for yum is probably the list of the packages(package summaries might be included here too). For each package, its summary and description can be stored in a separate file (e.g. foo-1.0.0-1.fc12-description)*. We can have separate directories for each locale, so that localized package summary and description can be provided too. If a user issues a yum info command, the summary and description of the package is downloaded (if not downloaded before) and displayed to the user. But certainly, if the user wants to do a yum search, yum should have the summary and description of all packages. Well, in this case you’ll download summary and description of all packages**. It is still better than the current situation. Also notice that these information will never change for a specific package, so they will be downloaded only once.
Other package information (like its requirements) will be stored in a separate file too. Currently, package requirements take a considerable space in the primary repo (IIRC by removing the requirements information from the primary repo, its compressed size will become about half of its original size), but in common workflows, you only need a package’s requirements when you want to install that package. Again, this information will be downloaded once for each package.
When it comes to packages’ provides and file list information, splitting them become hard: when you want to satisfy a package’s requirements, you might face file or capability based dependencies. So, you should be able to figure out which package provides a specific capability or file. I’ll describe my ideas in this regard in another post, but for now you can assume that packages’ list of provided capabilities and files in specific directories (which are currently in primary db) are downloaded for all packages by yum. Even in this case, such lists will be downloaded once for each package (providing the incremental metadata downloading). When new packages are added, their information are downloaded too.
Seems to be enough for now! Just the two points:
* The above description might result in a large number of very small files. Considering that each file should be signed, it might result in a considerable overhead. But it is not really needed to put each of those information in a separate file: instead of putting the information of each package in a separate file, we can put the data of a number of packages (for example each 10 packages, or each ~100KB chunk of data) in a single file (The file name will be included with the package list that yum downloads initially).
** Do we really need to do the search exclusively locally?! No. It’s true that we do not want our mirrors to do any server side processing, but:
1. The server side search feature can be provided by a few number of Fedora infrastructure servers
2. Even better (?!), a search engine like google will do this for us! When a user issues a “yum search” command, yum can at first search its local database, and then instead of downloading all package descriptions it can use a search engine and point it to a repoview (or a new plain html format for package descriptions more suitable for this kind of search) url in a mirror, and show the results to the end user. So, you’ll get server side processing using google’s resources!
Wow! Much longer than what I intended