My plans/Ideas for Yum! (part 1)


Well, there are some ideas about yum in my mind which I would like to implement but unfortunately I have not found enough time for it yet; and I would not start working on them at least for the next 2 months. However, I hope to be able to start working on them afterwards (even slowly). I’m writing them here, so that: 1. someone might decide to work on them! 2. I might receive some feedback/suggestions about my ideas to improve the design and fix its bugs.

For many users yum works great. But when it comes to low bandwidth internet connections, it is not so bright. The most annoying part in this regard is downloading repository metadata; which is downloaded completely in regular intervals. And until recently, sometimes yum was completely unable to download a repository’s metadata since the connection timed out during the download and yum would start from the beginning… fortunately this problem has been fixed.

Yum is really wasting bandwidth by downloading lots of metadata, most of which is not used at all. For some people, this is perfect as they prefer to waste bandwidth rather than some CPU cycles (like the ones who don’t like delta rpms); but if you don’t have a good bandwidth you won’t like that.

Considering that fact, I’ve decided to add a “low bandwidth” mode to yum so that users can select which mode they prefer. In this mode, a new kind of repository metadata will be used considering the following goals:

  • The metadata should be downloaded incrementally as much as possible. Try to avoid downloading a single piece of metadata more than once
  • Only the data which is needed should be downloaded
  • Yum should not become slower (noticeably)
  • No server side processing is acceptable: a repository is a set of files and directories. Just this. Any plain http/ftp server should be able to hold a repository.
  • Even if the bandwidth savings does not happen in all use cases, it still worth if it works on most common work flows (e.g. install/remove/update).
  • Security: transferred data should be verifiable.
  • Yum should be able to do its job (e.g. resolving dependencies)!

Currently, a repositories metadata is stored in a number of files, AFAIK 3 bigger ones are: primary, filelist and other (if available) databases stored in sqlite files which apparently provide a fast method for yum to query data. Among those, the primary database is always downloaded, others will be downloaded if the need arise. Currently, the primary db of F12 is around 12MB and the primary db of F12 updates is around 5MB.

OK, in this post I’ll consider the primary database only: this database contains this information: list of package files in some directories (e.g. bin), package information: name, summary, description, conflicts, obsoletes, provides, requires, and some other such information. But, does yum really need all this information about all packages to function? No. So, what I’d like to do is to split the metadata as far as possible (but not too much, will be described later) in a way that yum can avoid downloading data which it doesn’t need. Some people have said that using such methods might make yum much slower compared to using sqlite databases. But, there is no need to use the same format both on the server and the client side. My ideas are generally related to how the metadata is stored on the server side. Yum could integrate any downloaded information to its local sqlite databases. The yum cache in the client side will be the same format in both low bandwidth and high bandwidth modes. The databases should contain a flag so that yum can understand which data are available in its cache and which ones should be downloaded if needed. As a result, such metadata split will not decrees yum’s performance with regard to interacting with cached data (the only extra process is to check each needed data is available or not).

Now, let me be a bit more specific about how the primary database can be split: the minimum required information for yum is probably the list of the packages(package summaries might be included here too). For each package, its summary and description can be stored in a separate file (e.g. foo-1.0.0-1.fc12-description)*. We can have separate directories for each locale, so that localized package summary and description can be provided too. If a user issues a yum info command, the summary and description of the package is downloaded (if not downloaded before) and displayed to the user. But certainly, if the user wants to do a yum search, yum should have the summary and description of all packages. Well, in this case you’ll download summary and description of all packages**. It is still better than the current situation. Also notice that these information will never change for a specific package, so they will be downloaded only once.

Other package information (like its requirements) will be stored in a separate file too. Currently, package requirements take a considerable space in the primary repo (IIRC by removing the requirements information from the primary repo, its compressed size will become about half of its original size), but in common workflows, you only need a package’s requirements when you want to install that package. Again, this information will be downloaded once for each package.

When it comes to packages’ provides and file list information, splitting them become hard: when you want to satisfy a package’s requirements, you might face file or capability based dependencies. So, you should be able to figure out which package provides a specific capability or file. I’ll describe my ideas in this regard in another post, but for now you can assume that packages’ list of provided capabilities and files in specific directories (which are currently in primary db) are downloaded for all packages by yum. Even in this case, such lists will be downloaded once for each package (providing the incremental metadata downloading). When new packages are added, their information are downloaded too.

Seems to be enough for now! Just the two points:

* The above description might result in a large number of very small files. Considering that each file should be signed, it might result in a considerable overhead. But it is not really needed to put each of those information in a separate file: instead of putting the information of each package in a separate file, we can put the data of a number of packages (for example each 10 packages, or each ~100KB chunk of data) in a single file (The file name will be included with the package list that yum downloads initially).

** Do we really need to do the search exclusively locally?! No. It’s true that we do not want our mirrors to do any server side processing, but:

1. The server side search feature can be provided by a few number of Fedora infrastructure servers

2. Even better (?!), a search engine like google will do this for us! When a user issues a “yum search” command, yum can at first search its local database, and then instead of downloading all package descriptions it can use a search engine and point it to a repoview (or a new plain html format for package descriptions more suitable for this kind of search) url in a mirror, and show the results to the end user. So, you’ll get server side processing using google’s resources!

Wow! Much longer than what I intended 😛

Advertisements

11 responses to this post.

  1. Local Mirroring

    Have the /var/cache/yum/* cache mirror the same layout as the repo mirror sites. You could use any Fedora desktop with set up a web/webdav server to provide a local mirror of RPMs. Local installs of could look up the the local mirrors before downloading the packages/deltas from the repo mirrors. If the local mirror uses a webdav:/./ url with username/password, have each instance of Fedora put a copy of any new packages on the local mirror.

    Presto generating new packages also from deltas + existing but uninstalled RPMs.
    Unpack the old RPM, apply delta , produce new rpm.

    Reply

    • I’m talking about trying to avoid the complete metadata at the first place. Local mirroring has nothing to do with the mentioned ideas. Also, I’m not talking about the packages themselves, just the metadata. (BTW, deltas are applied to existing but installed RPMs after generating the original rpm from the installed files).

      Reply

  2. Posted by Michael on May 14, 2010 at 11:00 am

    I have the same problem regarding yum… common stuff sometimes takes ages because the complete metadata stuff is downloaded over and over again. Would be nice if someone finally fixes this…

    Reply

  3. Hi,

    I can’t use yum for this reason – the metadata download is too much. I filed a bug on this a few years ago, but it seems there was little interest. It’s great you’ve decided to do something.

    My idea I think is fairly simple and could work (but I’m not a python coder so could not do it myself): simply have the repository publish a weekly delta to the metadata. So:

    * Everybody gets (or should get, once fixed) a full set of original metadata when they install Fedora.

    * Then when they do “yum update” (or “yum whatever”), they just need to download the delta to the metadata, not a full set of new metadata.

    So the local yum client just needs to ask itself, “what metadata revision do I currently have?” and then download all the deltas since then and patch the local database.

    David

    Reply

    • Yes, this approach will avoid downloading a piece of data more than once, but you still get lots of data you don’t need. I’m trying to avoid that too.

      Reply

  4. Posted by Seth Vidal on May 14, 2010 at 5:10 pm

    Hedayat,
    You should talk to some of the other yum folks before going to wild on this. Changing metadata formats to smaller sizes is not as easy as you might think. There are other tools involved that have to be accounted for.

    Come by the yum irc channel or let’s talk on yum-devel mailing list.

    Reply

    • I’ve talked about it in yum-devel in past very briefly but… anyway I thought that it might be better to bother yum people when I started seriously working on the implementation.

      Thanks for your attention

      Reply

      • That’s fine – but there are some other goals that it might be worth having in mind before you start writing code.

        The translations of summary/descriptions in separate metadata in particular could be useful to be aware of.

      • Posted by Sajjad (sana) on May 22, 2010 at 8:57 pm

        I believe the variety in languages is already discussed in the post.

  5. Posted by Sajjad (sana) on May 22, 2010 at 8:54 pm

    Considering the time that has passed from when you first had these ideas, I seriously recommend that you choose the simplest path and implement it that way.
    I think by implementing only a subset of what you’ve in mind for yum (merely the simplest and most straightforward ideas, not the complex ones), it can be considerably improved. You know, you can achieve 80% of your goal with only 20% of the effort. And believe me, it’s enough for know.
    To be more specific I vote for 2 ideas:
    Firstly and most importantly, to download only a “delta”, and not downloading the full archives.
    Secondly and optionally, if you had time, to separate the summary and descriptions from the packages list. (or separate the requirements as you talked about; whichever is more effective in reducing the size)

    Best regards,
    Sajjad

    Reply

    • Yes I should not go for the final target from the beginning. I’m also thinking about separation of summary, description and requirements for the first step (these should be mostly similar). I think that would be enough to reach somewhere near that 80%. But I’m not much interested in going towards the “delta” path as a temporary step (for the primary repo db) which requires a new format which will be dropped soon. But I might go with that with file list database as you said.

      Thanks 🙂

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: