Finding where a tarball came from with git

I got the idea reading this question on the webkit-dev list, and from my recollection of the git documentation. Well, for the webkit case, the basic script I put up would be of no help because the tarballs don't contain all the files (plus, they use subversion, not git, so it would also require a long importing process). Anyways...

Considering how SHA-1 hashes of objects are created with git, it is actually pretty easy to generate a SHA-1 hash from a random (non-git) tree, and then, the corresponding commit. First, you have git-hash-object that helps you creating a hash for a particular object (though it's also trivial to do with sha1sum). For regular files, git-hash-object -t blob $filename is enough. For symbolic links, you have to read the link destination, and give it without a trailing character (be it the NULL character or a carriage return) to git-hash-object -t blob --stdin. For directories, you have to generate a tree "structure" by yourself and pass it to git-hash-object -t tree --stdin. I haven't bothered looking at other file types.

The tree structure can be guessed by either looking at mktree.c or at the output of git-cat-file tree $sha1 where $sha1 is the SHA-1 hash for a tree object. It contains 3 informations for each node in the tree : the file mode, with the same format as what stat() returns, except for some reason, permissions are 000 for directories and symbolic links ; the file name ; and the SHA-1 hash. These informations are written with the following format : file mode in octal ascii and no padding zero ("%o") followed by a space character, then the filename followed by a NULL character, and the binary form of the SHA-1 hash.

Nodes are sorted in a not-so-quite lexical order (take a look at base_name_compare in read-cache.c) and are not separated by any special character: the mode of a file just follows the SHA-1 hash of its predecessor.

With all this new knowledge, you should be able to write some code that would return the SHA-1 from an arbitrary directory. Okay, since you must be at least as lazy as I am, you can take the script I wrote.

Now, let's take a look at a real life case : what commit is the latest nightly snapshot for the linux kernel from ? First, download the latest snapshot patch and its baseline, and extract the whole. Then, run my git-hash-tree.pl script with the directory containing the extracted kernel as an argument. It will return, after a while, the SHA-1 hash for the whole tree. During this long process, you also have plenty of time to git clone linus's tree.

Once you're all done, you can search for the commit corresponding to the tree hash (let's call it $hash) with the following command :

git-rev-list --all | while read h; do git-cat-file commit $h | grep -q "^tree $hash" && echo $h && break; done

If you just followed these steps, you should just have spent a great moment having no result at all. There are actually 2 things that prevent this method to properly work with the linux kernel nightlies :

  • The snapshot patches contain a change to the top Makefile that doesn't exist in the repository. You need to remove the -gitn from the EXTRAVERSION variable in the Makefile.
  • git diff only includes diff headers for removal of empty files, so if you apply the snapshot patch with the patch utility (and you can't apply it with git-apply since you don't have a .git directory), empty files that were marked as deleted will still be on your tree. It happens with the current snapshot patch (2.6.22-rc7-git6): it doesn't remove include/asm-blackfin/macros.h.

Note this is a naive method, because I haven't dedicated much time going through git documentation and code to find better ways, if there are any. Also note it's pretty much worthless to do this with the kernel nightly snapshots, since a file containing the SHA-1 hash of the corresponding commit can be found alongside the patch.

I guess a similar method could be used with mercurial, though I could not find a documentation detailing what are the hashes calculated from (I've not searched a lot, I must say, but for git, it was just before my eyes).

2007-07-07 15:32:07+0900

miscellaneous, p.d.o

Both comments and pings are currently closed.

6 Responses to “Finding where a tarball came from with git”

  1. tonfa Says:

    It won’t work with mercurial because the hash is recursive:

    for any revlog (the base object):
    nodeid = hash(parent1, parent2, hash(content))

    then we have the manifest which contains the hashes from all the files, and the changelog which contains the changeset message (user, commit message, etc) and the manifest hash.

    see http://www.selenic.com/mercurial/wiki/index.cgi/Design for pictures :)

  2. James Westby Says:

    Hi,

    git-apply does work outside of a git repo, so you can avoid that problem.

  3. glandium Says:

    James: damn, that was git-applypatch that needed to be working with a repo.

  4. Aaron M. Ucko Says:

    patch -E will automatically delete files that become empty.

  5. glandium Says:

    Aaron: The fact is: the file was already empty in the tarball, and as such, the diff didn’t contain anything about it, except a git header. Plain diff also does this : diffing an empty file with /dev/null outputs nothing.

  6. Josh Triplett Says:

    Very creative and effective approach.

    One alternative: if the creator of the tarballs used git-archive (or the previous git-tar-tree) to create them, it defaults to embedding the sha1 of the corresponding commit in the tarball. You can then just git-get-tar-commit-id to find out the sha1.

    $ bzcat linux-2.6.22.tar.bz2 | git-get-tar-commit-id
    7dcca30a32aadb0520417521b0c44f42d09fe05c
    $ GIT_DIR=$HOME/src/linux-2.6/.git git show v2.6.22 | grep ‘^commit’
    commit 7dcca30a32aadb0520417521b0c44f42d09fe05c