by Alessandro Rubini
Reproduced and translated with permission of Linux & C, Edizioni Vinco.
The git package was originally written by Linus Torvalds, and was later maintained by other developers under the lead of Junio Hamano. The program is being adopted by an ever-increasing number of projects, from the kernel and U-Boot to Xorg and busybox. It belongs to the class of distributed version control systems.
A first introduction to using the package was published on this same magazine by Rodolfo Giometti. This article introduces to some more advanced features, that are needed to interact with complex products, while trying to uncover the ideas used within the program. Box 1 summarizes the most important commands of git, as quick reference for the least expert readers.
Box 1 - Most Common Git Commands
The following list shows the most important commands for git users. Command arguments are not shown as there often are several alternate uses, with different argument lists:
Distributed version control systems are more and more widespread; git is not the first and won't be the last one. The ideas described in this article are also present, in various measures, in other packages, like mercurial; the aim of this text is not showing how git is superior to other tools, but rather show interesting features that can be useful in other contexts as well. We talk about git because it is widely adopted.
To avoid ambiguity, we will never use the word tree to refer to a directory with files and subdirectories. The word in the git context is best reserved to a development history of a package, with all its ramifications.
One of the most common problems in managing big software projects is the imperfect matching of the working directory of different programmers: developers often remove completely their local copy, to restart from a known archive or repository; this is common, but unpleasant. The same problem occurs when there are collisions while applying a patch, or the source file was damaged by some mishap.
Moving tens or hundreds of megabytes to recover a source package is certainly not a nice experience, verifying whether one's copy is exactly the same as the copy of a different developer is even worse if you can't directly transfer the files.
The solution to both such problems is the identification of every object, within git, with a control code: a number that is derived from the whole amount of related data through a non-invertible mathematical algorithm. Such control code, called hash or message digest describes (or summarizes) the datum or data set, but the math makes it impossible (or very unfeasible) the creation of a different data set with the same hash.
The algorithm used in git is SHA1 (Secure hash
algorithm 1) which returns a 160-bit long hash.
The number is usually represented as 40 hex digits.
For example, if you want to group all files called
COPYING
in your system that are verbatim copies
for the same license file, you can issue the command:
locate COPYING | xargs sha1sum | sort
Within git, thus, each individual file, each directory, each commit is identified by its own hash value. A specific point in the development history of a package cannot be referenced by a sequential version number, but rather by a unique 160-bit number. An object of type commit includes a reference to the contents of the directory it represents, the commit message and the hash of the previous commit. If two programmers have the same commit, they know for sure that they are accessing the same source package. During technical discussions on public mailing lists, referring to code or patches using the hash value or an abbreviation thereof, is common practice.
Box 2 - Uniqueness of hash codes
Message digest algorithms warrant that the hash they return is unique by statistical probability. The most common such algorithms are SHA1 and MD5, which return values with a uniform distribution over the value space. By recalling that 2^10 is roughly equivalent to 10^3, we can say there's a chance of 1 in 4 billions for two 32-bit hashes to be equal. The number is on the order of 10^38 for 128-bit MD5 hashes and 10^48 for 160-bit SHA1 values.
Even with 1 million files, the chance for any two of them to feature the same SHA1 hash is 1 in 10^36, while for a billion files the chance is 1 in 10^30. 10^30 is roughly the number of sand grains in the mass of the whole earth; you can hardly say the hash is not unique.
Hash algorithms are designed so that the difference of even a single bit in the input data will affect all output bits. So in practice to identify a file you can use a few hex digits of the hash, and still be sure enough about not picking the wrong file. Thus, git accepts abbreviated identifiers, as long as the abbreviation is unique within its database. Thus, if you ask for an abbreviated for of the hash, git shows just 7 hex digits.
Even within a project with 1 billion objects, like the kernel, the chance for a 7 digits code to be ambiguous is very low. When picking objects within a project, you don't need that your hash to be globally unique; it's enough if is unique within the project: when the 7-digit code is ambiguous, git will show 8 digits or more, in order to solve the local ambiguity; it's somehow like what happens at school: teachers call students by surname, and in case of ambiguity they add the initial of the name, or more letters if needed. Please note that internally git is always using the hashes in their entirety.
We could test the idea by checking the SHA1 hash of Linux-2.6.30 as extracted from a git repository and that of the official tar file. To do this we need to use git cat-file to look into the relevant commits:
bash$ cd linux-2.6.git bash$ git checkout v2.6.30 HEAD is now at 07a2039... Linux 2.6.30 bash$ git cat-file commit 07a2039 | grep tree tree 0cea46e43f0625244c3d06a71d6559e5ec5419ca bash$ tar xjf linux-2.6.30.tar.bz2 bash$ cd linux-2.6.30 bash$ git init bash$ git add . bash$ git commit -m "from tar file" Created initial commit 7a6212e: from tar file bash$ git cat-file commit 7a6212e | grep tree tree 0cea46e43f0625244c3d06a71d6559e5ec5419ca
Box 3 - The git cat-file command
Since git records all of its objects in compressed format, blessing them according to the SHA1 hash, the tool offers the command "git cat-file" in order to print to stdout (like cat, as the name suggests) the contents of an object. The command receives two arguments: the object's type and its SHA1.
The command returns the contents of the file in case of blob objects, but for commit objects its output is a short text that includes the log message, the author's name, date of the commit in Unix format and the SHA1 codes of both parent and tree. The former is the previous commit in history and the latter is the directory of files described by this commit.
The user interface for git cat-file isn't what you'd define friendly. For example, if you git cat-file a tree object, you'll get a binary file sent to the terminal. This happens because cat-file is one the low-level commands, used by other git commands to get their work done.
In git documentation, low-level commands like this one are called plumbing, while the ones meant to be typed by real users are called porcelain -- the user interface that seats above plumbing, where eventually the user sits. Commands listed in box 1, for example, are part of the porcelain set.
As apparent, the tree object contained in the two commits is the same one. This is enough to demonstrate that the two file sets, 400MB each, are identical even we retrieved them in a different way.
With "git ls-tree
" you can check what the internal git
representation for such tree objects is; just like in a Unix directory, the
object includes names and identifiers for other objects. In this listing,
objects of type blob are normal files, those of type tree
are subdirectories. Just like a directory associates names to
their inode numbers, which are unique within the filesystem,
a tree object in git associates names to their hashes, which
are unique within the database (and globally, but this is irrelevant here).
The content of each object, then, is stored
in a file whose name is exactly the associated SHA1 value. A side effect
of this approach is that objects in git are immutable: every modification,
even of a single bit, creates a new object with a new hash and, thus,
a new file.
A secondary effect, and not a trivial one, of massive use of hash codes is the easiness in signing. If the author wants to sign a specific version of the software package, thay can just sign the SHA1 that represents the last commit. This testifies about the whole development, because the commit you signed includes the whole directory and the parent commit, as already noted (which in turn includes its tree and its parent, and so on). Signing such hashes is done with the usual asymmetric-key tools.
Thus, if a developer signs a specific commit, whoever gets hold of the same commit or the tree it refers to, they can be certain about source code integrity by just verifying the signature, even if the sources come from untrusted sources. The only components that you need to trust are the tools that create and check SHA1 hashes, i.e. git, gpg and other tools that are usually part of the operating system. You can usually trust them because they are signed by the relevant package maintainers.
The main difference between distributed and centralized version control system is in how easy (or not) creating branches is.
Personally, as a git user, I find that the concept of branch
is very similar to the concept of tag, and I hope this idea won't
upset those who know the internal representation of the two. We may
say that a branch is like a tag because just like "git tag
v1.0
" binds a meaningful name to the SHA1 name of the current
development status, the command "git branch
1.0-fixes
" binds a new symbolic name to the current status.
Both such names can be used in retrieving the current version
by calling "git checkout <name>
".
But a branch name is a moving tag: whenever some change is committed, the tag name will keep referring to the original SHA1, while the branch name will move, following the development.
Unless you are using a detached head, which is not common and not covered here, the current source status on the disk (the HEAD position, in git wording and case) corresponds to one of the development branches. Thus every commit operation is actually growing a branch.
The name of the main branch, the one called trunk by other packages, is master. The master branch is created when you make the first commit ever, and it is not a special name at all. All branches are managed in the same way; you can rename any branch you like or remove any branch you dislike, including master. Deleting a branch is like deleting a tag: all the git objects remain in the repository and you can retrieve them if you know their SHA1, at least until you garbage collect, a topic not covered in these pages. When you delete a branch, git tells you what hash it was, so you can undo the deletion if you removed it in error.
To move from one branch to another you can use the command git checkout <branch>, but the program refuses to perform the task if there are yet-uncommitted local modifications, to prevent loosing your work in unexpected ways.
The idea that a branch is just a label, without any reference to where and how it got detached from the original branch, is a remarkable one: if during development you get to a dead end, you can always create a new branch from some place in past history and try a different way to attack your problem; if such new way turns out to be the winning one, you can delete the initial branch without any effect on the new branch. Deleting a branch is like deleting a tag: the only effect is you can't reach the associated status with its human-readable name any more. The fact that a new branch was spun from the one now deleted is irrelevant: the tip of the branch that you preserved identifies the whole history, from project inception up to there, without any reference to other branches or splitting points.
Obviously, you can ask about the differences, or the log, between your head and another branch, for example the one you split from. But branches refers to no other branch, only to their past history; so, to compare branches, the system scans back the history of both branches until it finds a common commit, a SHA1 value that is common to both branches. This match in hash values is the only indication that the two branches have a common ancestor, and such ancestor can now be used as a starting point to perform the diff or log you asked for.
Such flexibility in branch management can easily lead developers to have dozens of branches in their tree; you must therefore be careful in choosing branch names, and remember to delete inactive branches, or move them to another git tree; otherwise, you'll find it hard to track the various aspects of your work.
Box 4 - Version numbers used in git
The identifiers used in git command lines, to name commits or other objects, fall in several categories. The most useful and most used are:
HEAD
: the last commit in the current branchSome commands can act on intervals, like git log
or git format-patch. The most common expression for intervals
is "v1..v2
", that represents all
commits that are reachable from v2 but not from v1.
Each commit allows to find back all of its past history, so
reachable refers to an ancestor of the specific version.
Therefore, the notation ..
identifies the history from
v1 to v2, if v1 is ancestor of v2, or
from the splitting point up to v2 if the versions are in
different branches.
No developer usually writes working code from scratch -- there are a very few exceptions, but we can't make tools that only work for them. Moreover, few people can afford devoting to a single problem until it is completely solved, while ignoring other issues. The net result of these internal and external limits in the development activity, is that in practice everyone writes code in a fuzzy way. On one hand we tend to add and then remove diagnostic messages or other tricks one may be ashamed of, on the other hand the available time is split among several issues, moving between them and temporarily abandoning each of them before it is completely solved, including the issues that will eventually be fixed.
History in a working branch is thus often quite a mess of changes: commits about different logical problems mix up in a seemingly random way, and some diagnostics code fragments are added and then removed soon after. Before such mess is delivered to the net and becomes part of the official history of computer science, the author needs to clean up. This means changing the relative order of the commits, collapsing several work steps in a single patch that fixes a bug or adds a feature in a single step, removing irrelevant modifications.
The tool git offers to this aim is "git rebase -i
", where
the i means "interactive". The command allows rewriting the
history of the current branch, starting from a specified version. For
example, "git rebase -i HEAD~10
" allows reordering,
collapsing and dropping anything since 10 commits ago. To do that, git
fires a text editor, opening a file that includes both the list
of the commits in your recent history, one
per line, and the instructions about how to edit it. The options are
well described, so I won't repeat it here.
If you want to save the current status before daring a reordering step,
you can
simply create a new branch and try rebasing that one. Otherwise,
you might just take note of the original hash, or make a temporary tag.
It's always possible,
at a later time, to ask git what are the differences between the old and
the new branch, using git diff
from the old branch, hash or tag..
The history of a package, including all branches, is hosted in
the .git
folder within the package itself, it is nothing more
than a local copy. A common need, therefore,
is moving branches between different trees, both within the same disk and
across the network.
To copy objects between different trees, git uses the fetch subcommand. while working on the receiving side, you tell on the command line what remote repository and branch you want to download from, as well as the name of the local branch where commits should be placed. The program retrieves the history of the remote branch and only copies the objects that are missing from the local tree. During the copy, any remote tag on the relevant branch is reproduced locally.
When you git fetch, the name you use for the local branch may already exist in your repository. Git will create the branch if it doesn't exists; otherwise it will grow the existing local branch. In this case the local branch should be an ancestor of the branch being fetched, or it wouldn't be possible to copy the objects while preserving the local commits. When this happens, the error is "rejected: non fast-forward". Therefore, fetch can only grow a branch, without changing it in any way unless you explicitly force this behaviour.
Usually, the "small programmers" keep a local copy of the development branches by "big programmers", and they periodically git fetch to follow development of the upstream package. A local branch that is used to follow remote development is called remote tracking branch or remote branch for short. If you are modifying an external project, you'll likely create a new local branch that hangs off the "remote tracking" branch. The fetch command is used in this way:
git fetch id-remote-tree source-branch:target-branch
The remote tree may be a pathname, a remote folder specified
in ssh format or a URL, either http://
or
git://
. There may be even more forms I don't know about.
Please note that all branches of a tree are local, even the ones called
"remote". All information managed by git is included in the .git
folder of your working directory, and this is a design choice. A branch may
be remote tracking or not according to how it is used.
It is possible, to simplify your command lines, to
preset your preferred arguments for
specific remote trees in your .git/config
file.
After you worked on a local branch, spun out of a remote-tracking one,
a further fetch performed on the remote branch will lead to split branches:
your local branch and the remote one have different head commits, even if
most of their history is in common.
You'll thus frequently need to move the local branch in order to have it
rooted in the current tip of the remote one. This operation is
called rebase. To perform a rebase you need to be on the local
branch; the command is just "git rebase <otherbranch>
".
Git does the following work: it identifies the most recent common
commit, it rewinds all local commits after that forking point, it
applies the commits that lead to otherbranch and finally re-applies
the ones that have been rewound, managing possible conflicts.
Any commit that matches a commit in the other branch is automatically
discarded, so most commonly no conflict happens at all.
If you think about it, the "rebase -i
" already
described is similar. In the most common case you interactively
rebase on an ancestor of your commit, as shown,
but you can as well use -i
when rebasing to a
different branch.
If compared with creating an applying a set of patches, a rebase operation is much more powerful and straightforward. Besides, git The tool can use a lot of context, so you get fewer conflicts and issues with a rebase than if you created and applied patches. The involved algorithms are called 3-way merge and octopus merge, and they are the state of the art in this area.
Another common requirement during development is importing into a branch some code fragments that already exist in another branch. The command git cherry-pick allows to choose and pick the commits you want, one at a time, and apply them to the head of your current branch. The command is very useful when you have good commits in an experimental branch, so you can selectively apply them to your "good" branch. Also, you can test individual features your fellow developer pushed to their own branch by applying them one at a time to your own branch.
In all situations where several people a developing concurrently, one of the most common problems is conflict handling. A conflict happens when you try to apply a patch to a code fragment, but that fragment is not what you expect it to be, because some other patch modified it. The two changes (your own and the one already applied) start from the same code but they are not compatible, and cannot be merged automatically. A conflict may also happen within the same tree, during a merge (not covered here) or a rebase, whether interactive or not. For example, a conflict may happen when you reverse the order of two patches, if one patch renames a variable and the other changes code that uses that very variable. The later patch can't be applied to the original code, because it would modify lines that didn't exist when the variable was using the old name.
When a conflict happens, git reports it in its messages and
stops the rebase operation, leaving the so-called "conflict markers"
in the source file. Such markers are the
usual <<<<<
, =====
and >>>>>
lines. The user is then expected
to solve manually the issue and then "git
add
" the fixed files before continuing the merge or
rebase operation (with commands such as"git rebase
-continue
"). As an alternative, the user can abort
the whole rebase with "git rebase
--abort
". Before you explicitly call
git add
, the conflicting file is not saved in the database as
a git object; this prevents most common errors where you would
commit the conflict markers.
Unlike CVS, git only finds conflicts when merging files that are already known to it, so there is no information loss: both files are still there. With CVS and some other centralized version systems, the conflicts happen between a local version and a file recorded in the repository. The program in this case adds the conflict markers in the local file, so the user won't have the original local file any more, and hand-editing is the only possible way out. With git, the local file modified by the markers is just a temporary copy, which is considered derived from two parent files, both known to git. To express this dual-parent situation, the git diff command uses a special output format in this case, to show separately the differences of the descendant from both ancestors at the same time. It takes some time to get accustomed to this new format, but with some practice you'll appreciate the usefulness of such information.
In conflict management it is helpful, once again, that git records history as immutable objects. Even in the most horrible source corruption, you can recover a good versions to restart from, ignoring the result of the erroneous operation. If you tried a merge or rebase hoping it would succeed, but then it fails miserably and you have no time to fix the conflicts, you can just checkout one of the original branches: all the files ridden with conflict markers will just be deleted, together with the files whose merge succeeded.
Another mistake that may happen, is making a local modification to a remote-tracking branch. In this case, a later git pull will perform a merge, and the local branch will feature an incorrect history, and your identifiers for the remote commits won't match upstream any more, because the commits were applied to a different local tree. Sometimes conflicts may arise, but they are the wrong way: instead of being unable to apply the local patch to the upstream code, git tried to apply the upstream patches to the local development.
The solution here is relatively easy: you can rename your branch and repeat your fetch or pull: no significant data transfer will take place, because ypu repository already downloaded them; but you'll get a new, correct, remote-tracking branch. Later on, you can delete the previous branch, or cherry pick some local commits from it, or checkout one of the local commits in its history to rebase it to the current remote branch.
After a developer has cleaned up the code to make it acceptable,
after you rebased to the new upstream version and after you solved any
conflict, the next step is usually publication, sending the patches
to maintainers. The command "git format-patch
" is
used to create in the current directory one file for each commit,
starting from the version named on the command line up to the tip
of the current branch. The name of such file starts with a
4-digit number, from patch 0001-
onwards,
so they appear properly ordered when you name them with wildcards
on the command line (i.e. *.patch
or otherwise).
The files git format-patch creates are laid out like email messages, with all the headers. According to the options you gave to the command, messages may include all information needed to be identified as an email thread, if sent as-is. If you want to contribute your work to discussion lists for the relevant package, you can simply send those messages. If your email client changes messages in an unpleasant way (like breaking long lines or encoding in some non-plain-text MIME representation) you can run git send-email directly. This however assumes some more configuration of the git package, because you must tell it how to actually send out the messages.
At the other side of the net there's people who need to apply locally the patches they received by email. To do that they simply need to run git am (apply mailbox). The command applies the patches and reproduces the log message in the branch where it runs, preserving authorship and other attributions. If the current commit at the recipient's place is not the same as what the original poster intended, the program will apply the patch using the same techniques (and the same limits) as the patch command, not being able to run a 3-way merge technique. If a conflict happens, usually upstream maintainers discard the contribution and send back a terse and cold message to the original poster: "please rebase and resubmit". Actually, a rebase step can use the whole history to automatically fix conflicts, so it's really easier (globally) for you to rebase and resubmit than for them to guess the right fix for a conflict.
git format-patch is able to detect file renames or "copy and sliglty edit" situations, so it reports this information in its output. For this reason, the patch command is not always able to work as git am. But besides renames or copies, the two diff formats are the same; in those special cases, however, the git format is more compact and more readable than the standard diff output (i.e. patch input), at least until the new feature will be added to the two Unix commands.
In addition to the command line, which remains the preferred interaction tool for developers, there are some graphic tools for git users, which are useful to both understand how a projects' history evolved, and navigate among the various development branches.
The figure shows a window of gitk (written in Tcl/Tk). Another approach to visualization is that of gitweb, which is usually installed on the servers that offer source code through git.
The git package is distributed together with extensive
documentation, as man pages (man command). For each subcommand
you find a manual page whose name begins by git-
,
so for example you can invoke "man git-fetch
".
This convention reflects the origins of git, when each subcommand
was actually a standalone command (with a dash in the name); but it
also allows to split a big corpus of documentation into useful
parts, whereas a single man page would be unmanageable. The main
page, "man git
" is available nonetheless, and brings
introductory and general information.
Something more introductory, designed for beginners, is
gittutorial(7)
(i.e., the gittutorial man page
in chapter 7 of the manual), and its follower
gittutorial-2(7)
, which goes to more depth. Other
manual pages that can be useful are listed in the SEE ALSO
section of git(1)
.
The official project site is git.or.cz
, and includes
among other things an interesting "git for svn users", and other
course material, within http://git.or.cz/course/
.
The http://www.youtube.com/watch?v=4XpnKHJAok8
video
is a recording of Linus Torvalds talking about git to Google technicians.
It's more like informal chatting than a technical presentation, but it is
quite interesting nonetheless.
Box 5 in this page briefly lists other git subcommands that I originally planned to describe as useful or otherwise interesting; detailed information about such tools can be found elsewhere, as hinted in this section.
Box 5 - Other important subcommands
This box lists other git commands that are not covered in this article, but that I suggest studying if you want to become a serious git user. Some have been touched in riquadro 1.