sábado, 27 de julho de 2013

Git Hooks are awesome, but hard

Git is awesome and has profoundly changed the daily work of many developers. In addition to providing a very rich set of concepts and tools, Git is extensible in several ways, which makes it even more powerful.

One of the ways in which Git's functionality can be extend is via hooks. A Git hook is an external program (usually, a script) that Git invokes during the execution of some of its native operations. As of Git 1.8.3.4 there are 20 different Git hooks. During a git commit, for example, Git may invoke up to five hooks in specific phases of the command execution, in this order: pre-commit, prepare-commit-msg, commit-msg, post-commit, and post-rewrite.

When Git is about to invoke a hook it looks up for an executable file named after the hook in the repository's .git/hooks directory. So, for example, when it's about to invoke the pre-commit hook it looks for an executable file called .git/hooks/pre-commit in the repository. If it doesn't find any, it simply continues the command execution. Otherwise, it invokes the file passing to it some information about the current command's state and waits for it to terminate.

Some hooks can affect the Git command with their exit values. The pre-commit hook, for example, can abort the commit if it terminates abnormally, i.e., with an exit value different from zero. The post-commit hook, however, can't, since it is invoked after the commit has been completed.

Git passes information about the current state of the command that is being executed to its hooks . This information may go via command parameters, environment variables, or standard input. Each hook has a specific set of information that's passed to it in some specific form.

By default, when you git init or git clone a repository the .git/hooks directory ends up with some template files, all having the .sample suffix in their names and helpful instructions inside explaining how to convert them into working hooks. To enable them, you simply have to edit and to rename them, dropping the suffix. And don't forget to make them executable.

Run git help githooks to read details about all the hooks and to understand their most common uses.

What Can Hooks Do?


In theory they can do anything allowed by the privileges of the user invoking them. Note that some hooks are invoked by your local Git, such as the above mentioned commit hooks. These hooks run as yourself and have all the privileges that you have to investigate or change your local repository. Other hooks are invoked by the remote Git, the most common being the pre-receive and the update hooks. Those are invoked by the Git process running in the remote repository and are commonly used to reject pushes with commits that don't obey some of the project's agreed upon policies.

Why Are They Awesome?


Because they can extend or restrict the functionality of Git's native commands in very useful ways.

For example, suppose your project's team decide to adhere to a set of coding standards. You could implement a pre-receive hook to run on the central Git server to check those standards in every added or modified source file in every commit, rejecting pushes carrying commits violating those standards. The remote hook's error messages are shown to the users performing the git push, letting them know what is wrong with their commits. This way you can automate a significant part of your coding review process.

Even better, the same hook, slightly modified, could be installed by all developers on their own cloned repositories as a pre-commit or a post-commit hook, letting them know at commit time if they have violated any rule, before going on with development.

Most hooks are used to check for policy violations such as these. But you can also use them as a notification service. For instance, the post-receive hook is invoked after a successful git push and can be used to notify interested parties about recent activity in the central repository.

You can even use a hook to trigger the execution of some action external to Git, turning it into your Personal Workflow Automatizator Tabajara. For example, a post-receive hook could check if a specific branch called production has been changed and update the system in the production server via ssh, rsync, or even git pull in another clone.

If your own imagination fails you, you can resort to Google to look for all sorts of useful hook scripts available elsewhere (e.g.  https://github.com/gitster/git/tree/master/contrib/hooks and http://google.com/search?q=git+hooks).

Why Are They Hard?


Three things: implementing hooks require Git-Fu, it's not easy to integrate functionality in a single hook, and it's not trivial to make them efficient.

Git-Fu


How many Git commands do you use? Ten? Twenty?

Last time I counted there were 161 Git commands... Really! Run git help -a to see them all, and then some.

Most of these commands aren't needed for your daily workflow. The ones you use directly (add, commit, checkout, branch, fetch, push, etc.) are part of a class of commands called porcelain, of which there are just a few. The majority of Git's commands belong to another class called plumbing. Those are the building blocks with which some porcelain commands are constructed, and they allow you to really get into Git's innards to investigate and poke around in the repository.

You don't need to know about the plumbing while you're just using Git as a high level version control tool. But as soon as you start to write hooks you have to learn some of the esoteric and fascinating plumbing commands. That's what I call Git-Fu. You don't need to be a Git master, but you're gonna need a little Git-Fu to be a proficient hook developer.

Integration


There are 20 different hooks, but each repository has just one of each. Suppose you already have a cool pre-receive hook in your project's central repository to check against coding standards violations and you stumble upon an awesome hook at GitHub to check the formatting of commit log messages. You would like to use both to guarantee the high quality of your project's commits. However, you can't use them both "as is" because there can be only one pre-receive hook in the repository.

One solution is to integrate the two hooks into a third one implementing both checks. This can be easy or hard, depending on the complexities of both hooks. Of course, if each one is written in a different programming language, the integration would be tantamount to re-implementing one into the other.

A more general solution is to implement a "hook driver", i.e. a script which would invoke a set of other scripts in turn, passing to them the same parameters, checking their exit values, and exiting accordingly. The one thing that makes this solution non-trivial is the fact that some Git hooks (viz. pre-push, pre-receive, and post-rewrite) also get information from their standard input. So, the driver has to read all the input and then feed it to each one of the other scripts in turn.

Anyway, standard Git doesn't have a ready solution for the need to invoke different programs in one hook.

Efficiency


Hooks invoked locally usually don't have to be particularly efficient. However, the hooks in your Git central server may be invoked much more frequently, even more so if your server serves many repositories for a large group of developers.

Moreover, if your're using "hook drivers", each hook may be invoking many processes to perform its duties. Since most hooks are implemented as scripts, just the startup times of the interpreters can have a significant impact in the overall utilization of your server. (If you're interested in comparing programming languages startup times, I've blogged about it recently.)

Yet another issue that may affect the efficiency of your hooks is that most of them have to invoke one or more of Git's plumbing commands to grok information about the repository and be able to process it and take action. If you have integrated many scripts behind a driver, most of them may be invoking the  same Git command to grok the same information over and over again. Since they're in different processes and unaware of each other, they can't cache the information.

And the solution is...


Well, not "the", but "a" solution to alleviate the above-mentioned problems would be to come up with a framework for implementing Git hooks. Such a framework should provide an easier API to get the hook parameters and to invoke the plumbing. It also should implement the hook driver concept directly. And it should also allow for some kind of caching of information about the repository, minimizing the need to invoke Git commands redundantly.

Guess what? There is at least one such framework. It's Git::Hooks. From yours truly.

I should like to say a few things about it in the forthcoming posts...