O Que Me Interessa: 2013

sábado, 27 de julho de 2013

Git Hooks are awesome, but hard

Git is awesome and has profoundly changed the daily work of many developers. In addition to providing a very rich set of concepts and tools, Git is extensible in several ways, which makes it even more powerful.

One of the ways in which Git's functionality can be extend is via hooks. A Git hook is an external program (usually, a script) that Git invokes during the execution of some of its native operations. As of Git 1.8.3.4 there are 20 different Git hooks. During a git commit, for example, Git may invoke up to five hooks in specific phases of the command execution, in this order: pre-commit, prepare-commit-msg, commit-msg, post-commit, and post-rewrite.

When Git is about to invoke a hook it looks up for an executable file named after the hook in the repository's .git/hooks directory. So, for example, when it's about to invoke the pre-commit hook it looks for an executable file called .git/hooks/pre-commit in the repository. If it doesn't find any, it simply continues the command execution. Otherwise, it invokes the file passing to it some information about the current command's state and waits for it to terminate.

Some hooks can affect the Git command with their exit values. The pre-commit hook, for example, can abort the commit if it terminates abnormally, i.e., with an exit value different from zero. The post-commit hook, however, can't, since it is invoked after the commit has been completed.

Git passes information about the current state of the command that is being executed to its hooks . This information may go via command parameters, environment variables, or standard input. Each hook has a specific set of information that's passed to it in some specific form.

By default, when you git init or git clone a repository the .git/hooks directory ends up with some template files, all having the .sample suffix in their names and helpful instructions inside explaining how to convert them into working hooks. To enable them, you simply have to edit and to rename them, dropping the suffix. And don't forget to make them executable.

Run git help githooks to read details about all the hooks and to understand their most common uses.

What Can Hooks Do?

In theory they can do anything allowed by the privileges of the user invoking them. Note that some hooks are invoked by your local Git, such as the above mentioned commit hooks. These hooks run as yourself and have all the privileges that you have to investigate or change your local repository. Other hooks are invoked by the remote Git, the most common being the pre-receive and the update hooks. Those are invoked by the Git process running in the remote repository and are commonly used to reject pushes with commits that don't obey some of the project's agreed upon policies.

Why Are They Awesome?

Because they can extend or restrict the functionality of Git's native commands in very useful ways.

For example, suppose your project's team decide to adhere to a set of coding standards. You could implement a pre-receive hook to run on the central Git server to check those standards in every added or modified source file in every commit, rejecting pushes carrying commits violating those standards. The remote hook's error messages are shown to the users performing the git push, letting them know what is wrong with their commits. This way you can automate a significant part of your coding review process.

Even better, the same hook, slightly modified, could be installed by all developers on their own cloned repositories as a pre-commit or a post-commit hook, letting them know at commit time if they have violated any rule, before going on with development.

Most hooks are used to check for policy violations such as these. But you can also use them as a notification service. For instance, the post-receive hook is invoked after a successful git push and can be used to notify interested parties about recent activity in the central repository.

You can even use a hook to trigger the execution of some action external to Git, turning it into your Personal Workflow Automatizator Tabajara™. For example, a post-receive hook could check if a specific branch called production has been changed and update the system in the production server via ssh, rsync, or even git pull in another clone.

If your own imagination fails you, you can resort to Google to look for all sorts of useful hook scripts available elsewhere (e.g. https://github.com/gitster/git/tree/master/contrib/hooks and http://google.com/search?q=git+hooks).

Why Are They Hard?

Three things: implementing hooks require Git-Fu, it's not easy to integrate functionality in a single hook, and it's not trivial to make them efficient.

Git-Fu

How many Git commands do you use? Ten? Twenty?

Last time I counted there were 161 Git commands... Really! Run git help -a to see them all, and then some.

Most of these commands aren't needed for your daily workflow. The ones you use directly (add, commit, checkout, branch, fetch, push, etc.) are part of a class of commands called porcelain, of which there are just a few. The majority of Git's commands belong to another class called plumbing. Those are the building blocks with which some porcelain commands are constructed, and they allow you to really get into Git's innards to investigate and poke around in the repository.

You don't need to know about the plumbing while you're just using Git as a high level version control tool. But as soon as you start to write hooks you have to learn some of the esoteric and fascinating plumbing commands. That's what I call Git-Fu. You don't need to be a Git master, but you're gonna need a little Git-Fu to be a proficient hook developer.

Integration

There are 20 different hooks, but each repository has just one of each. Suppose you already have a cool pre-receive hook in your project's central repository to check against coding standards violations and you stumble upon an awesome hook at GitHub to check the formatting of commit log messages. You would like to use both to guarantee the high quality of your project's commits. However, you can't use them both "as is" because there can be only one pre-receive hook in the repository.

One solution is to integrate the two hooks into a third one implementing both checks. This can be easy or hard, depending on the complexities of both hooks. Of course, if each one is written in a different programming language, the integration would be tantamount to re-implementing one into the other.

A more general solution is to implement a "hook driver", i.e. a script which would invoke a set of other scripts in turn, passing to them the same parameters, checking their exit values, and exiting accordingly. The one thing that makes this solution non-trivial is the fact that some Git hooks (viz. pre-push, pre-receive, and post-rewrite) also get information from their standard input. So, the driver has to read all the input and then feed it to each one of the other scripts in turn.

Anyway, standard Git doesn't have a ready solution for the need to invoke different programs in one hook.

Efficiency

Hooks invoked locally usually don't have to be particularly efficient. However, the hooks in your Git central server may be invoked much more frequently, even more so if your server serves many repositories for a large group of developers.

Moreover, if your're using "hook drivers", each hook may be invoking many processes to perform its duties. Since most hooks are implemented as scripts, just the startup times of the interpreters can have a significant impact in the overall utilization of your server. (If you're interested in comparing programming languages startup times, I've blogged about it recently.)

Yet another issue that may affect the efficiency of your hooks is that most of them have to invoke one or more of Git's plumbing commands to grok information about the repository and be able to process it and take action. If you have integrated many scripts behind a driver, most of them may be invoking the same Git command to grok the same information over and over again. Since they're in different processes and unaware of each other, they can't cache the information.

And the solution is...

Well, not "the", but "a" solution to alleviate the above-mentioned problems would be to come up with a framework for implementing Git hooks. Such a framework should provide an easier API to get the hook parameters and to invoke the plumbing. It also should implement the hook driver concept directly. And it should also allow for some kind of caching of information about the repository, minimizing the need to invoke Git commands redundantly.

Guess what? There is at least one such framework. It's Git::Hooks. From yours truly.

I should like to say a few things about it in the forthcoming posts...

sábado, 20 de julho de 2013

Programming languages startup times - 2013 roundup

I just revised the study I did a year ago about programming languages startup times. It all started because I was writing some small script that would be frequently invoked and I wanted to know how did the startup times of Bash and Perl compare against each other. The results were not at all what I expected and I extended the investigation to other languages. The main conclusion for me was that Bash and Perl had very similar startup times, which let me stick with Perl, much to my delight.

That post received some attention this week due to my refering to it in another blog, which made me want to repeat it to see if anything has changed in the meantime and to do it a little bit more properly. Also, I got some feedback and suggestions to extend it even further. So, in order to make it easier for me to repeat it and, perhaps, to incent people to replicate it in other platforms and with other languages, I've written a simple script called startup-times to automate the benchmark process.

The script is written in Perl (you guessed it!) and uses the Benchmark module to calculate the timings. This time I investigated 12 programming languages, two compiled (C and Java) and 10 interpreted. Running it on my laptop, which is still the same I used a year ago, a Dell Latitude E6410, now running Lubuntu 13.10, I got this:

$ ./startup-times
Bash: GNU bash, versão 4.2.45(1)-release (x86_64-pc-linux-gnu)
timethis for 1: 10.1873 wallclock secs ( 0.08 usr + 0.92 sys = 1.00 CPU) @ 3840.00/s (n=3840)

C: gcc (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
timethis for 1: 5.09504 wallclock secs ( 0.09 usr + 1.04 sys = 1.13 CPU) @ 3964.60/s (n=4480)

Java: javac 1.7.0_25
timethis for 1: 304.648 wallclock secs ( 0.14 usr 1.05 sys + 246.21 cusr 52.60 csys = 300.00 CPU) @ 12.80/s (n=3840)

JavaSun: javac 1.7.0_25
timethis for 1: 208.54 wallclock secs ( 0.11 usr 0.96 sys + 159.18 cusr 44.00 csys = 204.25 CPU) @ 17.55/s (n=3584)

Ksh:   version         sh (AT&T Research) 93u+ 2012-08-01
timethis for 1: 9.63142 wallclock secs ( 0.07 usr + 1.02 sys = 1.09 CPU) @ 3793.58/s (n=4135)

Lua: Lua 5.2
timethis for 1: 7.12142 wallclock secs ( 0.12 usr + 0.98 sys = 1.10 CPU) @ 3258.18/s (n=3584)

PHP: PHP 5.4.9-4ubuntu2.2 (cli) (built: Jul 15 2013 18:23:35)
timethis for 1: 44.1422 wallclock secs ( 0.03 usr 1.07 sys + 23.97 cusr 13.64 csys = 38.71 CPU) @ 86.80/s (n=3360)

Perl: This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi
timethis for 1: 11.7166 wallclock secs ( 0.09 usr + 1.05 sys = 1.14 CPU) @ 3627.19/s (n=4135)

Python: Python 2.7.4
timethis for 1: 55.0902 wallclock secs ( 0.12 usr 1.01 sys + 31.30 cusr 15.82 csys = 48.25 CPU) @ 69.64/s (n=3360)

Ruby: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
timethis for 1: 68.0358 wallclock secs ( 0.02 usr 1.08 sys + 45.19 cusr 13.79 csys = 60.08 CPU) @ 63.91/s (n=3840)

TCL: TCL 8.5
timethis for 1: 18.4099 wallclock secs ( 0.17 usr 0.88 sys + 5.37 cusr 6.38 csys = 12.80 CPU) @ 233.28/s (n=2986)

Tcsh: tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-unknown-linux) options wide,nls,dl,al,kan,rh,nd,color,filec
timethis for 1: 26.4094 wallclock secs ( 0.11 usr 0.91 sys + 6.70 cusr 6.76 csys = 14.48 CPU) @ 231.98/s (n=3359)

Zsh: zsh 5.0.0 (x86_64-unknown-linux-gnu)
timethis for 1: 15.1896 wallclock secs ( 0.11 usr 0.99 sys + 0.50 cusr 0.82 csys = 2.42 CPU) @ 1586.78/s (n=3840)

LANGUAGE   CALLS/s NULL(ms)     SCORE
       C   879.286     1.137     1.000
     Lua   503.271     1.987     1.747
     Ksh   429.324     2.329     2.048
    Bash   376.941     2.653     2.333
    Perl   352.917     2.834     2.491
     Zsh   252.805     3.956     3.478
     TCL   162.196     6.165     5.421
    Tcsh   127.190     7.862     6.913
     PHP    76.118    13.138    11.552
Python    60.991    16.396    14.417
    Ruby    56.441    17.718    15.579
JavaSun    17.186    58.186    51.163
    Java    12.605    79.335    69.759

I think a graph makes some things clearer.

There are a few things to notice. The first one is that Lua beat all other interpreted languages. Rob Hoelz urged me to include it, already predicting this. I'm embarrassed to confess that I don't know much about Lua, even though it's a language with roots in Brazil.

All shells (ksh, bash, zsh, and tcsh) have good and comparable startup times. Among the heavier scripting languages just Lua, Perl, and TCL are in the same ballpark. I've left Tcsh out of the green group because it's the slowest and nobody should program in csh, anyway.

I've put PHP, Python, and Ruby in the yellow group. Their median startup time is six times higher than the green group median. So, for instance, in terms of performance alone for small and frequently used scripts this means that you can get six times more bang for buck with Perl than with Python or Ruby. :-)

Java is another story. I even tried two different JDKs: the OpenJDK that comes with Ubuntu and the ~~Sun~~Oracle JDK to see how much they differ. Not much. Both crawl in comparison with all other languages. There seems to be a fair amount of discussion about this "problem". Even in academia. But I couldn't find a solution. It seems that Java simply isn't cut for this particular niche of programming.