sábado, 27 de julho de 2013

Git Hooks are awesome, but hard

Git is awesome and has profoundly changed the daily work of many developers. In addition to providing a very rich set of concepts and tools, Git is extensible in several ways, which makes it even more powerful.

One of the ways in which Git's functionality can be extend is via hooks. A Git hook is an external program (usually, a script) that Git invokes during the execution of some of its native operations. As of Git 1.8.3.4 there are 20 different Git hooks. During a git commit, for example, Git may invoke up to five hooks in specific phases of the command execution, in this order: pre-commit, prepare-commit-msg, commit-msg, post-commit, and post-rewrite.

When Git is about to invoke a hook it looks up for an executable file named after the hook in the repository's .git/hooks directory. So, for example, when it's about to invoke the pre-commit hook it looks for an executable file called .git/hooks/pre-commit in the repository. If it doesn't find any, it simply continues the command execution. Otherwise, it invokes the file passing to it some information about the current command's state and waits for it to terminate.

Some hooks can affect the Git command with their exit values. The pre-commit hook, for example, can abort the commit if it terminates abnormally, i.e., with an exit value different from zero. The post-commit hook, however, can't, since it is invoked after the commit has been completed.

Git passes information about the current state of the command that is being executed to its hooks . This information may go via command parameters, environment variables, or standard input. Each hook has a specific set of information that's passed to it in some specific form.

By default, when you git init or git clone a repository the .git/hooks directory ends up with some template files, all having the .sample suffix in their names and helpful instructions inside explaining how to convert them into working hooks. To enable them, you simply have to edit and to rename them, dropping the suffix. And don't forget to make them executable.

Run git help githooks to read details about all the hooks and to understand their most common uses.

What Can Hooks Do?


In theory they can do anything allowed by the privileges of the user invoking them. Note that some hooks are invoked by your local Git, such as the above mentioned commit hooks. These hooks run as yourself and have all the privileges that you have to investigate or change your local repository. Other hooks are invoked by the remote Git, the most common being the pre-receive and the update hooks. Those are invoked by the Git process running in the remote repository and are commonly used to reject pushes with commits that don't obey some of the project's agreed upon policies.

Why Are They Awesome?


Because they can extend or restrict the functionality of Git's native commands in very useful ways.

For example, suppose your project's team decide to adhere to a set of coding standards. You could implement a pre-receive hook to run on the central Git server to check those standards in every added or modified source file in every commit, rejecting pushes carrying commits violating those standards. The remote hook's error messages are shown to the users performing the git push, letting them know what is wrong with their commits. This way you can automate a significant part of your coding review process.

Even better, the same hook, slightly modified, could be installed by all developers on their own cloned repositories as a pre-commit or a post-commit hook, letting them know at commit time if they have violated any rule, before going on with development.

Most hooks are used to check for policy violations such as these. But you can also use them as a notification service. For instance, the post-receive hook is invoked after a successful git push and can be used to notify interested parties about recent activity in the central repository.

You can even use a hook to trigger the execution of some action external to Git, turning it into your Personal Workflow Automatizator Tabajara. For example, a post-receive hook could check if a specific branch called production has been changed and update the system in the production server via ssh, rsync, or even git pull in another clone.

If your own imagination fails you, you can resort to Google to look for all sorts of useful hook scripts available elsewhere (e.g.  https://github.com/gitster/git/tree/master/contrib/hooks and http://google.com/search?q=git+hooks).

Why Are They Hard?


Three things: implementing hooks require Git-Fu, it's not easy to integrate functionality in a single hook, and it's not trivial to make them efficient.

Git-Fu


How many Git commands do you use? Ten? Twenty?

Last time I counted there were 161 Git commands... Really! Run git help -a to see them all, and then some.

Most of these commands aren't needed for your daily workflow. The ones you use directly (add, commit, checkout, branch, fetch, push, etc.) are part of a class of commands called porcelain, of which there are just a few. The majority of Git's commands belong to another class called plumbing. Those are the building blocks with which some porcelain commands are constructed, and they allow you to really get into Git's innards to investigate and poke around in the repository.

You don't need to know about the plumbing while you're just using Git as a high level version control tool. But as soon as you start to write hooks you have to learn some of the esoteric and fascinating plumbing commands. That's what I call Git-Fu. You don't need to be a Git master, but you're gonna need a little Git-Fu to be a proficient hook developer.

Integration


There are 20 different hooks, but each repository has just one of each. Suppose you already have a cool pre-receive hook in your project's central repository to check against coding standards violations and you stumble upon an awesome hook at GitHub to check the formatting of commit log messages. You would like to use both to guarantee the high quality of your project's commits. However, you can't use them both "as is" because there can be only one pre-receive hook in the repository.

One solution is to integrate the two hooks into a third one implementing both checks. This can be easy or hard, depending on the complexities of both hooks. Of course, if each one is written in a different programming language, the integration would be tantamount to re-implementing one into the other.

A more general solution is to implement a "hook driver", i.e. a script which would invoke a set of other scripts in turn, passing to them the same parameters, checking their exit values, and exiting accordingly. The one thing that makes this solution non-trivial is the fact that some Git hooks (viz. pre-push, pre-receive, and post-rewrite) also get information from their standard input. So, the driver has to read all the input and then feed it to each one of the other scripts in turn.

Anyway, standard Git doesn't have a ready solution for the need to invoke different programs in one hook.

Efficiency


Hooks invoked locally usually don't have to be particularly efficient. However, the hooks in your Git central server may be invoked much more frequently, even more so if your server serves many repositories for a large group of developers.

Moreover, if your're using "hook drivers", each hook may be invoking many processes to perform its duties. Since most hooks are implemented as scripts, just the startup times of the interpreters can have a significant impact in the overall utilization of your server. (If you're interested in comparing programming languages startup times, I've blogged about it recently.)

Yet another issue that may affect the efficiency of your hooks is that most of them have to invoke one or more of Git's plumbing commands to grok information about the repository and be able to process it and take action. If you have integrated many scripts behind a driver, most of them may be invoking the  same Git command to grok the same information over and over again. Since they're in different processes and unaware of each other, they can't cache the information.

And the solution is...


Well, not "the", but "a" solution to alleviate the above-mentioned problems would be to come up with a framework for implementing Git hooks. Such a framework should provide an easier API to get the hook parameters and to invoke the plumbing. It also should implement the hook driver concept directly. And it should also allow for some kind of caching of information about the repository, minimizing the need to invoke Git commands redundantly.

Guess what? There is at least one such framework. It's Git::Hooks. From yours truly.

I should like to say a few things about it in the forthcoming posts...

sábado, 20 de julho de 2013

Programming languages startup times - 2013 roundup

I just revised the study I did a year ago about programming languages startup times. It all started because I was writing some small script that would be frequently invoked and I wanted to know how did the startup times of Bash and Perl compare against each other. The results were not at all what I expected and I extended the investigation to other languages. The main conclusion for me was that Bash and Perl had very similar startup times, which let me stick with Perl, much to my delight.

That post received some attention this week due to my refering to it in another blog, which made me want to repeat it to see if anything has changed in the meantime and to do it a little bit more properly. Also, I got some feedback and suggestions to extend it even further. So, in order to make it easier for me to repeat it and, perhaps, to incent people to replicate it in other platforms and with other languages, I've written a simple script called startup-times to automate the benchmark process.

The script is written in Perl (you guessed it!) and uses the Benchmark module to calculate the timings. This time I investigated 12 programming languages, two compiled (C and Java) and 10 interpreted. Running it on my laptop, which is still the same I used a year ago, a Dell Latitude E6410, now running Lubuntu 13.10, I got this:

$ ./startup-times
Bash: GNU bash, versão 4.2.45(1)-release (x86_64-pc-linux-gnu)
  timethis for 1: 10.1873 wallclock secs ( 0.08 usr +  0.92 sys =  1.00 CPU) @ 3840.00/s (n=3840)

C: gcc (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
  timethis for 1: 5.09504 wallclock secs ( 0.09 usr +  1.04 sys =  1.13 CPU) @ 3964.60/s (n=4480)

Java: javac 1.7.0_25
  timethis for 1: 304.648 wallclock secs ( 0.14 usr  1.05 sys + 246.21 cusr 52.60 csys = 300.00 CPU) @ 12.80/s (n=3840)

JavaSun: javac 1.7.0_25
  timethis for 1: 208.54 wallclock secs ( 0.11 usr  0.96 sys + 159.18 cusr 44.00 csys = 204.25 CPU) @ 17.55/s (n=3584)

Ksh:   version         sh (AT&T Research) 93u+ 2012-08-01
  timethis for 1: 9.63142 wallclock secs ( 0.07 usr +  1.02 sys =  1.09 CPU) @ 3793.58/s (n=4135)

Lua: Lua 5.2
  timethis for 1: 7.12142 wallclock secs ( 0.12 usr +  0.98 sys =  1.10 CPU) @ 3258.18/s (n=3584)

PHP: PHP 5.4.9-4ubuntu2.2 (cli) (built: Jul 15 2013 18:23:35)
  timethis for 1: 44.1422 wallclock secs ( 0.03 usr  1.07 sys + 23.97 cusr 13.64 csys = 38.71 CPU) @ 86.80/s (n=3360)

Perl: This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi
  timethis for 1: 11.7166 wallclock secs ( 0.09 usr +  1.05 sys =  1.14 CPU) @ 3627.19/s (n=4135)

Python: Python 2.7.4
  timethis for 1: 55.0902 wallclock secs ( 0.12 usr  1.01 sys + 31.30 cusr 15.82 csys = 48.25 CPU) @ 69.64/s (n=3360)

Ruby: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
  timethis for 1: 68.0358 wallclock secs ( 0.02 usr  1.08 sys + 45.19 cusr 13.79 csys = 60.08 CPU) @ 63.91/s (n=3840)

TCL: TCL 8.5
  timethis for 1: 18.4099 wallclock secs ( 0.17 usr  0.88 sys +  5.37 cusr  6.38 csys = 12.80 CPU) @ 233.28/s (n=2986)

Tcsh: tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-unknown-linux) options wide,nls,dl,al,kan,rh,nd,color,filec
  timethis for 1: 26.4094 wallclock secs ( 0.11 usr  0.91 sys +  6.70 cusr  6.76 csys = 14.48 CPU) @ 231.98/s (n=3359)

Zsh: zsh 5.0.0 (x86_64-unknown-linux-gnu)
  timethis for 1: 15.1896 wallclock secs ( 0.11 usr  0.99 sys +  0.50 cusr  0.82 csys =  2.42 CPU) @ 1586.78/s (n=3840)


LANGUAGE   CALLS/s  NULL(ms)     SCORE
       C   879.286     1.137     1.000
     Lua   503.271     1.987     1.747
     Ksh   429.324     2.329     2.048
    Bash   376.941     2.653     2.333
    Perl   352.917     2.834     2.491
     Zsh   252.805     3.956     3.478
     TCL   162.196     6.165     5.421
    Tcsh   127.190     7.862     6.913
     PHP    76.118    13.138    11.552
  Python    60.991    16.396    14.417
    Ruby    56.441    17.718    15.579
 JavaSun    17.186    58.186    51.163
    Java    12.605    79.335    69.759
I think a graph makes some things clearer.



There are a few things to notice. The first one is that Lua beat all other interpreted languages. Rob Hoelz urged me to include it, already predicting this. I'm embarrassed to confess that I don't know much about Lua, even though it's a language with roots in Brazil.

All shells (ksh, bash, zsh, and tcsh) have good and comparable startup times. Among the heavier scripting languages just Lua, Perl, and TCL are in the same ballpark. I've left Tcsh out of the green group because it's the slowest and nobody should program in csh, anyway.

I've put PHP, Python, and Ruby in the yellow group. Their median startup time is six times higher than the green group median. So, for instance, in terms of performance alone for small and frequently used scripts this means that you can get six times more bang for buck with Perl than with Python or Ruby. :-)

Java is another story. I even tried two different JDKs: the OpenJDK that comes with Ubuntu and the SunOracle JDK to see how much they differ. Not much. Both crawl in comparison with all other languages. There seems to be a fair amount of discussion about this "problem". Even in academia. But I couldn't find a solution. It seems that Java simply isn't cut for this particular niche of programming.

sábado, 9 de fevereiro de 2013

Expressões regulares cruzadas

Ei, essa sim é uma brincadeira "nerd". :-)

O Aurélio Jargas postou o desafio no twitter ontem e ficou de publicar a resolução depois do Carnaval. Eu não consegui esperar:

Image

 

Pra quem quiser resolver sozinho, baixe o tabuleiro.

sexta-feira, 21 de dezembro de 2012

"This is the end of the world"



If only I were a little more credulous, I'd think this had to be a sign of something.

I'm a fan os Muse's early albums but I've never bought one. Until today, while I was looking for a present for my dad and I stumbled upon the Absolution CD and decided to buy it.

When I put it on in the car, while driving home, it downed on me that those lyrics were so much apropriate for today. The day the world is supposed to end...

"and our time is running out."

quarta-feira, 26 de setembro de 2012

Por que não dá pra baixar o "Pouer Point"?

Vejam só a conversa que tive com minha filha via chat:

14:44 Juliana: papi, eu entrei aqui pra estuda mais eu to no winddows e quero ir no pouer point, mas nao tem power point, como eu baixo?

14:46 eu: oi fofs.  Não dá pra baixar o PowerPoint porque é um programa proprietário. Ele é pago.  A gente não tem PowerPoint em casa.

14:47 Juliana: e pq n pode baixar?

14:47 eu: porque é ilegal. Tem que pagar pra Microsoft pra poder instalar o PowerPoint.  Alguns programas são proprietários e pagos.  Outros, como os do Linux, são "livres" e de graça.  Esses a gente pode baixar.

14:48 Juliana: qual é aquele do google memo?

14:48 eu: É o Google Docs.

14:49 Clique em "Disco" aí em cima do Gmail.  Ele roda no Chrome mesmo

14:49 Juliana: ok

14:49 eu: Depois a gente conversa e eu te explico esse negócio de software proprietário e software livre, tá?

14:50 Juliana: nao prescisa

14:50 eu: Mas eu quero!  Deixa, vai!

14:50 Juliana: rsrrsrzsrrsrsrs

Será que ela vai me deixar explicar? Depois eu conto. :-)

quinta-feira, 5 de julho de 2012

Why I don't send email receipts

I don't usually let my email reader send out email receipts. If you have sent an email to me expecting to get a read- or a return-receipt, please don't. I don't mean to be rude. I simply don't like them. They're broken in several ways: they aren't reliable and they can be misused. So, why bother?

If you want to know if and when I read your email, please, say so. A simple "Please, acknowlege this." at the end of the message will trigger an instant reply. But please, don't make it a part of your signature. Not every message needs an acknowlegement and if you always require them I'll soon start to avoid replying. Few things annoy me more than a signature with a "I look forward to your reply" (or an equivalent "Aguardo retorno" in Portuguese) in it. I think that's rude.

 

quinta-feira, 28 de junho de 2012

Programming languages start-up times

I don't remember the last time I wrote a substantial program in anything but Perl. It fills almost all of my needs as a sysadmin and diletante programmer. But there are situations in which I write bash scripts. Two, to be precise.

One situation in which I feel more like using bash than Perl is when the script is small and function as a driver to invoke other programs. Bash (any shell, in fact) syntax to invoke other programs is more succint than Perl's. As you know: succinctness is power.

The other situation, is when the script in question is short and is going to be invoked lots of times. In this case, I worry about its start-up time, because it may very well dominate the overall system performance. I always assumed that Perl's start-up time was much larger than any shell's start-up time. However, as Knuth wisely said: Premature optimization is the root of all evil. You should always verify your assumptions with a profiler or, at least, a stopwatch before investing in any optimization work.

These days I'm studying the implementation of Git hooks and I'm constantly struggling to decide if I should write them in Perl or bash, because they tend to be frequently invoked when you setup a Git server serving lots of developers.

So, I decided to check exactly that. What's the real difference between the start-up time of Perl and bash. My testing platform is bash. I simply timed one thousand invokations of bash and Perl telling them to do nothing. This is what I got in my Dell Latitude E6410 laptop running Ubuntu 12.04:

$ (time for i in `seq 1 1000`; do bash -c :; done) 2>&1 | grep real
real 0m2.858s

$ (time for i in `seq 1 1000`; do perl -e 0; done) 2>&1 | grep real
real 0m3.326s

Not that different at all, is it? Perl takes just 16% more time to do nothing than bash. I sure was expecting Perl to take much more time than bash. Of course, a script doing nothing isn't that useful, although it can inspire a blog post. But while in order to perform useful work a bash script needs to invoke other programs, a Perl script can do many things in a single process just by useing (sic) Perl modules. So, I guess that after starting-up behind a bash script, an equivalent Perl script is going to catch up and finish the run first almost all times.

I found this very interesting. So much so that I decided to extend my investigations to other scripting and compiled languages as well. Just out of curiosity. But the results were startling.

The other three main scripting languages fared much worse than Perl. I wasn't expecting such a huge difference:

$ (time for i in `seq 1 1000`; do ruby -e 0; done) 2>&1 | grep real
real 0m5.628s

$ (time for i in `seq 1 1000`; do python -c 0; done) 2>&1 | grep real
real 0m27.373s

$ (time for i in `seq 1 1000`; do echo exit | tclsh; done) 2>&1 | grep real
real 0m10.991s

Ruby is 1.7 times slower than Perl, TCL is 3.3 times slower, and Python is 8.2 times slower!

What about compiled languages? They should be faster, right? Of course they are. Let's C:

$ cat >null.c <<EOF
#include <stdlib.h>
int main()
{
    exit(0);
}
EOF

$ gcc -O -o null null.c

$ (time for i in `seq 1 1000`; do ./null; done) 2>&1 | grep real
real 0m1.185s

This is interesting, because I rekon that this C program must have one of the shortest possible start-up times. So we can use it as a yardstick with which to compare every other language.

What about Java? I don't speak Java, so I googled "java helloworld", found a good example and stripped it of every non-essential work:

$ cat >Null.java <<EOF
public class Null
{
    public static void main(String args[])
    {
    }
}
EOF

$ javac Null.java

$ (time for i in `seq 1 1000`; do java Null; done) 2>&1 | grep real
real 0m58.231s

What?!? Almost one minute for doing nothing one thousand times? I did it again and again just to be sure. I realize Java isn't a vanila compiled language. At least, not like C. The Java compiler generates byte codes that are interpreted by the JVM. But since scripting languages in general have to perform the source to bytecode conversion just before the interpretation I thought that Java would be at least a little faster than most. So much for enterprise languages...

So, to sum it all up, here is the final score of the game: