domingo, 20 de dezembro de 2015

Tracing Subversion path histories

During the past few days I am trying to convert a large Subversion repository to Git. Every time I do this for a big repo with a large history I face the problem of mapping branches and tags from their path-based locations on Subversion to their reference-based names in Git.

The problem is that since branches and tags in Subversion are represented as simple directories with naming conventions, and conventions tend to change, for a large history it's very common to see that branches and tags may have been moved around.

Every conversion tool I know require that we specify the paths of Subversion trunk, branches, and tags and how they are to be mapped as Git references. I've used git-svn in the past but now I'm experimenting with subgit to see how it goes. So far so good. But the mapping problem still requires some thought and experimentation.

For example, in the repo I'm converting different kinds of branches are kept under different directories: feature branches under /branches/dev and release branches under /branches/fix. So, I started out with this configuration:

  branches = branches/dev/*:refs/heads/dev/*
  branches = branches/fix/*:refs/heads/fix/*

However, after having converted everything I noticed that some of the branches had short histories. That is, "git log branchname" showed me just a few commits and ended in an orphan commit. The initial history of the branch, showing when it branched from trunk, was lost.

With the help of "svn log" I was able to see that these branches had been moved in the past from branches/dev to branches/old and back again. Since I hadn't configured the path branches/old to be converted, all that history was lost. So, I had to insert another line to the configuration file and start again:

  branches = branches/old/*:refs/heads/old/*

Mind you, this is a simplified version of the story, as there was many more instances of branch and tag renaming that I was learning along the way. And doing that using "svn log" commands isn't fun.

So, I looked for some tool that could help me figure out at once and for all branches and tags I'm interested in their complete naming history. Basically I'd like to specify a list of paths existing at HEAD and the tool should tell me the complete history of renames for each of them, since they were copied originally from trunk or other branch.

I wasn't able to find anything like this so I ended up writing a small script that's really useful.

The script is invoked like this: --pathsfile FILE --logfile FILE >trace.csv

The --pathsfile argument must be a file listing one path per line. Each path represents a branch, a tag, or trunk, i.e., everything you want to trace the history of.

the --logfile argument must be a file containing the complete log of the repository in XML format, which you may produce like this:

  svn log --xml -v REPO_ROOT_URL >logfile.xml

Generating the complete log for a repository may take minutes or even hours depending on its size and if you're accessing it locally or from a remote server. So, it's better to produce it once to be able to reuse it multiple times after if you want to trace histories for different paths.

The script produces a CSV spreadsheet on its standard output, so it's better to redirect it to a file. Each line represents the creation of a path. The first column has the path and the second column the numeric revision when it was created. If it was created as a copy from another path (which is the most common situation if you're following branches and tags) it has two more columns showing the name of the original path and the revision from which is was copied. A fictional example may make it clearer:

  /tags/1.0.1       1534 /branches/fix/1.0 1533
  /tags/1.0.0       1234 /branches/fix/1.0 1230
  /branches/fix/1.0  999 /trunk             998
  /trunk               1

In this example you could have passed in --pathsfile only the two paths for the tags (/tags/1.0.1 and /tags/1.0.0). The script would start out tracking them and would find along the way from which paths they were copied, at which point it would start tracking those. That's why it ended up showing the history of /branches/fix/1.0 and /trunk.

Note that since /trunk wasn't copied from anywhere, it's line shows just the two first columns.

Also note that /tags/1.0.0 was created on r1234 as a copy from /branches/fix/1.0, but from a previous revision: r1230. This is common in some situations. If you use an automated continuous integration system to validate your branch and tag them automatically it can take several minutes to validate it while new commits may be created so that at the end the tag must be made to the revision that was validated, which isn't HEAD anymore.

The repo I'm converting has 235973 revisions and its complete XML log has 534MiB. I'm interested in tracing the history of 287 paths. To process that amount of information the script took 3min29s in my laptop and produced a spreadsheet with 442 lines. Not bad at all, I'd say.

You can open the resulting spreadsheet and study it to see all the paths that you must configure to be able to convert the complete history of the paths you're interested in.

With a little shell script you can produce a list of all of them:

  (csvtool col 1 trace.csv; csvtool col 3 trace.csv) | sort -u

The csvtool is a really handy command I use here to take only the first and third columns from the CSV file, the ones containing the original and copied paths. I concatenate them all and feed them to "sort -u" which sorts them and remove duplicates.

That's how I know every path I should consider when I'm configuring my conversion tool.

Now that I have this script, I'm sure I'll use it whenever I want to know the history of particular long lived branches. It's nice when you do something that ends up having unintended usefulness. :-)

sábado, 27 de julho de 2013

Git Hooks are awesome, but hard

Git is awesome and has profoundly changed the daily work of many developers. In addition to providing a very rich set of concepts and tools, Git is extensible in several ways, which makes it even more powerful.

One of the ways in which Git's functionality can be extend is via hooks. A Git hook is an external program (usually, a script) that Git invokes during the execution of some of its native operations. As of Git there are 20 different Git hooks. During a git commit, for example, Git may invoke up to five hooks in specific phases of the command execution, in this order: pre-commit, prepare-commit-msg, commit-msg, post-commit, and post-rewrite.

When Git is about to invoke a hook it looks up for an executable file named after the hook in the repository's .git/hooks directory. So, for example, when it's about to invoke the pre-commit hook it looks for an executable file called .git/hooks/pre-commit in the repository. If it doesn't find any, it simply continues the command execution. Otherwise, it invokes the file passing to it some information about the current command's state and waits for it to terminate.

Some hooks can affect the Git command with their exit values. The pre-commit hook, for example, can abort the commit if it terminates abnormally, i.e., with an exit value different from zero. The post-commit hook, however, can't, since it is invoked after the commit has been completed.

Git passes information about the current state of the command that is being executed to its hooks . This information may go via command parameters, environment variables, or standard input. Each hook has a specific set of information that's passed to it in some specific form.

By default, when you git init or git clone a repository the .git/hooks directory ends up with some template files, all having the .sample suffix in their names and helpful instructions inside explaining how to convert them into working hooks. To enable them, you simply have to edit and to rename them, dropping the suffix. And don't forget to make them executable.

Run git help githooks to read details about all the hooks and to understand their most common uses.

What Can Hooks Do?

In theory they can do anything allowed by the privileges of the user invoking them. Note that some hooks are invoked by your local Git, such as the above mentioned commit hooks. These hooks run as yourself and have all the privileges that you have to investigate or change your local repository. Other hooks are invoked by the remote Git, the most common being the pre-receive and the update hooks. Those are invoked by the Git process running in the remote repository and are commonly used to reject pushes with commits that don't obey some of the project's agreed upon policies.

Why Are They Awesome?

Because they can extend or restrict the functionality of Git's native commands in very useful ways.

For example, suppose your project's team decide to adhere to a set of coding standards. You could implement a pre-receive hook to run on the central Git server to check those standards in every added or modified source file in every commit, rejecting pushes carrying commits violating those standards. The remote hook's error messages are shown to the users performing the git push, letting them know what is wrong with their commits. This way you can automate a significant part of your coding review process.

Even better, the same hook, slightly modified, could be installed by all developers on their own cloned repositories as a pre-commit or a post-commit hook, letting them know at commit time if they have violated any rule, before going on with development.

Most hooks are used to check for policy violations such as these. But you can also use them as a notification service. For instance, the post-receive hook is invoked after a successful git push and can be used to notify interested parties about recent activity in the central repository.

You can even use a hook to trigger the execution of some action external to Git, turning it into your Personal Workflow Automatizator Tabajara. For example, a post-receive hook could check if a specific branch called production has been changed and update the system in the production server via ssh, rsync, or even git pull in another clone.

If your own imagination fails you, you can resort to Google to look for all sorts of useful hook scripts available elsewhere (e.g. and

Why Are They Hard?

Three things: implementing hooks require Git-Fu, it's not easy to integrate functionality in a single hook, and it's not trivial to make them efficient.


How many Git commands do you use? Ten? Twenty?

Last time I counted there were 161 Git commands... Really! Run git help -a to see them all, and then some.

Most of these commands aren't needed for your daily workflow. The ones you use directly (add, commit, checkout, branch, fetch, push, etc.) are part of a class of commands called porcelain, of which there are just a few. The majority of Git's commands belong to another class called plumbing. Those are the building blocks with which some porcelain commands are constructed, and they allow you to really get into Git's innards to investigate and poke around in the repository.

You don't need to know about the plumbing while you're just using Git as a high level version control tool. But as soon as you start to write hooks you have to learn some of the esoteric and fascinating plumbing commands. That's what I call Git-Fu. You don't need to be a Git master, but you're gonna need a little Git-Fu to be a proficient hook developer.


There are 20 different hooks, but each repository has just one of each. Suppose you already have a cool pre-receive hook in your project's central repository to check against coding standards violations and you stumble upon an awesome hook at GitHub to check the formatting of commit log messages. You would like to use both to guarantee the high quality of your project's commits. However, you can't use them both "as is" because there can be only one pre-receive hook in the repository.

One solution is to integrate the two hooks into a third one implementing both checks. This can be easy or hard, depending on the complexities of both hooks. Of course, if each one is written in a different programming language, the integration would be tantamount to re-implementing one into the other.

A more general solution is to implement a "hook driver", i.e. a script which would invoke a set of other scripts in turn, passing to them the same parameters, checking their exit values, and exiting accordingly. The one thing that makes this solution non-trivial is the fact that some Git hooks (viz. pre-push, pre-receive, and post-rewrite) also get information from their standard input. So, the driver has to read all the input and then feed it to each one of the other scripts in turn.

Anyway, standard Git doesn't have a ready solution for the need to invoke different programs in one hook.


Hooks invoked locally usually don't have to be particularly efficient. However, the hooks in your Git central server may be invoked much more frequently, even more so if your server serves many repositories for a large group of developers.

Moreover, if your're using "hook drivers", each hook may be invoking many processes to perform its duties. Since most hooks are implemented as scripts, just the startup times of the interpreters can have a significant impact in the overall utilization of your server. (If you're interested in comparing programming languages startup times, I've blogged about it recently.)

Yet another issue that may affect the efficiency of your hooks is that most of them have to invoke one or more of Git's plumbing commands to grok information about the repository and be able to process it and take action. If you have integrated many scripts behind a driver, most of them may be invoking the  same Git command to grok the same information over and over again. Since they're in different processes and unaware of each other, they can't cache the information.

And the solution is...

Well, not "the", but "a" solution to alleviate the above-mentioned problems would be to come up with a framework for implementing Git hooks. Such a framework should provide an easier API to get the hook parameters and to invoke the plumbing. It also should implement the hook driver concept directly. And it should also allow for some kind of caching of information about the repository, minimizing the need to invoke Git commands redundantly.

Guess what? There is at least one such framework. It's Git::Hooks. From yours truly.

I should like to say a few things about it in the forthcoming posts...

sábado, 20 de julho de 2013

Programming languages startup times - 2013 roundup

I just revised the study I did a year ago about programming languages startup times. It all started because I was writing some small script that would be frequently invoked and I wanted to know how did the startup times of Bash and Perl compare against each other. The results were not at all what I expected and I extended the investigation to other languages. The main conclusion for me was that Bash and Perl had very similar startup times, which let me stick with Perl, much to my delight.

That post received some attention this week due to my refering to it in another blog, which made me want to repeat it to see if anything has changed in the meantime and to do it a little bit more properly. Also, I got some feedback and suggestions to extend it even further. So, in order to make it easier for me to repeat it and, perhaps, to incent people to replicate it in other platforms and with other languages, I've written a simple script called startup-times to automate the benchmark process.

The script is written in Perl (you guessed it!) and uses the Benchmark module to calculate the timings. This time I investigated 12 programming languages, two compiled (C and Java) and 10 interpreted. Running it on my laptop, which is still the same I used a year ago, a Dell Latitude E6410, now running Lubuntu 13.10, I got this:

$ ./startup-times
Bash: GNU bash, versão 4.2.45(1)-release (x86_64-pc-linux-gnu)
  timethis for 1: 10.1873 wallclock secs ( 0.08 usr +  0.92 sys =  1.00 CPU) @ 3840.00/s (n=3840)

C: gcc (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
  timethis for 1: 5.09504 wallclock secs ( 0.09 usr +  1.04 sys =  1.13 CPU) @ 3964.60/s (n=4480)

Java: javac 1.7.0_25
  timethis for 1: 304.648 wallclock secs ( 0.14 usr  1.05 sys + 246.21 cusr 52.60 csys = 300.00 CPU) @ 12.80/s (n=3840)

JavaSun: javac 1.7.0_25
  timethis for 1: 208.54 wallclock secs ( 0.11 usr  0.96 sys + 159.18 cusr 44.00 csys = 204.25 CPU) @ 17.55/s (n=3584)

Ksh:   version         sh (AT&T Research) 93u+ 2012-08-01
  timethis for 1: 9.63142 wallclock secs ( 0.07 usr +  1.02 sys =  1.09 CPU) @ 3793.58/s (n=4135)

Lua: Lua 5.2
  timethis for 1: 7.12142 wallclock secs ( 0.12 usr +  0.98 sys =  1.10 CPU) @ 3258.18/s (n=3584)

PHP: PHP 5.4.9-4ubuntu2.2 (cli) (built: Jul 15 2013 18:23:35)
  timethis for 1: 44.1422 wallclock secs ( 0.03 usr  1.07 sys + 23.97 cusr 13.64 csys = 38.71 CPU) @ 86.80/s (n=3360)

Perl: This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi
  timethis for 1: 11.7166 wallclock secs ( 0.09 usr +  1.05 sys =  1.14 CPU) @ 3627.19/s (n=4135)

Python: Python 2.7.4
  timethis for 1: 55.0902 wallclock secs ( 0.12 usr  1.01 sys + 31.30 cusr 15.82 csys = 48.25 CPU) @ 69.64/s (n=3360)

Ruby: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
  timethis for 1: 68.0358 wallclock secs ( 0.02 usr  1.08 sys + 45.19 cusr 13.79 csys = 60.08 CPU) @ 63.91/s (n=3840)

TCL: TCL 8.5
  timethis for 1: 18.4099 wallclock secs ( 0.17 usr  0.88 sys +  5.37 cusr  6.38 csys = 12.80 CPU) @ 233.28/s (n=2986)

Tcsh: tcsh 6.18.01 (Astron) 2012-02-14 (x86_64-unknown-linux) options wide,nls,dl,al,kan,rh,nd,color,filec
  timethis for 1: 26.4094 wallclock secs ( 0.11 usr  0.91 sys +  6.70 cusr  6.76 csys = 14.48 CPU) @ 231.98/s (n=3359)

Zsh: zsh 5.0.0 (x86_64-unknown-linux-gnu)
  timethis for 1: 15.1896 wallclock secs ( 0.11 usr  0.99 sys +  0.50 cusr  0.82 csys =  2.42 CPU) @ 1586.78/s (n=3840)

       C   879.286     1.137     1.000
     Lua   503.271     1.987     1.747
     Ksh   429.324     2.329     2.048
    Bash   376.941     2.653     2.333
    Perl   352.917     2.834     2.491
     Zsh   252.805     3.956     3.478
     TCL   162.196     6.165     5.421
    Tcsh   127.190     7.862     6.913
     PHP    76.118    13.138    11.552
  Python    60.991    16.396    14.417
    Ruby    56.441    17.718    15.579
 JavaSun    17.186    58.186    51.163
    Java    12.605    79.335    69.759
I think a graph makes some things clearer.

There are a few things to notice. The first one is that Lua beat all other interpreted languages. Rob Hoelz urged me to include it, already predicting this. I'm embarrassed to confess that I don't know much about Lua, even though it's a language with roots in Brazil.

All shells (ksh, bash, zsh, and tcsh) have good and comparable startup times. Among the heavier scripting languages just Lua, Perl, and TCL are in the same ballpark. I've left Tcsh out of the green group because it's the slowest and nobody should program in csh, anyway.

I've put PHP, Python, and Ruby in the yellow group. Their median startup time is six times higher than the green group median. So, for instance, in terms of performance alone for small and frequently used scripts this means that you can get six times more bang for buck with Perl than with Python or Ruby. :-)

Java is another story. I even tried two different JDKs: the OpenJDK that comes with Ubuntu and the SunOracle JDK to see how much they differ. Not much. Both crawl in comparison with all other languages. There seems to be a fair amount of discussion about this "problem". Even in academia. But I couldn't find a solution. It seems that Java simply isn't cut for this particular niche of programming.

sábado, 9 de fevereiro de 2013

Expressões regulares cruzadas

Ei, essa sim é uma brincadeira "nerd". :-)

O Aurélio Jargas postou o desafio no twitter ontem e ficou de publicar a resolução depois do Carnaval. Eu não consegui esperar:



Pra quem quiser resolver sozinho, baixe o tabuleiro.

sexta-feira, 21 de dezembro de 2012

"This is the end of the world"

If only I were a little more credulous, I'd think this had to be a sign of something.

I'm a fan os Muse's early albums but I've never bought one. Until today, while I was looking for a present for my dad and I stumbled upon the Absolution CD and decided to buy it.

When I put it on in the car, while driving home, it downed on me that those lyrics were so much apropriate for today. The day the world is supposed to end...

"and our time is running out."

quarta-feira, 26 de setembro de 2012

Por que não dá pra baixar o "Pouer Point"?

Vejam só a conversa que tive com minha filha via chat:

14:44 Juliana: papi, eu entrei aqui pra estuda mais eu to no winddows e quero ir no pouer point, mas nao tem power point, como eu baixo?

14:46 eu: oi fofs.  Não dá pra baixar o PowerPoint porque é um programa proprietário. Ele é pago.  A gente não tem PowerPoint em casa.

14:47 Juliana: e pq n pode baixar?

14:47 eu: porque é ilegal. Tem que pagar pra Microsoft pra poder instalar o PowerPoint.  Alguns programas são proprietários e pagos.  Outros, como os do Linux, são "livres" e de graça.  Esses a gente pode baixar.

14:48 Juliana: qual é aquele do google memo?

14:48 eu: É o Google Docs.

14:49 Clique em "Disco" aí em cima do Gmail.  Ele roda no Chrome mesmo

14:49 Juliana: ok

14:49 eu: Depois a gente conversa e eu te explico esse negócio de software proprietário e software livre, tá?

14:50 Juliana: nao prescisa

14:50 eu: Mas eu quero!  Deixa, vai!

14:50 Juliana: rsrrsrzsrrsrsrs

Será que ela vai me deixar explicar? Depois eu conto. :-)

quinta-feira, 5 de julho de 2012

Why I don't send email receipts

I don't usually let my email reader send out email receipts. If you have sent an email to me expecting to get a read- or a return-receipt, please don't. I don't mean to be rude. I simply don't like them. They're broken in several ways: they aren't reliable and they can be misused. So, why bother?

If you want to know if and when I read your email, please, say so. A simple "Please, acknowlege this." at the end of the message will trigger an instant reply. But please, don't make it a part of your signature. Not every message needs an acknowlegement and if you always require them I'll soon start to avoid replying. Few things annoy me more than a signature with a "I look forward to your reply" (or an equivalent "Aguardo retorno" in Portuguese) in it. I think that's rude.