domingo, 20 de dezembro de 2015

Tracing Subversion path histories

During the past few days I am trying to convert a large Subversion repository to Git. Every time I do this for a big repo with a large history I face the problem of mapping branches and tags from their path-based locations on Subversion to their reference-based names in Git.

The problem is that since branches and tags in Subversion are represented as simple directories with naming conventions, and conventions tend to change, for a large history it's very common to see that branches and tags may have been moved around.

Every conversion tool I know require that we specify the paths of Subversion trunk, branches, and tags and how they are to be mapped as Git references. I've used git-svn in the past but now I'm experimenting with subgit to see how it goes. So far so good. But the mapping problem still requires some thought and experimentation.

For example, in the repo I'm converting different kinds of branches are kept under different directories: feature branches under /branches/dev and release branches under /branches/fix. So, I started out with this configuration:

  branches = branches/dev/*:refs/heads/dev/*
  branches = branches/fix/*:refs/heads/fix/*

However, after having converted everything I noticed that some of the branches had short histories. That is, "git log branchname" showed me just a few commits and ended in an orphan commit. The initial history of the branch, showing when it branched from trunk, was lost.

With the help of "svn log" I was able to see that these branches had been moved in the past from branches/dev to branches/old and back again. Since I hadn't configured the path branches/old to be converted, all that history was lost. So, I had to insert another line to the configuration file and start again:

  branches = branches/old/*:refs/heads/old/*

Mind you, this is a simplified version of the story, as there was many more instances of branch and tag renaming that I was learning along the way. And doing that using "svn log" commands isn't fun.

So, I looked for some tool that could help me figure out at once and for all branches and tags I'm interested in their complete naming history. Basically I'd like to specify a list of paths existing at HEAD and the tool should tell me the complete history of renames for each of them, since they were copied originally from trunk or other branch.


I wasn't able to find anything like this so I ended up writing a small script that's really useful.

The script is invoked like this:

  svn-trace-paths.pl --pathsfile FILE --logfile FILE >trace.csv

The --pathsfile argument must be a file listing one path per line. Each path represents a branch, a tag, or trunk, i.e., everything you want to trace the history of.

the --logfile argument must be a file containing the complete log of the repository in XML format, which you may produce like this:

  svn log --xml -v REPO_ROOT_URL >logfile.xml

Generating the complete log for a repository may take minutes or even hours depending on its size and if you're accessing it locally or from a remote server. So, it's better to produce it once to be able to reuse it multiple times after if you want to trace histories for different paths.

The script produces a CSV spreadsheet on its standard output, so it's better to redirect it to a file. Each line represents the creation of a path. The first column has the path and the second column the numeric revision when it was created. If it was created as a copy from another path (which is the most common situation if you're following branches and tags) it has two more columns showing the name of the original path and the revision from which is was copied. A fictional example may make it clearer:

  /tags/1.0.1       1534 /branches/fix/1.0 1533
  /tags/1.0.0       1234 /branches/fix/1.0 1230
  /branches/fix/1.0  999 /trunk             998
  /trunk               1

In this example you could have passed in --pathsfile only the two paths for the tags (/tags/1.0.1 and /tags/1.0.0). The script would start out tracking them and would find along the way from which paths they were copied, at which point it would start tracking those. That's why it ended up showing the history of /branches/fix/1.0 and /trunk.

Note that since /trunk wasn't copied from anywhere, it's line shows just the two first columns.

Also note that /tags/1.0.0 was created on r1234 as a copy from /branches/fix/1.0, but from a previous revision: r1230. This is common in some situations. If you use an automated continuous integration system to validate your branch and tag them automatically it can take several minutes to validate it while new commits may be created so that at the end the tag must be made to the revision that was validated, which isn't HEAD anymore.

The repo I'm converting has 235973 revisions and its complete XML log has 534MiB. I'm interested in tracing the history of 287 paths. To process that amount of information the script took 3min29s in my laptop and produced a spreadsheet with 442 lines. Not bad at all, I'd say.

You can open the resulting spreadsheet and study it to see all the paths that you must configure to be able to convert the complete history of the paths you're interested in.

With a little shell script you can produce a list of all of them:

  (csvtool col 1 trace.csv; csvtool col 3 trace.csv) | sort -u

The csvtool is a really handy command I use here to take only the first and third columns from the CSV file, the ones containing the original and copied paths. I concatenate them all and feed them to "sort -u" which sorts them and remove duplicates.

That's how I know every path I should consider when I'm configuring my conversion tool.

Now that I have this script, I'm sure I'll use it whenever I want to know the history of particular long lived branches. It's nice when you do something that ends up having unintended usefulness. :-)