segunda-feira, 9 de novembro de 2020

Perl is Dead Expressive

A few days ago when I knew that Python is now officially the new Cobol Java I remembered a conversation I had several years ago with some colleagues about programming languages. I blogged about it in Portuguese then and I think it would be nice to also have it in English.

One of my teammates, an ardent proponent of Python, was presenting some slides boasting the virtues of the language. At some point he started presenting some dangerous slides ... each comparing Python to another programming language. Perl, Bash, Java, and Ruby. I found it dangerous because I think that comparing two languages decently is a complex task that cannot be condensed into one slide. First, a set of objective criteria for comparison must be defined. Then, it is necessary to take into account the context in which the language is being used. Things like the domain of the applications that will be developed, the development and deployment platforms, the developers' experience with the language, the size of the team, and the time constraints of the project. After all that, you have to resist the temptation to argue passionately for the language of your preference in order to give at least the appearance of rationality.

But that's ok ... in a small group, this type of discussion is as stimulating and harmless as talking about politics, sports, or religion. ;)

I think it was on the Bash slide that he suggested a problem for which a standard Unix shell would not offer a solution as economical and as readable as the Python interactive shell could. The problem was, more or less, the following. Suppose there is a set of files in a directory which names consist of an alphabetical prefix, followed by a sequence of digits and ending in the extension .jpg. For example:
 $ ls
 a0.jpg b1.jpg c123.jpg
The challenge is to rename them so that all filenames have the same number of digits in them. In the case above, the result should be:
 a000.jpg b001.jpg c123.jpg
I left the talk with the problem in my head and the first thing I did was to come up with some one-liners:
 # printing the names
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "%s%03d.jpg", $1, $2/e'
 a000.jpg
 b001.jpg
 c123.jpg

 # generating commands to rename them
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "mv -n %s %s%03d.jpg", $&, $1, $2/e'
 mv -n a0.jpg a000.jpg
 mv -n b1.jpg b001.jpg
 mv -n c123.jpg c123.jpg

 # executing commands in the shell
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "mv %s %s%03d.jpg", $&, $1, $2/e' \
  | sh
 $ ls
 a000.jpg  b001.jpg  c123.jpg
That's how I usually develop a shell solution. Instead of loops I prefer to use commands to generate other commands, like the mv above, so that I can easily verify that I am doing the right thing. After making sure of that, just add a "| sh " at the end of the pipeline to execute the generated commands. Perl has some very useful options for making one-liners like this. -l, -a, -n, -p, and -e are the ones I use most often. Read the perlrun documentation to learn more about them and many other interesting options. But, not to say that Perl can't do things alone, I added a solution that doesn't use the shell at the end.
 # doing everything in Perl
 $ ls | perl -lne \
  'if (/^([a-z]+)(\d+)\.jpg/) {
    rename $_, sprintf "%s%03d.jpg", $1, $2
  }'

 $ ls
 a000.jpg b001.jpg c123.jpg
Another teammate, who is a Bash fan, didn't let it go and came up with the following solutions:
 $ ls
 a0.jpg b1.jpg c123.jpg

 $ for i in *.jpg; do
 >   j=${i%*.jpg}
 >   printf "mv -n %s %s%03d.jpg\n" $i ${j//[0-9]/} ${j//[a-z]/}
 > done
 mv -n a0.jpg a000.jpg
 mv -n b1.jpg b001.jpg
 mv -n c123.jpg c123.jpg

 $ for i in *.jpg; do
 >   j=${i%*.jpg}
 >   printf "mv -n %s %s%03d.jpg\n" $i ${j//[0-9]/} ${j//[a-z]/}
 > done | sh

 $ ls
 a000.jpg b001.jpg c123.jpg
Ninja! I'll confess that I never had the willpower to learn these advanced bash string manipulation strokes. For me, shell is a glue that serves to stick other commands together. Whenever I need something more complicated, like data structures or regular expressions, I don't think twice about using Perl. But the Python die-hard counter attacked with this:
$ python
Python 2.7.18 (default, Aug  4 2020, 11:16:42) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for name in os.listdir("."):
...   base, number, ext = name[0], name[1:name.find(".")], name.split(".")[1]
...   os.rename(name, "%s%03d.%s"%(base, int(number), ext))
... 
>>> 

$ ls
a000.jpg  b001.jpg  c123.jpg

$ # readability counts
Ah ... what a subtle criticism in that last comment.

IS IT?


When I want to solve a problem with a one liner "readability" is irrelevant, because if I am not going to save the solution in a script, no one else will read it, right? Come on ... if I was going to save it in a script I could write something more like his Python version. Something like this:
    opendir CWD, '.';
    foreach $name (readdir CWD) {
        if (($base, $number, $ext) = ($name =~ /^(.)(\d+)\.(.*)/)) {
           rename $name, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
    closedir CWD;
Hmmm ... I didn't even try to break the name with string operations because I find the regular expression more direct and, in this case, more readable. To get even more readable I would replace the commands opendir, readdir, and closedir by a glob pattern:
    foreach $name (<*.jpg>) {
        if (($base, $number, $ext) = ($name =~ /^(.)(\d+)\.(.*)/)) {
           rename $name, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
Better, right? But it’s still not good. It's very, how can I say it... heavyweight. One of the big differences between Perl and many other languages, Python in particular, is that we don't always have to be explicit. It is more or less like using pronouns or hidden subjects in natural languages. When you learn a foreign language at first you don't know it very well and baby-talk like this:
Joe is married. Joe has five children. Joe's children are all single.
Then you learn to use the pronouns and start speaking more economically.
Joe is married. He has five children. They are all single.
Until you are really fluent in the language and speak naturally like this:
Joe is married and has five children, all single.
Unintelligible? Of course not. Unless you are just starting to learn English. We usually talk to people who are as fluent as we are, so we can, and should, be economical and direct. By avoiding redundancies we are not just more direct. We are also more intelligible (or readable), because we do not insert in the speech that series of repeated names that end up polluting the text, hiding the real content of the message. Well, all of this is to explain my next version, in which I delete the variable $name, because in Perl the loop iterator may be implicitly operated on, like this:
    foreach (<*.jpg>) {
        if (($base, $number, $ext) = /^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
If you don't know Perl you won't know that the regular expression is being applied to the foreach implicit iterator. But if you've never seen Perl, that's not your biggest problem, is it? Oh, and $_ is the "pronoun" we use to refer explicitly to the iterator inside the loop.

On second thought, these local variables are not serving much purpose other than naming the parts captured by the regular expression. If we were to use them often, it would be proper. But to only use them once on the next line? The regular expression is clear enough (after gaining some experience with them, obviously). How about getting rid of those variables?
    foreach (<*.jpg>) {
        if (/^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("%s%03d.%s", $1, $2, $3);
        }
    }
I could use named capture groups to use names instead of numbers to refer to the captures. But in a small block like this I usually don't bother.

Still ... it's looking too much like C to me. In Perl it's more direct and readable to interpolate the variables in the format string:
    foreach (<*.jpg>) {
        if (/^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("$1%03d.$3", $2);
        }
    }
Hmmm ... the important thing is the rename ... the if is an accessory. In Perl, we can reverse the test and the action, more or less like when we choose the active voice or the passive voice for stylistic reasons. So, let's put what matters first:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.$3", $2)
            if /^(.)(\d+)\.(.*)/;
    }
Nice. And we saved a pair of braces too, see?

Ah ... being so succinct it becomes easier to perceive the opportunity to make trivial optimizations:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.jpg", $2)
            if /^(.)(\d+)\.jpg$/;
    }
Or timely generalizations:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.jpg", $2)
            if /^([a-z]+)(\d+)\.jpg$/i;
    }
It seems very readable for me. How about you?

Anyway, at least it proves that There Is More Than One Way To Do It.

Addendum: Sometime after writing this I discovered the rename command. With it the solution is trivial:
  $ rename 's/(\d+)/sprintf("%03d", $1)/e' *.jpg
Ah ... rename is written in Perl. :-)