segunda-feira, 9 de novembro de 2020

Perl is Dead Expressive

A few days ago when I knew that Python is now officially the new Cobol Java I remembered a conversation I had several years ago with some colleagues about programming languages. I blogged about it in Portuguese then and I think it would be nice to also have it in English.

One of my teammates, an ardent proponent of Python, was presenting some slides boasting the virtues of the language. At some point he started presenting some dangerous slides ... each comparing Python to another programming language. Perl, Bash, Java, and Ruby. I found it dangerous because I think that comparing two languages decently is a complex task that cannot be condensed into one slide. First, a set of objective criteria for comparison must be defined. Then, it is necessary to take into account the context in which the language is being used. Things like the domain of the applications that will be developed, the development and deployment platforms, the developers' experience with the language, the size of the team, and the time constraints of the project. After all that, you have to resist the temptation to argue passionately for the language of your preference in order to give at least the appearance of rationality.

But that's ok ... in a small group, this type of discussion is as stimulating and harmless as talking about politics, sports, or religion. ;)

I think it was on the Bash slide that he suggested a problem for which a standard Unix shell would not offer a solution as economical and as readable as the Python interactive shell could. The problem was, more or less, the following. Suppose there is a set of files in a directory which names consist of an alphabetical prefix, followed by a sequence of digits and ending in the extension .jpg. For example:
 $ ls
 a0.jpg b1.jpg c123.jpg
The challenge is to rename them so that all filenames have the same number of digits in them. In the case above, the result should be:
 a000.jpg b001.jpg c123.jpg
I left the talk with the problem in my head and the first thing I did was to come up with some one-liners:
 # printing the names
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "%s%03d.jpg", $1, $2/e'
 a000.jpg
 b001.jpg
 c123.jpg

 # generating commands to rename them
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "mv -n %s %s%03d.jpg", $&, $1, $2/e'
 mv -n a0.jpg a000.jpg
 mv -n b1.jpg b001.jpg
 mv -n c123.jpg c123.jpg

 # executing commands in the shell
 $ ls | perl -lpe \
  's/^([a-z]+)(\d+)\.jpg/sprintf "mv %s %s%03d.jpg", $&, $1, $2/e' \
  | sh
 $ ls
 a000.jpg  b001.jpg  c123.jpg
That's how I usually develop a shell solution. Instead of loops I prefer to use commands to generate other commands, like the mv above, so that I can easily verify that I am doing the right thing. After making sure of that, just add a "| sh " at the end of the pipeline to execute the generated commands. Perl has some very useful options for making one-liners like this. -l, -a, -n, -p, and -e are the ones I use most often. Read the perlrun documentation to learn more about them and many other interesting options. But, not to say that Perl can't do things alone, I added a solution that doesn't use the shell at the end.
 # doing everything in Perl
 $ ls | perl -lne \
  'if (/^([a-z]+)(\d+)\.jpg/) {
    rename $_, sprintf "%s%03d.jpg", $1, $2
  }'

 $ ls
 a000.jpg b001.jpg c123.jpg
Another teammate, who is a Bash fan, didn't let it go and came up with the following solutions:
 $ ls
 a0.jpg b1.jpg c123.jpg

 $ for i in *.jpg; do
 >   j=${i%*.jpg}
 >   printf "mv -n %s %s%03d.jpg\n" $i ${j//[0-9]/} ${j//[a-z]/}
 > done
 mv -n a0.jpg a000.jpg
 mv -n b1.jpg b001.jpg
 mv -n c123.jpg c123.jpg

 $ for i in *.jpg; do
 >   j=${i%*.jpg}
 >   printf "mv -n %s %s%03d.jpg\n" $i ${j//[0-9]/} ${j//[a-z]/}
 > done | sh

 $ ls
 a000.jpg b001.jpg c123.jpg
Ninja! I'll confess that I never had the willpower to learn these advanced bash string manipulation strokes. For me, shell is a glue that serves to stick other commands together. Whenever I need something more complicated, like data structures or regular expressions, I don't think twice about using Perl. But the Python die-hard counter attacked with this:
$ python
Python 2.7.18 (default, Aug  4 2020, 11:16:42) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for name in os.listdir("."):
...   base, number, ext = name[0], name[1:name.find(".")], name.split(".")[1]
...   os.rename(name, "%s%03d.%s"%(base, int(number), ext))
... 
>>> 

$ ls
a000.jpg  b001.jpg  c123.jpg

$ # readability counts
Ah ... what a subtle criticism in that last comment.

IS IT?


When I want to solve a problem with a one liner "readability" is irrelevant, because if I am not going to save the solution in a script, no one else will read it, right? Come on ... if I was going to save it in a script I could write something more like his Python version. Something like this:
    opendir CWD, '.';
    foreach $name (readdir CWD) {
        if (($base, $number, $ext) = ($name =~ /^(.)(\d+)\.(.*)/)) {
           rename $name, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
    closedir CWD;
Hmmm ... I didn't even try to break the name with string operations because I find the regular expression more direct and, in this case, more readable. To get even more readable I would replace the commands opendir, readdir, and closedir by a glob pattern:
    foreach $name (<*.jpg>) {
        if (($base, $number, $ext) = ($name =~ /^(.)(\d+)\.(.*)/)) {
           rename $name, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
Better, right? But it’s still not good. It's very, how can I say it... heavyweight. One of the big differences between Perl and many other languages, Python in particular, is that we don't always have to be explicit. It is more or less like using pronouns or hidden subjects in natural languages. When you learn a foreign language at first you don't know it very well and baby-talk like this:
Joe is married. Joe has five children. Joe's children are all single.
Then you learn to use the pronouns and start speaking more economically.
Joe is married. He has five children. They are all single.
Until you are really fluent in the language and speak naturally like this:
Joe is married and has five children, all single.
Unintelligible? Of course not. Unless you are just starting to learn English. We usually talk to people who are as fluent as we are, so we can, and should, be economical and direct. By avoiding redundancies we are not just more direct. We are also more intelligible (or readable), because we do not insert in the speech that series of repeated names that end up polluting the text, hiding the real content of the message. Well, all of this is to explain my next version, in which I delete the variable $name, because in Perl the loop iterator may be implicitly operated on, like this:
    foreach (<*.jpg>) {
        if (($base, $number, $ext) = /^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("%s%03d.%s", $base, $number, $ext);
        }
    }
If you don't know Perl you won't know that the regular expression is being applied to the foreach implicit iterator. But if you've never seen Perl, that's not your biggest problem, is it? Oh, and $_ is the "pronoun" we use to refer explicitly to the iterator inside the loop.

On second thought, these local variables are not serving much purpose other than naming the parts captured by the regular expression. If we were to use them often, it would be proper. But to only use them once on the next line? The regular expression is clear enough (after gaining some experience with them, obviously). How about getting rid of those variables?
    foreach (<*.jpg>) {
        if (/^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("%s%03d.%s", $1, $2, $3);
        }
    }
I could use named capture groups to use names instead of numbers to refer to the captures. But in a small block like this I usually don't bother.

Still ... it's looking too much like C to me. In Perl it's more direct and readable to interpolate the variables in the format string:
    foreach (<*.jpg>) {
        if (/^(.)(\d+)\.(.*)/) {
           rename $_, sprintf("$1%03d.$3", $2);
        }
    }
Hmmm ... the important thing is the rename ... the if is an accessory. In Perl, we can reverse the test and the action, more or less like when we choose the active voice or the passive voice for stylistic reasons. So, let's put what matters first:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.$3", $2)
            if /^(.)(\d+)\.(.*)/;
    }
Nice. And we saved a pair of braces too, see?

Ah ... being so succinct it becomes easier to perceive the opportunity to make trivial optimizations:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.jpg", $2)
            if /^(.)(\d+)\.jpg$/;
    }
Or timely generalizations:
    foreach (<*.jpg>) {
        rename $_, sprintf("$1%03d.jpg", $2)
            if /^([a-z]+)(\d+)\.jpg$/i;
    }
It seems very readable for me. How about you?

Anyway, at least it proves that There Is More Than One Way To Do It.

Addendum: Sometime after writing this I discovered the rename command. With it the solution is trivial:
  $ rename 's/(\d+)/sprintf("%03d", $1)/e' *.jpg
Ah ... rename is written in Perl. :-)

segunda-feira, 22 de abril de 2019

Perl Weekly Challenge 005

This week's challenges are all about anagrams.

The first one is to
Write a program which prints out all anagrams for a given word. For more information about Anagram, please check this wikipedia page.
It's not said but I assume that, besides the word, the program must also read a dictionary of words in which it will look for anagrams. My solution is simple and very much alike the solution to last week's second challenge.

The ideia is to use a hash function that generates a key for each word so that anagrams always produce the same key and non-anagrams always lead to different keys. The hash function I use lowercases the word so that we compare letters case insensitively. Then it splits the word in all of its letters, sorts, and joins them together. So, for example, "Perl" is keyed as "elpr".

The script first generates the key for the input word. Then it iterates for all dictionary words, printing those that have a key equal to the input word's key.


The second challenge is to
Write a program to find the sequence of characters that has the most anagrams.
My solution first reads all of the dictionary words and classify them in anagrams using the same hash function of the first script. Then it finds and prints the keys associated with the maximum number of anagrams.


And this is how they work. First I use the second script to grok the sequence of characters that has the most anagrams in my Ubuntu dictionary. Then I use the first script to grok all the anagrams associated with it:

-----
I came up with another solution to the second challenge that is shorter, faster and uses no modules:

quarta-feira, 17 de abril de 2019

Perl Weekly Challenge 004

This week I submitted my solutions via a pull request to the GitHub's repository.

This was the first time I solved the first problem, because it was interesting:
Write a script to output the same number of PI digits as the size of your script. Say, if your script size is 10, it should print 3.141592653.
After seeing a few solutions by other people I feel that my solution is a little dumb. I wrote the smallest script I could write, saw its size and edited back the number of characters I wanted. Some other solutions use clever ways to grok the scripts size dynamically.

The second problem was interesting too:
You are given a file containing a list of words (case insensitive 1 word per line) and a list of letters. Print each word from the file than can be made using only letters from the list. You can use each letter only once (though there can be duplicates and you can use each of them once), you don’t have to use all the letters. (Disclaimer: The challenge was proposed by Scimon Proctor)
My solution is similar to others I saw after having written it. It's not particularly clever, but I find it very readable. This is how it works in my Linux box:

$ ./ch-2.pl /usr/share/dict/words Perl
E
L
Le
P
Perl
R
e
l
p
per
r
re
rep

That's it for this week.

----
After a while I came up with a new solution to the second problem which is more concise because it's written in a more functional style. But it depends on the List::Util module.

quinta-feira, 11 de abril de 2019

svndumpsanitizer is a gem

I've been supporting Subversion repositories in my work for more than ten years already. During this time I've grudgingly done my fair share of migrations, moving partial histories from one repository to another.

The standard procedure consists in dumping the source repository, filtering the resulting dump to keep only the part of the history you're interested in, and loading the resulting dump into the target repository. It's possible to do it in a single pipeline like this:
svnadmin dump source | svndumpfilter options | svnadmin load target
If you ever did this to any non-trivial repository you must know how exasperating it can be to come up with the correct options. It's a trial-and-error process because you never know exactly which paths you need to include in the filter, since Subversion histories have a tendency of containing all sorts of weird movements and renamings, which break the filtering. Then, you have to understand which path you have to add to the filter and restart the process from the beginning.

This week I embarked in a Subversion migration adventure. If I only knew how I would regret it... I had to move the histories of some 15 directories from three source repositories into a sub-directory of a single target repository. They are big and old repositories, but the directories seemed innocent enough that I started very confident. To be sure, all but two of the directories were moved easily.

The remaining two kept me busy for most of the week though. Their histories are long and windy. During the course of my trials I became aware of some options in newer versions of the "svnadmin dump" command that promised to make it possible to avoid the intermediary svndumpfilter command. But it failed. Hard. Repeatedly. Annoyingly.

I gave myself today as my last chance to finish the process. I almost gave up but by chance I stumbled upon a link to svndumpsanitizer... and I was saved.

It's a simple, fast, and intelligent tool that seems to solve all the problems that the svndumpfilter program has. And it's superbly documented too. It's page explains very well the usual problems we get with svndumpfilter and how it overcomes them.

Discounting the time to make the initial dump and the final load, the filtering took less than a minute. Awesome!

Kudos to svndumpsanitizer's author, dsuni at GitHub, for such a gem!

domingo, 7 de abril de 2019

Perl Weekly Challenge #3

This week's challenge is to:

Create a script that generates Pascal Triangle. Accept number of rows from the command line. The Pascal Triangle should have at least 3 rows. For more information about Pascal Triangle, check this wikipedia page.

I don't know why there is a restriction in the number of rows. Here's my quick&dirty answer:


Here's how to use it:

sexta-feira, 5 de abril de 2019

O princípio e o fim

Meu filho está resfriado e começou a discutir com minha esposa sobre que remédio ele deveria tomar para dor e febre.

Eu não estava prestando muita atenção, mas percebi que estavam discutindo sobre as diferenças entre os princípios ativos. Ela argumentava que se os remédios tinham princípios ativos distintos não tinha problema tomar dois de uma vez, ao passo que ele teimava que se ambos serviam para a mesma coisa isso não fazia muito sentido... Devia ser mais complicado do que isso, mas, como eu disse, eu não estava prestando atenção.

Tentando ajudar eu perguntei:

- O que importa se eles não têm o mesmo princípio se ambos têm o mesmo fim?

Não ajudou em nada... Mas não ficou bonito? ;-)


domingo, 31 de março de 2019

Perl Weekly Chalenge #2

Last week I sent my solution to the Perl Weekly Chalenge #1 via email. It was fun and simple.

This week's challenge is to "write a script that can convert numbers to and from a base35 representation, using the characters 0-9 and A-Y."

I cannot do it as a one-liner this time, but it was still fun. While trying to solve it I realized that it wouldn't be much harder to implement a general solution to convert from any base to any base between 2 and 36.

This is my solution:

And this is how it works: