Conditional regular expression and Branch reset in Perl

I have been used Perl for 10 years in my research, but it is so vast that for anyone to learn every feature of the language. Recently, I was trying to solve a problem using a simple solution:

The problem: parse all the elements in the string ‘a,b,c,#d,e#,f,g’ separated by commas, but when a comma appears between two ‘#’ characters, then the comma is part of the element. Also ‘#’ can only appear at the beginning and end of an element.

Expected output (one element per line):
a
b
c
#d,e#
f
g

This problem can be solved with a loop over each character in the string; special attention should be given the element starting with ‘#’, and mark its presence until next ‘#’ is met to identify the element.

But I want to solve this in a simpler way, using the Perl strength in regular expression. So I searched google and found a great summary page of Perl regular expression. I found that conditional expression is useful in this case. Basically, the conditional regular expression has the format:

(?(condition)yes-pattern|no-pattern)

so matching which pattern depends on the test of the condition. For my case, I wrote the following code:

perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?(?=#)(#[^#]*#)|([^,]*)),?/gc) {$r=$&; $r =~ s/,$//;  print “$r\n”; }’

and got expected result. Here in addition to conditional expression, I also used the look-ahead matching (?=#) to test whether a hashtag ‘#’ is ahead as the test condition.

Then I realized that this code is not that clean, as I need post-process the whole matched string (stored in variable $&), so I simplify the code to

perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?|(#[^#]*#)|([^,]*)),?/gc) {print $1, “\n”; }’

which does not use conditional expression at all, but use a feature called branch reset (see the above summary page link for details). It has the format

(?|pattern)

which resets the indices of ‘|’ divided sub-expressions (those generated by parentheses ‘(‘ and ‘)’) in the ‘pattern’. Note that in my pattern, I have two alternative sub-expression ‘(#[^#]*#)’ and ‘([^,]*))’ separated by ‘|’. Without branch reset, these two have index 1 and 2 respectively, and accordingly $1 and $2 are needed for referring to the matched strings, but only one of them will have value in each matching. To use the same index for the result (here $1 always), I use the branch reset, which resets the index of the second sub-expression to 1.

Very cool!!

Please let me know your solution for this problem. I believe better solution always exists.

Note that I used Perl version v5.18.2 to test these examples. Older versions such as v5.8.8 may not work because of not recognizing these fantastic syntaxes.

Previous Entries My paper on Drosophila X chromosome regulation is online now Next Entries The history of sequencing in industry