Conditional regular expression and Branch reset in Perl

I have been used Perl for 10 years in my research, but it is so vast that for anyone to learn every feature of the language. Recently, I was trying to solve a problem using a simple solution:

The problem: parse all the elements in the string ‘a,b,c,#d,e#,f,g’ separated by commas, but when a comma appears between two ‘#’ characters, then the comma is part of the element. Also ‘#’ can only appear at the beginning and end of an element.

Expected output (one element per line):
a
b
c
#d,e#
f
g

This problem can be solved with a loop over each character in the string; special attention should be given the element starting with ‘#’, and mark its presence until next ‘#’ is met to identify the element.

But I want to solve this in a simpler way, using the Perl strength in regular expression. So I searched google and found a great summary page of Perl regular expression. I found that conditional expression is useful in this case. Basically, the conditional regular expression has the format:

(?(condition)yes-pattern|no-pattern)

so matching which pattern depends on the test of the condition. For my case, I wrote the following code:

perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?(?=#)(#[^#]*#)|([^,]*)),?/gc) {$r=$&; $r =~ s/,$//; print “$r\n”; }’

and got expected result. Here in addition to conditional expression, I also used the look-ahead matching (?=#) to test whether a hashtag ‘#’ is ahead as the test condition.

Then I realized that this code is not that clean, as I need post-process the whole matched string (stored in variable $&), so I simplify the code to

perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?|(#[^#]*#)|([^,]*)),?/gc) {print $1, “\n”; }’

which does not use conditional expression at all, but use a feature called branch reset (see the above summary page link for details). It has the format

(?|pattern)

which resets the indices of ‘|’ divided sub-expressions (those generated by parentheses ‘(‘ and ‘)’) in the ‘pattern’. Note that in my pattern, I have two alternative sub-expression ‘(#[^#]*#)’ and ‘([^,]*))’ separated by ‘|’. Without branch reset, these two have index 1 and 2 respectively, and accordingly $1 and $2 are needed for referring to the matched strings, but only one of them will have value in each matching. To use the same index for the result (here $1 always), I use the branch reset, which resets the index of the second sub-expression to 1.

Very cool!!

Please let me know your solution for this problem. I believe better solution always exists.

Note that I used Perl version v5.18.2 to test these examples. Older versions such as v5.8.8 may not work because of not recognizing these fantastic syntaxes.

Conditional regular expression and Branch reset in Perl

Caution: fastacmd is not case-sensitive

How to add Chinese Pinyin (拼音) in Microsoft Word 2007?

Install Jekyll on Windows

ROC curve and Area Under ROC Curve (AUC)

The history of sequencing in industry

Conditional regular expression and Branch reset in Perl

My paper on Drosophila X chromosome regulation is online now

A note on Globus