The problem: parse all the elements in the string ‘a,b,c,#d,e#,f,g’ separated by commas, but when a comma appears between two ‘#’ characters, then the comma is part of the element. Also ‘#’ can only appear at the beginning and end of an element.
Expected output (one element per line):
a
b
c
#d,e#
f
g
This problem can be solved with a loop over each character in the string; special attention should be given the element starting with ‘#’, and mark its presence until next ‘#’ is met to identify the element.
But I want to solve this in a simpler way, using the Perl strength in regular expression. So I searched google and found a great summary page of Perl regular expression. I found that conditional expression is useful in this case. Basically, the conditional regular expression has the format:
(?(condition)yes-pattern|no-pattern)
so matching which pattern depends on the test of the condition. For my case, I wrote the following code:
perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?(?=#)(#[^#]*#)|([^,]*)),?/gc) {$r=$&; $r =~ s/,$//; print “$r\n”; }’
and got expected result. Here in addition to conditional expression, I also used the look-ahead matching (?=#) to test whether a hashtag ‘#’ is ahead as the test condition.
Then I realized that this code is not that clean, as I need post-process the whole matched string (stored in variable $&), so I simplify the code to
perl -e ‘$s=”a,b,c,#d,e#,f,g”; while($s=~/(?|(#[^#]*#)|([^,]*)),?/gc) {print $1, “\n”; }’
which does not use conditional expression at all, but use a feature called branch reset (see the above summary page link for details). It has the format
(?|pattern)
which resets the indices of ‘|’ divided sub-expressions (those generated by parentheses ‘(‘ and ‘)’) in the ‘pattern’. Note that in my pattern, I have two alternative sub-expression ‘(#[^#]*#)’ and ‘([^,]*))’ separated by ‘|’. Without branch reset, these two have index 1 and 2 respectively, and accordingly $1 and $2 are needed for referring to the matched strings, but only one of them will have value in each matching. To use the same index for the result (here $1 always), I use the branch reset, which resets the index of the second sub-expression to 1.
Very cool!!
Please let me know your solution for this problem. I believe better solution always exists.
Note that I used Perl version v5.18.2 to test these examples. Older versions such as v5.8.8 may not work because of not recognizing these fantastic syntaxes.