# Count tandem repeats in Perl

I am trying to write a code that given this string:

"TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA"

finds consecutive repeats (alias tandem repeats) of substring ATC, count them and if higher than 10 output the message "Off"

Here is my code:

``````my @count = (\$content =~ /ATC+/g);
print @count . " Repeat length\n";

\$nrRepeats = scalar(@count);
if (\$nrRepeats>10) {
print("Off\n");
}
else {
print("On\n");
}
``````

Complications:
It counts all ATC substrings present in the string instead of only tandem repeats of ATC.

Thank you a lot for your help!

• What does it mean "consecutive repeats" vs "tandem repeats"? – zdim Jun 7 at 8:45
• Sorry..it is the same concept – RebiKirl Jun 7 at 8:47
• How many repeats should that sample string have? 8 or 1? (In other words, are you counting just the number of times ATCATC shows up, or the number of times 2 or more consecutive ATC substrings show up?) – Shawn Jun 7 at 9:14

Your question is a little ambiguous. I'm going to answer each interpretation separately.

1. If you're trying to determine whether the string contains a run of more than 10 ATCs in a row, you can use

``````if (\$content =~ /ATCATCATCATCATCATCATCATCATCATCATC/)
``````

This regex can be written more compactly as

``````if (\$content =~ /(?:ATC){11}/)
``````
2. If you're trying to count the number of occurrences of at least 2 ATCs in a row, you can use

``````my \$count = () = \$content =~ /(?:ATC){2,}/g;
if (\$count > 10)
``````

(See `perldoc -q count`.)

• @Wolf, There are multiple matches. Subsequent evaluations of `//g` in scalar context will proceed where the last one left off. Did you accidentally make a single long chain of repeated `ATC`? – ikegami Jun 7 at 12:43
• @melpomene Yes, it's obviously a really poor spec that we are working on. And OP is not very responsive/cooperative. – Wolf Jun 7 at 14:01

Your regex `/ATC+/g` is looking for `AT` followed by one or more `C` I suspect that what you want is this

``````/(ATC(?:ATC)+)/g
``````

Which is ATC followed by one or more ATC

• This would count every time ATC is repeated in the input not only consecutive. – sergiotarxz Jun 7 at 9:07
• Exactly..now it is working. Thanks for helping me! – RebiKirl Jun 7 at 9:10
• Good hint. @RebiKirl if this actually helped you, think about accepting this answer. – Wolf Jun 7 at 10:13

Perl is a quite repetition-aware programming language that has been created to overcome repetitive manual work. So you can write strings that repeat a pattern as `\$pattern x \$repetitions` or literally type `'ATC'x11`.

Besides matching via `/(?:ATC){11}/` (as already suggested), this would be another way to just get Off:

``````print "Off\n" if \$content =~ ("ATC" x 11);
``````

As to match all tandem repeats of `ATC` and trigger on those with more than 10 repetitions,[1] you need to loop explixitly:

``````while (\$content =~ /(ATC(?:ATC)+)/g) {
my \$count = (length \$1) / 3;
print "\$count repeat length\n";
print "Off\n" if \$count > 10;
}
``````

Otherwise, for inputs such as `\$prefix.ATCx2.\$infix.ATCx11.\$postfix` the detection would stop at the first tandem repeat. The predefined reference to the captured match `\$1` is used to check the match length.

[1] following counts appearances of `ATC` in total, ignoring if they are consecutive:

``````my \$count = () = \$content =~ /ATC/g;
print "count (total matches) \$count\n";
``````
• @ikegami Yes, I'm a bit slow today. I tried to find out meanwhile (and succeeded). What kept me exploring was a confusing forever loop with `while (\$str.\$str =~ /PAT/g) { ... }` – Wolf Jun 7 at 13:05
• @ikegami fixed. Although `{2,}` now looks less repetitive to me ;) - also `while` would be better if there was another `ATCx2` before the actual `ATCx11`... – Wolf Jun 7 at 13:10
• @ikegami the loop is indispensable, maybe this should be explained with some more text. Again: thanks a lot for your time. – Wolf Jun 7 at 13:16
``````#!/usr/bin/env perl
use strict;
use warnings;
# The string with the text to match
my \$content = "TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA";
# Split the text in every point preceded or followed by ATC
my @array = split /(?:(?<=ATC)|(?=ATC))/, \$content;
# Creates an array which first element is 0 to contain every number of consecutives matches of ATC
my @count = 0;
for (@array) {
if (/^ATC\$/) {
# If ATC matches \$_ increment by one the number of matches
\$count[-1]++;
} else {
# If not and the script is counting a previous ATC sequence
# we reset the counter adding a new element
\$count[-1] != 0 and push @count, 0;
}
}
# Initialices \$max and \$index to 0 and undef respectively
my (\$max,\$index) = (0, undef);
for (keys @count) {
# If \$max has less value than the current iterated sequence
# \$max is updated to current value and so is \$index
\$max < \$count[\$_] and (\$max, \$index) = (\$count[\$_], \$_);
}
# \$index won't be defined if no value of ATC exists
defined \$index and print "\$max Repeat length\n";
# prints Off is the max match is greater or equal than 10
print((\$max>=10?'Off':'On')."\n");
``````

I think this is a good way since it allows you to know more data like the number of times is repeated.

• The whole thing can probably be simplified to `use List::UtilsBy qw(max_by); my @count = map length(\$_) / length('ATC'), \$content =~ /(?:ATC)+/g; my \$index = max_by { \$count[\$_] } 0 .. \$#count; my \$max = \$count[\$index];` – melpomene Jun 7 at 10:33