-1

I am trying to write a code that given this string:

"TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA"

finds consecutive repeats (alias tandem repeats) of substring ATC, count them and if higher than 10 output the message "Off"

Here is my code:

my @count = ($content =~ /ATC+/g);
print @count . " Repeat length\n";

$nrRepeats = scalar(@count);    
if ($nrRepeats>10) {
    print("Off\n");
}
else {
    print("On\n");
}

Complications:
It counts all ATC substrings present in the string instead of only tandem repeats of ATC.

Thank you a lot for your help!

  • 1
    What does it mean "consecutive repeats" vs "tandem repeats"? – zdim Jun 7 at 8:45
  • 1
    Sorry..it is the same concept – RebiKirl Jun 7 at 8:47
  • 1
    How many repeats should that sample string have? 8 or 1? (In other words, are you counting just the number of times ATCATC shows up, or the number of times 2 or more consecutive ATC substrings show up?) – Shawn Jun 7 at 9:14
4

Your question is a little ambiguous. I'm going to answer each interpretation separately.

  1. If you're trying to determine whether the string contains a run of more than 10 ATCs in a row, you can use

    if ($content =~ /ATCATCATCATCATCATCATCATCATCATCATC/)
    

    This regex can be written more compactly as

    if ($content =~ /(?:ATC){11}/)
    
  2. If you're trying to count the number of occurrences of at least 2 ATCs in a row, you can use

    my $count = () = $content =~ /(?:ATC){2,}/g;
    if ($count > 10)
    

    (See perldoc -q count.)

  • 1
    @Wolf, There are multiple matches. Subsequent evaluations of //g in scalar context will proceed where the last one left off. Did you accidentally make a single long chain of repeated ATC? – ikegami Jun 7 at 12:43
  • @melpomene Yes, it's obviously a really poor spec that we are working on. And OP is not very responsive/cooperative. – Wolf Jun 7 at 14:01
1

Your regex /ATC+/g is looking for AT followed by one or more C I suspect that what you want is this

/(ATC(?:ATC)+)/g

Which is ATC followed by one or more ATC

  • This would count every time ATC is repeated in the input not only consecutive. – sergiotarxz Jun 7 at 9:07
  • Exactly..now it is working. Thanks for helping me! – RebiKirl Jun 7 at 9:10
  • Good hint. @RebiKirl if this actually helped you, think about accepting this answer. – Wolf Jun 7 at 10:13
1

Perl is a quite repetition-aware programming language that has been created to overcome repetitive manual work. So you can write strings that repeat a pattern as $pattern x $repetitions or literally type 'ATC'x11.

Besides matching via /(?:ATC){11}/ (as already suggested), this would be another way to just get Off:

print "Off\n" if $content =~ ("ATC" x 11);

As to match all tandem repeats of ATC and trigger on those with more than 10 repetitions,[1] you need to loop explixitly:

while ($content =~ /(ATC(?:ATC)+)/g) {
    my $count = (length $1) / 3;
    print "$count repeat length\n";
    print "Off\n" if $count > 10;
}

Otherwise, for inputs such as $prefix.ATCx2.$infix.ATCx11.$postfix the detection would stop at the first tandem repeat. The predefined reference to the captured match $1 is used to check the match length.


[1] following counts appearances of ATC in total, ignoring if they are consecutive:

my $count = () = $content =~ /ATC/g;
print "count (total matches) $count\n";
  • @ikegami Yes, I'm a bit slow today. I tried to find out meanwhile (and succeeded). What kept me exploring was a confusing forever loop with while ($str.$str =~ /PAT/g) { ... } – Wolf Jun 7 at 13:05
  • @ikegami fixed. Although {2,} now looks less repetitive to me ;) - also while would be better if there was another ATCx2 before the actual ATCx11... – Wolf Jun 7 at 13:10
  • @ikegami the loop is indispensable, maybe this should be explained with some more text. Again: thanks a lot for your time. – Wolf Jun 7 at 13:16
0
#!/usr/bin/env perl
use strict;
use warnings;
# The string with the text to match
my $content = "TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA";
# Split the text in every point preceded or followed by ATC
my @array = split /(?:(?<=ATC)|(?=ATC))/, $content;
# Creates an array which first element is 0 to contain every number of consecutives matches of ATC
my @count = 0;
for (@array) {
    if (/^ATC$/) {
# If ATC matches $_ increment by one the number of matches
        $count[-1]++;
    } else {
# If not and the script is counting a previous ATC sequence 
# we reset the counter adding a new element
        $count[-1] != 0 and push @count, 0;
    }
}
# Initialices $max and $index to 0 and undef respectively
my ($max,$index) = (0, undef);
for (keys @count) {
# If $max has less value than the current iterated sequence 
# $max is updated to current value and so is $index
    $max < $count[$_] and ($max, $index) = ($count[$_], $_);
}
# $index won't be defined if no value of ATC exists
defined $index and print "$max Repeat length\n";
# prints Off is the max match is greater or equal than 10
print(($max>=10?'Off':'On')."\n");

I think this is a good way since it allows you to know more data like the number of times is repeated.

EDIT: Updated with comments.

  • Can I know what is bad with the code to be downvoted? The number it returns is correct and also the output. – sergiotarxz Jun 7 at 9:39
  • You don't explain what this code is supposed to do. The code itself is not easy to understand, either. – melpomene Jun 7 at 9:56
  • @melpomene True, I'll correct it, thank you. – sergiotarxz Jun 7 at 9:58
  • Is it a good idea to get more results than required? – Wolf Jun 7 at 10:11
  • 1
    The whole thing can probably be simplified to use List::UtilsBy qw(max_by); my @count = map length($_) / length('ATC'), $content =~ /(?:ATC)+/g; my $index = max_by { $count[$_] } 0 .. $#count; my $max = $count[$index]; – melpomene Jun 7 at 10:33

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.