引用回帖: 21楼 : Originally posted by
essen11 at 2012-05-15 21:43:30:
是需要区分大小写的。不用再修改了,我用查找替换找小写的共有14774处,但是我在程序返回的序列中搜索只有10261,看了一下程序返回的序列,这里面的差异有一部分肯定是因为返回序列不完整引起的,但还有一部分是 ...
这个程序它是先将分布在多行的每条序列合并为一条完整的序列后再进行正则匹配的,所以一般不会出现匹配不到的情况,下面程序中将不含短序列的序列放到文件out2.txt中,其余不变
#!/usr/bin/perl
use strict;
use warnings;
my @name;
my @seqs;
if($ARGV[0] eq '-h' || $ARGV[0] eq '--help') {
print "Usage: perl ./$0 input_file substring\n";
exit(0);
}
if(@ARGV < 2) {
print "Arguments not enough!\n";
exit(0);
}
my $cnt=0;
my $subseq=$ARGV[1];
open IN,"<$ARGV[0]";
open OUTFILE, ">outfile.txt";
open OUTFILE2,">out2.txt";
while({
if (/^>/) {
++$cnt;
$name[$cnt]=$_;
}
if(/^[ATCGatcg]/) {
chomp;
$seqs[$cnt] .= $_;
}
}
close IN;
my $i;
my $length;
my $hits=0;
foreach (1..$cnt) {
if($seqs[$_]=~/$subseq/) {
++$hits;
print OUTFILE "$name[$_]";
$length=length $seqs[$_];
for($i=1;$i<=$length;$i++) {
printf OUTFILE "%s",substr($seqs[$_],$i-1,1);
if($i%50 == 0) {
print OUTFILE "\n";
}
}
print OUTFILE "\n";
} else {
print OUTFILE2 "$name[$_]";
$length=length $seqs[$_];
for($i=1;$i<=$length;$i++) {
printf OUTFILE2 "%s",substr($seqs[$_],$i-1,1);
if($i%50 == 0) {
print OUTFILE2 "\n";
}
}
print OUTFILE2 "\n";
}
}
print "A total of $hits sequences matched.\n";
close OUTFILE;
close OUTFILE2;