24小时热门版块排行榜    

CyRhmU.jpeg
查看: 2795  |  回复: 21

大爷给跪了

新虫 (小有名气)

引用回帖:
10楼: Originally posted by peterrjp at 2014-10-21 21:25:14
那个脚本是利用|符号来分割每行数据的,对于你的文件2,脚本会取出分割后的第2(gi号)和第5个(蛋白名称)区段放进输出结果里
如果要正常运行,必须确保这个规则是适用于每一行。你多贴一些文件2的内容出来看看吧 ...

谢谢回复
我的文件1中确实有重复的数据,但是我又不想对重复项进行删除,希望pl结果也是包含重复的。因为虽然文件1相同,但是它对应的我的基因不同。
文件2就是NR数据库,整个数据库有11G左右,没法传啊。
另外,我想输出的结果 第一列是严格的遵循.txt里的数据,比如1文件中的gi|571510024|ref|XP_006596207.1|,不希望输出结果是571510024。但这个影响不大。
11楼2014-10-21 21:56:23
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

大爷给跪了

新虫 (小有名气)

引用回帖:
10楼: Originally posted by peterrjp at 2014-10-21 21:25:14
那个脚本是利用|符号来分割每行数据的,对于你的文件2,脚本会取出分割后的第2(gi号)和第5个(蛋白名称)区段放进输出结果里
如果要正常运行,必须确保这个规则是适用于每一行。你多贴一些文件2的内容出来看看吧 ...

>gi|67472372|sp|P0A7T7.2|RS18_ECOLI RecName: Full=30S ribosomal protein S18 [Escherichia coli K-12]gi|67472373|sp|P0A7T8.2|RS18_ECOL6 RecName: Full=30S ribosomal protein S18 [Escherichia coli CFT073]gi|67472374|sp|P0A7T9.2|RS18_ECO57 RecName: Full=30S ribosomal protein S18 [Escherichia coli O157:H7]gi|67472375|sp|P0A7U0.2|RS18_PHOLL RecName: Full=30S ribosomal protein S18 [Photorhabdus luminescens subsp. laumondii TTO1]gi|67472376|sp|P0A7U1.2|RS18_SALTI RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Typhi]gi|67472377|sp|P0A7U2.2|RS18_SHIFL RecName: Full=30S ribosomal protein S18 [Shigella flexneri]gi|75505387|sp|Q57GI9.1|RS18_SALCH RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67]gi|81677980|sp|Q5PJ56.1|RS18_SALPA RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150]gi|123047658|sp|Q0SX84.1|RS18_SHIF8 RecName: Full=30S ribosomal protein S18 [Shigella flexneri 5 str. 8401]gi|123084302|sp|Q1R358.1|RS18_ECOUT RecName: Full=30S ribosomal protein S18 [Escherichia coli UTI89]gi|123343273|sp|Q0T9J1.1|RS18_ECOL5 RecName: Full=30S ribosomal protein S18 [Escherichia coli 536]gi|123728279|sp|Q31TD3.1|RS18_SHIBS RecName: Full=30S ribosomal protein S18 [Shigella boydii Sb227]gi|123728458|sp|Q328J7.1|RS18_SHIDS RecName: Full=30S ribosomal protein S18 [Shigella dysenteriae Sd197]gi|123759489|sp|Q3YUE5.1|RS18_SHISS RecName: Full=30S ribosomal protein S18 [Shigella sonnei Ss046]gi|166220942|sp|A8AMJ6.1|RS18_CITK8 RecName: Full=30S ribosomal protein S18 [Citrobacter koseri ATCC BAA-895]gi|166220951|sp|A7MM78.1|RS18_CROS8 RecName: Full=30S ribosomal protein S18 [Cronobacter sakazakii ATCC BAA-894]gi|166220955|sp|A6THB3.1|RS18_KLEP7 RecName: Full=30S ribosomal protein S18 [Klebsiella pneumoniae subsp. pneumoniae MGH 78578]gi|167011188|sp|A7ZV73.1|RS18_ECO24 RecName: Full=30S ribosomal protein S18 [Escherichia coli E24377A]gi|167011189|sp|A8A7U8.1|RS18_ECOHS RecName: Full=30S ribosomal protein S18 [Escherichia coli HS]gi|167011190|sp|A4W5T1.1|RS18_ENT38 RecName: Full=30S ribosomal protein S18 [Enterobacter sp. 638]gi|189029651|sp|A9N520.1|RS18_SALPB RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7]gi|189029677|sp|B1IT04.1|RS18_ECOLC RecName: Full=30S ribosomal protein S18 [Escherichia coli ATCC 8739]gi|189029686|sp|A9MFK9.1|RS18_SALAR RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. arizonae serovar 62:z4,z23:-]gi|226735113|sp|B7MLK7.1|RS18_ECO45 RecName: Full=30S ribosomal protein S18 [Escherichia coli S88]gi|226735114|sp|B7NTQ7.1|RS18_ECO7I RecName: Full=30S ribosomal protein S18 [Escherichia coli IAI39]gi|226735115|sp|B7M9G4.1|RS18_ECO8A RecName: Full=30S ribosomal protein S18 [Escherichia coli IAI1]gi|226735116|sp|B7NGD6.1|RS18_ECOLU RecName: Full=30S ribosomal protein S18 [Escherichia coli UMN026]gi|226735117|sp|B1LQM1.1|RS18_ECOSM RecName: Full=30S ribosomal protein S18 [Escherichia coli SMS-3-5]gi|226735119|sp|B7LLY4.1|RS18_ESCF3 RecName: Full=30S ribosomal protein S18 [Escherichia fergusonii ATCC 35469]gi|229559983|sp|B6I2A8.1|RS18_ECOSE RecName: Full=30S ribosomal protein S18 [Escherichia coli SE11]gi|229559984|sp|B2VCV9.1|RS18_ERWT9 RecName: Full=30S ribosomal protein S18 [Erwinia tasmaniensis]gi|229559986|sp|B5Z2K7.1|RS18_ECO5E RecName: Full=30S ribosomal protein S18 [Escherichia coli O157:H7 str. EC4115]gi|229559987|sp|B1XDV2.1|RS18_ECODH RecName: Full=30S ribosomal protein S18 [Escherichia coli str. K-12 substr. DH10B]gi|229560040|sp|B5Y306.1|RS18_KLEP3 RecName: Full=30S ribosomal protein S18 [Klebsiella pneumoniae 342]gi|229560053|sp|B4F277.1|RS18_PROMH RecName: Full=30S ribosomal protein S18 [Proteus mirabilis HI4320]gi|229560058|sp|B5F3B9.1|RS18_SALA4 RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Agona str. SL483]gi|229560059|sp|B5FSA3.1|RS18_SALDC RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853]gi|229560060|sp|B5R0S0.1|RS18_SALEP RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Enteritidis str. P125109]gi|229560061|sp|B5R9F0.1|RS18_SALG2 RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Gallinarum str. 287/91]gi|229560062|sp|B4TFD7.1|RS18_SALHS RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Heidelberg str. SL476]gi|229560063|sp|B4T3F3.1|RS18_SALNS RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Newport str. SL254]gi|229560064|sp|B5BKL1.1|RS18_SALPK RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Paratyphi A str. AKU_12601]gi|229560065|sp|B4TT35.1|RS18_SALSV RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633]gi|229560074|sp|B2TY75.1|RS18_SHIB3 RecName: Full=30S ribosomal protein S18 [Shigella boydii CDC 3083-94]gi|254765429|sp|B7UQL1.1|RS18_ECO27 RecName: Full=30S ribosomal protein S18 [Escherichia coli O127:H6 str. E2348/69]gi|254765430|sp|B7LCR3.1|RS18_ECO55 RecName: Full=30S ribosomal protein S18 [Escherichia coli 55989]gi|254765431|sp|B7MST2.1|RS18_ECO81 RecName: Full=30S ribosomal protein S18 [Escherichia coli ED1a]gi|254812953|sp|C0Q6G0.1|RS18_SALPC RecName: Full=30S ribosomal protein S18 [Salmonella enterica subsp. enterica serovar Paratyphi C str. RKS4594]gi|259494808|sp|C4ZR79.1|RS18_ECOBW RecName: Full=30S ribosomal protein S18 [Escherichia coli BW2952]gi|259494809|sp|C5BF78.1|RS18_EDWI9 RecName: Full=30S ribosomal protein S18 [Edwardsiella ictaluri 93-146]
MARYFRRRKFCRFTAEGVQEIDYKDIATLKNYITESGKIVPSRITGTRAKYQRQLARAIKRARYLSLLPYTDRHQ

这就是其中的某一个蛋白,它对应的gi有很多个,因为这些gi是不同的基因或者不同植物的。比如1文件中有gi号为gi|67472372|sp|P0A7T7.2|RS18_ECOLI,我就想得到RecName: Full=30S ribosomal protein S18 [Escherichia coli K-12]。
大神你的整个pl没有问题的,就是对重复项进行删除会有影响。
12楼2014-10-21 22:01:44
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

peterrjp

铁杆木虫 (著名写手)

引用回帖:
11楼: Originally posted by 大爷给跪了 at 2014-10-21 21:56:23
谢谢回复
我的文件1中确实有重复的数据,但是我又不想对重复项进行删除,希望pl结果也是包含重复的。因为虽然文件1相同,但是它对应的我的基因不同。
文件2就是NR数据库,整个数据库有11G左右,没法传啊。
另外 ...

按你的要求修改好了,你试试:


#!/usr/bin/perl
my $list_file="11111111.txt";
my $tb_file="nr.txt";
my ($ll,%gi1,%gi2,@f);
open INCLUDE, ">include.txt" || die "Can't open include.txt";
open TMP, $list_file || die "Can't open $list_file";
while($ll = <TMP> {
        chomp $ll;
        @f=split/\|/,$ll;
        $gi1{$f[1]}++;
        $gi2{$f[1]}=$ll;
}
close(TMP);
open(TMP, $tb_file) || die;
while($ll = <TMP>{
    chomp $ll;
        @f=split/\|/,$ll;
    while ($gi1{$f[1]}){
                print INCLUDE "$gi2{$f[1]}\t$f[4]\n";
                $gi1{$f[1]}--;
    }
}
close TMP;
close INCLUDE;
13楼2014-10-21 22:07:58
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

peterrjp

铁杆木虫 (著名写手)

引用回帖:
12楼: Originally posted by 大爷给跪了 at 2014-10-21 22:01:44
>gi|67472372|sp|P0A7T7.2|RS18_ECOLI RecName: Full=30S ribosomal protein S18 gi|67472373|sp|P0A7T8.2|RS18_ECOL6 RecName: Full=30S ribosomal protein S18 gi|67472374|sp|P0A7T9.2|RS18_ECO57 RecName ...

哎,文件2的格式这么乱啊,两个gi之间都没有任何分隔符啊;我还以为每个gi占一行呢,我再改改
14楼2014-10-21 22:10:47
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

大爷给跪了

新虫 (小有名气)

引用回帖:
14楼: Originally posted by peterrjp at 2014-10-21 22:10:47
哎,文件2的格式这么乱啊,两个gi之间都没有任何分隔符啊;我还以为每个gi占一行呢,我再改改...

用notepad打开的,看的时候有个分隔符,如三所示,是大写的SHO.
而且,同一个gi,从图中第二个>gi看得出来,可能有多个名字,不同的名字用“;”分开.
哎,反正复杂的东东
本地化blast注释后,想从结果中的Subject Seq-id得到Nr-annotation,求大神指导。
3.jpg

15楼2014-10-21 22:27:43
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

大爷给跪了

新虫 (小有名气)

引用回帖:
14楼: Originally posted by peterrjp at 2014-10-21 22:10:47
哎,文件2的格式这么乱啊,两个gi之间都没有任何分隔符啊;我还以为每个gi占一行呢,我再改改...

这次基本都ok了,就是。。。。。。。。
本地化blast注释后,想从结果中的Subject Seq-id得到Nr-annotation,求大神指导。-1
4.jpg

16楼2014-10-21 22:55:37
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

peterrjp

铁杆木虫 (著名写手)

引用回帖:
15楼: Originally posted by 大爷给跪了 at 2014-10-21 22:27:43
用notepad打开的,看的时候有个分隔符,如三所示,是大写的SHO.
而且,同一个gi,从图中第二个>gi看得出来,可能有多个名字,不同的名字用“;”分开.
哎,反正复杂的东东

3.jpg
...

额,怎么又整出蛋白序列来了。
再次修改的代码如下,如果你要处理的文件2是12楼里的那个(修改时我没看到你新发的这个图,没去管SOH、蛋白序列啥的),这个脚本应该能解决。实在不行,你再优化一下文件2吧,格式太乱了。我都看糊涂了


#!/usr/bin/perl

my $list_file="11111111.txt";
my $tb_file="nr.txt";
my ($ll,$i,$nm,%gi1,%gi2,@f,@g);

open INCLUDE, ">include.txt" || die "Can't open include.txt";
open TMP, $list_file || die "Can't open $list_file";

while($ll = <TMP> {
        chomp $ll;
        @f=split/\|/,$ll;
        $gi1{$f[1]}++;
        $gi2{$f[1]}=$ll;
}
close(TMP);

open(TMP, $tb_file) || die;
while($ll = <TMP> {
    chomp $ll;
        $ll=~s/\>gi\|//;
        @g=split/gi\|/,$ll;
        foreach $i (@g) {
                @f=split/\|/,$i;
                        while ($gi1{$f[0]}){
                                $nm=$f[3];
                                $nm=~s/.* RecName/RecName/;
                                print INCLUDE "$gi2{$f[0]}\t$nm\n";
                                $gi1{$f[0]}--;
                        }
    }
}
close TMP;
close INCLUDE;
17楼2014-10-21 22:57:28
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

大爷给跪了

新虫 (小有名气)

引用回帖:
15楼: Originally posted by 大爷给跪了 at 2014-10-21 22:27:43
用notepad打开的,看的时候有个分隔符,如三所示,是大写的SHO.
而且,同一个gi,从图中第二个>gi看得出来,可能有多个名字,不同的名字用“;”分开.
哎,反正复杂的东东

3.jpg
...

我用服务器跑的,还是很慢啊
18楼2014-10-21 22:57:41
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

peterrjp

铁杆木虫 (著名写手)

这个新的代码可以过滤掉文件2的蛋白序列,只看>后面的内容
蛋白名称只认RecName及其后面的字符串(后面如果有AltName,也会显示出来),
perl处理数据归根结底就是找规律,总结出的规律与事实不符就会出现错误。
至于SOH符号,我不知道是啥,留着对结果应该没什么影响:


#!/usr/bin/perl

my $list_file="11111111.txt";
my $tb_file="nr.txt";
my ($ll,$i,$nm,%gi1,%gi2,@f,@g);

open INCLUDE, ">include.txt" || die "Can't open include.txt";
open TMP, $list_file || die "Can't open $list_file";

while($ll = <TMP> {
        chomp $ll;
        @f=split/\|/,$ll;
        $gi1{$f[1]}++;
        $gi2{$f[1]}=$ll;
}
close(TMP);

open(TMP, $tb_file) || die;
while($ll = <TMP> {
  if($ll=~/^\>gi/) {
        chomp $ll;
        $ll=~s/\>gi\|//;
        @g=split/gi\|/,$ll;
        foreach $i (@g) {
                @f=split/\|/,$i;
                        while ($gi1{$f[0]}){
                                $nm=$f[3];
                                $nm=~s/.* RecName/RecName/;
                                print INCLUDE "$gi2{$f[0]}\t$nm\n";
                                $gi1{$f[0]}--;
                        }
        }
  }
}
close TMP;
close INCLUDE;
19楼2014-10-21 23:06:20
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

cbh524674418

新虫 (初入文坛)

我能问问楼主,图2所示的文件你是从哪里得到啊?我怎么一直也找不到0.0
20楼2015-12-01 18:24:09
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 大爷给跪了 的主题更新
信息提示
请填处理意见