24小时热门版块排行榜    

查看: 2256  |  回复: 6

ldy2140

金虫 (小有名气)

[交流] 讨论下怎么通过gi号批量获得物种的definition 已有4人参与

做生物信息学的大都避免不了要blast 有时尽管我们blast出来的结果很多很吓人 但还是要将这些结果汇总成excel表格
最近就遇到了很让我头疼的事情 我做了很多转运蛋白的微生物全库的blast 但得到的table里只有匹配物种的gi号 在汇总结果的时候我想把gi号换成物种信息 比如像GBFF里的definition这种能说明物种遗传背景的字符串
所以我考虑用perl的正则表达式替换 写了如下的程序
CODE:
#!/usr/bin/perl
use Bio::Seq;
use Bio::DB::GenBank;

$gb = new Bio::DB::GenBank;
$^I = ".bak";

while (<>) {
  $line = $_;
  if ( /gi\|(\d+)\|/ ) {
    $gi = $1;
    $seq_obj = $gb->get_Seq_by_gi ($1);
    $def = $seq_obj->desc;
  }
  $_ = $line;
  s#\t.*?$gi.*?\t#\t$def\t#;
  print;
}

但是运行起来速度很慢而且很浪费带宽 因为用到的模块是将gi号对应的整个序列信息都下载下来 然后从中提取definition 所以效率很差 这是我花很短时间学习perl和bioperl编写的急功近利的程序 期待高手拍砖

[ Last edited by ldy2140 on 2012-8-28 at 21:55 ]
回复此楼

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

伸手摘星,未必你如愿,但不会弄脏你的手。
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

libralibra

至尊木虫 (著名写手)

骠骑将军


小木虫: 金币+0.5, 给个红包,谢谢回帖
贴个例子看看

贴一个blast出来的结果(未处理的字符串)
贴一个你想要的结果(目标字符串)
matlab/VB/python/c++/Java写程序请发QQ邮件:790404545@qq.com
2楼2012-08-28 22:41:10
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

ldy2140

金虫 (小有名气)

引用回帖:
2楼: Originally posted by libralibra at 2012-08-28 22:41:10
贴个例子看看

贴一个blast出来的结果(未处理的字符串)
贴一个你想要的结果(目标字符串)

sp|P23936|LACY_STRTR        gi|169822596|gb|ABJK02000022.1|        61.65        631        241        1        5        634        344705        342813        0.0         714
sp|P23936|LACY_STRTR        gi|223555729|gb|ACGH01000016.1|        57.01        628        260        1        2        619        65439        67322        0.0         692
替换后
sp|P23936|LACY_STRTR        Streptococcus infantarius subsp. infantarius ATCC BAA-102 S_infantarius-2.0.1_Cont245, whole genome shotgun sequence.        61.65        631        241        1        5        634        344705        342813        0.0         714
sp|P23936|LACY_STRTR        Lactobacillus buchneri ATCC 11577 contig00018, whole genome shotgun sequence.        57.01        628        260        1        2        619        65439        67322        0.0         692
伸手摘星,未必你如愿,但不会弄脏你的手。
3楼2012-08-29 09:30:29
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

libralibra

至尊木虫 (著名写手)

骠骑将军


小木虫: 金币+0.5, 给个红包,谢谢回帖
你是不是有gi和description的对应关系,如果有,直接正则替换gi那部分即可.
如果必须去网络上查,查回来肯定gi序号和description同时有的,你处理完了再写文件
生物不懂,不过有bioperl,搜了下,也有biopython,照着例子改改,可以直接打印gi号和对应的description,可以看看.
准备学个脚本语言的时候,看过perl和python的语法,果断选了python,perl不懂啊
biopython教程:
http://biopython.org/DIST/docs/tutorial/Tutorial.html

例子代码,测试过了
CODE:
# Import the modules for interfacing with BLAST and parsing the output
from Bio.Blast import NCBIWWW, NCBIXML

# Blast the sequence of interest (in this case using the accession number
result_handle = NCBIWWW.qblast("blastn", "nr", "8332116")

# Parse the resulting output
blast_record = NCBIXML.read(result_handle)

# Loop over the alignments printing some output of interest
E_VALUE_THRESH = 0.004
for alignment in blast_record.alignments:
    result = alignment.title
    print 'gi no.: '+result.split()[0]
    print 'gi-desc: '+' '.join(result.split()[1:])
    print
##    for hsp in alignment.hsps:
##        if hsp.expect < E_VALUE_THRESH:
##            print
##            print '****Alignment****'
##            print 'sequence:', alignment.title
##            print 'length:', alignment.length
##            print 'e value:', hsp.expect
##            print hsp.query[0:75] + '...'
##            print hsp.match[0:75] + '...'
##            print hsp.sbjct[0:75] + '...'

结果,gi号和description可以分别提取打印:
CODE:
gi no.: gi|224094601|ref|XM_002310151.1|
gi-desc: Populus trichocarpa predicted protein, mRNA

gi no.: gi|359495761|ref|XM_002274845.2|
gi-desc: PREDICTED: Vitis vinifera uncharacterized LOC100267774 (LOC100267774), mRNA

gi no.: gi|349709091|emb|FQ378501.1|
gi-desc: Vitis vinifera clone SS0AEB13YG07

gi no.: gi|255562758|ref|XM_002522339.1|
gi-desc: Ricinus communis COR413-PM2, putative, mRNA

gi no.: gi|358346403|ref|XM_003637210.1|
gi-desc: Medicago truncatula Cold acclimation protein-like protein (MTR_079s1009) mRNA, complete cds

gi no.: gi|358344000|ref|XM_003636035.1|
gi-desc: Medicago truncatula Cold acclimation protein-like protein (MTR_026s0005) mRNA, complete cds

gi no.: gi|356561272|ref|XM_003548859.1|
gi-desc: PREDICTED: Glycine max uncharacterized protein LOC100817084 (LOC100817084), mRNA

gi no.: gi|356502211|ref|XM_003519866.1|
gi-desc: PREDICTED: Glycine max uncharacterized protein LOC100810337 (LOC100810337), mRNA

gi no.: gi|225311746|dbj|AK326681.1|
gi-desc: Solanum lycopersicum cDNA, clone: LEFL2011M15, HTC in fruit

gi no.: gi|255762732|gb|GQ370517.1|
gi-desc: Salvia miltiorrhiza cold acclimation protein (COR) mRNA, complete cds

gi no.: gi|225428595|ref|XM_002284686.1|
gi-desc: PREDICTED: Vitis vinifera uncharacterized LOC100248690 (LOC100248690), mRNA

gi no.: gi|297819785|ref|XM_002877730.1|
gi-desc: Arabidopsis lyrata subsp. lyrata COR413-PM2, mRNA

gi no.: gi|86755971|gb|DQ359747.1|
gi-desc: Chimonanthus praecox cold acclimation protein COR413-PM1 mRNA, complete cds

gi no.: gi|145339339|ref|NM_114943.4|
gi-desc: Arabidopsis thaliana cold-regulated 413-plasma membrane 2 (COR413-PM2) mRNA, complete cds

gi no.: gi|15810634|gb|AY056356.1|
gi-desc: Arabidopsis thaliana putative cold acclimation protein (At3g50830) mRNA, complete cds

gi no.: gi|10121842|gb|AF283005.1|
gi-desc: Arabidopsis thaliana cold acclimation protein WCOR413-like protein beta form mRNA, complete cds

gi no.: gi|13430785|gb|AF360305.1|
gi-desc: Arabidopsis thaliana putative cold acclimation protein (At3g50830) mRNA, complete cds

gi no.: gi|60317457|gb|AY761065.1|
gi-desc: Gossypium barbadense cold-related protein Cor413 (Cor413) mRNA, complete cds

gi no.: gi|255556172|ref|XM_002519075.1|
gi-desc: Ricinus communis COR413-PM2, putative, mRNA

gi no.: gi|156567558|gb|EU077497.1|
gi-desc: Poncirus trifoliata cold acclimation WCOR413-like protein mRNA, complete cds

gi no.: gi|46577795|gb|AY587773.1|
gi-desc: Tamarix androssowii putative stress-responsive protein mRNA, complete cds

gi no.: gi|305690597|gb|HQ010041.1|
gi-desc: Corylus heterophylla COR413-PM1 mRNA, complete cds

gi no.: gi|224105476|ref|XM_002313788.1|
gi-desc: Populus trichocarpa predicted protein, mRNA

gi no.: gi|242389633|emb|FP100664.1|
gi-desc: Phyllostachys edulis cDNA clone: bphylf036p06, full insert sequence

gi no.: gi|242382816|emb|FP092058.1|
gi-desc: Phyllostachys edulis cDNA clone: bphyem114p22, full insert sequence

gi no.: gi|242382391|emb|FP097178.1|
gi-desc: Phyllostachys edulis cDNA clone: bphylf028m11, full insert sequence

gi no.: gi|242381728|emb|FP091375.1|
gi-desc: Phyllostachys edulis cDNA clone: bphyst020e14, full insert sequence

gi no.: gi|238007351|gb|BT084358.1|
gi-desc: Zea mays full-length cDNA clone ZM_BFb0105L06 mRNA, complete cds

gi no.: gi|195636267|gb|EU965484.1|
gi-desc: Zea mays clone 286348 cold acclimation protein COR413-PM1 mRNA, complete cds

gi no.: gi|54652523|gb|BT017742.1|
gi-desc: Zea mays clone EL01N0449E04.c mRNA sequence

gi no.: gi|162459269|ref|NM_001111732.1|
gi-desc: Zea mays LOC542099 (gpm455), mRNA >gi|27902672|gb|AY181208.1| Zea mays cold acclimation protein COR413-PM1 mRNA, complete cds

gi no.: gi|21209119|gb|AY106041.1|
gi-desc: Zea mays PCO103483 mRNA sequence

gi no.: gi|242037992|ref|XM_002466346.1|
gi-desc: Sorghum bicolor hypothetical protein, mRNA

gi no.: gi|255617390|ref|XM_002539789.1|
gi-desc: Ricinus communis COR413-PM2, putative, mRNA

gi no.: gi|30690903|ref|NM_119885.2|
gi-desc: Arabidopsis thaliana cold acclimation protein WCOR413 (AT4G37220) mRNA, complete cds

gi no.: gi|26449888|dbj|AK117399.1|
gi-desc: Arabidopsis thaliana At4g37220 mRNA for putative ap2 cold acclimation protein, complete cds, clone: RAFL16-98-J01

gi no.: gi|226504237|ref|NM_001155133.1|
gi-desc: Zea mays cold acclimation protein COR413-PM1 (LOC100282221), mRNA >gi|195620729|gb|EU960077.1| Zea mays clone 221611 cold acclimation protein COR413-PM1 mRNA, complete cds

gi no.: gi|166359605|gb|EU365626.1|
gi-desc: Thellungiella halophila stress responsive protein (COR) mRNA, complete cds

gi no.: gi|150172175|emb|CU406592.1|
gi-desc: Oryza rufipogon (W1943) cDNA clone: ORW1943C102K01, full insert sequence

gi no.: gi|115455578|ref|NM_001057925.1|
gi-desc: Oryza sativa Japonica Group Os03g0767800 (Os03g0767800) mRNA, complete cds

gi no.: gi|10121844|gb|AF283006.1|
gi-desc: Oryza sativa (japonica cultivar-group) cold acclimation protein WCOR413-like protein mRNA, complete cds

gi no.: gi|32976054|dbj|AK066036.1|
gi-desc: Oryza sativa Japonica Group cDNA clone:J013049B03, full insert sequence

gi no.: gi|32970924|dbj|AK060906.1|
gi-desc: Oryza sativa Japonica Group cDNA clone:001-035-F05, full insert sequence

gi no.: gi|32970018|dbj|AK060000.1|
gi-desc: Oryza sativa Japonica Group cDNA clone:006-301-G09, full insert sequence

gi no.: gi|28973358|gb|BT005584.1|
gi-desc: Arabidopsis thaliana clone U50435 putative cold acclimation protein homolog (At4g37220) mRNA, complete cds

gi no.: gi|326534181|dbj|AK358227.1|
gi-desc: Hordeum vulgare subsp. vulgare mRNA for predicted protein, complete cds, clone: NIASHv1071H11

gi no.: gi|160954667|emb|CU225096.1|
gi-desc: Populus EST from leave

gi no.: gi|160950966|emb|CU229055.1|
gi-desc: Populus EST from severe drought-stressed leaves

gi no.: gi|357114154|ref|XR_137736.1|
gi-desc: PREDICTED: Brachypodium distachyon uncharacterized LOC100844112 (LOC100844112), miscRNA

gi no.: gi|224035946|gb|BT070152.1|
gi-desc: Zea mays full-length cDNA clone ZM_BFc0138N11 mRNA, complete cds

matlab/VB/python/c++/Java写程序请发QQ邮件:790404545@qq.com
4楼2012-08-29 17:25:33
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

wizardfan

至尊木虫 (著名写手)

优秀版主


小木虫: 金币+0.5, 给个红包,谢谢回帖
You know my comments on how to deal with high throughput data analysis: download the genbank flat file and parse the local file, which can improves the efficiency dramatically.

About your code:
1. Use "use strict;" all the time
2. Regular expression is fine, but I would use $` $' (special variables containing the previous and next part of the matching part) instead of another s/// statement
5楼2012-08-29 22:54:33
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

321wangke321

金虫 (正式写手)


小木虫: 金币+0.5, 给个红包,谢谢回帖
引用回帖:
5楼: Originally posted by wizardfan at 2012-08-29 22:54:33
You know my comments on how to deal with high throughput data analysis: download the genbank flat file and parse the local file, which can improves the efficiency dramatically.

About your code:
1 ...

如何从拟南芥的含有所有transcripts的txt文件中根据ID号批量提取相应的序列并且保存到一个文档中?
王者风范,一切皆有可能。
6楼2014-04-07 16:16:17
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

semiangle12

金虫 (正式写手)


小木虫: 金币+0.5, 给个红包,谢谢回帖
引用回帖:
5楼: Originally posted by wizardfan at 2012-08-29 22:54:33
You know my comments on how to deal with high throughput data analysis: download the genbank flat file and parse the local file, which can improves the efficiency dramatically.

About your code:
1 ...

请问已知gb号,怎么在批量下载的时候选择连GI值一起下载呢,批量下载的时候无法选择GI
7楼2017-04-26 10:47:59
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 ldy2140 的主题更新
普通表情 高级回复 (可上传附件)
最具人气热帖推荐 [查看全部] 作者 回/看 最后发表
[考研] 欢迎采矿、地质、岩土、计算机、人工智能等专业的同学报考 +3 pin8023 2026-02-28 5/250 2026-03-02 00:24 by 花YOU重开日
[基金申请] 本子写完了,给DS兄弟看了,得了92分 +3 Doma 2026-03-01 7/350 2026-03-02 00:00 by jnzsy
[考研] 材料化工调剂 +12 今夏不夏 2026-03-01 13/650 2026-03-01 23:32 by L135790
[考研] 265分求调剂不调专业和学校有行学上就 +6 礼堂丁真258 2026-02-28 8/400 2026-03-01 22:50 by jian_
[基金申请] 成果系统访问量大,请一小时后再尝试。---NSFC啥时候好哦,已经两天这样了 +4 NSFC2026我来了 2026-02-28 4/200 2026-03-01 22:37 by 铁门栓
[考研] 0856调剂 +5 刘梦微 2026-02-28 5/250 2026-03-01 22:30 by wang_dand
[考研] 306分材料调剂 +4 chuanzhu川烛 2026-03-01 5/250 2026-03-01 19:48 by 无际的草原
[考研] 0856材料求调剂 +11 hyf hyf hyf 2026-02-28 12/600 2026-03-01 18:57 by 18137688336
[考研] 291分工科求调剂 +9 science饿饿 2026-03-01 10/500 2026-03-01 18:55 by 18137688336
[考博] 26申博 +4 想申博! 2026-02-26 6/300 2026-03-01 17:32 by 想申博!
[考研] 328求调剂 +3 aaadim 2026-03-01 5/250 2026-03-01 17:29 by njzyff
[考研] 290求调剂 +9 材料专硕调剂; 2026-02-28 11/550 2026-03-01 17:21 by sunny81
[考研] 0856材料求调剂 +4 麻辣鱿鱼 2026-02-28 4/200 2026-03-01 16:51 by caszguilin
[基金申请] 刚录用,没有期刊号,但是在线可看的论文可以放为代表作吗 10+3 arang1 2026-03-01 3/150 2026-03-01 16:43 by babero
[考研] 求调剂 +6 repeatt?t 2026-02-28 6/300 2026-03-01 14:37 by Sakura绘
[考研] 317一志愿华南理工电气工程求调剂 +6 Soliloquy_Q 2026-02-28 11/550 2026-03-01 11:14 by 歌liekkas
[硕博家园] 2025届双非化工硕士毕业,申博 +3 更多的是 2026-02-27 4/200 2026-03-01 10:04 by ztg729
[论文投稿] Optics letters投稿被拒求助 30+3 luckyry 2026-02-26 4/200 2026-03-01 09:06 by babero
[考研] 307求调剂 +4 73372112 2026-02-28 6/300 2026-03-01 00:04 by ll247
[考研] 276求调剂 +3 路lyh123 2026-02-28 4/200 2026-02-28 19:45 by 路lyh123
信息提示
请填处理意见