24小时热门版块排行榜    

查看: 1378  |  回复: 3

brightfuture01

木虫之王 (文坛精英)

我爱打老虎

文献杰出贡献

[求助] 求编程,求应助

这个任务是否可以用脚本完成?不能的话用C++如何实现?本人编程知识欠缺,烦请大家帮忙。 如给出运行程序如脚本或者C++程序,本人不胜感激,金币不是问题。


问题描述:

(1) 数据: FASTA格式核酸序列(>+注释语言+换行+核酸序列)10000条左右,两个>之间为一个基因,格式如下:

>hg19_refGene_NM_052933 range=chr7:130353386-130373406 5'pad=0 3'pad=0 strand=- repeatMasking=none
tcttccaacgtggcccagggaagccaaaagattggacaccctgcttcaga
gcccatattaggttttttttttttttttttttggtcataaactgcagaaa
tgaaccctaagaaaataaaggatttatttgaaagaatattaggaaagctc
acagaactgaagagaaaattgtgtagtcttgctttaggaaggactagatc
tggggcagctccggggacctcagcaggtagagctcagagctcagagtctt
ccagcctgggataccacatgaggtggctcaggtgcaatcatcttacagct
ctgtgggtcattagcaaactttcaaattcttagtataacctggttggcca
cctttgaatcaatcatccagagctaggatgcaggagtgaatggcactaat
tggtgatggctatagacccttataatttctcaacccctcctcagcactgt
catagacttctcaaggacctctttgcatctttctaatctatagttcttct
gtctttccaatctatagttcacatgccactatttttgtatttctaaaata
tatattgttcaataaggagattatacctcatttaaaatctcttgttagcg
cttcattactcagaataacccacaaacttcctatcaaatagttcatttta
ggatttggatcctgtcttcctctctagtctcacctctcactacccttccc
ctaaatcttatgttcctggcatattgcaattacagtattttcaattcccc
taaataagccaggctttgttgtattctcatgtttttgcctctgcttctcc
ctctttctcaaatgtctgtcctctcatcacctggttaattcacactcatc
ttccccagctcagtttaggcagtagttccaaatctctctcagtctcttct
cagtatgggttaagtccccttcctcattgcttccatagtaccctaaattt
agatctaatttggcattctactctgttgcaatcaaccgtttctttgtctc
tcttcctggctaaactgtgagctcctaaaggagctcaagaattacatctg
atttatgctttgttctcatcacctaggagagtgtctaatatgtcataaat
gttcagcaaatatattgaaataagcaaacactagttttgaatgcataaga
tacaaagcactgggaggggggcacaaatatgtttaaaatttacatttcca
cttttaagggactcacagcttaagaggcagacaagacccgcacataagaa
aagtgaaaacagaactacaagacactatacaatgacatgcacaaaaccca
gtgactatagtctagagtttgggattgcattaatctagaaaatcttctct
ttaaaggaaTGACCTgagtagtaacataaagaaagaagattcagggctgg
gcatggtggctcatgcctgtaatcccagcactttgggaggctgaggtggg
tggatcacaAGGTCAagagatcgagaccatcctggccaacatggtgaaac
cccgtctctactaaaaatacaaaaattagctgggcatggtggcaggcgcc
tgtagtcccagctactctggaggctgaggcaggaaaatcacttgaaccag
ggagacagaggttgcagtgaactgagactgtgccactgcactccagcctt
gtgacagtgtgaggctccgtcagaagaaagagaaggagaaggagaaggag
aagattcaaatggctgaagagattggggacagcattttaagtatggggaa
taagaaaaaatgctcactgtatttgaagaatgcagtgaaaacattggctt
aggtagaggtagattaagagatggtgaagatgtgtgtgtgtgtgtgtgtg
tgtgtgtgtgtgtgtgtgtgtgtgtagagaagtcataggctggcaaaata
agagattggcagattttgtttaaatatgcctttaagttttggccttttca
ttctctgtctccaacacagaacacctcagcattctagattgcctattcat
>hg19_refGene_NR_030165 range=chr7:136585914-136588141 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gagtgccctgggaaaggaaaaattccatttccaagatccaggcatgtgag
aattacgtgaaattaaatatgtggacttgcggtttttggttcttggaaaa
atagaaaggtatagtgggttgcatgcagttctagctgcattccagctaga
aacagattgagacagttcagattcacaggcacttgcagggcaagttttca
atgtcactgtggctttaccacgtaggcaaatttgaaaatagacacactaa
atattgactacaggaaaagaaaaatgtgtatttattcatacaaaacacat
ttatttagcatctataatgtgctgtgcccTGACCTatagtgacccaaaca
gattagcagatctcacccgatcttgtagtgcagtgatttgaatagcaatg
atctttggaccctaagcaacccagtttggccattccctgatgtcattttt
ttcccccatttccattgttattttttgagaacatacagacgctttttgat
ccatttgttgtaagaacctagcctcttgatcagttgcatttacaaaataa
caacattgataatttcataaatgtgtaagatttacagcttacaaagaact
tgcacacacatctcatgcataactcacaaaacctcttaattctttcccaa
taatacaggggaatcatggactcaggggcTGACCTccaaagacgctgttg
gtggtaaattgttagagccccgagatgtgacttaaatttaggttttctga
cagaggtgtgctgctcgtctctatgctaatccattacacagccagacagg
aagaactgtcagtagattctgatcaatttctctttctataaaaaaaaatg
ataagcttagttaaattgtattagataagtgaagggttgtttataacagt
ccagcctccccttagcttcttctatggctttcattaggctccatcaaagc
ctactctcaaaacaaaatataaaaataattgttaactactaacattgatt
ctgtgatcttccttttaaactcatctatctatctatcgatctatctatcc
atccatccatccatccatccatccatccatccatccatctatccacctag
ctatccttctatgtttgcagtcagttgctagttattaagataatcagaat
tgctttcagaattaataattggtataaatttcagaagagtttgaatttag
gtggcaaattcataataagtgagaaggtaagctatagcatcctctgataa
tgtgtgcagtttactttttatctgtctctttctaattgaaaactaacaaa
tatagccccaattaaatgcacgtaaaaattagaagctggtgggatagggt
attaaacacaatcctagatgactcttatgaactcataccataagcagcca
ctttctttctcgagcaaactatagtgagaatgaagcatcttagattgaga
agggattaggacgaccctgaatggaatgggcaaatcataaacagctaggc
ccttagaatttggttgcagtcccaaacccAGGTCAgtttttaaacatgac
tatcagctagatatcctttctccaccatacaataatagataacaacctta
ataagacgtgtagacattaaactttgaaattccacagtaagatgtaaata
tttgctcaatcaagtacaatttaatatgtttgttatacagcaactgcaga
gcacagaattttgtactctttggatgtttatgataaggtacacattattt
gcaagtttttgcttgtttccttgttcagtttttcattatcaaacaaacaa
agcttctcagcctgggattaacctggagtctggaaagtatacattatggc
cagcaactttaaacaataggccagagatgggaaatgaatgaatgaaattc
tgacacagaagacaaacaacagaaactcatttgggctagtgtaggtgtag
gttttttattcttcaaccaacggtggtgaagaggatctcccttcacttta
>hg19_refGene_NM_001190906 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcctgagggtcagtttcctgaggaaaaaactcagttaagacataagtttc
aagttttaagacagagggcacatttctatgtttattcaaaaatccataaa
tatgggacaatttggccagtttcaacttcaggacattttaaccattgtgg
actcagTGACCTgtgaagtgtacaggccagggaaacttcctctttgcctt
ctgaagtttcactgaaaaatcaactcacaaaaggcagattaattggagaa
aaggtattcaattttatttaactatatgtttacacagggagaaccacaga
gtgattacccacccacaacagggtttagaagcttatttactggtaaatca
ggttatgggagaggggaaaagaggaattctgttgaggagatcactaggga
gaatgaatggatcagggagtagagattaacttgtacattaccttgtgaaa
gggtttgttcaggaaaggttacattcttagtcttacagggagaggaagaa
aaacgaattgttccttttggtgggtctggatcttaggcagataaaggaac
ttcactttgggagaggtggtagggagtagAGGTCAgagggaccttgaggc
ttttccagttcagtatgtcaaagtgccatattttggggtatcagtttctg
actcccaacacattctttgtcttctttttttgtaatggatcatgagataa
caggtaggggaaaaagaacaattgtcctccttggtgggtccatcctatct
ttatgtagacaggggaaagtctcttccagagcccgttgatctctaagggt
ctttacttcaaaatcttcattataccagggagccatatgttggggtggaa
tttcctgcctccttcaacccccagctccaaatttggaaatggcttctaat
ttttttaaggaaatgtcttttatttcctttggccactgtgattgatttca
cgatggatccatgaaccaagaaaggcaatcaaggctaacaagtttcaatt
tcaggactactgtttgagctactgagaaagtaaaatctctttttcgttgt
ttgagctggaaaccttgtcactccaaggacagtcagtaaacacttttcaa
tgttaagggaactgagagagagagtttcctgagagaacaggaaggcagag
agacgaatgagaagtgagaaattggtcctggccatgttattgttcctcta
gtccagtttaactggtgataagcttaacttgtgataaagatcctagatcc
ctttccagtcctgctgcatccaaatctcccaggaagtcctagaaaatgtc
tagtctcccctgaagctagccctactgccagaatttgaggaatatgaacc
aataaatttccattatagtttaagagaatttaagatagatgtgtgtatta
gttcattctcacactgctaataaagacaaccaagactgggtaattaacaa
agaaaaagagatttaatgggctcacagttccacatggctggggaggcctc
acaatcatggcagaaggtgaaagaggagcaaaggcatgtcttacatggca
ggaggcaagagagcatgtgcaggggaactgtccttcataaaaccatcaga
tcttgtgagacttattcactatcacaagagcagcatgggaaaaaaacacc
cccatgattcaattacctcccactgagtccctcccatgacattggggatt
gtgggagctacaattcaagatgagatttggttgaggacgcagccaaacta
tgtcagtctcttatttgccaacaaaagcatcctaactgatagaggccaga
cagatttgtttctttttgttttttcaatcttttgttgtgaagaagtaagc
ataaactctcaataggttacgttttacaagcctctgatgaagttcaaagg
acaaccatgcttaggatttccaggacaacctggaaaaaaaaacaggttga
gaaataggtgtgttaatctcccttccctctgctcctccctctggccttcc
.
.
.
.
.
.
>hg19_refGene_NM_005989 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcctgagggtcagtttcctgaggaaaaaactcagttaagacataagtttc
aagttttaagacagagggcacatttctatgtttattcaaaaatccataaa
tatgggacaatttggccagtttcaacttcaggacattttaaccattgtgg
actcagTGACCTgtgaagtgtacaggccagggaaacttcctctttgcctt
ctgaagtttcactgaaaaatcaactcacaaaaggcagattaattggagaa
aaggtattcaattttatttaactatatgtttacacagggagaaccacaga
gtgattacccacccacaacagggtttagaagcttatttactggtaaatca
ggttatgggagaggggaaaagaggaattctgttgaggagatcactaggga
gaatgaatggatcagggagtagagattaacttgtacattaccttgtgaaa
gggtttgttcaggaaaggttacattcttagtcttacagggagaggaagaa
aaacgaattgttccttttggtgggtctggatcttaggcagataaaggaac
ttcactttgggagaggtggtagggagtagAGGTCAgagggaccttgaggc
ttttccagttcagtatgtcaaagtgccatattttggggtatcagtttctg
actcccaacacattctttgtcttctttttttgtaatggatcatgagataa
caggtaggggaaaaagaacaattgtcctccttggtgggtccatcctatct
ttatgtagacaggggaaagtctcttccagagcccgttgatctctaagggt
ctttacttcaaaatcttcattataccagggagccatatgttggggtggaa
tttcctgcctccttcaacccccagctccaaatttggaaatggcttctaat
ttttttaaggaaatgtcttttatttcctttggccactgtgattgatttca
cgatggatccatgaaccaagaaaggcaatcaaggctaacaagtttcaatt
tcaggactactgtttgagctactgagaaagtaaaatctctttttcgttgt
ttgagctggaaaccttgtcactccaaggacagtcagtaaacacttttcaa
tgttaagggaactgagagagagagtttcctgagagaacaggaaggcagag
agacgaatgagaagtgagaaattggtcctggccatgttattgttcctcta
gtccagtttaactggtgataagcttaacttgtgataaagatcctagatcc
ctttccagtcctgctgcatccaaatctcccaggaagtcctagaaaatgtc
tagtctcccctgaagctagccctactgccagaatttgaggaatatgaacc
aataaatttccattatagtttaagagaatttaagatagatgtgtgtatta
gttcattctcacactgctaataaagacaaccaagactgggtaattaacaa
agaaaaagagatttaatgggctcacagttccacatggctggggaggcctc
acaatcatggcagaaggtgaaagaggagcaaaggcatgtcttacatggca
ggaggcaagagagcatgtgcaggggaactgtccttcataaaaccatcaga
tcttgtgagacttattcactatcacaagagcagcatgggaaaaaaacacc
cccatgattcaattacctcccactgagtccctcccatgacattggggatt
gtgggagctacaattcaagatgagatttggttgaggacgcagccaaacta
tgtcagtctcttatttgccaacaaaagcatcctaactgatagaggccaga
cagatttgtttctttttgttttttcaatcttttgttgtgaagaagtaagc
ataaactctcaataggttacgttttacaagcctctgatgaagttcaaagg
acaaccatgcttaggatttccaggacaacctggaaaaaaaaacaggttga
gaaataggtgtgttaatctcccttccctctgctcctccctctggccttcc

(2) 任务描述: 在所有10000条基因中查找两个字符串,比如"ATCG"和"GCTAT",按照两个字符串所含数目之和将10000条基因由大到小排序:比如,基因A中查到5个,基因B中查到4个,基因C和D中查到3个,基因EFGH中没有,则按照ABCDEFGH排序。

(3) 排序完成后,将每条基因的ID,即>hg19_refGene_NM_005989 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none 中refgene 之后,range 之前的部分”NM_005989“ 按序提取到独立的excel中,第一栏为基因ID,第二栏为所含字符串个数,第三栏为每条基因所含碱基个数


[ Last edited by brightfuture01 on 2012-12-5 at 14:43 ]
回复此楼

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

The world is a fine place and worth fighting for. I agree with the second part.
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

libralibra

至尊木虫 (著名写手)

骠骑将军

【答案】应助回帖

★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ...
感谢参与,应助指数 +1
xzhdty: 金币+2, 专家考核, 谢谢骠骑将军 2012-12-05 22:00:51
brightfuture01: 金币+500, ★★★★★最佳答案, Thanks a million O(∩_∩)O 2012-12-06 13:43:08
字符串解析用脚本语言应该是最爽的,下面是个python脚本,基因字符串保存在gene.txt,跟这个.py文件放在同一个文件夹下,运行完后会生成data.txt,里面是基因id,字符串数目和碱基数目3列数据,用tab分割
这样的好处是直接复制内容,copy进excel就行了.
如果用python去写excel也行,不过需要额外的库,跟复制粘贴相比工作量要增加不少,不合算.
CODE:
#! /usr/bin/env python
# -*- coding: cp936 -*-
from operator import itemgetter
# 测试基因字符串
s = open(r'gene.txt','r').read()
# 得到每个基因字符串
m = ['>'+x for x in s.split('>')[1:]]
# 生成 基因字符串:包含2个字符串数目之和 的字典
d = {}
for c in m:
        d[c] = c.upper().count('ATCG')+c.upper().count('GCTAT')
# 按照数目之和排序,从大到小,得到一个list of tuple
d = sorted(d.iteritems(), key=itemgetter(1), reverse=True)
# 构造结果字符串
ss = ''
for ge in d:
        # 基因ID
        refGene = ge[0][ge[0].index('_refGene_')+len('_refGene_'):ge[0].index('range')]
        # 所含字符串数目之和
        strNum = ge[1]
        # 碱基数目
        baseNum = sum([x.upper() in 'ATCG' and 1 or 0 for x in ge[0][ge[0].index('none')+len('none'):]])
        ss += refGene+'\t'+str(strNum)+'\t'+str(baseNum)+'\n'
# 输出结果到文件
f = open(r'data.txt','w')
f.write(ss)
f.close()
print 'Done'

结果
CODE:
NR_030165         2        2000
NM_052933         2        2000
NM_005989         0        2000
NM_001190906         0        2000

matlab/VB/python/c++/Java写程序请发QQ邮件:790404545@qq.com
2楼2012-12-05 21:05:29
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

cmdblock

银虫 (正式写手)

引用回帖:
2楼: Originally posted by libralibra at 2012-12-05 21:05:29
字符串解析用脚本语言应该是最爽的,下面是个python脚本,基因字符串保存在gene.txt,跟这个.py文件放在同一个文件夹下,运行完后会生成data.txt,里面是基因id,字符串数目和碱基数目3列数据,用tab分割
这样的好处是直接 ...

python编写程序如此之简单,看来我也的从c系转python了
3楼2012-12-06 08:03:43
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

chembetsey

木虫 (小有名气)

【答案】应助回帖

★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ...
感谢参与,应助指数 +1
brightfuture01: 金币+50, ★★★★★最佳答案, Many thanks. 2012-12-06 13:44:07
xzhdty: 金币+2, 谢谢 2012-12-07 08:12:13
awk 'BEGIN {RS=">"}
{        N=0
        ID=$1
        gsub(/hg19_refGene.+repeatMasking=none/, ""
        gsub("\n", ""
        gsub("\r", ""
        $0=toupper($0)
        Numb=length($0)
        N+=gsub("ATCG", "&", $0)+gsub("GCTAT", "&", $0)
        print ID, N, Numb} ' Gen.txt | sort -n -r -k 2
结果
hg19_refGene_NR_030165 3 2000
hg19_refGene_NM_052933 2 2000
hg19_refGene_NM_005989 0 2000
hg19_refGene_NM_001190906 0 2000
4楼2012-12-06 09:25:09
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 brightfuture01 的主题更新
最具人气热帖推荐 [查看全部] 作者 回/看 最后发表
[论文投稿] 纠结选哪一个期刊,电化学领域 50+5 Freya163 2024-05-28 5/250 2024-05-29 16:33 by ca0yan9
[论文投稿] materials letter +6 烟雨盛世 2024-05-24 6/300 2024-05-29 15:33 by ¥笑傲江湖¥
[基金申请] 九部门发文:不得将专利授权数量作为人才评价、项目评审、职称评定、高校评价等的条件 +13 sjtu2012 2024-05-28 16/800 2024-05-29 13:52 by 知识产权服务
[考研] 研0 +3 徐小七七 2024-05-25 5/250 2024-05-29 13:49 by 1158057902
[考博] 24或25申博 (金币+1) +6 Jacob- 2024-05-22 10/500 2024-05-29 12:43 by qinghuiustc
[基金申请] 信息学部函评结束了吗? +6 ducan21 2024-05-28 7/350 2024-05-29 12:10 by WORLD0256
[硕博家园] 论大家对6070后普通教授导师的看法 +4 SNaiL1995 2024-05-28 7/350 2024-05-29 11:33 by SNaiL1995
[论文投稿] 核心初审被拒,理由是“选题的意义不明确,文章写得不像是科技论文”,怎么改 5+3 工藤雷花樱 2024-05-27 8/400 2024-05-29 10:09 by topedit
[基金申请] 如果您是国自然评审专家 +3 丁香园账户 2024-05-28 3/150 2024-05-29 06:44 by gaohui8888
[基金申请] E10开始送了,希望有好运 +5 sail 2024-05-27 5/250 2024-05-28 18:36 by 芝小芝
[硕博家园] 文科博在木虫上存在感好低呀 +8 hahamyid 2024-05-25 11/550 2024-05-28 15:28 by cqu_zzh
[基金申请] 河北省基金 50+3 晓晓爱翠翠 2024-05-23 24/1200 2024-05-28 14:49 by 晓晓爱翠翠
[有机交流] 机理求助 200+4 15147165026 2024-05-26 10/500 2024-05-28 14:42 by 江东闲人
[基金申请] 基金上会 +14 mrKiller 2024-05-25 20/1000 2024-05-28 10:11 by bnullh
[硕博家园] 每天学术时间不能保证,能保证的只有: +5 hahamyid 2024-05-27 5/250 2024-05-27 18:18 by 沉默如昔
[考博] 25年博士申请 +7 zjc晨 2024-05-24 10/500 2024-05-27 15:25 by 82年拉菲
[基金申请] 感觉自然基金限制通过比例就是有点扯,学学B口,化学学部,不限制比例。 +10 wsjing 2024-05-26 14/700 2024-05-27 11:57 by kanmiaolucky
[基金申请] 化学口面上 +7 乐丰松庆 2024-05-23 17/850 2024-05-27 10:23 by ddr6021023
[硕博家园] 2024博士招生 +3 big 混子 2024-05-26 3/150 2024-05-26 20:47 by 宁多缺毋滥
[论文投稿] 真是奇怪的编辑部? +5 jjdg 2024-05-23 5/250 2024-05-25 21:57 by cqu_zzh
信息提示
请填处理意见