24小时热门版块排行榜    

查看: 1743  |  回复: 3

brightfuture01

木虫之王 (文坛精英)

我爱打老虎

文献杰出贡献

[求助] 求编程,求应助

这个任务是否可以用脚本完成?不能的话用C++如何实现?本人编程知识欠缺,烦请大家帮忙。 如给出运行程序如脚本或者C++程序,本人不胜感激,金币不是问题。


问题描述:

(1) 数据: FASTA格式核酸序列(>+注释语言+换行+核酸序列)10000条左右,两个>之间为一个基因,格式如下:

>hg19_refGene_NM_052933 range=chr7:130353386-130373406 5'pad=0 3'pad=0 strand=- repeatMasking=none
tcttccaacgtggcccagggaagccaaaagattggacaccctgcttcaga
gcccatattaggttttttttttttttttttttggtcataaactgcagaaa
tgaaccctaagaaaataaaggatttatttgaaagaatattaggaaagctc
acagaactgaagagaaaattgtgtagtcttgctttaggaaggactagatc
tggggcagctccggggacctcagcaggtagagctcagagctcagagtctt
ccagcctgggataccacatgaggtggctcaggtgcaatcatcttacagct
ctgtgggtcattagcaaactttcaaattcttagtataacctggttggcca
cctttgaatcaatcatccagagctaggatgcaggagtgaatggcactaat
tggtgatggctatagacccttataatttctcaacccctcctcagcactgt
catagacttctcaaggacctctttgcatctttctaatctatagttcttct
gtctttccaatctatagttcacatgccactatttttgtatttctaaaata
tatattgttcaataaggagattatacctcatttaaaatctcttgttagcg
cttcattactcagaataacccacaaacttcctatcaaatagttcatttta
ggatttggatcctgtcttcctctctagtctcacctctcactacccttccc
ctaaatcttatgttcctggcatattgcaattacagtattttcaattcccc
taaataagccaggctttgttgtattctcatgtttttgcctctgcttctcc
ctctttctcaaatgtctgtcctctcatcacctggttaattcacactcatc
ttccccagctcagtttaggcagtagttccaaatctctctcagtctcttct
cagtatgggttaagtccccttcctcattgcttccatagtaccctaaattt
agatctaatttggcattctactctgttgcaatcaaccgtttctttgtctc
tcttcctggctaaactgtgagctcctaaaggagctcaagaattacatctg
atttatgctttgttctcatcacctaggagagtgtctaatatgtcataaat
gttcagcaaatatattgaaataagcaaacactagttttgaatgcataaga
tacaaagcactgggaggggggcacaaatatgtttaaaatttacatttcca
cttttaagggactcacagcttaagaggcagacaagacccgcacataagaa
aagtgaaaacagaactacaagacactatacaatgacatgcacaaaaccca
gtgactatagtctagagtttgggattgcattaatctagaaaatcttctct
ttaaaggaaTGACCTgagtagtaacataaagaaagaagattcagggctgg
gcatggtggctcatgcctgtaatcccagcactttgggaggctgaggtggg
tggatcacaAGGTCAagagatcgagaccatcctggccaacatggtgaaac
cccgtctctactaaaaatacaaaaattagctgggcatggtggcaggcgcc
tgtagtcccagctactctggaggctgaggcaggaaaatcacttgaaccag
ggagacagaggttgcagtgaactgagactgtgccactgcactccagcctt
gtgacagtgtgaggctccgtcagaagaaagagaaggagaaggagaaggag
aagattcaaatggctgaagagattggggacagcattttaagtatggggaa
taagaaaaaatgctcactgtatttgaagaatgcagtgaaaacattggctt
aggtagaggtagattaagagatggtgaagatgtgtgtgtgtgtgtgtgtg
tgtgtgtgtgtgtgtgtgtgtgtgtagagaagtcataggctggcaaaata
agagattggcagattttgtttaaatatgcctttaagttttggccttttca
ttctctgtctccaacacagaacacctcagcattctagattgcctattcat
>hg19_refGene_NR_030165 range=chr7:136585914-136588141 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gagtgccctgggaaaggaaaaattccatttccaagatccaggcatgtgag
aattacgtgaaattaaatatgtggacttgcggtttttggttcttggaaaa
atagaaaggtatagtgggttgcatgcagttctagctgcattccagctaga
aacagattgagacagttcagattcacaggcacttgcagggcaagttttca
atgtcactgtggctttaccacgtaggcaaatttgaaaatagacacactaa
atattgactacaggaaaagaaaaatgtgtatttattcatacaaaacacat
ttatttagcatctataatgtgctgtgcccTGACCTatagtgacccaaaca
gattagcagatctcacccgatcttgtagtgcagtgatttgaatagcaatg
atctttggaccctaagcaacccagtttggccattccctgatgtcattttt
ttcccccatttccattgttattttttgagaacatacagacgctttttgat
ccatttgttgtaagaacctagcctcttgatcagttgcatttacaaaataa
caacattgataatttcataaatgtgtaagatttacagcttacaaagaact
tgcacacacatctcatgcataactcacaaaacctcttaattctttcccaa
taatacaggggaatcatggactcaggggcTGACCTccaaagacgctgttg
gtggtaaattgttagagccccgagatgtgacttaaatttaggttttctga
cagaggtgtgctgctcgtctctatgctaatccattacacagccagacagg
aagaactgtcagtagattctgatcaatttctctttctataaaaaaaaatg
ataagcttagttaaattgtattagataagtgaagggttgtttataacagt
ccagcctccccttagcttcttctatggctttcattaggctccatcaaagc
ctactctcaaaacaaaatataaaaataattgttaactactaacattgatt
ctgtgatcttccttttaaactcatctatctatctatcgatctatctatcc
atccatccatccatccatccatccatccatccatccatctatccacctag
ctatccttctatgtttgcagtcagttgctagttattaagataatcagaat
tgctttcagaattaataattggtataaatttcagaagagtttgaatttag
gtggcaaattcataataagtgagaaggtaagctatagcatcctctgataa
tgtgtgcagtttactttttatctgtctctttctaattgaaaactaacaaa
tatagccccaattaaatgcacgtaaaaattagaagctggtgggatagggt
attaaacacaatcctagatgactcttatgaactcataccataagcagcca
ctttctttctcgagcaaactatagtgagaatgaagcatcttagattgaga
agggattaggacgaccctgaatggaatgggcaaatcataaacagctaggc
ccttagaatttggttgcagtcccaaacccAGGTCAgtttttaaacatgac
tatcagctagatatcctttctccaccatacaataatagataacaacctta
ataagacgtgtagacattaaactttgaaattccacagtaagatgtaaata
tttgctcaatcaagtacaatttaatatgtttgttatacagcaactgcaga
gcacagaattttgtactctttggatgtttatgataaggtacacattattt
gcaagtttttgcttgtttccttgttcagtttttcattatcaaacaaacaa
agcttctcagcctgggattaacctggagtctggaaagtatacattatggc
cagcaactttaaacaataggccagagatgggaaatgaatgaatgaaattc
tgacacagaagacaaacaacagaaactcatttgggctagtgtaggtgtag
gttttttattcttcaaccaacggtggtgaagaggatctcccttcacttta
>hg19_refGene_NM_001190906 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcctgagggtcagtttcctgaggaaaaaactcagttaagacataagtttc
aagttttaagacagagggcacatttctatgtttattcaaaaatccataaa
tatgggacaatttggccagtttcaacttcaggacattttaaccattgtgg
actcagTGACCTgtgaagtgtacaggccagggaaacttcctctttgcctt
ctgaagtttcactgaaaaatcaactcacaaaaggcagattaattggagaa
aaggtattcaattttatttaactatatgtttacacagggagaaccacaga
gtgattacccacccacaacagggtttagaagcttatttactggtaaatca
ggttatgggagaggggaaaagaggaattctgttgaggagatcactaggga
gaatgaatggatcagggagtagagattaacttgtacattaccttgtgaaa
gggtttgttcaggaaaggttacattcttagtcttacagggagaggaagaa
aaacgaattgttccttttggtgggtctggatcttaggcagataaaggaac
ttcactttgggagaggtggtagggagtagAGGTCAgagggaccttgaggc
ttttccagttcagtatgtcaaagtgccatattttggggtatcagtttctg
actcccaacacattctttgtcttctttttttgtaatggatcatgagataa
caggtaggggaaaaagaacaattgtcctccttggtgggtccatcctatct
ttatgtagacaggggaaagtctcttccagagcccgttgatctctaagggt
ctttacttcaaaatcttcattataccagggagccatatgttggggtggaa
tttcctgcctccttcaacccccagctccaaatttggaaatggcttctaat
ttttttaaggaaatgtcttttatttcctttggccactgtgattgatttca
cgatggatccatgaaccaagaaaggcaatcaaggctaacaagtttcaatt
tcaggactactgtttgagctactgagaaagtaaaatctctttttcgttgt
ttgagctggaaaccttgtcactccaaggacagtcagtaaacacttttcaa
tgttaagggaactgagagagagagtttcctgagagaacaggaaggcagag
agacgaatgagaagtgagaaattggtcctggccatgttattgttcctcta
gtccagtttaactggtgataagcttaacttgtgataaagatcctagatcc
ctttccagtcctgctgcatccaaatctcccaggaagtcctagaaaatgtc
tagtctcccctgaagctagccctactgccagaatttgaggaatatgaacc
aataaatttccattatagtttaagagaatttaagatagatgtgtgtatta
gttcattctcacactgctaataaagacaaccaagactgggtaattaacaa
agaaaaagagatttaatgggctcacagttccacatggctggggaggcctc
acaatcatggcagaaggtgaaagaggagcaaaggcatgtcttacatggca
ggaggcaagagagcatgtgcaggggaactgtccttcataaaaccatcaga
tcttgtgagacttattcactatcacaagagcagcatgggaaaaaaacacc
cccatgattcaattacctcccactgagtccctcccatgacattggggatt
gtgggagctacaattcaagatgagatttggttgaggacgcagccaaacta
tgtcagtctcttatttgccaacaaaagcatcctaactgatagaggccaga
cagatttgtttctttttgttttttcaatcttttgttgtgaagaagtaagc
ataaactctcaataggttacgttttacaagcctctgatgaagttcaaagg
acaaccatgcttaggatttccaggacaacctggaaaaaaaaacaggttga
gaaataggtgtgttaatctcccttccctctgctcctccctctggccttcc
.
.
.
.
.
.
>hg19_refGene_NM_005989 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcctgagggtcagtttcctgaggaaaaaactcagttaagacataagtttc
aagttttaagacagagggcacatttctatgtttattcaaaaatccataaa
tatgggacaatttggccagtttcaacttcaggacattttaaccattgtgg
actcagTGACCTgtgaagtgtacaggccagggaaacttcctctttgcctt
ctgaagtttcactgaaaaatcaactcacaaaaggcagattaattggagaa
aaggtattcaattttatttaactatatgtttacacagggagaaccacaga
gtgattacccacccacaacagggtttagaagcttatttactggtaaatca
ggttatgggagaggggaaaagaggaattctgttgaggagatcactaggga
gaatgaatggatcagggagtagagattaacttgtacattaccttgtgaaa
gggtttgttcaggaaaggttacattcttagtcttacagggagaggaagaa
aaacgaattgttccttttggtgggtctggatcttaggcagataaaggaac
ttcactttgggagaggtggtagggagtagAGGTCAgagggaccttgaggc
ttttccagttcagtatgtcaaagtgccatattttggggtatcagtttctg
actcccaacacattctttgtcttctttttttgtaatggatcatgagataa
caggtaggggaaaaagaacaattgtcctccttggtgggtccatcctatct
ttatgtagacaggggaaagtctcttccagagcccgttgatctctaagggt
ctttacttcaaaatcttcattataccagggagccatatgttggggtggaa
tttcctgcctccttcaacccccagctccaaatttggaaatggcttctaat
ttttttaaggaaatgtcttttatttcctttggccactgtgattgatttca
cgatggatccatgaaccaagaaaggcaatcaaggctaacaagtttcaatt
tcaggactactgtttgagctactgagaaagtaaaatctctttttcgttgt
ttgagctggaaaccttgtcactccaaggacagtcagtaaacacttttcaa
tgttaagggaactgagagagagagtttcctgagagaacaggaaggcagag
agacgaatgagaagtgagaaattggtcctggccatgttattgttcctcta
gtccagtttaactggtgataagcttaacttgtgataaagatcctagatcc
ctttccagtcctgctgcatccaaatctcccaggaagtcctagaaaatgtc
tagtctcccctgaagctagccctactgccagaatttgaggaatatgaacc
aataaatttccattatagtttaagagaatttaagatagatgtgtgtatta
gttcattctcacactgctaataaagacaaccaagactgggtaattaacaa
agaaaaagagatttaatgggctcacagttccacatggctggggaggcctc
acaatcatggcagaaggtgaaagaggagcaaaggcatgtcttacatggca
ggaggcaagagagcatgtgcaggggaactgtccttcataaaaccatcaga
tcttgtgagacttattcactatcacaagagcagcatgggaaaaaaacacc
cccatgattcaattacctcccactgagtccctcccatgacattggggatt
gtgggagctacaattcaagatgagatttggttgaggacgcagccaaacta
tgtcagtctcttatttgccaacaaaagcatcctaactgatagaggccaga
cagatttgtttctttttgttttttcaatcttttgttgtgaagaagtaagc
ataaactctcaataggttacgttttacaagcctctgatgaagttcaaagg
acaaccatgcttaggatttccaggacaacctggaaaaaaaaacaggttga
gaaataggtgtgttaatctcccttccctctgctcctccctctggccttcc

(2) 任务描述: 在所有10000条基因中查找两个字符串,比如"ATCG"和"GCTAT",按照两个字符串所含数目之和将10000条基因由大到小排序:比如,基因A中查到5个,基因B中查到4个,基因C和D中查到3个,基因EFGH中没有,则按照ABCDEFGH排序。

(3) 排序完成后,将每条基因的ID,即>hg19_refGene_NM_005989 range=chr7:137759178-137803150 5'pad=0 3'pad=0 strand=+ repeatMasking=none 中refgene 之后,range 之前的部分”NM_005989“ 按序提取到独立的excel中,第一栏为基因ID,第二栏为所含字符串个数,第三栏为每条基因所含碱基个数


[ Last edited by brightfuture01 on 2012-12-5 at 14:43 ]
回复此楼

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

The world is a fine place and worth fighting for. I agree with the second part.
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

libralibra

至尊木虫 (著名写手)

骠骑将军

【答案】应助回帖

★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ...
感谢参与,应助指数 +1
xzhdty: 金币+2, 专家考核, 谢谢骠骑将军 2012-12-05 22:00:51
brightfuture01: 金币+500, ★★★★★最佳答案, Thanks a million O(∩_∩)O 2012-12-06 13:43:08
字符串解析用脚本语言应该是最爽的,下面是个python脚本,基因字符串保存在gene.txt,跟这个.py文件放在同一个文件夹下,运行完后会生成data.txt,里面是基因id,字符串数目和碱基数目3列数据,用tab分割
这样的好处是直接复制内容,copy进excel就行了.
如果用python去写excel也行,不过需要额外的库,跟复制粘贴相比工作量要增加不少,不合算.
CODE:
#! /usr/bin/env python
# -*- coding: cp936 -*-
from operator import itemgetter
# 测试基因字符串
s = open(r'gene.txt','r').read()
# 得到每个基因字符串
m = ['>'+x for x in s.split('>')[1:]]
# 生成 基因字符串:包含2个字符串数目之和 的字典
d = {}
for c in m:
        d[c] = c.upper().count('ATCG')+c.upper().count('GCTAT')
# 按照数目之和排序,从大到小,得到一个list of tuple
d = sorted(d.iteritems(), key=itemgetter(1), reverse=True)
# 构造结果字符串
ss = ''
for ge in d:
        # 基因ID
        refGene = ge[0][ge[0].index('_refGene_')+len('_refGene_'):ge[0].index('range')]
        # 所含字符串数目之和
        strNum = ge[1]
        # 碱基数目
        baseNum = sum([x.upper() in 'ATCG' and 1 or 0 for x in ge[0][ge[0].index('none')+len('none'):]])
        ss += refGene+'\t'+str(strNum)+'\t'+str(baseNum)+'\n'
# 输出结果到文件
f = open(r'data.txt','w')
f.write(ss)
f.close()
print 'Done'

结果
CODE:
NR_030165         2        2000
NM_052933         2        2000
NM_005989         0        2000
NM_001190906         0        2000

matlab/VB/python/c++/Java写程序请发QQ邮件:790404545@qq.com
2楼2012-12-05 21:05:29
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

cmdblock

银虫 (正式写手)

引用回帖:
2楼: Originally posted by libralibra at 2012-12-05 21:05:29
字符串解析用脚本语言应该是最爽的,下面是个python脚本,基因字符串保存在gene.txt,跟这个.py文件放在同一个文件夹下,运行完后会生成data.txt,里面是基因id,字符串数目和碱基数目3列数据,用tab分割
这样的好处是直接 ...

python编写程序如此之简单,看来我也的从c系转python了
3楼2012-12-06 08:03:43
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

chembetsey

木虫 (小有名气)

【答案】应助回帖

★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ...
感谢参与,应助指数 +1
brightfuture01: 金币+50, ★★★★★最佳答案, Many thanks. 2012-12-06 13:44:07
xzhdty: 金币+2, 谢谢 2012-12-07 08:12:13
awk 'BEGIN {RS=">"}
{        N=0
        ID=$1
        gsub(/hg19_refGene.+repeatMasking=none/, ""
        gsub("\n", ""
        gsub("\r", ""
        $0=toupper($0)
        Numb=length($0)
        N+=gsub("ATCG", "&", $0)+gsub("GCTAT", "&", $0)
        print ID, N, Numb} ' Gen.txt | sort -n -r -k 2
结果
hg19_refGene_NR_030165 3 2000
hg19_refGene_NM_052933 2 2000
hg19_refGene_NM_005989 0 2000
hg19_refGene_NM_001190906 0 2000
4楼2012-12-06 09:25:09
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 brightfuture01 的主题更新
最具人气热帖推荐 [查看全部] 作者 回/看 最后发表
[基金申请] NSFC申报书里申请人简历中代表性论著还需要在申报书最后的附件里面再上传一遍吗 20+5 NSFC2026我来了 2026-03-10 14/700 2026-03-15 23:53 by 不负韶华的虎
[考研] 0703化学调剂 ,六级已过,有科研经历 +4 曦熙兮 2026-03-15 4/200 2026-03-15 18:01 by JourneyLucky
[考研] 070305求调剂 +3 mlpqaz03 2026-03-14 4/200 2026-03-15 11:04 by peike
[考研] 290求调剂 +4 @将就将就看 2026-03-10 8/400 2026-03-14 14:23 by 千千运气
[考研] 0703化学求调剂 +5 很老实人 2026-03-09 5/250 2026-03-14 02:57 by JourneyLucky
[考研] 2026考研调剂+本科延边大学+山东大学+生物化学与分子生物学+有项目经验 +3 ccdsscjy 2026-03-09 6/300 2026-03-14 02:14 by JourneyLucky
[考研] 295复试调剂 +5 简木ChuFront 2026-03-09 5/250 2026-03-14 01:29 by JourneyLucky
[考研] 306求调剂 +4 唐薏薏 2026-03-09 4/200 2026-03-14 01:19 by JourneyLucky
[基金申请] 有必要更换申报口吗 20+3 fannyamoy 2026-03-11 3/150 2026-03-14 00:52 by zhanghaozhu
[考研] 材料专硕288分求调剂 一志愿211 +4 在家想你 2026-03-11 4/200 2026-03-13 22:49 by JourneyLucky
[硕博家园] 085600 260分求调剂 +3 天空还下雨么 2026-03-13 5/250 2026-03-13 18:46 by 天空还下雨么
[考研] 土木第一志愿276求调剂,科研和技能十分丰富,求新兴方向的导师收留 +3 土木小天才 2026-03-12 3/150 2026-03-13 15:01 by JourneyLucky
[考研] 328化工专硕求调剂 +4 。,。,。,。i 2026-03-12 4/200 2026-03-13 14:44 by JourneyLucky
[考研] 一志愿山大07化学 332分 四六级已过 本科山东双非 求调剂! +3 不想理你 2026-03-12 3/150 2026-03-13 14:18 by JourneyLucky
[考研] 0703一志愿211 285分求调剂 +4 ly3471z 2026-03-13 4/200 2026-03-13 13:00 by JourneyLucky
[考研] 277求调剂 +4 anchor17 2026-03-12 4/200 2026-03-13 11:15 by 白夜悠长
[考研] 290求调剂 +3 ADT 2026-03-13 3/150 2026-03-13 10:19 by peike
[考研] 270求调剂 085600材料与化工专硕 +3 YXCT 2026-03-11 3/150 2026-03-13 10:13 by houyaoxu
[考博] 26申博求助 +3 跳跃饼干 2026-03-10 4/200 2026-03-10 21:15 by Tntcnn
[考研] 一志愿:武汉理工,材料工程,英二数二 总分314 +3 2202020125 2026-03-10 4/200 2026-03-10 13:54 by xiongyaxuan
信息提示
请填处理意见