24小时热门版块排行榜    

查看: 254  |  回复: 2
【奖励】 本帖被评价2次,作者zhq025增加金币 2
当前主题已经存档。

zhq025

金虫 (小有名气)


[资源] 【转贴】Perl Programming for Biologists(已搜索无重复)

Introduction

Molecular biology is a study in accelerated expectations.
In  1973,  the  first  paper  reporting  a  nucleotide  sequence  derived  directly
from the DNA was reported. During the late 1970s, a graduate student could
earn  a  Ph.D.  and  publish  multiple  papers  in  Science,  Cell,  or  any  number
of  respected  journals  by  performing  the  astonishing  task  of  sequencing  a
gene – any gene. By 1982, DNA sequencing had become straightforward enough
that any well-equipped laboratory could clone and sequence a gene, providing
they had a copy of Molecular Cloning: A Laboratory Manual. By 1990, simply
sequencing a gene was considered sufficient for only a master’s degree, and
most journals considered the sequence of a gene to be only the starting point
for a scientific paper. The last sequencing-only paper published was the full
genomic  sequence  of  an  organism.  By  1995,  the  majority  of  journals  had
stopped publishing sequence data completely. In 1999, mid-way through the
Human Genome Sequencing Project, approximately 1.5 megabases of human
genomic sequence were being deposited in GenBank monthly, and by the end
of  2001  there  were  almost  15 billion  bases  of  sequence  information  in  the
databases, representing over 13 million sequences.
Bioinformatics, by necessity, is following the same growth curve.
Once  a  rarified  realm,  computers  in  biology  have  become  common  place.
Almost  every  biology  lab  has  some  type  of  computer,  and  the  uses  of  the
computer  range  from  manuscript  preparation  to  Internet  access,  from  data
collection to data crunching. And for each of these activities, some form of
bioinformatics is involved.
The field of bioinformatics can be split into two broad fields: computational
biology and analytical bioinformatics. Computational biology encompasses the
formal algorithms and testable hypotheses of biology, encoded into various
programs. Computational biologists often have more in common with people
in the campus computer science department than with those in the biology
department,  and  usually  spend  their  time  thinking  about  the  mathematics
of  biology.  Computational  biology  is  the  source  of  the  bioinformatic  tools
like  BLAST  or  FASTA,  which  are  commonly  used  to  analyze  the  results  of
experiments.
If computational biology is about building the tools, analytical bioinformatics
is about using those tools. From sequence retrieval from GenBank to performing
an analysis of variance regression using local statistical software, nearly every
biological researcher does some form of analytical bioinformatics. And just as
DNA sequencing has turned into a Red Queen pursuit, every biology researcher
has to perform more and more analytical bioinformatics to keep up.
Fortunately, keeping up is not as hard as it used to be. The explosion of the
Internet and the use of the World Wide Web (WWW) as a means of accessing
data and tools means that most researchers can keep up simply by updating the
bookmarks file of their favorite browser. In itself, this is no mean feat – Internet
research skills can be tricky to acquire and even trickier to understand how to
use properly. Still, there is a way to go further: one can begin to manipulate the
data returned from conventional programs.
Data manipulation can usually be done in spreadsheets and databases. Indeed,
these two types of programs are indispensable in any laboratory,  especially
those quite sophisticated in analytical bioinformatics. But to take the final step
to truly exploit data analysis tools, a researcher needs to understand and be
able to use a scripting language.
A scripting language is similar in most ways to a programming language.
The user writes computer code according to the syntactic conventions of the
language, and then executes the result. However, a scripting language is typically
much  easier  to  learn  and  utilize  than  a  traditional  programming  language,
because many of the common functions people use have already been created
and stored. Additionally, most scripting languages are interpreted (turned into
binary  computer  instructions  on  the  fly)  rather  than  compiled  (turned  into
binary computer instructions once), so that scripts development is generally
quicker and the scripts themselves are more portable.
Of course, there is always a price to pay for things being easier, and in the case
of scripting languages, the major price is speed. Scripting languages typically
take longer to execute than compiled code. But, except for the most extreme
cases, the trade-off for ease of use over speed is quite acceptable, and might
not even be noticeable on the faster computers available today.
The Perl programming language is probably the most widely used scripting
language in bioinformatics. A large percentage of programs are written in Perl,
and many bioinformatists cut their programming teeth using Perl. In fact, the
most common advice heard by aspiring bioinformatists is "go learn Perl."
In part, Perl is a popular language because it is less structured than traditional
programming languages. With fewer rules and multiple ways to perform a task,
Perl is a language that allows for fast and easy coding. For the same reasons,
it is an easier language to learn as a first programming language. But the very
ease of using Perl is a bit of a trap: it is quite easy to make simple mistakes that
are difficult to catch.
But there are strong reasons to learn and use Perl. The language was orig-
inally created for parsing files and quickly creating formatted reports. Larry
Wall, the author of Perl, claims the name stands for ‘‘Practical Extraction and
Reporting Language’’ (but he acknowledges that the name could just as easily
stand for ‘‘Pathologically Eclectic Rubbish Lister’’) and the language is perfect
for rummaging through files looking for a particular pattern of characters, or
for reformatting data tables. The program has a very powerful regular expres-
sion capability for pattern matching, as well as built-in file manipulation and
input/output (I/O) piping mechanisms. These abilities have proven invaluable
for  bioinformatics,  where  we  are  often  looking  for  motifs  within  sequences
(pattern-matching) or rearranging one database format into another.
The biggest use of Perl is the quick and dirty creation of small analysis pro-
grams. Nearly every bioinformatist has written a program to parse a nucleotide
sequence into the reverse complement sequence. Similarly, a great many people
use small Perl scripts to read disparate data files and parse the relevant data
into  a  new  format.  This  usage  is  so prevalent  that  the  term "glutility"  was
coined by Sam Cartinhour for scripts that take the output of one program (like
BLAST, for example) and change it into a form suitable for import into another
program (like ClustalW). Finally, with the advent of the WWW, Perl has become
the language of choice to create Common Gateway Interface (CGI) scripts to
handle form submissions and create compute servers on the WWW.
The purpose of this book is to teach you Perl programming. What sets this
book apart from most Perl language books is 1) the assumption that you’ve
never had any formal training in programming, and 2) the examples are geared
toward  real  problems  biologists  face,  so  you  don’t  have  to  either  learn  an
entirely new concept to understand the example or wrestle with an example
that is generic and difficult to extrapolate into the real world of the laboratory.
At the conclusion of the book, you should be able to write a script to fix the
clone library prefix that your summer student mistyped on every line of the
spreadsheet, or to scan a Fasta sequence file for every occurrence of an EcoRI
site. Moreover, you’ll be able to write reusable and maintainable scripts so you
don’t have to rewrite the same piece of code over and over. Additionally, you’ll
be able to look at other people’s scripts and adapt them to your own purposes.
After all, to quote Larry Wall, the creator of Perl, ‘‘For programmers, laziness is
a virtue.’’

Download link:http://www.isload.com.cn/store/u4qnppx23wxoh

[ Last edited by 2007骑猪逛街 on 2008-1-16 at 16:30 ]
回复此楼

» 猜你喜欢

已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

agri521

木虫 (著名写手)


★★★★★ 五星级,优秀推荐

我感觉这应该发到农林版,谢谢。
2楼2008-01-13 09:58:58
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

诸葛曹操

金虫 (著名写手)


★★★★★ 五星级,优秀推荐

生物信息学阿  发到这里应该是合适的
3楼2008-01-16 09:11:12
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 zhq025 的主题更新
☆ 无星级 ★ 一星级 ★★★ 三星级 ★★★★★ 五星级
普通表情 高级回复 (可上传附件)
信息提示
请填处理意见