| 查看: 255 | 回复: 2 | ||
| 【奖励】 本帖被评价2次,作者zhq025增加金币 2 个 | ||
| 当前主题已经存档。 | ||
| 当前只显示满足指定条件的回帖,点击这里查看本话题的所有回帖 | ||
[资源]
【转贴】Perl Programming for Biologists(已搜索无重复)
|
||
|
Introduction Molecular biology is a study in accelerated expectations. In 1973, the first paper reporting a nucleotide sequence derived directly from the DNA was reported. During the late 1970s, a graduate student could earn a Ph.D. and publish multiple papers in Science, Cell, or any number of respected journals by performing the astonishing task of sequencing a gene – any gene. By 1982, DNA sequencing had become straightforward enough that any well-equipped laboratory could clone and sequence a gene, providing they had a copy of Molecular Cloning: A Laboratory Manual. By 1990, simply sequencing a gene was considered sufficient for only a master’s degree, and most journals considered the sequence of a gene to be only the starting point for a scientific paper. The last sequencing-only paper published was the full genomic sequence of an organism. By 1995, the majority of journals had stopped publishing sequence data completely. In 1999, mid-way through the Human Genome Sequencing Project, approximately 1.5 megabases of human genomic sequence were being deposited in GenBank monthly, and by the end of 2001 there were almost 15 billion bases of sequence information in the databases, representing over 13 million sequences. Bioinformatics, by necessity, is following the same growth curve. Once a rarified realm, computers in biology have become common place. Almost every biology lab has some type of computer, and the uses of the computer range from manuscript preparation to Internet access, from data collection to data crunching. And for each of these activities, some form of bioinformatics is involved. The field of bioinformatics can be split into two broad fields: computational biology and analytical bioinformatics. Computational biology encompasses the formal algorithms and testable hypotheses of biology, encoded into various programs. Computational biologists often have more in common with people in the campus computer science department than with those in the biology department, and usually spend their time thinking about the mathematics of biology. Computational biology is the source of the bioinformatic tools like BLAST or FASTA, which are commonly used to analyze the results of experiments. If computational biology is about building the tools, analytical bioinformatics is about using those tools. From sequence retrieval from GenBank to performing an analysis of variance regression using local statistical software, nearly every biological researcher does some form of analytical bioinformatics. And just as DNA sequencing has turned into a Red Queen pursuit, every biology researcher has to perform more and more analytical bioinformatics to keep up. Fortunately, keeping up is not as hard as it used to be. The explosion of the Internet and the use of the World Wide Web (WWW) as a means of accessing data and tools means that most researchers can keep up simply by updating the bookmarks file of their favorite browser. In itself, this is no mean feat – Internet research skills can be tricky to acquire and even trickier to understand how to use properly. Still, there is a way to go further: one can begin to manipulate the data returned from conventional programs. Data manipulation can usually be done in spreadsheets and databases. Indeed, these two types of programs are indispensable in any laboratory, especially those quite sophisticated in analytical bioinformatics. But to take the final step to truly exploit data analysis tools, a researcher needs to understand and be able to use a scripting language. A scripting language is similar in most ways to a programming language. The user writes computer code according to the syntactic conventions of the language, and then executes the result. However, a scripting language is typically much easier to learn and utilize than a traditional programming language, because many of the common functions people use have already been created and stored. Additionally, most scripting languages are interpreted (turned into binary computer instructions on the fly) rather than compiled (turned into binary computer instructions once), so that scripts development is generally quicker and the scripts themselves are more portable. Of course, there is always a price to pay for things being easier, and in the case of scripting languages, the major price is speed. Scripting languages typically take longer to execute than compiled code. But, except for the most extreme cases, the trade-off for ease of use over speed is quite acceptable, and might not even be noticeable on the faster computers available today. The Perl programming language is probably the most widely used scripting language in bioinformatics. A large percentage of programs are written in Perl, and many bioinformatists cut their programming teeth using Perl. In fact, the most common advice heard by aspiring bioinformatists is "go learn Perl." In part, Perl is a popular language because it is less structured than traditional programming languages. With fewer rules and multiple ways to perform a task, Perl is a language that allows for fast and easy coding. For the same reasons, it is an easier language to learn as a first programming language. But the very ease of using Perl is a bit of a trap: it is quite easy to make simple mistakes that are difficult to catch. But there are strong reasons to learn and use Perl. The language was orig- inally created for parsing files and quickly creating formatted reports. Larry Wall, the author of Perl, claims the name stands for ‘‘Practical Extraction and Reporting Language’’ (but he acknowledges that the name could just as easily stand for ‘‘Pathologically Eclectic Rubbish Lister’’) and the language is perfect for rummaging through files looking for a particular pattern of characters, or for reformatting data tables. The program has a very powerful regular expres- sion capability for pattern matching, as well as built-in file manipulation and input/output (I/O) piping mechanisms. These abilities have proven invaluable for bioinformatics, where we are often looking for motifs within sequences (pattern-matching) or rearranging one database format into another. The biggest use of Perl is the quick and dirty creation of small analysis pro- grams. Nearly every bioinformatist has written a program to parse a nucleotide sequence into the reverse complement sequence. Similarly, a great many people use small Perl scripts to read disparate data files and parse the relevant data into a new format. This usage is so prevalent that the term "glutility" was coined by Sam Cartinhour for scripts that take the output of one program (like BLAST, for example) and change it into a form suitable for import into another program (like ClustalW). Finally, with the advent of the WWW, Perl has become the language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute servers on the WWW. The purpose of this book is to teach you Perl programming. What sets this book apart from most Perl language books is 1) the assumption that you’ve never had any formal training in programming, and 2) the examples are geared toward real problems biologists face, so you don’t have to either learn an entirely new concept to understand the example or wrestle with an example that is generic and difficult to extrapolate into the real world of the laboratory. At the conclusion of the book, you should be able to write a script to fix the clone library prefix that your summer student mistyped on every line of the spreadsheet, or to scan a Fasta sequence file for every occurrence of an EcoRI site. Moreover, you’ll be able to write reusable and maintainable scripts so you don’t have to rewrite the same piece of code over and over. Additionally, you’ll be able to look at other people’s scripts and adapt them to your own purposes. After all, to quote Larry Wall, the creator of Perl, ‘‘For programmers, laziness is a virtue.’’ Download link:http://www.isload.com.cn/store/u4qnppx23wxoh [ Last edited by 2007骑猪逛街 on 2008-1-16 at 16:30 ] |
» 猜你喜欢
基金申报
已经有5人回复
基金委咋了?2026年的指南还没有出来?
已经有7人回复
国自然申请面上模板最新2026版出了吗?
已经有17人回复
纳米粒子粒径的测量
已经有8人回复
疑惑?
已经有5人回复
计算机、0854电子信息(085401-058412)调剂
已经有5人回复
Materials Today Chemistry审稿周期
已经有5人回复
溴的反应液脱色
已经有7人回复
推荐一本书
已经有12人回复
常年博士招收(双一流,工科)
已经有4人回复
3楼2008-01-16 09:11:12
2楼2008-01-13 09:58:58











回复此楼