Cloud Technologies for Bioinformatics Applications


Cloud Technologies for Bioinformatics Applications
Abstract:
       
                         Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pair wise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a “doubly data parallel step” involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.
 Keywords:  Cloud, bioinformatics, Multicore, Dryad, Hadoop, MPI

Existing System:
                                 There have been several papers discussing data analysis using a variety of cloud and more traditional cluster/Grid technologies with the Chicago paper influential in posing the broad importance of this type of problem. The Notre Dame all pairs system  clearly identified the “doubly data parallel” structure seen in all of our applications. We discuss in the Alu case the linking of an initial doubly data parallel to more traditional “singly data parallel” MPI applications. BLAST is a well known doubly data parallel problem. The Swarm project successfully uses traditional distributed clustering scheduling to address the EST and CAP3 problem.






Note approaches like Condor have significant startup time dominating performance. For basic operations, we find Hadoop and Dryad get similar performance on bioinformatics, particle physics and the well known kernels. Wilde has emphasized the value of scripting to control these (task parallel) problems and here DryadLINQ offers some capabilities that we exploited. We note that most previous work has used Linux based systems and technologies. Our work shows that Windows HPC server based systems can also be very effective.

Disadvantages:
                               Experience has shown that the initial (and often most time consuming) parts of data analysis are naturally data parallel and the processing can be made independent with perhaps some collective (reduction) operation.
Proposed System:
                                      The applications each start with a “doubly data-parallel” (all pairs) phase that can be implemented in MapReduce, MPI or using cloud resources on demand. The flexibility of clouds and MapReduce suggest they will become the preferred approaches. We showed how one can support an application (Alu) requiring a detailed output structure to allow follow-on iterative MPI computations. The applications differed in the heterogeneity of the initial data sets but in each case good performance is observed with the new cloud technologies competitive with MPI performance. The simple structure of the data/compute flow and the minimum inter-task communicational requirements of these “pleasingly parallel” applications enabled them to be implemented using a wide variety of technologies. The support for handling large data sets, the concept of moving computation to data, and the better quality of services provided by the cloud technologies, simplify the implementation of some problems over traditional systems.

                        We find that different programming constructs available in cloud technologies such as independent “maps” in MapReduce, “homomorphic Apply” in Dryad, and the “worker roles” in Azure are all suitable for implementing applications of the type we examine. In the Alu case, we show that Dryad and Hadoop can be programmed to prepare data for use in later parallel MPI/threaded applications used for further analysis. Our Dryad and Azure work was all performed on Windows machines and achieved very large speed ups.
Advantages:
·         This is generating justified interest in new runtimes and programming models that, unlike traditional parallel models (such as MPI), directly address the data-specific issues.
·         This structure has motivated the important MapReduce paradigm and many follow-on extensions.

SYSTEM REQUIREMENTS:
Hardware Requirements
·         System                            :       Pentium IV 2.4 GHz.
·         Hard Disk                       :       40 GB.
·         Floppy Drive                   :       1.44 Mb.
·         Monitor                           :       15 VGA Color.
·         Mouse                             :       Logitech.
·         Ram                                 :       512 MB.
 Software Requirements
·                Operating System                                 :             Windows xp , Linux
·                 Language                                             :            Java1.4 or more
·          Technology                               :            Swing, AWT
·           Back End                                 :                    Oracle 10g
·           IDE                                           :          MyEclipse 8.6


 Module Description
Modules:
·         Alu Sequence Classification
·         CAP3 Application EST and Its Software CAP

1 Alu Sequence Classification:

              The Alu clustering problem is one of the most challenging problems for sequence clustering because Alus represent the largest repeat families in human genome.  There are about 1 million copies of Alu sequences in human genome, in which most insertions can be found in other primates and only a small fraction (_7000) is humanspecific. This indicates that the classification of Alu repeats can be deduced solely from the 1 million human Alu elements. Alu clustering can be viewed as a classical case study for the capacity of computational infrastructures because it is not only of great intrinsic biological interest, but also a problem of a scale that will remain as the upper limit of many other clustering problems in bioinformatics for the next few years, such as the automated protein family classification for millions of proteins. In our previous works, we have examined Alu samples of 35,339 and 50,000 sequences using the pipeline of Fig. 1.

CAP3 Application EST and Its Software CAP:
                        An Expressed Sequence Tag (EST) corresponds to messenger RNAs (mRNAs) transcribed from the genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to reconstruct full-length mRNA sequences for each expressed gene. Because ESTs correspond to the gene regions of a genome, EST sequencing has become a standard practice for gene discovery, especially for the genomes of many organisms that maybe too complex for whole-genome sequencing. EST is addressed by the software CAP3 which is a DNA sequence assembly program developed by Huang and Madan. CAP3 performs several major assembly steps including computation of overlaps, construction of contigs, construction of multiple sequence alignments, and generation of consensus sequences to a given set of gene sequences. The program reads a collection of gene sequences from an input file (FASTA file format) and writes its output to several output files, as well as the standard output.
                     
                             CAP3 is often required to process large numbers of FASTA formatted input files, which can be processed independently, making it an embarrassingly parallel application requiring no interprocess communications. We have implemented a parallel version of CAP3 using Hadoop and DryadLINQ. In both these implementations, we adopt the following algorithm to parallelize CAP3.
         1. Distribute the input files to storages used by the runtime. In Hadoop, the files are distributed to HDFS, and in DryadLINQ, the files are distributed to the individual shared directories of the computation nodes.
          2. Instruct each parallel process (in Hadoop, these are the map tasks and in DryadLINQ these are the vertices) to execute CAP3 program on each input file.








System Architecture:
Algorithm:
·         Smith-Waterman-Gotoh  algorithm
·         MPI algorithm

                                                                                                                                               

0 comments:

Post a Comment