Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud


Exploiting Dynamic Resource Allocation for
Efficient Parallel Data Processing in the Cloud
Abstract
                In recent years ad-hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today’s IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and compare the results to the popular data processing framework Hadoop.
Index Terms Many-Task Computing, High-Throughput Computing, Loosely Coupled Applications, Cloud Computing
Existing System:
          A growing number of companies have to process huge amounts of data in a cost-efficient manner. Classic representatives for these companies are operators of Internet search engines. The vast amount of data they have to deal with every day has made traditional database solutions prohibitively expensive .Instead, these companies have popularized an architectural paradigm based on a large number of commodity servers.
Problems like processing crawled documents or regenerating a web index are split into several independent subtasks, distributed among the available nodes, and computed in parallel.
Disadvantage
The challenge for both frameworks consists of two abstract tasks: Given a set of random integer numbers, the first task is to determine the k smallest of those numbers. The second task subsequently is to calculate the average of these k smallest numbers. The job is a classic representative for a variety of data analysis jobs whose particular tasks vary in their complexity
and hardware demands.
Proposed System:
             In recent years a variety of systems to facilitate MTC has been developed. Although these systems typically share common goals (e.g. to hide issues of parallelism or fault tolerance), they aim at different fields of application. MapReduce is designed to run data analysis jobs on a large amount of data, which is expected to be stored across a large set of share-nothing commodity servers.  Once a user has fit his program into the required map and reduce pattern, the execution framework takes care of splitting the job into subtasks, distributing and executing them. A single Map Reduce job always consists of a distinct map and reduce program.
Advantage
·        The first task has to sort the entire data set and therefore can take advantage of large amounts of main memory and parallel execution,
·         The second aggregation task requires almost no main memory and, at least eventually, cannot be parallelized



Hardware Requirements
·         System                            :       Pentium IV 2.4 GHz.
·         Hard Disk                       :       40 GB.
·         Floppy Drive                   :       1.44 Mb.
·         Monitor                           :       15 VGA Color.
·         Mouse                             :       Logitech.
·         Ram                                 :       512 MB.
Software Requirements   
·                      Operating System                         :             Windows xp , Linux
·                       Language                                     :            Java1.4 or more
·                Technology                                  :            Swing, AWT
·               Back End                                  :          Oracle 10g
·               IDE                                               :                   MyEclipse 8.6

Module Description
Modules:
  • Network Module
  • LBS Services
  • System Model
  • Scheduled Task
  • Query Processing



Exploiting Dynamic Resource Allocation                                                                                                      cloud computing
Network Module:

         A network channel lets two subtasks exchange data via a TCP connection. Network channels allow pipelined processing, so the records emitted by the producing subtask are immediately transported to the consuming subtask. As a result, two subtasks connected via a network channel may be executed on different instances. However, since they must be executed at the same time, they are required to run in the same Execution Stage

 LSB Service:
            Many people are familiar with wireless Internet, but many don't realize the value and potential to make information services highly personalized. One of the best ways to personalize information services is to enable them to be location based. An example would be someone using their Wireless Application Protocol (WAP) based phone to search for a restaurant. The LBS application would interact with other location technology components to determine the user's location and provide a list of restaurants within a certain proximity to the mobile user.
            In this age of significant telecommunications competition, mobile network operators continuously seek new and innovative ways to create differentiation and increase profits. One of the best ways to do accomplish this is through the delivery of highly personalized services. One of the most powerful ways to personalize mobile services is based on location.
Scheduled Task:
 A file channel allows two subtasks to exchange records via the local file system. The records of the producing task are first entirely written to an intermediate file and afterwards read into the consuming subtask. Nephele requires two such subtasks to be assigned to the same instance. Moreover, the consuming Group Vertex must be scheduled to run in a higher Execution Stage than the producing Group Vertex. In general, Nephele only allows subtasks to exchange records across different stages via file channels because they are the only channel types which store the intermediate records in a persistent manner
Exploiting Dynamic Resource Allocation                                                                                                      cloud computing
System Model:
Query Processing
          Similar to a network channel, an in-memory channel also enables Query processing. However, instead of using a TCP connection, the respective subtasks exchange data using the instance’s main memory. An in-memory channel typically represents the fastest way to transport records in Nephele, however, it also implies most scheduling restrictions: The two connected subtasks must be scheduled to run on the same instance and run in the same Execution Stage

3 comments:

  1. Can you please send a source code for this project to ma mail? @babukriz@gmail.com

    ReplyDelete
  2. And please provide your contact number to speak.

    ReplyDelete