With the rapid rise in Internet especially after Web 2.0 the companies from Google to Facebook are facing with challenges which are unique in itself. The terabytes and Petabytes of data that they receive each and every day are sure to wobble even the top brain. The companies though nowadays are more than willing to pay a hefty amount to keep those mountains of data safe because they know that underneath that lays a goldmine which can reveal the big secret and if smartly used will turn up the companies’ fortune.
Such processing requires huge computation resources therefore Google and Facebook like companies have huge privately held data centres to carry out the task and while not everyone can afford to set up those giant data centre for such data intensive computation but SMBs can make use of modern technology to make up what lost via use of cloud technology. Cloud technology is fusion of different computing technologies such as grid, distributed and parallel computing and incorporates all the advantages this provide a greater edge over the traditional computing and medium. Some of the core features of cloud technology are:
Elasticity: The scalable nature of cloud ensures that user can seamlessly upgrade their virtual instances to meet the computation requirement. This elastic nature ensures that bunch of BORPS of structured or unstructured data gets efficiently processed and analyzed and the best part is that user to pay for only the cloud resources utilized for computing the given data. Amazon under the brand name EC2 offer one such service to the user who are looking for scalable solution since the data intensive work uses computing resources that goes linear with the amount of data feed. As the cloud is highly elastic in nature user can easily scale their virtual instance.
Storage Space: Cloud technology can easily provide optimized storage system to store large blob of data and other distributed data store architecture.
API and Framework: One can get a framework and programming APIs for processing massive amount of data in optimized way since in cloud the API’s are merged with specific storage infrastructure for efficient performance.
To process such enormous data different Architectures have been developed for performing data intensive computing which one can use in the cloud:
- MapReduce: A popular programming model for processing large data sets with a parallel, Distributed algorithm in cluster.
- Dataflow-Processing mimics computation in 2-D graphical form by presenting data dependencies using directed arcs and edges.
- Hybrid DBMS: This architecture is design to encapsulate the benefits of traditional DBMS that uses shared nothing parallel DBMSs with the MapReduce architecture. This architecture type provides high fault tolerance level and superior data computing at the same time.
- Stream Processing: This architecture uses the concept of ‘single program-Multiple Data’ technique to process each member of the input data independently using multiple computational resources. Sphere is one such example where stream processing concept is applied.
Since the data driven application are designed to process dataset on the scale of multiple terabytes or even petabytes feeding such enormous data becomes a big headache since data may not exist in single location and may be distributed across different geographical location which will therefore cause a significant latency and may result in processing delay. Some of the major challenges that need to be overcome are:
- Design a better scalable algorithms to search and process massive datasets.
- Better Metadata management solution to handle complex, heterogeneous and distributed data sources.
- Advance computing platform designed for accessing in-memory multi-terabytes data structure.
- Better signature-generation technique to reduce data and increase processing speed.
Some of the limitation can be easily overcome with the smart use of grid computing that can provide huge computational power as well as storage facility via extraction of heterogeneous resources that are spread across different administrative domain. Using the data grid user can perform data intensive computing and manipulate large datasets stored in different repositories.
Cloud technology integrate all the necessary building blocks and if someone needs high computing resources for their data intensive work to extract the rich information hidden beneath it they should certainly opt for cloud which provide an efficient mechanism for computation needs.