###Cluster Architecture
Modern data-mining application, often called “big-data” analysis, demands quick processing over huge volume of data. In most of these applications, the data is extremely regular, offering an opportunity to exploit parallelism. Let’s see an example.

Google has to process about 6 billion pages everyday(4 million queries per minute!). The average size of a webpage is around 20kb, amounting to 120 TB for 6 billion pages. The data of each webpage has to be read from the memory via CPU. When the disk read bandwidth averages to 50 MB/sec, the query time of Google servers should be 2.4 million seconds or 27+ days. But Google manages this within 1-2 seconds. How ?

“When the load on your ox increases, and you cannot increase the strength of that ox beyond a limit, then you must increase the number of oxen employed for the task” -Grace Hopper’s analogy for parallel processing.

This is exactly what Google did, and it forms the underlying principal of Cluster Architecture.

Most computation is done on a single processor, using its main memory, cache, and local disk (comput node). The compute nodes are commodity hardware. 10x cheaper than special-purpose parallel machines. The new parallel-computing architecture, sometimes called cluster computing, is organized as follows. Independent compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There are many racks within a data center, and racks are inter connected by another level of network or a switch. The bandwidth of inter-rack communication is just slightly higher than the intrarack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this difference is considerable. The figure below suggests the architecture of a large-scale computing system. However, there may be many more racks and many more compute nodes per rack.

SWITCH
Fig 1. Physical Organization Of Compute Nodes

Cluster computing comes with its own set of challenges. Some of them being:

  • Node Failure-A scenario highly possible, and computationaly taxing. In order to prevent loss of data, we store the same data on multiple nodes. So that the redundant data is always available for compuatation.
  • Network Speed-Nodes are continuously transfering information(in huge amounts) among themselves. Thus, limited network speed can become a bottleneck as it uncessarliy adds to the overall quering time.
  • Distributed Programming Paradigm-Writing highly optimised and accurate distributed alogirthms is tough. For it not only requires you to be an expert programmer, but to also have extensive knowledge about the underlying system.

These challenges have been effectiveky answered by the Distributed File System(DFS). DFS requires writing data once, and then reading it as and when required. Three main components of the DFS include:

  • Chunk Servers- They act as computational servers. A chunk(a block of data) is replicated about 2 to 3 times on different servers so as to provide persistancy in the face of failure. In addition these nodes carry out the computation that is requested by the user. Instead of sending the data(larger in size) to user, we bring the user query(smaller in size)to the data.
  • The Master Node-A node that stores the metadata related to each server, and assigns them tasks. It is the master node responsible for overall health of the DFS
  • Client Libraries-A set APIs that contact with the master node to find the desired chunk server, and then connect directly to the chunk servers for read/write operations.


###Distributed File System Implementations

There are several distributed file systems of the type we have described that are used in practice. Among these:

  • The Google File System (GFS), the original of the class.
  • Hadoop Distributed File System (HDFS), an open-source DFS used with Hadoop, an implementation of MapReduce and distributed by the Apache Software Foundation.
  • CloudStore, an open-source DFS originally developed by Kosmix.


###MapReduce

The MapReduce algorithm is mostly used in order to reduce the size of our data, which may be enormous, and to make data access simpler. It helps in performing most common calculations on large-scale data to be performed on computing clusters efficiently and in a way that is tolerant of hardware failures during the computation.

All MapReduce requires is the implementation of two functions Map and Reduce. The Map function is simply going to return to me a collection of key-value pairs. The (key-value) pair could be anything, like (< word >,< occurance >) or (< pair of numbers >,< LCM/HCF >), anything. The Reduce function on the other hand, will perform a reduction of these key-value pairs to reduce the size of the data being accessed. To make it understandable, a MapReduce computation executes as follows:

  • Some number of Map tasks each are given one or more chunks from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function.
  • The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task.
  • The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function.

The figure below shows how the MapReduce task works.

    • Input Chunks
    • Map Task
  • Group By Keys
    (Key-Value) Pairs
    • A key with all its values
    • Reduce Task
  • Desired Output
Fig 2. Schema of a MapReduce Computation