日期:2014-05-16 浏览次数:20468 次
(From:http://www.theserverside.com/news/1321219/Why-Should-You-Care-About-MapReduce)
?
?
?
By?Eugene Ciurana
01 Feb 2008 | TheServerSide.com
MapReduce is a distributed programming model intended for processing massive amounts of data in large clusters,?developed by Jeffrey Dean and Sanjay Ghemawat at Google. MapReduce is implemented as two functions, Map which applies a function to all the members of a collection and returns a list of results based on that processing, and Reduce, which collates and resolves the results from two or more Maps executed in parallel by multiple threads, processors, or stand-alone systems. Both Map() and Reduce() may run in parallel, though not necessarily in the same system at the same time.
Google uses MapReduce for indexing every Web page they crawl. It replaced the original indexing algorithms and heuristics in 2004, given its proven efficiency in processing very large, unstructured datasets. Several implementations of MapReduce are available in a variety of programming languages, including Java, C++, Python, Perl, Ruby, and C, among others. In some cases, like in Lisp or Python, Map() and Reduce() have been integrated into the fabric of the language itself. In general, these functions could be defined as:
List2 map(Functor1, List1); Object reduce(Functor2, List2);
The map() function operates on large datasets split into two or more smaller buckets. A bucket then contains a collection of loosely delimited logical records or lines of text. Each thread, processor, or system executes the map() function on a separate bucket to calculate a set of intermediate values based on the processing of each logical records. The combined results from all of these should be identical as having been executed as a single bucket by a single map() function. The general form of the map() function is:
map(function, list) { foreach ele