附录B 外文原文
Resilient Distributed Datasets
1 Introduction
In this chapter, we present the Resilient Distributed Dataset (RDD) abstraction,
on which the rest of the rest of the dissertation builds a general-purpose cluster
computing stack. RDDs extend the data flow programming model introduced
by MapReduce and Dryad, which is the most widely used model for
large-scale data analysis today. Data flow systems were successful because they
let users write computations using high-level operators, without worrying about
work distribution and fault tolerance. As cluster workloads diversified, however,
data flow systems were found inefficient for many important applications, including
iterative algorithms, interactive queries, and stream processing. This led to the
development of a wide range of specialized frameworks for these applications.
Our work starts from the observation that many of the applications that data
flow models were not suited for have a characteristic in common: they all require
efficient data sharing across computations. For example, iterative algorithms, such as
PageRank, K-means clustering, or logistic regression, need to make multiple passes
over the same dataset; interactive data mining often requires running multiple
ad-hoc queries on the same subset of the data; and streaming applications need to
maintain and share state across time. Unfortunately, although data flow frameworks
offer numerous computational operators, they lack efficient primitives for data
sharing. In these frameworks, the only way to share data between computations
(e.g., between two MapReduce jobs) is to write it to an external stable storage
system, e.g., a distributed file system. This incurs substantial overheads due to data
replication, disk I/O, and serialization, which can dominate application execution.
Indeed, examining the specialized frameworks built for these new applications,
we see that many of them optimize data sharing. For example, Pregel is a
system for iterative graph computations that keeps intermediate state in memory,
while HaLoop is an iterative MapReduce system that can keep data partitioned
in an efficient way across steps. Unfortunately, these frameworks only support
specific computation patterns (e.g., looping a series of MapReduce steps), and
perform data sharing implicitly for these patterns. They do not provide abstractions
for more general reuse, e.g., to let a user load several datasets into memory and run
ad-hoc queries across them.
Instead, we propose a new abstraction called resilient distributed datasets
(RDDs) that gives users direct control of data sharing. RDDs are fault-tolerant,
parallel data structures that let users explicitly store data on disk or in memory,
control its partitioning, and manipulate it using a rich set of operators. They offer a simple and efficient programming interface that can capture both current
specialized models and new applications.
The main challenge in designing RDDs is defining a programming interface
that can provide fault tolerance efficiently. Existing abstractions for in-memory
storage on clusters, such as distributed shared memory , key-value stores ,
databases, and Piccolo , offer an interface based on fine-grained updates to
mutable state (e.g., cells in a table). With this interface, the only ways to provide
fault tolerance are to replicate the data across machines or to log updates across
machines. Both approaches are expensive for data-intensive workloads, as they
require copying large amounts of data over the cluster network, whose bandwidth
is far lower than that of RAM, and they incur substantial storage overhead.
In contrast to these systems, RDDs provide an interface based on coarse-grained
transformations (e.g., map, filter and join) that apply the same operation to many
data items. This allows them to efficiently provide fault tolerance by logging the
transformations used to build a dataset (its lineage) rather than the actual data. If
a partition of an RDD is lost, the RDD has enough information about how it was
derived from other RDDs to recompute just that partition. Thus, lost data can be
recovered, often quite quickly, without requiring costly replication.
Although an interface based on coarse-grained transformations may at first
seem limited, RDDs are a good fit for many parallel applications, because these
applications naturally apply the same operation to multiple data items. Indeed, we show
that RDDs can efficiently express many cluster programming models that have so
far been proposed as separate systems, including MapReduce, DryadLINQ, SQL,
Pregel and HaLoop, as well as new applications that these systems do not capture,
like interactive data mining. The ability of RDDs to accommodate computing needs
that were previously met only by introducing new frameworks is, we believe, the
most credible evidence of the power of the RDD abstraction.
We have implemented RDDs in a system called Spark, which is being used
for research and production applications at UC Berkeley and several companies.
Spark provides a convenient language-integrated programming interface similar to
DryadLINQ in the Scala programming language . In addition, Spark can
be used interactively to query big datasets from the Scala interpreter. We believe
that Spark is the first system that allows a general-purpose programming language
to be used at interactive speeds for in-memory data mining on clusters.
We e
剩余内容已隐藏,支付完成后下载完整资料
英语译文共 16 页,剩余内容已隐藏,支付完成后下载完整资料
资料编号:[484536],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。