弹性分布式数据集外文翻译资料

 2022-08-25 21:27:25

附录B 外文原文

Resilient Distributed Datasets

1 Introduction

In this chapter, we present the Resilient Distributed Dataset (RDD) abstraction,

on which the rest of the rest of the dissertation builds a general-purpose cluster

computing stack. RDDs extend the data flow programming model introduced

by MapReduce and Dryad, which is the most widely used model for

large-scale data analysis today. Data flow systems were successful because they

let users write computations using high-level operators, without worrying about

work distribution and fault tolerance. As cluster workloads diversified, however,

data flow systems were found inefficient for many important applications, including

iterative algorithms, interactive queries, and stream processing. This led to the

development of a wide range of specialized frameworks for these applications.

Our work starts from the observation that many of the applications that data

flow models were not suited for have a characteristic in common: they all require

efficient data sharing across computations. For example, iterative algorithms, such as

PageRank, K-means clustering, or logistic regression, need to make multiple passes

over the same dataset; interactive data mining often requires running multiple

ad-hoc queries on the same subset of the data; and streaming applications need to

maintain and share state across time. Unfortunately, although data flow frameworks

offer numerous computational operators, they lack efficient primitives for data

sharing. In these frameworks, the only way to share data between computations

(e.g., between two MapReduce jobs) is to write it to an external stable storage

system, e.g., a distributed file system. This incurs substantial overheads due to data

replication, disk I/O, and serialization, which can dominate application execution.

Indeed, examining the specialized frameworks built for these new applications,

we see that many of them optimize data sharing. For example, Pregel is a

system for iterative graph computations that keeps intermediate state in memory,

while HaLoop is an iterative MapReduce system that can keep data partitioned

in an efficient way across steps. Unfortunately, these frameworks only support

specific computation patterns (e.g., looping a series of MapReduce steps), and

perform data sharing implicitly for these patterns. They do not provide abstractions

for more general reuse, e.g., to let a user load several datasets into memory and run

ad-hoc queries across them.

Instead, we propose a new abstraction called resilient distributed datasets

(RDDs) that gives users direct control of data sharing. RDDs are fault-tolerant,

parallel data structures that let users explicitly store data on disk or in memory,

control its partitioning, and manipulate it using a rich set of operators. They offer a simple and efficient programming interface that can capture both current

specialized models and new applications.

The main challenge in designing RDDs is defining a programming interface

that can provide fault tolerance efficiently. Existing abstractions for in-memory

storage on clusters, such as distributed shared memory , key-value stores ,

databases, and Piccolo , offer an interface based on fine-grained updates to

mutable state (e.g., cells in a table). With this interface, the only ways to provide

fault tolerance are to replicate the data across machines or to log updates across

machines. Both approaches are expensive for data-intensive workloads, as they

require copying large amounts of data over the cluster network, whose bandwidth

is far lower than that of RAM, and they incur substantial storage overhead.

In contrast to these systems, RDDs provide an interface based on coarse-grained

transformations (e.g., map, filter and join) that apply the same operation to many

data items. This allows them to efficiently provide fault tolerance by logging the

transformations used to build a dataset (its lineage) rather than the actual data. If

a partition of an RDD is lost, the RDD has enough information about how it was

derived from other RDDs to recompute just that partition. Thus, lost data can be

recovered, often quite quickly, without requiring costly replication.

Although an interface based on coarse-grained transformations may at first

seem limited, RDDs are a good fit for many parallel applications, because these

applications naturally apply the same operation to multiple data items. Indeed, we show

that RDDs can efficiently express many cluster programming models that have so

far been proposed as separate systems, including MapReduce, DryadLINQ, SQL,

Pregel and HaLoop, as well as new applications that these systems do not capture,

like interactive data mining. The ability of RDDs to accommodate computing needs

that were previously met only by introducing new frameworks is, we believe, the

most credible evidence of the power of the RDD abstraction.

We have implemented RDDs in a system called Spark, which is being used

for research and production applications at UC Berkeley and several companies.

Spark provides a convenient language-integrated programming interface similar to

DryadLINQ in the Scala programming language . In addition, Spark can

be used interactively to query big datasets from the Scala interpreter. We believe

that Spark is the first system that allows a general-purpose programming language
to be used at interactive speeds for in-memory data mining on clusters.
We e

剩余内容已隐藏,支付完成后下载完整资料


英语译文共 16 页,剩余内容已隐藏,支付完成后下载完整资料


资料编号:[484536],资料为PDF文档或Word文档,PDF文档可免费转换为Word

原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。