Netflix奖的大规模并行协同过滤外文翻译资料

 2023-01-20 10:34:30

毕业设计(论文)外文翻译

外文文献:

Large-scale Parallel Collaborative Filtering for the Netflix Prize

Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan

HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304

{yunhong.zhou, dennis.wilkinson, rob.schreiber, rong.pan}@hp.com

Abstract. Many recommendation systems suggest items to users by utilizing the techniques of collaborative filtering (CF) based on historical records of items that the users have viewed, purchased, or rated. Two major problems that most CF approaches have to resolve are scalability and sparseness of the user profiles. In this paper, we describe Alternating-Least-Squares with Weighted-lambda;-Regularization (ALS-WR), a parallel algorithm that we designed for the Netflix Prize, a large-scale collaborative filtering challenge. We use parallel Matlab on a Linux cluster as the experimental platform. We show empirically that the performance of ALS-WR monotonically increases with both the number of features and the number of ALS iterations. Our ALS-WR applied to the Netflix dataset with 1000 hidden features obtained a RMSE score of 0.8985, which is one of the best results based on a pure method. Combined with the parallel version of other known methods, we achieved a performance improvement of 5.91% over Netflixrsquo;s own CineMatch recommendation system. Our method is simple and scales well to very large datasets.

Introduction

Recommendation systems try to recommend items (movies, music, webpages, products, etc) to interested potential customers, based on the information available. A successful recommendation system can significantly improve the revenue of e-commerce companies or facilitate the interaction of users in online communities. Among recommendation systems, content-based approaches analyze the content (e.g., texts, meta-data, features) of the items to identify related items, while collaborative filtering uses the aggregated behavior/taste of a large number of users to suggest relevant items to specific users. Collaborative filtering is popular and widely deployed in Internet companies like Amazon [16], Netflix [2], Google News [7], and others.

The Netflix Prize is a large-scale data mining competition held by Netflix for the best recommendation system algorithm for predicting user ratings on movies, based on a training set of more than 100 million ratings given by over 480,000 users to nearly 18,000 movies. Each training data point consists of a quadruple (user, movie, date, rating) where rating is an integer from 1 to 5. The test dataset consists of 2.8 million data points with the ratings hidden. The goal is to minimize the RMSE (root mean squared error) when predicting the ratings on the test dataset. Netflixrsquo;s own recommendation system (CineMatch) scores 0.9514 on the test dataset, and the grand challenge is to improve it by 10%.

The Netflix problem presents a number of practical challenges. (Which is perhaps why, as yet, the prize has not been won.) First, the size of the dataset is 100 times larger than previous benchmark datasets, resulting in much longer model training time and much larger system memory requirements. Second, only about 1% of the user-movie matrix has been observed, with the majority of (potential) ratings missing. This is, of course, an essential aspect of collaborative filetering in general. Third, there is noise in both the training and test dataset, due to human behavior – we cannot expect people to to be completely predictable, at least where their feelings about ephemera like movies is concerned. Fourth, the distribution of ratings per user in the training and test datasets are different, as the training dataset spans many years (1995-2005) while the testing dataset was drawn from recent ratings (year 2006). In particular, users with few ratings are more prevalent in the test set. Intuitively, it is hard to predict the ratings of a user who is sparsely represented in the training set.

In this paper, we introduce the problem in detail. Then we describe a parallel algorithm, alternating-least-squares with weighted-lambda;-regularization. We use parallel Matlab on a Linux cluster as the experimental platform, and our core algorithm is parallelized and optimized to scale up well with large, sparse data. When we apply the proposed method to the Netflix Prize problem, we achieve a performance improvement of 5.91% over Netflixrsquo;s own CineMatch system.

The rest of the paper is organized as follows: in Section 2 we introduce the problem formulation. In Section 3 we describe our novel parallel AlternativeLeast-Squares algorithm. Section 4 describes experiments that show the effectiveness of our approach. Section 5 discusses related work and Section 6 concludes with some future directions.

Problem Formulation

Let R = {rij}nutimes;nm denote the user-movie matrix, where each element rij represents the rating score of movie j rated by user i with its value either being a real number or missing, nu designates the number of users, and nm indicates the number of movies. In many recommendation systems the task is to estimate some of the missing values in R based on the known values.

We start with a low-rank approximation of the user-item matrix R. This approach models both users and movies by giving them coordinates in a low dimensional feature space. Each user and each movie has a feature vector, and each rating (known or unknown) of a movie by a user is modeled as the inner product of the corresponding user a

剩余内容已隐藏,支付完成后下载完整资料


毕业设计(论文)外文翻译

学生姓名: 田睿 学 号: 1401150230

所在学院: 计算机科学与技术学院 

专 业:   计算机科学与技术学院

设计(论文)题目:基于协同过滤的在线音乐管理软件

指导教师: 刘学军

2020年2月 20日

外文文献:

Large-scale Parallel Collaborative Filtering for the Netflix Prize

Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan

HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304

{yunhong.zhou, dennis.wilkinson, rob.schreiber, rong.pan}@hp.com

Abstract. Many recommendation systems suggest items to users by utilizing the techniques of collaborative filtering (CF) based on historical records of items that the users have viewed, purchased, or rated. Two major problems that most CF approaches have to resolve are scalability and sparseness of the user profiles. In this paper, we describe Alternating-Least-Squares with Weighted-lambda;-Regularization (ALS-WR), a parallel algorithm that we designed for the Netflix Prize, a large-scale collaborative filtering challenge. We use parallel Matlab on a Linux cluster as the experimental platform. We show empirically that the performance of ALS-WR monotonically increases with both the number of features and the number of ALS iterations. Our ALS-WR applied to the Netflix dataset with 1000 hidden features obtained a RMSE score of 0.8985, which is one of the best results based on a pure method. Combined with the parallel version of other known methods, we achieved a performance improvement of 5.91% over Netflixrsquo;s own CineMatch recommendation system. Our method is simple and scales well to very large datasets.

Introduction

Recommendation systems try to recommend items (movies, music, webpages, products, etc) to interested potential customers, based on the information available. A successful recommendation system can significantly improve the revenue of e-commerce companies or facilitate the interaction of users in online communities. Among recommendation systems, content-based approaches analyze the content (e.g., texts, meta-data, features) of the items to identify related items, while collaborative filtering uses the aggregated behavior/taste of a large number of users to suggest relevant items to specific users. Collaborative filtering is popular and widely deployed in Internet companies like Amazon [16], Netflix [2], Google News [7], and others.

The Netflix Prize is a large-scale data mining competition held by Netflix for the best recommendation system algorithm for predicting user ratings on movies, based on a training set of more than 100 million ratings given by over 480,000 users to nearly 18,000 movies. Each training data point consists of a quadruple (user, movie, date, rating) where rating is an integer from 1 to 5. The test dataset consists of 2.8 million data points with the ratings hidden. The goal is to minimize the RMSE (root mean squared error) when predicting the ratings on the test dataset. Netflixrsquo;s own recommendation system (CineMatch) scores 0.9514 on the test dataset, and the grand challenge is to improve it by 10%.

The Netflix problem presents a number of practical challenges. (Which is perhaps why, as yet, the prize has not been won.) First, the size of the dataset is 100 times larger than previous benchmark datasets, resulting in much longer model training time and much larger system memory requirements. Second, only about 1% of the user-movie matrix has been observed, with the majority of (potential) ratings missing. This is, of course, an essential aspect of collaborative filetering in general. Third, there is noise in both the training and test dataset, due to human behavior – we cannot expect people to to be completely predictable, at least where their feelings about ephemera like movies is concerned. Fourth, the distribution of ratings per user in the training and test datasets are different, as the training dataset spans many years (1995-2005) while the testing dataset was drawn from recent ratings (year 2006). In particular, users with few ratings are more prevalent in the test set. Intuitively, it is hard to predict the ratings of a user who is sparsely represented in the training set.

In this paper, we introduce the problem in detail. Then we describe a parallel algorithm, alternating-least-squares with weighted-lambda;-regularization. We use parallel Matlab on a Linux cluster as the experimental platform, and our core algorithm is parallelized and optimized to scale up well with large, sparse data. When we apply the proposed method to the Netflix Prize problem, we achieve a performance improvement of 5.91% over Netflixrsquo;s own CineMatch system.

The rest of the paper is organized as follows: in Section 2 we introduce the problem formulation. In Section 3 we describe our novel parallel AlternativeLeast-Squares algorithm. Section 4 describes experiments that show the effectiveness of our approach. Section 5 discusses related work and Section 6 concludes with some future directions.

Problem Formulation

Let R = {rij}nutimes;nm denote the user-movie matrix, where each element rij represents the rating score of movie j rated by user i with its value either being a real number or missing, nu designates the number of users, and nm indicates the number of movies. In many recommendation systems the task is to estimate some of the missing values in R based on the known values.

We start with a low-rank approximation of the user-item matrix R. This approach models both users and movies by giving them coordinates in a low dimensional feature space. Each user and each movie has a feature vector, and each rating (known or unknown) of a movie by a user is modeled as the inner product of the corresponding user a

剩余内容已隐藏,支付完成后下载完整资料


资料编号:[254533],资料为PDF文档或Word文档,PDF文档可免费转换为Word

原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。