the-tree/docs/mllib-collaborative-filtering.md

0001 ---
0002 layout: global
0003 title: Collaborative Filtering - RDD-based API
0004 displayTitle: Collaborative Filtering - RDD-based API
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 ## Collaborative filtering
0026
0027 [Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
0028 is commonly used for recommender systems.  These techniques aim to fill in the
0029 missing entries of a user-item association matrix.  `spark.mllib` currently supports
0030 model-based collaborative filtering, in which users and products are described
0031 by a small set of latent factors that can be used to predict missing entries.
0032 `spark.mllib` uses the [alternating least squares
0033 (ALS)](http://dl.acm.org/citation.cfm?id=1608614)
0034 algorithm to learn these latent factors. The implementation in `spark.mllib` has the
0035 following parameters:
0036
0037 * *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
0038 * *rank* is the number of features to use (also referred to as the number of latent factors).
0039 * *iterations* is the number of iterations of ALS to run. ALS typically converges to a reasonable
0040   solution in 20 iterations or less.
0041 * *lambda* specifies the regularization parameter in ALS.
0042 * *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
0043   *implicit feedback* data.
0044 * *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
0045   *baseline* confidence in preference observations.
0046
0047 ### Explicit vs. implicit feedback
0048
0049 The standard approach to matrix factorization-based collaborative filtering treats
0050 the entries in the user-item matrix as *explicit* preferences given by the user to the item,
0051 for example, users giving ratings to movies.
0052
0053 It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
0054 clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
0055 from [Collaborative Filtering for Implicit Feedback Datasets](https://doi.org/10.1109/ICDM.2008.22).
0056 Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
0057 as numbers representing the *strength* in observations of user actions (such as the number of clicks,
0058 or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
0059 confidence in observed user preferences, rather than explicit ratings given to items. The model
0060 then tries to find latent factors that can be used to predict the expected preference of a user for
0061 an item.
0062
0063 ### Scaling of the regularization parameter
0064
0065 Since v1.1, we scale the regularization parameter `lambda` in solving each least squares problem by
0066 the number of ratings the user generated in updating user factors,
0067 or the number of ratings the product received in updating product factors.
0068 This approach is named "ALS-WR" and discussed in the paper
0069 "[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](https://doi.org/10.1007/978-3-540-68880-8_32)".
0070 It makes `lambda` less dependent on the scale of the dataset, so we can apply the
0071 best parameter learned from a sampled subset to the full dataset and expect similar performance.
0072
0073 ## Examples
0074
0075 <div class="codetabs">
0076
0077 <div data-lang="scala" markdown="1">
0078 In the following example, we load rating data. Each row consists of a user, a product and a rating.
0079 We use the default [ALS.train()](api/scala/org/apache/spark/mllib/recommendation/ALS$.html)
0080 method which assumes ratings are explicit. We evaluate the
0081 recommendation model by measuring the Mean Squared Error of rating prediction.
0082
0083 Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
0084
0085 {% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
0086
0087 If the rating matrix is derived from another source of information (i.e. it is inferred from
0088 other signals), you can use the `trainImplicit` method to get better results.
0089
0090 {% highlight scala %}
0091 val alpha = 0.01
0092 val lambda = 0.01
0093 val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
0094 {% endhighlight %}
0095 </div>
0096
0097 <div data-lang="java" markdown="1">
0098 All of MLlib's methods use Java-friendly types, so you can import and call them there the same
0099 way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
0100 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
0101 calling `.rdd()` on your `JavaRDD` object. A self-contained application example
0102 that is equivalent to the provided example in Scala is given below:
0103
0104 Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
0105
0106 {% include_example java/org/apache/spark/examples/mllib/JavaRecommendationExample.java %}
0107 </div>
0108
0109 <div data-lang="python" markdown="1">
0110 In the following example we load rating data. Each row consists of a user, a product and a rating.
0111 We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
0112 recommendation by measuring the Mean Squared Error of rating prediction.
0113
0114 Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS) for more details on the API.
0115
0116 {% include_example python/mllib/recommendation_example.py %}
0117
0118 If the rating matrix is derived from other source of information (i.e. it is inferred from other
0119 signals), you can use the trainImplicit method to get better results.
0120
0121 {% highlight python %}
0122 # Build the recommendation model using Alternating Least Squares based on implicit ratings
0123 model = ALS.trainImplicit(ratings, rank, numIterations, alpha=0.01)
0124 {% endhighlight %}
0125 </div>
0126
0127 </div>
0128
0129 In order to run the above application, follow the instructions
0130 provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
0131 section of the Spark
0132 Quick Start guide. Be sure to also include *spark-mllib* to your build file as
0133 a dependency.
0134
0135 ## Tutorial
0136
0137 The [training exercises](https://github.com/databricks/spark-training) from the Spark Summit 2014 include a hands-on tutorial for
0138 [personalized movie recommendation with `spark.mllib`](https://github.com/databricks/spark-training/blob/master/website/movie-recommendation-with-mllib.md).