0001 ---
0002 layout: global
0003 title: Collaborative Filtering
0004 displayTitle: Collaborative Filtering
0005 license: |
0006 Licensed to the Apache Software Foundation (ASF) under one or more
0007 contributor license agreements. See the NOTICE file distributed with
0008 this work for additional information regarding copyright ownership.
0009 The ASF licenses this file to You under the Apache License, Version 2.0
0010 (the "License"); you may not use this file except in compliance with
0011 the License. You may obtain a copy of the License at
0012
0013 http://www.apache.org/licenses/LICENSE-2.0
0014
0015 Unless required by applicable law or agreed to in writing, software
0016 distributed under the License is distributed on an "AS IS" BASIS,
0017 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018 See the License for the specific language governing permissions and
0019 limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 ## Collaborative filtering
0026
0027 [Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
0028 is commonly used for recommender systems. These techniques aim to fill in the
0029 missing entries of a user-item association matrix. `spark.ml` currently supports
0030 model-based collaborative filtering, in which users and products are described
0031 by a small set of latent factors that can be used to predict missing entries.
0032 `spark.ml` uses the [alternating least squares
0033 (ALS)](http://dl.acm.org/citation.cfm?id=1608614)
0034 algorithm to learn these latent factors. The implementation in `spark.ml` has the
0035 following parameters:
0036
0037 * *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
0038 * *rank* is the number of latent factors in the model (defaults to 10).
0039 * *maxIter* is the maximum number of iterations to run (defaults to 10).
0040 * *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
0041 * *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
0042 *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
0043 * *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
0044 *baseline* confidence in preference observations (defaults to 1.0).
0045 * *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
0046
0047 **Note:** The DataFrame-based API for ALS currently only supports integers for user and item ids.
0048 Other numeric types are supported for the user and item id columns,
0049 but the ids must be within the integer value range.
0050
0051 ### Explicit vs. implicit feedback
0052
0053 The standard approach to matrix factorization-based collaborative filtering treats
0054 the entries in the user-item matrix as *explicit* preferences given by the user to the item,
0055 for example, users giving ratings to movies.
0056
0057 It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
0058 clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
0059 from [Collaborative Filtering for Implicit Feedback Datasets](https://doi.org/10.1109/ICDM.2008.22).
0060 Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
0061 as numbers representing the *strength* in observations of user actions (such as the number of clicks,
0062 or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
0063 confidence in observed user preferences, rather than explicit ratings given to items. The model
0064 then tries to find latent factors that can be used to predict the expected preference of a user for
0065 an item.
0066
0067 ### Scaling of the regularization parameter
0068
0069 We scale the regularization parameter `regParam` in solving each least squares problem by
0070 the number of ratings the user generated in updating user factors,
0071 or the number of ratings the product received in updating product factors.
0072 This approach is named "ALS-WR" and discussed in the paper
0073 "[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](https://doi.org/10.1007/978-3-540-68880-8_32)".
0074 It makes `regParam` less dependent on the scale of the dataset, so we can apply the
0075 best parameter learned from a sampled subset to the full dataset and expect similar performance.
0076
0077 ### Cold-start strategy
0078
0079 When making predictions using an `ALSModel`, it is common to encounter users and/or items in the
0080 test dataset that were not present during training the model. This typically occurs in two
0081 scenarios:
0082
0083 1. In production, for new users or items that have no rating history and on which the model has not
0084 been trained (this is the "cold start problem").
0085 2. During cross-validation, the data is split between training and evaluation sets. When using
0086 simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually
0087 very common to encounter users and/or items in the evaluation set that are not in the training set
0088
0089 By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item
0090 factor is not present in the model. This can be useful in a production system, since it indicates
0091 a new user or item, and so the system can make a decision on some fallback to use as the prediction.
0092
0093 However, this is undesirable during cross-validation, since any `NaN` predicted values will result
0094 in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
0095 This makes model selection impossible.
0096
0097 Spark allows users to set the `coldStartStrategy` parameter
0098 to "drop" in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values.
0099 The evaluation metric will then be computed over the non-`NaN` data and will be valid.
0100 Usage of this parameter is illustrated in the example below.
0101
0102 **Note:** currently the supported cold start strategies are "nan" (the default behavior mentioned
0103 above) and "drop". Further strategies may be supported in future.
0104
0105 **Examples**
0106
0107 <div class="codetabs">
0108 <div data-lang="scala" markdown="1">
0109
0110 In the following example, we load ratings data from the
0111 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
0112 consisting of a user, a movie, a rating and a timestamp.
0113 We then train an ALS model which assumes, by default, that the ratings are
0114 explicit (`implicitPrefs` is `false`).
0115 We evaluate the recommendation model by measuring the root-mean-square error of
0116 rating prediction.
0117
0118 Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/ml/recommendation/ALS.html)
0119 for more details on the API.
0120
0121 {% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
0122
0123 If the rating matrix is derived from another source of information (i.e. it is
0124 inferred from other signals), you can set `implicitPrefs` to `true` to get
0125 better results:
0126
0127 {% highlight scala %}
0128 val als = new ALS()
0129 .setMaxIter(5)
0130 .setRegParam(0.01)
0131 .setImplicitPrefs(true)
0132 .setUserCol("userId")
0133 .setItemCol("movieId")
0134 .setRatingCol("rating")
0135 {% endhighlight %}
0136
0137 </div>
0138
0139 <div data-lang="java" markdown="1">
0140
0141 In the following example, we load ratings data from the
0142 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
0143 consisting of a user, a movie, a rating and a timestamp.
0144 We then train an ALS model which assumes, by default, that the ratings are
0145 explicit (`implicitPrefs` is `false`).
0146 We evaluate the recommendation model by measuring the root-mean-square error of
0147 rating prediction.
0148
0149 Refer to the [`ALS` Java docs](api/java/org/apache/spark/ml/recommendation/ALS.html)
0150 for more details on the API.
0151
0152 {% include_example java/org/apache/spark/examples/ml/JavaALSExample.java %}
0153
0154 If the rating matrix is derived from another source of information (i.e. it is
0155 inferred from other signals), you can set `implicitPrefs` to `true` to get
0156 better results:
0157
0158 {% highlight java %}
0159 ALS als = new ALS()
0160 .setMaxIter(5)
0161 .setRegParam(0.01)
0162 .setImplicitPrefs(true)
0163 .setUserCol("userId")
0164 .setItemCol("movieId")
0165 .setRatingCol("rating");
0166 {% endhighlight %}
0167
0168 </div>
0169
0170 <div data-lang="python" markdown="1">
0171
0172 In the following example, we load ratings data from the
0173 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
0174 consisting of a user, a movie, a rating and a timestamp.
0175 We then train an ALS model which assumes, by default, that the ratings are
0176 explicit (`implicitPrefs` is `False`).
0177 We evaluate the recommendation model by measuring the root-mean-square error of
0178 rating prediction.
0179
0180 Refer to the [`ALS` Python docs](api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS)
0181 for more details on the API.
0182
0183 {% include_example python/ml/als_example.py %}
0184
0185 If the rating matrix is derived from another source of information (i.e. it is
0186 inferred from other signals), you can set `implicitPrefs` to `True` to get
0187 better results:
0188
0189 {% highlight python %}
0190 als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True,
0191 userCol="userId", itemCol="movieId", ratingCol="rating")
0192 {% endhighlight %}
0193
0194 </div>
0195
0196 <div data-lang="r" markdown="1">
0197
0198 Refer to the [R API docs](api/R/spark.als.html) for more details.
0199
0200 {% include_example r/ml/als.R %}
0201 </div>
0202
0203 </div>