the-tree/docs/sparkr-migration-guide.md

0001 ---
0002 layout: global
0003 title: "Migration Guide: SparkR (R on Spark)"
0004 displayTitle: "Migration Guide: SparkR (R on Spark)"
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 * Table of contents
0023 {:toc}
0024
0025 Note that this migration guide describes the items specific to SparkR.
0026 Many items of SQL migration can be applied when migrating SparkR to higher versions.
0027 Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html).
0028
0029 ## Upgrading from SparkR 2.4 to 3.0
0030
0031  - The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `jsonRDD` have been removed. Use `read.parquet`, `write.parquet`, `read.json` instead.
0032
0033 ## Upgrading from SparkR 2.3 to 2.4
0034
0035  - Previously, we don't check the validity of the size of the last layer in `spark.mlp`. For example, if the training data only has two labels, a `layers` param like `c(1, 3)` doesn't cause an error previously, now it does.
0036
0037 ## Upgrading from SparkR 2.3 to 2.3.1 and above
0038
0039  - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-based. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.
0040
0041 ## Upgrading from SparkR 2.2 to 2.3
0042
0043  - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
0044  - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
0045  - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
0046
0047 ## Upgrading from SparkR 2.1 to 2.2
0048
0049  - A `numPartitions` parameter has been added to `createDataFrame` and `as.DataFrame`. When splitting the data, the partition position calculation has been made to match the one in Scala.
0050  - The method `createExternalTable` has been deprecated to be replaced by `createTable`. Either methods can be called to create external or managed table. Additional catalog methods have also been added.
0051  - By default, derby.log is now saved to `tempdir()`. This will be created when instantiating the SparkSession with `enableHiveSupport` set to `TRUE`.
0052  - `spark.lda` was not setting the optimizer correctly. It has been corrected.
0053  - Several model summary outputs are updated to have `coefficients` as `matrix`. This includes `spark.logit`, `spark.kmeans`, `spark.glm`. Model summary outputs for `spark.gaussianMixture` have added log-likelihood as `loglik`.
0054
0055 ## Upgrading from SparkR 2.0 to 3.1
0056
0057  - `join` no longer performs Cartesian Product by default, use `crossJoin` instead.
0058
0059
0060 ## Upgrading from SparkR 1.6 to 2.0
0061
0062  - The method `table` has been removed and replaced by `tableToDF`.
0063  - The class `DataFrame` has been renamed to `SparkDataFrame` to avoid name conflicts.
0064  - Spark's `SQLContext` and `HiveContext` have been deprecated to be replaced by `SparkSession`. Instead of `sparkR.init()`, call `sparkR.session()` in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.
0065  - The parameter `sparkExecutorEnv` is not supported by `sparkR.session`. To set environment for the executors, set Spark config properties with the prefix "spark.executorEnv.VAR_NAME", for example, "spark.executorEnv.PATH"
0066  - The `sqlContext` parameter is no longer required for these functions: `createDataFrame`, `as.DataFrame`, `read.json`, `jsonFile`, `read.parquet`, `parquetFile`, `read.text`, `sql`, `tables`, `tableNames`, `cacheTable`, `uncacheTable`, `clearCache`, `dropTempTable`, `read.df`, `loadDF`, `createExternalTable`.
0067  - The method `registerTempTable` has been deprecated to be replaced by `createOrReplaceTempView`.
0068  - The method `dropTempTable` has been deprecated to be replaced by `dropTempView`.
0069  - The `sc` SparkContext parameter is no longer required for these functions: `setJobGroup`, `clearJobGroup`, `cancelJobGroup`
0070
0071 ## Upgrading from SparkR 1.5 to 1.6
0072
0073  - Before Spark 1.6.0, the default mode for writes was `append`. It was changed in Spark 1.6.0 to `error` to match the Scala API.
0074  - SparkSQL converts `NA` in R to `null` and vice-versa.
0075  - Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns
0076    of the same name of a DataFrame.