the-tree/docs/monitoring.md

0001 ---
0002 layout: global
0003 title: Monitoring and Instrumentation
0004 description: Monitoring, metrics, and instrumentation guide for Spark SPARK_VERSION_SHORT
0005 license: |
0006   Licensed to the Apache Software Foundation (ASF) under one or more
0007   contributor license agreements.  See the NOTICE file distributed with
0008   this work for additional information regarding copyright ownership.
0009   The ASF licenses this file to You under the Apache License, Version 2.0
0010   (the "License"); you may not use this file except in compliance with
0011   the License.  You may obtain a copy of the License at
0012
0013      http://www.apache.org/licenses/LICENSE-2.0
0014
0015   Unless required by applicable law or agreed to in writing, software
0016   distributed under the License is distributed on an "AS IS" BASIS,
0017   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
0018   See the License for the specific language governing permissions and
0019   limitations under the License.
0020 ---
0021
0022 There are several ways to monitor Spark applications: web UIs, metrics, and external instrumentation.
0023
0024 # Web Interfaces
0025
0026 Every SparkContext launches a [Web UI](web-ui.html), by default on port 4040, that
0027 displays useful information about the application. This includes:
0028
0029 * A list of scheduler stages and tasks
0030 * A summary of RDD sizes and memory usage
0031 * Environmental information.
0032 * Information about the running executors
0033
0034 You can access this interface by simply opening `http://<driver-node>:4040` in a web browser.
0035 If multiple SparkContexts are running on the same host, they will bind to successive ports
0036 beginning with 4040 (4041, 4042, etc).
0037
0038 Note that this information is only available for the duration of the application by default.
0039 To view the web UI after the fact, set `spark.eventLog.enabled` to true before starting the
0040 application. This configures Spark to log Spark events that encode the information displayed
0041 in the UI to persisted storage.
0042
0043 ## Viewing After the Fact
0044
0045 It is still possible to construct the UI of an application through Spark's history server,
0046 provided that the application's event logs exist.
0047 You can start the history server by executing:
0048
0049     ./sbin/start-history-server.sh
0050
0051 This creates a web interface at `http://<server-url>:18080` by default, listing incomplete
0052 and completed applications and attempts.
0053
0054 When using the file-system provider class (see `spark.history.provider` below), the base logging
0055 directory must be supplied in the `spark.history.fs.logDirectory` configuration option,
0056 and should contain sub-directories that each represents an application's event logs.
0057
0058 The spark jobs themselves must be configured to log events, and to log them to the same shared,
0059 writable directory. For example, if the server was configured with a log directory of
0060 `hdfs://namenode/shared/spark-logs`, then the client-side options would be:
0061
0062     spark.eventLog.enabled true
0063     spark.eventLog.dir hdfs://namenode/shared/spark-logs
0064
0065 The history server can be configured as follows:
0066
0067 ### Environment Variables
0068
0069 <table class="table">
0070   <tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
0071   <tr>
0072     <td><code>SPARK_DAEMON_MEMORY</code></td>
0073     <td>Memory to allocate to the history server (default: 1g).</td>
0074   </tr>
0075   <tr>
0076     <td><code>SPARK_DAEMON_JAVA_OPTS</code></td>
0077     <td>JVM options for the history server (default: none).</td>
0078   </tr>
0079   <tr>
0080     <td><code>SPARK_DAEMON_CLASSPATH</code></td>
0081     <td>Classpath for the history server (default: none).</td>
0082   </tr>
0083   <tr>
0084     <td><code>SPARK_PUBLIC_DNS</code></td>
0085     <td>
0086       The public address for the history server. If this is not set, links to application history
0087       may use the internal address of the server, resulting in broken links (default: none).
0088     </td>
0089   </tr>
0090   <tr>
0091     <td><code>SPARK_HISTORY_OPTS</code></td>
0092     <td>
0093       <code>spark.history.*</code> configuration options for the history server (default: none).
0094     </td>
0095   </tr>
0096 </table>
0097
0098 ### Applying compaction on rolling event log files
0099
0100 A long-running application (e.g. streaming) can bring a huge single event log file which may cost a lot to maintain and
0101 also requires a bunch of resource to replay per each update in Spark History Server.
0102
0103 Enabling <code>spark.eventLog.rolling.enabled</code> and <code>spark.eventLog.rolling.maxFileSize</code> would
0104 let you have rolling event log files instead of single huge event log file which may help some scenarios on its own,
0105 but it still doesn't help you reducing the overall size of logs.
0106
0107 Spark History Server can apply compaction on the rolling event log files to reduce the overall size of
0108 logs, via setting the configuration <code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code> on the
0109 Spark History Server.
0110
0111 Details will be described below, but please note in prior that compaction is LOSSY operation.
0112 Compaction will discard some events which will be no longer seen on UI - you may want to check which events will be discarded
0113 before enabling the option.
0114
0115 When the compaction happens, the History Server lists all the available event log files for the application, and considers
0116 the event log files having less index than the file with smallest index which will be retained as target of compaction.
0117 For example, if the application A has 5 event log files and <code>spark.history.fs.eventLog.rolling.maxFilesToRetain</code> is set to 2, then first 3 log files will be selected to be compacted.
0118
0119 Once it selects the target, it analyzes them to figure out which events can be excluded, and rewrites them
0120 into one compact file with discarding events which are decided to exclude.
0121
0122 The compaction tries to exclude the events which point to the outdated data. As of now, below describes the candidates of events to be excluded:
0123
0124 * Events for the job which is finished, and related stage/tasks events
0125 * Events for the executor which is terminated
0126 * Events for the SQL execution which is finished, and related job/stage/tasks events
0127
0128 Once rewriting is done, original log files will be deleted, via best-effort manner. The History Server may not be able to delete
0129 the original log files, but it will not affect the operation of the History Server.
0130
0131 Please note that Spark History Server may not compact the old event log files if figures out not a lot of space
0132 would be reduced during compaction. For streaming query we normally expect compaction
0133 will run as each micro-batch will trigger one or more jobs which will be finished shortly, but compaction won't run
0134 in many cases for batch query.
0135
0136 Please also note that this is a new feature introduced in Spark 3.0, and may not be completely stable. Under some circumstances,
0137 the compaction may exclude more events than you expect, leading some UI issues on History Server for the application.
0138 Use it with caution.
0139
0140 ### Spark History Server Configuration Options
0141
0142 Security options for the Spark History Server are covered more detail in the
0143 [Security](security.html#web-ui) page.
0144
0145 <table class="table">
0146   <tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
0147   <tr>
0148     <td>spark.history.provider</td>
0149     <td><code>org.apache.spark.deploy.history.FsHistoryProvider</code></td>
0150     <td>Name of the class implementing the application history backend. Currently there is only
0151     one implementation, provided by Spark, which looks for application logs stored in the
0152     file system.</td>
0153     <td>1.1.0</td>
0154   </tr>
0155   <tr>
0156     <td>spark.history.fs.logDirectory</td>
0157     <td>file:/tmp/spark-events</td>
0158     <td>
0159     For the filesystem history provider, the URL to the directory containing application event
0160     logs to load. This can be a local <code>file://</code> path,
0161     an HDFS path <code>hdfs://namenode/shared/spark-logs</code>
0162     or that of an alternative filesystem supported by the Hadoop APIs.
0163     </td>
0164     <td>1.1.0</td>
0165   </tr>
0166   <tr>
0167     <td>spark.history.fs.update.interval</td>
0168     <td>10s</td>
0169     <td>
0170       The period at which the filesystem history provider checks for new or
0171       updated logs in the log directory. A shorter interval detects new applications faster,
0172       at the expense of more server load re-reading updated applications.
0173       As soon as an update has completed, listings of the completed and incomplete applications
0174       will reflect the changes.
0175     </td>
0176     <td>1.4.0</td>
0177   </tr>
0178   <tr>
0179     <td>spark.history.retainedApplications</td>
0180     <td>50</td>
0181     <td>
0182       The number of applications to retain UI data for in the cache. If this cap is exceeded, then
0183       the oldest applications will be removed from the cache. If an application is not in the cache,
0184       it will have to be loaded from disk if it is accessed from the UI.
0185     </td>
0186     <td>1.0.0</td>
0187   </tr>
0188   <tr>
0189     <td>spark.history.ui.maxApplications</td>
0190     <td>Int.MaxValue</td>
0191     <td>
0192       The number of applications to display on the history summary page. Application UIs are still
0193       available by accessing their URLs directly even if they are not displayed on the history summary page.
0194     </td>
0195     <td>2.0.1</td>
0196   </tr>
0197   <tr>
0198     <td>spark.history.ui.port</td>
0199     <td>18080</td>
0200     <td>
0201       The port to which the web interface of the history server binds.
0202     </td>
0203     <td>1.0.0</td>
0204   </tr>
0205   <tr>
0206     <td>spark.history.kerberos.enabled</td>
0207     <td>false</td>
0208     <td>
0209       Indicates whether the history server should use kerberos to login. This is required
0210       if the history server is accessing HDFS files on a secure Hadoop cluster.
0211     </td>
0212     <td>1.0.1</td>
0213   </tr>
0214   <tr>
0215     <td>spark.history.kerberos.principal</td>
0216     <td>(none)</td>
0217     <td>
0218       When <code>spark.history.kerberos.enabled=true</code>, specifies kerberos principal name for the History Server.
0219     </td>
0220     <td>1.0.1</td>
0221   </tr>
0222   <tr>
0223     <td>spark.history.kerberos.keytab</td>
0224     <td>(none)</td>
0225     <td>
0226       When <code>spark.history.kerberos.enabled=true</code>, specifies location of the kerberos keytab file for the History Server.
0227     </td>
0228     <td>1.0.1</td>
0229   </tr>
0230   <tr>
0231     <td>spark.history.fs.cleaner.enabled</td>
0232     <td>false</td>
0233     <td>
0234       Specifies whether the History Server should periodically clean up event logs from storage.
0235     </td>
0236     <td>1.4.0</td>
0237   </tr>
0238   <tr>
0239     <td>spark.history.fs.cleaner.interval</td>
0240     <td>1d</td>
0241     <td>
0242       When <code>spark.history.fs.cleaner.enabled=true</code>, specifies how often the filesystem job history cleaner checks for files to delete.
0243       Files are deleted if at least one of two conditions holds.
0244       First, they're deleted if they're older than <code>spark.history.fs.cleaner.maxAge</code>.
0245       They are also deleted if the number of files is more than
0246       <code>spark.history.fs.cleaner.maxNum</code>, Spark tries to clean up the completed attempts
0247       from the applications based on the order of their oldest attempt time.
0248     </td>
0249     <td>1.4.0</td>
0250   </tr>
0251   <tr>
0252     <td>spark.history.fs.cleaner.maxAge</td>
0253     <td>7d</td>
0254     <td>
0255       When <code>spark.history.fs.cleaner.enabled=true</code>, job history files older than this will be deleted when the filesystem history cleaner runs.
0256     </td>
0257     <td>1.4.0</td>
0258   </tr>
0259   <tr>
0260     <td>spark.history.fs.cleaner.maxNum</td>
0261     <td>Int.MaxValue</td>
0262     <td>
0263       When <code>spark.history.fs.cleaner.enabled=true</code>, specifies the maximum number of files in the event log directory.
0264       Spark tries to clean up the completed attempt logs to maintain the log directory under this limit.
0265       This should be smaller than the underlying file system limit like
0266       `dfs.namenode.fs-limits.max-directory-items` in HDFS.
0267     </td>
0268     <td>3.0.0</td>
0269   </tr>
0270   <tr>
0271     <td>spark.history.fs.endEventReparseChunkSize</td>
0272     <td>1m</td>
0273     <td>
0274       How many bytes to parse at the end of log files looking for the end event.
0275       This is used to speed up generation of application listings by skipping unnecessary
0276       parts of event log files. It can be disabled by setting this config to 0.
0277     </td>
0278     <td>2.4.0</td>
0279   </tr>
0280   <tr>
0281     <td>spark.history.fs.inProgressOptimization.enabled</td>
0282     <td>true</td>
0283     <td>
0284       Enable optimized handling of in-progress logs. This option may leave finished
0285       applications that fail to rename their event logs listed as in-progress.
0286     </td>
0287     <td>2.4.0</td>
0288   </tr>
0289   <tr>
0290     <td>spark.history.fs.driverlog.cleaner.enabled</td>
0291     <td><code>spark.history.fs.cleaner.enabled</code></td>
0292     <td>
0293       Specifies whether the History Server should periodically clean up driver logs from storage.
0294     </td>
0295     <td>3.0.0</td>
0296   </tr>
0297   <tr>
0298     <td>spark.history.fs.driverlog.cleaner.interval</td>
0299     <td><code>spark.history.fs.cleaner.interval</code></td>
0300     <td>
0301       When <code>spark.history.fs.driverlog.cleaner.enabled=true</code>, specifies how often the filesystem driver log cleaner checks for files to delete.
0302       Files are only deleted if they are older than <code>spark.history.fs.driverlog.cleaner.maxAge</code>
0303     </td>
0304     <td>3.0.0</td>
0305   </tr>
0306   <tr>
0307     <td>spark.history.fs.driverlog.cleaner.maxAge</td>
0308     <td><code>spark.history.fs.cleaner.maxAge</code></td>
0309     <td>
0310       When <code>spark.history.fs.driverlog.cleaner.enabled=true</code>, driver log files older than this will be deleted when the driver log cleaner runs.
0311     </td>
0312     <td>3.0.0</td>
0313   </tr>
0314   <tr>
0315     <td>spark.history.fs.numReplayThreads</td>
0316     <td>25% of available cores</td>
0317     <td>
0318       Number of threads that will be used by history server to process event logs.
0319     </td>
0320     <td>2.0.0</td>
0321   </tr>
0322   <tr>
0323     <td>spark.history.store.maxDiskUsage</td>
0324     <td>10g</td>
0325     <td>
0326       Maximum disk usage for the local directory where the cache application history information
0327       are stored.
0328     </td>
0329     <td>2.3.0</td>
0330   </tr>
0331   <tr>
0332     <td>spark.history.store.path</td>
0333     <td>(none)</td>
0334     <td>
0335         Local directory where to cache application history data. If set, the history
0336         server will store application data on disk instead of keeping it in memory. The data
0337         written to disk will be re-used in the event of a history server restart.
0338     </td>
0339     <td>2.3.0</td>
0340   </tr>
0341   <tr>
0342     <td>spark.history.custom.executor.log.url</td>
0343     <td>(none)</td>
0344     <td>
0345         Specifies custom spark executor log URL for supporting external log service instead of using cluster
0346         managers' application log URLs in the history server. Spark will support some path variables via patterns
0347         which can vary on cluster manager. Please check the documentation for your cluster manager to
0348         see which patterns are supported, if any. This configuration has no effect on a live application, it only
0349         affects the history server.
0350         <p/>
0351         For now, only YARN mode supports this configuration
0352     </td>
0353     <td>3.0.0</td>
0354   </tr>
0355   <tr>
0356     <td>spark.history.custom.executor.log.url.applyIncompleteApplication</td>
0357     <td>false</td>
0358     <td>
0359         Specifies whether to apply custom spark executor log URL to incomplete applications as well.
0360         If executor logs for running applications should be provided as origin log URLs, set this to `false`.
0361         Please note that incomplete applications may include applications which didn't shutdown gracefully.
0362         Even this is set to `true`, this configuration has no effect on a live application, it only affects the history server.
0363     </td>
0364     <td>3.0.0</td>
0365   </tr>
0366   <tr>
0367     <td>spark.history.fs.eventLog.rolling.maxFilesToRetain</td>
0368     <td>Int.MaxValue</td>
0369     <td>
0370       The maximum number of event log files which will be retained as non-compacted. By default,
0371       all event log files will be retained. The lowest value is 1 for technical reason.<br/>
0372       Please read the section of "Applying compaction of old event log files" for more details.
0373     </td>
0374     <td>3.0.0</td>
0375   </tr>
0376 </table>
0377
0378 Note that in all of these UIs, the tables are sortable by clicking their headers,
0379 making it easy to identify slow tasks, data skew, etc.
0380
0381 Note
0382
0383 1. The history server displays both completed and incomplete Spark jobs. If an application makes
0384 multiple attempts after failures, the failed attempts will be displayed, as well as any ongoing
0385 incomplete attempt or the final successful attempt.
0386
0387 2. Incomplete applications are only updated intermittently. The time between updates is defined
0388 by the interval between checks for changed files (`spark.history.fs.update.interval`).
0389 On larger clusters, the update interval may be set to large values.
0390 The way to view a running application is actually to view its own web UI.
0391
0392 3. Applications which exited without registering themselves as completed will be listed
0393 as incomplete —even though they are no longer running. This can happen if an application
0394 crashes.
0395
0396 2. One way to signal the completion of a Spark job is to stop the Spark Context
0397 explicitly (`sc.stop()`), or in Python using the `with SparkContext() as sc:` construct
0398 to handle the Spark Context setup and tear down.
0399
0400
0401 ## REST API
0402
0403 In addition to viewing the metrics in the UI, they are also available as JSON.  This gives developers
0404 an easy way to create new visualizations and monitoring tools for Spark.  The JSON is available for
0405 both running applications, and in the history server.  The endpoints are mounted at `/api/v1`.  Eg.,
0406 for the history server, they would typically be accessible at `http://<server-url>:18080/api/v1`, and
0407 for a running application, at `http://localhost:4040/api/v1`.
0408
0409 In the API, an application is referenced by its application ID, `[app-id]`.
0410 When running on YARN, each application may have multiple attempts, but there are attempt IDs
0411 only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode
0412 can be identified by their `[attempt-id]`. In the API listed below, when running in YARN cluster mode,
0413 `[app-id]` will actually be `[base-app-id]/[attempt-id]`, where `[base-app-id]` is the YARN application ID.
0414
0415 <table class="table">
0416   <tr><th>Endpoint</th><th>Meaning</th></tr>
0417   <tr>
0418     <td><code>/applications</code></td>
0419     <td>A list of all applications.
0420     <br>
0421     <code>?status=[completed|running]</code> list only applications in the chosen state.
0422     <br>
0423     <code>?minDate=[date]</code> earliest start date/time to list.
0424     <br>
0425     <code>?maxDate=[date]</code> latest start date/time to list.
0426     <br>
0427     <code>?minEndDate=[date]</code> earliest end date/time to list.
0428     <br>
0429     <code>?maxEndDate=[date]</code> latest end date/time to list.
0430     <br>
0431     <code>?limit=[limit]</code> limits the number of applications listed.
0432     <br>Examples:
0433     <br><code>?minDate=2015-02-10</code>
0434     <br><code>?minDate=2015-02-03T16:42:40.000GMT</code>
0435     <br><code>?maxDate=2015-02-11T20:41:30.000GMT</code>
0436     <br><code>?minEndDate=2015-02-12</code>
0437     <br><code>?minEndDate=2015-02-12T09:15:10.000GMT</code>
0438     <br><code>?maxEndDate=2015-02-14T16:30:45.000GMT</code>
0439     <br><code>?limit=10</code></td>
0440   </tr>
0441   <tr>
0442     <td><code>/applications/[app-id]/jobs</code></td>
0443     <td>
0444       A list of all jobs for a given application.
0445       <br><code>?status=[running|succeeded|failed|unknown]</code> list only jobs in the specific state.
0446     </td>
0447   </tr>
0448   <tr>
0449     <td><code>/applications/[app-id]/jobs/[job-id]</code></td>
0450     <td>Details for the given job.</td>
0451   </tr>
0452   <tr>
0453     <td><code>/applications/[app-id]/stages</code></td>
0454     <td>
0455       A list of all stages for a given application.
0456       <br><code>?status=[active|complete|pending|failed]</code> list only stages in the state.
0457     </td>
0458   </tr>
0459   <tr>
0460     <td><code>/applications/[app-id]/stages/[stage-id]</code></td>
0461     <td>
0462       A list of all attempts for the given stage.
0463     </td>
0464   </tr>
0465   <tr>
0466     <td><code>/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]</code></td>
0467     <td>Details for the given stage attempt.</td>
0468   </tr>
0469   <tr>
0470     <td><code>/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskSummary</code></td>
0471     <td>
0472       Summary metrics of all tasks in the given stage attempt.
0473       <br><code>?quantiles</code> summarize the metrics with the given quantiles.
0474       <br>Example: <code>?quantiles=0.01,0.5,0.99</code>
0475     </td>
0476   </tr>
0477   <tr>
0478     <td><code>/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList</code></td>
0479     <td>
0480        A list of all tasks for the given stage attempt.
0481       <br><code>?offset=[offset]&amp;length=[len]</code> list tasks in the given range.
0482       <br><code>?sortBy=[runtime|-runtime]</code> sort the tasks.
0483       <br>Example: <code>?offset=10&amp;length=50&amp;sortBy=runtime</code>
0484     </td>
0485   </tr>
0486   <tr>
0487     <td><code>/applications/[app-id]/executors</code></td>
0488     <td>A list of all active executors for the given application.</td>
0489   </tr>
0490   <tr>
0491     <td><code>/applications/[app-id]/executors/[executor-id]/threads</code></td>
0492     <td>
0493       Stack traces of all the threads running within the given active executor.
0494       Not available via the history server.
0495     </td>
0496   </tr>
0497   <tr>
0498     <td><code>/applications/[app-id]/allexecutors</code></td>
0499     <td>A list of all(active and dead) executors for the given application.</td>
0500   </tr>
0501   <tr>
0502     <td><code>/applications/[app-id]/storage/rdd</code></td>
0503     <td>A list of stored RDDs for the given application.</td>
0504   </tr>
0505   <tr>
0506     <td><code>/applications/[app-id]/storage/rdd/[rdd-id]</code></td>
0507     <td>Details for the storage status of a given RDD.</td>
0508   </tr>
0509   <tr>
0510     <td><code>/applications/[base-app-id]/logs</code></td>
0511     <td>Download the event logs for all attempts of the given application as files within
0512     a zip file.
0513     </td>
0514   </tr>
0515   <tr>
0516     <td><code>/applications/[base-app-id]/[attempt-id]/logs</code></td>
0517     <td>Download the event logs for a specific application attempt as a zip file.</td>
0518   </tr>
0519   <tr>
0520     <td><code>/applications/[app-id]/streaming/statistics</code></td>
0521     <td>Statistics for the streaming context.</td>
0522   </tr>
0523   <tr>
0524     <td><code>/applications/[app-id]/streaming/receivers</code></td>
0525     <td>A list of all streaming receivers.</td>
0526   </tr>
0527   <tr>
0528     <td><code>/applications/[app-id]/streaming/receivers/[stream-id]</code></td>
0529     <td>Details of the given receiver.</td>
0530   </tr>
0531   <tr>
0532     <td><code>/applications/[app-id]/streaming/batches</code></td>
0533     <td>A list of all retained batches.</td>
0534   </tr>
0535   <tr>
0536     <td><code>/applications/[app-id]/streaming/batches/[batch-id]</code></td>
0537     <td>Details of the given batch.</td>
0538   </tr>
0539   <tr>
0540     <td><code>/applications/[app-id]/streaming/batches/[batch-id]/operations</code></td>
0541     <td>A list of all output operations of the given batch.</td>
0542   </tr>
0543   <tr>
0544     <td><code>/applications/[app-id]/streaming/batches/[batch-id]/operations/[outputOp-id]</code></td>
0545     <td>Details of the given operation and given batch.</td>
0546   </tr>
0547   <tr>
0548     <td><code>/applications/[app-id]/environment</code></td>
0549     <td>Environment details of the given application.</td>
0550   </tr>
0551   <tr>
0552     <td><code>/version</code></td>
0553     <td>Get the current spark version.</td>
0554   </tr>
0555 </table>
0556
0557 The number of jobs and stages which can be retrieved is constrained by the same retention
0558 mechanism of the standalone Spark UI; `"spark.ui.retainedJobs"` defines the threshold
0559 value triggering garbage collection on jobs, and `spark.ui.retainedStages` that for stages.
0560 Note that the garbage collection takes place on playback: it is possible to retrieve
0561 more entries by increasing these values and restarting the history server.
0562
0563 ### Executor Task Metrics
0564
0565 The REST API exposes the values of the Task Metrics collected by Spark executors with the granularity
0566 of task execution. The metrics can be used for performance troubleshooting and workload characterization.
0567 A list of the available metrics, with a short description:
0568
0569 <table class="table">
0570   <tr><th>Spark Executor Task Metric name</th>
0571       <th>Short description</th>
0572   </tr>
0573   <tr>
0574     <td>executorRunTime</td>
0575     <td>Elapsed time the executor spent running this task. This includes time fetching shuffle data.
0576     The value is expressed in milliseconds.</td>
0577   </tr>
0578   <tr>
0579     <td>executorCpuTime</td>
0580     <td>CPU time the executor spent running this task. This includes time fetching shuffle data.
0581     The value is expressed in nanoseconds.</td>
0582   </tr>
0583   <tr>
0584     <td>executorDeserializeTime</td>
0585     <td>Elapsed time spent to deserialize this task. The value is expressed in milliseconds.</td>
0586   </tr>
0587   <tr>
0588     <td>executorDeserializeCpuTime</td>
0589     <td>CPU time taken on the executor to deserialize this task. The value is expressed
0590     in nanoseconds.</td>
0591   </tr>
0592   <tr>
0593     <td>resultSize</td>
0594     <td>The number of bytes this task transmitted back to the driver as the TaskResult.</td>
0595   </tr>
0596   <tr>
0597     <td>jvmGCTime</td>
0598     <td>Elapsed time the JVM spent in garbage collection while executing this task.
0599     The value is expressed in milliseconds.</td>
0600   </tr>
0601   <tr>
0602     <td>resultSerializationTime</td>
0603     <td>Elapsed time spent serializing the task result. The value is expressed in milliseconds.</td>
0604   </tr>
0605   <tr>
0606     <td>memoryBytesSpilled</td>
0607     <td>The number of in-memory bytes spilled by this task.</td>
0608   </tr>
0609   <tr>
0610     <td>diskBytesSpilled</td>
0611     <td>The number of on-disk bytes spilled by this task.</td>
0612   </tr>
0613   <tr>
0614     <td>peakExecutionMemory</td>
0615     <td>Peak memory used by internal data structures created during shuffles, aggregations and
0616         joins. The value of this accumulator should be approximately the sum of the peak sizes
0617         across all such data structures created in this task. For SQL jobs, this only tracks all
0618          unsafe operators and ExternalSort.</td>
0619   </tr>
0620   <tr>
0621     <td>inputMetrics.*</td>
0622     <td>Metrics related to reading data from <code>org.apache.spark.rdd.HadoopRDD</code>
0623     or from persisted data.</td>
0624   </tr>
0625   <tr>
0626     <td>&nbsp;&nbsp;&nbsp;&nbsp;.bytesRead</td>
0627     <td>Total number of bytes read.</td>
0628   </tr>
0629   <tr>
0630     <td>&nbsp;&nbsp;&nbsp;&nbsp;.recordsRead</td>
0631     <td>Total number of records read.</td>
0632   </tr>
0633   <tr>
0634     <td>outputMetrics.*</td>
0635     <td>Metrics related to writing data externally (e.g. to a distributed filesystem),
0636     defined only in tasks with output.</td>
0637   </tr>
0638   <tr>
0639     <td>&nbsp;&nbsp;&nbsp;&nbsp;.bytesWritten</td>
0640     <td>Total number of bytes written</td>
0641   </tr>
0642   <tr>
0643     <td>&nbsp;&nbsp;&nbsp;&nbsp;.recordsWritten</td>
0644     <td>Total number of records written</td>
0645   </tr>
0646   <tr>
0647     <td>shuffleReadMetrics.*</td>
0648     <td>Metrics related to shuffle read operations.</td>
0649   </tr>
0650   <tr>
0651     <td>&nbsp;&nbsp;&nbsp;&nbsp;.recordsRead</td>
0652     <td>Number of records read in shuffle operations</td>
0653   </tr>
0654   <tr>
0655     <td>&nbsp;&nbsp;&nbsp;&nbsp;.remoteBlocksFetched</td>
0656     <td>Number of remote blocks fetched in shuffle operations</td>
0657   </tr>
0658   <tr>
0659     <td>&nbsp;&nbsp;&nbsp;&nbsp;.localBlocksFetched</td>
0660     <td>Number of local (as opposed to read from a remote executor) blocks fetched
0661     in shuffle operations</td>
0662   </tr>
0663   <tr>
0664     <td>&nbsp;&nbsp;&nbsp;&nbsp;.totalBlocksFetched</td>
0665     <td>Number of blocks fetched in shuffle operations (both local and remote)</td>
0666   </tr>
0667   <tr>
0668     <td>&nbsp;&nbsp;&nbsp;&nbsp;.remoteBytesRead</td>
0669     <td>Number of remote bytes read in shuffle operations</td>
0670   </tr>
0671   <tr>
0672     <td>&nbsp;&nbsp;&nbsp;&nbsp;.localBytesRead</td>
0673     <td>Number of bytes read in shuffle operations from local disk (as opposed to
0674     read from a remote executor)</td>
0675   </tr>
0676   <tr>
0677     <td>&nbsp;&nbsp;&nbsp;&nbsp;.totalBytesRead</td>
0678     <td>Number of bytes read in shuffle operations (both local and remote)</td>
0679   </tr>
0680   <tr>
0681     <td>&nbsp;&nbsp;&nbsp;&nbsp;.remoteBytesReadToDisk</td>
0682     <td>Number of remote bytes read to disk in shuffle operations.
0683     Large blocks are fetched to disk in shuffle read operations, as opposed to
0684     being read into memory, which is the default behavior.</td>
0685   </tr>
0686   <tr>
0687     <td>&nbsp;&nbsp;&nbsp;&nbsp;.fetchWaitTime</td>
0688     <td>Time the task spent waiting for remote shuffle blocks.
0689         This only includes the time blocking on shuffle input data.
0690         For instance if block B is being fetched while the task is still not finished
0691         processing block A, it is not considered to be blocking on block B.
0692         The value is expressed in milliseconds.</td>
0693   </tr>
0694   <tr>
0695     <td>shuffleWriteMetrics.*</td>
0696     <td>Metrics related to operations writing shuffle data.</td>
0697   </tr>
0698   <tr>
0699     <td>&nbsp;&nbsp;&nbsp;&nbsp;.bytesWritten</td>
0700     <td>Number of bytes written in shuffle operations</td>
0701   </tr>
0702   <tr>
0703     <td>&nbsp;&nbsp;&nbsp;&nbsp;.recordsWritten</td>
0704     <td>Number of records written in shuffle operations</td>
0705   </tr>
0706   <tr>
0707     <td>&nbsp;&nbsp;&nbsp;&nbsp;.writeTime</td>
0708     <td>Time spent blocking on writes to disk or buffer cache. The value is expressed
0709      in nanoseconds.</td>
0710   </tr>
0711 </table>
0712
0713 ### Executor Metrics
0714
0715 Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information.
0716 Executor metric values and their measured memory peak values per executor are exposed via the REST API in JSON format and in Prometheus format.
0717 The JSON end point is exposed at: `/applications/[app-id]/executors`, and the Prometheus endpoint at: `/metrics/executors/prometheus`.
0718 The Prometheus endpoint is experimental and conditional to a configuration parameter: `spark.ui.prometheus.enabled=true` (the default is `false`).
0719 In addition, aggregated per-stage peak values of the executor memory metrics are written to the event log if
0720 `spark.eventLog.logStageExecutorMetrics` is true.
0721 Executor memory metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library.
0722 A list of the available metrics, with a short description:
0723
0724 <table class="table">
0725   <tr><th>Executor Level Metric name</th>
0726       <th>Short description</th>
0727   </tr>
0728   <tr>
0729     <td>rddBlocks</td>
0730     <td>RDD blocks in the block manager of this executor.</td>
0731   </tr>
0732   <tr>
0733     <td>memoryUsed</td>
0734     <td>Storage memory used by this executor.</td>
0735   </tr>
0736   <tr>
0737     <td>diskUsed</td>
0738     <td>Disk space used for RDD storage by this executor.</td>
0739   </tr>
0740   <tr>
0741     <td>totalCores</td>
0742     <td>Number of cores available in this executor.</td>
0743   </tr>
0744   <tr>
0745     <td>maxTasks</td>
0746     <td>Maximum number of tasks that can run concurrently in this executor.</td>
0747   </tr>
0748   <tr>
0749     <td>activeTasks</td>
0750     <td>Number of tasks currently executing.</td>
0751   </tr>
0752   <tr>
0753     <td>failedTasks</td>
0754     <td>Number of tasks that have failed in this executor.</td>
0755   </tr>
0756   <tr>
0757     <td>completedTasks</td>
0758     <td>Number of tasks that have completed in this executor.</td>
0759   </tr>
0760   <tr>
0761     <td>totalTasks</td>
0762     <td>Total number of tasks (running, failed and completed) in this executor.</td>
0763   </tr>
0764   <tr>
0765     <td>totalDuration</td>
0766     <td>Elapsed time the JVM spent executing tasks in this executor.
0767     The value is expressed in milliseconds.</td>
0768   </tr>
0769   <tr>
0770     <td>totalGCTime</td>
0771     <td>Elapsed time the JVM spent in garbage collection summed in this executor.
0772     The value is expressed in milliseconds.</td>
0773   </tr>
0774   <tr>
0775     <td>totalInputBytes</td>
0776     <td>Total input bytes summed in this executor.</td>
0777   </tr>
0778   <tr>
0779     <td>totalShuffleRead</td>
0780     <td>Total shuffle read bytes summed in this executor.</td>
0781   </tr>
0782   <tr>
0783     <td>totalShuffleWrite</td>
0784     <td>Total shuffle write bytes summed in this executor.</td>
0785   </tr>
0786   <tr>
0787     <td>maxMemory</td>
0788     <td>Total amount of memory available for storage, in bytes.</td>
0789   </tr>
0790   <tr>
0791     <td>memoryMetrics.*</td>
0792     <td>Current value of memory metrics:</td>
0793   </tr>
0794   <tr>
0795     <td>&nbsp;&nbsp;&nbsp;&nbsp;.usedOnHeapStorageMemory</td>
0796     <td>Used on heap memory currently for storage, in bytes.</td>
0797   </tr>
0798   <tr>
0799     <td>&nbsp;&nbsp;&nbsp;&nbsp;.usedOffHeapStorageMemory</td>
0800     <td>Used off heap memory currently for storage, in bytes.</td>
0801   </tr>
0802   <tr>
0803     <td>&nbsp;&nbsp;&nbsp;&nbsp;.totalOnHeapStorageMemory</td>
0804     <td>Total available on heap memory for storage, in bytes. This amount can vary over time,  on the MemoryManager implementation.</td>
0805   </tr>
0806   <tr>
0807     <td>&nbsp;&nbsp;&nbsp;&nbsp;.totalOffHeapStorageMemory</td>
0808     <td>Total available off heap memory for storage, in bytes. This amount can vary over time, depending on the MemoryManager implementation.</td>
0809   </tr>
0810   <tr>
0811     <td>peakMemoryMetrics.*</td>
0812     <td>Peak value of memory (and GC) metrics:</td>
0813   </tr>
0814   <tr>
0815     <td>&nbsp;&nbsp;&nbsp;&nbsp;.JVMHeapMemory</td>
0816     <td>Peak memory usage of the heap that is used for object allocation.
0817     The heap consists of one or more memory pools. The used and committed size of the returned memory usage is the sum of those values of all heap memory pools whereas the init and max size of the returned memory usage represents the setting of the heap memory which may not be the sum of those of all heap memory pools.
0818     The amount of used memory in the returned memory usage is the amount of memory occupied by both live objects and garbage objects that have not been collected, if any.</td>
0819   </tr>
0820   <tr>
0821     <td>&nbsp;&nbsp;&nbsp;&nbsp;.JVMOffHeapMemory</td>
0822     <td>Peak memory usage of non-heap memory that is used by the Java virtual machine. The non-heap memory consists of one or more memory pools. The used and committed size of the returned memory usage is the sum of those values of all non-heap memory pools whereas the init and max size of the returned memory usage represents the setting of the non-heap memory which may not be the sum of those of all non-heap memory pools.</td>
0823   </tr>
0824   <tr>
0825     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OnHeapExecutionMemory</td>
0826     <td>Peak on heap execution memory in use, in bytes.</td>
0827   </tr>
0828   <tr>
0829     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OffHeapExecutionMemory</td>
0830     <td>Peak off heap execution memory in use, in bytes.</td>
0831   </tr>
0832   <tr>
0833     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OnHeapStorageMemory</td>
0834     <td>Peak on heap storage memory in use, in bytes.</td>
0835   </tr>
0836   <tr>
0837     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OffHeapStorageMemory</td>
0838     <td>Peak off heap storage memory in use, in bytes.</td>
0839   </tr>
0840   <tr>
0841     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OnHeapUnifiedMemory</td>
0842     <td>Peak on heap memory (execution and storage).</td>
0843   </tr>
0844   <tr>
0845     <td>&nbsp;&nbsp;&nbsp;&nbsp;.OffHeapUnifiedMemory</td>
0846     <td>Peak off heap memory (execution and storage).</td>
0847   </tr>
0848   <tr>
0849     <td>&nbsp;&nbsp;&nbsp;&nbsp;.DirectPoolMemory</td>
0850     <td>Peak memory that the JVM is using for direct buffer pool (<code>java.lang.management.BufferPoolMXBean</code>)</td>
0851   </tr>
0852   <tr>
0853     <td>&nbsp;&nbsp;&nbsp;&nbsp;.MappedPoolMemory</td>
0854     <td>Peak memory that the JVM is using for mapped buffer pool (<code>java.lang.management.BufferPoolMXBean</code>)</td>
0855   </tr>
0856   <tr>
0857     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreeJVMVMemory</td>
0858     <td>Virtual memory size in bytes. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0859   </tr>
0860   <tr>
0861     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreeJVMRSSMemory</td>
0862     <td>Resident Set Size: number of pages the process has
0863       in real memory.  This is just the pages which count
0864       toward text, data, or stack space.  This does not
0865       include pages which have not been demand-loaded in,
0866       or which are swapped out. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0867   </tr>
0868   <tr>
0869     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreePythonVMemory</td>
0870     <td>Virtual memory size for Python in bytes. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0871   </tr>
0872   <tr>
0873     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreePythonRSSMemory</td>
0874     <td>Resident Set Size for Python. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0875   </tr>
0876   <tr>
0877     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreeOtherVMemory</td>
0878     <td>Virtual memory size for other kind of process in bytes. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0879   </tr>
0880   <tr>
0881     <td>&nbsp;&nbsp;&nbsp;&nbsp;.ProcessTreeOtherRSSMemory</td>
0882     <td>Resident Set Size for other kind of process. Enabled if spark.executor.processTreeMetrics.enabled is true.</td>
0883   </tr>
0884     <tr>
0885     <td>&nbsp;&nbsp;&nbsp;&nbsp;.MinorGCCount</td>
0886     <td>Total minor GC count. For example, the garbage collector is one of     Copy, PS Scavenge, ParNew, G1 Young Generation and so on.</td>
0887   </tr>
0888   <tr>
0889     <td>&nbsp;&nbsp;&nbsp;&nbsp;.MinorGCTime</td>
0890     <td>Elapsed total minor GC time.
0891     The value is expressed in milliseconds.</td>
0892   </tr>
0893   <tr>
0894     <td>&nbsp;&nbsp;&nbsp;&nbsp;.MajorGCCount</td>
0895     <td>Total major GC count. For example, the garbage collector is one of     MarkSweepCompact, PS MarkSweep, ConcurrentMarkSweep, G1 Old Generation and so on.</td>
0896   </tr>
0897   <tr>
0898     <td>&nbsp;&nbsp;&nbsp;&nbsp;.MajorGCTime</td>
0899     <td>Elapsed total major GC time.
0900     The value is expressed in milliseconds.</td>
0901   </tr>
0902 </table>
0903 The computation of RSS and Vmem are based on [proc(5)](http://man7.org/linux/man-pages/man5/proc.5.html)
0904
0905 ### API Versioning Policy
0906
0907 These endpoints have been strongly versioned to make it easier to develop applications on top.
0908  In particular, Spark guarantees:
0909
0910 * Endpoints will never be removed from one version
0911 * Individual fields will never be removed for any given endpoint
0912 * New endpoints may be added
0913 * New fields may be added to existing endpoints
0914 * New versions of the api may be added in the future as a separate endpoint (eg., `api/v2`).  New versions are *not* required to be backwards compatible.
0915 * Api versions may be dropped, but only after at least one minor release of co-existing with a new api version.
0916
0917 Note that even when examining the UI of running applications, the `applications/[app-id]` portion is
0918 still required, though there is only one application available.  Eg. to see the list of jobs for the
0919 running app, you would go to `http://localhost:4040/api/v1/applications/[app-id]/jobs`.  This is to
0920 keep the paths consistent in both modes.
0921
0922 # Metrics
0923
0924 Spark has a configurable metrics system based on the
0925 [Dropwizard Metrics Library](http://metrics.dropwizard.io/).
0926 This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV
0927 files. The metrics are generated by sources embedded in the Spark code base. They
0928 provide instrumentation for specific activities and Spark components.
0929 The metrics system is configured via a configuration file that Spark expects to be present
0930 at `$SPARK_HOME/conf/metrics.properties`. A custom file location can be specified via the
0931 `spark.metrics.conf` [configuration property](configuration.html#spark-properties).
0932 Instead of using the configuration file, a set of configuration parameters with prefix
0933 `spark.metrics.conf.` can be used.
0934 By default, the root namespace used for driver or executor metrics is
0935 the value of `spark.app.id`. However, often times, users want to be able to track the metrics
0936 across apps for driver and executors, which is hard to do with application ID
0937 (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
0938 a custom namespace can be specified for metrics reporting using `spark.metrics.namespace`
0939 configuration property.
0940 If, say, users wanted to set the metrics namespace to the name of the application, they
0941 can set the `spark.metrics.namespace` property to a value like `${spark.app.name}`. This value is
0942 then expanded appropriately by Spark and is used as the root namespace of the metrics system.
0943 Non-driver and executor metrics are never prefixed with `spark.app.id`, nor does the
0944 `spark.metrics.namespace` property have any such affect on such metrics.
0945
0946 Spark's metrics are decoupled into different
0947 _instances_ corresponding to Spark components. Within each instance, you can configure a
0948 set of sinks to which metrics are reported. The following instances are currently supported:
0949
0950 * `master`: The Spark standalone master process.
0951 * `applications`: A component within the master which reports on various applications.
0952 * `worker`: A Spark standalone worker process.
0953 * `executor`: A Spark executor.
0954 * `driver`: The Spark driver process (the process in which your SparkContext is created).
0955 * `shuffleService`: The Spark shuffle service.
0956 * `applicationMaster`: The Spark ApplicationMaster when running on YARN.
0957 * `mesos_cluster`: The Spark cluster scheduler when running on Mesos.
0958
0959 Each instance can report to zero or more _sinks_. Sinks are contained in the
0960 `org.apache.spark.metrics.sink` package:
0961
0962 * `ConsoleSink`: Logs metrics information to the console.
0963 * `CSVSink`: Exports metrics data to CSV files at regular intervals.
0964 * `JmxSink`: Registers metrics for viewing in a JMX console.
0965 * `MetricsServlet`: Adds a servlet within the existing Spark UI to serve metrics data as JSON data.
0966 * `PrometheusServlet`: (Experimental) Adds a servlet within the existing Spark UI to serve metrics data in Prometheus format.
0967 * `GraphiteSink`: Sends metrics to a Graphite node.
0968 * `Slf4jSink`: Sends metrics to slf4j as log entries.
0969 * `StatsdSink`: Sends metrics to a StatsD node.
0970
0971 Spark also supports a Ganglia sink which is not included in the default build due to
0972 licensing restrictions:
0973
0974 * `GangliaSink`: Sends metrics to a Ganglia node or multicast group.
0975
0976 To install the `GangliaSink` you'll need to perform a custom build of Spark. _**Note that
0977 by embedding this library you will include [LGPL](http://www.gnu.org/copyleft/lesser.html)-licensed
0978 code in your Spark package**_. For sbt users, set the
0979 `SPARK_GANGLIA_LGPL` environment variable before building. For Maven users, enable
0980 the `-Pspark-ganglia-lgpl` profile. In addition to modifying the cluster's Spark build
0981 user applications will need to link to the `spark-ganglia-lgpl` artifact.
0982
0983 The syntax of the metrics configuration file and the parameters available for each sink are defined
0984 in an example configuration file,
0985 `$SPARK_HOME/conf/metrics.properties.template`.
0986
0987 When using Spark configuration parameters instead of the metrics configuration file, the relevant
0988 parameter names are composed by the prefix `spark.metrics.conf.` followed by the configuration
0989 details, i.e. the parameters take the following form:
0990 `spark.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]`.
0991 This example shows a list of Spark configuration parameters for a Graphite sink:
0992 ```
0993 "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
0994 "spark.metrics.conf.*.sink.graphite.host"="graphiteEndPoint_hostName>"
0995 "spark.metrics.conf.*.sink.graphite.port"=<graphite_listening_port>
0996 "spark.metrics.conf.*.sink.graphite.period"=10
0997 "spark.metrics.conf.*.sink.graphite.unit"=seconds
0998 "spark.metrics.conf.*.sink.graphite.prefix"="optional_prefix"
0999 "spark.metrics.conf.*.sink.graphite.regex"="optional_regex_to_send_matching_metrics"
1000 ```
1001
1002 Default values of the Spark metrics configuration are as follows:
1003 ```
1004 "*.sink.servlet.class" = "org.apache.spark.metrics.sink.MetricsServlet"
1005 "*.sink.servlet.path" = "/metrics/json"
1006 "master.sink.servlet.path" = "/metrics/master/json"
1007 "applications.sink.servlet.path" = "/metrics/applications/json"
1008 ```
1009
1010 Additional sources can be configured using the metrics configuration file or the configuration
1011 parameter `spark.metrics.conf.[component_name].source.jvm.class=[source_name]`. At present the
1012 JVM source is the only available optional source. For example the following configuration parameter
1013 activates the JVM source:
1014 `"spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"`
1015
1016 ## List of available metrics providers
1017
1018 Metrics used by Spark are of multiple types: gauge, counter, histogram, meter and timer,
1019 see [Dropwizard library documentation for details](https://metrics.dropwizard.io/3.1.0/getting-started/).
1020 The following list of components and metrics reports the name and some details about the available metrics,
1021 grouped per component instance and source namespace.
1022 The most common time of metrics used in Spark instrumentation are gauges and counters.
1023 Counters can be recognized as they have the `.count` suffix. Timers, meters and histograms are annotated
1024 in the list, the rest of the list elements are metrics of type gauge.
1025 The large majority of metrics are active as soon as their parent component instance is configured,
1026 some metrics require also to be enabled via an additional configuration parameter, the details are
1027 reported in the list.
1028
1029 ### Component instance = Driver
1030 This is the component with the largest amount of instrumented metrics
1031
1032 - namespace=BlockManager
1033   - disk.diskSpaceUsed_MB
1034   - memory.maxMem_MB
1035   - memory.maxOffHeapMem_MB
1036   - memory.maxOnHeapMem_MB
1037   - memory.memUsed_MB
1038   - memory.offHeapMemUsed_MB
1039   - memory.onHeapMemUsed_MB
1040   - memory.remainingMem_MB
1041   - memory.remainingOffHeapMem_MB
1042   - memory.remainingOnHeapMem_MB
1043
1044 - namespace=HiveExternalCatalog
1045   - **note:**: these metrics are conditional to a configuration parameter:
1046     `spark.metrics.staticSources.enabled` (default is true)
1047   - fileCacheHits.count
1048   - filesDiscovered.count
1049   - hiveClientCalls.count
1050   - parallelListingJobCount.count
1051   - partitionsFetched.count
1052
1053 - namespace=CodeGenerator
1054   - **note:**: these metrics are conditional to a configuration parameter:
1055     `spark.metrics.staticSources.enabled` (default is true)
1056   - compilationTime (histogram)
1057   - generatedClassSize (histogram)
1058   - generatedMethodSize (histogram)
1059   - sourceCodeSize (histogram)
1060
1061 - namespace=DAGScheduler
1062   - job.activeJobs
1063   - job.allJobs
1064   - messageProcessingTime (timer)
1065   - stage.failedStages
1066   - stage.runningStages
1067   - stage.waitingStages
1068
1069 - namespace=LiveListenerBus
1070   - listenerProcessingTime.org.apache.spark.HeartbeatReceiver (timer)
1071   - listenerProcessingTime.org.apache.spark.scheduler.EventLoggingListener (timer)
1072   - listenerProcessingTime.org.apache.spark.status.AppStatusListener (timer)
1073   - numEventsPosted.count
1074   - queue.appStatus.listenerProcessingTime (timer)
1075   - queue.appStatus.numDroppedEvents.count
1076   - queue.appStatus.size
1077   - queue.eventLog.listenerProcessingTime (timer)
1078   - queue.eventLog.numDroppedEvents.count
1079   - queue.eventLog.size
1080   - queue.executorManagement.listenerProcessingTime (timer)
1081
1082 - namespace=appStatus (all metrics of type=counter)
1083   - **note:** Introduced in Spark 3.0. Conditional to a configuration parameter:
1084    `spark.metrics.appStatusSource.enabled` (default is false)
1085   - stages.failedStages.count
1086   - stages.skippedStages.count
1087   - stages.completedStages.count
1088   - tasks.blackListedExecutors.count
1089   - tasks.completedTasks.count
1090   - tasks.failedTasks.count
1091   - tasks.killedTasks.count
1092   - tasks.skippedTasks.count
1093   - tasks.unblackListedExecutors.count
1094   - jobs.succeededJobs
1095   - jobs.failedJobs
1096   - jobDuration
1097
1098 - namespace=AccumulatorSource
1099   - **note:** User-configurable sources to attach accumulators to metric system
1100   - DoubleAccumulatorSource
1101   - LongAccumulatorSource
1102
1103 - namespace=spark.streaming
1104   - **note:** This applies to Spark Structured Streaming only. Conditional to a configuration
1105   parameter: `spark.sql.streaming.metricsEnabled=true` (default is false)
1106   - eventTime-watermark
1107   - inputRate-total
1108   - latency
1109   - processingRate-total
1110   - states-rowsTotal
1111   - states-usedBytes
1112
1113 - namespace=JVMCPU
1114   - jvmCpuTime
1115
1116 - namespace=ExecutorMetrics
1117   - **note:** these metrics are conditional to a configuration parameter:
1118     `spark.metrics.executorMetricsSource.enabled` (default is true)
1119   - This source contains memory-related metrics. A full list of available metrics in this
1120     namespace can be found in the corresponding entry for the Executor component instance.
1121
1122 - namespace=plugin.\<Plugin Class Name>
1123   - Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and
1124   configured using the Spark plugin API. See "Advanced Instrumentation" below for how to load
1125   custom plugins into Spark.
1126
1127 ### Component instance = Executor
1128 These metrics are exposed by Spark executors. Note, currently they are not available
1129 when running in local mode.
1130
1131 - namespace=executor (metrics are of type counter or gauge)
1132   - bytesRead.count
1133   - bytesWritten.count
1134   - cpuTime.count
1135   - deserializeCpuTime.count
1136   - deserializeTime.count
1137   - diskBytesSpilled.count
1138   - filesystem.file.largeRead_ops
1139   - filesystem.file.read_bytes
1140   - filesystem.file.read_ops
1141   - filesystem.file.write_bytes
1142   - filesystem.file.write_ops
1143   - filesystem.hdfs.largeRead_ops
1144   - filesystem.hdfs.read_bytes
1145   - filesystem.hdfs.read_ops
1146   - filesystem.hdfs.write_bytes
1147   - filesystem.hdfs.write_ops
1148   - jvmGCTime.count
1149   - memoryBytesSpilled.count
1150   - recordsRead.count
1151   - recordsWritten.count
1152   - resultSerializationTime.count
1153   - resultSize.count
1154   - runTime.count
1155   - shuffleBytesWritten.count
1156   - shuffleFetchWaitTime.count
1157   - shuffleLocalBlocksFetched.count
1158   - shuffleLocalBytesRead.count
1159   - shuffleRecordsRead.count
1160   - shuffleRecordsWritten.count
1161   - shuffleRemoteBlocksFetched.count
1162   - shuffleRemoteBytesRead.count
1163   - shuffleRemoteBytesReadToDisk.count
1164   - shuffleTotalBytesRead.count
1165   - shuffleWriteTime.count
1166   - succeededTasks.count
1167   - threadpool.activeTasks
1168   - threadpool.completeTasks
1169   - threadpool.currentPool_size
1170   - threadpool.maxPool_size
1171   - threadpool.startedTasks
1172
1173 - namespace=ExecutorMetrics
1174   - **notes:**
1175     - These metrics are conditional to a configuration parameter:
1176     `spark.metrics.executorMetricsSource.enabled` (default value is true)
1177     - ExecutorMetrics are updated as part of heartbeat processes scheduled
1178    for the executors and for the driver at regular intervals: `spark.executor.heartbeatInterval` (default value is 10 seconds)
1179     - An optional faster polling mechanism is available for executor memory metrics,
1180    it can be activated by setting a polling interval (in milliseconds) using the configuration parameter `spark.executor.metrics.pollingInterval`
1181   - JVMHeapMemory
1182   - JVMOffHeapMemory
1183   - OnHeapExecutionMemory
1184   - OnHeapStorageMemory
1185   - OnHeapUnifiedMemory
1186   - OffHeapExecutionMemory
1187   - OffHeapStorageMemory
1188   - OffHeapUnifiedMemory
1189   - DirectPoolMemory
1190   - MappedPoolMemory
1191   - MinorGCCount
1192   - MinorGCTime
1193   - MajorGCCount
1194   - MajorGCTime
1195   - "ProcessTree*" metric counters:
1196     - ProcessTreeJVMVMemory
1197     - ProcessTreeJVMRSSMemory
1198     - ProcessTreePythonVMemory
1199     - ProcessTreePythonRSSMemory
1200     - ProcessTreeOtherVMemory
1201     - ProcessTreeOtherRSSMemory
1202     - **note:** "ProcessTree*" metrics are collected only under certain conditions.
1203       The conditions are the logical AND of the following: `/proc` filesystem exists,
1204       `spark.executor.processTreeMetrics.enabled=true`.
1205       "ProcessTree*" metrics report 0 when those conditions are not met.
1206
1207 - namespace=JVMCPU
1208   - jvmCpuTime
1209
1210 - namespace=NettyBlockTransfer
1211   - shuffle-client.usedDirectMemory
1212   - shuffle-client.usedHeapMemory
1213   - shuffle-server.usedDirectMemory
1214   - shuffle-server.usedHeapMemory
1215
1216 - namespace=HiveExternalCatalog
1217   - **note:**: these metrics are conditional to a configuration parameter:
1218     `spark.metrics.staticSources.enabled` (default is true)
1219   - fileCacheHits.count
1220   - filesDiscovered.count
1221   - hiveClientCalls.count
1222   - parallelListingJobCount.count
1223   - partitionsFetched.count
1224
1225 - namespace=CodeGenerator
1226   - **note:**: these metrics are conditional to a configuration parameter:
1227     `spark.metrics.staticSources.enabled` (default is true)
1228   - compilationTime (histogram)
1229   - generatedClassSize (histogram)
1230   - generatedMethodSize (histogram)
1231   - hiveClientCalls.count
1232   - sourceCodeSize (histogram)
1233
1234 - namespace=plugin.\<Plugin Class Name>
1235   - Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and
1236   configured using the Spark plugin API. See "Advanced Instrumentation" below for how to load
1237   custom plugins into Spark.
1238
1239 ### Source = JVM Source
1240 Notes:
1241   - Activate this source by setting the relevant `metrics.properties` file entry or the
1242   configuration parameter:`spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource`
1243   - This source is available for driver and executor instances and is also available for other instances.
1244   - This source provides information on JVM metrics using the
1245   [Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/3.1.0/manual/jvm/)
1246    and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
1247
1248 ### Component instance = applicationMaster
1249 Note: applies when running on YARN
1250
1251 - numContainersPendingAllocate
1252 - numExecutorsFailed
1253 - numExecutorsRunning
1254 - numLocalityAwareTasks
1255 - numReleasedContainers
1256
1257 ### Component instance = mesos_cluster
1258 Note: applies when running on mesos
1259
1260 - waitingDrivers
1261 - launchedDrivers
1262 - retryDrivers
1263
1264 ### Component instance = master
1265 Note: applies when running in Spark standalone as master
1266
1267 - workers
1268 - aliveWorkers
1269 - apps
1270 - waitingApps
1271
1272 ### Component instance = ApplicationSource
1273 Note: applies when running in Spark standalone as master
1274
1275 - status
1276 - runtime_ms
1277 - cores
1278
1279 ### Component instance = worker
1280 Note: applies when running in Spark standalone as worker
1281
1282 - executors
1283 - coresUsed
1284 - memUsed_MB
1285 - coresFree
1286 - memFree_MB
1287
1288 ### Component instance = shuffleService
1289 Note: applies to the shuffle service
1290
1291 - blockTransferRateBytes (meter)
1292 - numActiveConnections.count
1293 - numRegisteredConnections.count
1294 - numCaughtExceptions.count
1295 - openBlockRequestLatencyMillis (histogram)
1296 - registerExecutorRequestLatencyMillis (histogram)
1297 - registeredExecutorsSize
1298 - shuffle-server.usedDirectMemory
1299 - shuffle-server.usedHeapMemory
1300
1301 # Advanced Instrumentation
1302
1303 Several external tools can be used to help profile the performance of Spark jobs:
1304
1305 * Cluster-wide monitoring tools, such as [Ganglia](http://ganglia.sourceforge.net/), can provide
1306 insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia
1307 dashboard can quickly reveal whether a particular workload is disk bound, network bound, or
1308 CPU bound.
1309 * OS profiling tools such as [dstat](http://dag.wieers.com/home-made/dstat/),
1310 [iostat](http://linux.die.net/man/1/iostat), and [iotop](http://linux.die.net/man/1/iotop)
1311 can provide fine-grained profiling on individual nodes.
1312 * JVM utilities such as `jstack` for providing stack traces, `jmap` for creating heap-dumps,
1313 `jstat` for reporting time-series statistics and `jconsole` for visually exploring various JVM
1314 properties are useful for those comfortable with JVM internals.
1315
1316 Spark also provides a plugin API so that custom instrumentation code can be added to Spark
1317 applications. There are two configuration keys available for loading plugins into Spark:
1318
1319 - <code>spark.plugins</code>
1320 - <code>spark.plugins.defaultList</code>
1321
1322 Both take a comma-separated list of class names that implement the
1323 <code>org.apache.spark.api.plugin.SparkPlugin</code> interface. The two names exist so that it's
1324 possible for one list to be placed in the Spark default config file, allowing users to
1325 easily add other plugins from the command line without overwriting the config file's list. Duplicate
1326 plugins are ignored.
1327
1328 Distribution of the jar files containing the plugin code is currently not done by Spark. The user
1329 or admin should make sure that the jar files are available to Spark applications, for example, by
1330 including the plugin jar with the Spark distribution. The exception to this rule is the YARN
1331 backend, where the <code>--jars</code> command line option (or equivalent config entry) can be
1332 used to make the plugin code available to both executors and cluster-mode drivers.