0001 ===================
0002 Block IO Controller
0003 ===================
0004
0005 Overview
0006 ========
0007 cgroup subsys "blkio" implements the block io controller. There seems to be
0008 a need of various kinds of IO control policies (like proportional BW, max BW)
0009 both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
0010 Plan is to use the same cgroup based management interface for blkio controller
0011 and based on user options switch IO policies in the background.
0012
0013 One IO control policy is throttling policy which can be used to
0014 specify upper IO rate limits on devices. This policy is implemented in
0015 generic block layer and can be used on leaf nodes as well as higher
0016 level logical devices like device mapper.
0017
0018 HOWTO
0019 =====
0020
0021 Throttling/Upper Limit policy
0022 -----------------------------
0023 Enable Block IO controller::
0024
0025 CONFIG_BLK_CGROUP=y
0026
0027 Enable throttling in block layer::
0028
0029 CONFIG_BLK_DEV_THROTTLING=y
0030
0031 Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
0032
0033 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
0034
0035 Specify a bandwidth rate on particular device for root group. The format
0036 for policy is "<major>:<minor> <bytes_per_second>"::
0037
0038 echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
0039
0040 This will put a limit of 1MB/second on reads happening for root group
0041 on device having major/minor number 8:16.
0042
0043 Run dd to read a file and see if rate is throttled to 1MB/s or not::
0044
0045 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
0046 1024+0 records in
0047 1024+0 records out
0048 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
0049
0050 Limits for writes can be put using blkio.throttle.write_bps_device file.
0051
0052 Hierarchical Cgroups
0053 ====================
0054
0055 Throttling implements hierarchy support; however,
0056 throttling's hierarchy support is enabled iff "sane_behavior" is
0057 enabled from cgroup side, which currently is a development option and
0058 not publicly available.
0059
0060 If somebody created a hierarchy like as follows::
0061
0062 root
0063 / \
0064 test1 test2
0065 |
0066 test3
0067
0068 Throttling with "sane_behavior" will handle the
0069 hierarchy correctly. For throttling, all limits apply
0070 to the whole subtree while all statistics are local to the IOs
0071 directly generated by tasks in that cgroup.
0072
0073 Throttling without "sane_behavior" enabled from cgroup side will
0074 practically treat all groups at same level as if it looks like the
0075 following::
0076
0077 pivot
0078 / / \ \
0079 root test1 test2 test3
0080
0081 Various user visible config options
0082 ===================================
0083
0084 CONFIG_BLK_CGROUP
0085 Block IO controller.
0086
0087 CONFIG_BFQ_CGROUP_DEBUG
0088 Debug help. Right now some additional stats file show up in cgroup
0089 if this option is enabled.
0090
0091 CONFIG_BLK_DEV_THROTTLING
0092 Enable block device throttling support in block layer.
0093
0094 Details of cgroup files
0095 =======================
0096
0097 Proportional weight policy files
0098 --------------------------------
0099
0100 blkio.bfq.weight
0101 Specifies per cgroup weight. This is default weight of the group
0102 on all the devices until and unless overridden by per device rule
0103 (see `blkio.bfq.weight_device` below).
0104
0105 Currently allowed range of weights is from 1 to 1000. For more details,
0106 see Documentation/block/bfq-iosched.rst.
0107
0108 blkio.bfq.weight_device
0109 Specifes per cgroup per device weights, overriding the default group
0110 weight. For more details, see Documentation/block/bfq-iosched.rst.
0111
0112 Following is the format::
0113
0114 # echo dev_maj:dev_minor weight > blkio.bfq.weight_device
0115
0116 Configure weight=300 on /dev/sdb (8:16) in this cgroup::
0117
0118 # echo 8:16 300 > blkio.bfq.weight_device
0119 # cat blkio.bfq.weight_device
0120 dev weight
0121 8:16 300
0122
0123 Configure weight=500 on /dev/sda (8:0) in this cgroup::
0124
0125 # echo 8:0 500 > blkio.bfq.weight_device
0126 # cat blkio.bfq.weight_device
0127 dev weight
0128 8:0 500
0129 8:16 300
0130
0131 Remove specific weight for /dev/sda in this cgroup::
0132
0133 # echo 8:0 0 > blkio.bfq.weight_device
0134 # cat blkio.bfq.weight_device
0135 dev weight
0136 8:16 300
0137
0138 blkio.time
0139 Disk time allocated to cgroup per device in milliseconds. First
0140 two fields specify the major and minor number of the device and
0141 third field specifies the disk time allocated to group in
0142 milliseconds.
0143
0144 blkio.sectors
0145 Number of sectors transferred to/from disk by the group. First
0146 two fields specify the major and minor number of the device and
0147 third field specifies the number of sectors transferred by the
0148 group to/from the device.
0149
0150 blkio.io_service_bytes
0151 Number of bytes transferred to/from the disk by the group. These
0152 are further divided by the type of operation - read or write, sync
0153 or async. First two fields specify the major and minor number of the
0154 device, third field specifies the operation type and the fourth field
0155 specifies the number of bytes.
0156
0157 blkio.io_serviced
0158 Number of IOs (bio) issued to the disk by the group. These
0159 are further divided by the type of operation - read or write, sync
0160 or async. First two fields specify the major and minor number of the
0161 device, third field specifies the operation type and the fourth field
0162 specifies the number of IOs.
0163
0164 blkio.io_service_time
0165 Total amount of time between request dispatch and request completion
0166 for the IOs done by this cgroup. This is in nanoseconds to make it
0167 meaningful for flash devices too. For devices with queue depth of 1,
0168 this time represents the actual service time. When queue_depth > 1,
0169 that is no longer true as requests may be served out of order. This
0170 may cause the service time for a given IO to include the service time
0171 of multiple IOs when served out of order which may result in total
0172 io_service_time > actual time elapsed. This time is further divided by
0173 the type of operation - read or write, sync or async. First two fields
0174 specify the major and minor number of the device, third field
0175 specifies the operation type and the fourth field specifies the
0176 io_service_time in ns.
0177
0178 blkio.io_wait_time
0179 Total amount of time the IOs for this cgroup spent waiting in the
0180 scheduler queues for service. This can be greater than the total time
0181 elapsed since it is cumulative io_wait_time for all IOs. It is not a
0182 measure of total time the cgroup spent waiting but rather a measure of
0183 the wait_time for its individual IOs. For devices with queue_depth > 1
0184 this metric does not include the time spent waiting for service once
0185 the IO is dispatched to the device but till it actually gets serviced
0186 (there might be a time lag here due to re-ordering of requests by the
0187 device). This is in nanoseconds to make it meaningful for flash
0188 devices too. This time is further divided by the type of operation -
0189 read or write, sync or async. First two fields specify the major and
0190 minor number of the device, third field specifies the operation type
0191 and the fourth field specifies the io_wait_time in ns.
0192
0193 blkio.io_merged
0194 Total number of bios/requests merged into requests belonging to this
0195 cgroup. This is further divided by the type of operation - read or
0196 write, sync or async.
0197
0198 blkio.io_queued
0199 Total number of requests queued up at any given instant for this
0200 cgroup. This is further divided by the type of operation - read or
0201 write, sync or async.
0202
0203 blkio.avg_queue_size
0204 Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
0205 The average queue size for this cgroup over the entire time of this
0206 cgroup's existence. Queue size samples are taken each time one of the
0207 queues of this cgroup gets a timeslice.
0208
0209 blkio.group_wait_time
0210 Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
0211 This is the amount of time the cgroup had to wait since it became busy
0212 (i.e., went from 0 to 1 request queued) to get a timeslice for one of
0213 its queues. This is different from the io_wait_time which is the
0214 cumulative total of the amount of time spent by each IO in that cgroup
0215 waiting in the scheduler queue. This is in nanoseconds. If this is
0216 read when the cgroup is in a waiting (for timeslice) state, the stat
0217 will only report the group_wait_time accumulated till the last time it
0218 got a timeslice and will not include the current delta.
0219
0220 blkio.empty_time
0221 Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
0222 This is the amount of time a cgroup spends without any pending
0223 requests when not being served, i.e., it does not include any time
0224 spent idling for one of the queues of the cgroup. This is in
0225 nanoseconds. If this is read when the cgroup is in an empty state,
0226 the stat will only report the empty_time accumulated till the last
0227 time it had a pending request and will not include the current delta.
0228
0229 blkio.idle_time
0230 Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
0231 This is the amount of time spent by the IO scheduler idling for a
0232 given cgroup in anticipation of a better request than the existing ones
0233 from other queues/cgroups. This is in nanoseconds. If this is read
0234 when the cgroup is in an idling state, the stat will only report the
0235 idle_time accumulated till the last idle period and will not include
0236 the current delta.
0237
0238 blkio.dequeue
0239 Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
0240 gives the statistics about how many a times a group was dequeued
0241 from service tree of the device. First two fields specify the major
0242 and minor number of the device and third field specifies the number
0243 of times a group was dequeued from a particular device.
0244
0245 blkio.*_recursive
0246 Recursive version of various stats. These files show the
0247 same information as their non-recursive counterparts but
0248 include stats from all the descendant cgroups.
0249
0250 Throttling/Upper limit policy files
0251 -----------------------------------
0252 blkio.throttle.read_bps_device
0253 Specifies upper limit on READ rate from the device. IO rate is
0254 specified in bytes per second. Rules are per device. Following is
0255 the format::
0256
0257 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
0258
0259 blkio.throttle.write_bps_device
0260 Specifies upper limit on WRITE rate to the device. IO rate is
0261 specified in bytes per second. Rules are per device. Following is
0262 the format::
0263
0264 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
0265
0266 blkio.throttle.read_iops_device
0267 Specifies upper limit on READ rate from the device. IO rate is
0268 specified in IO per second. Rules are per device. Following is
0269 the format::
0270
0271 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
0272
0273 blkio.throttle.write_iops_device
0274 Specifies upper limit on WRITE rate to the device. IO rate is
0275 specified in io per second. Rules are per device. Following is
0276 the format::
0277
0278 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
0279
0280 Note: If both BW and IOPS rules are specified for a device, then IO is
0281 subjected to both the constraints.
0282
0283 blkio.throttle.io_serviced
0284 Number of IOs (bio) issued to the disk by the group. These
0285 are further divided by the type of operation - read or write, sync
0286 or async. First two fields specify the major and minor number of the
0287 device, third field specifies the operation type and the fourth field
0288 specifies the number of IOs.
0289
0290 blkio.throttle.io_service_bytes
0291 Number of bytes transferred to/from the disk by the group. These
0292 are further divided by the type of operation - read or write, sync
0293 or async. First two fields specify the major and minor number of the
0294 device, third field specifies the operation type and the fourth field
0295 specifies the number of bytes.
0296
0297 Common files among various policies
0298 -----------------------------------
0299 blkio.reset_stats
0300 Writing an int to this file will result in resetting all the stats
0301 for that cgroup.