Documentation/process/botching-up-ioctls.rst

0001 =================================
0002 (How to avoid) Botching up ioctls
0003 =================================
0004
0005 From: https://blog.ffwll.ch/2013/11/botching-up-ioctls.html
0006
0007 By: Daniel Vetter, Copyright © 2013 Intel Corporation
0008
0009 One clear insight kernel graphics hackers gained in the past few years is that
0010 trying to come up with a unified interface to manage the execution units and
0011 memory on completely different GPUs is a futile effort. So nowadays every
0012 driver has its own set of ioctls to allocate memory and submit work to the GPU.
0013 Which is nice, since there's no more insanity in the form of fake-generic, but
0014 actually only used once interfaces. But the clear downside is that there's much
0015 more potential to screw things up.
0016
0017 To avoid repeating all the same mistakes again I've written up some of the
0018 lessons learned while botching the job for the drm/i915 driver. Most of these
0019 only cover technicalities and not the big-picture issues like what the command
0020 submission ioctl exactly should look like. Learning these lessons is probably
0021 something every GPU driver has to do on its own.
0022
0023
0024 Prerequisites
0025 -------------
0026
0027 First the prerequisites. Without these you have already failed, because you
0028 will need to add a 32-bit compat layer:
0029
0030  * Only use fixed sized integers. To avoid conflicts with typedefs in userspace
0031    the kernel has special types like __u32, __s64. Use them.
0032
0033  * Align everything to the natural size and use explicit padding. 32-bit
0034    platforms don't necessarily align 64-bit values to 64-bit boundaries, but
0035    64-bit platforms do. So we always need padding to the natural size to get
0036    this right.
0037
0038  * Pad the entire struct to a multiple of 64-bits if the structure contains
0039    64-bit types - the structure size will otherwise differ on 32-bit versus
0040    64-bit. Having a different structure size hurts when passing arrays of
0041    structures to the kernel, or if the kernel checks the structure size, which
0042    e.g. the drm core does.
0043
0044  * Pointers are __u64, cast from/to a uintprt_t on the userspace side and
0045    from/to a void __user * in the kernel. Try really hard not to delay this
0046    conversion or worse, fiddle the raw __u64 through your code since that
0047    diminishes the checking tools like sparse can provide. The macro
0048    u64_to_user_ptr can be used in the kernel to avoid warnings about integers
0049    and pointers of different sizes.
0050
0051
0052 Basics
0053 ------
0054
0055 With the joys of writing a compat layer avoided we can take a look at the basic
0056 fumbles. Neglecting these will make backward and forward compatibility a real
0057 pain. And since getting things wrong on the first attempt is guaranteed you
0058 will have a second iteration or at least an extension for any given interface.
0059
0060  * Have a clear way for userspace to figure out whether your new ioctl or ioctl
0061    extension is supported on a given kernel. If you can't rely on old kernels
0062    rejecting the new flags/modes or ioctls (since doing that was botched in the
0063    past) then you need a driver feature flag or revision number somewhere.
0064
0065  * Have a plan for extending ioctls with new flags or new fields at the end of
0066    the structure. The drm core checks the passed-in size for each ioctl call
0067    and zero-extends any mismatches between kernel and userspace. That helps,
0068    but isn't a complete solution since newer userspace on older kernels won't
0069    notice that the newly added fields at the end get ignored. So this still
0070    needs a new driver feature flags.
0071
0072  * Check all unused fields and flags and all the padding for whether it's 0,
0073    and reject the ioctl if that's not the case. Otherwise your nice plan for
0074    future extensions is going right down the gutters since someone will submit
0075    an ioctl struct with random stack garbage in the yet unused parts. Which
0076    then bakes in the ABI that those fields can never be used for anything else
0077    but garbage. This is also the reason why you must explicitly pad all
0078    structures, even if you never use them in an array - the padding the compiler
0079    might insert could contain garbage.
0080
0081  * Have simple testcases for all of the above.
0082
0083
0084 Fun with Error Paths
0085 --------------------
0086
0087 Nowadays we don't have any excuse left any more for drm drivers being neat
0088 little root exploits. This means we both need full input validation and solid
0089 error handling paths - GPUs will die eventually in the oddmost corner cases
0090 anyway:
0091
0092  * The ioctl must check for array overflows. Also it needs to check for
0093    over/underflows and clamping issues of integer values in general. The usual
0094    example is sprite positioning values fed directly into the hardware with the
0095    hardware just having 12 bits or so. Works nicely until some odd display
0096    server doesn't bother with clamping itself and the cursor wraps around the
0097    screen.
0098
0099  * Have simple testcases for every input validation failure case in your ioctl.
0100    Check that the error code matches your expectations. And finally make sure
0101    that you only test for one single error path in each subtest by submitting
0102    otherwise perfectly valid data. Without this an earlier check might reject
0103    the ioctl already and shadow the codepath you actually want to test, hiding
0104    bugs and regressions.
0105
0106  * Make all your ioctls restartable. First X really loves signals and second
0107    this will allow you to test 90% of all error handling paths by just
0108    interrupting your main test suite constantly with signals. Thanks to X's
0109    love for signal you'll get an excellent base coverage of all your error
0110    paths pretty much for free for graphics drivers. Also, be consistent with
0111    how you handle ioctl restarting - e.g. drm has a tiny drmIoctl helper in its
0112    userspace library. The i915 driver botched this with the set_tiling ioctl,
0113    now we're stuck forever with some arcane semantics in both the kernel and
0114    userspace.
0115
0116  * If you can't make a given codepath restartable make a stuck task at least
0117    killable. GPUs just die and your users won't like you more if you hang their
0118    entire box (by means of an unkillable X process). If the state recovery is
0119    still too tricky have a timeout or hangcheck safety net as a last-ditch
0120    effort in case the hardware has gone bananas.
0121
0122  * Have testcases for the really tricky corner cases in your error recovery code
0123    - it's way too easy to create a deadlock between your hangcheck code and
0124    waiters.
0125
0126
0127 Time, Waiting and Missing it
0128 ----------------------------
0129
0130 GPUs do most everything asynchronously, so we have a need to time operations and
0131 wait for outstanding ones. This is really tricky business; at the moment none of
0132 the ioctls supported by the drm/i915 get this fully right, which means there's
0133 still tons more lessons to learn here.
0134
0135  * Use CLOCK_MONOTONIC as your reference time, always. It's what alsa, drm and
0136    v4l use by default nowadays. But let userspace know which timestamps are
0137    derived from different clock domains like your main system clock (provided
0138    by the kernel) or some independent hardware counter somewhere else. Clocks
0139    will mismatch if you look close enough, but if performance measuring tools
0140    have this information they can at least compensate. If your userspace can
0141    get at the raw values of some clocks (e.g. through in-command-stream
0142    performance counter sampling instructions) consider exposing those also.
0143
0144  * Use __s64 seconds plus __u64 nanoseconds to specify time. It's not the most
0145    convenient time specification, but it's mostly the standard.
0146
0147  * Check that input time values are normalized and reject them if not. Note
0148    that the kernel native struct ktime has a signed integer for both seconds
0149    and nanoseconds, so beware here.
0150
0151  * For timeouts, use absolute times. If you're a good fellow and made your
0152    ioctl restartable relative timeouts tend to be too coarse and can
0153    indefinitely extend your wait time due to rounding on each restart.
0154    Especially if your reference clock is something really slow like the display
0155    frame counter. With a spec lawyer hat on this isn't a bug since timeouts can
0156    always be extended - but users will surely hate you if their neat animations
0157    starts to stutter due to this.
0158
0159  * Consider ditching any synchronous wait ioctls with timeouts and just deliver
0160    an asynchronous event on a pollable file descriptor. It fits much better
0161    into event driven applications' main loop.
0162
0163  * Have testcases for corner-cases, especially whether the return values for
0164    already-completed events, successful waits and timed-out waits are all sane
0165    and suiting to your needs.
0166
0167
0168 Leaking Resources, Not
0169 ----------------------
0170
0171 A full-blown drm driver essentially implements a little OS, but specialized to
0172 the given GPU platforms. This means a driver needs to expose tons of handles
0173 for different objects and other resources to userspace. Doing that right
0174 entails its own little set of pitfalls:
0175
0176  * Always attach the lifetime of your dynamically created resources to the
0177    lifetime of a file descriptor. Consider using a 1:1 mapping if your resource
0178    needs to be shared across processes -  fd-passing over unix domain sockets
0179    also simplifies lifetime management for userspace.
0180
0181  * Always have O_CLOEXEC support.
0182
0183  * Ensure that you have sufficient insulation between different clients. By
0184    default pick a private per-fd namespace which forces any sharing to be done
0185    explicitly. Only go with a more global per-device namespace if the objects
0186    are truly device-unique. One counterexample in the drm modeset interfaces is
0187    that the per-device modeset objects like connectors share a namespace with
0188    framebuffer objects, which mostly are not shared at all. A separate
0189    namespace, private by default, for framebuffers would have been more
0190    suitable.
0191
0192  * Think about uniqueness requirements for userspace handles. E.g. for most drm
0193    drivers it's a userspace bug to submit the same object twice in the same
0194    command submission ioctl. But then if objects are shareable userspace needs
0195    to know whether it has seen an imported object from a different process
0196    already or not. I haven't tried this myself yet due to lack of a new class
0197    of objects, but consider using inode numbers on your shared file descriptors
0198    as unique identifiers - it's how real files are told apart, too.
0199    Unfortunately this requires a full-blown virtual filesystem in the kernel.
0200
0201
0202 Last, but not Least
0203 -------------------
0204
0205 Not every problem needs a new ioctl:
0206
0207  * Think hard whether you really want a driver-private interface. Of course
0208    it's much quicker to push a driver-private interface than engaging in
0209    lengthy discussions for a more generic solution. And occasionally doing a
0210    private interface to spearhead a new concept is what's required. But in the
0211    end, once the generic interface comes around you'll end up maintainer two
0212    interfaces. Indefinitely.
0213
0214  * Consider other interfaces than ioctls. A sysfs attribute is much better for
0215    per-device settings, or for child objects with fairly static lifetimes (like
0216    output connectors in drm with all the detection override attributes). Or
0217    maybe only your testsuite needs this interface, and then debugfs with its
0218    disclaimer of not having a stable ABI would be better.
0219
0220 Finally, the name of the game is to get it right on the first attempt, since if
0221 your driver proves popular and your hardware platforms long-lived then you'll
0222 be stuck with a given ioctl essentially forever. You can try to deprecate
0223 horrible ioctls on newer iterations of your hardware, but generally it takes
0224 years to accomplish this. And then again years until the last user able to
0225 complain about regressions disappears, too.