Documentation/process/handling-regressions.rst

0001 .. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
0002 .. See the bottom of this file for additional redistribution information.
0003
0004 Handling regressions
0005 ++++++++++++++++++++
0006
0007 *We don't cause regressions* -- this document describes what this "first rule of
0008 Linux kernel development" means in practice for developers. It complements
0009 Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a
0010 user's point of view; if you never read that text, go and at least skim over it
0011 before continuing here.
0012
0013 The important bits (aka "The TL;DR")
0014 ====================================
0015
0016 #. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_
0017    (regressions@lists.linux.dev) quickly become aware of any new regression
0018    report:
0019
0020     * When receiving a mailed report that did not CC the list, bring it into the
0021       loop by immediately sending at least a brief "Reply-all" with the list
0022       CCed.
0023
0024     * Forward or bounce any reports submitted in bug trackers to the list.
0025
0026 #. Make the Linux kernel regression tracking bot "regzbot" track the issue (this
0027    is optional, but recommended):
0028
0029     * For mailed reports, check if the reporter included a line like ``#regzbot
0030       introduced v5.13..v5.14-rc1``. If not, send a reply (with the regressions
0031       list in CC) containing a paragraph like the following, which tells regzbot
0032       when the issue started to happen::
0033
0034        #regzbot ^introduced 1f2e3d4c5b6a
0035
0036     * When forwarding reports from a bug tracker to the regressions list (see
0037       above), include a paragraph like the following::
0038
0039        #regzbot introduced: v5.13..v5.14-rc1
0040        #regzbot from: Some N. Ice Human <some.human@example.com>
0041        #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
0042
0043 #. When submitting fixes for regressions, add "Link:" tags to the patch
0044    description pointing to all places where the issue was reported, as
0045    mandated by Documentation/process/submitting-patches.rst and
0046    :ref:`Documentation/process/5.Posting.rst <development_posting>`.
0047
0048 #. Try to fix regressions quickly once the culprit has been identified; fixes
0049    for most regressions should be merged within two weeks, but some need to be
0050    resolved within two or three days.
0051
0052
0053 All the details on Linux kernel regressions relevant for developers
0054 ===================================================================
0055
0056
0057 The important basics in more detail
0058 -----------------------------------
0059
0060
0061 What to do when receiving regression reports
0062 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0063
0064 Ensure the Linux kernel's regression tracker and others subscribers of the
0065 `regression mailing list <https://lore.kernel.org/regressions/>`_
0066 (regressions@lists.linux.dev) become aware of any newly reported regression:
0067
0068  * When you receive a report by mail that did not CC the list, immediately bring
0069    it into the loop by sending at least a brief "Reply-all" with the list CCed;
0070    try to ensure it gets CCed again in case you reply to a reply that omitted
0071    the list.
0072
0073  * If a report submitted in a bug tracker hits your Inbox, forward or bounce it
0074    to the list. Consider checking the list archives beforehand, if the reporter
0075    already forwarded the report as instructed by
0076    Documentation/admin-guide/reporting-issues.rst.
0077
0078 When doing either, consider making the Linux kernel regression tracking bot
0079 "regzbot" immediately start tracking the issue:
0080
0081  * For mailed reports, check if the reporter included a "regzbot command" like
0082    ``#regzbot introduced 1f2e3d4c5b6a``. If not, send a reply (with the
0083    regressions list in CC) with a paragraph like the following:::
0084
0085        #regzbot ^introduced: v5.13..v5.14-rc1
0086
0087    This tells regzbot the version range in which the issue started to happen;
0088    you can specify a range using commit-ids as well or state a single commit-id
0089    in case the reporter bisected the culprit.
0090
0091    Note the caret (^) before the "introduced": it tells regzbot to treat the
0092    parent mail (the one you reply to) as the initial report for the regression
0093    you want to see tracked; that's important, as regzbot will later look out
0094    for patches with "Link:" tags pointing to the report in the archives on
0095    lore.kernel.org.
0096
0097  * When forwarding a regressions reported to a bug tracker, include a paragraph
0098    with these regzbot commands::
0099
0100        #regzbot introduced: 1f2e3d4c5b6a
0101        #regzbot from: Some N. Ice Human <some.human@example.com>
0102        #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
0103
0104    Regzbot will then automatically associate patches with the report that
0105    contain "Link:" tags pointing to your mail or the mentioned ticket.
0106
0107 What's important when fixing regressions
0108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0109
0110 You don't need to do anything special when submitting fixes for regression, just
0111 remember to do what Documentation/process/submitting-patches.rst,
0112 :ref:`Documentation/process/5.Posting.rst <development_posting>`, and
0113 Documentation/process/stable-kernel-rules.rst already explain in more detail:
0114
0115  * Point to all places where the issue was reported using "Link:" tags::
0116
0117        Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
0118        Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
0119
0120  * Add a "Fixes:" tag to specify the commit causing the regression.
0121
0122  * If the culprit was merged in an earlier development cycle, explicitly mark
0123    the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag.
0124
0125 All this is expected from you and important when it comes to regression, as
0126 these tags are of great value for everyone (you included) that might be looking
0127 into the issue weeks, months, or years later. These tags are also crucial for
0128 tools and scripts used by other kernel developers or Linux distributions; one of
0129 these tools is regzbot, which heavily relies on the "Link:" tags to associate
0130 reports for regression with changes resolving them.
0131
0132 Prioritize work on fixing regressions
0133 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0134
0135 You should fix any reported regression as quickly as possible, to provide
0136 affected users with a solution in a timely manner and prevent more users from
0137 running into the issue; nevertheless developers need to take enough time and
0138 care to ensure regression fixes do not cause additional damage.
0139
0140 In the end though, developers should give their best to prevent users from
0141 running into situations where a regression leaves them only three options: "run
0142 a kernel with a regression that seriously impacts usage", "continue running an
0143 outdated and thus potentially insecure kernel version for more than two weeks
0144 after a regression's culprit was identified", and "downgrade to a still
0145 supported kernel series that lack required features".
0146
0147 How to realize this depends a lot on the situation. Here are a few rules of
0148 thumb for you, in order or importance:
0149
0150  * Prioritize work on handling regression reports and fixing regression over all
0151    other Linux kernel work, unless the latter concerns acute security issues or
0152    bugs causing data loss or damage.
0153
0154  * Always consider reverting the culprit commits and reapplying them later
0155    together with necessary fixes, as this might be the least dangerous and
0156    quickest way to fix a regression.
0157
0158  * Developers should handle regressions in all supported kernel series, but are
0159    free to delegate the work to the stable team, if the issue probably at no
0160    point in time occurred with mainline.
0161
0162  * Try to resolve any regressions introduced in the current development before
0163    its end. If you fear a fix might be too risky to apply only days before a new
0164    mainline release, let Linus decide: submit the fix separately to him as soon
0165    as possible with the explanation of the situation. He then can make a call
0166    and postpone the release if necessary, for example if multiple such changes
0167    show up in his inbox.
0168
0169  * Address regressions in stable, longterm, or proper mainline releases with
0170    more urgency than regressions in mainline pre-releases. That changes after
0171    the release of the fifth pre-release, aka "-rc5": mainline then becomes as
0172    important, to ensure all the improvements and fixes are ideally tested
0173    together for at least one week before Linus releases a new mainline version.
0174
0175  * Fix regressions within two or three days, if they are critical for some
0176    reason -- for example, if the issue is likely to affect many users of the
0177    kernel series in question on all or certain architectures. Note, this
0178    includes mainline, as issues like compile errors otherwise might prevent many
0179    testers or continuous integration systems from testing the series.
0180
0181  * Aim to fix regressions within one week after the culprit was identified, if
0182    the issue was introduced in either:
0183
0184     * a recent stable/longterm release
0185
0186     * the development cycle of the latest proper mainline release
0187
0188    In the latter case (say Linux v5.14), try to address regressions even
0189    quicker, if the stable series for the predecessor (v5.13) will be abandoned
0190    soon or already was stamped "End-of-Life" (EOL) -- this usually happens about
0191    three to four weeks after a new mainline release.
0192
0193  * Try to fix all other regressions within two weeks after the culprit was
0194    found. Two or three additional weeks are acceptable for performance
0195    regressions and other issues which are annoying, but don't prevent anyone
0196    from running Linux (unless it's an issue in the current development cycle,
0197    as those should ideally be addressed before the release). A few weeks in
0198    total are acceptable if a regression can only be fixed with a risky change
0199    and at the same time is affecting only a few users; as much time is
0200    also okay if the regression is already present in the second newest longterm
0201    kernel series.
0202
0203 Note: The aforementioned time frames for resolving regressions are meant to
0204 include getting the fix tested, reviewed, and merged into mainline, ideally with
0205 the fix being in linux-next at least briefly. This leads to delays you need to
0206 account for.
0207
0208 Subsystem maintainers are expected to assist in reaching those periods by doing
0209 timely reviews and quick handling of accepted patches. They thus might have to
0210 send git-pull requests earlier or more often than usual; depending on the fix,
0211 it might even be acceptable to skip testing in linux-next. Especially fixes for
0212 regressions in stable and longterm kernels need to be handled quickly, as fixes
0213 need to be merged in mainline before they can be backported to older series.
0214
0215
0216 More aspects regarding regressions developers should be aware of
0217 ----------------------------------------------------------------
0218
0219
0220 How to deal with changes where a risk of regression is known
0221 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0222
0223 Evaluate how big the risk of regressions is, for example by performing a code
0224 search in Linux distributions and Git forges. Also consider asking other
0225 developers or projects likely to be affected to evaluate or even test the
0226 proposed change; if problems surface, maybe some solution acceptable for all
0227 can be found.
0228
0229 If the risk of regressions in the end seems to be relatively small, go ahead
0230 with the change, but let all involved parties know about the risk. Hence, make
0231 sure your patch description makes this aspect obvious. Once the change is
0232 merged, tell the Linux kernel's regression tracker and the regressions mailing
0233 list about the risk, so everyone has the change on the radar in case reports
0234 trickle in. Depending on the risk, you also might want to ask the subsystem
0235 maintainer to mention the issue in his mainline pull request.
0236
0237 What else is there to known about regressions?
0238 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0239
0240 Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot
0241 of other aspects you want might want to be aware of:
0242
0243  * the purpose of the "no regressions rule"
0244
0245  * what issues actually qualify as regression
0246
0247  * who's in charge for finding the root cause of a regression
0248
0249  * how to handle tricky situations, e.g. when a regression is caused by a
0250    security fix or when fixing a regression might cause another one
0251
0252 Whom to ask for advice when it comes to regressions
0253 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0254
0255 Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
0256 CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
0257 issue might better be dealt with in private, feel free to omit the list.
0258
0259
0260 More about regression tracking and regzbot
0261 ------------------------------------------
0262
0263
0264 Why the Linux kernel has a regression tracker, and why is regzbot used?
0265 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0266
0267 Rules like "no regressions" need someone to ensure they are followed, otherwise
0268 they are broken either accidentally or on purpose. History has shown this to be
0269 true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to
0270 keep an eye on things as the Linux kernel's regression tracker, who's
0271 occasionally helped by other people. Neither of them are paid to do this,
0272 that's why regression tracking is done on a best effort basis.
0273
0274 Earlier attempts to manually track regressions have shown it's an exhausting and
0275 frustrating work, which is why they were abandoned after a while. To prevent
0276 this from happening again, Thorsten developed regzbot to facilitate the work,
0277 with the long term goal to automate regression tracking as much as possible for
0278 everyone involved.
0279
0280 How does regression tracking work with regzbot?
0281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0282
0283 The bot watches for replies to reports of tracked regressions. Additionally,
0284 it's looking out for posted or committed patches referencing such reports
0285 with "Link:" tags; replies to such patch postings are tracked as well.
0286 Combined this data provides good insights into the current state of the fixing
0287 process.
0288
0289 Regzbot tries to do its job with as little overhead as possible for both
0290 reporters and developers. In fact, only reporters are burdened with an extra
0291 duty: they need to tell regzbot about the regression report using the ``#regzbot
0292 introduced`` command outlined above; if they don't do that, someone else can
0293 take care of that using ``#regzbot ^introduced``.
0294
0295 For developers there normally is no extra work involved, they just need to make
0296 sure to do something that was expected long before regzbot came to light: add
0297 "Link:" tags to the patch description pointing to all reports about the issue
0298 fixed.
0299
0300 Do I have to use regzbot?
0301 ~~~~~~~~~~~~~~~~~~~~~~~~~
0302
0303 It's in the interest of everyone if you do, as kernel maintainers like Linus
0304 Torvalds partly rely on regzbot's tracking in their work -- for example when
0305 deciding to release a new version or extend the development phase. For this they
0306 need to be aware of all unfixed regression; to do that, Linus is known to look
0307 into the weekly reports sent by regzbot.
0308
0309 Do I have to tell regzbot about every regression I stumble upon?
0310 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0311
0312 Ideally yes: we are all humans and easily forget problems when something more
0313 important unexpectedly comes up -- for example a bigger problem in the Linux
0314 kernel or something in real life that's keeping us away from keyboards for a
0315 while. Hence, it's best to tell regzbot about every regression, except when you
0316 immediately write a fix and commit it to a tree regularly merged to the affected
0317 kernel series.
0318
0319 How to see which regressions regzbot tracks currently?
0320 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0321
0322 Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
0323 for the latest info; alternatively, `search for the latest regression report
0324 <https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
0325 which regzbot normally sends out once a week on Sunday evening (UTC), which is a
0326 few hours before Linus usually publishes new (pre-)releases.
0327
0328 What places is regzbot monitoring?
0329 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0330
0331 Regzbot is watching the most important Linux mailing lists as well as the git
0332 repositories of linux-next, mainline, and stable/longterm.
0333
0334 What kind of issues are supposed to be tracked by regzbot?
0335 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0336
0337 The bot is meant to track regressions, hence please don't involve regzbot for
0338 regular issues. But it's okay for the Linux kernel's regression tracker if you
0339 use regzbot to track severe issues, like reports about hangs, corrupted data,
0340 or internal errors (Panic, Oops, BUG(), warning, ...).
0341
0342 Can I add regressions found by CI systems to regzbot's tracking?
0343 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0344
0345 Feel free to do so, if the particular regression likely has impact on practical
0346 use cases and thus might be noticed by users; hence, please don't involve
0347 regzbot for theoretical regressions unlikely to show themselves in real world
0348 usage.
0349
0350 How to interact with regzbot?
0351 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0352
0353 By using a 'regzbot command' in a direct or indirect reply to the mail with the
0354 regression report. These commands need to be in their own paragraph (IOW: they
0355 need to be separated from the rest of the mail using blank lines).
0356
0357 One such command is ``#regzbot introduced <version or commit>``, which makes
0358 regzbot consider your mail as a regressions report added to the tracking, as
0359 already described above; ``#regzbot ^introduced <version or commit>`` is another
0360 such command, which makes regzbot consider the parent mail as a report for a
0361 regression which it starts to track.
0362
0363 Once one of those two commands has been utilized, other regzbot commands can be
0364 used in direct or indirect replies to the report. You can write them below one
0365 of the `introduced` commands or in replies to the mail that used one of them
0366 or itself is a reply to that mail:
0367
0368  * Set or update the title::
0369
0370        #regzbot title: foo
0371
0372  * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
0373    the issue or a fix are discussed -- for example the posting of a patch fixing
0374    the regression::
0375
0376        #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
0377
0378    Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot
0379    will consider all messages in that thread or ticket as related to the fixing
0380    process.
0381
0382  * Point to a place with further details of interest, like a mailing list post
0383    or a ticket in a bug tracker that are slightly related, but about a different
0384    topic::
0385
0386        #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
0387
0388  * Mark a regression as fixed by a commit that is heading upstream or already
0389    landed::
0390
0391        #regzbot fixed-by: 1f2e3d4c5d
0392
0393  * Mark a regression as a duplicate of another one already tracked by regzbot::
0394
0395        #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
0396
0397  * Mark a regression as invalid::
0398
0399        #regzbot invalid: wasn't a regression, problem has always existed
0400
0401 Is there more to tell about regzbot and its commands?
0402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0403
0404 More detailed and up-to-date information about the Linux
0405 kernel's regression tracking bot can be found on its
0406 `project page <https://gitlab.com/knurd42/regzbot>`_, which among others
0407 contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
0408 and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
0409 which both cover more details than the above section.
0410
0411 Quotes from Linus about regression
0412 ----------------------------------
0413
0414 Find below a few real life examples of how Linus Torvalds expects regressions to
0415 be handled:
0416
0417  * From `2017-10-26 (1/2)
0418    <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
0419
0420        If you break existing user space setups THAT IS A REGRESSION.
0421
0422        It's not ok to say "but we'll fix the user space setup".
0423
0424        Really. NOT OK.
0425
0426        [...]
0427
0428        The first rule is:
0429
0430         - we don't cause regressions
0431
0432        and the corollary is that when regressions *do* occur, we admit to
0433        them and fix them, instead of blaming user space.
0434
0435        The fact that you have apparently been denying the regression now for
0436        three weeks means that I will revert, and I will stop pulling apparmor
0437        requests until the people involved understand how kernel development
0438        is done.
0439
0440  * From `2017-10-26 (2/2)
0441    <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
0442
0443        People should basically always feel like they can update their kernel
0444        and simply not have to worry about it.
0445
0446        I refuse to introduce "you can only update the kernel if you also
0447        update that other program" kind of limitations. If the kernel used to
0448        work for you, the rule is that it continues to work for you.
0449
0450        There have been exceptions, but they are few and far between, and they
0451        generally have some major and fundamental reasons for having happened,
0452        that were basically entirely unavoidable, and people _tried_hard_ to
0453        avoid them. Maybe we can't practically support the hardware any more
0454        after it is decades old and nobody uses it with modern kernels any
0455        more. Maybe there's a serious security issue with how we did things,
0456        and people actually depended on that fundamentally broken model. Maybe
0457        there was some fundamental other breakage that just _had_ to have a
0458        flag day for very core and fundamental reasons.
0459
0460        And notice that this is very much about *breaking* peoples environments.
0461
0462        Behavioral changes happen, and maybe we don't even support some
0463        feature any more. There's a number of fields in /proc/<pid>/stat that
0464        are printed out as zeroes, simply because they don't even *exist* in
0465        the kernel any more, or because showing them was a mistake (typically
0466        an information leak). But the numbers got replaced by zeroes, so that
0467        the code that used to parse the fields still works. The user might not
0468        see everything they used to see, and so behavior is clearly different,
0469        but things still _work_, even if they might no longer show sensitive
0470        (or no longer relevant) information.
0471
0472        But if something actually breaks, then the change must get fixed or
0473        reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
0474        your user space then". It was a kernel change that exposed the
0475        problem, it needs to be the kernel that corrects for it, because we
0476        have a "upgrade in place" model. We don't have a "upgrade with new
0477        user space".
0478
0479        And I seriously will refuse to take code from people who do not
0480        understand and honor this very simple rule.
0481
0482        This rule is also not going to change.
0483
0484        And yes, I realize that the kernel is "special" in this respect. I'm
0485        proud of it.
0486
0487        I have seen, and can point to, lots of projects that go "We need to
0488        break that use case in order to make progress" or "you relied on
0489        undocumented behavior, it sucks to be you" or "there's a better way to
0490        do what you want to do, and you have to change to that new better
0491        way", and I simply don't think that's acceptable outside of very early
0492        alpha releases that have experimental users that know what they signed
0493        up for. The kernel hasn't been in that situation for the last two
0494        decades.
0495
0496        We do API breakage _inside_ the kernel all the time. We will fix
0497        internal problems by saying "you now need to do XYZ", but then it's
0498        about internal kernel API's, and the people who do that then also
0499        obviously have to fix up all the in-kernel users of that API. Nobody
0500        can say "I now broke the API you used, and now _you_ need to fix it
0501        up". Whoever broke something gets to fix it too.
0502
0503        And we simply do not break user space.
0504
0505  * From `2020-05-21
0506    <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
0507
0508        The rules about regressions have never been about any kind of
0509        documented behavior, or where the code lives.
0510
0511        The rules about regressions are always about "breaks user workflow".
0512
0513        Users are literally the _only_ thing that matters.
0514
0515        No amount of "you shouldn't have used this" or "that behavior was
0516        undefined, it's your own fault your app broke" or "that used to work
0517        simply because of a kernel bug" is at all relevant.
0518
0519        Now, reality is never entirely black-and-white. So we've had things
0520        like "serious security issue" etc that just forces us to make changes
0521        that may break user space. But even then the rule is that we don't
0522        really have other options that would allow things to continue.
0523
0524        And obviously, if users take years to even notice that something
0525        broke, or if we have sane ways to work around the breakage that
0526        doesn't make for too much trouble for users (ie "ok, there are a
0527        handful of users, and they can use a kernel command line to work
0528        around it" kind of things) we've also been a bit less strict.
0529
0530        But no, "that was documented to be broken" (whether it's because the
0531        code was in staging or because the man-page said something else) is
0532        irrelevant. If staging code is so useful that people end up using it,
0533        that means that it's basically regular kernel code with a flag saying
0534        "please clean this up".
0535
0536        The other side of the coin is that people who talk about "API
0537        stability" are entirely wrong. API's don't matter either. You can make
0538        any changes to an API you like - as long as nobody notices.
0539
0540        Again, the regression rule is not about documentation, not about
0541        API's, and not about the phase of the moon.
0542
0543        It's entirely about "we caused problems for user space that used to work".
0544
0545  * From `2017-11-05
0546    <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
0547
0548        And our regression rule has never been "behavior doesn't change".
0549        That would mean that we could never make any changes at all.
0550
0551        For example, we do things like add new error handling etc all the
0552        time, which we then sometimes even add tests for in our kselftest
0553        directory.
0554
0555        So clearly behavior changes all the time and we don't consider that a
0556        regression per se.
0557
0558        The rule for a regression for the kernel is that some real user
0559        workflow breaks. Not some test. Not a "look, I used to be able to do
0560        X, now I can't".
0561
0562  * From `2018-08-03
0563    <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
0564
0565        YOU ARE MISSING THE #1 KERNEL RULE.
0566
0567        We do not regress, and we do not regress exactly because your are 100% wrong.
0568
0569        And the reason you state for your opinion is in fact exactly *WHY* you
0570        are wrong.
0571
0572        Your "good reasons" are pure and utter garbage.
0573
0574        The whole point of "we do not regress" is so that people can upgrade
0575        the kernel and never have to worry about it.
0576
0577        > Kernel had a bug which has been fixed
0578
0579        That is *ENTIRELY* immaterial.
0580
0581        Guys, whether something was buggy or not DOES NOT MATTER.
0582
0583        Why?
0584
0585        Bugs happen. That's a fact of life. Arguing that "we had to break
0586        something because we were fixing a bug" is completely insane. We fix
0587        tens of bugs every single day, thinking that "fixing a bug" means that
0588        we can break something is simply NOT TRUE.
0589
0590        So bugs simply aren't even relevant to the discussion. They happen,
0591        they get found, they get fixed, and it has nothing to do with "we
0592        break users".
0593
0594        Because the only thing that matters IS THE USER.
0595
0596        How hard is that to understand?
0597
0598        Anybody who uses "but it was buggy" as an argument is entirely missing
0599        the point. As far as the USER was concerned, it wasn't buggy - it
0600        worked for him/her.
0601
0602        Maybe it worked *because* the user had taken the bug into account,
0603        maybe it worked because the user didn't notice - again, it doesn't
0604        matter. It worked for the user.
0605
0606        Breaking a user workflow for a "bug" is absolutely the WORST reason
0607        for breakage you can imagine.
0608
0609        It's basically saying "I took something that worked, and I broke it,
0610        but now it's better". Do you not see how f*cking insane that statement
0611        is?
0612
0613        And without users, your program is not a program, it's a pointless
0614        piece of code that you might as well throw away.
0615
0616        Seriously. This is *why* the #1 rule for kernel development is "we
0617        don't break users". Because "I fixed a bug" is absolutely NOT AN
0618        ARGUMENT if that bug fix broke a user setup. You actually introduced a
0619        MUCH BIGGER bug by "fixing" something that the user clearly didn't
0620        even care about.
0621
0622        And dammit, we upgrade the kernel ALL THE TIME without upgrading any
0623        other programs at all. It is absolutely required, because flag-days
0624        and dependencies are horribly bad.
0625
0626        And it is also required simply because I as a kernel developer do not
0627        upgrade random other tools that I don't even care about as I develop
0628        the kernel, and I want any of my users to feel safe doing the same
0629        time.
0630
0631        So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
0632        without upgrading some other random binary, then we have a problem.
0633
0634  * From `2021-06-05
0635    <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
0636
0637        THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
0638
0639        Honestly, security people need to understand that "not working" is not
0640        a success case of security. It's a failure case.
0641
0642        Yes, "not working" may be secure. But security in that case is *pointless*.
0643
0644  * From `2011-05-06 (1/3)
0645    <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
0646
0647        Binary compatibility is more important.
0648
0649        And if binaries don't use the interface to parse the format (or just
0650        parse it wrongly - see the fairly recent example of adding uuid's to
0651        /proc/self/mountinfo), then it's a regression.
0652
0653        And regressions get reverted, unless there are security issues or
0654        similar that makes us go "Oh Gods, we really have to break things".
0655
0656        I don't understand why this simple logic is so hard for some kernel
0657        developers to understand. Reality matters. Your personal wishes matter
0658        NOT AT ALL.
0659
0660        If you made an interface that can be used without parsing the
0661        interface description, then we're stuck with the interface. Theory
0662        simply doesn't matter.
0663
0664        You could help fix the tools, and try to avoid the compatibility
0665        issues that way. There aren't that many of them.
0666
0667    From `2011-05-06 (2/3)
0668    <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
0669
0670        it's clearly NOT an internal tracepoint. By definition. It's being
0671        used by powertop.
0672
0673    From `2011-05-06 (3/3)
0674    <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
0675
0676        We have programs that use that ABI and thus it's a regression if they break.
0677
0678  * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
0679
0680        > Now this got me wondering if Debian _unstable_ actually qualifies as a
0681        > standard distro userspace.
0682
0683        Oh, if the kernel breaks some standard user space, that counts. Tons
0684        of people run Debian unstable
0685
0686  * From `2019-09-15
0687    <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
0688
0689        One _particularly_ last-minute revert is the top-most commit (ignoring
0690        the version change itself) done just before the release, and while
0691        it's very annoying, it's perhaps also instructive.
0692
0693        What's instructive about it is that I reverted a commit that wasn't
0694        actually buggy. In fact, it was doing exactly what it set out to do,
0695        and did it very well. In fact it did it _so_ well that the much
0696        improved IO patterns it caused then ended up revealing a user-visible
0697        regression due to a real bug in a completely unrelated area.
0698
0699        The actual details of that regression are not the reason I point that
0700        revert out as instructive, though. It's more that it's an instructive
0701        example of what counts as a regression, and what the whole "no
0702        regressions" kernel rule means. The reverted commit didn't change any
0703        API's, and it didn't introduce any new bugs. But it ended up exposing
0704        another problem, and as such caused a kernel upgrade to fail for a
0705        user. So it got reverted.
0706
0707        The point here being that we revert based on user-reported _behavior_,
0708        not based on some "it changes the ABI" or "it caused a bug" concept.
0709        The problem was really pre-existing, and it just didn't happen to
0710        trigger before. The better IO patterns introduced by the change just
0711        happened to expose an old bug, and people had grown to depend on the
0712        previously benign behavior of that old issue.
0713
0714        And never fear, we'll re-introduce the fix that improved on the IO
0715        patterns once we've decided just how to handle the fact that we had a
0716        bad interaction with an interface that people had then just happened
0717        to rely on incidental behavior for before. It's just that we'll have
0718        to hash through how to do that (there are no less than three different
0719        patches by three different developers being discussed, and there might
0720        be more coming...). In the meantime, I reverted the thing that exposed
0721        the problem to users for this release, even if I hope it will be
0722        re-introduced (perhaps even backported as a stable patch) once we have
0723        consensus about the issue it exposed.
0724
0725        Take-away from the whole thing: it's not about whether you change the
0726        kernel-userspace ABI, or fix a bug, or about whether the old code
0727        "should never have worked in the first place". It's about whether
0728        something breaks existing users' workflow.
0729
0730        Anyway, that was my little aside on the whole regression thing.  Since
0731        it's that "first rule of kernel programming", I felt it is perhaps
0732        worth just bringing it up every once in a while
0733
0734 ..
0735    end-of-content
0736 ..
0737    This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
0738    of the file. If you want to distribute this text under CC-BY-4.0 only,
0739    please use "The Linux kernel developers" for author attribution and link
0740    this as source:
0741    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst
0742 ..
0743    Note: Only the content of this RST file as found in the Linux kernel sources
0744    is available under CC-BY-4.0, as versions of this text that were processed
0745    (for example by the kernel's build system) might contain content taken from
0746    files which use a more restrictive license.