0001 .. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
0002 .. See the bottom of this file for additional redistribution information.
0003
0004 Handling regressions
0005 ++++++++++++++++++++
0006
0007 *We don't cause regressions* -- this document describes what this "first rule of
0008 Linux kernel development" means in practice for developers. It complements
0009 Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a
0010 user's point of view; if you never read that text, go and at least skim over it
0011 before continuing here.
0012
0013 The important bits (aka "The TL;DR")
0014 ====================================
0015
0016 #. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_
0017 (regressions@lists.linux.dev) quickly become aware of any new regression
0018 report:
0019
0020 * When receiving a mailed report that did not CC the list, bring it into the
0021 loop by immediately sending at least a brief "Reply-all" with the list
0022 CCed.
0023
0024 * Forward or bounce any reports submitted in bug trackers to the list.
0025
0026 #. Make the Linux kernel regression tracking bot "regzbot" track the issue (this
0027 is optional, but recommended):
0028
0029 * For mailed reports, check if the reporter included a line like ``#regzbot
0030 introduced v5.13..v5.14-rc1``. If not, send a reply (with the regressions
0031 list in CC) containing a paragraph like the following, which tells regzbot
0032 when the issue started to happen::
0033
0034 #regzbot ^introduced 1f2e3d4c5b6a
0035
0036 * When forwarding reports from a bug tracker to the regressions list (see
0037 above), include a paragraph like the following::
0038
0039 #regzbot introduced: v5.13..v5.14-rc1
0040 #regzbot from: Some N. Ice Human <some.human@example.com>
0041 #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
0042
0043 #. When submitting fixes for regressions, add "Link:" tags to the patch
0044 description pointing to all places where the issue was reported, as
0045 mandated by Documentation/process/submitting-patches.rst and
0046 :ref:`Documentation/process/5.Posting.rst <development_posting>`.
0047
0048 #. Try to fix regressions quickly once the culprit has been identified; fixes
0049 for most regressions should be merged within two weeks, but some need to be
0050 resolved within two or three days.
0051
0052
0053 All the details on Linux kernel regressions relevant for developers
0054 ===================================================================
0055
0056
0057 The important basics in more detail
0058 -----------------------------------
0059
0060
0061 What to do when receiving regression reports
0062 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0063
0064 Ensure the Linux kernel's regression tracker and others subscribers of the
0065 `regression mailing list <https://lore.kernel.org/regressions/>`_
0066 (regressions@lists.linux.dev) become aware of any newly reported regression:
0067
0068 * When you receive a report by mail that did not CC the list, immediately bring
0069 it into the loop by sending at least a brief "Reply-all" with the list CCed;
0070 try to ensure it gets CCed again in case you reply to a reply that omitted
0071 the list.
0072
0073 * If a report submitted in a bug tracker hits your Inbox, forward or bounce it
0074 to the list. Consider checking the list archives beforehand, if the reporter
0075 already forwarded the report as instructed by
0076 Documentation/admin-guide/reporting-issues.rst.
0077
0078 When doing either, consider making the Linux kernel regression tracking bot
0079 "regzbot" immediately start tracking the issue:
0080
0081 * For mailed reports, check if the reporter included a "regzbot command" like
0082 ``#regzbot introduced 1f2e3d4c5b6a``. If not, send a reply (with the
0083 regressions list in CC) with a paragraph like the following:::
0084
0085 #regzbot ^introduced: v5.13..v5.14-rc1
0086
0087 This tells regzbot the version range in which the issue started to happen;
0088 you can specify a range using commit-ids as well or state a single commit-id
0089 in case the reporter bisected the culprit.
0090
0091 Note the caret (^) before the "introduced": it tells regzbot to treat the
0092 parent mail (the one you reply to) as the initial report for the regression
0093 you want to see tracked; that's important, as regzbot will later look out
0094 for patches with "Link:" tags pointing to the report in the archives on
0095 lore.kernel.org.
0096
0097 * When forwarding a regressions reported to a bug tracker, include a paragraph
0098 with these regzbot commands::
0099
0100 #regzbot introduced: 1f2e3d4c5b6a
0101 #regzbot from: Some N. Ice Human <some.human@example.com>
0102 #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
0103
0104 Regzbot will then automatically associate patches with the report that
0105 contain "Link:" tags pointing to your mail or the mentioned ticket.
0106
0107 What's important when fixing regressions
0108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0109
0110 You don't need to do anything special when submitting fixes for regression, just
0111 remember to do what Documentation/process/submitting-patches.rst,
0112 :ref:`Documentation/process/5.Posting.rst <development_posting>`, and
0113 Documentation/process/stable-kernel-rules.rst already explain in more detail:
0114
0115 * Point to all places where the issue was reported using "Link:" tags::
0116
0117 Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
0118 Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
0119
0120 * Add a "Fixes:" tag to specify the commit causing the regression.
0121
0122 * If the culprit was merged in an earlier development cycle, explicitly mark
0123 the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag.
0124
0125 All this is expected from you and important when it comes to regression, as
0126 these tags are of great value for everyone (you included) that might be looking
0127 into the issue weeks, months, or years later. These tags are also crucial for
0128 tools and scripts used by other kernel developers or Linux distributions; one of
0129 these tools is regzbot, which heavily relies on the "Link:" tags to associate
0130 reports for regression with changes resolving them.
0131
0132 Prioritize work on fixing regressions
0133 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0134
0135 You should fix any reported regression as quickly as possible, to provide
0136 affected users with a solution in a timely manner and prevent more users from
0137 running into the issue; nevertheless developers need to take enough time and
0138 care to ensure regression fixes do not cause additional damage.
0139
0140 In the end though, developers should give their best to prevent users from
0141 running into situations where a regression leaves them only three options: "run
0142 a kernel with a regression that seriously impacts usage", "continue running an
0143 outdated and thus potentially insecure kernel version for more than two weeks
0144 after a regression's culprit was identified", and "downgrade to a still
0145 supported kernel series that lack required features".
0146
0147 How to realize this depends a lot on the situation. Here are a few rules of
0148 thumb for you, in order or importance:
0149
0150 * Prioritize work on handling regression reports and fixing regression over all
0151 other Linux kernel work, unless the latter concerns acute security issues or
0152 bugs causing data loss or damage.
0153
0154 * Always consider reverting the culprit commits and reapplying them later
0155 together with necessary fixes, as this might be the least dangerous and
0156 quickest way to fix a regression.
0157
0158 * Developers should handle regressions in all supported kernel series, but are
0159 free to delegate the work to the stable team, if the issue probably at no
0160 point in time occurred with mainline.
0161
0162 * Try to resolve any regressions introduced in the current development before
0163 its end. If you fear a fix might be too risky to apply only days before a new
0164 mainline release, let Linus decide: submit the fix separately to him as soon
0165 as possible with the explanation of the situation. He then can make a call
0166 and postpone the release if necessary, for example if multiple such changes
0167 show up in his inbox.
0168
0169 * Address regressions in stable, longterm, or proper mainline releases with
0170 more urgency than regressions in mainline pre-releases. That changes after
0171 the release of the fifth pre-release, aka "-rc5": mainline then becomes as
0172 important, to ensure all the improvements and fixes are ideally tested
0173 together for at least one week before Linus releases a new mainline version.
0174
0175 * Fix regressions within two or three days, if they are critical for some
0176 reason -- for example, if the issue is likely to affect many users of the
0177 kernel series in question on all or certain architectures. Note, this
0178 includes mainline, as issues like compile errors otherwise might prevent many
0179 testers or continuous integration systems from testing the series.
0180
0181 * Aim to fix regressions within one week after the culprit was identified, if
0182 the issue was introduced in either:
0183
0184 * a recent stable/longterm release
0185
0186 * the development cycle of the latest proper mainline release
0187
0188 In the latter case (say Linux v5.14), try to address regressions even
0189 quicker, if the stable series for the predecessor (v5.13) will be abandoned
0190 soon or already was stamped "End-of-Life" (EOL) -- this usually happens about
0191 three to four weeks after a new mainline release.
0192
0193 * Try to fix all other regressions within two weeks after the culprit was
0194 found. Two or three additional weeks are acceptable for performance
0195 regressions and other issues which are annoying, but don't prevent anyone
0196 from running Linux (unless it's an issue in the current development cycle,
0197 as those should ideally be addressed before the release). A few weeks in
0198 total are acceptable if a regression can only be fixed with a risky change
0199 and at the same time is affecting only a few users; as much time is
0200 also okay if the regression is already present in the second newest longterm
0201 kernel series.
0202
0203 Note: The aforementioned time frames for resolving regressions are meant to
0204 include getting the fix tested, reviewed, and merged into mainline, ideally with
0205 the fix being in linux-next at least briefly. This leads to delays you need to
0206 account for.
0207
0208 Subsystem maintainers are expected to assist in reaching those periods by doing
0209 timely reviews and quick handling of accepted patches. They thus might have to
0210 send git-pull requests earlier or more often than usual; depending on the fix,
0211 it might even be acceptable to skip testing in linux-next. Especially fixes for
0212 regressions in stable and longterm kernels need to be handled quickly, as fixes
0213 need to be merged in mainline before they can be backported to older series.
0214
0215
0216 More aspects regarding regressions developers should be aware of
0217 ----------------------------------------------------------------
0218
0219
0220 How to deal with changes where a risk of regression is known
0221 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0222
0223 Evaluate how big the risk of regressions is, for example by performing a code
0224 search in Linux distributions and Git forges. Also consider asking other
0225 developers or projects likely to be affected to evaluate or even test the
0226 proposed change; if problems surface, maybe some solution acceptable for all
0227 can be found.
0228
0229 If the risk of regressions in the end seems to be relatively small, go ahead
0230 with the change, but let all involved parties know about the risk. Hence, make
0231 sure your patch description makes this aspect obvious. Once the change is
0232 merged, tell the Linux kernel's regression tracker and the regressions mailing
0233 list about the risk, so everyone has the change on the radar in case reports
0234 trickle in. Depending on the risk, you also might want to ask the subsystem
0235 maintainer to mention the issue in his mainline pull request.
0236
0237 What else is there to known about regressions?
0238 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0239
0240 Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot
0241 of other aspects you want might want to be aware of:
0242
0243 * the purpose of the "no regressions rule"
0244
0245 * what issues actually qualify as regression
0246
0247 * who's in charge for finding the root cause of a regression
0248
0249 * how to handle tricky situations, e.g. when a regression is caused by a
0250 security fix or when fixing a regression might cause another one
0251
0252 Whom to ask for advice when it comes to regressions
0253 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0254
0255 Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
0256 CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
0257 issue might better be dealt with in private, feel free to omit the list.
0258
0259
0260 More about regression tracking and regzbot
0261 ------------------------------------------
0262
0263
0264 Why the Linux kernel has a regression tracker, and why is regzbot used?
0265 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0266
0267 Rules like "no regressions" need someone to ensure they are followed, otherwise
0268 they are broken either accidentally or on purpose. History has shown this to be
0269 true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to
0270 keep an eye on things as the Linux kernel's regression tracker, who's
0271 occasionally helped by other people. Neither of them are paid to do this,
0272 that's why regression tracking is done on a best effort basis.
0273
0274 Earlier attempts to manually track regressions have shown it's an exhausting and
0275 frustrating work, which is why they were abandoned after a while. To prevent
0276 this from happening again, Thorsten developed regzbot to facilitate the work,
0277 with the long term goal to automate regression tracking as much as possible for
0278 everyone involved.
0279
0280 How does regression tracking work with regzbot?
0281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0282
0283 The bot watches for replies to reports of tracked regressions. Additionally,
0284 it's looking out for posted or committed patches referencing such reports
0285 with "Link:" tags; replies to such patch postings are tracked as well.
0286 Combined this data provides good insights into the current state of the fixing
0287 process.
0288
0289 Regzbot tries to do its job with as little overhead as possible for both
0290 reporters and developers. In fact, only reporters are burdened with an extra
0291 duty: they need to tell regzbot about the regression report using the ``#regzbot
0292 introduced`` command outlined above; if they don't do that, someone else can
0293 take care of that using ``#regzbot ^introduced``.
0294
0295 For developers there normally is no extra work involved, they just need to make
0296 sure to do something that was expected long before regzbot came to light: add
0297 "Link:" tags to the patch description pointing to all reports about the issue
0298 fixed.
0299
0300 Do I have to use regzbot?
0301 ~~~~~~~~~~~~~~~~~~~~~~~~~
0302
0303 It's in the interest of everyone if you do, as kernel maintainers like Linus
0304 Torvalds partly rely on regzbot's tracking in their work -- for example when
0305 deciding to release a new version or extend the development phase. For this they
0306 need to be aware of all unfixed regression; to do that, Linus is known to look
0307 into the weekly reports sent by regzbot.
0308
0309 Do I have to tell regzbot about every regression I stumble upon?
0310 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0311
0312 Ideally yes: we are all humans and easily forget problems when something more
0313 important unexpectedly comes up -- for example a bigger problem in the Linux
0314 kernel or something in real life that's keeping us away from keyboards for a
0315 while. Hence, it's best to tell regzbot about every regression, except when you
0316 immediately write a fix and commit it to a tree regularly merged to the affected
0317 kernel series.
0318
0319 How to see which regressions regzbot tracks currently?
0320 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0321
0322 Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
0323 for the latest info; alternatively, `search for the latest regression report
0324 <https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
0325 which regzbot normally sends out once a week on Sunday evening (UTC), which is a
0326 few hours before Linus usually publishes new (pre-)releases.
0327
0328 What places is regzbot monitoring?
0329 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0330
0331 Regzbot is watching the most important Linux mailing lists as well as the git
0332 repositories of linux-next, mainline, and stable/longterm.
0333
0334 What kind of issues are supposed to be tracked by regzbot?
0335 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0336
0337 The bot is meant to track regressions, hence please don't involve regzbot for
0338 regular issues. But it's okay for the Linux kernel's regression tracker if you
0339 use regzbot to track severe issues, like reports about hangs, corrupted data,
0340 or internal errors (Panic, Oops, BUG(), warning, ...).
0341
0342 Can I add regressions found by CI systems to regzbot's tracking?
0343 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0344
0345 Feel free to do so, if the particular regression likely has impact on practical
0346 use cases and thus might be noticed by users; hence, please don't involve
0347 regzbot for theoretical regressions unlikely to show themselves in real world
0348 usage.
0349
0350 How to interact with regzbot?
0351 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0352
0353 By using a 'regzbot command' in a direct or indirect reply to the mail with the
0354 regression report. These commands need to be in their own paragraph (IOW: they
0355 need to be separated from the rest of the mail using blank lines).
0356
0357 One such command is ``#regzbot introduced <version or commit>``, which makes
0358 regzbot consider your mail as a regressions report added to the tracking, as
0359 already described above; ``#regzbot ^introduced <version or commit>`` is another
0360 such command, which makes regzbot consider the parent mail as a report for a
0361 regression which it starts to track.
0362
0363 Once one of those two commands has been utilized, other regzbot commands can be
0364 used in direct or indirect replies to the report. You can write them below one
0365 of the `introduced` commands or in replies to the mail that used one of them
0366 or itself is a reply to that mail:
0367
0368 * Set or update the title::
0369
0370 #regzbot title: foo
0371
0372 * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
0373 the issue or a fix are discussed -- for example the posting of a patch fixing
0374 the regression::
0375
0376 #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
0377
0378 Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot
0379 will consider all messages in that thread or ticket as related to the fixing
0380 process.
0381
0382 * Point to a place with further details of interest, like a mailing list post
0383 or a ticket in a bug tracker that are slightly related, but about a different
0384 topic::
0385
0386 #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
0387
0388 * Mark a regression as fixed by a commit that is heading upstream or already
0389 landed::
0390
0391 #regzbot fixed-by: 1f2e3d4c5d
0392
0393 * Mark a regression as a duplicate of another one already tracked by regzbot::
0394
0395 #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
0396
0397 * Mark a regression as invalid::
0398
0399 #regzbot invalid: wasn't a regression, problem has always existed
0400
0401 Is there more to tell about regzbot and its commands?
0402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0403
0404 More detailed and up-to-date information about the Linux
0405 kernel's regression tracking bot can be found on its
0406 `project page <https://gitlab.com/knurd42/regzbot>`_, which among others
0407 contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
0408 and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
0409 which both cover more details than the above section.
0410
0411 Quotes from Linus about regression
0412 ----------------------------------
0413
0414 Find below a few real life examples of how Linus Torvalds expects regressions to
0415 be handled:
0416
0417 * From `2017-10-26 (1/2)
0418 <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
0419
0420 If you break existing user space setups THAT IS A REGRESSION.
0421
0422 It's not ok to say "but we'll fix the user space setup".
0423
0424 Really. NOT OK.
0425
0426 [...]
0427
0428 The first rule is:
0429
0430 - we don't cause regressions
0431
0432 and the corollary is that when regressions *do* occur, we admit to
0433 them and fix them, instead of blaming user space.
0434
0435 The fact that you have apparently been denying the regression now for
0436 three weeks means that I will revert, and I will stop pulling apparmor
0437 requests until the people involved understand how kernel development
0438 is done.
0439
0440 * From `2017-10-26 (2/2)
0441 <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
0442
0443 People should basically always feel like they can update their kernel
0444 and simply not have to worry about it.
0445
0446 I refuse to introduce "you can only update the kernel if you also
0447 update that other program" kind of limitations. If the kernel used to
0448 work for you, the rule is that it continues to work for you.
0449
0450 There have been exceptions, but they are few and far between, and they
0451 generally have some major and fundamental reasons for having happened,
0452 that were basically entirely unavoidable, and people _tried_hard_ to
0453 avoid them. Maybe we can't practically support the hardware any more
0454 after it is decades old and nobody uses it with modern kernels any
0455 more. Maybe there's a serious security issue with how we did things,
0456 and people actually depended on that fundamentally broken model. Maybe
0457 there was some fundamental other breakage that just _had_ to have a
0458 flag day for very core and fundamental reasons.
0459
0460 And notice that this is very much about *breaking* peoples environments.
0461
0462 Behavioral changes happen, and maybe we don't even support some
0463 feature any more. There's a number of fields in /proc/<pid>/stat that
0464 are printed out as zeroes, simply because they don't even *exist* in
0465 the kernel any more, or because showing them was a mistake (typically
0466 an information leak). But the numbers got replaced by zeroes, so that
0467 the code that used to parse the fields still works. The user might not
0468 see everything they used to see, and so behavior is clearly different,
0469 but things still _work_, even if they might no longer show sensitive
0470 (or no longer relevant) information.
0471
0472 But if something actually breaks, then the change must get fixed or
0473 reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
0474 your user space then". It was a kernel change that exposed the
0475 problem, it needs to be the kernel that corrects for it, because we
0476 have a "upgrade in place" model. We don't have a "upgrade with new
0477 user space".
0478
0479 And I seriously will refuse to take code from people who do not
0480 understand and honor this very simple rule.
0481
0482 This rule is also not going to change.
0483
0484 And yes, I realize that the kernel is "special" in this respect. I'm
0485 proud of it.
0486
0487 I have seen, and can point to, lots of projects that go "We need to
0488 break that use case in order to make progress" or "you relied on
0489 undocumented behavior, it sucks to be you" or "there's a better way to
0490 do what you want to do, and you have to change to that new better
0491 way", and I simply don't think that's acceptable outside of very early
0492 alpha releases that have experimental users that know what they signed
0493 up for. The kernel hasn't been in that situation for the last two
0494 decades.
0495
0496 We do API breakage _inside_ the kernel all the time. We will fix
0497 internal problems by saying "you now need to do XYZ", but then it's
0498 about internal kernel API's, and the people who do that then also
0499 obviously have to fix up all the in-kernel users of that API. Nobody
0500 can say "I now broke the API you used, and now _you_ need to fix it
0501 up". Whoever broke something gets to fix it too.
0502
0503 And we simply do not break user space.
0504
0505 * From `2020-05-21
0506 <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
0507
0508 The rules about regressions have never been about any kind of
0509 documented behavior, or where the code lives.
0510
0511 The rules about regressions are always about "breaks user workflow".
0512
0513 Users are literally the _only_ thing that matters.
0514
0515 No amount of "you shouldn't have used this" or "that behavior was
0516 undefined, it's your own fault your app broke" or "that used to work
0517 simply because of a kernel bug" is at all relevant.
0518
0519 Now, reality is never entirely black-and-white. So we've had things
0520 like "serious security issue" etc that just forces us to make changes
0521 that may break user space. But even then the rule is that we don't
0522 really have other options that would allow things to continue.
0523
0524 And obviously, if users take years to even notice that something
0525 broke, or if we have sane ways to work around the breakage that
0526 doesn't make for too much trouble for users (ie "ok, there are a
0527 handful of users, and they can use a kernel command line to work
0528 around it" kind of things) we've also been a bit less strict.
0529
0530 But no, "that was documented to be broken" (whether it's because the
0531 code was in staging or because the man-page said something else) is
0532 irrelevant. If staging code is so useful that people end up using it,
0533 that means that it's basically regular kernel code with a flag saying
0534 "please clean this up".
0535
0536 The other side of the coin is that people who talk about "API
0537 stability" are entirely wrong. API's don't matter either. You can make
0538 any changes to an API you like - as long as nobody notices.
0539
0540 Again, the regression rule is not about documentation, not about
0541 API's, and not about the phase of the moon.
0542
0543 It's entirely about "we caused problems for user space that used to work".
0544
0545 * From `2017-11-05
0546 <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
0547
0548 And our regression rule has never been "behavior doesn't change".
0549 That would mean that we could never make any changes at all.
0550
0551 For example, we do things like add new error handling etc all the
0552 time, which we then sometimes even add tests for in our kselftest
0553 directory.
0554
0555 So clearly behavior changes all the time and we don't consider that a
0556 regression per se.
0557
0558 The rule for a regression for the kernel is that some real user
0559 workflow breaks. Not some test. Not a "look, I used to be able to do
0560 X, now I can't".
0561
0562 * From `2018-08-03
0563 <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
0564
0565 YOU ARE MISSING THE #1 KERNEL RULE.
0566
0567 We do not regress, and we do not regress exactly because your are 100% wrong.
0568
0569 And the reason you state for your opinion is in fact exactly *WHY* you
0570 are wrong.
0571
0572 Your "good reasons" are pure and utter garbage.
0573
0574 The whole point of "we do not regress" is so that people can upgrade
0575 the kernel and never have to worry about it.
0576
0577 > Kernel had a bug which has been fixed
0578
0579 That is *ENTIRELY* immaterial.
0580
0581 Guys, whether something was buggy or not DOES NOT MATTER.
0582
0583 Why?
0584
0585 Bugs happen. That's a fact of life. Arguing that "we had to break
0586 something because we were fixing a bug" is completely insane. We fix
0587 tens of bugs every single day, thinking that "fixing a bug" means that
0588 we can break something is simply NOT TRUE.
0589
0590 So bugs simply aren't even relevant to the discussion. They happen,
0591 they get found, they get fixed, and it has nothing to do with "we
0592 break users".
0593
0594 Because the only thing that matters IS THE USER.
0595
0596 How hard is that to understand?
0597
0598 Anybody who uses "but it was buggy" as an argument is entirely missing
0599 the point. As far as the USER was concerned, it wasn't buggy - it
0600 worked for him/her.
0601
0602 Maybe it worked *because* the user had taken the bug into account,
0603 maybe it worked because the user didn't notice - again, it doesn't
0604 matter. It worked for the user.
0605
0606 Breaking a user workflow for a "bug" is absolutely the WORST reason
0607 for breakage you can imagine.
0608
0609 It's basically saying "I took something that worked, and I broke it,
0610 but now it's better". Do you not see how f*cking insane that statement
0611 is?
0612
0613 And without users, your program is not a program, it's a pointless
0614 piece of code that you might as well throw away.
0615
0616 Seriously. This is *why* the #1 rule for kernel development is "we
0617 don't break users". Because "I fixed a bug" is absolutely NOT AN
0618 ARGUMENT if that bug fix broke a user setup. You actually introduced a
0619 MUCH BIGGER bug by "fixing" something that the user clearly didn't
0620 even care about.
0621
0622 And dammit, we upgrade the kernel ALL THE TIME without upgrading any
0623 other programs at all. It is absolutely required, because flag-days
0624 and dependencies are horribly bad.
0625
0626 And it is also required simply because I as a kernel developer do not
0627 upgrade random other tools that I don't even care about as I develop
0628 the kernel, and I want any of my users to feel safe doing the same
0629 time.
0630
0631 So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
0632 without upgrading some other random binary, then we have a problem.
0633
0634 * From `2021-06-05
0635 <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
0636
0637 THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
0638
0639 Honestly, security people need to understand that "not working" is not
0640 a success case of security. It's a failure case.
0641
0642 Yes, "not working" may be secure. But security in that case is *pointless*.
0643
0644 * From `2011-05-06 (1/3)
0645 <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
0646
0647 Binary compatibility is more important.
0648
0649 And if binaries don't use the interface to parse the format (or just
0650 parse it wrongly - see the fairly recent example of adding uuid's to
0651 /proc/self/mountinfo), then it's a regression.
0652
0653 And regressions get reverted, unless there are security issues or
0654 similar that makes us go "Oh Gods, we really have to break things".
0655
0656 I don't understand why this simple logic is so hard for some kernel
0657 developers to understand. Reality matters. Your personal wishes matter
0658 NOT AT ALL.
0659
0660 If you made an interface that can be used without parsing the
0661 interface description, then we're stuck with the interface. Theory
0662 simply doesn't matter.
0663
0664 You could help fix the tools, and try to avoid the compatibility
0665 issues that way. There aren't that many of them.
0666
0667 From `2011-05-06 (2/3)
0668 <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
0669
0670 it's clearly NOT an internal tracepoint. By definition. It's being
0671 used by powertop.
0672
0673 From `2011-05-06 (3/3)
0674 <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
0675
0676 We have programs that use that ABI and thus it's a regression if they break.
0677
0678 * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
0679
0680 > Now this got me wondering if Debian _unstable_ actually qualifies as a
0681 > standard distro userspace.
0682
0683 Oh, if the kernel breaks some standard user space, that counts. Tons
0684 of people run Debian unstable
0685
0686 * From `2019-09-15
0687 <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
0688
0689 One _particularly_ last-minute revert is the top-most commit (ignoring
0690 the version change itself) done just before the release, and while
0691 it's very annoying, it's perhaps also instructive.
0692
0693 What's instructive about it is that I reverted a commit that wasn't
0694 actually buggy. In fact, it was doing exactly what it set out to do,
0695 and did it very well. In fact it did it _so_ well that the much
0696 improved IO patterns it caused then ended up revealing a user-visible
0697 regression due to a real bug in a completely unrelated area.
0698
0699 The actual details of that regression are not the reason I point that
0700 revert out as instructive, though. It's more that it's an instructive
0701 example of what counts as a regression, and what the whole "no
0702 regressions" kernel rule means. The reverted commit didn't change any
0703 API's, and it didn't introduce any new bugs. But it ended up exposing
0704 another problem, and as such caused a kernel upgrade to fail for a
0705 user. So it got reverted.
0706
0707 The point here being that we revert based on user-reported _behavior_,
0708 not based on some "it changes the ABI" or "it caused a bug" concept.
0709 The problem was really pre-existing, and it just didn't happen to
0710 trigger before. The better IO patterns introduced by the change just
0711 happened to expose an old bug, and people had grown to depend on the
0712 previously benign behavior of that old issue.
0713
0714 And never fear, we'll re-introduce the fix that improved on the IO
0715 patterns once we've decided just how to handle the fact that we had a
0716 bad interaction with an interface that people had then just happened
0717 to rely on incidental behavior for before. It's just that we'll have
0718 to hash through how to do that (there are no less than three different
0719 patches by three different developers being discussed, and there might
0720 be more coming...). In the meantime, I reverted the thing that exposed
0721 the problem to users for this release, even if I hope it will be
0722 re-introduced (perhaps even backported as a stable patch) once we have
0723 consensus about the issue it exposed.
0724
0725 Take-away from the whole thing: it's not about whether you change the
0726 kernel-userspace ABI, or fix a bug, or about whether the old code
0727 "should never have worked in the first place". It's about whether
0728 something breaks existing users' workflow.
0729
0730 Anyway, that was my little aside on the whole regression thing. Since
0731 it's that "first rule of kernel programming", I felt it is perhaps
0732 worth just bringing it up every once in a while
0733
0734 ..
0735 end-of-content
0736 ..
0737 This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
0738 of the file. If you want to distribute this text under CC-BY-4.0 only,
0739 please use "The Linux kernel developers" for author attribution and link
0740 this as source:
0741 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst
0742 ..
0743 Note: Only the content of this RST file as found in the Linux kernel sources
0744 is available under CC-BY-4.0, as versions of this text that were processed
0745 (for example by the kernel's build system) might contain content taken from
0746 files which use a more restrictive license.