0001 Bug hunting
0002 ===========
0003
0004 Kernel bug reports often come with a stack dump like the one below::
0005
0006 ------------[ cut here ]------------
0007 WARNING: CPU: 1 PID: 28102 at kernel/module.c:1108 module_put+0x57/0x70
0008 Modules linked in: dvb_usb_gp8psk(-) dvb_usb dvb_core nvidia_drm(PO) nvidia_modeset(PO) snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore nvidia(PO) [last unloaded: rc_core]
0009 CPU: 1 PID: 28102 Comm: rmmod Tainted: P WC O 4.8.4-build.1 #1
0010 Hardware name: MSI MS-7309/MS-7309, BIOS V1.12 02/23/2009
0011 00000000 c12ba080 00000000 00000000 c103ed6a c1616014 00000001 00006dc6
0012 c1615862 00000454 c109e8a7 c109e8a7 00000009 ffffffff 00000000 f13f6a10
0013 f5f5a600 c103ee33 00000009 00000000 00000000 c109e8a7 f80ca4d0 c109f617
0014 Call Trace:
0015 [<c12ba080>] ? dump_stack+0x44/0x64
0016 [<c103ed6a>] ? __warn+0xfa/0x120
0017 [<c109e8a7>] ? module_put+0x57/0x70
0018 [<c109e8a7>] ? module_put+0x57/0x70
0019 [<c103ee33>] ? warn_slowpath_null+0x23/0x30
0020 [<c109e8a7>] ? module_put+0x57/0x70
0021 [<f80ca4d0>] ? gp8psk_fe_set_frontend+0x460/0x460 [dvb_usb_gp8psk]
0022 [<c109f617>] ? symbol_put_addr+0x27/0x50
0023 [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb]
0024 [<f80bb3bf>] ? dvb_usb_exit+0x2f/0xd0 [dvb_usb]
0025 [<c13d03bc>] ? usb_disable_endpoint+0x7c/0xb0
0026 [<f80bb48a>] ? dvb_usb_device_exit+0x2a/0x50 [dvb_usb]
0027 [<c13d2882>] ? usb_unbind_interface+0x62/0x250
0028 [<c136b514>] ? __pm_runtime_idle+0x44/0x70
0029 [<c13620d8>] ? __device_release_driver+0x78/0x120
0030 [<c1362907>] ? driver_detach+0x87/0x90
0031 [<c1361c48>] ? bus_remove_driver+0x38/0x90
0032 [<c13d1c18>] ? usb_deregister+0x58/0xb0
0033 [<c109fbb0>] ? SyS_delete_module+0x130/0x1f0
0034 [<c1055654>] ? task_work_run+0x64/0x80
0035 [<c1000fa5>] ? exit_to_usermode_loop+0x85/0x90
0036 [<c10013f0>] ? do_fast_syscall_32+0x80/0x130
0037 [<c1549f43>] ? sysenter_past_esp+0x40/0x6a
0038 ---[ end trace 6ebc60ef3981792f ]---
0039
0040 Such stack traces provide enough information to identify the line inside the
0041 Kernel's source code where the bug happened. Depending on the severity of
0042 the issue, it may also contain the word **Oops**, as on this one::
0043
0044 BUG: unable to handle kernel NULL pointer dereference at (null)
0045 IP: [<c06969d4>] iret_exc+0x7d0/0xa59
0046 *pdpt = 000000002258a001 *pde = 0000000000000000
0047 Oops: 0002 [#1] PREEMPT SMP
0048 ...
0049
0050 Despite being an **Oops** or some other sort of stack trace, the offended
0051 line is usually required to identify and handle the bug. Along this chapter,
0052 we'll refer to "Oops" for all kinds of stack traces that need to be analyzed.
0053
0054 If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the
0055 quality of the stack trace by using file:`scripts/decode_stacktrace.sh`.
0056
0057 Modules linked in
0058 -----------------
0059
0060 Modules that are tainted or are being loaded or unloaded are marked with
0061 "(...)", where the taint flags are described in
0062 file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is
0063 annotated with "+", and "being unloaded" is annotated with "-".
0064
0065
0066 Where is the Oops message is located?
0067 -------------------------------------
0068
0069 Normally the Oops text is read from the kernel buffers by klogd and
0070 handed to ``syslogd`` which writes it to a syslog file, typically
0071 ``/var/log/messages`` (depends on ``/etc/syslog.conf``). On systems with
0072 systemd, it may also be stored by the ``journald`` daemon, and accessed
0073 by running ``journalctl`` command.
0074
0075 Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to
0076 read the data from the kernel buffers and save it. Or you can
0077 ``cat /proc/kmsg > file``, however you have to break in to stop the transfer,
0078 since ``kmsg`` is a "never ending file".
0079
0080 If the machine has crashed so badly that you cannot enter commands or
0081 the disk is not available then you have three options:
0082
0083 (1) Hand copy the text from the screen and type it in after the machine
0084 has restarted. Messy but it is the only option if you have not
0085 planned for a crash. Alternatively, you can take a picture of
0086 the screen with a digital camera - not nice, but better than
0087 nothing. If the messages scroll off the top of the console, you
0088 may find that booting with a higher resolution (e.g., ``vga=791``)
0089 will allow you to read more of the text. (Caveat: This needs ``vesafb``,
0090 so won't help for 'early' oopses.)
0091
0092 (2) Boot with a serial console (see
0093 :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`),
0094 run a null modem to a second machine and capture the output there
0095 using your favourite communication program. Minicom works well.
0096
0097 (3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
0098 extract the kernel ring buffer from old memory with using dmesg
0099 gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
0100
0101 Finding the bug's location
0102 --------------------------
0103
0104 Reporting a bug works best if you point the location of the bug at the
0105 Kernel source file. There are two methods for doing that. Usually, using
0106 ``gdb`` is easier, but the Kernel should be pre-compiled with debug info.
0107
0108 gdb
0109 ^^^
0110
0111 The GNU debugger (``gdb``) is the best way to figure out the exact file and line
0112 number of the OOPS from the ``vmlinux`` file.
0113
0114 The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``.
0115 This can be set by running::
0116
0117 $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO
0118
0119 On a kernel compiled with ``CONFIG_DEBUG_INFO``, you can simply copy the
0120 EIP value from the OOPS::
0121
0122 EIP: 0060:[<c021e50e>] Not tainted VLI
0123
0124 And use GDB to translate that to human-readable form::
0125
0126 $ gdb vmlinux
0127 (gdb) l *0xc021e50e
0128
0129 If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function
0130 offset from the OOPS::
0131
0132 EIP is at vt_ioctl+0xda8/0x1482
0133
0134 And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled::
0135
0136 $ ./scripts/config -d COMPILE_TEST -e DEBUG_KERNEL -e DEBUG_INFO
0137 $ make vmlinux
0138 $ gdb vmlinux
0139 (gdb) l *vt_ioctl+0xda8
0140 0x1888 is in vt_ioctl (drivers/tty/vt/vt_ioctl.c:293).
0141 288 {
0142 289 struct vc_data *vc = NULL;
0143 290 int ret = 0;
0144 291
0145 292 console_lock();
0146 293 if (VT_BUSY(vc_num))
0147 294 ret = -EBUSY;
0148 295 else if (vc_num)
0149 296 vc = vc_deallocate(vc_num);
0150 297 console_unlock();
0151
0152 or, if you want to be more verbose::
0153
0154 (gdb) p vt_ioctl
0155 $1 = {int (struct tty_struct *, unsigned int, unsigned long)} 0xae0 <vt_ioctl>
0156 (gdb) l *0xae0+0xda8
0157
0158 You could, instead, use the object file::
0159
0160 $ make drivers/tty/
0161 $ gdb drivers/tty/vt/vt_ioctl.o
0162 (gdb) l *vt_ioctl+0xda8
0163
0164 If you have a call trace, such as::
0165
0166 Call Trace:
0167 [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
0168 [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
0169 [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
0170 ...
0171
0172 this shows the problem likely is in the :jbd: module. You can load that module
0173 in gdb and list the relevant code::
0174
0175 $ gdb fs/jbd/jbd.ko
0176 (gdb) l *log_wait_commit+0xa3
0177
0178 .. note::
0179
0180 You can also do the same for any function call at the stack trace,
0181 like this one::
0182
0183 [<f80bc9ca>] ? dvb_usb_adapter_frontend_exit+0x3a/0x70 [dvb_usb]
0184
0185 The position where the above call happened can be seen with::
0186
0187 $ gdb drivers/media/usb/dvb-usb/dvb-usb.o
0188 (gdb) l *dvb_usb_adapter_frontend_exit+0x3a
0189
0190 objdump
0191 ^^^^^^^
0192
0193 To debug a kernel, use objdump and look for the hex offset from the crash
0194 output to find the valid line of code/assembler. Without debug symbols, you
0195 will see the assembler code for the routine shown, but if your kernel has
0196 debug symbols the C code will also be available. (Debug symbols can be enabled
0197 in the kernel hacking menu of the menu configuration.) For example::
0198
0199 $ objdump -r -S -l --disassemble net/dccp/ipv4.o
0200
0201 .. note::
0202
0203 You need to be at the top level of the kernel tree for this to pick up
0204 your C files.
0205
0206 If you don't have access to the source code you can still debug some crash
0207 dumps using the following method (example crash dump output as shown by
0208 Dave Miller)::
0209
0210 EIP is at +0x14/0x4c0
0211 ...
0212 Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
0213 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
0214 <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
0215
0216 Put the bytes into a "foo.s" file like this:
0217
0218 .text
0219 .globl foo
0220 foo:
0221 .byte .... /* bytes from Code: part of OOPS dump */
0222
0223 Compile it with "gcc -c -o foo.o foo.s" then look at the output of
0224 "objdump --disassemble foo.o".
0225
0226 Output:
0227
0228 ip_queue_xmit:
0229 push %ebp
0230 push %edi
0231 push %esi
0232 push %ebx
0233 sub $0xbc, %esp
0234 mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
0235 mov 0x8(%ebp), %ebx ! %ebx = skb->sk
0236 mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
0237
0238 file:`scripts/decodecode` can be used to automate most of this, depending
0239 on what CPU architecture is being debugged.
0240
0241 Reporting the bug
0242 -----------------
0243
0244 Once you find where the bug happened, by inspecting its location,
0245 you could either try to fix it yourself or report it upstream.
0246
0247 In order to report it upstream, you should identify the mailing list
0248 used for the development of the affected code. This can be done by using
0249 the ``get_maintainer.pl`` script.
0250
0251 For example, if you find a bug at the gspca's sonixj.c file, you can get
0252 its maintainers with::
0253
0254 $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c
0255 Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%)
0256 Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%)
0257 Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%)
0258 Bhaktipriya Shridhar <bhaktipriya96@gmail.com> (commit_signer:1/1=100%,authored:1/1=100%,added_lines:4/4=100%,removed_lines:9/9=100%)
0259 linux-media@vger.kernel.org (open list:GSPCA USB WEBCAM DRIVER)
0260 linux-kernel@vger.kernel.org (open list)
0261
0262 Please notice that it will point to:
0263
0264 - The last developers that touched the source code (if this is done inside
0265 a git tree). On the above example, Tejun and Bhaktipriya (in this
0266 specific case, none really involved on the development of this file);
0267 - The driver maintainer (Hans Verkuil);
0268 - The subsystem maintainer (Mauro Carvalho Chehab);
0269 - The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
0270 - the Linux Kernel mailing list (linux-kernel@vger.kernel.org).
0271
0272 Usually, the fastest way to have your bug fixed is to report it to mailing
0273 list used for the development of the code (linux-media ML) copying the
0274 driver maintainer (Hans).
0275
0276 If you are totally stumped as to whom to send the report, and
0277 ``get_maintainer.pl`` didn't provide you anything useful, send it to
0278 linux-kernel@vger.kernel.org.
0279
0280 Thanks for your help in making Linux as stable as humanly possible.
0281
0282 Fixing the bug
0283 --------------
0284
0285 If you know programming, you could help us by not only reporting the bug,
0286 but also providing us with a solution. After all, open source is about
0287 sharing what you do and don't you want to be recognised for your genius?
0288
0289 If you decide to take this way, once you have worked out a fix please submit
0290 it upstream.
0291
0292 Please do read
0293 :ref:`Documentation/process/submitting-patches.rst <submittingpatches>` though
0294 to help your code get accepted.
0295
0296
0297 ---------------------------------------------------------------------------
0298
0299 Notes on Oops tracing with ``klogd``
0300 ------------------------------------
0301
0302 In order to help Linus and the other kernel developers there has been
0303 substantial support incorporated into ``klogd`` for processing protection
0304 faults. In order to have full support for address resolution at least
0305 version 1.3-pl3 of the ``sysklogd`` package should be used.
0306
0307 When a protection fault occurs the ``klogd`` daemon automatically
0308 translates important addresses in the kernel log messages to their
0309 symbolic equivalents. This translated kernel message is then
0310 forwarded through whatever reporting mechanism ``klogd`` is using. The
0311 protection fault message can be simply cut out of the message files
0312 and forwarded to the kernel developers.
0313
0314 Two types of address resolution are performed by ``klogd``. The first is
0315 static translation and the second is dynamic translation.
0316 Static translation uses the System.map file.
0317 In order to do static translation the ``klogd`` daemon
0318 must be able to find a system map file at daemon initialization time.
0319 See the klogd man page for information on how ``klogd`` searches for map
0320 files.
0321
0322 Dynamic address translation is important when kernel loadable modules
0323 are being used. Since memory for kernel modules is allocated from the
0324 kernel's dynamic memory pools there are no fixed locations for either
0325 the start of the module or for functions and symbols in the module.
0326
0327 The kernel supports system calls which allow a program to determine
0328 which modules are loaded and their location in memory. Using these
0329 system calls the klogd daemon builds a symbol table which can be used
0330 to debug a protection fault which occurs in a loadable kernel module.
0331
0332 At the very minimum klogd will provide the name of the module which
0333 generated the protection fault. There may be additional symbolic
0334 information available if the developer of the loadable module chose to
0335 export symbol information from the module.
0336
0337 Since the kernel module environment can be dynamic there must be a
0338 mechanism for notifying the ``klogd`` daemon when a change in module
0339 environment occurs. There are command line options available which
0340 allow klogd to signal the currently executing daemon that symbol
0341 information should be refreshed. See the ``klogd`` manual page for more
0342 information.
0343
0344 A patch is included with the sysklogd distribution which modifies the
0345 ``modules-2.0.0`` package to automatically signal klogd whenever a module
0346 is loaded or unloaded. Applying this patch provides essentially
0347 seamless support for debugging protection faults which occur with
0348 kernel loadable modules.
0349
0350 The following is an example of a protection fault in a loadable module
0351 processed by ``klogd``::
0352
0353 Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97cc
0354 Aug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000
0355 Aug 29 09:51:01 blizard kernel: *pde = 00000000
0356 Aug 29 09:51:01 blizard kernel: Oops: 0002
0357 Aug 29 09:51:01 blizard kernel: CPU: 0
0358 Aug 29 09:51:01 blizard kernel: EIP: 0010:[oops:_oops+16/3868]
0359 Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212
0360 Aug 29 09:51:01 blizard kernel: eax: 315e97cc ebx: 003a6f80 ecx: 001be77b edx: 00237c0c
0361 Aug 29 09:51:01 blizard kernel: esi: 00000000 edi: bffffdb3 ebp: 00589f90 esp: 00589f8c
0362 Aug 29 09:51:01 blizard kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
0363 Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000)
0364 Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001
0365 Aug 29 09:51:01 blizard kernel: 00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00
0366 Aug 29 09:51:01 blizard kernel: bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036
0367 Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128]
0368 Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3
0369
0370 ---------------------------------------------------------------------------
0371
0372 ::
0373
0374 Dr. G.W. Wettstein Oncology Research Div. Computing Facility
0375 Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com
0376 820 4th St. N.
0377 Fargo, ND 58122
0378 Phone: 701-234-7556