Uek5 master driver update #5

be2net · 2018-11-29T17:02:54Z

be2net patch set

Lancer HW cannot handle a TSO packet with a single segment. Disable TSO/GSO for such packets. Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

If the driver receives a TX CQE with status as 0x1 or 0x9 or 0xb, the completion indexes should not be used. The driver must stop consuming CQEs from this TXQ/CQ. The TXQ from this point on-wards to be in a bad state. Driver should destroy and recreate the TXQ. 0x1: LANCER_TX_COMP_LSO_ERR 0x9 LANCER_TX_COMP_SGE_ERR 0xb: LANCER_TX_COMP_PARITY_ERR Reset the adapter if driver sees this error in TX completion. Also adding sge error counter in ethtool stats. Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

gregmarsden · 2018-11-30T15:00:11Z

Thank you for your submission to UEK! We are reviewing the requested changes.

Per our kernel review and merge process, we are closing this pull request, but that doesn't mean it's not being reviewed.

[ Upstream commit 46b3722 ] We occasionaly hit following assert failure in 'perf top', when processing the /proc info in multiple threads. perf: ...include/linux/refcount.h:109: refcount_inc: Assertion `!(!refcount_inc_not_zero(r))' failed. The gdb backtrace looks like this: [Switching to Thread 0x7ffff11ba700 (LWP 13749)] 0x00007ffff50839fb in raise () from /lib64/libc.so.6 (gdb) #0 0x00007ffff50839fb in raise () from /lib64/libc.so.6 #1 0x00007ffff5085800 in abort () from /lib64/libc.so.6 #2 0x00007ffff507c0da in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ffff507c152 in __assert_fail () from /lib64/libc.so.6 #4 0x0000000000535373 in refcount_inc (r=0x7fffdc009be0) at ...include/linux/refcount.h:109 #5 0x00000000005354f1 in comm_str__get (cs=0x7fffdc009bc0) at util/comm.c:24 #6 0x00000000005356bd in __comm_str__findnew (str=0x7fffd000b260 ":2", root=0xbed5c0 <comm_str_root>) at util/comm.c:72 #7 0x000000000053579e in comm_str__findnew (str=0x7fffd000b260 ":2", root=0xbed5c0 <comm_str_root>) at util/comm.c:95 #8 0x000000000053582e in comm__new (str=0x7fffd000b260 ":2", timestamp=0, exec=false) at util/comm.c:111 #9 0x00000000005363bc in thread__new (pid=2, tid=2) at util/thread.c:57 #10 0x0000000000523da0 in ____machine__findnew_thread (machine=0xbfde38, threads=0xbfdf28, pid=2, tid=2, create=true) at util/machine.c:457 #11 0x0000000000523eb4 in __machine__findnew_thread (machine=0xbfde38, ... The failing assertion is this one: REFCOUNT_WARN(!refcount_inc_not_zero(r), ... The problem is that we keep global comm_str_root list, which is accessed by multiple threads during the 'perf top' startup and following 2 paths can race: thread 1: ... thread__new comm__new comm_str__findnew down_write(&comm_str_lock); __comm_str__findnew comm_str__get thread 2: ... comm__override or comm__free comm_str__put refcount_dec_and_test down_write(&comm_str_lock); rb_erase(&cs->rb_node, &comm_str_root); Because thread 2 first decrements the refcnt and only after then it removes the struct comm_str from the list, the thread 1 can find this object on the list with refcnt equls to 0 and hit the assert. This patch fixes the thread 1 __comm_str__findnew path, by ignoring objects that already dropped the refcnt to 0. For the rest of the objects we take the refcnt before comparing its name and release it afterwards with comm_str__put, which can also release the object completely. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Lukasz Odzioba <lukasz.odzioba@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Cc: kernel-team@lge.com Link: http://lkml.kernel.org/r/20180720101740.GA27176@krava Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3e1a127 upstream. When we disable hotplugging on the GPU, we need to be able to synchronize with each connector's hotplug interrupt handler before the interrupt is finally disabled. This can be a problem however, since nouveau_connector_detect() currently grabs a runtime power reference when handling connector probing. This will deadlock the runtime suspend handler like so: [ 861.480896] INFO: task kworker/0:2:61 blocked for more than 120 seconds. [ 861.483290] Tainted: G O 4.18.0-rc6Lyude-Test+ #1 [ 861.485158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 861.486332] kworker/0:2 D 0 61 2 0x80000000 [ 861.487044] Workqueue: events nouveau_display_hpd_work [nouveau] [ 861.487737] Call Trace: [ 861.488394] __schedule+0x322/0xaf0 [ 861.489070] schedule+0x33/0x90 [ 861.489744] rpm_resume+0x19c/0x850 [ 861.490392] ? finish_wait+0x90/0x90 [ 861.491068] __pm_runtime_resume+0x4e/0x90 [ 861.491753] nouveau_display_hpd_work+0x22/0x60 [nouveau] [ 861.492416] process_one_work+0x231/0x620 [ 861.493068] worker_thread+0x44/0x3a0 [ 861.493722] kthread+0x12b/0x150 [ 861.494342] ? wq_pool_ids_show+0x140/0x140 [ 861.494991] ? kthread_create_worker_on_cpu+0x70/0x70 [ 861.495648] ret_from_fork+0x3a/0x50 [ 861.496304] INFO: task kworker/6:2:320 blocked for more than 120 seconds. [ 861.496968] Tainted: G O 4.18.0-rc6Lyude-Test+ #1 [ 861.497654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 861.498341] kworker/6:2 D 0 320 2 0x80000080 [ 861.499045] Workqueue: pm pm_runtime_work [ 861.499739] Call Trace: [ 861.500428] __schedule+0x322/0xaf0 [ 861.501134] ? wait_for_completion+0x104/0x190 [ 861.501851] schedule+0x33/0x90 [ 861.502564] schedule_timeout+0x3a5/0x590 [ 861.503284] ? mark_held_locks+0x58/0x80 [ 861.503988] ? _raw_spin_unlock_irq+0x2c/0x40 [ 861.504710] ? wait_for_completion+0x104/0x190 [ 861.505417] ? trace_hardirqs_on_caller+0xf4/0x190 [ 861.506136] ? wait_for_completion+0x104/0x190 [ 861.506845] wait_for_completion+0x12c/0x190 [ 861.507555] ? wake_up_q+0x80/0x80 [ 861.508268] flush_work+0x1c9/0x280 [ 861.508990] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 [ 861.509735] nvif_notify_put+0xb1/0xc0 [nouveau] [ 861.510482] nouveau_display_fini+0xbd/0x170 [nouveau] [ 861.511241] nouveau_display_suspend+0x67/0x120 [nouveau] [ 861.511969] nouveau_do_suspend+0x5e/0x2d0 [nouveau] [ 861.512715] nouveau_pmops_runtime_suspend+0x47/0xb0 [nouveau] [ 861.513435] pci_pm_runtime_suspend+0x6b/0x180 [ 861.514165] ? pci_has_legacy_pm_support+0x70/0x70 [ 861.514897] __rpm_callback+0x7a/0x1d0 [ 861.515618] ? pci_has_legacy_pm_support+0x70/0x70 [ 861.516313] rpm_callback+0x24/0x80 [ 861.517027] ? pci_has_legacy_pm_support+0x70/0x70 [ 861.517741] rpm_suspend+0x142/0x6b0 [ 861.518449] pm_runtime_work+0x97/0xc0 [ 861.519144] process_one_work+0x231/0x620 [ 861.519831] worker_thread+0x44/0x3a0 [ 861.520522] kthread+0x12b/0x150 [ 861.521220] ? wq_pool_ids_show+0x140/0x140 [ 861.521925] ? kthread_create_worker_on_cpu+0x70/0x70 [ 861.522622] ret_from_fork+0x3a/0x50 [ 861.523299] INFO: task kworker/6:0:1329 blocked for more than 120 seconds. [ 861.523977] Tainted: G O 4.18.0-rc6Lyude-Test+ #1 [ 861.524644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 861.525349] kworker/6:0 D 0 1329 2 0x80000000 [ 861.526073] Workqueue: events nvif_notify_work [nouveau] [ 861.526751] Call Trace: [ 861.527411] __schedule+0x322/0xaf0 [ 861.528089] schedule+0x33/0x90 [ 861.528758] rpm_resume+0x19c/0x850 [ 861.529399] ? finish_wait+0x90/0x90 [ 861.530073] __pm_runtime_resume+0x4e/0x90 [ 861.530798] nouveau_connector_detect+0x7e/0x510 [nouveau] [ 861.531459] ? ww_mutex_lock+0x47/0x80 [ 861.532097] ? ww_mutex_lock+0x47/0x80 [ 861.532819] ? drm_modeset_lock+0x88/0x130 [drm] [ 861.533481] drm_helper_probe_detect_ctx+0xa0/0x100 [drm_kms_helper] [ 861.534127] drm_helper_hpd_irq_event+0xa4/0x120 [drm_kms_helper] [ 861.534940] nouveau_connector_hotplug+0x98/0x120 [nouveau] [ 861.535556] nvif_notify_work+0x2d/0xb0 [nouveau] [ 861.536221] process_one_work+0x231/0x620 [ 861.536994] worker_thread+0x44/0x3a0 [ 861.537757] kthread+0x12b/0x150 [ 861.538463] ? wq_pool_ids_show+0x140/0x140 [ 861.539102] ? kthread_create_worker_on_cpu+0x70/0x70 [ 861.539815] ret_from_fork+0x3a/0x50 [ 861.540521] Showing all locks held in the system: [ 861.541696] 2 locks held by kworker/0:2/61: [ 861.542406] #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.543071] #1: 0000000076868126 ((work_completion)(&drm->hpd_work)){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.543814] 1 lock held by khungtaskd/64: [ 861.544535] #0: 0000000059db4b53 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185 [ 861.545160] 3 locks held by kworker/6:2/320: [ 861.545896] #0: 00000000d9e1bc59 ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.546702] #1: 00000000c9f92d84 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.547443] #2: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: nouveau_display_fini+0x96/0x170 [nouveau] [ 861.548146] 1 lock held by dmesg/983: [ 861.548889] 2 locks held by zsh/1250: [ 861.549605] #0: 00000000348e3cf6 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40 [ 861.550393] #1: 000000007009a7a8 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870 [ 861.551122] 6 locks held by kworker/6:0/1329: [ 861.551957] #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.552765] #1: 00000000ddb499ad ((work_completion)(&notify->work)#2){+.+.}, at: process_one_work+0x1b3/0x620 [ 861.553582] #2: 000000006e013cbe (&dev->mode_config.mutex){+.+.}, at: drm_helper_hpd_irq_event+0x6c/0x120 [drm_kms_helper] [ 861.554357] #3: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: drm_helper_hpd_irq_event+0x78/0x120 [drm_kms_helper] [ 861.555227] #4: 0000000044f294d9 (crtc_ww_class_acquire){+.+.}, at: drm_helper_probe_detect_ctx+0x3d/0x100 [drm_kms_helper] [ 861.556133] #5: 00000000db193642 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x4b/0x130 [drm] [ 861.557864] ============================================= [ 861.559507] NMI backtrace for cpu 2 [ 861.560363] CPU: 2 PID: 64 Comm: khungtaskd Tainted: G O 4.18.0-rc6Lyude-Test+ #1 [ 861.561197] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018 [ 861.561948] Call Trace: [ 861.562757] dump_stack+0x8e/0xd3 [ 861.563516] nmi_cpu_backtrace.cold.3+0x14/0x5a [ 861.564269] ? lapic_can_unplug_cpu.cold.27+0x42/0x42 [ 861.565029] nmi_trigger_cpumask_backtrace+0xa1/0xae [ 861.565789] arch_trigger_cpumask_backtrace+0x19/0x20 [ 861.566558] watchdog+0x316/0x580 [ 861.567355] kthread+0x12b/0x150 [ 861.568114] ? reset_hung_task_detector+0x20/0x20 [ 861.568863] ? kthread_create_worker_on_cpu+0x70/0x70 [ 861.569598] ret_from_fork+0x3a/0x50 [ 861.570370] Sending NMI from CPU 2 to CPUs 0-1,3-7: [ 861.571426] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x7f/0x120 [ 861.571429] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x7f/0x120 [ 861.571432] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x7f/0x120 [ 861.571464] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x7f/0x120 [ 861.571467] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x7f/0x120 [ 861.571469] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x7f/0x120 [ 861.571472] NMI backtrace for cpu 1 skipped: idling at intel_idle+0x7f/0x120 [ 861.572428] Kernel panic - not syncing: hung_task: blocked tasks So: fix this by making it so that normal hotplug handling /only/ happens so long as the GPU is currently awake without any pending runtime PM requests. In the event that a hotplug occurs while the device is suspending or resuming, we can simply defer our response until the GPU is fully runtime resumed again. Changes since v4: - Use a new trick I came up with using pm_runtime_get() instead of the hackish junk we had before Signed-off-by: Lyude Paul <lyude@redhat.com> Reviewed-by: Karol Herbst <kherbst@redhat.com> Acked-by: Daniel Vetter <daniel@ffwll.ch> Cc: stable@vger.kernel.org Cc: Lukas Wunner <lukas@wunner.de> Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180452 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 28733324 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>

Refactor the code and export the build of the matched flow groups list to separate function. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Orabug: 28571062 (cherry picked from commit 46719d7) cherry-pick-repo=kernel/git/torvalds/linux.git Conflicts: drivers/net/ethernet/mellanox/mlx5/core/fs_core.c Conflict was caused by missing code semantics changes made in core/fs_core.c back in v4.15 (b92af5a). However, those changes in the commit (b92af5a) will be overwritten later by patch #5 (net/mlx5: Support multiple updates of steering rules in parallel) in this series. Signed-off-by: Erez Alfasi <ereza@mellanox.com> Signed-off-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180514 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

The customer hit this crash few times. PID: 31556 TASK: ffff880f823caa00 CPU: 1 COMMAND: "cellsrv" #0 [ffff880f823db850] machine_kexec at ffffffff8105d93c #1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3 #2 [ffff880f823db980] oops_end at ffffffff8101a788 #3 [ffff880f823db9b0] no_context at ffffffff8106b9cf #4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d #5 [ffff880f823dba70] bad_area at ffffffff8106be97 #6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e #7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f #8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f [exception RIP: rds_ib_inc_copy_to_user+104] RIP: ffffffffa04607b8 RSP: ffff880f823dbbf8 RFLAGS: 00010287 RAX: 0000000000000340 RBX: 0000000000001000 RCX: 0000000000004000 RDX: 0000000000001000 RSI: ffff88176cea2000 RDI: ffff8817d291f520 RBP: ffff880f823dbc48 R8: 0000000000001340 R9: 0000000000001000 R10: 0000000000001200 R11: ffff880f823dc000 R12: ffff880f823dbed0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds] int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to) ... ... ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); len = be32_to_cpu(inc->i_hdr.h_len); sg = frag->f_sg; while (iov_iter_count(to) && copied < len) { to_copy = min_t(unsigned long, iov_iter_count(to), sg->length - frag_off); ... sg is NULL and it crashes accessing sg->length above. The cause looks like is due to ic->i_frag_sz returning incorrect value. 16KB when 4KB was expected. if (copied % ic->i_frag_sz == 0) { frag = list_entry(frag->f_item.next, struct rds_page_frag, f_item); frag_off = 0; sg = frag->f_sg; } The other end is using 4KB RDS fragsize (Solaris Super Cluster). This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64). The message being copied arrived over 4KB RDS frag size connection. But during the above check ic->i_frag_sz is 16KB. This can happen during a reconnect at the connection setup phase. We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB. Failing this check if (copied % ic->i_frag_sz == 0) { can result in sg not getting set correctly. Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB. During race condition with a reconnect, ic->i_frag_sz can be 16KB even though once the connection is set up it settled down to 4KB. It can change from 4KB to 16KB and back to 4KB during connection setup due to reconnect. We started seeing this crash after bug 26848749. But prior to that the same scenario could result in data copied to user from incorrect "sg" resulting in data corruption. Orabug: 28748049 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180535 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

commit c495144 upstream. We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted ------------------------------------------------------ aio-dio-invalid/30928 is trying to acquire lock: 0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0 but task is already holding lock: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&ei->dio_sem){++++}: lock_acquire+0xbd/0x220 down_write+0x51/0xb0 btrfs_log_changed_extents+0x80/0xa40 btrfs_log_inode+0xbaf/0x1000 btrfs_log_inode_parent+0x26f/0xa80 btrfs_log_dentry_safe+0x50/0x70 btrfs_sync_file+0x357/0x540 do_fsync+0x38/0x60 __ia32_sys_fdatasync+0x12/0x20 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #4 (&ei->log_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_record_unlink_dir+0x2a/0xa0 btrfs_unlink+0x5a/0xc0 vfs_unlink+0xb1/0x1a0 do_unlinkat+0x264/0x2b0 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #3 (sb_internal#2){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 start_transaction+0x3e6/0x590 btrfs_evict_inode+0x475/0x640 evict+0xbf/0x1b0 btrfs_run_delayed_iputs+0x6c/0x90 cleaner_kthread+0x124/0x1a0 kthread+0x106/0x140 ret_from_fork+0x3a/0x50 -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_alloc_data_chunk_ondemand+0x197/0x530 btrfs_check_data_free_space+0x4c/0x90 btrfs_delalloc_reserve_space+0x20/0x60 btrfs_page_mkwrite+0x87/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #1 (sb_pagefaults){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 btrfs_page_mkwrite+0x6a/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #0 (&mm->mmap_sem){++++}: __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 down_read+0x48/0xb0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 __blockdev_direct_IO+0x32d/0x1c20 btrfs_direct_IO+0x227/0x400 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 io_submit_one+0x3a9/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 other info that might help us debug this: Chain exists of: &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->dio_sem); lock(&ei->log_mutex); lock(&ei->dio_sem); lock(&mm->mmap_sem); *** DEADLOCK *** 1 lock held by aio-dio-invalid/30928: #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 stack backtrace: CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: dump_stack+0x7c/0xbb print_circular_bug.isra.37+0x297/0x2a4 check_prev_add.constprop.45+0x781/0x7a0 ? __lock_acquire+0x42e/0x7a0 validate_chain.isra.41+0x7f0/0xb00 __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 ? get_user_pages_unlocked+0x5a/0x1e0 down_read+0x48/0xb0 ? get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 ? __alloc_workqueue_key+0x358/0x490 ? __blockdev_direct_IO+0x14b/0x1c20 __blockdev_direct_IO+0x32d/0x1c20 ? btrfs_run_delalloc_work+0x40/0x40 ? can_nocow_extent+0x490/0x490 ? kvm_clock_read+0x1f/0x30 ? can_nocow_extent+0x490/0x490 ? btrfs_run_delalloc_work+0x40/0x40 btrfs_direct_IO+0x227/0x400 ? btrfs_run_delalloc_work+0x40/0x40 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 ? kvm_clock_read+0x1f/0x30 ? __might_fault+0x3e/0x90 io_submit_one+0x3a9/0x620 ? io_submit_one+0xe5/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit ebaf39e ] The *_frag_reasm() functions are susceptible to miscalculating the byte count of packet fragments in case the truesize of a head buffer changes. The truesize member may be changed by the call to skb_unclone(), leaving the fragment memory limit counter unbalanced even if all fragments are processed. This miscalculation goes unnoticed as long as the network namespace which holds the counter is not destroyed. Should an attempt be made to destroy a network namespace that holds an unbalanced fragment memory limit counter the cleanup of the namespace never finishes. The thread handling the cleanup gets stuck in inet_frags_exit_net() waiting for the percpu counter to reach zero. The thread is usually in running state with a stacktrace similar to: PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4" #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480 #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856 #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0 #10 [ffff880621563e38] process_one_work at ffffffff81096f14 It is not possible to create new network namespaces, and processes that call unshare() end up being stuck in uninterruptible sleep state waiting to acquire the net_mutex. The bug was observed in the IPv6 netfilter code by Per Sundstrom. I thank him for his analysis of the problem. The parts of this patch that apply to IPv4 and IPv6 fragment reassembly are preemptive measures. Signed-off-by: Jiri Wiesner <jwiesner@suse.com> Reported-by: Per Sundstrom <per.sundstrom@redqube.se> Acked-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit c5a94f4 ] It was observed that a process blocked indefintely in __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP to be cleared via fscache_wait_for_deferred_lookup(). At this time, ->backing_objects was empty, which would normaly prevent __fscache_read_or_alloc_page() from getting to the point of waiting. This implies that ->backing_objects was cleared *after* __fscache_read_or_alloc_page was was entered. When an object is "killed" and then "dropped", FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is ->backing_objects cleared. This leaves a window where something else can set FSCACHE_COOKIE_LOOKING_UP and __fscache_read_or_alloc_page() can start waiting, before ->backing_objects is cleared There is some uncertainty in this analysis, but it seems to be fit the observations. Adding the wake in this patch will be handled correctly by __fscache_read_or_alloc_page(), as it checks if ->backing_objects is empty again, after waiting. Customer which reported the hang, also report that the hang cannot be reproduced with this fix. The backtrace for the blocked process looked like: PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh" #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1 #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8 #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache] #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache] #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs] #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs] #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73 #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs] #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756 #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa #12 [ffff881ff43eff18] sys_read at ffffffff811fda62 #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29357838 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Aron Silverton <aron.silverton@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29016284 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the function called from __sanitizer_cov_trace_const_cmp4 shouldn't be traceable either. ftrace_graph_caller() gets called every time func write_comp_data() gets called if it isn't marked 'notrace'. This is the backtrace from gdb: #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116 #3 0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188 #4 0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27 #5 0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Rework so that write_comp_data() that are called from __sanitizer_cov_trace_*_cmp*() are marked as 'notrace'. Commit 903e8ff ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace") missed to mark write_comp_data() as 'notrace'. When that patch was created gcc-7 was used. In lib/Kconfig.debug config KCOV_ENABLE_COMPARISONS depends on $(cc-option,-fsanitize-coverage=trace-cmp) That code path isn't hit with gcc-7. However, it were that with gcc-8. Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6347244) Orabug: 29558684 Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: John Donnelly <john.p.donnelly@oracle.com>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29631452 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Aron Silverton <aron.silverton@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

…_map [ Upstream commit 39df730 ] Detected via gcc's ASan: Direct leak of 2048 byte(s) in 64 object(s) allocated from: 6 #0 0x7f606512e370 in __interceptor_realloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee370) 7 #1 0x556b0f1d7ddd in thread_map__realloc util/thread_map.c:43 8 #2 0x556b0f1d84c7 in thread_map__new_by_tid util/thread_map.c:85 9 #3 0x556b0f0e045e in is_event_supported util/parse-events.c:2250 10 #4 0x556b0f0e1aa1 in print_hwcache_events util/parse-events.c:2382 11 #5 0x556b0f0e3231 in print_events util/parse-events.c:2514 12 #6 0x556b0ee0a66e in cmd_list /home/changbin/work/linux/tools/perf/builtin-list.c:58 13 #7 0x556b0f01e0ae in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 14 #8 0x556b0f01e859 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 15 #9 0x556b0f01edc8 in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 16 #10 0x556b0f01f71f in main /home/changbin/work/linux/tools/perf/perf.c:520 17 #11 0x7f6062ccf09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: 8989605 ("perf tools: Do not put a variable sized type not at the end of a struct") Link: http://lkml.kernel.org/r/20190316080556.3075-3-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 54569ba ] Detected with gcc's ASan: Direct leak of 66 byte(s) in 5 object(s) allocated from: #0 0x7ff3b1f32070 in __interceptor_strdup (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x3b070) #1 0x560c8761034d in collect_config util/config.c:597 #2 0x560c8760d9cb in get_value util/config.c:169 #3 0x560c8760dfd7 in perf_parse_file util/config.c:285 #4 0x560c8760e0d2 in perf_config_from_file util/config.c:476 #5 0x560c876108fd in perf_config_set__init util/config.c:661 #6 0x560c87610c72 in perf_config_set__new util/config.c:709 #7 0x560c87610d2f in perf_config__init util/config.c:718 #8 0x560c87610e5d in perf_config util/config.c:730 #9 0x560c875ddea0 in main /home/changbin/work/linux/tools/perf/perf.c:442 #10 0x7ff3afb8609a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Taeung Song <treeze.taeung@gmail.com> Fixes: 20105ca ("perf config: Introduce perf_config_set class") Link: http://lkml.kernel.org/r/20190316080556.3075-6-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 8bde851 ] Detected with gcc's ASan: Direct leak of 4356 byte(s) in 120 object(s) allocated from: #0 0x7ff1a2b5a070 in __interceptor_strdup (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x3b070) #1 0x55719aef4814 in build_id_cache__origname util/build-id.c:215 #2 0x55719af649b6 in print_sdt_events util/parse-events.c:2339 #3 0x55719af66272 in print_events util/parse-events.c:2542 #4 0x55719ad1ecaa in cmd_list /home/changbin/work/linux/tools/perf/builtin-list.c:58 #5 0x55719aec745d in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #6 0x55719aec7d1a in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #7 0x55719aec8184 in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #8 0x55719aeca41a in main /home/changbin/work/linux/tools/perf/perf.c:520 #9 0x7ff1a07ae09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: 40218da ("perf list: Show SDT and pre-cached events") Link: http://lkml.kernel.org/r/20190316080556.3075-7-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 42dfa45 ] Using gcc's ASan, Changbin reports: ================================================================= ==7494==ERROR: LeakSanitizer: detected memory leaks Direct leak of 48 byte(s) in 1 object(s) allocated from: #0 0x7f0333a89138 in calloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee138) #1 0x5625e5330a5e in zalloc util/util.h:23 #2 0x5625e5330a9b in perf_counts__new util/counts.c:10 #3 0x5625e5330ca0 in perf_evsel__alloc_counts util/counts.c:47 #4 0x5625e520d8e5 in __perf_evsel__read_on_cpu util/evsel.c:1505 #5 0x5625e517a985 in perf_evsel__read_on_cpu /home/work/linux/tools/perf/util/evsel.h:347 #6 0x5625e517ad1a in test__openat_syscall_event tests/openat-syscall.c:47 #7 0x5625e51528e6 in run_test tests/builtin-test.c:358 #8 0x5625e5152baf in test_and_print tests/builtin-test.c:388 #9 0x5625e51543fe in __cmd_test tests/builtin-test.c:583 #10 0x5625e515572f in cmd_test tests/builtin-test.c:722 #11 0x5625e51c3fb8 in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #12 0x5625e51c44f7 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #13 0x5625e51c48fb in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #14 0x5625e51c5069 in main /home/changbin/work/linux/tools/perf/perf.c:520 #15 0x7f033214d09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Indirect leak of 72 byte(s) in 1 object(s) allocated from: #0 0x7f0333a89138 in calloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee138) #1 0x5625e532560d in zalloc util/util.h:23 #2 0x5625e532566b in xyarray__new util/xyarray.c:10 #3 0x5625e5330aba in perf_counts__new util/counts.c:15 #4 0x5625e5330ca0 in perf_evsel__alloc_counts util/counts.c:47 #5 0x5625e520d8e5 in __perf_evsel__read_on_cpu util/evsel.c:1505 #6 0x5625e517a985 in perf_evsel__read_on_cpu /home/work/linux/tools/perf/util/evsel.h:347 #7 0x5625e517ad1a in test__openat_syscall_event tests/openat-syscall.c:47 #8 0x5625e51528e6 in run_test tests/builtin-test.c:358 #9 0x5625e5152baf in test_and_print tests/builtin-test.c:388 #10 0x5625e51543fe in __cmd_test tests/builtin-test.c:583 #11 0x5625e515572f in cmd_test tests/builtin-test.c:722 #12 0x5625e51c3fb8 in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #13 0x5625e51c44f7 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #14 0x5625e51c48fb in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #15 0x5625e51c5069 in main /home/changbin/work/linux/tools/perf/perf.c:520 #16 0x7f033214d09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) His patch took care of evsel->prev_raw_counts, but the above backtraces are about evsel->counts, so fix that instead. Reported-by: Changbin Du <changbin.du@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/n/tip-hd1x13g59f0nuhe4anxhsmfp@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

…_event_on_all_cpus test [ Upstream commit 93faa52 ] ================================================================= ==7497==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f0333a88f30 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xedf30) #1 0x5625e5326213 in cpu_map__trim_new util/cpumap.c:45 #2 0x5625e5326703 in cpu_map__read util/cpumap.c:103 #3 0x5625e53267ef in cpu_map__read_all_cpu_map util/cpumap.c:120 #4 0x5625e5326915 in cpu_map__new util/cpumap.c:135 #5 0x5625e517b355 in test__openat_syscall_event_on_all_cpus tests/openat-syscall-all-cpus.c:36 #6 0x5625e51528e6 in run_test tests/builtin-test.c:358 #7 0x5625e5152baf in test_and_print tests/builtin-test.c:388 #8 0x5625e51543fe in __cmd_test tests/builtin-test.c:583 #9 0x5625e515572f in cmd_test tests/builtin-test.c:722 #10 0x5625e51c3fb8 in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #11 0x5625e51c44f7 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #12 0x5625e51c48fb in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #13 0x5625e51c5069 in main /home/changbin/work/linux/tools/perf/perf.c:520 #14 0x7f033214d09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: f30a79b ("perf tools: Add reference counting for cpu_map object") Link: http://lkml.kernel.org/r/20190316080556.3075-15-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit f97a899 ] ================================================================= ==7506==ERROR: LeakSanitizer: detected memory leaks Direct leak of 13 byte(s) in 3 object(s) allocated from: #0 0x7f03339d6070 in __interceptor_strdup (/usr/lib/x86_64-linux-gnu/libasan.so.5+0x3b070) #1 0x5625e53aaef0 in expr__find_other util/expr.y:221 #2 0x5625e51bcd3f in test__expr tests/expr.c:52 #3 0x5625e51528e6 in run_test tests/builtin-test.c:358 #4 0x5625e5152baf in test_and_print tests/builtin-test.c:388 #5 0x5625e51543fe in __cmd_test tests/builtin-test.c:583 #6 0x5625e515572f in cmd_test tests/builtin-test.c:722 #7 0x5625e51c3fb8 in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #8 0x5625e51c44f7 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #9 0x5625e51c48fb in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #10 0x5625e51c5069 in main /home/changbin/work/linux/tools/perf/perf.c:520 #11 0x7f033214d09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: 0751673 ("perf tools: Add a simple expression parser for JSON") Link: http://lkml.kernel.org/r/20190316080556.3075-16-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit d982b33 ] ================================================================= ==20875==ERROR: LeakSanitizer: detected memory leaks Direct leak of 1160 byte(s) in 1 object(s) allocated from: #0 0x7f1b6fc84138 in calloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee138) #1 0x55bd50005599 in zalloc util/util.h:23 #2 0x55bd500068f5 in perf_evsel__newtp_idx util/evsel.c:327 #3 0x55bd4ff810fc in perf_evsel__newtp /home/work/linux/tools/perf/util/evsel.h:216 #4 0x55bd4ff81608 in test__perf_evsel__tp_sched_test tests/evsel-tp-sched.c:69 #5 0x55bd4ff528e6 in run_test tests/builtin-test.c:358 #6 0x55bd4ff52baf in test_and_print tests/builtin-test.c:388 #7 0x55bd4ff543fe in __cmd_test tests/builtin-test.c:583 #8 0x55bd4ff5572f in cmd_test tests/builtin-test.c:722 #9 0x55bd4ffc4087 in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 #10 0x55bd4ffc45c6 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 #11 0x55bd4ffc49ca in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 #12 0x55bd4ffc5138 in main /home/changbin/work/linux/tools/perf/perf.c:520 #13 0x7f1b6e34809a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Indirect leak of 19 byte(s) in 1 object(s) allocated from: #0 0x7f1b6fc83f30 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xedf30) #1 0x7f1b6e3ac30f in vasprintf (/lib/x86_64-linux-gnu/libc.so.6+0x8830f) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: 6a6cd11 ("perf test: Add test for the sched tracepoint format fields") Link: http://lkml.kernel.org/r/20190316080556.3075-17-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 01ca667 upstream. Syzkaller report this: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 4378 Comm: syz-executor.0 Tainted: G C 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:__lock_acquire+0x95b/0x3200 kernel/locking/lockdep.c:3573 Code: 00 0f 85 28 1e 00 00 48 81 c4 08 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 24 00 00 49 81 7d 00 e0 de 03 a6 41 bc 00 00 RSP: 0018:ffff8881e3c07a40 EFLAGS: 00010002 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000080 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8881e3c07d98 R11: ffff8881c7f21f80 R12: 0000000000000001 R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000001 FS: 00007fce2252e700(0000) GS:ffff8881f2400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffc7eb0228 CR3: 00000001e5bea002 CR4: 00000000007606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: lock_acquire+0xff/0x2c0 kernel/locking/lockdep.c:4211 __mutex_lock_common kernel/locking/mutex.c:925 [inline] __mutex_lock+0xdf/0x1050 kernel/locking/mutex.c:1072 drain_workqueue+0x24/0x3f0 kernel/workqueue.c:2934 destroy_workqueue+0x23/0x630 kernel/workqueue.c:4319 __do_sys_delete_module kernel/module.c:1018 [inline] __se_sys_delete_module kernel/module.c:961 [inline] __x64_sys_delete_module+0x30c/0x480 kernel/module.c:961 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fce2252dc58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000140 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fce2252e6bc R13: 00000000004bcca9 R14: 00000000006f6b48 R15: 00000000ffffffff If alloc_workqueue fails, it should return -ENOMEM, otherwise may trigger this NULL pointer dereference while unloading drivers. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 0a38c17 ("fm10k: Remove create_workqueue") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit 7494cec ] Calling kvm_is_visible_gfn() implies that we're parsing the memslots, and doing this without the srcu lock is frown upon: [12704.164532] ============================= [12704.164544] WARNING: suspicious RCU usage [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G W [12704.164573] ----------------------------- [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [12704.164602] other info that might help us debug this: [12704.164616] rcu_scheduler_active = 2, debug_locks = 1 [12704.164631] 6 locks held by qemu-system-aar/13968: [12704.164644] #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [12704.164691] #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [12704.164726] #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164761] #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164794] #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164827] #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164861] stack backtrace: [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G W 5.1.0-rc1-00008-g600025238f51-dirty #16 [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [12704.164896] Call trace: [12704.164910] dump_backtrace+0x0/0x138 [12704.164920] show_stack+0x24/0x30 [12704.164934] dump_stack+0xbc/0x104 [12704.164946] lockdep_rcu_suspicious+0xcc/0x110 [12704.164958] gfn_to_memslot+0x174/0x190 [12704.164969] kvm_is_visible_gfn+0x28/0x70 [12704.164980] vgic_its_check_id.isra.0+0xec/0x1e8 [12704.164991] vgic_its_save_tables_v0+0x1ac/0x330 [12704.165001] vgic_its_set_attr+0x298/0x3a0 [12704.165012] kvm_device_ioctl_attr+0x9c/0xd8 [12704.165022] kvm_device_ioctl+0x8c/0xf8 [12704.165035] do_vfs_ioctl+0xc8/0x960 [12704.165045] ksys_ioctl+0x8c/0xa0 [12704.165055] __arm64_sys_ioctl+0x28/0x38 [12704.165067] el0_svc_common+0xd8/0x138 [12704.165078] el0_svc_handler+0x38/0x78 [12704.165089] el0_svc+0x8/0xc Make sure the lock is taken when doing this. Fixes: bf30824 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>

commit 01ca667 upstream. Syzkaller report this: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 4378 Comm: syz-executor.0 Tainted: G C 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:__lock_acquire+0x95b/0x3200 kernel/locking/lockdep.c:3573 Code: 00 0f 85 28 1e 00 00 48 81 c4 08 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 24 00 00 49 81 7d 00 e0 de 03 a6 41 bc 00 00 RSP: 0018:ffff8881e3c07a40 EFLAGS: 00010002 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000080 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8881e3c07d98 R11: ffff8881c7f21f80 R12: 0000000000000001 R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000001 FS: 00007fce2252e700(0000) GS:ffff8881f2400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffc7eb0228 CR3: 00000001e5bea002 CR4: 00000000007606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: lock_acquire+0xff/0x2c0 kernel/locking/lockdep.c:4211 __mutex_lock_common kernel/locking/mutex.c:925 [inline] __mutex_lock+0xdf/0x1050 kernel/locking/mutex.c:1072 drain_workqueue+0x24/0x3f0 kernel/workqueue.c:2934 destroy_workqueue+0x23/0x630 kernel/workqueue.c:4319 __do_sys_delete_module kernel/module.c:1018 [inline] __se_sys_delete_module kernel/module.c:961 [inline] __x64_sys_delete_module+0x30c/0x480 kernel/module.c:961 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fce2252dc58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000140 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fce2252e6bc R13: 00000000004bcca9 R14: 00000000006f6b48 R15: 00000000ffffffff If alloc_workqueue fails, it should return -ENOMEM, otherwise may trigger this NULL pointer dereference while unloading drivers. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 0a38c17 ("fm10k: Remove create_workqueue") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 9b9b0df4e7882638e53c55e8f556aa78915418b9) Orabug: 30322694 CVE: CVE-2019-15924 Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] *** DEADLOCK *** [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b958451 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Orabug: 36385281 (cherry picked from commit 374012b) cherry-pick-repo=kernel/git/torvalds/linux.git unmodified-from-upstream: 374012b Signed-off-by: Mikhael Goikhman <migo@nvidia.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>

The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] #13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 #14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Orabug: 35622106 (cherry picked from commit 93a3319) cherry-pick-repo=kernel/git/torvalds/linux.git unmodified-from-upstream: 93a3319 Signed-off-by: Mikhael Goikhman <migo@nvidia.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>

Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] *** DEADLOCK *** [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b958451 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Orabug: 36385281 (cherry picked from commit 374012b) cherry-pick-repo=kernel/git/torvalds/linux.git unmodified-from-upstream: 374012b Signed-off-by: Mikhael Goikhman <migo@nvidia.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>

One of our customers reported the following stack. crash-7.3.0> bt PID: 250515 TASK: ffff888189482f80 CPU: 1 COMMAND: "vmbackup" #0 [ffffc90025017878] die at ffffffff81033c22 #1 [ffffc900250178a8] do_trap at ffffffff81030990 #2 [ffffc900250178f8] do_error_trap at ffffffff810311d7 #3 [ffffc900250179c0] do_invalid_op at ffffffff81031310 #4 [ffffc900250179d0] invalid_op at ffffffff81a01f2a [exception RIP: ocfs2_truncate_rec+1914] RIP: ffffffffc1e73b4a RSP: ffffc90025017a80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000053a75 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8882d385be08 RDI: ffff8882d385be08 RBP: ffffc90025017b10 R8: 0000000000000000 R9: 0000000000005900 R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000001 R13: ffff88829e5a9900 R14: ffffc90025017cf0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: e030 SS: e02b #5 [ffffc90025017b18] ocfs2_remove_extent at ffffffffc1e73e6c [ocfs2] #6 [ffffc90025017bc8] ocfs2_remove_btree_range at ffffffffc1e745f2 [ocfs2] #7 [ffffc90025017c60] ocfs2_commit_truncate at ffffffffc1e75b1f [ocfs2] #8 [ffffc90025017d68] __dta_ocfs2_wipe_inode_606 at ffffffffc1e9a3e0 [ocfs2] #9 [ffffc90025017dd8] ocfs2_evict_inode at ffffffffc1e9ac10 [ocfs2] RIP: 00007f9b26ec8307 RSP: 00007ffc5a193f68 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000ddd0a0 RCX: 00007f9b26ec8307 RDX: 0000000000000001 RSI: 00007f9b2719e770 RDI: 0000000001010400 RBP: 0000000001263d80 R8: 0000000000000000 R9: 00000000012146a0 R10: 000000000000000d R11: 0000000000000246 R12: 0000000000ddd0a0 R13: 00007f9b27ba9595 R14: 00007f9b27ca4a50 R15: 00000000ffffffff ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b crash-7.3.0> This crash resulted due to invalid extent record selected for truncate. At the top of the function ocfs2_truncate_rec(), the code checks if the first extent record at the leaf extent list corresponding to the input path is still empty. In that case the tree is rotated left to get rid of the empty extent record but this rotation did not happen. But the function ocfs2_truncate_rec() assumes that the top level call to ocfs2_rotate_tree_left() to get rid of the empty extent always succeeds and hence it decrements the input "index" value. This results in selection of a wrong record for truncate that causes to hit a call to BUG() with the message "Owner %llu: Invalid record truncate: (%u, %u) ". The stack above is the panic stack caused due to hitting BUG(). Though the function ocfs2_rotate_tree_left() was intended to get rid of the first empty record in the extent block, it did not call the function ocfs2_rotate_rightmost_leaf_left() as it did not find h_next_leaf_blk in the extentleaf block to be zero, instead, it proceeded to call __ocfs2_rotate_tree_left(). However the input "index" value was indeed pointing to the last extent record in the leaf block. The macro path_leaf_bh() was returning rightmost extent block as per the tree-depth. and the function ocfs2_find_cpos_for_right_leaf() also found out that the extent block in question is indeed the rightmost and hence there is nothing to rotate at the last extent record pointed by the input "index" value. Hence the extent tree in the leaf block was not totated at all. Hence, the real reason for the above panic is that the value of the field h_next_leaf_blk in the right most leaf block was non-zero that caused the tree not to rotate left resulting in selection of invalid record for truncate. The reason why h_next_leaf_blk was not cleared for the last extent block is still not known and the code changes here is a workaround to avoid the panic by verifying that the extent block in question is indeed the rightmost leaf block in the tree and then correcting the invalid h_next_leaf_blk value. These changes have been verified by the customer by running the provided rpm in their env. Orabug: 35905419 Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Sherry Yang <sherry.yang@oracle.com>

Add a check to mlx5e_xmit() for shorter frames. A corrupted/malformed packet, with shorter length can eventually cause system panic further down in the code path. Avoid it by validating the length and dropping it at the earliest. Following is seen in our env with shorter skb->len crash> bt PID: 76981 TASK: ff19828cfe508000 CPU: 106 COMMAND: "vhost-76942" #0 [ff2d20159b39f2c8] machine_kexec at ffffffffad884801 #1 [ff2d20159b39f328] __crash_kexec at ffffffffad976142 #2 [ff2d20159b39f3f8] panic at ffffffffad8b3640 #3 [ff2d20159b39f4a0] no_context at ffffffffad8954e1 #4 [ff2d20159b39f518] __bad_area_nosemaphore at ffffffffad8958de #5 [ff2d20159b39f578] bad_area_nosemaphore at ffffffffad895a96 #6 [ff2d20159b39f588] do_kern_addr_fault at ffffffffad89688e #7 [ff2d20159b39f5b0] __do_page_fault at ffffffffad896b30 #8 [ff2d20159b39f618] do_page_fault at ffffffffad896db6 #9 [ff2d20159b39f650] page_fault at ffffffffae402acd [exception RIP: memcpy_erms+6] RIP: ffffffffae261ab6 RSP: ff2d20159b39f700 RFLAGS: 00010293 RAX: ff198291741ecf2e RBX: ff19828e70d6a100 RCX: fffffffffea1af2b RDX: fffffffffffffffd RSI: ff19828eba6d7e5e RDI: ff198291757d2000 RBP: ff2d20159b39f760 R8: ff198291741ecf00 R9: 000000000000037c R10: 000000000000003c R11: ff19828ffe953940 R12: ff198291741ecf20 R13: ff198267dcb1b600 R14: ff19828eeebb09c0 R15: ff198291741ecf00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ff2d20159b39f700] mlx5e_sq_xmit_wqe at ffffffffc05c162e [mlx5_core] #11 [ff2d20159b39f768] mlx5e_xmit at ffffffffc05c1ca3 [mlx5_core] #12 [ff2d20159b39f800] dev_hard_start_xmit at ffffffffae083766 #13 [ff2d20159b39f860] sch_direct_xmit at ffffffffae0e2564 #14 [ff2d20159b39f8b0] __qdisc_run at ffffffffae0e294e #15 [ff2d20159b39f928] __dev_queue_xmit at ffffffffae083eee #16 [ff2d20159b39f9a8] dev_queue_xmit at ffffffffae084370 #17 [ff2d20159b39f9b8] vlan_dev_hard_start_xmit at ffffffffc2fb6fec [8021q] #18 [ff2d20159b39f9d8] dev_hard_start_xmit at ffffffffae083766 #19 [ff2d20159b39fa38] __dev_queue_xmit at ffffffffae08416a #20 [ff2d20159b39fab8] dev_queue_xmit_accel at ffffffffae08438e #21 [ff2d20159b39fac8] macvlan_start_xmit at ffffffffc2fc18d9 [macvlan] #22 [ff2d20159b39faf0] dev_hard_start_xmit at ffffffffae083766 #23 [ff2d20159b39fb50] sch_direct_xmit at ffffffffae0e2564 #24 [ff2d20159b39fba0] __qdisc_run at ffffffffae0e294e #25 [ff2d20159b39fc18] __dev_queue_xmit at ffffffffae083c81 #26 [ff2d20159b39fc90] dev_queue_xmit at ffffffffae084370 #27 [ff2d20159b39fca0] tap_sendmsg at ffffffffc07206ed [tap] #28 [ff2d20159b39fd20] vhost_tx_batch at ffffffffc2fd6590 [vhost_net] #29 [ff2d20159b39fd68] handle_tx_copy at ffffffffc2fd70f3 [vhost_net] #30 [ff2d20159b39fe80] handle_tx at ffffffffc2fd7651 [vhost_net] #31 [ff2d20159b39feb0] handle_tx_kick at ffffffffc2fd76b5 [vhost_net] #32 [ff2d20159b39fec0] vhost_worker at ffffffffc12a5be8 [vhost] #33 [ff2d20159b39ff08] kthread at ffffffffad8dbfe5 #34 [ff2d20159b39ff50] ret_from_fork at ffffffffae400364 This change was discussed with Nvidia and they are in agreement. Orabug: 36660755 Fixes: e4cf27b ("net/mlx5e: Re-eanble client vlan TX acceleration") Reported-and-tested-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

…le_direct_reclaim() commit 6aaced5 upstream. The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <snishika@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 66cd37660ec34ec444fe42f2277330ae4a36bb19) Signed-off-by: Sherry Yang <sherry.yang@oracle.com>

…nt message commit cddc76b upstream. Address a bug in the kernel that triggers a "sleeping function called from invalid context" warning when /sys/kernel/debug/kmemleak is printed under specific conditions: - CONFIG_PREEMPT_RT=y - Set SELinux as the LSM for the system - Set kptr_restrict to 1 - kmemleak buffer contains at least one item BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 2 6 locks held by cat/136: #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30 #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128 #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0 #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0 #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0 irq event stamp: 136660 hardirqs last enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8 hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0 softirqs last enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8 softirqs last disabled at (0): [<0000000000000000>] 0x0 Preemption disabled at: [<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0 CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G E 6.11.0-rt7+ #34 Tainted: [E]=UNSIGNED_MODULE Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0xa0/0x128 show_stack+0x1c/0x30 dump_stack_lvl+0xe8/0x198 dump_stack+0x18/0x20 rt_spin_lock+0x8c/0x1a8 avc_perm_nonode+0xa0/0x150 cred_has_capability.isra.0+0x118/0x218 selinux_capable+0x50/0x80 security_capable+0x7c/0xd0 has_ns_capability_noaudit+0x94/0x1b0 has_capability_noaudit+0x20/0x30 restricted_pointer+0x21c/0x4b0 pointer+0x298/0x760 vsnprintf+0x330/0xf70 seq_printf+0x178/0x218 print_unreferenced+0x1a4/0x2d0 kmemleak_seq_show+0xd0/0x1e0 seq_read_iter+0x354/0xe30 seq_read+0x250/0x378 full_proxy_read+0xd8/0x148 vfs_read+0x190/0x918 ksys_read+0xf0/0x1e0 __arm64_sys_read+0x70/0xa8 invoke_syscall.constprop.0+0xd4/0x1d8 el0_svc+0x50/0x158 el0t_64_sync+0x17c/0x180 %pS and %pK, in the same back trace line, are redundant, and %pS can void %pK service in certain contexts. %pS alone already provides the necessary information, and if it cannot resolve the symbol, it falls back to printing the raw address voiding the original intent behind the %pK. Additionally, %pK requires a privilege check CAP_SYSLOG enforced through the LSM, which can trigger a "sleeping function called from invalid context" warning under RT_PREEMPT kernels when the check occurs in an atomic context. This issue may also affect other LSMs. This change avoids the unnecessary privilege check and resolves the sleeping function warning without any loss of information. Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com Fixes: 3a6f33d ("mm/kmemleak: use %pK to display kernel pointers in backtrace") Signed-off-by: Alessandro Carminati <acarmina@redhat.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Clément Léger <clement.leger@bootlin.com> Cc: Alessandro Carminati <acarmina@redhat.com> Cc: Eric Chanudet <echanude@redhat.com> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 64b2d32f22597b2a1dc83ac600b2426588851a97) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

…le_direct_reclaim() commit 6aaced5 upstream. The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <snishika@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 58d0d02dbc67438fc80223fdd7bbc49cf0733284) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

[ Upstream commit eb28fd7 ] gtp_newlink() links the device to a list in dev_net(dev) instead of src_net, where a udp tunnel socket is created. Even when src_net is removed, the device stays alive on dev_net(dev). Then, removing src_net triggers the splat below. [0] In this example, gtp0 is created in ns2, and the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name gtp0 type gtp role sgsn ip netns del ns1 Let's link the device to the socket's netns instead. Now, gtp_net_exit_batch_rtnl() needs another netdev iteration to remove all gtp devices in the netns. [0]: ref_tracker: net notrefcnt@000000003d6e7d05 has 1/2 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) gtp_create_sock (./include/net/udp_tunnel.h:59 drivers/net/gtp.c:1423) gtp_create_sockets (drivers/net/gtp.c:1447) gtp_newlink (drivers/net/gtp.c:1507) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) WARNING: CPU: 1 PID: 60 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 60 Comm: kworker/u16:2 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000009a07b60 EFLAGS: 00010286 RAX: 0000000000002bd3 RBX: ff1100000f4e1aa0 RCX: 1ffffffff0e40ac6 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000f4e1af0 R08: 0000000000000001 R09: fffffbfff0e395ae R10: 0000000000000001 R11: 0000000000036001 R12: ff1100000f4e1af0 R13: dead000000000100 R14: ff1100000f4e1af0 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9b2464bd98 CR3: 0000000005286005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 86f73d4ab2f27deeff22ba9336ad103d94f12ac7) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

[ Upstream commit ffc90e9 ] pfcp_newlink() links the device to a list in dev_net(dev) instead of net, where a udp tunnel socket is created. Even when net is removed, the device stays alive on dev_net(dev). Then, removing net triggers the splat below. [0] In this example, pfcp0 is created in ns2, but the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name pfcp0 type pfcp ip netns del ns1 Let's link the device to the socket's netns instead. Now, pfcp_net_exit() needs another netdev iteration to remove all pfcp devices in the netns. pfcp_dev_list is not used under RCU, so the list API is converted to the non-RCU variant. pfcp_net_exit() can be converted to .exit_batch_rtnl() in net-next. [0]: ref_tracker: net notrefcnt@00000000128b34dc has 1/1 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) pfcp_create_sock (drivers/net/pfcp.c:168) pfcp_newlink (drivers/net/pfcp.c:182 drivers/net/pfcp.c:197) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) WARNING: CPU: 1 PID: 11 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 11 Comm: kworker/u16:0 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000007f3fb60 EFLAGS: 00010286 RAX: 00000000000020ef RBX: ff1100000d6481e0 RCX: 1ffffffff0e40d82 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000d648230 R08: 0000000000000001 R09: fffffbfff0e395af R10: 0000000000000001 R11: 0000000000000000 R12: ff1100000d648230 R13: dead000000000100 R14: ff1100000d648230 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005620e1363990 CR3: 000000000eeb2002 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 76c8764 ("pfcp: add PFCP module") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 1c35a66e2bfea53dea3562b2575ac7fd4c38ee61) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

[ Upstream commit c7b87ce0dd10b64b68a0b22cb83bbd556e28fe81 ] libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250122025519.361873-1-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 161348aea66fde8356030df21f998d64f585bd51) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

[ Upstream commit eb28fd7 ] gtp_newlink() links the device to a list in dev_net(dev) instead of src_net, where a udp tunnel socket is created. Even when src_net is removed, the device stays alive on dev_net(dev). Then, removing src_net triggers the splat below. [0] In this example, gtp0 is created in ns2, and the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name gtp0 type gtp role sgsn ip netns del ns1 Let's link the device to the socket's netns instead. Now, gtp_net_exit_batch_rtnl() needs another netdev iteration to remove all gtp devices in the netns. [0]: ref_tracker: net notrefcnt@000000003d6e7d05 has 1/2 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) gtp_create_sock (./include/net/udp_tunnel.h:59 drivers/net/gtp.c:1423) gtp_create_sockets (drivers/net/gtp.c:1447) gtp_newlink (drivers/net/gtp.c:1507) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) WARNING: CPU: 1 PID: 60 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 60 Comm: kworker/u16:2 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000009a07b60 EFLAGS: 00010286 RAX: 0000000000002bd3 RBX: ff1100000f4e1aa0 RCX: 1ffffffff0e40ac6 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000f4e1af0 R08: 0000000000000001 R09: fffffbfff0e395ae R10: 0000000000000001 R11: 0000000000036001 R12: ff1100000f4e1af0 R13: dead000000000100 R14: ff1100000f4e1af0 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9b2464bd98 CR3: 0000000005286005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit c986380c1d5274c4d5e935addc807d6791cc23eb) Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>

On the node of an NFS client, some files saved in the mountpoint of the NFS server were copied to another location of the same NFS server. Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference crash with the following syslog: [232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [232066.588586] Mem abort info: [232066.588701] ESR = 0x0000000096000007 [232066.588862] EC = 0x25: DABT (current EL), IL = 32 bits [232066.589084] SET = 0, FnV = 0 [232066.589216] EA = 0, S1PTW = 0 [232066.589340] FSC = 0x07: level 3 translation fault [232066.589559] Data abort info: [232066.589683] ISV = 0, ISS = 0x00000007 [232066.589842] CM = 0, WnR = 0 [232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400 [232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000 [232066.590757] Internal error: Oops: 96000007 [#1] SMP [232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2 [232066.591052] vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs [232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1 [232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06 [232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4] [232066.598595] sp : ffff8000f568fc70 [232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000 [232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001 [232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050 [232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000 [232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000 [232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6 [232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828 [232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a [232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058 [232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000 [232066.601636] Call trace: [232066.601749] nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.601998] nfs4_do_reclaim+0x1b8/0x28c [nfsv4] [232066.602218] nfs4_state_manager+0x928/0x10f0 [nfsv4] [232066.602455] nfs4_run_state_manager+0x78/0x1b0 [nfsv4] [232066.602690] kthread+0x110/0x114 [232066.602830] ret_from_fork+0x10/0x20 [232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00) [232066.603284] SMP: stopping secondary CPUs [232066.606936] Starting crashdump kernel... [232066.607146] Bye! Analysing the vmcore, we know that nfs4_copy_state listed by destination nfs_server->ss_copies was added by the field copies in handle_async_copy(), and we found a waiting copy process with the stack as: PID: 3511963 TASK: ffff710028b47e00 CPU: 0 COMMAND: "cp" #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4 #1 [ffff8001116ef760] __schedule at ffff800008dd0650 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4] #7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4] #8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4] #9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4] The NULL-pointer dereference was due to nfs42_complete_copies() listed the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state. So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and the data accessed through this pointer was also incorrect. Generally, the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state(). When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED and copies are not deleted in nfs_server->ss_copies, the source state may be passed to the nfs42_complete_copies() process earlier, resulting in this crash scene finally. To solve this issue, we add a list_head nfs_server->ss_src_copies for a server-to-server copy specially. Fixes: 0e65a32 ("NFS: handle source server reboot") Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> (cherry picked from commit a848c29) Orabug: 37206487 Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Sherry Yang <sherry.yang@oracle.com>

[ Upstream commit c7b87ce0dd10b64b68a0b22cb83bbd556e28fe81 ] libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250122025519.361873-1-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit a48ebcd853a4e973566e3ed313655a8d96789e78) Signed-off-by: Vijayendra Suman <vijayendra.suman@oracle.com>

[ Upstream commit c7b87ce0dd10b64b68a0b22cb83bbd556e28fe81 ] libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250122025519.361873-1-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 093c20a38c9c81c653ced839e241cbf1b3b2a8b3) Signed-off-by: Sherry Yang <sherry.yang@oracle.com>

Scenario: 1. Port down and do fail over 2. Ap do rds_bind syscall PID: 47039 TASK: ffff89887e2fe640 CPU: 47 COMMAND: "kworker/u:6" #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518 #3 [ffff898e35f15b60] no_context at ffffffff8104854c #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff898e35f15dc8 RFLAGS: 00010282 RAX: 00000000fffffffe RBX: ffff889b77f6fc00 RCX:ffffffff81c99d88 RDX: 0000000000000000 RSI: ffff896019ee08e8 RDI:ffff889b77f6fc00 RBP: ffff898e35f15df0 R8: ffff896019ee08c8 R9:0000000000000000 R10: 0000000000000400 R11: 0000000000000000 R12:ffff896019ee08c0 R13: ffff889b77f6fe68 R14: ffffffff81c99d80 R15: ffffffffa022a1e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm] #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6 PID: 45659 TASK: ffff880d313d2500 CPU: 31 COMMAND: "oracle_45659_ap" #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm] #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma] #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds] #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds] #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670 PID: 45659 PID: 47039 rds_ib_laddr_check /* create id_priv with a null event_handler */ rdma_create_id rdma_bind_addr cma_acquire_dev /* add id_priv to cma_dev->id_list */ cma_attach_to_dev cma_ndev_work_handler /* event_hanlder is null */ id_priv->id.event_handler Orabug: 27530931 Signed-off-by: Guanglei Li <guanglei.li@oracle.com> Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 2c0aa08) Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 39e0939) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit 7d342f8) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

The customer hit this crash few times. PID: 31556 TASK: ffff880f823caa00 CPU: 1 COMMAND: "cellsrv" #0 [ffff880f823db850] machine_kexec at ffffffff8105d93c #1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3 #2 [ffff880f823db980] oops_end at ffffffff8101a788 #3 [ffff880f823db9b0] no_context at ffffffff8106b9cf #4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d #5 [ffff880f823dba70] bad_area at ffffffff8106be97 #6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e #7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f #8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f [exception RIP: rds_ib_inc_copy_to_user+104] RIP: ffffffffa04607b8 RSP: ffff880f823dbbf8 RFLAGS: 00010287 RAX: 0000000000000340 RBX: 0000000000001000 RCX: 0000000000004000 RDX: 0000000000001000 RSI: ffff88176cea2000 RDI: ffff8817d291f520 RBP: ffff880f823dbc48 R8: 0000000000001340 R9: 0000000000001000 R10: 0000000000001200 R11: ffff880f823dc000 R12: ffff880f823dbed0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds] int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to) ... ... ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); len = be32_to_cpu(inc->i_hdr.h_len); sg = frag->f_sg; while (iov_iter_count(to) && copied < len) { to_copy = min_t(unsigned long, iov_iter_count(to), sg->length - frag_off); ... sg is NULL and it crashes accessing sg->length above. The cause looks like is due to ic->i_frag_sz returning incorrect value. 16KB when 4KB was expected. if (copied % ic->i_frag_sz == 0) { frag = list_entry(frag->f_item.next, struct rds_page_frag, f_item); frag_off = 0; sg = frag->f_sg; } The other end is using 4KB RDS fragsize (Solaris Super Cluster). This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64). The message being copied arrived over 4KB RDS frag size connection. But during the above check ic->i_frag_sz is 16KB. This can happen during a reconnect at the connection setup phase. We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB. Failing this check if (copied % ic->i_frag_sz == 0) { can result in sg not getting set correctly. Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB. During race condition with a reconnect, ic->i_frag_sz can be 16KB even though once the connection is set up it settled down to 4KB. It can change from 4KB to 16KB and back to 4KB during connection setup due to reconnect. We started seeing this crash after bug 26848749. But prior to that the same scenario could result in data copied to user from incorrect "sg" resulting in data corruption. Orabug: 28748008 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 14858a3) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e86878f) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180452 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 964cad6) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e40c8e4) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Scenario: 1. Port down and do fail over 2. Ap do rds_bind syscall PID: 47039 TASK: ffff89887e2fe640 CPU: 47 COMMAND: "kworker/u:6" #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518 #3 [ffff898e35f15b60] no_context at ffffffff8104854c #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff898e35f15dc8 RFLAGS: 00010282 RAX: 00000000fffffffe RBX: ffff889b77f6fc00 RCX:ffffffff81c99d88 RDX: 0000000000000000 RSI: ffff896019ee08e8 RDI:ffff889b77f6fc00 RBP: ffff898e35f15df0 R8: ffff896019ee08c8 R9:0000000000000000 R10: 0000000000000400 R11: 0000000000000000 R12:ffff896019ee08c0 R13: ffff889b77f6fe68 R14: ffffffff81c99d80 R15: ffffffffa022a1e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm] #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6 PID: 45659 TASK: ffff880d313d2500 CPU: 31 COMMAND: "oracle_45659_ap" #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm] #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma] #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds] #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds] #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670 PID: 45659 PID: 47039 rds_ib_laddr_check /* create id_priv with a null event_handler */ rdma_create_id rdma_bind_addr cma_acquire_dev /* add id_priv to cma_dev->id_list */ cma_attach_to_dev cma_ndev_work_handler /* event_hanlder is null */ id_priv->id.event_handler Orabug: 27530931 Signed-off-by: Guanglei Li <guanglei.li@oracle.com> Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 2c0aa08) Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 39e0939) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit 7d342f8) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

The customer hit this crash few times. PID: 31556 TASK: ffff880f823caa00 CPU: 1 COMMAND: "cellsrv" #0 [ffff880f823db850] machine_kexec at ffffffff8105d93c #1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3 #2 [ffff880f823db980] oops_end at ffffffff8101a788 #3 [ffff880f823db9b0] no_context at ffffffff8106b9cf #4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d #5 [ffff880f823dba70] bad_area at ffffffff8106be97 #6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e #7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f #8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f [exception RIP: rds_ib_inc_copy_to_user+104] RIP: ffffffffa04607b8 RSP: ffff880f823dbbf8 RFLAGS: 00010287 RAX: 0000000000000340 RBX: 0000000000001000 RCX: 0000000000004000 RDX: 0000000000001000 RSI: ffff88176cea2000 RDI: ffff8817d291f520 RBP: ffff880f823dbc48 R8: 0000000000001340 R9: 0000000000001000 R10: 0000000000001200 R11: ffff880f823dc000 R12: ffff880f823dbed0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds] int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to) ... ... ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); len = be32_to_cpu(inc->i_hdr.h_len); sg = frag->f_sg; while (iov_iter_count(to) && copied < len) { to_copy = min_t(unsigned long, iov_iter_count(to), sg->length - frag_off); ... sg is NULL and it crashes accessing sg->length above. The cause looks like is due to ic->i_frag_sz returning incorrect value. 16KB when 4KB was expected. if (copied % ic->i_frag_sz == 0) { frag = list_entry(frag->f_item.next, struct rds_page_frag, f_item); frag_off = 0; sg = frag->f_sg; } The other end is using 4KB RDS fragsize (Solaris Super Cluster). This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64). The message being copied arrived over 4KB RDS frag size connection. But during the above check ic->i_frag_sz is 16KB. This can happen during a reconnect at the connection setup phase. We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB. Failing this check if (copied % ic->i_frag_sz == 0) { can result in sg not getting set correctly. Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB. During race condition with a reconnect, ic->i_frag_sz can be 16KB even though once the connection is set up it settled down to 4KB. It can change from 4KB to 16KB and back to 4KB during connection setup due to reconnect. We started seeing this crash after bug 26848749. But prior to that the same scenario could result in data copied to user from incorrect "sg" resulting in data corruption. Orabug: 28748008 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 14858a3) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e86878f) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180452 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 964cad6) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e40c8e4) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

One of our customers reported the following stack. crash-7.3.0> bt PID: 250515 TASK: ffff888189482f80 CPU: 1 COMMAND: "vmbackup" #0 [ffffc90025017878] die at ffffffff81033c22 #1 [ffffc900250178a8] do_trap at ffffffff81030990 #2 [ffffc900250178f8] do_error_trap at ffffffff810311d7 #3 [ffffc900250179c0] do_invalid_op at ffffffff81031310 #4 [ffffc900250179d0] invalid_op at ffffffff81a01f2a [exception RIP: ocfs2_truncate_rec+1914] RIP: ffffffffc1e73b4a RSP: ffffc90025017a80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000053a75 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8882d385be08 RDI: ffff8882d385be08 RBP: ffffc90025017b10 R8: 0000000000000000 R9: 0000000000005900 R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000001 R13: ffff88829e5a9900 R14: ffffc90025017cf0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: e030 SS: e02b #5 [ffffc90025017b18] ocfs2_remove_extent at ffffffffc1e73e6c [ocfs2] #6 [ffffc90025017bc8] ocfs2_remove_btree_range at ffffffffc1e745f2 [ocfs2] #7 [ffffc90025017c60] ocfs2_commit_truncate at ffffffffc1e75b1f [ocfs2] #8 [ffffc90025017d68] __dta_ocfs2_wipe_inode_606 at ffffffffc1e9a3e0 [ocfs2] #9 [ffffc90025017dd8] ocfs2_evict_inode at ffffffffc1e9ac10 [ocfs2] RIP: 00007f9b26ec8307 RSP: 00007ffc5a193f68 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000ddd0a0 RCX: 00007f9b26ec8307 RDX: 0000000000000001 RSI: 00007f9b2719e770 RDI: 0000000001010400 RBP: 0000000001263d80 R8: 0000000000000000 R9: 00000000012146a0 R10: 000000000000000d R11: 0000000000000246 R12: 0000000000ddd0a0 R13: 00007f9b27ba9595 R14: 00007f9b27ca4a50 R15: 00000000ffffffff ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b crash-7.3.0> This crash resulted due to invalid extent record selected for truncate. At the top of the function ocfs2_truncate_rec(), the code checks if the first extent record at the leaf extent list corresponding to the input path is still empty. In that case the tree is rotated left to get rid of the empty extent record but this rotation did not happen. But the function ocfs2_truncate_rec() assumes that the top level call to ocfs2_rotate_tree_left() to get rid of the empty extent always succeeds and hence it decrements the input "index" value. This results in selection of a wrong record for truncate that causes to hit a call to BUG() with the message "Owner %llu: Invalid record truncate: (%u, %u) ". The stack above is the panic stack caused due to hitting BUG(). Though the function ocfs2_rotate_tree_left() was intended to get rid of the first empty record in the extent block, it did not call the function ocfs2_rotate_rightmost_leaf_left() as it did not find h_next_leaf_blk in the extentleaf block to be zero, instead, it proceeded to call __ocfs2_rotate_tree_left(). However the input "index" value was indeed pointing to the last extent record in the leaf block. The macro path_leaf_bh() was returning rightmost extent block as per the tree-depth. and the function ocfs2_find_cpos_for_right_leaf() also found out that the extent block in question is indeed the rightmost and hence there is nothing to rotate at the last extent record pointed by the input "index" value. Hence the extent tree in the leaf block was not totated at all. Hence, the real reason for the above panic is that the value of the field h_next_leaf_blk in the right most leaf block was non-zero that caused the tree not to rotate left resulting in selection of invalid record for truncate. The reason why h_next_leaf_blk was not cleared for the last extent block is still not known and the code changes here is a workaround to avoid the panic by verifying that the extent block in question is indeed the rightmost leaf block in the tree and then correcting the invalid h_next_leaf_blk value. These changes have been verified by the customer by running the provided rpm in their env. Orabug: 34393593 Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180452 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 964cad6) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e40c8e4) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Add a check to mlx5e_xmit() for shorter frames. A corrupted/malformed packet, with shorter length can eventually cause system panic further down in the code path. Avoid it by validating the length and dropping it at the earliest. Following is seen in our env with shorter skb->len crash> bt PID: 76981 TASK: ff19828cfe508000 CPU: 106 COMMAND: "vhost-76942" #0 [ff2d20159b39f2c8] machine_kexec at ffffffffad884801 #1 [ff2d20159b39f328] __crash_kexec at ffffffffad976142 #2 [ff2d20159b39f3f8] panic at ffffffffad8b3640 #3 [ff2d20159b39f4a0] no_context at ffffffffad8954e1 #4 [ff2d20159b39f518] __bad_area_nosemaphore at ffffffffad8958de #5 [ff2d20159b39f578] bad_area_nosemaphore at ffffffffad895a96 #6 [ff2d20159b39f588] do_kern_addr_fault at ffffffffad89688e #7 [ff2d20159b39f5b0] __do_page_fault at ffffffffad896b30 #8 [ff2d20159b39f618] do_page_fault at ffffffffad896db6 #9 [ff2d20159b39f650] page_fault at ffffffffae402acd [exception RIP: memcpy_erms+6] RIP: ffffffffae261ab6 RSP: ff2d20159b39f700 RFLAGS: 00010293 RAX: ff198291741ecf2e RBX: ff19828e70d6a100 RCX: fffffffffea1af2b RDX: fffffffffffffffd RSI: ff19828eba6d7e5e RDI: ff198291757d2000 RBP: ff2d20159b39f760 R8: ff198291741ecf00 R9: 000000000000037c R10: 000000000000003c R11: ff19828ffe953940 R12: ff198291741ecf20 R13: ff198267dcb1b600 R14: ff19828eeebb09c0 R15: ff198291741ecf00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ff2d20159b39f700] mlx5e_sq_xmit_wqe at ffffffffc05c162e [mlx5_core] #11 [ff2d20159b39f768] mlx5e_xmit at ffffffffc05c1ca3 [mlx5_core] #12 [ff2d20159b39f800] dev_hard_start_xmit at ffffffffae083766 #13 [ff2d20159b39f860] sch_direct_xmit at ffffffffae0e2564 #14 [ff2d20159b39f8b0] __qdisc_run at ffffffffae0e294e #15 [ff2d20159b39f928] __dev_queue_xmit at ffffffffae083eee #16 [ff2d20159b39f9a8] dev_queue_xmit at ffffffffae084370 #17 [ff2d20159b39f9b8] vlan_dev_hard_start_xmit at ffffffffc2fb6fec [8021q] #18 [ff2d20159b39f9d8] dev_hard_start_xmit at ffffffffae083766 #19 [ff2d20159b39fa38] __dev_queue_xmit at ffffffffae08416a #20 [ff2d20159b39fab8] dev_queue_xmit_accel at ffffffffae08438e #21 [ff2d20159b39fac8] macvlan_start_xmit at ffffffffc2fc18d9 [macvlan] #22 [ff2d20159b39faf0] dev_hard_start_xmit at ffffffffae083766 #23 [ff2d20159b39fb50] sch_direct_xmit at ffffffffae0e2564 #24 [ff2d20159b39fba0] __qdisc_run at ffffffffae0e294e #25 [ff2d20159b39fc18] __dev_queue_xmit at ffffffffae083c81 #26 [ff2d20159b39fc90] dev_queue_xmit at ffffffffae084370 #27 [ff2d20159b39fca0] tap_sendmsg at ffffffffc07206ed [tap] #28 [ff2d20159b39fd20] vhost_tx_batch at ffffffffc2fd6590 [vhost_net] #29 [ff2d20159b39fd68] handle_tx_copy at ffffffffc2fd70f3 [vhost_net] #30 [ff2d20159b39fe80] handle_tx at ffffffffc2fd7651 [vhost_net] #31 [ff2d20159b39feb0] handle_tx_kick at ffffffffc2fd76b5 [vhost_net] #32 [ff2d20159b39fec0] vhost_worker at ffffffffc12a5be8 [vhost] #33 [ff2d20159b39ff08] kthread at ffffffffad8dbfe5 #34 [ff2d20159b39ff50] ret_from_fork at ffffffffae400364 This change was discussed with Nvidia and they are in agreement. Orabug: 36879156 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Fixes: e4cf27b ("net/mlx5e: Re-eanble client vlan TX acceleration") Reported-and-tested-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> (cherry picked from commit 0dd4b99) Orabug: 36879126 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Signed-off-by: Harshvardhan Jha <harshvardhan.j.jha@oracle.com> Reviewed-by: Vijayendra Suman <vijayendra.suman@oracle.com>

One of our customers reported the following stack. crash-7.3.0> bt PID: 250515 TASK: ffff888189482f80 CPU: 1 COMMAND: "vmbackup" #0 [ffffc90025017878] die at ffffffff81033c22 #1 [ffffc900250178a8] do_trap at ffffffff81030990 #2 [ffffc900250178f8] do_error_trap at ffffffff810311d7 #3 [ffffc900250179c0] do_invalid_op at ffffffff81031310 #4 [ffffc900250179d0] invalid_op at ffffffff81a01f2a [exception RIP: ocfs2_truncate_rec+1914] RIP: ffffffffc1e73b4a RSP: ffffc90025017a80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000053a75 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8882d385be08 RDI: ffff8882d385be08 RBP: ffffc90025017b10 R8: 0000000000000000 R9: 0000000000005900 R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000001 R13: ffff88829e5a9900 R14: ffffc90025017cf0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: e030 SS: e02b #5 [ffffc90025017b18] ocfs2_remove_extent at ffffffffc1e73e6c [ocfs2] #6 [ffffc90025017bc8] ocfs2_remove_btree_range at ffffffffc1e745f2 [ocfs2] #7 [ffffc90025017c60] ocfs2_commit_truncate at ffffffffc1e75b1f [ocfs2] #8 [ffffc90025017d68] __dta_ocfs2_wipe_inode_606 at ffffffffc1e9a3e0 [ocfs2] #9 [ffffc90025017dd8] ocfs2_evict_inode at ffffffffc1e9ac10 [ocfs2] RIP: 00007f9b26ec8307 RSP: 00007ffc5a193f68 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000ddd0a0 RCX: 00007f9b26ec8307 RDX: 0000000000000001 RSI: 00007f9b2719e770 RDI: 0000000001010400 RBP: 0000000001263d80 R8: 0000000000000000 R9: 00000000012146a0 R10: 000000000000000d R11: 0000000000000246 R12: 0000000000ddd0a0 R13: 00007f9b27ba9595 R14: 00007f9b27ca4a50 R15: 00000000ffffffff ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b crash-7.3.0> This crash resulted due to invalid extent record selected for truncate. At the top of the function ocfs2_truncate_rec(), the code checks if the first extent record at the leaf extent list corresponding to the input path is still empty. In that case the tree is rotated left to get rid of the empty extent record but this rotation did not happen. But the function ocfs2_truncate_rec() assumes that the top level call to ocfs2_rotate_tree_left() to get rid of the empty extent always succeeds and hence it decrements the input "index" value. This results in selection of a wrong record for truncate that causes to hit a call to BUG() with the message "Owner %llu: Invalid record truncate: (%u, %u) ". The stack above is the panic stack caused due to hitting BUG(). Though the function ocfs2_rotate_tree_left() was intended to get rid of the first empty record in the extent block, it did not call the function ocfs2_rotate_rightmost_leaf_left() as it did not find h_next_leaf_blk in the extentleaf block to be zero, instead, it proceeded to call __ocfs2_rotate_tree_left(). However the input "index" value was indeed pointing to the last extent record in the leaf block. The macro path_leaf_bh() was returning rightmost extent block as per the tree-depth. and the function ocfs2_find_cpos_for_right_leaf() also found out that the extent block in question is indeed the rightmost and hence there is nothing to rotate at the last extent record pointed by the input "index" value. Hence the extent tree in the leaf block was not totated at all. Hence, the real reason for the above panic is that the value of the field h_next_leaf_blk in the right most leaf block was non-zero that caused the tree not to rotate left resulting in selection of invalid record for truncate. The reason why h_next_leaf_blk was not cleared for the last extent block is still not known and the code changes here is a workaround to avoid the panic by verifying that the extent block in question is indeed the rightmost leaf block in the tree and then correcting the invalid h_next_leaf_blk value. These changes have been verified by the customer by running the provided rpm in their env. Orabug: 34393593 Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>

Add a check to mlx5e_xmit() for shorter frames. A corrupted/malformed packet, with shorter length can eventually cause system panic further down in the code path. Avoid it by validating the length and dropping it at the earliest. Following is seen in our env with shorter skb->len crash> bt PID: 76981 TASK: ff19828cfe508000 CPU: 106 COMMAND: "vhost-76942" #0 [ff2d20159b39f2c8] machine_kexec at ffffffffad884801 #1 [ff2d20159b39f328] __crash_kexec at ffffffffad976142 #2 [ff2d20159b39f3f8] panic at ffffffffad8b3640 #3 [ff2d20159b39f4a0] no_context at ffffffffad8954e1 #4 [ff2d20159b39f518] __bad_area_nosemaphore at ffffffffad8958de #5 [ff2d20159b39f578] bad_area_nosemaphore at ffffffffad895a96 #6 [ff2d20159b39f588] do_kern_addr_fault at ffffffffad89688e #7 [ff2d20159b39f5b0] __do_page_fault at ffffffffad896b30 #8 [ff2d20159b39f618] do_page_fault at ffffffffad896db6 #9 [ff2d20159b39f650] page_fault at ffffffffae402acd [exception RIP: memcpy_erms+6] RIP: ffffffffae261ab6 RSP: ff2d20159b39f700 RFLAGS: 00010293 RAX: ff198291741ecf2e RBX: ff19828e70d6a100 RCX: fffffffffea1af2b RDX: fffffffffffffffd RSI: ff19828eba6d7e5e RDI: ff198291757d2000 RBP: ff2d20159b39f760 R8: ff198291741ecf00 R9: 000000000000037c R10: 000000000000003c R11: ff19828ffe953940 R12: ff198291741ecf20 R13: ff198267dcb1b600 R14: ff19828eeebb09c0 R15: ff198291741ecf00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ff2d20159b39f700] mlx5e_sq_xmit_wqe at ffffffffc05c162e [mlx5_core] #11 [ff2d20159b39f768] mlx5e_xmit at ffffffffc05c1ca3 [mlx5_core] #12 [ff2d20159b39f800] dev_hard_start_xmit at ffffffffae083766 #13 [ff2d20159b39f860] sch_direct_xmit at ffffffffae0e2564 #14 [ff2d20159b39f8b0] __qdisc_run at ffffffffae0e294e #15 [ff2d20159b39f928] __dev_queue_xmit at ffffffffae083eee #16 [ff2d20159b39f9a8] dev_queue_xmit at ffffffffae084370 #17 [ff2d20159b39f9b8] vlan_dev_hard_start_xmit at ffffffffc2fb6fec [8021q] #18 [ff2d20159b39f9d8] dev_hard_start_xmit at ffffffffae083766 #19 [ff2d20159b39fa38] __dev_queue_xmit at ffffffffae08416a #20 [ff2d20159b39fab8] dev_queue_xmit_accel at ffffffffae08438e #21 [ff2d20159b39fac8] macvlan_start_xmit at ffffffffc2fc18d9 [macvlan] #22 [ff2d20159b39faf0] dev_hard_start_xmit at ffffffffae083766 #23 [ff2d20159b39fb50] sch_direct_xmit at ffffffffae0e2564 #24 [ff2d20159b39fba0] __qdisc_run at ffffffffae0e294e #25 [ff2d20159b39fc18] __dev_queue_xmit at ffffffffae083c81 #26 [ff2d20159b39fc90] dev_queue_xmit at ffffffffae084370 #27 [ff2d20159b39fca0] tap_sendmsg at ffffffffc07206ed [tap] #28 [ff2d20159b39fd20] vhost_tx_batch at ffffffffc2fd6590 [vhost_net] #29 [ff2d20159b39fd68] handle_tx_copy at ffffffffc2fd70f3 [vhost_net] #30 [ff2d20159b39fe80] handle_tx at ffffffffc2fd7651 [vhost_net] #31 [ff2d20159b39feb0] handle_tx_kick at ffffffffc2fd76b5 [vhost_net] #32 [ff2d20159b39fec0] vhost_worker at ffffffffc12a5be8 [vhost] #33 [ff2d20159b39ff08] kthread at ffffffffad8dbfe5 #34 [ff2d20159b39ff50] ret_from_fork at ffffffffae400364 This change was discussed with Nvidia and they are in agreement. Orabug: 36879156 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Fixes: e4cf27b ("net/mlx5e: Re-eanble client vlan TX acceleration") Reported-and-tested-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> (cherry picked from commit 0dd4b99) Orabug: 36879126 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Signed-off-by: Harshvardhan Jha <harshvardhan.j.jha@oracle.com> Reviewed-by: Vijayendra Suman <vijayendra.suman@oracle.com>

be2net added 3 commits November 21, 2018 02:12

be2net: Fix HW stall issue in Lancer

4cd817d

Lancer HW cannot handle a TSO packet with a single segment. Disable TSO/GSO for such packets. Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Update the driver version to 12.0.0.0

c337335

gregmarsden closed this Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uek5 master driver update #5

Uek5 master driver update #5

be2net commented Nov 29, 2018

gregmarsden commented Nov 30, 2018

Uek5 master driver update #5

Uek5 master driver update #5

Conversation

be2net commented Nov 29, 2018

gregmarsden commented Nov 30, 2018