Track PCIe AER correctable errors; fix GPU status key routing

Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch
"bus correctable error" events seen in SEL (Critical Interrupt / offset 7).
Both patterns carry severity "warning" — correctable errors are
hardware-recovered and should not flag a card as failed.

Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF>
but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" →
pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both
flushWindow (SAT-window path) and flushImmediate (always-on path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikhail Chusavitin
2026-05-08 12:50:14 +03:00
parent 5bc9bd7fb3
commit 805a3b277d
2 changed files with 24 additions and 2 deletions

View File

@@ -165,7 +165,9 @@ func (w *kmsgWatcher) flushWindow(window *kmsgWindow) {
for _, id := range evt.ids {
var key string
switch evt.category {
case "gpu", "pcie":
case "gpu":
key = "pcie:gpu:" + normalizeBDF(id)
case "pcie":
key = "pcie:" + normalizeBDF(id)
case "storage":
key = "storage:" + id
@@ -218,7 +220,9 @@ func (w *kmsgWatcher) flushImmediate(evt kmsgEvent) {
for _, id := range evt.ids {
var key string
switch evt.category {
case "gpu", "pcie":
case "gpu":
key = "pcie:gpu:" + normalizeBDF(id)
case "pcie":
key = "pcie:" + normalizeBDF(id)
case "storage":
key = "storage:" + id