Compare commits

...

10 Commits

Author SHA1 Message Date
Mikhail Chusavitin
2a7d366e50 bump bible and chart submodules
bible: add chart-based web visual baseline contract
chart: support hardware contract 2.10

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 14:56:07 +03:00
Mikhail Chusavitin
5bfaecd417 storage SAT: wait for smartctl self-test completion, add human-readable resource summary
smartctl -t short only launches the self-test and returns immediately
("Testing has begun"); unlike nvme device-self-test --wait, it has no
blocking mode. Validate/Load runs closed the task and produced reports
before the drive actually finished the test. Now poll smartctl -a until
the test completes (or times out) and report the real result.

Also add a per-disk "Resource" section with pseudographic progress bars
for uptime (vs 5y design life), bytes written (vs 1 DWPD x 5y budget),
and bytes read (percent from SMART attribute 242), all rendered in
human-scaled units (days/years, TB/PB) instead of raw hour/byte counts.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 14:54:48 +03:00
Mikhail Chusavitin
8575cf06f8 webui: show all RAID drives per controller and add drive-prepare action
RAID Controller Management previously hid any LSI drive that wasn't
already Frgn/UGood/JBOD, and scoped VROC "free drives" from all system
disks instead of the ones actually wired to the VROC controller's
ports - drives attached directly to the CPU or another HBA could leak
in. Now every drive is listed per its own controller, and LSI drives
not already ready for array creation get a "Prepare" button that
forces them to Unconfigured Good via storcli.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 13:32:03 +03:00
Mikhail Chusavitin
d1d5f63257 storage SAT: split collect/self-test modes, add per-disk text reports
Check mode: read-only SMART/NVMe data collection, no self-test.
Load mode: same collection + short self-test (nvme device-self-test -s 1,
smartctl -t short). Card descriptions updated accordingly.

After each storage SAT run, a disk-N-devname-report.txt is written
per device into the runDir (auto-included in support bundles).
Web UI task page renders one card per disk directly below Task Report.

Also fixes pre-existing TestDashboardRendersRuntimeHealthTable failure:
test fixture used "inactive" status but code now treats inactive as OK
for completed oneshot services; updated to "failed" to match intent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-30 19:08:15 +03:00
fc9b446d2e webui: per-source status bar in FRU/Elabel card + fix stale runtime-health test
Show an explicit per-source status line after "Read All" instead of hiding
failed/blocked sources in a "(skipped: …)" tail. Sources blocked by a missing
Supermicro license (SFT-OOB-LIC / SFT-DCMS-SINGLE) are flagged in red with an
actionable message, so engineers see that SAA DMI is gated rather than silently
falling back to the futile ipmitool FRU path (BIOS re-syncs FRU from DMI on boot).

Also fix TestDashboardRendersRuntimeHealthTable, stale since 4f6579e moved
"inactive" to the OK service states: the fixture now uses a failed service and
the assertion matches the current contract (failed flagged, inactive not).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 17:30:50 +03:00
Mikhail Chusavitin
ea68318744 contract: bump to v2.11 — add sfp_modules[], deprecate scalar sfp_* fields 2026-06-19 18:26:29 +03:00
Mikhail Chusavitin
518082c2e2 proposals: RFC for sfp_modules[] contract extension (v2.10 → v2.11) 2026-06-19 18:14:46 +03:00
Mikhail Chusavitin
056dce0b98 backlog: add SFP module collection with contract gap analysis 2026-06-19 16:25:57 +03:00
Mikhail Chusavitin
24f2e65b6e Add unified FRU/Elabel card with Huawei iBMC OEM IPMI support
Replaces separate IPMI FRU and SAA DMI cards with a single FRU / Elabel
card that reads all available sources in parallel and shows each field
with a color-coded source chip (IPMI FRU / Huawei iBMC / SAA DMI).

Huawei elabel fields are read/written via OEM IPMI raw commands
(NetFn 0x30, cmd 0x90) with 19-byte chunking protocol, matching
the FusionServer ElabelTool V511 wire format. Covers DeviceName,
DeviceSerialNumber, ProductName, ProductSerialNumber, ProductAssetTag,
ProductManufacturer, MainboardManufacturer, BoardProductName,
ChassisPartnumber, ChassisType (read-only), IOChassisSerial,
IOChassisAssetTag, and GUID (read-only via standard 0x06 0x08).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 15:29:07 +03:00
Mikhail Chusavitin
7f27b9aa38 Fix AMD GPU false detection, blackbox deadlock, and NOGPU build bloat
- sat.go: DetectGPUVendor lspci fallback now checks GPU device classes
  ([0300]/[0302]/[0380]) per line instead of scanning the whole output for
  vendor name; AMD EPYC servers have dozens of AMD-branded PCIe entries
  (Root Complex, IOMMU, Host Bridge) that were triggering the old check
- blackbox.go: fix deadlock in finishCycle — it held w.mu while calling
  persistState(), which acquires rt.mu then re-acquires w.mu inside
  persistStateLocked(); now w.mu is released before persistState()
- build.sh: remove NVIDIA-specific overlay files (bee-gpu-burn,
  bee-john-gpu-stress, bee-nccl-gpu-stress, bee-nvidia-recover,
  bee-dcgmproftester-staggered, bee-check-nvswitch,
  nvidia-fabricmanager.service.d/) for non-nvidia build variants
- bee-selfheal: gate NVIDIA recovery on BEE_GPU_VENDOR=nvidia so the
  script does not attempt to restart bee-nvidia.service on NOGPU builds

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 09:37:26 +03:00
24 changed files with 1926 additions and 221 deletions

View File

@@ -365,7 +365,6 @@ func (w *blackboxWorker) currentFlushPeriod() time.Duration {
func (w *blackboxWorker) finishCycle(duration time.Duration, err error) {
w.mu.Lock()
defer w.mu.Unlock()
w.lastDuration = duration
if err != nil {
w.status = "degraded"
@@ -383,6 +382,10 @@ func (w *blackboxWorker) finishCycle(duration time.Duration, err error) {
}
w.flushPeriod = adjustFlushPeriod(w.flushPeriod, duration, true, w.fastCycles)
}
w.mu.Unlock()
// persistState must be called without w.mu held: it acquires rt.mu then
// each worker.mu inside persistStateLocked, so holding w.mu here would
// cause a deadlock (w.mu → rt.mu → w.mu).
w.runtime.persistState()
}

View File

@@ -182,9 +182,16 @@ func (s *System) DetectGPUVendor() string {
return "amd"
}
if raw, err := exec.Command("lspci", "-nn").Output(); err == nil {
text := strings.ToLower(string(raw))
if strings.Contains(text, "advanced micro devices") || strings.Contains(text, "amd/ati") {
return "amd"
// Only match AMD GPU device classes [0300]=VGA, [0302]=3D controller, [0380]=Display.
// AMD CPUs also appear in lspci as "Advanced Micro Devices" (Root Complex, IOMMU, etc.)
// so matching vendor alone causes false positives on AMD CPU servers without GPUs.
for _, line := range strings.Split(strings.ToLower(string(raw)), "\n") {
if !strings.Contains(line, "advanced micro devices") && !strings.Contains(line, "amd/ati") {
continue
}
if strings.Contains(line, "[0300]") || strings.Contains(line, "[0302]") || strings.Contains(line, "[0380]") {
return "amd"
}
}
}
return ""
@@ -723,12 +730,14 @@ func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, e
}
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
commands := storageSATCommands(devPath, extended)
deviceOutputs := make(map[string][]byte, len(commands))
for cmdIndex, job := range commands {
if ctx.Err() != nil {
break
}
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
out, err := runSATCommandCtx(ctx, verboseLog, job.name, job.cmd, nil, logFunc)
deviceOutputs[job.name] = out
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
return "", writeErr
}
@@ -737,7 +746,28 @@ func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, e
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
// smartctl -t short only launches the self-test on the drive firmware and
// returns immediately ("Testing has begun"); unlike `nvme device-self-test
// --wait`, smartctl has no blocking mode, so we must poll the drive
// ourselves until the self-test actually finishes.
if job.name == "smartctl-self-test-short" && err == nil {
statusName := "smartctl-self-test-status"
statusOut := waitForSmartctlSelfTest(ctx, verboseLog, devPath, logFunc)
deviceOutputs[statusName] = statusOut
statusFile := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+2, statusName)
if writeErr := os.WriteFile(filepath.Join(runDir, statusFile), statusOut, 0644); writeErr != nil {
return "", writeErr
}
sStatus, sRC := classifySATResult(statusName, statusOut, nil)
stats.Add(sStatus)
sKey := filepath.Base(devPath) + "_" + strings.ReplaceAll(statusName, "-", "_")
fmt.Fprintf(&summary, "%s_rc=%d\n", sKey, sRC)
fmt.Fprintf(&summary, "%s_status=%s\n", sKey, sStatus)
}
}
reportText := GenerateDiskReportText(index+1, devPath, deviceOutputs, time.Now().UTC())
_ = os.WriteFile(filepath.Join(runDir, "disk-"+prefix+"-report.txt"), []byte(reportText), 0644)
}
writeSATStats(&summary, stats)
@@ -1170,6 +1200,42 @@ func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string
return out, err
}
// smartctlSelfTestPollInterval/Timeout bound how long we poll the drive after
// launching `smartctl -t short`, which SMART/ATA specs put at ~2 minutes.
const (
smartctlSelfTestPollInterval = 5 * time.Second
smartctlSelfTestTimeout = 4 * time.Minute
)
// waitForSmartctlSelfTest polls `smartctl -a` until the short self-test
// started on devPath finishes (or the timeout/context elapses) and returns
// the final output, which reflects the actual test result rather than the
// "Testing has begun" launch acknowledgement.
func waitForSmartctlSelfTest(ctx context.Context, verboseLog, devPath string, logFunc func(string)) []byte {
deadline := time.Now().Add(smartctlSelfTestTimeout)
var last []byte
for {
out, _ := runSATCommandCtx(ctx, verboseLog, "smartctl-self-test-status", []string{"smartctl", "-a", devPath}, nil, nil)
last = out
if ctx.Err() != nil {
return last
}
lower := bytes.ToLower(out)
if !bytes.Contains(lower, []byte("self-test routine in progress")) &&
!bytes.Contains(lower, []byte("% of test remaining")) {
return last
}
if time.Now().After(deadline) {
return last
}
select {
case <-ctx.Done():
return last
case <-time.After(smartctlSelfTestPollInterval):
}
}
}
func listStorageDevices() ([]string, error) {
out, err := satExecCommand("lsblk", "-dn", "-o", "NAME,TYPE,TRAN").Output()
if err != nil {
@@ -1178,26 +1244,27 @@ func listStorageDevices() ([]string, error) {
return parseStorageDevices(string(out)), nil
}
// storageSATCommands returns the commands to run for a single storage device.
// extended=false (Check): read-only SMART/NVMe data collection, no self-test.
// extended=true (Load): data collection + short self-test.
func storageSATCommands(devPath string, extended bool) []satJob {
if strings.Contains(filepath.Base(devPath), "nvme") {
selfTestLevel := "1"
if extended {
selfTestLevel = "2"
}
return []satJob{
jobs := []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", selfTestLevel, "--wait"}},
}
if extended {
jobs = append(jobs, satJob{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}})
}
return jobs
}
smartTestType := "short"
if extended {
smartTestType = "long"
}
return []satJob{
jobs := []satJob{
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", smartTestType, devPath}},
}
if extended {
jobs = append(jobs, satJob{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}})
}
return jobs
}
func (s *satStats) Add(status string) {

View File

@@ -14,14 +14,42 @@ import (
func TestStorageSATCommands(t *testing.T) {
t.Parallel()
nvme := storageSATCommands("/dev/nvme0n1", false)
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
t.Fatalf("unexpected nvme commands: %#v", nvme)
// Check mode (extended=false): read-only collection, no self-test.
nvmeCheck := storageSATCommands("/dev/nvme0n1", false)
if len(nvmeCheck) != 2 {
t.Fatalf("check nvme: want 2 commands, got %d: %#v", len(nvmeCheck), nvmeCheck)
}
if nvmeCheck[0].name != "nvme-id-ctrl" || nvmeCheck[1].name != "nvme-smart-log" {
t.Fatalf("check nvme: unexpected command names: %#v", nvmeCheck)
}
sata := storageSATCommands("/dev/sda", false)
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
t.Fatalf("unexpected sata commands: %#v", sata)
sataCheck := storageSATCommands("/dev/sda", false)
if len(sataCheck) != 1 || sataCheck[0].cmd[0] != "smartctl" {
t.Fatalf("check sata: want 1 smartctl command, got %#v", sataCheck)
}
// Load mode (extended=true): collection + short self-test.
nvmeLoad := storageSATCommands("/dev/nvme0n1", true)
if len(nvmeLoad) != 3 || nvmeLoad[2].name != "nvme-device-self-test" {
t.Fatalf("load nvme: want 3 commands with self-test last, got %#v", nvmeLoad)
}
if got := nvmeLoad[2].cmd[len(nvmeLoad[2].cmd)-3]; got != "-s" {
t.Fatalf("load nvme: want -s flag, got %q", got)
}
if got := nvmeLoad[2].cmd[len(nvmeLoad[2].cmd)-2]; got != "1" {
t.Fatalf("load nvme: want self-test level 1, got %q", got)
}
sataLoad := storageSATCommands("/dev/sda", true)
if len(sataLoad) != 2 || sataLoad[1].name != "smartctl-self-test-short" {
t.Fatalf("load sata: want 2 commands with short self-test last, got %#v", sataLoad)
}
// cmd is: smartctl -t short /dev/sda
if got := sataLoad[1].cmd[1]; got != "-t" {
t.Fatalf("load sata: want -t flag at index 1, got %q", got)
}
if got := sataLoad[1].cmd[2]; got != "short" {
t.Fatalf("load sata: want short at index 2, got %q", got)
}
}

View File

@@ -0,0 +1,517 @@
package platform
import (
"encoding/json"
"fmt"
"math"
"path/filepath"
"regexp"
"strconv"
"strings"
"time"
)
// GenerateDiskReportText builds a human-readable text report for one storage
// device from the raw command outputs collected during storage SAT.
//
// outputs keys match satJob.name: "nvme-id-ctrl", "nvme-smart-log",
// "smartctl-health", "smartctl-self-test-short".
func GenerateDiskReportText(index int, devPath string, outputs map[string][]byte, ts time.Time) string {
var b strings.Builder
devName := filepath.Base(devPath)
line := strings.Repeat("=", 80)
b.WriteString(line + "\n")
fmt.Fprintf(&b, "Disk %-3d %s\n", index, devPath)
b.WriteString(line + "\n")
isNVMe := strings.Contains(devName, "nvme")
if isNVMe {
writeNVMeReport(&b, outputs)
} else {
writeSATAReport(&b, outputs)
}
b.WriteString("\n")
fmt.Fprintf(&b, "Collected : %s\n", ts.UTC().Format("2006-01-02 15:04:05 UTC"))
b.WriteString(line + "\n")
return b.String()
}
// ── NVMe ─────────────────────────────────────────────────────────────────────
type nvmeIdCtrl struct {
ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"`
Firmware string `json:"fr"`
TotalCap uint64 `json:"tnvmcap"`
NVMCap uint64 `json:"nvmcap"`
}
// nvmeU64 handles both plain JSON numbers and {"lo":n,"hi":n} objects that
// some nvme-cli versions emit for 128-bit counters.
func nvmeU64(raw json.RawMessage) uint64 {
if len(raw) == 0 {
return 0
}
var n uint64
if json.Unmarshal(raw, &n) == nil {
return n
}
var obj struct {
Lo uint64 `json:"lo"`
Hi uint64 `json:"hi"`
}
if json.Unmarshal(raw, &obj) == nil {
return obj.Lo
}
return 0
}
type nvmeSmartLogRaw struct {
CriticalWarning uint64 `json:"critical_warning"`
Temperature json.RawMessage `json:"temperature"`
AvailSpare uint64 `json:"avail_spare"`
SpareThresh uint64 `json:"spare_thresh"`
PercentUsed uint64 `json:"percent_used"`
DataUnitsRead json.RawMessage `json:"data_units_read"`
DataUnitsWritten json.RawMessage `json:"data_units_written"`
PowerCycles json.RawMessage `json:"power_cycles"`
PowerOnHours json.RawMessage `json:"power_on_hours"`
UnsafeShutdowns json.RawMessage `json:"unsafe_shutdowns"`
MediaErrors json.RawMessage `json:"media_errors"`
NumErrLogEntries json.RawMessage `json:"num_err_log_entries"`
}
func writeNVMeReport(b *strings.Builder, outputs map[string][]byte) {
// id-ctrl
var ctrl nvmeIdCtrl
if data := outputs["nvme-id-ctrl"]; len(data) > 0 {
_ = json.Unmarshal(data, &ctrl)
}
model := strings.TrimSpace(ctrl.ModelNumber)
serial := strings.TrimSpace(ctrl.SerialNumber)
firmware := strings.TrimSpace(ctrl.Firmware)
capacityGB := ""
if ctrl.TotalCap > 0 {
capacityGB = formatCapacityGB(ctrl.TotalCap)
} else if ctrl.NVMCap > 0 {
capacityGB = formatCapacityGB(ctrl.NVMCap)
}
writeField(b, "Model", model)
writeField(b, "Serial", serial)
writeField(b, "Firmware", firmware)
if capacityGB != "" {
writeField(b, "Capacity", capacityGB)
}
// smart-log
data := outputs["nvme-smart-log"]
if len(data) == 0 {
b.WriteString("\n(no SMART data)\n")
return
}
var sl nvmeSmartLogRaw
if err := json.Unmarshal(data, &sl); err != nil {
fmt.Fprintf(b, "\n(SMART parse error: %v)\n", err)
return
}
tempK := nvmeU64(sl.Temperature)
tempC := int(tempK) - 273
if tempC < 0 {
tempC = 0
}
critWarn := sl.CriticalWarning
critWarnStr := "OK"
if critWarn != 0 {
critWarnStr = fmt.Sprintf("0x%02X", critWarn)
}
poh := nvmeU64(sl.PowerOnHours)
pc := nvmeU64(sl.PowerCycles)
us := nvmeU64(sl.UnsafeShutdowns)
me := nvmeU64(sl.MediaErrors)
nel := nvmeU64(sl.NumErrLogEntries)
// data_units are in 1000 × 512-byte sectors = 512,000 bytes each
dataRead := float64(nvmeU64(sl.DataUnitsRead)) * 512000 / 1e9
dataWritten := float64(nvmeU64(sl.DataUnitsWritten)) * 512000 / 1e9
writeSectionHeader(b, "Health")
writeField(b, "Temperature", fmt.Sprintf("%d °C", tempC))
writeField(b, "Critical Warning", critWarnStr)
writeField(b, "Percentage Used", fmt.Sprintf("%d %%", sl.PercentUsed))
writeField(b, "Available Spare", fmt.Sprintf("%d %% (threshold: %d %%)", sl.AvailSpare, sl.SpareThresh))
writeSectionHeader(b, "Usage")
writeField(b, "Power On Hours", fmt.Sprintf("%s h", formatUint(poh)))
writeField(b, "Power Cycles", formatUint(pc))
writeField(b, "Unsafe Shutdowns", formatUint(us))
writeField(b, "Data Written", fmt.Sprintf("%.1f GB", dataWritten))
writeField(b, "Data Read", fmt.Sprintf("%.1f GB", dataRead))
writeSectionHeader(b, "Errors")
writeField(b, "Media Errors", formatUint(me))
writeField(b, "Error Log Entries", formatUint(nel))
capacityBytes := ctrl.TotalCap
if capacityBytes == 0 {
capacityBytes = ctrl.NVMCap
}
writeResourceSection(b, resourceInfo{
powerOnHours: poh,
writtenBytes: uint64(nvmeU64(sl.DataUnitsWritten)) * 512000,
readBytes: uint64(nvmeU64(sl.DataUnitsRead)) * 512000,
capacityBytes: capacityBytes,
})
if selfTest := outputs["nvme-device-self-test"]; len(selfTest) > 0 {
writeSectionHeader(b, "Self-Test")
result := parseSelfTestResult(string(selfTest))
writeField(b, "Result", result)
}
}
// ── SATA / SAS (smartctl) ────────────────────────────────────────────────────
var (
smartHealthRE = regexp.MustCompile(`(?i)SMART overall-health self-assessment test result:\s*(\S+)`)
smartAttrLineRE = regexp.MustCompile(
`^\s*(\d{1,3})\s+(\S+)\s+0x[0-9a-fA-F]+\s+(\d{1,3})\s+(\d{1,3})\s+(\d{1,3})\s+\S+\s+\S+\s+\S+\s+(.+?)\s*$`,
)
smartModelRE = regexp.MustCompile(`(?im)^Device Model:\s*(.+)$`)
smartSerialRE = regexp.MustCompile(`(?im)^Serial Number:\s*(.+)$`)
smartFirmwareRE = regexp.MustCompile(`(?im)^Firmware Version:\s*(.+)$`)
smartCapacityRE = regexp.MustCompile(`(?im)^User Capacity:\s*(.+)$`)
)
type smartAttr struct {
ID int
Name string
Value int
Worst int
Threshold int
Raw string
}
func writeSATAReport(b *strings.Builder, outputs map[string][]byte) {
data := outputs["smartctl-health"]
if len(data) == 0 {
b.WriteString("\n(no SMART data)\n")
return
}
text := string(data)
// Identity
if m := smartModelRE.FindStringSubmatch(text); m != nil {
writeField(b, "Model", strings.TrimSpace(m[1]))
}
if m := smartSerialRE.FindStringSubmatch(text); m != nil {
writeField(b, "Serial", strings.TrimSpace(m[1]))
}
if m := smartFirmwareRE.FindStringSubmatch(text); m != nil {
writeField(b, "Firmware", strings.TrimSpace(m[1]))
}
var capacityBytes uint64
if m := smartCapacityRE.FindStringSubmatch(text); m != nil {
cap := strings.TrimSpace(m[1])
capacityBytes = parseLeadingUint(cap)
// trim everything after "[" if present (e.g. "500,107,862,016 bytes [500 GB]")
if idx := strings.Index(cap, "["); idx > 0 {
cap = strings.TrimSpace(cap[idx+1:])
cap = strings.TrimSuffix(cap, "]")
}
writeField(b, "Capacity", cap)
}
writeSectionHeader(b, "Health")
health := "unknown"
if m := smartHealthRE.FindStringSubmatch(text); m != nil {
health = strings.TrimSpace(m[1])
}
writeField(b, "SMART Overall Health", health)
attrs := parseSMARTAttrs(text)
if len(attrs) > 0 {
writeSectionHeader(b, "SMART Attributes")
fmt.Fprintf(b, " %-4s %-32s %5s %5s %5s %s\n", "ID", "Attribute", "Value", "Worst", "Thresh", "Raw")
b.WriteString(" " + strings.Repeat("-", 72) + "\n")
for _, a := range attrs {
fmt.Fprintf(b, " %-4d %-32s %5d %5d %5d %s\n",
a.ID, a.Name, a.Value, a.Worst, a.Threshold, a.Raw)
}
}
var poh, writtenLBAs, readLBAs uint64
var readValue int
hasReadValue := false
for _, a := range attrs {
switch a.ID {
case 9: // Power_On_Hours
poh = parseLeadingUint(a.Raw)
case 241: // Total_LBAs_Written
writtenLBAs = parseLeadingUint(a.Raw)
case 242: // Total_LBAs_Read
readLBAs = parseLeadingUint(a.Raw)
readValue = a.Value
hasReadValue = true
}
}
const sataSectorBytes = 512
writeResourceSection(b, resourceInfo{
powerOnHours: poh,
writtenBytes: writtenLBAs * sataSectorBytes,
readBytes: readLBAs * sataSectorBytes,
capacityBytes: capacityBytes,
readPercent: 100 - readValue,
hasReadPercent: hasReadValue,
})
selfTest := outputs["smartctl-self-test-status"]
if len(selfTest) == 0 {
selfTest = outputs["smartctl-self-test-short"]
}
if len(selfTest) > 0 {
writeSectionHeader(b, "Self-Test")
result := parseSelfTestResult(string(selfTest))
writeField(b, "Result", result)
}
}
func parseSMARTAttrs(text string) []smartAttr {
var attrs []smartAttr
inTable := false
for _, line := range strings.Split(text, "\n") {
if strings.Contains(line, "ATTRIBUTE_NAME") {
inTable = true
continue
}
if !inTable {
continue
}
m := smartAttrLineRE.FindStringSubmatch(line)
if m == nil {
if strings.TrimSpace(line) == "" {
inTable = false
}
continue
}
id, _ := strconv.Atoi(m[1])
val, _ := strconv.Atoi(m[3])
worst, _ := strconv.Atoi(m[4])
thresh, _ := strconv.Atoi(m[5])
attrs = append(attrs, smartAttr{
ID: id,
Name: m[2],
Value: val,
Worst: worst,
Threshold: thresh,
Raw: strings.TrimSpace(m[6]),
})
}
return attrs
}
// parseSelfTestResult extracts a one-line summary from nvme device-self-test,
// smartctl -a (post-completion status), or smartctl -t short (launch ack) output.
func parseSelfTestResult(text string) string {
text = strings.TrimSpace(text)
if text == "" {
return "no output"
}
lines := strings.Split(text, "\n")
// smartctl -a: "Self-test execution status: ( 0)\n\tThe previous
// self-test routine completed\n\twithout error ..." — the description
// wraps onto following indented, colon-free continuation lines.
for i, line := range lines {
if strings.Contains(strings.ToLower(line), "self-test execution status") {
parts := []string{strings.TrimSpace(line)}
for j := i + 1; j < len(lines) && j < i+4; j++ {
cont := strings.TrimSpace(lines[j])
if cont == "" || strings.Contains(cont, ":") {
break
}
parts = append(parts, cont)
}
return strings.Join(parts, " ")
}
}
// nvme device-self-test: look for "Short Device Self-Test Status : 0x0" or similar
for _, line := range lines {
l := strings.ToLower(line)
if strings.Contains(l, "self-test status") || strings.Contains(l, "self test status") {
return strings.TrimSpace(line)
}
}
// smartctl -t short: "Testing has begun" or "Short BGST started"
for _, line := range lines {
l := strings.ToLower(line)
if strings.Contains(l, "testing has begun") || strings.Contains(l, "started") || strings.Contains(l, "complete") {
return strings.TrimSpace(line)
}
}
// fallback: last non-empty line
for i := len(lines) - 1; i >= 0; i-- {
if s := strings.TrimSpace(lines[i]); s != "" {
return s
}
}
return "done"
}
// ── Resource (pseudographic usage bars) ────────────────────────────────────────
// designLifeYears/dwpd model the drive's rated endurance: 1 drive-write-per-day
// for 5 years, the baseline enterprise endurance spec used when the vendor's
// own TBW/DWPD rating isn't available from SMART/NVMe data.
const (
designLifeYears = 5
dwpd = 1.0
)
type resourceInfo struct {
powerOnHours uint64
writtenBytes uint64
readBytes uint64
capacityBytes uint64
readPercent int // only meaningful when hasReadPercent
hasReadPercent bool // true when the source SMART attribute exposes a normalized read-wear value
}
func writeResourceSection(b *strings.Builder, r resourceInfo) {
writeSectionHeader(b, "Resource")
const maxLifeHours = designLifeYears * 365 * 24
upFrac := float64(r.powerOnHours) / float64(maxLifeHours)
fmt.Fprintf(b, " %-9s %s %s / %s (%s)\n",
"Uptime", progressBar(upFrac, 24), formatHoursHuman(r.powerOnHours), formatHoursHuman(maxLifeHours), formatPercent(upFrac*100))
if r.capacityBytes > 0 {
maxWritten := float64(r.capacityBytes) * dwpd * designLifeYears * 365
wFrac := float64(r.writtenBytes) / maxWritten
fmt.Fprintf(b, " %-9s %s %s / %s (%s, %g DWPD×%dy)\n",
"Written", progressBar(wFrac, 24), formatBytesHuman(float64(r.writtenBytes)), formatBytesHuman(maxWritten), formatPercent(wFrac*100), dwpd, designLifeYears)
} else {
fmt.Fprintf(b, " %-9s %s\n", "Written", formatBytesHuman(float64(r.writtenBytes)))
}
if r.hasReadPercent {
fmt.Fprintf(b, " %-9s %s %s (%d%%)\n",
"Read", progressBar(float64(r.readPercent)/100, 24), formatBytesHuman(float64(r.readBytes)), r.readPercent)
} else {
fmt.Fprintf(b, " %-9s %s\n", "Read", formatBytesHuman(float64(r.readBytes)))
}
}
// progressBar renders a fixed-width pseudographic bar, e.g. "[######------]".
func progressBar(frac float64, width int) string {
if math.IsNaN(frac) || frac < 0 {
frac = 0
}
if frac > 1 {
frac = 1
}
filled := int(math.Round(frac * float64(width)))
return "[" + strings.Repeat("#", filled) + strings.Repeat("-", width-filled) + "]"
}
// formatBytesHuman renders a decimal (SI) human-readable byte size, e.g. "1.23 TB".
func formatBytesHuman(n float64) string {
units := []string{"B", "KB", "MB", "GB", "TB", "PB"}
i := 0
for n >= 1000 && i < len(units)-1 {
n /= 1000
i++
}
if i == 0 {
return fmt.Sprintf("%.0f %s", n, units[i])
}
return fmt.Sprintf("%.2f %s", n, units[i])
}
// formatHoursHuman renders an hour count as a human-scaled duration (hours,
// days, or years) so uptimes don't show as raw four/five-digit hour counts.
func formatHoursHuman(hours uint64) string {
if hours < 48 {
return fmt.Sprintf("%d h", hours)
}
days := float64(hours) / 24
if days < 365 {
return fmt.Sprintf("%.0f d", days)
}
years := days / 365
if years == math.Trunc(years) {
return fmt.Sprintf("%.0f y", years)
}
return fmt.Sprintf("%.1f y", years)
}
// formatPercent renders a percentage with extra precision below 1% (e.g.
// "0.03%"), where a rounded "0%" would hide any usage at all.
func formatPercent(pct float64) string {
if pct > 0 && pct < 1 {
return fmt.Sprintf("%.2f%%", pct)
}
return fmt.Sprintf("%.0f%%", pct)
}
// parseLeadingUint parses the leading run of digits/commas in s (e.g. from a
// SMART raw value or "500,107,862,016 bytes") into a uint64, ignoring the rest.
func parseLeadingUint(s string) uint64 {
s = strings.TrimSpace(s)
end := 0
for end < len(s) && (s[end] >= '0' && s[end] <= '9' || s[end] == ',') {
end++
}
digits := strings.ReplaceAll(s[:end], ",", "")
n, _ := strconv.ParseUint(digits, 10, 64)
return n
}
// ── Formatting helpers ────────────────────────────────────────────────────────
func writeSectionHeader(b *strings.Builder, title string) {
b.WriteString("\n")
header := "-- " + title + " "
header += strings.Repeat("-", max(0, 76-len(header)))
b.WriteString(header + "\n")
}
func writeField(b *strings.Builder, label, value string) {
fmt.Fprintf(b, " %-20s : %s\n", label, value)
}
func formatCapacityGB(bytes uint64) string {
gb := float64(bytes) / 1e9
if gb >= 1000 {
return fmt.Sprintf("%.2g TB", gb/1000)
}
return fmt.Sprintf("%.0f GB", math.Round(gb))
}
func formatUint(n uint64) string {
if n == 0 {
return "0"
}
s := strconv.FormatUint(n, 10)
// insert thousand separators
var out []byte
for i, c := range s {
if i > 0 && (len(s)-i)%3 == 0 {
out = append(out, ',')
}
out = append(out, byte(c))
}
return string(out)
}
func max(a, b int) int {
if a > b {
return a
}
return b
}

View File

@@ -0,0 +1,122 @@
package platform
import (
"strings"
"testing"
"time"
)
var testNVMeIdCtrl = []byte(`{
"mn": "SAMSUNG MZ1L2960HCJR-00A07 ",
"sn": "S665NN0X415495",
"fr": "GDC7602Q",
"tnvmcap": 960197124096
}`)
var testNVMeSmartLog = []byte(`{
"critical_warning": 0,
"temperature": 311,
"avail_spare": 100,
"spare_thresh": 10,
"percent_used": 0,
"data_units_read": 1023456,
"data_units_written": 738281,
"power_cycles": 32,
"power_on_hours": 1234,
"unsafe_shutdowns": 3,
"media_errors": 0,
"num_err_log_entries": 0
}`)
// lo/hi variant emitted by some nvme-cli versions
var testNVMeSmartLogLoHi = []byte(`{
"critical_warning": 0,
"temperature": {"lo": 311, "hi": 0},
"avail_spare": 100,
"spare_thresh": 10,
"percent_used": 0,
"data_units_read": {"lo": 1023456, "hi": 0},
"data_units_written": {"lo": 738281, "hi": 0},
"power_cycles": {"lo": 32, "hi": 0},
"power_on_hours": {"lo": 1234, "hi": 0},
"unsafe_shutdowns": {"lo": 3, "hi": 0},
"media_errors": {"lo": 0, "hi": 0},
"num_err_log_entries": {"lo": 0, "hi": 0}
}`)
var testSmartCtlHealth = []byte(`
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.0] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG MZ1L2960HCJR-00A07
Serial Number: S665NN0X415495
Firmware Version: GDC7602Q
User Capacity: 960,197,124,096 bytes [960 GB]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1234
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 45
177 Wear_Leveling_Count 0x0013 097 097 000 Pre-fail Always - 30
190 Airflow_Temperature_Cel 0x0032 063 045 000 Old_age Always - 37
`)
func TestGenerateDiskReportNVMe(t *testing.T) {
t.Parallel()
outputs := map[string][]byte{
"nvme-id-ctrl": testNVMeIdCtrl,
"nvme-smart-log": testNVMeSmartLog,
}
report := GenerateDiskReportText(1, "/dev/nvme0n1", outputs, time.Unix(0, 0).UTC())
assertContains(t, report, "Disk 1", "/dev/nvme0n1")
assertContains(t, report, "SAMSUNG MZ1L2960HCJR-00A07")
assertContains(t, report, "S665NN0X415495")
assertContains(t, report, "GDC7602Q")
assertContains(t, report, "38 °C") // 311 K - 273
assertContains(t, report, "1,234 h") // power_on_hours with separator
assertContains(t, report, "32") // power_cycles
assertContains(t, report, "3") // unsafe_shutdowns
assertContains(t, report, "378.0 GB") // data_units_written * 512000 / 1e9
}
func TestGenerateDiskReportNVMeLoHi(t *testing.T) {
t.Parallel()
outputs := map[string][]byte{
"nvme-id-ctrl": testNVMeIdCtrl,
"nvme-smart-log": testNVMeSmartLogLoHi,
}
report := GenerateDiskReportText(1, "/dev/nvme0n1", outputs, time.Unix(0, 0).UTC())
assertContains(t, report, "38 °C")
assertContains(t, report, "1,234 h")
}
func TestGenerateDiskReportSATA(t *testing.T) {
t.Parallel()
outputs := map[string][]byte{
"smartctl-health": testSmartCtlHealth,
}
report := GenerateDiskReportText(2, "/dev/sda", outputs, time.Unix(0, 0).UTC())
assertContains(t, report, "Disk 2", "/dev/sda")
assertContains(t, report, "SAMSUNG MZ1L2960HCJR-00A07")
assertContains(t, report, "S665NN0X415495")
assertContains(t, report, "PASSED")
assertContains(t, report, "Reallocated_Sector_Ct")
assertContains(t, report, "Power_On_Hours")
}
func assertContains(t *testing.T, text string, needles ...string) {
t.Helper()
for _, needle := range needles {
if !strings.Contains(text, needle) {
t.Errorf("report missing %q\nreport:\n%s", needle, text)
}
}
}

View File

@@ -0,0 +1,280 @@
package webui
import (
"context"
"encoding/json"
"fmt"
"net/http"
"os/exec"
"strconv"
"strings"
"time"
)
type huaweiField struct {
Name string `json:"name"`
Key string `json:"key"`
Value string `json:"value"`
ReadOnly bool `json:"read_only,omitempty"`
}
type huaweiChange struct {
Key string `json:"key"`
Value string `json:"value"`
}
type huaweiFieldDef struct {
Name string
Key string
FruID byte
TypeID byte
FieldID byte
Special string // "chassis-type" | "guid"
}
var huaweiElabelDefs = []huaweiFieldDef{
{"Device Name", "DeviceName", 0x00, 0x06, 0x01, ""},
{"Device Serial Number", "DeviceSerialNumber", 0x00, 0x06, 0x03, ""},
{"Product Name", "ProductName", 0x00, 0x03, 0x01, ""},
{"Product Serial Number", "ProductSerialNumber", 0x00, 0x03, 0x04, ""},
{"Product Asset Tag", "ProductAssetTag", 0x00, 0x03, 0x05, ""},
{"Product Manufacturer", "ProductManufacturer", 0x00, 0x03, 0x00, ""},
{"Mainboard Manufacturer", "MainboardManufacturer", 0x00, 0x02, 0x01, ""},
{"Board Product Name", "BoardProductName", 0x00, 0x02, 0x02, ""},
{"Chassis Part Number", "ChassisPartnumber", 0x00, 0x01, 0x01, ""},
{"Chassis Type", "ChassisType", 0x00, 0x01, 0x00, "chassis-type"},
{"IO Chassis Serial", "IOChassisSerialNumber", 0x01, 0x03, 0x04, ""},
{"IO Chassis Asset Tag", "IOChassisAssetTag", 0x01, 0x03, 0x05, ""},
{"GUID", "GUID", 0x00, 0x00, 0x00, "guid"},
}
// huaweiGetRaw reads a string elabel field via OEM IPMI raw command.
// Protocol: ipmitool raw 0x30 0x90 0x05 <fru_id> <type_id> <field_id> 0x00 0x30
// Response: <length_byte> <ascii_byte1> ... (null-terminated)
func huaweiGetRaw(ctx context.Context, def huaweiFieldDef) (string, error) {
if def.Special == "guid" {
return huaweiGetGUID(ctx)
}
args := []string{
"0x30", "0x90", "0x05",
fmt.Sprintf("0x%02x", def.FruID),
fmt.Sprintf("0x%02x", def.TypeID),
fmt.Sprintf("0x%02x", def.FieldID),
"0x00", "0x30",
}
out, err := exec.CommandContext(ctx, "ipmitool", append([]string{"raw"}, args...)...).CombinedOutput()
if err != nil {
return "", err
}
return huaweiParseStringResponse(strings.TrimSpace(string(out)), def.Special), nil
}
// huaweiParseStringResponse decodes the OEM IPMI response bytes to a string.
// Format: <length_byte> <byte1> <byte2> ...
func huaweiParseStringResponse(hexOut, special string) string {
parts := strings.Fields(hexOut)
if len(parts) < 2 {
return ""
}
if special == "chassis-type" {
// Response: <length=1> <type_byte>
if len(parts) >= 2 {
n, err := strconv.ParseUint(parts[1], 16, 8)
if err == nil {
return fmt.Sprintf("0x%02x", n)
}
}
return ""
}
var sb strings.Builder
for _, p := range parts[1:] {
b, err := strconv.ParseUint(p, 16, 8)
if err != nil || b == 0 {
break
}
sb.WriteByte(byte(b))
}
return strings.TrimRight(sb.String(), "\x00")
}
// huaweiGetGUID reads the system GUID via standard IPMI Get System GUID (0x06 0x08).
func huaweiGetGUID(ctx context.Context) (string, error) {
out, err := exec.CommandContext(ctx, "ipmitool", "raw", "0x06", "0x08").CombinedOutput()
if err != nil {
return "", err
}
parts := strings.Fields(strings.TrimSpace(string(out)))
if len(parts) != 16 {
return "", nil
}
// Format as UUID: 4-2-2-2-6 byte groups
// iBMC returns bytes in reversed order; re-reverse to get canonical UUID.
var bytes [16]string
for i, p := range parts {
bytes[15-i] = p
}
return fmt.Sprintf("%s%s%s%s-%s%s-%s%s-%s%s-%s%s%s%s%s%s",
bytes[0], bytes[1], bytes[2], bytes[3],
bytes[4], bytes[5],
bytes[6], bytes[7],
bytes[8], bytes[9],
bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15],
), nil
}
// huaweiChunks splits a value into 19-byte chunks for the OEM IPMI SET protocol.
// Key byte: bit7=1 means more chunks follow; bits 0-6 = offset into string.
func huaweiChunks(value string) [][]string {
if len(value) == 0 {
return [][]string{{"0x00", "0x01", "0x00"}}
}
const maxLen = 63
if len(value) > maxLen {
value = value[:maxLen]
}
const chunkSize = 19
var chunks [][]string
for offset := 0; offset < len(value); {
end := offset + chunkSize
if end > len(value) {
end = len(value)
}
isLast := end >= len(value)
key := byte(offset)
if !isLast {
key |= 0x80
}
args := []string{
fmt.Sprintf("0x%02x", key),
fmt.Sprintf("0x%02x", end-offset),
}
for _, b := range []byte(value[offset:end]) {
args = append(args, fmt.Sprintf("0x%02x", b))
}
chunks = append(chunks, args)
offset = end
}
return chunks
}
func (h *handler) handleAPIHuaweiElabelRead(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 60*time.Second)
defer cancel()
var fields []huaweiField
for _, def := range huaweiElabelDefs {
val, err := huaweiGetRaw(ctx, def)
if err != nil {
// First field failure likely means no Huawei BMC — abort with error.
if len(fields) == 0 {
msg := strings.TrimSpace(err.Error())
writeError(w, http.StatusInternalServerError, "huawei elabel not available: "+msg)
return
}
val = ""
}
fields = append(fields, huaweiField{
Name: def.Name,
Key: def.Key,
Value: val,
ReadOnly: def.Special == "guid" || def.Special == "chassis-type",
})
}
writeJSON(w, fields)
}
func (h *handler) handleAPIHuaweiElabelWrite(w http.ResponseWriter, r *http.Request) {
var req struct {
Changes []huaweiChange `json:"changes"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid JSON")
return
}
if len(req.Changes) == 0 {
writeError(w, http.StatusUnprocessableEntity, "no changes provided")
return
}
defByKey := make(map[string]huaweiFieldDef, len(huaweiElabelDefs))
for _, d := range huaweiElabelDefs {
defByKey[d.Key] = d
}
for _, c := range req.Changes {
def, ok := defByKey[c.Key]
if !ok {
writeError(w, http.StatusUnprocessableEntity, "unknown field key: "+c.Key)
return
}
if def.Special == "guid" || def.Special == "chassis-type" {
writeError(w, http.StatusUnprocessableEntity, "field is read-only: "+c.Key)
return
}
if len(c.Value) > 63 {
writeError(w, http.StatusUnprocessableEntity, "value too long (max 63 chars): "+c.Key)
return
}
for _, ch := range c.Value {
if ch < 0x20 || ch > 0x7E {
writeError(w, http.StatusUnprocessableEntity, "non-printable character in value for: "+c.Key)
return
}
}
}
t := &Task{
ID: newJobID("huawei-elabel-write"),
Name: fmt.Sprintf("Huawei Elabel Write (%d field(s))", len(req.Changes)),
Target: "huawei-elabel-write",
Priority: defaultTaskPriority("huawei-elabel-write", taskParams{}),
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{HuaweiElabelChanges: req.Changes},
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID})
}
func runHuaweiElabelWriteTask(ctx context.Context, j *jobState, p taskParams) error {
defByKey := make(map[string]huaweiFieldDef, len(huaweiElabelDefs))
for _, d := range huaweiElabelDefs {
defByKey[d.Key] = d
}
// Enable device name effective flag before writing.
enableCmd := exec.CommandContext(ctx, "ipmitool", "raw", "0x30", "0x90", "0x21", "0x04", "0x01")
if out, err := enableCmd.CombinedOutput(); err != nil {
j.append("Warning: enable flag: " + strings.TrimSpace(string(out)))
}
for _, c := range p.HuaweiElabelChanges {
def := defByKey[c.Key]
setPrefix := []string{
"0x30", "0x90", "0x04",
fmt.Sprintf("0x%02x", def.FruID),
fmt.Sprintf("0x%02x", def.TypeID),
fmt.Sprintf("0x%02x", def.FieldID),
}
chunks := huaweiChunks(c.Value)
j.append(fmt.Sprintf("Setting %s = %q (%d chunk(s))", c.Key, c.Value, len(chunks)))
for _, chunk := range chunks {
args := append([]string{"raw"}, setPrefix...)
args = append(args, chunk...)
cmd := exec.CommandContext(ctx, "ipmitool", args...)
if err := streamCmdJob(j, cmd); err != nil {
return fmt.Errorf("set %s: %w", c.Key, err)
}
}
// Commit after each field.
commitCmd := exec.CommandContext(ctx, "ipmitool", "raw", "0x30", "0x90", "0x06", "0x00", "0xAA")
if out, err := commitCmd.CombinedOutput(); err != nil {
return fmt.Errorf("commit after %s: %w (output: %s)", c.Key, err, strings.TrimSpace(string(out)))
}
j.append("Committed " + c.Key)
}
return nil
}

View File

@@ -202,83 +202,3 @@ func runIPMIFRUWriteTask(ctx context.Context, j *jobState, exportDir string, p t
return nil
}
func renderIPMIFRUCard() string {
return `<div class="card"><div class="card-head card-head-actions">IPMI &#8212; FRU<div class="card-head-buttons"><button class="btn btn-sm btn-secondary" onclick="fruRead()">Read</button></div></div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Reads and edits FRU fields via ipmitool (In-Band, device 0). Works on any server with IPMI support.</p>
<div id="fru-status" style="font-size:13px;color:var(--muted);margin-bottom:8px"></div>
<div id="fru-table"></div>
</div></div>
<script>
var _fruActBtnStyle = 'width:22px;height:22px;padding:0;font-size:13px;line-height:1;border:1px solid var(--line);border-radius:3px;background:var(--surface);cursor:pointer;vertical-align:middle;';
var _fruInputStyle = 'width:100%;padding:3px 6px;border:1.5px solid #888;border-radius:3px;font-size:13px;font-family:monospace;background:var(--surface);color:var(--ink);';
function fruRead() {
var status = document.getElementById('fru-status');
status.textContent = 'Reading...'; status.style.color = 'var(--muted)';
document.getElementById('fru-table').innerHTML = '';
fetch('/api/tools/ipmi-fru', {cache:'no-store'})
.then(function(r) { return r.json().then(function(d){if(!r.ok)throw new Error(d.error||r.statusText);return d;}); })
.then(function(fields) {
if (!fields || !fields.length) { status.textContent = 'No FRU fields returned.'; return; }
status.textContent = '';
var rows = fields.map(function(f) {
var val = escHtml(f.value || '');
return '<tr>'
+ '<td style="color:var(--muted);white-space:nowrap;padding-right:16px;vertical-align:middle;font-size:13px">' + escHtml(f.name) + '</td>'
+ '<td style="vertical-align:middle"><input class="fru-inp" style="' + _fruInputStyle + '"'
+ ' data-area="' + escHtml(f.area||'') + '" data-index="' + (f.index||0) + '" data-name="' + escHtml(f.name) + '"'
+ ' data-original="' + val + '" value="' + val + '" oninput="fruChanged(this)"></td>'
+ '<td class="fru-act" style="display:none;white-space:nowrap;padding-left:6px;vertical-align:middle">'
+ '<button style="' + _fruActBtnStyle + 'color:var(--ok-fg,green);margin-right:3px" title="Save" onclick="fruSave(this)">✓</button>'
+ '<button style="' + _fruActBtnStyle + 'color:var(--crit-fg,#9f3a38)" title="Cancel" onclick="fruCancel(this)">✗</button>'
+ '<span class="fru-msg" style="font-size:11px;margin-left:5px;color:var(--muted)"></span>'
+ '</td></tr>';
}).join('');
document.getElementById('fru-table').innerHTML = '<table style="width:100%;border-collapse:collapse">' + rows + '</table>';
})
.catch(function(e) { status.textContent = 'Error: '+e.message; status.style.color='var(--crit-fg)'; });
}
function escHtml(s) {
return String(s==null?'':s).replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;');
}
function fruChanged(inp) {
inp.closest('tr').querySelector('.fru-act').style.display = inp.value !== inp.dataset.original ? '' : 'none';
}
function fruCancel(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.fru-inp');
inp.value = inp.dataset.original;
row.querySelector('.fru-act').style.display = 'none';
row.querySelector('.fru-msg').textContent = '';
}
function fruSave(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.fru-inp');
var msg = row.querySelector('.fru-msg');
var cancelBtn = row.querySelectorAll('.fru-act button')[1];
btn.disabled = true; cancelBtn.disabled = true;
msg.textContent = '…'; msg.style.color = 'var(--muted)';
fetch('/api/tools/ipmi-fru/write', {method:'POST', headers:{'Content-Type':'application/json'},
body: JSON.stringify({changes:[{area:inp.dataset.area, index:parseInt(inp.dataset.index,10), name:inp.dataset.name, value:inp.value}]})})
.then(function(r){return r.json().then(function(d){if(!r.ok)throw new Error(d.error||r.statusText);return d;});})
.then(function(d){
var poll = setInterval(function(){
fetch('/api/tasks',{cache:'no-store'}).then(function(r){return r.json();}).then(function(tasks){
var t = Array.isArray(tasks)?tasks.find(function(x){return x.id===d.task_id;}):null;
if(!t) return;
if(t.status==='done'){
clearInterval(poll);
inp.dataset.original = inp.value;
row.querySelector('.fru-act').style.display = 'none';
msg.textContent = '';
} else if(t.status==='failed'||t.status==='cancelled'){
clearInterval(poll);
msg.textContent = t.error||t.status; msg.style.color='var(--crit-fg)';
btn.disabled=false; cancelBtn.disabled=false;
}
});
},1500);
})
.catch(function(e){ msg.textContent='Error: '+e.message; msg.style.color='var(--crit-fg)'; btn.disabled=false; cancelBtn.disabled=false; });
}
</script>`
}

View File

@@ -404,13 +404,227 @@ loadNvidiaSelfHeal();
func renderTools() string {
return renderNVMeFormatCard() + `
` + renderSAADMICard() + `
` + renderIPMIFRUCard() + `
` + renderFRUEditorCard() + `
` + renderRAIDMgmtCard()
}
func renderFRUEditorCard() string {
return `<div class="card"><div class="card-head card-head-actions">FRU / Elabel<div class="card-head-buttons"><button class="btn btn-sm btn-secondary" onclick="fruAllRead()">Read All</button></div></div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Reads and edits hardware identity fields from all available sources. Each field shows its source method.</p>
<div id="fru-all-status" style="font-size:13px;color:var(--muted);margin-bottom:8px"></div>
<div id="fru-src-status" style="display:none;margin-bottom:10px"></div>
<div id="fru-all-table"></div>
</div></div>
<style>
.fru-chip{display:inline-block;font-size:10px;font-weight:600;letter-spacing:.02em;padding:1px 6px;border-radius:3px;vertical-align:middle;white-space:nowrap;margin-right:8px;flex-shrink:0}
.fru-chip-ipmi{background:#e8e8e8;color:#555}
.fru-chip-huawei{background:#fff0e6;color:#b83}
.fru-chip-saa{background:#e6f0ff;color:#557}
.fru-inp-wrap{display:flex;align-items:center;gap:0}
</style>
<script>
(function(){
var _actBtn='width:22px;height:22px;padding:0;font-size:13px;line-height:1;border:1px solid var(--line);border-radius:3px;background:var(--surface);cursor:pointer;vertical-align:middle;';
var _inp='width:100%;padding:3px 6px;border:1.5px solid #888;border-radius:3px;font-size:13px;font-family:monospace;background:var(--surface);color:var(--ink);';
var SOURCES = [
{
id: 'ipmi-fru',
label: 'IPMI FRU',
chipClass: 'fru-chip-ipmi',
url: '/api/tools/ipmi-fru',
writeUrl: '/api/tools/ipmi-fru/write',
rowAttrs: function(f) {
return 'data-source="ipmi-fru" data-area="'+esc(f.area||'')+'" data-index="'+(f.index||0)+'" data-name="'+esc(f.name)+'"';
},
writeBody: function(inp) {
return JSON.stringify({changes:[{area:inp.dataset.area,index:parseInt(inp.dataset.index,10),name:inp.dataset.name,value:inp.value}]});
},
fieldName: function(f) { return f.name; },
fieldValue: function(f) { return f.value||''; },
readOnly: function(f) { return false; },
},
{
id: 'huawei',
label: 'Huawei iBMC',
chipClass: 'fru-chip-huawei',
url: '/api/tools/huawei-elabel',
writeUrl: '/api/tools/huawei-elabel/write',
rowAttrs: function(f) {
return 'data-source="huawei" data-key="'+esc(f.key)+'"';
},
writeBody: function(inp) {
return JSON.stringify({changes:[{key:inp.dataset.key,value:inp.value}]});
},
fieldName: function(f) { return f.name; },
fieldValue: function(f) { return f.value||''; },
readOnly: function(f) { return !!f.read_only; },
},
{
id: 'saa-dmi',
label: 'SAA DMI',
chipClass: 'fru-chip-saa',
url: '/api/tools/saa-dmi',
writeUrl: '/api/tools/saa-dmi/write',
rowAttrs: function(f) {
return 'data-source="saa-dmi" data-shn="'+esc(f.shn)+'"';
},
writeBody: function(inp) {
return JSON.stringify({changes:[{shn:inp.dataset.shn,value:inp.value}]});
},
fieldName: function(f) { return f.name; },
fieldValue: function(f) { return f.value||''; },
readOnly: function(f) { return false; },
},
];
function esc(s){return String(s==null?'':s).replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;');}
function renderSrcStatus(perSource) {
var bar = document.getElementById('fru-src-status');
if (!perSource.length) { bar.style.display = 'none'; bar.innerHTML = ''; return; }
var html = '';
perSource.forEach(function(p) {
var state, color;
if (p.ok) {
state = p.count + ' field(s) available';
color = 'var(--ok-fg,green)';
} else if (/not activated|product key|SFT-DCMS|SFT-OOB/i.test(p.reason)) {
state = 'requires Supermicro license (SFT-OOB-LIC / SFT-DCMS-SINGLE) — activate on BMC';
color = 'var(--crit-fg,#9f3a38)';
} else {
state = p.reason || 'unavailable';
color = 'var(--muted)';
}
html += '<div style="display:flex;align-items:center;gap:8px;font-size:12px;margin:3px 0">'
+ '<span class="fru-chip '+p.src.chipClass+'">'+p.src.label+'</span>'
+ '<span style="color:'+color+'">'+esc(state)+'</span>'
+ '</div>';
});
bar.innerHTML = html;
bar.style.display = '';
}
window.fruAllRead = function() {
var status = document.getElementById('fru-all-status');
var table = document.getElementById('fru-all-table');
status.textContent = 'Reading…'; status.style.color = 'var(--muted)';
table.innerHTML = '';
var fetches = SOURCES.map(function(src) {
return fetch(src.url, {cache:'no-store'})
.then(function(r){ return r.json().then(function(d){ if(!r.ok) throw new Error(d.error||r.statusText); return d; }); });
});
Promise.allSettled(fetches).then(function(results) {
var rows = '';
var totalFields = 0;
var perSource = [];
results.forEach(function(res, i) {
var src = SOURCES[i];
if (res.status === 'rejected' || !Array.isArray(res.value) || res.value.length === 0) {
var reason = '';
if (res.status === 'rejected' && res.reason) reason = res.reason.message;
else reason = 'no editable fields returned';
perSource.push({src:src, ok:false, count:0, reason:reason});
return;
}
perSource.push({src:src, ok:true, count:res.value.length, reason:''});
res.value.forEach(function(f) {
var val = esc(src.fieldValue(f));
var ro = src.readOnly(f);
var attrs = ro ? '' : (' '+src.rowAttrs(f));
rows += '<tr>'
+ '<td style="white-space:nowrap;padding-right:4px;vertical-align:middle">'
+ '<span class="fru-chip '+src.chipClass+'">'+src.label+'</span>'
+ '</td>'
+ '<td style="color:var(--muted);white-space:nowrap;padding-right:16px;vertical-align:middle;font-size:13px">'+esc(src.fieldName(f))+'</td>'
+ '<td style="vertical-align:middle">'
+ (ro
? '<span style="font-family:monospace;font-size:13px;color:var(--muted)">'+val+'</span>'
: '<input class="fru-uni-inp" style="'+_inp+'" value="'+val+'" data-original="'+val+'"'+attrs+' oninput="fruUniChanged(this)">')
+ '</td>'
+ '<td class="fru-uni-act" style="display:none;white-space:nowrap;padding-left:6px;vertical-align:middle">'
+ '<button style="'+_actBtn+'color:var(--ok-fg,green);margin-right:3px" title="Save" onclick="fruUniSave(this)">✓</button>'
+ '<button style="'+_actBtn+'color:var(--crit-fg,#9f3a38)" title="Cancel" onclick="fruUniCancel(this)">✗</button>'
+ '<span class="fru-uni-msg" style="font-size:11px;margin-left:5px;color:var(--muted)"></span>'
+ '</td>'
+ '</tr>';
totalFields++;
});
});
renderSrcStatus(perSource);
if (totalFields === 0) {
status.textContent = 'No editable fields available — see per-source status below.';
status.style.color = 'var(--crit-fg,#9f3a38)';
table.innerHTML = '';
return;
}
table.innerHTML = '<table style="width:100%;border-collapse:collapse">'+rows+'</table>';
status.textContent = totalFields + ' field(s) loaded';
status.style.color = 'var(--muted)';
});
};
window.fruUniChanged = function(inp) {
var row = inp.closest('tr');
row.querySelector('.fru-uni-act').style.display = inp.value !== inp.dataset.original ? '' : 'none';
row.querySelector('.fru-uni-msg').textContent = '';
};
window.fruUniCancel = function(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.fru-uni-inp');
inp.value = inp.dataset.original;
row.querySelector('.fru-uni-act').style.display = 'none';
row.querySelector('.fru-uni-msg').textContent = '';
};
window.fruUniSave = function(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.fru-uni-inp');
var msg = row.querySelector('.fru-uni-msg');
var cancelBtn = row.querySelectorAll('.fru-uni-act button')[1];
var src = SOURCES.find(function(s){ return s.id === inp.dataset.source; });
if (!src) { msg.textContent = 'Unknown source'; msg.style.color='var(--crit-fg)'; return; }
btn.disabled = true; cancelBtn.disabled = true;
msg.textContent = '…'; msg.style.color = 'var(--muted)';
fetch(src.writeUrl, {method:'POST', headers:{'Content-Type':'application/json'}, body:src.writeBody(inp)})
.then(function(r){ return r.json().then(function(d){ if(!r.ok) throw new Error(d.error||r.statusText); return d; }); })
.then(function(d) {
var poll = setInterval(function() {
fetch('/api/tasks',{cache:'no-store'}).then(function(r){return r.json();}).then(function(tasks){
var t = Array.isArray(tasks) ? tasks.find(function(x){return x.id===d.task_id;}) : null;
if (!t) return;
if (t.status==='done') {
clearInterval(poll);
inp.dataset.original = inp.value;
row.querySelector('.fru-uni-act').style.display = 'none';
msg.textContent = ''; msg.style.color = '';
} else if (t.status==='failed'||t.status==='cancelled') {
clearInterval(poll);
msg.textContent = t.error||t.status; msg.style.color = 'var(--crit-fg)';
btn.disabled = false; cancelBtn.disabled = false;
}
});
}, 1500);
})
.catch(function(e) {
msg.textContent = 'Error: '+e.message; msg.style.color = 'var(--crit-fg)';
btn.disabled = false; cancelBtn.disabled = false;
});
};
})();
</script>`
}
func renderExportIndex(exportDir string) (string, error) {
entries, err := listExportFiles(exportDir)
if err != nil {

View File

@@ -143,9 +143,9 @@ func renderValidateMode(opts HandlerOptions, stressDefault bool) string {
)) +
renderSATCard("storage", "Storage", "runSAT('storage')", "", renderValidateCardBody(
inv.Storage,
`Scans all storage devices and runs the matching health or self-test path for each device type.`,
`<code>lsblk</code>; NVMe: <code>nvme</code>; SATA/SAS: <code>smartctl</code>`,
`Seconds in Validate (NVMe: instant device query; SATA/SAS: short self-test). Up to ~1 h per device in Stress (extended self-test, device-dependent).`,
`Collects SMART data and runs a short self-test on each storage device.`,
`<code>lsblk</code>; NVMe: <code>nvme id-ctrl</code>, <code>nvme smart-log</code>, <code>nvme device-self-test -s 1</code>; SATA/SAS: <code>smartctl -H -A</code>, <code>smartctl -t short</code>`,
`~2 min per device (NVMe short self-test; SATA/SAS short self-test — duration device-dependent).`,
)) +
`</div>
<div style="height:1px;background:var(--border);margin:16px 0"></div>
@@ -672,9 +672,9 @@ func renderCheck(opts HandlerOptions) string {
)) +
renderSATCard("storage", "Storage", "runSAT('storage')", "", renderValidateCardBody(
inv.Storage,
`Scans all storage devices and runs the matching health or self-test path for each.`,
`<code>lsblk</code>; NVMe: <code>nvme</code>; SATA/SAS: <code>smartctl</code>`,
`Seconds (NVMe: instant device query; SATA/SAS: short self-test).`,
`Collects SMART health and attributes for each storage device. No self-test is triggered — read-only query only.`,
`<code>lsblk</code>; NVMe: <code>nvme id-ctrl</code>, <code>nvme smart-log</code>; SATA/SAS: <code>smartctl -H -A</code>`,
`Seconds instantaneous device query, no wear counters incremented.`,
)) +
`</div>
<div style="height:1px;background:var(--border);margin:16px 0"></div>

View File

@@ -38,6 +38,7 @@ type raidControllerInfo struct {
Model string `json:"model"`
ForeignDrives []raidDriveInfo `json:"foreign_drives"`
FreeDrives []raidDriveInfo `json:"free_drives"`
AllDrives []raidDriveInfo `json:"all_drives"`
Arrays []raidArrayInfo `json:"arrays,omitempty"`
}
@@ -97,6 +98,7 @@ func detectLSIControllers() []raidControllerInfo {
Model: c.ResponseData.Basics.Model,
ForeignDrives: []raidDriveInfo{},
FreeDrives: []raidDriveInfo{},
AllDrives: []raidDriveInfo{},
}
if ctrl.Model == "" {
ctrl.Model = fmt.Sprintf("LSI Controller %d", ctrl.Index)
@@ -111,6 +113,7 @@ func detectLSIControllers() []raidControllerInfo {
SizeGB: raidParseHumanSizeGB(d.Size),
Serial: strings.TrimSpace(d.SN),
}
ctrl.AllDrives = append(ctrl.AllDrives, info)
switch strings.TrimSpace(d.State) {
case "Frgn":
ctrl.ForeignDrives = append(ctrl.ForeignDrives, info)
@@ -168,6 +171,30 @@ func parseRAIDMDStat(raw string) []mdStatEntry {
return entries
}
// raidVROCPortRx matches lines like " Port2 : /dev/sda (SERIAL123)"
// or " Port3 : - no device attached -" from `mdadm --detail-platform`.
var raidVROCPortRx = regexp.MustCompile(`^\s*Port\d+\s*:\s*(\S+)`)
// parseVROCPorts returns the block device basenames (e.g. "sda") that are
// physically wired to the VROC I/O controller's ports, per `mdadm
// --detail-platform` output. Drives attached directly to the CPU (or to a
// separate HBA) rather than through this controller's ports are excluded.
func parseVROCPorts(raw string) map[string]bool {
ports := map[string]bool{}
for _, line := range strings.Split(raw, "\n") {
m := raidVROCPortRx.FindStringSubmatch(line)
if m == nil {
continue
}
dev := m[1]
if !strings.HasPrefix(dev, "/dev/") {
continue
}
ports[strings.TrimPrefix(dev, "/dev/")] = true
}
return ports
}
func detectVROCController() *raidControllerInfo {
out, err := exec.Command("mdadm", "--detail-platform").CombinedOutput()
if err != nil && len(out) == 0 {
@@ -191,8 +218,16 @@ func detectVROCController() *raidControllerInfo {
Model: "Intel VROC",
ForeignDrives: []raidDriveInfo{},
FreeDrives: []raidDriveInfo{},
AllDrives: []raidDriveInfo{},
}
ports := parseVROCPorts(string(out))
// Some mdadm builds omit the "Port" lines from --detail-platform. When
// we can't determine which drives are actually wired to this
// controller, fall back to showing every disk not already in an array
// rather than hiding everything.
portsKnown := len(ports) > 0
inArray := map[string]bool{}
raw, err := os.ReadFile("/proc/mdstat")
if err == nil {
@@ -222,15 +257,25 @@ func detectVROCController() *raidControllerInfo {
}
if json.Unmarshal(lsblkOut, &lsblkDoc) == nil {
for _, d := range lsblkDoc.BlockDevices {
if d.Type != "disk" || inArray[d.Name] {
// Only consider disks wired to this controller's ports -
// drives attached directly to the CPU (or another
// controller) never show up as VROC ports and are skipped.
if d.Type != "disk" || (portsKnown && !ports[d.Name]) {
continue
}
ctrl.FreeDrives = append(ctrl.FreeDrives, raidDriveInfo{
info := raidDriveInfo{
Device: "/dev/" + d.Name,
Model: strings.TrimSpace(d.Model),
Serial: strings.TrimSpace(d.Serial),
State: "available",
})
}
if inArray[d.Name] {
info.State = "member"
}
ctrl.AllDrives = append(ctrl.AllDrives, info)
if info.State == "available" {
ctrl.FreeDrives = append(ctrl.FreeDrives, info)
}
}
}
}
@@ -348,6 +393,38 @@ func (h *handler) handleAPIRAIDCreateMirror(w http.ResponseWriter, r *http.Reque
writeJSON(w, map[string]string{"task_id": t.ID})
}
func (h *handler) handleAPIRAIDPrepareDrive(w http.ResponseWriter, r *http.Request) {
var req struct {
ControllerID string `json:"controller_id"`
Slot string `json:"slot"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid JSON")
return
}
ctrlIdx, ok := parseLSIControllerIndex(req.ControllerID)
if !ok {
writeError(w, http.StatusBadRequest, "invalid controller_id")
return
}
if _, _, ok := parseRAIDSlot(req.Slot); !ok {
writeError(w, http.StatusBadRequest, "invalid slot")
return
}
t := &Task{
ID: newJobID("raid-lsi-prepare-drive"),
Name: fmt.Sprintf("Prepare drive %s (LSI ctrl %d)", req.Slot, ctrlIdx),
Target: "raid-lsi-prepare-drive",
Priority: defaultTaskPriority("raid-lsi-prepare-drive", taskParams{}),
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{RAIDController: ctrlIdx, RAIDSlot: req.Slot},
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID})
}
func parseLSIControllerIndex(id string) (int, bool) {
if !strings.HasPrefix(id, "lsi-") {
return 0, false
@@ -385,6 +462,34 @@ func runRAIDLSICreateMirrorTask(ctx context.Context, j *jobState, ctrl int, driv
return streamCmdJob(j, cmd)
}
// parseRAIDSlot splits a storcli "EID:Slt" identifier (e.g. "252:0") into
// enclosure and slot numbers.
func parseRAIDSlot(slot string) (eid int, slt int, ok bool) {
parts := strings.SplitN(strings.TrimSpace(slot), ":", 2)
if len(parts) != 2 {
return 0, 0, false
}
eid, err1 := strconv.Atoi(strings.TrimSpace(parts[0]))
slt, err2 := strconv.Atoi(strings.TrimSpace(parts[1]))
if err1 != nil || err2 != nil {
return 0, 0, false
}
return eid, slt, true
}
func runRAIDPrepareDriveTask(ctx context.Context, j *jobState, ctrl int, slot string) error {
eid, slt, ok := parseRAIDSlot(slot)
if !ok {
return fmt.Errorf("invalid slot %q", slot)
}
j.append(fmt.Sprintf("Preparing drive %s on controller %d (set good, force)...", slot, ctrl))
cmd := exec.CommandContext(ctx, "storcli64",
fmt.Sprintf("/c%d/e%d/s%d", ctrl, eid, slt),
"set", "good", "force",
)
return streamCmdJob(j, cmd)
}
func runRAIDVROCCreateMirrorTask(ctx context.Context, j *jobState, devices []string, arrayName string) error {
if arrayName == "" {
arrayName = "bee-mirror0"
@@ -507,6 +612,7 @@ function raidRenderController(c, idx) {
html += '</div></div>';
}
html += raidRenderAllDrives(c, idx);
html += raidRenderMirrorSection(c, idx, 'lsi');
}
@@ -529,12 +635,71 @@ function raidRenderController(c, idx) {
html += '</table>';
}
html += raidRenderAllDrives(c, idx);
html += raidRenderMirrorSection(c, idx, 'vroc');
}
return html;
}
var RAID_READY_STATES = {'UGood': true, 'JBOD': true, 'available': true};
var RAID_NO_PREPARE_STATES = {'UGood': true, 'JBOD': true, 'Frgn': true, 'Onln': true, 'Msng': true};
function raidRenderAllDrives(c, idx) {
var drives = c.all_drives || [];
var isLSI = c.type === 'lsi';
if (drives.length === 0) {
return '<p style="font-size:13px;color:var(--muted);margin-bottom:12px">No drives detected on this controller.</p>';
}
var html = '<div style="font-size:12px;font-weight:600;color:var(--muted);margin-bottom:6px;text-transform:uppercase;letter-spacing:.04em">All Drives on This Controller</div>';
html += '<table style="margin-bottom:14px"><tr><th>' + (isLSI ? 'Slot' : 'Device') + '</th><th>Model</th><th>Size</th><th>State</th>' + (isLSI ? '<th></th>' : '') + '</tr>';
drives.forEach(function(d) {
var ready = !!RAID_READY_STATES[d.state];
var badgeClass = ready ? 'badge-ok' : 'badge-warn';
var actionCell = '';
if (isLSI && !RAID_NO_PREPARE_STATES[d.state]) {
actionCell = '<td><button class="btn btn-sm btn-secondary" onclick="raidPrepareDrive(\'' + escHtml(c.id) + '\',\'' + escHtml(d.slot) + '\',this)">Prepare</button></td>';
} else if (isLSI) {
actionCell = '<td></td>';
}
html += '<tr>'
+ '<td style="font-family:monospace">' + escHtml(isLSI ? d.slot : d.device) + '</td>'
+ '<td>' + escHtml(d.model||'—') + (d.serial ? ' [' + escHtml(d.serial) + ']' : '') + '</td>'
+ '<td>' + (d.size_gb > 0 ? Math.round(d.size_gb) + ' GB' : '—') + '</td>'
+ '<td><span class="badge ' + badgeClass + '">' + escHtml(d.state||'—') + '</span></td>'
+ actionCell
+ '</tr>';
});
html += '</table>';
return html;
}
function raidPrepareDrive(ctrlID, slot, btn) {
if (!confirm('Prepare drive ' + slot + ' on ' + ctrlID + ' for array creation?\n\nThis forces the drive into Unconfigured Good state. If it currently belongs to a virtual drive or holds data, that data will become inaccessible.')) {
return;
}
var original = btn ? btn.textContent : '';
if (btn) { btn.disabled = true; btn.textContent = 'Preparing...'; }
raidShowOutput('Prepare drive ' + slot, '', '');
fetch('/api/tools/raid/prepare-drive', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({controller_id: ctrlID, slot: slot})
})
.then(function(r) { return r.json(); })
.then(function(d) {
if (d.error) throw new Error(d.error);
raidStreamTask(d.task_id, 'Prepare drive ' + slot, function() {
if (btn) { btn.disabled = false; btn.textContent = original; }
raidLoad();
});
})
.catch(function(e) {
raidShowOutput('Error', 'failed', e.message);
if (btn) { btn.disabled = false; btn.textContent = original; }
});
}
function raidRenderMirrorSection(c, idx, kind) {
var free = c.free_drives || [];
var html = '<div style="font-size:12px;font-weight:600;color:var(--muted);margin-bottom:6px;text-transform:uppercase;letter-spacing:.04em">Create RAID 1 Mirror</div>';
@@ -683,6 +848,9 @@ function raidStreamTask(taskID, taskName, onDone) {
}
window.raidLoad = raidLoad;
window.raidForeignAction = raidForeignAction;
window.raidCreateMirror = raidCreateMirror;
window.raidPrepareDrive = raidPrepareDrive;
raidLoad();
})();
</script>`

View File

@@ -212,86 +212,3 @@ func runSAADMIWriteTask(ctx context.Context, j *jobState, exportDir string, p ta
return nil
}
func renderSAADMICard() string {
return `<div class="card"><div class="card-head card-head-actions">Supermicro &#8212; DMI<div class="card-head-buttons"><button class="btn btn-sm btn-secondary" onclick="saaDMIRead()">Read</button></div></div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Reads and edits DMI fields via SAA (In-Band).</p>
<div id="saa-dmi-status" style="font-size:13px;color:var(--muted);margin-bottom:8px"></div>
<div id="saa-dmi-table"></div>
</div></div>
<script>
var _dmiActBtnStyle = 'width:22px;height:22px;padding:0;font-size:13px;line-height:1;border:1px solid var(--line);border-radius:3px;background:var(--surface);cursor:pointer;vertical-align:middle;';
var _dmiInputStyle = 'width:100%;padding:3px 6px;border:1.5px solid #888;border-radius:3px;font-size:13px;font-family:monospace;background:var(--surface);color:var(--ink);';
function dmiEsc(s){return String(s==null?'':s).replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;');}
function saaDMIRead() {
var status = document.getElementById('saa-dmi-status');
status.textContent = 'Reading...'; status.style.color = 'var(--muted)';
document.getElementById('saa-dmi-table').innerHTML = '';
fetch('/api/tools/saa-dmi', {cache:'no-store'})
.then(function(r){return r.json().then(function(d){if(!r.ok)throw new Error(d.error||('HTTP '+r.status));return d;});})
.then(function(fields){
status.textContent = fields.length + ' field(s) loaded.';
var rows = fields.map(function(f){
var val = dmiEsc(f.value||'');
return '<tr>'
+ '<td style="font-size:13px;color:var(--muted);white-space:nowrap;padding-right:8px;vertical-align:middle">'+dmiEsc(f.name)+'</td>'
+ '<td style="font-family:monospace;font-size:12px;color:var(--muted);white-space:nowrap;padding-right:8px;vertical-align:middle">'+dmiEsc(f.shn)+'</td>'
+ '<td style="vertical-align:middle"><input class="dmi-inp" type="text" style="'+_dmiInputStyle+'"'
+ ' data-shn="'+dmiEsc(f.shn)+'" data-original="'+val+'" value="'+val+'" oninput="dmiChanged(this)"></td>'
+ '<td class="dmi-act" style="display:none;white-space:nowrap;padding-left:6px;vertical-align:middle">'
+ '<button style="'+_dmiActBtnStyle+'color:var(--ok-fg,green);margin-right:3px" title="Save" onclick="dmiSave(this)">✓</button>'
+ '<button style="'+_dmiActBtnStyle+'color:var(--crit-fg,#9f3a38)" title="Cancel" onclick="dmiCancel(this)">✗</button>'
+ '<span class="dmi-msg" style="font-size:11px;margin-left:5px;color:var(--muted)"></span>'
+ '</td></tr>';
}).join('');
document.getElementById('saa-dmi-table').innerHTML =
'<table style="width:100%;border-collapse:collapse">'
+ '<tr><th style="text-align:left;font-size:12px;color:var(--muted);padding-bottom:6px;font-weight:normal">Field</th>'
+ '<th style="text-align:left;font-size:12px;color:var(--muted);padding-bottom:6px;font-weight:normal">SHN</th>'
+ '<th style="text-align:left;font-size:12px;color:var(--muted);padding-bottom:6px;font-weight:normal">Value</th><th></th></tr>'
+ rows + '</table>';
})
.catch(function(e){ status.textContent='Error: '+e.message; status.style.color='var(--crit-fg,#9f3a38)'; });
}
function dmiChanged(inp) {
inp.closest('tr').querySelector('.dmi-act').style.display = inp.value !== inp.dataset.original ? '' : 'none';
}
function dmiCancel(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.dmi-inp');
inp.value = inp.dataset.original;
row.querySelector('.dmi-act').style.display = 'none';
row.querySelector('.dmi-msg').textContent = '';
}
function dmiSave(btn) {
var row = btn.closest('tr');
var inp = row.querySelector('.dmi-inp');
var msg = row.querySelector('.dmi-msg');
var cancelBtn = row.querySelectorAll('.dmi-act button')[1];
if(!window.confirm('Apply DMI change for '+inp.dataset.shn+'?\nServer will need to reboot for changes to take effect.'))return;
btn.disabled=true; cancelBtn.disabled=true;
msg.textContent='…'; msg.style.color='var(--muted)';
fetch('/api/tools/saa-dmi/write',{method:'POST',headers:{'Content-Type':'application/json'},
body:JSON.stringify({changes:[{shn:inp.dataset.shn,value:inp.value}]})})
.then(function(r){return r.json().then(function(d){if(!r.ok)throw new Error(d.error||('HTTP '+r.status));return d;});})
.then(function(d){
var poll=setInterval(function(){
fetch('/api/tasks',{cache:'no-store'}).then(function(r){return r.json();}).then(function(tasks){
var t=(tasks||[]).find(function(x){return x.id===d.task_id;});
if(!t)return;
if(t.status==='done'){
clearInterval(poll);
inp.dataset.original=inp.value;
row.querySelector('.dmi-act').style.display='none';
msg.textContent='Saved. Reboot to apply.'; msg.style.color='var(--ok-fg,green)';
} else if(t.status==='failed'||t.status==='cancelled'){
clearInterval(poll);
msg.textContent=t.error||t.status; msg.style.color='var(--crit-fg)';
btn.disabled=false; cancelBtn.disabled=false;
}
});
},1500);
})
.catch(function(e){msg.textContent='Error: '+e.message; msg.style.color='var(--crit-fg)'; btn.disabled=false; cancelBtn.disabled=false;});
}
</script>`
}

View File

@@ -318,9 +318,12 @@ func NewHandler(opts HandlerOptions) http.Handler {
mux.HandleFunc("POST /api/tools/saa-dmi/write", h.handleAPISAADMIWrite)
mux.HandleFunc("GET /api/tools/ipmi-fru", h.handleAPIIPMIFRURead)
mux.HandleFunc("POST /api/tools/ipmi-fru/write", h.handleAPIIPMIFRUWrite)
mux.HandleFunc("GET /api/tools/huawei-elabel", h.handleAPIHuaweiElabelRead)
mux.HandleFunc("POST /api/tools/huawei-elabel/write", h.handleAPIHuaweiElabelWrite)
mux.HandleFunc("GET /api/tools/raid/status", h.handleAPIRAIDStatus)
mux.HandleFunc("POST /api/tools/raid/foreign", h.handleAPIRAIDForeignAction)
mux.HandleFunc("POST /api/tools/raid/create-mirror", h.handleAPIRAIDCreateMirror)
mux.HandleFunc("POST /api/tools/raid/prepare-drive", h.handleAPIRAIDPrepareDrive)
// GPU presence / tools
mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence)

View File

@@ -1227,7 +1227,8 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
],
"services":[
{"name":"bee-web","status":"active"},
{"name":"bee-nvidia","status":"inactive"}
{"name":"bee-audit","status":"inactive"},
{"name":"bee-nvidia","status":"failed"}
]
}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(health), 0644); err != nil {
@@ -1281,7 +1282,7 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
`Bee Services`,
`CUDA runtime is not ready for GPU SAT.`,
`Missing: nvidia-smi`,
`bee-nvidia=inactive`,
`bee-nvidia=failed`,
// Hardware Summary card — component health badges
`Hardware Summary`,
`>CPU<`,

View File

@@ -232,6 +232,9 @@ func renderTaskReportFragment(report taskReport, charts map[string]string, logTe
if powerCard := renderTaskPowerResultsCard(report.Target, logText); powerCard != "" {
b.WriteString(powerCard)
}
if report.Target == "storage" {
b.WriteString(renderStorageDiskReportCards(logText))
}
if len(report.Charts) > 0 {
for _, chart := range report.Charts {
@@ -369,3 +372,60 @@ func formatTaskDuration(sec int) string {
}
return fmt.Sprintf("%dh %02dm %02ds", sec/3600, (sec%3600)/60, sec%60)
}
// renderStorageDiskReportCards reads disk-*-report.txt files from the storage
// SAT run directory and renders one card per disk.
func renderStorageDiskReportCards(logText string) string {
runDir := taskStorageRunDirFromLog(logText)
if runDir == "" {
return ""
}
entries, err := os.ReadDir(runDir)
if err != nil {
return ""
}
var cards []string
for _, entry := range entries {
name := entry.Name()
if !strings.HasPrefix(name, "disk-") || !strings.HasSuffix(name, "-report.txt") {
continue
}
data, err := os.ReadFile(filepath.Join(runDir, name))
if err != nil || len(data) == 0 {
continue
}
// Extract disk label from filename: "disk-01-nvme0n1-report.txt" → "Disk 01 — nvme0n1"
stem := strings.TrimPrefix(strings.TrimSuffix(name, "-report.txt"), "disk-")
// stem is like "01-nvme0n1"
parts := strings.SplitN(stem, "-", 2)
title := "Disk " + stem
if len(parts) == 2 {
title = "Disk " + parts[0] + " — " + parts[1]
}
card := `<div class="card">` +
`<div class="card-head">` + html.EscapeString(title) + `</div>` +
`<div class="card-body" style="padding:0">` +
`<pre style="margin:0;padding:16px;font-size:12px;line-height:1.6;overflow-x:auto;white-space:pre">` +
html.EscapeString(string(data)) +
`</pre></div></div>`
cards = append(cards, card)
}
return strings.Join(cards, "\n")
}
// taskStorageRunDirFromLog finds the storage SAT run directory path logged as
// "Archive: /path/to/storage-YYYYMMDD-HHMMSS".
func taskStorageRunDirFromLog(logText string) string {
for _, line := range strings.Split(logText, "\n") {
line = strings.TrimSpace(line)
if !strings.HasPrefix(line, "Archive:") {
continue
}
path := strings.TrimSpace(strings.TrimPrefix(line, "Archive:"))
if strings.Contains(filepath.Base(path), "storage-") && !strings.HasSuffix(path, ".tar.gz") {
return path
}
}
return ""
}

View File

@@ -394,6 +394,12 @@ func executeTaskWithOptions(opts *HandlerOptions, t *Task, j *jobState, ctx cont
break
}
err = runIPMIFRUWriteTask(ctx, j, opts.ExportDir, t.params)
case "huawei-elabel-write":
if len(t.params.HuaweiElabelChanges) == 0 {
err = fmt.Errorf("no changes provided")
break
}
err = runHuaweiElabelWriteTask(ctx, j, t.params)
case "raid-foreign-clear":
err = runRAIDForeignClearTask(ctx, j, t.params.RAIDController)
case "raid-foreign-import":
@@ -404,6 +410,12 @@ func executeTaskWithOptions(opts *HandlerOptions, t *Task, j *jobState, ctx cont
break
}
err = runRAIDLSICreateMirrorTask(ctx, j, t.params.RAIDController, t.params.RAIDDevices)
case "raid-lsi-prepare-drive":
if strings.TrimSpace(t.params.RAIDSlot) == "" {
err = fmt.Errorf("no drive slot provided")
break
}
err = runRAIDPrepareDriveTask(ctx, j, t.params.RAIDController, t.params.RAIDSlot)
case "raid-vroc-create-mirror":
if len(t.params.RAIDDevices) < 2 {
err = fmt.Errorf("at least 2 devices required")

View File

@@ -140,11 +140,13 @@ type taskParams struct {
Device string `json:"device,omitempty"` // for install
LBAF int `json:"lbaf,omitempty"`
PlatformComponents []string `json:"platform_components,omitempty"`
SAADmiChanges []saaChange `json:"saa_dmi_changes,omitempty"`
FRUChanges []fruChange `json:"fru_changes,omitempty"`
SAADmiChanges []saaChange `json:"saa_dmi_changes,omitempty"`
FRUChanges []fruChange `json:"fru_changes,omitempty"`
HuaweiElabelChanges []huaweiChange `json:"huawei_elabel_changes,omitempty"`
RAIDController int `json:"raid_controller,omitempty"`
RAIDDevices []string `json:"raid_devices,omitempty"`
RAIDArrayName string `json:"raid_array_name,omitempty"`
RAIDSlot string `json:"raid_slot,omitempty"`
}
type persistedTask struct {

2
bible

Submodule bible updated: 1977730d93...d2600f1279

View File

@@ -13,6 +13,7 @@ Generic engineering rules live in `bible/rules/patterns/`.
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
| `docs/validate-vs-burn.md` | Validate and Validate -> Stress hardware test policy |
| `decisions/` | Architectural decision log, including read-only submodule policy |
| `proposals/` | RFCs and contract change proposals for Reanimator Core |
## Validate Test Matrix

View File

@@ -1,5 +1,103 @@
# Backlog
## Сбор SFP-модулей
**Статус:** не реализовано.
### Источник данных
`ethtool -m <iface>` / `ethtool --module-info <iface>` — читает EEPROM SFP/SFP+/QSFP28/QSFP-DD по стандарту MSA (SFF-8472 / SFF-8636).
Доступные поля из EEPROM:
- Идентификатор модуля: `Identifier` (SFP, SFP+, QSFP28, …)
- Тип коннектора: `Connector`
- Вендор: `Vendor name`, `Vendor OUI`, `Vendor PN`, `Vendor SN`, `Vendor rev`
- Оптика: `Wavelength`, `Transceiver type` (10GBase-SR, LR, DAC, …)
- Телеметрия DOM (если модуль поддерживает): `Laser tx bias current`, `Transmit avg optical power`, `Receive avg optical power`, `Module temperature`, `Module voltage`
- Статус: `Rx power high alarm`, `Tx power low warning`, …
Для QSFP28 данные повторяются на 4 канала (lane 03).
Инструмент требует root. На bee ISO — доступен (`ethtool` входит в образ).
### Scope для bee
1. Собирать список сетевых интерфейсов через `ip -j link show` (только `ether`, без `lo`/VLAN/bond).
2. Для каждого интерфейса пробовать `ethtool -m <iface>`. Если модуль отсутствует или не поддерживает EEPROM read — тихо пропускать.
3. Связывать интерфейс с PCIe-устройством через `ethtool -i <iface>` → поле `bus-info` (BDF) → сопоставление с `pcie_devices[].slot`.
### Gap в контракте
Текущий контракт v2.10 имеет в `pcie_devices[]` скалярные поля:
- `sfp_temperature_c`, `sfp_tx_power_dbm`, `sfp_rx_power_dbm`, `sfp_voltage_v`, `sfp_bias_ma`
Этого **недостаточно**:
- Одна NIC-карта может иметь несколько портов — нужен массив, а не скаляр.
- Нет полей идентификации модуля (vendor, part_number, serial_number, wavelength, connector).
- Нет разбивки по каналам для QSFP28.
### Предлагаемое расширение контракта
Добавить в `pcie_devices[]` массив `sfp_modules[]`:
```json
"pcie_devices": [
{
"slot": "0000:3b:00.0",
"device_class": "EthernetController",
"model": "ConnectX-6 Dx",
"sfp_modules": [
{
"port": 0,
"identifier": "QSFP28",
"connector": "LC",
"vendor": "Mellanox",
"part_number": "MFA1A00-C003",
"serial_number": "MT2124VS09999",
"revision": "A",
"wavelength_nm": 850,
"transceiver_type": "100GBase-SR4",
"temperature_c": 36.4,
"voltage_v": 3.29,
"tx_power_dbm": -1.8,
"rx_power_dbm": -2.1,
"bias_ma": 7.2
}
]
}
]
```
Поля `sfp_modules[]`:
| Поле | Тип | Описание |
|---|---|---|
| `port` | int | Номер порта на NIC (0-based) |
| `identifier` | string | `SFP`, `SFP+`, `QSFP28`, `QSFP-DD`, … |
| `connector` | string | `LC`, `MPO`, `DAC`, … |
| `vendor` | string | Производитель модуля |
| `part_number` | string | Партномер |
| `serial_number` | string | Серийный номер |
| `revision` | string | Ревизия |
| `wavelength_nm` | int | Длина волны, нм |
| `transceiver_type` | string | `10GBase-SR`, `100GBase-SR4`, `DAC`, … |
| `temperature_c` | float | Температура модуля, °C |
| `voltage_v` | float | Напряжение, В |
| `tx_power_dbm` | float | TX оптическая мощность, dBm |
| `rx_power_dbm` | float | RX оптическая мощность, dBm |
| `bias_ma` | float | Bias current, мА |
Старые скалярные поля `sfp_temperature_c` / `sfp_tx_power_dbm` / `sfp_rx_power_dbm` / `sfp_voltage_v` / `sfp_bias_ma` на уровне `pcie_devices[]`**вывести из контракта** (deprecated), заменить на `sfp_modules[]`.
### Порядок реализации
1. Согласовать расширение контракта с Reanimator Core (bump до v2.11).
2. Добавить `ethtool` parser в `audit/internal/collector/` — новый файл `sfp.go`.
3. Дополнить schema в `audit/internal/schema/` типом `SFPModule`.
4. Добавить `sfp_modules` в `PCIeDevice` в schema.
5. Заполнять в NIC-коллекторе: связь интерфейс → BDF → `pcie_devices[].sfp_modules`.
6. Показывать в TUI и web UI в разделе PCIe/NIC.
## BMC версия через IPMI
**Статус:** реализовано.

View File

@@ -1,7 +1,7 @@
---
title: Hardware Ingest JSON Contract
version: "2.10"
updated: "2026-04-29"
version: "2.11"
updated: "2026-06-19"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
language: ru
@@ -9,7 +9,7 @@ language: ru
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.10** · Дата: **2026-04-29**
Версия: **2.11** · Дата: **2026-06-19**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
@@ -22,6 +22,7 @@ language: ru
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.11 | 2026-06-19 | В `pcie_devices[]` добавлен необязательный массив `sfp_modules[]` с идентификацией и DOM telemetry SFP/QSFP-модулей. Скалярные поля `sfp_temperature_c` / `sfp_tx_power_dbm` / `sfp_rx_power_dbm` / `sfp_voltage_v` / `sfp_bias_ma` помечены как deprecated (принимаются, но `sfp_modules[]` имеет приоритет) |
| 2.10 | 2026-04-29 | Для `hardware.storage[]` добавлены необязательные числовые поля `logical_block_size_bytes`, `physical_block_size_bytes`, `metadata_bytes_per_block` для нормализованного описания формата блока накопителя |
| 2.9 | 2026-03-19 | Добавлена необязательная секция `hardware.platform_config` — произвольный объект с настройками платформы (BIOS/Redfish); хранится как latest-snapshot per machine |
| 2.8 | 2026-03-15 | Поле `location` удалено из всех `sensors.*`; сенсоры передаются только по `name` и измеренным значениям |
@@ -422,11 +423,12 @@ GET /ingest/hardware/jobs/{job_id}
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C *(deprecated since 2.11)* |
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm *(deprecated since 2.11)* |
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm *(deprecated since 2.11)* |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В *(deprecated since 2.11)* |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА *(deprecated since 2.11)* |
| `sfp_modules` | array | нет | Установленные SFP/QSFP-модули по портам (см. sfp_modules[]) |
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
| `device_class` | string | нет | Класс устройства (см. список ниже) |
| `manufacturer` | string | нет | Производитель |
@@ -444,10 +446,43 @@ GET /ingest/hardware/jobs/{job_id}
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
**Deprecated поля sfp_\*:** Скалярные поля `sfp_temperature_c`, `sfp_tx_power_dbm`, `sfp_rx_power_dbm`, `sfp_voltage_v`, `sfp_bias_ma` продолжают приниматься, но помечены как deprecated since 2.11. Если в payload одновременно присутствуют `sfp_modules[]` и deprecated sfp_-скаляры — приоритет у `sfp_modules[]`, скаляры игнорируются. Deprecated поля будут удалены в версии 3.0.
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
#### pcie_devices[].sfp_modules[]
Необязательный массив установленных SFP/QSFP-модулей для данного PCIe-устройства. Один элемент — один порт. Используйте для многопортовых NIC (ConnectX-6 Dx, Intel X710, Mellanox HDR и др.).
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `port` | int | **да** | Номер порта на NIC (0-based). Ключ дедупликации внутри устройства |
| `identifier` | string | нет | Тип модуля: `SFP`, `SFP+`, `SFP28`, `QSFP+`, `QSFP28`, `QSFP-DD`, `DAC` |
| `connector` | string | нет | Тип разъёма: `LC`, `MPO`, `RJ45`, `DAC`, `AOC`, `No separable connector` |
| `vendor` | string | нет | Производитель модуля из EEPROM |
| `part_number` | string | нет | Партномер из EEPROM |
| `serial_number` | string | нет | Серийный номер из EEPROM |
| `revision` | string | нет | Ревизия из EEPROM |
| `wavelength_nm` | int | нет | Длина волны, нм (0 для DAC/медных кабелей) |
| `transceiver_type` | string | нет | `10GBase-SR`, `10GBase-LR`, `25GBase-SR`, `100GBase-SR4`, `DAC`, … |
| `temperature_c` | float | нет | Температура модуля, °C (DOM telemetry) |
| `voltage_v` | float | нет | Напряжение питания, В (DOM telemetry) |
| `tx_power_dbm` | float | нет | TX оптическая мощность, dBm (DOM telemetry) |
| `rx_power_dbm` | float | нет | RX оптическая мощность, dBm (DOM telemetry) |
| `bias_ma` | float | нет | Bias current, мА (DOM telemetry) |
**Ключ дедупликации:** `(pcie_devices[].slot, sfp_modules[].port)`.
**Правила ingest:**
- При каждом импорте — полная замена `sfp_modules[]` для данного `pcie_devices[].slot` (upsert всего массива целиком).
- Если `sfp_modules` отсутствует или `null` — существующие данные SFP не трогать.
- Если `sfp_modules: []` (пустой массив) — трактовать как «модули не обнаружены», очистить сохранённые данные.
- Дубли по `port` внутри одного `pcie_devices[]` — невалидны, endpoint возвращает `400` с описанием поля.
- Модули без `serial_number` допустимы (многие DAC-кабели не имеют SN); сохраняются по ключу `(slot, port)`.
- Изменение `serial_number` или `part_number` модуля на порту создаёт событие `COMPONENT_CHANGED` для PCIe-устройства с описанием «SFP module replaced on port N».
**Значения `device_class`:**
| Значение | Назначение |
@@ -472,16 +507,47 @@ GET /ingest/hardware/jobs/{job_id}
"numa_node": 0,
"temperature_c": 48.5,
"power_w": 18.2,
"sfp_temperature_c": 36.2,
"sfp_tx_power_dbm": -1.8,
"sfp_rx_power_dbm": -2.1,
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"firmware": "9.20 0x8000d4ae",
"manufacturer": "Mellanox",
"model": "ConnectX-6 Dx",
"serial_number": "MT2012X12345",
"firmware": "22.35.2010",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
"status": "OK",
"sfp_modules": [
{
"port": 0,
"identifier": "QSFP28",
"connector": "LC",
"vendor": "Mellanox",
"part_number": "MFA1A00-C003",
"serial_number": "MT2124VS09999",
"revision": "A",
"wavelength_nm": 850,
"transceiver_type": "100GBase-SR4",
"temperature_c": 36.4,
"voltage_v": 3.29,
"tx_power_dbm": -1.8,
"rx_power_dbm": -2.1,
"bias_ma": 7.2
},
{
"port": 1,
"identifier": "QSFP28",
"connector": "LC",
"vendor": "Mellanox",
"part_number": "MFA1A00-C003",
"serial_number": "MT2124VS09998",
"revision": "A",
"wavelength_nm": 850,
"transceiver_type": "100GBase-SR4",
"temperature_c": 35.9,
"voltage_v": 3.28,
"tx_power_dbm": -1.9,
"rx_power_dbm": -2.3,
"bias_ma": 7.1
}
]
}
]
```
@@ -793,7 +859,24 @@ PSU без `serial_number` игнорируется.
"model": "X710 10GbE",
"serial_number": "K65472-003",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
"status": "OK",
"sfp_modules": [
{
"port": 0,
"identifier": "SFP+",
"connector": "LC",
"vendor": "Intel",
"part_number": "FTLX8574D3BCV-IT",
"serial_number": "FNS123456789",
"wavelength_nm": 850,
"transceiver_type": "10GBase-SR",
"temperature_c": 34.1,
"voltage_v": 3.30,
"tx_power_dbm": -2.5,
"rx_power_dbm": -3.0,
"bias_ma": 6.8
}
]
}
],
"power_supplies": [

View File

@@ -0,0 +1,201 @@
# RFC: SFP Module Identity and Telemetry in Hardware Ingest Contract
**Date:** 2026-06-19
**Status:** proposal
**Target:** Reanimator Core — ingest schema, validation, and storage
**Requested by:** bee (hardware audit LiveCD)
**Contract version affected:** 2.10 → 2.11
---
## Задача для агента
Внести изменения в контракт ingest и реализовать их в Reanimator Core:
1. Добавить массив `sfp_modules[]` внутри `pcie_devices[]`.
2. Обновить JSON-схему и валидатор ingest endpoint.
3. Реализовать хранение/upsert данных SFP-модулей.
4. Пометить скалярные поля `sfp_*` на уровне `pcie_devices[]` как deprecated.
5. Обновить документ контракта (`hardware-ingest-contract.md`) до версии 2.11.
Ниже — всё необходимое для принятия решений без дополнительных вопросов.
---
## Контекст: откуда берутся данные
**Источник в bee:** `ethtool -m <iface>` (читает EEPROM SFP/SFP+/QSFP28/QSFP-DD по стандарту MSA SFF-8472 / SFF-8636).
**Связь с PCIe:** `ethtool -i <iface>` возвращает `bus-info` = BDF (`0000:3b:00.0`), который совпадает с `pcie_devices[].slot`. Так bee связывает SFP-данные конкретного интерфейса с PCIe-устройством.
**Один NIC — несколько модулей:** карта ConnectX-6 Dx (2 порта), Intel X710 (4 порта), Mellanox HDR (2 порта). Каждый порт — отдельный `ethtool -m`, отдельный SFP-модуль. Одного скаляра на устройство недостаточно.
**QSFP28/QSFP-DD:** 4-канальные модули возвращают telemetry отдельно по каждому каналу (lane). В предложенной схеме lane-уровень не включён в первую версию — только агрегированные значения модуля в целом. Расширение до lane-уровня — отдельный RFC если понадобится.
---
## Проблема с текущим контрактом v2.10
В `pcie_devices[]` есть пять скалярных полей:
```
sfp_temperature_c float
sfp_tx_power_dbm float
sfp_rx_power_dbm float
sfp_voltage_v float
sfp_bias_ma float
```
Ограничения:
- **Нет идентификации модуля** — vendor, part_number, serial_number, wavelength отсутствуют; модуль нельзя инвентаризировать как самостоятельный компонент.
- **Только один набор значений на устройство** — невозможно описать 4-портовый NIC.
- **Нет типа модуля** — SFP, QSFP28, DAC-кабель не различаются.
- **Нет connector/transceiver_type** — невозможно понять, оптика это или медь.
---
## Предлагаемое изменение схемы
### Новая структура `sfp_modules[]`
Добавляется как необязательное поле внутри каждого объекта `pcie_devices[]`.
```json
"pcie_devices": [
{
"slot": "0000:3b:00.0",
"device_class": "EthernetController",
"model": "ConnectX-6 Dx",
"manufacturer": "Mellanox",
"serial_number": "MT2012X12345",
"status": "OK",
"sfp_modules": [
{
"port": 0,
"identifier": "QSFP28",
"connector": "LC",
"vendor": "Mellanox",
"part_number": "MFA1A00-C003",
"serial_number": "MT2124VS09999",
"revision": "A",
"wavelength_nm": 850,
"transceiver_type": "100GBase-SR4",
"temperature_c": 36.4,
"voltage_v": 3.29,
"tx_power_dbm": -1.8,
"rx_power_dbm": -2.1,
"bias_ma": 7.2
},
{
"port": 1,
"identifier": "QSFP28",
"connector": "LC",
"vendor": "Mellanox",
"part_number": "MFA1A00-C003",
"serial_number": "MT2124VS09998",
"revision": "A",
"wavelength_nm": 850,
"transceiver_type": "100GBase-SR4",
"temperature_c": 35.9,
"voltage_v": 3.28,
"tx_power_dbm": -1.9,
"rx_power_dbm": -2.3,
"bias_ma": 7.1
}
]
}
]
```
### Поля `sfp_modules[]`
| Поле | Тип | Обязательно | Описание |
|---|---|---|---|
| `port` | int | **да** | Номер порта на NIC (0-based). Ключ дедупликации внутри устройства. |
| `identifier` | string | нет | Тип модуля: `SFP`, `SFP+`, `SFP28`, `QSFP+`, `QSFP28`, `QSFP-DD`, `DAC` |
| `connector` | string | нет | Тип разъёма: `LC`, `MPO`, `RJ45`, `DAC`, `AOC`, `No separable connector` |
| `vendor` | string | нет | Производитель модуля из EEPROM |
| `part_number` | string | нет | Партномер из EEPROM |
| `serial_number` | string | нет | Серийный номер из EEPROM |
| `revision` | string | нет | Ревизия из EEPROM |
| `wavelength_nm` | int | нет | Длина волны, нм (0 для DAC/медных кабелей) |
| `transceiver_type` | string | нет | `10GBase-SR`, `10GBase-LR`, `25GBase-SR`, `100GBase-SR4`, `DAC`, … |
| `temperature_c` | float | нет | Температура модуля, °C (DOM telemetry) |
| `voltage_v` | float | нет | Напряжение питания, В (DOM telemetry) |
| `tx_power_dbm` | float | нет | TX оптическая мощность, dBm (DOM telemetry) |
| `rx_power_dbm` | float | нет | RX оптическая мощность, dBm (DOM telemetry) |
| `bias_ma` | float | нет | Bias current, мА (DOM telemetry) |
**Ключ дедупликации:** `(pcie_device.slot, sfp_modules[].port)`.
**Модули без серийного номера** — допустимы; многие DAC-кабели не имеют SN. Не игнорировать, сохранять по ключу `(slot, port)`.
---
## Deprecated поля
Следующие поля на уровне `pcie_devices[]` помечаются как **deprecated** начиная с v2.11:
```
sfp_temperature_c
sfp_tx_power_dbm
sfp_rx_power_dbm
sfp_voltage_v
sfp_bias_ma
```
**Поведение при получении deprecated полей:**
- Продолжать принимать и сохранять (не ломать существующих интеграторов).
- Если одновременно присутствуют `sfp_modules[]` и deprecated скаляры — приоритет у `sfp_modules[]`; скаляры игнорируются.
- В документации пометить как `deprecated since 2.11, will be removed in 3.0`.
**Не удалять** deprecated поля из валидации в этом PR — только пометить в документации и changelog.
---
## Правила ingest для `sfp_modules[]`
- `sfp_modules[]` хранится как snapshot-атрибут PCIe-компонента (аналогично `mac_addresses`).
- При каждом импорте — полная замена `sfp_modules[]` для данного `pcie_devices[].slot` (upsert всего массива целиком, не merge по портам).
- Если `sfp_modules` отсутствует или `null` — существующие данные SFP не трогать (не затирать).
- Если `sfp_modules: []` (пустой массив) — трактовать как «модули не обнаружены», очистить сохранённые данные.
- Изменение `serial_number` или `part_number` модуля на порту — создавать событие `COMPONENT_CHANGED` для PCIe-устройства с описанием «SFP module replaced on port N».
---
## Изменения в документе контракта
Файл: `bible-local/docs/hardware-ingest-contract.md`
1. Заголовок версии: `2.10``2.11`, дата → `2026-06-19`.
2. Добавить в changelog:
```
| 2.11 | 2026-06-19 | В `pcie_devices[]` добавлен необязательный массив `sfp_modules[]`
с идентификацией и DOM telemetry SFP/QSFP-модулей. Скалярные поля
sfp_temperature_c / sfp_tx_power_dbm / sfp_rx_power_dbm / sfp_voltage_v /
sfp_bias_ma помечены как deprecated (принимаются, но sfp_modules[] имеет приоритет). |
```
3. В секции `pcie_devices` добавить строку в таблицу полей:
```
| `sfp_modules` | array | нет | Установленные SFP/QSFP-модули по портам (см. sfp_modules[]) |
```
4. Добавить подсекцию `#### pcie_devices[].sfp_modules[]` с таблицей полей и примером JSON (из раздела выше).
5. Пометить deprecated поля в таблице: добавить суффикс `*(deprecated since 2.11)*`.
6. Обновить полный пример JSON — добавить `sfp_modules` к NIC-записи в `pcie_devices`.
---
## Что не нужно делать в этом PR
- Не добавлять lane-level данные QSFP (tx_power_dbm_lane_0 и т.п.) — отдельный RFC.
- Не удалять deprecated поля — только пометить.
- Не создавать отдельную top-level секцию `network_ports` — данные остаются вложенными в `pcie_devices`.
- Не менять логику идентификации PCIe-компонента — `serial_number` SFP-модуля не является ключом для самостоятельного компонента.
---
## Валидация
Единственное обязательное поле в `sfp_modules[]` — `port` (int, >= 0).
Все остальные поля опциональны.
Дубли по `port` внутри одного `pcie_devices[]` — невалидны, возвращать `400` с описанием поля.

View File

@@ -1419,6 +1419,13 @@ rm -rf \
if [ "$BEE_GPU_VENDOR" != "nvidia" ]; then
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nvidia-load"
rm -f "${OVERLAY_STAGE_DIR}/etc/systemd/system/bee-nvidia.service"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-burn"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-john-gpu-stress"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nccl-gpu-stress"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nvidia-recover"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-dcgmproftester-staggered"
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-check-nvswitch"
rm -rf "${OVERLAY_STAGE_DIR}/etc/systemd/system/nvidia-fabricmanager.service.d"
fi
# --- inject authorized_keys for SSH access ---

View File

@@ -67,7 +67,8 @@ if ! mkdir "${LOCK_DIR}" 2>/dev/null; then
fi
trap 'rmdir "${LOCK_DIR}" >/dev/null 2>&1 || true' EXIT
if have_nvidia_gpu && [ ! -e /dev/nvidia0 ]; then
GPU_VENDOR=$(cat /etc/bee-gpu-vendor 2>/dev/null || echo "")
if [ "$GPU_VENDOR" = "nvidia" ] && have_nvidia_gpu && [ ! -e /dev/nvidia0 ]; then
log_event "NVIDIA GPU detected but /dev/nvidia0 is missing"
restart_service bee-nvidia.service || true
fi