Bring codebase into compliance with bible contracts (A–E)
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to HardwareSnapshot. B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go; replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice, isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate) with numeric vendor_id comparisons. C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go, app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes. D (go-code-style): wrap bare return err in interfaceAdminState and interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including the interface name. E (go-project-bible): add bible-local/architecture/data-model.md and bible-local/architecture/api-surface.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
185
bible-local/architecture/api-surface.md
Normal file
185
bible-local/architecture/api-surface.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# API Surface
|
||||
|
||||
HTTP endpoints exposed by `bee web` (binds `0.0.0.0:80`).
|
||||
Handler registration: `audit/internal/webui/server.go` → `NewHandler()`.
|
||||
|
||||
---
|
||||
|
||||
## Health & readiness
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|----------------|-----------------------------------------------------|
|
||||
| GET | `/healthz` | Always 200. Used by load balancers / boot scripts. |
|
||||
| GET | `/api/ready` | 200 when audit JSON exists and is readable. |
|
||||
| GET | `/loading` | HTML loading page shown before first audit. |
|
||||
|
||||
---
|
||||
|
||||
## Audit
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|-----------------------|--------------------------------------------------------------|
|
||||
| GET | `/audit.json` | Latest audit JSON with SAT overlay applied. |
|
||||
| GET | `/runtime-health.json`| Latest runtime preflight JSON. |
|
||||
| POST | `/api/audit/run` | Enqueue a full `bee audit` run. Returns task ID. |
|
||||
| GET | `/api/audit/stream` | SSE: audit run log lines (`data:` + newline per line). |
|
||||
| GET | `/api/preflight` | Run runtime preflight check (synchronous, returns JSON). |
|
||||
| GET | `/api/hardware-summary` | Hardware health summary (status counts + failures). |
|
||||
| GET | `/api/components/{type}` | HTML fragment for component detail dialog (e.g. `cpu`, `memory`, `storage`, `pcie`). |
|
||||
|
||||
---
|
||||
|
||||
## SAT (System Acceptance Testing)
|
||||
|
||||
All SAT run endpoints enqueue an async task. Response: `{"task_id": "..."}`.
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|--------------------------------------------|-----------------------------------|
|
||||
| POST | `/api/sat/nvidia/run` | NVIDIA DCGM SAT |
|
||||
| POST | `/api/sat/nvidia-targeted-stress/run` | NVIDIA targeted stress validate |
|
||||
| POST | `/api/sat/nvidia-compute/run` | NVIDIA max compute load |
|
||||
| POST | `/api/sat/nvidia-targeted-power/run` | NVIDIA targeted power |
|
||||
| POST | `/api/sat/nvidia-pulse/run` | NVIDIA pulse test |
|
||||
| POST | `/api/sat/nvidia-interconnect/run` | NCCL all_reduce_perf |
|
||||
| POST | `/api/sat/nvidia-bandwidth/run` | NVBandwidth test |
|
||||
| POST | `/api/sat/nvidia-stress/run` | NVIDIA stress pack |
|
||||
| POST | `/api/sat/memory/run` | Memory acceptance |
|
||||
| POST | `/api/sat/storage/run` | Storage acceptance (smartctl) |
|
||||
| POST | `/api/sat/cpu/run` | CPU acceptance (stress-ng) |
|
||||
| POST | `/api/sat/amd/run` | AMD GPU SAT (ROCm) |
|
||||
| POST | `/api/sat/amd-mem/run` | AMD memory integrity + bandwidth |
|
||||
| POST | `/api/sat/amd-bandwidth/run` | AMD memory bandwidth |
|
||||
| POST | `/api/sat/amd-stress/run` | AMD GPU stress |
|
||||
| POST | `/api/sat/memory-stress/run` | Memory stress |
|
||||
| POST | `/api/sat/sat-stress/run` | Combined storage+memory stress |
|
||||
| POST | `/api/sat/platform-stress/run` | Fan + thermal stress |
|
||||
| GET | `/api/sat/stream` | SSE: live SAT log stream |
|
||||
| POST | `/api/sat/abort` | Abort the running SAT task |
|
||||
|
||||
---
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|-----------------------------------------|----------------------------------------------|
|
||||
| POST | `/api/bee-bench/nvidia/perf/run` | NVIDIA performance benchmark |
|
||||
| POST | `/api/bee-bench/nvidia/power/run` | NVIDIA power benchmark |
|
||||
| POST | `/api/bee-bench/nvidia/autotune/run` | Power source autotune (prerequisite for benchmarks) |
|
||||
| GET | `/api/bee-bench/nvidia/autotune/status` | Current autotune result / status |
|
||||
| GET | `/api/benchmark/results` | List completed benchmark result archives |
|
||||
|
||||
---
|
||||
|
||||
## Tasks (async job queue)
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|-----------------------------|----------------------------------------------------|
|
||||
| GET | `/api/tasks` | List all tasks with status |
|
||||
| POST | `/api/tasks/cancel-all` | Cancel all pending/running tasks |
|
||||
| POST | `/api/tasks/kill-workers` | Force-kill worker goroutines |
|
||||
| POST | `/api/tasks/{id}/cancel` | Cancel a specific task |
|
||||
| POST | `/api/tasks/{id}/priority` | Elevate task priority |
|
||||
| GET | `/api/tasks/{id}/stream` | SSE: live log stream for a task |
|
||||
| GET | `/api/tasks/{id}/charts` | List chart names for a task |
|
||||
| GET | `/api/tasks/{id}/chart/` | SVG chart for a task result |
|
||||
| GET | `/tasks/{id}` | HTML task detail page |
|
||||
|
||||
---
|
||||
|
||||
## Services
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|---------------------------|--------------------------------------------------|
|
||||
| GET | `/api/services` | List bee-* systemd services and their states |
|
||||
| POST | `/api/services/action` | start/stop/restart a service |
|
||||
|
||||
---
|
||||
|
||||
## Network
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|----------------------------|-----------------------------------------------------|
|
||||
| GET | `/api/network` | List interfaces with state and IPv4 addresses |
|
||||
| POST | `/api/network/dhcp` | Run dhclient on one or all interfaces |
|
||||
| POST | `/api/network/static` | Set static IPv4 address |
|
||||
| POST | `/api/network/toggle` | Bring interface up or down |
|
||||
| POST | `/api/network/confirm` | Confirm pending network change (clears rollback) |
|
||||
| POST | `/api/network/rollback` | Restore pre-change network snapshot |
|
||||
|
||||
---
|
||||
|
||||
## Export
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|-------------------------------|---------------------------------------------------|
|
||||
| GET | `/export/support.tar.gz` | Download support bundle (live-generated) |
|
||||
| GET | `/export/file` | Download a file from the export dir by path param |
|
||||
| GET | `/export/` | Browse export dir (HTML index) |
|
||||
| GET | `/api/export/list` | JSON list of files in export dir |
|
||||
| GET | `/api/export/usb` | List removable USB targets available for export |
|
||||
|
||||
---
|
||||
|
||||
## GPU
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|----------------------------|----------------------------------------------------|
|
||||
| GET | `/api/gpu/presence` | `{"nvidia": bool, "amd": bool}` |
|
||||
| GET | `/api/gpu/nvidia` | List NVIDIA GPUs from nvidia-smi |
|
||||
| GET | `/api/gpu/nvidia-status` | Per-GPU status (ECC, power, throttle) |
|
||||
| POST | `/api/gpu/nvidia-reset` | GPU reset by index |
|
||||
| GET | `/api/gpu/tools` | nvidia-smi / rocm-smi tool availability |
|
||||
|
||||
---
|
||||
|
||||
## System
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------------------------------|---------------------------------------------------|
|
||||
| GET | `/api/system/ram-status` | toram boot state and ISO copy status |
|
||||
| POST | `/api/system/install-to-ram` | Copy ISO to RAM (background task) |
|
||||
| GET | `/api/install/disks` | List block devices suitable for disk installation |
|
||||
| POST | `/api/install/run` | Install bee to disk (background task) |
|
||||
|
||||
---
|
||||
|
||||
## Tools & NVMe
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|-------------------------------|--------------------------------------------------|
|
||||
| GET | `/api/tools/check` | Check availability of required CLI tools |
|
||||
| GET | `/api/tools/nvme-formats` | List NVMe format options for a device |
|
||||
| POST | `/api/tools/nvme-format/run` | Run nvme-format on a device |
|
||||
|
||||
---
|
||||
|
||||
## Live metrics
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------------------------------|---------------------------------------------------|
|
||||
| GET | `/api/metrics/stream` | SSE: live metrics (GPU power, temp, utilization) |
|
||||
| GET | `/api/metrics/latest` | Latest metrics snapshot (JSON) |
|
||||
| GET | `/api/metrics/chart/` | SVG chart for a metric over time |
|
||||
| GET | `/api/metrics/export.csv` | Download metrics history as CSV |
|
||||
|
||||
---
|
||||
|
||||
## Blackbox logging
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|----------------------------|-----------------------------------------------|
|
||||
| GET | `/api/blackbox/status` | Blackbox log state (enabled, size, path) |
|
||||
| POST | `/api/blackbox/enable` | Start recording blackbox log |
|
||||
| POST | `/api/blackbox/disable` | Stop recording, flush to disk |
|
||||
|
||||
---
|
||||
|
||||
## UI pages
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------------|-----------------------------------------------|
|
||||
| GET | `/` | Main dashboard (serves all page routes) |
|
||||
| GET | `/viewer` | Standalone JSON viewer for uploaded audit files |
|
||||
|
||||
All pages are rendered server-side as HTML. The `/` route handles sub-paths such as
|
||||
`/network`, `/services`, `/sat`, `/benchmark`, `/install`, `/validate`, `/export`.
|
||||
137
bible-local/architecture/data-model.md
Normal file
137
bible-local/architecture/data-model.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Data Model
|
||||
|
||||
The canonical output of `bee audit` is a `HardwareIngestRequest` JSON document accepted
|
||||
by the Reanimator `/api/ingest/hardware` endpoint. The ingest endpoint uses a strict
|
||||
decoder — unknown fields cause `400 Bad Request`.
|
||||
|
||||
Source of truth: `audit/internal/schema/hardware.go`
|
||||
|
||||
---
|
||||
|
||||
## Top-level: HardwareIngestRequest
|
||||
|
||||
```
|
||||
HardwareIngestRequest
|
||||
├── collected_at string RFC3339 UTC timestamp of collection
|
||||
├── hardware HardwareSnapshot
|
||||
├── runtime RuntimeHealth? from bee-runtime-preflight service
|
||||
├── filename string?
|
||||
├── source_type string?
|
||||
├── protocol string?
|
||||
└── target_host string?
|
||||
```
|
||||
|
||||
`collected_at` is the primary sort key used by Reanimator to deduplicate ingests.
|
||||
|
||||
---
|
||||
|
||||
## HardwareSnapshot
|
||||
|
||||
All component arrays are `omitempty` — absent when the collector finds nothing.
|
||||
|
||||
| JSON key | Go type | Source |
|
||||
|-------------------|----------------------------|------------------------------|
|
||||
| `board` | HardwareBoard | dmidecode type 1/2 |
|
||||
| `firmware` | []HardwareFirmwareRecord | dmidecode type 0/13 |
|
||||
| `cpus` | []HardwareCPU | dmidecode type 4 |
|
||||
| `memory` | []HardwareMemory | dmidecode type 17 |
|
||||
| `storage` | []HardwareStorage | lsblk + nvme-cli + smartctl |
|
||||
| `pcie_devices` | []HardwarePCIeDevice | lspci |
|
||||
| `power_supplies` | []HardwarePowerSupply | ipmitool fru + sdr |
|
||||
| `sensors` | *HardwareSensors | sensors -j |
|
||||
| `event_logs` | []HardwareEventLog | ipmitool sel + journald |
|
||||
| `platform_config` | *json.RawMessage | reserved, nil until used |
|
||||
| `vroc_license` | *string | vroc-cli |
|
||||
|
||||
---
|
||||
|
||||
## Identity keys
|
||||
|
||||
Reanimator uses these fields to match components across successive audits:
|
||||
|
||||
| Component | Identity key |
|
||||
|----------------|------------------------------------------------|
|
||||
| Board | `board.serial_number` (required, never empty) |
|
||||
| CPU | `serial_number` if present; else generated key |
|
||||
| Memory DIMM | `serial_number` — absent DIMMs have `present: false` |
|
||||
| Storage | `serial_number` if present; else `linux_device` from Telemetry |
|
||||
| PCIe device | `bdf` (Bus:Device.Function address) |
|
||||
| PSU | `slot` |
|
||||
|
||||
Components without a stable identity are still emitted but may not be matched across runs.
|
||||
|
||||
---
|
||||
|
||||
## HardwareComponentStatus (embedded in all components)
|
||||
|
||||
```go
|
||||
type HardwareComponentStatus struct {
|
||||
Status *string `json:"status,omitempty"` // OK | Warning | Critical | Unknown
|
||||
ErrorDescription *string `json:"error_description,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
Status is set by collectors and overwritten at render time by `ApplySATOverlay`
|
||||
(latest SAT run results are always merged on top before display).
|
||||
|
||||
---
|
||||
|
||||
## HardwarePCIeDevice
|
||||
|
||||
The most enriched component type. Key fields:
|
||||
|
||||
| JSON key | Meaning |
|
||||
|----------------------|------------------------------------------------|
|
||||
| `bdf` | PCI address (identity key), e.g. `0000:4b:00.0` |
|
||||
| `vendor_id` | Numeric PCI vendor ID (hex). Use this for classification — not `manufacturer`. |
|
||||
| `device_id` | Numeric PCI device ID (hex) |
|
||||
| `device_class` | Human-readable class, e.g. `VideoController` |
|
||||
| `manufacturer` | String label from lspci — for display only |
|
||||
| `model` | From nvidia-smi / rocm-smi — display name |
|
||||
| `link_speed` | Current PCIe link speed, e.g. `Gen4` |
|
||||
| `max_link_speed` | Max negotiated speed |
|
||||
| `link_width` | Current lane count |
|
||||
| `max_link_width` | Max lane count |
|
||||
| `temperature_c` | From nvidia-smi / rocm-smi |
|
||||
| `power_w` | Current power draw |
|
||||
| `ecc_uncorrected_total` | Cumulative ECC uncorrected errors (NVIDIA) |
|
||||
| `ecc_corrected_total` | Cumulative ECC corrected errors (NVIDIA) |
|
||||
| `hw_slowdown` | HW throttle active (NVIDIA) |
|
||||
| `telemetry` | Free-form map for vendor-specific extras |
|
||||
|
||||
**Classification rule**: use `vendor_id` (numeric PCI ID), never `manufacturer` string.
|
||||
|
||||
| Vendor | vendor_id |
|
||||
|-----------|-----------|
|
||||
| NVIDIA | `0x10de` |
|
||||
| AMD | `0x1002` |
|
||||
| Mellanox | `0x15b3` |
|
||||
| Aspeed | `0x1a03` |
|
||||
| Intel | `0x8086` |
|
||||
|
||||
Constants live in `audit/internal/collector/pci_vendors.go`.
|
||||
|
||||
---
|
||||
|
||||
## HardwareMemory
|
||||
|
||||
`location` field exists in the Go struct with `json:"-"` — it is intentionally excluded
|
||||
from JSON output because the Reanimator schema does not include it. It is used internally
|
||||
for DIMM telemetry matching only (`collector/memory_telemetry.go`).
|
||||
|
||||
---
|
||||
|
||||
## HardwareSensors
|
||||
|
||||
Sensor structs (`HardwareFanSensor`, `HardwareTemperatureSensor`,
|
||||
`HardwarePowerSensor`, `HardwareOtherSensor`) do **not** have a `location` field.
|
||||
Location was removed in contract v2.8. The Go types mirror the schema exactly.
|
||||
|
||||
---
|
||||
|
||||
## JSON naming convention
|
||||
|
||||
All JSON keys are `snake_case`. Go field names are `CamelCase`. The mapping is
|
||||
maintained by struct tags in `audit/internal/schema/hardware.go`.
|
||||
|
||||
All pointer fields use `omitempty` — absent means not collected (not zero).
|
||||
Reference in New Issue
Block a user