SQMeter Nuclear Codebase Audit¶
Date: 2026-05-05
Auditor: Claude Sonnet 4.6 (automated deep inspection)
Branch audited: feat/rg15-rain-sensor
Commit: 0434605
Scope: Full repository — firmware, UI, API, MQTT, CI/CD, docs, tests, security, build system
Executive Summary¶
SQMeter is an ambitious, well-structured ESP32-based sky quality monitor with a genuinely impressive feature set for an alpha project. The architecture shows clear design intent: a clean SensorBase hierarchy, separation of firmware modules, a lightweight Preact UI served from LittleFS, and a working CI/CD pipeline that produces flash-ready release binaries. The addition of RG-15 rain sensor support is well-integrated.
However, several issues pose real reliability and safety risks before this device can be trusted in an observatory automation context:
Top 10 Highest-Impact Issues¶
- [CRITICAL] Blocking
delay()calls inside ESPAsyncWebServer async callbacks — WiFi connect, OTA restart, and MQTT test all calldelay()inside async request handlers. On ESPAsyncWebServer this runs on a FreeRTOS task distinct fromloop(). Blocking it can starve the network stack and trigger the watchdog. - [CRITICAL] Race condition:
createSensorDataJson()accessed from bothloop()task and async web server task — sensor readings are modified inloop()and serialized in async callbacks with no synchronization. Can produce garbled JSON or crash. - [CRITICAL] WiFi and MQTT passwords logged to serial in plaintext —
Config::save()logs the entire config JSON including credentials at INFO level to the serial port. - [HIGH] MQTT client ID is the topic name —
connect(config.topic.c_str())uses the topic (sqm/data) as the MQTT client ID. Multiple devices or broker reconnects will collide and cause unpredictable disconnections. - [HIGH]
WiFi.scanNetworks()called in async handler — Blocks the async task for seconds. Will starve WebSocket clients and may trigger the watchdog during a scan. - [HIGH] Config stored as pretty-printed JSON in NVS —
serializeJsonPrettyproduces unnecessarily large JSON. NVS has a per-key size limit (~4000 bytes). As sensors are added, this limit will be exceeded and config saves will silently fail. - [HIGH] ArduinoOTA has no password — Explicitly commented out:
// No password - don't call setPassword() to disable authentication entirely. Any device on the same network can push arbitrary firmware. - [HIGH] TCPServer (ASCOM) not connected to RG-15 sensor —
handleRainRate()returns"0.0"with aTODOcomment, but the RG-15 sensor exists and is running. Observatory clients using ASCOM rain rate will always see zero, potentially overriding a roof-close safety trigger. - [HIGH] WebSocket broadcasts sensor data 5× more often than sensors update —
WS_SENSOR_BROADCAST_INTERVAL_MS = 1000msbutsensor.readIntervalMs = 5000msby default. Clients receive four stale-data broadcasts for every fresh reading, with no indication that the data is stale. - [HIGH] Settings
PUTrequest never reaches firmware —Settings.tsxsendsPUT /api/configbut the firmware only registers a handler forPOST /api/config. Saves silently fail in any browser that correctly uses PUT.
Top 10 Quickest Wins¶
- Fix
saveConfiginSettings.tsxto usePOST(one character change —PUT→POST). - Replace
config.topic.c_str()with a unique client ID (device MAC + name) inMQTTClient::connect(). - Remove the full config JSON from the INFO log line in
Config::save()— keep byte count only. - Change
serializeJsonPrettytoserializeJsoninConfig::toJson(). - Redact passwords in
GET /api/configresponse (return empty strings or"***"for password fields). - Set a minimum
minFreeHeapmetric and exposeESP.getMinFreeHeap()in the status endpoint. - Add the RG-15 sensor reference to
TCPServer::setSensorReferences()and implementhandleRainRate(). - Change WebSocket broadcast interval to match
sensor.readIntervalMs(or add adataTimestampfield so clients can detect stale data). - Move WiFi scan to use
WiFi.scanNetworks(true)(async mode) with a polling approach rather than blocking. - Add
logLevelto config and default it toINFOrather thanDEBUGin production.
Overall Risk Level: High¶
The project is not ready for observatory automation safety use. It has several confirmed bugs (PUT/POST mismatch, MQTT client ID, async blocking) and one critical safety gap (ASCOM rain rate returning 0 while RG-15 data exists). It is suitable as a monitoring/dashboard device but should not be trusted for roof-safety interlocks until Phase 0 issues are resolved.
Scorecard¶
| Area | Score | Rationale |
|---|---|---|
| Architecture | 7/10 | Clean module separation, good use of C++17, clear sensor hierarchy. WebServer is slightly overloaded. |
| Firmware reliability | 5/10 | Confirmed async-blocking bugs, race condition on sensor data, WDT risk during WiFi scan. |
| Sensor abstraction | 7/10 | SensorBase hierarchy is solid. Reading structs are typed. Optional sensor pattern is consistent. |
| API/backend | 5/10 | PUT/POST mismatch is a confirmed bug. Docs are wrong in two places. Passwords returned in GET. |
| WebSocket/live telemetry | 6/10 | Two WebSocket channels is a good design. Broadcasting stale data 4/5 times is a significant UX flaw. |
| MQTT/telemetry | 4/10 | Client ID bug, no LWT/availability topic, millis() timestamp, field naming diverges from REST. |
| Embedded UI | 7/10 | Clean Preact/Tailwind app, good mobile layout. Validation logic has a fragile divide-by-2 hack. |
| Configuration/settings | 6/10 | Good Zod validation on UI side. Pretty-print JSON in NVS is wasteful and will fail at scale. |
| Testing | 2/10 | Only Playwright screenshot tests with no content assertions. Zero firmware or API unit tests. |
| CI/CD | 7/10 | Solid build pipeline with artifacts, release automation, and docs deployment. Missing lint/typecheck gates. |
| Documentation | 6/10 | Good hardware docs for RG-15. REST API docs have wrong response shapes in two places. MQTT docs incomplete. |
| Security | 4/10 | Passwords logged in plaintext. Unauthenticated OTA. Credentials returned via API. Acknowledged but not mitigated. |
| Observability | 6/10 | Good status endpoint. Missing minFreeHeap, reset reason, boot count, per-subsystem last error. |
| Build/release process | 7/10 | Clean PlatformIO + Vite pipeline. ESPAsyncWebServer pinned to GitHub HEAD (no hash). |
| Developer experience | 7/10 | Good DEV_WORKFLOW.md, CONTRIBUTING.md. Vite proxy requires manual IP edit. |
| Maintainability | 6/10 | Consistent naming. A few clear code smells (fragile validation path key, pretty-print in NVS). |
Findings by Severity¶
Critical¶
AUDIT-001¶
Severity: Critical
Area: Firmware — WebServer
Title: Blocking delay() calls inside ESPAsyncWebServer async request callbacks
Evidence: src/WebServer.cpp
- Line 209: delay(100) inside /api/wifi/connect handler before WiFi connect attempt
- Line 213: while (... attempts < 20) { delay(500); } — 10 second busy-wait in async callback
- Line 265: delay(1000) before restart in OTA success handler
- Line 327: delay(1000) before restart in /api/restart handler
- src/sensors/RG15Sensor.cpp lines 36 and 71: delay(200) inside begin(), which is called from reconfigure() which is called from saveConfigCallback() which is called from the async config POST handler
Why it matters: ESPAsyncWebServer runs its callback handlers on a separate FreeRTOS task (the lwIP/AsyncTCP task). Calling delay() on that task starves the network stack, drops WebSocket packets for connected clients, and can trigger the 30-second watchdog. This is a documented anti-pattern for ESPAsyncWebServer — their documentation explicitly warns against blocking in callbacks. The WiFi connect handler has a 10-second worst-case delay.
Recommended fix:
- For restart: schedule via esp_timer_create with a 500ms delay firing ESP.restart(), return the response immediately.
- For WiFi connect: use WiFi.scanNetworks(true) async mode or move the connection attempt to a flag polled in loop().
- For RG15 delay(): replace with a non-blocking startup sequence tracked by a timer.
Effort: M Owner: Firmware Dependency: None
AUDIT-002¶
Severity: Critical
Area: Firmware — Concurrency
Title: Race condition: sensor readings written in loop() task, read in async WebServer task
Evidence: src/WebServer.cpp:658 — createSensorDataJson() accesses tslSensor.getReading(), bmeSensor.getReading(), etc. via const references. These sensor objects' readings are mutated in src/main.cpp:335-340 inside loop(). ESPAsyncWebServer request handlers and WebSocket broadcasts run on a separate FreeRTOS task. There is no mutex, semaphore, or atomic access protecting these reads.
Why it matters: When the async task reads a sensor struct while loop() is writing to it, the result is undefined behavior: partial reads, torn floats, or crashes on multi-word fields. On Xtensa (ESP32) float reads are not atomic. The TSL2591Reading struct has float lux, uint16_t visible, uint16_t infrared, uint32_t full and uint32_t timestamp — six words written non-atomically from one task and read from another.
Recommended fix: Add a SemaphoreHandle_t mutex per sensor (or a single shared sensor-data mutex), take it before writing in update() and before reading in serialization. Alternatively, use double-buffering: keep a "committed" snapshot that is atomically swapped after each sensor poll.
Effort: M Owner: Firmware Dependency: None
AUDIT-003¶
Severity: Critical Area: Security — Credentials Title: WiFi and MQTT passwords logged to serial in plaintext at INFO level
Evidence: src/Config.cpp:59:
src/Config.cpp:38:
The config JSON includes wifi.password and mqtt.password in full. These are logged at INFO level, which is the default log level at runtime.
Why it matters: Anyone with serial monitor access (e.g., at a star party, or remote log capture) can read the WiFi password and MQTT credentials of the site network. If MQTT is used with an internet-facing broker, this is a credential leak.
Recommended fix: Log byte count only: Logger::info(TAG, "Config saved (%d bytes)", json.length());. Never log the JSON content at INFO. A DEBUG-level log is acceptable only if the password fields are masked (e.g., "password":"***").
Effort: S Owner: Firmware Dependency: None
High¶
AUDIT-004¶
Severity: High
Area: API — Settings
Title: Settings page sends PUT /api/config; firmware only handles POST /api/config
Evidence: web/src/components/Settings.tsx:148:
const response = await fetch('/api/config', {
method: 'PUT', // Using PUT for full resource replacement
src/WebServer.cpp:147:
AsyncCallbackJsonWebHandler only responds to POST by default (it does not accept PUT). The firmware's SPA fallback will return index.html for any unmatched route, so the UI receives a 200 OK with HTML instead of JSON, and the save operation silently appears to succeed before the JSON parse fails.
Recommended fix: Change method: 'PUT' to method: 'POST' in Settings.tsx. The comment about "full resource replacement" is incorrect — the firmware already does partial merging via default values in fromJson().
Effort: S Owner: UI Dependency: None
AUDIT-005¶
Severity: High Area: Firmware — MQTT Title: MQTT client ID is set to the topic name, not a device identifier
Evidence: src/MQTTClient.cpp:123-130:
connected = mqttClient->connect(
config.topic.c_str(), // This is "sqm/data", not a device ID
config.username.c_str(),
config.password.c_str());
Why it matters: The MQTT specification requires client IDs to be unique per broker. Using the topic name (sqm/data) means: (1) if two SQMeter devices publish to the same topic, they will knock each other off the broker repeatedly; (2) if the same device reconnects, it may conflict with its own lingering session; (3) brokers that validate client IDs may reject the connection because topic strings contain / which is invalid in client IDs on some brokers.
Recommended fix: Generate client ID from device MAC: "SQM-" + WiFi.macAddress().substring(9) (last 6 hex digits). Expose this in config as mqtt.clientId with a sane default.
Effort: S Owner: Firmware Dependency: None
AUDIT-006¶
Severity: High
Area: Firmware — WebServer
Title: WiFi.scanNetworks() is synchronous and blocks the async request handler
Evidence: src/WebServer.cpp:472:
handleWiFiScan(), which is registered as an ESPAsyncWebServer route handler.
Why it matters: Blocks the async task for up to 5 seconds. WebSocket clients will not receive updates. The 30-second watchdog countdown is a concern if called multiple times quickly (though unlikely in practice). The /api/wifi/scan endpoint is exposed to the UI with no rate-limiting.
Recommended fix: Use WiFi.scanNetworks(true) (async, returns immediately with WIFI_SCAN_RUNNING). Create a polling endpoint /api/wifi/scan/status that returns results when ready, or use a millis()-based polling approach in handle().
Effort: M Owner: Firmware Dependency: None
AUDIT-007¶
Severity: High Area: Firmware — Config Title: Config NVS entry uses pretty-printed JSON, risking NVS size limit failure
Evidence: src/Config.cpp:203:
putString() key size limit is 4000 bytes. A fully populated config in pretty-print format (all fields, plausible hostname/broker strings) easily exceeds 600 bytes. With more sensors added (e.g., wind sensor config) and longer string values, this will silently fail: Preferences::putString() returns 0 on overflow, and Config::save() reports an error, but the firmware continues running with the old NVS config.
Recommended fix: Change to serializeJson(doc, output). Compact JSON reduces size by ~40%. Also validate the resulting string length before saving and log a clear error if it would exceed 3800 bytes (with 200 bytes headroom).
Effort: S Owner: Firmware Dependency: None
AUDIT-008¶
Severity: High Area: Security — OTA Title: ArduinoOTA has no password, allowing any LAN device to push arbitrary firmware
Evidence: src/main.cpp:222:
ArduinoOTA.setHostname(config.wifi.hostname.c_str());
// No password - don't call setPassword() to disable authentication entirely
Why it matters: Any device on the same network segment that can reach the ESP32 on port 3232 (mDNS OTA port) can flash arbitrary firmware without any credential. This is particularly risky at shared observatories, star parties, or any network with untrusted users.
Recommended fix: Either set a password (ArduinoOTA.setPassword(...)) exposed via config, or disable ArduinoOTA entirely and rely solely on the web OTA endpoint (/api/update). If ArduinoOTA is kept for convenience, document the risk prominently and note that it only applies to LAN access.
Effort: S Owner: Firmware / Security Dependency: None
AUDIT-009¶
Severity: High Area: Firmware — TCPServer (ASCOM) Title: ASCOM rain rate command returns 0.0 even when RG-15 sensor is operational
Evidence: src/TCPServer.cpp:249-252:
TCPServer::setSensorReferences() does not accept an RG15Sensor* pointer (see src/TCPServer.cpp:42).
Why it matters: Observatory automation software (N.I.N.A., Voyager, Sequence Generator Pro) uses the ASCOM ObservingConditions interface to check rain rate before slewing or unparking. If rain rate is always reported as 0.0, a roof safety interlock based on this value will never trigger during actual rainfall. This is a safety issue for telescope equipment.
Recommended fix: Add RG15Sensor *rg15Sensor to TCPServer. Update setSensorReferences(). Implement handleRainRate() to return rg15Reading.rInt (or 0.0 if sensor not initialized/status not OK).
Effort: S Owner: Firmware Dependency: None
AUDIT-010¶
Severity: High Area: WebSocket Title: WebSocket broadcasts sensor data 5× more frequently than sensors are polled
Evidence:
- include/WebServer.h:60: WS_SENSOR_BROADCAST_INTERVAL_MS = 1000
- src/Config.cpp:135: cfg.sensor.readIntervalMs = 5000
- src/WebServer.cpp:87-94: broadcasts unconditionally every 1 second
Why it matters: With default settings, the WebSocket sends the same unchanged reading 4 times between each actual sensor poll. There is no timestamp in the WebSocket payload, so clients cannot detect staleness. This causes the UI to show values that look "live" but are actually up to 5 seconds old with no indication. For a sky quality device this matters — a cloud moving in or out mid-reading should be flagged, not hidden.
Recommended fix: Either (1) tie WS broadcast to the sensor update cycle (broadcast only when lastSensorUpdate changes), or (2) keep 1-second broadcast but add "dataTimestamp": lastSensorUpdate to the payload so the UI can display "data age". Option 2 is recommended for smoother UX.
Effort: S Owner: Firmware / UI Dependency: None
AUDIT-011¶
Severity: High
Area: Security — API
Title: WiFi and MQTT passwords returned in plaintext via GET /api/config
Evidence: src/Config.cpp:158,168:
Config::toJson() output returned by GET /api/config.
Why it matters: Any device or script that can make an unauthenticated HTTP request to the ESP32 can retrieve the WiFi password of the site network and any MQTT credentials. For devices on a shared observatory network or a guest network, this is a credential exposure.
Recommended fix: Return password fields as empty strings or masked ("***") in the API response. On settings save, accept empty password fields as "keep existing" rather than clearing. Store passwords separately in NVS from other config if possible.
Effort: M Owner: Firmware Dependency: None
AUDIT-012¶
Severity: High
Area: Firmware — Config
Title: LittleFS.begin(true) silently formats the filesystem on mount failure
Evidence: src/main.cpp:185:
Why it matters: If the flash is corrupted (e.g., power loss during a previous update), this will silently erase all web UI files. The device will still boot but the web UI will be empty. Users will see a blank browser page with no error explanation. This is a silent data loss scenario.
Recommended fix: Change to LittleFS.begin(false) and handle the error explicitly — log a clear error message and either attempt recovery or display a serial diagnostic suggesting re-flash of the filesystem.
Effort: S Owner: Firmware Dependency: None
AUDIT-013¶
Severity: High Area: Firmware — TSL2591 Lux Calculation Title: CPL formula has redundant factors that will silently produce wrong values if constants change
Evidence: src/sensors/TSL2591Sensor.cpp:83:
100.0 / 100.0 cancels to 1.0. The actual TSL2591 datasheet CPL formula is CPL = (ATIME_ms × AGAIN) / DF. As written, if someone changes the integration time constant (e.g., from 600ms to 300ms), only one of the 600.0 magic numbers would be updated, producing a 2× error in all lux calculations.
Why it matters: This is the core calculation used for all SQM, NELM, and Bortle outputs. A 2× lux error translates to approximately a 0.75 mag/arcsec² error in SQM — enough to shift Bortle classification by one class.
Recommended fix: Extract constants as named constexpr:
constexpr float INTEGRATION_TIME_MS = 600.0F;
constexpr float GAIN_FACTOR = 9876.0F;
constexpr float LUX_COEFF = 408.0F;
float cpl = (INTEGRATION_TIME_MS * GAIN_FACTOR) / LUX_COEFF;
Effort: S Owner: Firmware Dependency: None
Medium¶
AUDIT-014¶
Severity: Medium
Area: Firmware — MQTT
Title: MQTT timestamp is millis() (uptime ms), not wall-clock time
Evidence: src/MQTTClient.cpp:163:
Why it matters: Home Assistant, Grafana, and any time-series database expect Unix epoch timestamps. millis() resets on every reboot and has no relationship to real time. Historic data stitching, graphing, and alert rules based on this timestamp are broken.
Recommended fix: Use time(nullptr) to get Unix epoch seconds if NTP has synced. Fall back to millis() only if time is not synced, and add a "timeValid": true/false field so consumers know the difference.
Effort: S Owner: Firmware Dependency: TimeManager must be accessible from MQTTClient (currently it is not — pass via callback or reference)
AUDIT-015¶
Severity: Medium Area: Firmware — MQTT Title: No MQTT LWT (Last Will and Testament) / availability topic
Evidence: src/MQTTClient.cpp — no setWill() call before connect().
Why it matters: When the device loses power or WiFi, MQTT subscribers have no way to know the device is offline. Home Assistant will continue showing the last known sensor values as if they are current. For safety-critical use (observatory roof), this means a sensor failure is indistinguishable from a "clear sky" reading.
Recommended fix:
mqttClient->setWill(
(config.topic + "/availability").c_str(),
1, true, "offline");
// On successful connect:
mqttClient->publish((config.topic + "/availability").c_str(), "online", true);
Effort: S Owner: Firmware Dependency: Topic schema should be documented first
AUDIT-016¶
Severity: Medium
Area: API
Title: GET /api/status response shape contradicts documentation
Evidence: docs/api/rest.md:18-25 shows:
src/WebServer.cpp:745-923) nests rssi and ip under wifi, version under firmware, and includes dozens of additional fields. The doc example is the stale prototype shape.
Why it matters: External integrators (Node-RED, Home Assistant MQTT, Python scripts) reading the docs will build against the wrong schema and silently get undefined for every field they try to access.
Effort: S Owner: Docs Dependency: None
AUDIT-017¶
Severity: Medium
Area: API
Title: GET /api/wifi/scan response shape contradicts documentation
Evidence: docs/api/rest.md:140-143 shows:
src/WebServer.cpp:480-487):
Field name mismatch: "secure" vs "encryption", and the response is wrapped in an object, not a raw array. The UI correctly handles the actual format (data.networks), but the docs are wrong.
Effort: S Owner: Docs Dependency: None
AUDIT-018¶
Severity: Medium
Area: Firmware — Reliability
Title: readLine() in RG15Sensor spins without yielding during the 500ms response timeout
Evidence: src/sensors/RG15Sensor.cpp:163-187:
const uint32_t deadline = millis() + RESPONSE_TIMEOUT_MS;
while (millis() < deadline) {
while (serial->available()) { ... }
}
delay(), yield(), or vTaskDelay() inside the outer loop. In polling mode this runs every sensor.readIntervalMs.
Why it matters: In polling mode (mode == "polling"), this busy-waits for up to 500ms every poll cycle, consuming 100% CPU during that window. This prevents wifiManager->handle(), webServer->handle(), and mqttClient->handle() from running during the poll, causing dropped WebSocket packets and MQTT disconnections during rain conditions (exactly when you want telemetry most).
Recommended fix: Add vTaskDelay(1) or taskYIELD() inside the outer while loop, or restructure to a non-blocking state machine.
Effort: M Owner: Firmware Dependency: None
AUDIT-019¶
Severity: Medium Area: Firmware — Observability Title: No minimum free heap tracking or heap fragmentation diagnostics
Evidence: src/WebServer.cpp:758: doc["freeHeap"] = ESP.getFreeHeap(); — current free heap only. ESP.getMinFreeHeap() is not called.
Why it matters: On embedded devices, heap fragmentation is a primary cause of crashes. The current free heap can look healthy while the minimum ever recorded is near zero (indicating past pressure or fragmentation). Without minFreeHeap, it is impossible to diagnose heap-related crashes in the field.
Recommended fix: Add to status payload:
Effort: S Owner: Firmware Dependency: None
AUDIT-020¶
Severity: Medium Area: Firmware — Reliability Title: No reset reason or boot count in status or logs
Evidence: No calls to esp_reset_reason() or esp_restart_reason() anywhere in the codebase.
Why it matters: When a device reboots unexpectedly (watchdog, panic, power loss), there is no way to know why. Reset reason is essential for diagnosing firmware bugs in the field. A device that regularly reboots due to a watchdog triggered by the blocking async callback bug (AUDIT-001) will give no indication of this in the status endpoint.
Recommended fix:
#include <esp_system.h>
Logger::info("Main", "Reset reason: %d", esp_reset_reason());
doc["resetReason"] = esp_reset_reason();
doc["bootCount"] = bootCount; // stored in RTC memory: RTC_DATA_ATTR int bootCount = 0;
Effort: S Owner: Firmware Dependency: None
AUDIT-021¶
Severity: Medium Area: UI — Validation Title: Validation error count calculation is a fragile hack
Evidence: web/src/components/Settings.tsx:139:
rain.enabled → rainEnabled vs top-level fields)
- If the deduplication logic changes
- Any error path that doesn't produce exactly two keys
Recommended fix: Store errors in a Map<string, string> keyed by canonical dotted path only. Display using the dotted path with the nested section label. Remove the parallel flattened-key storage.
Effort: S Owner: UI Dependency: None
AUDIT-022¶
Severity: Medium
Area: UI — Mock data
Title: Mock config response uses {ok: true} but firmware returns {success: true}
Evidence: web/src/mocks/handlers.ts:29:
src/WebServer.cpp:162:
The mock returns ok but the real API returns success. The Settings component checks response.ok (the HTTP status, not the body), so this doesn't currently cause a UI bug, but any code that checks result.ok vs result.success will diverge between demo and real device.
Effort: S Owner: UI/Mocks Dependency: None
AUDIT-023¶
Severity: Medium Area: Documentation Title: CHANGELOG says GPS is "planned" but GPS is fully implemented
Evidence: CHANGELOG.md:8:
[Unreleased]. But src/sensors/GPSSensor.cpp, include/sensors/GPSSensor.h, and the full GPS pipeline are implemented and functional.
Why it matters: Prospective users reading the changelog will think GPS is not yet available and may not set it up or expect it to work.
Effort: S Owner: Docs Dependency: None
AUDIT-024¶
Severity: Medium Area: Firmware — Cloud Detection Title: Cloud detection uses a fixed 53% humidity fallback that silently masks BME280 failures
Evidence: src/WebServer.cpp:697:
53% is an arbitrary default. When the BME280 fails, cloud detection continues using this fixed humidity. The API response includes "humidityUsed": 53.0 but the cloudConditions object has no field indicating that the BME280 was unavailable. The MQTT payload (src/MQTTClient.cpp:196-204) skips cloud data when MLX is unavailable but does not similarly skip when BME is unavailable.
Why it matters: If the BME280 fails in a high-humidity environment (e.g., 85% actual humidity), the cloud cover estimate will be systematically wrong because the correction factor applied will be too small. Observatory users relying on cloud cover for go/no-go decisions will get incorrect results with no visible indication of the sensor failure.
Recommended fix: Add a "humiditySource" field to cloudConditions with values "bme280" or "default". Consider adding a separate "bme280Available" flag. Log a warning when the fallback is used.
Effort: S Owner: Firmware Dependency: None
AUDIT-025¶
Severity: Medium
Area: UI — Dashboard
Title: Rain sensor unit label is hardcoded as mm/hr regardless of configured units
Evidence: web/src/components/Dashboard.tsx:184:
in/hr but the label still says mm/hr.
Effort: S Owner: UI Dependency: Config must be available to Dashboard (currently it is not — Dashboard only uses sensor data)
AUDIT-026¶
Severity: Medium
Area: CI/CD
Title: Screenshots Playwright test has continue-on-error: true, silently hiding failures
Evidence: .github/workflows/docs.yml:66:
Why it matters: If the demo app crashes, screenshots fail silently and the docs deployment proceeds with outdated or missing screenshots. This is the only UI regression check in CI and it's disabled.
Recommended fix: Remove continue-on-error: true. Fix any screenshot test flakiness at the root (the waitForTimeout anti-pattern) rather than suppressing errors.
Effort: S Owner: CI/CD Dependency: Playwright test stability (see testing gaps)
AUDIT-027¶
Severity: Medium Area: CI/CD Title: ESPAsyncWebServer and AsyncTCP pinned to GitHub HEAD with no hash
Evidence: platformio.ini:33-34:
Why it matters: A breaking change or incompatible commit pushed to these repos will break the firmware build without any change to the SQMeter codebase. This has happened historically with ESPAsyncWebServer. The build is not reproducible.
Recommended fix: Pin to a specific commit hash: https://github.com/me-no-dev/ESPAsyncWebServer.git#<commit-sha>. Alternatively migrate to the maintained fork mathieucarbou/ESPAsyncWebServer which has proper versioned releases.
Effort: S Owner: Build Dependency: Verify compatible commit
AUDIT-028¶
Severity: Medium
Area: CI/CD
Title: npm dependency cache keyed on package.json, not package-lock.json
Evidence: .github/workflows/build.yml:65:
package-lock.json contains the exact resolved versions. Using package.json means the cache is not invalidated when transitive dependencies change.
Effort: S Owner: CI/CD Dependency: None
Low¶
AUDIT-029¶
Severity: Low Area: Firmware — Logger Title: Logger defaults to DEBUG level in production, spamming serial at ~600ms intervals
Evidence: src/Logger.cpp:12:
src/sensors/TSL2591Sensor.cpp:132). RG15 logs every parse. This generates significant serial output and introduces non-trivial delays if a serial monitor is attached, which may affect timing-sensitive operations.
Recommended fix: Default to INFO unless DEBUG_BUILD is defined. Add logLevel to Config so it can be changed at runtime.
Effort: S Owner: Firmware Dependency: None
AUDIT-030¶
Severity: Low
Area: UI — WebSocket
Title: useWebSocket reconnect uses a fixed 5-second delay with no exponential backoff
Evidence: web/src/hooks/useWebSocket.ts:38:
Why it matters: If the device is rebooting after an OTA update (which takes 15-30 seconds), the client will make ~3-6 failed connection attempts before the device is ready. While not harmful, it creates unnecessary log noise and could stress a slow device during startup.
Recommended fix: Implement exponential backoff with jitter: setTimeout(connect, Math.min(30000, 5000 * 2**retries + Math.random()*1000)).
Effort: S Owner: UI Dependency: None
AUDIT-031¶
Severity: Low
Area: Firmware — TCPServer
Title: TCPServer uses new WiFiServer(port) without delete on reconfiguration
Evidence: src/TCPServer.cpp:14-37: server = new WiFiServer(port) allocated in constructor. Destructor deletes it. However, if begin() is called after a reconnect (it checks WiFi.isConnected() but doesn't retry on connect), the server is never started.
Why it matters: Minor: if WiFi disconnects and reconnects, the TCP server is never restarted. ASCOM clients won't be able to reconnect.
Recommended fix: Check server state in handle() and call begin() if WiFi is connected but server is not running.
Effort: S Owner: Firmware Dependency: None
AUDIT-032¶
Severity: Low Area: Firmware — GPS Title: GPS data appears in two separate places in both API and UI with slightly different structure
Evidence:
- In GET /api/sensors: GPS data under "gps" key, hdop as raw value
- In GET /api/status (via gpsData): GPS data under "gpsData" key, hdop divided by 100 (gpsData["hdop"] = gpsReading.hdop / 100.0)
- Dashboard shows GPS from sensor endpoint; System page shows GPS from status endpoint
This means HDOP is inconsistently scaled between the two endpoints. api/sensors shows raw HDOP (e.g., 110), api/status shows divided HDOP (e.g., 1.10). Both appear in the UI in different pages.
Recommended fix: Normalize HDOP to actual floating-point value (divide by 100) at the reading source, not at serialization time.
Effort: S Owner: Firmware Dependency: None
Positive Findings¶
AUDIT-P01¶
Severity: Positive
Area: Architecture
Title: Clean SensorBase polymorphic hierarchy with consistent optional-sensor handling
Each sensor implements begin(), update(), getName(), toJson(). Optional sensors (BME280, MLX, GPS, RG15) are created unconditionally and their isInitialized() flag gates all usage. This is an excellent pattern that makes adding new sensors straightforward.
AUDIT-P02¶
Severity: Positive Area: Build Title: Complete flash image artifact produced in CI with correct partition layout
The CI workflow produces complete-flash.bin by merging bootloader, partition table, boot_app0.bin, firmware, and LittleFS — excluding NVS so user config is preserved. This is exactly what first-time flashers need and is often done incorrectly.
AUDIT-P03¶
Severity: Positive
Area: Firmware
Title: Watchdog timer configured and fed in loop()
30-second watchdog with esp_task_wdt_reset() in loop() provides a last-resort safety net. This is correctly implemented.
AUDIT-P04¶
Severity: Positive Area: UI Title: Zod schema validation on settings form with inline error display
web/src/validation/configSchema.ts provides thorough Zod validation with GPIO pin allow-listing, cross-field validation (SDA ≠ SCL, RX ≠ TX when enabled), and clear error messages. This is well above average for an embedded device UI.
AUDIT-P05¶
Severity: Positive Area: CI/CD Title: Separate build and release jobs with controlled artifact retention
PR builds retain artifacts 5 days; tag builds retain 90 days. Release job only runs on tags. Concurrency group correctly cancels superseded PR builds.
AUDIT-P06¶
Severity: Positive Area: Firmware Title: NVS config verification after save
src/Config.cpp:80-87 reads back the NVS value immediately after writing to verify it was saved correctly. This is a good practice for flash-backed storage.
AUDIT-P07¶
Severity: Positive Area: Firmware Title: Exponential backoff on WiFi reconnect
src/WiFiManager.cpp:150-154: currentReconnectDelay = std::min(currentReconnectDelay * 2, config.maxReconnectDelayMs). Correctly implemented with a configurable cap.
AUDIT-P08¶
Severity: Positive Area: Documentation Title: RG-15 wiring and protocol documentation is excellent
docs/hardware/rg15.md includes pin-by-pin wiring table, correct TX/RX cross-wiring note, DIP switch vs serial priority explanation, troubleshooting table, and MQTT/API payload examples. This is production-quality hardware documentation.
Code Smells¶
| File/Path | Smell | Impact | Fix |
|---|---|---|---|
src/WebServer.cpp:259-263 |
Heap-allocates String *responseStr = new String() then immediately deletes it after use |
Memory fragmentation, unnecessary allocation | Use stack String responseStr; |
src/WebServer.cpp:354-441 |
Static local variables for filesystem OTA state (fs_partition, fs_bytes_written, etc.) |
Not thread-safe, state persists across requests | Move to class members or use a dedicated state struct |
src/Config.cpp:59 |
Full config JSON logged at INFO level including passwords | Security leak | Log byte count only |
src/Config.cpp:203 |
serializeJsonPretty for NVS storage |
Wastes NVS space | Use serializeJson |
src/MQTTClient.cpp:124 |
Topic used as MQTT client ID | Protocol violation, device conflicts | Use MAC-based client ID |
src/sensors/TSL2591Sensor.cpp:83 |
(100.0F * 600.0F / 100.0F) — cancelling constants in formula |
Confusing, error-prone on maintenance | Extract named constants |
web/src/components/Settings.tsx:139 |
/ 2 hack for validation error count |
Fragile, breaks with any path deduplication change | Use Map with canonical keys |
web/src/components/Settings.tsx:168 |
updateConfig uses shallow spread then mutates deep path |
Could cause missed Preact re-renders | Use proper immutable deep update |
web/src/hooks/useWebSocket.ts:34-40 |
Creates ws state but cleanup effect uses stale closure reference |
WebSocket may not be closed on unmount | Use ref for socket, not state |
include/WebServer.h:60-61 |
WS_SENSOR_BROADCAST_INTERVAL_MS and WS_STATUS_BROADCAST_INTERVAL_MS don't reference config.sensor.readIntervalMs |
Hardcoded coupling | Derive from config |
src/TCPServer.cpp |
Entire file uses raw pointer new for WiFiServer, Arduino String everywhere else uses std::string |
Inconsistent ownership model | Use std::unique_ptr<WiFiServer> |
src/WebServer.cpp:136-146 |
GET /api/status and POST /api/config registered separately but no verb guard on GET config |
Potential confusion | Explicit verb routing |
web/src/mocks/data.ts:70 |
Mock firmware build date hardcoded: "Apr 24 2026" |
Will drift from reality | Use new Date().toLocaleDateString() |
web/src/mocks/handlers.ts:29 |
Config save mock returns {ok: true} but firmware returns {success: true} |
Contract drift | Align response shape |
src/WebServer.cpp:430 |
if (index % 10240 == 0) log in filesystem OTA — index is always 0 on first chunk |
Logging never fires as intended | Use fs_bytes_written for progress logging |
Testing Gaps¶
| Area | Current Coverage | Missing Tests | Priority | Suggested Test Files |
|---|---|---|---|---|
| Firmware — SkyQuality | None | SQM→Bortle boundary values, lux=0, lux=100000, NaN handling | Critical | test/test_sky_quality.cpp |
| Firmware — CloudDetection | None | Humidity correction range, threshold boundaries, MLX unavailable fallback | Critical | test/test_cloud_detection.cpp |
| Firmware — RG15 parser | None | Happy path, too-short line, partial parse, flag characters, imperial line | High | test/test_rg15_parser.cpp |
| Firmware — Config | None | Round-trip toJson/fromJson, partial JSON with missing keys, oversized JSON | High | test/test_config.cpp |
| Firmware — TSL2591 lux | None | Saturation path, IR > full path, normal path | High | test/test_tsl2591.cpp |
| API — contract | None | Response shape for all endpoints, error response shapes | High | web/tests/api.spec.ts |
| UI — Settings | None | Save with valid data, save with validation errors, PUT vs POST regression | High | web/tests/settings.spec.ts |
| MSW — contract | None | Mock payloads match TypeScript types | Medium | web/tests/mock-contract.test.ts |
| Firmware — MQTT payload | None | Timestamp field is epoch not millis, client ID is device ID not topic | Medium | test/test_mqtt_payload.cpp |
| UI — WebSocket | None | Reconnect on close, data staleness indicator | Medium | web/tests/websocket.spec.ts |
| Playwright — content | Screenshot-only (no assertions) | Assert specific SQM value visible, sensor status badges, error states | High | Extend tests/screenshots.spec.ts |
| Firmware — hardware abstraction | None | No mock/stub layer for I2C sensors | Medium | Requires hardware abstraction layer first |
CI/CD Gaps¶
| Workflow/File | Current Behaviour | Gap | Recommended Change | Priority |
|---|---|---|---|---|
build.yml |
Runs npm run build (which runs tsc && vite build) |
TypeScript errors only fail if Vite build fails — tsc --noEmit errors are not surfaced as a separate step |
Add - run: npx tsc --noEmit before Vite build |
High |
build.yml |
No lint step | ESLint is in package.json scripts but never called in CI |
Add - run: npm run lint |
Medium |
docs.yml |
Screenshots continue-on-error: true |
Screenshot failures are silently ignored | Remove continue-on-error after fixing Playwright waitForTimeout |
Medium |
| Both workflows | npm cache on package.json hash |
Transitive dep changes not caught | Change to hashFiles('web/package-lock.json') |
Medium |
| Both workflows | ESPAsyncWebServer pinned to HEAD | Not reproducible | Pin to specific commit hash | High |
build.yml |
No firmware size tracking | Binary grows silently | Add step to cat .pio/build/esp32dev/firmware.map \| grep -E 'FLASH\|RAM' and fail if > 90% |
Medium |
| Neither | No CHANGELOG.md check | Tags can be released without changelog entry | Add a check that CHANGELOG.md has an entry for the tag version |
Low |
| Neither | No dependabot/Renovate | JS deps go stale silently | Add Dependabot config for npm and GitHub Actions | Low |
Documentation Audit¶
Missing Documentation¶
| Missing Doc | Impact | Priority |
|---|---|---|
| ASCOM ObservingConditions driver — what commands are supported, port, protocol | Observatory integration users | High |
| Calibration — is the TSL2591 SQM calibrated against reference? What is the offset/accuracy? | Scientific users | High |
| Supported ESP32 boards — only says "ESP32" but GPIO defaults only work for specific variants | Reproduction | High |
| Config schema reference — complete field-by-field table of all config keys and defaults | Setup | High |
OTA update docs incomplete — no api/update/fs endpoint documented |
Users with UI issues | Medium |
| Node-RED flow examples | Integration | Medium |
| Reset config / factory reset procedure (not just "erase_flash") | Troubleshooting | Medium |
| Network security recommendations — when to use reverse proxy | Security | Medium |
| N.I.N.A. / Voyager ASCOM setup | Observatory use | High |
| Sensor accuracy and known limitations table | Scientific use | Medium |
Stale Documentation¶
| Doc | Stale Content | Fix |
|---|---|---|
CHANGELOG.md:8 |
GPS listed as planned/unreleased | Move to released section |
docs/api/rest.md |
/api/status response shows flat structure |
Update to actual nested structure |
docs/api/rest.md |
/api/wifi/scan shows array, "secure" field |
Update to {networks: [...]} with "encryption" field |
docs/api/websocket.md |
Missing rainSensor in message format |
Add rainSensor section |
docs/user-guide/mqtt.md |
Payload missing infrared, cloudConditions, rain, location objects |
Update to full payload example |
docs/user-guide/mqtt.md |
Home Assistant YAML uses environment.temperature but IR temp sensor is infrared.skyTemp |
Fix field paths |
Contradictions Table¶
| Doc | Code | Contradiction |
|---|---|---|
docs/api/rest.md shows status.condition = 0 for Clear |
src/calculations/CloudDetection.h: CLEAR=?, but enum starts from unnamed 0 |
Docs show 0=Clear, code has CloudCondition::CLEAR, but SensorBase.h enum starts at 0=OK — risk of confusion between sensor status 0 and cloud condition 0 |
docs/api/rest.md shows "version": "0.0.1" in status |
Firmware returns firmware.version nested |
Shape mismatch |
mockConfig in data.ts has ntp.daylightOffsetSec: 3600 |
Default config has daylightOffsetSec: 0 |
Demo shows DST configured but default doesn't |
mockConfig.mqtt.topic = "sqmeter/data" |
Default config has topic = "sqm/data" |
Demo topic diverges from default |
docs/hardware/rg15.md: "default: GPIO 18" for RX |
Config.cpp:128: cfg.rain.rxPin = 18 (RX pin) and cfg.rain.txPin = 19 |
Doc says GPIO 18 is RX; code matches; consistent |
Recommended Documentation Structure¶
The current structure is good. Suggested additions:
docs/
├── api/
│ ├── rest.md ← update response shapes
│ ├── websocket.md ← add rainSensor
│ └── ascom.md ← NEW: TCP protocol, commands, port 2020
├── integration/ ← NEW directory
│ ├── home-assistant.md ← move from mqtt.md, expand
│ ├── nina.md ← NEW: N.I.N.A. setup
│ └── node-red.md ← NEW: Node-RED flow
└── reference/
├── config-schema.md ← NEW: all config fields, types, defaults
└── calibration.md ← NEW: SQM accuracy, offsets
API/Telemetry Contract Audit¶
Endpoint Summary¶
| Endpoint | Method | Returns | Issues |
|---|---|---|---|
/api/status |
GET | Nested status object | Docs wrong; missing minFreeHeap, resetReason |
/api/sensors |
GET | Sensor readings | Correct; missing data age/timestamp |
/api/config |
GET | Full config | Returns passwords in plaintext |
/api/config |
POST | {success:true} or error |
UI sends PUT, firmware only handles POST |
/api/wifi/scan |
GET | {networks:[...]} |
Blocks async thread; docs show wrong shape |
/api/wifi/connect |
POST | {success:true, ip:...} |
Blocks async thread for 10s |
/api/mqtt/test |
POST | {success:bool, message/error} |
Good; no blocking issues found |
/api/restart |
POST | {success:true} |
delay(1000) in async handler |
/api/update |
POST | {success:bool} |
Unauthenticated; correct otherwise |
/api/update/fs |
POST | {success:bool} |
Not documented; unauthenticated |
Field Name Inconsistencies: API vs MQTT vs WebSocket¶
| Concept | REST /api/sensors |
WebSocket /ws/sensors |
MQTT payload |
|---|---|---|---|
| Light data object | lightSensor |
lightSensor |
light |
| Sky quality object | skyQuality |
skyQuality |
sky |
| IR cloud | cloudConditions |
cloudConditions |
clouds |
| IR temperature | irTemperature |
irTemperature |
infrared |
| Rain sensor | rainSensor |
rainSensor |
rain |
| GPS data | gps |
gps |
location |
Every subsystem uses different field names for the same data. Any integration that uses more than one channel must handle two schemas.
Recommended Stable Schema¶
{
"device": { "name": "...", "firmware": "...", "timestamp": 1700000000, "timeValid": true },
"light": { "lux": 0.0234, "visible": 123, "infrared": 45, "full": 168, "status": 0 },
"sky": { "sqm": 21.5, "nelm": 6.2, "bortle": 2, "description": "..." },
"environment": { "temperatureC": 12.4, "humidity": 72.1, "pressureHPa": 1013.25, "dewpointC": 7.8, "status": 0 },
"cloud": { "skyTempC": -15.2, "ambientTempC": 12.4, "deltaC": -27.6, "correctedDeltaC": -24.1, "coverPercent": 5.0, "condition": "clear", "humiditySource": "bme280", "status": 0 },
"rain": { "isRaining": false, "accMm": 0.0, "eventAccMm": 0.0, "totalAccMm": 12.34, "intensityMmPerHr": 0.0, "lensBad": false, "emSat": false, "status": 0 },
"gps": { "hasFix": true, "satellites": 8, "latitude": 51.5074, "longitude": -0.1278, "altitudeM": 42.0, "hdop": 1.2, "ageMs": 800 }
}
Adopt this as the single canonical schema for REST, WebSocket, and MQTT, with MQTT using the full object as the payload.
UI/UX Audit¶
Dashboard Issues¶
- Rain sensor shows hardcoded
mm/hrlabel regardless of configured units (AUDIT-025) - No data-age indicator — values look live but can be 5+ seconds old
- Cloud conditions card shows emoji icons that render as emoji in some terminals/screenshot tools but poorly on low-DPI displays
- Bortle
toFixed(1)on a value that is always an integer (1.0, 2.0...) shows1.0not1— minor visual noise - No fallback if the light sensor has READ_ERROR status —
sensors.lightSensor.lux.toFixed(6)throws if lux is null
Settings Issues¶
- GPS pin inputs use
<input type="number">withmin=0 max=39but no GPIO allow-list validation (unlike the Zod schema which does check this) - No indication that some settings require reboot to take effect (I2C pins, GPS pins — these are applied in
setup()only) - Validation error count divides by 2 (AUDIT-021)
updateConfig(['rain', 'enabled'], ...)mutates a deep path via a shared reference, which may not trigger Preact's shallow re-render
Mobile/Tablet¶
- Dashboard is well-responsive with
grid grid-cols-1 md:grid-cols-2 - Settings page scrolls to validation errors on mobile — good
- No PWA manifest or service worker for home screen install (separate from MSW)
- Touch targets on toggle checkboxes are small (4×4 block = 16px) — below WCAG AA 44px minimum
Accessibility¶
- No
aria-labelon icon-only elements (emoji icons as status indicators) - Color-only status indicators (green/red dots) without text for colorblind users
- No
<label for>association on some checkbox fields — screen readers may not read the label
Bundle Size (Embedded)¶
data/assets/index-z9OjbY1v.js: 140 KB (minified, presumably before gzip)data/assets/index-D_kLi0L7.css: 17 KB- LittleFS partition: 512 KB total, 40 KB in use (per mock data — actual TBD)
- Preact is ~3 KB gzipped, Tailwind (purged) is typically 5-15 KB, Zod ~12 KB gzipped
- Assessment: The 140 KB JS bundle is large for ESP32 serving but LittleFS has ~470 KB free. The main risk is gzip decompression time on client. The vite config correctly uses terser. Consider adding
build.reportCompressedSize: trueto track gzip size.
Security Audit¶
Realistic LAN-Device Threat Model¶
SQMeter is designed as a LAN-only device, which limits the attack surface significantly. The realistic threat model is:
- Local network attacker — any device on the same WiFi/VLAN can reach port 80 and port 2020 without authentication
- Shared observatory network — star parties, club observatories, or guest WiFi users can reach the device
- Compromised device on LAN — malware on a laptop can reach and reconfigure the device
- Physical access — serial port reveals all credentials (see AUDIT-003)
Security Findings¶
| Risk | Severity | Finding |
|---|---|---|
| ArduinoOTA unauthenticated on port 3232 | High | Any LAN device can push arbitrary firmware. Acknowledged in SECURITY.md but not mitigated. |
WiFi password exposed via GET /api/config |
High | Retrieves the site WiFi password without authentication |
MQTT password exposed via GET /api/config |
High | Same as above |
| WiFi password logged to serial | Critical | Full config including password logged at INFO level |
| No input size limit on POST body | Medium | AsyncCallbackJsonWebHandler default max size is ~16 KB; malformed large requests can cause heap issues |
| No rate limiting on settings endpoints | Low | Repeated POST /api/config calls are accepted without limit |
| Config can be replaced entirely via POST | Medium | No authentication prevents a LAN attacker from changing WiFi credentials, MQTT, pins |
| Rain sensor always returns 0 via ASCOM | High | Observatory safety software gets wrong data (safety impact, not security) |
What is Acceptable for a LAN Hobby Device¶
- No HTTPS — acceptable for LAN, but document that credentials travel unencrypted
- No UI authentication — acceptable if LAN is trusted; document the assumption
- Open OTA endpoint — not acceptable even for hobby use; the trivial fix (set a password) should be done
Recommended Mitigations (Prioritized)¶
- Immediate: Set ArduinoOTA password (exposes via config or hardcoded)
- Short-term: Mask passwords in
GET /api/configresponse - Short-term: Remove credential logging from
Config::save() - Medium-term: Add optional HTTP Basic Auth to the web server (single username/password in config)
- Nice-to-have: Add CSRF token for config-mutating endpoints (POST /api/config, POST /api/restart)
Observability Audit¶
What Exists¶
- Free heap (current)
- CPU frequency
- Flash and sketch size
- LittleFS usage
- Uptime (seconds)
- WiFi RSSI, SSID, IP, MAC
- Firmware name, version, build date
- Sensor initialized/status/lastUpdate per sensor
- NTP sync status, last/next sync, drift, server
- MQTT connected/state/lastPublish
- Partition layout (OTA slots, NVS usage)
What Is Missing¶
| Missing Metric | Why Needed |
|---|---|
minFreeHeap |
Detect past heap pressure events; required for stability diagnosis |
resetReason |
Distinguish normal boot from watchdog, panic, brownout |
bootCount |
Track how often device reboots; indicates reliability |
wifiReconnectCount |
Track connection instability |
mqttReconnectCount |
Track MQTT stability |
lastSensorError per sensor |
Last error message per sensor for field debugging |
sensorReadCount / sensorErrorCount |
Track sensor reliability over uptime |
dataTimestamp in WS payload |
Clients can detect stale data |
ntpLastSyncEpoch as absolute time (not millis) |
Meaningful for external consumers |
taskHighWaterMark |
FreeRTOS stack usage for each task |
Recommended Diagnostics Payload (additions to /api/status)¶
{
"diagnostics": {
"resetReason": 1,
"bootCount": 42,
"minFreeHeap": 98304,
"maxAllocHeap": 65536,
"wifiReconnectCount": 3,
"mqttReconnectCount": 1,
"sensorErrors": {
"tsl2591": 0,
"bme280": 2,
"mlx90614": 0,
"gps": 0,
"rg15": 0
}
}
}
Recommended Remediation Roadmap¶
Phase 0: Urgent Fixes (1–3 days)¶
These must be fixed before the device is used in any observatory automation or safety role.
Tasks:
1. Fix Settings.tsx to use POST instead of PUT for config save (S, UI)
2. Fix MQTT client ID — use MAC-based identifier (S, firmware)
3. Remove passwords from Config::save() log output (S, firmware)
4. Implement TCPServer::handleRainRate() using RG-15 sensor data (S, firmware)
5. Set ArduinoOTA password from config (S, firmware)
Rationale: Items 1, 2, 4 are confirmed bugs that produce wrong behavior. Item 3 is a security issue. Item 5 is a security risk for shared networks.
Dependencies: None — all are self-contained changes.
Expected Outcome: Config saves work, MQTT connects reliably, ASCOM rain rate is correct, credentials not leaked to serial.
Phase 1: Stabilise Foundations (1–2 weeks)¶
Tasks:
1. Fix async blocking — move WiFi connect, WiFi scan, restart, and RG15 startup delays out of async callbacks (M, firmware)
2. Add sensor data mutex — protect sensor struct reads from the async WebServer task (M, firmware)
3. Change serializeJson for NVS config (remove pretty-print) (S, firmware)
4. Add dataTimestamp to WebSocket sensor payload and sync broadcast interval to sensor poll interval (S, firmware)
5. Mask passwords in GET /api/config response (M, firmware)
6. Add minFreeHeap, resetReason, bootCount to /api/status (S, firmware)
7. Fix MQTT timestamp to use epoch time from TimeManager (S, firmware)
8. Add MQTT LWT/availability topic (S, firmware)
9. Fix screenshots Playwright tests — remove waitForTimeout, remove continue-on-error (M, CI/CD/UI)
10. Pin ESPAsyncWebServer to a specific commit hash (S, build)
Rationale: Items 1–2 address confirmed crash/reliability risks. Items 3–6 are quick wins that significantly improve operability. Items 7–8 are needed before Home Assistant integration is reliable.
Dependencies: Item 2 should precede item 1 (locking must be in place before concurrency changes). Item 8 requires topic schema decision.
Phase 2: Architecture and Testing (2–4 weeks)¶
Tasks:
1. Unify API/WebSocket/MQTT field naming to a single canonical schema (L, firmware)
2. Add firmware unit tests with PlatformIO test framework — SkyQuality, CloudDetection, RG15 parser, Config round-trip (L, firmware)
3. Add Playwright content-assertion tests for API contract (M, UI/CI)
4. Add ASCOM TCP server documentation (M, docs)
5. Add config schema reference documentation (M, docs)
6. Update REST API docs to match actual response shapes (S, docs)
7. Update WebSocket docs to include rainSensor (S, docs)
8. Update MQTT docs with full payload example (S, docs)
9. Fix HDOP unit inconsistency between endpoints (S, firmware)
10. Move GPS-disabled sensor creation to not instantiate GPS sensor at all (remove "disabled GPS sensor for API consistency" pattern — replace with nullptr + null checks) (M, firmware)
Rationale: Unified schema prevents integration failures. Tests provide regression coverage for the sensor math. Docs must match code for external integrators.
Phase 3: Polish, Docs, Release Maturity (ongoing)¶
Tasks:
1. Add HTTP Basic Auth option to web server (M, firmware)
2. Add N.I.N.A./ASCOM setup guide (M, docs)
3. Add Home Assistant integration guide (expanded from current MQTT docs) (M, docs)
4. Add Node-RED flow examples (M, docs)
5. Add PWA manifest for mobile home-screen install (S, UI)
6. Add exponential backoff to useWebSocket reconnect (S, UI)
7. Add logLevel to config, default to INFO (S, firmware)
8. Add firmware binary size gate to CI (fail if > 90% of partition) (S, CI/CD)
9. Add Dependabot for npm and GitHub Actions deps (S, CI/CD)
10. Add CHANGELOG entry for GPS implementation (S, docs)
11. Add accessibility improvements — ARIA labels, minimum touch targets (M, UI)
12. Fix TCPServer WiFi reconnect — restart server when WiFi reconnects (S, firmware)
Suggested GitHub Issues¶
Firmware¶
-
[Bug] Settings save silently fails — UI uses PUT but firmware only accepts POST for /api/config — Settings page sends
PUT /api/config. ESPAsyncWebServer only has a POST handler registered. The SPA fallback servesindex.htmlwith 200, so the save appears to succeed but config is not updated. -
[Bug] MQTT client ID is set to topic name, causing connection conflicts —
MQTTClient::connect()passesconfig.topic.c_str()as client ID. Should be a unique device identifier based on MAC address. -
[Bug] ASCOM TCP rain rate command (:051) always returns 0.0 — TCPServer has a TODO stub for rain rate. RG-15 sensor is available but not wired to TCPServer.
-
[Critical] Blocking delay() calls inside ESPAsyncWebServer async callbacks — WiFi connect, WiFi scan, OTA restart, and RG15 reconfiguration all call delay() inside async request handlers. This starves the network stack and can trigger the watchdog.
-
[Bug] Race condition: sensor readings read from async WebServer task while written in loop() task — No mutex protects sensor struct access between the Arduino loop task and the ESPAsyncWebServer async task.
-
[Security] WiFi and MQTT passwords logged to serial at INFO level — Config::save() logs the full config JSON including credentials. Should log byte count only.
-
[Improvement] Config stored as pretty-printed JSON in NVS — risks exceeding 4KB limit — serializeJsonPretty should be replaced with serializeJson. As more sensor configs are added, the NVS key size limit will be hit.
-
[Improvement] Add minFreeHeap, resetReason, bootCount to /api/status — Critical for field debugging of crashes and reboots.
-
[Bug] LittleFS.begin(true) silently formats filesystem on mount failure — Should fail explicitly rather than silently erasing all web files.
API/Telemetry¶
-
[Bug] MQTT timestamp uses millis() instead of Unix epoch — Home Assistant and time-series databases expect epoch timestamps. millis() resets on every reboot.
-
[Feature] Add MQTT LWT/availability topic — Without a Last Will and Testament, subscribers cannot detect when the device goes offline.
-
[Improvement] Unify field names across REST, WebSocket, and MQTT —
lightSensorvslight,skyQualityvssky, etc. causes fragile integrations. -
[Security] GET /api/config returns WiFi and MQTT passwords in plaintext — Should mask credential fields in the API response.
UI¶
-
[Bug] Rain sensor unit label hardcoded as mm/hr regardless of configured units — Dashboard shows
mm/hreven when imperial mode is configured. -
[Bug] Validation error count divides by 2 — fragile assumption —
Object.keys(validationErrors).length / 2is an undocumented hack that will break if path deduplication logic changes. -
[Improvement] Add data-age indicator to Dashboard — WebSocket broadcasts stale data 4 of every 5 seconds at default settings. UI should show when the last reading was fresh.
Testing¶
-
[Task] Add firmware unit tests for SkyQuality, CloudDetection, RG15 parser — Zero firmware unit tests currently exist. SkyQuality and CloudDetection are pure functions with no hardware dependencies — ideal for unit testing.
-
[Task] Add API contract tests — Playwright or Jest tests that verify the mock handler response shapes match TypeScript types.
-
[Bug] Playwright screenshot tests use waitForTimeout — replace with proper wait conditions —
await page.waitForTimeout(2000)is explicitly discouraged by Playwright and causes flaky tests.
Docs¶
-
[Bug] REST API docs show wrong /api/status response shape — Docs show flat
{uptime, freeHeap, rssi, ip, version}but actual response has nested objects. -
[Bug] /api/wifi/scan docs show wrong response shape and field name — Docs show array with
securefield; actual response is{networks: [...]}withencryptionfield. -
[Task] Document ASCOM TCP server (port 2020) commands — No documentation exists for the TCP/ASCOM protocol implementation.
-
[Bug] CHANGELOG lists GPS as unreleased but it is implemented — Move GPS from
[Unreleased]to[0.0.1]. -
[Task] Add config schema reference — Complete table of all config keys, types, defaults, and valid ranges.
CI/CD¶
-
[Bug] Screenshot Playwright CI step silently ignores failures —
continue-on-error: truehides regressions in the demo app. -
[Improvement] Pin ESPAsyncWebServer to specific commit hash — Currently pinned to GitHub HEAD, which is not reproducible.
-
[Improvement] Key npm cache on package-lock.json, not package.json — Transitive dep changes aren't caught by current cache key.
Security¶
-
[Security] ArduinoOTA has no password — Any LAN device can push arbitrary firmware. Add password from config.
-
[Security] Unauthenticated settings endpoints can change WiFi credentials — Consider optional HTTP Basic Auth for config-mutating endpoints.
Observability¶
-
[Feature] Add boot diagnostics endpoint — resetReason, bootCount, minFreeHeap, wifiReconnectCount for field debugging.
-
[Feature] Add lastSensorError per sensor to /api/status — Currently no way to see the last error message from a failing sensor without a serial monitor.
Final Recommendation¶
What is Solid¶
SQMeter has a genuinely good foundation. The sensor abstraction (SensorBase) is clean and extensible. The Preact UI is lightweight, mobile-friendly, and has better form validation than most hobby ESP32 projects. The CI/CD pipeline — including release binaries, complete flash images, and a live demo on GitHub Pages — is well above average. The RG-15 rain sensor integration is thorough and well-documented. The firmware uses C++17 properly with std::unique_ptr, std::optional, and namespacing throughout.
What is Risky¶
The device should not be used for observatory automation safety decisions in its current state. Specifically:
- The ASCOM TCP rain rate command returns 0.0 always — a roof safety interlock will not close during rain.
- The settings page save does not work (PUT vs POST mismatch).
- The async blocking in the WiFi connect and OTA restart handlers can destabilize the network stack during configuration operations.
- There is a race condition between sensor reading and JSON serialization that can produce garbled responses or crashes.
- Site WiFi passwords are logged to serial and returned via API.
What Must Be Fixed Before Adding More Sensors¶
Before adding another sensor (wind, all-sky camera, dew heater controller), the following must be addressed:
- Fix the sensor concurrency model — add a mutex before adding more sensor reads. More sensors = more race condition surface.
- Unify the field naming schema — adding a fourth naming convention for a fifth sensor will make MQTT/REST/WS completely unmaintainable.
- Fix the NVS config size issue — a wind sensor config block will push the pretty-printed JSON over 600 bytes; with 3-4 more sensors it will exceed the NVS limit.
- Add at least basic unit tests for sensor math — adding sensors without tests means regressions in existing calculations (SQM, cloud detection) cannot be caught.
What Must Be Fixed Before Observatory Automation/Safety Reliance¶
- All Phase 0 issues (3 days of work)
- ASCOM rain rate implementation (already Phase 0)
- Sensor data mutex (Phase 1)
- MQTT LWT/availability (Phase 1)
- Documented and tested sensor failure states — the system must distinguish "sensor unavailable" from "sensor reads 0" clearly in all telemetry channels
After Phase 0 and Phase 1, the device would be suitable as a monitoring device for a human-supervised observatory. For fully automated observatory control (unattended roof, equipment safety), complete the Phase 2 testing work and conduct field testing over at least one full season before relying on it for equipment protection decisions.
End of audit. This report was generated by automated deep inspection of the codebase at commit 0434605. All findings cite specific files and line numbers verified by reading the actual source. Issues marked "likely" were inferred from code patterns; all others were confirmed by direct code reading.