Ok so Loki from Grafana Labs is *sexy*. This fixes a lot of the issues I set out to solve with our shipped systems for alerting and analytics, without further burdening the overwhelmed software team with adding a metrics API to our codebases.
And as a plus, Loki (well, promtail) can also eat syslog, systemd/journald, and Windows Event Log formats.
Combined with our use of time-series system health information and the outgoing relational databases to the plant, we can answer questions like:
* "Does the inspection cycle time change with CPU temperature?"
* "What source file throws the most errors? Is it dependent on the inspection recipe?"
* "send an alert when any inspection cycle time, per recipe, goes above 80% allotted cycle time for more than 5 in a row"
As well as the usual statistical process control questions like:
* "which recipes have the worst reject rates?"
* "are there time- or shift-dependent changes in judgement accuracy?"
* "is this measurement trending out of tolerance?"