2019-10-28
A System Management Principle
I have a system management principle that I’ve never seen enunciated anywhere, and I think is important:
If you get a report of a system problem from a user before an appropriate alert within some reasonable alerting interval, then you have at least two problems to fix.
The first is, obviously, the reported problem (assuming it’s a real problem).
The second is the failure of your monitoring and reporting system. You want to arrange an alert for such a problem in future (unless the combination of the feasibility of providing the alert and its assumed frequency and impact says it’s not worth it).
Then, considering those failures may suggest related problems to forestall, as in van Vleck’s Three Questions About Each Bug You Find:
Is this mistake somewhere else also?
What next bug is hidden behind this one?
What should I do to prevent bugs like this?
Actually, I’m inclined to add a fourth:
Does a proposed fix involve, possibly subtle, incompatibility or bad interaction that might cause trouble elsewhere?I.e. you need integration testing, not simply unit testing, if you think in those terms.