Five whys – Joel on Software: After some internal discussion we all agreed that rather than imposing a statistically meaningless measurement and hoping that the mere measurement of something meaningless would cause it to get better, what we really needed was a process of continuous improvement. Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we’re doing to prevent that problem in the future. In this case, the change is that our internal documentation will include detailed checklists for all operational procedures in the live environment. [Joel ran across the checklist article in the New Yorker and is putting it to use. Smart. So is understanding the value of service and where it makes sense to live on that continuum. Anyway, all this ties in nicely to the thoughts about system failures and people, with further proof that you cannot engineer out the occurrence of a collapse.]