If a single person can make the system fail then the system has already failed.
It’s never a single person who caused a failure.
Crowdstrike CEO should go to jail. The corporation should get the death sentence.
Edit: For the downvoters, they for real negligently designed a system that killed people when it fails. The CEO as an officer of the company holds liability. If corporations want rights like people when they are grossly negligent they should be punished. We can’t put them in jail so they should be forced to divest their assets and be “killed.” This doesn’t even sound radical to me, this sounds like a basic safe guard against corporate overreach.
Microsoft also started blaming th eu. Its such a shitshow its ridiculous.
sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.
Imagine if he/she was Russian or Chinese…
Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.
If only we had terms for environments that were ment for testing, staging, early release and then move over to our servers that are critical…
I know it’s crazy, really a new system that only I came up with (or at least I can sell that to CrowdStrike as it seems)
It’s a systematic multi-layered problem.
The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.
Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.
Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.
The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there’s good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.
And there’s a better reason for wanting to delay definition updates: this outage.
Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.
That means spending time and money on developing such a system, which means increasing costs in the short term… which is kryptonite for current-day CEOs
Right. More than money, I say it’s about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.
I strongly believe in no-blame mindsets, but “blame” is not the same as “consequences” and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that “low likelihood, big impact” corner.
Because likely when the problem happens (again) they’ll be able to sweep it under the rug (again) or will have moved on to greener pastures.What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer’s recommendation).