FIRST AWS, NOW MICROSOFT CLOUD, WHO IS NEXT
Now let’s turn to the Microsoft episode. On Tue, March 21st (what is it with Tuesdays !!), Outlook, Hotmail, OneDrive, Skype, and Xbox Live were all significantly impacted and trouble ranged from being unable to login to degraded services. True to form, the Microsoft response was to downplay the impact and provide little detail (by contrast to be fair Amazon provided a much more detailed post mortem) - a subset of Azure customers may have experienced intermittent login failures while authenticating with their Microsoft accounts. Engineers identified a recent deployment task as the potential root cause. Engineers rolled back the recent deployment task to mitigate the issue.
There are really three vectors of control - Scope, Privileges and Governance Model
So, is this the death of public cloud? No! Far from it. And anyone who says otherwise should have their head examined. But, it should serve as a wake-up call to every IT, security and compliance professional across every industry. Why? Because, this kind of ‘user error’ or ‘deployment task snafu’ can happen anywhere—on-prem, on private cloud and on public cloud. And since every enterprise is deployed on one or more of the above—every enterprise is at risk. So enough of the fear mongering. Sorry! What does someone do about it? Glad you asked.
There are really three vectors of control - Scope, Privileges and Governance Model.
Scope is really the number of ‘objects’ aka the nuclear radius of what each admin (or script) is authorized to work on, at any given time. Using the Microsoft Cloud example (I am extrapolating since they have not provided any details), this may be the number of containers a deployment task can operate on, at any given time.
Privileges calls for controlling what an administrator or task can do on the object. Bear with me for a minute and I will explain. For instance, continuing with the container example from above—the privilege restriction could be that the container can be launched but not destroyed.
And finally, you need a governance model. This is really the implementation of best practices and a well-defined policy for enforcing the above two functions—scope overview and control enforcement—in a self-driven fashion. In this example, the policy could be to ensure that the number of containers an admin can operate remains under 100 (scope) and that any increase in that number automatically requires a pre-defined approval process (control). Further sophistication can easily be built in, where the human approver could easily be a bot that checks the type of container and the load on the system and approves (or denies) the request. Bottom-line–checks and balances.
So there you have it. The two large public clouds have suffered embarrassing outages in the past month. They will recover—get stronger–and most likely may have future outages as well. The question for the rest of us is what we learn from their experience and make our environments in our own data centers and on private and public clouds better! If we don’t, we may not be lucky to survive to fight another day.