Innovation in... Operating Systems?

For most of us - pretty much all of us - the way we use our operating system (OS) on our laptop is not that different from how we use it on our mobile or a system administrator uses it on a server:

  1. The operating system is installed to the local disk.
  2. Changes / upgrades are performed by installing files to the same disk and then rebooting.
  3. Software is installed and/or upgraded by installing files to the same disk.

In principle, this is not that different than what we were doing way back with early versions of Mac or Windows or even DOS.

As annoying and painful as the process can be on laptops or mobile, those are at least one-to-one. Each of us manages roughly one laptop, and the updates are not too frequent.

To boot (pun intended), much of the process has been simplified over the years, between the Windows "Install Shield" provider, Mac OS X's drag-and-drop .app packager, and Linux's various package managers like .rpm and .deb.

However, at heart, these improve the process of installing packages and upgrades, they don't improve the result and resiliency of installed packages. Once you make a change, it is extraordinarily difficult to roll back:

https://twitter.com/msuster/status/724029873690316800

For those who manage servers, this issue is orders of magnitude worse for two key reasons:

  1. The ratios are much higher. Unlike our one-to-one ratio above, administrators manage many more. Servers were 10:1 in the 90s (except at places that highly engineered them like we did at Morgan Stanley), 100s:1 in the early 2000s, and 1,000s:1 or even 10,000s:1 nowadays.
  2. The risk is much higher. If your laptop has an error, or is misconfigured, you suffer. If a server is misconfigured, it can affect millions of users and a company's revenue. Multiply that by the thousands or tens of thousands of servers you manage (see #1 above), and the risk can be very high indeed.

As a result, several "configuration management" tools have sprung up over the last decade-plus, notably Puppet and Chef, then Ansible and Salt. These manage the installed configuration of the system and applications.

The next iteration is containers, led by Docker. The combination of abstracting out the application from the underlying operating system by making it portable, while at the same time forcing clean separation of immutable code from mutable data, can have a significant impact on a company's operations... if done correctly.

Ubuntu, a popular Linux distribution, or "distro", is following in that path by using "snaps" in its latest release, 16.04 LTS (despite issues with snap security for graphical applications, as raised by Matthew Garrett), although it was available in its cloud- and IoT-focused Snappy Ubuntu Core earlier.

Separating the application and all its dependencies from the OS enables us to deal with the OS alone, without worrying about applications chasing (and breaking) it. However, those apps, even in containers, still need an OS on which to run. That OS, in turn, has its own configuration, updates and everything else that changes and can break.

In short, the OS itself is an application to manage.

Fortunately, in the last several years, an additional movement of innovation in how operating systems are designed and managed has taken root.

Actually, as I use these, I begin to wonder why we cannot have these safer paradigms on my laptop.

Let's explore a few of these and how they innovate:

SmartOS

SmartOS is a version of OpenSolaris engineered by Joyent. As an aside, the Joyent team is one of the smartest collections of engineers I have ever met. SmartOS's main purpose is to serve as the basis for their bare-metal distributed container platform, Triton. Triton is fully open-source, and can be run on-premise, supported by Joyent, or in the public cloud.

From an operating system perspective, though, the relevant innovation in SmartOS is the immutability of the OS itself.

The entire OS is run from a single read-only USB key (or CD - it is all of 160MB). While all of the changeable data - containers and configuration - are saved to local disks, the operating system itself is immutable. You don't "change" or "upgrade" the OS; you just replace it. Say you are running version 20160414. To upgrade to 20160422, you just download that release as an img, load it onto a USB key, replace the one in the machine, and run. If you are unhappy with 20160422, put back the USB with 20160414 and go.

In other words, you never upgrade. You always replace in toto.

I was so inspired by the design that I adopted it for the SecureAppliance initiative.

CoreOS

CoreOS is a Linux OS that is optimized for running Linux Containers. CoreOS the company actually released a container engine - Rkt - one of the only serious attempts to provide an alternative to Docker. Additionally, CoreOS the company releases several other excellent open-source tools for managing at cloud-scale, like Fleet and etcd.

CoreOS provides a different take on the same SmartOS idea: immutable operating system. However, rather than requiring you to download the new version and replace the installed USB, the OS is installed in two individual areas of your disk ("partitions"). CoreOS itself always keeps one active. When a new version is available, it downloads the new version to the inactive partition. When ready, you switch the active and inactive. If there is an issue, you always can just switch back.

Once again, you never upgrade. You always replace in toto, but the OS itself handles both versions for you.

RancherOS

RancherOS, a product of the company behind the excellent container orchestration platform Rancher, also is a Linux OS that is optimized for running containers. In that respect, it is similar to CoreOS. It is even smaller than the previous, coming in at just 32MB.

However, rather than requiring two disk areas for OS versions and having a maximum of 2 at any given moment, or replacing the installed CD/DVD/USB, RancherOS leverages the Docker application versioning and deployment system for the operating system itself. That means you can upgrade or downgrade to any given version at any given moment, and even have multiple versions available on your server at any moment.

Once again, you never upgrade. You always replace in toto, but the OS handles multiple version for you through Docker distribution.

And Whence the SysAdmin?

To some extent, the purpose of all of this innovation is to reduce risk in production systems by making upgrades and changes far more predictable, taking advantage of the separation between application and operating system that the container image paradigm provides.

But another purpose is to require fewer system administrators per system. As the definition of "large-scale" has grown from hundreds to thousands and hundreds of thousands of servers, we can only continue to manage these or grow even larger at lower risk through better OS design and management.

This pushes further the trend away from systems administrators and towards systems engineers. As we discussed in our serverless article, this is good for the economy, very good for companies, and great for those admins who can evolve and grow into engineers.