In this article Fabio Mora, Software Engineer, Agile Coach, DevOps expert and author, delves into some more practical and technical aspects of the Site Reliability Engineer profession and some fundamental concepts, in particular that of reliability.
Reliability
Site Reliability Engineering (SRE) means working on the most important functionality of a system: reliability, a “feature” that precedes any other. To illustrate its importance, imagine that you need to use a service whose operation is based, in whole or in part, on computer systems, electronics, telco and other related industries. Take any online service, reliability is critical.
If the assistant who wakes you up in the morning and streams your favorite radio station might not be essential, the smartphone that allows you to interact with relatives and friends, manage documents and appointments, definitely is critical for the quality of your daily life.
If your smartphone is “not available” and the apps like your bank account, Google, social networks, and Wikipedia do not work, some problems with your routine could loom. With various nuances of criticality, these are – under the hood – very sophisticated platforms that interact and work with each other, self-balance and often consist of millions, or billions of lines of code and hardware devices.
The functionalities all devices and apps offer correspond to possibilities in the real world. The idea therefore is that they should be kept efficient and responsive for those who use them, with a quality of service that lives up to the needs. This is called reliability.
The immensity of the system
To illustrate, if downloading a file from your Drive may appear to be a simple gesture, behind it lies an endless chain of events: from the mobile radio network, the data flow travels encapsulated, encrypted, in an optical fiber, through transoceanic cables that carry it within milliseconds to a remote datacenter, and back. In turn, there are data links that allow these infrastructures to communicate with each other, provide network services, hardware, but also energy and gas on the network – even further upstream.
From POS to pay in-store, to ticketing services, to railway, motorway, aeronautical and civil signaling networks, to remote surgery, to medical diagnoses in the cloud. But also the logistics of each package delivered by courier, the work of the “riders”, the heat trails of food and drug transport. All of these are pivotal platforms for entire sectors and for the quality of personal life: industry, communication, education, marketing, media, health, public administration, democratic processes – almost the entire service sector – and beyond.
Possible drawbacks
There are many things that can go wrong. First, systems become inherently unstable over time. Due to their incredible complexity, they tend to break down and it is necessary to work continuously so that this does not happen. The work activities on the systems and their updating must not be carried out only when «accidents», or events of an exceptional nature, occur, they must be part of business as usual.
As business as usual the activities can prevent inertia, obsolescence and technical debt. The latter are all daemons that threaten not only the quality of the services, but also the possibility of continuing to introduce changes in them. The SREs is to keep the systems stable and that of the programmers is to write and maintain the functionalities. of products, with continuous software releases. Each release, therefore, could introduce new errors, and complexity.
Value of SRE
Reliability is the upstream feature of any system. However, it is also a difficult feature to communicate because, when it is present, it can easily be taken for granted. It is also difficult to always give the right importance to this theme. To correct this small cognitive bias, the roles and organizational structures of the SREs are often autonomous with respect to the Software Engineers, who instead build the products.
The value attributed to the SRE, therefore, is to keep these products stable on systems; error free, maintainable, usable for the user – no matter what is happening. The value that an SRE offers to its organisation and to the users of its products is, ultimately, that of guaranteeing the stability of production systems, the maintainability of the software and a high quality level of service. All this regardless of external conditions, be it traffic peaks or continuous releases of new features.
If you want to learn more about this topic read our blog “ITIL, SRE, DevOps: Similarities and Differences“