Site Reliability Engineering, or SRE, is a trendy topic these days in the world of software development and infrastructure management.
But it’s not just people who build applications and manage infrastructure who can benefit from the concepts surrounding SRE. Data engineers, too, have much to gain from SRE principles, despite the fact that data engineering has not traditionally factored into the SRE conversation.
It’s time for that to change. If data engineers want to maximize the value that data pipelines bring to businesses, they need to learn to think like SREs.
Here’s why data engineers should adopt SRE practices, along with tips on how they can get started doing so.
The Basics of Site Reliability Engineering
The main goal of Site Reliability Engineering – a concept that originated at Google in the early 2000s but did not become widespread at other companies until the past several years – is to maximize the reliability of software applications and infrastructure.
Using practices like code-based automation and “chaos engineering” (a method of testing reliability by deliberately introducing error conditions into a system), Site Reliability Engineers, or SREs, seek to minimize downtime by maximizing a system’s ability to recover from a failure.
Traditionally, the systems that SREs managed centered on software. An SRE might help optimize the reliability of applications hosted on cloud-based infrastructure, for example, or manage a set of containerized apps running on Kubernetes.
What SRE Can Do for Data
But just because SRE and SREs have conventionally focused on software and the infrastructure that hosts them doesn’t mean the tools and practices they use can’t extend to data, too.
On the contrary, plenty of things can go wrong with a data pipeline or workload, just as they can with a software application. For example, data systems can suffer unexpected problems such as:
- Corrupted data that the system can’t read, triggering errors within processes that depend on the data.
- Accidental data deletion, another source of errors.
- Slow data movement, leading to delayed processing and decision-making.
- Unexpected spikes in demand for data, which can also cause delays in processing.
- The need to move a higher volume of data than a pipeline can support, resulting in processing delays or complete failures.
SRE strategies can help to mitigate the impact of these data failures in the same way that they mitigate the consequences of problems within software systems.
To be clear, I’m not saying that the typical company should hire SREs to help manage its data pipelines. SREs already have a lot on their plates, and they specialize in software systems and related infrastructure, not data pipelines.
But I do think that data engineers – meaning folks whose job is to design and manage data pipelines that move information from the places where it originates to the destinations where it is processed, stored or otherwise put to use – would do well to integrate SRE methodologies into data engineering.
Four Ways to Apply SRE Principles to Data
A comprehensive guide to leveraging SRE strategies for data pipelines is beyond the scope of this article. But to provide a sense of the type of operations I have in mind, let’s consider how data engineers can use the so-called “four golden signals” to optimize data pipeline reliability.
The four golden signals are a concept popularized by Google SREs to help SRE teams understand which types of data to focus on when monitoring systems and how to interpret that data. They include latency, traffic, errors and saturation.
Here’s how data engineering teams can use the golden signals to measure data pipeline performance and reliability.
Latency
Latency, meaning the time it takes to complete each request, maps directly onto the value that a data pipeline provides. The higher your latency, the lower the value of your data may be because it takes longer to process the data. And for data workloads (such as fraud detection) where making decisions in real time is critical, a delay of just a few hundred microseconds might make your data essentially useless, because it fails to enable real-time action.
Thus, data engineers should continuously measure data pipeline latency. If they notice high latency for an extended time, they should consider making changes to their pipelines (such as eliminating processing bottlenecks) that will improve overall latency. If they see periodic spikes in latency, which often result from sudden increases in requests, they might want to invest in scalable infrastructure so that their pipelines can accommodate more requests during periods of peak demand.
Traffic
Traffic refers to the volume of requests that a system is handling. In a data pipeline, a spike in traffic rates doesn’t necessarily indicate a problem, but it could if traffic exceeds the volume that the pipeline was designed to support.
This means that monitoring traffic levels is important for ensuring that data pipelines are right-sized. Higher-than-expected traffic, or major spikes in traffic at unpredictable times, might be reason to modify the pipeline to ensure that traffic doesn’t become so high that the pipeline can’t handle it.
Errors
Errors in data pipelines can happen for a variety of reasons, including but not limited to ingesting data that the system was not designed to support, data quality problems, buggy code in applications that process or transform data, a lack of available infrastructure for processing data and problems transmitting data over the network.
For data engineers, detecting and responding to errors is critical because errors are often the first indication that something is seriously wrong in a data pipeline. Some volume of errors is unavoidable, but if error rates are trending upwards, data engineers need to investigate before errors cross the threshold into creating serious workload disruptions.
Saturation
Saturation is the total amount of resources your system is using relative to the total amount available. For example, if your data pipeline is consuming 80% of available CPU resources, its CPU saturation is 80%.
By measuring saturation, data engineers can help ensure that their systems and the infrastructure they depend on remain adequate for supporting their intended workloads. If saturation rates approach 100%, the pipeline may begin experiencing high latency or errors because it lacks the infrastructure resources to process data normally. The problem gets even worse if baseline saturation rates are high and the pipeline experiences a sudden surge in requests.
Conclusion: It’s Time for Data Engineers to Think Like SREs
In short, if data engineers want to maximize the value that data creates for the business, they need to maximize data pipeline performance and reliability. And since optimizing performance and reliability is the specialty of SREs, applying SRE practices to data pipelines is a great way to derive more value from those pipelines.
For too long, the worlds of SRE and data engineering have been siloed from one another. That needs to change if businesses want to get more from their data.