The daily workload of a typical developer or operations engineer is constantly evolving.

As leaders seek new ways to scale, while improving user experience, site reliability engineering (SRE) has emerged as a critical component of the digital transformation process.

What is SRE?

The definition of SRE can be vague and varies depending on the person you ask.

According to Ben Treynor, Google’s VP of Engineering, “Fundamentally, it's what happens when you ask a software engineer to design an operations function.”

Gartner defines SRE as a function that “enables organizations to fulfill their reliability needs at scale to support the demands of digital business.”

The core tenets of SRE involve quality, performance, and availability. To adequately address such objectives, organizations must maintain a vigilant commitment to:

seeking opportunities for improvement;
acting in a “big-picture” ad-visory capacity;
and responding to outages in a timely manner.

The benefits of SRE include:

narrowing the gap between operations and development teams
improved visibility and reporting
acceleration of culture change

The best way to define a concept like SRE is to see it in action. In this post, we’ll look at how Google, Netflix, and PayPal utilize SRE.

Google – on Implementing SRE

Google’s Developer Advocate Nathan Harvey was recently interviewed on Deloitte’s On Cloud Podcast.

One topic addressed in the podcast is how various organizations implement SRE.

Some organizations create a siloed SRE team that is separate from development and separate from operations.
Others spread their SRE practices out across development and operations. Nobody received the label Site Reliability Engineer.
Others have dedicated Site Reliability Engineers and embed them within either the development team or the operations team.

At Google, Harvey acknowledges there is an abundance of everything. So, they use all three of the implementation options above. A hybrid method, if you will.

They have entire SRE teams. They also have site reliability engineers that work with certain teams or services. But if that service becomes too unreliable or it can’t sustain itself, the site reliability engineers will disengage.

According to Harvey, it sounds something like this, "Hey, developers, you own the reliability of this service. We're here to augment and to help you with that, but if you're shipping something that is always breaking…you have to take ownership of that. And when you have a more reliable system, then come and talk to us and we'll reengage.

Netflix – on Using SRE to Strengthen Resilience

Traditionally, when applications were built from scratch, capabilities were tangled in a single code package and security was not addressed until late in the process. Upgrading or adding a new feature was highly complicated and involved extensive system-wide testing.

Because apps were not designed to scale, maintaining a reliable and resilient system was a challenge, to say the least.

However, now that digital products are built to scale with SRE in mind, best-practice resiliency patterns can be found in open-source libraries like Netflix’s Hystrix library for improving fault tolerance, according to McKinsey.

Instead of building their own code, and struggling with security and compliance audits, teams can now reuse expert code.

PayPal – on Improving Customer Satisfaction with SRE

According to Deloitte, “SRE isn’t just about managing to a number or a percentage. Instead, it’s about defining ‘reliability’ in terms of customer satisfaction and letting data drive decisions.”

At PayPal, they’ve done exactly that.

Nine months after creating an SRE team, the Mean Time to Resolution of PayPal’s most critical application issues dropped by more than 90%, resulting in increased customer satisfaction and savings of more than $20M in revenue that year.

So Where Does That Leave You?

You don’t have to be a Google, a Netflix, or a PayPal to benefit from SRE practices.

Whether you’re just beginning to think about site reliability at your company or you’re working to optimize an existing SRE team, aim for progress not perfection.

Transforming to an SRE practice is not a quick fix or a cost savings initiative, but rather a long-term investment in your business and should be treated as such. As your SRE practice evolves and gains momentum, the organization will see a dramatic return on investment.

Conclusion

Organizations that adopt SRE into their processes and into the overall culture demonstrate a commitment to taking risks in order to create real momentum.

No SRE team can prevent mishaps. But by improving processes, taking risks, and being resilient, you will ultimately provide a more reliable customer experience. And that’s what it’s all about.

About Stone Door Group

Stone Door Group® modernizes the digital enterprise through skilled DevOps and Hybrid Cloud professional services. We make it easy to quickly access and deploy DevOps solutions to transform your business and provide certified consultants to deliver your projects.

For more information about SRE, contact Stone Door Group today.

ABOUT THE AUTHOR

Ray Stoner is a Consultant focused on Observability, Service Management and Site Reliability Engineering at Stone Door Group, a Cloud and DevOps consulting company that delivers successful digital transformation projects in the private and public sectors. To speak with Ray and our team, send us an email at letsdothis@stonedoorgroup.com.

Top 3 Examples of Companies Implementing Site Reliability Engineering