CONTRIBUTOR
General Manager and Editorial Director,
Techstrong Group

Synopsis

In this Digital CxO Leadership Insights Series video, Mike Vizard talks with Thomas LaRock, the head geek for SolarWinds, about how observability is driving digital business transformation.

 

Transcript

Mike Vizard: Hey folks. Welcome to the latest Digital CxO Leadership video cast. I’m your host Mike Vizard. Today, we’re with Thomas LaRock who’s head geek for SolarWinds. And we’re talking about digital transformation and observability. Thomas, welcome to the show.

Thomas LaRock: Hi, thanks for having me.

Mike Vizard: What exactly do we mean by observability these days? I think a lot of digital CxO leadership folks kind of look at the term and they go, “Well, haven’t we been doing that all along?” And turns out that, while we may have been monitoring things, observability is a little bit different. So explain to the uninitiated what we mean by observability.

Thomas LaRock: Yeah. Observability has, as it turns out, many meanings. It depends on what company you ask. The traditional meaning comes out of control engineering theory, which is the ability to infer the health of a system by just examining its outputs. And that sounds well and good on the surface, right? It does sound like, “Haven’t we been doing that all along?” But the fact is, when you get into these platforms, that claim to be observability solutions, what you see, what, in my opinion, what it really is, it more speaks to the transformation of the underlying architecture of the systems that have been built over just the past 15 years. So when you think traditional monitoring tools, you think that was built to monitor the server that sat in the closet or under my desk; it was a single node. People were connecting over the land, and now we’re talking about globally distributed systems that were built to be truly cloud-native. And there’s a lot of pieces and moving parts all over the world.

So what you need is, you know, the traditional visibility of monitoring system, and now with this observability. Hey, how can I tell if my system is just healthy right now? You know, and there’s a difference between health and performance, right? Health is, you know, I’m healthy enough to run a four minute mile, but you know, I’m not going to be able to do that. So you really kind of need to have this platform that gives you that observability to understand health performance and put it all together in a nice little dashboard.

Mike Vizard: And who is tracking these systems and this kind of data these days, because historically it’s always been kind of an IT issue. But as I look at it lately, more businesses are dependent upon that IT infrastructure. So are there other people, besides the IT folks starting to look at this data and trying to correlate some events as a result?

Thomas LaRock: Oh, I would say absolutely. The biggest thing I see in my chatters and my feeds is cost, right? So when you talk about these systems that are built to be cloud-native, a lot of the time these systems are using let’s say, pieces of infrastructure inside of AWS and Azure, that they’re being built for, that maybe they weren’t aware of. So you can use these metrics and say, or the observability metrics that come back and say, all right, I can see what our system is doing. And I can also see, I don’t know, let’s say for example, it’s throwing a lot more errors than usual. And every time that error happens, well, that’s more compute and more resources. Can we fix where those errors are happening? You know, that might be a poor customer experience, but the customer doesn’t see the error, but you just know that you’re consuming something extra in the background, if there’s something inefficient. So it can be used not just by somebody in IT that needs to log a ticket and resolve an issue, but it can be used by people, you know, writing the checks and they go, “Hey, wait, are we truly using all the resources that we’re paying for in the most efficient way?”

Mike Vizard: It also seems the IT environment itself is much more complex than it used to be. And it’s much more distributed. So is it getting harder just to understand what’s happening in that environment? And then you just really don’t know if an application is slowing down, what’s causing that.

Thomas LaRock: Yeah. Agreed. As I mentioned, the globally distributed nature of a lot of the architecture that’s happening these days does make this very, very difficult because you could have a user complaining of a slowdown, and now your first thought might be, well, you know, where are they? What part of the world are they in right now? Is there something wrong with the CDN? Like what is happening such that they are experiencing this illness, but somebody else might not be. And I think this is part of the reason that Google gave rise to a transition from say the administrative role to what they call site reliability engineering. And that’s where they focus on those four golden signals, right? Your concurrency, your errors, your latency, your threshold. And they look at those metrics first.

And as a DBA, I’m used to things like, you know, page splits and all these ____ statistics that I thought were what I needed to tune this engine. But the reality is when you go cloud-native, you need to kind of alter that first level, that first layer of troubleshooting. And so, yeah, it is absolutely more complex these days. So you got to kind of have a bit of a simpler approach to start, before I can dive in and get to that meantime to resolution.

Mike Vizard: And part of that is it’s just harder to query those systems, right? I mean, historically, if I looked at monitoring, it was, here’s a set of known conditions that were going to track. And that’s interesting, but ultimately it didn’t really tell me what the core issue was. So when I think about observability, it’s really all about the ability to query those systems and actually get to something that feels like a root cause.

Thomas LaRock: Well, querying is, would be fine, I would say, but I’m going to tell you, querying is a bit inefficient. If I had to sit at a dashboard, if I got an alert, if you see me starting to type queries to figure out what’s gone wrong, then I’m not sure your platform is doing the job for you. And that’s where things like AI Ops come into play. What you need is a system that’s smart enough to say, hey, let me, again, there’s a lot of metrics out there. I think you mentioned that. So how do you get a signal through all that noise? And one of those ways is with something as simple, as simple as anomaly detection. You can use the at time series forecasting to build a predictive model and say, I anticipate, let’s just say, you know, tomorrow you’ll have the number 10 and then, but tomorrow comes in the numbers 15 and you go, whoa, that is an anomaly. We weren’t expecting that 15. So you should, now, now you should go look at something. Now maybe you want to write the query. So instead of, you know, writing queries all day long, trying to figure out what’s wrong, you can really have that laser sharp focus and say no today, right now I need to go investigate this thing.

Mike Vizard: So as we look at these higher levels of automation and we look at AI, what will be the role of IT going forward? What’s going to happen? What should be?

Thomas LaRock: Ooh. Oh, I never like to predict the future, I guess, because I’m thinking it’s almost, it always changes. It’s just so fast. It’s hard, but what’s going to be the role of IT. I think you’re going to see them shift from more of this reactive nature. Hey, something’s wrong? My phone’s ringing, right? Something’s wrong? Go fix it now and quickly as you can. I think there’s going to be more of a proactive nature where you’re going to use these observability platforms to be able to monitor say in the holistic way, observe the health of all of your systems and take actions as necessary. Think of it more like you visit your doctor one once a year, and it’s easier for you to stay on top of your health than if you only go every 20 years. So the role of IT is going to be more of that practitioner. They’re going to be able to, you know, let the machines do the work that the machines are best at. And then humans can do more of the work as say, how do we build a better architecture? How do we make things, you know, safer? How do we make things faster? How we do all these other things? That’s what humans need to do, and have the machines do the automated tasks and let them do what they’re best at.

Mike Vizard: As we go along here and we hear a lot about digital business transformation these days, it almost seems like every IT event correlates back to a business event and a business event now correlates back to an IT event. Do you think that we all get that these days, or we’re still trying to figure out what the exact relationship is between IT and the rest of the business?

Thomas LaRock: I think that might vary by industry, but yes, at the end of the day, I do think there’s going to be a business unit that does wonder what IT does. I mentioned I was a DBA, and I used to think that the best DBA was one you never had to see, because if you had to see the DBA, something’s gone horribly wrong. And I think there are still some people who view IT in that way. If you have to see it, you think, well, I had to open the help desk ticket, somebody from IT had to come and do something. So something is almost always wrong when IT is involved. But I think again, by industry, I think things are kind of changing where IT really truly does have an equal seat at the table. There are a lot of various reasons for why that has happened saying the past three to five years specifically and maybe in the past two, just for COVID related where people had to work remotely, where IT has really had to step up, be that table and say, so here are the options we have. Here’s what it’ll cost. Here’s the benefits. Here’s the risks here are the costs. Right?

So yeah, I think there’s still going to be some, some disconnect between business and IT. And what’s a business event; what’s an IT event? But generally speaking, I believe they’re starting to play together a little bit nice.

Mike Vizard: So what’s your best advice right now to folks? I mean, should they rip and replace all their monitoring tools for something that’s more of an observability platform? And do I do that? Do I kind of put everybody in a room and lock the door and hope for the best? Or is there some more perhaps rational way of going after this?

Thomas LaRock: Yeah, that’s a great question. I don’t think I’m going to advocate that everybody toss out their current monitoring tools, but I do believe you need to take a hard look and say, what was this, the tool I’m using today? What was it built for? What was the purpose that it was built for? And if the answer is it was built for a system that was from 20 years ago, you know, before we started talking about cloud-native technologies and architectures, then you really have to think a little bit harder if you’re getting the value from that. Now, is this a tool that’s extensible? Is this a tool that can, you know, provide me with the necessary details? And like I said, separating out the health from the performance. Is this a tool that lets me be truly productive and efficient with my time during the day? Or is this a tool where I really have to go back, and I spend my time writing queries, or I’m spending my time managing DISC and doing all these other things that I really know could be automated the way. So I think that’s where I would tell somebody that’s the best advice is figure out how many hours you’re spending daily, weekly, monthly on administrative tasks that you know, could be better handled through automation.

Mike Vizard: All right. Great. Hey Thomas, thanks for being on the show.

Thomas LaRock: Thanks for having me.

Mike Vizard: All right, guys. Thank you all for watching our latest episode. You can find this one and other ones on Digital CxO website. We encourage you to check those all out and, once again, thanks for spending some time with us.

Show Notes