Figuring Out the Failures | Lodewijk Bogaards from StackState

When something goes wrong in a highly complex integrated system, the damage is potentially enormous. Even the smallest error can cost millions. But when you have so many technologies interacting, isolating the problem can be extremely difficult.

This is a problem that Lodewijk Bogaards understands well. “Think about the process running in a container, running on a pod, running on a virtual machine, running on a hypervisor, running in a blade on a rack in a data center somewhere,” he says. As a consultant for a bank, his client needed a better ability to understand their systems.

And as Lodewijk will tell you, there’s no better way to start a company than when you can find the solution to a problem a bank needs to have solved. StackState was born, a company that gives admins full observability on their systems, enabling them to quickly identify problems.

On this edition of UpTech Report, Lodewijk explains how his technology manages this feat, and the various ways it can help companies avoid disaster.

More information: https://stackstate.com/

Lodewijk Bogaards is a StackState co-founder and CTO. He combines deep technical skills with high-level technical vision.

If he’s not working on StackState, Lodewijk might be found playing squash, answering questions on StackOverflow, or meditating.

TRANSCRIPT

DISCLAIMER: Below is an AI generated transcript. There could be a few typos but it should be at least 90% accurate. Watch video or listen to the podcast for the full experience!

Lodewijk Bogaards 0:00
There’s a couple of different approaches to bringing all this information together. And many of these approaches have failed.

Alexander Ferguson 0:14
Welcome, everyone to UpTech Report. UpTech Report is sponsored by TeraLeap, learn how to leverage the power of video at teraleap.io. My guest today is Lodewijk Bogaards, based in Amsterdam, in the Netherlands, CTO and co founder of Stackstate, welcome Lodewijk, good to have you on.

Lodewijk Bogaards 0:31
Yeah, thanks for having me.

Alexander Ferguson 0:33
Now, on your site, you talk about your product is really all around unified observability of all technology, kind of providing that that relationship based observability on across all IT components. So for those who are out there, maybe you’re an infrastructure information opera, an operation leader, or a new term coming out site reliability engineers, or other business leader that you’re managing multiple systems, that technologies, this could be a tool you’re going to want to check out. Now for you look like when when you set out to to solve this problem? How did you even discover the problem that you realize, Wow, this needs to be solved. And we have a new way to to to approach this.

Lodewijk Bogaards 1:17
Oh, it’s typically the way it goes is you have the problem. And then you struggle with the problem for a long time. And you start to think about, you know how we might solve it. And that’s also how I got started involved. First of all, I had the problem, and I had some ideas how to solve it. That time, I didn’t have time to actually go and do it the way I wanted to do it. But I actually got an intern who actually did it for me on the side. And then when I switched companies, I actually was asked to solve that problem. But now not. Now, I wasn’t having the problem. But the bank that I was working for as a consultant had that problem. And they were asking me as consultants, help us solve that problem, but at the scale of the bank. And so my, my, the other founder, the other co founder of Sac State was in exactly the same place, you also had tried to solve that problem before. He switched to a consultancy gig at that bank, we met. And we both had the same ideas on how to solve the problem. So that’s how we

Alexander Ferguson 2:22
start internally come across a problem yourself, then you go to become a consultant, and they’re asking you, Hey, can you solve this for us? And your co founder? Mark? Correct? Yeah, he also was seeing that same issue. So which led to the development of stack state? Yeah, exactly. That’s one of the best ways to build a company is when you see the problem itself, and then you have an enterprise company or a bank that needs it solved. And you go about doing that. So curious, that has, how has that changed over the years? Or has it not really has it always remained the same? Just observability, across all technologies,

Lodewijk Bogaards 2:59
it’s gotten worse, actually, so. So we’re, we’re still just making a small dent in this in this problem. But the to be clear, the problem that I’m talking about is that, think about a bank, for example, that’s where we started at that bank, they have a lot of monitoring systems, monitoring systems that are very specialized for databases, monitoring systems that are very specialized for infrastructure, for a service buses, etc, etc, they have sometimes that really literally hundreds of monitoring solutions. And these are all monitoring just a small part of the stack. And now when something goes down, then you know, at the level of a bank that can, you know, really mean, you know, millions per hour, or even per minute, if it’s in the wrong place, that you have to figure out which of these monitoring systems have something useful to say about that. Typically, you know, across the stack, you’ll find that many solutions have something to say about many of the things that are going on, but it’s very hard to pinpoint, you know, where did this actually start? What was actually the change that led to the failure? You know, where should I be looking? Where should I not be looking, you know, who should I involve? These kind of fundamental questions are very hard to answer.

Alexander Ferguson 4:24
It’s almost like the saying of finding a needle in the haystack of, and the quantity of technologies and monitoring solutions is only increasing, not decreasing, right. And so the difficulty of finding which one failed is becoming, as you said, more of a problem.

Lodewijk Bogaards 4:41
Yeah, this becoming more of a problem. So think about, you know, what’s the latest in an ID technology? Everybody is now moving to the cloud native space. Kubernetes is the big hot new contender in the space. That means we have new monitoring solutions. You Now called observability, there’s a whole bunch of new tools, there’s a whole new system, all those legacy systems are, of course, still there. So they have to be connected. So you have to build adapters. And then when you go into the cloud, native space, all of the sudden, we’re not talking anymore, about, you know, this kind of monoliths that run on servers. But we’re now talking about highly distributed systems, microservices, all over the base, place, ball spinning up and down. So now all the sudden the systems become more complex, more dynamic. And, you know, the speed of change is just increasing. And at the same time, of course, the entire economy is becoming more and more dependent on technology. So yeah, the problem is actually just getting worse. And there’s, there’s good statistics on that as well, where you can see that the the hourly cost, or, you know, one hour of downtime is still on the increase.

Alexander Ferguson 6:02
Let’s, for those that are interested in technology itself, and how it works, let’s dive a little bit deeper, I mean, for the quantity of, of tracking and reliability tools that are out there, and you’re effectively plugging into them allowing you to see all of them in a topology type of view, is that your time effectively trying to create what was the API connections? Or how does that all work, then for you to be able to hook in properly to all these systems?

Lodewijk Bogaards 6:30
Well, the the fundamental insights that we we took is that there’s a couple of different approaches to bringing all this information together. And many of these approaches have failed and have now kind of become the, you know, the, even the laughingstock of the of the IT monitoring world. So for example, single pane of glass, sometimes also called, you know, pane of single glass, you know, that kind of idea of all this, in fact, we have all those tools, and we just kind of slap a glass pane of glass on that, and we make lots of dashboards with lots of widgets, that simply does not scale and means that you have to manually figure out how to pull that data together into dashboards. And, you know, in dynamic situations, that that absolutely does not work, another solution has been to, and still, many vendors are trying to push this, it’s kind of still the kind of the main, the main one, if you look at the scene at large, or community of it, monitoring at large, is kind of what we call the big database in the sky approach. And say, Well, you know, we’re Dynatrace, we’re New Relic, we’re, you know, data dog, you know, we can give you complete end to end coverage of your entire IT landscape. But all you need to do is bring all your data to us. And then of course, you know, we can do it. And, you know, that’s been tried, since, you know, the, we know that, that the seven been happening even with IBM in the 90s, and so on, and HB and all of those, and that’s just never going to happen, all the data is never all going to end up in one place. It’s just not going to happen. Another solution has been, you know, let’s not consider all the data. But just let’s consider only the events. So we bring only the events, which are kind of which have, you know, kind of the most important bits of the data, and we bring all that together. So but that’s also not really a good approach. Because if you get an event that says, well, the disk is 80%, full, you know, that, that that tells you very little, you know, whether, you know, should I now take action or not, you know, is that, you know, you what you want to see is like, did it rise to the 80% in five minutes, or was it there was that 97% For the last five months, you know, that that makes a world of difference. So you need to have more insight into the data. And also, when you’re looking at a whole bunch of events, it becomes very hard to correlate them. And then there’s all kinds of approaches of all let’s use natural language processing to bunch them together. And you know, these, these approaches really also don’t work. So our approach is fundamentally different, that we actually take the topology, the understanding of how the different components of your IT environment fit together. And we make sure that that is a real time map of the smallest little components. So think about the process, running in a container, running on a bot running on a virtual machine running on a hypervisor running in a blade on a rack in a data center somewhere. And all those different abstractions all the way to well, this process is actually running a database and that is actually done as a logical is actually part of logical application. And so all the way up to the up to your business where this logical application provides a service that connects to customers, etc. And has actually revenue streams connected to it. An understanding of all of that across the entire stack, makes it possible for you to know exactly how your environment is configured. Keep that in real time, make sure that you also track everything over time because it’s highly dynamic. And then for each point in that stack, make it possible to know which data about that and about that components where, where the data is coming from. So that you don’t really need to have all the data in one place, all that you need to know is what you have. So kind of the metadata about your entire environment, we call it a topology, and how it’s all related. And then where you can find more information about that particular components in which system, and then pull that all together and kind of a seamless data fabric. So that’s, you know, it doesn’t really matter if it’s coming from such system extra system wide data, we handle it with an abstraction layer as if it’s all, you know, part of the same system. And that’s been our fundamental approach from from day one. I believe that’s the only approach that actually will work.

Alexander Ferguson 11:24
Or bringing on a new, new customer for Stax day somebody wants to be able to to utilize it. What does that process look like? How long does it take to be able to build up and have all of their different technologies within that whole stack? Seeing it? The visibility?

Lodewijk Bogaards 11:42
Yeah, so it can go super fast, depending on the sources. So for example, you have an AWS environment, you have a Kubernetes environment, you’re using Dynatrace, you’re using CloudWatch. You’ve been using Prometheus, you’ve been using data dog, these are well known systems that have, you know, good identifiers that often are used in conjunction with automation. Setting all that up can go extremely fast. So we’re literally talking about minutes. And since we also don’t need to copy all the data, you get access to it immediately, once the topology is there. And then of course, if you talk about something like a bank, it becomes a much longer process, because now we have to also deal with all the legacy systems. And that’s more and that’s why we build it as a platform. And then it becomes more of a journey where you say, well, we have some, we have an IBM mainframe here, we want to actually put that into the fold, okay, there’s a process, there’s an SDK, etc. And then it takes, you know, takes months, depending on how far you want to go.

Alexander Ferguson 12:54
So basically, you have your base platforms that integrate very quickly matter of minutes, that that is the most common ones, almost everyone use, but you do allow legacy systems, using your SDK to be able to come in, it just might take a few months to be able to build it. But the idea is the platform itself can handle both the legacy as well as the everyday and the latest stacks.

Lodewijk Bogaards 13:20
Yeah, that’s pretty cool, actually, because we like we started with this data model, that, in theory, when we started should be able to handle the newer generations of it as well. But of course, at that time, and we started in 2014, we started working at that bank, we didn’t know about serverless yet, and Kubernetes was somewhere but it wasn’t, it wasn’t the hot, hot new gate that it is right now. But it’s working perfectly in those environments as well. And I have no doubt that when you know, the next generation of of running your ID comes around, and that it will work perfectly in that environment as well.

Alexander Ferguson 14:04
If you could share a word of wisdom, or insight to a site reliability engineer or someone who has to manage all this infrastructure, what what word of wisdom would you give them in today’s day and age?

Lodewijk Bogaards 14:22
Well, I’ve recently, I’m recently working on a block where, which is about becoming predictive, because many companies actually are looking to become predictive in their monitoring. So they want to be alerted on the problem before the problem actually occurs. And that’s, that’s it’s a desire of companies since since the dawn of mankind. But what I would suggest is to think about becoming more proactive, and actually making sure that you can actually leverage that data such that at some point, you can become predictive. Think about, like one thing that, you know, one concept that might be interesting, think about this. What would be really good actionable information? And how could I get that? So for example, if I get an alert, how can I make sure that alert tells me about the possible root? Cause? What do I need for that? And if I get an alert, and it would be a prediction, can it tell me about the possible impact? Because if it says, you know, the disk is 80%, full, it could be it could be the, it could be Armageddon. Or it could be it doesn’t matter at all, because somebody you know, or, you know, some processes code going to come around and is going to optimize disk space, and it’s going to go away. So what do you need? Maybe it’s more of a questions, there’s not a word of wisdom. But think about what do you need, in order to get to that type of information?

Alexander Ferguson 15:59
That powerful predictive type of answers are much more helpful than just here’s the state of something. And that’s effectively what you’re saying is the future and where people need to be thinking. Yep. Yeah. For Where do you see, as far as the technology and where you guys are headed, you know, four or five years from now? What are you excited about and working towards?

Lodewijk Bogaards 16:25
Well, I’m super excited about their autonomous AI engine. So this, this model that we have at the base of of that makes everything possible to connect all that information together is of course, just the starting point. Oh, you know, now you have all that information together, what are you going to do with it. And there’s many use cases that we have. So we help you with the root cause analysis, we help you find the right people at the right time. And you know, all of that is going to send it around mean time to repair. But what I’m really excited about is the possibility of connecting also business components onto that stack. So you have a topology contains also business components. And so my grand goal would be for autonomous AI, which is something that I’m working on, is for autonomous AI to pick up on the right signals, because it has an understanding of your business. So we, you know, if you want to look at, you know, where can I apply AI, the best. So I have a bank, and there’s a ton of data streaming, every minute, terabytes of data, daily exabytes of data? Where can I apply AI to best that’s, you know, I cannot apply it across the board, you know, because then I will need two more data centers simply to do that. And it might not be worth the investment. But if you have an understanding of how your IT components, you know, how does that one part? How did that process in that part, relates to, you know, your important streams, like, how many users Am I processing? How many, what what is my revenue, and if you can correlate those back to what’s happening in that, that little process inside that pot, if you can correlate that back to what’s going on in the business, then it becomes possible for you to apply AI in a very smart manner. And that’s where I see you know, the big future of Sac State.

Alexander Ferguson 18:41
And I like the the future of that you that you paint there of AI autonomous and the specific places it can can play a role. For those that want to learn more, you can go to Stackstate.com, and you’ll be able to schedule a demo and see how their platform works. For those that want to learn more about Lodewijk, history in his story. Stick around for part two of our discussion, we’re going to hear more about that founders journey. Thanks again, everyone for joining us. Again, our sponsor for today’s episode was TeraLeap. If your company wants to learn how to better leverage the power of video to increase sales and marketing, head over to TeraLeap.io and check out the product customer stories. Thanks again, everyone. We’ll see you next time. That concludes the audio version of this episode. To see the original and more visit our UpTech Report YouTube channel. If you know a tech company, we should interview you can nominate them at UpTech report.com. Or if you just prefer to listen, make sure you subscribe to this series on Apple podcasts, Spotify or your favorite podcasting app.

PART 2

The Disaster Detective | Lodewijk Bogaards from StackState

YouTube | LinkedIn | Twitter| Podcast

Figuring Out the Failures | Lodewijk Bogaards from StackState

PART 2

SUBSCRIBE

Written by Alexander Ferguson