Applications were complicated enough, but with the advent of the cloud, they’ve reached a level of complexity that’s often difficult to manage. With multi-clouds, hybrid clouds, edge computing, and external data dependencies, applications have become a tangled web of interconnected systems.
And the more complex a system, the more opportunities for things to go wrong. Our guest on this edition of UpTech Report is trying to solve this with his startup, Glasnostic, which actively monitors and manages interactions in the cloud, allowing problems to be resolved in real time before they take down systems.
More information: https://glasnostic.com/
TRANSCRIPTION
DISCLAIMER: Below is an AI generated transcript. There could be a few typos but it should be at least 90% accurate. Watch video or listen to the podcast for the full experience!
Tobias Kunze 0:00
The problem is you have four or 510 systems 500 systems, you put them together, something doesn’t work. It’s not a code issue. It’s an environmental issue.
Alexander Ferguson 0:14
Welcome to UpTech Report. This is our applied tech series UpTech Report is sponsored by TeraLeap. Learn how to leverage the power of video at Teraleap.io. Today, I’m excited to be joined by my friend and guest Tobias soon to be friend. I’ve said friend already because I’m excited for you to be my friend who’s based in Menlo Park CEO and co founder at Glasnostic. Welcome Tobias, good to have you on. Thanks for having me. Now, your whole focus or platform is focused on making cloud applications resilient helped me understand Tobias. What was the root problem that you saw and set out to solve with Glasnostic?
Tobias Kunze 0:48
Yeah, absolutely. So the root problem is really application complexity, right? move to the cloud, everything gets more complicated. We have more pieces running in more places, right? It’s not just cloud, there’s on premise, there’s multi cloud as hybrid clouds are getting to the edge, right? It’s not going to get simpler anytime soon. And frankly, the problem is always our applications are not what they used to be 3020 years ago, right? It’s now the interesting data is always outside of the application. So you have dependencies, right? So the connectivity explodes. It’s not just a proliferation of application pieces. Think of microservices, right? It’s really the the number of systems you converse with. It’s SaaS services. It’s other services. It’s your components you depend on. It’s the cloud services, right? You saw the complexity of that, huh? Yes. Yeah.
Alexander Ferguson 1:44
From and from seeing this, how did that develop into the product itself.
Tobias Kunze 1:49
So from seeing that, it became very clear that there’s a natural limit how much we can do distributed systems engineering to begin with wide as a developer, or like x developer, mostly now, my team doesn’t let me code anymore. But I’m very clear, maybe you cannot build out a system with 500 moving parts, 5000 moving parts and think you built this, like, it’s an in memory. Process space, right? You’re not calling out into services, everything you use is shared. Everything is multi tenant, it’s not like you own anything else that you just use it, that becomes in aggregate across the architecture becomes very difficult to manage.
Alexander Ferguson 2:37
So the sculpt complexity is definitely only increasing. But for where you guys play a role, what comes to Glasnostic? Like, it’s not prior to launching, it’s not like figuring out what what happens before. It’s if I understood from appreciate its day to day operations, can you help me understand that?
Tobias Kunze 2:52
Yeah, absolutely. So we used to think about applications or like anything we do in software, but the value is in writing the code. And it’s been true if you write a small piece, yes, that’s where the value is, once it’s deployed, it’s just gonna run, right. But the more complex they become, the more involved pieces actually, once the code is life. And it’s kind of nature versus nurture, right. And parents know that right, making the baby is not the work and raising it is the work afterwards. And we are entering in that stage by 20 years ago, absolutely. writing the code was difficult thing to do. Today, it’s all about keeping it up and keeping it up in the face of change and rising complexity.
Alexander Ferguson 3:37
I love that, that focus that okay, maybe having a baby is easy, but then having to raise it, that’s a whole nother thing, especially as new problems are, they continue to learn and add new things to your world environmental changes. Another analogy I think we talked about is like, if you have a team of 20 people, the way you manage a team of 20 is different when you start to scale to 50 to 200 helped me understand when it comes to the software side to be able to manage an application. What are some of the complexity to come in when you’re trying to scale? And what should you be looking for?
Tobias Kunze 4:05
Yeah, I think it really comes down to this natural limit how much we can engineer deterministically by there’s just a scale limit to it. We can go into distributed systems engineering and try to really predict and prevent every single corner case that you know, things that could happen to us in production. There’s a natural limit when there’s a natural complexity and cost limit to that. So beyond that, you just need to kind of deal with it. And if you can’t prevent it, you need to manage it. You need to deal it, deal with it. And it’s I think, always very similar to when a company grows, right? You start with a bunch of people. There is no team structure. Everybody’s heads down. You have a question you yelled are they getting into guy or gal Next, you know next step. But as you grow, responsibilities kind of divide and you need to get some kind of management is in place. And the only purpose of management here is to make sure nothing’s in the way of the teams do that work. And management is all about being situational reacting to the situation. Now strange enough, because as software engineers, we always brought up in writing code before it happens, and then let it run. That kind of idea is kind of foreign to us. But it’s absolutely necessary. And before applications became a really complicated complex, that kind of management was done by the deep, you know, the ops teams, and made sure nothing got in the way of the applications. But now, things that get into the way of the applications or other services, other software components, right. So it’s a it’s an exploding problem.
Alexander Ferguson 5:55
Who then like it? Let’s get a little bit nitty gritty on the on the details of the tech and the platform itself? How does it work? And who’s actually the one using your platform?
Tobias Kunze 6:06
Two different questions. How does it work? Well, if you really want to manage any, any of these unknown unknowns that happen in production, you cannot rely on evidence, like you can’t have extra code running on a machine or any any given workload, you may not own it. And most of the time, you don’t own it. So you need to look at how do these things behave. And the way you define behavior is by looking at wire signals, like you’re looking at outward behavior, there’s a load balancer, tons of stuff coming in, and nothing coming out. That’s behavior, right? I don’t care what the function is on the box, I look at the outside the altoid behavior, and that becomes the most important paradigm shift, really. So we look at this as a traffic control, right? You manage the airspace, you’re no longer managing a single flight, you’re not about gear up gear down and getting navigating to the right destination, you’re managing the airspace, because now you have hundreds and 1000s of planes in the air. And it’s not something you can manage from within the cockpit, where you have all the observability, right, and the switches and all that. So it’s an it’s an again, it’s a management task, a management issue that you’re solving yet. And if you manage the airspace, you operate with different nouns in different words, right? For the air traffic controller, it’s the callsign. It’s a position, the altitude, direction, speed, maybe a little bit of weather data. Like, for us, it’s the callsign, what’s the service name? You know, what’s the endpoint name? How many requests are running, what’s latency on it? How many run in parallel and concurrency and bandwidth, and, crucially, not even an error rate at this point, because that’s an application specific signal. If I want to manage the airspace, so to speak of a cloud application or complex cloud application, I need to see everything that can be anything that I don’t see. Right, hence touchless, agentless, and golden signals, wire signals. So that’s how it works, right? We are bumping the wire
Alexander Ferguson 8:20
is having that high level perspective, dog being having been distracted by too many smaller details, it sounds like that’s where you really been able to focus in on who is that air traffic controller then, so to speak, that.
Tobias Kunze 8:36
So the titles in technology in South Korea is really all over the place, I like to say, the people who are responsible for deployments, that can be the platform developer, it can be in the app development can be the essary, that’s embedded in the app team. It could be the essary. And the platform team, it could the operations team, the DevOps, it’s all over the place. Some companies have service delivery teams, right? whoever is responsible for the deployments. Whatever is deployed, we need to make sure that nothing is nothing steps on somebody else’s toes. Everything’s always available, always stable, insecure. Now, at that level, I’m not really concerned with bugs in code, that bugs in code are actually really easy to find. If you can unit tested, my code works. The problem is your four 510 systems, 500 systems, you put them together, something doesn’t work. It’s not a code issue. It’s an environmental issue. And that’s what we are after.
Alexander Ferguson 9:42
What’s so interesting is like you’re really honing in on that like versus the, it sounds like a lot of people want to know the root cause like let me just show me the root cause but you’re actually saying no, let’s not look at the root cause let’s look what’s your forest for the trees. I mean, is it just you feel that that gives a better focus. That’s where you should be. Starting before you get down into that,
Tobias Kunze 10:03
absolutely, you need to start with, how can we mitigate it? How can we remediate, you do not care about the root cause you care about quick fixes, right? You need to get out of that outage, you need to get out of that. Whoever can and is sucking up your bandwidth at the moment, or whatever the case may be. Keep in mind, it’s all unknown unknowns most of the time. So it’s really important that you manage the situation, again, manage meant, but before you then decide, is this something I want to do something about long term? Right? Do I want to invest in something so I don’t have to manage this again. And that’s totally fine. But that takes time. Right? You probably want to look at different ways you might be able to fix it. The problem with root causes is really once you think about it, it isn’t such a thing, right? It’s I always like to say, there’s no source of the Nile. The Nile has 1000 different creeks and whatever, which is source as a pure matter of definition and convenience. Right? Ultimately, if you have to hammer, the root cause is, is the nail that you find, right? That’s really what it is. And it’s the big problem with root causes. It’s just it’s a safari, it takes a long time. That’s
Alexander Ferguson 11:27
why I want to just take take a day, or multiple days, we’re going into a world where we need to stay online all the time our customers are expecting to. And it’s like we have to be able to quickly address and shift things that would be my categorize in the space.
Tobias Kunze 11:43
Absolutely. Yeah. Yeah. And CMOS, this can be automated, right? But some of this can be automated or fairly, almost automated other things you really want to have, you know, the big red button and do something about but what it comes down to is like something happens, it’s all about rapidly detecting and rapidly responding to it. So first responders show us how it’s done. Right?
Alexander Ferguson 12:07
So we’ve retired a lot of other analogies and and kind of high level but can you give me a use case they kind of illustrates and a simple story for for glass plastic.
Tobias Kunze 12:16
Yeah, so I’m, is hundreds, literally of them. But if you just look at the big outages that happen, and companies publish post mortems all the time, I find it super interesting, because you look at the post mortem, it’s 90% of the post mortem is all the detail that happened, how heroically they figured out what you know, what, what happened and what they did in order to contain it. And the real nugget of the stories always before that, the first 10%. And the story arc of that is always the same. It’s always always been doing something that we’ve always been doing. So it was totally normal. But random circumstances comes conspired, and then we reach some weird limit that nobody could ever have conceived of right? was completely unforeseen, right? And then mayhem ensues, because that limit triggers a whole chain of a chain reaction of events. And then it’s craziness. But we with our heroic attitude, we go and fix it. And it yes, it took a day. But it was a big outage, right. And that’s totally backwards, right? When we discover something, we should automatically do something about it. And frankly, the vast majority of outages can be mitigated into a me a slight degradation of service. If you could just do back pressure, very specific back pressure against incoming requests for whatever system you’re dealing with right now.
Alexander Ferguson 13:50
You’re saying you can prevent the outages just by simply making a slight adjustments versus everything goes down until you fix the root cause and then a day or two later,
Tobias Kunze 13:59
if there is a root cause? If you think that it’s worth fixing, right, that’s a decision you should take it you shouldn’t just blindly hop on an alert and say, Okay, let’s, let’s let’s investigate, because, frankly, there’s 1000 alerts every second, right?
Alexander Ferguson 14:13
may not be able to answer this question, but it’s like, what percentage do you think of the the solutions out there web applications that are not are doing the approach of winter breaks? Alright, let’s fix it completely. And then launch it again, versus slight degradation, but it stays online, like who’s actually like doing one versus the other?
Tobias Kunze 14:33
It’s difficult to say I don’t have hard numbers on that from our customers. Clearly, it’s a biased group. I’m guessing but
Alexander Ferguson 14:42
I guess I’m more wondering, is it is it? Is everyone already there? They’re just not doing it well, or is it? Is it a shift in mindset
Tobias Kunze 14:50
shift? It’s a paradigm shift. Yeah. If you leave production to the engineers, the gut reaction is always Well, let’s find out what went wrong. Let’s Let’s dig in and find out. And by the way, we have all the time in the world, it’s important to get it right. And no, it’s not. If somebody has car accident, first responders come in, they don’t call him the oncologist and the, you know, orthopedic surgeon and whatever the guy knows, they just stopped the bleeding first, and then we can see whether there’s something we need to do. And strangely enough, because the way we’ve been brought up as engineers to think methodically and in, in procedures, and very, you know, we write code, and code is the epitome of everything is deterministic. But we, it’s very difficult for engineers to jump out of that mindset, it’s definitely a mind shift.
Alexander Ferguson 15:51
It’s a shift in mindset. And if you’re anything more about who your customers are now that are working with you.
Tobias Kunze 15:58
So our prime target audience, the people who live and breathe these problems, I’m touching something in production, new deployment, one out of three times something breaks somewhere else, right? This pure matter of complexity. So those tend to be the platform developers, the platform operators in companies, all the way up to manage those providers. So the people who kind of expose Lego blocks to their internal or external customers, and then run a ton of different workloads. By the way, demand curves are pretty unpredictable, right? So everything is already in an fairly unpredictable space. Contrary to that, like, if you have just five services, a small micro service, whatever web application, you’re not our customer, but you can write this, like you always wrote the applications. Tech stack is a little bit different. But it’s the same thing. You plan it, you design it, you implemented, modified, and code really matters at that stage, the real
Alexander Ferguson 17:02
power for and the purpose of your platform comes into play, when you get to a scale of enterprise where there’s just too many applications, there’s no way that you can dig into all of them. There’s just a complexity there, you need a platform like that bird’s eye view.
Tobias Kunze 17:15
everything’s connected, everything changes all the time, the climate over there, literally the butterfly effect, something else breaks over there. Now, what are you gonna do? Are you going to push back and not deploy anymore? Because all we need to figure that out? First, we need to de risk the deployment, right? Or do you just say, that’s how it works? I just need to invest in a capability to manage to, you know, affect control?
Alexander Ferguson 17:38
What can you share, as the looking ahead from here, how the space will be or change as we go forward?
Tobias Kunze 17:46
It’s very clear, application is not going to become less complex. Even in the type of application, we are going to be based on even more data, right? more data means more connectivity to other services, sources of data. And there’s going to be because of physical limitations how much data we can process in on itself, there’s going to be intermediate stages, there’s explosions, there’s deal redundancies, there’s different clouds as edge edge locations, it’s only going to explode. And we are reaching a limit where basically, most applications I see today are not synchronous anymore. But that’s everything is eventually consistent, obviously, because that’s the only thing you can do. So even where that application landscape ends is not really clear, right? Because you’re somehow connected to some cloud region that has, you know, appearing failure, right. And now you are having a retry storm, or there’s a feedback loop, you know, whatever the case may be. And most companies today, fly totally blind to these effects. Once if they happen, there’s nothing they can do. And it’s not just about if it happens, because frankly, these degradations happen all the time. At a certain scale, something’s always off. And you just don’t know how much time you’re wasting and how much reidsville resources you’re wasting by not seeing it. Because what happens is, you have a thought, again, 1000 alerts a second. And some in some in your engineering teams are always looking at these alerts, whether they are meaningful or not. Or they’re not looking at them and then you have different problems. It’s, you’re not insulated from this at all, most of what we do in essary or in you know, operations today, we don’t don’t we need to be able to deal with the fact that we don’t own it.
Alexander Ferguson 19:56
It’s the the future as you state as own is not going to get simpler. And we can’t own all the applications. So having something allows us this across it and be able to adjust things as in fly it without really is going to be integral device. I really appreciate you being well spent little time with us for those who want to hear a bit more about the journey because this is not your first day in this space for sure. Stick around for part two, where we’ll be going into our founders journey to hear his his insights. But those that want to learn more about LastPass, they can go over to Glasnostic.com. That’s GLASNOSTIC.COM. And probably book, probably the first thing they could do is Devo. Is that a good first step.
Tobias Kunze 20:35
Look at it can look at the demo, they can sign up for an account. No problem. Yeah.
Alexander Ferguson 20:40
Awesome. All right. Well, thank you so much for this device. And we’ll see you all on the next episode of UpTech Report. That concludes the audio version of this episode. To see the original and more visit our UpTech Report YouTube channel. If you know a tech company, we should interview you can nominate them at Uptechreport.com. Or if you just prefer to listen, make sure you subscribe to this series on Apple podcasts, Spotify or your favorite podcasting app.