High-Performing SRE Teams with Dave Mangot

Dave Mangot joins Mike to give more thoughts and depth on his idea of “ops smells”: like the infamous “code smell,” Dave has identified a number of ops smells through his lengthy career in Ops/SRE. This episode covers a range of wonderful topics, including the dangers of outsourced ops teams, testing in production, and the value of consistency in your infrastructure.

About the Guest

Dave Mangot is the author of Mastering DevOps from Packt Publishing. He’s formerly the head of Site Reliability Engineering (SRE) for the SolarWinds Cloud companies and an accomplished systems engineer with over 20 years' experience. He has held positions in various organizations, from small startups to multinational corporations such as Cable & Wireless and Salesforce, from systems administrator to architect. He has led transformations at multiple companies in operational maturity and in a deeper adherence to DevOps thinking. He enjoys time spent as a mentor, speaker, and student to so many talented members of the community.

Guest Links

Links Referenced: 

Transcript

Mike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.


Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.


Mike Julian: Hey folks, this is Mike Julian, I'm here with Dave Mangot, former head of SRE for SolarWinds Cloud. Welcome to the show, Dave.


Dave Mangot: Thanks Mike. It's great to be here.


Mike Julian: So why don't you tell us a bit about yourself and what you've been up to lately.


Dave Mangot: Sure. Um, I've recently, like you said, left SolarWinds Cloud. There I was running the global SRE organization. We started that with about two people at Librato and wind up growing that into multiple teams in multiple locations with certainly lots of products you've heard of. Before that, I was an architect and technical operations for Salesforce, working on their internal infrastructure, monitoring, automation, configuration management, all kinds of fun stuff like that.


Dave Mangot: I've been able to take a bunch of the experience of those two companies plus lots of years of experience doing stuff before that into a pretty fun conference talk that I gave a few weeks ago in Nashville at the USENIX LISA ‘18 Conference. And I was talking about familiar smells I've detected in your systems engineering organization and how to fix them. And so it was really fun to be able to use a lot of the things that I've seen in my career and all the things I've seen as we try to mature engineering organizations, both operationally and all kinds of other fun DevOps ways, and be able to give a talk about that. And pretty, I guess you could say, controversial at times, but well accepted at other times sort of venue.


Mike Julian: It's always interesting when you start pointing out the smells and people's infrastructure, and their reactions are kind of all over the place. Like some people like, “Yes, you're absolutely right.” And other times it's like, “No, no, please don't say that. Like, I'm not cool with you pointing out my flaws.”


Dave Mangot: Yeah, I think the important thing is, you know, what I was trying to get across in the talk as a message is like nobody's perfect, we all have to get better. If I've seen those things in other organizations, there's a chance that I was involved in those things. It's not like I just kind of wandered in from Mars and I knew everything and everything was all good, like I've been growing throughout my career, and I try to continue to keep growing throughout my career. And if people feel a little bit uncomfortable, I think that's good, right? That's a sign of growth, if we're feeling uncomfortable. It's just, what do we do with that? If we get defensive, then we could double down on something that maybe we know in deep inside isn't really working for us as opposed to maybe we go home and sleep on it and wake up the next day and say like, "Well maybe there's a couple of things here that we could actually get some incremental improvement on." And that's ultimately how we make all this successful changes to our infrastructure. 


Mike Julian: I've had people come find me where they've taken my place in their previous job, and they'll be like, "Why did you do this really dumb thing?" And my answer is, "Yeah, it was really dumb. I'm sorry about that." Like that that was a technical smell that I created, and I'm really sorry.


Dave Mangot: Yeah.


Mike Julian: These are things that we did, like they're not necessarily ... we didn't do them because they're good ideas, we did them for all sorts of reasons.


Dave Mangot: Yeah. And also I think that's important with like John [Allspaw 00:04:53] things that it's easy to come back later and say, why did you do that? That was a bad idea. But we make the best decisions that we can with the information that we have at the time.


Mike Julian: Absolutely.


Dave Mangot: It turns out that was a bad idea.


Mike Julian: Yup. So this talk that you gave at LISA, sadly I couldn't make it to LISA to see it live, but you and I have been talking about this stuff for years off and on over coffee and lunch and I did get to watch the talk after the recording went up. It's a fantastic talk, which we'll link in the show notes. But what I want to talk about with you here is like let's talk about that some more. Let's talk maybe the 2.0 version of this talk. Go a bit deeper, explore some of the points that you made a bit more where with the conference talk you're limited on time, we have a bit more time on this.


Dave Mangot: Yeah, there was a lot of content that I definitely cut out, and some of it was really, would you call it painful to drop it from the talk. I know that you recently gave some talks up in Oregon and cutting down to the time that you have allotted sometimes can be-


Mike Julian: It's rough.


Dave Mangot: It could be like being told you should give up some beloved piece of infrastructure that you have your heart set on and somebody gets up at a conference talk and says, "Yeah, you probably shouldn't be doing this."


Mike Julian: Stephen King has a fantastic quote on that, "Kill your darlings. Kill your darlings. Kill your darlings." I like ... the stuff you're most attached to it, yeah, you should probably cut it, but even then it's so hard because it's such an interesting topic. It's a fascinating story, it's a really good point, so it's hard to cut. So now that you said that, what are some of the stuff you had to cut from the talk?


Dave Mangot: One of the things I really wanted to talk about was using revision control. I know it sounds really silly, right? Especially for those of us who've attended all the DevOpsDays where the unicorns stand up and talk about this and that and whatever. I'm certainly old enough to argue the finer points of CVS versus SVN versus Git, but I don't really care, which use but one of the things that I was going to talk about in the talk about systems engineering is make sure your stuff is in revision control like no matter what. And that really comes back to the fundamentals of how we make change, right? So if you're using configuration management for things and certainly there's still a role for configuration management despite the rise of containerization and all that other stuff, there's lots of things that we wouldn't want to run necessarily in a container or there's no advantage to it or whatever.


Dave Mangot: But one of the things when we were starting on this journey, was if we're going to make changes to production, don't go on the host and make changes — make the change in configuration management and then go make the change in production. But we have to use the code to do it. And so going back to like the origins of the talk, one of the things I really try to make a central part of how we're supposed to examine these problems was this concept of crawl, walk, run. We don't have to jump to the end and like everything's like we're running Kubernetes in production tomorrow because we decided we wanted to do it today. We need to get our house in order, in order to get to the point where that's something that we can do it, if that's indeed what's best for the business obviously.


Dave Mangot: But I really liked the idea of configuration management or even configs, like forget about the systems, but making sure those things are in revision control because there's all kinds of stuff that can happen and especially when we're firefighting, and we're in the middle of a crisis, and we just want to get the things fixed quickly. Like this is one of our ways of making sure that we understood what we did. Well, did we change that file before we changed the other file or do we change it after? I don't remember. Like things get real confusing and if you have this like practice of committing stuff to revision control, then it's all documented there for you. And I loved working with Peter Norton, that's, I guess, his handle on Twitter as well because he was always like, "We can roll it back." Because that's the great thing about configuration or about revision control is-


Mike Julian: We know what it looked like before.


Dave Mangot: Yeah. Like we can always roll it back. Like that's not getting into the discussion of like whether there's true rollback of applications or whatever, but like the actual changes that we committed, that's why Git has a history. So like don't be afraid of it, like embrace it. That's really important.


Mike Julian: I worked on a job where we had this problem and everyone was like, we were all looking at how do we do this, how do we do it like DevOps said we should do it of it has to be a full CI/CD pipeline or else it's not worth doing. And I don't know, that's bullshit except no one in the company really knew Git except for me. And everyone was kind of uncomfortable with the idea of revision control in general. So I'm like we have RCS, let's use RCS.


Dave Mangot: CO minus L.


Mike Julian: So kicking it old school there, the granddaddy of version control. But like yes, it sucks, like it's not a great version control system, but it worked. It solved the problem of version this, version the files. And from there we eventually moved in to Git.


Dave Mangot: Better than none.


Mike Julian: Exactly.


Dave Mangot: Yeah. So I kinda wanted to make a point of that and then you kind of took it to the natural extension. If we're gonna do like crawl, walk, run — crawl can be RCS. There's nothing wrong with that. We're trying to develop some muscle memory here. That's how you get started. And then yeah, maybe walking is using Git or maybe walking is something else, like setting up a Jenkins server and then maybe run is we've got Docker containers that are automatically being generated on code check in and whatever. And even Docker, you have versions for the containers. I thought that was a ... that was a topic I really wanted to explore in the talk in the talk and I never really ... Time was a problem. So I had to let it go.


Mike Julian: Yeah. It's such a good topic. I think a lot of people take it for granted that, oh, of course everyone's doing this. And I don't think that's true. Like even in companies that are doing this, there are places in the infrastructure where they're not doing it.


Dave Mangot: Yeah. And it's in the talk like in the testing infrastructure section, right, where I kind of take, what would you call it, an extreme position on run books that I don't appreciate them. Not the remediation kind, but the like this is how we're going to make changes in production, but that fly so much in the face of doing revision control because if my changes are done via me reading a word document and doing whatever the computer tells me to do, that's not really ... there's nothing there like who knows. Like we have 500 servers in production, what percentage of them have received the change? A bunch. A bunch isn't a number, like that's not something we can quantify and that's really important.


Mike Julian: Somewhere between one percent and N percent.


Dave Mangot: Right.


Mike Julian: It's like who knows?


Dave Mangot: And that's like people like you and me have done this a while like that's where you're having outages.


Mike Julian: Right, when the systems are mostly the same.


Dave Mangot: Yep. I guess Knight Capital is probably like the classic example of that one.


Mike Julian: Right. Yeah, the Knight Capital story for those not familiar is the a ... actually, why don't you tell that story. I think you probably know it better. You've been looking at this for a while. That's the one where they deployed a change too quickly is something like that?


Dave Mangot: I don't remember the details exactly, but I do remember is I think there were like three servers and two of them were configured one way and one was configured another.


Mike Julian: Yup.


Dave Mangot: And I don't remember the specifics beyond that but that's you know, it is not a surprise that there was a major problem when some of the infrastructure is one way and some of the infrastructure is another. And that's not to say, like, canarying is bad or anything like that, but like this isn't canarying, it's not intentional. Like this is somewhere where we wound up and in canarying you have a very specific understanding of which machines are the canaries and which ones are not.


Mike Julian: And we know how they were changed in a very specific way.


Dave Mangot: Yeah. And if I want to roll that back, I can do that. If I'm SSHing on the host and 207 out of my 500 hosts are configured one way and the rest are configured another, I don't know what that means. I don't know which are the canaries, which ones aren't. Like it's impossible to the reason about. And then when you have an outage, then you get the people running off saying like, "Well, I'm going to go check these 60 servers and I'm going to go check those and you check these." I don't want to spend my time in an outage checking stuff. I want to have a very specific understanding of how things are configured yet and that's why the automation comes in and the revision control comes in and all these other kinds of things because testing in production is fine. If you are deliberately testing in production, you don't want testing in production to be a byproduct of a bad process.


Mike Julian: Yeah. The testing of production is an interesting point of... I think you and I probably hold a bit of a contrarian view. Personally, I think the whole testing in production argument is mostly bullshit. What do you think about that? Like how do you view it, having run large scale systems yourself?



Dave Mangot: I think testing in production has a place, but testing in production is one of many things that are required and being able to safely carry out test in production is a really important capability and we can do that with dark launches, we can do that with things like feature flags, we can do things blue green deployments, the canaries, all this stuff that we're talking about. That's great. I think the places where I get nervous when I see people having this discussion is like you should test in production. It sort of like the way that it comes out. And you should have the ability to test in production, but testing in production is a really, really expensive place to test. And there's a lot of reasons for that. One simple one would be, if I have an outage in production that's expensive, now I've got all hands on deck, everybody stopping what they're doing, that affects the people who are responding, it affects my delivery of my software because we have a fire now and everything else comes to a stop.


Dave Mangot: And then there's the reputation damage and then there's a customer damage. There's a lot of ways that testing in production can be dangerous, but then it's expensive on that side. The other side of where it's expensive is to go back to like the Jez Humble and Dave Farley's continuous delivery book. It's way cheaper to fix things early in the development life cycle or earlier, like let's call it like on the left side, than on the right side. And the reasons for that is I sit down, I write some code, I run my unit tests and whatever other tests I want to run, and maybe that becomes integration test the down the line or whatever. But if I get feedback 10 minutes later that says you broke something, okay, I can fix that. I just wrote the code 10 minutes ago. It's not a problem. Like I'll sit down, oh yeah, I totally messed that up. All right, I'm going to change this, I'm going to change that. Okay, we're done. Boom. That's cheap because it's really ... we talk about fast feedback a lot and DevOps like that's a really cheap fix. If I'm doing all my testing and production, well, okay, I wrote that code yesterday or the day before or maybe depending on how fast your deployment pipeline is, a week ago, now I got to go back and re-familiarize myself with the code and what was I thinking back then and whatever. It takes a lot more time to fix things.


Dave Mangot: And yeah, I understand if I check something in and it creates a Docker container and that goes out like 20 minutes later, maybe that I'm not going to have forgotten. I can understand that argument. At the same time, do I really want to give my customers a bad experience and saying like, "Oh, well you should have all these things in place so they don't have a bad experience." It's fine, but that comes back to the old like in theory, theory and practice are the same, in practice, they're not. In theory, I can have all this stuff, that's magic and it's awesome, but in the end I'm going to have a production outage at some point. It's inevitable. So I'd much rather have those outages earlier in the process because they're not really outages, they're unit test failing, there's staging environment blowing up, there's whatever. And I think that's just a lot less expensive to be doing it that way.


Mike Julian: You had a major point in your talk that staging is like prod. I think where I think the whole testing in prod thing is falling down or creating problems in my mind, is that it's giving people license to not have a staging environment or to not invest in that area. With the kind of backing of “you cannot have a staging environment that fully replicates production,” which I think for most people is not actually true, like you totally can. It's only at the very large scale, bigger than most companies are that you can't do it.


Dave Mangot: Yeah, I think there's always going to be something special about production and I don't think you can catch all those things. I agree with you in terms of it's a lot more manifest at scale because you are not going to be running those kinds of numbers through the staging environment.


Mike Julian: Right. And you can never really predict what a customer is going to do with your systems.


Dave Mangot: Yeah. Right. There can be all kinds of crazy inputs coming in from the internet that you could never have predicted, but at the same time, I think one of the things I really try to drive home in the talk, what I kept saying stages like prod, stages like prod, stages like prod, is that what I've seen people do is they get not so much lazy as they're trying to make compromises and they're like, "Well, in staging, you know, we would like to do this, but let's just do it a different way just for staging." And I think you have to be extremely, extremely deliberate about making those kinds of choices because you really do want to catch things as for far to the left as you possibly can. And staging is one of the best places to do that. And it's also ... I know when we're trying out new things, that's a great environment for doing that.


Dave Mangot: But the important part is that when I say “stages is like prod” multiple times is it really has to be a representative test environment. Like do you find that you're making these compromises so that it no longer becomes a representative test environment, that's when you're going to have trouble. And that's why you're going to be like, oh, let's just put it in production and find out because we didn't spend the time on staging. And I talk about it in the LISA talk, if you've ever seen like Gene Kim's general like “How do we get better?” speech, and there's lots of variations on it. One of the main points he drives home as high performing organizations can build representative test environments in a very short period of time. And if you can't build your staging environment because you hand crafted a bunch of stuff to be way different than production, you're not going to be a high performing team like it's just ... sorry, it's a major indicator. And so you can take that all the way back to the famous ... what is it, you should be able to walk into your data center, pull a server out of the rack and throw it out the window? And how long does it take for you to replace that server with an exact duplicate? I mean thank God that's so much easier for us in the cloud now than it was in the data center.


Mike Julian: I do not miss those days.


Dave Mangot: But it's still the point stands like you have to be able to construct things and it should be the same code. If I'm building something in staging, I'm building something in production. If it's not the same code, then what are you doing?


Mike Julian: Right? Like everything we're trying to do is make operating production safer and  staging, having a one-to-one staging environment or as close to one-to-one as possible makes production safer to operate. So it's not like we're trying to skip all the way and just like say, “No staging, we can't have staging.” Staging is a good thing. I'm sure there's going to be some situations where you can't have a one-to-one. I can't replicate Google scale traffic on a staging environment, is not going to happen. But I can do some stuff there. Like at least the infrastructure and the code can be the same.


Dave Mangot: Yeah. And sometimes the argument is financial. Well, we don't want to spend as much on the staging environment as we spend on production. I think that's great. I don't think there's anything wrong with that. You definitely should not be spending as much on staging as you are in production. But like one of the things I mentioned in the talk is like if you're running on a c4 4xlarge in production, like maybe you run it on a c4large or something smaller than that. The point is it's called a representative test environment and it doesn't have to be the exact same thing.


Dave Mangot: And going back to like the Humble and Farley continuous delivery book, again, one of the points I really love in that book was the idea of, well, how much confidence do you want to have in what you're deploying to production? Because that's going to dictate how much you're going to spend on your test environments and how much money and how much time and all the other things.


Dave Mangot: I remember working at a company and I was making an argument for infrastructure as code, like testing environments. And they said, "Well, how much money do you want to spend on this?" And I said, "How much confidence do you want to have?" Because it's not up to me, this is a business problem at this point. Like if you want to have a high degree of confidence, we should spend more money on the testing environment. If you want to have an average or a lower amount of confidence, then let's spend less on it.


Dave Mangot: It's not up to me to decide what the right thing is, it's what's the businesses tolerance for risk. And some businesses have a higher tolerance for risk than others. And if you're running like a really small startup and nobody cares that much about your product and you don't want to build a staging environment, then that's a business decision. And you're saying we're going to do all of our testing in production, that's a business decision. Like that's okay. There's no one saying that you can't make that decision, but you obviously have a very high tolerance for risk in that environment.


Dave Mangot: And I think as companies mature and they get more customers and more revenue that they want to protect, then their tolerance for risk goes down. And I think that's why when you look at like the Google SRE model with the error budgets and SLOs and things like that, there is a very important lesson there that Google's trying to reinforce is like, you would think that we would have a very low appetite for risk, but we are very deliberate about the amount of risk that we're willing to take on and we really do believe in having that risk. That's actually important to us. And so we're going to make that a formal part of our process.


Mike Julian: Yeah, I think that that is such a fantastic point because so many people look at a SLOs and SLAs as a technical thing and they look at staging environments marrying production and like how many environments do you have, what do they look like? And they look at all these as technical decisions, but they’re really not. You're completely right, it's really about ... it's a business decision. What's your appetite for risk? How much confidence do you want? Do you want to be 80 percent confident or 50 percent competent or 100 percent confident? And all of that's going to cost more or less.


Dave Mangot: Yeah. I love actually that you were using real numbers like that because you recommended How to Measure Anything book to me and I've been reading it and there's just a lot of stuff in that book about like being able to quantify things like things that people would consider unquantifiable. So I love that you're able to talk about your appetite for risk in terms of actual numbers. That's pretty awesome.

Mike Julian: Yeah, that book is one of my favorite books and I talk about it weekly just because there's ... like how do you measure risk? How do you measure confidence? And like those are hard problems and people would consider them to be unmeasurable but they actually aren't. And this book talks about how you can do that.


Dave Mangot: Yeah. Like he basically, he does a little bit of the idea of this crawl, walk, run, right?


Mike Julian: Yup, he absolutely does.


Dave Mangot: To start with something like everyone's trying to jump to the end, "Oh, I've got a 58 percent tolerance for risk." Like forget it, you can't just start there, start with something. And so I think that is really good. Obviously with the things that we do in systems engineering or production or distributed systems, it's not unique. There are lessons from other places to be taken there.


Mike Julian: Right. Like even when building bridges and building skyscrapers and other like an apartment building, basically everything is mostly known going in, but people are still in the civil engineering disciplines talking about levels of risk and they're very upfront with, “This is what we're going to do with a 90 percent error budget,” essentially. Like, “This is probably going to work and we're confident to like 90 percent that it's going to be fine.” And it's like, "Well, do you want five percent more?" That's going to cost a lot more money.


Dave Mangot: Yeah. And I think that's important for when we're talking about like building the staging environment.


Mike Julian: Yeah, absolutely.


Dave Mangot: If it's really going to cost you 100 percent more to get that last five percent, then let's make some really deliberate choices. And that's okay.


Mike Julian: Does your revenue support that?


Dave Mangot: There's no point in going out of business just to get that other five percent.


Mike Julian: Right.


Dave Mangot: Like, that's not useful. Yeah, when you said testing the bridges, I just kept thinking of that Calvin and Hobbes when he says, "Dad, how do they decide what the weight limit is on the bridge?" They just keep driving bigger and bigger trucks across it until it collapses and then they rebuild it to the last part that they built the last truck. And then the mom's like, "If you don't the answer, just say so."


Mike Julian: I love that comic. We were talking before the call about a boring technology and there was an article that sparked this a couple of years ago by a ... Oh, I forget his name, McFinley is, what's his name?


Dave Mangot: I think it's Dan McKinley.



Dave Mangot: Yeah. He's got a whole website that he did with his slides from a talk and then he has an explanation of every slide next to it. It's really-


Mike Julian: Oh, that's cool. I haven't seen that.


Dave Mangot: That's really nice.


Mike Julian: So when you're running these large scale systems, how does that actually play out? Like what do you ... I know you believe firmly in this “choose boring tech,” so how does that actually look when you're making decisions?


Dave Mangot: Yeah, I believe in it because I've been burned by it, right? That's mostly the way that people learn things is the hard way. But there's a couple of different things to that. The first one is the systems that we're working on are complex and if they're not complex ... and when I say complex, like I mean it like in the ... what's it called [inaudible 00:31:05]. I gave a talk on this. I don't remember it. In the cynefin way. But like this is a natural byproduct of working on any of these infrastructures for long enough is you're going to get complexity. Whether you started out that way or not is irrelevant. If I started out with two servers and eventually, or two cloud instances, whatever you want to call them, and eventually I have 500, something happened along the way. We're not starting out with like our two things or whatever.


Dave Mangot: And so I kind of say choose boring technology because complexity is an emergent property of all these things — you run anything long enough, you ship enough code; if I have 2000 code ships, whether that's through Docker container or whatever, the environment is going to be more complex than it was when we started. So introducing complexity in to an environment that's already going to get complex is sort of it's a bad idea. It's a bad smell. It's a bad sign because we're going to get that for free. I don't have to do any extra work to try to make things more complex, that's going to happen. And so it comes back to ... even some of the stuff that you and I were talking about earlier, like when we talked about revision control is if all the configs for app foo are under /etc/foo and all the configs for app bar are under /etc/bar.


Dave Mangot: If I'm going to make like some new application called baz, make sure their config for that is under /etc/baz. Don't be like, "Oh well, /var/lib/baz seems like a good place now." Right? And it sounds like silly like when you put it out there like that, but like one of the principles of keeping this stuff as simple as we possibly can is there should be some intuition about what the right answer is. So if I'm an operator and it's three o'clock in the morning and I'm on the host and we deployed baz and I want to know where the configs are, I shouldn't be going to a documentation document and reading through it and being like, "Okay, where's the config section? Whereas the section that tells me where the configs are for this?" Like, no, like keep it simple right from the start. Follow patterns that people understand so that when it is the heat of the moment, they know where to go and they know how to do stuff because you don't want to be relying on magic.


Dave Mangot: And that's a little bit more of what I was talking about in the talk in terms of like boring technology. But the magic stuff is what kills you and that's where you have these, "Well, I don't really know what's happening." And nobody ... I mean, I've been in I don't know how many outages, “I don't know what's happening” is not an answer I want to hear or something I want to say.


Mike Julian: Yes. I've always-


Dave Mangot: And obviously with your background and monitoring and things like that, that's one of the things that we're trying to answer. We want to make sure that we're never in a situation where we say “I don't know what's happening.”


Mike Julian: Yeah, absolutely. We were talking about ... you have an Einstein quote that you're pretty fond of.


Dave Mangot: Yeah. I don't know if we have the actual one because I think his actual one is a little bit more wordy, but the one I've heard boiled down to is “make things as simple as possible, but no simpler.”


Mike Julian: Yeah. So there's also a related quote by John Gall [inaudible 00:34:45] He's a systems' theorist, well actually it was a pediatrician that got into systems theory. And people love to quote what ... I'd love to quote what he's saying, which is to say, "A Complex system that works is invariably found to evolve from a simple system that worked. The complex system designed from scratch, never works and cannot be patched up to make it work. You have to start over beginning with a working simple system.” It's a great quote when you're talking about systems architecture, but there's another bit that he goes on to say that people have kind of forgotten about, which is that “a simpler system may or may not work.” Like if you were to rebuild SolarWinds product today and try to make it simpler, you may not be able to. Like you will probably end up with a pretty complex system because it just cannot be made simpler and still serve the customers. So complexity is not necessarily a bad thing. As you said, it's an emergent property. It's a good thing, it's going to happen.


Dave Mangot: Yeah. I think the important thing is we need to learn from all the things that we're doing. The part of that quote that kind of resonated with me is I was working somewhere where they had rebuilt their entire production stack. And not many people get an opportunity to do this, right?


Mike Julian: Right.


Dave Mangot: That's pretty awesome. But they had a new VP come in and he was a guy from Microsoft and they wound up like rebuilding their whole production stack on Windows. They were using Java and Tomcat on Windows, but they wind up rebuilding it. I was talking to one of the systems engineers and he said, "Well, obviously Microsoft is a better operating system in a better environment to do development on and run things on because our new environment has so many less problems than we used to have in our old environment where like things were breaking all the time and it was hard to get things done. So obviously this is a superior technology." And I said, "Well, do you think that the reason that this new environment is better than your old environment is because the technology is better? Or do you think that you looked at all the things that you were having problems with and when you re-architected it, you made sure you architected all that stuff out of it? You probably have a lot simpler system now that runs a lot better because you knew all the things that you didn't like about your last environment." Oh, okay, maybe it was that.


Mike Julian: Right. Every time I've seen a company or as a team move from one technology to another, like competing technologies. I've seen this happen with config management over and over and over. Someone will have salt, Ansible, Chef, Puppet, it doesn't even matter, pick one. And they'll say, "All right, this is screwed. We're going to move because it's going to fix all of our problems." They go to one of the other three and everything is better, except that happens in the other direction too. So how is it possible that two teams going from Chef to Puppet and Puppet to Chef improve their situation by going to the other tool? Like that doesn't make any sense. So maybe it's because they got the opportunity to start over to make things simpler, to clean it up as best they could. And that's what made things better, not because the tool they were using?


Dave Mangot: Yeah, and that's why kind of what you and I were talking about. That's why I keep it simpler. Keep it as simple as possible, but no simpler. You choose boring technology, all that kind of stuff is so important because it's almost inevitable that you're going to wind up into that situation where you're like, “Oh my gosh, we have to throw out Puppet and rebuild everything on Chef.”


Mike Julian: Ha ha, right.


Dave Mangot: It's almost an inevitability because we said the complexity is an emergent property of these systems. So if we can put that off for as long as possible, that's an important thing and that makes our systems more operable and it makes them cheaper to run and it makes our time to recover faster. And all those other great benefits that we get out of it because that's what we want to be doing with our time. We don't want to be spending our time doing endless rewrites from Chef to Puppet and back and forth, we want to be spending our time operating our environment and making money. That's why we're in business. That's why we're in a for-profit organization, that's what best serves our customers, not and this the shooting ourselves in the foot and then trying to figure out all the ways to undo that.


Mike Julian: I think it was a Joel Spolsky who made a comment some years ago that software gets more stable over time. And if you're constantly ripping out whatever your code is and replacing it with something completely different, you're actually introducing instability.


Dave Mangot: Yeah, that's interesting. Considering I guess there's a ... there's the argument from chaos engineering that like chaos engineering doesn't create problems, it exposes problems that already existed. I'm trying to wrap my head around how that kind of jives with that a little bit because if my code is in production and it's been running there for a long time, we know that latent bug or whatever can be there, but that's okay. It's only when that code path is exercised in that way that we actually have the problem. And so I guess it does work pretty well with Joel's idea because we can use chaos engineering to be able to expose those things.


Mike Julian: Right. And I mean, just even knowing that you have the bug is that's 90 percent of the battle right there.


Dave Mangot: Yeah. And then it becomes a business question as to whether or not we want to do something.


Mike Julian: Exactly. And so actually that's a great segue to another thing that you talked about in your talk was like if you have this bug, if you know that you have the situation, then you actually have really two or three responses for it. One, you can do absolutely nothIng about it, two, you could fix it, or three, you can kind of patch over it with a human response and how this tends to take shape is a runbook. So rather than go fix it or maybe in the interim while it's getting fixed and getting [inaudible 00:41:25] all that, you write up something like, this is how you handle this when this happens. You had some pretty interesting stuff to say about runbooks in the talk.


Dave Mangot: Yeah. The runbook I was talking about in the talk where the ... what would you call them? Like the procedural runbooks as opposed to say a remediation runbook or whatever. And so the procedural runbooks I have a big problem with because I'm like a big fan of making computers work for me instead of me work for computers. I don't know why, it's just a thing I've picked up over the years.


Mike Julian: That’s weird.


Dave Mangot: But when I'm sitting down and there's a Word document or Google document in front of me that's telling me what I should be typing, I kind of feel like the computer is telling me what to do. I just kind of have a problem with that, like fundamentally. But also like we were talking about earlier that's a good way of causing an outage is there's things that happen and I talk about it in the talk where people make decisions that you didn't expect and all kinds of other things because you're leaving it up to people's judgment and everybody's got their background and experiences that they bring to the table.


Dave Mangot: Ordinarily, that's a really good thing. If I'm going to have like an SRE team or something like that, I want a lot of diversity, I want a lot of different opinions, I want a lot of different backgrounds. When people are making judgements about which file to edit, not interested. Like I don't want a lot of diversity, I don't want a lot of different ways of editing the file or well we should use this convention. Like I don't want no guesswork.


Mike Julian: I want one convention, I want one location.


Dave Mangot: Right. And I think like to your point a little bit about remediation runbooks, like I am not a fan of them. I am not as against them I guess. It really depends on how they're implemented.


Mike Julian: Sure.


Dave Mangot: In the case of a procedural runbook, I don't care how it's implemented, I'm against it. You can come to me with all the different arguments you want, I'm totally against that. I think it's a terrible idea, we definitely should not do that. The remediation runbooks that I've seen work well are not really runbooks as much as they're a guide to things that you should be looking at if this application is having trouble. So like these are the important metrics, this is what this means, these are the downstream systems for this, these are the upstream systems for this. It's all about context and understanding, right? Because ultimately the goal of all these tools or whatever is to empower the humans to do great work and we're creative, we do all kinds of things. The more data we have, the more information that we can use to make these decisions.


Dave Mangot: I think that stuff is really important and I think that's great. And certainly this isn't your idea or my idea, like Allspaw talks about this a lot in terms of what the tools are supposed to be doing for you. I think that the places where it starts to go off the rails a little bit in terms of remediation runbooks is when it starts to do those things like if you see this type this, if you see that type that, if you'd see this restart this thing. That's where I start to feel like it goes off the rails because then it starts to become a lot more like that procedural thing. And literally like all you're doing now is you're giving like the people who are trying to fix the problem one more thing to think about and one more thing to remember.


Dave Mangot: Because if they read it in the runbook like three times, then the fourth time they have to fix this problem, which there's already problems with just that statement, but the fourth time they have to, they're like, “Oh, I know what I'm doing, I'll just go restart this thing over here.” And like my goal for SREs that I'm working with is I don't want you to have a bunch of things in your head that you have to remember, like that's a problem. And we've talked about this a lot and in systems, in SRE, on the teams it's like if you're telling me that the thing that's the solution here is that assist admin or an SRE has to remember some specific thing about this application at three o'clock in the morning, that's a fragile system. And I don't want to put my SREs in a situation where they have to remember stuff like that. I want them to be in a situation where they understand what they're doing and they have the tools that are necessary to get themselves out of that situation.


Dave Mangot: But just like a human as a depository for minutia, just that's just terrible. That's not a good place to be. And so I think like for the remediation runbooks, it's very dependent on how they're implemented. Now, going back to what we said, like if you are in a situation where you're trying to fix that same problem for the fourth time, that goes back to one of the other things in the talk which was don't fix problems, solve them because I've seen this way too many times where people think that their job is to restart the Tomcat server. And that's-


Mike Julian: Yeah. It's like, no, no, your job is to make that so I never have to do that, so it's no longer a thing.


Dave Mangot: Right. And when we talk a lot about in our SRE stuff about,  I don't want you to solve the same problem more than once.


Mike Julian: Yeah, exactly.


Dave Mangot: Like if I've seen that problem, what am I going to do about It? And the answer isn't always something that that engineer can do all by themselves, but that's DevOps, right?


Mike Julian: Yep.


Dave Mangot: I'm going to sit down with the teams who are responsible for that system or interact with that system, whether it's developers or DBAs or network engineers or whoever, and be like, “What can we do so that we never see this problem again?” And that's different than saying, “What can we do so that if we see this problem again, we know what to do,” right?


Dave Mangot: That for me is where things are ... there is a problem because well, what are you going to do? “Well, I'm going to restart Tomcat.” Okay, well now we have a solution, great. And now like if I'm going to say we're gonna measure MTTR, which I guess we're not supposed to measure the mean anymore, but if I'm trying to lower my time to recover, now it becomes a race how fast can I get to the computer and restart Tomcat?


Mike Julian: Yeah.


Dave Mangot: That's not a good situation to be in.


Mike Julian: I once worked in a company that had an outsourced ops team. And this is like the lowest level of ops. So basically they were on call and that's the only thing they did for us is tier one on call. And they had runbooks that they would do various things when they got various alarms, but the alarms were never very good and the solutions were also never very good. So one time they called me and it's like two in the morning, and they say, “Hey, this thing happened.” And in my mind I'm thinking, “oh god, I know exactly what that is, and they did the wrong thing.” And then I'm like, "Okay, why are you calling me for?" "Well it said to push this button. So I pushed the button and the whole thing exploded." I'm like, "Okay. Did you know that it would explode when you pushed the button?" "Well, yeah, we've done this before." "Well, why did you push the button?" "Well, the runbook said to."


Dave Mangot: All right, yeah. My favorite for those of us, I worked for a company where the pager would go off and then inevitably a minute later the president of the company would call and say, "What are you doing to fix this problem?" And I was like, "Well, number one, I was trying to fix the problem, but now I'm talking to you. And number two, if you're going to call me every time the pager goes off, why do I need a pager, why don't you just call me?" Skip it.


Mike Julian: Yup.


Dave Mangot: The run books can be really dangerous. And the outsource ops thing is something we also talk about in the talk. And I think that's one of the things I was ... your story highlights it very well, is what are you gaining out of that? Their incentive is sort of like “don't bother Mike.” And that's about it. Other than that their incentive is to follow whatever it says in the runbook because who's going to fire them for saying “I did what it said in the runbook?” As management, that's not a defensible position. Wait, they followed the instructions exactly, so you fired them? Like, no, that's not gonna work. And so you don't want to set up environments like that or to use your ... when you're talking about systems theorists, like you don't want to set up your system like that because the incentives are all wrong.


Mike Julian: Yeah, that's a bad feedback loop.


Dave Mangot: Yeah, yeah. Absolutely. And that's why outsource ops can be so dangerous is how do you get out of that situation? Like who's going to be the driving factor to get out of that situation?


Mike Julian: No one. Like that's a politically difficult environment. 


Dave Mangot: Like, hey, they said, “Hey, we restarted it even though we knew it was going to blow up.” Okay, well that was a bad idea, but what's the incentive to get out of there? Who's going to be the person that drives that change? Is it their manager? Like when Mike goes back to their manager and says, "Hey, they did what it said in the runbook and it blew up, so they probably shouldn't have done that." Like you're not going to win that argument.


Mike Julian: No.


Dave Mangot: And that manager will be like, "Well, give me better documentation." That's kind of like going to be their answer because they don't want to be in that situation either and now it's on you to do that because you don't want to be woken up for something dumb. And so now like you're feeling the pain and that's going to be the only thing that changes anything, but what if that only happens like once every six months or once eight months or every year and a half. Like when is Mike going to be like, "Well, what I really need to do is stop everything that I'm doing and go write up better documentation for people for something that happens once every 16 months."


Mike Julian: Right.


Dave Mangot: Like, no way, that's never gonna happen. That's never going to get prioritized.


Mike Julian: Nicole Forsgren in her research is ... in her State of DevOps research found that ... or is it companies that used outsource or functional outsourcing in IT are three point or three point nine times more likely to be low performers. I'm like, yeah, I could definitely see that. In fact, I think it might be ... I think the number might be a little low.


Dave Mangot: Yeah. I mean I agree with that 100 percent and the research that Nicole has done is freaking outstanding.


Mike Julian: Oh yes.


Dave Mangot: And like whatever. But I think like the thing I would be worried about when I hear that and like I said, I totally agree with it, is people are like, well we're low performing because we have outsourced ops or whatever, or we're more likely to be low performing, and I think it kind of goes back to what we were just saying it's like those ... it’s sort of like a feedback loop, like one thing causes the other. It's low performers will choose to do outsource ops and then people who are outsourced ops will wind up being low performers. I think that they both feed each other because you're in that situation where like, what's your incentive? Mike's incentive is not to fix the documentation for something that happens rarely, and their incentive is not to be creative thinkers because that's not what they're being paid for. They're being paid for doing the thing that's in front of them.


Dave Mangot: And so I think the only way out of that situation to like to do our crawl, walk, run is like, we need to change the incentives and so maybe the right thing for that manager of that team is to say like, "Hey Mike, we don't want to call you for no reason and you don't want to get called for no reason and these folks are just doing what it says, what can we do to change the scenario or change the situation so that this doesn't happen?" And this goes back a little bit to don't fix problems, solve them. Like in that situation, the discussion shouldn't even be “Mike, can you help us with the documentation for this one specific thing” — because that's fixing the problem, that's not solving the problem.


Dave Mangot: The discussion should be, what can we do to change fundamentally the relationship that we have and what people's incentives are so that we don't keep winding up in the situation. Because I don't want to keep coming back to you, Mike, and saying, "Can you fix this documentation, can you fix that documentation." Because that doesn't get us anywhere. We need to change the relationship so that they're incentivized to say like, "Hey, we knew that this thing blew up the last time, why did it blow up? Like what would happen, what got us into this situation, what are the things that we can do to understand better why we're in this situation, what are the things we can do better to, — like we just said before — to make it so that we never see this thing again? I don't want to see the same problem twice.”


Mike Julian: Yeah. Like that completely shifts the relationship from us versus them to us and them like they are no longer someone outside the team, but they're not part of the team.


Dave Mangot: Right. And there's your DevOps in a nutshell.


Mike Julian: Right. So on that note, it's been absolutely wonderful talking with you today. I like to have some sort of closing action. When people are trying to improve stuff, especially these smells, what's something that they could do today or this week to really help them improve? Like something concrete?


Dave Mangot: Yeah. I think my advice for that is always when we come back to crawl, walk, run. I was talking in the talk at LISA about like, if you've ever read The Lean Startup by ... the name is escaping me. Anyways, if you ever read the lean startup, like one of the things they talk about there is the MVP. And the MVP is the minimum viable product and a lot, a lot, a lot of people really mess up what the MVP actually means. And they think that the minimum viable product is the junkiest thing of whatever I can possibly develop that I can throw out there and that's the minimum viable product. And because it's a viable product and it's minimum. But the real lesson for that book and with the MVP is it's the minimal thing that we can put out there and learn something from. If we do not learning anything from it, it's not the minimum viable product.


Dave Mangot: And so what I talk to people about when they want to crawl, walk, run their way out of it is, what is the smallest change that I can make that I can learn something from? And this is why especially when we have these discussions about everybody wants Kubernetes. Like we're gonna ... that's what we're gonna do, we're gonna go full on K-8 from the start. It's like, okay, well what kind of workloads do we want to run on Kubernetes and what are the things that Kubernetes is going to bring to the table for us, and are these problems that we have? Because if they are not problems that we have, then let's go back to our choose boring technology idea. Like don't run some like massively scalable distributed system when that's not a problem that we have.


Dave Mangot: And then once we've kind of thought about what are the things we want to do, how can we solve this problem? What are the different ways we can do that? Then we talk about what is the first step that I can do that I can learn something from? Because one of the great lessons about agile is to not do agile but to be agile. And in order to be agile, I need to be able to respond to things that as they come. And this is why we don't do waterfall anymore because if four months into a six month project we discovered that we're doing all the wrong thing, but waterfall says we should keep going, then we're just going to waste another two months of time and effort and money and everything. So forget it. We need to be agile. We need to be able to adjust. And so by doing this like smallest thing that we can bite off that we can learn something from, then I have the ability to say, "Hey, you know what, we thought we were going to need Kubernetes, but it turns out we did this other thing, and this is actually sending us down this completely different path that we did not expect that we were going to go down, but it turns out this is going to fit the problem that we have so much better than what we thought we were going to do like right in the beginning when we were trying to jump from nothing to run." And so that's sort of my takeaway for people is what's the smallest thing I can learn something from, start there.


Mike Julian: Yeah, that's fantastic advice. So where can people find out more about you and your work?


Dave Mangot: Certainly, I'm available on Twitter, I'm Dave Mangot on Twitter. If you are interested in a lot of these DevOps concepts that we talked about, I have a video course on pack publishing called, Mastering DevOps. It's about six hours of listening to me talk about high performing organizations, but certainly that's another great ... a resource that I'm happy to have had the time and ability to put out there. And I think it goes ... a lot of the things that you and I have been talking about are kind of covered in that course as well.


Mike Julian: Oh yeah. All right. Well thank you so much for joining today. This has been a wonderful call.


Dave Mangot: Thanks Mike. I really appreciated it. That was fun.


Mike Julian: Thanks for listening to the RealWorld DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at RealWorldDevOps.com and on iTunes, Google Play or wherever it is you get your podcasts. I'll see you in the next episode.


Want to sponsor the podcast? Send me an email.

2019 Duckbill Group, LLC