Troubleshooting in China with Steve Mushero
The wild world of systems in China may be different — and smaller — than you’d think. This episode’s guest is Steve Mushero, CEO of ChinaNetCloud and Siglos, who joins Mike to discuss the challenges and evolution of systems infrastructure in China. They also dig into what it could look like to standardize troubleshooting methods and the challenge of teaching troubleshooting to people.
About the Guest
Steve Mushero is CEO of Ops Platform provider Siglos.io, and CEO of ChinaNetCloud, China's first Internet Managed Service Provider, AWS Partner, and manager of hundreds of large-scale (up to hundreds of millions of users each) systems. He's previously been CTO in a variety of organizations in Silicon Valley, Seattle, New York, and around the world. You can follow along with his work and insights on LinkedIn, Medium and Twitter.
Links Referenced
- ChinaNetCloud
- Siglos
- Taking Over & Managing Large Messy Systems (LISA 2018)
- Incidents as we Imagine Them Versus How They Actually Are by John Allspaw (PagerDuty Summit 2018)
- Brandon Gregg’s USE Model
- Steve’s tool, runqstat
- Observability vs. Monitoring, is it about Active vs. Passive or Dev vs. Ops? By Steve Mushero
- Dao of Troubleshooting
- How to Monitor the SRE Golden Signals
Transcript
Mike: Running infrastructure at scale is hard. It's messy. It's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps Podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly, and author of O'Reilly's Practical Monitoring.
Mike: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools, and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you might not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.
Mike: Hi, folks. My name is Mike Julian. I'm here with Steve Mushero, CEO of ChinaNetCloud, and his new startup, Siglos. Welcome to the show, Steve.
Steve: Thank you. Glad to be here.
Mike: You and I met through Monitoring Weekly some time ago, and we've just been chatting since then, and learning all about, or me learning, primarily, about the wild world of systems in China. I believe the company you run, ChinaNetCloud is one of the largest, if not the largest Chinese AWS partner?
Steve: Probably. We're the first one. We were their first sort of operations partner here in China, and also their first MSP, or Managed Service Provider. We've been AWS here and around the world almost 10 years. Yeah, it's been an interesting time.
Mike: Yeah, I'm sure. Why don't you tell us a bit more about what your company does? What are you doing in the Chinese market?
Steve: Here in China, we're basically a managed service provider, so we started 10 years ago now, really managing systems. I was CTO for Tudou, similar to YouTube in the US, so online video and so on. Actually, older and bigger than YouTube. We saw how that market really needed operations help so we really launched this to start helping people design, build, operate, manage, monitor, all that sort of stuff, internet systems. This was before the clouds and all that. Then, we were the first cloud company in China actually doing online VMs. You could buy actual virtual servers, now, of course you have AWS, and Aliyun, and others doing that. Today, our job is really just design, build, manage, monitor, troubleshoot, and do 7-by-24 support for large-scale internet systems, so we're way in the backend, you might say, but really sort of ditch-digging operations guys and girls, really dealing with the just 24-hour systems in clouds, and hybrids, and all the interesting problems you can have. Especially with Chinese scale that we do. That's what we do.
Mike: I think it's absolutely wild that there is ... To think about the size of YouTube, like, pretty much everyone everyone in the western world thinks about YouTube, it's like, oh, wow, that's absolutely huge. To think that there's a competitor in China that's bigger than that, the scale is staggering.
Steve: Yeah. I think now YouTube is larger, but back, this is 10 years ago, now-
Mike: Oh, okay.
Steve: ... It was different, because YouTube, even now in the US, think of it more like Netflix and YouTube combined, because YouTube historically has been short video clips, right? 10 years ago, it was, hey, my cat is cute, stuff like this, you know, 90-seconds, three-minutes, five-minutes. We were running full length movies and TV, and all that, not entirely legally, you might say. It was more like Netflix.
Steve: We had a 100 million viewers a day doing sort of full length stuff, so YouTube's average thing was whatever, you know, a couple minutes, ours was an hour. Completely different infrastructure needs. This is before, especially China had good infrastructure, and BGP routing, and all kinds of stuff, so you had sort of massive CDM problems, and lots of other stuff. I think we were the world's biggest bandwidth buyer for a while.
Mike: Oh, wow.
Steve: I know. Now, everybody else is bigger, but this is, again, a long time ago.
Mike: Oh, sure.
Steve: Because, DSL had just hit China, and everybody wanted to watch Western movies and TV, and a lot of people a lot of times had nothing else to do during the day, and so they're watching this everywhere. Before smartphones too. Yes, it was quite the interesting- Actually, that company's older than YouTube, so even though a lot of things come into China are sort of copies of Western things, those companies actually had much more features, and much richer than YouTube and even Netflix today ... 10 years ago. More advanced, innovative, actually, than the Americans at this point.
Mike: Right.
Steve: It was pretty cool. I learned a lot about infrastructure here, and the craziness of data centers, and large scale systems, you know, thousands of servers, and all kinds of locations, all this kind of stuff. It's all physical servers, no virtualization, no nothing, right? This is started in 2005, and I was here in 2007. It's hundreds of racks of equipment, and all that kind of stuff.
Mike: What's it like today? Thinking back to what I was doing for systems in 2007 ... Yeah, I was doing quite a bit of bare metal stuff. There was a little bit of virtualization in my world, but the scale wasn't even what you were talking about then, so what's it like today? Is it just even more a staggering scale? Is there a lot of cloud native stuff that's getting deployed, or are you still working on things that are, like, 10 years old by San Francisco Bay area standards?
Steve: It's all of those things. I think, more broadly, the China market is similar to the rest of the world. We have a lot of clouds, we have AWS here. Probably familiar. It's probably number two or three player, we have AliBaba, or Ali Cloud, also called, Aliyun, is the number one player, by far. Also, Tencent, the other major player who makes WeChat and QQ, and so on. The cloud market in China is still quite small. It's only one-tenth of the rest of the world. Even though it's pretty hot, and people are doing stuff, and certainly anything new you would do on the Internet, not unlike the Bay Area, would go on the cloud somewhere. You know, you wouldn't buy your own hardware for new things on the Internet. There's a lot of legacy stuff, you know, probably still 95% of it is still traditional, or beginning to think about hybrids and that kind of stuff. There's a huge tradition here of having your own hardware, having your own data centers, having your own equipment. Much like the US was maybe 10 or 20 years ago. That's only slowly changing. In fact, everybody's rushing to do private virtual clouds here. Say, a VMware, of course. Even Aliyun, and now I hear AWS announced one also in the US. You know, these private systems where you can have all your own hardware, all your own data centers, but have the benefits of the cloud. That's going to be a huge thing in China, because people just love to have that hardware, and have their arms around it. They don't fully trust the cloud. Just like anybody else doesn't trust it fully.
Mike: Right.
Steve: They've made a lot of progress in that. That's still many, many years away here. It's a mix, so we have half our customers go on the cloud. Whatever the cloud is. Half have physical servers, and some are doing hybrids. A lot of people want to move. You know, one of the problems, even globally is, there's not enough talent to help you move. Migration sounds easy, and it never is, and we do a lot of them. It just takes real skill, and real knowledge, and real experience to migrate an old SAP system, or web system, or mobile e-commerce or whatever. You know, it's always messy, it's always on physical servers, it's always five, eight-year-old software.
Mike: Yeah.
Steve: How do you move that? We have a customer that, I think they have 22-year-old Linux running.
Mike: Oh, wow.
Steve: It's Red Hat Six, but not Red Hat Enterprise Linux Six, but Red Hat Six.
Mike: Right. It's been a while since I've seen that one.
Steve: It's on a two-two kernel, and that's just one. They have six, they have seven, they have eight, and then they have RHEL, I don't know what, three, four, five. I think I counted 22 different operating system versions there. What do you do with that stuff? Okay, migrate that to the cloud. Where do you even start with that, right?
Mike: Yeah.
Steve: This is a huge, multi, multi-billion-dollar company. You can't just throw this stuff away, right? It runs important things. That's the enterprise world, even in the US. We do a bunch of that kind of stuff. Yeah, everybody's everywhere. I think people are getting to the cloud. You still do not see very much cloud-native stuff, how you define that. Of course, in the US everybody talks about Kubernetes and all that. How many people are really running Kubernetes in production at scale, even in the US? It's not nearly what you'd think. Here, you can probably count it on a few hands. Especially at scale. That's true of OpenStack also in some of these more advanced systems. They're just not there yet. Docker, kind of floating around. We still don't see that much of it really. Everything is still pretty traditional. I'd say a good portion of our customers, most of them probably still FTP code to their servers. Maybe they do some Jenkins and some of that stuff, but it's still not that common. Yet it's at scale, right?
Mike: Right.
Steve: You have users with ten, a hundred, hundred-million users or customers, ten to a hundred-million user systems, FTP, everything. Also, the developer's laptop.
Mike: Right, so for us that are steeped in, I guess, the Silicon Valley culture of how we do things, to have even a few people FTP-ing code around is just, that's an anathema to everything we do and think, to think of hundreds of thousands of people doing this as just standard practice. And, to think about all the other practices and systems that go into deployment, and security, and monitoring, and what that must mean if their deployment practice is FTP. What does everything else look like? That's- [crosstalk 00:10:31]
Steve: Yeah, it's a little bit weird, because in one level it's like you said, it's, let's call it behind, or not up to what we might look at in the Valley.
Mike: We'll call it less mature.
Steve: Less mature. On the other level, even when I was at Tudou, again, the video company 10 years ago ... Bare metal hardware, thousand core servers, and more around the world. We were deploying code every few hours on a hundred different subsystems. Update everyday. Long before anybody was doing this in the US, right?
Mike: That's stuff is wild. I mean ...
Steve: That's because of the-
Mike: ... There are companies in the US that still don't do that.
Steve: Right, exactly, with all the automated tools. We had none of those tools. In one sense it's behind, because we don't have all the sexy tools, but the sort of dynamics of both Chinese Internet and website, and China in general, things just update and happen fast, right? The idea of only update your website, like NetSuite and Salesforce, and people do every six months, every three months, for as long as I've been here, almost 15 years, it's been multiple times a day. In some sense, the culture has been very dynamic, extremely, let's call it DevOps-y, and extremely native in that sense, and dynamic way before the US was, but without all the sexy tools. In some extent that's just continued. While the West has gotten up to this multi releases per day kind of thing, and tooling and all of that, here it's always been like that. Just not as mature, you might say.
Steve: It's ahead and behind the same time, if you know what I mean. But, also very dynamic, so we have to deal with that. Just, you know, this code push is just all hours of the day, all versions of everything. That's just going to continue, where I think in the West it sort of wasn't like that, and suddenly all these tools came and people started thinking about more dynamic pushes, and dynamic code updates. We've had to deal with that forever. Where, even 10 years ago customers are just deploying all the time, all hours of the day, breaking things all the time. How do you find that? How do you prove somebody did a deployment? You figure out, because they'll say, "Oh, we didn't do anything." Then, you find out that third party developer pushed something in at 9 pm, so we have all these tools to track all this because, in our world we're sort of the exact opposite of I think the way people think of cloud-native now. You might say that the most modern way is, you know, infrastructure is code, everything is in git. Now it could be the hardware and the infrastructure's in git also, my code's in git, everything's in git, all my assets are in git. Everything's there, right? In theory, git is the reference, right?
Mike: Mm-hmm (affirmative)-
Steve: It's the gold standard or whatever. We could be-
Mike: It's the source of truth.
Steve: The source of truth, yes. That was what I was looking for. We have the exact opposite. Our source of truth is what's on the server running, because it's the only thing we really can believe, so we have lots of tools and technology to actually reverse-engineer that. We have deep CMDBs, we have config inspectors, we have command trackers. We figure out what actually happened, and we can diff, for example, all these production service. You have 10 web servers, and two of them aren't working the say way. Somebody's crashing, it could be giving you 500 errors. We can reverse-engineer all the configs, and all the pieces and parts and figure out why that is, because that's just reality for us. You might say from a could native side, well, just blow them away and just restart them and build them from a gold master.
Mike: Yeah, exactly.
Steve: ... master. We don't have a gold master. You copy one from one of the others, right? We used to have tools, we could even do that. We can suck all the configs out of one server and build another one, before you could do that on the cloud, you know?
Mike: Mm-hmm (affirmative)-
Steve: In that sense it's sort of developed its own culture of, how do you be extremely dynamic, and messy and all that. In China things move fast, and at scale, so it's the reverse of a bank or something like that, or very super-structured, super-full of processes are not that common.
Mike: Yeah, so it's almost like the Western culture and Chinese culture, as far as tech, came to the same conclusion of quickly updating dynamic infrastructure from two different premises, two different approaches. Whereas, the US kinda started with, how can we make this safe? China said, we don't really care about safety as much, because we just have to do it. We have to move this fast, and you've created some sort of safety, or some sort of introspection after the fact. 15:02
Steve: Yeah, exactly. That's what we have done. That's interesting now, because we see in reality that's all moving forward, and you do have more Jenkins, and you have people more pushing formal releases, and you're starting to get more Docker builds. You know, that stuff's starting to command, let's call it the more advanced, or the newer companies, who of course have seen this, who are coming out of Google and other places.
Steve: That's interesting for us, because now we look to the West and some of the things there in the US, and the things people are struggling with, which is how do I track all these moving parts and all that, you know, on the cloud and all that, and to us this is old hat in a lot of ways, because we've seen this movie before. People are still trying to deal with, oh my God, everything's changing all the time. We're like, yeah, we know, everything's changing all the time.
Steve: We're many generations into some of these things, exactly for these reasons. I think reality is, of course, they're converging ...
Mike: Right.
Steve: ... Is that, people think things aren't as formal as you really think in the US, but everyone says, never SSH into a server, you shouldn't even have it, it shouldn't be possible. Yet, you go to conferences like Velocity, I was there last year in the Bay Area, and someone asks, "How many have you FSH, SSH into servers to fix stuff?" Of course, every hand in the room goes up, right?
Mike: Of course.
Steve: These are some of the largest operations folks in the world, so it ain't as clear and clean as people think it is. Yet, they behave as if it is. We behave as if it's a complete mess at all times, and assume it's going to be like that, but yet try to work towards cleanliness, and cleaning it up, and rigor and all that, realizing it's imperfect, I think. Sometimes. Especially in the Bay Area and the cloud-native folks and all that. You know, just assume everything is immutable and perfect all the time. Then that, of course, gets into a lot of trouble, because it's not really. We live in that hybrid somewhere, which is cool.
Mike: That reminds me of a quote I read a while back, and I can't remember who it is. Of course, someone listening will figure it out for me. The quote says, "The only person in my life who behaves rationally is my tailor. He measures me every time I go to see him."
Steve: Yes.
Mike: That's kind of how I feel about the DevOps, and infrastructure, and running systems at scale in the US is, everyone talks about what the idealistic perspective of like, "Oh yes, we never log in to our servers," or, "We should never do it." Yet, we all do it. We all do it daily. Some, more than that, so yes, while we shouldn't, we do. Whereas, it sounds like in China you say, "Well, no. We clearly do. We're being honest with ourselves that this is what we do, but how can we make it better from there?"
Steve: Absolutely, and even with that, we have to go a step further and track what other people are doing. We actually have command trackers that, when you SSH in, the shell sort of put in some hooks, and we get all the commands everyone runs, because you've got us as an outside third-party, you've got the customer, let's say, you've often got third-party developers. Maybe more people, maybe other third-party people, and everybody is not communicating, of course, or organized, or documentation or anything. They're changing stuff all the time, so you've really got to figure out what people actually did so you can, not so much point the finger in a blaming way, but point the finger to, hey, you did this and what happened. Also, be able to go look deeply at configuration, because as people have discovered, it's not your infrastructure physically, it's the configurations of things matter a lot. Even if it's Nginx, or PHP, or Java or whatever. Ruby. so you've got to track that stuff, and nearest we can tell, even the advanced people really aren't doing a good job of that.
Mike: Right.
Steve: You know, developers are picking off, Googling, or downloading, or finding simple Nginx configs, throwing those into production, not really understanding what they mean. That's another whole set of problems in operations. Then, not really tracking changes to that, and their effects, and then how they span across the fleet, right? Of course, if everything's immutable and built from scratch every time, okay, that's all right, but it often isn't that way. Especially with databases, and longer persisted virtual machines or containers or whatever.
Steve: We actually have to go and reverse-engineer all that and say, "You know, we got these nine Nginx servers. Something's wrong with Nginx three. What's wrong?" Or, questions like, "We've got SSL bugs, or SSL vulnerabilities. Which of our servers are running TLS 1.1, 1.2, and what ciphers?" Near as we can tell, nobody nobody in the row can tell that, because they're just like, "We don't care, we just push it all out there." In reality, you have to know this stuff, right?
Mike: Yeah.
Steve: We have tools so that can figure all that out for us, so we can survey the whole fleet very quickly. Again, we've dealt with all this chaos for so long that we need to do these things, so it' very interesting, and of course, at scale, you know?
Mike: Right.
Steve: Now you get into more hybrids and different things, and just like DevOps is still pretty new for a lot of people even in the West, and then a lot of these practices, you know, it's pretty new here too, so it's still amalgamation of lots of different pieces and parts, and technologies, and hybrids, and physicals, and traditional things, and new things, and you know, server-less. Maybe not here yet, but server-less stuff is coming, and Kubernetes, but SAP systems. Now we got terabyte RAM databases for SAP right alongside docker containers serving mobile apps to register something, so it's interesting.
Mike: Yeah, absolutely. You and I have been talking about troubleshooting, and I think this maps well to what you're talking about. How do you troubleshoot a system, like, actually really troubleshoot it and understand what it's doing when most people, how they approach troubleshooting is, I'm going to check this thing, I'm going to look over here. After a while, you just start guessing about how your systems are working. For example, with your SSL thing, if you start having SSL negotiation problems intermittently, how do you even approach that? How do you even get to the SSL is your issue? What do you think about that? Are there any standardized troubleshooting methods that you've seen work well with your own teams, or with yourself?
Steve: Well, this is a very interesting area. In part because I grew up troubleshooting. I grew up in construction, and factories, manufacturing, so I've been troubleshooting since I was, I don't know, five years old, you know? I have to be careful, because what comes natural to me, and what I've been doing across lots of industries is not always what everyone else is doing. Here, it's different too, because it sort of highlights ... You know, things move very quickly, and people want to get stuff done, want to fix stuff, and understanding how things work, and how it's all put together is not that common. People just want to fix it, right? Just get a hammer and just start redoing stuff, and rebooting stuff, and typing, rather than stepping back and understanding how it works. I think that's actually pretty common, and I suspect that DevOps, whatever that means, is making that worse also, just more globally because the reality is, most developers don't know a lot about operations, and vice-versa. In the sense, at the same time, getting more complicated, and more pieces, and more microservices, and more going from complicated to complex, and all these interactions, and nobody really knows how this stuff works unless you really have pretty senior folks involved with it. In many ways I suspect the West is becoming more like China, where we really don't understand all these parts, they're all moving around, and like you say, you just want to bang on it a little bit, and restart stuff, and reboot stuff, and make it work.
Mike: Yeah.
Steve: I've never seen a really good set of troubleshooting things. I've looked over the years. I recently wrote an article on a thing I gave at LISA about sort of the tao of troubleshooting, sort of 10 steps of that.
Mike: Yeah, we'll throw that in the show notes as well.
Steve: Yeah, it's just a little bit cuter, like, know all the things, and look at all the things, but ...
Mike: Right.
Steve: The only ones that I've seen, the only people who have really attacked troubleshooting methodically have been in the engineering disciplines, like, nuclear reactors, or military systems, or communications, like the cell phone guys have a lot of stuff about this. How do you troubleshoot a broken, not cell phone, but cell phone towers and infrastructures, telcos? That stuff tends to be very academic, and very hard to approach. A lot of it was written in the '70 and stuff like this, so I've never seen more modern, like, this is how you go at approaching troubleshooting in systems. You got to think, it's just very important, about methodically going through things, understanding what it can be. Very, very importantly, understanding what it can't be. We have endless engineers waste time Googling something, and trying to solve problems they not only don't have, they can't have because of the way the system works, right? Google said, look at this, this and this, but we don't actually have any of those things. Yet, they'll spend a whole afternoon hunting that stuff down, right? Because, they don't know they don't have it. Stuff like that. How do you teach them to step back? Then, when it all seems messed up, you've got to step back and go, my assumptions are wrong.
Mike: Right.
Steve: Something that I assume is true, is not. There's a famous, I think it's a Mark Twain quote. I think it's Mark Twain, I'm not sure who it is. You know, it's not the things that we know that cause- It's the things that we know that just aren't so. It's the things that we think, and we're sure that configuration says this. I'm sure I checked that file, I'm sure I just went and looked at this, when actually you didn't. It was the wrong thing. You logged in the wrong server, you looked at some other wrong thing, so something in your mental model is wrong. One of the best ways to check senior engineers, or senior troubleshooting folks is, can they realize that, can they step back and recheck all their assumptions? After you've done this for a couple hours, because something you've done is wrong and you go, "Oh yeah, that thing over there." Of course, involving other people, and communicating and all that. It's not just troubleshooting skill, of course.
Steve: Let's talk about this. PagerDuty, I think, has done quite a bit of good work on this on, how do you manage incidents, and how do you communicate, and instant commanders and all that stuff. We actually created all that stuff, like, seven, eight years ago for our own needs. It's interesting to see them, and other people start to talk about incident management in the last two or three years. Same exact thing. That was pretty cool. It's not only troubleshooting itself, but it's all the things around it, you know?
Mike: John Allspaw has actually been talking a lot about this lately too. You mentioned the checking of assumptions, and how people think about troubleshooting. In the talk that he gave at PagerDuty Summit a while back, he showed this timeline of an incident. At several points in this timeline, there was one engineer who said, "Hey, this problem feels like this, like this thing has happened," but they weren't able to say, like, why. It just felt that way. Sure enough, it turns out that it was related to that thing, but it all came out of this engineer's instinct, and a year-old experience, or several-years-old experience with one particular incident at some point in the past. That particular incident informed their mental model of how the system worked, and allowed them to have an instinctual response to this issue that they're seeing. The problem there is, you can't really teach instincts. You can't teach experience. People form developmental models by having more experience with the system. Then again, maybe you can. Maybe you can teach mental models. Have you had any success with trying to teach people how to do this?
Steve: Not a lot.
Mike: That's unfortunate.
Steve: I agree. Well, here it's a little different because I think we're enmeshed, in our case, sort of in the Chinese education system, which doesn't focus on that in a lot of ways. As you know, especially folks that tend to be our mid-to-senior folks that are a little bit older, you know, went to school here in the '70s where the education was very focused on, as you know, sort of test-taking and mathematics, and studying these things than sort of abstract mental models, are not something you see a lot, realistically, in our workforce. That's a problem. That's back to this, how do you understand how the whole system works? Then, how do you troubleshoot and put it together? I think that's an ongoing project, I would say, in China. More broadly, across Asia. That's a challenge, as you know, with education. I think in the West a little bit different, but the same thing, yeah. You can't obviously teach all that experience. You can write it all down, but the mental model, and how you think about it, and the intuition, if you can try to teach and/or you can identify people that have it, you know, it's like having the Jedi force, or having the Harry Potter magic. People either have a touch for troubleshooting and figuring things out, or they don't. It's almost like sales. Somebody said, either a natural-born salesperson or you're not, and I think that's some very important component of that, and then how do you teach on top of that? I don't know anyone's really worked on it very well, especially in a formal setting, especially in modern times. I mean, John's doing great stuff. More about incident and communication, and also root cause, and how do you dig into that? How do you apply that stuff? I still don't know.
Steve: Let's talk about this. PagerDuty, I think, has done quite a bit of good work on this on, how do you manage incidents, and how do you communicate, and instant commanders and all that stuff. We actually created all that stuff, like, seven, eight years ago for our own needs. It's interesting to see them, and other people start to talk about incident management in the last two or three years. Same exact thing. That was pretty cool. It's not only troubleshooting itself, but it's all the things around it, you know?
Mike: John Allspaw has actually been talking a lot about this lately too. You mentioned the checking of assumptions, and how people think about troubleshooting. In the talk that he gave at PagerDuty Summit a while back, he showed this timeline of an incident. At several points in this timeline, there was one engineer who said, "Hey, this problem feels like this, like this thing has happened," but they weren't able to say, like, why. It just felt that way. Sure enough, it turns out that it was related to that thing, but it all came out of this engineer's instinct, and a year-old experience, or several-years-old experience with one particular incident at some point in the past. That particular incident informed their mental model of how the system worked, and allowed them to have an instinctual response to this issue that they're seeing. The problem there is, you can't really teach instincts. You can't teach experience. People form developmental models by having more experience with the system. Then again, maybe you can. Maybe you can teach mental models. Have you had any success with trying to teach people how to do this?
Steve: Not a lot.
Mike: That's unfortunate.
Steve: I agree. Well, here it's a little different because I think we're enmeshed, in our case, sort of in the Chinese education system, which doesn't focus on that in a lot of ways. As you know, especially folks that tend to be our mid-to-senior folks that are a little bit older, you know, went to school here in the '70s where the education was very focused on, as you know, sort of test-taking and mathematics, and studying these things than sort of abstract mental models, are not something you see a lot, realistically, in our workforce. That's a problem. That's back to this, how do you understand how the whole system works? Then, how do you troubleshoot and put it together? I think that's an ongoing project, I would say, in China. More broadly, across Asia. That's a challenge, as you know, with education. I think in the West a little bit different, but the same thing, yeah. You can't obviously teach all that experience. You can write it all down, but the mental model, and how you think about it, and the intuition, if you can try to teach and/or you can identify people that have it, you know, it's like having the Jedi force, or having the Harry Potter magic. People either have a touch for troubleshooting and figuring things out, or they don't. It's almost like sales. Somebody said, either a natural-born salesperson or you're not, and I think that's some very important component of that, and then how do you teach on top of that? I don't know anyone's really worked on it very well, especially in a formal setting, especially in modern times. I mean, John's doing great stuff. More about incident and communication, and also root cause, and how do you dig into that? How do you apply that stuff? I still don't know.
Mike: Yeah. Have you seen Brendan Gregg's use model, USE?
Steve: Oh sure, yeah. As far as Golden Signals things, you mean?
Mike: Yeah, so Brendan Gregg has the USE model, Utilization, Saturation, and Errors. He uses it for performance troubleshooting, but it's very much, it's a very low-level sort of thing. For his model, it's not like, I'm checking the rate of requests coming into a web server, or like an entire tier of web servers. Instead it's like, what's the utilization on this particular core? Or, what's the current memory utilization across this particular bus? It's very low-level stuff, but it's meant for a troubleshooting approach.
Steve: Right, and I think that you're right. Golden Signal is in a larger sense, so I wrote a whole series of articles about Golden Signals, picking up Brendan's stuff, and some of the others in the same way, and partially because that all sounds excellent. My whole focus was, how do you get that data about a system?
Mike: Right.
Steve: Everybody talks about it, no one talks about how do you actually get the stuff, so I wrote a bunch of stuff up about that on Medium. Exactly that. Definitely, and I think that applies across a lot of things. That's relatively new for us here, and I think new more broadly. We've actually made migration of things over to sort of Golden Signal. Whether it's use, or one of the other models, kind of to get RED models also together on error rates, and especially latency, of course.
Mike: Right.
Steve: Latency is the big one to track there, and error is and so on, so we're trying to do much more of that. The challenge, of course, is still what do you do with it, but in the microservices world, you know, that's when it becomes really, really important. We have customers with dozens of microsystems. Every one of them it looks okay. Monitoring is beautiful, there's no errors, no nothing, but the system's down, and we can't figure out why. The game is changing all the time.
Mike: That's the worst.
Steve: Yeah, exactly. The canonical problem ... Or, a system's slow. Even worse, right? You can't figure out why, and used to be you had one, or two, or three tiers, you could figure it out. Now, you've got 25 tiers, and they all talk to each other. You have third-party services, and you've got RDS, and cloud services. That's becoming common everywhere. I don't think anybody's figured out how to properly troubleshoot and diagnose this. Something Honeycomb and so on is looking at the observability side, and then, some of the better monitoring sort of looking at all the Golden Signals, if you can get the data. If you can, then of course, the use stuff, utilization, the saturation, particularly the latency usually shows up as a huge red flag right in the middle of that. You've got to get that data, and you've got to have it organized, and you've got to understand how all these pieces communicate, which nobody seems to bother to ever document.
Mike: Yeah. That's actually a super-interesting point, because even in Brendan's documentation about how to use his USE model, he actually says there are some metrics that there's just no easy way to get them. Circonus has done a lot of work with USE too, and there are entire facets of these low-level metrics that, it's just not easy to measure.
Steve: Well, exactly. In fact, here's a simple one, and even out of Brendan's stuff, right? His BPF stuff helps some, but that's not simple, is how do you get Linux CPU run queues?
Mike: Right.
Steve: You can get the load average, you cannot get the run queue other than instantaneous. Every tool that exists, because I checked every single one of them ... I wrote one, actually, because every single tool out there gives you instantaneous running queue. How do I know how busy my CPU is on Linux? You can't get it. You can't get the saturization. I wrote a program, linked out of my articles, that actually samples it. All you can get. The kernel can't give it to you over time. I sample it, like, every 10 milliseconds for a few seconds, and try to come up with an average and say, this is the current run queue. Of course, as we know, when the cues fill up whenever, then you're really saturated, right?
Mike: Mm-hmm (affirmative)-
Steve: We need to get that in the kernel. The kernel has a very nice load average, but that's not the run queue. We need to have the kernel also give you the run queue. I'm not, unfortunately, a good enough kernel developer to put that in there, but somebody should, and then we'd have that, right? It would be super-useful, because when the CPU run queue is more than the number of CPUs or whatever, you're saturated, cut and dried.
Mike: Yeah.
Steve: Everything else doesn't matter anymore. I think Brendan is doing some of that stuff at BPF and so on, but we need it, we need it more broadly. Yeah, I did a whole bunch of work with, how do you get that out of Apache logs? Not saturation, but how do you pull latency, how do you get web server latency? That stuff's really hard to get. Load balancers, ALBs, that should give it to you pretty nicely. HAProxy kind of gives it to you, but you got to hunt all this stuff down, and the average developer sort of person isn't doing this stuff. We need more work pulling that stuff out.
Mike: There's a new HTTP standard that, what is it, W3C is currently working on. I believe it's called the HTTP check, but it allows you to add in arbitrary metrics, like, key value metrics, into an HTTP header. Which allows you to do things like, how long did this request take, and record it in the header response. From there, you can pull it from the browser side, which is super-cool. Like, that's awesome.
Steve: Yeah. That is a good idea.
Mike: I wish we had that even a couple years ago. It surprises me that we're only just now getting things like that.
Steve: True, and there's no technical barrier. You could've written that yourself 10 years ago, right? Put it in your code.
Mike: Exactly.
Steve: Things that make me think, gee, we should put that in our application. Actually, we do it at the page load, it's at the bottom of the page, right? See, it used to be, and sure there were other tools, they've done this forever. They tell you at the bottom of the page how long that page took, which is cool. The problem now is, everything's API-driven, you you have React, or AngularJS-type stuff, so a page is a hundred calls. Those other missing UI calls are just lost, right? That's where it's sort of the backend, Honeycomb and other systems that are picking up backend events, in theory of picking that stuff up, and a tag and so on. But that's all a lot of moving parts to get all that working, right?
Mike: Yeah.
Steve: Then, what do you do with all the data? Still, developing questions? It doesn't help you with your saturization, so you still get a problem, okay, now I have latency here across the board, or I have latency on this type of customer. Now what? You really got to start digging, and that's where everybody falls off the cliff of no available tools.
Mike: Yeah.
Steve: You know, you start trying to dig through these multi-systems. Even then, people are really trying to focus on that, and I think that's going to help, but that's sort of not for the faint of heart in some ways, right?
Mike: Mm-hmm (affirmative)-
Steve: You got to dive in pretty deep and know what's going on, and really that's pretty cutting-edge. I think it's the future in a lot of ways, but we still got to get better at it.
Mike: Yeah, absolutely. We've been talking a lot about troubleshooting, and it's kind of hard to not think about troubleshooting in terms of monitoring as well. When I talk about USE, yes USE is primarily a model for troubleshooting, but it's also a way to work through instrumentation. Golden Signals and Tom Wilkie's RED model is actually even more useful for instrumentation and understanding monitoring. All that feeds into the topic of observability, which I know you've been talking about quite a bit. How do you think about observability in terms of what you're doing in China? Are you still working with fairly old tools, or are you working with a lot of newer stuff, a mix?
Steve: Well, I think [heater 00:35:57] observability, however you define it, is still an unknown thing. I mean, first to me, well, one of the phrases I like ... Well I go, it's sort of, I monitor you, but you make yourself observable. Something I put in an article a while ago, and I think that's actually true in that, here we're still very much in the monitoring traditional sense. Whether it's Zabbix, or other traditional things here, Prometheus sometimes. You know, doing traditional black box kind of monitoring, which you're only getting some of the Golden Signals you can get, and so on. Resources, and usual things.
Steve: The observability, which is sort of, I make myself observable, where I'm emitting your rich events, maybe adjacent, maybe structured with some tags, I don't think we've ever seen.
Mike: Okay.
Steve: Really, it's still a nearly unheard-of kind of thing here. In the tool, even if you could do it the tools aren't here. One of the challenges of the great firewall and other things is the vast majority of Western tools, whether it's Datadog, or New Relic, or APM tools, monitoring tools, log tools, all that stuff are really not available here. We're providing some of this stuff for our customers, but a lot of those tools, and therefore the thinking around those tools and how you should use them, and all this kind of stuff, it's just really not here. People don't talk about this stuff very much. There's a few local providers, but it's the same thing. A little bit of APM. In that sense, not mature in a lot of ways, and so people really aren't capturing their logs even. Step one is what? Step one is getting your logs off the server.
Mike: Yup.
Steve: Putting them someplace you can look at them, whether it's ... Splunk's too expensive for most people, but at least some sort of ELK stack, which is not trivial in itself to any sort of third party. That's step one, right? That's actually incredibly rare. We have some customers running their own ELK stacks. There's nothing beyond that. There's just nothing happening beyond that, any sort of [inaudible 00:37:57].
Mike: Yeah, that's absolutely crazy to think. I spend my entire day in monitoring. It's what I do, it's what I know best, but every company I go to, no matter how mature, or advanced or not they are, they still have a giant Splunk cluster, or an ELK cluster, or a syslog box that just has where logs go to die. At least they're doing it, which is, that's good in some ways. It's wild to me that, with a market as large as China, and with so much technology there, that a lot of the tools that we take for granted aren't there. Why is that? Why is it Datadog, and New Relic, and Dynatrace, and all these companies, why aren't they in China?
Steve: Well, there's two problems. One is technical. Let's say distance to firewall, and so on. Just that over the distances in performance and various connectivity issues, or one of them that make these things slow ... Although, some stuff does work. New Relic works and so on. Another one is, you really can't sell the Chinese customers without having legal entities here, and tax and all that kind of stuff, and potentially running inside the Chinese firewall, which has a lot of other issues. We help a lot of companies come into China. They're not typical companies, usually, but other companies. It's a big process.
Mike: Okay.
Steve: You can service the UK or Germany from the US with no problem, and get paid, and do everything. You really can't do that with China. It's another world in a lot of ways. It's not a focus. All these companies are doing really well in the West, and they carve out a whole other business or whatever. It's just not been something most of them, any of them have been willing to focus on. You are seeing HashiCorp and so on be very popular here, and starting to push things, but those are products people can use. Certainly Vagrant and Vault and these things becoming more popular. That's different than running a SaaS service, a sort of foreign SaaS services consumed by the Chinese users is not much.
Mike: Gotcha.
Steve: For all the reasons that that I mentioned. Yeah, we've had centralized logging since we started the company. It's one of the first things we built 10 years ago. Then, gradually added Splunk, and ELK stacks. We struggle to get people to use it. I mean, they'd much rather SSH the server, and grep the logs. Really. They know how to do it. They understand it, that step back and look at multiple stuff. I don't like about it anyway, actually, thinking about it. I find it very hard to use, but I think that's part of it too. I think there's a big usability issue there. Splunk's own complications, so it's in some ways hard, and we don't have the Sumos and some of the really nice tools to use. They just SSH and grep the logs, right? I mean, even our team does that. Grep for IPs, grep for agents, grep for this, try to figure out who's attacking you and block stuff. It's just what people are used to, and that's what they do, and that's how we are. That's weblogs. Now, you start having rich application logs and so on. It's a challenge, yeah. Like I said, we have customers with dozens of java microservices running. I'm pretty sure that customer's not logging that stuff. I don't know how they troubleshoot it.
Mike: Yeah.
Steve: It gets worse, because that system has, I don't know, like, 20 microservices in very nice HA pairs. You have payment one, payment two, customer one, customer two and so on, but they don't do the same thing, and sometimes not even running. Payment one and payment two actually aren't mirrors of each other, they do different things. That's horrifically confusing, if you can imagine. That's our world right there.
Mike: Man, that's an absolutely fascinating world.
Steve: These things are coming slowly, and we're trying to provide them somewhat with our platform here. It's just beginning to provide more operations stuff, and more discoverability, and more centralized logging, and more Golden Signals and things like this. Beyond that, it's pretty much a Zabbix world. Zabbix wasn't popular here five, six years ago, and suddenly it's everywhere. Little bit Prometheus, but you really see Zabbix as the primary monitoring system now, I think.
Mike: Interesting.
Steve: For most people. We love Zabbix. We're a huge Zabbix fan. We've been on Zabbix forever, but it has its limitations also. It's not very DevOps-friendly, in my opinion.
Mike: Right. I ask all my guests, for something actual. We've hit on a lot of really interesting topics there, but some of them are, what is it like on the other side of the world for some of the listeners. Let's bring this to something that we can do today. For you, when you're thinking about large-scale systems, and trying to make things better, is there anything that you would recommend people that they can focus on today, or this week to improve their infrastructure?
Steve: I think here and sort of in the West, as we talked about, you know, sort of Golden Signals and basic information like that is probably the single thing that sort of we can do, and I think a lot of people can do, because they're extremely useful, and they're hard to get, and they take real work. But when you have them, they really can help you a lot. Especially as you become more dynamic, more multi-service, multi-microservice, servant lists and all this kind of stuff.
Steve: As you scale out into lots of different alleged copies of all the same thing, you've really got to know how things are structured, how they're communicating, and how are those communications? You know, error rates, rates, latency and so on. If we could all get to that, you know, we'd be 80% of the way there, I think, to figuring out a lot of stuff. Almost, the rest of it is just added on after that, debugging hard things, and distributed tracing and all that kind of stuff, but those are really esoteric if you really can't look at your database latency or your payment gateway latencies and so on. I really still focus on the Golden Signals. That's why we try to spend a lot of time on it, and write about it and so on. That's still the bang for the buck for us.
Mike: Yeah, that's fantastic advice. Steve, where can people find more about you and your work?
Steve: We have a few things. One, we have our Chinese company. If you want to see more about that, that's called ChinaNetCloud, is the English name, so it's chinanetcloud.com. We also actually are taking all of our tools and technology and bringing this to the US, and launching that in the US next year. That's a company called Siglos, which is S-I-G-L-O-S.io. That's just going to be taking all of our experience, and things we've been doing, and sort of operations platform, AWS and cloud stuff, and bringing that to the rest of the world. It's siglos.io. That's going to be in the Bay Area in the Spring. Beyond that, I'm on LinkedIn, and Medium, and Twitter for the most important places to find me, so maybe you can have some links to that.
Mike: Yeah, those will all be in the show notes. Highly recommend Steve's Twitter account. He writes some fantastic stuff on there. So, Steve, thank you so much for joining us. It's been a real pleasure.
Steve: Well, thank you. It's been very interesting.
Mike: Thanks for listening to the Real World DevOps Podcast. If you want to stay up-to-date on the latest episodes, you can find us at realworlddevops.com, and on iTunes, Google Play, or wherever it is you get your podcasts. I'll see you in the next episode.
2019 Duckbill Group, LLC