Welcome to the Chaos
Jan. 23, 2025

Disaster Recovery Fails: Lessons from the Trenches | Chaos Lever

Disaster Recovery Fails: Lessons from the Trenches | Chaos Lever

Welcome to the Chaos Lever podcast! In this episode, we're sharing some of our favorite (and most cringe-worthy) disaster recovery stories as Chris and I relive our days in the IT trenches. From accidentally shutting down a whole data center with the push of a button to a missing utility server derailing an entire cloud migration, we’ve seen it all. If you’ve ever wondered how NOT to handle DR or just need a good laugh, you’re in the right place. 😅⚡  

We’ll talk about lessons learned the hard way—like why servers named "util01" are always critical, why you should *actually* test your DR plans, and why a bad backup can ruin your entire week. Whether you’re an IT pro looking for a relatable rant or someone curious about the chaos behind the scenes, you’ll enjoy this wild ride through tech disasters (and recoveries). 💾🔥  

Thanks for hanging out with us and listening to our stories of near-catastrophes and occasional triumphs. If there’s a topic you want us to cover—or if you just want to share your own war stories—hit us up! You made it all the way to the end, so reward yourself with a seat on the couch and a nice, quiet pilot light DR plan. You’ve earned it. 🎙️🛋️  

Chapters

00:00 - - Intro: What even is rest?

00:30 - - Banter and fancy vibes

02:00 - - Why white balance matters (and why Ned’s fixing it)

03:30 - - Listener request: Disaster recovery chaos

06:00 - - Chris’s tale of the “emergency shutoff” button disaster

16:00 - - My NYC blackout adventure and DR horror stories

27:00 - - VMware SRM, bubble tests, and frustration

37:00 - - Disaster recovery vs. migration planning

Transcript

[00:00:00.27]
Chris: Have you ever considered Rest?


[00:00:06.10]
Ned: Sounds familiar. Doesn't seem like fun, though. I'm out.


[00:00:10.22]
Chris: You might want to look it up. I'm just thinking out loud on this one? It is not like an Italian appetizer of some kind?


[00:00:19.21]
Ned: No, it does sound like one. After your aperitif, you can have your resto.


[00:00:27.01]
Chris: Or resty, if you're getting fancy.


[00:00:30.26]
Ned: Well, I mean, it's pretty clear from my shirt and my general demeanor that I am quite fancy. It's what everybody says. Hello, Alleged Human, and welcome to the Chaos Lover podcast. My name is Ned, and I'm definitely not a robot. I'm a real human person who desires rest and quiet and a lack of activity. With me is Chris, who is also here. I didn't have a good bit. I'm sorry.


[00:01:11.00]
Chris: Yeah, I don't have one either, actually.


[00:01:12.27]
Ned: Wow. It's come to this, huh?


[00:01:17.13]
Chris: How is your fill in the blank with thing that I don't care about?


[00:01:22.01]
Ned: Oh, quite good. And thank you so much for asking.


[00:01:24.02]
Chris: I'm glad. Is that how this works?


[00:01:27.24]
Ned: Yeah, we can go with that.


[00:01:29.07]
Chris: I've communicated associated with humans before.


[00:01:32.07]
Ned: Did you notice that my background is sparser than usual?


[00:01:36.01]
Chris: I noticed it's darker.


[00:01:38.07]
Ned: It is that, too. It's not actually darker. Well, it is a little bit darker. I could turn my fancy lights on. But it more has to do with me adjusting the white balance on my camera.


[00:01:52.00]
Chris: Oh, you finally learned that you can do that.


[00:01:55.16]
Ned: You hush.


[00:01:57.02]
Chris: I'm helping.


[00:01:58.19]
Ned: I'm I'm working on a new course, well, a revision of an existing course for Pluralsight, and they now want people to do... Instructors to do face to camera is what they call it. I call it video, but they want people to do that. If you want to do it, you have to get your setup approved. Their feedback was, one, you have too much crap behind you and you need less. Number two was, your white balance doesn't match the lighting, so you have to fix that.


[00:02:35.00]
Chris: Number one feels a little subjective/rude. I don't disagree. I'm just thinking out loud.


[00:02:43.20]
Ned: Yeah, it has to be distracting, and apparently I had too many things that were distracting behind me. People would be like, Oh, what's that?


[00:02:51.14]
Chris: They pay too much attention to the flee circus?


[00:02:54.20]
Ned: I mean, they do get a little out of hand sometimes. When they're jumping Going through the fire hoops, I could see how it would be difficult to focus on learning about vault.


[00:03:05.28]
Chris: Right, especially that one time that Lorelai got set on fire. That was almost a tragedy.


[00:03:10.27]
Ned: I mean, almost. I'm so glad that they managed to put her out with the thimble full of water.


[00:03:15.19]
Chris: Look, that's what it's there for, man. That's what it's there for.


[00:03:18.27]
Ned: Safety first. Safety in numbers. So many numbers. Let's talk about some tech garbage, shall we?


[00:03:27.09]
Chris: Okay.


[00:03:28.20]
Ned: What we We had a listener request that we share some of our stories from the trenches. Since neither of us has ever actually been in a warzone, let alone a trench, I'm not sure what they were looking for, Chris.


[00:03:45.20]
Chris: I've been in a trench before.


[00:03:48.11]
Ned: Is it one you dug yourself?


[00:03:50.16]
Chris: Well, I mean, not just all by myself. I was part of a group, a very friendly group that all dressed alike, and we're there. Actually, I feel like I've already said too much.


[00:04:04.28]
Ned: I'm not saying cult because I would never call the Boy Scouts a cult.


[00:04:11.06]
Chris: Look, man, they don't even have cookies. What's the point?


[00:04:14.05]
Ned: They have popcorn, and it's not good. Bro.


[00:04:19.07]
Chris: What? Bro. I can't even. With the popcorn right now, what are you doing? I love popcorn. I thought this was a serious podcast.


[00:04:26.18]
Ned: What have you been listening to for the last three guy and his freaking cookies. I love popcorn, but not the popcorn that the Boy Scouts sell. Pick something else. Donuts. Sell me donuts.


[00:04:45.26]
Chris: I can do that. $85.


[00:04:48.29]
Ned: Done. I will buy $85 worth of donuts. What is that? Half a donut? I don't know. Inflation is out of control.


[00:04:55.20]
Chris: You just give me your credit card and I'll take care of it.


[00:04:59.12]
Ned: I feel like that's got I'm in trouble before. But I like donuts. Let's do this thing. I thought that we could share some stories of disaster recovery. Perhaps disaster recovery disasters, if you will. I certainly have my fair share of disaster recovery exercises that I was part of, and not all of them went well. You?


[00:05:32.07]
Chris: Accurate. Similar. I'm pretty sure that's the rule. You remember that one time a company had a DR plan and it went perfectly? No notes?


[00:05:43.12]
Ned: Yeah, no. Just not a thing. But there's different levels of how poorly it can go. I've seen it go about as poorly as it can. Would Would you like to start or do you want me to start?


[00:06:03.09]
Chris: I can start actually before disaster recovery tests. There is such a thing as having a disaster without a recovery plan.


[00:06:15.13]
Ned: That seems like the worst disaster recovery.


[00:06:22.10]
Chris: Back in the SME days, you would get a phone call at 1: 00 in the morning. Emergency, customer, absolutely panicking. Everything is broken and we don't know how to fix it. Do you have any more information for me, sir? No. Okay. Right. This is a lot to go on. Thank you for the heads up, sir. So I've had this happen a couple of times, actually. The first time, way back when I was still young and had hope for the world.


[00:06:56.28]
Ned: So when you were five?


[00:06:58.05]
Chris: It It was a glorious nap. No, I worked for a university that had an unplanned outage. Now, here's the fun thing about this. They were supposed to be backed up to the gills so that power would at least stay up even if power in the city went down. They had dual feeds from PECO. Good sign, right? Promising.


[00:07:28.08]
Ned: Yeah, good start.


[00:07:29.24]
Chris: They It also had a backup battery system. I want to say it was supposed to be rated for 110 minutes.


[00:07:38.27]
Ned: Okay.


[00:07:39.16]
Chris: That's ample. We're talking early 2000s. That's a lot of car batteries.


[00:07:43.18]
Ned: Yeah.


[00:07:44.11]
Chris: Honestly, spoiler alert, if you ever look at your backup system, that's all it is, is a shitload of car batteries.


[00:07:53.11]
Ned: Have you ever been to the SunGuard facility down in Philly? Yeah.


[00:07:59.03]
Chris: I'm not allowed to talk it, of course.


[00:08:01.15]
Ned: Well, they have a battery room, and when you walk past, it just looks like a field of car batteries all linked together. Just like hundreds of car batteries, and that's their battery room. And that's when I realized, oh, that's what's inside those big metal boxes. It's just blood acid batteries.


[00:08:23.20]
Chris: Now, I'm sure that as time has gone by, the technology has advanced, but I'm also pretty sure not It was a lot. Anyway, this particular disaster had no disaster recovery plan. So basically, we walked in to an empty building with all the lights out. I can't remember if it was a brownout or a blackout. It was a citywide incident, whatever it was, and it completely caked many zip codes. Also, a fun takeaway, maybe you should have disaster recovery in a different zip code.


[00:09:02.06]
Ned: It's a thought.


[00:09:04.21]
Chris: We were trying to figure out what in the world was going on, and the call went out. The third thing that they have after the batteries was a diesel generator. Yeah, And here's the thing about diesel generators. They don't run all the time. They sit there and wait to be needed.


[00:09:24.10]
Ned: Yes.


[00:09:25.06]
Chris: Because otherwise, that would be a waste of a lot of diesel.


[00:09:29.19]
Ned: True.


[00:09:29.26]
Chris: So we get into the room and we're trying to figure out why nothing works, why is the battery not powering the room? Mind you, we're doing this by a flashlight because the entire building is out. Long story short, we never had to turn on the diesel generator. Would you like to know why?


[00:09:53.22]
Ned: Because it had no diesel?


[00:09:56.04]
Chris: Because somebody pushed the whole room emergency shutoff in a panic to get out of the room thinking it was going to break the the Mantrap magnets.


[00:10:06.18]
Ned: Oh.


[00:10:07.27]
Chris: Which meant that even the power from the battery backup stopped going to the servers. Yeah. There were a number of takeaways from this incident.


[00:10:21.14]
Ned: One would imagine, yes.


[00:10:23.17]
Chris: First question, was it a good idea to have the battery power run through that whole room emergency kill switch in in the first place? Now, if you think about it, this is not an easy question to answer. Let's say there's an electrical fire or there's a reason that you want all of the power to be able to be pulled or killed from one place immediately. There's What's the use case for that.


[00:10:46.07]
Ned: Yeah, that's immediately what my mind meant to is there's a fire in the server room, I need to cut off the power. All of the power. How do I do that? All of the power. All the power. Hit the big button. It would be bad if I hit the big button and instead the battery pickups kicked in and kept power going to the electric fire.


[00:11:03.14]
Chris: Right. Incidentally, not really disaster recovery or computer-related, but if you have some type of power generation in your house, you have to be connected to the PECO All Power Kill Network for similar reasons. If PECO is working on wires and expects all the power to be out, and your solar system or whatever you have is still sending power into the grid, you could kill somebody.


[00:11:29.03]
Ned: That seems bad.


[00:11:30.10]
Chris: Yeah, not ideal.


[00:11:31.19]
Ned: No.


[00:11:32.17]
Chris: There's absolutely a reason to have a, when in emergency, kill all power switch.


[00:11:37.11]
Ned: Yes.


[00:11:39.01]
Chris: Was it necessary to have that directly next to the door, though?


[00:11:44.27]
Ned: Probably not, or at least there should have been another more obvious button for an emergency open of the door.


[00:11:54.26]
Chris: Right. Also, everybody that has access to that room should have been and actually practiced leaving the room in a hurry so that you have the muscle memory to push the right button and not push every button you can find.


[00:12:13.13]
Ned: I think you raise an interesting point because typically when we think of disaster recovery, we are thinking of all of the technical things you need to do to bring your systems back online. But there's a whole other component to it that's the business content continuity planning that includes what humans need to do in the event of an emergency. It would cover the thing like, make sure you've trained everybody on how to leave the server room without breaking everything.


[00:12:45.27]
Chris: We won't spend too much time on this because we actually did a whole episode about it not too long ago. What's an incident response plan? What's a disaster recovery plan? What's a business continuity plan? Those are all different things that work together. If you don't have them and they're not current, hilarity can ensue.


[00:13:06.10]
Ned: Hilarity in big old scare quotes. I guess you did eventually get it resolved once you figured out what had happened.


[00:13:15.05]
Chris: Oh, yeah. I mean, that's the fun part of the story. The long story short, they pushed the button, all the servers slowly started coming back online. About 15, 20 minutes later, PECO resolved. I think they only resolved half the power, but it was enough because if the power is in the city, then your network is probably down anyway because you've got a relay. But yeah, that was the fun part of the story, was just recognizing there are other things that need to go into a DR plan that you might not consider until you actually practice.


[00:13:46.28]
Ned: That is so true. Okay, so I don't have anything quite as catastrophic as somebody turning off all the power because they hit the wrong button. Although I did ask once in a server room, what is that button for? And it was the one to emergency enable the halon. And the guy was like, Don't push that button.


[00:14:11.06]
Chris: I have a story about that, too. Why?


[00:14:13.00]
Ned: Because you'll die. So the halon systems don't actually trigger right away. You have a countdown basically before it kicks in for you to evacuate from the room.


[00:14:28.20]
Chris: It depends on the system. It's usually like 15 seconds.


[00:14:31.05]
Ned: Yeah, that was the case.


[00:14:33.18]
Chris: It can't be a huge amount of time. Otherwise, it wouldn't kill the fire before the fire did unrecoverable damage. But yeah, they don't want the people to be in the room.


[00:14:44.03]
Ned: Not so much. It wasn't my... I guess this was technically my very first tech job, and it wasn't disaster recovery per se, but I did end up trapped in New York City during a blackout. That was great. It doesn't seem great. I was working for a major retail company at the time, doing support for their POS systems, their point of sales systems. And so I was visiting one of the stores on site in New York City in the summer, and it would have to be like 2003 or so. So there was a major rolling power outage across most of the Eastern US. I think it was caused by something that happened in Connecticut. I don't know. It's always Connecticut's fault, whatever it was. I happened to be in the basement of the building working on the server at the time, and suddenly just every single light in the building and all the power just out. Yikes. Then the emergency battery-powered lights came on, but there was no UPS for the server or really any of the equipment in the entire store. Of course, this was close enough to 9/11 that my first thought was not, Oh, it's just a power outage.


[00:16:10.09]
Ned: My first thought was like, Something catastrophic has just happened.


[00:16:15.03]
Chris: Yeah. And what's sad is I'm pretty sure that's probably word for word exactly how it went through your head.


[00:16:21.02]
Ned: Yeah, well, I am lame that way. I made my way up to the street level and realized that the power was across the entire city. And because New Yorkers being New Yorkers, somebody had a crank-powered radio, and they were listening to broadcast. I listened in as well, and there was no mention of an tack or anything. Instead, I just spent a night in New York City with no power because my car was in a garage that used an elevator to park your car in the basement. It makes it a little hard to get out.


[00:17:01.10]
Chris: On the one hand, those are cool.


[00:17:03.19]
Ned: Yeah.


[00:17:04.15]
Chris: On the other hand, yeah, they need power.


[00:17:07.24]
Ned: It actually was fortunate because anybody who tried to get off of Manhattan during the power outage just sat on a bridge or close to a bridge for 9 to 10 hours.


[00:17:17.29]
Chris: I was going to say a 700-hour traffic jam.


[00:17:20.21]
Ned: Yeah. That's my non-disaster recovery power outage story. But more germane to actual disaster recovery was Because my subsequent job, I was working for a small company, 200-ish people, and there were three of us on the IT staff, and two of us did work.


[00:17:42.22]
Chris: Actually, it's got to be a pretty high ratio.


[00:17:46.19]
Ned: They put me in charge of the disaster recovery plan because they had a contract with SunGuard to have some servers available and tape machine available to do recoveries, and they'd done two rounds of DR testing, and both times it had been an absolute disaster. Could not recover a single server within the 16 hours that they were there. It's not good.


[00:18:17.25]
Chris: No, that's more like a disaster, not recovery. Nailed it.


[00:18:23.12]
Ned: Awesome. They would joke about how the first thing they tried to do was bring up a domain controller. But if you're the person who's managed Active Directory for any amount of time, you'll know that there are some key roles that one domain controller in the forest has responsibility for. They're called the FISMO roles.


[00:18:53.06]
Chris: Oh, that role. I thought you were talking about a croissant.


[00:18:56.29]
Ned: Oh, that would be good, too. That's the sixth FISMO role. The croissant master. Flexible single master operation roles is what they're called. When you bring up a domain controller from a backup, if it's not the holder of those roles, it tries to find out who the holder is so it can get up to date with whatever's happened with the domain since it was last up. If it can't do that, it won't do anything. It's protective measure. But the other two guys that I worked with didn't know that. After many, many hours of troubleshooting and trying to figure out what was going on, some guy from SunGuard came and he's like, Well, did you seize the FISMO roles? They were like, The what? Seize the who?


[00:19:48.09]
Chris: Who said who to the what now?


[00:19:49.25]
Ned: Exactly. The running joke was like, Did you seize the FISMO roles? But yeah, so this was 2004. We were running server 2000 and for everything, Exchange 2003 for email. The goal of the recovery was to bring up a domain controller, a file server, email, and the main application that the company used to get business done. Those were the modest goals. It took me six months of working on this DR plan three days a week to actually get it all up and operational because nothing was documented. There were no recovery procedures for anything. I had to use old server equipment because this was pre-virtual machines. I couldn't just spin up a whole fleet of virtual machines and do my recoveries there. I had to have physical hardware, which was all the older hardware that was due for retirement, to do the recovery on. Server 2000 and 2003 didn't really like being recovered on different hardware. They got angry about that. A lot of it was I had to make driver disks that had all the drivers for this alternate server, and then I would have to go into safe mode to manually install the drivers to do the recovery because it was just that period of time.


[00:21:21.05]
Ned: That's when I also discovered that tape backups aren't always good, but you don't figure out that the tape backup isn't good until Until you've recovered about half of the data, and then it just fails.


[00:21:34.16]
Chris: It just stops.


[00:21:36.28]
Ned: Non-recoverable errors. I really learned to hate backup exec a lot.


[00:21:46.18]
Chris: Did I ever tell you the time that somebody tried to reorganize the tapes in the tape library?


[00:21:52.03]
Ned: Oh, Lord.


[00:21:53.11]
Chris: By hand?


[00:21:54.16]
Ned: Was it a robotic tape library?


[00:21:56.00]
Chris: I think you know that it was.


[00:22:00.12]
Ned: The robot knows best.


[00:22:03.00]
Chris: That wasn't backup exact. I don't remember what program that was.


[00:22:08.10]
Ned: But yeah, so from that whole process, I developed a deep hatred for tape.


[00:22:13.10]
Chris: Networker.


[00:22:15.01]
Ned: I'm sorry.


[00:22:16.22]
Chris: I told you, I'm 100 years old.


[00:22:19.03]
Ned: But I also learned a metric shit ton about Active Directory and the DNS entries that it requires to function properly. Because the The first thing you had to do in the recovery process was seize the FISMO roles and use AdC to clean out some of the domain controllers that weren't there and update some DNS records. So when the clients came on, they would actually be able to successfully find a domain controller and authenticate. It was a whole pain in the ass. That's the only thing I could call it, but I did learn a lot.


[00:22:54.08]
Chris: When you say recover active directory, what does that mean in this instance? Is that just a flat system image that you're reassessing and re-IPing and all that?


[00:23:06.11]
Ned: It's a multi-step process.


[00:23:08.09]
Chris: The proper way to do it on Microsoft's website, I think, is 31 steps long?


[00:23:13.14]
Ned: That seems like not enough, but you're close. Yeah, you would recover... The way that backup exec did, and I think other backup software, is that it would back up the directory and the actual operating system separately. You would do a recovery of the operating system, and then you would do a recovery of the end-disc database that actually stored active directory. Then you would have to use a combination of the ndisutil and add cedit to get everything up and running because you basically had to tell it that even though the records were old, this was now the official record. You had to to fast forward the database to say, you are now the active copy and the authoritative copy for this domain going forward. Because in our case, we only had one site, so it's not like you could replicate from another domain controller. It was a lot of steps. None of them were documented at the time. Microsoft's documentation website in 2004 was not great. Six months later, we went and did our disaster recovery test, and we passed. Then we fired off the confetti cannons. Then we started upgrading everything to 2008, and I had to start all over again.


[00:24:43.00]
Chris: Yay.


[00:24:43.23]
Ned: Yay. Okay.


[00:24:46.03]
Chris: So if we fast... This is probably an unfair question. I don't know how much of this you've been actually sticking with and looking into, but in terms of Active Directory 2025.


[00:24:54.21]
Ned: Oh, Lord. Okay, sure.


[00:24:56.11]
Chris: Has anything changed? In terms of if you were asked to do a DR plan for Active Directory today.


[00:25:04.11]
Ned: Well, the first thing I would say is don't have all your domain controllers in a single site.


[00:25:10.13]
Chris: Good start.


[00:25:11.10]
Ned: When your disaster recovery site does come online, it can talk to other domain controllers. You may still have to seize a FISMA role if some of the domain controllers you lost held those roles. But That would be my biggest piece of advice is just don't rely on a single site for all of your domain controllers, then you're in a better situation. As far as behind the scenes, the way that Active Directory actually functions, I don't think anything has significantly changed in the way that sites and services works, the DNS records, and the FISMO rules. All that is still the same as far as I know. The recovery process, I think that the backup tools have gotten better at that recovery process, but it's still quite a process.


[00:26:06.22]
Chris: Yeah. I mean, that's something that I had noticed over the past 15, 20 years is people went from this concept of doing either back up to tape or back up to whatever medium off-site that you can then recover more into a hot/warm standby environment. So theoretically, if your Active Directory was up in this warm standby and it was continuing continuously being updated, it would be more a matter of promoting it rather than having to back it or recover it from a backup.


[00:26:37.24]
Ned: Exactly. What you're describing is a pilot light setup where you have the minimum number of servers that have key services that you want to stay replicated and keep those up with moderate amount of cost. That's the thing you could do with cloud now is just have two EC2 instances or Azure VMs or running to provide that pilot light and then have your actual disaster recovery site. It could be an Azure, could be a secondary site, but you can always point up at that Azure deployment or AWS deployment to replicate and authenticate instead of having to recover from scratch.


[00:27:19.15]
Chris: The downside there is that you pay a little bit each month for an environment that's effectively not helping you at all for a productivity and revenue-generating thing. But if a disaster does occur, then you push a bunch of buttons. I'm thinking very perfect world here. You push a bunch of buttons, your supporting system all pop online, you swap out some DNS records, and you're back in business.


[00:27:45.12]
Ned: Yeah. One of the later disaster recovery projects that I worked on, and you may have been part of this as well, was for a pharmaceutical company that we were doing consulting work for. Did it work? It did. You probably worked for it. Okay.


[00:27:57.07]
Chris: That case, I was probably not there. No.


[00:28:00.15]
Ned: But what we were actually using for disaster recovery was VMware's SRM product, Site Recovery Manager. Site Recovery Manager did site-to-site replication of your VMware clusters. The process of recovering was pretty straightforward because all you had to do was say, bring up these virtual machines, these replicated virtual machines, machines over at the other site and make sure that you update your DNS records. There was more to it than that, but that was essentially the crux of it. What that meant was you weren't recovering from tape. It wasn't this big long, drawn-out process. And also it had the capability of recovering into an isolated VLAN, so you could do test recoveries. We called that the bubble.


[00:28:54.19]
Chris: We love the bubble. We love the bubble.


[00:29:00.01]
Ned: While there were some tricky things with SRM, ultimately, it was awesome. I think we were using array-based replication, which meant we were getting really good replication times as well. Instead of using SRM built-in replication, which was a little bit slower because I think it was doing snapshot-based replication, this was just doing array-to-array replication, and so it was really fast. I think our recovery point objective was five minutes.


[00:29:29.24]
Chris: You'd have five minutes of loss. You can only do that with an array-based.


[00:29:32.25]
Ned: Yeah, you could only do that with an array, absolutely. But the two sites were not next to each other, not in the same zip code. They were several states away from each other. So it's the thing you'd have to get on a plane. It was eventually successful, although the tabletop exercises and the early bubble exercises were incredibly frustrating because people kept doing things not in the bubble. God damn it. Try not to bring down production when we're doing a DR test.


[00:30:09.13]
Chris: Bring down production on your own time. That's called a hobby.


[00:30:15.13]
Ned: Yeah.


[00:30:17.07]
Chris: Yeah. I mean, those are challenges, too. When you're running a DR test, it is essential that everybody runs only off of the documentation provided to them and only in the disaster recovery designated area. Yes. Because otherwise, it's not a true test.


[00:30:36.07]
Ned: Exactly. It is very hard and expensive to run a full DR test, but there are some major benefits because you'll encounter those things that you didn't plan for. Right. And figure out workarounds or at the very least, update the documentation to go, Oh, okay, we actually need to do it this other way because because X, Y, and Z. We certainly encountered that with some of the... Some of the virtual machines would not remap their IPs quite right because of the way the applications that were installed were configured. Basically, there were hard-coded IPs inside some of the machines, and so we had to account for that as part of the recovery process, even though you would think, not that the servers wouldn't have static IPs, but that the remapping of the IPs the SRM scripts should take care of that. It did at this operating system level, but it did not at the application level, hence we had a problem.


[00:31:40.20]
Chris: You always tell people you shouldn't have static IPs, hard coded into applications.


[00:31:47.07]
Ned: Yeah.


[00:31:48.00]
Chris: And yet- But there's that word should.


[00:31:54.11]
Ned: I think we have time for maybe one more story if you have something.


[00:31:59.19]
Chris: I mean, I'm trying to think. I do have one that was similar to the SRM model that you were talking about, and I started alluding to it where it was a customer, and they had this situation where their on-prem environment was their production. They were just starting to use the cloud, and they wanted to do a little bit more than a pilot light, but that idea where you've got continuous high availability at the application level, taking care of the replication to a separate site. The theory with this was everything's consistently updated. It's geographically separated by a responsible amount of distance into an environment which allegedly will never go down. It wasn't US East. What is it? Shit, I can't remember which one we make fun of all the time now.


[00:32:50.01]
Ned: Is it one or two? It's US East One. You were right. North Virginia, baby.


[00:32:54.01]
Chris: I should never doubt myself. What they wanted to do was just effectively turn it into a situation where it could be a blue-green environment. You've got production here on-prem. You throw a switch, automated things happen, and then all of a sudden, production is up in the cloud. Some of the stuff that they did here was super expensive. One of them was they were running Oracle Rack, already expensive with a node here and a node in the cloud, and the commensurate storage costs that go along with that. Obviously, Hopefully, there are better ways to do this. But this was a few years ago, and for a long period of time, people were insistent that the cloud is just another place to run virtual machines.


[00:33:43.07]
Ned: Yes.


[00:33:43.27]
Chris: I think that mindset has changed at this point, but it's important to remember, the 2010s, 2008s, that's what people thought. Other things like Paz was viewed with suspicion, not with an eye towards opportunity.


[00:33:57.07]
Ned: Not only that, but I don't think Oracle Paz was even an available option at that time.


[00:34:02.24]
Chris: No, if I remember correctly, they were actually violating the rules by running it in a virtual machine, let alone...


[00:34:08.10]
Ned: Yeah, I was going to say, Oracle was real pissy about all of that for a long time until they built their own cloud And then they were like- Now you can run Oracle in the cloud.


[00:34:17.09]
Chris: Everyone should only use Exa data.


[00:34:21.08]
Ned: Not everybody has just barrels and barrels of cash.


[00:34:26.06]
Chris: But anyway, I'm getting ahead of myself. The The point here is that they wanted to have it as smooth a transition as possible, like literally, script out all the little things like you were talking about. Push a button, everything migrates into the cloud. They ran a test, and their tests were actually fairly brazen. It was still, I want to say, three in the morning on a Saturday or something, but they didn't do a bubble test. They did a real live test migrating everything into the cloud. Wow. I know. Ironically, they didn't want to do the bubble because they thought the bubble was going to be too much work. They go through all this stuff. They have all the people on site. They'd run all of their pre-checks. Everybody's ready to go. They throw the switch. We wait 15 minutes. We try to go to the host name of this particular company, and nothing happens.


[00:35:25.11]
Ned: Okay.


[00:35:26.12]
Chris: Everybody's like, Maybe we have to wait a little bit longer. So we waited a little bit longer. Nothing happened. Now people start to panic a little bit, trying to figure out what went wrong. We followed all the steps. Well, long story short, they followed all the steps as they were written three years ago. Certain applications, they just copy and pasted from old documentation into the new documentation without double checking that things were consistent the way that they used to be. In particular, there was this helper machine that did something, some an essential utility. I cannot for the life of me remember what it is. It was just absolutely necessary. It was basically a raspberry pie. It was like a nothing system, but it wasn't part of the plan. When they went to the production environment in the cloud, little Mr. I think I can, I think I can was left behind.


[00:36:31.29]
Ned: Oh, poor baby.


[00:36:32.18]
Chris: This crucial data transfer or pipeline or whatever the F it was, was not there. So the application was like, You know what I'm going to do instead? Crash. Right.


[00:36:46.27]
Ned: Oh, dear. Yeah. If you ever find a server that's named like srv-util01, that is the most important server in your entire fucking company.


[00:36:57.07]
Chris: You have no idea.


[00:37:00.23]
Ned: We had a util01 and util02 at that one company I worked for. Some of the stuff that was on there was so business critical, and it had no business being all on the same server, but that was pre-virtualization. Right. You only had so many physical servers, and sometimes one server needed to share a whole bunch of stuff, and it just did. Oh, God.


[00:37:25.12]
Chris: I was talking to a customer recently, actually, and they are a modern company, a gigantic modern company, and they still run some critical services on AS400s and IBM Z mainframes.


[00:37:42.06]
Ned: Yeah, no, that sounds accurate.


[00:37:45.27]
Chris: Trying to conversate about disaster recovery there has been entertaining. Some of this code is literally older than some of their interns.


[00:37:57.12]
Ned: Oh, yeah. No, I believe it. I I believe IBM does offer AS400s as a service.


[00:38:04.24]
Chris: They have the Z cloud, and that's what this customer is investigating using. Actually, this is something that has... This happened a lot, and we can end with this one because I'm sure you've had this experience, too. Customers would, looking curiously at the cloud of some kind, eventually wanting to get there and basically building the disaster recovery plan with the idea that it's actually a migration plan.


[00:38:28.17]
Ned: Oh, several times.


[00:38:29.16]
Chris: In fact, I think Azure Migrate started being called Azure Migrate because people just used it to migrate.


[00:38:38.03]
Ned: Yeah. What was it originally called? Azure Site Recovery, ASR. It was originally called Azure Site Recovery, and they just renamed it to Azure Migrate because most people, that's what they used it for. Oh, my goodness. All right. Well, hopefully, dear listener, you enjoyed Chris and I This is all of our conversation, and we're really envincing about our days in the trenches of disaster recovery. If you want other episodes like this, let us know. Go to chaoslever. Com, fill out the contact form, and if there's another topic you want us to cover of what we've lived through, we are more than happy to do so. In the meantime, hey, thanks for listening or something. I guess you found it worthwhile enough if you made it all the way to the end. So congratulations to you, dear friend. Now you can go sit on the couch, fire up a your CR bubble in Azure, and migrate your physical hardware to there. You've earned it. You can find us on LinkedIn, Chaos Lever. You can go to our website, chaoslever. Com. You'll find show notes, blog posts, and general Tomfoulery. We'll be back next week to see what fresh hell is upon us.


[00:39:47.10]
Ned: Ta-ta for now.


[00:40:00.26]
Chris: Did I ever tell you about the time that I was helping a company with a merger and acquisition, and nobody wanted to change their server's names? And there was a huge fight between the teams because both teams had a server named Snoopy.