Welcome to the Chaos
Aug. 1, 2024

The Day the Earth Stood Still (Because of CrowdStrike)

The Day the Earth Stood Still (Because of CrowdStrike)

Ned and Chris explore the chaotic fallout from a CrowdStrike Falcon sensor update that crashed Windows systems across various sectors.

Where Were You the Day the Screens Turned Blue?

The tech industry is a house of cards propped up by a mishmash of redundant systems and safety nets. In this episode, Ned and Chris dive into CrowdStrike’s Falcon sensor update on July 19, 2024. This blunder sent Windows systems crashing, causing chaos across airlines, retail stores, and hospitals. They dissect how the update triggered the dreaded Blue Screen of Death and the nightmarish recovery process, especially for BitLocker-encrypted systems. Solutions like macOS’s System Extensions and Linux’s eBPF are tossed around, with a side of skepticism about the balance between speed and security and the inevitable trainwreck of regulatory responses.


Transcript
1
00:00:00,430 --> 00:00:01,390
Ready to party.

2
00:00:02,090 --> 00:00:03,360
By party, do you mean nap?

3
00:00:04,019 --> 00:00:05,610
Uh, yeah, we could have a nap party.

4
00:00:05,720 --> 00:00:07,450
I’m not even mad about that.

5
00:00:07,480 --> 00:00:08,680
That sounds delightful.

6
00:00:10,650 --> 00:00:10,800
[snoring]

7
00:00:21,779 --> 00:00:25,040
.
Hello, alleged human, and welcome to the Chaos Lever podcast.

8
00:00:25,040 --> 00:00:27,680
My name is Ned, and I’m definitely not a robot.

9
00:00:27,710 --> 00:00:31,430
I am a real human person who does not make random

10
00:00:31,469 --> 00:00:34,929
updates to your computer and then crash entire airports.

11
00:00:35,619 --> 00:00:37,780
That would be wild, and I am definitely not

12
00:00:37,780 --> 00:00:40,400
behind that update through my Skynet uplink.

13
00:00:40,570 --> 00:00:43,379
With me is Chris, who was also here.

14
00:00:44,250 --> 00:00:45,510
Let’s talk about not Skylink.

15
00:00:46,230 --> 00:00:47,629
[crosstalk] . Whatever [laugh]

16
00:00:48,099 --> 00:00:48,659
.
Wow.

17
00:00:49,389 --> 00:00:52,520
Yeah, I don’t know if you go into this or not, but you know, they are already

18
00:00:53,469 --> 00:00:57,730
conspiracy theories out there that this was an actual attack of some kind.

19
00:00:58,190 --> 00:00:59,790
Or a test for an actual attack.

20
00:01:00,310 --> 00:01:02,729
Yeah, I have seen that on the Reddit

21
00:01:02,769 --> 00:01:04,989
threads, [background noise] and other places.

22
00:01:05,069 --> 00:01:05,920
Ohh, that was loud.

23
00:01:06,340 --> 00:01:07,940
That was way louder than I thought it was going to be.

24
00:01:08,039 --> 00:01:08,430
Sorry [laugh]

25
00:01:09,160 --> 00:01:14,420
.
Who likes fizzy water [laugh] ? Yeah, on the Reddit forums, and in

26
00:01:14,420 --> 00:01:19,060
other places, I’ve seen the conspiracy theories starting to propagate.

27
00:01:19,800 --> 00:01:22,870
I feel like those start out as jokes a lot of the time.

28
00:01:22,870 --> 00:01:25,300
Like, “Wouldn’t it be funny if this was like a state actor,”

29
00:01:25,330 --> 00:01:28,409
or something, but like, when you actually drill down into

30
00:01:28,410 --> 00:01:31,940
it at all, you realize that this is just incompetence.

31
00:01:33,930 --> 00:01:39,660
Don’t attribute to malfeasance what is more likely just gross incompetence.

32
00:01:40,460 --> 00:01:41,939
There’s a pithier way of saying that.

33
00:01:42,469 --> 00:01:44,289
Well, maybe now we should just talk about what we’re actually

34
00:01:44,289 --> 00:01:46,140
talking about because we’re already talking about it.

35
00:01:46,260 --> 00:01:47,240
Oh, CrowdStruck.

36
00:01:48,270 --> 00:01:49,000
UnderStrike?

37
00:01:49,040 --> 00:01:49,210
Dunn.

38
00:01:49,210 --> 00:01:49,240
[laugh]

39
00:01:51,690 --> 00:01:55,759
. Those of us who have been around the tech industry for a while,

40
00:01:55,760 --> 00:01:59,559
and have peeked behind the mysterious curtain to see what actually

41
00:01:59,559 --> 00:02:04,010
supports this endeavor that we call modern information technology—

42
00:02:04,279 --> 00:02:04,899
It’s terrifying.

43
00:02:05,500 --> 00:02:07,040
It’s three monkeys in a trench coat.

44
00:02:07,040 --> 00:02:07,089
Barely.

45
00:02:10,440 --> 00:02:15,209
I feel like you quickly become aware of how fragile this entire construction

46
00:02:15,210 --> 00:02:20,969
is, and just how many redundancies and safeguards have to be put in place

47
00:02:21,309 --> 00:02:25,990
to prevent the entire edifice from crumbling into the proverbial sea.

48
00:02:26,430 --> 00:02:28,489
Yeah, and just to put a pin on that, in terms of, not

49
00:02:28,490 --> 00:02:30,540
only is the technology fragile, so are the people.

50
00:02:31,030 --> 00:02:35,230
I saw a joke on LinkedIn today about power-washing the back

51
00:02:35,230 --> 00:02:39,389
of your servers to let the packets go faster, and I guarantee

52
00:02:39,389 --> 00:02:41,649
there’s somebody out there going, “I haven’t done that.

53
00:02:42,260 --> 00:02:44,140
I should do that.”

54
00:02:44,140 --> 00:02:48,250
[laugh] . If nothing else, it’ll clean the air filters, so that’s probably good.

55
00:02:49,050 --> 00:02:50,709
It’ll make everything a lot quieter.

56
00:02:52,610 --> 00:02:53,430
[laugh] . I suppose it will.

57
00:02:53,630 --> 00:02:55,410
Oh, silence is golden.

58
00:02:55,960 --> 00:02:57,609
The packets go faster in silence.

59
00:02:58,270 --> 00:03:01,480
To quote the second-greatest sci-fi movie of all time, Men

60
00:03:01,480 --> 00:03:05,620
in Black, “There’s always an Arquillian Battle Cruiser,

61
00:03:05,620 --> 00:03:09,190
or a Korilian Death Ray, or an intergalactic plague that

62
00:03:09,190 --> 00:03:12,119
is about to wipe out all life on this miserable planet.

63
00:03:12,410 --> 00:03:14,430
The only way these people can get on with their

64
00:03:14,430 --> 00:03:17,769
happy lives is that they do not know about it!”

65
00:03:18,320 --> 00:03:19,110
I love that quote.

66
00:03:19,400 --> 00:03:22,170
Yeah, so just kind of apply that to technology instead

67
00:03:22,170 --> 00:03:25,720
of aliens, and it’s pretty much the same thing.

68
00:03:26,370 --> 00:03:30,099
The CrowdStrike debacle may not have been a Korilian death

69
00:03:30,099 --> 00:03:35,829
ray, but for 8.5 million Windows devices, it basically was.

70
00:03:36,820 --> 00:03:40,900
Everything, everywhere, is breaking, all at once, and it is

71
00:03:40,930 --> 00:03:45,030
only through the heroic efforts of thousands of ops people

72
00:03:45,040 --> 00:03:49,310
diligently doing their jobs that the public is unaware.

73
00:03:50,130 --> 00:03:54,359
Of course, the public does occasionally become very aware,

74
00:03:54,850 --> 00:03:58,110
and then senators have to hold hearings to grandstand

75
00:03:58,120 --> 00:04:01,070
about things they do not even slightly understand.

76
00:04:01,770 --> 00:04:06,960
They’ll hold some CEO’s feet to the fire for an hour, make self-serving

77
00:04:06,960 --> 00:04:11,169
proclamations and possibly even attempt to levy a fine or two.

78
00:04:11,690 --> 00:04:14,670
Good luck with that now that Chevron Deference is dead.

79
00:04:15,010 --> 00:04:16,670
But hey, we’re not a Supreme Court podcast.

80
00:04:17,430 --> 00:04:18,880
Go listen to 5-4 for that.

81
00:04:20,029 --> 00:04:21,320
Solid plug for 5-4.

82
00:04:21,940 --> 00:04:22,799
Definitely [crosstalk] time.

83
00:04:23,630 --> 00:04:28,380
After all, the hubbub dies down, honestly, one or two C-level executives

84
00:04:28,380 --> 00:04:32,169
will probably fall on their swords to appease the investor public.

85
00:04:33,080 --> 00:04:34,409
I wouldn’t feel too sorry for them.

86
00:04:35,290 --> 00:04:39,219
It is a metaphorical sword after all, and it comes with a guaranteed

87
00:04:39,270 --> 00:04:43,970
payout of several millions of dollars, and a cushy job as a lobbyist

88
00:04:43,970 --> 00:04:49,059
or CEO of some other poor unsuspecting private equity firm-acquired

89
00:04:49,080 --> 00:04:54,570
disaster where they can oversee another unavoidable catastrophe.

90
00:04:55,290 --> 00:04:56,659
It’s the circle of life, Chris.

91
00:04:57,349 --> 00:04:58,150
I’m not going to sing it.

92
00:04:58,150 --> 00:04:59,230
I don’t want to get sued again.

93
00:04:59,360 --> 00:05:02,984
I—no, you’re seeing it in your head though, and I can see it [laugh]

94
00:05:03,889 --> 00:05:03,919
.
[laugh]

95
00:05:05,139 --> 00:05:08,830
.
Oh, so rather than talking about CrowdStrike for the next 30

96
00:05:08,830 --> 00:05:11,690
minutes, I think we should all just go watch The Lion King—

97
00:05:11,700 --> 00:05:12,300
The original.

98
00:05:12,320 --> 00:05:15,539
Which is the best sci-fi movie of all time [laugh]

99
00:05:15,649 --> 00:05:17,840
.
I’m not really sure where to go with that [laugh]

100
00:05:18,539 --> 00:05:18,969
.
[laugh] . I don’t either.

101
00:05:20,049 --> 00:05:22,309
I’m curious to hear the comments that we get in.

102
00:05:22,320 --> 00:05:28,190
I did recently watch Dune: Part Two, which was excellent.

103
00:05:28,560 --> 00:05:29,560
Took you long enough.

104
00:05:30,090 --> 00:05:30,590
Listen.

105
00:05:30,910 --> 00:05:33,550
Some of us responsible citizens saw in the theater.

106
00:05:34,380 --> 00:05:35,539
I have one word for you.

107
00:05:36,040 --> 00:05:36,999
That word is children.

108
00:05:37,469 --> 00:05:37,949
Anyway.

109
00:05:38,139 --> 00:05:39,260
You don’t think they would like it?

110
00:05:39,700 --> 00:05:44,830
I think nightmare fuel would probably be the closest, yeah,

111
00:05:44,840 --> 00:05:49,000
Feyd-Rautha—Routha—however the hell you say his name—yeah,

112
00:05:49,000 --> 00:05:51,500
those scenes in particular, God, that dude is creepy.

113
00:05:51,940 --> 00:05:55,510
Yeah, he really inhabited the creepy level of the character.

114
00:05:56,150 --> 00:05:59,719
Like, Jared Leto levels of creepy.

115
00:06:00,020 --> 00:06:01,320
No, but except good.

116
00:06:01,500 --> 00:06:06,100
Yes [laugh] . Yeah, because he’s playing a character, not himself.

117
00:06:06,910 --> 00:06:07,140
Oh.

118
00:06:07,750 --> 00:06:08,410
Anyway.

119
00:06:09,179 --> 00:06:09,269
CrowdStrike.

120
00:06:10,230 --> 00:06:11,239
What the hell happened?

121
00:06:12,040 --> 00:06:19,000
On Friday, July 19th, 2024, at 5:24 UTC—that’s 1 a.m.

122
00:06:19,010 --> 00:06:23,110
for our East Coast peeps, and the day before for California

123
00:06:23,330 --> 00:06:26,480
because you’re a bunch of weirdos—security vendor CrowdStrike

124
00:06:26,520 --> 00:06:29,479
released an update for their Falcon sensor platform.

125
00:06:30,120 --> 00:06:34,219
Falcon is an endpoint detection and response solution meant to protect

126
00:06:34,219 --> 00:06:38,210
systems against viruses, malware, and advanced persistent threats.

127
00:06:38,900 --> 00:06:44,230
The update type was a content update, or what CrowdStrike calls a channel file,

128
00:06:44,559 --> 00:06:50,399
which you can think of is, like, the virus definition, except as a modern EDR,

129
00:06:51,049 --> 00:06:54,560
it’s a bit more complicated than that, and we’ll get to why that’s important.

130
00:06:54,740 --> 00:06:59,040
When we get to the root-cause analysis, or what we know so far.

131
00:06:59,860 --> 00:07:03,039
Once the channel file was loaded by the Falcon sensor

132
00:07:03,040 --> 00:07:07,080
platform, it caused a memory access fault at the kernel

133
00:07:07,080 --> 00:07:11,420
level that forced a system crash on all Windows clients.

134
00:07:11,959 --> 00:07:14,750
The old Blue Screen of Death popped up, and then the

135
00:07:14,750 --> 00:07:18,760
system either rebooted or sat at that screen for a while.

136
00:07:19,310 --> 00:07:20,440
Possibly forever.

137
00:07:21,070 --> 00:07:22,600
Yeah, until somebody touched it.

138
00:07:22,990 --> 00:07:23,729
Pretty much.

139
00:07:24,740 --> 00:07:27,919
So, if you happened to walk into a major airport around that time,

140
00:07:28,270 --> 00:07:31,640
you might have been greeted by giant display signs that just had

141
00:07:31,640 --> 00:07:37,220
the sad frowny face on it, because now the blue screen has an emoji.

142
00:07:38,180 --> 00:07:40,289
And it was kind of funny, actually.

143
00:07:40,639 --> 00:07:43,830
I mean, funny for the people, you know, seeing the screens; not funny

144
00:07:43,830 --> 00:07:46,090
for everybody who had to deal with the disaster, [unintelligible]

145
00:07:46,330 --> 00:07:46,650
Right.

146
00:07:46,690 --> 00:07:52,520
And were sitting in airports for three days while waiting to, you know, go home.

147
00:07:52,960 --> 00:07:53,590
Yeah.

148
00:07:53,970 --> 00:07:58,520
Depending on which airline you were working with, you

149
00:07:58,520 --> 00:08:02,050
may have been not impacted at all, impacted slightly, or

150
00:08:02,050 --> 00:08:04,280
still sitting in the airport listening to this right now.

151
00:08:04,949 --> 00:08:05,779
I’m so sorry.

152
00:08:06,240 --> 00:08:08,410
Maybe don’t fly Delta [laugh] next time.

153
00:08:08,990 --> 00:08:10,000
Actually, I don’t know if it was Delta.

154
00:08:10,010 --> 00:08:10,880
It might have been United.

155
00:08:10,960 --> 00:08:11,760
They’re all terrible.

156
00:08:11,950 --> 00:08:12,700
It doesn’t matter.

157
00:08:13,530 --> 00:08:16,620
But one of the few that wasn’t affected was Southwest.

158
00:08:17,010 --> 00:08:18,389
Is that because they’re running Linux?

159
00:08:18,679 --> 00:08:19,429
Allegedly.

160
00:08:19,870 --> 00:08:23,610
Again, this is unproven internet theory, but allegedly it’s because

161
00:08:23,629 --> 00:08:26,410
their systems were so old that CrowdStrike wouldn’t run on them.

162
00:08:27,670 --> 00:08:33,710
[laugh] . I feel like we did cover Southwest in a Chaos Lever, or possibly its

163
00:08:33,719 --> 00:08:38,689
precursor, when we talked about old, out-of-date systems that are super fragile.

164
00:08:38,889 --> 00:08:39,990
Am I remembering correctly?

165
00:08:40,340 --> 00:08:44,730
I mean, I had that theory, or that thought as well, but I’m also now like,

166
00:08:45,180 --> 00:08:49,199
did they just post that, and it became a memory, or is it a real memory?

167
00:08:49,530 --> 00:08:51,450
[laugh] . It’s hard to say.

168
00:08:52,080 --> 00:08:57,359
I will say that it was in fact Delta—and is Delta—that’s having

169
00:08:57,370 --> 00:09:01,720
the biggest struggle because they use BitLocker extensively.

170
00:09:02,150 --> 00:09:02,340
Right.

171
00:09:02,380 --> 00:09:03,550
I assume you’re going to get into that.

172
00:09:03,810 --> 00:09:04,410
Oh, yes.

173
00:09:04,560 --> 00:09:04,890
Okay.

174
00:09:05,200 --> 00:09:06,010
I don’t want to interrupt.

175
00:09:06,310 --> 00:09:06,710
Carry on.

176
00:09:06,940 --> 00:09:07,780
So, we had all these crashes—

177
00:09:07,780 --> 00:09:08,479
Whenever you’re ready.

178
00:09:08,960 --> 00:09:10,330
And you know, when your system—

179
00:09:10,330 --> 00:09:12,000
Just go with [crosstalk]

180
00:09:12,000 --> 00:09:12,130

[unintelligible]

181
00:09:12,130 --> 00:09:12,363

—whenever [unintelligible]

182
00:09:12,363 --> 00:09:12,536

[unintelligible]

183
00:09:12,710 --> 00:09:13,170

At anytime—

184
00:09:13,170 --> 00:09:13,464
[unintelligible]

185
00:09:13,759 --> 00:09:15,729

When you could—why would—who—

186
00:09:15,740 --> 00:09:18,099
[laugh] . We have all these crashed systems,

187
00:09:18,230 --> 00:09:19,459
and what do you do with the crash system?

188
00:09:19,460 --> 00:09:20,150
You restart it.

189
00:09:20,840 --> 00:09:23,240
But unfortunately, attempts to restart the afflicted

190
00:09:23,240 --> 00:09:26,160
systems just resulted in another blue screen of death.

191
00:09:26,599 --> 00:09:30,959
Because Falcon sensor is loaded as a driver during system

192
00:09:30,969 --> 00:09:35,790
boot, and it has been marked as boot required, meaning

193
00:09:35,790 --> 00:09:38,780
it must be loaded for the system to boot properly.

194
00:09:39,580 --> 00:09:42,880
As soon as Falcon started, it would load all of its channel

195
00:09:42,880 --> 00:09:45,630
files and, predictably, the system would crash again.

196
00:09:46,309 --> 00:09:49,630
This rendered all effective systems completely

197
00:09:49,670 --> 00:09:53,429
unusable and inaccessible through in-band management.

198
00:09:53,440 --> 00:09:56,069
So, you can RDP into this thing and fix it.

199
00:09:56,830 --> 00:10:01,240
So, this makes sense from an EDR perspective, right?

200
00:10:01,240 --> 00:10:01,355
Yes.

201
00:10:01,355 --> 00:10:02,699
You want to protect your computer.

202
00:10:03,580 --> 00:10:07,469
No matter what tool you have, it’s going to have this boot requirement

203
00:10:08,290 --> 00:10:11,460
because you don’t want your system booting without endpoint protection.

204
00:10:11,639 --> 00:10:11,939
Right.

205
00:10:11,939 --> 00:10:14,210
Because endpoint protection, ostensibly, is good.

206
00:10:15,090 --> 00:10:15,750
Ostensibly.

207
00:10:16,700 --> 00:10:20,450
The problem, obviously, comes in where your endpoint management

208
00:10:20,450 --> 00:10:23,510
is now, effectively, malware that’s crashing your system.

209
00:10:23,880 --> 00:10:24,230
Right.

210
00:10:24,550 --> 00:10:26,190
That would be what we call ‘the downside.’

211
00:10:26,190 --> 00:10:27,210
[laugh] . Yes.

212
00:10:27,639 --> 00:10:29,769
And we will definitely get into that as well.

213
00:10:30,290 --> 00:10:34,700
Microsoft has published a blog post where they claim, according to their

214
00:10:34,700 --> 00:10:39,109
telemetry, about 8.5 million Windows devices were impacted by this.

215
00:10:39,450 --> 00:10:43,210
Now, that’s only about one or 2% of all Windows devices out

216
00:10:43,210 --> 00:10:47,530
there, so this is not, as a percentage, a ton of devices.

217
00:10:47,539 --> 00:10:53,010
However… it’s still a lot of devices, [laugh] and the impact was pretty severe.

218
00:10:53,020 --> 00:10:56,489
As we discussed, airlines had to suspend or cancel

219
00:10:56,500 --> 00:11:00,120
flights, retail stores suddenly couldn’t accept payment.

220
00:11:00,690 --> 00:11:03,630
Medical devices and hospitals crashed in the middle of

221
00:11:03,630 --> 00:11:08,180
surgeries, bowling alleys had to hand out paper and pencils to

222
00:11:08,180 --> 00:11:12,210
individuals, who just looked at them like, what the hell is this?

223
00:11:12,400 --> 00:11:14,599
How do I track ten frames by hand?

224
00:11:14,900 --> 00:11:16,520
How does a turkey even work?

225
00:11:17,150 --> 00:11:19,539
[sigh] . Dark times for all of us, Chris.

226
00:11:19,730 --> 00:11:23,569
That’s the kind of math podcast that needs to come out because I guarantee

227
00:11:23,570 --> 00:11:26,079
there’s no one left on earth who knows how to score bowling by hand.

228
00:11:28,140 --> 00:11:28,829
[laugh] . True story.

229
00:11:29,390 --> 00:11:33,240
I was up in Cape Cod, and we went duckpin bowling—which is a real thing.

230
00:11:33,309 --> 00:11:33,840
Look it up—

231
00:11:33,929 --> 00:11:34,630
Oh, it’s so fun.

232
00:11:34,730 --> 00:11:35,450
It’s super fun.

233
00:11:35,469 --> 00:11:36,280
Definitely look it up.

234
00:11:36,639 --> 00:11:40,860
Super fun, but the bowling alley was so old that

235
00:11:40,860 --> 00:11:43,279
they did not have a computerized scoring system.

236
00:11:43,810 --> 00:11:44,160
Wow.

237
00:11:44,630 --> 00:11:44,950
Yeah.

238
00:11:45,000 --> 00:11:46,630
They gave me a piece of paper and pencil, and

239
00:11:46,630 --> 00:11:49,980
I was like, “Uh, score is not important, right?

240
00:11:50,950 --> 00:11:58,140
We’re just here to have fun.” Oh… now to get these systems back to a working

241
00:11:58,140 --> 00:12:03,580
state, the offending channel files had to be removed before Falcon was loaded.

242
00:12:04,130 --> 00:12:07,550
There’s a few options to do this, and none of them are great or easy.

243
00:12:08,429 --> 00:12:13,110
You can boot the system into Windows safe mode, which only loads the

244
00:12:13,219 --> 00:12:17,580
absolute bare minimum of Windows drivers, and then remove the files.

245
00:12:18,440 --> 00:12:22,710
For virtual systems, you could mount the system disk on another system and

246
00:12:22,710 --> 00:12:27,390
remove the files, and then reattach the drive to the original system, or

247
00:12:27,420 --> 00:12:31,959
if you had snapshots or a backup, you could roll back to a prior snapshot.

248
00:12:32,770 --> 00:12:38,540
Fortunately, CrowdStrike did pull the offending file from the update servers,

249
00:12:38,970 --> 00:12:42,850
so you wouldn’t then immediately redownload it and be back where you were.

250
00:12:43,550 --> 00:12:48,339
While it is a huge pain to fix all of these virtual systems, the real pain

251
00:12:48,349 --> 00:12:52,840
is those physical systems that don’t have an out-of-band management option.

252
00:12:53,539 --> 00:12:56,790
Someone will need to physically sit at the terminal, invoke

253
00:12:56,790 --> 00:13:00,880
safe mode, and perform the remediation steps, or use a separate

254
00:13:00,890 --> 00:13:04,260
boot device like a thumb drive to perform the maintenance.

255
00:13:04,360 --> 00:13:06,519
This is very bad.

256
00:13:07,349 --> 00:13:10,389
You forgot about the other way to fix the system, which apparently

257
00:13:10,410 --> 00:13:14,010
did work on some—at least a number of people’s, which is just

258
00:13:14,010 --> 00:13:17,959
keep rebooting it until CrowdStrike Falcon updated, and deleted

259
00:13:17,960 --> 00:13:20,300
the file on its own before it crashed because of the file.

260
00:13:20,890 --> 00:13:25,360
[laugh] . I guess if it does load and the network stack loads

261
00:13:25,360 --> 00:13:30,210
in time for it to pull the update and replace it, maybe?

262
00:13:30,730 --> 00:13:31,200
Maybe.

263
00:13:31,650 --> 00:13:33,550
Between 15 and 20 reboots.

264
00:13:33,650 --> 00:13:35,410
Sometimes people were getting it to work.

265
00:13:35,680 --> 00:13:36,290
Wow.

266
00:13:37,110 --> 00:13:37,920
That’s awful.

267
00:13:38,030 --> 00:13:38,909
But okay.

268
00:13:38,970 --> 00:13:40,030
So, another option.

269
00:13:40,990 --> 00:13:45,970
Microsoft has published a USB tool to assist with the

270
00:13:45,970 --> 00:13:49,830
removal of this file, so you have that option as well.

271
00:13:51,330 --> 00:13:55,480
As I mentioned, the BitLocker thing does throw a bit of a wrench

272
00:13:55,650 --> 00:14:00,360
in the whole plan because in order to access a BitLocker-protected

273
00:14:00,360 --> 00:14:05,770
system drive out-of-band, you have to supply a BitLocker unlock key—

274
00:14:06,139 --> 00:14:06,349
Yeah.

275
00:14:07,080 --> 00:14:09,280
And that can be hard to get.

276
00:14:10,080 --> 00:14:13,240
Well, it’s not like people want their end-users to have that.

277
00:14:13,370 --> 00:14:13,720
Again—

278
00:14:13,730 --> 00:14:13,880
Yes.

279
00:14:13,880 --> 00:14:15,380
—this is a security concern.

280
00:14:15,929 --> 00:14:20,290
Also, the BitLocker key is 48 characters long, so not only finding it

281
00:14:20,290 --> 00:14:24,839
but typing it in before BitLocker times out… which it does, apparently.

282
00:14:26,230 --> 00:14:27,230
So, a bit of a nightmare.

283
00:14:27,790 --> 00:14:28,710
Not a great situation.

284
00:14:29,030 --> 00:14:29,329
No.

285
00:14:30,300 --> 00:14:33,360
And so, that’s part of the reason Delta is still struggling.

286
00:14:34,179 --> 00:14:39,099
I would love to say that, as of right now, we know exactly what caused the

287
00:14:39,099 --> 00:14:44,660
error, but honestly portions of the supply chain are still pretty murky.

288
00:14:45,440 --> 00:14:51,650
Instead, I will try to explain how a simple update for an EDR caused millions

289
00:14:51,650 --> 00:14:56,090
of Windows machines to blue screen, and we can also have fun pointing all the

290
00:14:56,090 --> 00:15:01,050
fingers that we have at all the other parties because we’ve got jazz hands.

291
00:15:01,050 --> 00:15:02,200
[whispering] It’s all your fault.

292
00:15:03,030 --> 00:15:04,340
That works better with a—

293
00:15:04,660 --> 00:15:05,430
Visual medium?

294
00:15:05,760 --> 00:15:09,490
Yeah [laugh] . So, to start with, we have to consider

295
00:15:09,490 --> 00:15:11,870
what the Falcon [Center’s] actually trying to do.

296
00:15:12,690 --> 00:15:16,930
Falcon Center, as I mentioned, is an EDR product, and it’s meant to

297
00:15:16,930 --> 00:15:21,209
scan all activity on the host operating system looking for threats.

298
00:15:21,820 --> 00:15:24,900
Most applications aren’t granted that level of access

299
00:15:24,900 --> 00:15:27,849
to other applications or to the system as a whole.

300
00:15:28,400 --> 00:15:32,099
As you mentioned, Chris, it needs to be in a privileged position.

301
00:15:32,679 --> 00:15:37,150
But that’s the point: you’re trying to prevent other pieces of software from

302
00:15:37,150 --> 00:15:41,320
getting themselves into privileged positions to compromise your computer.

303
00:15:42,430 --> 00:15:45,219
To understand what it means to be in that privileged position,

304
00:15:45,220 --> 00:15:48,350
I’m going to briefly talk about user space and kernel space.

305
00:15:48,620 --> 00:15:51,790
Please feel free to interrupt me when I get something wrong, which I will.

306
00:15:52,510 --> 00:15:53,386
[whispering] Yes, thank you.

307
00:15:53,609 --> 00:15:55,920
Your operating system, whether it’s Windows,

308
00:15:55,980 --> 00:15:59,910
Linux, macOS, I don’t know, Solaris—

309
00:16:00,680 --> 00:16:01,280
AIX?

310
00:16:01,760 --> 00:16:06,299
—sure—it is responsible for managing the hardware on your system.

311
00:16:06,410 --> 00:16:10,200
That includes stuff like memory management, writing data to disk,

312
00:16:10,510 --> 00:16:14,460
sensing input from peripherals, and scheduling threads on the CPU.

313
00:16:15,230 --> 00:16:17,510
This all happens in what is called kernel

314
00:16:17,520 --> 00:16:20,530
space, and it’s considered highly privileged.

315
00:16:20,860 --> 00:16:24,870
If something goes wrong in kernel space, the system may have to halt

316
00:16:25,230 --> 00:16:29,579
or crash to prevent damage to the hardware, or corruption of data.

317
00:16:30,490 --> 00:16:34,710
Ideally, as little as possible should be running in kernel space.

318
00:16:35,500 --> 00:16:38,530
Instead, most applications run in user space

319
00:16:38,639 --> 00:16:41,310
which does not have direct access to the hardware.

320
00:16:42,170 --> 00:16:45,940
Applications running in user space interact with the operating system,

321
00:16:45,990 --> 00:16:50,190
and make requests based on that operating system’s published APIs.

322
00:16:50,530 --> 00:16:52,650
Do you want to write a file to disk?

323
00:16:52,950 --> 00:16:55,620
You make an API call and pass the correct information.

324
00:16:56,130 --> 00:16:57,450
Need to access memory?

325
00:16:57,880 --> 00:17:01,220
Make an API call and specify the address and range.

326
00:17:01,960 --> 00:17:04,530
The operating system will evaluate that request,

327
00:17:05,010 --> 00:17:08,839
make sure it’s valid and allowed before executing it.

328
00:17:09,510 --> 00:17:12,810
This means when an application runs into issues or it crashes, the

329
00:17:12,849 --> 00:17:17,329
operating system is able to handle that crash gracefully—most of

330
00:17:17,329 --> 00:17:21,339
the time—and keep other processes and the system as a whole running.

331
00:17:22,440 --> 00:17:22,720
Right.

332
00:17:23,069 --> 00:17:25,129
And there’s one point to note here.

333
00:17:25,129 --> 00:17:27,843
So, first of all, some of the terminology, they call it the

334
00:17:27,920 --> 00:17:32,270
kernel; also, they call it Ring 0, meaning it is the lowest

335
00:17:32,280 --> 00:17:35,000
possible level of the system, and it has access to everything

336
00:17:35,000 --> 00:17:37,910
else that is going on in the system without restriction.

337
00:17:38,679 --> 00:17:42,690
Necessary to make sure, for things like EDR tools, that it can

338
00:17:42,790 --> 00:17:46,770
scan not only all of the files, but all of the activity, all of the

339
00:17:46,770 --> 00:17:49,669
network, all of the I/O, the disk, et cetera, et cetera, et cetera.

340
00:17:50,270 --> 00:17:50,490
Right.

341
00:17:50,870 --> 00:17:56,040
One thing people always get upset about is, why does Windows crash so easily?

342
00:17:56,420 --> 00:17:56,780
And—

343
00:17:58,030 --> 00:17:58,060
[laugh]

344
00:17:58,350 --> 00:18:01,250
.
While there is an argument to be made that it is fragile and poorly

345
00:18:01,250 --> 00:18:04,100
designed and should have a better way of handling things like EDR

346
00:18:04,140 --> 00:18:07,730
that needs this access—which is true, and I assume you’ll get to that—

347
00:18:07,990 --> 00:18:08,310
Yes.

348
00:18:08,310 --> 00:18:12,320
The other thing is, again, remember, completely unfettered access.

349
00:18:12,349 --> 00:18:16,239
If something goes wrong at the kernel level, we

350
00:18:16,240 --> 00:18:20,700
get our old friend, unanticipated consequences.

351
00:18:21,600 --> 00:18:22,840
And this is extremely bad.

352
00:18:22,850 --> 00:18:27,709
So, for example, let’s say you have a system that is running a database.

353
00:18:28,200 --> 00:18:30,120
Databases, as you know, are kind of important.

354
00:18:31,389 --> 00:18:35,919
A kernel-level job is trying to write a new file, or

355
00:18:35,920 --> 00:18:38,760
a new table, or a new row, or record, or whatever, but

356
00:18:38,760 --> 00:18:42,179
it runs into an error with, say, memory misallocation.

357
00:18:43,290 --> 00:18:44,980
What is it going to write to the database?

358
00:18:45,610 --> 00:18:47,669
It could be writing absolute nonsense.

359
00:18:47,710 --> 00:18:49,949
It could completely corrupt the database.

360
00:18:49,969 --> 00:18:53,639
Therefore, the kernel crashes preemptively whenever it detects a

361
00:18:53,639 --> 00:18:59,280
failure because the consequences of trying to soldier on might be worse.

362
00:18:59,970 --> 00:19:00,420
Right.

363
00:19:00,940 --> 00:19:04,810
It’s that, “Out of an abundance of caution, I’m going to fail.”

364
00:19:05,179 --> 00:19:05,469
Right.

365
00:19:05,889 --> 00:19:08,170
Which is the same thing I did in high school.

366
00:19:09,280 --> 00:19:12,560
[laugh] . Yes… it was better if you didn’t succeed, Chris.

367
00:19:12,560 --> 00:19:12,590
[laugh]

368
00:19:13,890 --> 00:19:17,070
.
So, Windows applications have been able to request

369
00:19:17,080 --> 00:19:19,760
access to run in kernel mode for a long time.

370
00:19:20,309 --> 00:19:24,420
Generally, that’s a bad idea, for the reasons you just articulated.

371
00:19:25,150 --> 00:19:27,389
But Microsoft wasn’t super strict about it.

372
00:19:27,960 --> 00:19:30,600
Microsoft is nothing if not accommodating

373
00:19:30,610 --> 00:19:32,959
to developers and their terrible ideas.

374
00:19:33,860 --> 00:19:36,030
Some applications actually do need to run in

375
00:19:36,030 --> 00:19:38,879
kernel mode, in particular, antivirus software.

376
00:19:39,450 --> 00:19:42,869
Applications running in user mode are not generally allowed to

377
00:19:42,870 --> 00:19:46,529
access the memory and monitor the behavior of other applications.

378
00:19:46,949 --> 00:19:50,220
Microsoft Teams can’t just decide to read the memory space of

379
00:19:50,220 --> 00:19:54,620
Slack or kill the Zoom processes, as much as it might want to.

380
00:19:54,620 --> 00:19:56,010
I was going to say, it totally would.

381
00:19:56,270 --> 00:19:59,939
[laugh] . The operating system just doesn’t allow that type of nonsense.

382
00:20:00,509 --> 00:20:03,739
But, you know, an antivirus application needs a privileged

383
00:20:03,740 --> 00:20:06,700
level of access and monitoring to defeat the bad guys.

384
00:20:07,100 --> 00:20:10,350
So, antivirus companies like Symantec wrote

385
00:20:10,380 --> 00:20:12,640
their application to run in kernel space.

386
00:20:13,500 --> 00:20:18,460
Now, Microsoft actually tried to push back on the rampant abuse of kernel

387
00:20:18,460 --> 00:20:23,949
mode by antivirus outfits—and others—when Windows Vista was being rolled out.

388
00:20:24,440 --> 00:20:24,800
Yeah.

389
00:20:24,920 --> 00:20:26,020
What, what, what?

390
00:20:26,230 --> 00:20:29,930
Vista, for you youngsters in the crowd,

391
00:20:30,310 --> 00:20:33,220
Vista was the Windows 8 of the early aughts.

392
00:20:34,230 --> 00:20:36,230
Hopefully that puts some perspective on things.

393
00:20:37,179 --> 00:20:42,590
While Vista was a disaster as an operating system release, they did add a whole

394
00:20:42,590 --> 00:20:48,410
bunch of additional functionality and features that brought the client OSes more

395
00:20:48,410 --> 00:20:52,810
in line with what the server OSes were doing, and added a bunch of security.

396
00:20:53,040 --> 00:20:57,280
And one of the things they really tried to do was lock down kernel mode access.

397
00:20:58,150 --> 00:21:02,159
Unfortunately, antivirus companies didn’t like that, and they threw a hissy

398
00:21:02,160 --> 00:21:06,550
fit, claiming that since Windows Defender could run in kernel mode, and their

399
00:21:06,550 --> 00:21:12,870
stuff couldn’t, Microsoft was abusing their influence, a la Internet Explorer.

400
00:21:13,530 --> 00:21:17,420
And Microsoft, still reeling from their decade-long battle

401
00:21:17,420 --> 00:21:22,040
with the FTC over antitrust, kowtowed to the AV club, and

402
00:21:22,040 --> 00:21:25,040
allowed them to keep their precious kernel mode access.

403
00:21:25,710 --> 00:21:29,170
It’s not an unreasonable request because all the

404
00:21:29,170 --> 00:21:31,929
other players wanted was an even playing field.

405
00:21:32,120 --> 00:21:32,419
Right.

406
00:21:32,460 --> 00:21:35,360
The fact that even playing field was a wide-opened

407
00:21:35,469 --> 00:21:38,660
security nightmare is still a Microsoft problem.

408
00:21:39,710 --> 00:21:40,190
[laugh] . Right.

409
00:21:40,750 --> 00:21:44,390
Microsoft did add an interesting requirement, though, if you

410
00:21:44,390 --> 00:21:47,550
wanted to play in kernel space, and that was driver signing.

411
00:21:48,889 --> 00:21:51,370
Antivirus applications would present themselves

412
00:21:51,410 --> 00:21:55,050
as device drivers to get to run in kernel mode.

413
00:21:55,480 --> 00:21:59,320
A device driver to nothing, but a device driver nonetheless.

414
00:21:59,960 --> 00:22:05,620
Microsoft created the Windows Hardware Quality Labs Testing Certification—aka

415
00:22:05,750 --> 00:22:14,240
WHQL—and once a driver had gone through that lab and gotten its certification,

416
00:22:14,470 --> 00:22:20,190
Microsoft would digitally sign the driver and give them the Certified for

417
00:22:20,190 --> 00:22:25,139
Windows logo, so they could proudly display ‘Certified for Windows Vista’—or

418
00:22:25,139 --> 00:22:30,540
Windows 8 or whatever—on the box when you buy the software, or on their website.

419
00:22:31,400 --> 00:22:34,430
Now, vendors could still choose to sign their drivers

420
00:22:34,450 --> 00:22:38,750
internally, but the antivirus folks wanted to get that

421
00:22:39,400 --> 00:22:42,710
WHQL certification and all the cachet that went with it.

422
00:22:43,599 --> 00:22:46,369
As long as your driver code didn’t change,

423
00:22:46,500 --> 00:22:49,179
the digital signature would remain valid.

424
00:22:49,630 --> 00:22:53,690
So, that means all these antivirus companies—like CrowdStrike—would

425
00:22:53,690 --> 00:22:56,710
get that certification, which meant that it had gone through

426
00:22:56,720 --> 00:23:00,079
some level of rigorous testing when it came to the way the

427
00:23:00,080 --> 00:23:03,000
driver was written and the way it interacted with the kernel.

428
00:23:03,880 --> 00:23:04,919
Seems like a good idea.

429
00:23:05,609 --> 00:23:06,109
I’m for it.

430
00:23:06,980 --> 00:23:10,649
Unfortunately, external data could be loaded by the driver, like—

431
00:23:10,660 --> 00:23:10,680
No.

432
00:23:11,290 --> 00:23:12,670
—virus definitions.

433
00:23:13,100 --> 00:23:14,909
But in theory, the actual running code

434
00:23:14,910 --> 00:23:18,270
should all live in that signed device driver.

435
00:23:18,429 --> 00:23:21,300
So, read in some config, but all the logic in the

436
00:23:21,300 --> 00:23:23,609
actual code should live in that device driver.

437
00:23:24,150 --> 00:23:27,490
That’s all well and good for loading virus signatures and

438
00:23:27,490 --> 00:23:32,100
looking for matches in memory and CPU threads, but Falcon

439
00:23:32,110 --> 00:23:36,600
sensor is a modern EDR, and it doesn’t just use signatures.

440
00:23:37,170 --> 00:23:40,800
Instead, Falcon uses machine learning to develop behavior

441
00:23:40,800 --> 00:23:44,150
patterns, and then it needs to detect and respond to

442
00:23:44,150 --> 00:23:47,120
emerging threats that match those behavior patterns.

443
00:23:47,740 --> 00:23:53,510
The channel updates Falcon sensor receives to model that behavior, those updates

444
00:23:53,550 --> 00:23:59,639
appear to include some amount of pseudocode that is executed by the driver.

445
00:24:00,309 --> 00:24:03,970
And it is that injected code from the channel—or lack

446
00:24:03,970 --> 00:24:07,010
thereof, actually—that seems to have caused the issue.

447
00:24:07,730 --> 00:24:10,330
According to people who have looked at the channel

448
00:24:10,330 --> 00:24:14,279
file in question, it is entirely filled with zeros

449
00:24:18,150 --> 00:24:21,670
[laugh] . Now, you would hope that the driver would look

450
00:24:21,670 --> 00:24:25,909
at a file full of zeros and just ignore it, like, “Nope.

451
00:24:26,510 --> 00:24:30,700
That’s invalid.” Falcon sensor chose a slightly different route and crashed.

452
00:24:31,500 --> 00:24:31,800
Right.

453
00:24:32,259 --> 00:24:35,180
So, what we have here is a driver that is legitimate

454
00:24:35,670 --> 00:24:39,169
and was tested and proven resilient, which is good.

455
00:24:39,870 --> 00:24:40,149
Yeah.

456
00:24:40,300 --> 00:24:43,300
And we have updates that come down the wire multiple times

457
00:24:43,300 --> 00:24:47,090
a day and interact directly with that driver, that were not.

458
00:24:47,660 --> 00:24:48,250
Precisely.

459
00:24:48,940 --> 00:24:53,010
And it would appear that of the many tests that were run against

460
00:24:53,010 --> 00:24:56,510
that driver, none of the tests were, “Here’s a file full of zeros.

461
00:24:56,630 --> 00:25:00,520
What do you do?” Because no one thought that was a thing that would ever occur.

462
00:25:01,070 --> 00:25:01,639
But it did.

463
00:25:02,460 --> 00:25:08,940
There is a popular breakdown of the Falcon sensor crash dump by Twitter person

464
00:25:09,240 --> 00:25:15,260
Perpetualmaniac, which I won’t be linking because after assessing that it was

465
00:25:15,280 --> 00:25:20,420
a lack of null pointer checking in the dump, he then went on to make weird

466
00:25:20,429 --> 00:25:24,080
disparaging comments about the Rust community and blamed the whole thing on DEI.

467
00:25:25,740 --> 00:25:29,760
It got strange and kind of fasci, so fuck that guy.

468
00:25:29,990 --> 00:25:30,250
Fair.

469
00:25:30,420 --> 00:25:33,900
Instead, I’ll include a link to a different Twitter thread, by

470
00:25:33,900 --> 00:25:37,989
someone who actually debugs stuff like this for a living, and he

471
00:25:37,990 --> 00:25:42,959
basically said that Perpetualmaniac was wrong and thinks that it is

472
00:25:43,570 --> 00:25:47,300
uninitialized data being read from a table that caused the crash.

473
00:25:48,059 --> 00:25:51,180
Now, considering that the input file was entirely filled

474
00:25:51,180 --> 00:25:54,950
with nothing, uninitialized sounds like an understatement.

475
00:25:55,970 --> 00:26:00,170
Unfortunately, we won’t know for sure unless CrowdStrike shares

476
00:26:00,170 --> 00:26:03,480
their source code for their driver, which seems unlikely.

477
00:26:03,990 --> 00:26:06,239
Maybe they should, but I don’t think they will.

478
00:26:07,080 --> 00:26:11,360
The point is that the channel update caused Falcon sensor to attempt to access a

479
00:26:11,360 --> 00:26:15,560
memory location that didn’t exist or wasn’t initialized, and the driver crashed,

480
00:26:15,719 --> 00:26:19,989
forcing the system to halt in order to prevent possible data corruption.

481
00:26:20,780 --> 00:26:21,800
So, that’s where we’re at.

482
00:26:22,770 --> 00:26:24,070
Now, it’s time to point fingers.

483
00:26:24,590 --> 00:26:25,010
Cool.

484
00:26:25,340 --> 00:26:26,540
[It’s] Everybody’s favorite part.

485
00:26:26,950 --> 00:26:30,780
Predictably, in a fuckup of this magnitude, the blame

486
00:26:30,780 --> 00:26:33,949
game and armchair quarterbacking is in full effect.

487
00:26:34,530 --> 00:26:37,090
Thought leaders are tripping over themselves on

488
00:26:37,349 --> 00:26:39,680
LinkedIn to have an opinion about the whole mess.

489
00:26:40,080 --> 00:26:43,760
And I’ve seen posts ranging from ‘this is all CrowdStrike fault.

490
00:26:43,930 --> 00:26:47,650
How did this update ever get out the door?’ ‘This is all Microsoft’s fault.

491
00:26:47,860 --> 00:26:50,750
How could they let third parties run in kernel mode?’ ‘This is the

492
00:26:50,750 --> 00:26:55,530
customers’ fault for not having phased rollouts.’ Et cetera, et cetera.

493
00:26:56,280 --> 00:26:59,799
And then there’s all the conspiracy theories about how this was a state actor,

494
00:26:59,799 --> 00:27:05,200
or a planned thing, or I don’t know CrowdStrike did it on purpose, for reasons?

495
00:27:05,910 --> 00:27:06,140
Anyway.

496
00:27:06,990 --> 00:27:08,040
Solar flares?

497
00:27:08,660 --> 00:27:09,460
Oh, I like that one.

498
00:27:09,840 --> 00:27:11,030
That’s what made it all zeros.

499
00:27:11,770 --> 00:27:14,439
There’s plenty of blame to go around, and none of it is

500
00:27:14,440 --> 00:27:17,850
actually helpful while the fire is burning, but now that

501
00:27:17,850 --> 00:27:21,640
we’re over a week out, maybe we can take a more nuanced look.

502
00:27:22,110 --> 00:27:22,480
Or not.

503
00:27:23,640 --> 00:27:27,020
So, how did this update actually leave CrowdStrike’s front door?

504
00:27:27,760 --> 00:27:28,630
That’s a great question.

505
00:27:29,349 --> 00:27:33,370
The truth is, we will not know until CrowdStrike tells us or

506
00:27:33,380 --> 00:27:37,330
a lawsuit forces legal discovery, and we find out that way.

507
00:27:38,140 --> 00:27:39,560
The former could come any day.

508
00:27:39,600 --> 00:27:43,160
I’ve checked their [unintelligible] blog posts several times as I was

509
00:27:43,160 --> 00:27:47,680
writing this piece, and so far, they haven’t said, but maybe they will.

510
00:27:48,230 --> 00:27:49,979
Uh, actually, so they did—

511
00:27:50,780 --> 00:27:51,010
Ooh.

512
00:27:51,130 --> 00:27:52,520
—at about three o’clock this morning.

513
00:27:54,230 --> 00:27:54,870
[laugh] . Of course they did.

514
00:27:54,870 --> 00:27:56,910
They released an official—well, an official

515
00:27:56,920 --> 00:27:59,740
unofficial preliminary post-incident review.

516
00:28:00,130 --> 00:28:00,540
Okay.

517
00:28:00,540 --> 00:28:01,370
It’s a good name.

518
00:28:01,840 --> 00:28:04,240
And basically what they’re saying is, it went through automated

519
00:28:04,240 --> 00:28:07,540
testing, but the automated content validator had a bug in it.

520
00:28:08,099 --> 00:28:12,729
So, they passed it—quote-unquote, “Passed, but it was an invalid file.

521
00:28:13,239 --> 00:28:13,249
Ah.

522
00:28:13,389 --> 00:28:17,540
“Once the file went out, it was immediately picked up, read by Falcon

523
00:28:17,540 --> 00:28:22,250
sensor, and it caused an out-of-bounds memory read, triggering an exception.

524
00:28:23,150 --> 00:28:25,520
This unexpected exception could not be gracefully handled,

525
00:28:25,520 --> 00:28:29,509
resulting in a Windows operating system crash BSOD.” Unquote.

526
00:28:30,930 --> 00:28:34,379
So, it seems like their testing harness or whatever they’re using

527
00:28:34,540 --> 00:28:37,450
also doesn’t know what to do with the file that’s all zeros.

528
00:28:37,870 --> 00:28:38,480
Well, yeah.

529
00:28:38,530 --> 00:28:39,909
There’s a lot of problems here.

530
00:28:40,030 --> 00:28:43,520
First of all, clearly they did not test the tester enough.

531
00:28:44,410 --> 00:28:44,720
Yeah.

532
00:28:44,770 --> 00:28:46,579
Because if you have a bug in a testing system

533
00:28:46,580 --> 00:28:48,770
in an automated deployment, that is a problem.

534
00:28:48,890 --> 00:28:50,159
That is a huge problem.

535
00:28:50,900 --> 00:28:55,720
And the fact that simply loading the file caused the blue screen pretty

536
00:28:55,720 --> 00:28:59,490
quickly makes it sound like they don’t actually push these updates to

537
00:28:59,490 --> 00:29:04,970
test machines that then run the update to see if the system crashes.

538
00:29:05,309 --> 00:29:08,060
They’re using some other testing process.

539
00:29:08,510 --> 00:29:11,770
Right, which they do not go into any detail about, unsurprisingly.

540
00:29:12,410 --> 00:29:16,600
So, I am sure that the lawsuits are forthcoming, and maybe we’ll

541
00:29:16,600 --> 00:29:20,549
find out more when legal discovery happens, if it gets that far, but

542
00:29:21,009 --> 00:29:25,199
the truth is, CrowdStrike pushes these channel updates frequently.

543
00:29:25,199 --> 00:29:28,460
Like you said, Chris, they push these more than once a day.

544
00:29:28,850 --> 00:29:33,029
And they have automated testing in place, but they’re trying to stay

545
00:29:33,030 --> 00:29:37,029
one step ahead of the bad guys, which means time is of the essence.

546
00:29:37,400 --> 00:29:39,700
This specific update was meant to address

547
00:29:39,700 --> 00:29:43,480
something, a new vulnerability found in named pipes.

548
00:29:44,070 --> 00:29:46,670
They wanted to get that update out before any

549
00:29:46,670 --> 00:29:50,260
attacker figured out how to abuse this vulnerability.

550
00:29:51,160 --> 00:29:53,350
So, maybe what they’re doing is sacrificing

551
00:29:53,350 --> 00:29:56,480
quality or testing in favor of speed.

552
00:29:57,200 --> 00:30:00,520
This is a systematic failure, and it’s not the fault of one person.

553
00:30:00,640 --> 00:30:03,969
Yes, maybe someone screwed up and accidentally saved the file

554
00:30:03,990 --> 00:30:06,600
empty, but something else in the chain should have caught that.

555
00:30:07,310 --> 00:30:07,700
Right.

556
00:30:08,210 --> 00:30:11,060
If a single person can unwittingly push an update that

557
00:30:11,060 --> 00:30:13,430
takes down eight-and-a-half million Windows clients,

558
00:30:14,099 --> 00:30:16,619
that’s an organizational and systematic problem.

559
00:30:17,449 --> 00:30:19,850
There’s also some indications that this isn’t

560
00:30:19,860 --> 00:30:22,870
the first time such a transgression has occurred.

561
00:30:23,390 --> 00:30:28,000
It appears that Red Hat Enterprise Linux, Debian, and

562
00:30:28,000 --> 00:30:31,489
Rocky Linux have all encountered similar crashing problems

563
00:30:31,559 --> 00:30:34,959
earlier this year after a channel update was pushed.

564
00:30:35,730 --> 00:30:40,600
I think it was April and May were the two months where the issues were found.

565
00:30:40,820 --> 00:30:43,980
The issue with Debian in particular was traced to a specific

566
00:30:43,990 --> 00:30:47,760
version of the kernel that wasn’t included in CrowdStrike’s

567
00:30:47,889 --> 00:30:51,990
testing matrix, but was on their list of supported kernel versions.

568
00:30:52,129 --> 00:30:56,680
macOS seems to have weathered the storm, for reasons that we will get to.

569
00:30:57,340 --> 00:30:58,689
Yeah, and I mean, that’s an important point.

570
00:30:58,690 --> 00:31:01,940
And, you know, a lot of times people will say, this only

571
00:31:01,940 --> 00:31:04,090
happens to Windows, and that’s absolutely not the case.

572
00:31:04,290 --> 00:31:07,669
Anytime something runs unfettered in Ring 0 of any operating

573
00:31:07,670 --> 00:31:11,250
system of any kind, you run the risk of causing an immediate crash.

574
00:31:12,109 --> 00:31:15,800
It’s just not usually so public-facing because you don’t tend to

575
00:31:15,800 --> 00:31:20,970
have Linux running your displays that’s also running CrowdStrike.

576
00:31:21,280 --> 00:31:21,540
Right.

577
00:31:21,550 --> 00:31:23,540
For whatever reason, we like Windows for that.

578
00:31:24,130 --> 00:31:24,479
I don’t know.

579
00:31:24,770 --> 00:31:27,409
We’ll get to that, too [laugh] . So, what about Microsoft?

580
00:31:27,830 --> 00:31:29,999
Shouldn’t they prevent this kind of thing from happening?

581
00:31:30,430 --> 00:31:32,690
In an ideal world, they could.

582
00:31:32,970 --> 00:31:35,139
And we’ll get into the technical solutions in a

583
00:31:35,140 --> 00:31:38,680
moment, but this is largely not Microsoft’s fault.

584
00:31:39,440 --> 00:31:46,440
Yes, Windows has its flaws—many, many, many flaws—and Microsoft

585
00:31:46,449 --> 00:31:49,750
hasn’t always produced the stablest or most secure software.

586
00:31:50,290 --> 00:31:52,980
No one could call them blameless with a straight face.

587
00:31:53,240 --> 00:31:56,760
But in this specific instance, the system is working

588
00:31:56,760 --> 00:31:59,730
as designed, even if the design kind of sucks.

589
00:32:00,490 --> 00:32:04,319
Should we be shaming all these organizations who let the update

590
00:32:04,340 --> 00:32:07,750
barrel through their environment like salmonella on a cruise ship?

591
00:32:08,330 --> 00:32:09,170
That’s an image for you.

592
00:32:10,000 --> 00:32:14,280
Think about the counterexample for a second, let’s say that a zero-day

593
00:32:14,300 --> 00:32:18,430
attack was discovered using this named pipes thing, and it was

594
00:32:18,430 --> 00:32:22,419
leveraged by a hacking group to infect a major airline with ransomware,

595
00:32:22,840 --> 00:32:27,010
and later it came out that CrowdStrike would have protected them

596
00:32:27,240 --> 00:32:30,180
if they had been running the newest version of the channel updates.

597
00:32:30,700 --> 00:32:34,590
Stupid CISO decided to stay at n minus one for updates.

598
00:32:35,580 --> 00:32:39,819
Do you think the defense of not running the latest channel updates as

599
00:32:39,820 --> 00:32:43,920
a resiliency strategy would appease litigators and the public at large?

600
00:32:44,790 --> 00:32:45,950
I’m going to go with unlikely.

601
00:32:46,590 --> 00:32:49,320
[laugh] . So, I mean, another point that’s important to note

602
00:32:49,320 --> 00:32:51,730
here is that the kind of patch that came out—or the channel

603
00:32:51,730 --> 00:32:55,620
update—would not have been stopped by an n minus one effect.

604
00:32:56,190 --> 00:32:59,360
N minus one would stop the driver update.

605
00:32:59,879 --> 00:33:03,639
Remember, that’s the part that was signed by Microsoft, and is noted as good.

606
00:33:03,880 --> 00:33:06,270
The actual kernel—or the actual channel update itself

607
00:33:06,490 --> 00:33:08,639
happens automatically, and you can’t do anything about it.

608
00:33:09,490 --> 00:33:12,670
There is… I was reading through some Reddit posts, and some

609
00:33:12,679 --> 00:33:16,620
people did say that there is a way to run a little behind

610
00:33:17,650 --> 00:33:21,550
the channel updates, so to postpone them by certain periods.

611
00:33:21,790 --> 00:33:24,389
There is a way to run, kind of like, n minus one for the

612
00:33:24,390 --> 00:33:28,149
channel updates, but there’s an inherent risk in doing that.

613
00:33:28,500 --> 00:33:29,760
Yeah, like you said, it’s certainly not the

614
00:33:29,770 --> 00:33:32,150
sort of thing that a CISO is going to encourage.

615
00:33:32,849 --> 00:33:33,209
Right.

616
00:33:33,259 --> 00:33:37,100
And there’s also a regulatory hurdle with that, too, because there may be

617
00:33:37,130 --> 00:33:41,370
compliance and regulations that say you have to be running the latest version.

618
00:33:41,890 --> 00:33:46,790
So really, it’s just a rational decision based on balancing priorities and

619
00:33:46,790 --> 00:33:51,390
political realities, and trying to protect your customers as best you can.

620
00:33:52,299 --> 00:33:54,540
So, the blame ultimately should reside on

621
00:33:54,540 --> 00:33:55,735
CrowdStrike for putting out a floud update—floud?

622
00:33:55,735 --> 00:33:56,040
Flawed.

623
00:33:58,150 --> 00:33:58,370
Floud.

624
00:33:58,450 --> 00:33:59,210
Words.

625
00:33:59,890 --> 00:34:00,400
I love them.

626
00:34:00,840 --> 00:34:02,030
I like floud, actually.

627
00:34:02,849 --> 00:34:04,580
It’s like loud, but with an F.

628
00:34:04,840 --> 00:34:05,470
It’s floud.

629
00:34:05,470 --> 00:34:07,250
It’s like a flan that has opinions.

630
00:34:07,710 --> 00:34:08,559
An opinionated flan.

631
00:34:09,540 --> 00:34:09,989
I like it.

632
00:34:10,779 --> 00:34:14,659
Let’s talk about solutions [laugh] . The reason macOS hasn’t encountered

633
00:34:14,659 --> 00:34:20,100
a similar fate as the Linux and Apple installations is that Apple

634
00:34:20,630 --> 00:34:25,160
doesn’t let CrowdStrike—or really anything else—running kernel mode.

635
00:34:25,599 --> 00:34:29,219
Starting in macOS 10.15—I didn’t look at the codename,

636
00:34:29,600 --> 00:34:34,639
so please forgive me—Apple offered System Extensions.

637
00:34:35,260 --> 00:34:38,190
These allow an application to stay in user mode while

638
00:34:38,190 --> 00:34:41,819
requesting special access to hardware managed by the kernel.

639
00:34:42,370 --> 00:34:47,379
At the same time, Apple phased out Kernel Extensions—often

640
00:34:47,529 --> 00:34:50,429
shortened to kext, or [pronounced] kext, I guess—

641
00:34:50,560 --> 00:34:52,020
Yeah, it’s pronounced, unfortunately.

642
00:34:52,840 --> 00:34:56,020
[sigh] . They phased those out starting in macOS 11.

643
00:34:56,599 --> 00:35:00,150
So basically, CrowdStrike doesn’t run in kernel mode on

644
00:35:00,160 --> 00:35:04,390
macOS, and thusly, it cannot crash macOS the same way.

645
00:35:05,480 --> 00:35:08,135
I don’t know no Mac, so I don’t know about any of that [laugh]

646
00:35:08,400 --> 00:35:09,100
.
No, it’s true.

647
00:35:09,100 --> 00:35:11,950
And for a while, it was extremely annoying because a lot of programs

648
00:35:11,950 --> 00:35:15,319
relied on kexts for similar reasons: to have instant access.

649
00:35:15,320 --> 00:35:18,370
Like, a good example is if you have an external audio

650
00:35:18,370 --> 00:35:23,190
device and you want that to work as fast—as efficiently as

651
00:35:23,190 --> 00:35:25,569
possible, you would want it to work and run in kernel mode.

652
00:35:25,840 --> 00:35:26,150
Right.

653
00:35:26,590 --> 00:35:29,120
So, there are actually ways to get around the

654
00:35:29,120 --> 00:35:32,270
security that you just talked about in macOS.

655
00:35:32,760 --> 00:35:35,190
I don’t recommend it, but it is doable.

656
00:35:36,020 --> 00:35:38,489
And the whole point here is that you have this little secret

657
00:35:38,500 --> 00:35:41,400
enclave, effectively, where things run in this sort of in-between

658
00:35:41,400 --> 00:35:44,600
mode—sandbox, if you will—which we’re going to go into in a second.

659
00:35:45,160 --> 00:35:48,340
But if it crashes there, it doesn’t take down the operating system.

660
00:35:48,940 --> 00:35:49,240
Right.

661
00:35:49,820 --> 00:35:53,680
And Linux actually has a similar option with eBPF,

662
00:35:53,830 --> 00:35:57,340
which I struggle to say because it’s awkward.

663
00:35:57,639 --> 00:35:59,699
And apparently, it’s no longer an acronym.

664
00:36:00,469 --> 00:36:01,660
It’s just its own thing.

665
00:36:02,220 --> 00:36:03,950
So… that’s weird.

666
00:36:04,500 --> 00:36:08,370
eBPF lets applications load into a sandboxed

667
00:36:08,760 --> 00:36:10,930
secure kernel execution environment.

668
00:36:11,320 --> 00:36:15,490
So, once again, gives them kernel-level access to resources, while applying

669
00:36:15,580 --> 00:36:19,480
stringent safety checks to make sure the application doesn’t crash the system.

670
00:36:20,210 --> 00:36:25,200
CrowdStrike now offers running Falcon in user mode on Linux—what

671
00:36:25,200 --> 00:36:29,629
they call user mode—which actually uses eBPF under the covers.

672
00:36:30,440 --> 00:36:33,500
If you were running in that mode, those previous crashes

673
00:36:33,780 --> 00:36:37,470
that happened with Red Hat, and Debian, and—what was

674
00:36:37,490 --> 00:36:40,750
it?—Rocky Linux, you would not have been affected by those.

675
00:36:41,460 --> 00:36:43,420
I mean, CrowdStrike would—Falcon still would have

676
00:36:43,420 --> 00:36:44,950
crashed, but it wouldn’t have crashed your system.

677
00:36:45,880 --> 00:36:46,659
Which is better.

678
00:36:47,020 --> 00:36:47,630
I think so.

679
00:36:48,330 --> 00:36:52,850
Windows has some similar functionality available.

680
00:36:53,180 --> 00:36:56,600
There’s the Windows Filtering Platform, Windows Defender

681
00:36:56,620 --> 00:37:00,705
Application Control, and Windows Defender Device Guard, all of

682
00:37:00,720 --> 00:37:05,159
which have APIs, but none of them have the same mechanisms present

683
00:37:05,940 --> 00:37:10,680
that, like, System Extensions for macOS or eBPF for Linux have.

684
00:37:11,020 --> 00:37:15,220
So, they provide an API that applications could be rewritten to take

685
00:37:15,220 --> 00:37:20,630
advantage of and get, you know, almost kernel levels of access and speed,

686
00:37:21,370 --> 00:37:26,720
but they’re not the same as this, sort of, sandbox, secured enclave.

687
00:37:27,610 --> 00:37:31,330
There is a project to port eBPF over to Windows, for what it’s worth.

688
00:37:31,830 --> 00:37:35,940
I don’t know if that will be the ultimate solution, but this catastrophic

689
00:37:35,940 --> 00:37:39,970
calamity should at least prompt Microsoft to try something similar.

690
00:37:40,790 --> 00:37:43,350
I have heard some folks—and we could call this a technical

691
00:37:43,350 --> 00:37:45,220
solution—I’ve heard some folks say that you just shouldn’t

692
00:37:45,230 --> 00:37:47,270
be running Windows in most of these environments.

693
00:37:47,900 --> 00:37:49,920
Like… you’re not wrong.

694
00:37:50,700 --> 00:37:55,250
If I could wave a magic wand and turn back time, if I could find a way, Chris—

695
00:37:56,440 --> 00:37:56,799
Stop it.

696
00:37:57,099 --> 00:37:59,400
—I would take back all the Windows that hurt

697
00:37:59,400 --> 00:38:02,450
you and replace them with Linux variants.

698
00:38:03,280 --> 00:38:04,270
Okay, it doesn’t rhyme.

699
00:38:04,920 --> 00:38:06,019
[laugh] . It’s the best I could do.

700
00:38:06,940 --> 00:38:09,600
If you’re out there, and you’re building a net-new system,

701
00:38:10,040 --> 00:38:13,910
that’s, like, an end-user terminal, an IoT device, or even a

702
00:38:13,910 --> 00:38:18,270
server running in the cloud, I think anything but Windows is your

703
00:38:18,280 --> 00:38:22,020
best bet, and it would probably be malpractice to do otherwise.

704
00:38:22,830 --> 00:38:25,500
But like it or not, Windows remains the most popular desktop

705
00:38:25,520 --> 00:38:28,830
operating system, and that doesn’t appear to be changing anytime soon.

706
00:38:29,500 --> 00:38:33,499
We need a short-term plan to make things better—through some sort

707
00:38:33,500 --> 00:38:38,080
of update—and a long-term plan to ditch Windows for most use cases.

708
00:38:38,799 --> 00:38:39,279
Thoughts?

709
00:38:40,080 --> 00:38:45,230
So, a lot of in—I’ll put, ‘in my opinion’ around all of this—

710
00:38:45,650 --> 00:38:45,890
Right.

711
00:38:46,220 --> 00:38:51,730
A lot of this comes down to the never-ending battle between speed and

712
00:38:51,730 --> 00:38:58,020
security, and making assumptions that things are just going to work.

713
00:38:59,320 --> 00:39:03,620
After all, like we said, they’ve done multiple channel updates

714
00:39:03,620 --> 00:39:07,319
a day for years and years and years and years and years, and

715
00:39:07,320 --> 00:39:10,159
while they’ve had a few issues in the past, it’s not very many.

716
00:39:11,130 --> 00:39:14,010
This is the sort of thing that leads developers—and, you know,

717
00:39:14,020 --> 00:39:17,839
engineering teams—to have a false sense of security, and a false sense

718
00:39:17,840 --> 00:39:20,110
that everything they do is golden, and they will never have a problem.

719
00:39:20,980 --> 00:39:25,300
Therefore, checks get skipped, checks get removed from the process—because

720
00:39:25,300 --> 00:39:29,350
after all, they’re just slowing us down—and that’s a huge issue.

721
00:39:30,050 --> 00:39:34,630
The other issue is, when you push everything out all at once, the problem

722
00:39:34,779 --> 00:39:38,560
can occur—like it did this time—that everything will crash all at once.

723
00:39:40,060 --> 00:39:44,250
There needs to be some type of a fuzzed deployment.

724
00:39:44,599 --> 00:39:46,839
So, let’s just say these things get released

725
00:39:46,840 --> 00:39:48,779
on a schedule, I don’t know, every four hours.

726
00:39:49,759 --> 00:39:51,740
You get a customer that’s got a hundred servers.

727
00:39:52,410 --> 00:39:57,250
Those servers should get that update five minutes apart in, like, groups of 30.

728
00:39:58,140 --> 00:40:01,120
That way, if there is a catastrophic failure, it

729
00:40:01,120 --> 00:40:03,940
only takes down a percentage of your platform.

730
00:40:04,460 --> 00:40:06,580
Now, that’ll happen for every single customer on Earth, and

731
00:40:06,580 --> 00:40:10,369
that’s not great, but the assumption is, and should be, that

732
00:40:10,369 --> 00:40:13,560
there is high availability built into this, so if half your

733
00:40:13,560 --> 00:40:16,620
systems go down, theoretically, the other half can carry the load.

734
00:40:17,260 --> 00:40:17,570
Mmm.

735
00:40:18,470 --> 00:40:21,759
Yeah, and that’s something that CrowdStrike could change today.

736
00:40:22,010 --> 00:40:23,470
That’s within their realm of control.

737
00:40:23,470 --> 00:40:26,850
Yeah, and I suspect that they will [laugh] . Because the only

738
00:40:26,850 --> 00:40:29,540
other option—if this is the situation—is people are going to end

739
00:40:29,540 --> 00:40:32,470
up with half of their environment running one antivirus solution,

740
00:40:32,480 --> 00:40:34,370
and the other half of their environment running another one.

741
00:40:35,020 --> 00:40:35,850
That seems worse.

742
00:40:35,930 --> 00:40:39,070
It’s just as insane as running—insane and difficult

743
00:40:39,080 --> 00:40:40,960
to manage as running in a multi-cloud environment.

744
00:40:41,240 --> 00:40:42,669
Or a super cloud, as some might say.

745
00:40:43,449 --> 00:40:43,469
Ugh.

746
00:40:43,780 --> 00:40:44,280
I hate you.

747
00:40:44,900 --> 00:40:47,099
Yeah, these are all technical solutions.

748
00:40:47,590 --> 00:40:50,700
I don’t know if there’s any policy solutions, but my biggest

749
00:40:50,700 --> 00:40:54,750
concern coming out of all of this is that regulators and

750
00:40:54,750 --> 00:40:58,579
litigators are going to get into hubbub and pass some poorly

751
00:40:58,590 --> 00:41:02,970
thought-out legislation that makes things effectively worse.

752
00:41:03,660 --> 00:41:05,739
I can’t quite figure out how they would make

753
00:41:05,740 --> 00:41:08,170
things worse, but I am excited to see them try.

754
00:41:08,180 --> 00:41:11,990
[laugh] . They’re nothing if not creative.

755
00:41:12,780 --> 00:41:14,230
Well, hey, thanks for listening or something.

756
00:41:14,230 --> 00:41:16,130
I guess you found it worthwhile enough if you made it all

757
00:41:16,130 --> 00:41:18,280
the way to the end, so congratulations to you, friend.

758
00:41:18,480 --> 00:41:19,730
You accomplished something today.

759
00:41:19,740 --> 00:41:23,229
Now, you can go sit on the couch, update your CrowdStrike channel

760
00:41:23,230 --> 00:41:26,990
file, and watch everything crash in beautiful synchronicity.

761
00:41:27,140 --> 00:41:27,820
You’ve earned it.

762
00:41:28,500 --> 00:41:30,840
You can find more about this show by visiting our LinkedIn page,

763
00:41:30,840 --> 00:41:34,240
just search ‘Chaos Lever,’ or go to our website, chaoslever.com.

764
00:41:34,520 --> 00:41:36,460
You’ll find show notes, blog posts, and general tomfoolery.

765
00:41:37,280 --> 00:41:39,620
And if we got something wrong, or you have strong opinions

766
00:41:39,620 --> 00:41:42,200
about what CrowdStrike should have done, leave us a comment.

767
00:41:42,379 --> 00:41:43,290
Leave us a voicemail.

768
00:41:43,410 --> 00:41:45,400
We might even listen to it.

769
00:41:45,400 --> 00:41:47,920
We’ll be back next week to see what fresh hell is upon us.

770
00:41:48,240 --> 00:41:48,950
Ta-ta for now.

771
00:41:57,200 --> 00:41:57,790
What a mess.

772
00:41:58,440 --> 00:41:58,770
Mmm.

773
00:41:59,400 --> 00:42:00,359
A glorious mess.