Transcript
1
00:00:00,430 --> 00:00:01,390
Ready to party.
2
00:00:02,090 --> 00:00:03,360
By party, do you mean nap?
3
00:00:04,019 --> 00:00:05,610
Uh, yeah, we could have a nap party.
4
00:00:05,720 --> 00:00:07,450
I’m not even mad about that.
5
00:00:07,480 --> 00:00:08,680
That sounds delightful.
6
00:00:10,650 --> 00:00:10,800
[snoring]
7
00:00:21,779 --> 00:00:25,040
.
Hello, alleged human, and welcome to the Chaos Lever podcast.
8
00:00:25,040 --> 00:00:27,680
My name is Ned, and I’m definitely not a robot.
9
00:00:27,710 --> 00:00:31,430
I am a real human person who does not make random
10
00:00:31,469 --> 00:00:34,929
updates to your computer and then crash entire airports.
11
00:00:35,619 --> 00:00:37,780
That would be wild, and I am definitely not
12
00:00:37,780 --> 00:00:40,400
behind that update through my Skynet uplink.
13
00:00:40,570 --> 00:00:43,379
With me is Chris, who was also here.
14
00:00:44,250 --> 00:00:45,510
Let’s talk about not Skylink.
15
00:00:46,230 --> 00:00:47,629
[crosstalk] . Whatever [laugh]
16
00:00:48,099 --> 00:00:48,659
.
Wow.
17
00:00:49,389 --> 00:00:52,520
Yeah, I don’t know if you go into this or not, but you know, they are already
18
00:00:53,469 --> 00:00:57,730
conspiracy theories out there that this was an actual attack of some kind.
19
00:00:58,190 --> 00:00:59,790
Or a test for an actual attack.
20
00:01:00,310 --> 00:01:02,729
Yeah, I have seen that on the Reddit
21
00:01:02,769 --> 00:01:04,989
threads, [background noise] and other places.
22
00:01:05,069 --> 00:01:05,920
Ohh, that was loud.
23
00:01:06,340 --> 00:01:07,940
That was way louder than I thought it was going to be.
24
00:01:08,039 --> 00:01:08,430
Sorry [laugh]
25
00:01:09,160 --> 00:01:14,420
.
Who likes fizzy water [laugh] ? Yeah, on the Reddit forums, and in
26
00:01:14,420 --> 00:01:19,060
other places, I’ve seen the conspiracy theories starting to propagate.
27
00:01:19,800 --> 00:01:22,870
I feel like those start out as jokes a lot of the time.
28
00:01:22,870 --> 00:01:25,300
Like, “Wouldn’t it be funny if this was like a state actor,”
29
00:01:25,330 --> 00:01:28,409
or something, but like, when you actually drill down into
30
00:01:28,410 --> 00:01:31,940
it at all, you realize that this is just incompetence.
31
00:01:33,930 --> 00:01:39,660
Don’t attribute to malfeasance what is more likely just gross incompetence.
32
00:01:40,460 --> 00:01:41,939
There’s a pithier way of saying that.
33
00:01:42,469 --> 00:01:44,289
Well, maybe now we should just talk about what we’re actually
34
00:01:44,289 --> 00:01:46,140
talking about because we’re already talking about it.
35
00:01:46,260 --> 00:01:47,240
Oh, CrowdStruck.
36
00:01:48,270 --> 00:01:49,000
UnderStrike?
37
00:01:49,040 --> 00:01:49,210
Dunn.
38
00:01:49,210 --> 00:01:49,240
[laugh]
39
00:01:51,690 --> 00:01:55,759
. Those of us who have been around the tech industry for a while,
40
00:01:55,760 --> 00:01:59,559
and have peeked behind the mysterious curtain to see what actually
41
00:01:59,559 --> 00:02:04,010
supports this endeavor that we call modern information technology—
42
00:02:04,279 --> 00:02:04,899
It’s terrifying.
43
00:02:05,500 --> 00:02:07,040
It’s three monkeys in a trench coat.
44
00:02:07,040 --> 00:02:07,089
Barely.
45
00:02:10,440 --> 00:02:15,209
I feel like you quickly become aware of how fragile this entire construction
46
00:02:15,210 --> 00:02:20,969
is, and just how many redundancies and safeguards have to be put in place
47
00:02:21,309 --> 00:02:25,990
to prevent the entire edifice from crumbling into the proverbial sea.
48
00:02:26,430 --> 00:02:28,489
Yeah, and just to put a pin on that, in terms of, not
49
00:02:28,490 --> 00:02:30,540
only is the technology fragile, so are the people.
50
00:02:31,030 --> 00:02:35,230
I saw a joke on LinkedIn today about power-washing the back
51
00:02:35,230 --> 00:02:39,389
of your servers to let the packets go faster, and I guarantee
52
00:02:39,389 --> 00:02:41,649
there’s somebody out there going, “I haven’t done that.
53
00:02:42,260 --> 00:02:44,140
I should do that.”
54
00:02:44,140 --> 00:02:48,250
[laugh] . If nothing else, it’ll clean the air filters, so that’s probably good.
55
00:02:49,050 --> 00:02:50,709
It’ll make everything a lot quieter.
56
00:02:52,610 --> 00:02:53,430
[laugh] . I suppose it will.
57
00:02:53,630 --> 00:02:55,410
Oh, silence is golden.
58
00:02:55,960 --> 00:02:57,609
The packets go faster in silence.
59
00:02:58,270 --> 00:03:01,480
To quote the second-greatest sci-fi movie of all time, Men
60
00:03:01,480 --> 00:03:05,620
in Black, “There’s always an Arquillian Battle Cruiser,
61
00:03:05,620 --> 00:03:09,190
or a Korilian Death Ray, or an intergalactic plague that
62
00:03:09,190 --> 00:03:12,119
is about to wipe out all life on this miserable planet.
63
00:03:12,410 --> 00:03:14,430
The only way these people can get on with their
64
00:03:14,430 --> 00:03:17,769
happy lives is that they do not know about it!”
65
00:03:18,320 --> 00:03:19,110
I love that quote.
66
00:03:19,400 --> 00:03:22,170
Yeah, so just kind of apply that to technology instead
67
00:03:22,170 --> 00:03:25,720
of aliens, and it’s pretty much the same thing.
68
00:03:26,370 --> 00:03:30,099
The CrowdStrike debacle may not have been a Korilian death
69
00:03:30,099 --> 00:03:35,829
ray, but for 8.5 million Windows devices, it basically was.
70
00:03:36,820 --> 00:03:40,900
Everything, everywhere, is breaking, all at once, and it is
71
00:03:40,930 --> 00:03:45,030
only through the heroic efforts of thousands of ops people
72
00:03:45,040 --> 00:03:49,310
diligently doing their jobs that the public is unaware.
73
00:03:50,130 --> 00:03:54,359
Of course, the public does occasionally become very aware,
74
00:03:54,850 --> 00:03:58,110
and then senators have to hold hearings to grandstand
75
00:03:58,120 --> 00:04:01,070
about things they do not even slightly understand.
76
00:04:01,770 --> 00:04:06,960
They’ll hold some CEO’s feet to the fire for an hour, make self-serving
77
00:04:06,960 --> 00:04:11,169
proclamations and possibly even attempt to levy a fine or two.
78
00:04:11,690 --> 00:04:14,670
Good luck with that now that Chevron Deference is dead.
79
00:04:15,010 --> 00:04:16,670
But hey, we’re not a Supreme Court podcast.
80
00:04:17,430 --> 00:04:18,880
Go listen to 5-4 for that.
81
00:04:20,029 --> 00:04:21,320
Solid plug for 5-4.
82
00:04:21,940 --> 00:04:22,799
Definitely [crosstalk] time.
83
00:04:23,630 --> 00:04:28,380
After all, the hubbub dies down, honestly, one or two C-level executives
84
00:04:28,380 --> 00:04:32,169
will probably fall on their swords to appease the investor public.
85
00:04:33,080 --> 00:04:34,409
I wouldn’t feel too sorry for them.
86
00:04:35,290 --> 00:04:39,219
It is a metaphorical sword after all, and it comes with a guaranteed
87
00:04:39,270 --> 00:04:43,970
payout of several millions of dollars, and a cushy job as a lobbyist
88
00:04:43,970 --> 00:04:49,059
or CEO of some other poor unsuspecting private equity firm-acquired
89
00:04:49,080 --> 00:04:54,570
disaster where they can oversee another unavoidable catastrophe.
90
00:04:55,290 --> 00:04:56,659
It’s the circle of life, Chris.
91
00:04:57,349 --> 00:04:58,150
I’m not going to sing it.
92
00:04:58,150 --> 00:04:59,230
I don’t want to get sued again.
93
00:04:59,360 --> 00:05:02,984
I—no, you’re seeing it in your head though, and I can see it [laugh]
94
00:05:03,889 --> 00:05:03,919
.
[laugh]
95
00:05:05,139 --> 00:05:08,830
.
Oh, so rather than talking about CrowdStrike for the next 30
96
00:05:08,830 --> 00:05:11,690
minutes, I think we should all just go watch The Lion King—
97
00:05:11,700 --> 00:05:12,300
The original.
98
00:05:12,320 --> 00:05:15,539
Which is the best sci-fi movie of all time [laugh]
99
00:05:15,649 --> 00:05:17,840
.
I’m not really sure where to go with that [laugh]
100
00:05:18,539 --> 00:05:18,969
.
[laugh] . I don’t either.
101
00:05:20,049 --> 00:05:22,309
I’m curious to hear the comments that we get in.
102
00:05:22,320 --> 00:05:28,190
I did recently watch Dune: Part Two, which was excellent.
103
00:05:28,560 --> 00:05:29,560
Took you long enough.
104
00:05:30,090 --> 00:05:30,590
Listen.
105
00:05:30,910 --> 00:05:33,550
Some of us responsible citizens saw in the theater.
106
00:05:34,380 --> 00:05:35,539
I have one word for you.
107
00:05:36,040 --> 00:05:36,999
That word is children.
108
00:05:37,469 --> 00:05:37,949
Anyway.
109
00:05:38,139 --> 00:05:39,260
You don’t think they would like it?
110
00:05:39,700 --> 00:05:44,830
I think nightmare fuel would probably be the closest, yeah,
111
00:05:44,840 --> 00:05:49,000
Feyd-Rautha—Routha—however the hell you say his name—yeah,
112
00:05:49,000 --> 00:05:51,500
those scenes in particular, God, that dude is creepy.
113
00:05:51,940 --> 00:05:55,510
Yeah, he really inhabited the creepy level of the character.
114
00:05:56,150 --> 00:05:59,719
Like, Jared Leto levels of creepy.
115
00:06:00,020 --> 00:06:01,320
No, but except good.
116
00:06:01,500 --> 00:06:06,100
Yes [laugh] . Yeah, because he’s playing a character, not himself.
117
00:06:06,910 --> 00:06:07,140
Oh.
118
00:06:07,750 --> 00:06:08,410
Anyway.
119
00:06:09,179 --> 00:06:09,269
CrowdStrike.
120
00:06:10,230 --> 00:06:11,239
What the hell happened?
121
00:06:12,040 --> 00:06:19,000
On Friday, July 19th, 2024, at 5:24 UTC—that’s 1 a.m.
122
00:06:19,010 --> 00:06:23,110
for our East Coast peeps, and the day before for California
123
00:06:23,330 --> 00:06:26,480
because you’re a bunch of weirdos—security vendor CrowdStrike
124
00:06:26,520 --> 00:06:29,479
released an update for their Falcon sensor platform.
125
00:06:30,120 --> 00:06:34,219
Falcon is an endpoint detection and response solution meant to protect
126
00:06:34,219 --> 00:06:38,210
systems against viruses, malware, and advanced persistent threats.
127
00:06:38,900 --> 00:06:44,230
The update type was a content update, or what CrowdStrike calls a channel file,
128
00:06:44,559 --> 00:06:50,399
which you can think of is, like, the virus definition, except as a modern EDR,
129
00:06:51,049 --> 00:06:54,560
it’s a bit more complicated than that, and we’ll get to why that’s important.
130
00:06:54,740 --> 00:06:59,040
When we get to the root-cause analysis, or what we know so far.
131
00:06:59,860 --> 00:07:03,039
Once the channel file was loaded by the Falcon sensor
132
00:07:03,040 --> 00:07:07,080
platform, it caused a memory access fault at the kernel
133
00:07:07,080 --> 00:07:11,420
level that forced a system crash on all Windows clients.
134
00:07:11,959 --> 00:07:14,750
The old Blue Screen of Death popped up, and then the
135
00:07:14,750 --> 00:07:18,760
system either rebooted or sat at that screen for a while.
136
00:07:19,310 --> 00:07:20,440
Possibly forever.
137
00:07:21,070 --> 00:07:22,600
Yeah, until somebody touched it.
138
00:07:22,990 --> 00:07:23,729
Pretty much.
139
00:07:24,740 --> 00:07:27,919
So, if you happened to walk into a major airport around that time,
140
00:07:28,270 --> 00:07:31,640
you might have been greeted by giant display signs that just had
141
00:07:31,640 --> 00:07:37,220
the sad frowny face on it, because now the blue screen has an emoji.
142
00:07:38,180 --> 00:07:40,289
And it was kind of funny, actually.
143
00:07:40,639 --> 00:07:43,830
I mean, funny for the people, you know, seeing the screens; not funny
144
00:07:43,830 --> 00:07:46,090
for everybody who had to deal with the disaster, [unintelligible]
145
00:07:46,330 --> 00:07:46,650
Right.
146
00:07:46,690 --> 00:07:52,520
And were sitting in airports for three days while waiting to, you know, go home.
147
00:07:52,960 --> 00:07:53,590
Yeah.
148
00:07:53,970 --> 00:07:58,520
Depending on which airline you were working with, you
149
00:07:58,520 --> 00:08:02,050
may have been not impacted at all, impacted slightly, or
150
00:08:02,050 --> 00:08:04,280
still sitting in the airport listening to this right now.
151
00:08:04,949 --> 00:08:05,779
I’m so sorry.
152
00:08:06,240 --> 00:08:08,410
Maybe don’t fly Delta [laugh] next time.
153
00:08:08,990 --> 00:08:10,000
Actually, I don’t know if it was Delta.
154
00:08:10,010 --> 00:08:10,880
It might have been United.
155
00:08:10,960 --> 00:08:11,760
They’re all terrible.
156
00:08:11,950 --> 00:08:12,700
It doesn’t matter.
157
00:08:13,530 --> 00:08:16,620
But one of the few that wasn’t affected was Southwest.
158
00:08:17,010 --> 00:08:18,389
Is that because they’re running Linux?
159
00:08:18,679 --> 00:08:19,429
Allegedly.
160
00:08:19,870 --> 00:08:23,610
Again, this is unproven internet theory, but allegedly it’s because
161
00:08:23,629 --> 00:08:26,410
their systems were so old that CrowdStrike wouldn’t run on them.
162
00:08:27,670 --> 00:08:33,710
[laugh] . I feel like we did cover Southwest in a Chaos Lever, or possibly its
163
00:08:33,719 --> 00:08:38,689
precursor, when we talked about old, out-of-date systems that are super fragile.
164
00:08:38,889 --> 00:08:39,990
Am I remembering correctly?
165
00:08:40,340 --> 00:08:44,730
I mean, I had that theory, or that thought as well, but I’m also now like,
166
00:08:45,180 --> 00:08:49,199
did they just post that, and it became a memory, or is it a real memory?
167
00:08:49,530 --> 00:08:51,450
[laugh] . It’s hard to say.
168
00:08:52,080 --> 00:08:57,359
I will say that it was in fact Delta—and is Delta—that’s having
169
00:08:57,370 --> 00:09:01,720
the biggest struggle because they use BitLocker extensively.
170
00:09:02,150 --> 00:09:02,340
Right.
171
00:09:02,380 --> 00:09:03,550
I assume you’re going to get into that.
172
00:09:03,810 --> 00:09:04,410
Oh, yes.
173
00:09:04,560 --> 00:09:04,890
Okay.
174
00:09:05,200 --> 00:09:06,010
I don’t want to interrupt.
175
00:09:06,310 --> 00:09:06,710
Carry on.
176
00:09:06,940 --> 00:09:07,780
So, we had all these crashes—
177
00:09:07,780 --> 00:09:08,479
Whenever you’re ready.
178
00:09:08,960 --> 00:09:10,330
And you know, when your system—
179
00:09:10,330 --> 00:09:12,000
Just go with [crosstalk]
180
00:09:12,000 --> 00:09:12,130
—
[unintelligible]
181
00:09:12,130 --> 00:09:12,363
—
—whenever [unintelligible]
182
00:09:12,363 --> 00:09:12,536
—
[unintelligible]
183
00:09:12,710 --> 00:09:13,170
—
At anytime—
184
00:09:13,170 --> 00:09:13,464
[unintelligible]
185
00:09:13,759 --> 00:09:15,729
—
When you could—why would—who—
186
00:09:15,740 --> 00:09:18,099
[laugh] . We have all these crashed systems,
187
00:09:18,230 --> 00:09:19,459
and what do you do with the crash system?
188
00:09:19,460 --> 00:09:20,150
You restart it.
189
00:09:20,840 --> 00:09:23,240
But unfortunately, attempts to restart the afflicted
190
00:09:23,240 --> 00:09:26,160
systems just resulted in another blue screen of death.
191
00:09:26,599 --> 00:09:30,959
Because Falcon sensor is loaded as a driver during system
192
00:09:30,969 --> 00:09:35,790
boot, and it has been marked as boot required, meaning
193
00:09:35,790 --> 00:09:38,780
it must be loaded for the system to boot properly.
194
00:09:39,580 --> 00:09:42,880
As soon as Falcon started, it would load all of its channel
195
00:09:42,880 --> 00:09:45,630
files and, predictably, the system would crash again.
196
00:09:46,309 --> 00:09:49,630
This rendered all effective systems completely
197
00:09:49,670 --> 00:09:53,429
unusable and inaccessible through in-band management.
198
00:09:53,440 --> 00:09:56,069
So, you can RDP into this thing and fix it.
199
00:09:56,830 --> 00:10:01,240
So, this makes sense from an EDR perspective, right?
200
00:10:01,240 --> 00:10:01,355
Yes.
201
00:10:01,355 --> 00:10:02,699
You want to protect your computer.
202
00:10:03,580 --> 00:10:07,469
No matter what tool you have, it’s going to have this boot requirement
203
00:10:08,290 --> 00:10:11,460
because you don’t want your system booting without endpoint protection.
204
00:10:11,639 --> 00:10:11,939
Right.
205
00:10:11,939 --> 00:10:14,210
Because endpoint protection, ostensibly, is good.
206
00:10:15,090 --> 00:10:15,750
Ostensibly.
207
00:10:16,700 --> 00:10:20,450
The problem, obviously, comes in where your endpoint management
208
00:10:20,450 --> 00:10:23,510
is now, effectively, malware that’s crashing your system.
209
00:10:23,880 --> 00:10:24,230
Right.
210
00:10:24,550 --> 00:10:26,190
That would be what we call ‘the downside.’
211
00:10:26,190 --> 00:10:27,210
[laugh] . Yes.
212
00:10:27,639 --> 00:10:29,769
And we will definitely get into that as well.
213
00:10:30,290 --> 00:10:34,700
Microsoft has published a blog post where they claim, according to their
214
00:10:34,700 --> 00:10:39,109
telemetry, about 8.5 million Windows devices were impacted by this.
215
00:10:39,450 --> 00:10:43,210
Now, that’s only about one or 2% of all Windows devices out
216
00:10:43,210 --> 00:10:47,530
there, so this is not, as a percentage, a ton of devices.
217
00:10:47,539 --> 00:10:53,010
However… it’s still a lot of devices, [laugh] and the impact was pretty severe.
218
00:10:53,020 --> 00:10:56,489
As we discussed, airlines had to suspend or cancel
219
00:10:56,500 --> 00:11:00,120
flights, retail stores suddenly couldn’t accept payment.
220
00:11:00,690 --> 00:11:03,630
Medical devices and hospitals crashed in the middle of
221
00:11:03,630 --> 00:11:08,180
surgeries, bowling alleys had to hand out paper and pencils to
222
00:11:08,180 --> 00:11:12,210
individuals, who just looked at them like, what the hell is this?
223
00:11:12,400 --> 00:11:14,599
How do I track ten frames by hand?
224
00:11:14,900 --> 00:11:16,520
How does a turkey even work?
225
00:11:17,150 --> 00:11:19,539
[sigh] . Dark times for all of us, Chris.
226
00:11:19,730 --> 00:11:23,569
That’s the kind of math podcast that needs to come out because I guarantee
227
00:11:23,570 --> 00:11:26,079
there’s no one left on earth who knows how to score bowling by hand.
228
00:11:28,140 --> 00:11:28,829
[laugh] . True story.
229
00:11:29,390 --> 00:11:33,240
I was up in Cape Cod, and we went duckpin bowling—which is a real thing.
230
00:11:33,309 --> 00:11:33,840
Look it up—
231
00:11:33,929 --> 00:11:34,630
Oh, it’s so fun.
232
00:11:34,730 --> 00:11:35,450
It’s super fun.
233
00:11:35,469 --> 00:11:36,280
Definitely look it up.
234
00:11:36,639 --> 00:11:40,860
Super fun, but the bowling alley was so old that
235
00:11:40,860 --> 00:11:43,279
they did not have a computerized scoring system.
236
00:11:43,810 --> 00:11:44,160
Wow.
237
00:11:44,630 --> 00:11:44,950
Yeah.
238
00:11:45,000 --> 00:11:46,630
They gave me a piece of paper and pencil, and
239
00:11:46,630 --> 00:11:49,980
I was like, “Uh, score is not important, right?
240
00:11:50,950 --> 00:11:58,140
We’re just here to have fun.” Oh… now to get these systems back to a working
241
00:11:58,140 --> 00:12:03,580
state, the offending channel files had to be removed before Falcon was loaded.
242
00:12:04,130 --> 00:12:07,550
There’s a few options to do this, and none of them are great or easy.
243
00:12:08,429 --> 00:12:13,110
You can boot the system into Windows safe mode, which only loads the
244
00:12:13,219 --> 00:12:17,580
absolute bare minimum of Windows drivers, and then remove the files.
245
00:12:18,440 --> 00:12:22,710
For virtual systems, you could mount the system disk on another system and
246
00:12:22,710 --> 00:12:27,390
remove the files, and then reattach the drive to the original system, or
247
00:12:27,420 --> 00:12:31,959
if you had snapshots or a backup, you could roll back to a prior snapshot.
248
00:12:32,770 --> 00:12:38,540
Fortunately, CrowdStrike did pull the offending file from the update servers,
249
00:12:38,970 --> 00:12:42,850
so you wouldn’t then immediately redownload it and be back where you were.
250
00:12:43,550 --> 00:12:48,339
While it is a huge pain to fix all of these virtual systems, the real pain
251
00:12:48,349 --> 00:12:52,840
is those physical systems that don’t have an out-of-band management option.
252
00:12:53,539 --> 00:12:56,790
Someone will need to physically sit at the terminal, invoke
253
00:12:56,790 --> 00:13:00,880
safe mode, and perform the remediation steps, or use a separate
254
00:13:00,890 --> 00:13:04,260
boot device like a thumb drive to perform the maintenance.
255
00:13:04,360 --> 00:13:06,519
This is very bad.
256
00:13:07,349 --> 00:13:10,389
You forgot about the other way to fix the system, which apparently
257
00:13:10,410 --> 00:13:14,010
did work on some—at least a number of people’s, which is just
258
00:13:14,010 --> 00:13:17,959
keep rebooting it until CrowdStrike Falcon updated, and deleted
259
00:13:17,960 --> 00:13:20,300
the file on its own before it crashed because of the file.
260
00:13:20,890 --> 00:13:25,360
[laugh] . I guess if it does load and the network stack loads
261
00:13:25,360 --> 00:13:30,210
in time for it to pull the update and replace it, maybe?
262
00:13:30,730 --> 00:13:31,200
Maybe.
263
00:13:31,650 --> 00:13:33,550
Between 15 and 20 reboots.
264
00:13:33,650 --> 00:13:35,410
Sometimes people were getting it to work.
265
00:13:35,680 --> 00:13:36,290
Wow.
266
00:13:37,110 --> 00:13:37,920
That’s awful.
267
00:13:38,030 --> 00:13:38,909
But okay.
268
00:13:38,970 --> 00:13:40,030
So, another option.
269
00:13:40,990 --> 00:13:45,970
Microsoft has published a USB tool to assist with the
270
00:13:45,970 --> 00:13:49,830
removal of this file, so you have that option as well.
271
00:13:51,330 --> 00:13:55,480
As I mentioned, the BitLocker thing does throw a bit of a wrench
272
00:13:55,650 --> 00:14:00,360
in the whole plan because in order to access a BitLocker-protected
273
00:14:00,360 --> 00:14:05,770
system drive out-of-band, you have to supply a BitLocker unlock key—
274
00:14:06,139 --> 00:14:06,349
Yeah.
275
00:14:07,080 --> 00:14:09,280
And that can be hard to get.
276
00:14:10,080 --> 00:14:13,240
Well, it’s not like people want their end-users to have that.
277
00:14:13,370 --> 00:14:13,720
Again—
278
00:14:13,730 --> 00:14:13,880
Yes.
279
00:14:13,880 --> 00:14:15,380
—this is a security concern.
280
00:14:15,929 --> 00:14:20,290
Also, the BitLocker key is 48 characters long, so not only finding it
281
00:14:20,290 --> 00:14:24,839
but typing it in before BitLocker times out… which it does, apparently.
282
00:14:26,230 --> 00:14:27,230
So, a bit of a nightmare.
283
00:14:27,790 --> 00:14:28,710
Not a great situation.
284
00:14:29,030 --> 00:14:29,329
No.
285
00:14:30,300 --> 00:14:33,360
And so, that’s part of the reason Delta is still struggling.
286
00:14:34,179 --> 00:14:39,099
I would love to say that, as of right now, we know exactly what caused the
287
00:14:39,099 --> 00:14:44,660
error, but honestly portions of the supply chain are still pretty murky.
288
00:14:45,440 --> 00:14:51,650
Instead, I will try to explain how a simple update for an EDR caused millions
289
00:14:51,650 --> 00:14:56,090
of Windows machines to blue screen, and we can also have fun pointing all the
290
00:14:56,090 --> 00:15:01,050
fingers that we have at all the other parties because we’ve got jazz hands.
291
00:15:01,050 --> 00:15:02,200
[whispering] It’s all your fault.
292
00:15:03,030 --> 00:15:04,340
That works better with a—
293
00:15:04,660 --> 00:15:05,430
Visual medium?
294
00:15:05,760 --> 00:15:09,490
Yeah [laugh] . So, to start with, we have to consider
295
00:15:09,490 --> 00:15:11,870
what the Falcon [Center’s] actually trying to do.
296
00:15:12,690 --> 00:15:16,930
Falcon Center, as I mentioned, is an EDR product, and it’s meant to
297
00:15:16,930 --> 00:15:21,209
scan all activity on the host operating system looking for threats.
298
00:15:21,820 --> 00:15:24,900
Most applications aren’t granted that level of access
299
00:15:24,900 --> 00:15:27,849
to other applications or to the system as a whole.
300
00:15:28,400 --> 00:15:32,099
As you mentioned, Chris, it needs to be in a privileged position.
301
00:15:32,679 --> 00:15:37,150
But that’s the point: you’re trying to prevent other pieces of software from
302
00:15:37,150 --> 00:15:41,320
getting themselves into privileged positions to compromise your computer.
303
00:15:42,430 --> 00:15:45,219
To understand what it means to be in that privileged position,
304
00:15:45,220 --> 00:15:48,350
I’m going to briefly talk about user space and kernel space.
305
00:15:48,620 --> 00:15:51,790
Please feel free to interrupt me when I get something wrong, which I will.
306
00:15:52,510 --> 00:15:53,386
[whispering] Yes, thank you.
307
00:15:53,609 --> 00:15:55,920
Your operating system, whether it’s Windows,
308
00:15:55,980 --> 00:15:59,910
Linux, macOS, I don’t know, Solaris—
309
00:16:00,680 --> 00:16:01,280
AIX?
310
00:16:01,760 --> 00:16:06,299
—sure—it is responsible for managing the hardware on your system.
311
00:16:06,410 --> 00:16:10,200
That includes stuff like memory management, writing data to disk,
312
00:16:10,510 --> 00:16:14,460
sensing input from peripherals, and scheduling threads on the CPU.
313
00:16:15,230 --> 00:16:17,510
This all happens in what is called kernel
314
00:16:17,520 --> 00:16:20,530
space, and it’s considered highly privileged.
315
00:16:20,860 --> 00:16:24,870
If something goes wrong in kernel space, the system may have to halt
316
00:16:25,230 --> 00:16:29,579
or crash to prevent damage to the hardware, or corruption of data.
317
00:16:30,490 --> 00:16:34,710
Ideally, as little as possible should be running in kernel space.
318
00:16:35,500 --> 00:16:38,530
Instead, most applications run in user space
319
00:16:38,639 --> 00:16:41,310
which does not have direct access to the hardware.
320
00:16:42,170 --> 00:16:45,940
Applications running in user space interact with the operating system,
321
00:16:45,990 --> 00:16:50,190
and make requests based on that operating system’s published APIs.
322
00:16:50,530 --> 00:16:52,650
Do you want to write a file to disk?
323
00:16:52,950 --> 00:16:55,620
You make an API call and pass the correct information.
324
00:16:56,130 --> 00:16:57,450
Need to access memory?
325
00:16:57,880 --> 00:17:01,220
Make an API call and specify the address and range.
326
00:17:01,960 --> 00:17:04,530
The operating system will evaluate that request,
327
00:17:05,010 --> 00:17:08,839
make sure it’s valid and allowed before executing it.
328
00:17:09,510 --> 00:17:12,810
This means when an application runs into issues or it crashes, the
329
00:17:12,849 --> 00:17:17,329
operating system is able to handle that crash gracefully—most of
330
00:17:17,329 --> 00:17:21,339
the time—and keep other processes and the system as a whole running.
331
00:17:22,440 --> 00:17:22,720
Right.
332
00:17:23,069 --> 00:17:25,129
And there’s one point to note here.
333
00:17:25,129 --> 00:17:27,843
So, first of all, some of the terminology, they call it the
334
00:17:27,920 --> 00:17:32,270
kernel; also, they call it Ring 0, meaning it is the lowest
335
00:17:32,280 --> 00:17:35,000
possible level of the system, and it has access to everything
336
00:17:35,000 --> 00:17:37,910
else that is going on in the system without restriction.
337
00:17:38,679 --> 00:17:42,690
Necessary to make sure, for things like EDR tools, that it can
338
00:17:42,790 --> 00:17:46,770
scan not only all of the files, but all of the activity, all of the
339
00:17:46,770 --> 00:17:49,669
network, all of the I/O, the disk, et cetera, et cetera, et cetera.
340
00:17:50,270 --> 00:17:50,490
Right.
341
00:17:50,870 --> 00:17:56,040
One thing people always get upset about is, why does Windows crash so easily?
342
00:17:56,420 --> 00:17:56,780
And—
343
00:17:58,030 --> 00:17:58,060
[laugh]
344
00:17:58,350 --> 00:18:01,250
.
While there is an argument to be made that it is fragile and poorly
345
00:18:01,250 --> 00:18:04,100
designed and should have a better way of handling things like EDR
346
00:18:04,140 --> 00:18:07,730
that needs this access—which is true, and I assume you’ll get to that—
347
00:18:07,990 --> 00:18:08,310
Yes.
348
00:18:08,310 --> 00:18:12,320
The other thing is, again, remember, completely unfettered access.
349
00:18:12,349 --> 00:18:16,239
If something goes wrong at the kernel level, we
350
00:18:16,240 --> 00:18:20,700
get our old friend, unanticipated consequences.
351
00:18:21,600 --> 00:18:22,840
And this is extremely bad.
352
00:18:22,850 --> 00:18:27,709
So, for example, let’s say you have a system that is running a database.
353
00:18:28,200 --> 00:18:30,120
Databases, as you know, are kind of important.
354
00:18:31,389 --> 00:18:35,919
A kernel-level job is trying to write a new file, or
355
00:18:35,920 --> 00:18:38,760
a new table, or a new row, or record, or whatever, but
356
00:18:38,760 --> 00:18:42,179
it runs into an error with, say, memory misallocation.
357
00:18:43,290 --> 00:18:44,980
What is it going to write to the database?
358
00:18:45,610 --> 00:18:47,669
It could be writing absolute nonsense.
359
00:18:47,710 --> 00:18:49,949
It could completely corrupt the database.
360
00:18:49,969 --> 00:18:53,639
Therefore, the kernel crashes preemptively whenever it detects a
361
00:18:53,639 --> 00:18:59,280
failure because the consequences of trying to soldier on might be worse.
362
00:18:59,970 --> 00:19:00,420
Right.
363
00:19:00,940 --> 00:19:04,810
It’s that, “Out of an abundance of caution, I’m going to fail.”
364
00:19:05,179 --> 00:19:05,469
Right.
365
00:19:05,889 --> 00:19:08,170
Which is the same thing I did in high school.
366
00:19:09,280 --> 00:19:12,560
[laugh] . Yes… it was better if you didn’t succeed, Chris.
367
00:19:12,560 --> 00:19:12,590
[laugh]
368
00:19:13,890 --> 00:19:17,070
.
So, Windows applications have been able to request
369
00:19:17,080 --> 00:19:19,760
access to run in kernel mode for a long time.
370
00:19:20,309 --> 00:19:24,420
Generally, that’s a bad idea, for the reasons you just articulated.
371
00:19:25,150 --> 00:19:27,389
But Microsoft wasn’t super strict about it.
372
00:19:27,960 --> 00:19:30,600
Microsoft is nothing if not accommodating
373
00:19:30,610 --> 00:19:32,959
to developers and their terrible ideas.
374
00:19:33,860 --> 00:19:36,030
Some applications actually do need to run in
375
00:19:36,030 --> 00:19:38,879
kernel mode, in particular, antivirus software.
376
00:19:39,450 --> 00:19:42,869
Applications running in user mode are not generally allowed to
377
00:19:42,870 --> 00:19:46,529
access the memory and monitor the behavior of other applications.
378
00:19:46,949 --> 00:19:50,220
Microsoft Teams can’t just decide to read the memory space of
379
00:19:50,220 --> 00:19:54,620
Slack or kill the Zoom processes, as much as it might want to.
380
00:19:54,620 --> 00:19:56,010
I was going to say, it totally would.
381
00:19:56,270 --> 00:19:59,939
[laugh] . The operating system just doesn’t allow that type of nonsense.
382
00:20:00,509 --> 00:20:03,739
But, you know, an antivirus application needs a privileged
383
00:20:03,740 --> 00:20:06,700
level of access and monitoring to defeat the bad guys.
384
00:20:07,100 --> 00:20:10,350
So, antivirus companies like Symantec wrote
385
00:20:10,380 --> 00:20:12,640
their application to run in kernel space.
386
00:20:13,500 --> 00:20:18,460
Now, Microsoft actually tried to push back on the rampant abuse of kernel
387
00:20:18,460 --> 00:20:23,949
mode by antivirus outfits—and others—when Windows Vista was being rolled out.
388
00:20:24,440 --> 00:20:24,800
Yeah.
389
00:20:24,920 --> 00:20:26,020
What, what, what?
390
00:20:26,230 --> 00:20:29,930
Vista, for you youngsters in the crowd,
391
00:20:30,310 --> 00:20:33,220
Vista was the Windows 8 of the early aughts.
392
00:20:34,230 --> 00:20:36,230
Hopefully that puts some perspective on things.
393
00:20:37,179 --> 00:20:42,590
While Vista was a disaster as an operating system release, they did add a whole
394
00:20:42,590 --> 00:20:48,410
bunch of additional functionality and features that brought the client OSes more
395
00:20:48,410 --> 00:20:52,810
in line with what the server OSes were doing, and added a bunch of security.
396
00:20:53,040 --> 00:20:57,280
And one of the things they really tried to do was lock down kernel mode access.
397
00:20:58,150 --> 00:21:02,159
Unfortunately, antivirus companies didn’t like that, and they threw a hissy
398
00:21:02,160 --> 00:21:06,550
fit, claiming that since Windows Defender could run in kernel mode, and their
399
00:21:06,550 --> 00:21:12,870
stuff couldn’t, Microsoft was abusing their influence, a la Internet Explorer.
400
00:21:13,530 --> 00:21:17,420
And Microsoft, still reeling from their decade-long battle
401
00:21:17,420 --> 00:21:22,040
with the FTC over antitrust, kowtowed to the AV club, and
402
00:21:22,040 --> 00:21:25,040
allowed them to keep their precious kernel mode access.
403
00:21:25,710 --> 00:21:29,170
It’s not an unreasonable request because all the
404
00:21:29,170 --> 00:21:31,929
other players wanted was an even playing field.
405
00:21:32,120 --> 00:21:32,419
Right.
406
00:21:32,460 --> 00:21:35,360
The fact that even playing field was a wide-opened
407
00:21:35,469 --> 00:21:38,660
security nightmare is still a Microsoft problem.
408
00:21:39,710 --> 00:21:40,190
[laugh] . Right.
409
00:21:40,750 --> 00:21:44,390
Microsoft did add an interesting requirement, though, if you
410
00:21:44,390 --> 00:21:47,550
wanted to play in kernel space, and that was driver signing.
411
00:21:48,889 --> 00:21:51,370
Antivirus applications would present themselves
412
00:21:51,410 --> 00:21:55,050
as device drivers to get to run in kernel mode.
413
00:21:55,480 --> 00:21:59,320
A device driver to nothing, but a device driver nonetheless.
414
00:21:59,960 --> 00:22:05,620
Microsoft created the Windows Hardware Quality Labs Testing Certification—aka
415
00:22:05,750 --> 00:22:14,240
WHQL—and once a driver had gone through that lab and gotten its certification,
416
00:22:14,470 --> 00:22:20,190
Microsoft would digitally sign the driver and give them the Certified for
417
00:22:20,190 --> 00:22:25,139
Windows logo, so they could proudly display ‘Certified for Windows Vista’—or
418
00:22:25,139 --> 00:22:30,540
Windows 8 or whatever—on the box when you buy the software, or on their website.
419
00:22:31,400 --> 00:22:34,430
Now, vendors could still choose to sign their drivers
420
00:22:34,450 --> 00:22:38,750
internally, but the antivirus folks wanted to get that
421
00:22:39,400 --> 00:22:42,710
WHQL certification and all the cachet that went with it.
422
00:22:43,599 --> 00:22:46,369
As long as your driver code didn’t change,
423
00:22:46,500 --> 00:22:49,179
the digital signature would remain valid.
424
00:22:49,630 --> 00:22:53,690
So, that means all these antivirus companies—like CrowdStrike—would
425
00:22:53,690 --> 00:22:56,710
get that certification, which meant that it had gone through
426
00:22:56,720 --> 00:23:00,079
some level of rigorous testing when it came to the way the
427
00:23:00,080 --> 00:23:03,000
driver was written and the way it interacted with the kernel.
428
00:23:03,880 --> 00:23:04,919
Seems like a good idea.
429
00:23:05,609 --> 00:23:06,109
I’m for it.
430
00:23:06,980 --> 00:23:10,649
Unfortunately, external data could be loaded by the driver, like—
431
00:23:10,660 --> 00:23:10,680
No.
432
00:23:11,290 --> 00:23:12,670
—virus definitions.
433
00:23:13,100 --> 00:23:14,909
But in theory, the actual running code
434
00:23:14,910 --> 00:23:18,270
should all live in that signed device driver.
435
00:23:18,429 --> 00:23:21,300
So, read in some config, but all the logic in the
436
00:23:21,300 --> 00:23:23,609
actual code should live in that device driver.
437
00:23:24,150 --> 00:23:27,490
That’s all well and good for loading virus signatures and
438
00:23:27,490 --> 00:23:32,100
looking for matches in memory and CPU threads, but Falcon
439
00:23:32,110 --> 00:23:36,600
sensor is a modern EDR, and it doesn’t just use signatures.
440
00:23:37,170 --> 00:23:40,800
Instead, Falcon uses machine learning to develop behavior
441
00:23:40,800 --> 00:23:44,150
patterns, and then it needs to detect and respond to
442
00:23:44,150 --> 00:23:47,120
emerging threats that match those behavior patterns.
443
00:23:47,740 --> 00:23:53,510
The channel updates Falcon sensor receives to model that behavior, those updates
444
00:23:53,550 --> 00:23:59,639
appear to include some amount of pseudocode that is executed by the driver.
445
00:24:00,309 --> 00:24:03,970
And it is that injected code from the channel—or lack
446
00:24:03,970 --> 00:24:07,010
thereof, actually—that seems to have caused the issue.
447
00:24:07,730 --> 00:24:10,330
According to people who have looked at the channel
448
00:24:10,330 --> 00:24:14,279
file in question, it is entirely filled with zeros
449
00:24:18,150 --> 00:24:21,670
[laugh] . Now, you would hope that the driver would look
450
00:24:21,670 --> 00:24:25,909
at a file full of zeros and just ignore it, like, “Nope.
451
00:24:26,510 --> 00:24:30,700
That’s invalid.” Falcon sensor chose a slightly different route and crashed.
452
00:24:31,500 --> 00:24:31,800
Right.
453
00:24:32,259 --> 00:24:35,180
So, what we have here is a driver that is legitimate
454
00:24:35,670 --> 00:24:39,169
and was tested and proven resilient, which is good.
455
00:24:39,870 --> 00:24:40,149
Yeah.
456
00:24:40,300 --> 00:24:43,300
And we have updates that come down the wire multiple times
457
00:24:43,300 --> 00:24:47,090
a day and interact directly with that driver, that were not.
458
00:24:47,660 --> 00:24:48,250
Precisely.
459
00:24:48,940 --> 00:24:53,010
And it would appear that of the many tests that were run against
460
00:24:53,010 --> 00:24:56,510
that driver, none of the tests were, “Here’s a file full of zeros.
461
00:24:56,630 --> 00:25:00,520
What do you do?” Because no one thought that was a thing that would ever occur.
462
00:25:01,070 --> 00:25:01,639
But it did.
463
00:25:02,460 --> 00:25:08,940
There is a popular breakdown of the Falcon sensor crash dump by Twitter person
464
00:25:09,240 --> 00:25:15,260
Perpetualmaniac, which I won’t be linking because after assessing that it was
465
00:25:15,280 --> 00:25:20,420
a lack of null pointer checking in the dump, he then went on to make weird
466
00:25:20,429 --> 00:25:24,080
disparaging comments about the Rust community and blamed the whole thing on DEI.
467
00:25:25,740 --> 00:25:29,760
It got strange and kind of fasci, so fuck that guy.
468
00:25:29,990 --> 00:25:30,250
Fair.
469
00:25:30,420 --> 00:25:33,900
Instead, I’ll include a link to a different Twitter thread, by
470
00:25:33,900 --> 00:25:37,989
someone who actually debugs stuff like this for a living, and he
471
00:25:37,990 --> 00:25:42,959
basically said that Perpetualmaniac was wrong and thinks that it is
472
00:25:43,570 --> 00:25:47,300
uninitialized data being read from a table that caused the crash.
473
00:25:48,059 --> 00:25:51,180
Now, considering that the input file was entirely filled
474
00:25:51,180 --> 00:25:54,950
with nothing, uninitialized sounds like an understatement.
475
00:25:55,970 --> 00:26:00,170
Unfortunately, we won’t know for sure unless CrowdStrike shares
476
00:26:00,170 --> 00:26:03,480
their source code for their driver, which seems unlikely.
477
00:26:03,990 --> 00:26:06,239
Maybe they should, but I don’t think they will.
478
00:26:07,080 --> 00:26:11,360
The point is that the channel update caused Falcon sensor to attempt to access a
479
00:26:11,360 --> 00:26:15,560
memory location that didn’t exist or wasn’t initialized, and the driver crashed,
480
00:26:15,719 --> 00:26:19,989
forcing the system to halt in order to prevent possible data corruption.
481
00:26:20,780 --> 00:26:21,800
So, that’s where we’re at.
482
00:26:22,770 --> 00:26:24,070
Now, it’s time to point fingers.
483
00:26:24,590 --> 00:26:25,010
Cool.
484
00:26:25,340 --> 00:26:26,540
[It’s] Everybody’s favorite part.
485
00:26:26,950 --> 00:26:30,780
Predictably, in a fuckup of this magnitude, the blame
486
00:26:30,780 --> 00:26:33,949
game and armchair quarterbacking is in full effect.
487
00:26:34,530 --> 00:26:37,090
Thought leaders are tripping over themselves on
488
00:26:37,349 --> 00:26:39,680
LinkedIn to have an opinion about the whole mess.
489
00:26:40,080 --> 00:26:43,760
And I’ve seen posts ranging from ‘this is all CrowdStrike fault.
490
00:26:43,930 --> 00:26:47,650
How did this update ever get out the door?’ ‘This is all Microsoft’s fault.
491
00:26:47,860 --> 00:26:50,750
How could they let third parties run in kernel mode?’ ‘This is the
492
00:26:50,750 --> 00:26:55,530
customers’ fault for not having phased rollouts.’ Et cetera, et cetera.
493
00:26:56,280 --> 00:26:59,799
And then there’s all the conspiracy theories about how this was a state actor,
494
00:26:59,799 --> 00:27:05,200
or a planned thing, or I don’t know CrowdStrike did it on purpose, for reasons?
495
00:27:05,910 --> 00:27:06,140
Anyway.
496
00:27:06,990 --> 00:27:08,040
Solar flares?
497
00:27:08,660 --> 00:27:09,460
Oh, I like that one.
498
00:27:09,840 --> 00:27:11,030
That’s what made it all zeros.
499
00:27:11,770 --> 00:27:14,439
There’s plenty of blame to go around, and none of it is
500
00:27:14,440 --> 00:27:17,850
actually helpful while the fire is burning, but now that
501
00:27:17,850 --> 00:27:21,640
we’re over a week out, maybe we can take a more nuanced look.
502
00:27:22,110 --> 00:27:22,480
Or not.
503
00:27:23,640 --> 00:27:27,020
So, how did this update actually leave CrowdStrike’s front door?
504
00:27:27,760 --> 00:27:28,630
That’s a great question.
505
00:27:29,349 --> 00:27:33,370
The truth is, we will not know until CrowdStrike tells us or
506
00:27:33,380 --> 00:27:37,330
a lawsuit forces legal discovery, and we find out that way.
507
00:27:38,140 --> 00:27:39,560
The former could come any day.
508
00:27:39,600 --> 00:27:43,160
I’ve checked their [unintelligible] blog posts several times as I was
509
00:27:43,160 --> 00:27:47,680
writing this piece, and so far, they haven’t said, but maybe they will.
510
00:27:48,230 --> 00:27:49,979
Uh, actually, so they did—
511
00:27:50,780 --> 00:27:51,010
Ooh.
512
00:27:51,130 --> 00:27:52,520
—at about three o’clock this morning.
513
00:27:54,230 --> 00:27:54,870
[laugh] . Of course they did.
514
00:27:54,870 --> 00:27:56,910
They released an official—well, an official
515
00:27:56,920 --> 00:27:59,740
unofficial preliminary post-incident review.
516
00:28:00,130 --> 00:28:00,540
Okay.
517
00:28:00,540 --> 00:28:01,370
It’s a good name.
518
00:28:01,840 --> 00:28:04,240
And basically what they’re saying is, it went through automated
519
00:28:04,240 --> 00:28:07,540
testing, but the automated content validator had a bug in it.
520
00:28:08,099 --> 00:28:12,729
So, they passed it—quote-unquote, “Passed, but it was an invalid file.
521
00:28:13,239 --> 00:28:13,249
Ah.
522
00:28:13,389 --> 00:28:17,540
“Once the file went out, it was immediately picked up, read by Falcon
523
00:28:17,540 --> 00:28:22,250
sensor, and it caused an out-of-bounds memory read, triggering an exception.
524
00:28:23,150 --> 00:28:25,520
This unexpected exception could not be gracefully handled,
525
00:28:25,520 --> 00:28:29,509
resulting in a Windows operating system crash BSOD.” Unquote.
526
00:28:30,930 --> 00:28:34,379
So, it seems like their testing harness or whatever they’re using
527
00:28:34,540 --> 00:28:37,450
also doesn’t know what to do with the file that’s all zeros.
528
00:28:37,870 --> 00:28:38,480
Well, yeah.
529
00:28:38,530 --> 00:28:39,909
There’s a lot of problems here.
530
00:28:40,030 --> 00:28:43,520
First of all, clearly they did not test the tester enough.
531
00:28:44,410 --> 00:28:44,720
Yeah.
532
00:28:44,770 --> 00:28:46,579
Because if you have a bug in a testing system
533
00:28:46,580 --> 00:28:48,770
in an automated deployment, that is a problem.
534
00:28:48,890 --> 00:28:50,159
That is a huge problem.
535
00:28:50,900 --> 00:28:55,720
And the fact that simply loading the file caused the blue screen pretty
536
00:28:55,720 --> 00:28:59,490
quickly makes it sound like they don’t actually push these updates to
537
00:28:59,490 --> 00:29:04,970
test machines that then run the update to see if the system crashes.
538
00:29:05,309 --> 00:29:08,060
They’re using some other testing process.
539
00:29:08,510 --> 00:29:11,770
Right, which they do not go into any detail about, unsurprisingly.
540
00:29:12,410 --> 00:29:16,600
So, I am sure that the lawsuits are forthcoming, and maybe we’ll
541
00:29:16,600 --> 00:29:20,549
find out more when legal discovery happens, if it gets that far, but
542
00:29:21,009 --> 00:29:25,199
the truth is, CrowdStrike pushes these channel updates frequently.
543
00:29:25,199 --> 00:29:28,460
Like you said, Chris, they push these more than once a day.
544
00:29:28,850 --> 00:29:33,029
And they have automated testing in place, but they’re trying to stay
545
00:29:33,030 --> 00:29:37,029
one step ahead of the bad guys, which means time is of the essence.
546
00:29:37,400 --> 00:29:39,700
This specific update was meant to address
547
00:29:39,700 --> 00:29:43,480
something, a new vulnerability found in named pipes.
548
00:29:44,070 --> 00:29:46,670
They wanted to get that update out before any
549
00:29:46,670 --> 00:29:50,260
attacker figured out how to abuse this vulnerability.
550
00:29:51,160 --> 00:29:53,350
So, maybe what they’re doing is sacrificing
551
00:29:53,350 --> 00:29:56,480
quality or testing in favor of speed.
552
00:29:57,200 --> 00:30:00,520
This is a systematic failure, and it’s not the fault of one person.
553
00:30:00,640 --> 00:30:03,969
Yes, maybe someone screwed up and accidentally saved the file
554
00:30:03,990 --> 00:30:06,600
empty, but something else in the chain should have caught that.
555
00:30:07,310 --> 00:30:07,700
Right.
556
00:30:08,210 --> 00:30:11,060
If a single person can unwittingly push an update that
557
00:30:11,060 --> 00:30:13,430
takes down eight-and-a-half million Windows clients,
558
00:30:14,099 --> 00:30:16,619
that’s an organizational and systematic problem.
559
00:30:17,449 --> 00:30:19,850
There’s also some indications that this isn’t
560
00:30:19,860 --> 00:30:22,870
the first time such a transgression has occurred.
561
00:30:23,390 --> 00:30:28,000
It appears that Red Hat Enterprise Linux, Debian, and
562
00:30:28,000 --> 00:30:31,489
Rocky Linux have all encountered similar crashing problems
563
00:30:31,559 --> 00:30:34,959
earlier this year after a channel update was pushed.
564
00:30:35,730 --> 00:30:40,600
I think it was April and May were the two months where the issues were found.
565
00:30:40,820 --> 00:30:43,980
The issue with Debian in particular was traced to a specific
566
00:30:43,990 --> 00:30:47,760
version of the kernel that wasn’t included in CrowdStrike’s
567
00:30:47,889 --> 00:30:51,990
testing matrix, but was on their list of supported kernel versions.
568
00:30:52,129 --> 00:30:56,680
macOS seems to have weathered the storm, for reasons that we will get to.
569
00:30:57,340 --> 00:30:58,689
Yeah, and I mean, that’s an important point.
570
00:30:58,690 --> 00:31:01,940
And, you know, a lot of times people will say, this only
571
00:31:01,940 --> 00:31:04,090
happens to Windows, and that’s absolutely not the case.
572
00:31:04,290 --> 00:31:07,669
Anytime something runs unfettered in Ring 0 of any operating
573
00:31:07,670 --> 00:31:11,250
system of any kind, you run the risk of causing an immediate crash.
574
00:31:12,109 --> 00:31:15,800
It’s just not usually so public-facing because you don’t tend to
575
00:31:15,800 --> 00:31:20,970
have Linux running your displays that’s also running CrowdStrike.
576
00:31:21,280 --> 00:31:21,540
Right.
577
00:31:21,550 --> 00:31:23,540
For whatever reason, we like Windows for that.
578
00:31:24,130 --> 00:31:24,479
I don’t know.
579
00:31:24,770 --> 00:31:27,409
We’ll get to that, too [laugh] . So, what about Microsoft?
580
00:31:27,830 --> 00:31:29,999
Shouldn’t they prevent this kind of thing from happening?
581
00:31:30,430 --> 00:31:32,690
In an ideal world, they could.
582
00:31:32,970 --> 00:31:35,139
And we’ll get into the technical solutions in a
583
00:31:35,140 --> 00:31:38,680
moment, but this is largely not Microsoft’s fault.
584
00:31:39,440 --> 00:31:46,440
Yes, Windows has its flaws—many, many, many flaws—and Microsoft
585
00:31:46,449 --> 00:31:49,750
hasn’t always produced the stablest or most secure software.
586
00:31:50,290 --> 00:31:52,980
No one could call them blameless with a straight face.
587
00:31:53,240 --> 00:31:56,760
But in this specific instance, the system is working
588
00:31:56,760 --> 00:31:59,730
as designed, even if the design kind of sucks.
589
00:32:00,490 --> 00:32:04,319
Should we be shaming all these organizations who let the update
590
00:32:04,340 --> 00:32:07,750
barrel through their environment like salmonella on a cruise ship?
591
00:32:08,330 --> 00:32:09,170
That’s an image for you.
592
00:32:10,000 --> 00:32:14,280
Think about the counterexample for a second, let’s say that a zero-day
593
00:32:14,300 --> 00:32:18,430
attack was discovered using this named pipes thing, and it was
594
00:32:18,430 --> 00:32:22,419
leveraged by a hacking group to infect a major airline with ransomware,
595
00:32:22,840 --> 00:32:27,010
and later it came out that CrowdStrike would have protected them
596
00:32:27,240 --> 00:32:30,180
if they had been running the newest version of the channel updates.
597
00:32:30,700 --> 00:32:34,590
Stupid CISO decided to stay at n minus one for updates.
598
00:32:35,580 --> 00:32:39,819
Do you think the defense of not running the latest channel updates as
599
00:32:39,820 --> 00:32:43,920
a resiliency strategy would appease litigators and the public at large?
600
00:32:44,790 --> 00:32:45,950
I’m going to go with unlikely.
601
00:32:46,590 --> 00:32:49,320
[laugh] . So, I mean, another point that’s important to note
602
00:32:49,320 --> 00:32:51,730
here is that the kind of patch that came out—or the channel
603
00:32:51,730 --> 00:32:55,620
update—would not have been stopped by an n minus one effect.
604
00:32:56,190 --> 00:32:59,360
N minus one would stop the driver update.
605
00:32:59,879 --> 00:33:03,639
Remember, that’s the part that was signed by Microsoft, and is noted as good.
606
00:33:03,880 --> 00:33:06,270
The actual kernel—or the actual channel update itself
607
00:33:06,490 --> 00:33:08,639
happens automatically, and you can’t do anything about it.
608
00:33:09,490 --> 00:33:12,670
There is… I was reading through some Reddit posts, and some
609
00:33:12,679 --> 00:33:16,620
people did say that there is a way to run a little behind
610
00:33:17,650 --> 00:33:21,550
the channel updates, so to postpone them by certain periods.
611
00:33:21,790 --> 00:33:24,389
There is a way to run, kind of like, n minus one for the
612
00:33:24,390 --> 00:33:28,149
channel updates, but there’s an inherent risk in doing that.
613
00:33:28,500 --> 00:33:29,760
Yeah, like you said, it’s certainly not the
614
00:33:29,770 --> 00:33:32,150
sort of thing that a CISO is going to encourage.
615
00:33:32,849 --> 00:33:33,209
Right.
616
00:33:33,259 --> 00:33:37,100
And there’s also a regulatory hurdle with that, too, because there may be
617
00:33:37,130 --> 00:33:41,370
compliance and regulations that say you have to be running the latest version.
618
00:33:41,890 --> 00:33:46,790
So really, it’s just a rational decision based on balancing priorities and
619
00:33:46,790 --> 00:33:51,390
political realities, and trying to protect your customers as best you can.
620
00:33:52,299 --> 00:33:54,540
So, the blame ultimately should reside on
621
00:33:54,540 --> 00:33:55,735
CrowdStrike for putting out a floud update—floud?
622
00:33:55,735 --> 00:33:56,040
Flawed.
623
00:33:58,150 --> 00:33:58,370
Floud.
624
00:33:58,450 --> 00:33:59,210
Words.
625
00:33:59,890 --> 00:34:00,400
I love them.
626
00:34:00,840 --> 00:34:02,030
I like floud, actually.
627
00:34:02,849 --> 00:34:04,580
It’s like loud, but with an F.
628
00:34:04,840 --> 00:34:05,470
It’s floud.
629
00:34:05,470 --> 00:34:07,250
It’s like a flan that has opinions.
630
00:34:07,710 --> 00:34:08,559
An opinionated flan.
631
00:34:09,540 --> 00:34:09,989
I like it.
632
00:34:10,779 --> 00:34:14,659
Let’s talk about solutions [laugh] . The reason macOS hasn’t encountered
633
00:34:14,659 --> 00:34:20,100
a similar fate as the Linux and Apple installations is that Apple
634
00:34:20,630 --> 00:34:25,160
doesn’t let CrowdStrike—or really anything else—running kernel mode.
635
00:34:25,599 --> 00:34:29,219
Starting in macOS 10.15—I didn’t look at the codename,
636
00:34:29,600 --> 00:34:34,639
so please forgive me—Apple offered System Extensions.
637
00:34:35,260 --> 00:34:38,190
These allow an application to stay in user mode while
638
00:34:38,190 --> 00:34:41,819
requesting special access to hardware managed by the kernel.
639
00:34:42,370 --> 00:34:47,379
At the same time, Apple phased out Kernel Extensions—often
640
00:34:47,529 --> 00:34:50,429
shortened to kext, or [pronounced] kext, I guess—
641
00:34:50,560 --> 00:34:52,020
Yeah, it’s pronounced, unfortunately.
642
00:34:52,840 --> 00:34:56,020
[sigh] . They phased those out starting in macOS 11.
643
00:34:56,599 --> 00:35:00,150
So basically, CrowdStrike doesn’t run in kernel mode on
644
00:35:00,160 --> 00:35:04,390
macOS, and thusly, it cannot crash macOS the same way.
645
00:35:05,480 --> 00:35:08,135
I don’t know no Mac, so I don’t know about any of that [laugh]
646
00:35:08,400 --> 00:35:09,100
.
No, it’s true.
647
00:35:09,100 --> 00:35:11,950
And for a while, it was extremely annoying because a lot of programs
648
00:35:11,950 --> 00:35:15,319
relied on kexts for similar reasons: to have instant access.
649
00:35:15,320 --> 00:35:18,370
Like, a good example is if you have an external audio
650
00:35:18,370 --> 00:35:23,190
device and you want that to work as fast—as efficiently as
651
00:35:23,190 --> 00:35:25,569
possible, you would want it to work and run in kernel mode.
652
00:35:25,840 --> 00:35:26,150
Right.
653
00:35:26,590 --> 00:35:29,120
So, there are actually ways to get around the
654
00:35:29,120 --> 00:35:32,270
security that you just talked about in macOS.
655
00:35:32,760 --> 00:35:35,190
I don’t recommend it, but it is doable.
656
00:35:36,020 --> 00:35:38,489
And the whole point here is that you have this little secret
657
00:35:38,500 --> 00:35:41,400
enclave, effectively, where things run in this sort of in-between
658
00:35:41,400 --> 00:35:44,600
mode—sandbox, if you will—which we’re going to go into in a second.
659
00:35:45,160 --> 00:35:48,340
But if it crashes there, it doesn’t take down the operating system.
660
00:35:48,940 --> 00:35:49,240
Right.
661
00:35:49,820 --> 00:35:53,680
And Linux actually has a similar option with eBPF,
662
00:35:53,830 --> 00:35:57,340
which I struggle to say because it’s awkward.
663
00:35:57,639 --> 00:35:59,699
And apparently, it’s no longer an acronym.
664
00:36:00,469 --> 00:36:01,660
It’s just its own thing.
665
00:36:02,220 --> 00:36:03,950
So… that’s weird.
666
00:36:04,500 --> 00:36:08,370
eBPF lets applications load into a sandboxed
667
00:36:08,760 --> 00:36:10,930
secure kernel execution environment.
668
00:36:11,320 --> 00:36:15,490
So, once again, gives them kernel-level access to resources, while applying
669
00:36:15,580 --> 00:36:19,480
stringent safety checks to make sure the application doesn’t crash the system.
670
00:36:20,210 --> 00:36:25,200
CrowdStrike now offers running Falcon in user mode on Linux—what
671
00:36:25,200 --> 00:36:29,629
they call user mode—which actually uses eBPF under the covers.
672
00:36:30,440 --> 00:36:33,500
If you were running in that mode, those previous crashes
673
00:36:33,780 --> 00:36:37,470
that happened with Red Hat, and Debian, and—what was
674
00:36:37,490 --> 00:36:40,750
it?—Rocky Linux, you would not have been affected by those.
675
00:36:41,460 --> 00:36:43,420
I mean, CrowdStrike would—Falcon still would have
676
00:36:43,420 --> 00:36:44,950
crashed, but it wouldn’t have crashed your system.
677
00:36:45,880 --> 00:36:46,659
Which is better.
678
00:36:47,020 --> 00:36:47,630
I think so.
679
00:36:48,330 --> 00:36:52,850
Windows has some similar functionality available.
680
00:36:53,180 --> 00:36:56,600
There’s the Windows Filtering Platform, Windows Defender
681
00:36:56,620 --> 00:37:00,705
Application Control, and Windows Defender Device Guard, all of
682
00:37:00,720 --> 00:37:05,159
which have APIs, but none of them have the same mechanisms present
683
00:37:05,940 --> 00:37:10,680
that, like, System Extensions for macOS or eBPF for Linux have.
684
00:37:11,020 --> 00:37:15,220
So, they provide an API that applications could be rewritten to take
685
00:37:15,220 --> 00:37:20,630
advantage of and get, you know, almost kernel levels of access and speed,
686
00:37:21,370 --> 00:37:26,720
but they’re not the same as this, sort of, sandbox, secured enclave.
687
00:37:27,610 --> 00:37:31,330
There is a project to port eBPF over to Windows, for what it’s worth.
688
00:37:31,830 --> 00:37:35,940
I don’t know if that will be the ultimate solution, but this catastrophic
689
00:37:35,940 --> 00:37:39,970
calamity should at least prompt Microsoft to try something similar.
690
00:37:40,790 --> 00:37:43,350
I have heard some folks—and we could call this a technical
691
00:37:43,350 --> 00:37:45,220
solution—I’ve heard some folks say that you just shouldn’t
692
00:37:45,230 --> 00:37:47,270
be running Windows in most of these environments.
693
00:37:47,900 --> 00:37:49,920
Like… you’re not wrong.
694
00:37:50,700 --> 00:37:55,250
If I could wave a magic wand and turn back time, if I could find a way, Chris—
695
00:37:56,440 --> 00:37:56,799
Stop it.
696
00:37:57,099 --> 00:37:59,400
—I would take back all the Windows that hurt
697
00:37:59,400 --> 00:38:02,450
you and replace them with Linux variants.
698
00:38:03,280 --> 00:38:04,270
Okay, it doesn’t rhyme.
699
00:38:04,920 --> 00:38:06,019
[laugh] . It’s the best I could do.
700
00:38:06,940 --> 00:38:09,600
If you’re out there, and you’re building a net-new system,
701
00:38:10,040 --> 00:38:13,910
that’s, like, an end-user terminal, an IoT device, or even a
702
00:38:13,910 --> 00:38:18,270
server running in the cloud, I think anything but Windows is your
703
00:38:18,280 --> 00:38:22,020
best bet, and it would probably be malpractice to do otherwise.
704
00:38:22,830 --> 00:38:25,500
But like it or not, Windows remains the most popular desktop
705
00:38:25,520 --> 00:38:28,830
operating system, and that doesn’t appear to be changing anytime soon.
706
00:38:29,500 --> 00:38:33,499
We need a short-term plan to make things better—through some sort
707
00:38:33,500 --> 00:38:38,080
of update—and a long-term plan to ditch Windows for most use cases.
708
00:38:38,799 --> 00:38:39,279
Thoughts?
709
00:38:40,080 --> 00:38:45,230
So, a lot of in—I’ll put, ‘in my opinion’ around all of this—
710
00:38:45,650 --> 00:38:45,890
Right.
711
00:38:46,220 --> 00:38:51,730
A lot of this comes down to the never-ending battle between speed and
712
00:38:51,730 --> 00:38:58,020
security, and making assumptions that things are just going to work.
713
00:38:59,320 --> 00:39:03,620
After all, like we said, they’ve done multiple channel updates
714
00:39:03,620 --> 00:39:07,319
a day for years and years and years and years and years, and
715
00:39:07,320 --> 00:39:10,159
while they’ve had a few issues in the past, it’s not very many.
716
00:39:11,130 --> 00:39:14,010
This is the sort of thing that leads developers—and, you know,
717
00:39:14,020 --> 00:39:17,839
engineering teams—to have a false sense of security, and a false sense
718
00:39:17,840 --> 00:39:20,110
that everything they do is golden, and they will never have a problem.
719
00:39:20,980 --> 00:39:25,300
Therefore, checks get skipped, checks get removed from the process—because
720
00:39:25,300 --> 00:39:29,350
after all, they’re just slowing us down—and that’s a huge issue.
721
00:39:30,050 --> 00:39:34,630
The other issue is, when you push everything out all at once, the problem
722
00:39:34,779 --> 00:39:38,560
can occur—like it did this time—that everything will crash all at once.
723
00:39:40,060 --> 00:39:44,250
There needs to be some type of a fuzzed deployment.
724
00:39:44,599 --> 00:39:46,839
So, let’s just say these things get released
725
00:39:46,840 --> 00:39:48,779
on a schedule, I don’t know, every four hours.
726
00:39:49,759 --> 00:39:51,740
You get a customer that’s got a hundred servers.
727
00:39:52,410 --> 00:39:57,250
Those servers should get that update five minutes apart in, like, groups of 30.
728
00:39:58,140 --> 00:40:01,120
That way, if there is a catastrophic failure, it
729
00:40:01,120 --> 00:40:03,940
only takes down a percentage of your platform.
730
00:40:04,460 --> 00:40:06,580
Now, that’ll happen for every single customer on Earth, and
731
00:40:06,580 --> 00:40:10,369
that’s not great, but the assumption is, and should be, that
732
00:40:10,369 --> 00:40:13,560
there is high availability built into this, so if half your
733
00:40:13,560 --> 00:40:16,620
systems go down, theoretically, the other half can carry the load.
734
00:40:17,260 --> 00:40:17,570
Mmm.
735
00:40:18,470 --> 00:40:21,759
Yeah, and that’s something that CrowdStrike could change today.
736
00:40:22,010 --> 00:40:23,470
That’s within their realm of control.
737
00:40:23,470 --> 00:40:26,850
Yeah, and I suspect that they will [laugh] . Because the only
738
00:40:26,850 --> 00:40:29,540
other option—if this is the situation—is people are going to end
739
00:40:29,540 --> 00:40:32,470
up with half of their environment running one antivirus solution,
740
00:40:32,480 --> 00:40:34,370
and the other half of their environment running another one.
741
00:40:35,020 --> 00:40:35,850
That seems worse.
742
00:40:35,930 --> 00:40:39,070
It’s just as insane as running—insane and difficult
743
00:40:39,080 --> 00:40:40,960
to manage as running in a multi-cloud environment.
744
00:40:41,240 --> 00:40:42,669
Or a super cloud, as some might say.
745
00:40:43,449 --> 00:40:43,469
Ugh.
746
00:40:43,780 --> 00:40:44,280
I hate you.
747
00:40:44,900 --> 00:40:47,099
Yeah, these are all technical solutions.
748
00:40:47,590 --> 00:40:50,700
I don’t know if there’s any policy solutions, but my biggest
749
00:40:50,700 --> 00:40:54,750
concern coming out of all of this is that regulators and
750
00:40:54,750 --> 00:40:58,579
litigators are going to get into hubbub and pass some poorly
751
00:40:58,590 --> 00:41:02,970
thought-out legislation that makes things effectively worse.
752
00:41:03,660 --> 00:41:05,739
I can’t quite figure out how they would make
753
00:41:05,740 --> 00:41:08,170
things worse, but I am excited to see them try.
754
00:41:08,180 --> 00:41:11,990
[laugh] . They’re nothing if not creative.
755
00:41:12,780 --> 00:41:14,230
Well, hey, thanks for listening or something.
756
00:41:14,230 --> 00:41:16,130
I guess you found it worthwhile enough if you made it all
757
00:41:16,130 --> 00:41:18,280
the way to the end, so congratulations to you, friend.
758
00:41:18,480 --> 00:41:19,730
You accomplished something today.
759
00:41:19,740 --> 00:41:23,229
Now, you can go sit on the couch, update your CrowdStrike channel
760
00:41:23,230 --> 00:41:26,990
file, and watch everything crash in beautiful synchronicity.
761
00:41:27,140 --> 00:41:27,820
You’ve earned it.
762
00:41:28,500 --> 00:41:30,840
You can find more about this show by visiting our LinkedIn page,
763
00:41:30,840 --> 00:41:34,240
just search ‘Chaos Lever,’ or go to our website, chaoslever.com.
764
00:41:34,520 --> 00:41:36,460
You’ll find show notes, blog posts, and general tomfoolery.
765
00:41:37,280 --> 00:41:39,620
And if we got something wrong, or you have strong opinions
766
00:41:39,620 --> 00:41:42,200
about what CrowdStrike should have done, leave us a comment.
767
00:41:42,379 --> 00:41:43,290
Leave us a voicemail.
768
00:41:43,410 --> 00:41:45,400
We might even listen to it.
769
00:41:45,400 --> 00:41:47,920
We’ll be back next week to see what fresh hell is upon us.
770
00:41:48,240 --> 00:41:48,950
Ta-ta for now.
771
00:41:57,200 --> 00:41:57,790
What a mess.
772
00:41:58,440 --> 00:41:58,770
Mmm.
773
00:41:59,400 --> 00:42:00,359
A glorious mess.