>>Hi everyone, uhm, it's great to be here at DefCon! Uhm, I'm here to talk to you today about Breaking the x86 Instruction Set Architecture. Before I get into any of that, my name is, uh, Christopher Domas, I'm a cyber security researcher at a company called the Battelle Memorial Institute. I really really like working here, gives me the chance to encounter a lot of very hard problems. Uhm, but I think what I like the most about working here is the problems in people I encounter during the day, uhm, give a lot of interesting ideas for the fringe-areas of cyber security to explore. Kind of on my free time and that's what I wanted to show you today, Is one of these little side projects I've been working on on-and-off for the last couple of months. So, this whole things based around trust and I'm gonna start, uh, this presentation off, uh, With a really really obvious statement: We don't trust software and we shouldn't - software can be horrible. So, we audit software to make sure that it works the way we expect it to work; we reverse engineer software to make sure there's, uhm, secrets hidden inside of it. We don't want minesweeper dialing out to a Russian website. We break software, we try to find vulnerabilities and exploit them in order to harden it. [background noise] Make things more resilient. And despite all this effort, uhm, we still don't trust software, so we sandbox it so that when something goes wrong at least it's contained. But the, uh, the processor itself, you know, the thing responsible for actually running our software for enforcing all of our security checks - the processor itself we pretty much just blindly trust. And we don't really have a lot of choice here because we don't have reverse engineering or introspection or auditing tools for the processor the way that we do for software. Really, the best we have for the processor is a bunch of processor specifications and reference manuals, Of bunch of documents that tell us "This is how your processor operates". And we're just supposed to take it on faith that that's how the processor actually operates. And I think, think that's kind of crazy, right? Because we would never do that with software, we would never download just some totally random executable. Open up the readme see that it says "Totally not a virus" and think "Yeah, that's totally not virus, I can go ahead and run this". But that's exactly what we're doing with hardware right now. Uhm, why? Because hardware has, All the same problems as software. So, so, so why do we trust hardware but not software? Uhm, if we're worried about secret functionality, I'm worried about that in software, does hardware have that? Absolutely. Uh, Intel infamously had what was called the 'appendix H documents' that described all the secret pieces of the x86 architecture and was only available to trusted Intel sources. What about bugs? You worried about bugs in the software, does hardware have bugs? Uhm, you've probably heard of several of these. The 'F00F bug' in the Intel Pentium could allow an unprivileged user to lock the system. The 'f-tip bug' in the Pentium, uhm, every couple of billion floating point operations, the Pentium would be off by a fraction of a percent. That cost Intel about five-hundred million dollars. Uhm, the 'txs bugs' and Haswell architecture - Intel had to completely disable transactional memory support in Haswell because of these bugs. Skylake has hyperthreading data corruptions, uhm, the Ryzen processor is crashing, faulting under heavy 'F May three' operation loads. Uhm, modern processors have tons and tons of bugs in them. What about vulnerabilities? We're worried about vulnerabilities in software, [background noise] Hardware's got those too. Like the 'SYSRET' uh, vulnerability, uh allowed, uhm, A user to escalate into kernel level permissions on almost every major operating system. The cache poisoning and memory sinkhole attacks allowed infiltrating x86's protected system management mode. So, vulnerabilities exist in processors too. So, the, the point here is, We should stop blindly trusting our hardware - if we don't trust software we shouldn't trust hardware either. So, what kinds of things do we need to worry about when we're talking about trusting hardware? [background noise] Well, the thing that I was mostly interested in for this research was hidden instructions inside of the processor. Like, maybe there's secret functionality that can give us backdoor, Or powerful access to the processor internals. Now, I know that sounds almost conspiracy theory-esque but it's not that far off from reality. There's actually some interesting historical examples of this in the x86 architecture. So, for example, early x86 chips had what was called the 'ice breakpoint instruction'. This was an instruction that wasn't documented anywhere; that would switch the processor over to a very privileged ice mode. Uhm, those same processors also had this 'loadall instruction' - another undocumented secret instruction that would give you access to, uhm, hidden pieces of the processors registers that you normally couldn't reach. Uhm, or just a more recent example, as you may have heard about the, uh, the 'API call instruction' in Microsoft's x86 emulator. Which, basically what Microsoft did was they backdoored the UD zero x86 instruction and turned it into something entirely different without ever, uhm, releasing that information. And that caused a fairly serious vulnerability, uhm, in their, in their emulator. So, these, these things can exist in real life and sort of highlight that point. If you actually go into your processor's documentation, if you look up Intel software developer's manual volume two, near the end of it you will see dozens and dozens of tables that look something like this. What these are, are, uh, the opcode maps - these are supposed to enumerate all of the instructions data processor supports. It's basically saying that when the processor see this byte, it's going to execute this instruction; when it sees this byte it's going to execute this instruction. So, uhm, that's what we're seeing in the, in the opcode maps. [background noise] But if you look really closely at these opcode maps, uhm, you'll notice something - there are gaps here and there. So, this is a document that's supposed to tell us everything the processor does. This is a document that we're basing all of our trust for this processor on, uhm, but it's leaving things out, uhm, with these gaps. That's not really a good start for trust, right? If we're relying on this document and it's intentionally not telling us certain things about the architecture. So, I wanted to find a way to actually audit the processors that are in all of our computers - I wanted to figure out what's really on my processor. So, I wanted to find these hidden instructions. How do we do that in this architecture? The challenge with x86 is the instruction format is very very complex. In x86 you can have one byte machines instructions like 'Increment eax is hex forty'. You can also have fifteen byte machine instructions like 'Add keyword CS override complicated memory access, complicated', uhm, 'immediate value', Is a fifteen byte x86 instruction. And you can have everything in between as well. So, if we actually look at a worst case scenario and assume all the instructions are fifteen bytes, just for simplicity, you're looking at something like one point three undecillion possible instructors, On this architecture. Now, if I just want to find one hidden secret instruction in all of that, that's going to be really, really difficult. So, the obvious approach is to kinda like sort through all of these instructions just don't really work. So, you might think "Let's just try all of the x86 instructions". And that might work for a risc chip with a fixed length instruction set - that's not going to work for x86. There's, there's way way too much. You might think "Well, what if we just do random instructions and see, uh, if we find anything interesting." The problem is you could run, run random instructions for the next ten years and you'd still cover such a tiny tiny portion of that search space, uhm, That it wouldn't be very useful. You might think "Well, we've got this documentation that kinda tells us what these instructions look like. Why don't we guide our base, our, uh, search based on the documentation?". But big problem with that is kind of like we already.. Established, uh, the documentation can't be trusted. If you're basing you search on the documentation, you're kind of doomed to fail from the start. The other big problem, you might think "Well, uhm, there's these gaps in those manuals, why don't we just explore those gaps? But those gaps really only tell you the first byte of the instruction. If you were looking for a secret instruction that's fifteen bytes long, uhm, you still got a long ways to search even basing it on a documentation. So, so, we don't want to do that, we want to find a better way to search through this instruction set. Really, our goal is to find the bytes in the instruction that actually matters so that we can fuzz those and ignore all the bytes in the instruction that don't matter. So, I wanna throughout an observation here that the meaningful bytes in x86 machine instruction impact either that instruction's length or its exception behavior. Basically, the meaningful bytes change something, important about the instruction and the bytes that don't really matter don't really change anything important about the instruction. So, I kinda came up with this idea, [background noise] Of a depth first search algorithm that would let us search through all the reasonably distinct instructions inside of the x86 instructions set. So, the way that this is going to work is we're going to guess an instructions - now, we don't know how long this instruction is. [background noise] Let's just guess fifteen bytes of zeros. Then I'm going to execute this instruction and from the execution I'm going to see how long was this instruction actually. You'll see that this instruction is two bytes long. [sneeze] So, we're going to increment the very last byte of this instructions and then repeat the process. Execute the instruction; observe its length; increment the last byte; execute; observe the length; increment the last byte. Very very simple algorithm here. But eventually what will happen when you do that final incrementation and execute the instruction you'll find that at some point the length changes. Uhm, Our instruction just went from two bytes to three bytes. So what we do when the length changes is we're going to move to the last byte and increment it; then execute the instruction; observe its length; increment the last byte. Uhm, over and over and over again, so if you repeat this process, these are what you're instructions kind of look like. Uhm, you can see that through this, like, incrementation approach, we very gradually sort of drill down into the instruction set, uhm, by generating more and more complex instructions as we go along. So, at some point you're gonna increment all possible bytes for that last byte and you're going to see that the length still didn't change. So, when you've done that, when the last byte is an F F, uh, two fifty-five, you're going to increment it one more time; let it roll over and then you're going to move back a byte. And that's going to be our marker byte. Uhm, so then we're going to increment that marker, So it becomes one. And we'll repeat the process - execute the instruction; observe its length - if the length hasn't changed yet, and it didn't here, increment the marker again. And just repeat that over and over and over, [background noise] And eventually in this situation that marker will roll over again. So, you move back a byte now; if the marker roles over, move back; marker rolls over, move back. And just keep that up, incrementing these bytes one by one and observing the length changes. So, eventually when you increment the marker, you're going to find when you execute the instruction that the length changed again. So, at this point we now move back to the end of the instruction and start the whole process over. Incrementing that last byte and seeing how long that instruction is. So, this is a, a sort of neat algorithm - this tunneling algorithm lets us search through the instructions space really quickly,. [background noise] Skipping over over the bytes that don't matter and focusing on those that do. Now, that let's us basically exhaustively search the bytes and the instructions that actually matter. Uh and effectively what that does, is it reduces your search space in this architecture from one point three times ten to the thirty-sixth instruction down to about a hundred million - something that you actually scan in about a day. Uh, so that's, that's pretty neat. I, I was, I was excited that I, I, I found a way to quickly search this ISA. Uhm, but there's a catch here that I, I kind of glossed over - namely that in order for those algorithm to work, we need to know the length of an x86 instruction. And it's not as easy as it sounds. We can't just disassemble the instruction to see it's length because we might be dealing with undocumented instructions or misdocumented instructions. [background noise] So, the simple approach here if you, if you're, if you're familiar with x86, might be to use the x86 trapflag, so the way the x86 trapflag works is if you've got an instruction and you set the trap flag before executing that instruction then you execute that instruction - the processor throws an exception and gives control to your trap handler. Now, that trap handler could look at that new instruction pointer; compare it to the original instruction pointer and that difference tells you the instructions length. So it's how you could try to do this with the trap flag but the trap flag has a really big limitation here - namely it fails to resolve the length of faulting instructions. Uhm, so why do we care about faulting instructions? Well, uhm, it's necessary if you want to search through privileged instructions. So, let's say we're executing this in ring three - the least privileged mode of execution on the x86 processor. We still want to figure out if there are instructions that only exist only in ring zero. So, for example, loading a control register can only get into kernel and ring zero. There are instructions that can only be executed in hypervisor - like VM enter. Or instructions that can only be executed in system management mode like 'resume'. So, regardless of what, uh, mode of execution or privilege level we're scanning at - we want to be able to resolve the instructions that occur in any other mode. So, we gotta come up with a better approach for figuring out the length of an instruction. What I came up with this, for this was, uh, sort of page fault analysis. So, the idea here is choose an arbitrary instruction - we don't actually know how long this instruction is right now but we wanna figure out how long it is. So, we're gonna map two pages into memory, uh. The first page is going to have 'rewrite execute permissions' and the second page is only going to have 'rewrite permissions'. Then what we do is we place our instruction in memory so that the first byte of the instruction is on the last byte of the executable page. And the rest of the instruction is at the beginning of the non-executable page. Then you execute this instruction basically just jump to the instruction and see what happens. So, internally the processor's fetcher is basically going to grab that first byte - that ' O F' here and it's going to see O F is not an instruction by itself - I need more bytes. So it tries to grab the next byte of the instruction but now it sees that that next byte is on, A non-executable page. So, the processor throws, uh, a page fault, uh, specifically the processor's going to generate a page fault exception with a fault address in a CR2 register. So, CR2 points to the address of that second page. So, whenever that happens we receive a page fault and CR2 is set to the address of the second page - we know that the instruction, instruction is longer than, uh, what we have in the executable, uh, page. So, in this situation I know the instruction is longer than just O F. So, we move the instruction back a byte and repeat the instruction - execute the instruction, uh, the decoder's gonna grab that first byte; it's gonna see the instruction continues beyond O F. Uhm, so, it's gonna grab that next byte - that works now because this is an executable page - uh, it's gonna see that if this is 6A I still need more bytes to finish the instruction. Uhm, and then it's gonna try and grab the next byte, again it's going to fault here because the byte is on a non-executable page. [background noise] Uhm, so we basically just repeat this process over and over, uhm, as long as we recieve page fault exceptions with CR 2 set to the second page address - keep moving the instruction backwards one byte. And eventually what will happen when you execute the instruction is. [sneeze] You're gonna find that that entire instruction resides in an executable page. So, the processor could do a lot of different things here. Uhm, this instruction could run; the instruction could throw a different kind of fault; it could throw a page fault as well but it's going to have different address in the CR2 register. So, regardless of what happens here - all of those situations mean that this instruction has now been successfully decoded so it must reside entirely in the executable page. That means that we know the instruction length, uh, at this point. So, we know the instruction length - we know how many bytes the instruction decoder consumed. What we don't actually know at this point is whether or not this instruction really exists. The processor has to decode even non-existing instructions - fortunately it's pretty easy to check if this instruction exists or not at this point. Basically if the instruction does not exist on the processor - like, it;s really not there; not just that it's un, or it's undocumented, [sneeze] If the instruction is really not there the processor generates an undefined opcode exception at this point. So, if we receive anything other than a UD-exception then we know that this instruction actually exists on this processor. So, this is kind of a neat approach. It lets us resolve the lengths for successfully executing instructions, faulting instructions, privileged instructions - things that can only execute in ring zero; ring minus one and system management mode. And so, I, I threw all this functionality into little process that I called 'The injector'. The injector basically does this page fault analysis and the instruction search operations. The injector has a really big problem though - it's fuzzing the device that it's actually running on. So, how do we keep the injector from crashing itself with these random instructions that it's generating? Uhm, well, the first step is we're gonna restrict ourselves to running inside of, uh, ring three. [background noise] And it's not such a big limitation because even in ring three using this approach we can find out whether or not certain instructions exists in more privileged rings as well. So, that's, uh, not a big limitation but it does mean, uh, that we're, uh, essentially not accidentally totally crash the entire system through these random instructions. Uhm, unless you've got a very serious processor bug. Uhm, and so the next thing we're gonna do is try and hook all the exceptions that this instruction could generate. Uhm, in Linux that means we're gonna hook the 'SEG fault exception'; we're gonna hook 'illegal instruction exceptions'; we're going to hook, uhm, a 'floating point exceptions'; bus errors, traps. Uh, and whenever we receive one of these exceptions the process is basically gonna clean up after itself, uh, so, by that I mean that the process is going, uh, essentially reload all of the registers to known good values. [background noise] That way, no matter what happened with that instruction that we generated we reset the system to a known good state. So, uhm, at, at step three what we're going to do to prevent other types of crashes is we'll set all the general purpose registers, uh, on that processor to 'zero' before we try to execute the instruction. So, what that does for us is it mean an arbitrary memory write instruction like this - like this instruction adds the value nine one oh two, uhm, into some location in memory. We don't want that to actually hit something that our program needs. We don't want that location in memory to fall into our process' address space. Uhm, so by loading all the registers to zero we ensure that, uhm, the memory, uh, uh, address portion of this instruction resolves to zero. Uh, but, That's not quite enough just yet because x86 has some complicated addressing modes - we can have an address like 'e a x plus four times e c x plus' some, uhm, weird long offset. [couhg] So, even though we can ensure that the part on the left resolves to zero; the part on the part on the right might still fall into the instructions address, or the process's address space. Uhm, in which case you could corrupt yourself beyond repair, so, we don't want that to happen. Uhm, fortunately we're actually in good shape here because of the way that the tunneling search works. Uhm, essentially tunneling ensure that those offsets are constrained. Uh, if we have a four byte offset, three of those bytes will always be zero when we're searching these instructions. So, there's a range of different values we can get for that offset but none of those actually fall onto the normal, uhm, Linux process address space. Uhm, meaning that our process won't accidentally corrupt itself. Now, those instructions will segfaults, uhm, but that's perfectly okay - we catch SEG fault and can correct those. [background noise] Uhm, so, we've handled faulting instructions at this point - what about non-faulting instructions? The instructions that actually; the instruction we generate that actually do run, uhm, how do we handle those? We need some way for the analyses to continue after these instructions. So, imagine that you randomly generated, uhm, like, an instruction like jump backwards fifty bytes, uhm, the processor would jump back fifty bytes and then just start executing random garbage that could, could corrupt your, your process beyond recovery. So we don't want that to happen but here's where the trap flag can actually help us right. Basically right before we execute an instruction we're gonna set the trap flag, uh, so what will happen is when that instruction executes a trap will trigger that gives control over to our program and our program can restore our registers to known good state again. So, with all of these things. Limiting ourselves to ring three; handling exceptions; initializing registers; keeping register sets in known good values. Uhm, trapping execution - the injector, uh, survives, basically, it won't crash itself now. So, so we've got an effective way to search the x86 instruction space. Now, the next question is how do we make sense of these instructions that we're actually generating? Uh, so for this I, I designed what I call the 'sifter process'. It's kind of a wrapper around the injector. And the filter's job is to look at what the injector's doing, uhm, and pull out anomalies from the injector. So, so what I mean by anomalies, what's an anomaly in, in x86? Well, I thought, what I really want to find is somewhere where the actual processor execution deviates from the processor's specifications, uh, so my idea here was we could use a disassembler as a sort of ground truth for this search because a disassembler is somewhat presumably written based on a processor specifications - a good abstraction of those specification for us. Uhm, so for this, for this work I used a Capstone disassembler - a really great, a really great tool. Uhm, so, so using this by comparing the results of the actual execution to the disassembled results we can start pulling out interesting things, uhm, from the architecture. For example we can find undocumented instructions now, uhm, basically if the disassembler does not recognize a byte sequence, uhm, but that byte sequence generates anything other than an unidentified opcode exception when it executes that means that this instruction exists on the processor even though it's not on the documentation. Uhm, we can find software bugs this - the disassembler recognizes an instruction but the processor says that that instruction's length is something different. Uhm, that's usually indicative of a software bug. And with the search we can also find hardware bug and there's actually a really good heuristic I came up with for, for these hardware bugs. Basically when everything goes haywire you can dive in and figure what's really going wrong, uhm, on a system. So, I wanna give you a quick demo, uhm, not sure how well this is gonna show up on our, on our projectors but, uh, [background noise] Uhm, we're gonna run the sand sifter tool. [pause] And what we see here is, uh, our tool fuzzing the x86 processor, now in the top half of this image, uhm, we're seeing these instructions that the sifter, uhm, the injector is generating. On the right side you'll see the actual machine code - the raw bytes, uhm, that the processor is executing. The part highlighted in white down here - that's the observed length of the instruction on this processor. And Then on the left we see what capstone's doing, our disassembler, our ground truth think this instruction actually is. And the sifter is basically just watching for differences between those two sides and whenever it find a difference, whenever it finds an anomaly it's going to toss it into, uhm, this little section down here. Uhm, so that we can, uh, investigate that in a little bit more depth. So, after a little while, uhm, it will spit out a couple of anomalies, uhm, and we can, I will talk a little bit about what these actually are momentarily. But, uhm, uhm, the, the tunneling approach we're using right now is very very thorough. It does a really good job of searching the x86 instruction space, uhm, but it's also not very fast. It takes about a day for this scan to complete so, uhm, if want just quick results we can sort of change the fuzzing mode that we're using. So I added totally random fuzzing, Into this and it doesn't do as good a job searching but it does let you get very quick results. So, if we change over to totally random fuzzing mode we'll start to really see a lot of anomalies appearing in the processor. Basically what every one of these indicates is an instruction that exists on my x86 chip but that we're not told about in the reference manuals and I think that's, uhm, that's kind of scary that all of these things are on my processor, uhm, but nobody's acknowledging what they actually are. So, if you let this run for a day it will, uh, it'll quit, it'll dump out all the results, uhm, on most modern system it'll come up with one to three million, uhm, interesting things too look at. Now, that's a lot to go through by hand and try to make sense of so, I built this, uhm, summarizer, uh, tool. Again, I don't know how well that's gonna show up on the, on the monitors. But what the summarizer does is it tries to condense all that information and tries to pull out the most meaningful parts for you. So, for example, what you summarizer is saying here is that it found thirty-two different instructions that start with OF 18 that appeared to not be documented. Or if I dive into, uihm, one of these instructions I can drill down in a little bit more depth. So, it's telling me the instruction OF A7 C1 existed on this processor but, uhm, none of the three disassemblers that we gave that, uh, byte sequence to, understood what that instruction was. That's a really strong indication that this is an undocumented instruction sitting on my processor. So, uhm, [sigh] So basically this gave us a way to systematically scan our processors for secrets and, and bugs. So I, I scanned eight of the systems in my, in my test library and I want to share with you the things that we found. Cause we find some really interesting things. First we found hidden instructions in every single processor we scanned. [background noise] We found ubiquitous software bugs; software bugs; hypervisor flaws and some very very serious hardware bugs as well. So, I, when I set out to do this, uhm, I was trying to find hidden instructions on the processor, so let me share the hidden instructions , uh, with you first. So, when I scanned an Intel core i7 chip, so this was a chip manufactured in 2012. And these are some of the, some of the hidden instructions that I found. OF OD, OF 18, OF 1A, OF AE, uhm, most of these instructions are documented for certain combinations of bits inside of the instruction but even instruction without that specific combination - the ones that aren't documented - still run. Now, some of these, Intel has recently updated their, their processor specification to document these instructions. So, for example, OF 18 was added to a processor documents in December of 2016. The thing is I was scanning a chip made in 2012. These instructions were sitting on these processors for a a very, very long time before they were acknowledge by anybody. Uhm, this was another set of instructions I found on that Intel chips: DB EO, DF F1, CO DO, D2F6F7, uhm, a whole wide range of instructions which have absolutely no documentation in the reference manuals. Uhm, so that's a, a little worrisome to me so I started scanning other, uh, I scanned an AMD Geode processor. And what I found when I started scanning, uhm, processors from other manufacturers is that there's actually a lot of overlap in the undocumented instructions between these processors which kind of indicates that there's some kind of collaboration going on between different manufacturers as to what these undocumented instructions are actually going to be doing. Uhm, so the really interesting parts i thought were the places there was no overlap. Where this was really a unique set of instructions only on that processor. So for the AMD chip I scanned, the part where there was no overlap - the OF OF 40 80 followed by some byte. Now, AMD documented some versions of the last byte but the vast majority of those instructions aren't documented - it's just functionality hidden in the processor. So, next I scanned a VIA, a couple of VIA nano-chips, uhm, the unique parts of the VIA architecture were the instructions around OFA 7. So OF, OFA 7 is a unique VIA extension instruction set called the 'padlock instructions'. So you can actually lookup VIA's padlock reference manual and see that all of these instructions have to do with cryptographic functions. However, the padlock reference manual doesn't acknowledge the existence of C1 through C7 as that final byte. So, these are cryptographic instructions that are doing something on the processor, uhm, but there's no, no hint as to, as to what they're actually doing. So, so what do these do? Uhm, some of these instructions - if you Google around you'll find that people have stumbled across these in the past and have reversed engineered what they do. Basically looking at the register differences before and after an instruction executes in order to find out what that instruction must be doing. But some of these instructions have absolutely no record at all - they're not in the documents; they're not in the processor, processor specifications; they don't appear anywhere online, uhm, that's, that's, I think, Uhm, scary, In terms of trust for these things. So, uhm, so those are the hidden instructions I found, uhm, this thing also ended up returning a lot of software bugs. So, this is not what I set out to find but it's sort of interesting some of the results we got. Uhm, so the issue here is that the sifter process is forced to use a disassembler as its ground truth - it turns out every single disassembler that we tired for our ground truth was absolutely littered with bugs. So, most of the bugs we encountered with this tool. Uhm, only appeared in a couple of tools so they weren't all that interesting. Uhm, but some of the bugs we, we encountered appeared in all tools that we checked. Those can actually be used to an attacker's advantage. So, two of the more interesting ones I found were this 'jump and call' instruction. So, in a 64 bit version of the x86 architecture, uhm, the E9 instruction is a jump instruction; E8 is a call instruction and 66 is supposed to be what's a data size override prefix. The idea is that that 66 prefix on these instructions is supposed to change the default operand sizes of the instruction. Now, the default operand size here is 32 bit - so 66 is supposed to change that to either 16 or 64. But, on Intel processors, for whatever reason, they silently ignore that 66 prefix on these instructions alone. So, uhm, why does that matter? Well, turns out because of that everyone is parsing these instructions wrong - I tried this in IDA, Valgrind, GDB, objdump, visual studio, capstone, QEMU, uhm, every single tool I tried is parsing these instructions incorrectly. And, and here's sort of how we can use that to our advantage as an attacker. So, uhm, here we're seeing how AIDA parses this instruction - you'll see that IDA thinks this is a 4-byte instruction; it saw the 66 override - thought that changed the next, uh, uh, operand to 16 bits instead of 32 bits so IDA's getting the wrong length for this instruction. Here's visual studio's view of the instruction, visual studio actually recognizes that 66 didn't change the operand size, uhm, but it thinks that 66 caused the destination of this jump to get truncated to 16 bits - that's also not the correct, uhm, behavior. So, visual studio isn't able to resolve the, uh, the target of this jump. So, you can actually use this uhm, for some malicious things by throwing off the disassembler - so I made a little bit, a little malicious program. And, and had obs jump try to analyse it. So, you'll see here obs jump is miss analyzing, miss parsing that jump instruction. It gets the wrong size for that jump instruction. Now our obs jump thinks we have a jump followed by an add, followed by an add, followed by an add, followed by an move adds except obs jump got the wrong size for this instruction - it's messed up all of the previous instruc; or all of the following instructions that it tries to disassemble. So, that's useful to because I embedded a malicious instruction inside of the operands for one of these move abbs instructions. What this jump is really gonna do it's gonna jump into the middle of one of these instructions and execute malicious code that obs jump can't see. And to highlight that - a sort of implications of that, uhm, I wrote this little program that you can run in QEMU, uhm, so it's a popular emulator. So, what I'm going to do here, I'm going to SSH into our QEMU vm and I'm gonna run my little program and as far as QEMU can tell when this program runs it's just totally benign behaviour - it prints out "I'm totally benign". And no matter how many times you run this QEMU is only going to see benign behavior but that's because QEMU is miss parsing a jump instruction in this code. So, you, when you run this in Bare Metal you find this is actually a malicious program - it prints out "I'm actually malicious". [laughter] And the neat thing about this there's, there's not any QEMU detection logic inside of this program. The only thing that allow this is one misparsed jump instruction that, Everybody is getting wrong. So, I think that was, uh, that was kinda interesting. Uhm, so I was curious about why everybody was miss parsing these jump instructions, uh, and my best theory right now is that AMD actually obeyed that override prefix on this whereas Intel ignores that override prefix in their architecture. Uhm, and that's a little bit troubling when we can't agree on a standard for our processor architecture. The last time when that happened, when Intel deviated from AMD's specifications, uh, just a little bit it resulted in the very serious vulnerability called the 'sysread bug' that allow kernel privilege escalation. Uhm, so you might think why don't we update all of our tools to fix this issue? Why don't we just do it the way Intel does? They have 95% of the market share. Well, you can do that but then AMD is going to be vulnerable. There's really no winning here, no correct answer. Uhm, and I think it kind of highlights the impractically complex nature of this architecture when tools can't even analyze a simple jump instruction correctly. So, uhm, switching gears a little bit - uhm, early on in the development of this I was getting fairly tired of waiting a day for my scans to, uh, Complete. So, I wanted scans that could run a lot faster - I wanted to enumerate the instructions in the instruction, the instruction set in an hour instead of a day. So, I added multi-core support to this fuzzer, uhm, and I thought wouldn't it be neat if could run this thing on 20-core? So, I rented an Azure instance, uhm. And ran this tool. But I very quickly find out that my tool wasn't running correctly on my Azure instance. And tried to that there's a small bug in the Azure hypervisor. Namely Azure doesn't emulate the trap flag correctly - if the trap flag is set on the CPUID instruction. So, the idea here is that if you execute CPUID, uhm, it causes a vm exit and the hypervisor takes over. [background] And the supervisor's supposed to take over and emulate that CPUID instruction and then the last thing it's supposed to do is check whether or not the trap flag was set - if it was it should inject the trap into the guest vm. Azure forgets that last step, uhm, and what that, uh looks like is, is something like this. So, I've got a little program here, uhm, and I'm gonna run it on Bare Metal, uh, it's basically gonna check that, CPUID trap behaviour. And it's telling me it's gonna execute a CPUID NOP NOP instruction and the trap flag is gonna be set on the first CPUID instruction. Now it expects to get a trap at that first NOP and sure enough when ti ran the code it got a trap at that first NOP instruction. But now what I'm gonna do - I'm gonna SSH into, uh, my Azure instance and try to run that exact same piece of code. Uh, what you'll see is we get different results inside of the Azure hypervisor, uh, when I executed the CPUID NOP NOP, We expected a trap at that first NOP but Azzure gave us a trap at that second NOP. Uhm, that's a very small bug in the hypervisor. So, uhm, this isn’t a security protocol bug, it's not an especially big deal, it would be pretty hard to run across this in a normal situation but it's always a little bit troubling when the hypervisor can't faithfully emulate the underlying hardware. Uhm, do that sort of brings us to the last class of issues I found with this processor scanning tool. Uhm, we found some hardware bugs, and hardware bugs are always troubling no matter how small they are because a bug in your hardware means you have that exact same bug in all of your software. Hardware bugs are very hard to find and they're very difficult to fix. So, I started out scanning some Intel processors - a core, a Pentium, and a core i7 processor. And I didn't actually find anything too interesting on these - I did find the F00F bug on the Intel Pentium - that's a, a, a malform instruction that can totally lock the processor but that's a very old, very well-known bug. Uhm, so that was kind of anticlimactic but it sort of validate that this tool is capable of finding these malformed instructions that could have serious, uh, uh, consequences. Uh, next, I scanned an AMD - a couple of AMD systems. On several of these AMD systems I noticed, uhm, uhm, some odd behavior - basically they could generate an undefined opcode exception before they completed the instruction fetch. So, I dove into AMD specifications that's not the correct behavior for these processors, uhm, basically a page faltering instruction fetch should, should take priority over an undefined opcode exception. Uhm, so that was an errata. Uhm, until when I was getting these slides ready, uh, to present I found AMD - somebody at AMD must have found this recently because they updated their document in March of 2017 - to have this tiny little footnote down here. This footnote basically says "okay, this behavior is allowed". And I think this is a bit, a bit of a copout. Uhm, if you've got a processor errata and you update your documents to allow that errata - is it still an errata? [laughter] Uhm, I think it is, but, uhm, maybe it's not anymore. I scanned a Transmeta processor - so, transmitter's not so popular anymore but they were kind of popular few years ago. Uhm, I found some interesting behavior on a Transmeta processor with the instructions OF 71, 72 or 73 the Transmeta would give us a, uh, floating point exception during the instruction fetch. So, this is also the incorrect behavior. Basically, if you had an FPU exception pending and executed one of these instructions you'd receive an FPU exception in the middle of that instruction fetch. Uhm, the correct behavior here, if that last byte of the instructions on an invalid page is the page fault. Uhm, so again, this is a very minor errata but it's always little unsettling when there are bugs in our processors. Uhm, but that, that takes us to our last finding and I thin, I think this is the most interesting one. Uhm, we actually found that on one specific processor what appears to be a halt and catch fire instruction. So, a, [laughter] A 'halt and catch fire' instruction, if you haven't, if you haven't seen this, uhm, is a - well, it can mean a lot of things but in this case it's a single malformed instruction that when we execute it, execute it in ring three - the least privileged realm of execution on these processors - it completely locks, uh, the processor. So, I wanted to make really sure this wasn't a, uh, kernel bug, uh, instead. So, I tested this on two different, two different windows kernels; three different Linux kernels but then I really, really wanted to be sure this wasn't a kernel bug. Uhm, so I actually wrote a small loadable kernel module for Linux that would hook all of the interrupts and the interrupt description table dump serial debug information whenever interrupts occurred. I was a little bit worried that maybe this infrastructure was causing what's called an 'interrupt storm' to make it look like the processor was locked when it wasn't really locked. But this also seems to validate the processors is completely locked when we execute this instruction. So, unfortunately I found this about two weeks ago during one my scans which means the vendor has not had time to respond to this issue. So there's no detail available right now on the actual chip affected; the vendor or the actual instruction format. But I did wanna give you, uh, demo of, uh, this thing in action. So, I've got Debian booted on this processor - you can see it can just run pretty much any program. It can run 'Hello World' and print out 'Hello World'. But when I run 'A dot out', uhm, it's gonna do something different. Now, the only instruction in A dot out is this one malformed instruction - that's, that's all there is. And when I run it the processor locks. Now, I don't mean the process locks, uhm, I mean the processor itself is completely locked up at this point. You can try and use 'control C' to break out of this problem. you can choose to change levels. Uhm, the system won't respond anymore. You can even try the Linux magic SYSRET keys in order to just crash or reboot the system and you'll get no response. Cause that processor is done executing instructions at this point, [laughter] So, uh, uh, I was pretty excited about this, uhm, as far as I know this is the first such attack on x86 found in 20 years. The original version of this, uhm, [applause] [cheering] So, so the,. [applause] Alright, so the original uhm, halt and catch fire instruction was that Pentium F00F bug on the original Pentium in, uh, 1997, uhm so I think that's, that's pretty cool and I don't wanna, don't wanna try to make people panic, Uhm, this is on, uh, very specific chip that is not in widespread use. So, it's not like the sky is falling here. I think the, the interesting part about this is that we have a tool that can actually find these kinds of very serious bugs because it is, uh, a serious bug. And should be apparent I think that this is a, a very serious security concern if an unprivileged user on the process can initiate an entire processor DOS, That's gonna lock everyone else out of the system. Um, so, so it is a really neat thing. And, uh, I'm hoping and, uh, and if everything goes well with responsible disclosure I'll be able to tell you more about that within the next month if you, uh, stay tuned. So, uhm, I didn't want this to just be an academic thing. I wanted people to actually be able to use this tool and scan their processors. So, this now is open sourced on GitHub - that's Github dot com xor EAX; EAX, EAX, slash sandsifter. Uhm, and I encourage you to try this thing out because we don't really know what's in our processors. So, uhm, you should use this to audit your processor, find your secret, secret instructions in it. Break disassemblers, break emulators, hypervisors. Find these halt and catch fire instructions like my system didn't literally halt and catch fire but yours might and I think that would be awesome! [laughter] So, [laughter] So, you won't know unless you, you actually try this out on your specific system. And I wanna, I wanna highlight I've only scanned few systems with this and this is really just a fraction of what I found on those few systems. This is just what I could cram into a 45 minute presentation and so if that's what I found on these few scans - who knows what you'll find on, uh, on your computers. So, uhm, check your system, if you're, if you're not sure, uhm, about the results that this thing is generating - you know, dumps out all these weird instructions, uhm, You can send me those results, I'd be happy to, uh, try to dissect them for you. Uhm, but the real point here is we need to to stop blindly trusting specifications. Uhm, if we don't trust software we shouldn't trust hardware. We need tools that would actually audit our processors and make sure that they're really doing what we're told they're doing. And that's what sandsifter let's us do. It sort of gives us a really important critical first step in introspecting this black box that's at the center of all of our systems. Uhm, so, again, you can find this on Github right now - it's the sandsifter project. Uhm, you can find some other, uhm, fun things that I've tinkered with over the last few years on there. Like I wrote a single instruction C-compiler for no reason. [laughter] Uhm, I wrote some tools for manipulating program control flow to, uh, show images in IDA. A couple of years ago I released an architectural exploit on x86 and there's there's lots of other random things that I've, I've played around with over the last few years that you can check out on there. If anybody has feedback or ideas I would absolutely love to discuss that with you. Uhm, so grab me after the talk or, uh, you can reach out to me - my name is Christopher Domas, I'm at twitter at xor EAX EAX EAX - or same thing at gmail dot com.' And like I said I'll hopefully be releasing all the details for that processor vulnerability within the next month. Uhm, so thank you everyone! [applause]