Hi. Welcome Everybody. Thanks a lot for coming to my talk. What you are about to see here today is pretty cool stuff. I had lots of fun working on that project, so I hope you will find it cool too. So, in a nutshell, I am about to show you how to set up a cache timing cover channel so that two VMs co-located on the same physical box, same socket, can talk together without being detected. The without being detected is important here because this talk is all about practical implementation, not just some proof of concept or (?) stuff. Sorry. And so... just a second. Alright. So, before we being, usual disclaimer. So, this is a research project that was done on my own time, my own network. This talk reflects my own opinion and another one from my employer. Information and code provided here should be used for informational purposes only. Alright. Let me introduce myself and also give you guys some context around how I ended up working on that project in the first place. So my name is Etienne Martineau. I'm from Ottawa, Canada. I work currently for Cisco Systems where I do Linux Kernel and KDM stuff. As a kid I was fascinated by electronics and radio, especially the concept of modulation. It was just much later during my studying at Laval University that I finally realized what was going on. That was cool. But then I got a job and ended up spending up several years on Linux Kernel and as you may have imagined, I forgot all about modulation, up until recently. Where part of my work doing some low level analysis on KDM, I noticed something strange. Basically I observed some sort of cross talk going on between two virtual machines. Obviously the cross talk was very subtle, but the tool I was using was smart enough to detect it. So doing more investigation I realized that the two VMs were being assigned to the same physical core, but different threads of execution. So this threads of execution concept is typically known as SMT or hyperthreading of Intel. There is more research and I found a nice picture from Intel. So here we can clearly see that the two execution threads are actually sharing some common functional unit. And some operation have to be serialized one after the other. So that is tough to explain the result that I got. But then I had an idea. What if on one of the hyperthreaded sibling I modulate the contention pattern over the execution pipeline? Let's say a long instruction is a one. Let's say a short instruction is a zero. Then, what if on the other hyperthreaded sibling I tried to detect the amount of contention over the execution pipeline by executing an instruction and measuring the time it takes. If it's slow, then it's a one. If it's fast, then it's a zero. Then, I realized that with this technique I could do pretty cool stuff, such as sending information from one VM to another, assuming that the two VMs are being assigned to this hyperthreaded siblings. So, I ended up spending quite a bit of time on that project. And so, one of my goals was to see with my eyes the quality of the signal being-- the quality of the communication channel. So, naturally I decided to try to transmit an image so I could see the resulting output on the other side. Obviously, the very first image I used back then was not this one, it was a picture of my kids, which I am obviously not going to show around here. So, I used the Def Con logo. This has been reformatted at 640 from #(Inaudible - 4:09). It's VGA quality, 1-bit per pixel. And this is what I got on the other side. So that was pretty cool. I was about to see all the #(Inaudible 4:21), but still I was about to get some information out of it. So then, it was at this time that I essentially realized that the problem related to security and I said to myself, there is a big issue with this kind of stuff. And this is basically why I am here today. And by the way, before we move on to the core of this talk, just for the fun of us here at Def Con, I have a recording that shows what happened in real time when we take that image and we send it over and over again at 15 frames per second. One thing I did in that recording is that I have my noise generator that I start and stop in the background, which is essentially a complication of the Linux Kernel. And then you will see the effect on the communication channel. One more thing is that I'm running the mp4 encoding software on the same machine where I am running this experiment, so on its own this thing is generating quite a bit of noise. Alright, let's take a look at the video here. We see that when the Linux Kernel compilation is running the channel is completely saturated by noise and that is expected because the pipeline is just running so many instructions at the same time, so many contact switch are coming, that nothing can go to the other side. Alright, let's take a step back here. My goal was to come with a practical implementation, not just some theory stuff. Why? Well, because I wanted to prove that this is a real issue and that we need to fix it. So now in this talk we are going to go over the design and what it takes to basically come with such a cache timing channel. So basically, we are going to go over the shared resources on 686 Multicore. I'm going to show you guys how to encode and decode data using cache line. And doing that, we will see the effect of the hardware (?) and I am going to show a trick we can do to get around it. Then, we will also see that the encoded data that we put in the cache line doesn't stay there for a very long time, especially with VMs, because with VMs there is lots of noise. Also, we will show you how to find cache lines that are shared across VMs. And at the end of it I will show you my implementation which essentially enable two processors running from different VM to synchronize together very, very precisely. Finally, we will do some detection and mitigation aspect, and also toward the end of this talk I will basically measure-- do some sort of bandwidth measurement on this channel and we will also go over a reverse example. Alright. So, when you have hyperthreading enabled there is lots of possibility for inter-VM modulation assuming that the two VMs are being assigned to the different-- two hyperthreading siblings on the same core. You can do pipeline from tension. This is the first example I show you guys. But you can also do modulation in L1 cache or you can do modulation in L2 cache. Now, if you have hyperthreading disabled-- Hi. >> (Cannot hear him/her.) Maybe this one? Alright. So, as you guys are all very familiar with by now, we have a fantastic tradition called "Shot The New". Has this guy been doing good? (Applause) Alright. That is exactly what I like to hear. Alright. We're not going to hold him up too much longer. To new speakers (cheers), to new attendees, and new to Def Con 23. (Applause) >> Alright, as I was saying, now if we have our hyperthreading disabled, which is typically the case because this type of issue has been reported way back then in 2005, looking at the bottom of this page we can still do modulation, but this time in L3 cache. That is what this talk is all about. Obviously, if the VMs are being assigned to different sockets this cache timing modulation will not work because the cache are not shared across the sockets. But now you see there is that buses that connects all the caches together-- the co-event C-module that can potentially be used. So this is also interesting, but this is outside the scope of my talk today. Alright. Now it's time to understand how we encode data in the cache. A cache line typically holds 64 bits. So when you read a bit that is not in the cache, it's the whole cache line that is brought in from memory. So now the basic of this trick relies on the fact that we can measure very accurately the time it takes to read a bit from memory. When we get it from L1, it is very fast, L2 a bit slower, L3 even slower, and main memory it's very slow. So now the way we do for encoding a pattern in a cache line is we load our flush a particular cache line. Let's say when it's loaded it's a one, a flush is a zero. And for the decoding part, what we do is we measure the time it takes to read a bit that corresponds to this cache line. If it's fast it's a zero because that cache line was loaded. If it's slow then it's-- sorry, if it's fast it's a one, if it's slow it's a zero. Sound simple? Alright. Let's take a look at the practical example. So when I started that stuff I wrote a simple client and test program. There is no VM in the picture at this time. This thing is running just on Linux, on the host directly. And the cache lines are coming directly from shared memory. So here, the client is encoding a pattern. This is the graphic that you see at the bottom left here. And once the pattern is encoded in the cache, basically the client wakes up a single on the mutex and the server wakes up and do the decoding. So but now here there is something weird. This is what I got on the other side. So, there is clearly a pattern when you look at this. This is not pure noise. Alright. Then I decided to take a step back. I'm just going to write a simple test that flushes all the cache line from zero to one hundred and then after that I am going to measure the time it takes to load them back. And I'm expecting long latency for all of them. But there is obviously something else going on here. Some of the cache lines were exhibiting long latency, but lots some of them were very fast. What is going on? This is at the time I learned about refreshing. So refreshing means bringing data or instruction from memory to the cache before they are needed. So on the processor that I am using, which is a Xeon 5500, there is more than one refresher. There is a refresher for L1, a refresher for L2, and there is also #(Inaudible 12:06). But at the end of the day, this thing is trying to predict what address will be needed in the future. Alright. Before I move on, sorry, this hardware refresher stuff is one of those things you can control at the bios level directly and enable or disable that. This is obviously not what I am doing here because we don't have that type of access over the machine. So we have to work around it. So, I came up with the idea that I'm going to just randomize the cache line access. So I'm going to basically randomize the cache line access within the page. Fair enough. But it turns out that we also need to randomize the cache line access at the page level. In other words, you cannot just go with an incremental pattern at the page level because the other refresher will kick in and will detect that-- and will try basically to load your cache line in that (?). Doing all that apparently to manage the hardware refresher, at least on the machine I was running on. Then, I faced another problem. So, what-- basically, what happens if you end up waiting longer before doing the decoding? So right now I have a client. It basically (?), signals the server, the server wakes up and does the decoding. It's very fast. But let's say what happens if you wait, wait more, and wait even more. Well, we clearly see that the time from when you encode the data to the time from when you decode the data has to be very small, otherwise, the other stuff that is running on the system will kick in and start to pollute the cache and essentially erase your data. So in other words, the uncoded data in the cache evaporates pretty quickly. And this is even more true for us when we are running in the VM because with VMs there is lots of noise. So, talking of noise, so I have done a couple of experiments to try to characterize the noise and so on. So, I have a test program that basically, it's basically using a calibrated software loop that takes exactly two CPU cycles to execute. And I am running that loop 100,000 times, so I am expecting that the execution time in cycle is going to be 200,000 cycles. And I am repeating that test 1000 times. So when you are running on bare metal Kernel with all interruptions disabled, there is no noise, right? The loop is always taking 200,000 cycle over and over again. But now, if you are running in user space, right? There is processors running and so on, well, there is some noise. Well, actually the noise is coming from the host of everything system that is doing inturruption handler and all that stuff. Alright. By the way, the smallest bit that we see there are the timer interruption on a per CPU basis. So this is a six core machine. The bigger spike is actually a network interruption that is running on CPU zero. Now, if you are running in the Kernel inside the virtual machine without all interruptions disabled, there is still quite a bit of noise down there because you know, the host Kernel is running, all the interruptions are running, and you have some noise because of the hypervisor layer. And finally, if you are running in VM user space, which is the case, when we will do that communication thing, there is quite a bit of noise. And that is because the Kernel you are running on has its own timer interruption and all the stuff is running there, the upper visor layer, and then on the host. By the way, it looks really bad here, but if you do the math, the degradation comes down to about 2%, which is about what we expect for a compute load when running on the VM. Alright. Now we understand the noise. We have a way to trick the hardware refresher. Now it's time to put the client in VM1 and the server in VM2. Remember the first test we did, all that stuff was running on the host directly. So I put my server on VM1 and VM2. But then I realized there is another problem. The cache line that I was using initially are not developed anymore. Basically, the L2 and the L3 cache, those things are tagged by the physical address. But in the VM the physical address that you see has nothing to do with the real physical address that exists on the host that the cache is using. Why? Because there is that other translation layer, that other translation layer, sorry, and basically the VMs in the virtual machine do not have access to that information. It's a tricky problem to solve, but I don't think it's impossible. But fortunately for us we don't have to worry at all with this issue thanks to KSM, well, at least as long as KSM is enabled on those systems. What is KSM? KSM is a Kernel thread that runs on the host Kernel that basically scans the running processors and compares their memory. If it finds identical pages, KSM merges them into just one single page. So but obviously if one of those program wants to modify let's say those pages, KSM kicks in and basically do the unmerging. KSM is pretty useful because it saves a whole lot of memory with VMs. Especially when the guest of everything system in one VM can be shared with the other guest of everything system in the other VM. Alright. Coming back to that slide. The idea that I had was to create a unique pattern, a per page unique pattern in memory that is the same across the (?) and the server. So, the idea is that on host, KSM will look at those things, do the scan, and it will eventually do the page duplication for us. Note here that the per page is important. If different pages are identical in content, KSM will detect that and will merge them on top of each other. So you will end up overlapping your cache line which is obviously not what you wanted to do. A side comment-- with KSM you can do pretty cool stuff such as identifying the operating system or the application that is running beside you. All you need to do for that is to load in your own memory the image of what you think is running beside you. Then, you wait a bit because KSM duplication process takes time. Then you write to some of those pages and then you measure the time it takes. So if the time it takes to do the write is much longer than the normal write inside your virtual machine, it means that you have KSM involved all the way down from the host, the page duplication is done for you and it means you have a match. You have basically identified what is running beside you. Alright. Coming back to that picture again. I realized there was another problem with my design. Basically there is no synchronization primitive across processors running on different VM. Remember when I was running directly on the host, the server was signaling a mutex, a client was signaling a mutex, the server was waking up and doing the decoding. But here there is no such thing. Well, in reality there are things to do that, for example, on Linux there is, and KDM, there is IVSHM where you can basically from one VM to another signal the #(Inaudible 20:15), but that stuff is not enabled in production. We need something to replace the mutex. Why? Because we want the server to run right after the client so that it will pick up the signal, right? Remember what happens if we wait too long? All the data is gone, so we have to be fast. So then I ran a couple of options. I did not really know how to attack this thing. So one of the options I had was basically I'm just going to forget all about that synchronization aspect and kind of hope for the best. With some, with error correction, ECC, we can achieve some data transmission. There is space between the client and server and it's totally random. So this will give us very low bit rate, obviously, but the CPU consumption is low. And now, that is kind of cool because we need to be-- we don't want to burn CPU because everybody will detect us, right? So another option, it's basically I set it up such that there is a loop on each side and the client is set up to run a bit faster than the server. So at some point in time, there will be an overlap, and at that point the server will pick up the signal and the transmission will happen. So this is giving an okay bit rate, but the problem with that one is that the CPU consumption is very high. And this is no good for us because we want to remain undetected. Hopefully, we would like to be less than one person CPU usage. Option number three, so this is basically my loop implementation that I was talking in the beginning. Let's find a common period on the server and the client and let's have the client and the server lock into place. How I did that? Well, at the beginning of each period I have the server that is sending out the synced pattern. This synced pattern is very similar to what you found in those analog tv for the vertical sync, that is kind of the same concept. Then you have the client that is running a scan over that period and tries to detect that same pattern. Once it detects it, is basically locks on it. So once the sync is detected, the client is just shifting back the face and now we are ready for transmission. But there is a tricky problem for that to work. We need a monitronic pulse. We can in reality tolerate some jitter, but not too much because in VM there is lots of noise and the data operates off of the cache very quickly. So in practice all that stuff looks fine-- but in theory, all that stuff looks fine, but in practice it's a bit more tricky. How can we achieve a monotronic pulse? The first thing that comes into mind is to use a timer. Well, timers are good because anyway we need to sleep in order to have our detection. But, with timers there is a big problem. So this is a graphic that represents the latency in microseconds-- the latency distribution in microseconds of a timer that is running in the VM. This is a log-log scale. So we see that from that graph there is lots of jitter. It can range from all the way down to 20 microseconds all the way up to almost 200 microseconds. And now, if you factor in the original design, this timer, this jitter from the timer is going to come from both VM at the same time because they have the same kind of distribution. So there was just too much jitter for that to work. The data will not persist and transmission will not happen for sure. Okay, so the idea I had was to basically compensate this timer in software to some value above the maximum jitter, so that in theory this should give us a nice monotronic signal. Well, here we need to be a bit careful because this compensation thing will be subject to noise. In other words, what I am trying to say here is that the more time you are trying to compensate something, more noise you end up accumulating from the underlying stuff that is running behind you, right? And the other thing that we need to be careful with is this compensation is burning CPU, but on that aspect it's not too bad because all we have to do is to stretch the timer period to some higher number and we will still stay under one person CPU usage. It's a tricky problem, but I believe in the end I got it right. The compensation thing I am using is basically a calibrated software loop that is kept in check with a TSC at every single point in time. And this is the result I got. So my machine is a 2.4 gig machine. And when I am running idol, this graphic represents the jitter that I have on my compensative timer and it's in cycle. So I have roughly I have 50 cycles of jitter on my timer. It corresponds to 20 nanoseconds. Even on a loaded system I've got roughly 300 cycles. It's 120 nanoseconds. It's pretty fast, pretty accurate, sorry. If you compare that with the original timer, I mean there is obviously no comparison here. Even if we put the latency back into perspective with the original graph, it doesn't even-- the jitter doesn't even show up on that scale because the original timer was in microseconds scale and now I am working in nanoseconds scale. As you may have already understood, this synchronization aspect is the key of this design because it basically enables the communication to happen with very low noise because it's very, very precise with the process running and at the same time consumes low CPU. Okay. Let's recap what we have so far. We have an encoding and decoding scheme that is based on memory access time. Slow it's a one. Fast it's a zero. We managed to get rid of the hardware refresher without disabling it by the bios because we just randomized the cache line access at the page level and so on. We also found cache lines that are shared across VMs and that is thanks to KSM by the way. We managed to design a face lock loop that gives a very high precision across two processors running across two VM. Time for a demo now. Okay. So I basically with that technique-- I basically repeated the original experiment which consists of sending that Def Con logo from one VM to another. We can see that technique offers pretty high quality and kind of low noise communication channel at least if you compare that with the original pipeline contention example I showed you at the beginning, right? So there is no error correction that is running on the transmission channel. There is no retransmission. Nothing. But as you can see if you look carefully you will see in the picture there are a couple bits that are flipped here and there. That is expected, right? The channel is kind of noisy a bit. Alright. Down in that experiment, again, it's another recording, I repeated that exact same steaming experiment with my noise generator running on and off in the background. The first thing I want to mention, and that is kind of cool, is that when the transmitter is not running, the receiver is picking up the noise from whatever is running on the operating system. So to me, this could potentially be used to fingerprint the operating system that is running underneath. Also, when this recording here, same stuff as the previous one I had, the mp4 encoding software that is running on the same machine and so on, so on its own, this thing is generating quite a bit of noise. But still, you will see the effect when let's say I move a window around for example. And of course you will also see the effect when I compile the Linus Kernel. You will see the noise going on and off and so on. One last thing, I may have mentioned it before, there is no compression, no retransmission, no protocol. What you see is essentially the raw capacity of that channel. Let's take a look at the video. That is the noise I was talking about. You see the compilation of the Linus Kernel totally saturates the channel. So on the left is the source and the right is obviously the destination of the (?). (Applause) Thank you. Alright. So now we will make it fast here. That video was transmitted at 60 frames per second, interlaced four times, 15 full frames per second, one frame was VGA quality, 640 x 480, 1 bit per pixel. It's, if you do the math it's roughly 4.5 megabit per second, and the CPU on both sides is 15% CPU utilization. Of course if you utilize more CPU you can crank up the bandwidth on this one. Alright. Now let's focus on something a bit more useful than trying to stream a picture from one VM to another. The first thing I'm doing here, and again, this is the same two VMs that I had originally, is that I'm running the client-- I'm running the server in loop back mode which displays whatever was sent by the client. And the client in that mode is sending a bunch of these other # (Inaudible 31:18). And in the background I'm running on and off again my noise generator. You guys will see the effect. You will see that the synchronization will happen at some point. Alright. Now transmission is going on. Alright. I'm going to hit the pause button for a second here. In my program, the reversal stuff, I have a way to basically turn on ECC on the communication channel because as you can observe there are some bits here and there that are flipped, right? So now I'm just going to unpause and run the system with error correction turned on this time. Alright. Now we see that the output is clean, right? (Applause) Thank you. Obviously sometimes it will happen that there are more bits that are flipped than what ECC can support. And so my program is actually displaying a couple of stars right there, the server is displaying that. Obviously we can crank up the error correction and just using, I believe a 16 bit ECC over 240 bit payload, so that can be increased easily. So now I'm going to basically remove the new back mode because I want to send command on the other VM. That's the reverse shell mode. Right? So that was just a test mode here. So I'm going to unpause that thing. So again, the lock was pretty fast. Right now the lock is when the face lock loop is synchronizing. So now I'm sending command to that other VM. Commands are coming back. This is the process that is running in the other VM. I'm just going to hit the pause button again. Well, to be imaginative here I named my stuff timing channels. The timing channel is the program that is running in the other VM. As you can see, the time this program has consumed is very minimal. This is the reversal. (Applause) Here I'm looking I believe at the source code of this program. Alright. So you see that you know the reversal is not superb responsive and that is because the server has been configured in a mode where I don't want the server to burn too much CPU on the other side. We can dynamically crank it up or crank it down. Right now I believe it's using half a percent of CPU in this demo. So you see the responsiveness level. So that's it for the demo. So what can we do to prevent that stuff? So the first thing is to disable that page duplication thing or to set it up on a per VM policy so that it doesn't cross the VM boundary. That's one thing you can do. It will take care of those inter-VM shared read only pages. It's going to move them out. So this flush and reload technique won't work essentially. And then it will take care of that side VM and application fingerprinting thing I was mentioning earlier. Obviously this is at the cost of a higher memory. One of the other things that can be done is on the X86 today, this CL flush instruction is not privilege. So maybe Intel could make it privilege or something. I don't know if it's a microcode update. Obviously you have to revisit your colocation policy. What are you putting on the core? What do you put on a socket? What do you put on a box basis? Personally I'm more of a fan of trying to detect this kind of communication. And for that to happen there is many, let's say, ways we can do. One of them is there is a bunch of hardware counter that is available for different reasons on the chips. So one could do some sort of pattern and noise analysis to try to detect some spike and some very precise noise. The other thing that can be done, you know, it would be to try to detect those inter-VM process scheduling pattern. Meaning that, let's say you have two process running in two different VM and they are always scheduled at the same time somehow, they always overlap. That could be detected obviously. And one of the other things is #(Inaudible 37:11) for that stuff that I have been doing there is lots of calls to RDTSC because this is what the compensation is doing and so on. So one could monitor and put some #(Inaudible 37:26) and try to detect that. So this is really what I am working on. Yeah. So basically that is pretty much all I have. The source code for this is going to be on my hub. I will send an update on this slide deck shortly. Thank you very much everybody. (Applause)