Hopefully everybody will be entertained. I know everyone thinks what I really want to do at 5:00 p.m. is go to a talk that involves math. >> Whoo! >> Excellent. Came for the math, stay for the moustache. I'm packer and so can you. I'm going to attempt to keep this to the 45-ish minute mark so I can do some Q and A. Hopefully I'll get good questions. Now we're on to the agenda. A little bit of an intro, talk about the project, a little about me, because why wouldn't I, I'm up here, give everybody a little refresher, kind of talking a little bit about the PE format, look at the data pull out our magnifying glass and do a little bit of math and look at the solution and finally look at the results. So the most important part, me. What do I do? Currently research, those are my hobbies. Anybody else from Texas? >> Yeah! >> There we go. If you're in Austin, I will totally buy you a beer. If you guys are looking for various security data, I try to keep an updated list, everything from bro logs to snort logs to other projects, way more information than I can possibly host. Follow me on Twitter and I'm sometimes a contributing member -- and feel free to tweet about this and use the hashtag secure math because we are going to be talking about math. So what's the main problem here? I'm sure a lot of people are familiar with the idea of detecting compilers and encryptors. Some of the tools are really old. PID was written in 2005. So it's 10-year-old technology. Maybe there's a more interesting way or a better way to manage this problem. So really the goal was to set out as can we do something new and different. So we've got some goals. We've got some great projects out there, like PEID and some other ones. Yeah, they might be a little old, but there's probably still some validity. However, for this we're going to try and adopt kind of a zero trust towards them. In other words, if somebody as an analyst says this PEID signature is verifiably correct, then great, we being myself or anybody else in this room can create a signature and kind of directly translate it into this new language. The other one is this easy to create signatures. So looking at PEID and some of the other associated tools, you've got to live in X editor, you've got to open up IDA and find the exact pattern you're looking for, it requires a certain bar to entry. So the idea here is can this really be distilled down to something anybody can get value out of. Let's make it easy and we're going to talk about the signatures as well. Cross-platform. So running PEID on a Mac itself, that's not going to happen. There are a couple of solutions to let you run signatures on Linux, on Mac. They're really good, they're not as full featured as using it in Windows so that's kind of a negative there. The other thing once again, simple to extend and understand. So in my opinion what I'm going to start with here is kind of this base notion, this idea, present some data, and say look, I'm pretty sure this mostly works, and hopefully somebody, multiple somebodies in this room or elsewhere will go wow, that guy wasn't really dumb. He was only mildly dumb and instead, here is a couple of enhancements. And the other thing I wanted to get out of this, this idea of fuzzy matching. So if you've got something like PEID or another signature-based language, generally it's the signature hit or it didn't. So instead I want to introduce a notion of well, part of the signature hit, and this is about how much of the signature hit. So in other words, when I use this or anybody else uses this for signature management, you can kind of figure out where your overlapping signatures lie and you can maybe be a little bit more effective out of the gate. So this we're going to jump in, just an easy refresher, talk about the terms, when I say certain words what I mean. It might be different from what other people said so I want to go basically level setting, talk about the PE file structure. I'm sure most of you in this room go home and dream about the P headers. Probably not everybody does. All right. This is a very simplified look of the PE file structure. You've got this stub at the beginning, these other various headers that are optional, some are which only generated by certain compilers, this notion of sections, some sections contain the code and some contain data and so forth and so on. This idea of resources, so if you ever look at a program executable icon, resource section. So there are many different parts. This is one of my favorite graphs and I apologize if you can't see it all that well. These are all the header values that you can have in a PE file. Keep in mind not all of them are required to exist. Not all of them are required to be filled out in a entirely accurate way. But this is what you can deal with. So there is a lot of things to mess with. They're color coded. So as far as the PE format itself and header structure, this is what we're going to care about today. The three basic things that I decided, whether I'm correct or not, that's fine. But three basic features out of the PE header that I said these can be kind of interesting and these should generally vary enough from compiler to compiler or packer to packer or crypter to crypter that they should be useful features doing this type of analysis. The other one is number of sections. Things like UPX and a lot of other packers, maybe they jam the executable into one section and have this little section. So when I use the tool chain, what I'm talking about is the set of tools used to develop software. So you can have things like IDs and linkers and compilers. Each one of these actually leaves somewhat of a relatively unique fingerprint upon the binary that it creates. You can manually go in and change these. Not a lot of people do. So for this when I talk about tool chain, I'm going to talk about kind of the build environment. So GCC versus visual C++. So packers, what are they? Program within a program. When I want to pack a binary, what I'll do is I'll take the original executable, kind of smoosh it down and ram it somewhere inside this new executable. So I want to do that to evade AB, maybe analyst's lives harder, because who doesn't love stepping through LED bug trying to figure out how do I get this unpacked version because this is ridiculous. At least if you know, identify what packer is similar to anything you've seen before, you know what steps you have to go through or maybe you know what tool to pull out of your tool box in order to do the unpacking. So there are really two parts to a packer. You get the packer executable that you run on the original file. This is the thing that actually does the compression and creates the new executable. And you've got the unpacker and the unpacker is generally this little stub that comes out of this new program. The stub is generally the first thing executed and it goes through and unpacks the original binary and I'm going to run this. When I talk about packer detection in this context, I'm going to refer to the unpacker or the stub. So unpackers. How do they work? So what you really want to do is take control of the address of the entry point. So where, when a Windows or P file is loaded, where should I go and begin executing code. So you want that to point to your stub. And once you unpack it, maybe you decrypt it or whatever it is, you find the packed data. You've got to do a couple of relocation fixes because it's not the Windows loader doing the actual loading for execution. You have to mimic some of that and you jump into the original program and keep going. All right. So now we're into the popular kids. So these are the kind of the three in my opinion, and there is probably several more tools that when people do compiler detection, this is what they're talking about. So PEID I mentioned that one before. The signature language is pretty good. It's been around forever. It's my opinion it's kind of the de facto standard. Several projects that will allow you to take PEID rule sets so you can update your analyst tools but you're still kind of using this limited idea of what you're looking at, this harder way to describe data. And this last one, I really like their slogan. All right. So now we're going to dig into data and who doesn't love data. And honestly if you're going to talk about math and you're going to talk about doing any type of analysis, if you don't use data and you don't understand your data, it's really, really hard to get good results. A lot of times data is really ugly. Right? It's not this beautiful end results. It's this nasty thing you have to slog through and dissect and understand. So this is the data that I used in my testing set-up. So I went and I found and I Googled and I threw together 3,977 unique PEID signatures. That's a lot of PID signatures. Right? So that alone got me thinking maybe we can address the signature management problem. We've got some file sets, various sizes, right? We've got smaller ones that I understood that I could pull apart and go o oh, okay, I get it. And this giant random sample at the bottom. Everybody loves big data and this wouldn't be a math talk unless I use the phrase big data. So there you go. So that was kind of the end-all after I felt comfortable with the technique and comfortable with the tool, what I ran it over to verifying and spot checking with the data set. We'll talk about that as well. Let's get into the data analysis. So for this there's a handful of slides we'll go through them. We'll talk about the basic exploration of the Zeus data set. So 7,600 samples roughly are what these slides are based off of. So first thing I did was what happens if I run PID on these files, it turns out they don't match 4,600 of them. Really disappointing. So we get some other ones. There's different UPX and another UPX version and Microsoft visual basic and armadillo packer which I'm sure just by looking at the numbers you could probably make a relatively educated guess that Microsoft basic and armadillo are really closely related. So what is kind of those numbers what they look like in visual format. It's a bar chart. You don't have to worry about the numbers. That really tall line is the 4,600. Just another way to visualize it, just to get into the idea that creating signatures is hard. It's non-trivial. So having an easier way to do that would be great because then that really big giant, I apologize not using gray scale, blue-ish purple-ish box, to make that smaller to get more things you can actually label and understand. This graph in my opinion is what science looks like. You show this to somebody, that dude up there, totally good science. This is simply a correlation matrix. You take all of these PEID signatures and for files that had a flag, you want to see which signatures flagged with a high correlation or flagged when one flagged the other one was very, very likely to flag. So the diagonal is basically the signature correlating with itself, right? Because every time a signature fires it's going to be observed. So with this you want to pull up the black dots. You can zoom in. This is the upper left-hand corner and you can see there are a couple of signatures that are highly correlated. So there's a lot of signature overlap. There could be signature overlap in your environments. There's obviously signature overlap on the Internet. So every time one of these AS pack signatures flagged the other one did. So with that you get a feel for this is where I'm lacking or maybe this is where I have some duplication. So we understand what PID looks like in a sample data set. So now we see what we can look at in addition to header features that allow us to say with a probability that we're looking at a specific packer or a specific compiler. I love it when any type of malware author or any author in general includes a PDV string. Because sometimes they're awesome. It's important to keep in mind these are just texts, so there's no reason why you can't create your own you've got these major and minor linkers, what do they look like in the sample set. So this is just breaking down. So if you got the first one, linker 2.5, 2,000 of them. So why you can group this Zeus sample set or many other sample sets by looking at the linker versions of the count, it still really doesn't tell you the whole story. So we looked at the number of sections and you can kind of see relatively similar distribution. You've got a couple really big groups of files that might indicate a specific campaign within the Zeus status and you can have this longer tail. The other thing we wanted to look at. Assembly -- these are kind of cool. The idea here is when an executable runs, there's code. And that, those bytes can be translated into a mnemonic. And Johnny five's live, but -- sorry Johnny. So for this, a capstone engine was used. If you're looking for a free and an awesome disassembler is great. I love it. Runs on multiple architectures, bindings for multiple languages, it's super easy to use. The reason I call this out specifically, I'm sure a lot of you have noticed that every single time you run a different disassembler on an executable, you will get results. So really you only get consistency within a disassembling engine. The point is just to be consistent with this type of stuff. So then I had what I thought was a really rad idea. I was going to look at the correlation between assembly mnemonics. They describe the program behavior and what we're looking to capture is what is this packer doing or how does the executable get set up because it's compiler specific or in the case of a packer encryptor, they have to know what undo so they can run whatever code they want. We have to capture this program behavior. That's what we're doing with the assembly mnemonics. So how can we look at these various assembly mnemonics. It looked ridiculous. Imagine looking at that for 400,000 samples. You're going to go blind and be sad. So there's this kind of notion of distance or similarity, that fuzzy idea. If I have a signature I want to know how close what I'm looking at is how close is it to the signature. How similar. Call talk a little about Jicard distance. It's cool however it doesn't take order into account. Whether it executes in order, it doesn't jump around. I mean, there's full control and that kind of stuff. But generally, had to move -- there will be executed in that order and not move exor at or vice versa. While it's great and it might be useful, order I thought was important to take into account. This is another cool metric distance. Position is important. The left most is the one at the entry point so this is where the executable will start and it moves from left to right and you can see there's various ones. So the easy way to view, take the total number of sheer elements -- and that's your distance. So in this case it's move push which is 2 divided by the other set, which is 8, and you get .25. So as far as set membership is concerned, these two things are have a distance of .25. And while, okay. It just didn't feel right. So once again, you have this idea of order. So how many things have to change to make one into the other? So this kind of fit the domain a little bit better. So once again, kind of just doing a quick compare, you know, looking at if they're different. So right there's one difference and they're not different and so forth and so on. So basically seven changes are necessary to make one set into the other set. Therefore we get a distance of 7. But code is executed in order, there may be branches. I really didn't want to build any type of flow graph. I wanted to keep it simple and understandable and efficient. So in theory, the assumption was, what I worked with was the assembly mnemonics to the left should be more than the assembly mnemonics on the right. Because it will execute starting on the left and finish on the right. And if there's a jump in there, maybe you want to care about it, but maybe you don't want to care about this stuff after as much as the fact there was a jump. So there are a bunch of testing and metrics where I try to figure out where the cut-off was and so forth and so on. We have to take into account how big is the stub and if you don't know what you're looking at and some of these questions are really hard to answer. The ones on the left, any edit to the left will have a higher weight than the edit to the right. Which kind of makes sense. We care about more of the things that are executed first in case there is something like a branch or a jump. And then we have a language, this assembly mnemonics to kind of capture program behavior. So we can put those two together and you basically calculate every single position, the position of you're looking at divided by the length of the set. So in this case there's ten things in the set, so the first thing requires one full edit. The second thing requires zero edits. And the third thing requires .8 of an edit. So now you have a distance of 3.5. So to me this was great because it said yeah, these things are separate and different, but there might be some similarities. The next thing you can also do is use it as a similarity calculation. So if you want to use it as a similarity. So it says basically those two sets are 65% similar. This is how we get the idea of similarity mixed into the algorithm. All right. We've made it through the great refresher, everybody loves the files and headers and we have an idea of the features we're going to look at. The major linker version, the minor linker version, the various assembly mnemonics, we have some fancy-sounding algorithms that are simple to understand, which is great. We have a way to do fuzzy matching. Awesome. Now what do we do? First step gather samples. You know there are well over 411,000 samples. So the second thing was let's get PEID this industry standard and see what it looks like for everything. And then from there, for every single one of the executables, we going to disassemble them because we need the assembly mnemonics and in this case we wind up using the first 30. We need the header features. We'll talk about clustering so you can kind of understand which PE files are similar based on these features. And when I ran this across all the data sets my threshold was 90% similar. So I felt that if an executables signature and the signature I was matching against were not at least 90%, that wasn't good enough to call it an actual match. So one of the things I started off using was a similarity comparison and optimization. I wound up doing a lot of comparisons, but luckily not by hand. And we created signatures so we can verify. So one of the things I kind of want to talk about briefly is signatures, everybody we lived in, security data science, have to do security machine learns, if we're not using scan -- no. No. Sometimes it's overkill, right? So one of the nice things about signatures in this case is we can use it to capture this domain specific language. But me or anybody else, we don't have to worry about -- they might have great accuracy when help get new data and they go to train it. It gets out of whack, so to speak and going to keep going through this large process, right? This is one of the issues with operationalizing machine learning. Also the model will vary based on training source. It will be good at finding things labeled APT 1 but it would be worse trying to determine which packer or which crypter is what. And likely everybody else will have a different edit than me and it wasn't a good foot. And the last bullet is where I was going is simple. You want to play, you want to do things, you want to tinker. Sometimes machine learning is fun to tinker with. Sometimes you really just want to get something done. So here is what the signature language itself looks like. So really, really simple. It's kind of highlighted to show you the signature and I'm going to go into a demo in a second. But so the signature for Microsoft basic, you can see there's quite a few, the ones on the left -- you get a similarity of .902. So in my opinion, I think that accurately captured the signature is relatively to the file and I feel pretty confident that this file matches my signature. Now let's move into a demo. I think I broke everything. That's phenomenal. Seriously. It just hates full screen. Asian guy showed up. Awesome. This went from friendly math talk to klan rally. I scripted this all because I was kind of a chicken as well. I didn't want to type commands. So I'll direct your attention to the top, kind of small box and walk through the demo. Just sit there for two minutes. Third time is a charm. When in doubt try a different port. You know what, screw it. If anybody want to actually see a demo, I promise I literally promise -- I swear. Completely unreadable slides. There's two phases to this. One the signature generation phase and that simply says run this one script on the binary that I can't even show on a computer, that's what I get for trying to do a demo. And generate the signature. And on the signature is going to be, is a simple listed assembly mnemonics and give you this major minor linker version as well as the number of sections and all you have to do, if you're not giving a demo, is run this other script that if you can see it, that MMPES.pi on a signature and you can do all sorts of things. So if your idea for similarity is different than mine, it's 50% similar, you can do that. You can give it this crazy verbose -- here's the signature that I have, and here's what I'm matching against. It also tells you when the major and minor linker versions match or when the numbers of sections don't match. It tells you how many edits you have and the actual similarity. This is between two samples. And you can see the signature generated on the two files in this directory. The first one really didn't match all that well. This .844 required roughly 4 and a half edits, but this other file matched exactly. So all 30 assembly mnemonics were perfectly in order. Both the numbers matched as well as the number sections. And here's kind of a better description of the rule you guys might be able to see. All right. So let's look at some of the data sets. Let's look at some of the bigger ones, because again, big data. We'll start with the APT 1. Here, this is kind of describing the clusters, in other words the like things grouped with other like things and it's 2-bar charts which is why you get the color variations. Apologies for gray scale. The far one on the right is PID said this many things are similar and that green bar is the assembly mnemonics comparison. The cool thing, even with having zero trust in the labels of using something like PEID, you get kind of this anticipated view. You expect a lot of things to fall into a few buckets you get this long tail that as an analyst is always a pain in the ass to deal with. One of the ways to represent this is a neat looking bubble graph. It's not really science unless you have sweet graphs. So this just clustered on assembly mnemonics, so once again kind of representing what you can see, just one large cluster in kind of these other ones. The signature language in this work revolved around a couple of other features so what did they look like. So the darker blue is the actual, so in this case it's that big orange one is the big dark blue one and within that one cluster, based only on assembly mnemonics similarity, you have three subclusters number of sections. This is kind of interesting, maybe there's a little bit of variation, maybe somebody used a slightly different version of something, some forth and so on. Likewise with linker versions, I thought this was kind of neat. There's very little in the sample set, deviation for linker versions when using subclustering. So this is kind of a three-dimensional or two-dimensional view of a three-dimensional set of features. So once again the dark blue the assembly mnemonics circle and you've got these various subcircles, kind of the one on the lower right-hand corner you can see the cluster and you can see one cluster that was actually based off of a number of sections and you can kind of see two subclusters in that and everything else only had that one cluster. So it's kind of cool. So let's look at Zeus. Much bigger data set, much more graphs. Much more science. This is what Zeus looked like. Once again kind of earlier a little teaser, you get this massive, massive PID unknown label. But the cluster actually breaks it up. This one and the stacked one, you can see the assembly mnemonics clustering on that, that yellow bar is a little bit more manageable and you kind of get this slightly more gentle sloping curve. But you get a lot of bubbles. So the end result is I shouldn't do anything in D 3, or you shouldn't have a D 3 while you're high. (Laughter.) The whole scenarios end badly. So once again what does it look like if we subcluster on a number of sections versus the cluster, the initial cluster on assembly mnemonic. You get these crazy subspirals, things look bizarre. For me it was kind of enjoyable because it was an exploration of Zeus and a way to visualize this entire data set. And you kind of want to go home and cry. It's never very good. So I mentioned that I did something on 411,000 files and it was you awesome. Let's talk about them. This is just the assembly mnemograph. Roughly 5,400 of these files are not 90% similar to any other file in this entire corpus. I thought that was really cool and really surprising. This might be some polymorphic stuff, this might be various crypters. Who knows. But it was cool. 5,800 things is way too many for me to actually dig through. So we'll kind of skip through some of these. Everybody loves spirals and I wanted to leave 15 minutes for questions. So don't D 3. I actually broke D 3. This is the one that I broke. I give up or you're doing it wrong. It might very well be that I was doing it wrong. But it cried. So there were a couple of really cool things that popped out of this relatively large data set. Like Google Chrome. There are 97 Google chrome instances and they all match this same signature. They all have this kind of same assembly mnemonic string. They're very consist event. They're very consistent with what linkers they have, what linkers they use. So out of the 97, the take-home is 94 of those 97 have matching linker versions, matching number of sections, and assembly mnemonics within 90% of this .9 distance. So it was kind of cool. And it really wouldn't be a talk about packers if we didn't talk about UPX. Somebody was going to ask about it. So this was kind of cool, this was kind of telling. I dug into UPX some in the past but this forced me to do a little bit more digging. So I kind of cheated and I said all right, what if I do this really, really naively and just looked for the string, UPX 0, UPX 1 and said it's probably UPX, right? Because once again I didn't want to test any prior solution and I wanted to see how this stuff is backed up. So with the assembly mnemonics, I got 65 different groups and I thought shit. Now I'm going to be laughed off stage. However there are some pretty cool results here. You can see this group label and this count, so that's group label or the cluster label is the arbitrary number that I assigned to it, this group. So you can see once again you get this neat little slope. I was like all right. So maybe there's some variations of UPX, maybe I'm much smarter than I thought I was and I can do UPX version detection with this. Maybe my head is going to explode or maybe I failed miserably. The answer is kind of somewhere in between. So looking up against PEID, it was neat to say either me and/or every other person were making the same mistakes or maybe we're totally on to something. Kind of the cool thing was, here's the numbers. It looks like maybe I was on to something after all. There's also -- I dubbed through that a little bit to see what was going on. It turns out there's a bunch of packers that basically wrap UPX. I thought that was awesome so I learned a whole bunch there. These kind of variations. So let's go through this recap. The idea was generate signatures I had a working demo you would have seen me type one command and the signature would have appeared out of no where and it would have been awesome. It involves math, great. Who doesn't love math. It's cross-platform, it's all written in python because python is the new old ruby. Capstone engine, cross platform, mostly easy to understand, it involves a little bit of math but hopefully not too bad even for 5:00 on a Friday and probably most important for me is it works. So even though the paper promised a demo and it didn't work, I'm going to release it online. The guys at work are more than happy to say you can totally release this tool and sample signatures for people to play with and use. It and these slides, the updated slides because the old ones are on the CD, feel free to take a picture of it, or hit me on Twitter, however it's not up there yet because I'm a slacker and it will get done next week. If anybody has any questions, I'm more than happy to answer them. >> The answer is once you have all of this data, what's the action. And that's really a good question. So aside from why did I do it, because I love messing with things, it's important in my opinion for any analysis to drive an action and the action is to understand what you're looking at a malware versus something looking for extra context. If I could help solve part of this signature management program and you can get this idea of fairly accurate signatures with very little low lift, when you're at your home organizations and you've got man, I've got this piece of malware that I've never seen and you can grab 3,900 signatures off the Internet, it tells me how similar it is to some things other people have seen. It gives you a starting point for analysis. It would be awesome. Would you believe it? If anyone is using it? I haven't run into it. The question was have I run into anyone putting in the packer information into the packed files. My answer is no. Because I didn't run into it in any of my sample sets. However, even at 410 or 411,000 binaries, given the number of executables that everybody talks about, that's still a relatively small sample set so it's no where near everything. Does this apply to protectors as well. When I say packer I mean protectors, crypters, the whole gamut. Any more? Man. Is my math that much not everybody fell asleep and nobody has questions on math? All right. Cool. So I'll be around if anybody has questions. One more. >> (Inaudible). >> How do I make this moustache happen. I think it is genetics. It is math. This is what happens when you do too much math. (Laughter.) I actually had a really long beard at one point in time and my wife hated my long beard because I told her I was going for wizard length. So I said, you know if I can't have a long beard, I'm going to have a long moustache. Now I sleep on the couch. (Laughter.) Too much D 3. Exactly. Any more questions? Nobody? All right, cool. Thanks for coming. I appreciate it. (Applause.)