>> So uh thank you everyone for being here. We’re really excited to be at DefCon. This is my second time. Thomas is his first. Uh and it’s it’s amazing to be around people who care about security and sharing information. So our talk today is about malicious CDNs. And we’re gonna cover one particular one. I mean, there aren’t many, but Zbot has been uh and interesting fast flex proxy network in the past few years. We- we’re gonna show how we have been studying it uh using like SSL scans and combining like a few interesting heuristics that uh use graph theory and uh some statistics—basic. >> Hi so my name is Thomas Mathew and uh I’m a researcher at uh Cisco Umbrella which is formerly OpenDNS and my main focus is on data science and machine learning. >> And I’m Dhia. I’m the head of security research at um Cisco Umbrella and my interests are in graph theory and security uh overall. So what are we talking about today? Uh just a quick overview or a few words about CDNs. Most of us know about them. These are like uh a very interesting and powerful technology that enables uh people who have content especially popular sites to deliver the content. Uh in an efficient way so then people around the world can get the content with low lea- latency based on the edge nodes that are closer to these uh users. So for the sake this talk uh certain features or requirements of a CDN infrastructure are of particular interest of us. So specifically that the uh content of a customer will be delivered uh with low latency. The customer’s website will also be protected against DDoS attacks. The customers also try to hide their origin IP behind the CDN infrastructure. And then if uh the site is uh communicating via HTPS then you can you can also have a SSL search deployed on the edge nodes. Uh so then you can you can guarantee that end to end uh secure com. Now most of us know about the legit ones, so Akamai, Cloudflare, Google Cloud, Amazon Cloudfront. Cloudfare tends to be abused a few times by some kind of bad content. But they are the legit side and we work with them to kind of mitigate some of these uh threats. But then, there are like some pure criminal content delivery networks here. They’re more on the reverse proxy, fast-flex network side. And Zbot is one of them. In fact we’ve been studying this network uh for the past few years. We’ve had the talk BlackHat 2014, 2016 in BotConf and I invite you to go check like the details on how to detect it. And uh some of the other uh features of the infrastructure. So this is kind of a overview of how uh this network operates, so you have in fact around thirty to forty thousand uh uh compromised machines. Mainly routers and access points in Ukraine and Russia and they are maintained and harvested by uh the actors. So it’s not necessarily the guy who is selling the service in the underground. The same as the ones who are basically compromising the machines. Usually you can buy installs for thousands of machines and then you can go and provision them to your customers. So, as we know like there’s this big segmentation of uh expertise in the underground. And now what you do is the, this actual offer in the underground in terms of uh fast flex, so then as a customer let say you are willing to deliver um you know like malware, ransomware sites uh fishing especially uh and we also saw a lot of discarding and cyber crime forums, cyber crime forums who will switch between these infrastructures to protect the contents so it’s not taken down. So it’s not revealed to uh security researchers et cetera. And in that case what they do is they hide behind the Zbot CDN or fast flex proxy network and when they buy the service for a couple hundred dollars, or I would say in that range then you will get uh you will be provisioned ef- I would say between forty to fifty. Sometimes it goes up but it’s in that range, um number of IPs or bots that in which you’ll be having your SSL uh installed and that way you will guarantee that the end to end communication from the victims talking to ransomware c2s, crimeware consumers or researchers like us will be talking HTPS to the uh end to end. So this is kind of an interesting infrastructure, in fact its been around for years and uh, it’s worth investigating from the SSL perspective like we said because all of the bots will have SSL uh certs install- uh installed on them. And when you have like scans you can figure out a lot of interesting patterns. Quickly. I mentioned crimeware forums, dump shops, malware, these are just like screen- screenshots, you can find uh anywhere on the web. And I guess quickly as the cert as we know, it will have the section called common name or or a field, which has to match the domain you’re trying to protect with HTPS. Uh obviously I will not get to get into details where you can have like wildcards and sub-domains and stuff like that. And then the main point is that like we said if you want to protect your site end to end, then you have to have your cert off the let’s say crimeware forums form deployed on the bots you got provisioned so then they can, uh deliver your content with uh SSL encryption. >> So uh, the main objective in today’s is to provide researchers with a series of statistical tools that can allow them to analyze uh large sets of SSL data. And uh all of the data that we’re going to be discussing in today’s talk is actually available uh at the following URL and it’s collected by rapid7 and University of Michigan. So this is a high level uh overview of kind of how the the talk’s gonna gonna go. We’re gonna discuss what exactly is contained in the SSL sonar data and then once we do that, we can then model that data using a bipartite graph which slits the uh data into uh common names and then ASNs. Once we have that bipartite graph we can then start co- collecting kind of what we call global information through a series of histograms that like calculate the relative frequencies of the popularity of both ASNs and domains. Uh these histograms then allow us to create a micro local features at at a per domain basis. And then once we have those tiny histograms we can ce- we can use uh a bucketing scheme to convert them into a vector and then this vector can then be measured against other domains in the same kind of neighborhood of popularity. In the graph. And then ultimately we can then use uh just like a very simple anomaly detection mechanism to see whether a domain in a particular neighborhood is unlike it’s other neighbors. And that’s how we can identify whether a domain is potentially a Zbot host. So with SSL we don’t really need to dig into the de- uh the details too much we just want to know that SSL is a secure socket layer, it’s used for encrypting traffic over HTTP and there’s just been an increase in websites uh employing uh SSL. And the type of SSL data that we’re working with is the x509 certificate and the x509 certificate contains uh information regarding the issuer as well as the subject. And for today’s talk we are more interested in the subject. That is, the person who the SSL certificate belongs to. And in particular, when we were looking at the subject information, we’re interested in a field that Dhia mentioned, it’s called the common name. Now a common name can be any alpha numeric string. But in particular we’re interested inc common names that are uh legitimate domain names. And that’s because we want to see what are the domain names that are associated with a particular x509 certificate. So this is just uh an example of how an x509 certificate looks if you uh decode the uh base sixty-four and one of the useful things about uh SSL data is that it allows us to map out not only residential IPs verse commercial IP space, but also allows us to uh understand where a network can be uh spread out over a a series of IPs. So let’s say we have uh a set of x509 certificates and their corresponding IPs. If you look at the common name information, we can see how a particular uh common name is spread out over a series of IPs and then we can make a guess that that common name entity is somehow involved in some way with the uh hosting or the co-location at that IP. And so with sonar data, uh its its about a twice or four times monthly scan. Over the entire IPv4 space. But in actuality because certain network operators don’t allows the University of Michigan to uh scan their IP range, we don’t get information from certain ranges. And just a basic scan on port 443. And then uh the most important part is that we have an x5- x509 certificate as well as information regarding the IP that it was found on. And this is kind of the flow chart that we used uh in order to kind of get our data prepped before we performed the analysis. We take uh a raw monthly scan of sonar data. We can extract out the common names and then we map uh an IP to an ASN and then we come up with this kind of quadruple which is the SSL SHA, the IP, the common name and then the ASN that the IP belongs to. And what’s really go- great about the sonar study is uh because this is a produced on a monthly basis, we can kind of see how hosting patterns emerge and change over a five month period. >> So real quick here what you can see is the range of a I would say number of SHAs we collect every month. So we can see that between 250K and a million unique SHAs, here in a five month period and then from those obviously the data also has the uh the raw certs so you can decode it and then extract the common name. So the point here is that we have a quite a big enough uh I would say a data set that allows us to uh try to find some interesting patterns and find these anomalies and uh I would say threats in general. And the other things is that its difficult to manually inspect these domains. That’s why you are looking we looking for a large scale abstraction model that can help us uh do these kind of analytics. And in the sense, graphs as you know are very useful to do a lot of things so um most of you know about bipartite graphs you have like two sets are disjoint. And in this case we took like a simpler presentation where you have a common name set, connected to the ASNs which means the common name is hosted on an IP, but then that IP belongs to an ASN. And we found that by uh lumping all of the IPs within and the ASN node, that’s more useful for our investigation than keeping the CN to IP bipartite graph. And that’s basically what you end up having. Uh this has been useful to do the type of analysis that we’re gonna describe uh in a second. So I guess the first takeaway is that the bipartite graph is a useful representation of uh this problem that helped us uh solve the uh uh issues. >> So classically there are multiple methods for uh analyzing a graph. And kind of figuring out uh or identifying various sub structures. One is uh you can use a graph factorization technique or you can identify the various connected components within the graph and study each of those connected components. Or kind of calculate the minimum spanning tree. Uh for this talk we’re actually not going to use any of those three methods, we’re instead gonna be looking at a- another set of statistics. But in general our goal is to identify uh possibly anomalous sub-structures within the graph. So a sub-structure within the graph can be thought of as a certain set of domains and ASNs that have some sort of odd shape and I know that sounds slightly vague right now, but as the talk goes forward you’ll start to see what we mean by a sub-structure uh within the graph. Uh so when we are analyzing the graph, we need to come up first with uh a baseline metric with what we consider a normal. And so what we- we wanted to do is create a a metric that is based on the topological features of the graph. That means we’re looking at somehow the relationship between uh domains and their mapping to ASNs and vice versa. Now uh a really easy way to do that is to kind of just look at the popularity of each common name. And so the popularity of each common name becomes as defined as the essentially the degree count at each domain vertex and so we then calculate the frequencies of how a particular domain name is distributed across a set of ASNs. And then for each ASN we model that ASN as having a particular type. And now the type of the ASN is referred to its uh popularity and the popularity of the ASN is how many unique common names uh appear on it. So there’s like this type of weird mirror relationship between the two uh popularity scores that we create. And so Dhia now will kind of show with a simple example of how this works. >> So lets break it down here in in a like a very simple example. So we see our common name uh I would say red set linking to the other set of the bipartite graph which is the ASNs in blue. And uh the nice analogy we’re gonna use in this talk is common names take them as people individuals and ASNs are basically cities or states and you can see that let’s say a person like John at the top, he lived in lets say one, two, three, four, four cities which are four ASNs or he lived in four different states. And what we try to do here is to study the the uh the ASN part. So in, in a sense you looking at the uh behavior of cities in terms of how many people they hosted. So what we have here is that the ASN at the top, the blue one, has three incident edges which means it has a degree of three and it had three common names hosted on it and you look at the other ac- uh follower uh following like ASNs, the second one has an incident uh one incident um edge. Which means it has a degree of one. Anyhow, so what you end up having is like the three bullets at the bottom where you have two occurrences of an ASN with degree one, three occurrences of an ASN with degree two and two occurrences of a degree of three. And the simplified histogram at the bottom you can see ASN on the x. Number of occurrences of that event on the y. And that’s how you can scale that to a bigger data set. >> So when we apply then this technique to the uh entire January data set that we’re on twenty-two thousand ASNs uh the histogram on the right eh is a sampling, a stratified sampling from a a set now of 5k. And what we notice is that there’s a definite, kind of long tailed to this unto this distribution so the majority of the ASNs that are kind of lumped in the zero to five range is you can see what’s being circled but then theirs a couple ASNs way out to the right that host more than fifty thousand unique uh domain names. And- >> Yeah, like the quick takeaway is that uh as Thomas said, the majority of the ASNs are basically hosting between one and a hundred domains. In a sense, the majority of cities in the U.S. for example, are hosting between one and a hundred people. Just to use the analogy. >> So here’s some kind of numbers uh saw raw numbers to kind of back up the that statistic so if you look at the number of ANSs hosting just one uh unique common name it’s around seven thousand six hundred and the number of ASNs hosting two common names its uh a little under four thousand. And then if you look at the number of ASNs who’s hosting uh under twenty unique common names it’s nineteen thousand. So that’s more than ninety-seven percent of all uh ASNs kind of fall within that uh twenty unique uh common band. And of course ASNs hosting more than one hundred are uh only less than a thousand. >> So lets take the mirror set now the common names which are let’s say the people or individuals so in a sense you’re trying to figure out some uh some behavior or like some uh better understanding. So in here we can see that the first red dot has one two three four outgoing edges so it has a degree of four. The next one has five um outgoing edges so a degree of five and you end up having that list of uh I would say events, so two times we have a common name with degree one. Et cetera, et cetera. And you end up constructing that histogram at the bottom where you have like the common name degrees on the x and the number of times that even happened in terms of uh degree value. >> So what happens when we apply then this metric to the global data set, well out of around uh eight hundred fifth thousand domains we kind of did a a sampling kind of represented on the histogram. And we can see again that there’s again uh a very close clustering towards the one to two hundred uh range. And we can kind of see this if we zoom in uh that’s the bottom histogram where we can see the majority of uh common names mapped to say one to three uh unique ASNs and then there’s a very short or sharp drop off and you get ticks at essentially every other uh kind of count in between one to one hundred, a hundred and forty. And so let’s now just take a really quick quick look at the outliers. So uh you can see that one of the outliers is dlink and the other one is uh google video. Both domains that people are pretty familiar with and they’re definitely not malicious. >> Yeah, the the quick takeaway I would say is the google video as we see is found on as a common name is found on two thousand different ASNs. That’s basically the common name you’ll find uh on the search that are used for uh YouTube uh content delivery. Obviously, google has deployed a lot of uh I would say, caching mechanisms on g-nodes around the world so that’s why you see this big diversity of ASNs for google video. So google video is not fast-flux, it’s just a a core uh CDN, uh I would say um common name you find on the search that are serving the content. >> So uh the I guess the one other point I was gonna mention is that there is an exponential kind of drop off when it comes to how uh domain counts are distributed uh across ASNs. So you can see the the jump from sinology to example and then example to dlink and then dlink to Apple iTunes is very rapid. Which uh kind of just shows how quickly everything starts to converge towards uh use extremely tw- towards common names that mapped to very few ASNs. And so why are we kind of talking about all of this and kind of doing all these histograms. Well, the goal of this talk is to find kind of sub-structures uh within in the graph. And so as we know that as you move toward the right of uh this graph and we start to see how an ASN, I’m sorry, a common name maps to more and more ASNs, we gain more information about that uh common name. So for example when we know that an ASN mapped, a common name mapped, to say a thousand ASNs, it’s very easy to understand possibly what that uh common names behavioral role is. But as you move towards say an ASN, I’m sorry a common name that maps to only one uh one ASN it’s very difficult to understand what’s going on. So uh what we can then see is that around ninety-seven percent of all domains uh just map to a single ASN. And the problem here then is that in that range of one to ten mappings. There’s just not enough information to kind of make any type of inference. >> Quick thought here is that uh in general if you’re trying to do data analysis, uh data is useful, but then if uh the data is too sparse, there’s no chance to find anything interesting and if the data is let’s say google video and dlink, they are like so popular that you are not expecting to find anything useful. That’s why here, we kind of focus like Thomas said on a very smaller range that we believe has the core or the instance, the instance of some of the interesting patterns we are trying to track. And in general, we could have taken like a clustering approach, cause this is kind of, kind of an unsupervised method on this big data set, but then we found that using like some simple statistical uh techniques like histograms is very good to find like the uh to get like build this understanding, step by step of the data you’re looking at. So you’re trying to isolate your focus on specific regions that then you can go and like peal them off with other techniques. >> So uh with the following kind of information that we just kind of discussed, it’s very easy to come up with uh a simply heuristic to filter out nine-nine percent of all the domains. We just remove, or we just don’t look at domains that map to fewer or more than uh 10 different ASNs and to kind of expand again, it’s it’s kind of like when you’re analyzing a document. If a document only has a single word or a couple words, its very difficult to know whether that document was just created by some random process. Somebody just kind of typing it out a word and then spitting out that document, but as the document increases in size, it’s potentially a lot easier to find some sort of topic within the document. So the goal here is to kind of find the topic of a a domain. We want to see kind of how the domain deployed within this larger ASNs uh structure and in particular we’re trying to understand whether that topic can be considered malicious. So we’ve talked a lot about kind of the macro level of the graph, but we haven’t talked that much about the micro level. And so the micro level is understanding how a particular domain is mapped to a set of ASNs. So the two histograms right here uh kind of give uh frequency counts at at domain uh level. So the x axis denotes the type of ASNs and just as a refresher the type of ASN um means how many other domains are are mapped to that ASN. And then the y axis represents the frequency of that type of ASN. For the particular domain. So if you look at the top uh histogram for naranyamarket, we can see that it contains one ASN that is extremely popular so this d- uh ASN hosts more than uh 25k other unique uh domains. And then it has a concentration of ASNs that are hosting at least a thousand other unique uh domains. And that’s kind of where its, general, kind of mass density is concentrated. But if you look at meenyousecu and its ASNs frequencies, we notice that the ASN that contains the most uh oh- hosts the most other domains is only a little under two thousand. And then the majority of its ASNs that its found on are only hosting one or five other unique domain names. So there’s clearly a difference in how these two uh domains are distributed. >> To bring back the analogy of earlier so uh I guess that people living in cities or states like naranyamarket think about this that as let’s say John and John happens to have lived on uh one single city that had twenty-five thousand people in it, but most of the time he lived in the cities that had between zero and lets say one and a thousand people. So you can see kind of the how this guy is migrating between different cities. Similarly for meenyousecu, it happened to have lived only on one city or ASN that had also re- or re- like hosted fifteen hundred plus other common names or people, but most of the time, the meenyousecu has kind of rotated around a cities or ASNs that were in that smaller range at the bottom. So these are like very lowly populated ASNs. And this as we will see is very interesting because it will tell you what’s this common name used for depending on uh how- where is it being res- where is it residing and how is it moving around on the ASN ecosystem. >> So I understand that pictures can sometimes be a little bit confusing, especially at the resolutional scales that I’ve had them up, so hopefully this kind of numeric information can kind of further highlight what we were discussing. We can see that for meenyousecu, uh uh the ASNs that it’s hosted on are all ASNs that just host one two or three other uh unique domain names. While for uh naranyamarket, there’s not a single ASN that hosts uh less than a thousand other unique domain names. So the kind of, this general intuition it’s only natural to ask, how can we come up with the mechanism to determine how far a part or how different uh naranyamarket and uh meenyousecu are? Real quick in fact this one the meenyousecu is part of Zbot so having that interesting patter will be uh highlighted later and we are able to find it with an unsupervised method. >> So uh as I mentioned, you you can directly compare uh these two histograms because they’re on completely different scales. Uh, meenyousecu is only having counts under a thousand essentially and uh naranyamarket is having counts over a thousand. So we need some sort of representation of the entire spectrum of possible uh domain ASNs counts uh to kind of be created. And so this object will be unique per domain and then we can use some kind of vector as a mechanism to determine similarity and so in order to create this vector we need to have a a bucketing scheme, which maps certain regions of counts to a particular dimension within that vector. And so in this case we’re interested in domains that might be mapped to a variety of different ASNs, but the ASNs that they’re mapped to are actually quite unpopular. So what we’re looking for, what we’re most interested in are then uh ASNs that are at a very low frequency and as a result we then create a bucketing scheme that is incredibly sensitive to low frequencies. Uh the best to think about this is perhaps in a picture. Let’s say you’re interested in a certain color uh then you divide a filter that essentially blocks out other colors and then kind of just focuses either just on the gray or the purples and same way you can think of the distribution of uh domain uh ASN distribution as kind of belonging to colors. So like low frequencies are more like blues, high frequencies are more like reds and we care more about blues. So as a result when we bucket uh the histogram, per domain we bucket into nine different bans and each ban refers to a kind of an index of popularity. So you have a band thats uh counting the number of ASNs that map to one five, five and ten, ten and twenty, but then as we go larger, we increase the size of each bucket. So that means that for example all the uh numbers that range between one thousand and four thousand will all be mapped to the same bucket in the vector. And so again this allows us to have a much better resolution of understanding how a a domain is mapped to lower frequency uh ASNs. And I- if this has become a little bit comp- I kind of messed up the explanation, this slide kind of gives a really nice uh pictorial representation. >> Yeah, so if we look here, we can see meenyousecu and it has that long uh I would say um, uh table or sorry array. So the array as we can see it has all of these one one one’s and based on the bucketing that Thomas described, one five, we’re basically counting how many numbers uh are in the range of one five so you can see one one one, all the way to five, so that it it gives you like fifteen occurrences of those numbers. And you cannot keep going, so you have the five to ten. You have basically five numbers occurring in the range and that’s how you build your vector for both your top domain and the bottom domain. And you can see that that way you can have these two vectors, that will allow you to compare these two uh uh domains. Uh in the same scale. >> Yeah, I guess just a, the takeaway three here is that the majority of domains we saw earlier are mapping or hosted or living in between one to two hundred ASNs and ask Thomas said we had to devise a bucketing that is sensitive to low popularity ASNs. In other words you have a configurable variable resolution uh depending on the lower uh I would say bins of interest. Now the next, the next step here is that we want to go back and focus on the common name and how many ASNs each domain maps to. And that will be like the next step that will lead us to uh explain to you how we find these outliers and hence these Zbot domains within this very big data set. >> K, so we now are going to go back to the original list of uh domains that were found in the x509 certificate. And so as we know we can kind of filter out the make mega outliers, the dlinks the googles because we kind of already have a very good idea of what they are and now what we want to do is come up with a mechanism to kind of create neighborhoods of domains. And in this case a neighborhood of a domain is uh other domains that share a very close uh count in how many other ASNs they’re mapped to. So for example uh on the picture you can see that let’s say we were looking at the list of domains that map to a hundred and sixty to a hundred and fifty other ASNs. Well in that kind of neighborhood you would have like iTunes at apple dot com, uh asos-media, download dot mcafee dot com and so all of these domains we say belong to the same neighborhood. Because they all kind of map to the same out of ASNs. So once we have a neighborhood uh we also have a histogram, a histogram vector that we created for each domain. And now this is where we can just apply uh a really simple pairwise Euclidean distance between any of these two domains using the domain’s histogram. >> Real quick, like a quick analogy again is think about the bands as like income. So you have like people and they are in the band of income like one fifty, one sixty k, and you have like these cities with neighborhoods and you’re trying to find people who are within a, how close are they to each other if they’re making within that range of income. And you’ll see later with Thomas that some of them will have some interesting outliers and they they are maybe anomalous, they’re making this much money, but maybe it’s there’s something fishy about them. >> So this is a a hypo- uh it’s like a hypothetical distance matrix for a band or a neighborhood that contains like three domains. Domain one, domain two, domain three. So uh each cell you can think of the value there is calculated by uh calculating the distance between any two uh domains. So let’s just look at the red column. So uh the distance between domain one and itself is naturally going to be zero right? Then the uh the distance between uh domain two and domain one is also gonna be some value and then the distance between domain three and domain one is also going to be some value. And so what this means is that if I look at the red column, I can then see the distance between d1 and ev- and every other domain in it’s band. And uh this naturally kind of means that if you want to find a domains that are very different from its neighbors, we just calculate the uh- Euclidean norm of each column of this of this matrix. Uh and that then allows us to kind of figure out uh how different it is from its neighbor. And of course the larger the value, that’s gonna mean that its more different than its neighbors. So over the January, Feb- uh data set as a trial, we kind of ran this awhile ago. And we had one really interesting case in the neighborhood of a hundred to a hundred and ten. And so as you can see, the average in this ban or in this neighborhood is around a hundred and twenty-eight or so. But there’s one very clear outlier, which has an overall distance from its neighbors from around five hundred and sixty-seven and so in the histogram you can see how the averages are all kind of bunched up in this really tall spike and then you have these two outliers way out at four hundred and around five hundred. And just kind leads it kind of the its its easy to kind of calculate the standard deviation then of these two and then notice that they’re go that they’re definitely two standard deviations away. And so what was great was when we ran this we found this domain called tangerine dash secure dot com. Which uh through some kind of more further manual probing uh we were able to actually identify as a Zbot domain. But Zbot lives in other ranges as well and so again, in the neighborhood of thirty to forty different ASNs, we found a couple of other outliers. So in this case, uh as the amount of kind of information decreases right? ‘Cause they’re going from say a hundred, which is a lot more information that say like, thirty, the the spectrum of possible distances increases. So it becomes a little bit nosier. But at the same time if you look at the tail of the histogram. There’s still ins- interesting domains. So the majority of distances are kind of all, kind of all nicely lumped together, but then if you cross the two hundred barrier, uh the two hundred distance mark, you can see they co- they’re actually five domains. And uh, out of these five domains, uh three actually turned out to be uh malicious. So meenyousecu, securedataassl and secure tangerine access dot com. And uh the the kind of the further validation was done actually through some more passive DNS down the road and then some more active probing. Uh and what’s really good about this method is we were able to take a set of around eight thousand, eight hundred thousand domains and then reduce it to a far more kind of manageable set of around eight domains that we can inspect manually. Uh by hand or give on to an analyst. And also gain IP information. >> Look about the previous the analogy you could think about, but also you have all of these Zbot domains are trying to kind of hide within the large how would say SSL ASN IP space ecosystem and they’re part of the same gang, however, you can see that some of them have lived in low to medium to high income that that’s the analogy and then with this method you were able to find them uh with all of that pipeline of of macro micro distance measurement uh calculation that we described to you. So I guess the, a few final thoughts here is that when we found these this last list that was reduced from eight thousand k to I would say you know less than ten, uh we had to use some extra signals to verify the true positives and also we doubt the false positives and for that we used some simple ones like how many uh SHAs does the common name map to? And also the a ratio between the IPs that the SHA was found on uh to the uh ASNs where those IPs are living. And you can see that that ratio ha- a ha- was very revealing to find all of these Zbot domains, like meenyousecu, the secure data SSL and then tangerine. They happen to have IP count over ASN count between one and two. In other words, actually that was confirmed uh, that confirms the uh the business model of the actor behind Zbot, where when he sells you between forty and fifty IPs, he will never give you uh IPs that belong to uh the same ASN. Usually you will get like on IP per ASN and he tries to diversify uh the the offering that he gives to his uh customers. Um so yeah, this has been like very useful to find this actionable intelligence so we can like block these domains or or further investigate them. Obviously we have like some other systems to catch these uh Zbot domains, but the whole exercise that we tried to share with you is that you can start with like very large data set and kind of try to peal it off by building this understanding from a macro micro and then as the intuitions kind of strengthen, you end up having like these interesting ways to uh find like uh reduce the set and have it um managed have it uh, to the scale where its manageable uh by hand. Or by uh eyeballing. So I guess uh just a quick comparison here. You have secure tangerine which is Zbot. Blue apron which is a legit uh domain and they happen to live in the same neighborhood, like they’re making between thirty and forty k uh money wise. Uh the idea is that th- they live in ASNs that have uh actually they they are mapped they live in a neighborhood where the most people, most common names there are mapping to thirty to forty ASNs. And for tangerine, you can see that all of the ASNs we mention here are all like Ukranian ISPs. It’s basically residential uh you know IPs that are either pop bought, routers or access points uh that have been uh leveraged for this infrastructure. Where as blue apron is hosted on, even though its in the same band, it’s hosted on like here legit Akamais and Orange and Amazon. So I guess, the takeaways I’ll let Thomas go over those. >> Um so I guess one of the big takeaways we took is you can use kind of the global structure of the ASN uh domain name graph. To help us inform decisions at the local level. And then there’s also a really kind of easy statistical tools that you can use to whittle down a pretty large data set into something that’s far more uh manageable. And what was kind of great about this is that we start off in January and then we kind of ran this approach uh every month on the subsequent uh kind of sonar SSL feeds and started monitoring the the IP space for the domains that we found and Dhia will kind of show some more uh examples in in the later months, like in April and June. Of what else we found. >> So one more thought actually about the main takeaways. So you might think ok you spent all this time to just catch like eight or three things, uh fair enough, but the exercise is useful because you could take this generic method and apply it on any other data set so most of research is often is not about the problem itself that you’re trying to solve, its about the whole methodology and the mental exercise you go through with your team and you’re peers, so that’s why we felt like this might be useful for you know as a general thought process so we could apply it on other data sets where you can basically represent the the data as a bipartite graph x to y. Now some bonus slides about the Zbot infrastructure itself. By studying SSL we were seeing like some interesting um patterns actually. The slide got messed up there unfortunately. But the two ta- timeworn shows the malware C2 domain that has a different way to operate with SSL than a domain at the bottom which is private zone dot w s which is a known crimeware forum. So the top you can see that uh orospu.cc was created on April of this year. The first DNS query we saw on our traffic were four days later, then it was hosted two days later on another bulletproof infrastructure we call Alex, that we actually covered on Thursday at Black Hat. And then on April twenty-third, uh there was a cert created that was then deployed on the Zbot fast flex and the domain started being hosted on Zbot. So you can see here, as you want as you buy the service, you’re immediately provisioned with a SSL either you buy it yourself and you provide it to the actor and he will push it on the machines on the nodes. Or he will do it for you. Similarly for private zone dot w s, which is a known crimeware forum for years, the domain was created ike four years ago, like 2014 I would say, for a long time its been hiding behind the cloudFlare, but actually it had an origin IP um that was unknown lets say. Unless you have other ways, like SSL, passive DNS probing. But then over that period of CloudFlare protection, they were using a variety of SSLs. Uh provisioned by CloudFlare. Then on May the seventh 2017, they created a dedicated SSL cert and then on May the seventh, the same day they started being hosted on the Alex fast flex infrastructure. Uh and then on June twenty-seventh now they became, uh hosted on the Zbot with the same old SSL and finally on July nineteenth they created a second SSL and pushed it on the uh edge nodes that are bought by the customer which are around forty to fifty machines in Ukraine and Russia. So what they’re trying to show here is that it’s kind of interesting to see how the actor uh sets up his backend infrastructure and how he maintains all of these uh different like domain creations, SSL creation, hosting change of SSLs, et cetera. I guess the final slide here is like the same one as we showed earlier. Just like maybe to a kind of do a a bring it all together. So again its like uh an infrastructure that is uh provided for customers to hide their content behind the scenes. You can have one SSL cert per domain or you can have um, often times act- actually we saw even if the bot is still holding the common name of a known uh crimeware forum, it doesn’t prevent uh any other domain to be uh uh I would say hosted or delivered through that IP. But then that case it will not be using the SSL encryption. Because the the domain of the new site will not match the common name of let’s say private zone. So yeah. That was it. Thanks again or your attention and hopefully… [applause] Questions? >> Yeah? >> What tools are you using to process the data and generate the graph? >> Uh we just used uh Python and CBORD. Uh, yeah, so. You actually uh what’s kind of funny is you don’t really need any fancy machine learning stuff, you just kind of just gotta use some basic stats and you find interesting stuff in the data. >> So yeah, like another thing the the the whole uh scans were pushed into hbase that way we can kind of do the the search at scale. But yeah, like Thomas said, it’s mainly like uh >> [inaudible off mic comment] >> Yeah, and like some some good, the I would say judgement and discussions with the team. >> [inaudible off mic question] >> Oh uh, he asked whether we found anything interesting about the certificate authorities uh for the SSL cert, for like the the domains that were hosting uh Zbot. >> Well, unfortunately your typical uh abused ones like the Commodore and Lets Encrypt and I don’t want a, you know, name names, but those are the ones you see a lot being used, ‘cause either they offer either like free certs or they have a lot of resellers. Usually you find a lot of these uh I would say suspicious or bullet proof hosting providers who will offer you hosting plus SSL uh certs. So it became like a very um, like a very common commodity to get like a cert with the hosting uh space. >> [inaudible off mic question] >> You’re saying did we use the same method to find some other bot nets? Good question, so at the moment Zbot happened, we we are tracking other uh I would say bullet proof hosting infrastructures that are uh I would say distributed, but this one happens to be the only one that uses a CDN like uh structure. So the others they will have certs, but they are deployed on only one or two domains as we saw sorry, uh IPs, so as we saw earlier, is the do- information is to sparse, then there isn’t much to find with this method. Then you’ll you’re better off like using some other techniques that are much simpler. And no need for like complicating your life basically. But, >> [inaudible off mic comment] >> Ok, sorry you mean DDoS commander control. We haven’t. So that that would be a good uh discussion. We can talk. Yeah, thank you. >> [inaudible off mic question] >> Well, Akamai were not a problem. Like we just said, they were good and they were hosting good stuff. >> Yeah, it was a bunch of like residential people who actually were like unsuspectingly hosting a lot of a lot of these guys. Akamai for the most part is actually uh really uh have a clean network. >> Yeah, like a the reminder is that uh, most of these bot IPs are on Ukraine and Russia. So they’re mostly residential ISPs that are being I would say abused. So, Akamai was an example that we showed it was a legit, clean uh uh I would say infrastructure. Thank you all. [applause]