I'm Mahdi Namazifar and currently I'm a senior data scientist with Twitter but before that I was lucky enough to be a part of the amazing Talos team with Cisco and this work I did was -- wow, that's awesome. Was something that I did when I was with Talos. First I want to give you the definition of the problem that I'm trying to address here. Let's say that you're given a random string and you want to decide whether this string is a random sequence of characters. So one thing to note here that I said random sequence of characters. And this work does not address strings that are random sequence of dictionary words. The other thing is that my focus is are on strings that are at least 8 characters long and anything less than that is very difficult for a human being to detect randomness. So I'm focusing on strings of length 8 or more. So why do we even look at this problem? Our motivation for this was detecting domain names that are generated using domain generation algorithms. You know better than I do how these are used. And this is not a new problem. This problem has been studied quite a lot. There's rich literature around it, at least I'm aware of some of them here and also a bunch of works that are done at Cisco. Usually the way these works work is that they look at this problem as a classification problem and they take a machine learning classification approach to solve these problems. And but here my approach is a little bit different. So I give you the big picture of the approach, and then I'll go deeper into the details of it. So the first thing I want to do -- out of these, I want a word list. List of words, as many as possible. And once I have this, I put them together, I call it the mega dictionary. How do I use it? I basically take the arbitrary string that I'm given, and I take substrings of it, and I look them up in this mega dictionary. So based on the number of dictionary hits that I find out of these substrings, based on the length of these substrings and based on the different languages that these substrings are from, I come up with a randomness score. And based on this randomness score, I determine whether or not this is a random string, and if you notice here, the idea is that if you could see how the substrings of this string could be covered by different words from different languages, then we have an idea whether this is a random string or not. So I'll get into some details here. So the first part is that mega dictionary. First I try to find as many dictionaries that I can, language-based dictionaries, basically, from the Internet. These are almost 70 different languages that I found different dictionaries for. Some of them are constructed languages, some of them are English. I have multiple versions of the English dictionary like British English dictionaries, or American English dictionaries. So this is the languages that I was able to find. So other than that, so also I should note here that out of these dictionaries I only want the key, not the value for each dictionaries. You have the word and you have the definition of it. I don't care about the definition, I just want the words. So that was the languages. I also get some names. Based on the census data, female names, maiden names, surnames. Also I get a list of Scrabble words, words that are not necessarily in the English dictionary, they could be acronyms here. And so these two items were given to me by my good friend and former colleague Adam Katz and they helped a lot, actually. The next thing is I get the Alexa 1000 domain names, I add them to my word list. The word Yelp, the word eBay might not be in any dictionary, but these are actually important words. Some numbers, and also got my hands on a dictionary of or a list of texting acronyms. YOLO, BRB, things like that. So for some of these words, I need to do some special treatments. For instance, the words that are coming from eastern European languages I need to get rid of accent on the characters. For the Mandarin language, I need to get these characters, if I can find my pointer, these characters, and basically translate them to Roman characters. And for that I use the pin ying standard for Russian and Ukrainian. It will take care of the fact that I and Y are used interchangeably and a bunch of other special treatments like that. So the next thing I need to note here is that the same word might appear in multiple dictionaries. The word book appears in at least these three dictionaries, English, polish, Dutch. So to take care of this, I run a map to find all the dictionaries that a given word appears in. So the result of this looks like something like this for each word, for each given word that I have in these dictionaries, I have a list of dictionaries that that word appears in. So here in this example, sui appears in the French dictionary, in the Catalan dictionary, and a bunch of others. So this is at the end what my mega dictionary is going to look like. The keys here are words and the values here are lists of dictionary names. And the look of complexity of this is constant, so it's pretty fast to look up anything here. And at the end I have about 12 million words in this mega dictionary. So that was the dictionary part. Now how am I going to use this dictionary for detecting random strings? So I find substrings of a given string and I look them up in this dictionary. How do I do that? I do that by traversing strings. I can traverse a string from left to right this way. I have two indices, I fix them one at the end, one at the beginning. If I'm traversing from left, I fix the right index and I move the left index one at a time. And as a result, these are the substrings that I find. If I find the substring the same string from right to left, I get the substrings. It will be a little bit more clear why I do it once from right to left, once from left to right later on, but let's just fix the definition here and go to the next slide. And look at an example. How do I find the substrings of a given string? So let's say we are just dealing with the English dictionary only. And let's say that this is my given string. So I start traversing this string and at each step and I'm doing it from left. And at each step I find the substring and I look that up in my mega dictionary. I continue until I find a hit. So none of these words appeared in my English dictionary until I hit this word. That appeared in my English dictionary. So I take that word, I take it out of my string, put it aside, and now I am left with this substring. So again, I start -- I reset the indexes and I start doing the traverse from left again. And these are the substrings. None of them are in the English dictionary, until I find this word. And it's a hit. I take it out. I'm left with this substring. And so on. So at the end I get these three words. Out of this string that it's good to be here. So I did this once from left, I do the same thing once from right to left. And at the end from left to right I get this list and if I do it from right to left I get this list. You see that? So these two, ones from right, ones from left give me a higher chance to find the right words in that given string. So I need to pick between these two and because the minimum length of the words that are in this list is four, I pick this one. Because the longer words that you find in the substring, the higher is the chance that this is not by chance. So that was just looking up in the English dictionary. What about the case that we have, well, we built the dictionary based on 70 different languages. How do we use that? Let's say we are looking at this string, and let's say that I found these substrings in this string. Right? And these are all the dictionaries that each one of them appears in. So the question here is, okay. So at the end of the day, how many languages do I need to cover these substrings that I found. If I take the union of these, these dictionary lists, it's going to be way too many. If I take the intersection of these, it's going to be zero. They don't have any intersection. So I need to find the minimal set of dictionaries or languages that cover these substrings. So how do I do that? That's actually the problem of minimum hitting set. It's a very well-known problem. Very well-studied and this is basically the father of a bunch of other unknown problems such as set covering problem and stuff. This is the definition of the problem. I don't want to get into the definition. You can always look it up. But envelope this is a problem. But the good news is our sets are small enough, if you look at it again, our sets are small enough that even if I do a greedy search in the space of possible solutions, I can find the headings set. So I exactly do that. I basically have this very, very simple algorithm for finding a minimal heading set problem. This is by know means an optimal or the best greedy algorithm for this, but I don't care because it's fast enough. It gives me what I need. That's good enough for me. So back to our example. So we have these sets that we wanted to find the minimum hitting set of. And by just applying a very simple reading surf, I find these minimum hitting sets. These are the -- the minimum hitting set number is 2. Meaning that I need at least 2 different languages to cover these substrings that I found. So based on this minimum hitting set number, the length of the string itself, the percentage of it that was covered by the substrings that I found, some of the length of the words that are in the string, length of the substrings themselves, I define a randomness score, and that becomes my touchstone for detecting randomness. One last thing to mention here is that I do it twice. I run it, the string, first against the English language only. So English language is universal. Many people use it. So I first run it against that. If, according to the English language I don't have a verdict about randomness of the string, next I go to all the languages. And this is basically a two-phase filter. So a bunch of other considerations here. If you have a sequence of alternating vowels and consonants, if you practice this yourself, put a sequence of random alternating vowels and consonants, for this, also another thing I consider is if I see a dash or underscore, it means that there is some natural separation at that point. So because otherwise why would be an underscore or a dash in the middle of a string. So based on that I treat that as a separation and I look at each one of these separate pieces separately. So I'm going to get into some results here. Based on the experiments that we ran and I look at both false positive and false negative rates. For this, let's first look at false negative. So I'm using 9 domain generation algorithms. These are from known malwares and bot nets and these are reverse engineered by the Talos team. So they gave me the code and I generated basically these samples. So this is the number of samples that comes from each one of these algorithms. And then this is the number of strings that were generated by this specific algorithm that were missed by my randomness detection. So if you look at the missed percentages, I don't know how clear it is on the screen, but basically the rate is pretty low. Across the board it's probably around 1%, less than 1% false negative rate. So how about false positive? For this I took Alexa 10,000 domain names. I filtered out strings that are shorter than 8 characters. And put them aside as I mentioned in the first slide. So I'm left with 5,400, almost 5,400 domain names. And I run them through the code. So the rationale here is that in the Alexa 10,000, you're unlikely to see loss of VGAs. So hopefully a lot of those are legitimate websites. Out of this 5,400 domains that I checked, 412 of them were depicted by my algorithm as being random. And this is the whole list of them. Some of them, like for instance this one. This one, if you showed it to me I would say this is a random string. Or I don't know, like this one. This is pretty random looking to me. But some other ones are not random. Like, for instance, as a matter of fact I know this is a pretty legitimate Turkish website, or this one is a legitimate Farsi website. So but at the end, out of 5,400, we have 42. And we can say that it's about 1% false positive. Which is not bad. So if 1% for false positive, 1% for false negative, and overall I think it's a good rate compared to other studies that I've seen on this problem. And this would conclude my talk. Thank you very much. (Applause.)