Karen Benson, take it away. Hi everyone, thank you so much for coming today to learn about examining the internet's pollution. As announced, I'm Karen Benson and I'm really excited to be talking here today at my first DEF CON. Uh, so, to start off, uh, a couple years ago on Reddit, somebody asked the garbage men who, on there, um, about the illegal, strange, and valuable things that they had seen while examining other people's trash. And you can go find this thread and read what they found. But the main takeaway is that they found a number of interesting and valuable items. So, today I'm gonna talk about the analysis of the internet. This is an analogous question, but for the internet. We're going to ask what sort of interesting and valuable information can we find looking at some packets and traffic that you may consider the internet trash. And, um, I feel that I'm pretty qualified to talk to you about this, not because I'm Oscar the Grouch, but because I just defended my PhD, in which I spent the f- last four years looking at this type of traffic. And, prior to that, I looked at, not so, trashy traffic, but writing intrusion detection software. So, I've looked at some packets. Um, alright. So, quick outline of the talk. Basically, I'm gonna go a little more into depth on what this trash is, and the various ways that you could, you can collect this. Um, I'll talk about the ways that we collect this and the ways that you could possibly collect this on your own network. And, I'll go into a little bit about the data that I use for the presentation, and then the bulk of it is going to be about the interesting and valuable items that you can find in trash. And, then there will be, uh, concluded. Alright. So, what is internet trash? Or, this is something I made up, so what am I calling this? Um, so, basically, I mean any unsolicited packets. So, this means you're not going out trying to get people to send packets to you. You're just passively capturing everything that you're sending that comes to you with your own IP addresses. Uh, and, um, this name, has a name other than trash. It's internet background radiation, or IVR. Um, and people have studied this for a long time, for, to look at worms and stuff like that, but I'll tell you kind of more of the things that have happened in the past couple years. So, probably the most obvious example of IVR is scanning, when you're searching for hosts that run a service. Um, and, um, this is something that I've been doing for a long time. You're going to send packets to hosts that will respond to you, as well as hosts that are behind firewalls, and they're not going to respond to you. And, possibly to people like me, who are just kind of collecting the garbage of the internet. We also get, uh, backscatter packets, which is any packet that's a response to a forged or spoofed packet. And, typically, you think of these in denial of service attacks. So, you have a victim, and the attacker has to send a response. And, the attacker doesn't necessarily want everyone to know that they are the one launching the attack. And, so they may be able to forge the source address, or the from field of the packet. And, when they send it to the victim, um, the victim may have a hard time differentiating between forged and non-forged packets. And, they may respond, but they're not going to respond to the attacker. Instead, they're going to, hopefully, respond to us. Um, next slide. So, um, we have misconfigurations, which is when you just erroneously believe that a machine is hosting a service. Um, these can be small scale, like someone typing an IP address incorrectly. But, they can also be pretty large scale, um, um, and affect a lot of hosts. And, we see this a lot in peer-to-peer networks. Similar to misconfigurations are bugs. And, this is when you have some sort of software error that causes the packets to reach an unintended destination, such as a byte order bug. So, even if if you know your DNS server correctly you may um because of some issue in software send the packet to uh an unintended destination. Uh we also get a bunch of spoofed traffic where uh for some reason people are using the wrong address. Um they typically aren't trying to attack me but uh we still get some packets like this. Um and then finally there's some traffic that we just don't know what it is. This can be um TCP uh SYN packets to non-standard ports or UDP packets where we don't understand what the payload is. One example of this is encrypted packets. They are difficult to understand what the intention of that packet is. So this is kind of a summary of the major classes of RDB. So this is kind of a summary of the major classes of RDB. So this is an example of one of the most popular ones in the range of the IBR. Um so how can we collect this? Uh you've probably heard of honey pots where you purposely set up machines to be infected with malware. Maybe you run an old operating system or some sort of vulnerable service and the with with this you can get really in-depth information because you're infected and you understand the attack vectors and the consequences of this. But if we don't want to do something click so in-depth the password should be innocent. If the user is not aware of the code the um we can um do an- we can have some other setups. The first example of this is uh just collecting one-way traffic. So if this is your network and these are the used machines in your network, you announce a um a some BGP prefix and you probably have some sort of middle box keeping state of um the connections and which ones are bi-directional and which ones haven't received an- an acknowledgement yet. And if they never receive an acknowledgement, this is probably some sort of unsolicited pack- traffic so you can store this as your collection of IBR. Similar to this you can have um a grey net where your state is the IP addresses that are used and then you just know which other ones you can rate to storage as they come into your network. Uh another concept related to this is if all of your addresses are in some small BGP prefix but you have a much larger one, you can announce the whole prefix that you have and then based on the destination decide which ones to route to the destination or right to storage. And then finally an extreme example of this is a network telescope where you just don't use a BGP prefix that you have and you record all the traffic that comes in. Um and in the order that I presented these um it becomes easier to scale and implement and there's normally relatively fewer privacy concerns. Um but you lack the ability to do really in depth um analysis if you're not responding and people can avoid your um IP addresses. Uh for this talk I am going to use traffic collected at a number of network telescopes. Um so we had a network telescope. We have multiple large academic network telescopes um and we receive a ton of data from these. We're currently capturing about 5 terabytes of compressed PCAP per week and we have traffic going all the way back to 2008 so we can do some historical space with this. And with this uh with this data we see traffic from all over the internet. In terms of the countries we see all countries except a few islands in the Pacific Ocean. So we have a lot of data that we can see in the Pacific Ocean. And in terms of IP addresses we are seeing about 5 percent of the announced IP addresses in BGP. So it's a pretty good sampling. And I'm showing you data from July 2013 but if we look over time this is we we're almost always seeing data um I didn't extend this graph but it's just increased a lot recently too. Uh there can also be events such as the spam house attack which was a really big DNS based uh uh denial of service attack. And with this attack we see this event we were able to see traffic fromitar hosts. Alright so now we get to go to the exciting part of the talk where we talk about the interesting and valuable um things found in the internet's trash. Uh so for this section I'm going to focus on some of the going to go through the major classes of traffic besides spoofed and I'm going to tell you about the thing that I think is the most exciting um for them. So in terms of scanning I'll talk about some trends and some relationship to vulnerability announcements. And to collect this data we use the historical data that we had since 2008 and we just applied bros um parameters for determining if an IP address is a scanner which is if you send packets to 25 different IP addresses on the same protocol and port within 5 minutes bro would alert that something you were being scanned. So this is maybe not the best definition of a scanner because it obviously depends on how many IP addresses you have and it's definitely not capturing um slower scans but it can give us a kind of a first look at the data that we are scanning. So this is the first look at the macroscopic uh scanning that's happening on the internet or at least of our networks. So uh broke it broke up the data into what was happening from 2008 and 2012 first and you can see that um the colors correspond to ports and we see in terms of packets and IP addresses the purple uh port is very popular and this is TCP 445 and we see that the first increase is right when the configger outbreak occurred and then we see subsequent um increases um often corresponding to new releases of configger. Uh but we can't say all of this is necessarily configger because there's other scans of this port though most of it happens to be from configger. And so we can come up with a way to get some heuristics to determine which packets originate from configger and to do this um we can exploit a bug that configger has in its pseudo random number generator for the most part when it's randomly scanning the internet to propagate it has a bug where it only targets IP addresses a.b.c.d where b is less than 128 and d is less than 128 so it's only really scanning a fourth of and so we used a heuristic based on the uh birthday problem which basically says given a random group of people what is the probability that two people are going to share a birthday and often this you're it's like surprising it's only like 34 people and it's pretty prob- and then it's likely that people share a birthday. Um so another way of asking this question um is how many unique birthdays can we expect to give in n people and 365 birthdays. So turning this into a identifying configure if we have IP addresses a dot b dot c dot d that are being scanned we can look at the individual bytes of the IP address so if we look at d and we say how many unique d values can we expect to give in either targeting 128 or 256 targets uh which are the possible values for d and you can repeat this for the other bytes and you can then start to differentiate between randomly scanning a quarter of the internet versus the entire internet in expectation. So if we look at the configure outbreak um and the amount of scanning that happened around that time period and this is a gra- this graph is in log scale um we do have some missing bits um but we can look at the um the data but we do see an increase right when configure was discovered. So what we would expect here is that we wouldn't see any um host matching our configure heuristic. However when we look at the number of IP addresses meeting the configure heuristic this is what we see. And so for up until about August we didn't see um no IP addresses met this time period. So we're pretty sure we're uh not exactly sure if it was heuristic and then all of a sudden we started seeing some traffic. Um so this is and this is well before configure was actually discovered. So this is evidence that someone was trying to actually like test out their configure bug prior to this. Um and on the first day the IP addresses were all in the same province and the first couple days they were all in the same province in China. Um and so maybe this is how we can helpful um as far as I know nobody has claimed the Microsoft 250k bounty to collect the um Configure worm author so perhaps this information could be useful for that. Alright so that was before 2012 so if we look at what was happening since 2012 not surprisingly Configure is uh dying out but the most popular port has been replaced with port 23 which is telnet and the best explanation I have for this is that people may be trying to scan for internet of things um if you have a better idea of uh let me know and uh we can also see some other interesting things happening here. So this spike that is in gray it was a variety of ports and it corresponded to traffic from the carna botnet. So uh this is uh this is the carna botnet which was somebody decided to um create a botnet scan the whole internet and then publish all of the results anonymously. So we see this and we can verify that that traffic was actually coming from the carna botnet based on their data. Um so as if we look at the IP addresses we notice some period of time where there's um increased activity on a port. So if we look at um Heartbleed right around there um and here you can see in red where the Heartbleed vulnerability announcement occurred and then like a week or so later we see a lot of increased activity on the pink port which is TCP 443 which was where Heartbleed was likely could be exploited. Um similarly a little bit later we see a lot of activity on the pink port. We see a lot of traffic a lot of sc- sc- scanning TCP port 5000 and um so just google searching TCP port 5000 during that time. We Akamai had a report that they were seeing lots of universal plug and play devices being used in denial of service attacks and prior to that report we see evidence of scanning on on that port. So um so we can we were potentially seeing activity before it was used in an attack. Alright so that was scanning um hopefully we will release our scanning data set pretty soon um but going on to backscatter um gonna talk about an attack that we're seeing that we've been seeing on authoritative DNS servers. So just a reminder backscatter is a response to a spoofed packet. So let's suppose you have a web server that you want to perform a denial of service attack on. You could do a sequence of specific packages to exploit your data. denial of service attack directly on the web server. However, there is also another weak point. All legitimate hosts who want to contact that web server need to find the IP address associated with the name. So they have to do a number of DNS queries. So it turns out that you could also perform a denial of service attack on the authoritative name server. So one way that you can do this is with an open resolver. And an open resolver, typically with DNS you should only resolve domains for machines that you administer. So UCSD's domain server should only resolve domain names for clients in UCSD. So, um, so it's typically considered bad because otherwise you could use them in um, DDoS attacks. But so, you could, you could do this with an open resolver. You could use an open resolver to do pull off this attack on the authoritative name server. In particular, the attacker can spoof a packet, a DNS query, send it to the open resolver. And since the open resolver resolves the data for everyone, it's more than happy to ask the authoritative name server and they get a response. And since the original query was spoofed, they do not respond to the attacker, but instead it's likely that they will, you know, return, uh- respond to our network telescope or there's a probability that they will do that. So, um, this is- So we're seeing, uh, a lot of traffic recently from open resolvers. So, um, this is uh- 2014 data. So prior to pretty much the end of January 2014, we didn't see pretty much any traffic from open resolvers. We saw about 3,000 open resolvers per month and then starting in February 2014 we saw 1.5 million open resolvers per month. And we noticed that once this attack sort of took off we were seeing traffic from the same open resolvers over and over again. Um this is only a small fraction of the open resolvers used on the internet. The open resolver project which is scanning, active scanning at the same time saw about 20 times the number of open resolvers that we did. Um but this is so this means that this attack is only using a subset of the open resolvers and but we can also look at um some other data that we have from the attack which is the status code that comes back with your DNS response. So if it is like okay everything is happy. Um but you can also get a number of failures including a serve fail which indicates that there is a problem with most likely the authoritative name server. And in the month of data we got serve fail errors from nearly every open resolver that we saw. Where as in the open resolver project scan they see this error very seldomly. So this is evidence that this attack is actually overwhelming authoritative nameserver. So this is a really interesting thing that you see. We see some data on January 29th and then the attack seems to really take off in the beginning of February. And um this first day uh the domain that was queried was all for badu.com which is a popular website so this reflects a testing phase here. Um since then there's been lots the the the the the the the the the the the the domains seem to be um just used for a very short period of time. A number most of them seem to have bogus um registration information. Um and we're still seeing this this all all this analysis was from the first month of activity and we're still observing this type of attack right now. Alright so that was backscatter now I'm gonna go on to misconfigurations which um in particular I'm gonna talk about BitTorrent misconfigurations. So if you want to download a torrent through BitTorrent you use you contact the you typically contact the BitTorrent distributed has table and they will tell you the location of the torrent or some other BitTorrent node that is closer to the torrent that you want. However there can be malicious nodes in the hash in the uh distributed hash table and they get bakery brought down can lie to you about the location of the torrent. And if this happens repeatedly over and over again it's going to be a lot harder for you to actually find the torrent and get the latest episode of Game of Thrones or whatever you want to watch. Um so this attack is called an index poisoning attack where you're purposely inserting fake information into or about what's in the hash table. And uh so what happens after you receive this false information is you try to set up a connection. So when people send bit torrent packets to the network telescope we get an idea of what torrents they are trying to download. And so this is some data from July 2012 and in terms of the most packets uh associated with a torrent. And you'll notice that a lot of them happen to have the word China in their name. And a year later we see about the same thing. Um so this this attack doesn't seem to be going on right now or if it is it's a lot slower. But we have but oh I'm sorry. And typically in this China attack um typically the IP addresses that are asked for the torrents are the ones that are most likely to be used to satisfy this equation or this set of IP addresses. Um basically they're in certain slash 13 blocks. And so it seems that they're being generated programmatically with a buggy pseudorandom number generator. Um and this uh attack is sometimes we see a lot of packets from it and then sometimes we don't see any. And currently we're not seeing very very many. Um but um more recently in about about a year ago we saw a huge spike in the amount of bit torrent traffic we see. Uh we're getting traffic from um about 250 times more um IP addresses per per hour. And we don't really know everything that's going on to try to investigate this. Um we we were so just uh just a recap when you want the torrent, you ask the someone a uh the bit torrent the DHT node the location of the torrents and they come back with the locations and then they if they potentially contact our network telescope. Um so we want to know who is spreading this false information. So this node so we can't really learn this by looking at the IBR. Instead we can we can set up nodes to in a actually interact with the distributed hash table. Is there a problem? So we set up two torrent, uh two uh clients um and examined what happened for over two months and they both contacted our network telescope uh fairly frequently. And so we looked at who was telling the, who was telling our clients to contact the network telescope. Um and the most popular client string was a lib torrent one but this only accounted for about 70% of the clients and it's a pretty popular client string among legitimate hosts as well. Most of the um IP addresses were in China but they were in multiple AS's so this wasn't too successful in identifying who is actually sending this false information. But we did notice one, one really suspicious behavior. Um so in the hash table all the um networks nodes have an ID so that means that they think that the nodes in the IP addresses in our network telescope also have IDs and so the IDs that they request they um all have four as the third byte. Um so that's kinda weird and typically when you look at the location, when you receive locations you receive not just one location but multiple locations at a time and this behavior is similar for a lot of other IP addresses that we see. So um we're receiving a lot of bit torrent traffic um as a result of a bug in or a misconfiguration in a peer to peer uh network. Peer to peer networks also cause a lot of traffic um as a result of a bug in one of the systems. So if we look at the uh number of sources sending us traffic, a lot of them are over time. We notice some interesting things like the configure outbreak um when we started seeing a lot of bit torrent traffic and then all of a sudden in October 2010 there was all, you, the shape of the graph definitely changes, it's very diurnal and we weren't really sure what was happening here. And we were able to identify the responsible payload and certain bytes seem fixed and then we could hypothesize about what the other ones were using it for. But we still had no idea what this was, what this was and the popular ports, the most frequently used ports, we weren't really sure what those were either. Um, but we did notice that in terms of the sources sending them, they were mostly locat- a large number of them were located in China. In fact, we received in a month's time, uh, traffic from 30% of all BGP announced IP addresses in China. So this is, like, huge. Um, also interestingly, um, when the USA category, 4 IP addresses belong to the UCSD computer science department where I went to school. So we were able to coordinate with someone who could monitor the traffic, go in and out of UCSD's network to basically capture traffic from these IP addresses. Um, we made- this ensured that this traffic wasn't spoofed and was actually happening. So, um, all of the CSC machines basically contacted a common IP address and in response they got a pretty large packet. Um, and based on this packet then they sent about 40 more packets to different machines and they were all encoded in this original big packet. Um and it wasn't just one packet they were exchanging a lot of packets and eventually the UCSD machines would receive a packet like this and so this packet is from 113.70.40.122 but instead they would respond to 122.40.70.113 just immediately after receiving this packet. So and this packet met the um BPF filter that we had used to identify all of this traffic. So this is a byte order bug and this is why we were receiving um a lot of this traffic. Um we identified that this software bug was in QIHU 360 um and if you look at their license agreement it says that this is the most popular security software in China and if you look at their license um agreement you see that they will use peer to peer technology to update program modules, malware definition databases and components of the software. So basically we were getting information about when people were updating getting software updates. Um we contacted uh QIHU and told them hey like you have this bug um and so then we could see uh how long it took for them to fix it. Um the traffic had one kind of weird thing which was like every 4-5 weeks um there was a large spike probably related to um big update events but there wasn't a big decrease following one of these. Instead it decreased like about a month later and this date was about a month after the same time a new version of QIHU was available on their website. So uh we're still getting some traffic but in general um this bug has been fixed. Alright now on to the last part which is in um looking at some unknown traffic. So um the bug was also an example of unknown traffic but I'll go through another one. So um the bug was also an example of unknown traffic but I'll go through one with you okay? So I did a little research to about a million different types of bugs. So um I uh what I found was one of the things that I found was that the bugs were supposed to be a little bit more collectible. So basically it was a pretty easy process. Um I did a little research to see if we could then trigger it to actually handle the bug. So we did a byte wise analysis of like what is the first byte, second byte, third byte and stuff like that. And we found that this byte here always seemed to be somewhat related to the whole length of the packet itself. Um and then I read a bunch of white papers and found that the Salady botnet their um encryption is such that these 4 bytes are an RC4 key used to decrypt or encrypt the entire rest of the message. Um so when we decrypted all almost all the the packets to this one IP address we found that they all sort of looked started like this. So um this confirmed that this is a Salady commanding control packet. So this is kind of interesting because you're like okay I understand why someone would have a bug or someone would purposely put false information into a BitTorrent uh packet. Um so we found that DHT or I understand how a byte order bug happens but this also happens um in peer to peer botnets as well. And that's why we receive that much, we receive a lot of traffic. In fact um if we look at um how many IP addresses were sending us traffic per month basically to this one IP address we see about the same number of infections as uh Symantec was seeing in the early part of this decade. So um I'm going to go ahead and show you a little bit more of the data. So in conclusion it's pretty likely that you are transmitting internet background radiation um and if you use network telescopes or other technologies you can find a whole bunch of interesting things. Um in addition to just looking at these kind of security related events uh we can also learn about the networks and machines generating the traffic. For example um you can do outage detection with traffic reaching network telescopes. Um you can do outage detection with traffic reaching network telescopes. Um this is a graph from a paper that analyzed um events during the Arab Spring. And as you can see the number of packets coming from Libya went down to zero at certain periods of time. Um and these corresponded to known times that the uh Libyan government had pulled the plug on um their country's internet. Um we can also look at path changes. So um when you send a packet on the internet you can also look at the internet. There's this TTL field that is decremented by every intermediate router to prevent routing loops. But based on this you can um infer how many hops away the source is. So if this changes then you know that a path change occurred. And this can help you analyze outages and understand routing dynamics. Um so looking at some of this stuff we can see like if you have traffic like this where the TTL is. Um you can also look at how many hops away the source is. So this is about the same value over time. It's probably using the same path. But if it looks more like this then um you're you're you know that the path has changed. Um and then as a final example we can also look at DHCP lease duration. So when you join a network using DHCP you announce that you want to join the network and you're given an IP address to use and typically at some point in time you are no longer you no longer use that IP address. So you can see that you're using a different IP address. Um which means uh at a future time someone else can use the same IP address. So we can look at DHCP lease durations um using any traffic that has some sort of ID um associated with a client. So if these are the packets rece- you receive over time you know that the lease duration is at least this long and at most this long. Um so as I noted before BitTorrent has IDs as well. So we can use BitTorrent to identify how long lease durations are for various autonomous systems. Um so this autonomous system almost everything has a minimum lease duration of less than 7 days. And this is really useful for understanding the effectiveness of blacklisting um or how if people are going to not be able to access the internet because the you have blacklisted their IP. Um so uh so hopefully you enjoyed the talk today where we very we discussed some of the crazy things that happen on the internet and uh thank you. Hello? Hi uh very fascinating research and uh great presentation thank you. Um looking toward the future I noticed this was all IPv4. Have you done any consideration of IPv6 based telescopes and do you think it's practical with the sparseness of prefixes and v6 addresses? So I haven't but some people wrote a research paper where they used an IPv6. They basically were able to announce a covering prefix and basically capture everything that wasn't other people weren't announcing in BGP. And they didn't find as much. But I think as IPv6 evolves I think also this will evolve as well. Thank you. So thank you that's very convincing that this is incredibly useful data. How can other security researchers get access to it? Um so I know that the data that UCSD um has that it is available. Um it's available on the internet. It's available to academic researchers. You might need to sign a bunch of things. But uh I don't I don't know the whole process. Um but you I mean you can start with your if you have your own network too. So. Is there a question over there? Thank you. Um so I'm not sure if I can answer this question. Thank you.