All right, how's everybody doing? Having a good time? All right. How many people here use Tor? Yeah. All right. How many people trust Tor? No. There we go. How about that? That's good to see. Well, these two gentlemen have been working on a project that is going to help us maybe perhaps trust Tor a little bit more, or at least find those people that are out there messing around with Tor, making it untrustworthy. Let's give these two guys a big round of applause. Thank you for the introduction. It seems that we have a problem with the slides. They're not showing up, so finally they're working on fixing it. In the meantime, we're going to start and give our new beer, and this is joint presentation and work with Amirali Sanatina. We are both from Northeastern University. We work on developing security privacy techniques, building systems to enhance security. We're also interested in investigating the potential of attacks on real-world systems, and therefore this work. So, unfortunately you cannot see the slides, but the talk is about something called Honey Onions that we developed, and it is about exposing snooping Tor HSDIR relays. So, this is in the context of Tor. A large number of people use it. It's quite popular. We're interested in understanding how many of the Tor relays are misbehaving. These are relays that host what's called hidden services. And, in fact, this issue that we're going to be talking about is known to the Tor people, and they've been working on having long-term solutions and also short-term solutions. Our interest was in knowing how many today of these relays are misbehaving. So, the next slide that you can see, unfortunately, is about what is Tor. So, Tor is a very powerful and popular tool for enhancing privacy. I personally I personally regularly use Tor anytime, for example, I want to check something that has to do with health or any, when I Google something, I don't want to have the whole world know about it, so I use it, and I feel that among the systems that exist today, it's really one of the best systems that one can use. So, maybe one question to the audience. So, the first question was how many of you know about it. So, how many of you use it? Okay, fair number. So, how many of you ran a relay? Yeah. So, how many of you have a hidden service that's running? Okay, I see much less numbers, and it could mean that maybe some of you don't want to even disclose this information. So, this work is about showing that, well, in fact, you can't while you can trust Tor about various kinds of things, you cannot trust that your hidden service, the existence of it, is not hidden. And this is the goal of this work. So, Tor provides two types of services that are quite well known. One of them is that you can browse the internet anonymously, in the sense that you can go to some website, and the goal is that your ISP, the website, would not know that you're doing so. A type of services called hidden services, it allows one to run a server, a website, for example, and no one would be able to know what's its IP address, and therefore its physical location. So, Tor is used by a large number of people. I think they have over a million people every day that would use it. Most of these users are normal users, so most of them try to browse the internet, don't want to reveal things about themselves. Some people try to circumvent censorship, so all these are reasonable applications. You also have a fair number of journalists, whenever they want to communicate with their sources, or they want to access information, they don't want to reveal that. So, that is another type of use of Tor. You have also activists, whistleblowers, they don't reveal information about themselves, who they're talking to, what information is being shared. You also have law enforcement and military use. They don't reveal information about what they're doing. They also sometimes they also want to hide among the group of larger people who are much more normal. And, obviously, you have also criminals who use Tor for their activities. In fact, you don't see slides, but, so the next one is about hidden services. A hidden service is basically this capability of being able to run this website or server and hide physical location. It has other side effects that are quite interesting. One of them is, by reading your website as a hidden service, you can hide, you'll have this self-authentication. You don't need certificates, because the onion address itself includes information by the public key and allows you to self-authenticate. It allows you also to have end-to-end encryption. And there are many systems besides websites that use Tor hidden services. SecureDrop, for example, used by the New Yorker or the Guardian, allows one to communicate with journalists and provide information to journalists. But even mainstream systems like Facebook, they have a hidden service. You have also applications like Ricochet that allows secure messaging, and every client runs a hidden service so you can hide the identity, the clients, who's talking to whom, and you also get rid of central entities. You also have other types of people who use it, like the Silk Road, for example, run as a hidden service. You have ransomware like CryptoLocker used hidden services so that they could hide the location of the server that collects the bitcoins that people would have to pay. So there's a variety of people who would use it. So maybe to also clarify for this talk, so Tor claims to have a set of properties, but it does not really claim for example, that it will give you, if you use a Tor browser, you don't have any guarantee that you have end-to-end encryption. Because if you just browse a website that does not have HTTPS, any traffic going from the exit node would be in the clear. For hidden services, when you create a hidden service, there's no guarantee for you that that hidden service existence is protected. Tor aims at, at least in what they have as a system today, is that you can't really tell where is the location but not the fact that it exists. And this work was about finding out how many relays misbehave, trying to get this information as an indicator of other malicious activities. Within the space of, in general, looking at privacy infrastructure and attacks on it, so whenever you have a privacy infrastructure, various kinds of entities try to attack it, maybe to reduce its popularity or to misuse it. I mean, cryptography in general is used for good things. A lot of people try also to misuse it. And there has been work related to this trying to find out how many of the exit relays would be snooping into the traffic of users, which is different from what we are doing. Other work looked into all these hidden services, what kind of information, what kind of content do they have. So here, what we want to know is out of the Tor relays, the subset of them that can serve as, have this HSDI flag, therefore they host descriptors about the hidden services, how many of them are misbehaving and misbehaving meaning that they log information that they are not supposed to do so whoever is running them modified the code to be able to log this information. And later on they might visit these websites. And as I mentioned, this is a problem that is known to the Tor people and they've been working on resolving it and they have other techniques that they use to identify these misbehavior relays. Our techniques have the advantage that we can cover a larger scale of misbehaving devices. So this is not really about breaking the privacy of Tor in terms of if you browse someplace, but more about the hidden service existence. The questions we try to address, there are four of them. The first one is how many of the Tor relays are misbehaving in the sense that I defined. This makes my life easier. So there are four questions we're trying to address. One of them is how many of them are misbehaving. And if you could have a small number, lower bound to that number, we have an idea about how much misbehavior is happening in Tor. The next thing is that which one of them are snooping in terms of trying to find out information that they're not supposed to collect. The third one, what do they really do? How much are they just collecting information? Do they try to attack? Are they aggressive or not? And the last one is who they are really. Besides what relay and what IP addresses. So we have addressed mostly the first two questions and a little bit of the third one. The last one with really who they are, that we didn't really solve. And this might be a nice community for looking into that and pushing this work to the next stage. So first maybe I'll explain a little bit how hidden services work. This diagram somehow summarizes that. So to run hidden service, what you do, you pick a random public key. And some people will go and select one that will end up in an onion address that they prefer, like Facebook has a nice one. But typically you pick a random public-private key, you hash the public key in a specific way, that gives you the .onion address. Then what you do, you pick a subset of the relays, it's called introduction points, a few of them, and you set up a circuit to them. These introduction points will help people come back to you later on. Then you hash your .onion address with time information and other things, and that gives you a descriptor, a descriptor ID. And it also tells you which relays with this HDIR flag you should put this descriptor information with. And you're gonna find the two, and then you end up with a set of six relays with the HDIR flag where you'll put your information. Now, on the other hand, at the same time, you give your .onion address to whoever you want to communicate with you. And that's in, and then in step three, this client he'll take the .onion address and he will hash it the same way you did, and every day it's gonna give him the descriptor ID that will tell him which HDIR relays he should come to. So in step three, he will go to these relays and ask them what are the introduction points to be able to talk to this hidden service. In step four, the client will also select something called the rendezvous point. Some other relay, he sets up a circuit to him. So, he will go to this .onion address, the HDIR flag, which one of them are misbehaving in the sense that I defined earlier. So now going back to just a little bit more specific about the HDIR, so you have these relays, they have identifiers and that will show up in something called the ring of HDIR identifiers and your .onion address, once you hash it, gives you this descriptor ID and you find the first HDIR after the descriptor ID and you pick the first three, I mean the first one, second and third, and then you hash it again in a different way and it gives you another descriptor ID and then you find the other two after, or the other three after that and you have now three and three relays that will host information how you would be reached. The reason why it changes every day and you take more than one is to have reliability and protect against denial of service such that if you are always hosting your information in the same location, someone who wants to block you, he might have that information and won't serve it to anyone. The side effect is that whenever they host that information, they can log it and they can go visit your system that you don't want to leak. So, our system how to detect who is misbehaving, the idea is quite simple. We can create a large number of these things we call honey onions, like honey pots. We set them up in a secure way in the sense that we follow all the instructions that they are not going to leak. We don't tell anyone about them. We know that if someone comes to visit our service, then whoever had the information leaked it. He logged it and maybe gave it to someone and leaked it. So that is the fundamental idea. But it is not that trivial because information every day, we're going to give it to six people or six relays. So it could be any of the six. We'll have to find out who out of the six. Since we want to look at the global scale, we generate several batches. Every day we generate some number. Every week, another number and every month another batch. The reason for this is that some of them will collect these onion addresses but they won't visit them immediately. They will wait for a few days to confuse us that it might be someone else. So we could compute that we need 1,500 onions to be able to cover 95% of these relays. So every day we generate 1,500, every week 1,500 and every month 1,500. And then we see who's visiting. Some of you remember there was a peak in the Tor number of hidden services. So I don't have time to comment on that but it is not us for various reasons. I mean that number was much larger than what we generated. We generated at any moment we had 4,500 onions in the system. So before I bore you a little bit with some math, I'll just show you the reason for So we choose this name honey onion because it made sense. Then we googled it a little bit and we found that it has a meaning. And it was in fact quite interesting that the meaning really matched what we were doing. So we couldn't resist using it. So how do we find out who's misbehaving? The idea is quite, again so we create the onion, we put it on six places. If one of them is misbehaving we know that one of the six is banned. But then as we create more they might end up in different locations and now we have that for these two onions, the visits there are two people who could explain it. And then later on we put more and you can tell these are maybe not the guy depending on the assumptions and so on. You can see that there is some maybe possible way how one could find out who is misbehaving. So the architecture system that we built in step one, you generate all these onions that you're going to place on these servers. Then some of them get visited. Whenever there's a visit, it gives us information that all who knew about it should become suspicious. We put them in some graph that we built it's called a bipartisan graph that has nodes that correspond to onions and nodes that correspond to relays. And whenever there's a visit for the onion, we put all the edges to all the relays that had this information. So since I'm running a little bit out of time so maybe get to what we exactly did. So I'm not going to maybe talk about details of the math. The first thing is that we wanted to know what is the smallest set of suspicious of relays that could explain the visits that we see. That will tell us the lower bound on how many of these relays are misbehaving. If you find the smallest set that explains the visits, we know that there should be more than that that are misbehaving. And there's a way to formulate it, but I'm going to skip maybe that math. This is not necessarily a trivial problem. There's some heuristics that could give some approximation and we could formulate this as something called integer linear program where basically to each relay we're going to give some variable either 0 or 1, 1 meaning that it is malicious. And we want to minimize the sum of these xi's that you find the smallest set. But we need to explain every visit. And this can be solved with something called ILP solver. And before we tell you what we found, why we trust that this technique works reasonably well, we also did some simulations selecting some of these to be malicious and so on. And you can see that we can get between 97% accuracy to 81, assuming 81 means that there was a significant number of malicious ones. So now I'm going to pass the talk to Amir Ali who's going to tell you the results of the experiments. Good thing the slides are back because otherwise I didn't have much to talk about. Here we can see from left to right, on the bottom you can see our schedule for daily, weekly, and monthly visits. And the reason behind having three schedules is if adversary visits a honey and immediately we can catch them in our daily, but if they would wait for a while then they won't show up in the daily, but we can still spot them in the weekly and monthly. The other thing as was mentioned by Gora, the rise in the number of onion services, we only at each point in time, we only had 4,500, but the number of increase in onion addresses was at least more than the magnitude, much more than what we had. So we are sure it's not us. And we started our experiments on February 12th and most of the results that we explained are based on the 72 days that we are running this experiment, although we have them for further and we discuss them later. We are sure that the visits are not a result of the rise in the number of onion addresses because the increase was happening on 18th of February and we can still see visits even happening at 12th and 13th of February. So it happened before the increase. The other thing to mention in the daily graph, you can see there are not many visits during the peak. One of the reasons is that to get the HSDR flag it would take 96 hours or four days. So after people saw there are a lot of new onion addresses they probably set up new relays and it took them four days to get the HSDR flag. And after that they started probing. The other things that you see more visits on weekly and monthly because they are running for longer time so these adversaries or the malicious HSDRs had more time to visit them. This is an example of a typical connectivity graph that we had. For example the gray circles in the middle are the onions, the visited onions. And the black ones above them are the HSDRs that are picked by ILP and that explain the visits. All the other colorful nodes are the other HSDRs who have been hosting these onions. As you can see for example for the orange one top right that one is more trivial to pick because you only have one HSDR who visit both of two onions. But the power of ILP really comes in the cases with the purple one top left when you have many HSDR who have been hosting many of the visited onions but you want to know what is the lower bound or who are the most likely HSDRs. And that's where really our technique and ILP comes to power. And we can pick these four and identify these are the most likely suspicious HSDRs. And apparently we are running out of time. So the snooping behavior we saw some of them were visiting everything and were hosting Alibaba. Alibaba this one they weren't visiting most of the onions and the Tor people we identified they also identified them. We talked to Tor people and after a while they become more advanced and they delay their visit the bottom left graph. And the geographical location we mostly see them in Europe and North America is because Tor is more and it's also representative of the uses of Tor. You don't have it much in Middle East and China because it's mostly blocked. And this location doesn't necessarily mean that these are the countries who are snooping. These are the relays that are located in these countries not the country themselves. And to give you more statistics more than half of them were hosting on cloud infrastructure and they also had the exit flag about 25% which is much more than what you would have. And some of them were doing some attacks and some of them were less aggressive. So maybe just one final comment. So since we've done this work in fact whoever was snooping changed their behavior and now in fact you can see that most of them delay they don't really visit quite immediately they wait for days and sometimes weeks before they do the visits. So this is still an interesting problem. I think we'll stop here. Sorry that we couldn't really talk about the last part in detail.