Hi everyone. So I have a lot of stuff to get through and not a lot of time. I am Ryan Mitchell and I know the name is weird for me too surprise. This is separating the box from the humans so who I am other than my name. I am software engineer. I work in Boston APA back end Java stuff and I have author of two jobs that have nothing to do with my day job. I am engineer graduate and I take night classes I have been doing that for a few years and graduating with my masters in engineering. So I submitted a brief proposal to oriel I will for a hacking book and pythons is a popular language these days so I called it web scraping with python and I put together a proposal and they accepted my hacking back. Thank you. And I don't have a lot of time so instead I'm going to focus on the first step. Separating the bots from the humans. Because if you web administrator what do you want to do you want to feel out who is the human and who is the bots. And if you web scraper you want to try to avoid the administrators from stopping you so constant back and forth so this could be called how to look like a human when you are bots and h is very important. So we are at Def Con, hopefully you heard of web scrapers before. Lost audio -- they can take their sweet time and they can be smart or dumb. But we're going tostart off with the defense stuff so we're going to go through the stupid human things to stop bots. So completely legally and unenforceable and not one standard of anything it is called robot inclusion standard but IUTF doesn't recognize and it government doesn't recognize. It is like your and your friends got together and one of you friends happen to be google so nice forgetting out of indexing but other than that is use less for blocking bots other than the good ones that want to follow it. Term of service this can be more useful but only in specific circumstances and those a usually when you and the American who wrote the bots end up in court. You have to if you have to click a agree you problem bring -- if you have to agree to terms of service that is contract so you should be really careful about scraping the site but at the bottom of the page go for it that is not enforceable and do other thing to access the site in order they can't just enforce the terms of service because it happen to be on the site. So headers, so a -- you can change the headers in chrome. So most there are use less most websites do check the heards so if you go the Amazon -- they will send you 403 forbidden. Java, I know we have other opinions about this. I'm not going the sway anyone here. You are now taking the code excuse for where you have a controlled environment and give the client -- don't quite know what is going on and makes your site less use able but you are protected against bots most can execute Java script. Embedding text and images don't do that. Your site is not use able by anyone anymore like the -- not resize able so don't try that and if it is read able if you do have text that can be read then it is very easy to OCR so not stopping anyone at all. I kind of catches most catches are breakable if you use a caption make sure sit really good like old google or make sure that is so obscure that no one is going look at it so behavioral patterns is where the future is. The google recap here is demo here and I have been doing it so often what it will mark me as a bots. It did it. But if I refresh it and click on again there see now I have not moved my mouse around enough for it to convince them I am not a robot. That is kind of thing that is robot would do so now it wants me to select all the images. Okay. So behavioral patterns, if someone is not moving over the page appropriately or loading things too fast or typing too fast or not scrolling down the page so that could be a clue that is bots. Honey bots -- what is something that is used often. Humans can't see it bots can see it and bots thinks that humans can see it over wise they would be obvious honey bots. And it is important to use the honey bots to the robot text one you have the bad robot -- tells me not to go there I'm going to go there honey bots and the google bots will not go to it and you will not block them. Now is the fun part and we are going to go over optical and Java execution. So optical character recognition. There is caption here. That we're going to go to. All right. So you can see a couple things about this caption there is prefix thing here. Notice it has this gray background and has this blue lines going through it and you can build this image getter that uses the library to go and clean this image you functions like it is gray if so take it out. You have things you can go around the perimeter of the image and see what the first non-gray pixels are and these are the lines here and cleaning the image you can get really nice looking things like this. And then you can start to do things with it. So the first step in using this tool that I made that you can get to I have the link here. Here is the test track trainer so you download the folder and label them so this is 40 AGU labeled -- I suppose you can ship it off to amazon -- but it is not that hard. Watch a few movies and kind of relaxing. The next step is to bring it into the software. This is just being run from my documents. It is just Java and the with be found the hub page. So add a collection of files let's go to ones that are labeled but let's do them again. So open as many files as you want you get these boxes and when you move them around it creates bots files. Creates these bots files and this is very necessary stop for training optical character recognizers and I have tried to make it painless as possible. So download -- knew one and this goes through all the images until you run out and you get a nice folder full of bots file that's look like this. So this is really important you need to keep a back up because it will over write them and destroy them. So you have bots PNG and put it into the images folder here. And now we go to fun part. Training so this is what uses tester acts. Run shape clustering. I don't remember what half these things are and I spent hour was documentation. This is library that was built by academics very difficult to use. All you have to do is get your bots files with the appropriate names named with their solution and you say python 3 I have both of them installed. Trainer . PY. Go and just does everything for you which is really nice because back in the days we had to write commands. Now the next step is you get this -- so our language that we're using for this is called CAP. If we wanted to solve any regular text we use ENG training file but we make our own here and we need to move that. I'm just trying to copy the file. So we created this file in the images director with all the bots files and image files then we just move it over and now test track knows what the language CAP is and how the run it so we can do something like this where we want to get the text from let's just use image that images to by the labeled folder. Let's check that one out. It Y4EE. So I did plan that one ahead of time but does work well. But remember our goal is to go to this and post a comment. So let's check out the code what does now that we have everything all built. So this uses the same code that is in the image training file so it is you colored and we clean our images before we trained it so when we get the image we have to clean it again before we can do that comparison. So the main function is post comment and goes in here and first thing it does is get me a caption. So this is where posting our comment and goes and gets this SI code that is code at the end of image that acts as a seed for the random cap you don't have to you work with it in memory but I prefer to download it and then it uses text track and then gets solution and checks to see if it has four characters and then it puts in the parameters and puts in a post request and you might get a error page in which your -- python will explode but hopefully that will not happen. So let's try this out. Python three comment poster. Let watch it. It is going to work. It is loading the page and -- success. (CLAPPING) so I did that in like 10 minutes you can too. But for the bots file creation but you can too. So I mentioned a jacks that is important part of internet today so I have a bots that can handle this. So there is this page. H is contend that will appear while the page is loading now the important text pops up. This is the contend that we would scrape so what we need the -- in my slides I have the download link. So download phantom JS browser that runs in the background you can run from the channel line and let it do its thing. You can execute Java script from the command line. So remember I have that website. So this is text that it displayed first. So this script is doing its job. It should wait for a little bit while the page is loading and the text you want to retrieve should pop up. So it is going on. I have given you web driver, the driver gets that page and sleep seconds. So honey bots. Here is bots proof form. Why it is bots proof. Have a lot of honey bots. Most people have hid input fields and you see in the CSS, I moved it 50,000 pixels to the write so these are things that bots try to fill out and I think there is one at the top of the page so there are other ones too that I invented and basically input type with a couple of DIVs overlaying. That make it so that you can't -- I don't notice anything going on. You can't select that so it meets the three criteria. Visible to humans and in visible to bots so we use the same celeb yum and you can select tag name if you have DIVs overlaying it and the text is the same as the backgrounding text color then it can be tricked sometimes. So let's run this. Going to give that page and every time it a field is not displayed it will tell you it is trapped. If you bots filling out the form or navigating from link to link it is easy to see if it is displayed first before you go to it. It found the type is hidden. The display non-is track. But our honey bots covered with DIVs that we can't see it all so that worked out well and in the last five minutes I wanted to take a couple questions but first run a really cool script that I like. This is so the amazon previews that are protected. These are embedded with imagines. Not difficult for a bots. These are a jacks limited so we can use a bots to go grab this text and print it out. This worked better on a larger screen but I will pop up Firefox you can see it work. You can do this with phantom JS as well. Any questions? He said do you have idea what the defense might be or attacks might be against this new generation of behavioral defenses and I think the big one is going to be things like that where you use Java script and you can make the mouse move and interact with dragon drop and send characters one key at a time so you can add in all this timing and making the mouse move with functions so I think the next version is going to emulating the behavioral patterns. It broke. It took so long to load. So I will try it against. So how do you defend against Java script. It use a brassiere. The only thing between the bots and human in this case is what the behavior is like and can hey do things like a human would be able the do. Caps and behavioral and caps might not be that great. And I wanted to let you know that everything is available at python . com and I will put it up after the talk and I will be in the Def con23 lounge after this, drinking happily. (CLAPPING)