00:00:00.133-->00:00:05.138 >>Ah so today we're here to talk about 411 a framework for managing security alerts ah 00:00:07.241-->00:00:12.246 which we will be open sourcing after Defcon [cheering] so before we get started let's do 00:00:18.619-->00:00:25.325 introductions. My name is Kai, oh Kai Zhong and I am a product security engineer at Etsy so I'm 00:00:25.325-->00:00:31.031 responsible for um helping developers with running secure code and maintaining some of the 00:00:31.031-->00:00:35.969 um internal applications that we use on the security team like 411 and on occasion I've been 00:00:35.969-->00:00:42.643 known to wear many hats like you see in that photo and uh after this presentation um I'll be 00:00:42.643-->00:00:47.147 tweeting out links to the slides on my twitter so follow me please gotta get those followers 00:00:47.147-->00:00:52.152 alright oh heh sorry I'm supposed to make a really really bad pun here um hopefully you 00:00:56.123-->00:01:01.061 won't find our presentation to be unbearable yes you groaned >>Thanks Kai my name's Ken Lee 00:01:05.866-->00:01:11.672 I'm a senior product security engineer at Etsy I'm glad to be back at Defcon I was here three 00:01:11.672-->00:01:17.010 years ago for a presentation on content security policy and two important facts about me, one my 00:01:17.010-->00:01:22.015 twitter handle is KennySan and two I really love funny cat gifs so I've managed to sneak one 00:01:24.384-->00:01:29.389 into the slide deck >>Nice! >>For those that don't know this adorable cat is Maru so let me 00:01:36.063-->00:01:41.068 go and start by explaining what Etsy is, Etsy is a marketplace for handmade and vintage goods 00:01:43.537-->00:01:48.141 the security team at Etsy is responsible for keeping private member's personal information 00:01:48.141-->00:01:53.146 such as credit card details, their addresses, etcetera oh in addition the Etsy security team 00:01:55.949-->00:02:00.887 has been successfully running our own bug bounty program for the past four years as well 00:02:04.758-->00:02:08.862 [applause] I'm going to go into some more detail about what we're covering in today's 00:02:08.862-->00:02:14.001 presentation. First we're going to start by talking a little bit about the history of our 00:02:14.001-->00:02:19.006 transition to using ELK we're going to go delve into some of the problems that we encountered 00:02:21.274-->00:02:26.713 during this transition process and we're going to talk more about our solution which we call 00:02:26.713-->00:02:31.718 411 then we're going to dive into a how we at Etsy do alert management using 411 we're going 00:02:36.957-->00:02:41.862 to show you some additional more involved examples and we're going to finish things off with 00:02:41.862-->00:02:46.867 a non live demo I know I really wanted the live demo but I I never trust the demo gods to get 00:02:50.670-->00:02:55.809 it right um first we're going to go over some terminology for some of you this must be old 00:02:55.809-->00:03:01.248 news but we're going to try to get over this as quickly as possible. So for those that 00:03:01.248-->00:03:04.184 don't know this is a log file logs are typically interesting messages generated by web server 00:03:04.184-->00:03:06.186 that's stored in a log file this is the ELK stack the ELK stack is consisting of three different 00:03:06.186-->00:03:11.191 technologies, Elasticsearch, Logstash, and Kibana and I'm going to quickly go over what 00:03:20.567-->00:03:25.305 each of these different applications do. The first as represented by our friendly 00:03:25.305-->00:03:28.942 mustachioed log over here is called Logstash. Logstash is our data processor and log shipper 00:03:28.942-->00:03:30.944 tool, we primarily use it as a way to identify interesting fields that we would want to 00:03:30.944-->00:03:32.946 perform searches on in the future. In addition we also use Logstash to ship logs into 00:03:32.946-->00:03:34.948 Elasticsearch proper, what is Elastic Search? Great question me! Elasticsearch is the 00:03:34.948-->00:03:36.950 distributed real time search engine created by Elastic dot CO. It allows for storing 00:03:36.950-->00:03:38.952 complex nested documents but in this case we primarily use Elasticsearch for storing log 00:03:38.952-->00:03:43.924 files parsed by Logstash in addition Elasticsearch allows the generation of statistics of 00:03:43.924-->00:03:48.929 your data so you can run interesting aggregations over the information that you have 00:03:54.668-->00:04:00.240 stored in Elasticsearch which lends itself very well to analysis of the data that you 00:04:00.240-->00:04:02.242 have. Finally the la- the K in ELK stands for Kibana and that's the data visualization web 00:04:02.242-->00:04:07.247 application front end for Elasticsearch. Kibana allows for log discover and more 00:04:13.153-->00:04:18.158 importantly debugging of problems in your application and in addition Kibana provides for 00:04:36.910-->00:04:41.882 some interesting visualization options. Unfortunately this was the best stock image that I 00:04:41.882-->00:04:46.553 could find of Kibana to show you what it does um you can do interesting pie charts, graphs, 00:04:46.553-->00:04:52.526 eccetera, using Kibana as a front end. So now let's talk a little bit more about the 00:04:52.526-->00:04:57.531 history of how we transitioned into using ELK so Etsy switched to using the ELK stack back in 00:05:02.569-->00:05:07.574 mid 2014 from Splunk and the work took about a year and throughout this process we both 00:05:10.243-->00:05:15.182 learned a lot of good lessons from the migration process and we got a bunch of great tools 00:05:15.182-->00:05:20.187 out of it including 411 but it wasn't a super easy rode to go down we were aware of the fact 00:05:23.356-->00:05:29.229 that we were going to run into issues when we started to transition to using ELK and we 00:05:29.229-->00:05:34.234 had to deal with our fair share of really annoying performance impacting bugs with our ELK 00:05:36.803-->00:05:41.241 cluster. In addition the security team was concerned about the usability of the ELK 00:05:41.241-->00:05:46.246 as a solution for being able to do some of our alerting and monitoring. So to give an 00:05:49.382-->00:05:55.021 example of one of these bugs here we have two Anitech articles, ones from September of 00:05:55.021-->00:06:01.695 2014 and the other from April of 2015 that's a span of about six or so months basically this 00:06:01.695-->00:06:06.700 article illustrates the discovery of uh a bug with Samsung's line of solid state 00:06:08.735-->00:06:15.408 hard drives and the fix acknowledge is coming out about six months plus later so 00:06:15.408-->00:06:20.413 unfortunately for us our ELK our ELK cluster used these SSDs to power the um ELK cluster and so 00:06:23.650-->00:06:30.490 we were affected by this reperformance bug for more than six months in addition this is 00:06:30.490-->00:06:35.362 just a small snippet from an email we had a small issue with a kernel level bug affecting how 00:06:35.362-->00:06:39.566 it was handling NSF mounds this caused a lot of instability with our ELK cluster and 00:06:39.566-->00:06:45.038 unfortunately some additional outage uh downtime as well. So to say the least you know these 00:06:45.038-->00:06:50.176 are just two example bugs that we had to encounter at times it felt like we were riding the 00:06:50.176-->00:06:55.448 struggle bus with regards to all of the bugs and issues that we had to deal with with regards to 00:06:55.448-->00:07:02.255 ELK but that aside, Kai is now going to talk to you about um some of the actual problems, not 00:07:02.255-->00:07:07.260 just bugs that we encountered, when migrating to Elk >>Thank you Ken, so um like most 00:07:09.262-->00:07:15.235 security organizations alerting is a major part of how the security team at Etsy knows what 00:07:15.235-->00:07:20.240 is going on on the site um and some mechanisms that we use for alerting are um Splunk, or used 00:07:22.776-->00:07:28.915 to use our Splunk, StatsD and Graphite and unfortunately um when we first started this 00:07:28.915-->00:07:34.287 migration um there we were making use of Splunk safe searches to automatically 00:07:34.287-->00:07:39.926 schedule queries on some sort of periodic interval and Elasticsearch didn't offer like 00:07:39.926-->00:07:44.798 equivalent functionality at that time and additionally, Elasticsearch also didn't offer 00:07:44.798-->00:07:50.637 some sort of web UI for managing those um queries that we were writing which is pretty useful 00:07:50.637-->00:07:56.276 when say it's like the middle of the weekend and you're getting spammed with alerts and you need 00:07:56.276-->00:08:01.047 to make a change to one of the queries but doing so would require a could push and you 00:08:01.047-->00:08:05.585 don't want to like break something with some sort of web UI where everything is handled 00:08:05.585-->00:08:10.757 for you you could just go in there, change the query and then update it and you're good to go. 00:08:10.757-->00:08:15.462 Now the second problem was that um we were just not familiar with the new query language that 00:08:15.462-->00:08:21.201 we were um faced with um our old queries were built using SPL which is the language that 00:08:21.201-->00:08:28.007 Splunk uses and um so the some of the functionality that we needed in order to write our 00:08:28.007-->00:08:33.012 queries simply wasn't available um in Elasticsearch's Lucene shorthand. Additionally there 00:08:35.782-->00:08:41.421 were some things that weren't obvious coming from um Splunk like especially with how 00:08:41.421-->00:08:47.594 Elasticsearch indexes documents um it has an affect on like whether or not and how you can 00:08:47.594-->00:08:54.300 query um the actual fields that you are searching on. So this came as a surprise to us at 00:08:54.300-->00:09:00.740 certain points and because of these issues the road to ELK integration was a long one in 00:09:00.740-->00:09:06.513 order to successfully um complete the migration we essentially needed three things, 00:09:06.513-->00:09:10.717 firstly we needed a query language that would allow us to build complex queries preferably 00:09:10.717-->00:09:15.055 without having to write any code, uh we also needed a mechanism to actually run these 00:09:15.055-->00:09:20.193 queries and like email us with those results and finally we would like to have all of this 00:09:20.193-->00:09:23.763 ready before we turned off Splunk because then we're then dark otherwise and that would be 00:09:23.763-->00:09:28.768 really bad. Alright so as it turns out the first half of the solution was provided to us by 00:09:31.237-->00:09:38.144 um the data engineering team at Etsy and that solution is called ESQuery and what it is is it's a 00:09:38.144-->00:09:43.283 superset of the standard Lucene shorthand and um it's intactictly pretty similar to 00:09:43.283-->00:09:48.354 SPL so it's got like a bunch of pipelines everywhere that you can then like take data from the 00:09:48.354-->00:09:52.859 first one and transfer it to the second one. I'll provide an example in in a bit but more 00:09:52.859-->00:09:58.198 importantly it supports all of the functionality that we need. So here's a quick summary of all 00:09:58.198-->00:10:03.203 of the syntax um when you define a um Elasticsearch query you do it via this large json DSL and 00:10:07.640-->00:10:12.879 we provided the ability to like in line all of these directly into the query so you can see it 00:10:12.879-->00:10:17.283 over here you can specify say like size or how you're sorting the results that come back or 00:10:17.283-->00:10:23.189 just what fields are coming back. Additionally you can do an emulated join so you can results 00:10:23.189-->00:10:28.828 from one query and then like insert them into a subsequent query and all the irrigation 00:10:28.828-->00:10:34.100 functionality that is available in Elasticsearch is also available in ESQuery but in 00:10:34.100-->00:10:39.105 line. And finally you can also um define variables within ESQuery um and you configure 00:10:41.441-->00:10:46.913 them in 411 and then have those queries get substituted into uh sorry those variables get 00:10:46.913-->00:10:51.184 substituted into your queries at one time so like you can have a list of values that you can 00:10:51.184-->00:10:58.124 update independently of these queries so here's an example SPL query. Um what this is doing is 00:10:58.124-->00:11:03.062 it's finding all um failed login attempts and then giving you the top ten IP addresses that made 00:11:05.131-->00:11:10.136 attempts this is the same query but um when using uh Lost Searches um DSL and finally this 00:11:13.873-->00:11:19.312 is the same query but when using ESQuery so you can see it's pretty similar to how you would 00:11:19.312-->00:11:25.985 write it using SPL and way shorter as well and the two are actually similar enough that um 00:11:25.985-->00:11:31.190 someone at Etsy was able to write a simple query translator which we made use of during our 00:11:31.190-->00:11:36.796 migration so what we did was we would just plug it in, um test it out, and make changes if 00:11:36.796-->00:11:41.801 necessary and then stick them into 411. Speaking of which next up let's talk about what 411 is 00:11:44.537-->00:11:51.311 so 411 is an alert management interface or application and what it does is it allows you to 00:11:51.311-->00:11:56.149 write queries that get automatically executed on some sort of schedule then you can 00:11:56.149-->00:12:01.487 configure it to email you with like email you alerts whenever those data sources that you're 00:12:01.487-->00:12:06.960 querying return any results and additionally you can manage the alerts that our generated 00:12:06.960-->00:12:11.965 through the web interface. Before we dive into 411 let's um talk briefly about how 00:12:14.434-->00:12:19.439 scheduling works within um the system. So whenever a search job is run it executes um a query 00:12:24.344-->00:12:29.983 against a data source and then generates a a an alert for every single result that comes back 00:12:29.983-->00:12:36.322 you can then configure a series of filters on those alerts to re- like reduce or modify the 00:12:36.322-->00:12:42.362 stream somehow and then finally um specify a list of targets that you can send uh the 00:12:42.362-->00:12:49.002 remaining alerts to. So an example of one target that is pretty neat is the Jira target 00:12:49.002-->00:12:54.941 which allows you to like generate a a ticket for every single alert that goes through 00:12:54.941-->00:13:00.980 the pipeline. Alright wait oh sorry additionally if we um take a step back what happens is 00:13:00.980-->00:13:05.451 there's a scheduler that runs periodically and generates those search jobs which then get fed 00:13:05.451-->00:13:10.857 off to a bunch of workers that actually execute them. And now we're ready to get into 411. So 00:13:10.857-->00:13:15.695 the first thing you'll see when you log on is the dashboard which is this thing over here 00:13:15.695-->00:13:19.632 it's pretty simple but you see there's some um userful information about the current 00:13:19.632-->00:13:23.803 status of 411 there's a breakdown of alerts that are currently active as well as a 00:13:23.803-->00:13:28.808 histogram of just like alerts that have come in over the last few days. Alright moving on um 00:13:31.110-->00:13:35.682 one of the most important things you'll want to do in 411 is manage the queries that you are 00:13:35.682-->00:13:40.720 like schedule to execute and you do that via the search management page which you can 00:13:40.720-->00:13:45.258 see here the center you've got all the searches listed out with like some categorization 00:13:45.258-->00:13:50.763 information and on the right you'll you can see the health of that particular search, whether 00:13:50.763-->00:13:56.035 or not it's been running correctly, and whether or not it's been able to execute. Now 00:13:56.035-->00:14:00.807 if you want to modify an individual search you'll get taken to this page over here 00:14:00.807-->00:14:06.879 which has a whole like slew of options that you can configure um there's a title which is not 00:14:06.879-->00:14:11.184 too exciting but more importantly there are all of these fields so let's go through 00:14:11.184-->00:14:17.824 all of these briefly. At the top here is the query which is quite simply the query that you're 00:14:17.824-->00:14:23.296 sending off to whatever data source in this case this is a Logstash source so we're sending 00:14:23.296-->00:14:30.203 this to an ElasticSearch cluster with a Logstash index um you can also configure we can also 00:14:30.203-->00:14:36.008 configure a results type so whether or not you want the actual contents of the log 00:14:36.008-->00:14:40.713 inside um match the query or whether you just want like a simple count or even an 00:14:40.713-->00:14:46.152 indication that there's like no results and finally you can filter you can apply thresholds 00:14:46.152-->00:14:51.724 on like how many results that you want to get back next up you can you can also provide a 00:14:51.724-->00:14:56.929 description that um gets included whenever an alert gets sent to you so you should 00:14:56.929-->00:15:03.603 preferably put some information that allows you allows whoevers um assigned to the alert to 00:15:03.603-->00:15:09.776 resolve it and there are a few categorization options at the bottom as well for the alerts 00:15:09.776-->00:15:14.781 that are generated much better alright next up is the frequency which is how often you want to 00:15:24.090-->00:15:29.328 run this search and the time range which is how how far back of a like time window you want 00:15:29.328-->00:15:33.232 to search most of the time you're gonna want both of these to be the same value but if you 00:15:33.232-->00:15:38.538 want say like better granularity you might want to specify a frequency of one minute and a 00:15:38.538-->00:15:43.543 time range of ten minutes and finally we've got the status bun which lets you toggle this 00:15:45.545-->00:15:50.550 search. Cool that's all for the basic tab next up let's talk about uh notifications. So in 00:15:53.753-->00:15:58.758 411 you can configure uh you can configure email notifications whenever um it generates any 00:16:00.726-->00:16:06.632 alerts and those notifications can be sent out as soon as the alerts are generated or included 00:16:06.632-->00:16:11.637 in a hourly or daily roll out. You can also assign you also have to assign um these alerts 00:16:15.041-->00:16:20.813 to an assignee which is the person or the group of people that are responsible for 00:16:20.813-->00:16:26.752 actually resolving and taking a look at those alerts and finally the owner field is just um for 00:16:26.752-->00:16:30.156 bookkeeping so you can keep track of who is responsible for maintaining that particular 00:16:30.156-->00:16:35.161 search. And here's the AppSec group that we're currently using here you see it's got a list of 00:16:38.097-->00:16:44.837 all the users that are currently on the security AppSec team and uh whenever 411 generates an 00:16:44.837-->00:16:51.744 alert for this particular um search they'll email all of these people. Alright moving on 00:16:51.744-->00:16:56.148 to the final tab the here we've got some more advanced functionality that's less 00:16:56.148-->00:17:01.587 commonly used like auto close which allows you to automatically close alerts that 00:17:01.587-->00:17:06.592 haven't seen any activity after a while so they're probably stale and we've also got um the 00:17:08.628-->00:17:14.500 actual configuration for filters and targets here as well so again recall that filters 00:17:14.500-->00:17:20.640 allowed you to reduce the list of alerts that get passed through um 411 and eventually 00:17:20.640-->00:17:26.112 get generated and here is a list of filter that are currently available so I'll just highlight 00:17:26.112-->00:17:30.316 a few of them. Dedupe allows you to just like dedupe alerts that are the same and you can 00:17:30.316-->00:17:35.655 throttle um the alerts that are generated to like some threshold for the purposes of this 00:17:35.655-->00:17:39.792 presentation let's talk about the regular expression one because that's relatively 00:17:39.792-->00:17:44.797 complicated uh you can configure this particular filter to um have some sort of key like what 00:17:47.934-->00:17:52.972 keys you want to match on within the alert as well as a regular expression to match on and then 00:17:52.972-->00:17:58.511 you can specify whether or not you want matching alerts to be included or excluded from the 00:17:58.511-->00:18:04.016 like final list of alerts. Similarly on the other side we've got the list of targets 00:18:04.016-->00:18:10.222 that you can configure and we're going to cover the Jira target which allows you to specify a 00:18:10.222-->00:18:16.662 Jira instance and a a project a type and a and a assignee and then any alerts that make it to 00:18:16.662-->00:18:21.667 this target get turned into Jira tickets so that's useful if you want to use Jira as your alert 00:18:23.736-->00:18:29.709 management workflow cool so that's about it as far as managing searches go next up 00:18:29.709-->00:18:35.114 we're going to get into actually managing the alerts that are generated by 411. So here it is 00:18:35.114-->00:18:40.419 the main alert management interface you'll notice at the top there's a search bar for 00:18:40.419-->00:18:45.257 filtering the list of alerts that are visible and this 411 actually indexes all of its 00:18:45.257-->00:18:50.262 alerts into Elasticsearch so all of your standard like Lucene or hand queries are valid here um 00:18:52.465-->00:18:57.870 in the center you'll see all of the actual alerts that matched the current filter and you can 00:18:57.870-->00:19:03.209 select um individual alerts and apply actions to them using the search um action bar at the 00:19:03.209-->00:19:08.214 bottom. Now if you want to drill down into a individual alert you can so this is the view for 00:19:11.083-->00:19:15.287 viewing just like a single alert and you can see at the center there's all of the information 00:19:15.287-->00:19:20.860 that was available before but also a change log for viewing all actions that have been taken 00:19:20.860-->00:19:26.132 on this one's alert. Additionally you'll see there is the same action bar that's 00:19:26.132-->00:19:31.137 available at the bottom and let's say thank you let's say we were to investigate investigate 00:19:36.409-->00:19:40.646 this alert like we took a look at IP address and then we've determined that it's just a 00:19:40.646-->00:19:46.585 scanner so nothing to worry about we can then hit resolve on that action bar which will pop 00:19:46.585-->00:19:52.024 up this little dialog where we can select a resolution status in this case not an issue and a 00:19:52.024-->00:19:56.929 description of exactly what actions we took to resolve this alert and then once you hit 00:19:56.929-->00:20:03.335 resolve there you'll see the change log has been updated with this um additional action. 411 00:20:03.335-->00:20:08.607 also offers a um alert feed so what you can do is just keep this open and whatever new 00:20:08.607-->00:20:14.613 alerts come in um it'll just hop up on this list and you can also leave it running in the 00:20:14.613-->00:20:19.952 background because it's got desktop notifications so you'll see that nice little chrome pop 00:20:19.952-->00:20:24.957 up uh whenever there are new alerts cool alright next up >>Thanks Kai I'm gonna talk to 00:20:28.227-->00:20:33.232 us talk to you more about how we do alert management at Etsy using 411. So here we have a 00:20:36.602-->00:20:41.507 sample email generated by 411 I'm going to go into some more depth and explain to you what's 00:20:41.507-->00:20:46.912 going on so the subject line of this email says login service five hundreds ah the description 00:20:46.912-->00:20:52.384 says login five hundreds investigate for people that aren't very familiar with it log 00:20:52.384-->00:20:56.789 in is just basically a process to essentially log you into a website, five hundreds is 00:20:56.789-->00:21:02.394 basically a a message that says oh something bad is happening and usually this is pretty bad 00:21:02.394-->00:21:07.133 to the extent where you would want to create an alert for it and be notified about it and we 00:21:07.133-->00:21:11.504 can see from the time range that this alert is taken place over the past five minutes and we 00:21:11.504-->00:21:17.743 have buttons on the bottom to both view the alert in 411 as well as to be able to view this 00:21:17.743-->00:21:23.716 link in Kibana as well we also get a short snippet including the PHP error that was thrown 00:21:23.716-->00:21:28.287 and as you can see from this sort of short email snippet people are sort of taking action 00:21:28.287-->00:21:33.292 based on this alert. But let's take a step back a little bit and think more about what we do 00:21:36.162-->00:21:42.535 to actually crea- create high quality alerts and at Etsy the secret is we create alerts that 00:21:42.535-->00:21:48.641 have a high degree of sensitivity. What do I mean when I say high sensitivity well 00:21:48.641-->00:21:53.345 let's say that we have an alert that fires one hundred times over the course of a day and out 00:21:53.345-->00:21:58.350 of those hundred times that alert correctly predicts an event actually happening ninety 00:22:00.920-->00:22:05.925 times so what that means is out of a hundred times that alert only improperly fires ten times 00:22:08.594-->00:22:15.301 so there's a one in ten chance that that alert is misfiring so ninety percent of the time that 00:22:15.301-->00:22:20.105 alert is responding correctly to an event so we say that that that particular has a 00:22:20.105-->00:22:24.810 sensitivity of ninety percent that's a pretty high sensitivity that we would you know find to 00:22:24.810-->00:22:29.815 be useful for alerts that aren't as important we still create them as searches and alerts in 00:22:32.051-->00:22:36.922 411 but what we do is we end up not generating email notifications out of them and 00:22:36.922-->00:22:43.062 I'll go into more detail as to why in just a moment for more important alerts we still 00:22:43.062-->00:22:49.969 generate alerts off of them but what we do is we set them up as um rollups so every hour or 00:22:49.969-->00:22:55.875 every day we have this alert go off and it'll email us the results and one reason why we 00:22:55.875-->00:23:00.112 really like doing this is because it gives us the option of being able to monitor a 00:23:00.112-->00:23:05.117 particular search over a period of time for anomalies. So one of the reasons why we take this 00:23:08.087-->00:23:12.291 sort of tiered approach to alerting is because attackers hitting your website will often 00:23:12.291-->00:23:17.096 generate a lot of noise and in the process of doing so they'll set off a bunch of different 00:23:17.096-->00:23:23.535 alerts that you have set up. So one thing that we often have to answer when we see an alert on 00:23:23.535-->00:23:28.440 our phone at three in the morning is is this something that I really need to respond to 00:23:28.440-->00:23:33.746 at three in the morning? Do I Can I Can I just continue sleeping? Do I have to you know 00:23:33.746-->00:23:39.051 can I just answer this tomorrow or even after the weekend? Well one way in which we make that 00:23:39.051-->00:23:44.390 determination is by seeing and looking at the other alerts that have gone off in the same period 00:23:44.390-->00:23:49.762 of time so we look at the high alerts the low alerts the medium alerts that have gone off over 00:23:49.762-->00:23:55.200 this period of time an example uh a good example of this would be let's say there is a very 00:23:55.200-->00:24:00.439 high number of failed login attempts that an alert a high alert that has gone off recently 00:24:00.439-->00:24:05.778 well maybe if we also have a lower alert that indicates that we have a low quality uh series 00:24:05.778-->00:24:09.315 of bots trying to scan us at the same time maybe that's indicative that actually this 00:24:09.315-->00:24:13.786 isn't like a really concentrated attack that we need to worry about so we can go back to 00:24:13.786-->00:24:19.692 sleep. So in addition to creating alerts one thing that we also have to be vigilant 00:24:19.692-->00:24:25.631 about is maintaining our alerts sometimes we create alerts that overfit on a particular attacker 00:24:25.631-->00:24:29.768 and as a result of that the alerts become less useful over time one way in which this 00:24:29.768-->00:24:34.707 happens is the alert simply generates too much noise we've sometimes we've created this 00:24:34.707-->00:24:39.445 search and it turns out we're the IP address for example might be shared by some legitimate 00:24:39.445-->00:24:43.982 users as well um and that can create a bunch of false positives so in those cases we 00:24:43.982-->00:24:48.187 sometimes have to finetune our alerts and one way in which we do that is we look at other 00:24:48.187-->00:24:53.792 fields so another example is sometimes say an attacker might accidentally be using a static 00:24:53.792-->00:25:00.399 but very easily identifiable user agent when attacking our website one way in which so we 00:25:00.399-->00:25:04.703 can create a search off of that to easily identify that attacker but perhaps they become a little 00:25:04.703-->00:25:09.408 savvier and realize that they're making this terrible mistake in the first place and they make an 00:25:09.408-->00:25:13.946 att- they make an effort to randomize the user agent and by doing this what they essentially 00:25:13.946-->00:25:19.118 do is they're forcing us to have to use other fields to identify the attacker may be looking at 00:25:19.118-->00:25:24.690 what data center it's coming from or IP or other IP addresses that they're coming from for 00:25:24.690-->00:25:31.029 example so let's take a step back we've sort of sold 411 as a tool for security teams but it's 00:25:31.029-->00:25:36.835 also a very useful team um a very useful tool for the average developer as well and one way in 00:25:36.835-->00:25:41.507 which 411 can be useful for a developer is creating alerts based off of potential error 00:25:41.507-->00:25:47.312 conditions in your code so a good example of this would be when you want to know potential 00:25:47.312-->00:25:53.085 exception conditions say for example code wrapped in a tri catch statement for example you 00:25:53.085-->00:25:57.156 generally don't want your application to be running into too many exceptions so generally 00:25:57.156-->00:26:02.561 by entering in a log line and creating an alert based off that log line you'll get a 00:26:02.561-->00:26:09.067 notification when something bad happens in your application. Another condition under which 00:26:09.067-->00:26:13.372 you'd want to create an alert is when you're getting a large amount of unwanted traffic to an 00:26:13.372-->00:26:19.211 endpoint that you uh consider sensitive. A good example of this would be uh an attack for 00:26:19.211-->00:26:23.916 example trying to hit a gift card redemption endpoint or a credit card number re- uh 00:26:23.916-->00:26:29.288 entering endpoint you know those endpoints are probably already rate limited in the first place 00:26:29.288-->00:26:34.560 so it's only natural to add basically an additional alert on top of that just so you know 00:26:34.560-->00:26:40.032 that someone's trying to intentionally brute force this particular endpoint and finally 00:26:40.032-->00:26:43.969 the last instance under which you might want to consider creating an alert is when you're 00:26:43.969-->00:26:48.574 deprecating old code. So at Etsy we have what's called a feature flag system that allows us to 00:26:48.574-->00:26:54.012 very easily flag on and off particular bits of code but sometimes we need to evaluate 00:26:54.012-->00:26:58.684 how often a particular code branch is being exercised before we can move it entirely from the 00:26:58.684-->00:27:03.355 code base one way in which we do that is we sometimes just like to add a log line and create an 00:27:03.355-->00:27:08.560 alert just to I with a rollup to see how many times this particular code branch has been 00:27:08.560-->00:27:12.798 exercised throughout the course of a day or even a week and by doing that once we have 00:27:12.798-->00:27:16.869 confidence in knowing yes this code is not really being used that often we can go ahead and 00:27:16.869-->00:27:21.874 actually remove the code in question. So at Etsy we actually have a couple different 00:27:24.109-->00:27:29.481 instances of 411 set up and I'll explain what they are. Our main instance that the application 00:27:29.481-->00:27:34.553 security and risk engineering teams used is called Sec411 in this instance it's primarily 00:27:34.553-->00:27:40.425 used for monitoring issues that happened on Etsy dot com itself. The network security team has 00:27:40.425-->00:27:46.899 it's own instance of 411 called appropriately netsec411 and this instance is set up primarily to 00:27:46.899-->00:27:53.505 aid in monitoring laptops and our servers and finally for those compliance loving folks we 00:27:53.505-->00:27:59.978 have an instance of 411 setup called Sox411 which is primarily uh used for sox related 00:27:59.978-->00:28:04.917 compliance issues. Now I'm going to go into some more examples of uh some functionality that we 00:28:06.985-->00:28:12.724 have present in 411 that we're going to be making available to you when we open source the tool 00:28:12.724-->00:28:16.361 a lot of this additional functionality was made av- was made at the request of 00:28:16.361-->00:28:22.534 developers at Etsy and we found it useful enough to include in the open source version of 411 00:28:22.534-->00:28:28.607 as well. So Kai mentioned earlier that 411 has the ability to incorporate lists into 00:28:28.607-->00:28:35.247 queries here we have a search functionality that looks for suspicious duo activity coming 00:28:35.247-->00:28:40.085 from known TOR exit nodes so this query looks fairly straightforward but let's take a 00:28:40.085-->00:28:45.190 look let's take a deeper look so we're looking at logs of the type duo login and we're looking 00:28:45.190-->00:28:51.597 for the IP address that matches this TOR exits variable well if we take a look at what the list 00:28:51.597-->00:28:56.969 functionality is we can see that TOR exits is defined as a URL that just enumerates a list of 00:28:56.969-->00:29:02.507 IP addresses so what 411 is actually doing behind the scenes is it's taking this TOR exits 00:29:02.507-->00:29:07.946 node variable and expanding the query out to include all of those IP addresses in that TOR 00:29:07.946-->00:29:14.353 exits node list so essentially when you get when you get any hit in a log line that contains 00:29:14.353-->00:29:19.358 a TOR exit node IP address it matches with the search and generates an alert. Now I'm 00:29:22.461-->00:29:26.898 gonna talk more about some of the additional functionality that we offer beyond just the 00:29:26.898-->00:29:31.903 ELK stack with 411. We offer a searcher for graphite which is basically a way of storing and 00:29:35.173-->00:29:40.912 viewing time series data this is what graphites front end interface looks like as you can 00:29:40.912-->00:29:44.916 it's a very nice way of easily generating graphs, this particular graph shows an 00:29:44.916-->00:29:49.821 overlay of potential cross site scripting over potential scanners um it's just a really 00:29:49.821-->00:29:55.327 nice way of being able to determine when you are when there are anomalies happening 00:29:55.327-->00:30:00.332 and so the graphite searcher gives you a really easy way to do simple threshold style 00:30:02.567-->00:30:09.107 alerting uh and because the graphite searcher basically directly sends the query to 00:30:09.107-->00:30:13.612 Graphite itself all of graphite's data transform functions are available for you 00:30:13.612-->00:30:18.950 to be able to use for the searcher so as an example of some of the things you can do 00:30:18.950-->00:30:24.122 you can essentially write a query to say please fire off an alert when you see a high rate 00:30:24.122-->00:30:29.127 of change for failed logins. Now I'm gonna talk a little bit about the HTTPS searcher that 00:30:31.363-->00:30:36.802 we're also making available. This is a fairly straightforward searcher what it does is you 00:30:36.802-->00:30:42.207 provide an HTTP endpoint and if you receive an unexpected response code it creates an 00:30:42.207-->00:30:47.079 alert based off of that. It's very useful for web services when you want to know if a 00:30:47.079-->00:30:52.851 particular service is for example down or even up and for those in the devops community 00:30:52.851-->00:30:57.856 this is very similar in functionality to the tool called NAGIOS. Now I'm gonna go to the 00:31:01.259-->00:31:06.264 non live demo portion let's hope this works [laughter] okay I'll be narrating this so for this 00:31:13.105-->00:31:18.110 demo we set up a very simple wordpress blog instance called Demo All The Things and we have 00:31:20.412-->00:31:25.417 a we have a plugin installed called WP Audit Log which logs everything that happens in this 00:31:25.417-->00:31:32.057 wordpress instance. In addition we are forwarding the logs to our own ELK stack so that we can 00:31:32.057-->00:31:37.062 index the log files um here I'm just showing off this one nice blog post that we have uh red is 00:31:39.231-->00:31:44.102 apparently the best color. Now we're going into Kibana proper to actually look at some of the 00:31:44.102-->00:31:49.107 log files from this wordpress instance and we can see here there's an interesting log line 00:31:49.107-->00:31:53.445 user deactivated a wordpress plugin okay that's kind of interesting maybe we can make an 00:31:53.445-->00:31:58.817 alert off of that particular phrase that we can use for the future. So what we're going to 00:31:58.817-->00:32:04.189 go and do now is we're going to go into 411 proper we're going to go into the searches tab 00:32:04.189-->00:32:10.695 we're going to go and hit the create button and create a new searcher of the Logstash type 00:32:10.695-->00:32:15.367 and we're basically just going to create a new search to look for this particular message 00:32:15.367-->00:32:21.473 we're going to call this search disabled wordpress plugin and the query is going to look for 00:32:21.473-->00:32:25.010 anything in the message field that contains the phrase user deactivated a wordpress plugin 00:32:25.010-->00:32:30.015 and we're going to provide a little description in the search to let others that use 411 know 00:32:30.015-->00:32:33.852 what this search is about in case they have to deal with an alert generated by it in the 00:32:33.852-->00:32:39.925 future. We're going to look back in the past fifteen minutes and we're gonna test this alert and 00:32:39.925-->00:32:46.264 we can see here that 411 has successfully grabbed data from um from logstash so we're going 00:32:46.264-->00:32:51.536 to go ahead and create the search and to actually generate a real alert we're going to go 00:32:51.536-->00:32:55.707 ahead and hit the execute button which will cre- which will not just test the alert it will 00:32:55.707-->00:33:00.178 actually create a real alert for us in the alert page we can see here we get the same results 00:33:00.178-->00:33:05.717 back that we just got from hitting the test button so now we're going to go into alerts 00:33:05.717-->00:33:09.688 we're going to click on view to take a look at our particular the alert that was just 00:33:09.688-->00:33:16.094 generated and we can see here that in the in the plugin file information we can see that the 00:33:16.094-->00:33:22.467 duo wordpress plugin was disabled well that's not good so now that we've gotten the 00:33:22.467-->00:33:25.837 relevant information from this particular plugin we're going to go ahead and remediate this 00:33:25.837-->00:33:30.976 issue we're gonna go into the wordpress back end we're going to go into the plugins page oh 00:33:30.976-->00:33:36.214 and what do you know? Duo two factor off plugin the plugin is disabled so we're gonna go ahead 00:33:36.214-->00:33:40.518 and re enable it and now that we've taken care of that issue we're gonna go ahead and hit 00:33:40.518-->00:33:47.392 resolve and we're gonna just say that we've taken action to re enable this plugin and we've 00:33:47.392-->00:33:52.397 taken care of the alert by doing that. That concludes the live demo, not live demo [applause]. 00:33:58.403-->00:34:03.341 >>Cool and that also happens to conclude the presentation as well um once again 411s gonna be 00:34:10.515-->00:34:15.520 open sourced after uh Defcon and we will take questions now um there's a mic over there and 00:34:18.023-->00:34:23.028 over there so if you've got a question please line up [movement in the room] >>if 00:34:35.907-->00:34:40.912 you're leaving you have to leave out these doors in the back >>when deciding to move away 00:34:54.259-->00:34:59.264 from Splunk um how did you guys scale ELK versus going with Splunk like so ELK has a problem 00:35:03.501-->00:35:08.506 when it gets really big it gets really expensive so was it a cost decision moving from 00:35:14.713-->00:35:19.718 Splunk? >>ah the the question was why did we switch from Splunk um it was basically a 00:35:23.321-->00:35:28.326 decision made by our operations team >>Okay, one last question, what are you guys using as your 00:35:30.595-->00:35:35.600 send mail function? Are you guys using like mail chimp? Um we've just got everything setup 00:35:43.575-->00:35:49.414 correctly already so it's whatever um you provide to PHP >>The question was what do we 00:35:49.414-->00:35:54.419 use to send mail in 411? >>So um yeah I have a question so you're open sourcing 411 after this 00:35:59.758-->00:36:04.162 talk or that's the first part and the second part is do you have an a is this built on a AWS 00:36:04.162-->00:36:09.300 architecture such as using a simple email service is it using elasticsearch what is it using 00:36:09.300-->00:36:12.670 as far as your infrastructure that you can talk about? >>Um we're going to be open sourcing 00:36:12.670-->00:36:17.442 this after Defcon and as far as Gmail um sorry what was the second question, email right? 00:36:17.442-->00:36:22.647 >>No is it AWS architecture, so do you have an AWS architecture to go with it? >>Uh no it's just 00:36:22.647-->00:36:26.718 um whatever email like >>No no no I meant in general the entire because like elasticsearch are 00:36:26.718-->00:36:32.090 you using like lando functions or is it all pretty much like uh uh internal to itself instances 00:36:32.090-->00:36:37.095 as far as >>Everything's inside like our data centers >>Okay got it thanks >>questions? >>Hey um 00:36:42.434-->00:36:47.038 I have a question about the configuration you showed us, the beautiful UOI but how is the 00:36:47.038-->00:36:52.043 configuration actually stored and uh yes there is a change log on individual pages but would it 00:36:55.380-->00:37:00.318 be easy to version control the configuration somehow? >>so the question was about change log 00:37:06.524-->00:37:12.764 and version controlling of alerts uh >>There is no version controlling of alerts but there 00:37:12.764-->00:37:16.768 is a change log of all of the things that have been taken on the alert so could you also 00:37:16.768-->00:37:21.573 speak louder because I think the mic isn't that great. >>oh okay So the initial question was how 00:37:21.573-->00:37:26.478 is the configuration stored? Is it like is it stored in some text format that you can review 00:37:26.478-->00:37:32.717 is it xml is it, can we version control it? >>All of it's stored in MySql so we're using MySql as 00:37:32.717-->00:37:37.722 a database. >>Hello Hello Hey uh so at this point you guys are probably definitely aware of 00:37:41.593-->00:37:48.433 Watcher Allasa Searches own alerting service um what's the motivation between using their 00:37:48.433-->00:37:54.906 own uh plugin built in straight to the you know cluster? >>So uh at the time when we started um 00:37:54.906-->00:38:00.245 working on this I don't think Watcher existed yet >>Yeah it's super new >>So that's' why we 00:38:00.245-->00:38:05.750 ended up writing this >>Right um so is there any point to using it now as opposed to just 00:38:05.750-->00:38:10.788 running the plugin? I don't want to be like that guy I'm just >>Um I don't know you're kind of 00:38:10.788-->00:38:14.726 putting me on the spot uh there's also so it's not just elasticsearch like you can also 00:38:14.726-->00:38:19.731 plug in other data searches into 411 for like querying those data sources >>Okay, thank you. >>Hi 00:38:25.303-->00:38:31.242 um my questi- I have like two questions one of them is what was your motivation to move away 00:38:31.242-->00:38:38.049 from Splunk and build your own your motivation to move away from Splunk and build your own 00:38:38.049-->00:38:43.054 uh >>So that was a decision made by our Sys Ops team. >>Okay >>So I didn't really have any like 00:38:45.089-->00:38:50.094 much input on that >>Uh but any security concerns they had or? Was it I mean did they have any 00:38:52.163-->00:38:57.168 security concerns at all or yeah >>I don't think so I think there at one point like uh the 00:39:15.954-->00:39:22.126 scripting functionality in ELK was enabled by default and there were some like serious security 00:39:22.126-->00:39:27.131 issues with that so that's as far as I can remember >>Okay and um just one last question um 00:39:29.500-->00:39:35.306 does ELK also help like you know doing log analysis across multiple servers and senses? 00:39:35.306-->00:39:41.980 >>Uh >>Or is it like dedicated to just like one group of >>Yeah you can setup multiple instances 00:39:41.980-->00:39:47.986 and have them like connect to the same database and that would just work >>Oh okay thanks 00:39:47.986-->00:39:53.591 >>Okay >>Uh are ya'll open sourcing that ESQuery as well? Because Query DSL sucks >>Uh 00:39:53.591-->00:39:56.561 yeah it's built in >>Oh it's already up >>Huh? >>Oh it's already up? Oh it's built in? 00:39:56.561-->00:40:01.566 Okay >>MMhmm >>My questions on uh Jira integration in your demo you showed that you resolved uh 00:40:04.902-->00:40:11.542 the issue with the user turning off the the feature in wordpress does that end up um closing a 00:40:11.542-->00:40:18.249 jira ticket? >>Um no it doesn't so uh the Jira like target is pretty much separate you just 00:40:18.249-->00:40:23.254 send that data off to Jira and then like 411 forgets about it >>Okay thank you >>Mmhmm >>K so 00:40:26.391-->00:40:31.396 my question is a little bit two fold >>Okay >>Uh we saw a lot of web UI about this but uh there 00:40:34.332-->00:40:39.337 wasn't any real uh uh focus on any API around it so uh like consider the use case where 00:40:42.573-->00:40:49.247 there might be something where uh the same type of alert happens frequently but self 00:40:49.247-->00:40:56.254 resolves uh would it have the possibility to either escalate the same type of alert due to 00:40:56.254-->00:41:01.192 it's frequency or in contrast if it somehow self resolves all of the history of those alerts get 00:41:01.192-->00:41:05.797 resolved as well? >>Um that's not currently built in but that's because like it hasn't 00:41:05.797-->00:41:11.302 been asked for yet so um like once this is open sourced you could create an issue and then 00:41:11.302-->00:41:16.307 we could consider it >>Okay thank you >>Cool, guess that's it [applause]