>> I want to point out that we have ourselves a first time speaker here. I’d like to welcome John Seymour to the stage for his talk Quantum Classification of Malware. Give it up!
>> All right hi everybody thanks for coming out to Defcon and I hope you'll find it interesting my name is John Seymour and the best way to reach me is probably through my email but I’ll put my Twitter info on the last slide if you need it on so I guess we'll just go and get started here all the standard a little bit about myself I'm a PhD student at the University of Maryland Baltimore County trying to find out what it means to be a good malware data set I've been actively studying and researching infosec for about three years now some still bit of a new but I'm trying to bridge the gap between academia and industry and I'm currently a seven point international finishing up a few summer projects involving both infosec and machine learning talk into a few major segments first I want to talk about the current state of the delay and delay controversy in what working with one actually looks like and then will switch gears and move into the machine learning background necessary to understand how deeply class fireworks we don't have the money right now for talking quantum classification malware there is going to be a little bit of technical stuff around the on so I hope that's okay not what finally this Atlanta our design choices and implementation details of actually getting on our classifier onto the DB2 instance that my university has access to and we found some interesting things as we played around to wrap this up with some interesting observations and everything search might actually be useful some of us are ears heard about quantum computing and how to bring all her criticisms and all the things like I had a dance with the you are in arrears for the complete loss of quantum computers that allow us to no communicate perfectly securely and no help us do all the things right answer on my software I just want to lower everybody's occasions the do it doesn't do any of that right all regardless of the state of standard quantum computing viewing doesn't do that in fact there's a lot of lowering expectations that I needed to hear visible some misinformation about the way when it first came out and let's go and get out of the way you might've heard that the D-wave solved NP complete problems in polynomial time this is definitely that he doesn't solve and peaceably problems now you might think of solutions to these problems but it's important to remember that often we can do that classically to it's also hard question as to whether the wave is better at solving any real-world problems at this moment than classical machines there's a few papers arguing that the delay outperforms classical machines already but so far these been pretty spurious comparisons more specialized classical software was able to outperform the D-wave on the now course that's not to say that if you become better than standard cluster computing future but it still don't waste and because of all this misinformation it's no surprise that there's a lot of polarized debate about what the can do to my knowledge this is the current state of affairs regarding the delay first facts are right all this might not actually be interesting quantum effects have been everywhere even in like NAND flash sample the question everyone's interested in is whether the delay uses quantum effects for computation and whether implementation will or might perform all better in the future regardless The Standard Quantum Albums That Everyone Gets Excited about And If You Do Try to Say This Pretty Clearly in Every Presentation about Ever Been It Looks like They've Made Some Design Choices in the Pursuit of Solving NP Complete Problems Which Means Universal Quantum Computation Can't Happen on the DV Machines Was Also to Their Credit They Have Made Several Advances on Cooling and on Power Consumption Tronic Devices And I've Also Heard That Some of Their Techniques Might Be Useful for Scaling Even Standard Quantum Computation So It's a Non-Technical Stuff Right Most of You Probably Seen This Picture before All This Is like to Do a Case It's a Big Black Box about the Size of This And If You Opened up This Is What You See inside On This Contraption Is Mostly for The Bottles Has A Lot Of Room for a Technician to Stand inside for Repairs and Whatnot But the Chip Is Finding out the Bottom Can't Even Really See from Here So Here's a Close Right There's Actually A Lot Of Classical Certain Circuits on This to All Only about Middle {Where I Don't See It Is the Quantum Part And Again This Year to See Some Initial Close-Up On the
So This Is Actually Check This Is the Washington The Thousand Chip That Just Came out This Summer But Structure Is Pretty Similar to What We Were Just Figure You Might Actually Be Able to See a Faint Grid on the Chart That's A Lot Of of What Do You Think the Cubans So There Was A Lot Of Yoga Loops Which Is Where the Possible Quantum Behavior Comes from These Lives Are Magnetized and Then They Handle with Intersect At Least I Think We Have Consensus That the Containment This Intersection Points but As with Everything Else That Still Hotly Debated The Idea Is That These Loops Want to Be an Agreement Which Will Happen If the Minimal Energy State of the System So Think like Norton North Repels More In South South We Want All of These Different On Magnetization's to Be Pointing Different Directions So the Way This Program by Advising Many of Them Loops The Couplers Which Govern Their Interactions This Formula Here Is How You Represent That Mathematically Given His Views Which Are All Real Numbers The Delay Comes the Final Assignment of Cues Such That This Formula Is Minimized We Normally Work with You Being Either 01 And When the Case This Is Known As a Quadratic Unconstrained Area Optimization Problem Or Cuba for Short It Turns out That If We Could Softy Was Pretty Easily and Actually Used But He Doesn't Always Get the Absolute Minimum Solution So the Company Now Call Their Machine a Heuristic for Solving the Sorts of Problems So We You Receive Access to the Way to Instance in Burnaby Canada Do You Has Built a Little Website. Everything For Submitting Programs and Parameters for Running Them to the Machine They Also Have API Access Which Is Basically like Using Old for Those of You Can Google Apps before And When You Want to Play with a Trip to Their Website You Get a Visual Representation Looking like This All the Time Your Graph This Is the System Six Processor That We Had Access to When We First Experiments Now the First Thing You'll Notice Is All the Spots Where Nodes Are Missing Physical Debt Units or Qubits Which Are Defective And It Programmers Can't Interact with Those That It's All And There Assume That They Don't Interfere with Any of the Computations There Determined When the Machine Boots up so We Can Fix Those That Kibitz but It Can Also Kill All the Runs Reboots of the Viewing Don't Happen Often All Probably like Every Two Months or so in My Experience Out Of 512 Cubits We Had Maximum This System Six Chip Has 496 Working But Now Compare That to the System 13 Jet Which Is What We Have Access to Now Write There's A Lot More Documents on This Trip The Take Away from Here Is Just Because I Call It a 512 Cuban Machine Doesn't Mean the Machine Actually Has All 512 Those Qubits And Finally Here's an Example of an Optimization Problem Run on the System Fixture The Left Is What We Input to the Machine Each Colored Notoriety Corresponds the Bias That We Gave It On the Right Is What Was Returned to Us by the Way And Again Each Color Corresponds to Keep His Final State I Think Red Means That the Qubit Measured at the End Was Positive One and Blue Means That They Keep It at the End of the Run Was -1 But We like to Work with Binary Variables So We Apply Simple Substitution Function to Change the -Ones to Zeros So Far We've Been Talking about Documents In Working Directly onto the Director Now There Are Certain Problems like Reset Which Actually Simply Transform into Those And As a Side Note This Is Why the Company Is so Interested in Three Set However You Is Also Developed Some Closer Software to Embed Arbitrary Minimization Problems And They Call the Software Blackbox Generally the Problem of Embedding an Arbitrary Minimization Problem onto the Time Your Graph Is NP Complete It's Very Similar to the Subgraph Isomorphism Problem There Might Still Be Solutions for Particular Graphs Rather Than Actually Solving the Problem for a Chat with Given That Qubits Do It Instead Uses a Heuristic Called to Search For Embedding Problems onto the Director Blackhawks Involves a Dialogue between This Classical to Do Algorithm In the Direction And What I Mean by This Is the to You Album on a Classical Machine Finds What He Thinks Is a Good Embedding for Some Chunk of the Problem And Then This Talk Is Sent over to the Wafer Salt The Solution Is Passed Back from the Delay to This to the Search Algorithm Which Uses Data Input for the Next Iteration Often the Problem And This Continues until the Machine Can't Find a Better Solution Or until Specified Timeout Is Reached So A Lot Of Our Time in Blackbox Programs Is Actually Wasted Just Due To Network Latency As Part of This Dialogue But Actually Coding This All on Straightforward Right Here's an Example of Some Python Code Which Connects the System Six Processor and Minimizes a Given Function You Basically Just Put in the Software You Want to Use Some Parameters and a Function Which Just Returns a Value for How Good a Given Bit String Is And Then Blackbox Will Look for the Best Bit String to Minimize That Function Susan: All the Question Is What Can This Machine Do Now Do It A Lot Of Applications like Classification Protein Folding Problems Solving Getting Close to Optimal Solutions to NP Complete Problem Traveling Salesman On They Do Have Way Tutorials for Most of These on Their Website But It's Not Quite Clear to Me How Those Toys Size Tutorials Scale to Larger Problem So Now Were to Switch Gears a Little Bit and Talk about the Machine Learning Background Necessary for the Delay. On My Can Obviously Get through Everything to Do with Machine Learning the Time We Have Available to Us Today But I Want to Try to Go through What's Relevant to This Project When Assume You Guys Know about like Supervising Unsupervised Classification. If You Don't Definitely Check out Alex Mendoza Roberts Talks Their Amazing But We're Using the Supervised Technique Here Which Means That the Instances We Feed into the Album We Create Are Labeled before We Train Our Classifier The Direct Classifier We Look at Is a Boosting Out Of Them And to Explain This Concept I'm Borrowing from Good Tutorial I Read Recently The Only Check It out If You're Interested in Machine Learning But It's Very Similar to Eric Correcting Codes If Any of You Work with Signal Processing Let's Say We Have Three Programs That Classify Malware And Further Let's Suppose Any Single One of These Programs Is a 70% Probability of Being Correct for Any Given Instance You Can't Simply Choose Any Single Classifier and Be Happy with Getting 30% of Your Is Wrong Or You Could Be a Bit Smarter Than Mine A Simple Way to Combine Them Is by Running Each of Your Three Classifiers On Instance and Use Whichever Classification the Majority Signs Is Your Final Guess During This Your New Classifier Can Be Right in Our Example up to 78% of the Time Because Now the New Classifier Will Be Correct Whenever At Least Two The Old Classifiers Guess Correctly You Can Actually Check This by Writing out the Probabilities before Cases When 012 and Three Classifies Actually Correct But Many Using Algorithms Allow You to Give Us Some Wii Classifiers More Weight Than Others Though What We Look on the Delay Doesn't Interestingly the Do Everything Out Of Them Does Pretty Well Even In Spite Of Being Simpler So First Deftly Don't Be Scared by the Equation There Is Really Not That Important I Went to Try to Talk Is There But Let's Talk about This Whole Process As a Minimization Function Because the Delay Likes to Minimize the Central to the Idea of Machine Learning Algorithms Is the Idea of Loss Function or Quantification of How Poorly a Classifier Performs The Idea Is That You Want to Minimize the Sauce Now Generally Losses the Parts We Want to Minimize the The Number of Misclassification That Are Model Creates And We Want Our Model to Be As Simple As Possible In Our Case We Have a Set of Classifiers and Were Trying to Find Which Subset of Classifiers Can Be Boosted Using Majority Vote into the Best Possible Classifier In This Scary Formula Is Just a Modification of That This Is an Example of a Loss Function That We Actually Used for a Class Parties As a Function Block Box That Were Trying to Minimize In a Mostly Including This for People Looking at the Slides Later I Will Say Though That the Sign of the W Soundly Asked Her What Her Boosted Classifier Guesses the Executables Are And Then This Is Compared to Whether the Executable Is Actually Labeled to Be Malicious or Not and If They Differ And That's Called Mass Classification And out + Is Just a Term Which Penalizes Using A Lot Of Wii Classifiers So the Final Machine Learning Ingredient Your Classifier Is the Features We Use Now We Used an Grams Which Is a Standard Type Feature Used in Document Analysis We Attendees and Dems from the Heck Stumps of Malware and Benign Software You Can Think of an Grams As the Sliding Window Overtaxed So If You Consider the Has Staying Deadbeat I'll Go Ahead and Give an Example of 2 MB of Diagrams and Remember That One Bite Is Just Too Texted Its So We Take Her First Two Bites and That's One by Graham And Then We Take Two Bites with an Offset One and That's Another by Graham And We Keep Going until We Reach the End of the Extra So Here's the Final Diagram And so Their 32 MB and Debbie DAD ADB and Be Now We Actually Use Trigrams of the Programs so Three by Sliding Window Instead of a 2 x 1 As a Basis for a Classifier That Doesn't Really Change Much There's A Few Reasons Why We Chose Engrams of Her Other Features They've Been Used before Malware with These Results First off But We Mostly Use Them Because We Had No Idea How Many Features to Be with Handle It's Easy to Generate a Large Number of Features with Engrams And Then to Preprocess Them down to Any Given Number And It's Also Trivial to Turn These Engrams and We Classify As You Can Have Just Simply Whether or Not the Engrams Present in the Executable As Being a We Classify Obviously since Were Only Using 3 G What We Build Won't Be As Good As State-Of-The-Art Malware Classifiers We Don't Need It to Be the Best in Our Class Are in Existence Here Though Were Just Using It to Compare Classifiers between the D-Wave Machine in Standard Classical Machine Learning Techniques Now Hopefully All of That Wasn't Too Painful And We Can Get into All the Fun Stuff At First Glance the Dealer Looks like It's Going to Be Awesome for Classifying Malicious Executables There's an Out Of Them Already Developed Called Cubist Freezing That He Would Classify Things Cubist Models in the Paper Had Higher Accuracy Than At Least One Entered Classical Boosting Algorithm Called Database What's Really Interesting Here Is That Depending on the Lost on to Use The Classifier Can Be Robust to Label Voice Generally If You're Wrong about A Lot Of the Samples in a Training Set Then Many Algorithms That You Apply to It Will Learn Incorrectly If an Album's Robust Label Noise However Short of Catastrophic Failure Labeling Bill Still Generally Learn Even If a Significant Number of Instances Are Mislabeled And As You Trim Our Classifying How to Tell It Whether Each Instance Is Benign or Malicious Obtaining Background through This Hard In This Domain Even As We Found in Our Lab But Finally I Found a Call with the Creator Cubist Documents As Early Scale to What's Known As Google Size Problems However like Box Handles Trunking of Problems and so It Supposedly Can Scale to Larger Problems I Was There Also Was a Tutorial for Implement and Keep His Lawyer Using the Blackboard Software And It Looked Pretty Easy to Do So That's What We Did Here Our Goal at the Time of This Research Was to Classify Executables As Either Being Malicious or Benign Of Course There's Loads of Malicious Data Sets to Choose from We Use Via Kevin Which Is a Pretty Standard Data Set for Training Malware Classifiers Although Now It's Starting to Show Its Age It's like 10 Years Old by Now However There's No Standard Nine Software Data Sets and This Is Pretty Problematic For Benign SQLs Is the Combination of Executables Found in Clean Windows XP and Seven Installs And the Executables Resulting from Installation of Sick Women and Certain Source Forge Executables Based on Some Previous Work We Did Never Thought Don't Do This On We Don't Claim That This Is An Acceptable Data Set for Future Malware Classification It's Not Very Diverse Are Representative of the NYNEX Cables and General Were Actually Trying to Solve That Problem Now But As a Final Note on Data Sets We Do Know about the Source for the Hardware And We Would like to Make the Disclaimer That No Adware Was Used This Test Fire So There's Some Classical Preprocessing That We Did before We Do This Thing onto Blackbox First You'll Notice That We Have Tons More Malware Nine Examples If We Created a Problem to a Program to Classify Executables That Always Return That the Executable Was Malicious That Program Would Do Extremely Well on Our Data Set Even Though It's Not Actually Learning What Malware Actually Is Right Just like a Random Number Generator That Always Returns for Isn't Really Random To Get around This Issue We Sampled with Replacement And There's Some Upside and Some Downsides That On the First Leg Classifies It Back to the Train We Are Throwing A Lot Of Information the Way by Doing so Sampling with Replacement Also Has Some Good Statistical Properties for the Underlying Distribution If You Care about That Sort of Thing Now We Use 3 G Is the Basis for a Classifier Knowing That What We Built Here Won't Be As Good As State-Of-The-Art Systems There Were Not Turning Any Heads of the Accuracy of the Models We Built Here The Models Will Be Complex Enough to Compare Accuracy and Timing Information And I've Done All This Is Simple Python Program Which Uses Blackbox along with the System 60 with Two Instance That We Have Access to To Minimize That Scary Loss Function from Earlier That's Actually What, Are Our Classifier It Will Determine Which Engrams Using Majority Vote Best Classifier Malware So When We First Ran over Classifier It Wasn't Doing Any Better Than Random Chance We Did Some Digging and Found the Black Boxes Using up All the Way Time That We Actually A Lot To Solve This Problem We Needed to Increase the Amount of Time That We Allowed Blackbox to Search for Solutions But the Question Is How Much Time We Actually Need to Get It To Get Reasonable Accuracy on a Problem with the Given Number of Variables So Previous Work Using Blackbox Mostly Deals with NP Complete Problems so They All Use a Rather Large and Arbitrary 30 Minute Timeout In Many Classical Models on the Scale That We Built in the past Especially after Resampling to Smaller Numbers of Executables Trenton A Few Seconds to Minutes so This Is an Extremely Large Time We Originally Thought That Slimming down the Problem in This Way Would Give Us a Reasonable Decrease in the Time Required to Solve the Problem But We Quickly Found out That This Wasn't the Case Even for Minimize Asian Problems With Small Numbers of Binary Variables It Took over 10 Minutes to Get a Decent Solutions But We Still Pressed on Just in Case May Be Some Accuracy Increase Might Justify a 10 Minute Model Creation Time Even for Very Very Simple Models So As a Result of Our Pilot Study We Decided to Restrict Our Classifier to 32 Features The Balance of Complexity the Classifier with the Time It Took to Train Now 32 Features Is a Very Very Small Number of Features for Machine Learning Problems But We Found It Took Almost an Hour to Train a Single Model And We Have Limited Allotment of Time on the Buick Machine We Kinda Naïvely Split Those 32 Trigrams from Earlier into 16 Each of Benign and Malware Features Then We Use the Same Python Code from Earlier to Trends and Classifiers And We Noted the Time to Taking the Train The Accuracy And Which Features Were Present in the Final Booster Classifier We Did That on the UHF and the Delay Simulator Which Is Classical Nature Using the Same Features We Compared the Buick Classifier to Several Models We Built Using What Which Is a Voting Machine Learning Library Written in Java We Compared the Classified Eight of Most J 48 Decision Trees In Rain Forests It Should Be Pretty Obvious When Compared to Those like Ada Boosting Cubist Have Been Compared to before And These Are Similar Techniques MJ 48 and Random Foresight Easy to Use Techniques That Work Right Out Of the Box And Also Been Shown I Think Could Be Pretty Good with Malware As Well But I'll Just Take a Minute to Let You All Look at Our Results And There Are Two Major Things That Look Super Weird about This Maybe You Guys Can Spot It So Yeah Our First Finding Is That for Quantum Speed up the Thing Is Extremely Slow Compared to Classical In Fact the Timing for the Delay Stuff Is Actually Underreported We Only Included the Total Time That the D-Wave Itself Was Running On Our Problems so That We Don't Include Any Latent He Caused by the Network in This Calculation Remember the Black Box Involves a Conversation between Classical and Quantum Hardware As a Side Effect the Classical Time for the True Other Than from Blackbox Isn't Actually Included in the Time Taken to Build the Delay Classifier Now We Found You to Be Middle-Of-The-Road on Accuracy But It Takes 10,000 Times on the Train And Remember the Other Out Of Them Scale But We Had to Heavily Restrict Our Numbers of Features in Order to Train the Delete Classifier in a Reasonable Amount of Time No We Don't Know Why It Takes so Long but We Do Have A Few Guesses It's Possible That Blackbox Is Finding Very Good in Beddings Or Maybe the D-Wave Isn't Actually Getting Good Enough Solutions on the Also You Have Those Documents from Earlier a Release from the Black Box Or It Could Also Be That Black Box Is Trying to Solve the Next Financial Problem We Don't Really Know Right Now But Our Second Interesting Result Is That the Delay Simulator Which Is Again Classical Nature Takes Less Time to Train the Natural Beauty Check And That's Kind of Surprising to His Late Wife I Do It When You Can Just Use the You Have Simulator on Your Laptop Right We Think This Might Be an Artifact of Dead Qubits Because the Simulator Assumes a Perfect View But Is Still Really Really Weird That the Simulator for the Delay Actually Outperformed the Actual Chip So What Does This All Mean Right We Found the Wall As Possible to Create a Malware Classifier Using the D-Wave And That It Has Similar Accuracies to Standard Machine Learning Techniques It's Not Very Practical There Significant Overhead and We Need to Restrict the Problem Substantially We Don't Know Exactly Where This Overhead Comes from It Could Be from DB Software to Embed Arbitrary Minimization Problems onto the Director Or Could Come from the Direction Itself Not Finding Enough Solutions However Were Betting the Black Box Is the Problem Here Regardless It Seems That the Weight Isn't Quite Ready for Even the Sort of Toy Problem Much Less the Real-World Malware Problem That We Currently Deal with So We Probably Really Should've Stuck with Cuba Is Here Even Those Not Ready Now There's Still Some Areas Will Look into Before Closing This Is There's A Few Magazines to Try to Do to Get around This Timing Problem We Could Just Wait to Do a Trip Size Is Supposed to Double Every Couple of Years and Defects Should Decrease over Time On Each Cubit Should Even Exponentially Increase the Size of the Problem The Chip Should Be Able to Solve It's Possible That the Next Generation Check for the Tip after That All Will Be Fast Enough for This Method to Compare While the Standard Models But of Course Waiting for Fun Right Instead of Salt Instead of Solving and Embedding of the Problem Directly onto a Particular Chip Rather Than Using Heuristics from Betting Is What I Think Probably Should Be the Best Route We Did Also Notice That the Buick Classifier Often Used Less for Different Features Than the Classic Logarithms When Compared to Used So It's Possible the Accused Might Be Useful for Some Other Purpose like Feature Selection or Future Preprocessing But That Time Issues Still There Other Than the Revisions We Noticed That Most Infosec Data Sets Are Out Of Date and Relatively Small Private Researchers Regard Data Sets and Features As Being Part of Their Secret Sauce for Classification These Facts Combined Make It Really Really Hard to Produce Results or Effectively Evaluate Her Own Creations And That's a Challenge We Actually Really Really Need to Overcome His I Feel So Actually That's the End of My Talk on Meyer A Lot Of People Were Actually Flying out Later Today and I Wanted to Make Sure We Had A Lot Of Time for Questions On so That's My Experience with Programming to Do It Due To Building Our Classifier and I Hope You Enjoyed It On So I Can Go to Field Questions If Anyone Has Any or I Guess I Could Step down and Meet with People One on One I'm Honestly Kind Better at That Something Anyways Some Anyone Right so I Had Actually Seen Any Our Studies on Scaling about Qubits Here And I Really Really Would like to Your On Certainly Possible to Do so And on So Yeah I Think That's a Good Nextep to See How the New Dealerships Will Actually So to Be in the Future Oh Yes Sorry If Anyone Has Questions Please Is Mike Yes