Using Machine Learning and DNS in the Cat & Mouse Game of Fighting Bot Malware

Posted on by Yuriy Yuzifovich

Although they’ve been on the security radar for years, Botnets have not gone away. Bots lurking on devices are trained to find valuable data like credit cards, logins, or credentials for financial transactions, and quietly exfiltrate them for monetization.

Bot malware spreads randomly using software flaws or social engineering and attackers always want to maximize their ROI. Contemporary botnets employ ever-more sophisticated strategies to hide themselves so they can activate and monetize as many exploits as possible. For instance, most malware developers design bots to use resources on compromised hosts judiciously to evade layers of filters that could belie their presence. They also carefully blend legitimate and malicious behaviors to complicate evaluation.  

One of the most common approaches for security research is to obtain malware samples using honeypots or other techniques. Captured binaries are evaluated and it may be difficult to fully characterize malware due to obfuscation and complexity, or because it behaves differently in artificial lab environments than it does in the wild. It takes times to capture and evaluate samples, which introduces delays in subsequent detection. Analysis is also complicated by the fact that there are sophisticated forms of polymorphic malware that vary substantially from one instance to another. 

More than anything, it’s hard to scale teams of security researchers parsing outputs from honeypots. As these cat-and-mouse games play out, humans are challenged to keep pace, which allows attackers to sustain their exploits until they do damage. Getting ahead of the volume and sophistication of today’s attacks requires augmenting human expertise with more agile and comprehensive threat insights. 

DNS Data: Agile, Broad, and Deep

An approach that is getting a lot of attention is leveraging Domain Name System (DNS) data for threat intelligence. The DNS has always been central to the Internet, making it simple for users to navigate to web destinations; and for content providers, apps, and services like VoIP, to advertise their presence so everyone can find and use them. The DNS is also widely used by malware developers because it’s:

  • Scalable, manageable, and available on every host and network
  • Easy to support highly dynamic threats
  • Simple to maintain anonymity
  • Free to generate and use random names to obfuscate exploits, and inexpensive (and stealthy) to instantly activate malicious domain names when they’re needed

This means threat intelligence derived from DNS data has the potential to power up human security researchers. DNS queries are often a leading indicator of malicious activity because resolving the IP address of a malicious resource – a botnet Command & Control server (C&C), malware download server, data exfiltration site, etc. – is the first step in enabling most malicious activity. Monitoring DNS activity thus provides an extremely agile method of detecting malware. 

The Power of Machine Learning

Applying machine learning to DNS data delivers an additional benefit. Coverage of malicious activity can be expanded by using techniques borrowed from natural language processing to reveal relationships among seemingly random domain names. After relationships among suspect domain names are calculated they can be fed into other algorithms that generate domain clusters, with the most correlated names grouped together.  

Members of a cluster can then be compared to domains reported by third party researchers, and when they share the same characteristics of members they can inherit the findings of the third party. This extends validated security knowledge to many more domain names. Measurements have shown these machine-derived clusters can increase human security intelligence by 5x to 10x without sacrificing accuracy. The net result is better threat coverage and better precision; and because machine learning operates in real time there is no delay associated with uncovering new malicious activity. Details of these techniques were presented at a recent security conference. 

Creating Threat Intelligence

Like other security research, creating DNS threat intelligence requires DNS data. Data from DNS resolvers, the servers in every network that answer queries from hosts and other networked devices, is an especially rich source of threat insights since resolvers see all DNS activity in a network. Processing live-streamed DNS resolver data substantially improves agility, and diversifying the data set with worldwide samples enhances coverage even more since malware can start in one part of the world and then migrate elsewhere over the course of a day. 

A problem that has to be addressed is DNS resolvers generate prodigious amounts of data, the vast majority of which is benign. In order to scale efficiently, machine learning systems need to be focused on data of interest, in this case DNS queries which are likely to reflect malicious activity. A method is needed to reduce DNS resolver data by removing as much noise as possible (legitimate queries), and retaining queries that are likely to be malicious.  

An effective way to reduce DNS data sets is to match incoming queries, streamed from resolvers, against a log file that aggregates every unique instance of historical queries. This exposes “new” queries which are being seen for the first time. It turns out new queries are highly correlated with malicious activity and are thus excellent candidates for algorithmic processing to determine their reputation and decide whether or not they should be placed on a threat feed to block them. Details were presented at a recent DNS conference. 


Evaluating real-time DNS query data obtained from DNS resolvers provides an extremely agile method of detecting the presence of bot malware. Machine learning increases depth of threat coverage by revealing subtle variations in exploits that honeypots or other traditional forensics techniques miss. It also operates in real time so new threats are identified quickly. 

The end game is to equip DNS resolvers with real time DNS threat intelligence which allows them to identify malicious traffic instantly by matching incoming queries against the threat list. This approach is lightweight, with no dedicated inline or offline data-plane packet processing. Instead, intelligent processing of DNS traffic in the control plane is highly efficient and has no perceptible network impact. There’s also no need to install any software on devices - computers, tablets, phones, IoT - so there’s no impact on them either.

Yuriy Yuzifovich

Director of Data Science , Akamai


Blogs posted to the website are intended for educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the blog author individually and, unless expressly stated to the contrary, are not the opinion or position of RSA Conference™, or any other co-sponsors. RSA Conference does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented in this blog.

Share With Your Community

Related Blogs