LLMs Save Threat Researchers Time, But Our Obfuscated Javascript Case Study Shows How Multi-Step Reasoning Can Go Awry


Posted on by Armin Buescher

In its search for new ways to improve the speed and accuracy of threat research, the cybersecurity community has of course turned to generative AI (genAI). Hence, it’s no surprise that genAI featured prominently in the RSAC™ 2025 Call for Submissions entries for the “Threat Intelligence and Incident Response” and “Threat Intelligence and Attribution” Subtopics.  And Vicente Diaz, Threat Intelligence Strategist at Google addressed this problem in detail in his “Analyzing VirusTotal’s Malware Executables Collection with LLMs” session at RSAC™ 2025 Conference where he found that LLMs struggled with some code analysis due to limited prompt sizes, and that the LLMs were most effective when combined with decompilation and disassembly tools.

Here at RSAC, we’ve also conducted some research to assess the potential of genAI in threat and malware research – and we began by opening the aperture a bit to consider some of the challenges that practitioners face in threat and malware research and their broad applications in threat hunting, which include: (a) 1) Defining clear threat models of the appropriate granularity; 2) Deciding which tools to use; 3) Tests and experiments that aren’t easily reproducible; 4) The limitations of VirusTotal and other repositories (incomplete, corrupted, or incorrectly-labeled samples, etc.); 5) Biased or unrepresentative data sets that lead researchers astray; and 6) insufficient connection between academic (focused on novelty without practical constraint) and industry (focused on providing viable solutions for real-world problems) research.

Threat and Malware Researchers Need Data and Tools to Manipulate That Data

Threat researchers need two basic resources to meet the challenges we describe above: (accurate) data, and tools to work with that data. There are many data sources that are relevant for these roles, including threat intelligence feeds, telemetry data from security products, network and endpoint logs, open source intelligence, and sometimes private communications. Data varies from completely unstructured to fully standardized formats and from just a single indicator of compromise (IoC) to massive intelligence firehoses from Microsoft or Google. As for tooling, there’s a wide range available that focuses on different aspects of threat research. For example: 1) a malware reverse engineer might only be using disassemblers and debuggers like IDA Pro, Ghidra, Binary Ninja, X64Dbg or Hopper; 2) an incident responder might spend most their time securing evidence and analyzing logs with tools like Velociraptor, Timesketch or OSQuery; and 3) a threat actor tracker might live in underground fora like Breachforums and DarkFox, as well as foreign blogs.  The field is diverse and having a varied skillset produces synergistic effects that make a cybersecurity professional better at everything. But time is scarce, and using it efficiently is a challenge.

Generative AI Tools Eliminate Routine Work and Shave Time From Complex Efforts

So, can generative AI help threat researchers speed up their day-to-day jobs? Yes, it can. A plethora of AI-assisted security and analysis products are already on the market. However, there are many ways for individual researchers to use AI to make manual work easier and faster, provided they keep its limitations jn mind.  Here are a few of the applications we find most compelling:

  • AI plugins and addons for popular reverse engineering tools like IDA Pro, Ghidra and Binary Ninja are improving, and can significantly speed up reverse engineering efforts.  Mainstream AI offerings are also getting better at illustrating code—some are able to create complex flowcharts that can be imported to other tools.
  • Deobfuscating code, typically scripts (See Figures 1 and 2 below).  This is a particular benefit for earlier-career researchers, because their senior colleagues had to spend thousands of hours learning to deobfuscate code via painstaking trial and error, whereas today’s new malware researchers can rely on LLMs to do a lot of this work for them.  But don’t expect a 100% deobfuscation success rate with LLMs—see our obfuscated javascript case study in the next section for an example of how the LLM can go astray.

Figure 1: An example obfuscated script

June 1

Figure 2: DeepSeek’s deobfuscated version of the script in Figure 3

June 2

  • Explaining artifacts. Malware writers don’t usually provide explanatory comments in their code, so even if the code isn’t deliberately obfuscated, researchers still need to spend time figuring out what it does and how. GenAI tools can perform a lot of this detective work effectively if given a few prompts. This applies not only to code, but other artifacts as well – for example, the tools can explain why a DNS TXT record contains a particular data type.
  • Developing scripts to automate manual tasks. Researchers have two options for using LLMs to create useful little bits of project-specific code: 1) “traditional,” where one prompts an AI Code Assistant tool like Github Copilot, Amazon Q Developer, or Google Gemini Code Assist and then reviews and adjusts the code it generates; and 2) so-called “vibe-coding,” where the user keeps refining the output with their AI Code Assistant until the script does what they want (Cursor and Windsurf are popular options for the vibe-coding approach).  We estimate that even if we follow the more cautious “traditional” approach, we can cut the time we’ll spend automating manual tasks by 90%.  Automation scripts are a great place to experiment with vibe-coding because they’re quasi-throwaways, so we don’t need them to be elegant code masterpieces.
  • Transcription, translation, summarization, and image description (See Figures 3 and 4 below)Particularly for analysts who work with content that includes many human languages in multiple different scripts and with communications that include a mix of audio, text, and images, LLMs’ ability to transcribe, translate, and summarize, and to describe images so that analysts can easily decide whether they are relevant or interesting, will save a lot of time.  RSAC tested this on a previously analyzed data set from Telegram that included a mix of languages in audio and text, plus images, and the materials the LLMs produced were accurate enough and reduced the time and effort required to analyze them sufficiently.

Figure 3: Summarizing a chat in an unknown language

June 3

Figure 4:  Describing images, including translation and explanation of text contained in the image

June 4

  • Classification and rating of data. When working with heterogeneous data, AI can help make sense of the structure so that it can be worked with effectively. For example, LLMs can extract relevant indicators of compromise (IoC) from free text and can unify datatypes that contain errors or inconsistencies.
  • Simulation of user behavior in testing, sandboxing and honeypotting scenarios. Training the system on a recorded user behavior data set and then having it synthesize a similar set of user behavior allows for more faithful and faster tests. A varied behavior pattern is also harder to profile.
  • General faster upskilling for junior researchersAs with deobfuscation, the widespread availability of genAI tools gives newer researchers the ability to fill gaps in their knowledge and capabilities more easily. This gives employers more flexibility in hiring; they can be more confident that a curious candidate who’s missing a few job requirements but likes experimenting will be able to learn fast.

Beware of Errors and Hallucinations: An Obfuscated Javascript Case Study

As with all LLM-generated content, be wary of hallucinations. This happens more often with complex or technically specific tasks and is a particular problem for deobfuscated or explanatory content, because analysts with less experience with deobfuscation will not easily be able to tell that the AI is making up information.

Here, we gave Deepseek an obfuscated javascript that downloads an installer for the Lumma stealer malware. Deepseek claimed instead that this was a web skimmer.

Figure 5: DeepSeek’s first attempt at deobfuscation

June 5

When confronted with the fact that the domain name did not sound very plausible, DeepSeek seemed to correct itself:

Figure 6: Second attempt at deobfuscation

June 6

At this point, we were left shaking our heads, because that r000t[.]ru domain does not seem to have ever existed!

Here we have an example of the LLM’s logic breaking down. DeepSeek has taken a wrong turn and appears to be continuing further into a fictional worldview of its own making. And this can happen with any LLM (not just DeepSeek); when using multi-step reasoning, the LLM can end up amplifying its error. Hence, in this situation, we’ve found it difficult to get the LLM back onto the right track – we prefer to reset the session and start again. Thus, when using an LLM for any of the threat research tasks we suggest here (and indeed, for any complex task), we recommend treating an LLM’s output with skepticism; always compare it with known data to verify that it matches expectations.

______________________________________________________________________________________________

(a) An academic literature review paper from 2021 identifies 5 challenges and 20 pitfalls common to malware research.  https://www.sciencedirect.com/science/article/abs/pii/S0167404821001115

Contributors
Armin Buescher

Technical Director, RSAC

Snorre Fagerland

Research Lead, RSAC

Laura Koetzle

Head of Community Research, RSAC

Blogs posted to the RSAConference.com website are intended for educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the blog author individually and, unless expressly stated to the contrary, are not the opinion or position of RSAC™ Conference, or any other co-sponsors. RSAC™ Conference does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented in this blog.


Share With Your Community

Related Blogs