Artificial Intelligence Cybersecurity Newswire Quick Reads Technology

AI malware detectors fail on unfamiliar data

April 2, 2026Last Updated: April 2, 2026

3 minutes read

Close-up of a person's face surrounded by screens and digital icons.

▼ Summary

– Machine learning malware detectors for Windows are typically evaluated on data similar to their training sets, but real-world malware differs and is often obfuscated.
– A study tested these detectors by evaluating them on external datasets, including real-world and obfuscated samples, unlike standard benchmarks.
– Models performed well on data from their own training distribution but showed significantly worse performance on diverse, external datasets like SOREL-20M.
– Adding a dataset of obfuscated malware to training improved detection of those samples but reduced the model’s effectiveness on broader, more diverse threat data.
– The research highlights that a detector’s real-world utility depends on whether its benchmark data reflects the actual threat landscape, as factors like obfuscation can degrade performance.

Current AI-powered malware detection systems often struggle when faced with data that differs from their training material. This reality poses a significant challenge for enterprise security, as real-world threats are frequently obfuscated and originate from varied sources. A new study from researchers at the Polytechnic of Porto explicitly tests this performance gap, revealing critical weaknesses in the static malware detectors many organizations rely on as a primary defense layer. The findings are particularly relevant given that the European Union Agency for Cybersecurity has consistently identified public administration as the most frequently targeted sector for malware attacks in recent years, with ransomware and data intrusions as leading vectors.

The research constructed detection pipelines using a common feature format across six public Windows PE datasets. It tested two training configurations: one using combined EMBER and BODMAS datasets, and another that also incorporated the ERMDS dataset, which is specifically designed with obfuscated malware samples at multiple levels. Crucially, the models were evaluated not just on data from their own training distribution but on four external datasets: TRITIUM, containing natural threat samples; INFERNO, derived from red team and custom command-and-control malware; the large-scale, temporally diverse SOREL-20M benchmark; and ERMDS used as an external test. This cross-dataset evaluation methodology sets the study apart from typical benchmarks that test models on splits of their original training data.

Performance was strong when models were tested on data resembling their training sets, with top models achieving AUC and F1 scores in the high 90s and maintaining robust true positive rates even at very low false positive thresholds. These in-distribution results appear promising for operational environments where false alarms are costly. However, the cross-dataset results painted a different picture. While models transferred reasonably well to the TRITIUM dataset, performance on the INFERNO red team dataset was more variable, with detection rates dropping at strict thresholds. The most significant decline occurred on the SOREL-20M dataset, where some model configurations lost so much effectiveness that their practical utility would be limited. Performance on the external ERMDS test set was similarly poor.

A particularly instructive finding involved the attempt to directly address obfuscation. Adding the ERMDS dataset to training improved performance on obfuscated samples within that specific distribution. However, it also reduced the model’s ability to generalize to the broader SOREL-20M dataset compared to training without ERMDS. This indicates a fundamental tension in model training: specializing a detector to recognize heavily obfuscated malware can alter its feature understanding in ways that diminish effectiveness against a wider, more diverse threat landscape. The researchers suggest that obfuscation-heavy samples cause feature vectors to spread within each class, blurring the distinction between benign and malicious files that classifiers depend on.

For security teams, the implications are clear. Static detection remains attractive for on-host deployment due to its low computational footprint and fast verdicts, and the study confirms that compact, boosting-based models can be viable under the right conditions. Yet the research underscores a critical, often overlooked limitation: a detector’s benchmark performance is only meaningful if the evaluation data accurately mirrors the operational threat environment it will face. Tools from red teams, packed malware, and samples from different time periods can all severely degrade a model that performs excellently in controlled tests. The team plans to extend this evaluation to deep learning architectures, continuing to investigate how training data composition impacts detection at the low false positive rates required for real-world deployment.

(Source: Help Net Security)