survey

Public Access

A Survey on Malware Detection Using Data Mining Techniques

Authors:
Yanfang Ye

West Virginia University, Morgantown, USA

West Virginia University, Morgantown, USA
View Profile

,
Tao Li

Florida International University 8 Nanjing University of Posts and Telecommunications, Nanjing, China

Florida International University 8 Nanjing University of Posts and Telecommunications, Nanjing, China

0000-0001-9277-1539
View Profile

,
Donald Adjeroh

West Virginia University, Morgantown, USA

West Virginia University, Morgantown, USA
View Profile

,
S. Sitharama Iyengar

Florida International University, Miami, FL, USA

Florida International University, Miami, FL, USA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 50 Issue 3Article No.: 41pp 1–40https://doi.org/10.1145/3073559

Published:29 June 2017Publication History

ACM Computing Surveys

Abstract

In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, including Comodo, Kaspersky, Kingsoft, and Symantec, provide the major defense against malware. Unfortunately, driven by the economic benefits, the number of new malware samples has explosively increased: anti-malware vendors are now confronted with millions of potential malware samples per year. In order to keep on combating the increase in malware samples, there is an urgent need to develop intelligent methods for effective and efficient malware detection from the real and large daily sample collection. In this article, we first provide a brief overview on malware as well as the anti-malware industry, and present the industrial needs on malware detection. We then survey intelligent malware detection methods. In these methods, the process of detection is usually divided into two stages: feature extraction and classification/clustering. The performance of such intelligent malware detection approaches critically depend on the extracted features and the methods for classification/clustering. We provide a comprehensive investigation on both the feature extraction and the classification/clustering techniques. We also discuss the additional issues and the challenges of malware detection using data mining techniques and finally forecast the trends of malware development.

References

Tony Abou-As saleh, Nick Cercone, Vlado Keselj, and Ray Sweidan. 2004. N-gram-based detection of new malicious code. In Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC). Google ScholarDigital Library
David W. Aha, Dennis Kibler, and Marc K. Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (1991), 37--66. Google ScholarDigital Library
Blake Anderson, Daniel Quist, Joshua Neil, Curtis Storlie, and Terran Lane. 2011. Graph based malware detection using dynamic analysis. Journal in Computer Virology 4 (2011), 247--258. Google ScholarDigital Library
Blake Anderson, Curtis Storlie, and Terran Lane. 2012. Improving malware classification: Bridging the static/dynamic gap. In Proceedings of 5th ACM Workshop on Security and Artificial Intelligence (AISec). Google ScholarDigital Library
Anubis. 2010. Anubis: Analyzing Unknown Binaries. Retrieved from http://anubis.iseclab.org/.Google Scholar
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated classification and analysis of internet malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection. Google ScholarDigital Library
Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In Proceedings of the 16th Annual Network and Distributed System Security Symposium.Google Scholar
Ulrich Bayer, Christopher Kruegel, and Engin Kirda. 2006a. TTAnalyze: A tool for analyzing malware. In EICAR.Google Scholar
Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda. 2006b. Dynamic analysis of malicious code. Journal in Computer Virology 2(1) (2006), 67--77. Google ScholarCross Ref
Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. 2013. A survey on heuristic malware detection techniques. In Proceedings of the 5th Conference on Information and Knowledge Technology (IKT). Google ScholarCross Ref
Philippe Beaucamps and ric Filiol. 2007. On the possibility of practically obfuscating programs towards a unified perspective of code protection. Journal in Computer Virology 3, 1 (2007), 3--21.Google ScholarCross Ref
Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1--127. Google ScholarDigital Library
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19 (2007).Google Scholar
Christopher M. Bishop. 1995. Neural networks for pattern recognition. Oxford, Clarendon Press. Google ScholarDigital Library
Bizjournals. 2011. McAfee: Trends in a decade of cybercrime. Retrieved from http://www.bizjournals.com/sanjose/news/2011/01/25/mcafee-trends-in-a-decade-of-cybercrime.html?page=all.Google Scholar
Kevin Borders and Atul Prakash. 2004. Web tap: Detecting covert web traffic. In Proceedings of the 11th ACM Conference on Computer and Communications Security. Google ScholarDigital Library
Leo Breiman. 1996. Bagging predicators. Machine Learning 24, 2 (1996), 123--140. Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32. Google ScholarDigital Library
Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS). Google ScholarDigital Library
Duen Horng Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, and Christos Faloutsos. 2011. Polonium: Tera-scale graph mining for malware detection. In Proceedings of the SIAM International Conference on Data Mining (SDM). Google ScholarCross Ref
Lingwei Chen, William Hardy, Yanfang Ye, and Tao Li. 2015. Analyzing file-to-file relation network in malware detection. In Proceedings of the International Conference on Web Information Systems Engineering (WISE). Google ScholarDigital Library
Mihai Christodorescu and Somesh Jha. 2003. Static analysis of executables to detect malicious patterns. In Proceedings of the 12th Conference on USENIX Security Symposium. Google ScholarDigital Library
Mihai Christodorescu, Somesh Jha, and Christopher Kruegel. 2007. Mining specifications of malicious behavior. In Proceedings of ESEC/FSE. Google ScholarDigital Library
Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Randal E. Bryant. 2005. Semantics-aware malware detection. In Proceedings of IEEE Symposium on Security and Privacy. Google ScholarDigital Library
William W. Cohen. 1995. Fast effective rule induction. In Proceedings of 12th International Conference on Machine Learning. Google ScholarDigital Library
Peter Coogan. 2010. SpyEye Bot Versus Zeus Bot. Retrieved from http://www.symantec.com/connect/blogs/spyeye-bot-versus-zeus-bot.Google Scholar
Thomas Cover and Peter Hart. 1967. Nearest nieghbor pattern classification. IEEE Transaction on Information Theory IT-13, 1 (1967), 21--27. Google ScholarDigital Library
Jedidiah R. Crandall, Zhendong Su, S. Felix Wu, and Frederic T. Chong. 2005. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS). Google ScholarDigital Library
Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99--108. Google ScholarDigital Library
Damballa. 2008. 3% to 5% of Enterprise Assets Are Compromised by Bot-Driven Targeted Attack Malware. Retrieved from http://www.prnewswire.com/news-releases/3-to-5-of-enterprise-assets-are-compromised-by-bot-driven-targeted-attack-malware-61634867.html.Google Scholar
Mohsen Damshenas, Ali Dehghantanha, and Ramlan Mahmoud. 2013. A survey on malware propagation, analysis, and detection. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 2, 4 (2013), 10--29.Google Scholar
Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. 2016. Semantics-based online malware detection: Towards efficient real-time protection against malware. IEEE Transactions on Information Forensics and Security 11, 2 (2016), 289--302. Google ScholarDigital Library
Thomas Dietterich. 1997. Machine learning research: Four current directions. Artificial Intelligence Magzine 18, 4 (1997), 97--36.Google Scholar
Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems. Google ScholarDigital Library
Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS). Google ScholarDigital Library
Pedro Domingos and Michael Pazzani. 1997. On the optimality of simple Bayesian classifier under zero-one loss. Machine Learning 29, 2--3 (1997), 103--130. Google ScholarDigital Library
Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Computing Surveys (CSUR) 44, 2 (2012), 6. Google ScholarDigital Library
Yuval Elovici, Asaf Shabtai, Robert Moskovitch, Gil Tahan, and Chanan Glezer. 2007. Applying machine learning techniques for detection of malicious code in network traffic. KI: Advances in Artificial Intelligence (2007). Google ScholarDigital Library
EMarketer. 2014. Global B2C Ecommerce Sales to Hit &dollar;1.5 Trillion This Year Driven by Growth in Emerging Markets. Retrieved from http://www.emarketer.com/Article/Global-B2C-Ecommerce-Sales-Hit-15-Trillion-This-Year-Driven-by-Growth-Emerging-Markets/1010575.Google Scholar
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 1 (2014), 3133--3181. Google ScholarDigital Library
Eric Filiol, Gregoire Jacob, and Mickael Le Liard. 2007. Evaluation methodology and theoretical model for antiviral behavioural detection strategies. Journal in Computer Virology 3, 1 (2007), 23--37. Google ScholarCross Ref
Ivan Firdausi, Alva Erwin, and Anto Satriyo Nugroho. 2010. Analysis of machine learning techniques used in behavior based malware detection. In Proceedings of 2nd International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT). Google ScholarDigital Library
Evelyn Fix and Joseph L. Hodges Jr. 1951. Discriminatory analysis-nonparametric discrimination: Consistency properties. US Air Force, School of Avaiation Medicine, Tech. Rep 4 (1951), 5--32.Google Scholar
Matt Fredrikson, Somesh Jha, Mihai Christodorescu, Reiner Sailer, and Xifeng Yan. 2010. Synthesizing near-optimal malware specifications from suspicious behaviors. In Proceedings of IEEE Symposium on Security and Privacy. Google ScholarDigital Library
Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 55, 1 (1997), 119--39. Google ScholarDigital Library
Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security 5, 2 (2014), 56--64. Google ScholarCross Ref
Maria Garnaeva, Victor Chebyshev, Denis Makrushin, Roman Unuchek, and Anton Ivanov. 2014. Kaspersky Security Bulletin 2014. Retrieved from http://securelist.com/analysis/kaspersky-security-bulletin/68010/kaspersky-security-bulletin-2014-overall-statistics-for-2014/.Google Scholar
Todd R. Golub, Donna K. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P. Mesirov, and Hilary Coller. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 5439 (1999), 531--537. Google ScholarCross Ref
Isabelle Guyon and Andr Elisseeff. 2003. An introduction to variable and feature selection. Jouranl of Machine Learning Research 3 (March 2003), 1157--1182. Google ScholarDigital Library
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter (2009). Google ScholarDigital Library
William Hardy, Lingwei Chen, Shifu Hou, Yanfang Ye, and Xin Li. 2016. DL4MD: A deep learning framework for intelligent malware detection. In Proceedings of the International Conference on Data Mining (DMIN).Google Scholar
Olivier Henchiri and Nathalie Japkowicz. 2006a. A feature selection and evaluation scheme for computer virus detection. In Proceedings of the 6th International Conference on Data Mining. Google ScholarDigital Library
Olivier Henchiri and Nathalie Japkowicz. 2006b. A feature selection and evaluation scheme for computer virus detection. In Proceedings of ICDM. Google ScholarDigital Library
Shif Hou, Aaron Saas, Yanfang Ye, and Lifei Chen. 2016. DroidDelver: An android malware detection system using deep belief network based on API call blocks. In Proceedings of the International Conference on Web-Age Information Management. 54--66. Google ScholarCross Ref
Xin Hu. 2011. Large-scale malware analysis, detection, and signature generation. Ph.D. Dissertation, Department of Computer Science and Engineering, University of Michigan. Google ScholarDigital Library
Galen Hunt and Doug Brubacher. 1998. Detours: Binary interception of win32 functions. In Proceedings of the 3rd USENIX Windows NT Symposium. Google ScholarDigital Library
IDAPro. 2016. The Interactive Disassembler. Retrieved from https://www.hex-rays.com/products/ida/support/download_freeware.shtml.Google Scholar
Nwokedi Idika and Aditya P. Mathur. 2007. A survey of malware detection techniques. Research Report in Purdue University (2007).Google Scholar
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of 30th Annual ACM Symposium on Theory of Computing. Google ScholarDigital Library
Virtualization Technology Intel. 2013. Retrieved from http://www.intel.com/technology/virtualization.Google Scholar
Rafiqul Islam, Ronghua Tian, Lynn M. Batten, and Steve Versteeg. 2013. Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Application 36, 2 (2013), 646--656. Google ScholarDigital Library
ITU. 2014. ITU releases 2014 ICT figures. Retrieved from https://www.itu.int/net/pressoffice/press_releases/2014/23.aspx.Google Scholar
Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. 2000. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1 (2000), 4--37. Google ScholarDigital Library
Xuxian Jiang, Dongyan Xu, Helen Wang, and Eugene Spafford. 2005. Virtual playgrounds for worm behavior investigation. In Proceedings of the 8th International Symposium on Recent Advances in Intrusion Detection. Google ScholarDigital Library
Thorsten Joachims. 1998. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines (1998). Google ScholarDigital Library
George H. John and Pat Langley. 1995. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. Google ScholarDigital Library
Min Gyung Kang, Pongsin Poosankam, and Heng Yin. 2007. Renovo: A hidden code extractor for packed executables. In Proceedings of the 5th ACM Workshop on Recurring Malcode (WORM). Google ScholarDigital Library
Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. 2008. Spamalytics: An empirical analysis of spam marketing conversion. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS). Google ScholarDigital Library
Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. 2013. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment. Google ScholarDigital Library
Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 1--2 (2005), 13--23.Google ScholarCross Ref
Kaspersky. 2015. The Great Bank Robbery. Retrieved from http://www.kaspersky.com/about/news/virus/2015/Carbanak-cybergang-steals-1-bn-USDfrom-100-financial-institutions-worldwide.Google Scholar
Kris Kendall and Chad McMillan. 2007. Practical Malware Analysis. Retrieved from https://www.blackhat.com/presentations/bh-dc-07/Kendall_McMillan/Presentation/bh-dc-07-Kendall_McMillan.pdf.Google Scholar
Kingsoft. 2014. 2013-2014 Internet Security Report in China. Retrieved from http://www.ijinshan.com/news/2014011401.shtml.Google Scholar
Kingsoft. 2015. 2014-2015 Internet Security Research Report in China. Retrieved from http://www.cssn.cn/xwcbx/xwcbx_gcsy/201501/P020150122566733317860.pdf.Google Scholar
Kingsoft. 2016. 2015-2016 Internet Security Research Report in China. Retrieved from http://cn.cmcm.com/news/media/2016-01-14/60.html.Google Scholar
Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, Xiaoyong Zhou, and XiaoFengWang. 2009. Effective and efficient malware detection at the end host. In Proceedings of the 18th Conference on USENIX Security Symposium. Google ScholarDigital Library
Jeremy Z. Kolter and Marcus A. Maloof. 2004. Learning to detect malicious executables in the wild. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
J. Zico Kolter and Marcus A. Maloof. 2006. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (Dec. 2006), 2721--2744. Google ScholarDigital Library
Nojun Kwak and Chong-Ho Choi. 2002. Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24, 12 (2002), 1667--1671. Google ScholarDigital Library
Pat Langley. 1994. Selection of relevant features in machine learning. In Proceedings of AAAI Fall Symposium. Google ScholarCross Ref
Andrea Lanzi, Monirul Sharif, and Wenke Lee. 2009. K-Tracer: A system for extracting kernel malware behavior. In Proceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS).Google Scholar
Tony Lee and Jigar J. Mody. 2006. Behavioral classification. In Proceedings of the European Institute for Computer Antivirus Research Conference (EICAR).Google Scholar
David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, New York, Inc., 3--12. Google ScholarDigital Library
Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2011. Random KNN feature selection - A fast and stable alternative to random forests. BMC Bioinformatics 12, 1 (2011), 450.Google ScholarCross Ref
Shengqiao Li, E. James Harner, and Donald A. Adjeroh. 2014. Random KNN. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshops. 629--636. Google ScholarCross Ref
Tao Li (Ed.). 2015. Event Mining: Algorithms and Applications. CRC Press. Google ScholarDigital Library
LordPE. 2013. PE Tools - LordPE. Retrieved from http://www.malware-analyzer.com/pe-tools.Google Scholar
Mike Loukides and Andy Oram. 1996. Getting to know gdb. Linux Journal (1996). Google ScholarDigital Library
James Lyne. 2014. Security threat trends 2015. Retrieved from https://www.sophos.com/threat-center/medialibrary/PDFs/other/sophos-trends-and-predictions-2015.pdf.Google Scholar
Mohammad M. Masud, Tahseen Al-Khateeb, Kevin W. Hamlen, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2011. Cloud-based malware detection for evolving data streams. ACM Trans. Management Inf. Syst. 2, 3 (2011), 16. Google ScholarDigital Library
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2008. Mining concept-drifting data stream to detect peer to peer botnet traffic. Tech. rep. UTDCS-05-08, The University of Texas at Dallas, Richardson (2008).Google Scholar
Mohammad M. Masud, Latifur Khan, and Bhavani Thuraisingham. 2007. A scalable multi-level feature extraction technique to detect malicious executables. Information Systems Frontiers 10, 1 (2007), 33--45. Google ScholarDigital Library
Kirti Mathur and Saroj Hiranwal. 2013. A survey on techniques in detection and analyzing malware executables. International Journal of Advanced Research in Computer Science and Software Engineering 3, 4 (2013), 422--428.Google Scholar
Micropoint. 2008. Micropoint Antivirus. Retrieved from http://www.micropoint.com.cn/Channel//20080626114608.html.Google Scholar
David Moore and Colleen Shannon. 2002. Code-red: A case study on the spread and victims of an internet worm. In Proceedings of the Internet Measurement Workshop. Google ScholarDigital Library
Andreas Moser, Christopher Kruegel, and Engin Kirda. 2007. Limits of static analysis for malware detection. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC). Google ScholarCross Ref
Robert Moskovitch, Clint Feher, and Yuval Elovici. 2009. A chronological evaluation of unknown malcode detection. LNCS: Intelligence and Security Informatics 5477 (2009), 112--117. Google ScholarDigital Library
Robert Moskovitch, Clint Feher, Nir Tzachar, Eugene Berger, Marina Gitelman, Shlomi Dolev, and Yuval Elovici. 2008a. Unknown malcode detection using OPCODE representation. In Proceedings of the European Conference on Intelligence and Security Informatics (EuroISI). Google ScholarDigital Library
Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2008b. Acquisition of malicious code using active learning. In PinKDD.Google Scholar
Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, and Yuval Elovici. 2008c. Unknown malcode detection via text categorization and the imbalance problem. In IEEE Intelligence and Security Informatics. Google ScholarDigital Library
Kevin P. Murphy. 2012. Machine learning: A probabilistic perspective. In The MIT Press, Cambridge, Massachusetts. Google ScholarDigital Library
Ion Muslea, Steven Minton, and Craig A. Knoblock. 2006. Active learning with multiple views. Journal of Artificial Intelligence Research 27 (2006), 203--233. Google ScholarCross Ref
Carey Nachenberg and Vijay Seshadri. 2010. An analysis of real-world effectiveness of reputation-based security. In Proceedings of the Virus Bulletin Conference (VB).Google Scholar
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Hieu T. Nguyen and Arnold Smeulders. 2004. Active learning using pre-clustering. In Proceedings of the 21st International Conference on Machine Learning. ACM, 79. Google ScholarDigital Library
Ming Ni, Tao Li, Qianmu Li, Hong Zhang, and Yanfang Ye. 2016. FindMal: A file-to-file social network based malware detection framework. Knowledge-Based Systems 112 (2016), 142--151. Google ScholarDigital Library
Corporation of Compuware. 1999. Debugging blue screens. Technical Paper (September 1999).Google Scholar
Gunter Ollmann. 2010. Serial variant evasion tactics techniques used to automatically bypass antivirus technologies. Retrieved from http://www.damballa.com/downloads/rpubs/WPSerialVariantEvasionTactics.pdf.Google Scholar
OllyDump. 2006. PE Tools - OllyDump. Retrieved from http://www.openrce.org/downloads/details/108/OllyDump.Google Scholar
David Orenstein. 2000. Application programming interface (API). In Quick Study: Application Programming Interface (API).Google Scholar
Nikunj C. Oza and Stuart Russell. 2001. Experimental comparisons of online and batch versions of bagging and boosting. In Proceedings of SIGKDD. Google ScholarDigital Library
Judea Pearl. 1987. Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence 32, 2 (1987), 245--258. Google ScholarDigital Library
Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238. Google ScholarDigital Library
Simon Perkins, Kevin Lacker, and James Theiler. 2003. Grafting: Fast incremental feature selection by gradient descent in function space. JMLR 3 (March 2003), 1333--1356. Google ScholarDigital Library
Qemu. 2016. (2016). http://www.qemu-project.org/index.html.Google Scholar
Internet Security Center Qihoo. 2015. 2014 Internet Security Research Report in China. Retrieved from http://zt.360.cn/report/.Google Scholar
J. Ross Quinlan. 1986. Induction of decision trees. Machine Learning 1, 1 (1986), 81--106. Google ScholarCross Ref
J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, Inc. (1993). Google ScholarDigital Library
Alain Rakotomamonjy. 2003. Variable selection using SVM-based criteria. JMLR 3 (March 2003), 1357--1370. Google ScholarDigital Library
Zulfikar Ramzan, Vijay Seshadri, and Carey Nachenberg. 2013. Reputation-based security: An analysis of real world effectiveness. In Symantec Security Response.Google Scholar
Rizwan Rehmani, G. C. Hazarika, and Gunadeep Chetia. 2011. Malware threats and mitigation strategies: A Survey. Journal of Theoretical and Applied Information Technology 29, 2 (2011), 69--73.Google Scholar
John Robbins. 1999. Debugging windows based applications using windbg. Microsoft Systems Journal (1999).Google Scholar
Lior Rokach. 2010. Ensemble-based classifiers. Artif Intell Rev 33, 1 (2010), 1--39. Google ScholarDigital Library
Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee. 2006. PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of the 22nd Annual Computer Security Applications Conference. Google ScholarDigital Library
Yvan Saeys, Inaki Inza, and Pedro Larranaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507--2517. Google ScholarDigital Library
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620. Google ScholarDigital Library
Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. 2013. OPEM: A static-dynamic approach for machine learning based malware detection. In Proceedings of International Conference CISIS-ICEUTE, Special Sessions Advances in Intelligent Systems and Computing. Google ScholarCross Ref
Igor Santos, Carlos Laorden, and Pablo G. Bringas. 2011a. Collective classification for unknown malware detection. In Proceedings of the International Conference on Security and Cryptography.Google Scholar
Igor Santos, Javier Nieves, and Pablo G. Bringas. 2011b. Semi-supervised learning for unknown malware detection. In International Symposium on Distributed Computing and Artificial Intelligence Advances in Intelligent and Soft Computing. Google ScholarCross Ref
Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 10th International Conference on Malicious and Unwanted Software (MALWARE). Google ScholarDigital Library
Matthew G. Schultz, Eleazar Eskin, F. Zadok, and Salvatore J. Stolfo. 2001. Data mining methods for detection of new malicious executables. In Proc. of the IEEE Symposium on Security and Privacy. Google ScholarDigital Library
Fabrizio Sebastiani. 2002. Text categorization. Comput. Surveys 34, 1 (2002), 1--47. Google ScholarDigital Library
H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 287--294. Google ScholarDigital Library
Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29. Google Scholar
Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2008. A survey of data mining techniques for malware detection using file features. In Proceedings of ACM-SE. Google ScholarDigital Library
Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2009. Detecting internet worms using data mining techniques. Journal of Systemics, Cybernetics and Informatics 6, 6 (2009), 48--53.Google Scholar
Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena. 2008. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security. Google ScholarDigital Library
Eugene H. Spafford. 1989. The internet worm incident. In Proceedings of the 2nd European Software Engineering Conference. Google ScholarDigital Library
Elizabeth Stinson and John C. Mitchell. 2007. Characterizing bots’ remote control behavior. LNCS: Detection of Intrusions and Malware, and Vulnerability Assessment 4579 (2007), 89--108. Google ScholarDigital Library
Jack W. Stokes, John C. Platt, Helen J. Wang, Joe Faulhaber, Jonathan Keller, Mady Marinescu, Anil Thomas, and Marius Gheorghescu. 2012. Scalable telemetry classification for automated malware detection. Computer Security - ESORICS (2012).Google Scholar
Andrew H. Sung, Jianyun Xu, Patrick Chavez, and Srinivas Mukkamala. 2004. Static analyzer of vicious executables (SAVE). In Proceedings of the 20th Annual Computer Security Applications Conference. Google ScholarDigital Library
Symantec. 2008. Symantec global internet security threat report. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiii_04-2008.en-us.pdf.Google Scholar
Symantec. 2014a. Internet Security Threat Report 2014. Retrieved from http://www.symantec.com/security_response/publications/threatreport.jsp.Google Scholar
Symantec. 2014b. The Threat Landscape in 2014 and Beyond: Symantec and Norton Predictions for 2015, Asia Pacific and Japan. Retrieved from http://www.symantec.com/connect/blogs/threat-landscape-2014-and-beyond-symantec-and-norton-predictions-2015-asia-pacific-japan.Google Scholar
Symantec. 2016. Internet Security Threat Report. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.Google Scholar
Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro. 2017. The evolution of android malware and android analysis techniques. ACM Computing Surveys (CSUR) 49, 4 (2017), 76. Google ScholarDigital Library
Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. 2014. Guilt by association: Large scale malware detection by mining file-relation graphs. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). Google ScholarDigital Library
Fadi Abdeljaber Thabtah. 2007. A review of associative classification mining. Knowledge Engineering Review 22, 1 (2007), 37--65. Google ScholarDigital Library
Ronghua Tian, Rafiqul Islam, Lynn Batten, and Steve Versteeg. 2010. Differentiating malware from cleanwares using behavioral analysis. In Proceedings of 5th International Conference on Malicious and Unwanted Software (Malware).Google Scholar
TrendLabs. 2014. The invisible becomes visible: Trend micro security predictions for 2015 and beyond. (2014). http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/reports/rpt-the-invisible-becomes-visible.pdf.Google Scholar
Trend Threat Research Team TrendMicro. 2010. Zeus: A Persistent Criminal Enterprise. Retrieved from http://www.trendmicro.com/cloud-content/us/pdfs/security-intelligence/white-papers/wp_zeuspersistent-criminal-enterprise.pdf.Google Scholar
Amit Vasudevan and Ramesh Yerraballi. 2005. Stealth breakpoints. In Proceedings of the 21st Annual Computer Security Applications Conference. Google ScholarDigital Library
Amit Vasudevan and Ramesh Yerraballi. 2006. Cobra: Fine-grained malware analysis using stealth localized-executions. In Proceedings of 2006 IEEE Symposium on Security and Privacy. Google ScholarDigital Library
Shobha Venkataraman, Avrim Blum, and Dawn Song. 2008. Limits of learning-based signature generation with adversaries. In NDSS.Google Scholar
Andrei Venzhega, Polina Zhinalieva, and Nikolay Suboch. 2013. Graph-based malware distributors detection. In Proceedings of the 22nd International Conference on World Wide Web Companion (WWW). Google ScholarDigital Library
Randall Wald, Taghi M. Khoshgoftaar, and Amri Napolitano. 2013. Comparison of stability for different families of filter-based and wrapper-based feature selection. In ICMLA. Google ScholarDigital Library
Tzu-Yen Wang, Shi-Jinn Horng, Ming-Yang Su, Chin-Hsiung Wu, Peng-Chu Wang, and Wei-Zen Su. 2006b. A surveillance spyware detection system based on data mining methods. Evolutionary Computation (2006), 3236--3241.Google Scholar
Yi-Min Wang, Doug Beck, Xuxian Jiang, and Roussi Roussev. 2006a. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In NDSS.Google Scholar
SECURITY LABS WEBSENSE. 2014. 2015 Security Predictions. Retrieved from http://www.websense.com/assets/reports/report-2015-security-predictions-en.pdf.Google Scholar
Paul Werbos. 1974. Beyond regression: New tools for prediction and analysis in the behavioral science. Ph.D. Dissertation, Harvard University.Google Scholar
Wikipedia. 2016. Scareware. Retrieved from https://en.wikipedia.org/wiki/Scareware.Google Scholar
Wikipedia. 2017a. Assembly Language. Retrieved from http://en.wikipedia.org/wiki/Assembly_language.Google Scholar
Wikipedia. 2017b. Computer Virus. Retrieved from http://en.wikipedia.org/wiki/Computer_virus.Google Scholar
Wikipedia. 2017c. Morris Worm. Retrieved from http://en.wikipedia.org/wiki/Morris_worm.Google Scholar
Wikipedia. 2017d. Ransomware. Retrieved from https://en.wikipedia.org/wiki/Ransomware.Google Scholar
Wikipedia. 2017e. Rootkit. Retrieved from http://en.wikipedia.org/wiki/Rootkit.Google Scholar
Wikipedia. 2017f. Zero-day (computing). Retrieved from https://en.wikipedia.org/wiki/Zero-day_(computing).Google Scholar
Wikipedia. 2017g. Zeus (malware). Retrieved from http://en.wikipedia.org/wiki/Zeus_(malware).Google Scholar
Carsten Willems, Thorsten Holz, and Felix Freiling. 2007. Toward automated dynamic malware analysis using cwsandbox. In IEEE Security and Privacy. Google ScholarDigital Library
Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. In IEEE Transactions on Neural Networks 16, 3 (2005), 645--678. Google ScholarDigital Library
Yanfang Ye. 2010. Research on intelligent malware detection methods and their applications. Ph.D. Dissertation, Department of Computer Science, Xiamen University (2010).Google Scholar
Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao. 2009. SBMDS: An interpretable string based malware detection system using SVM ensemble with bagging. Journal in Computer Virology 5, 4 (2009), 283--293. Google ScholarCross Ref
Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. 2010. Automatic malware categorization using cluster ensemble. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2009a. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list. Journal of Intelligent Information Systems 35, 1 (2009), 1--20. Google ScholarDigital Library
Yanfang Ye, Tao Li, Qingshan Jiang, Zhixue Han, and Li Wan. 2009c. Intelligent file scoring system for malware detection from the gray list. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
Yanfang Ye, Tao Li, Qingshan Jiang, and Youyu Wang. 2009b. CIMDS: Adapting post-processing techniques of associative classification for malware detection system. IEEE Transactions on Systems, Man, and Cybernetics 40, 3 (2009), 298--307. Google ScholarDigital Library
Yanfang Ye, Tao Li, Shenghuo Zhu, Weiwei Zhuang, Egemen Tas, Umesh Gupta, and Melih Abdulhayoglu. 2011. Combining file content and file relations for cloud based malware detection. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. 2007. IMDS: Intelligent malware detection system. In Proccedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. 2008. An intelligent PE-malware detection system based on association mining. Journal in Computer Virology 4, 4 (2008), 323--334. Google ScholarCross Ref
Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. 2001. Understanding belief propagation and its generalizations. In Mitsubishi Electric Research Laboratories.Google Scholar
Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. 2007. Panorama: Capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS). Google ScholarDigital Library
Chunqiu Zeng, Liang Tang, Wubai Zhou, Tao Li, Larisa Shwartz, and Genady Ya.Grabarnik. 2017. An integrated framework for mining temporal logs from fluctuating events. IEEE Transactions on Services Computing (TSC) (2017). In Press.Google Scholar
Boyun Zhang, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin Wang. 2007. Malicious codes detection based on ensemble learning. Autonomic and Trusted Computing (2007). Google ScholarDigital Library
Jianwei Zhuge, Thorsten Holz, Chengyu Song, Jinpeng Guo, Xinhui Han, and Wei Zou. 2008. Studying malicious websites and the underground economy on the Chinese web. In Proceedings of the 7th Workshop on Economics of Information Security.Google Scholar

Index Terms

A Survey on Malware Detection Using Data Mining Techniques
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Systems security
    1. Operating systems security

Recommendations

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Read More
Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...
Read More
A state-of-the-art survey of malware detection approaches using data mining techniques

Data mining techniques have been concentrated for malware detection in the recent decade. The battle between security analyzers and malware scholars is everlasting as innovation grows. The proposed methodologies are not adequate while evolutionary and ...
Read More

Reviews

Reviewer: Klerisson Paixao

It is not new that software is eating the world [1]. Industries and businesses everywhere are being "softwareized." Meanwhile, we cannot deny that malware (malicious software) is also having a feast. This paper provides a comprehensive survey of existing technology for malware detection focused on data mining techniques. It starts with a taxonomy, primarily based on common types of malware: viruses, worms, Trojans, spyware, ransomware, scareware, bots, rootkits, and hybrid malware. Then, the paper describes the current state of the (anti-)malware industry. The study is a bit short on the data mining techniques used. The authors restrain their efforts to describing detections relying on classification and clustering algorithms. On the other hand, it does a very good job at summarizing dozens of methods used in the literature. Further, the authors suggest new ideas for future research directions. Notably, they discuss the application of active learning to the task. Such a technique seems more appropriate to deal with a critical problem in the field: data scarcity. While cybercriminals usually cooperate and collaborate to build their malware, their counterparts keep collections of cybercrime data under lock. The paper ends with a clear conclusion: there is no silver bullet when it comes to malware detection. All classification/clustering techniques have their pros and cons; thus, they will not always perform optimally. This survey serves well as a starting point and initial set of guidelines for people willing to do research in this field. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 50, Issue 3
May 2018
550 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3101309
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2017
- Accepted: 1 March 2017
- Revised: 1 November 2016
- Received: 1 August 2015
Published in csur Volume 50, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Survey
data mining
malware detection
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 388
  Total Citations
  View Citations
- 10,300
  Total Downloads
- Downloads (Last 12 months)2,341
- Downloads (Last 6 weeks)374
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Survey on Malware Detection Using Data Mining Techniques

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware detection using adaptive data compression

A state-of-the-art survey of malware detection approaches using data mining techniques

Reviews

Access critical reviews of Computing literature here