Machine learning approaches for malware classification based on hybrid artefacts
Permanent link to Research Commons versionhttps://hdl.handle.net/10289/15896
Malware could be developed and transformed into various forms to deceive users and evade antivirus and security endpoint detection. Furthermore, if one machine in the network is compromised, it could be used for lateral movement--when malware spreads stealthily without sending an alarm to monitoring systems. Malware attacks pose security threats to modern enterprises and can cause massive financial, reputation, and data loss to major enterprises. Therefore, it is important to detect these attacks effectively to reduce the loss to the minimum level. The current research uses different approaches, including static and dynamic analysis, to detect and analyze malware categories using distinct feature sets, such as imported modules, opcodes, and API calls, which can improve performance in binary and multi-class classification problems. This thesis proposes a method for identifying and analyzing malware samples via static and dynamic approaches, including memory analysis and consecutive application operation sequences performed on the Windows 10 virtual environment. Standard classifiers and frequently used sequence models are utilized to expose the malware characteristics and benefit predictive capabilities. The features used in these algorithms are extracted from the static and dynamic analysis of malware samples, such as the rich header feature, debug information, temporary files, prefetch files, and event logs. The measurement of the classifiers and the degree of correctness are calculated using the accuracy, f1-score, Mean Absolute Error (MAE), confusion matrix, and Area under the ROC Curve (AUC). Combining two feature sets can provide the best classification performance on static file properties and dynamic analysis results, regardless of whether applying feature selection or not, achieving the accuracy and f1_score at 97% for integrating two datasets. For consecutive sequences, concatenating the Gated Recurrent Unit (GRU) and Transformers model can yield the highest accuracy at 97% for Noriben operations, while GRU can achieve the maximum accuracy for Opcode sequences at 89%.
The University of Waikato
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
- Higher Degree Theses