From exploration to classification: harnessing symbolic execution and graph neural networks in malware analysis
The proliferation of cybercrime and the escalating threat of malware attacks necessitate more effective and efficient analysis techniques. Traditional methods, such as static and dynamic analysis, have limitations that hinder their effectiveness against sophisticated and evasive malware. Symbolic execution, a promising approach, overcomes some of these limitations but faces scalability challenges.
This thesis explores the enhancement of SEMA, a toolchain using Symbolic Execution for Malware Analysis, with two main contribution: (1) developing novel exploration strategies to enhance symbolic execution, and (2) incorporating graph neural networks (GNNs) to improve the classification accuracy of malware samples. Additionally, this work ventures into the realm of GNN explainability, aiming to shed light on the decision-making process of GNN models.
To address the scalability issues of symbolic execution, in particular the path explosion problem, this study proposes seven new exploration strategies that aim to enhance the symbolic execution process by selecting paths based on assigned weights, aiming to improve code coverage within a limited timeframe. Additionally, graph neural networks (GNNs) are integrated as new classifiers into the toolchain, leveraging the structural properties of malware represented as graphs, in particular System Call Dependency Graphs (SCDGs), for accurate and efficient classification and detection. Specifically, a Graph Isomorphism Network (GIN) and a Graph Isomorphism Network with JumpingKnowledge (GINJK) are implemented. Furthermore, a GNN explainability module is developed to provide insights into the decision-making process of GNN models.
The experimental results demonstrate that GINJK, combined with an exploration strategy that focuses on the nature of the system calls and on exploring states leading to new assembly instruction addresses, outperforms other classifiers combined with different exploration strategies. The thesis concludes by highlighting the potential of future research directions. Overall, this work contributes to the advancement of the SEMA Toolchain, enabling more effective and accurate malware analysis.