Evaluation replication for [16].
Open access peer-reviewed article
This Article is part of Communications and Security Section
Article metrics overview
10 Article Downloads
View Full Metrics
Article Type: Research Paper
Date of acceptance: July 2025
Date of publication: August 2025
DoI: 10.5772/acrt.20250021
copyright: ©2025 The Author(s), Licensee IntechOpen, License: CC BY 4.0
Machine learning assisted binary analysis is an area of great interest in cybersecurity research. Training accurate machine learning models requires methods of binary lifting, which require binaries to be translated through an intermediate language representation. This study postulates that different intermediate language representations change the performance characteristics of these machine learning models. Taking a published machine learning framework as a control and modifying the input methodology to include different intermediate language representation transforms, this study compared the performance of models in the realm of malware classification. The contributions of this study are: verification and replication of a published machine learning framework, novel transforms and usage of a public malware dataset, a comparative study on the impact of performance of different intermediate language representations for opcode based malware classification, and a set of heatmaps that can be utilized as a reference lookup table to inform binary lifting choice.
binary analysis
intermediate language representation
machine learning
malware classification
natural language processing
Author information
The ongoing development and improvement of machine learning models for advanced and time-intensive tasks continues to be a concentrated focus for researchers [1]. A popular research topic is machine learning aided binary analysis, with examples like identifying buffer overflow vulnerabilities using neural networks [2], automating reverse engineering utilizing machine learning [3], android malware analysis [4, 5], and recovering metadata from obfuscated binaries with machine learning techniques [6]. However, an issue with these model frameworks is the implicit trust, for both performance and accuracy, in how binaries are lifted for training. Model architects do not experiment with differing modalities of lifting binaries within their frameworks. This weakness in the contemporary literature is explored in [7] during a discussion on the usage of intermediate language representations.
In contrast, this study seeks to fill few gaps elucidated in literature [7]. This study is a controlled comparative study of the impact of different intermediate language representations on a malware classification machine learning framework. The study also replicates previous research from published design documents using publicly available original code and training data. The original binary data set used to train the subject framework was lifted using three intermediate language representations to create new training data. This training data was used to train a model for each intermediate language representation using the original subject framework’s architecture. The results for each intermediate language representation are then placed within a heatmap to perform a comparative analysis.
Section 2 presents the background of intermediate language representations and the chosen intermediate language representations used for translation, as well as the tools that leverage each intermediate language representation for binary lifting. Section 3 describes the subject machine learning framework and replication results. Section 4 presents the details of the malware dataset, the limitations of the original method of distribution, and conversion methods to lift the malware binaries into different intermediate code representations. Section 5 reports the performance differences of models trained within the target malware classification framework using different intermediate language representations of the converted training data. Section 6 concludes the work and highlights the key insights and directions for future research.
To emphasize, no new algorithm is proposed in this work. This work explored how different methods of binary lifting into different intermediate language representations change the performance characteristics of a known control machine learning model which aids in malware classification. This exploration is done by measuring key machine learning performance metrics, after the control malware classification model architecture has been trained while using each of the different intermediate language representations, to evaluate how each chosen intermediate language representation affects the measured metrics. Each intermediate language representation has different metrics compared to other intermediate language representations, even though the same model architecture and training data are used.
To summarize, this work contributes the following:
Independent verification and replication of a published machine learning framework.
Novel transforms and usage of a public malware dataset.
A comparative study on the impact of performance of different intermediate language representations for opcode based malware classification.
A set of heatmaps that can be utilized as a reference lookup table to inform binary lifting choice.
Utilizing the explanation from [7], an overview of the contemporary uses of intermediate language representations, compilation is the process, where raw source code undergoes transformation to become a packaged software application. Intermediate code, the code the compiler generates from source code in the first phase of the analysis-synthesis model of a compiler, is written in an intermediate language representation [8]. Using intermediate code allows for advantages such as access to code optimizations and platform independence [8]. These advantages are due to the generalization it allows in the front of the compilation process. Without compilers turning source code into an intermediate language, it would require a full native compiler for each new machine the code needs to operate on [8].
The importance of the impact of the transformation cannot be ignored. This allows a compiler chain to have the same front sections for any machine. When a new target machine is developed, only the backend of a compiler needs to be redesigned. Intermediate language representations allow the portability and extension of code to new machines. This idea can be described as a process line that starts with source code, continues into an intermediate representation, and finalizes as a compiled binary. Binary lifting is this process line going in reverse.
This work compares binary lifting into intermediate language representations using three common reverse engineering tools. Each tool has a different intermediate language representation that impacts the resulting output analysis.
IDA Pro [9] is an established tool. With active development beginning over 30 years ago, IDA Pro has been a core software in the reverse engineering community. It set the standard in disassembly with which many other tools try and act as analogous or improved alternatives. IDA Pro boasts many features and supports many architectures for disassembly and decompilation. This is powered fundamentally by its intermediate language representation known as microcode. Exposed to users only since IDA Pro version 7.1, the RISC [10] inspired intermediate language representation was explained in detail during Blackhat 2018 by the designer [11].
In order to allow IDA Pro to operate in a headless, or script-based, environment, headless IDA [12] is utilized. This allows full IDA Pro functionality using the same programming interface as the other tools present in testing. This is done to expose IDA Pro to the complete functionality that the chosen machine learning frameworks use to process binaries.
Starting off as an internal capture the flag tool, Binary Ninja [13] is an reverse engineering platform that includes an interactive decompiler, disassembler, and debugger for binary analysis. With open source API’s in C++, Python, and Rust, Binary Ninja allows a large amount of flexibility when it comes to automation and inclusion in larger projects. Binary Ninja has several intermediate language representations built onto each other in a stack-like manner with each level becoming more abstract than the last. The most relevant for the purpose of research is the Low Level Intermediate Language (LLIL); that is the lowest level intermediate representation responsible for lifting instructions and including flags into conditional instructions. LLIL is comparable to the other intermediate representations of focus based on the way they sit close to the bottom of the code abstraction stack. How LLIL representation is modified into higher forms is directly explained by documentation given in [13].
Due to natural design and support, the Python API already allows direct headless operation of Binary Ninja. This allows all Binary Ninja functionality using the same programming interface as the other tools present in testing. This is done to expose Binary Ninja to all of the needed functionality that the chosen machine learning frameworks use to process binaries.
Created and maintained by the National Security Agency (NSA), Ghidra [14] is the product of the Research Directorate. The Research Directorate of the NSA is the main research and development group of the United States of America’s offensive cybersecurity intelligence agency. Ghidra is an open-source suite of software analysis tools designed to aid in reverse engineering. Disassembly, assembly, decompilation, graphing, and scripting are some of the advertised capabilities. Ghidra operates using the intermediate representation known as PCode, which was designed to mimic the same functionality as microcode but more streamlined and working in Ghidra’s native programming language of Java. PCode operations are claimed to theoretically define operations for any given processor [14].
While Ghidra supports headless operation, it does not support it in the same development environments commonly used for machine learning applications. Pyhidra [15] is a solution that solves this problem. Written by the Department of Defense’s Cyber Crime Center, Pyhidra allows direct access to the Ghidra API using Python 3. This modification exposes all of the needed functionality that the chosen machine learning frameworks use to process binaries using Ghidra.
To start off, the subject model architecture of note is a supervised malware classification structure outlined in [16]. This is the known control machine learning framework used to test performance impact when different intermediate language representations are used in the binary lifting stage of training data processing.This study was done as a response to the 2015 Microsoft Malware Classification Challenge and used the provided training data [17]. The idea in [16] was to train multiple machine learning models to classify the malware for the challenge but use different features in training to identify those with the most impact. The work of [16] was focused on feature extraction using a malware sample’s machine code snippet with the output of IDA Pro’s disassembly. The model architecture provided good comparison tests because the disassembly ingested can be swapped for different intermediate language representations of the same samples for training comparison.
To clarify, [16] was an experiment that sought to determine the impact on training with different features on malware classification performance. The publication used the dataset provided in [17]. Seven unique feature vectors were extracted from the dataset. The main feature of importance is the Opcode 4-gram, as it is the only feature impacted by code being converted to other intermediate representations. For each feature, three machine learning models were trained: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and a Random Forest Classifier (RF) models. SVM models use a linear decision boundary to separate objects of different classes [18]. KNN models cluster objects together and then assigns each cluster a category [18]. RF models are a type of decision tree models that make predictions based on a path from the root, or base of a tree to the leaf, or end [18]. After selecting the best of the three types of models, each feature’s performance was compared against each other. Moreover, certain mixtures of features were tested in the same way.
In contrast, Chen et al. [16] was not the only malware classification model that could have been considered as a control for this comparative study. Li et al. in [19] also built a similar malware classification model using the newer deep learning methodology of a transformer model. Li et al. leveraged few static training features similar to those found in Chen et al., including opcode sequences, but with a different data set and code base. Liet al. [19] model is also an example where only one intermediate representation was used by only acknowledging IDA Pro [9]. However, while [19] did release the code utilized, the training data was not released; therefore, it cannot be used as a control framework within a comparative study. Only transformations of code representations into the three different intermediate language representations was allowed to impact the model performance metrics, and without access to both the original training data and model architecture that cannot be guaranteed. It is true Chen et al. only experimented with classical machine learning model types; as a consequence, deep learning models such as transformers are considered in future work.
The experiment in [16] was independently replicated for this study and the results are shown in Table 1. The overall margin of error was 0.397%. This margin was calculated by taking the individual error between the published and replicated for each feature or feature mix and averaging them. Table 1 lists the best accuracy results for each feature vector both published and replicated. The dimension field shows the original length of the feature vectors and what length the vectors were reduced to. For example, the total size for the Import Library feature vector was 570 items, but the models were only trained using the most common 300 item types present within the dataset. Accuracy is defined as the ratio of correct classifications over the total number of classification attempts. Table 1 serves as a verification source and is a replication of a published machine learning framework, a major contribution of this work. While Table 1 shows the successful replication of the full experiment designed in [16], the rest of this article focuses on the All Features, (File Size, API 4-gram, Opcode 4-gram), and Opcode 4-gram feature combinations. These are the feature combinations impacted by different intermediate language representations in the binary lifting stage of training data pre-processing.
Feature(s) | Dimension | Published best accuracy | Replicated best accuracy |
---|---|---|---|
All features | 1812921 → 10343 | 0.9948 | 0.9922 |
Section size, section permission, content complexity | 861 → 40 | 0.9940 | 0.9922 |
Section size, section permission, content complexity, import library | 1431 → 340 | 0.9922 | 0.9890 |
Opcode 4-gram | 1408515 → 5000 | 0.9908 | 0.9913 |
File size, API 4-gram, Opcode 4-gram | 1811490 → 10003 | 0.9899 | 0.9940 |
Content complexity | 6 | 0.9811 | 0.9862 |
Section size | 846 → 25 | 0.9775 | 0.9784 |
Section permission | 9 | 0.9701 | 0.9738 |
Import library | 570 → 300 | 0.9393 | 0.9255 |
File size | 3 | 0.9352 | 0.9402 |
API 4-gram | 402972 → 5000 | 0.5796 | 0.5787 |
Evaluation replication for [16].
Opcode 4-gram is a feature vector based on collections of groupings of 4 sequential operation codes present in the lifted binary. File Size is a feature vector that contains the size of a sample’s two component files and the ratio between them. API 4-gram is a feature vector based on collections of groupings of 4 sequential external call or jump instructions from imports present in the lifted binary. The Import Library feature is a vector that identifies which of the 300 most commonly imported libraries seen in the dataset are present in each sample. PE Section Size is a feature vector that is based on the relative size of each sample’s sections compared to other section and which sections are present in a sample. Section Permission is a feature vector that details whether a portion of a section was readable data, writable data, or executable code. Content Complexity is a feature vector that includes compression ratios for each malware as a method to flag code encryption and obfuscation.
To provide greater detail and understanding, the training data provided in [17] includes several hundred gigabytes of sterilized malware samples from different malware families totaling 10868 samples. The malware families are listed in Table 2. This dataset is not balanced between the classes of malware present. The largest class is the KelihosVer3 with 2942 samples and the smallest class is Simda with 42 samples. Each malware sample is composed of two files of information: an IDA Pro disassembly output file or .asm file, and a text file containing the binary contents of the sample in hexadecimal after processing with a Portable Executable file header stripping routine, or a .bytes file.
Family name | Number of samples | Type |
---|---|---|
Ramnit | 1541 | Worm |
Lollipop | 2478 | Adware |
KelihosVer3 | 2942 | Backdoor |
Vundo | 475 | Trojan |
Simda | 42 | Backdoor |
Tracur | 751 | TrojanDownloader |
KelihosVer1 | 398 | Backdoor |
Obfuscator.ACY | 1228 | All obfuscated malware |
Gatak | 1013 | Backdoor |
Malware dataset class populations.
To ensure common understanding, the definition of each type of malware is established. A worm is a self-replicating malware that spreads through networks. Adware displays intrusive ads and tracks victim behavior. Backdoors enable unauthorized remote system access. Trojans are malware that are disguised as legitimate software to deliver a malicious payload. A trojan-downloader downloads additional malicious software after the initial infection. Obfuscated malware is malware that hides using specific techniques to evade detection and thwart analysis.
In order to sterilize released malware binaries, Microsoft performed a header stripping routine. This header stripping routine ensures that it is impossible for the malware sample to become actively malicious during research as it cannot import required functions due to a missing Import Address Table. The Import Address Table lists what libraries need to be loaded into system memory for the program to function.
Due to the dataset’s safety measures, it is impossible to use any information that found in the stripped header into the other intermediate language representations and their representative binary lifting methods. This lack of header information is a limitation derived from how the original dataset was distributed. Moreover, header information is not executable code and there is no way to convert it to an intermediate representation if it was available. Therefore, the API n-gram feature vector, which is derived from the Import Address Table information, cannot be converted to another intermediate language representation. Additionally, [16]’s machine learning architecture was designed with the limitation resulting from the header stripping process.
In addition, the lack of header information leads to the binary lifting mechanisms of both IDA Pro and Ghidra to be unable to consistently identify functions within code regions of a binary. Overcoming this limitation is described in Section 4.3.
The process to convert the malware data samples into a readable form for Binary Ninja is shown in Figure 1. The .bytes file is loaded into a python development environment and rewritten into a temporary file such that the hexadecimal is interpreted into its appropriate binary conversion. This is done by iterating though the file as a long string, and converting each hexadecimal value into binary before writing the binary to a file stream that writes the temporary file. This temporary file is then loaded into Binary Ninja. Once Binary Ninja completes its automated analysis process to identify the code regions in the binary file, each identified function, in the order encountered via the linear memory view, is then converted to its intermediate language representation form. This form is then stored into a sample’s .LLIL reference file so that it can be loaded during [16]’s Opcode 4-gram feature extraction phase before machine learning model training. To illustrate feature extraction in more detail, the .LLIL file is loaded and interpreted as a string. This string is scanned and checked against a regular expression that contains the pattern to isolate all opcodes present in LLIL representation. Once the list of opcodes is created, the following 3-grams after an opcode are collected to make an entry into the feature extraction file.
Low level intermediate language data conversion process.
The data conversion process is slightly modified for the remaining two intermediate language representations. In order to confirm that the language representations themselves are impacting the model performance, to control for auto-analysis differences within the reverse engineering tools that utilize these intermediate languages, and to overcome the stripped header information limitation within the dataset, the function boundaries are fixed between the intermediate language representations. This allows the same function to be represented in all three intermediate language representations during the opcode extraction phase.
Therefore, for both Pcode and Microcode translations, the function boundaries are correlated to the corresponding ones which are identified within Binary Ninja. These functions are then iterated through to extract the respective intermediate language representation of each function resulting in .pcode and .idaMicro files respectively. This modified process is shown in Figure 2. To discern the differences between each intermediate language representation, refer to excerpts from sample dWy5HfqNGPs6vwMxB1m3 as shown in Figure 3, which shows LLIL having a higher level of abstraction than PCode or Microcode. Furthermore, it depicts PCode as a highly verbose assembly-like language with a large variety of opcodes, while also showing Microcode as an assembly-like language that has reduced verbosity. This difference in verbosity was noticed experimentally as the correlated binary conversions had significantly higher PCode operations than the Microcode equivalent. In language design documents, there is no significant difference between the number of possible opcodes in both languages. At the same time, the conversion process between different intermediate language representations has different costs associated. As displayed in Table 3, there are significant time differences to produce a training set converted to each intermediate representation. LLIL representation was the fastest, taking 30.30% less time than the average of the other two intermediate language representations, but that is easily explained by the differences within the conversion process. There are more steps in the PCode and Microcode conversion process, due to requirements of the correlation step, that would increase the time required to produce an output; however, since both PCode and Microcode share the modified conversion process they can be compared directly, and it takes 12.41% less time to create a PCode dataset as opposed to Microcode.
Pcode and microcode data conversion process.
Excerpts of LLIL, PCode, Microcode transforms from sample dWy5HfqNGPs6vwMxB1m3.
Intermediate language representations | Conversion time (hours) | Converted training data size |
---|---|---|
Low level intermediate language | 273.03 | 27.72 GB |
PCode | 365.8 | 4.99 TB1 |
Microcode | 417.63 | 3.66 GB |
Training data conversion metrics.
1This size is estimated. The true size is 249.52 GB with a 95% gzip compression ratio.
Microcode had the smallest converted dataset size in storage requirements. LLIL was larger by approximately one order of magnitude, but the relative prolixness of PCode data incurred an additional storage requirement of three orders of magnitude larger than Microcode. Due to increased storage requirements, .pcode files are not stored as raw text as the other two representations, but as .gzip archives. This process had no impact on opcode retrieval and extraction for training. These differences in time and storage cost do scale with the size of data required to be converted. Individuals seeking to export binaries into different intermediate language representations should take note. To reiterate, this method serves to highlight the novel transforms and usage of a public malware dataset. This process allows the dataset created in [17] to be used and analyzed in novel ways.
Due to the focus of interest on code representation, the results of training only include the feature models impacted by the representation conversion process. This includes the Opcode 4-gram feature models and all multi-feature mixture models that include the Opcode 4-gram feature vector. All other features act as a control and therefore the only difference in performance is attributed to the different code representations. See Table 1 for other features tested in [16]. N-grams are concepts drawn from natural language processing where there are groupings n-grams long. In the case of these models, the length is four tokens of information. In the trained models, each 4-gram is anchored by an operation code that begins the group. Then the next three opcodes in order are collected as tokens and paired as the operation code grouping. The total length of the extracted opcode vector for a malware sample in each intermediate language representation and vector usage percentage are listed in Table 4. Following the malware classification framework design in [16], each opcode vector for each sample was reduced to containing the 5000 most common Opcode 4-grams extracted from all malware samples in the respective converted dataset. All discarded Opcode 4-gram combinations not included in this pruning process are discarded and not used for training as specified in [16].
Intermediate language representations | Dimension | Vector usage percentage |
---|---|---|
Low level intermediate language | 26759750 → 5000 | 0.02% |
PCode | 452074 → 5000 | 1.11% |
Microcode | 69318 → 5000 | 7.21% |
Intermediate language Opcode 4-grams vector dimension information.
To focus on intermediate language design implications, Table 4 depicts the increase in overall extracted Opcode 4-grams with LLIL by two orders of magnitude when compared to PCode. This is due to LLIL representing a higher form of abstraction than the other two intermediate representations with less structured opcodes and its sequences, which led directly to more unique entries and a low vector usage percentage. On the other hand, PCode and Microcode are much more assembly-like in implementation. This creates an environment where opcode sequences repeat more often due to more limited permutations of sequences created from a smaller list of explicit opcodes.
Therefore, creating smaller amount of combinations that are represented more frequently in the extraction process has led to smaller initial vector dimensions. However, PCode’s vector length is an order of magnitude larger than that of Microcode, and correlating with extracted function representations, PCode required more operations and unique opcodes to represent the dataset. Extracted dimension size has an inverse relationship with vector usage percentage. The larger the original extraction Opcode 4-gram vector dimension the smaller the vector usage percentage. In order to maintain controlled variables, the published method restricted Opcode 4-gram vector’s length to 5000 groups. This entails that the larger the initial dimension the more of the extracted information is not utilized in training, indicating that the more abstract the intermediate representation, with this framework or similar ones, the more information that is wasted.
This section describes the heatmap comparisons with a cool-warm color mapping. Red signifies a higher value which transitions to gray and into blue as the value decreases. Providing context for the comparison figures, the heatmaps relay information about differences in model accuracy due to intermediate language representation input during training. Moreover, this numerical analysis also includes per-class accuracy measurements through the validation set of each model trained. In Appendix A, additional class-based validation heatmaps are provided for the machine learning statistics: precision, recall, and F1-score. Heatmaps for all three training feature combinations for each statistic are intended to be used as reference to inform intermediate language representation choice or machine learning model type choice. To best utilize Appendix A, readers are suggested to refer to heatmaps relevant to the machine learning statistic or malware type of interest for their use case. The heatmaps, both overall and class-based, are meant to be used as a reference tool to focus machine learning model efforts to more efficient model type and intermediate language representation selection. Accuracy measurements were selected to remain in the core of the document due to the use of accuracy as the main statistic of note in the original implementation and publication of [16], the source of the malware classification framework. To reiterate, accuracy is the ratio of correct classifications over the total number of classification attempts. Precision is the ratio of attempted classifications to correct classifications. Recall is the ratio of correct classifications to possible correct classifications. F1-Score is the double of the ratio of precision multiplied with recall to the summation of precision and recall. Section 5.1 demonstrates how researchers can use the heatmaps to make a more informed intermediate language representation choice when designing binary analysis machine learning models.
Prior to quantitative analysis of the accuracy metric, it is necessary to explain that both Figures 4 and 5 represent the same group of malware classification models, however, Figure 4 represents the accuracy of the training set while Figure 5 represents the accuracy of the validation set. In machine learning, the validation set is a group of samples segregated from the training model so that they can test the model and validate its performance. All models trained in the heatmaps represent a 80% training and 20% validation split. Analysis of the results of one will be similar to the other as they represent the same overall trend but with reduced performance of the validation set. Both sets are presented to illustrate any possible over-training that may have occurred within the trained models for each intermediate language representation. Over-training is a phenomenon where the model gets very good at classifying training data but is unable to generalize to new, but similar information, such as samples found in a validation set. In this training, it does not appear that over-training was an issue within the validation set.
Intermediate language training accuracy comparison.
Intermediate language validation accuracy comparison.
Discussion is focused on training results with feature combinations from least information to most information. First, analysis of models trained with the least amount of information: only Opcode 4-gram trained models. From an intermediate language representation perspective, Microcode performed the worst across all model types, PCode trained models had middling performance, and LLIL trained models were the most successful at attaining higher accuracy with only opcode vectors acting as differentiating features of the nine malware classes. SVM-based models performed the worst across all intermediate languages, while RF classifiers had the highest performance with this feature set. The relative weakness of SVM models is attributed to the increased difficulty of creating a proper linear decision boundary in a multi-class classification problem. It is noted that even though the dimension reduction process caused the lowest vector usage percentage with LLIL, it trained a comparatively stronger KNN model then the other representations. The performance drop-off from RF models to KNN models was the smallest with LLIL, implying language design of LLIL led to stronger clustering mechanisms then the other intermediate language representations. This is explained by how clustering models interface with the vector usage percentage of LLIL. The information present in the Opcode 4-gram includes the 5000 most common 4-grams, and since LLIL had such a low vector usage percentage compared to other intermediate representations, indicating that there were substantially more unique combinations due to the higher abstraction level of LLIL. Since the 5000 most common groupings came from a set with more unique combinations, they were more unique within the vector compared to Microcode and PCode. This greater uniqueness, due to a higher abstraction in LLIL, created larger distances between groupings, and with larger distances between groupings in a cluster algorithm, the easier to identify which group a new data point should belong within.
Second, the analysis of the File Size, API 4-gram, Opcode 4-gram based models, also known as the Three Feature models, is interesting because this is the only case where a KNN trained model outperformed the RF version within an intermediate language representation, again with LLIL. This is significant because one of the conclusions of [16] is that RF models are the model type of choice when it comes to the static analysis malware classification, but based on this study, if one is utilizing LLIL with certain feature restrictions, it may be more performant to create a KNN model. LLIL was the highest performing intermediate representation, but this time PCode was the worst as it was outdone by models trained with Microcode converted training data.
Third, comparing the overall accuracy between training and validation set results when all features were used to train classification models showed that RF models performed the best between both LLIL and Microcode; however, Microcode had the highest accuracy for the training set, while LLIL had the highest accuracy for the validation set. Microcode did have the highest accuracy value for the feature set, but LLIL was marginally lower which is more significant. A trend shown across feature sets used for training was that as more features were added, the better Microcode performed relative to other intermediate language representations. This is due to its high vector usage percentage meshing well with other features that can act as additional discrimination factors to allow the least amount of waste from the opcode extraction step.
The high vector usage percentage compared to the other intermediate representations may be attributed to this phenomenon. Opposed to HLIL, Microcode had a lower number of unique combinations of Opcode 4-grams. As a lower abstraction level language, opcode combinations that do relatively simple tasks take up more tokens and occur more often. This relationship indicates that when cultivating a grouping of the 5000 most commonly found opcode combinations, the overall vector will include groupings that perform the most common tasks. This leads to a higher level of challenge to classification-based tasks because differentiation of malware that also perform common tasks becomes harder. The additional features act as a method to increase differentiation between malware samples which explains improved classification as features are added. The same effect can be seen in PCode models, as PCode is also a lower abstraction level language, but not to the same magnitude because its vector usage percentage proportionally is smaller. A key analytical insight is that the higher the vector usage percentage of the Opcode 4-grams for each intermediate representation, the higher marginal utility additional features add when evaluating performance metrics.
Class-based statistics are shown in Figures 6, 7, and 8. See class definitions in Section 4.1. These represent accuracy per-class within the validation set of malware samples. It should be noted that due to the imbalance of the dataset, it is possible for classes to be heavily underrepresented in the validation set. Additionally, accuracy as a measure can be misleading when giving class statistics in imbalanced dataset scenarios and when trying to infer overall accuracy due to needing to correct for dataset sample percentages. However, there are few takeaways that are useful even in this imbalanced dataset scenario. Most apparent is the comparative weakness of the models in all intermediate language representations for the classification of the Ramnit class of malware. Ramnit is a class of worm samples, a self- propagating malware type, with adequate representation in the training data. It generally scored noticeably lower in all feature combination scenarios, across all model types, and across all languages when compared to other classes of malware. Conversely, the other machine learning metric scores were low for class Simda, it was the smallest class with only 42 samples, but LLIL generally did better with it than the other intermediate language representations. This calls back to how LLIL had a higher level of abstraction allowing more unique combinations in the Opcode 4-gram vector, such as those found in small underrepresented categories, which may have enhanced classification performance. Underrepresented categories in unbalanced training data tend to have poor results in multi-class classification models and Simda in severely underrepresented in the training data by an order of magnitude to the next smallest class. In addition, the class containing obfuscated malware, Obfuscator.ACY, also performed lower than other classes with roughly equal difficulty in all three intermediate language representations.
Opcode 4-grams class-based accuracy comparison.
Three feature class-based accuracy comparison.
All features class-based accuracy comparison.
Understanding design requirement permutations and how to best utilize the set of heatmaps and tables within this work can inform which binary lifting mechanism and intermediate language representation is most appropriate. These are examples that illustrate how different intermediate language representations may be more appropriate than others for different design considerations.
If one has been given the design parameters that a malware classification model must: fit within a cluster-based ensemble model, prioritize the machine learning evaluation statistic of accuracy, and may only use opcode sequences from binaries as the main training feature; then, after referring to the heatmaps in either Figure 4 or Figure 5, it is evident that using the binary lifter Binary Ninja with the intermediate language representation of LLIL would result in the most performant model that meets all requirements.
If one has been given the design parameters that a malware classification model must: be an SVM model, prioritize classification of malware similar to the Vundo and KelihosVer1 families, prioritize the machine learning evaluation statistic, F1-Score, and may only use opcode sequences from binaries as the main training feature; then, after referring to the heatmaps in Figure A.7, it is evident that using the binary lifter Ghidra with the intermediate language representation of Pcode would result in the most performant model that meets all requirements.
If one has been given the design parameters that a malware classification model must: train on the most efficient representation of binaries when considering a hardware storage limitation, maximize the precision metric, and train with all available features; then, after reviewing Table 3 and Figure A.3, it is evident that using the binary lifter IDA Pro with the intermediate language representation of Microcode would result in the most performant model that meets all requirements.
In closing, this study sought to describe the issue of model frameworks having implicit trust, for both performance and accuracy, when binaries are lifted for training. Limited information in the contemporary literature is explored in [7] during a discussion on the usage of intermediate language representations. Different reverse engineering programs that convert binaries to intermediate language representations have different strengths and weaknesses for training machine learning models. This study provides an independent verification and replication of a published machine learning framework, novel transform and usage of a public malware dataset, a comparative study on the impact of the performance of different intermediate language representations for opcode based malware classification.
The verification of previous work adds validity to the methods described in [16]. Also, the novel transforms of the dataset [17] may act as a springboard for further analysis and research opportunities. Moreover, there are relevant takeaways from the conversion process and machine learning model comparison. If time or storage capacity is of critical importance, for a similar task such as classification of malware samples, refer to Table 3 to help inform binary lifting choice, as each intermediate language representation has a respective reverse engineering tool associated with it. If model performance is prioritized and the model method is fixed, refer to the overall accuracy heatmaps in Section 5 to inform the determination of the intermediate language representation. Furthermore, when working with malware similar to any of the classes in the data used within this work, review heatmaps in the main body and Appendix A for the metric of focus to see how similar malware may behave with each intermediate language representation and machine learning model type tested in the comparative analysis. A significant message to bear in mind is that the more unique operation code sequences for each intermediate representation, the lower marginal utility additional features add when evaluating performance metrics. The comparative analysis can be used as a look-up table to inform intermediate language representation choice.
Ultimately, this work focused on the direct impact of intermediate language representations in a specific malware classification task utilizing converted training data against different model types. There are two different avenues to go forward with when considering future work: a continuation of how intermediate language representations affect malware classification utilizing more recent developments in deep learning, or going up the chain and looking at how abstract syntax trees derived from intermediate language representations may affect binary analysis. Moreover, now that it has been shown that intermediate languages do affect model performance using a controlled machine learning architecture, there are also two directions for each avenue: original implementations of tasks while testing intermediate representations or more controlled comparative studies.
One could tackle the first avenue, by continuing research on intermediate language impact on malware classification, in both directions. The work of Chen et al. [16] could be expanded by including the training of a transformer or Long-short term memory model and adding further model types for comparison as a method of more controlled comparative study. On the other hand, the reimplementation of [19], with an included intermediate representation study, on a custom dataset could show more insights with deep learning techniques.
One could tackle the second avenue, by going up the chain and looking at how abstract syntax trees derived from intermediate language representations affect binary analysis. A controlled comparative study would look similar to this one presented, with a known model architecture, accessible dataset, and public code, that modifies to learn how abstract syntax trees are impacted by intermediate language design. Conversely, an original implementation of [20] with a new dataset that can be transformed into the three intermediate language representations would allow abstract syntax tree study with the transformer technique for cloned code analysis.
Opcode 4-grams class-based precision comparison.
Three feature class-based precision comparison.
All features class-based precision comparison.
Opcode 4-grams class-based recall comparison.
Three feature class-based recall comparison.
All features class-based recall comparison.
Opcode 4-grams class-based F1-score comparison.
Three feature class-based F1-score comparison.
All features class-based F1-score comparison.
Cannan, Logan: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft; Morris, Tommy: Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing. The authors contributed equally to this work.
This study was supported by internal funding from the College of Engineering at The University of Alabama in Huntsville through a Graduate Research Assistantship provided to the first author. The study did not receive external funding from any agencies.
Not applicable.
The data that support the findings of this study are available on request from the corresponding author.
The authors declare no conflict of interest.
Written by
Article Type: Research Paper
Date of acceptance: July 2025
Date of publication: August 2025
DOI: 10.5772/acrt.20250021
Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0
© The Author(s) 2025. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Impact of this article
10
Downloads
16
Views
Join us today!
Submit your Article