Most Recent Malicious Software Datasets and Machine Learning Detection Techniques: A Review

Background: Within the context of cyber security, it has become crucial to monitor systems and analyze data to maintain data security and integrity. Recently, it has become important to create a system for analyzing and classifying data, to prevent any malicious programs such as malware. Materials


INTRODUCTION
Malware is a term used to describe programs or malicious codes that are developed to harm computer systems. Malware is used for disrupting system services, gaining access to systems, denying services, or stealing or modifying confidential data [1] [2].
Malware can be classified into different types such as viruses, Trojans, spyware, and adware. The unstable growth of malware and good ware, and the increase in different families of malware create the demand for practically studying the classification of malware [3] [4]. Generally, there are two types of malware analysis: static and dynamic. The analysis is considered to be static in case it is not run on a system, otherwise, it is considered to be dynamic [5].
In static analysis, the executable file undergoes analysis based on its structure without being executed within a controlled environment. There are numerous static characteristics of executable files, including memory compactness and various memory sections. One of the ways of extracting static characteristics out of executable files is the portable executable python library (PEFILE) [6].
As for dynamic analysis, the analysis of malware is performed within a dynamically controlled environment. During the execution of the malware, the registry keys witness a malicious change, after which the privilege mode of the OS is taken. The latter causes everything to change in the operating system [7]. The software can fully access the resources to be executed within the environment [8]. The software could alter the registry keys on the computer, as well as activate the debugger mode. After executing the malware, the environment is reverted to its past state, according to the snapshot made at the beginning of the setup. The behavior of the software is logged by the agent in the controlled environment. [6].

1992-0652
The main problem in any system is the efficiency of the system and its ability to detect malicious files accurately. Given the increasing number of ML algorithms, it becomes rather challenging to identify the ideal algorithm. In any case, this research discusses a group of research papers and will compare a group of databases that have been classified through ML algorithms.
ML techniques are used to identify and categorize malware into its categories and families, to separate the instances that exhibit new activity for in-depth study. This section discusses the literature works related to these approaches.
The authors in [9] present a flexible architecture that allows users to distinguish between clean and malware files using machine learning methods. In their research, one-sided perceptrons and kernelized one-sided perceptrons were used to reduce false positives and distinguish between malware and clean files.
In [10], it is suggested to figure out how the malware samples should behave. Many algorithms are used throughout this process in particular, including K-NN, Decision Trees, SVM, Naive Bayes, and Random Forest.
The work in [11] presents a methodology in which machine learning techniques are used to process data, classify malware, and detect new malware. Opcode n-grams, feature extraction, and grayscale images are all used in data processing. Malware classification is displayed in form of a decision-making model. The feed-forward neural network (FFNN) technique is used by the detection module to categorize malware families.
Most of the previous research works test the efficiency of a particular algorithm using only one database. In some cases, they test the efficiency of a group of algorithms on a specific database, while some algorithms differ in their efficiency from one database to another. An example of this is the FFNN algorithm. One of its advantages is that it can deal with large data, unlike the SVM algorithm, which is less efficient in the case of large data. Therefore, it has become necessary to test the algorithm on more than one type of data, which is what this paper aims at.

Materials and Methods  Data Sets and Machine Learning Methods
In its general meaning, the dataset is defined as "a collection of data". Normally, data is represented in database tables, in which the columns indicate unique variables and every row stands for a specific record for a respective dataset. Furthermore, each variable is listed with different values [12].
During the past years, different datasets have been used due to the increase in the number and type of attacks. Therefore, it has become necessary to generate and update the datasets for reducing attacks and improving security. Table 1 shows an example of a collected sample of a dataset [13] [14].
To educate machines on how to handle data more effectively, machine learning (ML) is used. The main reason for using ML is the fact that data cannot be evaluated or extrapolated [15]. To solve data challenges, machine learning uses a variety of algorithms. The type of algorithm used relies on many factors, such as type of the problem to be solved, the number of variables, the most suitable model type, and others [16].
This part contains two main sections. The first section demonstrates the most popular datasets used over the past years, while the second section will focus on the methods used for malware classification for each dataset.
o Sorel-20M Dataset This dataset is considered to be of a larger scale and consists of metadata and features that are extracted in advance. It also contains labels of high quality that are collected through different sources. Furthermore, 20M malware samples are found in this dataset, in addition to information about vendor detections at the collection time. It also involves tags about the aforementioned information with samples serving as extra targets [17]. Approximately, sorel provides 10 million malware samples whereby the optional_headers.subsystem and file_header. machine flags are set to zero, to be utilized when exploring features and detection strategies [18].

 Soerel Detection Techniques
There are two baseline machine learning algorithm models used on the sorel dataset. First of all, the feed-forward neural network (FFNN) model is used. The weights of input data are a key element in input data and data classification. Pre-training and data pre-processing are considered to be key components in creating effective methods for achieving quick training and high classification accuracy [19].

1992-0652
Secondly, the LightGBM gradient boosted the decision tree model. This is a commonly adopted ML algorithm that is known for being accurate, interpretable, and efficient. GB DT has been found to achieve state-of-the-art performances when executing different ML tasks, like multiclass classification, click predicting, and learning to rank [20]. However, the higher feature dimension and increased data size cause it not to be considered efficient or scalable to a satisfactory extent. This is mostly due to the need for teaching features for scanning every data instance to estimate the information gained via every possible split point. This procedure tends to cost a lot of time [21].

 Soerel Classification Result
This section shows the Soerel data set classification result after implementing the FFNN algorithm and the LightGBM algorithm.   o Ember Dataset One of the most popular malware datasets used with ML model training for detecting malicious executable files is the Ember dataset. It consists of 1.1 million extracted binary files, which include more than 900k training samples (300k benign, 300k unlabeled, 300k malicious) [13].

 Ember Detection Techiques
Gradient boosting decision tree (GBDT) is a common algorithm for ML models. The reason for its popularity is its efficiency, interpretability, and accuracy [22]. However, the big data terms (number of instances and number of features) cause several challenges to GBDT in the tradeoff between efficiency and accuracy. Therefore, GBDT needs every feature and scans all data to estimate information gain. This will increase implementation time when handling big data [

 Ember Classification Result
This section shows the results of the Ember dataset classification using Light GBM, as shown in the table below.  [14]. A preliminary analysis is performed for illustrating the effect of concept drift and discussing the ways through which the data set could contribute to current as well as future research efforts.

 Bodmas Detection Techiques
In the BODMAS dataset, the Gradient Boosted Decision Tree (GBDT) classifier is used like the Ember dataset.

 BodmasClaasification Result
This section illustrates the results of the LightGBM algorithm on the BODMAS data set, as presented in the table below.

1992-0652
o New Dataset for Dynamic Malware Classification: This research introduces two types of datasets. The first dataset of 9795 samples was obtained from simple sharing, and the second dataset was obtained via virus share. The new dataset researches also analyze the performance in terms of the balance and imbalance of the multi-class malware using RF, SVM, Histogram-based gradient boosting, and XGBoost [24].

 Histogram-based Gradient Boosting (HGB):
It is a popularly used ML algorithm known for its various implementation domains, which makes it easy to manage concerning how complex the model is, using tree depth and the number of trees [24]. HGB is found to be significantly faster when applied to larger datasets and provides native support for missing values found in the dataset, which is a key feature b [25].

Table 7. HGB Algorithm Result
 Random Forest (RF) RF is a popular machine-learning algorithm. It has developed into a widely used nonparametric approach that may be adopted in classifying or regression issues [25]. To obtain more precise predictions, the RF algorithm builds numerous decision trees and then combines them [26].

 Random Forest Result
This section shows the results of the RF, as illustrated in the table below.  SVM is a supervised ML algorithm that is commonly applied for classifying tasks in complex data sets. The benefit of adopting SVM is its increased efficiency when applied in highdimensional spaces [26]. Besides, it provides more accurate results and can be used with a larger number of independent variables [1].

1992-0652
As a result of classifying the new dataset using the HGB algorithm, the RF algorithm, and SVM, the obtained accuracy rates were 89.5% for both the HGB and RF algorithms, whereas the accuracy of SVM was 91%. Therefore, the SVM is considered to be more efficient than the RF and HGB algorithms.

Results and Discussion
This paper discussed and reviewed a set of research works that dealt with the topic of datasets and machine-learning algorithms used for malware classification. The first research explained how the Sorel dataset is classified using the FFNN algorithm, where the results were accurate and efficient. The Ember database was classified using the GBDT algorithm, which showed the ability to classify the data in a semi-efficient manner. The disability of GBDT was due to the time consumed by the algorithm because it requires all properties of the elements in the database. This leads to a slowdown in the work of the algorithm. This is also the case with the BODMAS database when classified using the GBDT algorithm. As for the new dataset, it was classified by three algorithms: Random Forest, SVM, and HGB. As a result, the SVM algorithm was the best algorithm obtaining an accuracy of 91%. This led to the conclusion that the best algorithm is FFNN after implementing it on Sorel 20M. The reason for choosing this algorithm is that it has been tested on a dataset that is considered to be relatively larger as compared to the rest of the databases. It also contains a larger number of types, so it can be concluded that it is more efficient.

Conclusion
Malware classification is a significant field of study. This paper has reviewed a group of research papers that dealt with the topic of malware classification using machine learning algorithms. The malware data was obtained from popular malware datasets which are the Ember dataset and the Sorel dataset. Moreover, the datasets were collected by researchers such as BODMAS and the new dataset. It was found that the FFNN algorithm was the best algorithm for the sorel20M dataset based on the research work discussed in this paper.

Conflict of interests.
There are non-conflicts of interest.