大摘要

标题

When Malware is Packin’ Heat;Limits of Machine Learning Classifiers Based on Static Analysis Features

恶意软件打包正热：基于静态分析的机器学习分类器的局限性

摘要

机器学习技术被广泛应用于签名和启发式方法以提高反恶意软件的检测率，因为它们可以自动创建检测模型，使其有可能处理数量不断增加的新恶意软件样本。相应的，恶意软件使用包装和其他形式的混淆来阻碍反恶意软件系统的分析并逃避检测。然而很少有人意识到，良性应用程序也使用包装和混淆，以保护知识产权和防止许可证滥用。

在本文中，我们研究了基于静态分析特征的机器学习如何在打包样本上运行。恶意软件研究人员通常认为，包装会阻碍机器学习技术建立有效的分类器。然而，工业界和学术界都发表了一些结果，表明基于机器学习的分类器可以实现良好的检测率。这导致许多专家认为分类器只是检测样本被包装的事实，因为包装在恶意样本中更普遍。

我们表明，与通常的假设不同，打包器在打包程序时确实保留了一些对恶意软件分类“有用”的信息。但是这些信息并不一定能捕捉到样本的行为。我们证明，因为从打包的可执行文件中提取的信息不够丰富，基于机器学习的模型无法将它们的知识普及到未见过的打包程序上，也无法对对抗性的例子保持稳定。我们还表明，机器学习技术的滥用导致了大量的假阳性，这反过来又可能导致在过去的工作中对地面真实数据的错误标记。

背景

要解决的问题

对包装好的二进制文件进行静态分析，是否能提供足够丰富的特征集，以利用机器学习建立一个恶意软件分类器？

该问题的意义

任何在设计和评估恶意软件检测方法时未能考虑打包的良性样本的方法，最终都会在真实世界的数据中产生大量的误报。这对于基于机器学习的方法来说尤其令人担忧，因为在缺乏可靠和新鲜的基础真相的情况下，这些方法经常依赖VirusTotal上的反恶意软件产品的标签。我们认为这样存在一个特别麻烦的问题：数据集污染，即被反恶意软件产品检测为恶意的打包良性样本被错误地作为恶意软件样本使用。该问题的解决有利于提高反恶意软件的检测率。

主要贡献

我们研究了仅使用静态特征的基于机器学习的恶意软件分类器的局限性。我们表明，如果良性和恶意样本使用的打包器之间缺乏重叠，会导致分类器将特定的打包器与恶意联系起来。我们表明，如果训练得当，分类器能够区分由现实世界的打包器打包的良性和恶意样本，尽管它仍然容易受到不可见的打包程序的影响，甚至是对整个程序应用强加密导致的影响。此外我们还表明，通过简单的对抗性攻击，有可能制作出绕过检测的规避样本。
我们对VirusTotal上的六种产品的评估表明，目前基于静态机器学习的反恶意软件引擎检测的是包装而不是恶意。
我们发布了一个包含392,168个可执行文件的数据集，我们知道每个样本是良性的还是恶意的，是打包的还是未打包的。我们还知道实验数据集的具体打包器，其中包括341,444个可执行文件。

主要工作

建立数据集。我们的实验需要一个由可执行程序组成的数据集，我们需要知道这些程序是：(1)良性或恶意的，(2)打包或未打包的。我们将一个来自商业供应商的标记数据集与EMBER数据集（标记）结合起来，建立我们的野生数据集。我们利用多种方法将一个可执行文件标记为已打包或未打包。我们广泛地使用商业和免费打包器以及我们自己的打包器AES-Encrypter打包野生数据集中的所有可执行文件，从而建立了另一个基础可信数据集，即实验室数据集。经过对文献的详细研究，我们为所有的样本提取了九个系列的特征。
我们介绍并研究问题，帮助我们回答我们的主要假设。对于每个问题，我们都进行了一个或多个实验。

实验过程

RQ1. 在良性和恶意样本中使用的打包器分布的偏差是否会导致分类器将特定的打包程序作为恶意的标志？

Experiment I: “no packed benign”
在良性和恶意样本中使用的打包器之间的重叠可能会导致分类器区分打包程序，即打包器。
Experiment II: “packer classifier”
如果数据集的良性和恶意样本中使用的打包器之间缺乏重叠，可能会使分类器偏向于将特定的打包程序与恶意行为联系起来。
Experiment III: “good-bad packers”
分类器在区分好的和坏的打包者方面有很大的偏向。

**发现1：**如果良性和恶意样本中使用的打包器之间缺乏重叠，那么分类器会偏向于区分打包程序。

RQ2. 打包器是否阻碍了仅利用静态分析特征的基于机器学习的恶意软件分类器？

Experiment IV: “different packed ratios (wild)”
Experiment V: “different packed ratios (lab)”
Experiment VI: “single packer”

**发现2：**打包器在打包程序时保留了一些可能对恶意软件分类 "有用"的信息，然而这些信息并不一定代表样本的真实性质。

RQ3. 一个经过精心训练、不偏重于特定包装程序的分类器能否在现实世界的场景中表现良好？

Experiment VII: “wild vs. packers”
Experiment VIII: “withheld packer”
Experiment IX: “lab against wild”
Experiment X: “Strong & Complete Encryption”
Experiment XI: “adversarial samples”

**发现3：**尽管我们观察到静态分析特征与机器学习相结合可以区分包装好的良性样本和包装好的恶意样本，但这样的分类方法在现实世界中仍会产生不可容忍的错误。

RQ4. 利用机器学习与静态分析功能相结合的现实世界反恶意软件引擎的准确性是如何被打包器影响的？

Experiment XII: “anti-malware industry”

**发现4：**VirusTotal上基于机器学习的反恶意软件引擎检测包装而不是恶意。

结论

我们首先观察到，必须考虑训练集中打包器的分布，如果良性和恶意样本中使用的打包器之间缺乏重叠，可能会导致分类器区分打包程序而不是行为。与通常的假设不同，打包器在打包程序时保留了对恶意软件分类 “有用 “的信息，然而这些信息不一定能捕捉到样本的行为。此外，这些信息并不能帮助分类器概括其知识来操作从未见过的打包器，对琐碎的对抗性攻击具有鲁棒性。我们观察到，VirusTotal上基于静态机器学习的产品在打包的二进制文件上产生了很高的假阳性率，这可能是由这项工作中所讨论的局限性导致的。当反恶意软件行业的趋势是越来越多地部署仅使用静态特征的基于机器学习的分类器时，这个问题将会被放大。

缺陷

我们观察到机器学习与静态分析相结合对未见过的打包器的适用性很差，而且我们在实验中没有考虑到时间的限制，这些将作为我们未来的工作。另外，在本文中我们只关注了Windows x86可执行文件，但我们的假设可能也适用于安卓应用，因为安卓应用的打包也越来越普遍。

启发

动态分析能够提供了一个可执行文件行为的清晰图像，但是动态分析不受信任的代码需要内核级权限从而扩大攻击面，或者需要一个虚拟机而这需要大量的计算资源。恶意软件通常采用环境检查来避免检测，而虚拟化环境可能无法反映恶意软件的目标环境。当前应采用动态分析和静态分析相结合的方式检测打包的可执行程序，避免静态分析的滥用。
通过基于机器学习的静态分析特征来检测打包的可执行程序是个困难问题，但是本文中对于数据集的分类、筛选降低了实验的操作难度，提高了实验结果的可信度。提出问题并设计对照实验，从实验结果种发现新的问题进而继续设计实验来解决，这种循序渐进的实验思维让我们受益匪浅。

论文阅读

Abstract—Machine learning techniques are widely used in addition to signatures and heuristics to increase the detection rate of anti-malware software, as they automate the creation of detection models, making it possible to handle an ever-increasing number of new malware samples. In order to foil the analysis of anti-malware systems and evade detection, malware uses packing and other forms of obfuscation. However, few realize that benign applications use packing and obfuscation as well, to protect intellectual property and prevent license abuse.
In this paper, we study how machine learning based on static analysis features operates on packed samples. Malware researchers have often assumed that packing would prevent machine learning techniques from building effective classifiers. However, both industry and academia have published results that show that machine-learning-based classifiers can achieve good detection rates, leading many experts to think that classifiers are simply detecting the fact that a sample is packed, as packing is more prevalent in malicious samples. We show that, different from what is commonly assumed, packers do preserve some information when packing programs that is “useful” for malware classification. However, this information does not necessarily capture the sample’s behavior. We demonstrate that the signals extracted from packed executables are not rich enough for machine-learning-based models to generalize their knowledge to operate on unseen packers, and be robust against adversarial examples. We also show that a naive application of machine learning techniques results in a substantial number of false positives, which, in turn, might have resulted in incorrect labeling of ground-truth data used in past work.

I. INTRODUCTION

Anti-malware software provides end-users with a means to detect and remediate the presence of malware on their machines. Most anti-malware software traditionally consists of two parts: a signature-based detector and a heuristics- based classifier. While signature-based methods detect similar versions of known malware families with a small error rate, they become insufficient as an ever-increasing number of new malware samples are being identified. VirusTotal reports that, on average, over 680,000 new samples are analyzed per day, of which some are merely repacked versions of previously seen samples with identical behavior. Over the last few years, the need for techniques that generalize to new, unknown malware samples while removing expensive human experts from the loop has led to approaches that leverage both static and dynamic analyses combined with data mining and machine learning techniques.

反恶意软件为终端用户提供了检测和补救其机器上的恶意软件的手段。大多数反恶意软件传统上由两部分组成：一个基于签名的检测器和一个基于启发式的分类器。虽然基于签名的方法能以较小的错误率检测到已知恶意软件家族的类似版本，但随着越来越多的新恶意软件样本被发现，这些方法变得不够用。VirusTotal报告说，平均每天有超过68万个新样本被分析，其中一些只是以前看到的样本的重新包装版本，具有相同的行为。在过去的几年里，由于需要能够概括新的、未知的恶意软件样本的技术，同时将昂贵的人类专家从循环中移除，导致了利用静态和动态分析与数据挖掘和机器学习技术相结合的方法。

Although dynamic analysis provides a clear picture of an executable’s behavior, it has some issues in practice: for example, dynamic analysis of untrusted code requires either kernel-level privileges, thus expanding the attack surface, or a virtual machine, which requires a substantial amount of computing resources. In addition, malware usually employs environmental checks to avoid detection, and the virtualized environment may not reflect the environment targeted by the malware. To avoid such limitations, some approaches heavily rely on features ex- tracted through static analysis. These approaches are appealing to anti-malware companies that want to replace anti-malware systems based on dynamic analysis. These static-analysis- based anti-malware vendors, which have quickly grown into billion-dollar companies, boast that their tools leverage “AI techniques” to determine the maliciousness of programs solely based on their static features (i.e., without having to execute them). However, static analysis has known issues when applied to obfuscated and packed samples.

虽然动态分析提供了一个可执行文件行为的清晰图像，但它在实践中也有一些问题：例如，动态分析不受信任的代码需要内核级权限，从而扩大攻击面，或者需要一个虚拟机，这需要大量的计算资源。此外，恶意软件通常采用环境检查来避免检测，而虚拟化环境可能无法反映恶意软件的目标环境。为了避免这种限制，一些方法严重依赖通过静态分析得出的特征。这些方法对那些想取代基于动态分析的反恶意软件系统的反恶意软件公司很有吸引力。这些基于静态分析的反恶意软件供应商已经迅速成长为价值数十亿美元的公司，他们吹嘘自己的工具利用 “人工智能技术”，仅根据程序的静态特征（即，无需执行它们）来确定其恶意程度。然而，静态分析在应用于混淆和包装的样本时有已知的问题。

It is commonly assumed that packing greatly hinders machine learning techniques that leverage features extracted from static (file) analysis. However, both industry and academia have published results showing that machine-learning-based classifiers can achieve good detection rates. Many experts assume that these results are due to the fact that classifiers just learn to distinguish between packed and unpacked programs. In fact, we would expect that machine-learning-based classifiers will deliver poor performance in real-world settings, where packing is increasingly seen in both malicious and benign software. Unfortunately, most related work did not consider or only briefly discussed the effects of packing when proposing machine-learning-based classifiers. Surprisingly, our initial experiments showed that machine-learning-based classifiers can distinguish between packed benign and packed malicious samples in our dataset. This led us to the following research question: does static analysis on packed binaries provide a rich enough set of features to build a malware classifier using machine learning?

人们通常认为，包装大大阻碍了利用从静态（文件）分析中提取的特征的机器学习技术。然而，工业界和学术界都发表了一些结果，表明基于机器学习的分类器可以达到良好的检测率。许多专家认为，这些结果是由于分类器只是学习区分打包和未打包的程序。事实上，我们预计基于机器学习的分类器在现实世界中的表现会很差，因为在现实世界中，恶意软件和良性软件都越来越多地出现打包现象。不幸的是，在提出基于机器学习的分类器时，大多数相关工作都没有考虑或只是简单地讨论了打包的影响。令人惊讶的是，我们最初的实验表明，基于机器学习的分类器可以在我们的数据集中区分出打包的良性软件和打包的恶意软件样本。这促使我们提出了以下研究问题：对包装好的二进制文件进行静态分析，是否能提供足够丰富的特征集，以利用机器学习建立一个恶意软件分类器？

Our experiments require a ground-truth dataset for which we can determine if each sample is packed or unpacked and malicious or benign. We created our first dataset, the wild dataset, with executables provided by a commercial anti-malware vendor, which uses dynamic features, combined with the labeled benchmark dataset EMBER. We leveraged the vendor’s sandbox, along with VirusTotal, to remove samples with inconsistent benign/malicious labels from the dataset. For identifying packed executables, we used the vendor’s sandbox combined with the Deep Packer Inspector tool and a number of static tools. The fact that we built the dataset mainly based on the runtime behavior of samples gives us high confidence in our ground truth labels. We created a second dataset, the lab dataset, by packing all the executables in the wild dataset with widely used commercial and free packers. Following a detailed literature study, we extracted nine families of features from the executables in the two datasets. Even though in our experiments we used SVM, deep neural networks (i.e., MalConv), and different variants of decision-tree learners, like random forest, we only discuss the results of the random forest approach as we observed similar findings for these approaches, with random forest being the best classifier in most experiments, and random forest allows for better interpretation of the results compared to neural networks.

我们的实验需要一个真实的数据集：我们可以确定每个样本是打包的还是未打包的，是恶意的还是良性的。我们用一个商业反恶意软件供应商提供的可执行文件创建了我们的第一个数据集–野生数据集，该数据集使用动态特征，并与标记的基准数据集EMBER相结合。我们利用供应商的沙盒和VirusTotal，从数据集中去除良性/恶性标签不一致的样本。为了识别打包的可执行文件，我们使用了供应商的沙盒与Deep Packer Inspector工具和一些静态工具。我们主要根据样本的运行时行为来建立数据集，这让我们对我们的基础事实标签有很高的信心。我们通过广泛使用的商业和免费打包器打包野生数据集中的所有可执行文件创建了第二个数据集–实验室数据集。经过详细的文献研究，我们从这两个数据集中的可执行文件中提取了九个系列的特征。尽管在实验中我们使用了SVM、深度神经网络（即MalConv）和决策树学习器的不同变体，如随机森林，但我们只讨论随机森林方法的结果，因为我们观察到这些方法有类似的发现，随机森林是大多数实验中的最佳分类器，而且与神经网络相比，随机森林可以更好地解释结果。

As a naive experiment, we first trained the classifier on packed malicious and unpacked benign samples. The resulting classifier produced a high false positive rate on packed benign samples, which shows that the classifier is biased towards detecting packing. Using n-grams, Perdisci et al. also observed that packing detection is an easier task to learn compared to detecting maliciousness. In addition, we demon- strated that “packer classification” is a trivial task by training a packer classifier using samples from each packer (class) in the lab dataset. The classifier achieved precision and recall greater than 99.99% for each class. This indicates that a bias in the training set regarding packers may cause the classifier to learn specific packing routines as a sign of maliciousness. We verified this by training the classifier on benign and malicious executables packed by two non-overlapping subsets of packers, which we refer to as good and bad packers, respectively. The resulting classifier learned to label anything packed by good packers as benign, and anything packed by bad packers as malicious, regardless of whether or not the sample is malicious.

作为一个navie实验，我们首先在打包的恶意样本和未打包的良性样本上训练分类器。结果分类器在包装好的良性样本上产生了很高的假阳性率，这表明分类器偏向于检测包装。通过使用n-grams，Perdisci等人还观察到，与检测恶意相比，包装检测是一项更容易学习的任务。此外，我们通过使用实验室数据集中每个打包器（类）的样本训练打包器分类器，证明了 "打包器分类 "是一项琐碎的任务。该分类器对每个类别都达到了大于99.99%的精度和召回率。这表明，关于包装商的训练集的偏见可能导致分类器学习特定的包装程序作为恶意的标志。我们通过对由两个不重叠的打包器打包的良性和恶意的可执行文件进行分类器训练来验证这一点，我们把它们分别称为好的和坏的打包器。由此产生的分类器学会了将由好的打包器打包的任何东西标记为良性的，而将由坏的打包器打包的任何东西标记为恶意的，不管这个样本是否是恶意的。

We extended the naive experiment by training the classifier on different training sets with increasing ratios of packed benign samples. To avoid the bias introduced by the use of good and bad packers, we selected packed samples from the lab dataset uniformly distributed over packers. Surprisingly, despite the popular assumption that packing hinders machine- learning-based classifiers, we found that increasing the packed benign ratio in the training set helped the classifier to maintain relatively low false positive and false negative rates. This shows that packers preserve some information about the original binary that can be leveraged for malware detection. For example, most packers keep .CAB file headers in the resource sections of the executables. Jacob et al. found a similar trend for packers that employ weak encryption or compression. By training on one packer at a time, we observed that the information preserved about the original binaries is not necessarily associated with malicious behavior, but is “useful” for malware detection. Nevertheless, we argue that such a classifier still suffers from three issues: inability to generalize, failure in the presence of strong encryption, and vulnerability to adversarial samples.

我们通过在不同的训练集上训练分类器来扩展naive实验，其中包装良性样本的比例不断增加。为了避免使用好的和坏的包装器所带来的偏差，我们从实验室数据集中选择了包装器均匀分布的包装样本。令人惊讶的是，尽管流行的假设是包装阻碍了基于机器学习的分类器，但我们发现，在训练集中增加包装的良性比率有助于分类器保持相对较低的假阳性和假阴性率。这表明，打包器保留了一些关于原始二进制文件的信息，可以用来进行恶意软件检测。例如，大多数打包器在可执行文件的资源部分保留了.CAB文件头。Jacob等人发现，采用弱加密或压缩的打包器也有类似的趋势。通过一次对一个打包器的训练，我们观察到，保留的关于原始二进制文件的信息不一定与恶意行为相关，但对恶意软件检测是 “有用的”。然而，我们认为这样的分类器仍然存在三个问题：无法归纳，在强加密的情况下失败，以及容易受到对抗性样本的影响。

Generalization. Training the classifier on packed samples is not guaranteed to generalize to packers that are not included in the training set. We excluded one packer at a time from the training dataset and evaluated the classifier against samples packed with the excluded packer. We observed false positive rates of 43.65%, 47.49%, and 83.06% when excluding tElock, PECompact, and kkrunchy, respectively. Moreover, the classifier trained on all packers from the lab dataset produced a false negative rate of 41.98% on packed executables from the wild dataset. This means that although packers preserve some information, the trained classifier fails to generalize to previously unseen packing routines. This is a severe problem as malware authors often prefer customized packing routines to off-the-shelf packers.

普及性。在包装好的样品上训练分类器，并不能保证能推广到不包括在训练集中的包装商。我们每次从训练数据集中排除一个包装器，并针对被排除的包装器所包装的样品评估分类器。当排除tElock、PECompact和kkrunchy时，我们观察到假阳性率分别为43.65%、47.49%和83.06%。此外，在实验室数据集的所有打包器上训练的分类器对野生数r据集的打包可执行文件产生了41.98%的假阴性率。这意味着，虽然打包器保留了一些信息，但训练有素的分类器无法归纳以前未见过的打包程序。这是一个严重的问题，因为恶意软件作者往往喜欢定制的打包程序而不是现成的打包程序。

Strong & complete encryption. We argue that an executable might be packed in a way that reveals no information related to its behavior until it is executed. As a preliminary step, we packed all executables in the wild dataset with our own packer, called AES-Encrypter, which encrypts the executable with AES and injects it as the overlay of the packed binary. When the packed program is executed, AES-Encrypter decrypts the over- lay and executes the original program within a new process. All static features are always the same, except for features extracted from the encrypted overlay. We trained and tested the classifier on executables packed by the AES-Encrypter, and, as expected, the classifier could not distinguish between benign and malicious executables packed by AES-Encrypter. This shows that packing can be performed without transferring any (static) initial pattern to the packed program, if properly optimized for this purpose.

强而完整的加密。我们认为，一个可执行文件在被执行之前，可能会以一种不透露其行为相关信息的方式被打包。作为初步步骤，我们用自己的打包器打包了野外数据集中的所有可执行文件，称为AES-加密器，它用AES对可执行文件进行加密，并将其作为打包后的二进制文件的覆盖层注入。当打包后的程序被执行时，AES-Encrypter会解密覆盖层，并在一个新的进程中执行原始程序。除了从加密的覆盖层中提取的特征外，所有的静态特征都是一样的。我们在由AES-加密器打包的可执行文件上训练和测试了分类器，正如预期的那样，分类器无法区分由AES-加密器打包的良性和恶意可执行文件。这表明，如果为此目的进行了适当的优化，可以在不将任何（静态）初始模式转移到打包程序的情况下进行打包。

Adversarial samples. Machine-learning-based malware classi- fiers have been shown to be vulnerable against adversarial sam- ples, especially those that use only static analysis features. We expect that generating such adversarial samples would be easier in our case, as static analysis of packed binaries does not provide features that capture a sample’s behavior. We first trained the classifier on a dataset whose benign and malicious samples are packed with the same pack- ers so that the classifier is not biased to detect specific packing routines as a sign of maliciousness. The classifier maintained a low error rate. From all malicious samples that the classifier detected successfully, we managed to generate new samples that the classifier no longer detects as malicious. Specifically, we identified “benign” sequences of bytes that occurred more frequently in benign samples and injected them into the target binary without affecting the sample’s behavior. Very recently, a group of researchers used a very similar technique to trick Cylance’s AI-based anti-malware engine into thinking that malware like WannaCry and tools such as Mimikatz were benign. They did this by taking strings from an online gaming program and injecting them into malicious files. Since games are highly obfuscated and packed, they confront such an engine with a dilemma; either inherit a bias towards games or produce high rates of false positives for them.

对抗性样本。基于机器学习的恶意软件分类器已被证明容易受到对抗性样本的影响，特别是那些只使用静态分析特征的样本。我们预计，在我们的案例中，产生这样的对抗性样本会更容易，因为对打包的二进制文件的静态分析并不能提供捕捉样本行为的特征。我们首先在一个数据集上训练分类器，这个数据集的良性样本和恶意样本是用相同的打包器打包的，这样分类器就不会偏向于检测特定的打包程序作为恶意行为的标志。该分类器保持了较低的错误率。从分类器成功检测到的所有恶意样本中，我们设法生成了分类器不再检测为恶意的新样本。具体来说，我们确定了在良性样本中出现频率较高的 "良性 "字节序列，并将其注入目标二进制文件中，而不影响样本的行为。最近，一组研究人员使用了一种非常类似的技术来欺骗Cylance的基于人工智能的反恶意软件引擎，使其认为像WannaCry这样的恶意软件和Mimikatz这样的工具是良性的。他们通过从一个在线游戏程序中提取字符串，并将其注入恶意文件中做到了。由于游戏是高度混淆和包装的，它们让这样的引擎面临两难境地；要么继承对游戏的偏见，要么对游戏产生高比率的误报。

To investigate how real-world malware detectors operate on packed executables, we submitted benign and malicious exe- cutables packed by each packer to VirusTotal. We only focused on six machine-learning-based engines that use only static analysis features according to their description on VirusTotal or the company’s website. Unfortunately, we observed that all these six engines learned that packing implies maliciousness. It must be noted that, we used commercial packers, like Themida, PECompact, PELock, and Obsidium, that legitimate software companies use to protect their software. Nevertheless, benign programs packed by these packers were detected as malware.

为了研究现实世界中的恶意软件检测器是如何对打包的可执行文件进行操作的，我们向VirusTotal提交了由每个打包器打包的良性和恶意的可执行文件。我们只关注六个基于机器学习的引擎，根据它们在VirusTotal或公司网站上的描述，它们只使用静态分析功能。不幸的是，我们观察到，这六个引擎都认为打包意味着恶意。必须指出的是，我们使用了商业打包器，如Themida、PECompact、PELock和Obsidium，这些都是合法软件公司用来保护其软件的。然而，由这些打包器打包的良性程序被检测为恶意软件。

As packing is being increasingly adopted by legitimate software, the anti-malware industry needs to do better than detecting packers, otherwise good and bad programs are misclassified, causing pain to users and eventually resulting in alert fatigue and missed detections. This is especially a concern for previous studies that rely on anti-malware products for establishing ground truth, as misclassification of packed benign programs might have biased those studies.

由于打包被越来越多的合法软件所采用，反恶意软件行业需要做得比仅检测打包器更好，否则好的和坏的程序会被错误分类，给用户带来痛苦，最终导致警报疲劳和漏检。这对于以前依靠反恶意软件产品建立基础真相的研究来说尤其令人担忧，因为对打包的良性程序的错误分类可能会使这些研究产生偏差。

In summary, we make the following contributions:
We study the limits of machine-learning-based malware classifiers that use only static features. We show that the lack of overlap between packers used in benign and malicious samples causes the classifier to associate specific packers with maliciousness. We show that, if trained correctly, the classifier is able to distinguish between benign and malicious samples packed by real-world packers, though it remains susceptible to unseen packing routines or, even worse, to the application of strong encryption to the entire program. Furthermore, we show that it is possible to craft evasive samples that bypass detecition via a naive adversarial attack.
Our evaluation of six products on VirusTotal shows that current static machine-learning-based anti-malware engines detect packing instead of maliciousness.
We release a dataset of 392,168 executables for which we know whether each sample is benign or malicious, and packed or unpacked. We also know the specific packer for the lab dataset, which includes 341,444 executables.
We release the source code of all experiments in a Docker image at https://github.com/ucsb-seclab/packware to support the reproducibility of our results.

综上所述，我们做出了以下贡献。

我们研究了仅使用静态特征的基于机器学习的恶意软件分类器的局限性。我们表明，如果良性和恶意样本使用的打包器之间缺乏重叠，会导致分类器将特定的打包器与恶意联系起来。我们表明，如果训练得当，分类器能够区分由现实世界的打包器打包的良性和恶意样本，尽管它仍然容易受到不可见的打包程序的影响，甚至是对整个程序应用强加密导致的影响。此外我们还表明，通过简单的对抗性攻击，有可能制作出绕过检测的规避样本。
我们对VirusTotal上的六种产品的评估表明，目前基于静态机器学习的反恶意软件引擎检测的是包装而不是恶意。
我们发布了一个包含392,168个可执行文件的数据集，我们知道每个样本是良性的还是恶意的，是打包的还是未打包的。我们还知道实验数据集的具体打包器，其中包括341,444个可执行文件。

II. MOTIVATION

Packing has long been an effective method for malware authors to evade the signature-based detection of anti-malware engines, but little is known about its legitimate usage in benign applications. As the first step in this direction, in 2013, Lakshman Nataraj explored how anti-malware scanners available on VirusTotal handle packing. He packed 16,663 benign system executables from various Windows OS versions with four different packers (UPX, Upack, NSPack, and BEP), and submitted them to VirusTotal. He showed that 96.7% of the files packed with Upack, NSPack, and BEP triggered at least ten detections on VirusTotal. Another recent study mined byte pattern-based signatures of anti- malware products to force misclassifications of benign files, and also found that the artifacts of packers are effective as “malicious markers.” We argue that these results stem from the fact that packing historically has been associated with malware only. Consequently, a na ̈ıve detection approach only based on static features from packed samples will be heavily biased towards associating packing with malicious behavior. In fact, static analysis features that are shown to be useful for packing detection are also being used by machine-learning-based malware detectors.

长期以来，包装一直是恶意软件作者躲避反恶意软件引擎基于签名的检测的有效方法，但人们对其在良性应用中的合法使用却知之甚少。作为这个方向的第一次尝试，2013年，Lakshman Nataraj探索了VirusTotal上的反恶意软件扫描器如何处理打包。他用四种不同的打包器（UPX、Upack、NSPack和BEP）打包了来自不同Windows操作系统版本的16663个良性系统可执行文件，并将它们提交给VirusTotal。他表明，96.7%的用Upack、NSPack和BEP打包的文件在VirusTotal上引发了至少10次检测。另一项最近的研究挖掘了基于字节模式的反恶意软件产品的签名，以强制对良性文件进行错误分类，并且还发现打包器的工件作为 "恶意标记 "是有效的。我们认为，这些结果源于一个事实，即包装在历史上只与恶意软件有关。因此，仅基于打包样本的静态特征的naive检测方法将严重偏向于将打包与恶意行为联系起来。事实上，被证明对包装检测有用的静态分析也正被基于机器学习的恶意软件检测器使用。

We collected a large-scale, real-world dataset of malicious, suspicious, and benign files from a commercial vendor of advanced malware protection products. This dataset includes samples that the vendor analyzed from customers around the globe over the past three years. As Figure 1 shows, packing is not only widespread in malware samples (75%), but also common in benign samples (50% in the worst case). Note that Figure 1 presents a lower bound for the ratio of packed executables. Our findings overlap with the findings of Rahbarinia et al. , who studied 3 million web-based software downloads over 7 months in 2014, finding that both malicious and benign files use known packers (58% and 54%, respectively). Making matters even worse, more than half of the 69 unique packers they observed (e.g., INNO, UPX) are being used by both malicious and benign software. While some packers (e.g., NSPack, Molebox) were exclusively used to pack malware in their dataset, they conclude that packing information alone is not a good indicator of malicious behavior. We further packed 613 executables from a fresh installation of Windows 10 (located in C:\Windows\System32) with Themida and submitted them to VirusTotal. Figure 2 shows the his- togram of the number of detections. Unsurprisingly, out of 613 binaries, 564 binaries were detected as malicious by more than 10 anti-malware tools. If we consider only the six machine- learning-based anti-malware engines on VirusTotal, out of 613 binaries, 553 binaries were detected as malicious by more than four tools.

我们从一家高级恶意软件保护产品的商业供应商那里收集了一个大规模的、真实世界的恶意、可疑和良性文件的数据集。该数据集包括该供应商在过去三年中从全球客户那里分析的样本。如图1所示，包装不仅在恶意软件样本中普遍存在（75%），而且在良性样本中也很常见（最差情况下为50%）。请注意，图1显示的是打包可执行文件的比例的下限。我们的发现与Rahbarinia等人的研究结果重叠，他们在2014年7个月内研究了300万次基于网络的软件下载，发现恶意和良性文件都使用已知的打包器（分别为58%和54%）。更糟糕的是，在他们观察到的69个独特的打包器中，有一半以上（例如INNO、UPX）被恶意软件和良性软件使用。虽然有些打包器（如NSPack、Molebox）在他们的数据集中专门用于打包恶意软件，但他们的结论是，仅仅打包信息并不是恶意行为的良好指标。我们用Themida进一步打包了613个来自新安装的Windows 10的可执行文件（位于C:\Windows\System32），并将它们提交给VirusTotal。图2显示了检测到的数量的图表。不出所料，在613个二进制文件中，有564个二进制文件被10个以上的反恶意软件工具检测为恶意的。如果我们只考虑VirusTotal上的六个基于机器学习的反恶意软件引擎，在613个二进制文件中，有553个二进制文件被四个以上的工具检测为恶意。

As these numbers show, any approach that fails to consider packed benign samples when designing and evaluating a malware detection approach ultimately results in a substantial number of false positives on real-world data. This is especially a concern for machine-learning-based approaches, which, in the absence of reliable and fresh ground truth, frequently rely on labels from anti-malware products available on VirusTotal. Given the disagreement of anti-malware products in labeling samples, a common practice is to sanitize a dataset, for example, by considering decisions from a selected set of anti-malware products, or, as another example, by using a voting-based consensus. While this approach is problematic for various reasons, we believe that one main aspect is particularly troublesome: Dataset pollution. Packed benign samples that are detected by anti-malware products as malicious are incorrectly used as malware samples. For example, a recent related work used a similar procedure for labeling, as stated by the authors: “We train a classifier using supervised learning and therefore require a target label for each sample (0 for benign and 1 for malware). We use malware indicators from VirusTotal. For each sample, we count the number of malicious detections from the various engines aggregated by VirusTotal, weighted according to a reputation we give to each engine, such that several well-known engines are given weight >1, and all others are weighted 1. We use the result to label a sample benign or malicious.” While we do not know which weights are used by the authors, there is a good chance that their dataset is skewed, since, as we showed above, a number of anti-malware engines on VirusTotal detect packed benign samples as malware.

正如这些数字所显示的那样，任何在设计和评估恶意软件检测方法时未能考虑打包的良性样本的方法，最终都会在真实世界的数据中产生大量的误报。这对于基于机器学习的方法来说尤其令人担忧，因为在缺乏可靠和新鲜的基础真相的情况下，这些方法经常依赖VirusTotal上的反恶意软件产品的标签。鉴于反恶意软件产品在标记样本方面的分歧，一种常见的做法是对数据集进行消毒，例如，考虑来自选定的反恶意软件产品的决定，或者，作为另一个例子，使用基于投票的共识。虽然这种方法由于各种原因存在问题，但我们认为有一个主要方面是特别麻烦的：数据集污染，即被反恶意软件产品检测为恶意的打包良性样本被错误地作为恶意软件样本使用。例如，最近的一项相关工作使用了类似的程序进行标记，正如作者所说。"我们使用监督学习来训练分类器，因此需要给每个样本一个目标标签（0代表良性，1代表恶意软件）。我们使用VirusTotal的恶意软件指标。对于每个样本，我们计算由VirusTotal汇总的各种引擎的恶意检测数量，根据我们给每个引擎的声誉进行加权，例如几个著名的引擎的权重大于1，而所有其他引擎的权重为1。虽然我们不知道作者使用了哪些权重，但他们的数据集很有可能是有偏差的，因为正如我们上面所显示的，VirusTotal上的一些反恶意软件引擎将打包的良性样本检测为恶意软件。

As studied by the anti-malware community, evaluating existing malware detection methodologies poses substantial challenges. For example, Rossow et al. presented guidelines for collecting and using malware datasets. Our work aims to find whether packing even retains rich enough static features from the original binary to detect anything meaningful besides the packing itself. To the best of our knowledge, no prior work has considered the effects of packed executables on machine-learning-based malware detectors that leverage only static analysis features.

正如反恶意软件界所研究的那样，评估现有的恶意软件检测方法带来了巨大的挑战。例如，Rossow等人提出了收集和使用恶意软件数据集的指导方针。我们的工作旨在寻找包装是否从原始二进制文件中保留了足够丰富的静态特征，以检测除了包装本身之外的任何有意义的东西。据我们所知，以前的工作没有考虑过打包的可执行文件对只利用了静态分析特征的基于机器学习的恶意软件检测器的影响。

III. BACKGROUND

A. Executable Packers

A packer is a software component that applies a set of routines to compress or encrypt a target program. The simplest form of packing consists of the decryption or decompression (at runtime) of the original payload followed by a jump to the memory address that contains the target payload (this technique is called “tail jump”). Ugarte et al. classify packers into six types, with an increasing level of complexity in the reconstruction of the target payload:
Type I: A single unpacking routine is executed to transfer the control to the original program. UPX is the most popular packer in this class. Type II: The packer employs a chain of unpacking routines executed sequentially, with the original code recomposed at the end of the chain. Type III: Unpacking routines include loops and backward edges. Though the origi- nal code is not necessarily reconstructed in the last layer, a tail transition still exists to separate the packer and the application code. Type IV: In each layer of packing, the corresponding part of the unpacking routine is interleaved with the corresponding part of the original code. However, the entire original code will be completely unpacked in memory at some point during the execution. Type V: The packer is composed of different layers in which the unpacking code is mangled with the original code. There are multiple tail jumps that reveal only a single frame of the original code at a time. Type VI: Packers reveal (unpack) only a single fragment of the original code (as little as a single instruction) at any given time.

打包器是一个软件组件，它应用一套程序来压缩或加密目标程序。最简单的打包形式包括对原始有效载荷进行解密或解压缩（在运行时），然后跳转到包含目标有效载荷的内存地址（这种技术被称为 “尾跳”）。Ugarte等人将打包器分为六种类型，在重建目标有效载荷方面的复杂程度越来越高：

Type I：执行单一的解包程序，将控制权转移到原始程序。UPX是这一类中最流行的打包器。

Type II：打包器采用一连串的解包例程，依次执行，原代码在链的末端重新组成。

Type III：解包程序包括循环和后向边缘。尽管原始代码不一定在最后一层被重构，但仍然存在一个尾部过渡，将打包程序和应用程序代码分开。

Type IV：在每一层打包中，解包程序的相应部分与原代码的相应部分交错在一起。然而，整个原代码将在执行过程中的某个时刻在内存中被完全解包。

Type V：打包程序由不同的层组成，其中解包代码与原代码混杂在一起。有多个尾部跳转，每次只显示原代码的一个框架。

Type VI：打包者在任何时候都只揭示（解包）原代码的一个片段（少到一个指令）。

We discuss approaches that are proposed for packing detection, packer identification, and automated unpacking in Appendix A. Here, we discuss the limitations of these methods.

我们在附录A中讨论了为包装检测、包装器识别和自动解包提出的方法。在这里，我们讨论这些方法的局限性。

Limitations of packing detection

Signature-based ap- proaches to packing detection have a high false negative rate, as they require a priori knowledge of packed executables generated by each packer. As an example, PEiD is shown to have approximately a 30% false negative rate. Other approaches apply static analysis to extract a set of features or use hand-crafted heuristics to detect packed executables. However, they are vulnerable to adversaries. As an example, the Zeus malware family applies different techniques, such as inserting a selected set of bytes into executables, in order to keep the entropy of the file and its sections low. Such malware evades entropy-based heuristics, as they are often used to determine if an executable is packed. Dynamic approaches seem to perform better, since they often look for a write-execute sequence in a memory location, which is the definition of packing. However, packed executables usually employ different techniques to evade analysis, like conditional execution of unpacking routines.

包装检测的局限性。基于签名的打包检测方法有很高的假阴性率，因为它们需要对每个打包者产生的打包可执行文件有先验的了解。比如PEiD被证明有大约30%的假阴性率。其他方法应用静态分析来提取一组特征或使用手工制作的启发式方法来检测打包的可执行文件。然而，它们很容易受到对手的攻击。举个例子，Zeus恶意软件家族应用了不同的技术，如在可执行文件中插入一组选定的字节，以保持文件及其部分的熵值较低。这种恶意软件逃避了基于熵的启发式方法，因为它们通常被用来确定一个可执行文件是否被打包。动态方法似乎表现得更好，因为它们经常在内存位置寻找一个写-执行序列，这是打包的定义。然而，打包的可执行文件通常采用不同的技术来逃避分析，如有条件地执行解包程序。

Limitations of generic unpackers

Packers usually employ different techniques to evade analysis approaches utilized by generic unpackers. For example, tELock and Armadillo lever- age several anti-debugging routines to terminate the execution in a debugging setting. Although some unpackers ex- ploit hardware virtualization to achieve transparency, the introduced performance overhead could be unacceptable. Themida applies virtualization obfuscation to its unpacking routine, which can cause slice size explosion. In general, generic unpackers rely on a number of assumptions that do not necessarily hold in practice: the entire original code is in memory at a certain point, the original code is unpacked in the last layer, the execution of the unpacking routine and the original code are completely separated, and the unpacking code and the original code run in the same process without any inter-process communication. These simplifications make these unpackers inadequate for handling the challenges introduced by complex, real-world packers. Moreover, generic unpackers often rely on heuristics that are designed for specific packers.

通用解包器的局限性。打包器通常采用不同的技术来规避通用解包器所使用的分析方法。例如，tELock和Armadillo利用几个反调试程序来终止调试环境下的执行。尽管一些解包程序利用硬件虚拟化来实现透明性，但引入的性能开销可能是不可接受的。Themida在其解包程序中应用了虚拟化混淆，这可能会导致片断大小爆炸。一般来说，通用的解包程序依赖于一些假设，这些假设在实践中不一定成立：整个原始代码在某一点上都在内存中，原始代码在最后一层被解包，解包例程和原始代码的执行完全分开，解包代码和原始代码在同一个进程中运行，没有任何进程间的通信。这些简化使得这些解包器不足以处理复杂的、真实世界的打包器所带来的挑战。此外，通用的解包器往往依赖于为特定的打包器设计的启发式方法。

B. Packing vs. Static Malware Analysis

In Appendix B, we discuss how machine learning is being adopted by the anti-malware community to statically analyze malicious programs. In particular, we reviewed a wide range of static malware analysis approaches based on machine learning. Although static malware detectors have been shown to be biased towards detecting packing, we observed a number of limitations in related work when it comes to the handling of packed executables. In particular, out of the 30 papers mentioned above: (1) Ten papers do not mention packing or obfuscation techniques. (2) Ten approaches work only on unpacked executables, as mentioned by the authors. They used either unpacked executables or executables that they managed to unpack. (3) Seven papers claim to perform well in malware classification regardless of whether or not the executables are packed. However, the authors did not discuss whether any bias in terms of packing was present in their dataset or not. More precisely, they did not mention using packed benign executables in their dataset, or brief examinations have been done on the effects of packed executables, though the evaluation has been thoroughly carried out only on unpacked executables. (4) Only three papers focused on packed executables. However, they have two major limitations: (a) they use signature-based packer detectors, such as PEiD, to detect packing, while PEiD has approximately a 30% false negative rate, and (b) they augmented their datasets by packing benign executables using only a small number of packers. However, malicious executables might be packed with a different set of packers, which can result in a bias towards detecting specific packing techniques. Jacob et al. detect similar malware samples even if they are packed, yet, their method is resilient only against packers that employ compression or weak encryption, as they acknowledge.

在附录B中，我们讨论了机器学习是如何被反恶意软件社区采用来静态分析恶意程序的。特别地，我们审查了广泛的基于机器学习的静态恶意软件分析方法。虽然静态恶意软件检测器已被证明偏向于检测打包，但我们观察到相关工作在处理打包的可执行文件时存在一些局限性。在上述30篇论文中：(1) 10篇论文没有提到打包或混淆技术。(2) 10篇方法只在未打包的可执行文件上工作。他们要么使用未打包的可执行文件，要么使用他们设法解压的可执行文件。(3) 七篇论文声称，无论可执行文件是否被打包，恶意软件分类器都表现良好。然而，作者并没有讨论他们的数据集中是否存在包装方面的偏差。更确切地说，他们没有提到在他们的数据集中使用打包的良性可执行文件，或者对打包的可执行文件的影响做了简单的检查，评估只在未打包的可执行文件上彻底进行了。(4) 只有三篇论文专注于打包的可执行文件。然而，它们有两个主要的局限性。(a) 他们使用基于签名的打包器检测器，如PEiD，来检测打包，而PEiD有大约30%的假阴性率，(b) 他们通过只使用少量的打包器打包良性可执行文件来增强他们的数据集。然而，恶意可执行文件可能是用不同的打包器打包的，这可能导致对检测特定打包技术的偏见。Jacob等人检测到类似的恶意软件样本，即使它们被打包。然而，他们的方法只对采用压缩或弱加密的打包器有效，正如他们承认的那样。

Finally, most related work did not publish their datasets, hence these approaches cannot be fairly compared to each other.

最后，大多数相关的工作没有公布他们的数据集，因此这些方法不能被公平地相互比较。

IV. DATASET

Our experiments require a dataset composed of executable programs for which we know if they are: (1) benign or malicious and (2) packed or unpacked. We combined a labeled dataset from a commercial vendor with the EMBER dataset (labeled) to build our wild dataset. We leveraged a hybrid approach to label an executable as packed or unpacked. We built another ground-truth dataset, the lab dataset, by packing all executables in the wild dataset with widely used com- mercial and free packers and our own packer, AES-Encrypter. Following a detailed study of the literature, we extracted nine families of features for all samples.

我们的实验需要一个由可执行程序组成的数据集，我们需要知道这些程序是：(1)良性或恶意的，(2)打包或未打包的。我们将一个来自商业供应商的标记数据集与EMBER数据集（标记）结合起来，建立我们的野生数据集。我们利用一种混合方法将一个可执行文件标记为已打包或未打包。我们用广泛地使用商业和免费打包器以及我们自己的打包器AES-Encrypter打包野生数据集中的所有可执行文件，从而建立了另一个基础真实数据集，即实验室数据集。经过对文献的详细研究，我们为所有的样本提取了九个系列的特征。

A. Wild Dataset

We used two different sources to create our wild dataset of Windows x86 executables. (1) A commercial anti-malware vendor provided 29,573 executables. These samples, observed “in the wild,” were randomly selected from an original pool that was analyzed by the anti-malware vendor’s sandbox in the US during the period from 2017-05-15 to 2017-09-19. Along with the benign/malicious label and the malicious behaviors observed during the execution, the vendor identified which executable was packed or not. (2) A labeled benchmark dataset, called EMBER, was introduced by Anderson et al. for training machine learning models to statically detect Windows mal- ware. This dataset consists of 800,000 Windows executables that are labeled. However, no information is provided regarding packing. We randomly selected 56,411 x86 executables from this dataset and submitted each sample to the commercial anti-malware vendor’s sandbox, in order to identify if the sample is packed. This also provides us confirmation whether an executable is malware or benign software, as the sandbox detects malicious behavior. Note that samples from these two sources were observed “in the wild” sometime in 2017, allowing more than enough time for current anti-malware engines to have incorporated means to detect them. As these two sources might have samples that are incorrectly labeled,we performed a careful and extensive post-processing step, which we describe in the following paragraphs.

我们使用两个不同的来源来创建我们的Windows x86可执行文件的野生数据集。(1) 一个商业反恶意软件供应商提供了29,573个可执行文件。这些在 "野外 "观察到的样本是从一个原始池中随机选择的，该池在2017-05-15至2017-09-19期间由反恶意软件供应商在美国的沙盒进行分析。随着良性/恶意标签和执行过程中观察到的恶意行为，供应商确定了哪个可执行文件被打包或未打包。（2）安德森等人引入了一个标记的基准数据集，称为EMBER，用于训练机器学习模型来静态检测Windows恶意软件。这个数据集包括800,000个被标记的Windows可执行文件。然而，没有提供关于包装的信息。我们从该数据集中随机选择了56,411个x86可执行文件，并将每个样本提交给商业反恶意软件供应商的沙盒，以确定该样本是否被打包。这也为我们提供了一个确认可执行文件是恶意软件还是良性软件的方式，因为沙盒会检测到恶意行为。请注意，这两个来源的样本是在2017年的某个时候 "在野外 "观察到的，这让目前的反恶意软件引擎有足够的时间纳入检测手段。由于这两个来源的样本可能有不正确的标签，我们进行了仔细和广泛的后处理步骤，我们在以下段落中描述。

Malicious vs. benign

We used three different sources to detect whether an executable is malicious or benign. (1) VirusTotal: We obtained reports for our entire dataset by querying VirusTotal. All 85,984 executables in our dataset have been available on VirusTotal for more than one year. From all engines used by VirusTotal, we considered only seven tools that are well- known as strong products in the anti-malware industry and labeled each executable based on the majority vote. (2) The anti-malware vendor: Since we sent samples extracted from the EMBER dataset to the vendor’s sandbox, we have the benign/malicious label for all samples. (3) EMBER dataset: All samples that we selected from the EMBER dataset are labeled by Endgame.

我们使用三个不同的来源来检测一个可执行文件是恶意的还是良性的。(1) VirusTotal：我们通过查询VirusTotal获得整个数据集的报告。我们数据集中的所有85,984个可执行文件在VirusTotal上都有超过一年的时间。在VirusTotal使用的所有引擎中，我们只考虑了七个工具，它们在反恶意软件行业中是众所周知的强大产品，并根据多数人的投票对每个可执行文件进行了标注。(2) 反恶意软件供应商。由于我们将从EMBER数据集中提取的样本发送到供应商的沙盒中，我们对所有的样本都有良性/恶性标签。(3) EMBER数据集。我们从EMBER数据集中选择的所有样本都由Endgame标记。

We discarded 4,113 samples for which there was a dis- agreement about their benign/malicious nature between the three sources. As Table I shows, at the end of this step, we have 37,269 benign and 44,602 malicious samples left (a total of 81,871 executables).

我们放弃了4,113个样本，因为这三个来源对它们的良性/恶意性质不一致。如表一所示，在这一步结束时，我们还剩下37,269个良性样本和44,602个恶意样本（总共有81,871个可执行文件）。

Packed vs. unpacked

Due to the limitations discussed in Section III-A, we leveraged a hybrid approach to determine if an executable is packed. In particular, for each sample, we took the following steps: (1) The anti-malware vendor: We submitted the sample to the vendor’s sandbox, and given the downloaded report, we detected whether unpacking behavior had occurred or not. The anti-malware tool detects the presence of packed code by running the executable in a custom sandbox that interrupts the execution every time there is a write to a memory location followed by a jump to that address. At that point in time, a snapshot of the loaded instructions is compared to the original binary, and if they differ, the executable is marked as packed. (2) Deep Packer Inspector (dpi): We used dpi to further analyze each sample. This framework measures the runtime complexity of packers. Adding an extra dynamic engine helps us to identify packed executables that are not detected as packed by the first dynamic engine. For ex- ample, the host configuration might make the sample terminate before the unpacking process starts. In addition, this framework gives us insights about the runtime complexity of packers in our dataset. As dpi is not operating on .NET executables, we removed all 13,489 .NET executables, 10,681 benign and 2,808 malicious, from our dataset, resulting in 68,382 executables, 26,588 benign and 41,794 malicious. (3) Signatures and heuristics: We used Manalyze, Exeinfo PE, yara rules, PEiD, and F-Prot (from VirusTotal) to identify packers that leave noticeable artifacts in packed executables.

由于III-A节中讨论的局限性，我们利用一种混合方法来确定一个可执行文件是否被打包。特别是，对于每个样本，我们采取了以下步骤。(1) 反恶意软件供应商。我们将样本提交给供应商的沙盒，鉴于下载的报告，我们检测是否发生了解包行为。反恶意软件工具通过在一个自定义的沙盒中运行可执行文件来检测打包代码的存在，该沙盒在每次有写到一个内存位置并跳转到该地址时就会中断执行。在该时间点上，加载指令的快照与原始二进制文件进行比较，如果它们不同，可执行文件被标记为打包的。(2) Deep Packer Inspector (dpi)。我们使用dpi来进一步分析每个样本。这个框架衡量打包器的运行时复杂性。添加一个额外的动态引擎有助于我们识别未被第一个动态引擎检测为打包的可执行文件。例如，主机配置可能使样本在解包过程开始前就终止了。此外，这个框架让我们了解到数据集中打包程序的运行时复杂性。由于dpi不对.NET可执行文件进行操作，我们从数据集中删除了所有13,489个.NET可执行文件，其中10,681个是良性的，2,808个是恶意的，结果是68,382个可执行文件，26,588个是良性的，41,794个是恶意的。(3) 签名和启发式方法。我们使用Manalyze、Exeinfo PE、yara规则、PEiD和F-Prot（来自VirusTotal）来识别那些在打包的可执行文件中留下明显人工痕迹的打包器。

In particular, we labeled an executable as packed in our dataset if one among the vendor’s sandbox, dpi, and signature- based tools detects the executable as packed. In total, we labeled 46,328 samples as packed divided into 12,647 benign and 33,681 malicious samples. We further used heuristics pro- posed by Manalyze for packing detection to determine samples that might be packed. Manalyze labeled 24,911 samples as “possibly packed,” of which 6,898 samples are not detected as packed by other tools. We argue that this discrepancy might be due to limitations with packing detection, which we discuss in Section III-A. Nevertheless, we discarded all these samples as we were not completely sure if they are packed or not.

特别地，在我们的数据集中，如果供应商的沙盒、dpi和基于签名的工具中的一个检测到可执行文件是打包的，我们就将其标记为打包。总的来说，我们将46328个样本标记为打包，分为12647个良性样本和33681个恶意样本。我们进一步使用Manalyze为打包检测提供的启发式方法来确定可能被打包的样本。Manalyze将24,911个样本标记为 “可能被打包”，其中6,898个样本没有被其他工具检测为被打包的。我们认为，这种差异可能是由于包装检测的局限性造成的，我们在第III-A节中讨论过。尽管如此，我们还是放弃了所有这些样本，因为我们不能完全确定它们是否是被打包的。

Table X in the Appendix shows statistics about packed exe- cutables that are detected by each approach. Of 17,043 benign executables, 12,647 executables are packed, and 4,396 executa- bles are unpacked, and of 40,031 malicious executables, 33,681 executables are packed, and 5,752 executables are unpacked. While unpacked malware is shown to be rare, we did not detect packing for 5,752 (13.61%) malicious samples. Since this percentage could be considered somewhat higher than expected, we attempted to verify our packer analysis by randomly selecting 20 samples, and manually looking for the presence or absence of unpacking routines. We observed the unpacking routine code for 18 samples, but our packer detection scheme did not detect them due to the anti-detection techniques that these samples use. Since we do not need any unpacked malicious executables for our experiments, we discarded all 5,752 malicious samples that our system labeled as unpacked. To confirm that all 4,396 benign samples that we identified as unpacked are not packed, we manually looked into 100 unpacked benign executables and did not find any sign of packing. Simple statistics guarantee that more than 97.11% (95.59%) of these samples are labeled correctly with the confidence of 95% (99%).

附录中的表X显示了每种方法检测到的打包可执行文件的统计数据。在17,043个良性可执行文件中，12,647个可执行文件被打包，4,396个可执行文件未被打包；在40,031个恶意可执行文件中，33,681个可执行文件被打包，5,752个可执行文件未被打包。虽然未打包的恶意软件被证明是罕见的，但有5752个（13.61%）恶意样本没有检测到被打包。由于这个百分比可以被认为比预期的要高一些，我们试图通过随机选择20个样本来验证我们的打包器分析，并手动寻找是否存在未打包的程序。我们观察到18个样本的未打包例程代码，但由于这些样本使用的反检测技术，我们的打包器检测方案没有检测到它们。由于我们在实验中不需要任何未打包的恶意可执行文件，我们丢弃了所有被我们的系统标记为未打包的5752个恶意样本。为了确认所有被我们识别为未打包的4396个良性样本没有被打包，我们手动查看了100个未打包的良性可执行文件，没有发现任何打包的迹象。简单的统计学保证了这些样本中超过97.11%（95.59%）的样本被正确标记，置信度为95%（99%）。

We further noticed that our dataset was skewed in terms of DLL files, containing 4,005 benign DLLs but only 598 malicious ones. We removed all these samples from our dataset. In the end, the wild dataset consists of 50,724 executables divided into 4,396 unpacked benign, 12,647 packed benign, and 33,681 packed malicious executables.

我们进一步注意到，我们的数据集在DLL文件方面是倾斜的，包含4,005个良性DLL，但只有598个恶性的DLL。我们从我们的数据集中删除了所有这些样本。最后，野生数据集由50,724个可执行文件组成，分为4,396个未打包的良性文件，12,647个打包的良性文件，以及33,681个打包的恶意可执行文件。

Packer complexity

As Table X in the Appendix shows, dpi detects the unpacking behavior for 34,044 executables in the wild dataset. Table XI presents the packer complexity classes, as defined by Ugarte et al. , for these executables.

如附录中的表X所示，dpi检测了野生数据集中34,044个可执行文件的解包行为。表XI显示了Ugarte等人为这些可执行文件定义的打包器复杂性类别。

Packers in the wild

Using PEiD, F-Prot, Manalyze, Exeinfo PE, and yara rules, we matched signatures of packers for 9,448 executables, 1,866 benign and 7,582 malicious. We found the artifacts of 48 packers in the wild dataset. As Table XII in the Appendix shows, some packers like dxpack, MPRESS, and PECompact have been used mostly in malicious samples.

使用PEiD、F-Prot、Manalyze、Exeinfo PE和yara规则，我们为9448个可执行文件匹配了打包器的签名，其中1866个是良性的，7582个是恶意的。我们在野外数据集中发现了48个打包器的工件。如附录中的表XII所示，一些打包器如dxpack、MPRESS和PECompact大多被用于恶意样本中。

B. Lab Dataset

Some of our experiments require us to know with certainty which packer is used to pack a program. Therefore, we obtained nine packers that are either commercially available or freeware (namely Obsidium, PELock, Themida, PECompact, Petite, UPX, kkrunchy, MPRESS, and tElock) and packed all 50,724 executables in our wild dataset to create the lab dataset. None of the packers were able to pack all samples. For example, Petite failed on most executables with a GUI, while Obsidium in some cases produced empty executables. We looked at logs generated by these packers and removed those executables that were not properly packed. We also verified that all packed executables have valid entry points. Finally, we developed our own simple packer, called AES-Encrypter, which, given the executable P, encrypts P using AES with a random key (which is included in the final binary), and injects the encrypted binary as the overlay of the packed binary P’. When P’ is executed, it first decrypts the overlay and then executes the decrypted (original) binary. Table II lists the number of samples we packed successfully with each packer. In total, we generated 341,444 packed executables. To ascertain if packing does, in fact, preserve the original behavior, we compared the behavior of these samples with the original samples. Our results confirm that 94.56% of samples exhibit the original behavior. We explain in Appendix C how we conducted this comparison.

我们的一些实验要求我们确切地知道哪个打包器是用来打包程序的。因此，我们获得了九种市售或免费的打包器（即Obsidium, PELock, Themida, PECompact, Petite, UPX, kkrunchy, MPRESS, and tElock），并对野生数据集中的所有50,724个可执行文件进行打包，以创建实验室数据集。没有一个打包器能够打包所有的样本。例如，Petite在大多数有GUI的可执行文件上失败了，而Obsidium在某些情况下产生了空的可执行文件。我们查看了这些打包器产生的日志，并删除了那些没有被正确打包的可执行文件。我们还验证了所有打包的可执行文件都有有效的入口点。最后，我们开发了自己的简单打包器，称为AES-加密器，它对于给定可执行文件P，用随机密钥（包括在最终的二进制文件中）用AES对P进行加密，并将加密的二进制文件作为打包后的二进制文件P’的覆盖层注入。当P’被执行时，它首先解密覆盖层，然后执行解密的（原始）二进制。表II列出了我们用每个打包器成功打包的样本数量。总的来说，我们产生了341,444个打包的可执行文件。为了确定打包是否确实保留了原始行为，我们将这些样本的行为与原始样本进行了比较。我们的结果证实，94.56%的样本表现出原始行为。我们在附录C中解释了我们如何进行这种比较。

C. Features

Following a detailed analysis of the literature (see Sec- tion B), we extracted nine families of static analysis features that were shown to be useful in related work. We used pefile to extract features from three different sources: the PE structure, the program’s assembly, and the raw bytes of the binary. As Table III shows, we extracted a total of 56,543 individual features from the samples in our dataset.

经过对文献的详细分析（见B节），我们提取了9个系列的静态分析特征，这些特征在相关工作中被证明是有用的。我们使用pefile从三个不同的来源提取特征：PE结构、程序的汇编和二进制文件的原始字节。如表III所示，我们从数据集中的样本中共提取了56,543个单独的特征。

(1) PE headers

Features related to PE headers have been widely used in related work. In our case, we use all fields in the PE headers that exhibit some variability across differ- ent executables (some header fields never change). We extracted 12 individual features from the Optional and COFF headers, which are described in Table XX in the Appendix. Moreover, from the characteristics field in the COFF header, we extracted 16 binary features, each representing whether the corresponding flag is set for the executable or not. Thus, we extracted 12 integer and 16 binary features from the PE headers, resulting in a total of 28 features.

与PE头文件有关的特征已被广泛用于相关工作中。在我们的案例中，我们使用了PE头文件中的所有字段，这些字段在不同的可执行文件中表现出一定的可变性（有些头文件字段从未改变）。我们从Optional和COFF头文件中提取了12个单独的特征，这些特征在附录中的表XX中描述。此外，从COFF头中的特征字段中，我们提取了16个二进制特征，每个特征代表可执行文件中是否设置了相应的标志。因此，我们从PE头文件中提取了12个整数和16个二进制特征，总共有28个特征。

(2) PE sections

Every executable has different sections, such as the .data and .text sections. For each section, we extracted 8 individual features as described in Table XXI in the Appendix. Moreover, from the characteristics field in the section header, we created up to 32 binary features for each bit (flag). For example, the feature corresponding to the 30th bit is true when the section is executable. We ignored the bits (flags) that do not vary in our dataset. For each section of the PE file, we computed 32 (at most) binary, 7 integer, and one string feature, named pesection sectionId field. The maximum number of sections that an executable has in our dataset is 19. For each executable, we built a vector of 516 different features obtained from its sections followed by the default values for sections that the sample does not include. Based on the related work, we augmented this set of features with the following processing steps: (1) We extracted the above- mentioned features for the section where the executable’s entry point resides and added them to the dataset separately; (2) We calculated the mean, minimum, and maximum entropy of the sections for each executable. We did the same for both the size and the virtual size attributes. As a result, we extracted a total of 570 features from the PE sections.

每个可执行文件都有不同的部分，如.data和.text部分。对于每个部分，我们提取了8个单独的特征，如附录中表XXI所描述。此外，从节头的特征字段中，我们为每个位（标志）创建了多达32个二进制特征。例如，对应于第30位的特征在该节可执行时为真。我们忽略了那些在我们的数据集中没有变化的位（标志）。对于PE文件的每个部分，我们计算了32个（最多）二进制、7个整数和一个字符串特征，命名为pesection sectionId字段。在我们的数据集中，一个可执行文件的最大节数是19。对于每个可执行文件，我们建立了一个由516个不同特征组成的向量，这些特征来自于它的部分，然后是样本中不包括的部分的默认值。基于相关的工作，我们用以下的处理步骤增加了这组特征。(1) 我们为可执行文件的入口点所在的部分提取了上述特征，并将它们分别添加到数据集中；(2) 我们计算了每个可执行文件的部分的平均、最小和最大熵。我们对大小和虚拟大小属性都做了同样的处理。结果，我们从PE部分共提取了570个特征。

(3) DLL imports

Most executables are linked to dynamically-linked libraries (DLLs). For each library, we use a binary feature that is true when an executable uses that library. In total, we have 4,305 binary features in this set.

大多数可执行文件都被链接到动态链接库（DLLs）。对于每个库，我们使用一个二进制特征，当一个可执行文件使用该库时，该特征为真。在这组数据中，我们总共有4,305个二进制特征。

(4) API imports

Every executable has an Import Directory Table that includes the APIs that the executable imports from external DLLs. We introduce a binary feature for each API function that is true if the executable imports that function. In total, we have 19,168 binary features in this set.

每个可执行文件都有一个导入目录表，其中包括可执行文件从外部DLLs导入的API。我们为每个API函数引入了一个二进制特征，如果可执行文件导入了该函数，则为真。在这个集合中，我们总共有19168个二进制特征。

(5) Rich Header

The Rich Header field in the PE file includes information regarding the identity or type of the object files and the compiler used to build the executable. Webster et al. have shown that the Rich Header is useful for detecting different versions of malware, as malware authors often do not deliberately strip this header. In particular, they observed that “most packers, while sometimes introducing anomalies, did not often strip the Rich Header from samples.” Based on our observation, as Table II shows, while Obsidium, kkrunchy, MPRESS, and PELock stripped the Rich Header for 70–80% of binaries in the wild dataset, other packers always kept this header, except for AES-Encrypter, which always produces the same header. We followed the procedure by Webster et al. to encode this header into 66 integer features.

PE文件中的Rich Header字段包括有关对象文件的身份或类型以及用于构建可执行文件的编译器的信息。韦伯斯特等人表明，丰富的头信息对检测不同版本的恶意软件很有用，因为恶意软件作者通常不会故意剥离这个头信息。特别是，他们观察到，“大多数打包者，虽然有时会引入异常情况，但并不经常从样本中剥离富头”。根据我们的观察，如表II所示，虽然Obsidium, kkrunchy, MPRESS和PELock在野外数据集中剥离了70-80%的二进制文件的Rich Header，其他打包器总是保留这个头，除了AES-加密器，它总是产生相同的头。我们遵循Webster等人的程序，将这个头编码为66个整数特征。

(6) Byte n-grams

Given that an executable file is a sequence of bytes, we extracted byte n-grams by considering every n consecutive bytes as an individual feature. Given the practical impossibility of storing the representation of n-grams for n ≥ 4 in main memory, a feature selection process is needed. Raff et al. observed that 6-grams perform best over their dataset. We used the same strategy to select the most important 6-gram features, where each feature represents if the executable contains the corresponding 6-gram. We first randomly selected a set of 1,000 samples and computed the number of files containing each individual 6-gram. We observed 1,060,957,223 unique 6-grams in these samples. As Figure 10a in the Appendix shows, and as Raff et al. observed, byte 6-grams follow a power-law type distribution, with 99.99% 6-grams occurring ten or fewer times. We reduced our set of candidate 6-grams by selecting 6-grams that occurred in more than 1% of the samples in the set, which results in 204,502 individual 6-gram features. Then, we selected the top 13,000 n-gram features based on the Information Gain (IG) measure, since our dataset roughly converges at this value, as depicted in Figure 10b.

鉴于可执行文件是一个字节序列，我们通过考虑每一个连续的n个字节作为一个单独的特征来提取字节n-grams。鉴于在主内存中不可能存储n≥4的n-grams的表示，需要一个特征选择过程。Raff等人观察到，6-grams在他们的数据集中表现最好。我们使用同样的策略来选择最重要的6-gram特征，其中每个特征代表可执行文件是否包含相应的6-gram。我们首先随机选择了一组1000个样本，并计算了包含每个单独的6-gram的文件数量。我们在这些样本中观察到1,060,957,223个独特的6-grams。正如附录中的图10a所示，以及Raff等人的观察，字节6-grams遵循幂律型分布，99.99%的6-grams出现了10次或更少。我们减少了候选6-grams的集合，选择了在集合中超过1%的样本中出现的6-grams，这就产生了204,502个单独的6-gram特征。然后，我们根据信息增益（IG）指标选择了前13,000个n-gram特征，因为我们的数据集大致收敛于这个值，如图10b所示。

(7) Opcode n-grams

We used the Capstone disassembler to tokenize executables into sequences of opcodes and then built the opcode n-grams. While a small value may fail to detect complex malicious blocks of code, long sequences of opcodes can easily be avoided with simple obfuscation techniques. Moreover, large values of n introduce a high performance overhead. For these reasons, similarly to most related work, we use sequences up to a length of four. We represent opcode n-grams by computing the TF-IDF value for each sequence. While we could extract the assembly for all samples in the wild dataset, out of the 341,444 samples in the lab dataset, we could not disassemble 2,200 samples (see Table II). For these programs, we put -1 as the value of opcode n-grams features. In total, we extracted 5,373,170 unique opcode n-grams, from which, only 51,942 n-grams occurred in more than 0.1% of executables in the lab dataset (Figure 10c). We only consider these opcode n-grams (reduction of 98.47%). Figure 10d presents the Information Gain (IG) measure of these opcode n-grams. We selected the top 2,500 opcode n- grams (based on IG value) with their TF-IDF weights as feature values, resulting into 2,500 float features.

我们使用Capstone反汇编程序将可执行文件标记为操作码序列，然后建立操作码n-grams。虽然小的数值可能无法检测到复杂的恶意代码块，但长的操作码序列可以通过简单的混淆技术轻松避免。此外，大的n值会带来很高的性能开销。由于这些原因，与大多数相关工作类似，我们使用长度不超过4的序列。我们通过计算每个序列的TF-IDF值来表示操作码n-grams。虽然我们可以提取野生数据集中的所有样本的组装，但在实验室数据集中的341,444个样本中，我们有2200个样本不能反汇编（见表II）。对于这些程序，我们把-1作为opcode n-grams特征的值。我们总共提取了5,373,170个独特的操作码n-grams，其中，只有51,942个n-grams出现在实验室数据集中超过0.1%的可执行文件中（图10c）。我们只考虑这些操作码n-grams（减少98.47%）。图10d显示了这些操作码n-grams的信息增益（IG）指标。我们选择了前2500个操作码n-grams（基于IG值），用它们的TF-IDF权重作为特征值，从而得到2500个浮动特征。

(8) Strings

The (printable) strings contained in an executable may give valuable insights into the executable, such as file names, system resource information, malware signatures, etc. We leveraged the GNU strings program to extract the printable character sequences that are at least 4 characters long. We rep- resent each printable string with a binary feature indicating if the executable contains the string. We observed 1,856,455,113 unique strings, from which more than 99.99% were seen in less than 0.4% of samples. After removing these rare strings, we obtained 16,900 binary features.

可执行文件中包含的（可打印的）字符串可能会对可执行文件提供有价值的参考，如文件名、系统资源信息、恶意软件的签名等。我们利用GNU字符串程序来提取至少有4个字符的可打印字符序列。我们用一个二进制特征来代表每个可打印的字符串，表明该可执行文件是否包含该字符串。我们观察到1,856,455,113个独特的字符串，其中99.99%以上的字符串在不到0.4%的样本中出现过。在去除这些罕见的字符串后，我们得到了16900个二进制特征。

(9) File generic

We also computed the size of each sample (in bytes), and the entropy of the whole file. We further reference to this small family of features as “generic.”

我们还计算了每个样本的大小（以字节为单位），以及整个文件的熵值。我们进一步将这个小系列的特征称为 “通用”。

V. EXPERIMENTS AND RESULTS

In this work, we aim to answer the following question: does static analysis on packed binaries provide rich enough features to a malware classifier? We analyze multiple facets of this question by performing a number of experiments. As explained in the introduction, even though we used several machine learning approaches (i.e., SVM, neural networks and decision tress), we only discuss the results of the random forest approach as (1) we observed similar findings for these approaches, with random forest being the best classifier in most experiments, and (2) random forest allows for better interpretation of the results compared to neural networks. Following a linear search over different configurations of random forest, we found a suitable trade-off between learning time and test accuracy. Table XIX in the Appendix shows the parameters of the model.

在这项工作中，我们旨在回答以下问题：包装好的二进制文件的静态分析是否为恶意软件分类器提供足够丰富的特征？我们通过进行一系列的实验来分析这个问题的多个方面。正如介绍中所解释的，即使我们使用了几种机器学习方法（即SVM、神经网络和决策树），我们只讨论随机森林方法的结果，因为（1）我们观察到这些方法有类似的结果，而随机森林是大多数实验中的最佳分类器；（2）与神经网络相比，随机森林可以更好地解释结果。在对随机森林的不同配置进行线性搜索后，我们找到了学习时间和测试准确性之间的合适权衡。附录中的表XIX显示了该模型的参数。

Note that all malicious executables in our datasets are packed. Unless stated otherwise: (1) we always partition the dataset into training and test sets with a 70%-30% split, and both the training and test sets are balanced over benign and malicious executables; (2) We repeat each experiment five times by randomly splitting the dataset into training and test sets each time, and average the results of all five rounds; (3) We use all 56,543 features to train the classifier; (4) We focus only on real-world packers (we do not include AES-Encrypter except for Experiment X).

请注意，我们的数据集中所有的恶意可执行文件都是打包的。除非另有说明。(1)我们总是将数据集分成训练集和测试集，比例为70%-30%，训练集和测试集在良性和恶意可执行文件上都是平衡的；(2)我们每次都将数据集随机分成训练集和测试集，重复每个实验五次，并对所有五轮的结果进行平均；(3)我们使用所有56,543个特征来训练分类器；(4)我们只关注真实世界的打包程序（除了实验X，不包括AES-加密器）。

We introduce and motivate research questions that help us answer our main hypothesis. For each, we describe one or more experiments followed by the corresponding results. Our results fit into four major findings, which we divide as follows. (I) Finding 1 and 3 may be intuitively known in the community, though mostly based on anecdotal experience. We confirm these findings with solid experiments. (II) Previous works have shown preliminary evidence of Finding 2, but with major limitations. We provide extensive evidence for this finding. (III) We present additional evidence for Finding 4, which is a fairly established fact confirmed by related work.

我们介绍并研究问题，帮助我们回答我们的主要假设。对于每个问题，我们都描述了一个或多个实验，然后是相应的结果。我们的结果符合四个主要的发现，我们将其划分如下。(I)结论1和3可能在社区中是直觉上已知的，尽管大部分是基于传闻的经验。我们用坚实的实验证实了这些发现。(II)以前的工作已经显示了结论2的初步证据，但有很大的局限性。我们为这一发现提供了广泛的证据。(III) 我们为结论4提出了更多的证据，这是一个被相关工作证实的相当成熟的事实。

A. Effects of Packing Distribution During Training

RQ1. Does a bias in the distribution of packers used in benign and malicious samples cause the classifier to learn specific packing routines as a sign of maliciousness?

RQ1. 在良性和恶意样本中使用的打包器分布的偏差是否会导致分类器将特定的打包程序作为恶意的标志？

RQ1 is important for two reasons: (1) Machine learning is increasingly being used for malware detection, while, as discussed in Section III-B, most related work does not specify considering packed benign executables, and the remaining few neglect the bias that may be introduced by the overlap between packers used in benign and malicious samples; (2) Nowadays, packing is also widespread in benign samples. To answer RQ1, we conducted three experiments.

RQ1的重要性在于两个原因。（1）机器学习越来越多地被用于恶意软件检测，而正如III-B节所讨论的，大多数相关工作没有明确考虑打包的良性可执行文件，剩下的少数工作则忽略了良性和恶意样本中使用的打包器之间的重叠可能带来的偏差；（2）如今，打包在良性样本中也很普遍。为了回答RQ1，我们进行了三个实验。

Experiment I: “no packed benign”

We trained the classi- fier on 3,956 unpacked benign and 3,956 packed malicious executables from the wild dataset. The resulting classifier produced a false positive rate of 23.40% on 12,647 packed benign samples. It should be noted that the classifier is fairly well calibrated, with false negative and false positive rates of 3.82% and 2.64% for 440 (unseen) packed malicious and 440 (unseen) unpacked benign samples. While this is a na ̈ıve experiment, it delivers an important message: excluding packed benign samples from the training set makes the classifier biased towards interpreting packing as an indication of maliciousness, and such a classifier will produce a substantial number of false positives in real-world settings, where packing is also widespread in benign samples. This experiment shows that packed benign executables must be considered when training the classifier.

我们在野生数据集中的3,956个未打包的良性可执行文件和3,956个已打包的恶意可执行文件上训练分类器。结果分类器在12,647个打包的良性样本上产生了23.40%的假阳性率。应该指出的是，该分类器的校准效果相当好，对于440个（未见过的）打包的恶意样本和440个（未见过的）未打包的良性样本，假阴性率和假阳性率分别为3.82%和2.64%。虽然这是一个naive实验，但它传递了一个重要的信息：从训练集中排除打包的良性样本，使分类器偏向于将打包解释为恶意的迹象，这样的分类器在现实世界中会产生大量的假阳性，因为在良性样本中打包也是很普遍的。这个实验表明，在训练分类器时，必须考虑打包的良性可执行文件。

The overlap between packers that are used in benign and malicious samples may cause the classifier to distinguish between packing routines, i.e., packers. To further investigate this issue, we performed the following two experiments.

在良性和恶意样本中使用的打包器之间的重叠可能会导致分类器区分打包程序，即打包器。为了进一步研究这个问题，我们进行了以下两个实验。

Experiment II: “packer classifier”

We used the lab dataset to create a packer classifier. We defined nine classes for the classifier, one per packer. We trained and tested the classifier on datasets with samples uniformly distributed over all classes. In particular, we trained the classifier on 107,471 samples and evaluated it against 46,059 samples. Note that we discarded the benign and malicious labels of samples. The classifier maintained the precision and recall of 99.99% per class. This result shows that “packer classification” is a simple task for the classifier, which indicates that the lack of overlap between packers that are used in benign and malicious samples of the dataset might bias the classifier to associate specific packing routines with maliciousness.

我们用实验数据集创建了一个包装器分类器。我们为该分类器定义了九个类别，每个包装器一个。我们在样本均匀分布于所有类别的数据集上训练和测试了该分类器。特别地，我们在107,471个样本上训练了分类器，并对46,059个样本进行了评估。请注意，我们放弃了样本的良性和恶意标签。该分类器保持了每类99.99%的精度和召回率。这一结果表明，"打包器分类 "对分类器来说是一项简单的任务，这表明在数据集的良性和恶意样本中使用的打包器之间缺乏重叠，可能会使分类器偏向于将特定的打包程序与恶意行为联系起来。

Experiment III: “good-bad packers”

We trained the clas- sifier on a dataset in which benign samples are packed by four specific packers, and malicious samples are packed by the remaining five packers. We refer to these two non-overlapping subsets of packers as good and bad packers, respectively. Then, we tested the classifier on benign and malicious samples that are packed by bad and good packers, respectively. We repeated this experiment for each split of packers. The accuracy of the classifier varied from 0.01% to 12.57% across all splits, showing that the classifier was heavily biased to distinguish between good and bad packers.

我们在一个数据集上训练分类器，其中良性样本由四个特定的包装器打包，而恶意样本则由其余五个包装器打包。我们把这两个不重叠的包装器子集分别称为好的和坏的包装器。然后，我们对分别由坏的和好的包装器打包的良性和恶意样本进行分类器测试。我们对每个种类的包装器重复这一实验。在所有的种类中，分类器的准确率从0.01%到12.57%不等，这表明分类器在区分好的和坏的打包者方面有很大的偏向。

Finding 1. The lack of overlap between packers used in benign and malicious samples will bias the classifier towards distinguishing between packing routines.

发现1：如果良性和恶意样本中使用的打包器之间缺乏重叠，那么分类器会偏向于区分打包程序。

B. Packers vs. Malware Classification

RQ2. Do packers prevent machine-learning-based malware classifiers that leverage only static analysis features?

RQ2. 打包器是否阻碍了仅利用静态分析特征的基于机器学习的恶意软件分类器？

It is commonly assumed that machine learning combined with only static analysis is not able to distinguish between benign and malicious samples that are packed. We performed the following three experiments to validate this assumption.

人们普遍认为，机器学习结合仅有的静态分析是无法区分打包的良性和恶意样本的。我们进行了以下三个实验来验证这一假设。

Experiment IV: “different packed ratios (wild)”

We trained the classifier on different subsets of the wild dataset by increasing the ratio of packed benign executables in the training set, with steps of 0.05. The “packed benign ratio” is defined as the proportion of benign samples that are packed. We always used datasets of the same size to fairly compare the trained models with each other, and tested models against the test set with a “wild ratio” of packed benign samples, i.e., the maximum ratio of packed benign executables that the vendor has seen in the wild (i.e., 50% packed benign, see Figure 1). As Figure 3a shows, increasing the packed benign ratio helps the classifier to maintain a lower false positive rate on packed samples, while the false negative rate slightly increases. However, the false positive rate on unpacked samples considerably increases from 3.18% to 16.24% as the classifier sees fewer unpacked samples, which indicates that a classifier that is trained only on packed samples cannot achieve high accuracy on unpacked samples. As illustrated by Table IV, we always used training sets of the same size, uniformly distributed over benign and malicious executables. Table IV also demonstrates that as we increase the ratio of packed benign executables in the training dataset, byte n-gram features play a much more significant role compared to other feature families.

我们在野生数据集的不同子集上训练分类器，增加训练集中打包的良性可执行文件的比例，步长为0.05。“被包装的良性比率”被定义为被包装的良性样本的比例。我们总是使用相同大小的数据集来公平地比较训练过的模型，并针对 "野生比率"测试集测试模型，供应商在野外数据集的打包良性可执行文件的最大比率约为50%，见图1）。如图3a所示，增加打包的良性比率有助于分类器在打包的样本上保持较低的假阳性率，而假阴性率略有增加。然而，由于分类器看到的无包装样本较少，无包装样本上的假阳性率大大增加，从3.18%增加到16.24%，这表明只对有包装的样本进行训练的分类器在无包装的样本上不能达到高的准确性。如表IV所示，我们总是使用相同大小的训练集，均匀地分布在良性和恶意的可执行文件上。表IV还表明，随着我们在训练数据集中的良性可执行文件比例的增加，与其他特征系列相比，字节n-gram特征发挥了更大的作用。

Note that the performance of the classifier might be due to features that do not necessarily capture the real behavior of samples. For example, packed benign executables might be packed by a different set of packers compared to malicious executables. Table XII in the Appendix shows that the distribu- tion of packers being used by benign samples is very different from packers used by malicious samples. For example, there are 13 packers for which we found signatures only in malicious executables in our dataset (e.g., FSG, VMProtect, dxpack, and PE-Armor). Although this discrepancy might not hold for the entire wild dataset, it indicates that such a difference may make the classifier biased to distinguish between good and bad packers, and thus, results can be misleading.

请注意，分类器的性能可能会取决于不一定能捕获到的样本的真实行为。例如，与恶意可执行文件相比，打包的良性可执行文件可能是由一组不同的打包器打包的。附录中的表XII显示，良性样本使用的打包器的分布与恶意样本使用的打包器有很大不同。例如，在我们的数据集中，有13个打包器只在恶意可执行文件中发现了签名（例如，FSG、VMProtect、dxpack和PE-Armor）。尽管这种差异对整个野生数据集来说可能不成立，但它表明这种差异可能使分类器在区分好的和坏的打包器时出现偏差，因此，结果可能会产生误导。

Experiment V: “different packed ratios (lab)”

To mitigate the uncertainty about the distribution of packers in the dataset, we repeated the previous experiment on the lab dataset combined with unpacked benign executables from the wild dataset. We selected packed samples uniformly distributed over the packers for training and test sets. Surprisingly, unlike the popular assumption that packing greatly hinders machine learning models based on static features, the classifier per- formed better than our expectations, even when there was no unpacked sample in the training set, with false positive and false negative rates of 12.24% and 11.16%, respectively. As Figure 3b presents, the false positive rate for packed executables decreases from 99.76% to 16.03% as we increase the ratio of packed benign samples in the training dataset. Unsurprisingly, when there is no packed benign executable in the training set, the classifier detects everything packed by the packers in the lab dataset as malicious. Table V presents the important features for the classifier based on the ratio of packed benign executables in the dataset. Byte n-grams and PE sections are the most useful families of features. We focused on one packer at a time in the next experiment to identify useful features for each packer.

为了减少数据集中打包器分布的不确定性，我们在实验室数据集上重复了之前的实验，并结合野生数据集中未打包的良性可执行文件。我们选择均匀分布在打包器上的打包样本作为训练和测试集。令人惊讶的是，不同于包装会极大地阻碍了基于静态特征的机器学习模型流行的假设，分类器的表现比我们的预期要好，即使在训练集中没有未包装的样本，假阳性和假阴性率分别为12.24%和11.16%。如图3b所示，随着我们增加训练数据集中打包的良性样本的比例，打包可执行文件的假阳性率从99.76%下降到16.03%。不足为奇的是，当训练集中没有打包的良性可执行文件时，分类器会将实验室数据集中打包者打包的所有文件检测为恶意的。表VI列出了基于数据集中打包的良性可执行文件比例的分类器的重要特征。字节n-grams和PE部分是最有用的特征系列。在接下来的实验中，我们一次只关注一个打包器，以确定每个打包器的有用特征。

Experiment VI: “single packer”

For each packer, we trained and tested the classifier on only benign and malicious executables that we packed with that packer. Table VI presents the performance of the classifier corresponding to each individual packer. In all cases, the classifier performed relatively well, with byte n-gram and PE section features as the most useful.

对于每个打包器，我们只对用该打包器打包的良性和恶意的执行文件进行了训练和测试分类器。表VI显示了与每个打包器相对应的分类器的性能。在所有情况下，分类器的表现都比较好，其中字节n-gram和PE部分的特征是最有用的。

We are also curious to see how packers preserve informa- tion when packing programs. To this end, for each packer, we built different models by training the classifier on one family of features at a time. In particular, we observed the following:

我们也很想知道打包器在打包程序时是如何保存信息的。为此，对于每个打包器，我们通过一次对一个特征系列进行分类器训练来建立不同的模型。特别地，我们观察到以下情况：

Rich Header

The Rich Header family alone helps the classifier to achieve relatively high accuracy, except for those packers that often strip this header (see Table II). As an example, using only Rich Header features, the classifier that is trained on executables packed with Themida maintains an accuracy of 89.03%. Webster et al. also showed that the Rich Header is useful for detecting similar malware.

除了那些经常剥离这个头的打包器外，仅富头族就能帮助分类器达到相对较高的准确性（见表II）。作为一个例子，只使用富头特征，对用Themida打包的可执行文件进行训练的分类器保持了89.03%的准确率。韦伯斯特等人还表明，丰富的头文件对检测类似的恶意软件很有用。

API imports

If we use tElock, Themida, and kkrunchy, API import features are no longer useful for malware detection. However, other packers preserve some information in these features. For example, we trained the classifier on executables that are packed with UPX and observed an accuracy of 89.11%. We noticed a similar trend for the DLL imports fam- ily. Among the packers affected by these features, the number of API imports was one of the most important features for the classifier. Figure 5 presents the distribution of this feature for UPX, Petite, and PECompact. We also observed specific API imports to be very distinguishing, like ShellExecuteA. Table XXII in the Appendix shows the number of benign and malicious samples that import each of these APIs. For example, Obsidium keeps importing the API FreeSid when packing a binary, or it is well-known that UPX keeps one API import from each DLL that the original binary imports to avoid the complexity of loading DLLs during execution. This indicates that packers preserve some information in the Import Directory Table when packing programs.

如果我们使用tElock、Themida和kkrunchy，API导入功能对恶意软件检测不再有用。然而，其他打包器在这些特征中保留了一些信息。例如，我们在用UPX打包的可执行文件上训练分类器，观察到89.11%的准确性。我们注意到DLL导入器家族有类似的趋势。在受这些特征影响的打包程序中，API导入的数量是分类器最重要的特征之一。图V显示了UPX、Petite和PECompact的这个特征的分布。我们还观察到特定的API导入是非常有区别的，如ShellExecuteA。附录中的表XXII显示了导入这些API的良性和恶意样本的数量。例如，Obsidium在打包二进制文件时不断导入API FreeSid，或者众所周知，UPX从原始二进制文件导入的每个DLL中保留一个API导入，以避免执行期间加载DLL的复杂性。这表明打包者在打包程序时保留了导入目录表中的一些信息。

Opcode n-grams

For each of Obsidium, tElock, and Themida, we trained the classifier using opcode n-grams, and the accuracy dropped to ∼50%. However, we observed the accuracy of 89.01%, 88.72%, 88.27%, 77.25%, 77.04%, and 65.75% while training on samples packed with Petite, PELock, Mpress, kkrunchy, UPX, and PECompact, respectively.

对于Obsidium、tElock和Themida中的每一个，我们用opcode n-grams训练分类器，准确率下降到大约50%。然而，我们观察到在用Petite、PELock、Mpress、kkrunchy、UPX和PECompact打包的样本上训练时，准确率分别为89.01%、88.72%、88.27%、77.25%、77.04%和65.75%。

PE headers

For all packers, the classifier had an accuracy above 90%. In particular, the “size of the initialized data” was the most important feature in all cases but UPX. However, the distribution of this feature differs across packers (see Figure 4). While malicious samples packed with kkrunchy, Obsidium, PECompact, tElock, and Themida have bigger initialized data compared to benign executables, the same malicious samples, packed with MPRESS, PELock, and Petite have smaller initial- ized data. Interestingly, malicious samples packed with UPX follow a distribution very similar to the distribution observed for benign samples.

对于所有的打包机，分类器的准确率都在90%以上。特别是，"初始化数据的大小 "是所有情况下最重要的特征，但UPX除外。然而，这一特征在不同打包器中的分布是不同的（见图4）。与良性可执行文件相比，用krunchy、Obsidium、PECompact、tElock和Themida打包的恶意样本的初始化数据更大，而用MPRESS、PELock和Petite打包的同样的恶意样本的初始化数据更小。有趣的是，用UPX打包的恶意样本的分布与良性样本的分布非常相似。

PE sections

The accuracy of the classifier was above 90% for all packers, varying from 91.23% to 96.72%. As Figure 7 shows, the importance weights of features significantly differ across different models. For example, the entropy of the entry point section is a very important feature for the classifier that is trained on MPRESS. However, this feature is not helpful when we train the classifier on samples packed with Obsidium, Themida, or PELock. The entry point of binaries packed with MPRESS resides in the second section, .MPRESS2, for which benign and malicious executables have a mean entropy of 6.16 and 5.77. However, for Obsidium, the entry point section always has a high entropy, close to 8.

分类器的准确率对所有的打包机都在90%以上，从91.23%到96.72%不等。如图7所示，不同模型中特征的重要性权重明显不同。例如，入口点部分的熵对于在MPRESS上训练的分类器是一个非常重要的特征。然而，当我们在用Obsidium、Themida或PELock包装的样本上训练分类器时，这个特征就没有帮助了。用MPRESS包装的二进制文件的入口点位于第二部分，.MPRESS2，对其来说，良性和恶意的可执行文件的平均熵为6.16和5.77。然而，对于Obsidium来说，入口点部分的熵值总是很高，接近8。

Finding 2. Packers preserve some information when packing programs that may be “useful” for malware classification, however, such information does not necessarily represent the real nature of samples.

发现2：包装商在打包程序时保留了一些可能对恶意软件分类 "有用"的信息，然而，这些信息并不一定代表样本的真实性质。

We should emphasize that related work has provided pre- liminary evidence of Finding 2. Jacob et al. showed that some packers employ weak encryption, which can be used to detect similar malware samples packed with these packers. Webster et al. also showed that some packers do not touch the rich header, leaving it viable for malware detection.

我们应该强调的是，相关工作已经提供了发现2的初步证据。雅各布等人表明，一些打包器采用弱加密，这可以用来检测类似的恶意软件样本与这些打包器打包。韦伯斯特等人还表明，一些打包器不接触丰富的头，使其对恶意软件的检测是可行的。

C. Malware Classification in Real-world Scenarios

RQ3. Can a classifier that is carefully trained and not biased towards specific packing routines perform well in real-world scenarios?

RQ3. 一个经过精心训练、不偏重于特定包装程序的分类器能否在现实世界的场景中表现良好？

RQ3 is a key question in the development of machine- learning-based malware classifiers. In this work, we focus on three specific issues:
• Generalization. Nowadays, runtime packers are evolving, and malware authors often tend to use their own custom packers. This raises serious doubt about how a classifier performs against previously unseen packers.
• Strong & Complete Encryption. Malware authors might customize the packing process to remove the static features that machine-learning-based classifiers can reasonably be expected to leverage. Can malware classifiers be effective in the presence of strong and complete encryption?
• Adversarial Examples. Despite their limited scope, recent work has shown that machine-learning-based malware detectors are vulnerable to adversarial examples. Is it possible to use the learned model to drive evasion?
To investigate the generalization question, we carried out the next three experiments.

RQ3是开发基于机器学习的恶意软件分类器中的一个关键问题。在这项工作中，我们专注于三个具体问题:

普遍性。如今运行时打包器在不断发展，恶意软件作者往往倾向于使用自己的自定义打包器。这引起了人们对分类器在对付以前未见过的打包器时的表现的严重怀疑。
强大而完整的加密。恶意软件作者可能会自定义打包过程，以删除基于机器学习的分类器可以合理预期利用的静态特征。恶意软件分类器在强而完整的加密情况下是否有效？
对抗性的例子。尽管范围有限，但最近的工作表明，基于机器学习的恶意软件检测器很容易受到对抗性例子的影响。是否有可能使用学到的模型来驱动规避？

为了研究泛化问题，我们进行了接下来的三个实验。

Experiment VII: “wild vs. packers”

First, we trained the classifier on a dataset with a “wild ratio” of packed benign samples extracted from the wild dataset, and tested it on the lab dataset. As Table VIII shows, the classifier performed poorly against all packers, with the highest accuracy being 78.19% against Themida. This is interesting, as we knew that at least 50% of the packers in our dataset keep the Rich Header, and, therefore, the classifier still should have maintained high accuracy based on the earlier results. We argue that this happened because the classifier chose features with more information gain, and, while testing on the lab dataset, those features are not helpful anymore. In fact, we trained the classifier using only the Rich Header, and the classifier’s accuracy against packers that keep the Rich Header increased considerably, up to over 90%.

首先，我们在一个带有从野生数据集中提取的包装好的良性样本的比例为 "野生比例 "的数据集上训练分类器，并在实验室数据集上测试。如表VIII所示，该分类器对所有包装者的表现都很差，对Themida的最高准确率为78.19%。这很有意思，因为我们知道在我们的数据集中至少有50%的打包者保留了Rich Header，因此，根据先前的结果，分类器仍然应该保持较高的准确性。我们认为，发生这种情况是因为分类器选择了具有更多信息增益的特征，而且，在实验室数据集上测试时，这些特征已经没有帮助了。事实上，我们只用富头来训练分类器，分类器对保留富标题的打包器的准确率大大增加，达到90%以上。

Experiment VIII: “withheld packer”

Second, we performed several rounds of experiments on the lab dataset, in which we withheld one packer from the training set and then evaluated the resulting classifier on packed executables generated by this packer (one round for each of the nine packers). To have a fair comparison between rounds, we fixed the size of the training set to 83,760, by selecting 5,235 benign and 5,235 malicious executables for each of the packers. We then tested the classifier on 5,235 benign and 5,235 malicious executables packed with the withheld packer. As Table VII shows, except for the three noticeable cases of PECompact, tElock, and kkrunchy, the classifier performed relatively well, with an F-1 score ranging from 0.90 to 0.95.

其次，我们在实验数据集上进行了几轮实验，其中我们从训练集中留下了一个打包器，然后对这个打包器产生的可执行文件进行了评估（九个打包器中的每一个都有一轮）。为了在不同的回合之间进行公平的比较，我们将训练集的大小固定为83,760，为每个打包器选择5,235个良性和5,235个恶意的可执行文件。然后，我们在5,235个良性和5,235个恶意的可执行文件上测试了分类器，这些可执行文件都是用留下的打包器打包的。如表VII所示，除了PECompact、tElock和kkrunchy这三种明显的情况外，分类器的表现相对较好，F-1得分在0.90到0.95之间。

In all cases, we identified byte n-gram features extracted from .CAB file headers (reside in the resource sections) as the most important features. There are 6,717 benign and 1,269 malicious executables having these features enabled in the wild dataset. In the previous experiment, the classifier did not learn these features as there were more distinguishing features. However, as packers mostly keep headers of resources despite the encryption of the body, this initial bias is intensified as we packed each sample with multiple packers. In particular, there are 28,765 benign and 2,428 malicious executables in the lab dataset that include these sequences of bytes. However, for PECompact the situation is a bit different, as we could pack only 1,095 benign and 451 malicious samples that have .CAB headers. For tElock, we could pack only 181 benign and 444 malicious samples with .CAB headers. This explains why the accuracy of the classifier is low against PECompact and tElock. We looked at the most important features when we withheld kkrunchy in the learning phase, and we found that byte n-grams extracted from the version info field of resources are very helpful for the classifier. Other packers usually keep this information, hence the classifier learns it, but fails to utilize that against samples packed with kkrunchy, as the packer strips this information. We repeated the experiment by excluding byte n-grams features, and the accuracy of the classifier dropped significantly in all cases, except when we withheld PECompact or kkrunchy (see Table VII).

在所有情况下，我们确定从.CAB文件头（位于资源部分）中提取的字节n-gram特征是最重要的特征。在野生数据集中，有6,717个良性可执行文件和1,269个恶意可执行文件启用了这些特征。在以前的实验中，分类器没有学习这些特征，因为有更多的区别性特征。然而，由于打包器尽管对主体进行了加密，但大多保留了资源的头文件，当我们用多个打包器打包每个样本时，这种最初的偏差就会加剧。特别是，在实验室的数据集中，有28,765个良性的和2,428个恶意的可执行文件包括这些字节序列。然而，对于PECompact，情况有点不同，因为我们只能打包1,095个良性样本和451个有.CAB头的恶意样本。对于tElock，我们只能打包181个良性样本和444个带有.CAB头的恶意样本。这就解释了为什么分类器对PECompact和tElock的准确性很低。当我们在学习阶段扣留kkrunchy时，我们研究了最重要的特征，我们发现从资源的版本信息字段中提取的字节n-grams对分类器非常有帮助。其他的打包器通常会保留这些信息，因此分类器会学习这些信息，但却无法利用这些信息来对付用kkrunchy打包的样本，因为打包器会剥离这些信息。我们重复了这个实验，排除了字节n-grams的特征，在所有情况下，分类器的准确率都明显下降了，除了当我们保留了PECompact或kkrunchy时（见表VII）。

Experiment IX: “lab against wild”

In this third experiment, we trained the classifier on the lab dataset and evaluated it on packed executables in the wild dataset. This experiment is important as malware authors often prefer customized packing routines to off-the-shelf packers. To avoid any bias in our dataset toward any particular packer, benign and malicious executables were uniformly selected from the various packers. We observed the false negative rate of 41.84%, and false positive rate of 7.27%.

在这第三个实验中，我们在实验室数据集上训练了分类器，并在野外数据集的打包可执行文件上评估了它。这个实验很重要，因为恶意软件作者通常喜欢定制的打包程序而不是现成的打包器。为了避免我们的数据集对任何特定的打包器有任何偏见，我们从各种打包器中统一选择了良性和恶意的可执行文件。我们观察到假阴性率为41.84%，假阳性率为7.27%。

Experiments VII, VIII, and IX demonstrate that when using static analysis features, the classifier is not guaranteed to generalize well to previously unseen packers. As a preliminary step towards the Strong & Complete Encryption issue, we performed the following experiment.

实验VII、VIII和IX表明，当使用静态分析特征时，分类器不能保证能很好地归纳到以前未见过的打包器。作为对强完全加密问题的初步措施，我们进行了以下实验。

Experiment X: “Strong & Complete Encryption”

In this experiment, we trained the classifier on 11,929 benign and 11,929 malicious executables packed with AES-Encrypter and evaluated it against 5,113 benign and 5,113 malicious executa- bles packed with AES-Encrypter. As AES-Encrypter encrypts the whole executable with AES, we would expect that static analysis features are no longer helpful for a static malware classifier. Surprisingly, the classifier performed better than a random guess just because of two features, “file size” and “file entropy,” with accuracy of 72.67%. As benign samples are bigger in the wild dataset, obviously, packed benign executa- bles are still larger than packed malicious executables as AES- Encrypter just encrypts the original binary. Also, the entropy of the packed executable is affected as a bigger overlay increases the entropy of the packed program more. All other static analysis features are the same across executables packed with AES-Encrypter, except for byte n-grams and strings features, as executables have different (encrypted) overlays. Since we have more malicious samples in the wild dataset, our feature selection procedures for extracting byte n-grams and strings (see Section IV-C) tend to select those features that appear in malicious samples with a higher probability, thus, we expect that the accuracy of the classifier is still slightly better than 50%. In particular, removing the features “file size” and “file entropy” from the dataset resulted in a classifier with an accuracy of 56.85%. In fact, we repeated the feature selection procedure for a balanced dataset of only executables packed with AES-Encrypter, and we got an accuracy of 50% for the classifier when removing these two features.

在这个实验中，我们对11,929个良性和11,929个用AES-加密器打包的恶意可执行文件进行了分类训练，并对5,113个用AES-加密器打包的良性和5,113个恶意可执行文件进行了评估。由于AES-加密器用AES对整个可执行文件进行加密，我们预计静态分析功能对静态恶意软件分类器不再有帮助。令人惊讶的是，仅仅因为 "文件大小 "和 "文件熵 "这两个特征，分类器的表现就比随机猜测好，准确率达到72.67%。由于野生数据集中的良性样本较大，显然，打包的良性可执行文件仍然比打包的恶意可执行文件大，因为AES- Encrypter只是对原始二进制文件进行加密。另外，打包的可执行文件的熵也受到影响，因为更大的覆盖层会使打包程序的熵增加更多。所有其他静态分析特征在用AES-加密器打包的可执行文件中都是一样的，除了字节n-grams和字符串特征，因为可执行文件有不同的（加密的）覆盖层。由于我们在野外数据集中有更多的恶意样本，我们提取字节n-grams和字符串的特征选择程序（见IV-C节）倾向于选择那些以更高的概率出现在恶意样本中的特征，因此，我们预计分类器的准确性仍然略好于50%。特别是，从数据集中去除 "文件大小 "和 "文件熵 "这两个特征后，分类器的准确率为56.85%。事实上，我们对一个只有用AES-加密器打包的可执行文件的平衡数据集重复了特征选择程序，当去除这两个特征时，我们得到的分类器的准确率为50%。

Experiment X raises serious doubts about machine learning classifiers. When packing hides all information about the original binary until execution, the classifier has no choice but to classify any sample packed by such a packer as malicious. This is an issue, as packing is increasingly being adopted by legitimate software.

实验X提出了对机器学习分类器的严重怀疑。当打包在执行之前隐藏了所有关于原始二进制文件的信息时，分类器别无选择，只能将由这样的打包器打包的任何样本归类为恶意的。这是一个问题，因为打包越来越多地被合法软件所采用。

Experiment XI: “adversarial samples”

Recent work has shown that machine-learning-based malware detectors, especially those that are based on only static analysis features, are vulnerable to adversarial samples. In our case, this issue becomes magnified as packing causes machine learning classifiers to make decisions based on features that are not directly derived from the actual (unpacked) program. Therefore, generating such adversarial samples would be easier for an adversary.

最近的工作表明，基于机器学习的恶意软件检测器，特别是那些只基于静态分析特征的检测器，容易受到恶意样本的影响。在我们的案例中，这个问题被放大了，因为打包导致机器学习分类器根据并非直接来自实际（解压）程序的特征做出决定。因此，产生这样的对抗性样本对对抗者来说会更容易。

In this experiment, first we carefully trained the classifier on 3,956 unpacked benign, 3,956 packed benign, and 7,912 malicious executables whose packed benign and malicious samples are uniformly distributed over the same packers from the lab dataset and packed executables in the wild. We showed that such a classifier is not biased towards detecting specific packing routines as a sign of maliciousness. As expected, the classifier performed relatively well in the evaluation, with false positive and false negative rates of 9.70% and 5.33%, respectively. Figure 6 shows the box and whisker plot of the classifier’s confidence score on the test set. The mean confi- dence of the classifier for packed and unpacked executables that are classified correctly is 0.89 and 0.93, respectively. For benign samples that the classifier misclassified (false positives), the mean confidence is 0.68 and 0.58 for packed and unpacked samples, respectively.

在这个实验中，首先我们在3,956个未打包的良性可执行文件、3,956个打包的良性可执行文件和7,912个恶意可执行文件上仔细训练了分类器，这些打包的良性和恶意样本均匀地分布在实验室数据集的相同打包器和野外的打包可执行文件中。我们表明，这样的分类器并不偏向于检测特定的打包程序作为恶意的标志。正如预期的那样，该分类器在评估中表现相对较好，假阳性和假阴性率分别为9.70%和5.33%。图VI显示了分类器在测试集上的置信度得分的盒和须图。对于被正确分类的已打包和未打包的可执行文件，分类器的平均置信度分别为0.89和0.93。对于分类器错误分类的良性样本（假阳性），打包和未打包的样本的平均置信度分别为0.68和0.58。

Then, we generated adversarial samples from all 2,494 malicious samples that the classifier detected as malicious (i.e., true positives). To achieve this, we identified byte n- gram and string features that occurred more in benign samples and injected the corresponding bytes into the target program without affecting its behavior. We verified this by analyzing the sample with the ANY.RUN sandbox. By injecting 34.24 (69.92) benign features on average, we managed to generate 2,483 (1,966) adversarial samples that cause the classifier to make false predictions with a confidence greater than 0.5 (0.9). We expect that a more complex attack is needed when the classifier is trained using features extracted from dynamic analysis, which represent the sample’s behavior.

然后，我们从分类器检测为恶意的所有2,494个恶意样本（即真阳性）中生成对抗性样本。为了做到这一点，我们确定了在良性样本中出现较多的字节n- gram和字符串特征，并将相应的字节注入目标程序中，而不影响其行为。我们通过用ANY.RUN沙盒分析样本来验证这一点。通过平均注入34.24（69.92）个良性特征，我们成功地产生了2483（1966）个对抗性样本，导致分类器做出了信心大于0.5（0.9）的错误预测。我们预计，当分类器使用从动态分析中提取的特征进行训练时，需要更复杂的攻击，这代表了样本的行为。

Finding 3. Although we observed that static analysis features combined with machine learning can distinguish between packed benign and packed malicious samples, such a classi- fier will produce intolerable errors in real-world settings.

发现3：尽管我们观察到静态分析特征与机器学习相结合可以区分包装好的良性样本和包装好的恶意样本，但这样的分类方法在现实世界中会产生不可容忍的错误。

Recently, a group of researchers found a very similar way to subvert Cylance’s AI-based anti-malware engine. They developed a “global bypass” method that works with almost any malware to fool the Cylance engine. The eva- sion technique involves simply taking strings from an online gaming program and appending them to known malware, like WannaCry. The major problem that plagued Cylance was that behaviors that are common in malware are also common in games. Games use these techniques for various reasons, e.g., to prevent cheating or reverse engineering. Tuning the system to flag the malware but not such benign programs is quite difficult and prone to more errors, which in this case, confront Cylance’s engine with a dilemma, either produce high false positives for games or inherit a bias towards them.

最近，一组研究人员发现了一种非常类似的方法来颠覆Cylance基于人工智能的反恶意软件引擎。他们开发了一种 "全局绕过 "的方法，几乎适用于任何恶意软件，以愚弄Cylance引擎。该技术包括简单地从一个在线游戏程序中获取字符串，并将其附加到已知的恶意软件上，如WannaCry。困扰Cylance的主要问题是，恶意软件中常见的行为在游戏中也很常见。游戏出于各种原因使用这些技术，例如，防止作弊或逆向工程。调整系统来标记恶意软件而不是这些良性程序是相当困难的，而且容易产生更多的错误，在这种情况下，Cylance的引擎面临着两难的境地，要么对游戏产生高误报，要么继承对游戏的偏见。

D. Anti-malware Industry vs. Packers

RQ4. How is the accuracy of real-world anti-malware en- gines that leverage machine learning combined with static analysis features affected by packers?

RQ4. 利用机器学习与静态分析功能相结合的现实世界反恶意软件引擎的准确性是如何被打包器影响的？

In today’s world, legitimate software authors pack their products. Therefore, it is no longer acceptable for anti-malware products to detect anything packed as malicious. RQ4 is important because most machine-learning-based approaches rely on labels from VirusTotal in the absence of a reliable and fresh ground-truth dataset. To this end, we identified six products on VirusTotal that, either on the corresponding company’s website or on a VirusTotal blog post, are described as machine-learning-based malware detectors that use only static analysis features. It should be noted that, while VirusTotal clearly discourages using their service to perform anti-malware comparative analyses, in the next experiment, we aim only to see how these engines assign labels to packed samples in general. We do not intend to compare these tools with each other or against another tool.

在今天的世界里，合法的软件作者会将他们的产品打包。因此，反恶意软件产品检测任何被包装成恶意的东西都是不可接受的。RQ4很重要，因为大多数基于机器学习的方法在缺乏可靠和新鲜的地面真实数据集的情况下依赖于来自VirusTotal的标签。为此，我们在VirusTotal上确定了六种产品，它们在相应公司的网站上或在VirusTotal的博客文章中被描述为基于机器学习的恶意软件检测器，只使用静态分析功能。应该注意的是，虽然VirusTotal明确不鼓励使用他们的服务来进行反恶意软件的比较分析，但在接下来的实验中，我们的目的只是要看看这些引擎是如何给打包的样本分配标签的。我们不打算在这些工具之间或与其他工具进行比较。

Experiment XII: “anti-malware industry”

In February 2019, we submitted 6,000 benign and 6,000 malicious executa- bles packed with each packer from the lab dataset to VirusTotal to evaluate these six anti-malware products. As Table IX shows clearly, all six engines have learned to associate packing with maliciousness. Other engines on VirusTotal also produced a similarly high error rate as these six engines. As we discussed in Section II, related work have published results showing sim- ilar trend. This experiment indicates that as packing is being used more often in legitimate software, unless the anti-malware industry does better than detecting packing, benign and malicious software are going to be increasingly misclassified.

2019年2月，我们从实验室数据集中向VirusTotal提交了6000个良性和6000个用每个打包器打包的恶意执行文件，以评估这六个反恶意软件产品。如表IX所示，所有六个引擎都学会了将打包与恶意行为联系起来。VirusTotal上的其他引擎也产生了与这六个引擎类似的高错误率。正如我们在第二节中所讨论的，相关的工作已经发表的结果显示了类似的趋势。这个实验表明，随着包装在合法软件中的使用越来越频繁，除非反恶意软件行业在检测包装方面做得更好，否则良性和恶意软件将被越来越多地错误分类。

Finding 4. Machine-learning-based anti-malware engines on VirusTotal detect packing instead of maliciousness.

发现4：VirusTotal上基于机器学习的反恶意软件引擎检测包装而不是恶意。

VI. DISCUSSION

We showed that machine-learning-based anti-malware engines on VirusTotal produce a substantial number of false positives on packed binaries, which can be due to the limitations discussed in this work. This is especially a serious issue for machine-learning-based approaches that frequently rely on labels from VirusTotal, causing an endless loop in which new approaches rely on polluted datasets, and, in turn, generate polluted datasets for future work.

我们表明，VirusTotal上基于机器学习的反恶意软件引擎在打包的二进制文件上产生了大量的误报，这可能是由于这项工作中讨论的限制因素造成的。这对于经常依赖VirusTotal标签的基于机器学习的方法来说，尤其是一个严重的问题，造成了一个无休止的循环，新的方法依赖于被污染的数据集，并反过来为未来的工作产生被污染的数据集。

One might say that this general issue with packing can be avoided by whitelisting samples based on code-signing certificates. However, we have seen that valid digital signatures allowed malware like LockerGoga, Stuxnet, and Flame to bypass anti-malware protections. It should be noted that although we showed that packer classification is an easy task for the classifier to learn over our dataset, packing detection, in general, is a challenging task, especially when malware authors use customized packers that evolve rapidly. While using dynamic analysis features seems necessary to mitigate the limitations of static malware detectors, malware could still force malware detectors to fall back on static features by using sandbox evasion. For example, Jana et al. discovered 45 evasion exploits against 36 popular anti-malware scanners by targeting file processing in malware detectors. All these issues suggest that malware detection should be done using a hybrid approach leveraging both static and dynamic analysis.

有人可能会说，这种包装的一般问题可以通过基于代码签名证书的白名单样本来避免。然而，我们已经看到，有效的数字签名使LockerGoga、Stuxnet和Flame等恶意软件绕过了反恶意软件保护。应该指出的是，虽然我们表明在我们的数据集上学习的打包器分类是一个容易的任务，打包检测在一般情况下仍是一个具有挑战性的任务，特别是当恶意软件作者使用定制的迅速发展的打包器。虽然使用动态分析特征似乎对于缓解静态恶意软件检测器的局限性很有必要，但恶意软件仍然可以通过使用沙盒规避迫使恶意软件检测器回落到静态特征。例如，Jana等人通过针对恶意软件检测器的文件处理，发现了45个针对36个流行反恶意软件扫描器的规避漏洞。所有这些问题表明，恶意软件检测应使用混合方法，同时利用静态和动态分析来完成。

Limitations

As encouraged by Pendlebury et al. and Jordaney et al. , malware detectors should be evaluated on how they deal with concept drift. We have observed that machine learning combined with static analysis generalizes poorly to unseen packers, however, we did not consider time constraints in our experiments, which we leave as future work. Also, we focused only on Windows x86 executables in this paper, but our hypothesis might also be applicable to Android apps, for which packing is also getting more common.

局限性。正如Pendlebury等人和Jordaney等人所鼓励的那样，应该对恶意软件检测器如何处理概念漂移进行评估。我们已经观察到，机器学习与静态分析相结合，对未见过的打包器的适用性很差，而且我们在实验中没有考虑到时间限制，这些我们将作为未来的工作。另外，我们在本文中只关注了Windows x86可执行文件，但我们的假设可能也适用于安卓应用，因为安卓应用的打包也越来越普遍。

The theoretical limitations of malware detection have been studied widely. Early work on computer viruses showed that the existence of a precise virus detector that detects all computer viruses implies a decision procedure for the halting problem. Later, Chess et al. presented a polymorphic virus that cannot be precisely detected by any program. Similarly, several critical techniques of static and dynamic analysis are undecidable, including detection of unpacking execution.

恶意软件检测的理论限制已被广泛研究。早期关于计算机病毒的工作表明，一个能检测到所有计算机病毒的精确的病毒检测器的存在，意味着有一个为了停顿的决策程序。后来，Chess等人提出了一种不能被任何程序精确检测的多态性病毒。同样，静态和动态分析的几个关键技术也是不可判定的，包括检测解包执行。

Moser et al. proposed a binary obfuscation scheme based on opaque constants that scrambles a program’s control flow and hides data locations and usage. They showed that static analysis for the detection of malicious code can be evaded by their approach in a general way. Christodorescu et al. showed that three anti-malware tools can be easily evaded by very simple obfuscation transformations. Later, they developed a system for evaluating anti-malware tools against obfuscation transformations commonly used to disguise malware. ADAM and DroidChameleon used simi- lar transformation techniques to evaluate commercial Android anti-malware tools. In particular, DroidChameleon’s results on ten anti-malware products show that none of these is resistant to common and simple malware transformation methods. Bacci et al. showed that while dynamic-analysis-based detection demonstrates equal performance on both obfuscated and non- obfuscated Android malware, static-analysis-based detection has a poor performance on obfuscated samples. Although they showed that this effect can be mitigated by using obfuscated malicious samples in the learning phase, no obfuscated benign sample is used, which raises the doubt that the classifier might have learned to detect obfuscation. Hammad et al. recently studied the effects of code obfuscation on Android apps and anti-malware products and found that most anti-malware products are severely impacted by simple obfuscations.

Moser等人提出了一种基于不透明常数的二进制混淆方案，该方案扰乱了程序的控制流，并隐藏了数据的位置和使用。他们表明，用于检测恶意代码的静态分析可以通过他们的方法以一般方式规避。Christodorescu等人表明，三种反恶意软件工具可以很容易地通过非常简单的混淆转换来规避。后来，他们开发了一个系统，用于评估反恶意软件工具对常用于伪装恶意软件的混淆转化。ADAM和DroidChameleon使用类似的转换技术来评估商业Android反恶意软件工具。特别是，DroidChameleon对十种反恶意软件产品的结果表明，这些产品都不能抵抗常见的简单恶意软件的转化方法。Bacci等人表明，虽然基于动态分析的检测在混淆和非混淆的安卓恶意软件上表现出同等的性能，但基于静态分析的检测在混淆的样本上表现不佳。尽管他们表明这种影响可以通过在学习阶段使用被混淆的恶意样本来缓解，但没有使用被混淆的良性样本，这让人怀疑分类器可能已经学会了检测混淆的问题。Hammad等人最近研究了代码混淆对安卓应用和反恶意软件产品的影响，发现大多数反恶意软件产品受到了简单混淆的严重影响。

VIII. CONCLUSIONS

In this paper, we have investigated the following question: does static analysis on packed binaries provide a rich enough set of features to a malware classifier? We first observed that the distribution of the packers in the training set must be considered, otherwise the lack of overlap between pack- ers used in benign and malicious samples might cause the classifier to distinguish between packing routines instead of behaviors. Different from what is commonly assumed, packers preserve information when packing programs that is “useful” for malware classification, however, such information does not necessarily capture the sample’s behavior. In addition, such information does not help the classifier to (1) generalize its knowledge to operate on previously unseen packers, and (2) be robust against trivial adversarial attacks. We observed that static machine-learning-based products on VirusTotal produce a high false positive rate on packed binaries, possibly due to the limitations discussed in this work. This issue becomes magnified as we see a trend in the anti-malware industry toward an increasing deployment of machine-learning-based classifiers that only use static features.

在本文中，我们研究了以下问题：对打包的二进制文件进行静态分析，是否能为恶意软件分类器提供足够丰富的特征集？我们首先观察到，必须考虑训练集中打包器的分布，否则，良性和恶意样本中使用的打包器之间缺乏重叠，可能会导致分类器区分打包程序而不是行为。与通常的假设不同，打包器在打包程序时保留了对恶意软件分类 "有用 "的信息，然而，这些信息不一定能捕捉到样本的行为。此外，这些信息并不能帮助分类器(1)概括其知识以操作以前未见过的打包器，以及(2)对琐碎的对抗性攻击具有鲁棒性。我们观察到，VirusTotal上基于静态机器学习的产品在打包的二进制文件上产生了很高的假阳性率，这可能是由于这项工作中所讨论的限制。当我们看到反恶意软件行业的趋势是越来越多地部署仅使用静态特征的基于机器学习的分类器时，这个问题就会被放大。

To the best of our knowledge, this work is the first compre- hensive study on the effects of packed Windows executables on machine-learning-based malware classifiers that use only static analysis features. The source code and our dataset of 392,168 executables are publicly available at https://github.com/ucsb-seclab/packware.

大摘要

标题

摘要

背景

要解决的问题

该问题的意义

主要贡献

主要工作

实验过程

结论

缺陷

启发

论文阅读

I. INTRODUCTION

II. MOTIVATION

III. BACKGROUND

A. Executable Packers

Limitations of packing detection

Limitations of generic unpackers

B. Packing vs. Static Malware Analysis

IV. DATASET

A. Wild Dataset

Malicious vs. benign

Packed vs. unpacked

Packer complexity

Packers in the wild

B. Lab Dataset

C. Features

(1) PE headers

(2) PE sections

(3) DLL imports

(4) API imports

(5) Rich Header

(6) Byte n-grams

(7) Opcode n-grams

(8) Strings

(9) File generic

V. EXPERIMENTS AND RESULTS

A. Effects of Packing Distribution During Training

Experiment I: “no packed benign”

Experiment II: “packer classifier”

Experiment III: “good-bad packers”

B. Packers vs. Malware Classification

Experiment IV: “different packed ratios (wild)”

Experiment V: “different packed ratios (lab)”

Experiment VI: “single packer”

Rich Header

API imports

Opcode n-grams

PE headers

PE sections

C. Malware Classification in Real-world Scenarios

Experiment VII: “wild vs. packers”

Experiment VIII: “withheld packer”

Experiment IX: “lab against wild”

Experiment X: “Strong & Complete Encryption”

Experiment XI: “adversarial samples”

D. Anti-malware Industry vs. Packers

Experiment XII: “anti-malware industry”

VI. DISCUSSION

Limitations

VII. RELATED WORK

VIII. CONCLUSIONS