Best Languages to Learn for Malware Analysis
One of the most common questions I’m asked is “what programming language(s) should I learn to get into malware analysis/reverse engineering”, to answer this question I’m going to write about the top 3 languages which I’ve personally found most useful. I’ll focus on native malware (malware which does not require a framework such as Java, Python, or .NET to run), as this is the most common type and understanding it it will provide you the skills required to pivot into other kinds. In this article I won’t be covering IoT or mobile malware as this is more specialized and I don’t have as much experience with it.
Python is an incredibly versatile language and my personal go to for when I need to get something done quickly. Whilst the are several other languages which are fast for development, I find python to have the best combination of readability, rapid development potential, and easiness to learn. One of my favorite things about python is the write once run anywhere nature of it: it’s an interpreted language so the interpreter does all the translating between different operating systems for you, meaning you can write code and expect it to work on any operating system with python installed. I’m a big fan of Linux servers but I use Windows as my main OS; therefore, it’s incredibly useful to be able to write and test my code on my desktop then just upload it to the server when I’m done, instead of trying to develop and test code in a PuTTY terminal (ugh).
One of my favorite uses for python is quickly replicating components of malware in order to better understand how they work, or interface with the malware itself, allowing for quicker analysis. A good example is my TrickBot toolkit, which helps overcome some of the hurdles faced when reversing modular malware.
One of the major issues I had when working with TrickBot is it uses hacked servers for commanding the malware, most of which get shutdown very quickly. Within a few days of getting a fresh sample, all the hardcoded control servers would be dead and the sample wouldn’t work. After analyzing the encryption/decryption code used by the malware, I was able to create a simple python script which allowed me to decrypt and modify any of TrickBot’s config files, which I used to edit the main config file and supply it with fresh control servers which I found posted online by other analysts.
Another handy script I wrote was for interfacing with the command & control infrastructure allowing me to fetch commands, payloads, and new server addresses, all without needing to run the malicious binary. If you’re privacy conscious then you probably don’t want to run malware on a system connected to the internet; instead, it makes sense to analyze malware offline and re-implement any code which requires internet access in python, where you have complete control of what it does (say goodbye to malware sending the bad guys real data and hello to honeypot credentials). ## Debugger Extensions
Often analysis tasks can be repetitive, mundane, and extremely time consuming, so it’s handy to be able to identify such tasks and automate them. Every debugger and disassembler I’ve worked with has its own built-in scripting language, but if you’re using more than one setup (which you probably should be), it doesn’t make sense to learn a bunch of different scripting languages; luckily, almost all of them support Python!
Previously I’ve written posts explaining my use of Python for automation of tasks (in this case using IDA Pro): in my Let’s Analyze: Dridex Series I use Python to automate the decryption,commenting, and dumping of encrypted strings; furthermore, I showed how it can be used to handle functions which are only resolved at call time to complicate static analysis.
In the above example, the script goes through every encrypted string in the Dridex binary and decrypts then dumps each one, all without needing to run the malware. ## Task Automation
Python is well known for its large range of downloadable libraries, which implement all kinds of different functionality so you don’t have to, allowing you to efficiently automate almost any task you can do on a computer without writing code from scratch. Using python for task automation doesn’t just have to apply to malware analysis either (it’s great for anything). I have written scripts to check command and control domains, upload or query files on virustotal, load virtual machines to analyze malware in, and even provision entire servers. Python is a swiss army knife of programming, so it’s worth learning as much for everyday tasks as it is for malware analysis.
C was the first programming language I became competent in after I began studying it at the age of 12. I didn’t actually learn C with the intention of getting into reverse engineering (I wanted to be a programmers), so I spent years studying it and ended up using the knowledge I gained as the foundation for my malware analysis career instead. Whilst I won’t insist you go and master programming in C (unless it’s something you want to do), I do recommend learning how to read and understand it, which is much less time comnsuming and equally as useful when it comes to malware analysis.
Even for an experienced reverse engineer, there will be times when you come across a call to a function you’re not familiar within some malware’s code. If you want to understand what a function does, what the parameters are, how to initialize it, and what it returns, then the best course of action is to pull up the documentation.
For an example, we’ll take the Windows and Linux documentation for the “connect” function.
In both cases the function definition is in C (the MSDN documentation on the left side says C++ but it’s actually C). In C each parameter passed to a function has a “type”, which simply specifies what kind of data it contains (is it a number? is it text? is it binary data?). The 3rd parameter in both examples is of type “sockaddr”, which is a well known C type. Simply knowing what “sockaddr” is would allow us to immediately know what data to expect, how to reference it, and how to interpret it. Due to the fact “sockaddr” is a structure (a group of multiple bits of data) we’d have to first look up the definition of sockaddr to find what types of data it contains, then look up each individual data type to understand what it is (this is essentially learning C). MSDN also often provides sample C code to explain the usage of a function in a group (reading this code is a fast way to understand the relationship between functions).
With closed source operating systems such as Microsoft Windows, there are a lot of data structures and functions that aren’t documented (usually because they are not supposed to be used by anyone but Microsoft themselves); these are often referred to as Windows Internals. Malware developers like to abuse internals to get around security controls, or simply to confuse analysts who are unfamiliar. Although it is possible to reverse engineer the operating system itself in order to understand its internals (which is how I originally gained most of my own understanding), it is extremely time consuming and difficult; however, there are faster options if you know C.
ReactOS Source Code ReactOS is an open source operating system written in C and designed to be compatible with Windows Server 2003 executable. In order to maintain compatibility with Windows binaries the developers had to reverse engineer and re-implement most of NT 5.2 (the version of the Windows operating system core which drives Windows 2000, Windows Server 2003, and Windows XP), so a lot of Windows can be understood through ReactOS. Due to the fact that Windows XP still has a large market share, malware developers still go to a lot of trouble to make their malware XP compatible, therefore it is rare to see usage of internals not present in the 5.2 core and thus ReactOS. I’ve personally used the ReactOS source code to understand some Windows internals enough that I can successfully work with them on actual Windows systems.
Windows Research Kernel The Windows Research Kernel (or WRK) is a subset of the NTOS Kernel source code made available by Microsoft for researchers, it includes most of the core kernel source code; however, some things have been removed. Although the WRK includes only the kernel part of Windows, it is still useful for understanding non kernel malware because parts of the kernel are exposed to userland applications via the “Native API”, which is often abused by rootkits. Initially the WRK was difficult to obtain (it was only provided to accredited universities as part of various computer science based programs), but since then some of the older versions have been posted to GitHub and are therefore publicly available. Whilst the WRK is not as complete as ReactOS, it has some benefits such as containing code for the 64-bit Windows Kernel, as well as being actual code written by Microsoft, not just a replica made from reverse engineering.
One of the best ways to understand how malware works and what to look for when reverse engineering is to read the source codes of actual malware, which is almost always written in C or C++. Obviously it’s impossible to get the source code of every or even most pieces of malware, because the developers don’t publish them; however, some of the biggest malware families have had their source codes leaked at some point in time (e.g. Zeus, Mirai, Carberp, ISFB, Rovnix). A large quantity of leaked malware source codes are available via the following GitHub Repository, but it’s best to avoid downloading the code as it may contain elements which could infect your computer (be aware that in some countries possession of malware code may even be illegal).
IDA Pro Pseudocode
The full edition of IDA Pro with Decompiler has an option to display assembly as “Pseudocode”, which will result in it coming up with code that matches the Assembly.
Although IDA Pseudocode it’s not technically 100% valid C (there are subtle differences), it is clearly legible to anyone capable of reading C. The Pseudocode feature will allow you to quickly get a feel for what code is doing, but it’s more a way of saving time and not a replacement for understanding Assembly language.
Assembly (commonly abbreviated to ASM) is by far the most important tool in any reverse engineer’s toolkit: it’s the human readable version of machine code, the only language the computer’s CPU actually understands. Any code which needs to run on a computer without the assistance of an interpreter must be compiled to machine code, which is a collection of zeros and ones that form instructions telling the CPU what to do. Languages like C, C++, GoLang, Pascal, and Haskell are all compiled (translated) to machine code, and as a result the majority of software (including malware) can be read as Assembly code using a Disassembler (software which translates machine code into its human readable version, Assembly). If you can read Assembly well, you won’t need the original code for anything written in a language which compiles to machine code (this has many uses outside of malware analysis).
Unfortunately, because Assembly is just a human readable version of machine code, there isn’t only one assembly language (because there isn’t only one type of machine code). Different types of CPUs accept different types of instructions so when you’re learning Assembly, you’re actually just learning the text versions of the instructions your CPU supports (known as its’ instruction set). So in order to learn Assembly, you’ll need to first choose which instruction set you should learn Assembly for; luckily there are only 2 instructions sets which are common when it comes to traditional computers.
i386 (short for Intel 80386) is the 32-bit version of the x86 instruction set and is used in some capacity by almost all 32-bit desktop computers, servers, and laptops. It doesn’t matter if a machine is running Windows, Linux, or Mac, if it’s 32-bit, then the CPU is probably i386 based. When someone says “Assembly”, they’re usually talking about i386 Assembly as it’s one of the most widespread instruction sets there is.
When Microsoft began designing 64-bit operating systems they went to great trouble ensuring 32-bit (i386) applications would still work, as a result most Windows malware developers simply write 32-bit malware because it works on both 32-bit and 64-bit Windows. Even on 64-bit Windows the majority of malware is i386, so this is the instruction set I recommend starting with.
This is the 64-bit version of x86 and the most common instruction set used by modern computers. Due to the fact that architectures in the x86 family maintains strong backwards compatibility: all x86_64 CPUs can execute i386 instructions and most 64-bit instructions closely resemble i386 ones, it generally makes sense to learn i386 Assembly first then progress on to x86_64, because the x86_64 instruction set essentially includes the i386 one.
Whilst 64-bit Windows can run i386 applications, other operating systems do not offer such support out of the box, therefore all native code must be x86_64. More advanced Windows malware families will often deploy 64-bit versions of their code on 64-bit Windows, also all kernel mode functions are 64-bit, so x86_64 Assembly is valuable addition to knowing i386.
Whilst there is no easy path to jumping from non-programmer to malware reverse engineer, learning programming is very rewarding for a variety of reason. Having experience in various programming languages (especially low level ones like ASM and C) will allow you to learn other languages much faster, as well as give you a better understanding of how computers work in general. Programming is one of the most marketable computer based skills and will be something you can use even if you decide not to be a malware analyst, so don’t feel like you’re wasting your time learning so much just to start out analyzing malware.