CS2821, Python
Reverse engineering binaries, whether malicious or benign, is made more difficult by the absence of debug information. Variables and functions have had their identifiers “stripped”, so reverse engineers have to manually name them during analysis based on human understanding of the code functionality.
The goal of this project is to use machine learning to predict debug information for binaries. You will use a corpus of open source software, like as an entire Linux distribution, to predict correct names for particular code features such as variables and functions. The project has several technical parts:
- Disassembling binaries with debug information, using Capstone (with Python bindings)
- Extracting features usable for machine learning, e.g., graphs of assembly instructions
- Training a classifier, using scikit-learn, on the debug symbols in one half of the binaries
- Testing the classifier on the other half to evaluate. The same classifier can then be used to predict symbols on code where no source code or symbols are available.
You should feel confident writing Python code, and know the basics of x86 assembly or be willing to learn them quickly. Knowledge of machine learning isn’t necessary, but you should be interested in learning more about it.