Advanced Machine Learning-Based PE File Analysis
The Malware Detection System is an advanced machine learning-based application designed to identify potentially malicious Portable Executable (PE) files. By analyzing the structural characteristics and header information of Windows executable files (.exe, .dll, .sys), our system can accurately distinguish between benign and malicious software with high precision.
This project leverages static analysis techniques, extracting key features from PE headers without executing the files, ensuring safe and efficient malware detection. The system is built using modern web technologies and machine learning frameworks, providing a user-friendly interface for security analysts and researchers.
Flask 2.x
Lightweight Python web framework for building the REST API and serving the application
scikit-learn
Robust ML library for model training and predictions
joblib
Efficient model serialization and persistence
pefile
Python PE file parser for extracting header information and structural features
pandas
Data manipulation and feature engineering
NumPy
Numerical computing support
HTML
Modern web standards for UI
CSS
Template engine for dynamic content
vs code
Secure filename handling and upload validation
User uploads PE file through web interface
File type and size validation (max 25MB)
Temporary storage with timestamped filename
Extract 18 PE header features using pefile
Pre-trained model analyzes features
Show prediction, confidence, and features
Automatic removal of uploaded file
The system analyzes 18 critical features from PE file headers:
| Feature Name | Description | Source |
|---|---|---|
| TimeDateStamp | Compilation timestamp of the file | FILE_HEADER |
| Characteristics | File characteristics flags | FILE_HEADER |
| MajorLinkerVersion | Linker major version number | OPTIONAL_HEADER |
| SizeOfInitializedData | Size of initialized data section | OPTIONAL_HEADER |
| AddressOfEntryPoint | Entry point address for execution | OPTIONAL_HEADER |
| ImageBase | Preferred base address in memory | OPTIONAL_HEADER |
| MajorOperatingSystemVersion | Required OS major version | OPTIONAL_HEADER |
| MinorOperatingSystemVersion | Required OS minor version | OPTIONAL_HEADER |
| MajorImageVersion | Image major version | OPTIONAL_HEADER |
| MinorImageVersion | Image minor version | OPTIONAL_HEADER |
| MajorSubsystemVersion | Subsystem major version | OPTIONAL_HEADER |
| MinorSubsystemVersion | Subsystem minor version | OPTIONAL_HEADER |
| SizeOfHeaders | Combined size of all headers | OPTIONAL_HEADER |
| CheckSum | Image file checksum | OPTIONAL_HEADER |
| Subsystem | Required subsystem identifier | OPTIONAL_HEADER |
| DllCharacteristics | DLL characteristics flags | OPTIONAL_HEADER |
| SizeOfStackReserve | Stack reservation size | OPTIONAL_HEADER |
| ImageDirectoryEntryExport | Export directory size | DATA_DIRECTORY |
The ML model is trained on a comprehensive dataset of known malware and benign PE files. The training process involves:
Accuracy
Precision
Recall
F1-Score
# 1. Clone the repository git clone https://github.com/yourusername/malware-detection-system.git cd malware-detection-system # 2. Create virtual environment python -m venv venv # 3. Activate virtual environment # Windows: venv\Scripts\activate # Linux/Mac: source venv/bin/activate # 4. Install dependencies pip install flask pandas scikit-learn joblib pefile werkzeug # 5. Train the model (if not already trained) python train_model.py # 6. Run the application python app.py # 7. Access the application # Open browser and navigate to: http://127.0.0.1:5000
| Endpoint | Method | Description |
|---|---|---|
| / | GET, POST | Main interface for file upload and analysis |
| /about | GET | Information about the system |
| /project | GET | Detailed project documentation |
This project was developed as part of a machine learning and cybersecurity initiative. Special thanks to the open-source community for providing the tools and libraries that made this project possible.
This project is released under the MIT License. Feel free to use, modify, and distribute the code for educational and research purposes. Commercial use should be done with proper attribution and consideration of security implications.
For questions, bug reports, or feature requests, please open an issue on the GitHub repository or contact the development team.