Phishing Detection
Machine learning-powered phishing website detector using URL and HTML analysis. Built with Python, scikit-learn, Flask, and BeautifulSoup for production-ready security.
Core Features
Data Processing
Load URLs and labels from CSV, extract 10 features including URL length, dots, HTTPS, and HTML elements. Save processed features as Parquet.
Machine Learning
RandomForestClassifier trained on extracted features with optional screenshot hash via Selenium for enhanced detection accuracy.
Prediction System
CLI and REST API endpoints for single URL predictions with detailed explanations and confidence scores.
Evaluation Tools
Comprehensive metrics including accuracy, precision, recall, F1 score, and ROC-AUC for model performance assessment.
Feature Extraction
URL Features
  • URL length analysis
  • Number of dots in domain
  • Presence of @ symbol
  • HTTPS protocol verification
  • IP address detection
HTML Features
  • Form element detection
  • Password field analysis
  • Iframe presence
  • External link counting
  • Script tag evaluation
API Integration
Flask-based REST API provides seamless integration with JSON responses including URL, confidence score, classification label, and detailed explanations for each prediction.
curl -X POST http://localhost:5000/scan \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' Response: { "url": "https://example.com", "score": 0.12, "label": "legit", "explanation": "Contains HTTPS, no suspicious patterns" }
Quick Start Guide
01
Setup Environment
Clone repository, install dependencies from requirements.txt, and configure environment variables using .env.example template.
02
Prepare Data
Create CSV with URL and label columns or use synthetic data. Save as data/sample_urls.csv for training.
03
Train Model
Run training script to generate baseline.joblib model and processed.parquet data with approximately 95% accuracy.
04
Deploy & Test
Launch API server, run pytest suite, and test predictions via CLI or REST endpoints.
Model Performance
95%
Accuracy
Achieved on synthetic training data
10
Features
URL and HTML characteristics analysed
5
Metrics
Precision, recall, F1, ROC-AUC tracked
Comprehensive evaluation includes detailed classification reports and confusion matrices for thorough model assessment and continuous improvement.
Deployment Options
Docker Support
Containerised deployment with ChromeDriver included. Build and run with single commands, mounting data and models for persistence.
Docker Compose
Orchestrated multi-container setup with volume mounting for data and models, enabling seamless local development and testing.
CI/CD Pipeline
GitHub Actions workflow for automated linting, testing, training, and deployment verification on every push and pull request.
Technology Stack
Core Libraries
  • Python 3.10+
  • scikit-learn
  • Flask API
  • BeautifulSoup4
Optional Tools
  • Selenium WebDriver
  • ChromeDriver
  • Pytest testing
  • Black formatter
Languages
  • Python 84.6%
  • Jupyter 12.6%
  • Dockerfile 2.8%
Future Enhancements
Real Datasets
Integrate PhishTank and OpenPhish data sources for production-grade training and validation.
Advanced Features
Add JavaScript analysis, WHOIS lookup, SSL certificate validation, and domain age checking.
Model Upgrades
Implement XGBoost, LSTM networks, and BERT transformers for improved detection accuracy.
Contributors & Maintainers
Mantra Patil
Project author and lead developer. GitHub: mantrapatil03
Himali Patil
Core contributor. GitHub: himalipatil26
Maintained by Shree Organisation and built with ❤️ by CodeM03 Company. If you find this project useful, please star the repository and share it with others. Stay safe online! 🕵️‍♂️
Made with