π― Project Overview
This project tackles the fascinating problem of predicting online news popularity using machine learning techniques. The challenge involves analyzing a heterogeneous set of features from articles published by Mashable over two years to predict the number of social media shares.
Key Highlights
- Dataset: 39,797 articles with 61 features
- Target: Social media shares (discretized into 5 classes)
- Challenge: Handling severely imbalanced classes
- Best Model: Support Vector Machine with strategic class balancing
π Dataset Description
The dataset contains articles from Mashable with diverse features including:
- Content Features: Word counts, token rates, sentiment analysis
- Metadata: Publication day, data channels (lifestyle, entertainment, business, etc.)
- Keyword Analytics: Min/max/average shares for keywords
- Multimedia: Number of images, videos, and links
- NLP Features: LDA topic modeling, sentiment polarity
Target Variable Distribution
The original dataset showed extreme class imbalance, with the majority of articles having low share counts (class 0), making this a challenging classification problem.
π§ Methodology
1. Data Preprocessing & Cleaning
Noise Detection and Removal:
- Identified and removed 135 rows with formatting issues
- Eliminated samples with values outside expected column domains
- Handled missing values and βn.a.β entries
- Applied Z-score outlier detection (threshold: 5) removing 1,875 additional samples
Feature Engineering:
- Used Random Forest for feature importance analysis
- Retained only features with importance > 0.02
- Reduced from 61 to 18 most informative features
- Applied MinMax scaling for model compatibility
2. Class Imbalance Strategy
The most critical challenge was the severe class imbalance. Instead of uniform balancing, I implemented a strategic approach:
Class Distribution Strategy:
- Class 0 (low shares): 1/3 of samples
- Class 1: 1/4 of samples
- Class 2: 1/5 of samples
- Class 3: 1/6 of samples
- Class 4 (viral): 1/7 of samples
Rationale: This distribution acknowledges that viral content is naturally rare while ensuring all classes have sufficient representation for learning.
Implementation:
- RandomUnderSampler: Reduced overrepresented classes
- SMOTE: Generated synthetic samples for underrepresented classes
- Maintained realistic class proportions reflecting real-world news sharing patterns
3. Model Selection & Evaluation
Models Tested:
- Decision Trees
- Support Vector Machines
- AdaBoost Ensemble
- Random Forest
- Multi-Layer Perceptron Neural Networks
Evaluation Strategy:
- 66% training, 33% testing split
- Cross-validation with both accuracy and F1-macro scoring
- Emphasis on F1-macro to ensure sensitivity across all classes
π Results
Before Class Balancing
Model | Accuracy | F1-Macro | CV Accuracy | CV F1-Macro |
---|---|---|---|---|
DT | 0.66 | 0.21 | 0.4071 | 0.1513 |
SVM | 0.80 | 0.18 | 0.7968 | 0.1774 |
Boost | 0.80 | 0.18 | 0.7957 | 0.1783 |
Random Forest | 0.80 | 0.18 | 0.6614 | 0.1570 |
MLPN | 0.80 | 0.18 | 0.7968 | 0.1774 |
After Class Balancing
Model | Accuracy | F1-Macro | CV Accuracy | CV F1-Macro |
---|---|---|---|---|
DT | 0.49 | 0.20 | 0.4071 | 0.1513 |
SVM | 0.65 | 0.25 | 0.7968 | 0.1774 |
Boost | 0.59 | 0.23 | 0.7957 | 0.1783 |
Random Forest | 0.70 | 0.26 | 0.6614 | 0.1570 |
MLPN | 0.65 | 0.22 | 0.7968 | 0.1774 |
Hyperparameter Tuning Results
Best SVM Configuration:
{
'C': 1,
'kernel': 'rbf',
'gamma': 'scale',
'decision_function_shape': 'ovo',
'random_state': 42
}
π Key Insights
1. The Class Imbalance Challenge
The original datasetβs extreme imbalance (majority class dominance) led to models that achieved high accuracy by simply predicting the majority class. This highlighted the importance of F1-macro as an evaluation metric for imbalanced problems.
2. Strategic Balancing vs. Uniform Balancing
Unlike uniform class balancing, the strategic approach maintained the natural hierarchy of news sharing patterns while ensuring model sensitivity to all classes. This proved crucial for real-world applicability.
3. Feature Importance Impact
Reducing features from 61 to 18 using Random Forest importance analysis improved model efficiency without sacrificing performance, demonstrating the value of feature selection in high-dimensional problems.
4. Model Performance Trade-offs
The class balancing strategy improved F1-macro scores significantly (from 0.18 to 0.25 for SVM) while maintaining reasonable accuracy, showing the successful navigation of the precision-recall trade-off.
π οΈ Technical Implementation
Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler, KBinsDiscretizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from scipy.stats import zscore
Key Functions
clean_and_load_dataset()
: Comprehensive data cleaning pipelineget_hist()
: Visualization utility for distribution analysisfit_models()
&evaluate_models()
: Model training and evaluation frameworktune_model()
: Hyperparameter optimization wrapper
π Lessons Learned
Domain Knowledge Matters: Understanding that news sharing follows natural power-law distributions informed the strategic balancing approach.
Evaluation Metrics Selection: In imbalanced problems, accuracy alone can be misleading. F1-macro provided better insights into model performance across all classes.
Preprocessing Impact: Careful outlier removal and feature selection significantly improved model performance and training efficiency.
Class Balancing Strategy: One-size-fits-all approaches (like uniform balancing) may not suit all problems. Domain-informed strategies often perform better.
π Future Improvements
- Advanced Sampling Techniques: Explore borderline-SMOTE or ADASYN for more sophisticated synthetic sample generation
- Feature Engineering: Investigate polynomial features or interaction terms
- Ensemble Methods: Combine multiple models with different balancing strategies
- Deep Learning: Experiment with neural networks designed for imbalanced classification
- Temporal Analysis: Incorporate time-series aspects of news popularity
π References
- K. Fernandes, P. Vinagre and P. Cortez. βA Proactive Intelligent Decision Support System for Predicting the Popularity of Online News.β Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.
This project demonstrates the importance of careful data preprocessing, strategic handling of class imbalance, and thoughtful evaluation in machine learning challenges. The combination of domain knowledge and technical expertise proved crucial for developing an effective news popularity prediction model.