How to win Kaggle competitions
Tabular competition
EDA - Exploratory Data Analysis
- Look at outliers (check simple exploration notebook @ kaggle - Elo)
Training Method
- Separate model (using separate model for different distribution would provide better result) model berdistribusi normal dengan model untuk outliers
single model can't predict the two distributed data for lgbm
- Binary predictions for outlier - normal pake regression - AUC
- Training method - weight averaged model or combine model with a model without outliers comparison local cv to different model
Final model = P(outliers)x-33.219+1-P(outliers)xL(no outlier)
- Replace the low outliers likelihood card id with no outliers model's result and replace the high outliers likelihood card id as outliers(-33)
Feature Engineering
- Liat pengaruh/perilaku variabel outliers - apa yang menjadikan data outliers - training outliers data outliers terlalu kecil? tambahkan probability to outliers binary prediction model feature engineering - (aggregate_feature, statistic-in-different items-most feature-svdfeature)
Feature Selection
step1
- run the model al features
step2
- shuffle the target randomly and run
step3
- get 1&2 lgbm importance table calculate feature ratio =lgbm_gain(real target)/lightgbm.gain(random target) if feature ratio > 3 drop it
step4
- run a model containing left features
step5
- repeat step2
- step4
step6
- repeat step2
step7
- try different ratio threshold to drop feature to get best local cv
An 0.0015 improvement for outliers binary predictions is good. (check feature selection with null importances @ kaggle - Oliver)
Read every notebook/code and discussion carefully to get insight.
Read the top solutions en top kernels of past competitions.
You have to try an idea!
From: Kaggle Days China, October 19-20th, 2019 - Jian Qiao & Zhongshan Huang