kaggle, competition

Tabular competition

EDA - Exploratory Data Analysis

Look at outliers (check simple exploration notebook @ kaggle - Elo)

Training Method

Separate model (using separate model for different distribution would provide better result) model berdistribusi normal dengan model untuk outliers

single model can't predict the two distributed data for lgbm

Binary predictions for outlier - normal pake regression - AUC
Training method - weight averaged model or combine model with a model without outliers comparison local cv to different model
```
Final model = P(outliers)x-33.219+1-P(outliers)xL(no outlier)
```
Replace the low outliers likelihood card id with no outliers model's result and replace the high outliers likelihood card id as outliers(-33)

Feature Engineering

Liat pengaruh/perilaku variabel outliers - apa yang menjadikan data outliers - training outliers data outliers terlalu kecil? tambahkan probability to outliers binary prediction model feature engineering - (aggregate_feature, statistic-in-different items-most feature-svdfeature)

Feature Selection

step1 - run the model al features step2 - shuffle the target randomly and run step3 - get 1&2 lgbm importance table calculate feature ratio =lgbm_gain(real target)/lightgbm.gain(random target) if feature ratio > 3 drop it step4 - run a model containing left features step5 - repeat step2 - step4 step6 - repeat step2 step7 - try different ratio threshold to drop feature to get best local cv

An 0.0015 improvement for outliers binary predictions is good. (check feature selection with null importances @ kaggle - Oliver)

Read every notebook/code and discussion carefully to get insight.

Read the top solutions en top kernels of past competitions.

You have to try an idea!

From: Kaggle Days China, October 19-20th, 2019 - Jian Qiao & Zhongshan Huang

How to win Kaggle competitions

Tabular competition

EDA - Exploratory Data Analysis

Training Method

Feature Engineering

Feature Selection