How to win Kaggle competitions

·

2 min read

Tabular competition

EDA - Exploratory Data Analysis

  • Look at outliers (check simple exploration notebook @ kaggle - Elo)

Training Method

  • Separate model (using separate model for different distribution would provide better result) model berdistribusi normal dengan model untuk outliers

single model can't predict the two distributed data for lgbm

  • Binary predictions for outlier - normal pake regression - AUC
  • Training method - weight averaged model or combine model with a model without outliers comparison local cv to different model
    Final model = P(outliers)x-33.219+1-P(outliers)xL(no outlier)
    
  • Replace the low outliers likelihood card id with no outliers model's result and replace the high outliers likelihood card id as outliers(-33)

Feature Engineering

  • Liat pengaruh/perilaku variabel outliers - apa yang menjadikan data outliers - training outliers data outliers terlalu kecil? tambahkan probability to outliers binary prediction model feature engineering - (aggregate_feature, statistic-in-different items-most feature-svdfeature)

Feature Selection

step1 - run the model al features step2 - shuffle the target randomly and run step3 - get 1&2 lgbm importance table calculate feature ratio =lgbm_gain(real target)/lightgbm.gain(random target) if feature ratio > 3 drop it step4 - run a model containing left features step5 - repeat step2 - step4 step6 - repeat step2 step7 - try different ratio threshold to drop feature to get best local cv

An 0.0015 improvement for outliers binary predictions is good. (check feature selection with null importances @ kaggle - Oliver)

Read every notebook/code and discussion carefully to get insight.

Read the top solutions en top kernels of past competitions.

You have to try an idea!

From: Kaggle Days China, October 19-20th, 2019 - Jian Qiao & Zhongshan Huang