kaggle初参戦コンペで銅メダルを獲得できたのでその感想戦(1)-test中のfakeデータ抽出

TL;DR

kaggle初参戦コンペで銅メダル獲れた
自分がコンペ開催中に気付けなかった&実装できなかった重要ポイント（test中のfakeデータ抽出）について、コンペ終了後に感想戦してみた

はじめに

もう3週間以上経っちゃいましたが、kaggleの初参戦コンペ*1で銅メダルを獲得することができました！

参加コンペは以下、
Santander Customer Transaction Prediction | Kaggle

f:id:niaoz:20190504130942p:plain — あと0.00003で銀メダルだった…

で、本コンペで鍵となった要素について振り返ってみる。
まずはtest中のfakeデータ抽出について。

結論から言うと、このtips（fakeデータ抽出と抽出後のデータ再形成）だけで60/8300位（上位1%以内）くらいは行けたっぽい。

参考kernel

広まるきっかけになったkernelは以下、
List of Fake Samples and Public/Private LB split | Kaggle
testデータ中に含まれてる半分が元データから生成された偽データで、その選別が重要だった。

このkernelはデータの抽出を扱っているだけなので、これでどれだけスコアが上昇するのか見るのに以下kernelを参照。
simple magic VAR 0.922 | Kaggle

上記2つのkernelが主な引用元だが、コンペ中盤頃にベースとして使ってた以下kernalからの引用も少しあるのでご紹介。
このkernelの重要テーマであるdata augmentationは本記事では割愛。
LGB 2 leaves + augment | Kaggle

検証用jupyter notebook

検証に使用したjupyter notebookは以下、

santander_late2

解説

それぞれポイントとなる部分について適宜解説

真テストデータの抽出（セル4）

真テストデータのインデックスを抽出する。
偽データは他データのコピーなので、unique_countを定義した上で他にコピーがなければ真データ、ダブってれば偽データへ分類している。

# GET INDICIES OF REAL TEST DATA
#######################
# TAKE FROM YAG320'S KERNEL
# https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split

df_test = pd.read_csv('input/test_santander.csv')
df_test.drop(['ID_code'], axis=1, inplace=True)
df_test = df_test.values

unique_samples = []
unique_count = np.zeros_like(df_test)
for feature in range(df_test.shape[1]):
    _, index_, count_ = np.unique(df_test[:, feature], return_counts=True, return_index=True)
    unique_count[index_[count_ == 1], feature] += 1

# Samples which have unique values are real the others are fake
real_samples_indexes = np.argwhere(np.sum(unique_count, axis=1) > 0)[:, 0]
synthetic_samples_indexes = np.argwhere(np.sum(unique_count, axis=1) == 0)[:, 0]

print('Found',len(real_samples_indexes),'real test')
print('Found',len(synthetic_samples_indexes),'fake test')
#######################

特徴量生成とデータの再形成（セル10-17）

特徴量をtrainデータの各列から持ってきてfeaturesを定義

features = [col for col in df_train.columns if col not in ['ID_code', 'target']]

trainデータと真テストデータを結合し、featuresを加えた新しいデータ群(=df)を生成

#concatenate train data and real test data with new features
df = pd.concat([df_train,real_test], axis = 0)

for feat in tqdm_notebook(features):
    df[feat+'_var'] = df.groupby([feat])[feat].transform('var')

for feat in tqdm_notebook(features):
    df[feat+'plus_'] = df[feat] + df[feat+'_var']

drop_features = [c for c in df.columns if '_var' in c]
df.drop(drop_features, axis=1, inplace=True)

dfを元のデータ比(train/test = 200000/100000)になるように再分割

#divide into new train and test sets
df_train = df.iloc[:df_train.shape[0]]
df_test2 = df.iloc[df_train.shape[0]:]

featuresの再定義

#新trainデータのtargetを使用してfeaturesを再定義
features = [col for col in df_train.columns if col not in ['ID_code', 'target']]
target = df_train['target']

以上の特徴量生成+データ処理で、各データについて新たに200個特徴量が追加されたことがわかる（セル15-17）

学習（セル19-22）

新たに生成したtrainデータとtestデータを5foldのlightGBMに入れる。
パラメータ等各種設定値はほぼsimple-magic-var-0-922 kernelのまま。

結果

augmentaion等、他のトリックなしでも今回のデータ処理だけで上位1%に入れる！

f:id:niaoz:20190504145554p:plain — だいたいPrivateLBで60位前後の成績

おわりに

機械学習方面でまだまだ学ぶことは山のようにあるので、引き続き精進したい。
本コンペではデータのaugmentationも鍵だったので、この検証ももうちょっと深く突っ込んでやりたい…*2

…けどそろそろインフラ方面タスクをやるかな。

*1:以前PLAsTiCCコンペに様子見で参加したことがあるので、正確には初「本格」参戦コンペ

*2:というか、このaugmentaiton改良が今回銅獲得の原動力になった

Trial and Error

ITで色々やってることを書いてくチラシの裏です