Imbalanced Data

2020-06-06

Imbalanced Data

실제로 도메인에서 적용될 때 클래스가 Imbalance한 데이터들이 많을 것이다. 아래와 같이 불균형인 데이터를 그냥 학습시키면 다수의 클래스를 갖는 데이터를 많이 학습하게 되므로 소수 클래스에 대해서는 잘 분류해내지 못한다.

데이터 클래스 비율이 너무 차이가 나면(highly-Imbalanced data) 단순히 우세한 클래스를 택하는 모형의 정확도가 높아지므로 모형의 성능판별이 어려워진다. 즉, 정확도(Accuracy)가 높아도 데이터 개수가 적은 클래스의 재현율(recall-rate)이 급격히 작아지는 현상이 발생할 수 있다. 이렇게 각 클래스에 속한 데이터의 개수의 차이에 의해 발생하는 문제들을 비대칭 데이터 문제(Imbalanced data problem)이라고 한다.

데이터 불균형 문제 - 01

아래 코드와 그림은 SVM을 사용하여 각각 다변량(아래는 이변량) 정규분포를 갖는 비대칭 데이터와 대칭 데이터를 분류한 결과를 비교하는 코드이다. 우선 label마다 확실하게 구분되어질 수 있도록 서로 다른 평균을 갖는 이변량 정규분포에 샘플링하여 사용한다.

from sklearn.datasets import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.svm import SVC


def classification_result(n0, n1, title=""):
    rv1 = sp.stats.multivariate_normal([-1, 0], [[1, 0], [0, 1]])
    rv2 = sp.stats.multivariate_normal([+1, 0], [[1, 0], [0, 1]])
    X0 = rv1.rvs(n0, random_state=0)
    X1 = rv2.rvs(n1, random_state=0)
    X = np.vstack([X0, X1])
    y = np.hstack([np.zeros(n0), np.ones(n1)])

    x1min = -4; x1max = 4
    x2min = -2; x2max = 2
    xx1 = np.linspace(x1min, x1max, 1000)
    xx2 = np.linspace(x2min, x2max, 1000)
    X1, X2 = np.meshgrid(xx1, xx2)

    plt.contour(X1, X2, rv1.pdf(np.dstack([X1, X2])), levels=[0.05], linestyles="dashed")
    plt.contour(X1, X2, rv2.pdf(np.dstack([X1, X2])), levels=[0.05], linestyles="dashed")

    model = SVC(kernel="linear", C=1e4, random_state=0).fit(X, y)
    Y = np.reshape(model.predict(np.array([X1.ravel(), X2.ravel()]).T), X1.shape)
    plt.scatter(X[y == 0, 0], X[y == 0, 1], marker='x', label="0 클래스")
    plt.scatter(X[y == 1, 0], X[y == 1, 1], marker='o', label="1 클래스")
    plt.contour(X1, X2, Y, colors='k', levels=[0.5])
    y_pred = model.predict(X)
    plt.xlim(-4, 4)
    plt.ylim(-3, 3)
    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.title(title)

    return model, X, y, y_pred

plt.figure(figsize=(12,8))    
plt.subplot(121)
model1, X1, y1, y_pred1 = classification_result(200, 200, "대칭 데이터 (5:5)")
plt.subplot(122)
model2, X2, y2, y_pred2 = classification_result(200, 20, "비대칭 데이터 (9:1)")
plt.tight_layout()
plt.savefig('Imbalanced_data_svc_example')
plt.show()

SVC를 이용한 대칭 데이터와 비대칭 데이터의 분류 결과

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y1, y_pred1))
print(classification_report(y2, y_pred2))

결과

              precision    recall  f1-score   support

         0.0       0.86      0.83      0.84       200
         1.0       0.84      0.86      0.85       200

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400

              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       200
         1.0       0.75      0.60      0.67        20

    accuracy                           0.95       220
   macro avg       0.86      0.79      0.82       220
weighted avg       0.94      0.95      0.94       220

from sklearn.metrics import roc_curve, confusion_matrix

fpr1, tpr1, thresholds1 = roc_curve(y1, model1.decision_function(X1))
fpr2, tpr2, thresholds2 = roc_curve(y2, model2.decision_function(X2))

c1 = confusion_matrix(y1, y_pred1, labels=[1, 0])
c2 = confusion_matrix(y2, y_pred2, labels=[1, 0])
r1 = c1[0, 0] / (c1[0, 0] + c1[0, 1])
r2 = c2[0, 0] / (c2[0, 0] + c2[0, 1])
f1 = c1[1, 0] / (c1[1, 0] + c1[1, 1])
f2 = c2[1, 0] / (c2[1, 0] + c2[1, 1])

plt.figure(figsize=(12, 8))
plt.plot(fpr1, tpr1, ':', label="대칭")
plt.plot(fpr2, tpr2, '-', label="비대칭")
plt.plot([f1], [r1], 'ro')
plt.plot([f2], [r2], 'ro')
plt.legend()
plt.xlabel('Fall-Out')
plt.ylabel('Recall')
plt.title('ROC 커브')
# plt.savefig("roc_curve_diiferent_result_between_balanced_and_Imbalanced")
plt.show()

주황색 포인트가 현재 각 모델들의 성능을 의미한다. 대칭 데이터보다 비대칭 데이터를 사용하였을 경우 훨씬 좋지 않다.

SVC의 결과의 위치 및 ROC 커브

모형이 학습을 하면서 위에서 언급한 것 과 같이 소수의 데이터를 잘 학습하지 못하여 소수의 데이터를 잘 분류해 내지 못하는 모습을 아래의 그림에서도 확인할 수 있다. 빨간색 원형의 부분에도 소수 클래스의 데이터라 존재하지만 주변 다수 클래스가 많이 분포되어있어 분류해 내지 못했다.

데이터 불균형 문제의 예시 - 01

아래 그림은 oversampling을 통해 소수 클래스의 데이터의 비중을 늘려 주어 이전보다 학습을 많이 할 수 있도록 하는 방법을 택했을 경우의 학습 결과를 보여준다. 확실히, 이전에 decision boundary가 생성되어야 할 부분에 생성되어 진 것을 확인 할 수 있다. 또한, 이러한 방법은 모형내에 weight 파라미터가 존재할 경우 해당 소수 클래스에 더 많은 가중치를 줌으로써 동일한 결과를 도출해 낼 수 도 있을 것이다.

데이터 불균형 문제의 예시 - 02

소수 클래스의 데이터를 충분히 학습하지 못하는 문제뿐만 아니라 다수 클래스의 데이터에 대한 예측 확률 값은 높을 수록 1로 예측하기 때문에 0에 가까이 수렴할 수 밖에 없다. 그러므로 본래의 threshold인 0.5를 낮춰 설정하여야 좀 더 예측의 성능을 높일 수 있을 것이다. threshold는 validation set을 통해 설정해야 할 것이다.

데이터 불균형 문제 - 02

또한, 성능지표를 설정하는 것에도 문제가 있다. 기존의 Accuracy만을 생각한다면 다수의 클래스를 잘 예측해내기만 한다면 성능이 높을 수 밖에 없다. 그러므로 소수클래스의 비중이 적은 점을 고려하면서 성능을 비교할 수 있는 지표를 설정해야 할 것이다. 그런 측면에서의 성능지표는 아래와 같이 2 가지를 많이 사용한다.

불균형 문제를 갖는 데이터에 대한 성능지표 - 01

실제 양성이라고 판단한 것 중 양성이라고 예측한 비율을 의미하는 recall(재현율)과 양성이라고 예측한 것 중 실제 양성인 비율을 의미하는 precision(정밀도)의 조화평균이 F1-Score라고 할 수 있다. 두 지표를 모두 고려하는 지표인 것이다. 두 지표는 서로 트레이드 오프 관계를 갖고 있다. 3가지 성능 지표(recall, precision, f1-score) 모두를 구하여 비교해보는 것이 좋다.

불균형 문제를 갖는 데이터에 대한 성능지표 - 02

threshold가 변동되야 하므로 ROC curve를 그려보거나 AUC를 통한 지표를 설정하는 것도 좋은 방법이다.

불균형 문제를 갖는 데이터에 대한 성능지표 - 03

해결방법

위에서 잠깐 언급했듯이 Imbalanced data를 해결하기 위한 방법은 크게 2 가지로 소개 할 수 있다. 첫번째 방법은 리샘플링 방법으로 소수의 데이터를 부풀리는 Over sampling 과 다수의 데이터에서 일부만 사용하는 Under sampling, 그리고 마지막으로 두 가지 방법을 섞어 사용하는 Hybrid sampling이 있다. 두 번째 방법은 모형 자체의 학습하는 가중치를 소수 클래스에 더 주어 학습시키는 방법이다.

불균형 문제를 해결하기 위한 리샘플링 방법들

리샘플링 시 유의할 점은 먼저 데이터세트의 클래스별 비율을 유지한 채 train, validation, test 세트로 나누어야 한다는 점이다. 그리고 나서 학습시킬 데이터에 대해서만 resampling 방법을 적용시킨다는 점이다. 직관적으로 생각해 보면 위의 단계는 당연하지만, 초보자의 입장에선 헷갈릴수 있는 부분이기 때문에 언급하고 넘어가겠다.

불균형 문제를 해결하기 위한 리샘플링 방법들

오버 샘플링을 통한 불균형 문제 해결 예시

Imbalanced-learn 패키지

Imbalanced data 문제를 해결하기 위한 다양한 샘플링 방법을 구현한 Python 패키지

1	pip install -U imbalanced-learn

Under sampling

RandomUnderSampler : random under-sampling method
TomekLinks : Tomek’s link method
CondensedNearestNeighbour : condensed nearest neighbor method
OneSidedSelection : under-sampling based on one-sided selection method
EditedNearestNeighbours : edited nearest neighbor method
NeighbourhoodCLeaningRule : neighborhood cleaning rule

from imblearn.under_sampling import *

n0 = 200; n1 = 20
rv1 = sp.stats.multivariate_normal([-1, 0], [[1, 0], [0, 1]])
rv2 = sp.stats.multivariate_normal([+1, 0], [[1, 0], [0, 1]])
X0 = rv1.rvs(n0, random_state=0)
X1 = rv2.rvs(n1, random_state=0)
X_imb = np.vstack([X0, X1])
y_imb = np.hstack([np.zeros(n0), np.ones(n1)])

x1min = -4; x1max = 4
x2min = -2; x2max = 2
xx1 = np.linspace(x1min, x1max, 1000)
xx2 = np.linspace(x2min, x2max, 1000)
X1, X2 = np.meshgrid(xx1, xx2)

def classification_result2(X, y, title=""):
    plt.contour(X1, X2, rv1.pdf(np.dstack([X1, X2])), levels=[0.05], linestyles="dashed")
    plt.contour(X1, X2, rv2.pdf(np.dstack([X1, X2])), levels=[0.05], linestyles="dashed")
    model = SVC(kernel="linear", C=1e4, random_state=0).fit(X, y)
    Y = np.reshape(model.predict(np.array([X1.ravel(), X2.ravel()]).T), X1.shape)
    plt.scatter(X[y == 0, 0], X[y == 0, 1], marker='x', label="0 클래스")
    plt.scatter(X[y == 1, 0], X[y == 1, 1], marker='o', label="1 클래스")
    plt.contour(X1, X2, Y, colors='k', levels=[0.5])
    y_pred = model.predict(X)
    plt.xlim(-4, 4)
    plt.ylim(-3, 3)
    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.title(title)
    return model

Random Under-Sampler

무작위로 데이터를 없애는 단순 샘플링

X_samp, y_samp = RandomUnderSampler(random_state=0).fit_sample(X_imb, y_imb)

plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)

X_samp, y_samp = RandomUnderSampler(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('RandomUnderSampler로 리샘플링한 결과')
# plt.savefig('RandomUnderSampler_result_resampling')

RandomUnderSampler로 리샘플링한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.92      0.95       200
         1.0       0.51      0.90      0.65        20

    accuracy                           0.91       220
   macro avg       0.75      0.91      0.80       220
weighted avg       0.95      0.91      0.92       220

Tomek’s link method

토멕링크(Tomek’s link)란 서로 다른 클래스에 속하는 한 쌍의 데이터 $ (x_{+}, x_{-}) $ 로 서로에게 더 가까운 다른 데이터 존재하지 않는 상태이다. 클래스가 다른 두 데이터가 아주 가까이 붙어있으면 토멕링크가 된다. 토멕링크 방법은 이러한 토멕링크를 찾은 다음 그 중에서 다수 클래스에 속하는 데이터를 제외하는 방법으로 경계선을 다수 클래스쪽으로 밀어붙이는 효과가 있다.

소수쪽 경계선을 늘리자라는 방법이다.(단 없애지는 것들이 많은 수가 있지는 않다.) 반복가능!

Tomek's link 방법

X_samp, y_samp = TomekLinks(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('Tomeks link method로 리샘플링한 결과')
# plt.savefig('TomekLinks_result_resampling')

Tomek's link method로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       200
         1.0       0.70      0.70      0.70        20

    accuracy                           0.95       220
   macro avg       0.83      0.83      0.83       220
weighted avg       0.95      0.95      0.95       220

Condensed Nearest Neighbor

CNN(Condensed Nearest Neighbor) 방법은 1-NN 모형으로 분류되지 않는 데이터만 남기는 방법이다.
- 원래는 clustering에서 사용하는 방법이지만 응용해서 사용한다.
- 랜덤하게 하나를 고른 후, 그 다음에는 1-NN방법을 사용한다. 그 다음 선택된 것과 만약 같은 클래스(다수 클래스)이면 안 뽑고 다른 클래스(소수 클래스)이면 뽑음.

선택된 데이터 집합을 $ S $ 라고 하자.
- 1) 소수 클래스 데이터를 모두 $ S $ 에 포함시킨다.
- 2) 다수 데이터 중에서 하나를 골라서 가장 가까운 데이터가 다수 클래스이면 포함시키지 않고 아니면 $ S $ 에 포함시킨다.
- 3) 더 이상 선택되는 데이터가 없을 때 까지 2를 반복한다.

이 방법을 사용하면 기존에 선택된 데이터와 가까이 있으면서 같은 클래스인 데이터는 선택되지 않기 때문에 다수 데이터의 경우 선택되는 비율이 적어진다.
- 허나, 가장 가까운 데이터가 소수 클래스인 경우 집합에 포함되겠지만 그만큼 아래 그림처럼 소수 클래스 집단과 거리가 가깝다면 두 클래스를 분류하는데 도움을 주진 못할 것이다.
- CNN 자체는 경계선을 살리는 역할을 하기 때문에 경계선 보다 멀거나 적은 클래스 주변에 없으면 제거된다. 그러므로 홀로 이걸 사용한다기 보다는 다른 사용에 중간과정에 사용하기도한다.

X_samp, y_samp = CondensedNearestNeighbour(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('CondensedNearestNeighbour로 리샘플링한 결과')
# plt.savefig('CondensedNearestNeighbour_result_resampling')

CondensedNearestNeighbour로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       200
         1.0       0.75      0.60      0.67        20

    accuracy                           0.95       220
   macro avg       0.86      0.79      0.82       220
weighted avg       0.94      0.95      0.94       220

One Sided Selection

One Sided Selection은 토맥링크 방법과 Condensed Nearest Neighbour 방법을 섞은 것이다. 토맥링크 중 다수 클래스를 제외하고 나머지 데이터 중에서도 서로 붙어있는 다수 클래스 데이터는 1-NN 방법으로 제외한다.

X_samp, y_samp = OneSidedSelection(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('OneSidedSelection로 리샘플링한 결과')
# plt.savefig('OneSidedSelection_result_resampling')

OneSidedSelection으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       200
         1.0       0.70      0.70      0.70        20

    accuracy                           0.95       220
   macro avg       0.83      0.83      0.83       220
weighted avg       0.95      0.95      0.95       220

Edited Nearest Neighbours

ENN(Edited Nearest Neighbours) 방법은 다수 클래스 데이터 중 가장 가까운 k(n_neighbors)개의 데이터가 모두(kind_sel='all') 또는 다수(kind_sel='mode') 다수 클래스가 아니면 삭제하는 방법이다. 소수 클래스 주변의 다수 클래스 데이터는 사라진다.

그러므로 가까운 k개 중 소수 클래스를 지닌 데이터들은 모두 제거되어 소수 클래스와 다수 클래스 간의 구분이 상대적으로 명확해지게 된다.

CNN과 비슷하지만, 모든 데이터에 대해서 주변에 제일 가까운 k개를 지정해줘서 주변에 다수 데이터가 많거나 또는 모두 다수 데이터가 아니면 그 데이터를 삭제하여 경계선에 있는 애들 중 다수클래스가 사라져서 위에 말한 것과 동일한 효과를 준다.

knn방법은 데이터간의 모든 거리를 구하기 때문에 데이터 갯수가 많으면 사용하기 힘들다.

X_samp, y_samp = EditedNearestNeighbours(kind_sel="all", n_neighbors=5, random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('EditedNearestNeighbours로 리샘플링한 결과')
# plt.savefig('EditedNearestNeighbours_result_resampling')

EditedNearestNeighbours으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.94      0.96       200
         1.0       0.58      0.90      0.71        20

    accuracy                           0.93       220
   macro avg       0.79      0.92      0.83       220
weighted avg       0.95      0.93      0.94       220

Neighbourhood Cleaning Rule

CNN(Condensed Nearest Neighbour) 방법과 ENN(Edited Nearest Neighbours) 방법을 섞은 것이다.

가장 가까운 데이터가 소수데이터가 아니거나, 가장 가까운 k개가 다수 클래스이거나 다수가 다수 클래스인 데이터를 제거하는 방법이므로 경계선을 너무 명확히 해주는 것을 방지한 효과를 줄 수 있다고 생각한다. ENN을 통해 소수 클래스 데이터 주변을 너무 많이 제거하는 것을 방지한다고 보면 될 것 같다.

X_samp, y_samp = NeighbourhoodCleaningRule(kind_sel="all", n_neighbors=5, random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('NeighbourhoodCleaningRule로 리샘플링한 결과')
# plt.savefig('NeighbourhoodCleaningRule_result_resampling')

NeighbourhoodCleaningRule으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96       200
         1.0       0.56      0.95      0.70        20

    accuracy                           0.93       220
   macro avg       0.78      0.94      0.83       220
weighted avg       0.96      0.93      0.94       220

Over sampling

RandomOverSampler : random sampler
ADASYN : Adaptive Synthetic Sampling Approach for Imbalanced Learning
SMOTE : Synthetic Minority Over-sampling Technique

1	from imblearn.over_sampling import *

RandomOverSampler

Random Over Sampling은 소수 클래스의 데이터를 반복해서 넣는 것(replacement)이다. 가중치를 증가시키는 것과 비슷하다.

오른쪽과 왼쪽이 변화가 없는 것 처럼 보이지만 오른쪽은 똑같은 데이터를 복제하는 것이기 때문에 숫자를 늘려 경계선을 밀어버린다.

Random Over Sampling이란

X_samp, y_samp = RandomOverSampler(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('RandomOverSampler로 리샘플링한 결과')
# plt.savefig('RandomOverSampler_result_resampling')

RandomOverSampler로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.91      0.95       200
         1.0       0.51      0.95      0.67        20

    accuracy                           0.91       220
   macro avg       0.75      0.93      0.81       220
weighted avg       0.95      0.91      0.92       220

ADASYN

소수 데이터를 랜덤하게 두 포인트를 고른 후 직선으로 이어 그 선 사이의 랜덤한 위치에 데이터를 새로 생성한다.

ADASYN(Adaptive Synthetic Sampling) 방법은 소수 클래스 데이터와 그 데이터에서 가장 가까운 k개의 소수 클래스 데이터 중 무작위로 선택된 데이터 사이의 직선상에 가상의 소수 클래스 데이터를 만드는 방법이다.

X_samp, y_samp = ADASYN(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('ADASYN로 리샘플링한 결과')
# plt.savefig('ADASYN_result_resampling')

ADASYN으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.90      0.94       200
         1.0       0.47      0.95      0.63        20

    accuracy                           0.90       220
   macro avg       0.73      0.92      0.79       220
weighted avg       0.95      0.90      0.91       220

SMOTE

SMOTE(Synthetic Minority Over-sampling Technique) 방법도 ADASYN 방법처럼 데이터를 생성하지만 생성된 데이터를 무조건 소수 클래스라고 하지 않고 분류 모형에 따라 분류한다. 따라서 순수하게 소수 클래스만 sampling을 하지는 않는다.

SMOTE 방법

또한, 실행할 때 마다 랜덤하게 데이터가 생성되므로 결과는 매번 다르다.

SMOTE 방법의 작동 원리

oversampling하는 대상이 전체 소수 클래스의 데이터이므로 noise로 판단되어 질 수 있는 밀집되어 나타나지 않는 소수 클래스의 데이터에 대해서도 모두 샘플링하므로 주의하자.

SMOTE를 할 경우 주의할 점

X_samp, y_samp = SMOTE(random_state=4).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('SMOTE로 리샘플링한 결과')
# plt.savefig('SMOTE_result_resampling')

SMOTE로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

               precision    recall  f1-score   support

         0.0       0.99      0.91      0.95       200
         1.0       0.50      0.90      0.64        20

    accuracy                           0.91       220
   macro avg       0.74      0.91      0.80       220
weighted avg       0.94      0.91      0.92       220

BLSMOTE

Borderline에 있는 데이터는 class Imbalanced problem에 큰 영향을 미친다고 판단하여 해당 dataset에만 SMOTE를 적용하는 방법이다. 애초에 Decision boundary에 영향을 미칠수 있는 데이터를 리샘플링하겠다는 의도이다.

BLSMOTE의 개념 및 작동원리

noise같이 판단되어지는 데이터에 대해서도 oversampling을 하는 SMOTE의 단점을 보완하여 noise라는 카테고리로 분류해 새로 리샘플링하는 데이터에 소수 클래스 데이터로 인해 과적합되지 않도록 해준다.

BLSMOTE의 장점

DBSMOTE(DBSCAN SMOTE)

DBSCAN cluster 생성 후, cluster 내에서 SMOTE를 적용하는 방법이다.

DBSMOTE 개념 및 작동원리

DBSCAN Cluster 를 진행하기에 군집의 중심과 이어지는 경향이 있으며, 이 또한 BLSMOTE와 비슷하게 원래 DBSCAN의 장점 중 하나인 noise를 제거하여 샘플링해준다.

DBSMOTE의 결과

데이터 마다 편차가 크므로, 실험적으로 해본뒤 해당 데이터에 잘 맞는 샘플링 방법을 사용해야 한다.

Oversampling의 단점

복합 샘플링

SMOTEENN : SMOTE + ENN
SMOTETomek : SMOTE + Tomek

1	from imblearn.combine import *

SMOTE + ENN

SMOTE+ENN 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 ENN(Edited Nearest Neighbours) 방법을 섞은 것이다.

X_samp, y_samp = SMOTEENN(random_state=0).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
plt.title('원본 데이터')
classification_result2(X_imb, y_imb)
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('SMOTE +  ENN으로 리샘플링한 결과')
# plt.savefig('SMOTE_and_ENN_result_resampling')

SMOTE + ENN으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.92      0.96       200
         1.0       0.54      0.95      0.69        20

    accuracy                           0.92       220
   macro avg       0.77      0.94      0.82       220
weighted avg       0.95      0.92      0.93       220

SMOTE+Tomek

SMOTE+Tomek 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 토멕링크 방법을 섞은 것이다.

X_samp, y_samp = SMOTETomek(random_state=4).fit_sample(X_imb, y_imb)

plt.figure(figsize=(12,8))
plt.subplot(121)
classification_result2(X_imb, y_imb)
plt.title('원본 데이터')
plt.subplot(122)
model_samp = classification_result2(X_samp, y_samp)
plt.title('SMOTE + Tomek으로 리샘플링한 결과')
plt.savefig('SMOTETomek_result_resampling')

SMOTE + Tomek으로 리샘플링 한 결과

1	print(classification_report(y_imb, model_samp.predict(X_imb)))

결과

              precision    recall  f1-score   support

         0.0       0.99      0.92      0.95       200
         1.0       0.51      0.90      0.65        20

    accuracy                           0.91       220
   macro avg       0.75      0.91      0.80       220
weighted avg       0.95      0.91      0.92       220

DataLatte's IT Blog using Hexo

machine learning

Imbalanced Data

Imbalanced Data

결과

해결방법

Imbalanced-learn 패키지

Under sampling

Random Under-Sampler

결과

Tomek’s link method

결과

Condensed Nearest Neighbor

결과

One Sided Selection

결과

Edited Nearest Neighbours

결과

Neighbourhood Cleaning Rule

결과

Over sampling

RandomOverSampler

결과

ADASYN

결과

SMOTE

결과

BLSMOTE

DBSMOTE(DBSCAN SMOTE)

복합 샘플링

SMOTE + ENN

결과

SMOTE+Tomek

결과