정규화 (Regularization)

¶

By Sooyeong Lim

이 노트는 Bishop의 PRML 정규화 예제 그림 1.7& 1.8에 대한 설명과 Python 구현이다.

#Import packages
import numpy as np
import pandas as pd
import random
import math
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
import sklearn
rcParams['figure.figsize'] = 12, 8 # Set up the size of figures

랜덤으로 10개의 점을 sin 2x를 기준으로 random noise를 추가하여 생성을 해보자.

#Define input array with angles from 60deg to 300deg converted to radians

x= np.linspace(0,1,10) #Generate test_dataset
y = np.sin((2*math.pi)*x) + np.random.normal(0,0.3,len(x)) #Assign random noise with s.d=0.3
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'o')
x_1=np.linspace(0,1,1000)
plt.plot(x_1,np.sin((2*math.pi)*x_1),color='green')

[<matplotlib.lines.Line2D at 0x1c16eab2b00>]

Linear Model을 Fit 할 것인데, Predictor와 Response variable의 관계가 단순한 직선이 아니기 때문에 Polynomial term 을 추가 하려고 한다. 아래의 phi 함수는 $x, x^1 , x^2 , ... x^9$ 까지의 predictor를 X matrix에 추가할 것이다.

#Make a polynomial, define a function when M=9
#I reused a Phi function which gets input vector and give an output as M-dimension matrix
def phi(x, order):
    x = np.atleast_1d(x) # <-- Make sure the input is an array
    M = order  # <-- Notation from the book
    N = len(x) 
    return np.column_stack([x**k for k in range(order+1)])

X=np.array(phi(x,9))
X_1=np.array(phi(x_1,9))
#only 10 datapoints
print(np.shape(X))
#1000 data points but this is data is used just for plotting the trained data
print(np.shape(X_1))

(10, 10)
(1000, 10)

Sci-kit learn 패키지는 머신러닝에 대한 거의 모든 기능을 다 포함하고 있다. Sklearn에서 Linear Regression을 가지고 오자. Input은 phi function을 가지고 방금 생성한 X matrix 와 response variable-y 이다.

#Define Linear Regression model
#Import Linear Regression model from scikit-learn.
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
#Train the model
lm.fit(X,y)
#With out any regularization term.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

x_test=np.linspace(0,1,100)
X_test=np.array(phi(x_test,9))
lm.predict(X_test)

array([ 0.21671715, -0.39897038, -0.75620912, -0.91337483, -0.91974278,
       -0.8165217 , -0.63780309, -0.4114304 , -0.15979295,  0.09945129,
        0.35271945,  0.58991022,  0.80386884,  0.98990696,  1.14537352,
        1.26927334,  1.36192986,  1.42468886,  1.45966005,  1.46949357,
        1.45718852,  1.42593088,  1.37895813,  1.31944827,  1.25043071,
        1.17471705,  1.09484937,  1.01306436,  0.93127114,  0.85104122,
        0.77360882,  0.69988001,  0.63044938,  0.56562261,  0.50544402,
        0.44972764,  0.398091  ,  0.34999047,  0.30475741,  0.26163424,
        0.21980973,  0.17845296,  0.13674524,  0.09390964,  0.04923766,
        0.0021127 , -0.04796994, -0.10138672, -0.15837335, -0.21901619,
       -0.28324756, -0.35084503, -0.42143444, -0.49449671, -0.56937813,
       -0.64530397, -0.72139514, -0.79668757, -0.87015398, -0.94072775,
       -1.00732829, -1.06888764, -1.12437784, -1.1728384 , -1.21340356,
       -1.24532872, -1.26801546, -1.28103461, -1.28414686, -1.2773202 ,
       -1.26074368, -1.23483689, -1.20025445, -1.1578851 , -1.10884459,
       -1.05446193, -0.99625825, -0.93591792, -0.87525112, -0.81614746,
       -0.7605201 , -0.71023979, -0.66705841, -0.63252154, -0.60786956,
       -0.59392691, -0.59097919, -0.59863765, -0.61569085, -0.63994318,
       -0.66804006, -0.69527963, -0.71541075, -0.72041732, -0.70028882,
       -0.64277715, -0.53313983, -0.35386971, -0.08441137,  0.29913543])

data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'o')
x_1=np.linspace(0,1,1000)
plt.plot(x_1,np.sin((2*math.pi)*x_1),color='green')
plt.plot()
plt.plot(x_test,lm.predict(X_test))

[<matplotlib.lines.Line2D at 0x1c1715fb198>]

위 그림에서 볼 수 있듯이, 정규화가 없이 Linear model을 9th polynomial까지 fit을 한다면, 필연적으로 Overfitting 문제가 생긴다.

L2 정규화 (Ridge Regression)¶

Ridge Regression은 위에 언급한 상황처럼 여러 Parameter가 있을때 Overfitting 문제를 풀기 위한 Method 중 하나이다. L1 정규화를 Lasso Regression라고 하며, L2 정규화를 Ridge Regression 라고 칭한다. 두 방법 모두 기본적으로 파라미터에 페널티를 준다는 점에서 아이디어는 같다.

np.shape(x)

(10,)

#Ridge regularization
#Fitting the same data by using ridge regression
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
rcParams['figure.figsize'] = 15, 9# Set up the size of figures
ridgeReg=Ridge(alpha=np.exp(-18),normalize=True)
ridgeReg.fit(X,y)
plt.figure(figsize=(13, 5))
plt.subplot(121)
plt.plot(x_1,ridgeReg.predict(X_1),color='red')
plt.ylabel('t')
plt.xlabel('x')
plt.plot(x_1,np.sin((2*math.pi)*x_1),color='green')
plt.plot(data['x'],data['y'],'o',)
ridgeReg2=Ridge(alpha=100,normalize=True) # Alpha = 1/2*Labmda, ln(Labmda)=0 --> Lambda=1 
ridgeReg2.fit(X,y)
plt.text(0.7, 0.1, r'$\lambda=exp(-18)$')
plt.subplot(122)
plt.text(0.7,0.1, r'$\lambda=exp(0)=1$')
plt.plot(x_1,ridgeReg2.predict(X_1),color='red')
plt.ylabel('t')
plt.xlabel('x')
plt.plot(x_1,np.sin((2*math.pi)*x_1),color='green')
plt.plot(data['x'],data['y'],'o')

[<matplotlib.lines.Line2D at 0x1c17134ae48>]

왼쪽의 그림은 $e^{-18}$(매우작은 수)를 Hyper parameter $\lambda$로 설정을 한 것이고, 오른쪽은 상대적으로 큰 수인 $\lambda$=$e^{0}=1$을 설정하였다. 왼쪽의 예시는 상황에 대한 일반화를 잘 했다고 평가 할 수 있는 반면에, 오른쪽의 그림은 정규화 페널티가 너무 큰 나머지 학습이 잘 일어나지 않는다는 것을 확인할 수 있다.

Figure 1.8¶

그렇다면 어떤 수의 Hyper parameter를 설정하는 것이 좋을까? 이 문제의 답은 문제, 모델마다 다르기 때문에 시각화를 하거나 Grid search를 이용해서 직접 설정을 해야 한다. Kaggle competition에서 수많은 경우가 이런 Hyperparameter를 어떻게 설정하느냐에 따라 랭킹이 갈리곤 한다.

# Set the range of lambda
lg_lambdas = np.linspace(-100, 10, 100)
lambdas = np.exp(lg_lambdas)
rmse_tr=[]
rmse_test=[]

# Generate a new test dataset
x_test= np.linspace(0,1,100)
y_test= np.sin((2*math.pi)*x_test) + np.random.normal(0,0.5,len(x_test)) #Assign random noise with s.d=0.5

보통 Regression 문제서 오류의 측정을 Mean squred error 혹은 Root mean squred error로 측정하곤 한다.

from sklearn.metrics import mean_squared_error
for i in (lambdas):
    ridgeReg=Ridge(alpha=i,normalize=True)
    #Fit the original value.
    ridgeReg.fit(X,y)
    X_test=phi(x_test,9)
    rmse_tr.append(np.sqrt(mean_squared_error(y,ridgeReg.predict(X)))) #check the training rmse value
    rmse_test.append(np.sqrt(mean_squared_error(y_test,ridgeReg.predict(X_test)))) #check the test rmse value

plt.subplot(1,1,1)
plt.xlim((-60,10))
plt.ylim((0,1))
plt.plot(lg_lambdas,rmse_tr,label='training')
plt.plot(lg_lambdas,rmse_test,label='test',color='red')
plt.legend(loc='upper right')
plt.ylabel('E_RMSE')
plt.xlabel('ln($\lambda$)')

Text(0.5,0,'ln($\\lambda$)')

Test RMSE가 대충 $ln(\lambda)$=-9에서 가장 작음을 확인 할 수 있다. 그 이상 혹은 그 이하의 값을 $\lambda$로 설정하면 test error가 올라가는 것을 확인 할 수 있다.

Data to Impact|데이터사이언스 미국 유학/취업

[머신러닝] 정규화 (Regularization) PRML 예제 구현

[머신러닝] 정규화 (Regularization) PRML 예제 구현

정규화 (Regularization)

¶

L2 정규화 (Ridge Regression)¶

Figure 1.8¶

댓글

티스토리툴바