[Data Science] r^2를 계산하는 두 가지 방법!! (Python 코드 포함)

728x90

Introduction

회귀모델을 평가할때 주로 r^2 값을 계산한다. 이것은 대략적으로 참(true)값과 예측값(predicted) 사이의 상관관계의 정도를 평가한다고 이해할 수 있지만 실제 계산은 이것보다는 약간 복잡하다.

또한, r^2에 대한 다른 두 개의 정의가 존재한다. 물론 결과는 동일하지만 계산과정이 다르기 때문에 직접 계산을 해보고 r^2값의 의미를 생각해보는 것이 좋을 것 같다.

이 포스팅에서는 다음의 두 방법으로 r^2를 이해해보고자 한다.

Pearson's correlation coefficient
Coefficient of Determination r^2

Preprocessing

1. Import libraries

import os 
import pandas as pd 
import numpy as np 
from scipy import stats 
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

2. Load skin cancer mortality data

with open('skin-cancer-mortality.rtf') as f:
    lines = f.readlines()
    
data = []
for line in lines[10:-1]:
    temp = []
    for i, el in enumerate(line.split()):
        if i > 0:
            temp.append(np.float(el.split("\\")[0]))
        else: 
            temp.append(el)
            
    data.append(temp)

3. Create a pandas dataframe

df = pd.DataFrame(data, columns=['State', 'Lat', 'Mort', 'Ocean', 'Long'])
y = df.pop('Mort').values.reshape(-1,1)
X = df.Lat.values.reshape(-1,1)

print(X.shape, y.shape) 
# ((49, 1), (49, 1))

Check the dataframe

df.head()

4. Split dataset into train and test set

X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# results are 
# (36, 1) (36, 1) (13, 1) (13, 1)

5. Linear regression

reg = LinearRegression().fit(X_train, y_train)

print(reg.score(X_test,y_test))
# 0.7818725023507447

y_pred = reg.predict(X_test)

print(r2_score(y_test, y_pred))
# 0.7818725023507447

방법 1: Pearson's correlation coefficient

r, p = stats.pearsonr(y_test.flatten(), y_pred.flatten())

print(f'Pearson correlation coefficient is {r}')
print(f'r^2 is {r**2}')

# Pearson correlation coefficient is 0.8884540790782807
# r^2 is 0.7893506506308358

방법 2: Coefficient of Determination r^2

print(r2_score(y_test, y_pred))

# 0.7818725023507447

두 계산결과가 거의 비슷한 것을 볼 수 있다.

1. Sum of square of Regression

SSR = np.sum((y_pred - np.mean(y_test))**2)

print(SSR)

# 9574.961686529288

2. Sum of square of errors

SSE = np.sum((y_test - y_pred)**2)

print(SSE)

# 3233.287118616631

3. Total sum of squares

SSTO = SSR+SSE

print(SSR/SSTO, 1-SSE/SSTO)

# 0.7475621243930235, 0.7475621243930235

Reference

2.6 - (Pearson) Correlation Coefficient r

2.5 - The Coefficient of Determination, r-squared

sklearn.metrics.r2_score

scipy.stats.pearsonr

728x90

저작자표시

'Programming > Machine Learning' 카테고리의 다른 글

Relative Standard Deviation(RSD) 란? (ft. 간단한 Python 예제) (0)	2022.05.09
1D Convolutional Neural Network 이해하기 (CNN in numpy & keras) (0)	2021.08.27
Feature Importance with Information Gain (0)	2021.08.21
Seaborn boxplot으로 five-number summary 이해하기 (0)	2021.05.03
Information Gain (간단한 예제 & 파이썬 코드) (3)	2020.12.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Voyager

[Data Science] r^2를 계산하는 두 가지 방법!! (Python 코드 포함)

Introduction

Preprocessing

1. Import libraries

2. Load skin cancer mortality data

3. Create a pandas dataframe

4. Split dataset into train and test set

5. Linear regression

방법 1: Pearson's correlation coefficient

방법 2: Coefficient of Determination r^2

1. Sum of square of Regression

2. Sum of square of errors

3. Total sum of squares

Reference

'Programming > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

[Data Science] r^2를 계산하는 두 가지 방법!! (Python 코드 포함)

Introduction

Preprocessing

1. Import libraries

2. Load skin cancer mortality data

3. Create a pandas dataframe

4. Split dataset into train and test set

5. Linear regression

방법 1: Pearson's correlation coefficient

방법 2: Coefficient of Determination r^2

1. Sum of square of Regression

2. Sum of square of errors

3. Total sum of squares

Reference

'Programming > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역