[5주차 총정리] 비지도 학습(Unsupervised) 모델 시각화 (K-Means, PCA)

1. K-Means Algorithm

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D   # 3차원 도화지에서 그리기 위해
from sklearn import cluster
from sklearn import datasets
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

# 1. Prepare the data (array!)
(# 2. Feature selection)
iris = datasets.load_iris()
X = iris.data
y = iris.target

(# 3. Train/Test split)

# 4. Create model object 
# 나중에 모델 3개를 한 번에 시각화하기 위해서
# tuple1: '이름', 모델1
# tuple2: '이름', 모델2
# tuple3: '이름', 모델3
estimators = [('k=8', cluster.KMeans(n_clusters=8)),
              ('k=3', cluster.KMeans(n_clusters=3)),
              # 랜덤하게 초기값을 잡아서 이상하게 그려진다
              # 이론은 랜덤하게 찍지만, 현재는 init='k-means++'으로 초기 중심값을 데이터 기반으로 데이터 중 center 후보를 정함
              ('k=3(r)', cluster.KMeans(n_clusters=3, n_init=1, init='random'))] # random init
print(estimators[0]), print()

# 5. Train the model 
(# 6. Test the model)
# 7. Visualize the model
fignum = 1   # 도화지 번호
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
    # 도화지를 여러 장 (subplot과 다른 방법: 한 도화지에 여러 개 그래프)
    fig = plt.figure(fignum, figsize=(7, 7))
    # Set the elevation and azimuth of the axes. (축의 고도와 방위각)
    ax = Axes3D(fig, elev=48, azim=134) 
    
    est.fit(X)
    
    labels = est.labels_   # 행들에 대한 class 번호

    # 3차원 좌표 축 설정
    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), edgecolor='w', s=100)
    # c=labels.astype(np.float) : 0~7 중 숫자가 150개 모여있음

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    ax.set_title(titles[fignum - 1])
    ax.dist = 12 # 값이 커지면 전체 plot 이 작아짐
    
    fignum = fignum + 1
    
plt.show()

- Ground Truth (정답)

# Plot the ground truth (g.t 정답)
fig = plt.figure(figsize=(7, 7))
ax = Axes3D(fig,  elev=48, azim=134)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(), X[y == label, 0].mean(), X[y == label, 2].mean()+2, 
              name, horizontalalignment='center')

ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='w', s=100)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

plt.show()

2. PCA

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition   # 행렬 분해에서 PCA
from sklearn import datasets
import warnings
warnings.filterwarnings("ignore")

# 1. Prepare the data (array!)
(# 2. Feature selection)
iris = datasets.load_iris()
x = iris.data
y = iris.target

(# 3. Train/Test split)

# 4. Create model object 
# PC1, PC2는 언제나 동일
model = decomposition.PCA(n_components=1) 
# 누적분산비율로 모델 객체 생성도 가능
# model = decomposition.PCA(n_components=0.95)

# 5. Train the model 
model.fit(x)

# 6. Transform the model
x1 = model.transform(x)

# 7. Visualize the model
# 분포 겹치는 부분 보기 위해
import seaborn as sns
# bins : Specification of hist bins, or None to use Freedman-Diaconis rule.
# kde : Whether to plot a gaussian kernel density estimate
sns.distplot(x1[y==0], color="b", bins=20, kde=False)
sns.distplot(x1[y==1], color="g", bins=20, kde=False)
sns.distplot(x1[y==2], color="r", bins=20, kde=False)
plt.xlim(-6, 6)
plt.show()

# PCA plot of 2 PCs
plt.scatter(x[:, 0], x[:, 1], c=iris.target)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# PCA plot of 3 PCs
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134) # Set the elevation and azimuth of the axes
ax.scatter(x[:, 0], x[:, 1], x[:, 2], c=iris.target, edgecolor='w', s=100)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.dist = 12 # 값이 커지면 전체 plot 이 작아짐
plt.show()

- 누적분산비율이 ?% 이상인 PC axis 개수 구하는 코드

np.argmax(np.cumsum(model.explained_variance_ratio_) >= 0.95 ) + 1

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

[5주차 총정리] Model Stacking을 위한 vecstack (0)	2022.04.13
[5주차 총정리] 훈련모델 및 Scaler 저장 (joblib, pickle) (0)	2022.04.13
[5주차 총정리] 최적 Cluster 개수 찾기 (elbow기법, silhouette기법) (0)	2022.04.13
[5주차 총정리] GridSearch 코드 예시 (SVM 기반) (0)	2022.04.13
[5주차 총정리] 지도학습(Supervised) 모델 시각화 (Linear Regression/Logistic Regression/kNN/SVM) (0)	2022.04.13

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

올라프의 [데이터 사이언스] 공부 일기

[5주차 총정리] 비지도 학습(Unsupervised) 모델 시각화 (K-Means, PCA)

1. K-Means Algorithm

2. PCA

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

티스토리툴바

1. K-Means Algorithm

2. PCA

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

검색

티스토리툴바