[5주차 총정리] scikit-learn 머신러닝 모델 학습 단계 Framework

1. 데이터셋 불러오기

df = pd.read_csv('   .csv')

# df.describe() / df.info() / df.shape
# categorical variable인지 확인할 때: 
# df_data[?].value_counts(sort=False)

- X, y로 데이터 나누기

- diabetes.data[:, 7:8]

-> 한 행을 뽑을 때, X는 반드시 2차원 이상의 행렬이어야 하기에 [:, 7] 형태가 아닌 [:, 7:8]

- 항상 행렬 형태로 뽑아서 모델에게 던져줘야 한다.

- diabetes.data[:, 1:8]

- diabetes.data[:, (3,5)] 또는 diabetes.data[:, [3,5]]

- Feature Selection

- np.array로 데이터타입 변경

2. Train/Test set으로 데이터 나누기

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size, random_state)

# print(X_train.shape)
# print(X_test.shape)
# print(y_train.shape)
# print(y_test.shape)

- Feature Normalization

- Continuous variable -> Feature Normalization (Min-Max algorithm / Standardization)

- Categorical(Discontinuous) variable -> one-hot encoding (dummy)

- 선형 모델이 아니라, Decision Tree 계열의 모델이라면 Feature Normalizaition 안해도 된다.

- Test data는 절대 Feature Normalization 될 수 없다.

- 대상이 되는 데이터는 반드시 Training data

3. 모델 객체 (Model Instance) 생성하기

model = sklearn.linear_model.LinearRegression()
model = sklearn.linear_model.LogisticRegression()

model = sklearn.neighbors.KNeighborsClassifier(n_neighbors)

model = sklearn.cluster.KMeans(n_clusters)

model = sklearn.decomposition.PCA(n_components)

model = sklearn.svm.SVC(kernel, C, gamma)

4. 모델 학습시키기 (Model fitting)

model.fit(train_X, train_y)

5. 모델로 새로운 데이터 예측하기 (Predict on test data)

model.predict(test_X)

# logistic regression
model.predict(test_X)
model.predict_proba(test_X)

- Cost Function 평가 (sklearn.metrics.~)

- Regression : MSE, MAE, MAPE

- Classification : Softmax Algorithm -> Cross-entropy [\sum y^{i} log(h_0(x^{i}))]

from sklearn.metrics import mean_squared_error
print('MSE(Training data) : ', mean_squared_error(model.predict(X_train), y_train))

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

[5주차 총정리] 교차 검증(Cross Validation) (0)	2022.04.12
[5주차 총정리] Ensemble 기법 종류 (Boosting 알고리즘 중심으로) (0)	2022.04.12
[4주차 총정리] Python 기반 SQL 프로그래밍(6) _Selenium+SQLite 실습 (0)	2022.04.11
[4주차 총정리] Python 기반 SQL 프로그래밍(5) _ORDER BY, WHERE, JOIN, GROUP BY, SubQuery 총정리 (0)	2022.04.11
[4주차 총정리] Python 기반 SQL 프로그래밍(4) _DML 실습 (0)	2022.04.08

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

올라프의 [데이터 사이언스] 공부 일기