[7주차 총정리] AutoML, Pycaret 실습 (Colab 권장)

0. Import Library

# !pip install pycaret==2.3.10
# !pip install jinja2==3.1.2
# !pip install xgboost==1.6.0

from google.colab import files
uploaded = files.upload()
# 파일이 크다면 구글 드라이브에 저장해서 들고오는 것이 더 좋음

from pycaret.classification import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. Prepare train & test data + 2. Data Preprocessing = Setup

titanic_df = pd.read_csv("titanic_modified.csv")

# train_test_split & preprocessing 한 번에 처리
model = setup(data=titanic_df, 
              target='Survived', 
              train_size=0.7, # default value
              session_id=123) # Random seed 

# 11	Transformed Train Set	(623, 18) : Training data (70% 할당됨)
# 12	Transformed Test Set	(268, 18) : Test data (30% 할당됨)

3. Build the model & Set the criterion = Create_model, Compare_models, Model_blending

1) 직접 모델 종류 지정하여 모델 튜닝 진행하고 싶을 때

Create_model

# creates a model and scores it using stratified cross validation
xgb = create_model('xgboost', fold=5) 
# tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation
xgb_tuned = tune_model(xgb, optimize='Accuracy')

2) 여러 가지 모델 종류 비교하고 싶을 때

Compare_models

# 아래 cell 에서 compare_models 함수 실행 시, 
# 활용하는 데이터의 상태에 따라 [ AttributeError: 'str' object has no attribute 'decode' ] 에러가 발생할 수 있습니다.
# 이 경우에는 아래 코드를 대신 실행하여 비교 대상이 되는 모델 리스트에서 Logistic Regression 모델을 제외시켜주시기 바랍니다.
# top_3_models = compare_models(exclude=['lr'], 
#                               sort='Accuracy', # Other options are 'AUC', 'Recall', 'Precision', 'F1', 'Kappa' and 'MCC'
#                               n_select = 3) # Select top n models 

top_3_models = compare_models(sort='Accuracy', # Other options are 'AUC', 'Recall', 'Precision', 'F1', 'Kappa' and 'MCC'
                              n_select = 3) # Select top n models
# top_3_models
# 상위 3개의 모델들과 해당 parameter를 출력해줌

- 원하는 모델을 index 번호로 꺼내어 활용할 수 있으며 시각화도 가능

# evaluate_model로 얻어지는 plot을 직접 지정하여 별도로 출력할 수 있음

xgb = top_3_models[0]
plot_model(xgb, plot = 'auc')
# plot_model(xgb, plot = 'pr')
# plot_model(xgb, plot='feature')
# plot_model(xgb, plot = 'confusion_matrix')

- Test data에 대해 자동으로 predict 가능

predict_model(top_3_models[0]) # 미리 제외시켜두었던 test data에 대한 예측 결과값

- Top N에 대해 Tuning도 가능

tuned_top3 = [tune_model(i, optimize='logloss') for i in top_3_models]

3) top_N개의 모델을 Blend하고 싶을 때

Blend_models

위에서 compare_models()로 활용하여 Top N개를 뽑아낸 모델 바탕으로 모델을 혼합시킬 수도 있다.

blended = blend_models(estimator_list=top_3_models, 
                       fold=10, # default
                       optimize='Accuracy',
                       method = 'hard')

# method 'hard' : uses predicted class labels for majority rule voting. 
# (0,0,1 -> 0)
# method 'soft' : predicts the class label based on the argmax of the sums of the predicted probabilities, 
# which is recommended for an ensemble of well-calibrated classifiers. (0,0,1의 확률값까지 고려)

- Test data에 대해 자동으로 predict 가능

predict_model(blended)

4. Train the model = Finalize Model

test/hold-out sample을 포함해서 모든 데이터셋에 모델을 학습시킨다

# finalize_model() 
# - fits the model onto the complete dataset including the test/hold-out sample (30% in this case). 
# - The purpose of this function is to train the model on the complete dataset before it is deployed in production.

final_blended = finalize_model(blended)
print(final_blended)

5. Test the model = Predict_model

predict_model(final_blended)

- New Data에 대해서 예측하기

'''1) 새로운 데이터가 있다면'''
# data_unseen = ? # unseen data as pd.DataFrame (without labels)
# unseen_predictions = predict_model(final_blended, data=data_unseen)
# unseen_predictions

'''2) 기존 데이터에서 추출하는 holdout 방법을 쓴다면'''
titanic_df = pd.read_csv("titanic_modified.csv")
holdout_data = titanic_df.sample(frac=0.10, random_state=0).reset_index(drop=True)   
# dataframe 0.10 만큼 뽑아줌 -> 아예 새로운 데이터처럼 가져가기

unseen_predictions = predict_model(final_blended, data=holdout_data)   
# 우리 데이터 주고, 새로운 데이터 주고 (새로운 데이터에도 pipeline대로 feature engineering)
unseen_predictions

+ Evaluate Model

from pycaret.utils import check_metric
check_metric(unseen_prediction, unseen_target, metric = 'Accuracy')

+ Save_model

save_model(final_blended, 'final_blended_2022')
# !ls
# files.download('final_blended_2022.pkl')

# pycaret만 colab에서 실행하고, final_blended_2022.pkl 파일만 다운받아서 jupyter에서 작업
# loaded_model = load_model('final_blended_2022')
# unseen_predictions = predict_model(loaded_model, data=holdout_data)
# unseen_predictions

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

[7주차 총정리] ImageGenerator for CNN models (0)	2022.05.01
[7주차 총정리] AutoML, Keras-tuner for Bayesisan HPO (Colab 권장) (0)	2022.05.01
[7주차 총정리] CNN 이론과 실습 + 데이터 증강(Data Augmentation) (0)	2022.04.30
[7주차 총정리] DNN (Deep Neural Network) 발달 흐름 정리 (0)	2022.04.30
[7주차 총정리] TF2 Regression (tensorflow.keras) (0)	2022.04.29

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

올라프의 [데이터 사이언스] 공부 일기

[7주차 총정리] AutoML, Pycaret 실습 (Colab 권장)

0. Import Library

1. Prepare train & test data + 2. Data Preprocessing = Setup

3. Build the model & Set the criterion = Create_model, Compare_models, Model_blending

1) 직접 모델 종류 지정하여 모델 튜닝 진행하고 싶을 때

2) 여러 가지 모델 종류 비교하고 싶을 때

3) top_N개의 모델을 Blend하고 싶을 때

4. Train the model = Finalize Model

5. Test the model = Predict_model

+ Evaluate Model

+ Save_model

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

티스토리툴바

0. Import Library

1. Prepare train & test data + 2. Data Preprocessing = Setup

3. Build the model & Set the criterion = Create_model, Compare_models, Model_blending

1) 직접 모델 종류 지정하여 모델 튜닝 진행하고 싶을 때

2) 여러 가지 모델 종류 비교하고 싶을 때

3) top_N개의 모델을 Blend하고 싶을 때

4. Train the model = Finalize Model

5. Test the model = Predict_model

+ Evaluate Model

+ Save_model

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

검색

티스토리툴바