(출처: https://github.com/ResidentMario/missingno)
0. 라이브러리 설치
# !pip install missingno==0.5.1
# !pip install quilt==2.9.15
# !quilt install ResidentMario/missingno_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import quilt
import missingno as msno
1. 데이터 불러오기
from quilt.data.ResidentMario import missingno_data
collisions = missingno_data.nyc_collision_factors()
collisions = collisions.replace("nan", np.nan)
2. 결측치 시각화
1) 일반적인 데이터
msno.matrix(collisions.sample(250))
# The sparkline at right summarizes the general shape of the data completeness and
# points out the rows with the maximum and minimum nullity in the dataset.
# This visualization will comfortably accommodate up to 50 labelled variables.
# Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

msno.bar(collisions.sample(1000))

# The missingno correlation heatmap measures nullity correlation:
# how strongly the presence or absence of one variable affects the presence of another
msno.heatmap(collisions)

# The dendrogram allows you to more fully correlate variable completion,
# revealing trends deeper than the pairwise ones visible in the correlation heatmap:
msno.dendrogram(collisions)

2) 시계열 데이터
null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글
| [5주차 총정리] 결측값: pd.isnull(notnull), np.isnan으로만 확인 가능 + 결측행 뽑기 (0) | 2022.04.17 |
|---|---|
| [5주차 총정리] 불균형(Imbalanced) 데이터 처리 (SMOTE, oversampling) (0) | 2022.04.14 |
| [5주차 총정리] Feature-transformer를 위한 파이프라인 (numerical, categorical features에 모두 접근 가능) (0) | 2022.04.13 |
| [5주차 총정리] Model Stacking을 위한 vecstack (0) | 2022.04.13 |
| [5주차 총정리] 훈련모델 및 Scaler 저장 (joblib, pickle) (0) | 2022.04.13 |