[3주차 총정리] Okt 텍스트 분석 후에 Word Cloud 제작하기

이전 게시글에서 뉴스 기사 크롤링을 배웠다. 우리는 앞서 기사 크롤링한 내용을 텍스트 분석을 진행한 후, Word Cloud로 만들어볼 것이다.

https://lafgh.tistory.com/29

[3주차 총정리] Web Scrapping (네이버 여러 뉴스 기사)

1. 특정 뉴스 웹페이지 크롤링 중 에러 발생 시 회피 : 종종 naver.news로 들어갔는데도 이상한 페이지로 넘어가서 HTML 태그가 바뀌는 경우가 있다 (특히, 연예 섹션 뉴스) -> try, except 함수 사용 DataFra

lafgh.tistory.com

1. 라이브러리 설치

한글 텍스트 분석을 위해서는 NLTK 설치 및 Konlpy 설치를 완료해야 한다.

설치 순서

1) Microsoft Build Tools 2015 설치

2) Java SE Development Kit (JDK) 설치

3) 아래 셀 실행

# 오류 나는 경우엔 cmd 창에서 관리자 권한으로 실행
# 계속 설치가 안되는 경우는 버전 문제일 가능성도 있으므로 확인해보기
!pip install JPype1-1.2.0-cp38-cp38-win_amd64.whl
!pip install konlpy==0.5.2
!pip install tweepy==3.10.0

2. 크롤링 데이터 불러오기

우리는 텍스트 데이터 분석을 위해서 텍스트 데이터를 str 자료형으로 준비해야 한다.

1) articles = df['Article'].to_list()

DataFrame에서 df['Article']의 Series에서 .to_list 함수를 써서 list형으로 바꾼다.

2) articles = ' '.join(articles)

리스트의 아이템들을 특정한 문자 기준으로 이어붙인다.

3. 단어 정규화 및 어근화, 품사 태깅 (Okt 형태소 분석기)

Okt 품사 태깅의 옵션

- norm(정규화) / stem(어근화)

- stem: 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

tokenizer = Okt()
# POS Tagging with normalizing and stemming
raw_pos_tagged = tokenizer.pos(articles, norm=True, stem=True) 
raw_pos_tagged


# <<<중요>>>
# 종종 kernel dead 에러가 나게 됨 (The kernel appears to have died.)
# 원인은 텍스트데이터에 이모티콘 끼워져 있을 때
# 제거한 후에 pos tagging 해주어야 함 : 파이썬 문자열 이모티콘 제거

- raw_pos_tagged는 list 안에 tuple 형식으로 품사가 태깅되어져 있음

- e.g., [('인천', 'Noun'), ... ]

4. 단어 등장 빈도 카운팅

del_list = ['하다', '있다', '되다', '이다', '돼다', '않다', '그렇다', '아니다', '이렇다', '그렇다', '어떻다'] 

word_cleaned = []
for word in raw_pos_tagged:   # tuple : ('서울', 'Noun')
    if word[1] not in ["Josa", "Eomi", "Punctuation", "Foreign"]:
    	if (len(word[0]) != 1) & (word[0] not in del_list):
            word_cleaned.append(word[0])
        
word_cleaned

# '안/된다', '못/쓰겠어요', '안/들어요'
# 딥러닝의 감정분석할 때는 위와 같은 '안','못','안'이 중요하기 때문에 len 1 안떨어뜨림

# 한글자 단어라도 '한' 이라는 단어는 중요할 수도
# 그럴 때는 if '한'이면 추가
# elif 하는 방식으로

from collections import Counter
result = Counter(word_cleaned)
word_dic = dict(result)   # type을 Counter -> dict로 변환
word_dic

+ 추가: 빈도수 정렬 및 막대그래프로 시각화

1. 빈도수 기준 정렬 (sorted)

# lambda 함수를 활용하여,앞서 만든 dict를 item 단위(tuple)로 꺼내어, 
# tuple(x)의 value(x[1])를 기준으로 하여 내림차순(reverse=True) 정렬
sorted_word_dic = sorted(word_dic.items(), key=lambda x:x[1], reverse=True)
sorted_word_dic

# for word, count in sorted_word_dic[:50]:
#     print("{0}({1})".format(word, count), end=" ")

2. 단어 등장 빈도 시각화 (선그래프)

import nltk
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

# 한글 폰트 위치를 넣어주세요 ('범죄 시각화' 파일에서 배운 내용)
# NanumGothic.otf
font_name = matplotlib.font_manager.FontProperties(fname="C:/Windows/Fonts/malgun.ttf").get_name() 
matplotlib.rc('font', family=font_name)

# nltk 라이브러리는 단어의 등장 빈도를 더욱 쉽게 셀 수 있는 .Text() 함수 제공

word_counted = nltk.Text(word_cleaned) 
plt.figure(figsize=(15, 7))   # plot 영역(그래프 영역)의 크기를 지정합니다.
word_counted.plot(50)   # "plot" the graph, 상위 50개 단어를 보여줍니다.

2. 단어 등장 빈도 시각화 (막대그래프)

- 막대그래프로의 시각화가 단어별 빈도수를 파악하기에 더 좋지만, Text() 함수를 가진 선그래프와 달리 막대그래프는 NTLK의 함수만으로는 진행하기 어렵다.

- 따라서, NLTK의 FreqDist 함수를 적용하고, pd.DataFrame에 데이터를 담고 시각화를 진행한다.

### Frequency Distribution
word_frequency = nltk.FreqDist(word_cleaned) 
# >FreqDist({'데이터': 538, '분석': 252, '서비스': 154, '빅데이터': 115, ...})


### 단어 빈도가 담긴 Dict 로부터 값을 가져와 DataFrame 을 만든다
# pd.DataFrame(data=values, index열=keys)
df = pd.DataFrame(list(word_frequency.values()), word_frequency.keys()) 

# 빈도 내림차순으로 정렬합니다.
result = df.sort_values([0], ascending=False)

# 전체 데이터(단어 수)는 너무 많기 때문에 출현 횟수 상위 50개만 가져와 시각화합니다.
result = result[:50]

result.plot(kind='bar', legend=False, figsize=(15,5))
plt.show()

5. 워드클라우드 만들기

1) conda install -c https://conda.anaconda.org/conda-forge wordcloud==1.5.0

2) 라이브러리 설치

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image   # Python Image Library
import numpy as np

3) 기본 적용

word_cloud = WordCloud(font_path="C:/Windows/Fonts/malgun.ttf", # font_path="C:/Windows/Fonts/NanumSquareB.ttf"
                       width=2000, height=1000,   # 화면 해상도
                       # background_color='white',   # 하얀 배경으로 설정 가능
                       max_font_size=100).generate_from_frequencies(word_dic) # Max font-size

plt.figure(figsize=(15,15))
plt.imshow(word_cloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

4) 원하는 사진에 masking 적용

from PIL import Image
# image를 수치 행렬로 바꿈
np.array(Image.open('python_mask.jpg')).shape
# plt.imshow는 np.array를 받아 이미지로 보여줌
plt.imshow(np.array(Image.open('python_mask.jpg')))

# Image 로부터 Color 를 생성(Generate)해내는 객체
from wordcloud import ImageColorGenerator 

python_coloring = np.array(Image.open("python_mask.jpg"))
image_colors = ImageColorGenerator(python_coloring)

word_cloud = WordCloud(font_path="C:/Windows/Fonts/malgun.ttf",
                       # 해상도는 원본 이미지의 해상도를 넘을 수 없음 (도구>크기>큼)
                       # 파워포인트에서 크기를 강제로 늘려서 이미지 파일로 들고오기 (쪼금 깨져도 괜찮음)
                       width=2000, height=1000,  
                       mask=python_coloring, 
                       background_color='white').generate_from_frequencies(word_dic)

plt.figure(figsize=(15,15))
# 다시(re) 색칠하기
plt.imshow(word_cloud.recolor(color_func=image_colors), interpolation='bilinear') 
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

# color_func: 색상을 생성해주는 함수
# interpolation(보간법): 비어있는 픽셀 값 색을 채워주는 방법

# 하지만, 내가 직접 색상을 지정해주고 싶다면
# plt.imshow(word_cloud.recolor(colormap='Blues'), interpolation='bilinear') 
# Matplotlib colormap 활용 (http://j.mp/32UXOQ6)

6. 워드클라우드 이미지로 저장

word_cloud.to_file("word_cloud_completed.png")

'멋쟁이 사자처럼 AI SCHOOL 5기 > Today I Learned' 카테고리의 다른 글

[3주차 총정리] Selenium으로 브라우저 제어 자동화하기(2) _ 인터파크 투어 크롤링 (0)	2022.03.29
[3주차 총정리] Selenium으로 브라우저 제어 자동화하기 (0)	2022.03.29
[3주차 총정리] Web Scrapping (네이버 여러 뉴스 기사) (0)	2022.03.28
[2주차 총정리] Web Scrapping (네이버 단일 뉴스 기사) (0)	2022.03.26
[2주차 총정리] 자연어처리(NLP) _ 2. Text Similarity Analysis (TF-IDF, Cosine Similarity 이론 및 실습) (0)	2022.03.26

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

올라프의 [데이터 사이언스] 공부 일기