NLP 실습 유사도를 반영한 검색 키워드 최적화

2020-02-11

이번 실습의 소개는 프로젝트성으로 진행 할 것이다.

프로젝트 소개

더존 ICT 온라인 고객센터 키워드 검색 최적화 및 챗봇 구현

프로젝트를 하게 된 계기
- 먼저, 더존 온라인 고객센터 페이지 중 smart A에 관한 페이지에서 전체 탭을 클릭한 후, 살펴본 QnA 페이지를 살펴보았다.
- 필자는 고객들의 입장에서 생각해보았을때, 자신이 작성하는 질문(물론, 그림으로 첨부해야할 만큼 그 환경이 중요한 질문들은 제외하고)과 비슷한 질문들이 존재할 거라는 생각을 갖고 키워드를 통해 검색해볼 것이다. 아래 그림은 재입사자라는 키워드를 smart A페이지에서 검색했을 때 출력되는 결과이다. 11건의 총 검색 결과 중 재입사자에 대한 연말정산과 관련된 문건이 8건이 존재한다.
- 그래서 필자는 재입사자 연말정산이라는 키워드를 통해 검색을 해보았다. 위에서 재입사자라는 키워드를 통해 검색 했을 때, 재입사자의 연말정산에 대한 질문이 8건이 존재한 반면에 아래 그림에서와 같이 8건 중 5건 만을 보여준다.
- 필자는 질문의 내용이 아닌 질문의 제목에 재입사자 연말정산이라는 키워드가 8건이 존재할 뿐 내용은 그와는 다를 수도 있다는 생각이 들어, 검색결과에 포함되어 있지 않는 질문들을 살펴보았다. 또한, 질문 내용 자체가 본질적으로 물어보는 의미가 검색결과에 포함되지 않은 질문들은 다를 수도 있기에 특정 알고리즘을 통해 결과를 보여줄 수 있다는 생각이 들어 검색 결과에 포함된 질문과도 비교해 보기로 했다.
- 검색결과에 포함되지 않은 질문과 답변이 왼쪽 그림이고, 검색결과에 포함된 질문과 답변이 오른쪽의 빨강색 네모로 되어있는 그림이다. 두 질문은 비슷한 질문이라고 보인다. 그런데도 불구하고 재입사자 연말정산이라는 키워드 검색 결과에 포함되어 있지 않는 점을 통해 필자는 각각의 질문들과 검색 키워드 간의 유사성을 점수화해 유사성이 높은 질문들을 보여주는 시스템도입이 필요할 것 같다는 생각이 들었다.
- 또한, 챗봇을 만드는 부분에 있어서 입력과 출력의 문장의 sequence 길이를 맞춰주어야 하는데, 그에 따라서 답변이 특정 분야(예를들면, 연말정산이나 원천징수등)에서는 긴 문장으로 이루어질 수도 있으므로, 챗봇을 구현한다면, 각 분야에 따른 문장길이를 분석해 보기도 해야 할 것 같다는 생각이들었다.
- 이런 제한 상황으로 인해 챗봇 구현이 힘들다면, 질문과 검색 키워드 간의 유사도를 반영한 검색 결과를 통해서라도 더존 온라인 고객센터의 질문을 하시는 고객 분들에게 조금이나마 더 편의성을 드릴수 있게끔 하면 좋을 것 같다는 생각이 들었다.
- 더존 사이트내에서 영업 문의 전화나 구매자에 대한 상담은 따로 서비스를 제공하고 있지만, 온라인 고객센터 tap부분에서만 Q&A에 관한 사항을 다루는데 답변을 해주는 시간은 업무 시간내로만 제한 되어있다. 이에 따라 24시간 또는 업무 이외의 시간에는 챗봇 서비스를 시행한다면 고객들의 입장에서 보았을 때 조금 더 편리하게 더존의 서비스나 솔루션을 이용할 수 있을 것이라는 취지에 의해서 챗봇 구현에 관심을 갖게 되었다.
데이터 이름 : qna_smart_a.csv
- 더존에서는 WEHAGO 플랫폼상에서 여러가지 서비스를 제공하고 있다. 그 중 더존 Smart A는 재무회계, 세무신고, 인사·급여관리, 물류관리까지 중소기업의 업무를 통합적으로 관리할 수 있는 회계프로그램로서, 이 프로그램의 질문과 답변에 의해서만 먼저 학습을 해 볼 것이다. 그 이유는 다른 프로그램들(ERP와 WEHAGO)은 사용자들의 성격에 따라 다양한 용도로 개발 되어있지만, 회계프로그램인 Smart A는 모든 기업이 공용으로 사용하기 때문에 우선적으로 학습해 볼 것이다. 또한, 가장 주요한 선택 이유는 Q&A 게시판의 데이터 중 가장 많은 데이터를 포함하고 있었기 때문이다.
데이터 용도 :
데이터 출처 : 더존 온라인 고객센터 Smart A 전체 tap의 전체 질문과 답변들을 크롤링 해서 사용하였다. 크롤링 방식은 Scrapy를 통해 페이지를 순회하게끔 코드를 작성하여 크롤링해서 얻었다.
먼저, 더존 온라인 고객 센터페이지에서 질문과 답변을 크롤링해와서 데이터 셋을 구성할 것이다.

Spider bot 만들기

전체 scrapy bot의 구성은 다음과 같다.
items.py와 settings.py를 활용했으며, 마지막 결과 파일은 csv로 저장했다. 혹시 db파일로 저장하고 싶다면 추가적으로 pipelines에서 작업을 하면된다. 더존 온라인 고객센터의 게시판에서 최근 게시판에서는 답변 완료상태인 데이터가 주로 많지만 예전 데이터 중에는 간간히 답변 대기 상태인 데이터가 존재한다. 그러므로 pipeline.py에서 이를 통해 답변 완료인 상태인 데이터만을 크롤링하여도 되지만, 필자는 어떤 데이터가 답변 대기 상태인 데이터인지 눈으로 살펴보기 위해 그냥 모두 크롤링하는 것으로 처리하였다.

thezone
├── scrapy.cfg
└── thezone
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-37.pyc
    │   ├── items.cpython-37.pyc
    │   ├── pipelines.cpython-37.pyc
    │   └── settings.cpython-37.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-37.pyc
        │   └── qnacrawler.cpython-37.pyc
        ├── last_qna_smart_a.csv
        ├── qna_smart_a.csv
        └── qnacrawler.py

qnacrawler.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import sys
sys.path.insert(0, '/Users/heungbaelee/workspace/project/chat_bot_project/thezone/thezone')
from items import ThezoneItem

class QnacrawlerSpider(CrawlSpider):
    name = 'qnacrawler'
    allowed_domains = ['help.douzone.com']
    start_urls = ['http://help.douzone.com/pboard/index.jsp?code=qna10&pid=10&s_category_id=all&type=all&s_listnum=50&s_field=&s_keyword=']

    # rules = [
    #     Rule(LinkExtractor(allow=r'/pboard/index.jsp?code=qna10&pid=10&s_category_id=all&type=all&s_listnum=50&s_field=&s_keyword=&page=\d+', ), callback='parse_parent', follow=True),
    # ]
    rules = [
        Rule(LinkExtractor(restrict_css='div.page_box > ul > li:nth-child(n+4)',attrs='href'), callback='parse_parent', follow=True),
    ]

    def parse_parent(self, response):
        # link = LinkExtractor(allow=r'/pboard/index.jsp?code=qna10&pid=10&s_category_id=all&type=all&s_listnum=50&s_field=&s_keyword=&page=\d+')
        # links = link.extract_links(response)
        # print(links)
        # print(response.status)
        for url in response.css('div.tab_cnt.mt30 > table > tbody > tr'):
            article_num = url.css('td:nth-child(1)::text').extract_first().strip()
            self.logger.info('Article number : %s' % article_num)
            article_link = url.css('td:nth-child(3) > a::attr(href)').extract_first().strip()
            self.logger.info('Article link : %s' % article_link)
            # print(article_num, response.urljoin(article_link))
            yield scrapy.Request(response.urljoin(article_link), self.parse_child, meta={'article_num': article_num})

    def parse_child(self, response):
        # 부모, 자식 수신 정보 로깅
        self.logger.info('----------------------------------------')
        self.logger.info('Child Response URL : %s' % response.url)
        self.logger.info('Child Response Status ; %s' % response.status)
        self.logger.info('----------------------------------------')

        # 질문 번호
        article_num = response.meta['article_num']

        # 유형
        category = response.css("div.qna_read.mt30 > table:nth-child(1) > tbody > tr:nth-child(2) > td > dl > dd:nth-child(2)::text").extract_first().strip()

        # 질문
        question = "".join(response.css("div.qna_read.mt30 > table:nth-child(1) > tbody > tr:nth-child(2) > td > div.q > div.q_cnt > p::text").extract()).strip()

        # 등록일
        enrolled_date_time = response.css("div.qna_read.mt30 > table:nth-child(1) > tbody > tr:nth-child(1) > td:nth-child(4)::text").extract_first().strip()

        # 작성일
        answer_date_time = response.css("table.mt10 > tbody > tr:nth-child(1) > td:nth-child(4)::text").extract_first().strip()

        # 답변여부
        answer_yes = response.css("table.mt10 > tbody > tr:nth-child(1) > td.ta_l > span::text").extract_first().strip()

        # 답변
        answering = "".join(response.css("table.mt10 > tbody > tr:nth-child(2) > td.ta_l.pd20 > div.a > div > p::text").extract()).strip()

        yield ThezoneItem(article_num=article_num, category=category, enrolled_date_time=enrolled_date_time, question=question, answer_date_time=answer_date_time, answer_yes=answer_yes, answering=answering)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ThezoneItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 문서번호
    article_num = scrapy.Field()

    # 유형
    category = scrapy.Field()

    # 질문
    question = scrapy.Field()

    # 답변
    answering = scrapy.Field()

    # 작성일
    answer_date_time = scrapy.Field()

    # 등록일
    enrolled_date_time = scrapy.Field()

    # 답변여부
    answer_yes = scrapy.Field()

settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'thezone'

SPIDER_MODULES = ['thezone.spiders']
NEWSPIDER_MODULE = 'thezone.spiders'

DEFAULT_REQUEST_HEADERS = {'Referer' : 'http://help.douzone.com'}

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 쿠키사용
COOKIES_ENABLED = True

DOWNLOAD_DELAY = 3

# User-Agent 미들웨어 사용
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

# 파이프 라인 활성화
# 숫자가 작을 수록 우선순위 상위
ITEM_PIPELINES = {
    'thezone.pipelines.ThezonePipeline': 300,
}

# 재시도 횟수
RETRY_ENABLED = True
RETRY_TIMES = 2

# 한글 쓰기(출력 인코딩)
FEED_EXPORT_ENCODING = 'utf-8'

위의 scrapy 파일들을 통해서 데이터를 먼저 확보 했다. 필자의 로컬환경을 통해서는 10시간 정도 걸렸다.

1	scrapy runspider qnacrawler.py -o qna_smart_a.csv - t csv

데이터 소개

위의 scrapy spider bot을 통해서 얻은 데이터를 통해 다음과 같은 feature들을 얻었다.
회계프로그램인 smart_a에 대한 전체 Q&A를 크롤링하여 챗봇을 만드는 것이 프로젝트의 목표이다.

raw 데이터 구성

answer_date_time : 답변완료일자
answer_yes : 답변 여부
answering : 답변 내용
category : 질문의 유형
enrolled_date_time : 질문등록일자
question : 질문 내용

thezone smart A 온라인고객센터 데이터

1	raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12475 entries, 0 to 12474
Data columns (total 6 columns):
answer_date_time      12436 non-null object
answer_yes            12436 non-null object
answering             12420 non-null object
category              12475 non-null object
enrolled_date_time    12475 non-null object
question              12409 non-null object
dtypes: object(6)
memory usage: 584.9+ KB

데이터에 null 값이 포함되어 있기 때문에 null값들을 제거해주고, 답변 대기 상태인 데이터들은 총 39건이 있었는데 답변이 작성되지 않은 데이터 이므로 답변 대기 상태인 데이터들도 같이 제거해준다.

raw_data = raw_data[raw_data["answer_yes"]=="답변완료"]
raw_data.reset_index(drop=True, inplace=True)

raw_data = raw_data[pd.isnull(raw_data["question"])!=True].reset_index(drop=True)
print(sum(raw_data["question"].apply(lambda x: pd.isnull(x))))

raw_data = raw_data[pd.isnull(raw_data["answering"])!=True].reset_index(drop=True)
print(sum(raw_data["question"].apply(lambda x: pd.isnull(x))))

답변대기 상태인 데이터들을 제거하고 총 사용가능한 데이터는 12,354건의 질문과 답변 쌍이다.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12354 entries, 0 to 12353
Data columns (total 6 columns):
answer_date_time      12354 non-null object
answer_yes            12354 non-null object
answering             12354 non-null object
category              12354 non-null object
enrolled_date_time    12354 non-null object
question              12354 non-null object
dtypes: object(6)
memory usage: 579.2+ KB

먼저, 간단하게 데이터들의 분류 카테고리에 따라서 어떤 분포를 띄고 있는지 간략하게 살펴볼 것이다.

카테고리 별 질문 및 답변쌍의 개수

질문 데이터 전처리

세무/회계관련 질문들이라서 금액에 관한 질문과 답변들이 많이 있기에 숫자에 대한 내용을 제거할지 하지 말하야 할지를 두고 필자는 생각이 많았는데, 우선 프로젝트의 첫번째 목표인 검색 키워드와 질문의 내용간의 유사도를 측정하는 면에 있어서는 숫자들이 크게 중요하지 않을 것이라는 판단하에 숫자부분들과 마침표같은 부호들을 제거하기로 결정했다. 다만, []안의 내용은 대부분 smart A의 메뉴명을 의미하기 때문에 살려두었다. [메뉴명]을 하나의 명사로 인식하기 위해 형태소 분석을 할 경우에도 비지도 학습을 통한 방식을 채택하기 위해 soynlp를 사용할 것이다.

질문 데이터들

def pattern_match(x):

    pattern = "\d+"
    reg = re.compile(pattern)
    sentence = re.sub(reg, " ", x)

    pattern = "[!|,|.|?|~|※|)|(|■|+|=|-|/|*|-|>|-|;|^|]|-|%|'|'|ㅠ+|ㅎ+]"
    reg = re.compile(pattern)
    sentence = re.sub(reg, " ", sentence)

1	raw_data['question_after'] = raw_data['question_after'].apply(lambda x : pattern_match(str(x)))

기본적인 부호들과 숫자들을 제거해 주었으므로 이제 기본적인 띄어쓰기 단위 어절과 음절(문자 하나하나를 의미)단위로 질문의 평균적인 길이와 한 질문당 단어의 평균적인 사용량을 대략적으로 살펴볼 것이다.

1 2	# 띄어쓰기 단위로 나눈 어절 기초통계량 raw_data['question_after'].apply(lambda x: len(str(x).split(" "))).describe()

위의 띄어쓰기 단위로 나눈 질문의 Token의 개수는 평균적으로 33개의 어절과 중앙값은 27개의 어절을 갖는다는 것을 확인 할 수 있다. 평균이 올라간것은 3사분위수가 42개인 것과 최대 어절이 791개인 것으로 미루어 보아 이상치에 의한 영향을 받아 평균이 데이터의 중심을 잘 반영하고 있지 않다고 판단해 볼 수 있다. 그러므로 이상치들의 데이터 형태를 살펴보고 문제점이 무엇인지 파악해 볼 것이다.

count    12290.000000
mean        35.055411
std         33.537140
min          1.000000
25%         17.000000
50%         27.000000
75%         42.000000
max        791.000000
Name: question_after, dtype: float64

가장 높은 최댓값을 갖는 데이터를 살펴보면, 아래의 그림과 같이 공백으로 일정한 형식을 맞춰보려고 한 것 같이 되어있다. 그러나 우리는 이 질문의 내용적인 면이나 키워드가 중요한 것이므로 형식이 우리가 푸는 문제에는 큰 영향을 주지 못하므로 공백을 제거해 줄 것이다.

이상치 데이터의 모습

def pattern_match(x):
    pattern = "  +"
    reg = re.compile(pattern)
    sentence = re.sub(reg, " ", x)
    return sentence

1
2
3

raw_data['question_after'] = raw_data['question_after'].apply(lambda x : pattern_match(str(x)))

sent_len_by_token = raw_data['question_after'].apply(lambda x: len(str(x).split(" ")))

공백이 많은 데이터들을 공백을 줄여주는 함수를 통해 처리를 해준 후에 다시 질문 당 띄어쓰기 단위 어절의 길이에 관한 기초 통계량을 살펴보면 다음과 같다. 역시 함수를 통해 공백을 줄여준 후에 다시 측정해보니 평균과 중앙값의 차이가 이전과 다르게 확연히 줄어든 것을 볼 수 있으며, 평균적으로 22~23개의 어절을 사용함을 확인해 볼 수 있다.

count    12290.000000
mean        28.416273
std         22.040055
min          1.000000
25%         15.000000
50%         23.000000
75%         35.000000
max        355.000000
Name: question_after, dtype: float64

90%의 위치에 위치하고 있는 어절의 길이는 52개 였다.

1	np.quantile(sent_len_by_token, 0.90)

52.0

또한, 위에서 355개의 어절을 갖는 데이터에 관해서도 이상치이므로 살펴보았다. 아래와 같이 오류 코드에 관한 질문이었기 때문에 공백이 많이 포함되어있을 수 밖에 없다는 것을 확인 할 수 있었다. 그러므로 이 데이터의 공백은 질문의 내용을 표현하는데 불필요한 요소가 아니므로 그대로 상태를 유지 할 것이다.

355개의 어절을 갖는 질문

그 다음은 음절 단위 길이를 분석해 볼 것이다. 음절 단위의 기초 통계량은 아래와 같다. 평균적으로 136자를 사용하였으며, 중앙값은 112자이다.

count    12279.000000
mean       136.459647
std        109.823703
min          3.000000
25%         74.000000
50%        112.000000
75%        167.000000
max       2637.000000
Name: question_after, dtype: float64

위에서의 기초 통계량 값을 시각화해서 간단히 살펴 보기위해서 아래와 같이 히스토그램을 활용하였다. 상식적으로도 알 수 있듯이, 음절이 어절보다 훨씬 단위가 클수밖에 없을 것이다. 여기서 볼 것은 꼬리 분포이다. 음절과 어절 단위로 살펴본 질문의 길이는 둘다 일정 수준이하에 주로 분포돼있고, 일정 수준 이상은 이상치가 존재하고 있다.

어절 및 음절 단위 문장 길이 히스토그램

모델 설정

제일 먼저, TF-IDF 행렬을 사용해 LSA 분석의 일종인 TruncatedSVD 행렬을 이용해 문장 임베딩을 실행한 후, 키워드 검색어와의 유사한 문서들을 살펴 볼 것이다.

import math
from sklearn.feature_extraction.text import TfidfVectorizer
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity


q_sentence = list(raw_data['question_after'])

word_extractor = WordExtractor(min_frequency=1, min_cohesion_forward=0.05, min_right_branching_entropy=0.0)
word_extractor.train(q_sentence)
scores = word_extractor.word_scores()

cohesion_scores = {key:(scores[key].cohesion_forward * math.exp(scores[key].right_branching_entropy)) for key in scores.keys()}
tokenizer = LTokenizer(scores=cohesion_scores)

tokens = []
for q_s in q_sentence:
    tokens.append(tokenizer.tokenize(q_s))

sentence_by_tokens = [' '.join(word) for word in tokens]

## TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1,1), lowercase=True, tokenizer=lambda x : x.split(" "))
input_matrix = vectorizer.fit_transform(sentence_by_tokens)

vocab2id = {token : vectorizer.vocabulary_[token] for token in vectorizer.vocabulary_.keys()}

id2vocab = {vectorizer.vocabulary_[token]: token for token in vectorizer.vocabulary_.keys()}

## TruncatedSVD
svd =  TruncatedSVD(n_components=100)
vecs = svd.fit_transform(input_matrix)

criterion_sentence = "재입사자 연말정산"

criterion_tokens = tokenizer.tokenize(criterion_sentence)
criterion_tokens

재입사자 연말정산이라는 키워드를 tokenizing한 결과는 아래와 같다. 재입사, 자, 연말정산 이렇게 3가지 형태로 형태소를 분리했다.

1	['재입사', '자', '연말정산']

criterion_sentence_by_token = [" ".join(criterion_tokens)]
criterion_vec = vectorizer.transform(criterion_sentence_by_token)
criterion_vec = svd.transform(criterion_vec)

svd_l2norm_vectors = normalize(vecs, axis=1, norm='l2')
svd_l2norm_criterion_vectors = normalize(criterion_vec, axis=1, norm='l2').reshape(100,1)
cosine_similarity = np.dot(svd_l2norm_vectors, svd_l2norm_criterion_vectors)

ls=[]
for idx, cosine_similarity in enumerate(cosine_similarity.tolist()):
    ls.append((idx, cosine_similarity))
sorted_list = sorted(ls, key= lambda x: x[1], reverse=True)

criterion_tokens_list = []
for i in np.arange(len(sorted_list)):
    criterion_tokens_list.append(criterion_tokens)
show_list = []
for sorted_lists, criterion_tokens in zip(sorted_list, criterion_tokens_list):
    idx=sorted_lists[0]
    similarity=sorted_lists[1]
    tf_list=[]
    for token in criterion_tokens:
        tf_list.append(token in raw_data['question'].loc[idx])
    if (np.array(tf_list) == True).all():
        show_list.append((idx, similarity))
show_list

위의 show_list결과 중 몇가지 질문들을 살펴보자면, 아래와 같다.

최종적으로 선택된 질문들

위에서 형태소가 재입사, 자, 연말정산 이렇게 3가지로 분리했던 것을 우리가 알 고 있듯이 재입사자, 연말정산 2가지로 잘 분리하도록 명사 추출기 점수를 더한 점수를 통해서 다시 tokenize할 것이다.

명사 추출기를 통한 명사 점수를 합산한 score를 통한 tokenizer 활용

import math
from soynlp.word import WordExtractor
from soynlp.noun import LRNounExtractor_v2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

noun_extractor = LRNounExtractor_v2(verbose=True)
nouns = noun_extractor.train_extract(q_sentence)

noun_scores = {noun:score.score for noun, score in nouns.items()}
combined_scores = {noun:score + cohesion_scores.get(noun, 0) for noun, score in noun_scores.items()}
combined_scores = combined_scores.update(
    {subword:cohesion for subword, cohesion in cohesion_scores.items()
    if not (subword in combined_scores)}
)

tokenizer = LTokenizer(scores=combined_scores)

tokens = []
for q_s in q_sentence:
    tokens.append(tokenizer.tokenize(q_s))

sentence_by_tokens = [' '.join(word) for word in tokens]

vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1,1), lowercase=True, tokenizer=lambda x : x.split(" "))
input_matrix = vectorizer.fit_transform(sentence_by_tokens)

vocab2id = {token : vectorizer.vocabulary_[token] for token in vectorizer.vocabulary_.keys()}

id2vocab = {vectorizer.vocabulary_[token]: token for token in vectorizer.vocabulary_.keys()}


svd =  TruncatedSVD(n_components=100)
vecs = svd.fit_transform(input_matrix)

criterion_sentence = "재입사자 연말정산"

criterion_tokens = tokenizer.tokenize(criterion_sentence)
criterion_tokens

재입사자 연말정산이라는 검색 키워드를 tokenizing한 결과 아래와 같이 재입사자, 연말정산이라고 분류해냈다. 허나, 위에서와 같이

1	['재입사자', '연말정산']

sorted_list에 포함된 질문들을 보면 대부분 연말정산이 들어가있는 질문들이 유사도가 높다는 것을 확인할 수 있었다. 이런 문제점은 필자의 개인적인 생각으로 input matrix로 TF-IDF matrix를 사용했기 때문에 전체 질문 건수에서 연말정산이 차지하는 비율이 높다보니 나타나는 현상이라고 생각했다. 이를 해결하기 위해 먼저 필자는 각 분야의 질문의 수를 맞추거나 다른 방법의 input matrix를 사용해서 문제를 해결해야 할 것이라고 생각했다.

criterion_sentence_by_token = [" ".join(criterion_tokens)]
criterion_vec = vectorizer.transform(criterion_sentence_by_token)

criterion_vec=svd.transform(criterion_vec)

svd_l2norm_vectors = normalize(vecs, axis=1, norm='l2')
svd_l2norm_criterion_vectors = normalize(criterion_vec, axis=1, norm='l2').reshape(100,1)
cosine_similarity = np.dot(svd_l2norm_vectors, svd_l2norm_criterion_vectors)

ls=[]
for idx, cosine_similarity in enumerate(cosine_similarity.tolist()):
    ls.append((idx, cosine_similarity))
sorted_list = sorted(ls, key= lambda x: x[1], reverse=True)

criterion_tokens_list = []
for i in np.arange(len(sorted_list)):
    criterion_tokens_list.append(criterion_tokens)
show_list_noun = []
for sorted_lists, criterion_tokens in zip(sorted_list, criterion_tokens_list):
    idx=sorted_lists[0]
    similarity=sorted_lists[1]
    tf_list=[]
    for token in criterion_tokens:
        tf_list.append(token in raw_data['question'].loc[idx])
    if (np.array(tf_list) == True).all():
        show_list_noun.append((idx, similarity))
show_list_noun

show_list_index = []
for idx, score in show_list:
    show_list_index.append(idx)

show_list_noun_index = []
for idx, score in show_list_noun:
    show_list_noun_index.append(idx)

set(show_list_noun_index) == set(show_list_index)

결과는 False로 처음 명사추출기 점수를 더해서 tokenizing한 결과가 더 많고 좋은 질문들을 검색해 내었다.

False

결론

연말정산 카테고리의 문건이 전체 문건 중 다수를 포함하고 있기 때문에, 그에 따른 영향으로 검색 키워드에 연말정산이 포함되면 유사도가 큰 문건들은 대부분 연말정산의 내용만을 담고 있었기 때문에 추후에 tokenizing한 검색 키워드를 전부 포함하고 있는 문건들을 출력해주는 방식으로 바꾸었다. 처음의 model을 최종적으로 선택할 것이며, 현재의 검색어 시스템에서 재입사자와 재입사자 연말정산이라는 두 가지 키워드에 대한 검색이 아래 그림과 같이 재입사자는 11건 재입사자 연말정산는 5건으로 재입사자 키워드에서 대부분이 연말정산에 관한 내용임에도 불구하고 검색이 되지 않는 문제점은 해결할 수 있다는 것에 만족할 것이다. 게다가 기존의 띄어쓰기에 취약한 문제점도 보완할 수 있기에 이전보다는 더 나은 검색 시스템이라고 주장한다.
아래 그림은 더존 온라인 고객센터의 smart A 게시판에서 동일한 내용이지만 띄어쓰기만 다른 재입사자 연말정산(위)과 재입사자 연말 정산(아래)이라는 두 키워드를 검색한 결과이다.

재입사자 연말정산 검색 결과

재입사자 연말 정산 검색 결과

위 같이 띄어쓰기가 달라도 검색어를 입력했을 때 동일한 결과를 얻을 수 있었다.

재입사자 연말 정산 키워드 토크나이징 결과

모델의 결과

보완점

TF-IDF를 사용하였기 때문에 단어와 단어가 사용된 문건의 수에 의한 가중치에 의해 영향을 받는다는 점을 고려했었야 한다는 판단을 내렸다. 또한, 검색 키워드를 나중에 유사도를 계산한 리스트 중에 필터링 역할로 사용하기에 검색 키워드를 기반으로 하되 불필요한 부분을 제거하여 사용할 수 있는 알고리즘을 만들면 더 좋은 검색 시스템을 구성할 수 있을 것으로 기대 된다. 또한, 검색 시스템 뿐만 아니라 자신이 질문을 작성한 후에 자신의 질문과 유사도가 높은 질문들의 리스트를 보여주는 페이지로 전환시켜 주는 서비스도 좋을 것 같다. 이러한 생각이 들었던 이유는 더존 온라인 고객센터의 답변들 중 연말정산 같은 회계분야의 특정 시즌 때 질문들에 대한 답변이 조금 늦는 경우(물론 하루이상을 넘기지 않고 답변을 다 달아주신다)를 보았는데, 위와 같은 시스템을 도입하면 온라인 고객센터의 직원 분들도 덜 고생하시고, 고객님들께서도 조금 더 해결방안을 빨리 찾으실 수 있을 것 같다고 생각했기 때문이다.

토이 프로젝트를 하면서 느낀점

제일 많은 것을 느낀것은 전처리부분이었다. 데이터 분석에 있어서 전처리가 80%이상이라는 말은 매번 되새기게되지만, 이번에는 특히나 더 와 닿았던 토이 프로젝트 였던 것 같다. 다음 토이 프로젝트로 질문에 대한 답변을 작성해 주는 챗봇을 구현해 보려고하는데, 질문의 카테고리별로 모델을 따로 만들어야 될 것 같다는 생각이 들었다. 챗봇을 구현할 때 먼저 입,출력 벡터의 크기를 일정하게 정해서 부족하면 패딩처리하는 방식으로 사용하여야 하는데, 각 카테고리별로 답변을 주는 평균적인 답변과 질문의 sequence의 길이가 다를 것이라고 생각 했기 때문이다.