AAI02, What is natural language processing

Back to the previous page｜page management｜paper review｜to CV
List of posts to read before reading this article

korean morpheme analysis
konlpy installation
Introduction
Text preprocessing(1) : Lexical analysis
Text preprocessing(2) : Syntax analysis
Language model
- Statistical Language Model, SLM
  - N-gram Language Model
  - Evaluation: Perplexity(Branching factor)
- Neural Network Based Language Model
  - Feed Forward Neural Network Language Model, FFNNLM : Neural Probabilistic Language Model
  - Recurrent Neural Network Language Model, RNNLM
Quantification : Word representation
Recurrent Neural Network
- Word-level
- Character-level
Text Classification
Tagging Task
Neural Machine Translation
Attention Mechanism
Transformer
Convolution Neural Network

korean morpheme analysis

konlpy installation

$ sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl      # Install Java 1.8 or up
$ python3 -m pip install --upgrade pip
$ python3 -m pip install konlpy                                            # Python 3.x

Hannanum : KAIST

from konlpy.tag import Hannanum

hannanum = Hannanum()
analyze = hannanum.analyze((u'대한민국은 아름다운 나라이다.'))
morphs = hannanum.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = hannanum.nouns((u'대한민국은 아름다운 나라이다.'))
pos = hannanum.pos((u'대한민국은 아름다운 나라이다.'))

print("analyze :\n", analyze)
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)

Kkma : SNU

from konlpy.tag import Kkma

kkma = Kkma()
morphs = kkma.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = kkma.nouns((u'대한민국은 아름다운 나라이다.'))
pos = kkma.pos((u'대한민국은 아름다운 나라이다.'))
sentences = kkma.sentences((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("sentences :\n", sentences)

Komoran : Shineware

Mecab : Eunjeon project

$ sudo apt-get install curl git
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)

from konlpy.tag import Mecab

mecab = Mecab()
morphs = mecab.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = mecab.nouns((u'대한민국은 아름다운 나라이다.'))
pos = mecab.pos((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)

Okt : Twitter

from konlpy.tag import Okt

okt = Okt()
morphs = okt.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = okt.nouns((u'대한민국은 아름다운 나라이다.'))
pos = okt.pos((u'대한민국은 아름다운 나라이다.'))
phrases = okt.phrases((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("phrases :\n", phrases)

Introduction

Keywords : Unsupervised machine translation, Pretrained language model, common sense inference datasets, Meta-learning, Robust unsupervised methods, Understanding representations, Clever auxiliary tasks, Combining semi-supervised learning with transfet learning, QA and reasoning with large documents, Inductive bias

NLP Categorization

Phonology : Linguistics sounds
- Speech to Text(STT)
Morphology : Meaningful components of words
- Lexical analysis
Syntax : Structural relationships between words
- Syntax analysis
Semantics : Meaning
- Semantic analysis
Pragmatics : How language is used to complish goals
- Pragmatic analysis
Discourse : Larger lingustic units

NLP Research trend

Rule-based approach(dedeuctive reasoning, determinstic)
Statistical approach(Inductive reasoning, stochastic)
Machine learning approach(Inductive reasoning, stochastic) : end to end multi-task learning

Upstream-task

Tokenize
Embedding
- Factorization based(Matrix decomposition)
  - GloVe, Swivel
- Prediction based
  - Word2Vec, FastText, BERT, ELMo, GPT
- Topic based
  - LDA

Downstream-task

Part of Speech tagging
Named Entity Recognition
Semantic Rule Labeling

Text preprocessing(1) : Lexical analysis

Tokenization

Word Tokenization

word_tokenize

from nltk.tokenize import word_tokenize  

print(word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))  

[‘Do’, “n’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr.’, ‘Jone’, “‘s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]

WordPunctTokenizer

from nltk.tokenize import WordPunctTokenizer  

print(WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

[‘Don’, “’”, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr’, ‘.’, ‘Jone’, “’”, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]

text_to_word_sequence

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

[“don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘mr’, “jone’s”, ‘orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]

Consideration

Don’t simply exclude punctuation marks or special characters.
- ex] Ph.D, AT&T, 123,456,789
In case of abbreviations and spacing within words
- ex] rock ‘n’ roll(abbreviation), New York(spacing within words)
Standard : Penn Treebank Tokenization

TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

tokenizer=TreebankWordTokenizer()
text="Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."
print(tokenizer.tokenize(text))

[‘Starting’, ‘a’, ‘home-based’, ‘restaurant’, ‘may’, ‘be’, ‘an’, ‘ideal.’, ‘it’, ‘does’, “n’t”, ‘have’, ‘a’, ‘food’, ‘chain’, ‘or’, ‘restaurant’, ‘of’, ‘their’, ‘own’, ‘.’]

Sentence Tokenization

sent_tokenize

from nltk.tokenize import sent_tokenize

text="His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to mae sure no one was near."
print(sent_tokenize(text))

[‘His barber kept his word.’, ‘But keeping such a huge secret to himself was driving him crazy.’, ‘Finally, the barber went up a mountain and almost to the edge of a cliff.’, ‘He dug a hole in the midst of some reeds.’, ‘He looked about, to mae sure no one was near.’]

from nltk.tokenize import sent_tokenize

text="I am actively looking for Ph.D. students. and you are a Ph.D student."
print(sent_tokenize(text))

[‘I am actively looking for Ph.D. students.’, ‘and you are a Ph.D student.’]

Part-of-speech tagging

English

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text="I am actively looking for Ph.D. students. and you are a Ph.D. student."
x=word_tokenize(text)

print(x)
pos_tag(x)

[‘I’, ‘am’, ‘actively’, ‘looking’, ‘for’, ‘Ph.D.’, ‘students’, ‘.’, ‘and’, ‘you’, ‘are’, ‘a’, ‘Ph.D.’, ‘student’, ‘.’]
[(‘I’, ‘PRP’), (‘am’, ‘VBP’), (‘actively’, ‘RB’), (‘looking’, ‘VBG’), (‘for’, ‘IN’), (‘Ph.D.’, ‘NNP’), (‘students’, ‘NNS’), (‘.’, ‘.’), (‘and’, ‘CC’), (‘you’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘Ph.D.’, ‘NNP’), (‘student’, ‘NN’), (‘.’, ‘.’)]

Reference

PRP	personal pronouns
VBP	verb
RB	adverb
VBG	present participle
IN	preposition
NNP	proper noun
NNS	aggregate noun
CC	conjunction
DT	article

Korean

from konlpy.tag import Kkma  

kkma=Kkma()  
print(kkma.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))  
print(kkma.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))  

[‘열심히’, ‘코딩’, ‘하’, ‘ㄴ’, ‘당신’, ‘,’, ‘연휴’, ‘에’, ‘는’, ‘여행’, ‘을’, ‘가보’, ‘아요’]
[(‘열심히’,’MAG’), (‘코딩’, ‘NNG’), (‘하’, ‘XSV’), (‘ㄴ’, ‘ETD’), (‘당신’, ‘NP’), (‘,’, ‘SP’), (‘연휴’, ‘NNG’), (‘에’, ‘JKM’), (‘는’, ‘JX’), (‘여행’, ‘NNG’), (‘을’, ‘JKO’), (‘가보’, ‘VV’), (‘아요’, ‘EFN’)]
[‘코딩’, ‘당신’, ‘연휴’, ‘여행’]

Named entity recognition

Co-reference

Basic dependencies

Tokenization with regular expression

import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer=RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

[‘Don’, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘Mr’, ‘Jone’, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]

import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

[“Don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name,’, ‘Mr.’, “Jone’s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]

Cleaning and normalization : Morphological analysis

morphology
- stem
- affix

Lemmatization : conservation of pos

WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

n=WordNetLemmatizer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([n.lemmatize(w) for w in words])

[‘policy’, ‘doing’, ‘organization’, ‘have’, ‘going’, ‘love’, ‘life’, ‘fly’, ‘dy’, ‘watched’, ‘ha’, ‘starting’]

The above results present inappropriate words that do not have any meaning, such as dy or ha. This is because the lemmatizer must know the information about part or speech of the original word for accurate results.

from nltk.stem import WordNetLemmatizer

n=WordNetLemmatizer()
print(n.lemmatize('dies', 'v'))
print(n.lemmatize('watched', 'v'))
print(n.lemmatize('has', 'v'))

‘die’
‘watch’
‘have’

Stemming : non-conservation of pos

stemming through porter algorithm
PorterStemmer

from nltk.stem import PorterStemmer

s=PorterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([s.stem(w) for w in words])

[‘polici’, ‘do’, ‘organ’, ‘have’, ‘go’, ‘love’, ‘live’, ‘fli’, ‘die’, ‘watch’, ‘ha’, ‘start’]

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

s = PorterStemmer()
text="This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
words=word_tokenize(text)

print(words)
print([s.stem(w) for w in words])

[‘This’, ‘was’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘Billy’, ‘Bones’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accurate’, ‘copy’, ‘,’, ‘complete’, ‘in’, ‘all’, ‘things’, ‘–’, ‘names’, ‘and’, ‘heights’, ‘and’, ‘soundings’, ‘–’, ‘with’, ‘the’, ‘single’, ‘exception’, ‘of’, ‘the’, ‘red’, ‘crosses’, ‘and’, ‘the’, ‘written’, ‘notes’, ‘.’]
[‘thi’, ‘wa’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘billi’, ‘bone’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accur’, ‘copi’, ‘,’, ‘complet’, ‘in’, ‘all’, ‘thing’, ‘–’, ‘name’, ‘and’, ‘height’, ‘and’, ‘sound’, ‘–’, ‘with’, ‘the’, ‘singl’, ‘except’, ‘of’, ‘the’, ‘red’, ‘cross’, ‘and’, ‘the’, ‘written’, ‘note’, ‘.’]

The results of the above algorithm include words that are not in the dictionary.

stemming through Lancaster Stemmer algorithm
LancasterStemmer

from nltk.stem import LancasterStemmer

l=LancasterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([l.stem(w) for w in words])

Removing Unnecessary Words(noise data)

Stopword
List of Stopword about Eng
stopwords

from nltk.corpus import stopwords  
stopwords.words('english')[:10]

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’]
Removing Stopword about Eng

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "Family is not an important thing. It's everything."
stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example)

result = []
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 

print(word_tokens) 
print(result) 

[‘Family’, ‘is’, ‘not’, ‘an’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
[‘Family’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
List of Stopword about Kor

Removing Stopword about Kor

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "고기를 아무렇게나 구우려고 하면 안 돼. 고기라고 다 같은 게 아니거든. 예컨대 삼겹살을 구울 때는 중요한 게 있지."
stop_words = "아무거나 아무렇게나 어찌하든지 같다 비슷하다 예컨대 이럴정도로 하면 아니거든"
# 위의 불용어는 명사가 아닌 단어 중에서 저자가 임의로 선정한 것으로 실제 의미있는 선정 기준이 아님
stop_words=stop_words.split(' ')
word_tokens = word_tokenize(example)

result = [] 
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 
# 위의 4줄은 아래의 한 줄로 대체 가능
# result=[word for word in word_tokens if not word in stop_words]

print(word_tokens) 
print(result)

[‘고기를’, ‘아무렇게나’, ‘구우려고’, ‘하면’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘아니거든’, ‘.’, ‘예컨대’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]
[‘고기를’, ‘구우려고’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘.’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]

Rare words

words with very a short length

import re

text = "I was wondering if anyone out there could enlighten me on this car."
shortword = re.compile(r'\W*\b\w{1,2}\b')

print(shortword.sub('', text))

was wondering anyone out there could enlighten this car.

Encoding

Integer encoding

One-hot encoding

Byte Pair encoding

Text preprocessing(2) : Syntax analysis

Language model

A language model is a criterion for determining whether a sentence is natural.

Assign probabilities for word sequences
- Machine Translation
- Spell Correction
- Speech Recognition

Statistical Language Model, SLM

Prediction of the next word from a given previous word
- P(W) = P(w_1,w_2,w_3,…,w_n)
- P(w_n｜w_1,…,w_n-1) = P(w_1,w_2,…,w_n)/P(w_1,w_2,w_3,…,w_n-1)

Cause sparsity problem.

N-gram Language Model

Sparsity problem
Trade off (sparsity vs accuracy)
- The higher the value of n, the stronger the sparsity problem and the accuracy.
- The lower the value of n, the weaker the sparsity problem and the accuracy.

for general, for n-gram,

Evaluation: Perplexity(Branching factor)

Evaluation
- extrinsic
- intrinsic : Perplexity

Neural Network Based Language Model

Feed Forward Neural Network Language Model, FFNNLM : Neural Probabilistic Language Model

Improvement : Solving sparsity problem
Limitation : Fixed-length input

Recurrent Neural Network Language Model, RNNLM

Quantification : Word representation

Local Representation

WDM, Word-Document Matrix
TF-IDF, Term Frequency-Inverse Document Frequency
WCM, Word-Context Matrix
PMIM, Point-wise Mutual Information Matrix

Count based word Representation(1) : BoW

Bag of Words

from konlpy.tag import Okt
import re  
okt=Okt()  

token=re.sub("(\.)","","정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다.")  
# 정규 표현식을 통해 온점을 제거하는 정제 작업입니다.  
token=okt.morphs(token)  
# OKT 형태소 분석기를 통해 토큰화 작업을 수행한 뒤에, token에다가 넣습니다.  

word2index={}  
bow=[]  
for voca in token:  
         if voca not in word2index.keys():  
             word2index[voca]=len(word2index)  
# token을 읽으면서, word2index에 없는 (not in) 단어는 새로 추가하고, 이미 있는 단어는 넘깁니다.   
             bow.insert(len(word2index)-1,1)
# BoW 전체에 전부 기본값 1을 넣어줍니다. 단어의 개수는 최소 1개 이상이기 때문입니다.  
         else:
            index=word2index.get(voca)
# 재등장하는 단어의 인덱스를 받아옵니다.
            bow[index]=bow[index]+1
# 재등장한 단어는 해당하는 인덱스의 위치에 1을 더해줍니다. (단어의 개수를 세는 것입니다.)  
print(word2index)
print(bow)

(‘정부’: 0, ‘가’: 1, ‘발표’: 2, ‘하는’: 3, ‘물가상승률’: 4, ‘과’: 5, ‘소비자’: 6, ‘느끼는’: 7, ‘은’: 8, ‘다르다’: 9)
[1, 2, 1, 1, 2, 1, 1, 1, 1, 1]

Create Bag of Words with CountVectorizer class

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.

[[1 1 2 1 2 1]]
{‘you’: 4, ‘know’: 1, ‘want’: 3, ‘your’: 5, ‘love’: 2, ‘because’: 0}

Remove stopwords in Bag of Words
Custom stopwords

from sklearn.feature_extraction.text import CountVectorizer

text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words=["the", "a", "an", "is", "not"])

print(vect.fit_transform(text).toarray()) 
print(vect.vocabulary_)

[[1 1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 4, ‘it’: 3, ‘everything’: 0}
CountVectorizer stopwords

from sklearn.feature_extraction.text import CountVectorizer

text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words="english")

print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)

[[1 1 1]]
{‘family’: 0, ‘important’: 1, ‘thing’: 2}
NLTK stopwords

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

text=["Family is not an important thing. It's everything."]
sw = stopwords.words("english")
vect = CountVectorizer(stop_words =sw)
print(vect.fit_transform(text).toarray()) 
print(vect.vocabulary_)

[[1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 3, ‘everything’: 0}

Count based word Representation(2) : DTM

Document-Term Matrix, DTM

Limitations : Sparse representation

Count based word Representation(3) : TD-IDF

Term Frequency-Inverse Document Frequency
Implementation : pandas

import pandas as pd # 데이터프레임 사용을 위해
from math import log # IDF 계산을 위해

docs = [
  '먹고 싶은 사과',
  '먹고 싶은 바나나',
  '길고 노란 바나나 바나나',
  '저는 과일이 좋아요'
] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()



N = len(docs) # 총 문서의 수

def tf(t, d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df + 1))

def tfidf(t, d):
    return tf(t,d)* idf(t)
    


result = []
for i in range(N): # 각 문서에 대해서 아래 명령을 수행
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]        
        result[-1].append(tf(t, d))

tf_ = pd.DataFrame(result, columns = vocab)
print(tf_)

result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])
print(idf_)

result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]

        result[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(result, columns = vocab)
print(tfidf_)

Implementation : scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
vector = CountVectorizer()

print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.

[[0 1 0 1 0 1 0 1 1]
[0 0 1 0 0 0 0 1 0]
[1 0 0 0 1 0 1 0 0]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
tfidfv = TfidfVectorizer().fit(corpus)

print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)

[[0. 0.46735098 0. 0.46735098 0. 0.46735098 0. 0.35543247 0.46735098]
[0. 0. 0.79596054 0. 0. 0. 0. 0.60534851 0. ]
[0.57735027 0. 0. 0. 0.57735027 0. 0.57735027 0. 0. ]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}

Implementation : keras

Continuous Representation

Topic modeling(1) : LSA

Singular Value Decomposition, SVD

import numpy as np

A = np.array([[0,0,0,1,0,1,1,0,0],
              [0,0,0,1,1,0,1,0,0],
              [0,1,1,0,2,0,0,0,0],
              [1,0,0,0,0,0,0,1,1]])
              
# Full SVD              
U, s, VT = np.linalg.svd(A, full_matrices = True)  
S = np.zeros(np.shape(A))         # 대각 행렬의 크기인 4 x 9의 임의의 행렬 생성
S[:4, :4] = np.diag(s)            # 특이값을 대각행렬에 삽입

# Truncated SVD
S=S[:2,:2]
U=U[:,:2]
VT=VT[:2,:]

Latent Semantic Analysis, LSA

Topic modeling(2) : LDA

Latent Dirichlet Allocation, LDA

Word Embedding

Document Similarity

Recurrent Neural Network

Word-level

Character-level

Text Classification

Tagging Task

Neural Machine Translation

Attention Mechanism

attention explanation URL

Transformer

Convolution Neural Network

List of posts followed by this article