6626070
2997924

AAI02, What is natural language processing

Back to the previous pagepage managementpaper reviewto CV
List of posts to read before reading this article


Contents


korean morpheme analysis

konlpy installation

URL

$ sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl      # Install Java 1.8 or up
$ python3 -m pip install --upgrade pip
$ python3 -m pip install konlpy                                            # Python 3.x




Hannanum : KAIST

from konlpy.tag import Hannanum

hannanum = Hannanum()
analyze = hannanum.analyze((u'대한민국은 아름다운 나라이다.'))
morphs = hannanum.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = hannanum.nouns((u'대한민국은 아름다운 나라이다.'))
pos = hannanum.pos((u'대한민국은 아름다운 나라이다.'))

print("analyze :\n", analyze)
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)





Kkma : SNU

from konlpy.tag import Kkma

kkma = Kkma()
morphs = kkma.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = kkma.nouns((u'대한민국은 아름다운 나라이다.'))
pos = kkma.pos((u'대한민국은 아름다운 나라이다.'))
sentences = kkma.sentences((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("sentences :\n", sentences)





Komoran : Shineware






Mecab : Eunjeon project

$ sudo apt-get install curl git
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
from konlpy.tag import Mecab

mecab = Mecab()
morphs = mecab.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = mecab.nouns((u'대한민국은 아름다운 나라이다.'))
pos = mecab.pos((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)





Okt : Twitter

from konlpy.tag import Okt

okt = Okt()
morphs = okt.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = okt.nouns((u'대한민국은 아름다운 나라이다.'))
pos = okt.pos((u'대한민국은 아름다운 나라이다.'))
phrases = okt.phrases((u'대한민국은 아름다운 나라이다.'))

print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("phrases :\n", phrases)





Introduction

Keywords : Unsupervised machine translation, Pretrained language model, common sense inference datasets, Meta-learning, Robust unsupervised methods, Understanding representations, Clever auxiliary tasks, Combining semi-supervised learning with transfet learning, QA and reasoning with large documents, Inductive bias




NLP Categorization

  • Phonology : Linguistics sounds
    • Speech to Text(STT)
  • Morphology : Meaningful components of words
    • Lexical analysis
  • Syntax : Structural relationships between words
    • Syntax analysis
  • Semantics : Meaning
    • Semantic analysis
  • Pragmatics : How language is used to complish goals
    • Pragmatic analysis
  • Discourse : Larger lingustic units





NLP Research trend

  • Rule-based approach(dedeuctive reasoning, determinstic)
  • Statistical approach(Inductive reasoning, stochastic)
  • Machine learning approach(Inductive reasoning, stochastic) : end to end multi-task learning



Upstream-task

  • Tokenize
  • Embedding
    • Factorization based(Matrix decomposition)
      • GloVe, Swivel
    • Prediction based
      • Word2Vec, FastText, BERT, ELMo, GPT
    • Topic based
      • LDA





Downstream-task

  • Part of Speech tagging
  • Named Entity Recognition
  • Semantic Rule Labeling





Text preprocessing(1) : Lexical analysis

Tokenization

Word Tokenization

word_tokenize

from nltk.tokenize import word_tokenize  

print(word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))  

[‘Do’, “n’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr.’, ‘Jone’, “‘s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]

WordPunctTokenizer

from nltk.tokenize import WordPunctTokenizer  

print(WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

[‘Don’, “’”, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr’, ‘.’, ‘Jone’, “’”, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]


text_to_word_sequence

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

[“don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘mr’, “jone’s”, ‘orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]




Consideration

  • Don’t simply exclude punctuation marks or special characters.
    • ex] Ph.D, AT&T, 123,456,789
  • In case of abbreviations and spacing within words
    • ex] rock ‘n’ roll(abbreviation), New York(spacing within words)
  • Standard : Penn Treebank Tokenization

TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

tokenizer=TreebankWordTokenizer()
text="Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."
print(tokenizer.tokenize(text))

[‘Starting’, ‘a’, ‘home-based’, ‘restaurant’, ‘may’, ‘be’, ‘an’, ‘ideal.’, ‘it’, ‘does’, “n’t”, ‘have’, ‘a’, ‘food’, ‘chain’, ‘or’, ‘restaurant’, ‘of’, ‘their’, ‘own’, ‘.’]




Sentence Tokenization

sent_tokenize

from nltk.tokenize import sent_tokenize

text="His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to mae sure no one was near."
print(sent_tokenize(text))

[‘His barber kept his word.’, ‘But keeping such a huge secret to himself was driving him crazy.’, ‘Finally, the barber went up a mountain and almost to the edge of a cliff.’, ‘He dug a hole in the midst of some reeds.’, ‘He looked about, to mae sure no one was near.’]

from nltk.tokenize import sent_tokenize

text="I am actively looking for Ph.D. students. and you are a Ph.D student."
print(sent_tokenize(text))

[‘I am actively looking for Ph.D. students.’, ‘and you are a Ph.D student.’]


Part-of-speech tagging

English

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text="I am actively looking for Ph.D. students. and you are a Ph.D. student."
x=word_tokenize(text)

print(x)
pos_tag(x)

[‘I’, ‘am’, ‘actively’, ‘looking’, ‘for’, ‘Ph.D.’, ‘students’, ‘.’, ‘and’, ‘you’, ‘are’, ‘a’, ‘Ph.D.’, ‘student’, ‘.’]
[(‘I’, ‘PRP’), (‘am’, ‘VBP’), (‘actively’, ‘RB’), (‘looking’, ‘VBG’), (‘for’, ‘IN’), (‘Ph.D.’, ‘NNP’), (‘students’, ‘NNS’), (‘.’, ‘.’), (‘and’, ‘CC’), (‘you’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘Ph.D.’, ‘NNP’), (‘student’, ‘NN’), (‘.’, ‘.’)]

Reference

PRP personal pronouns
VBP verb
RB adverb
VBG present participle
IN preposition
NNP proper noun
NNS aggregate noun
CC conjunction
DT article




Korean

from konlpy.tag import Kkma  

kkma=Kkma()  
print(kkma.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))  
print(kkma.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))  

[‘열심히’, ‘코딩’, ‘하’, ‘ㄴ’, ‘당신’, ‘,’, ‘연휴’, ‘에’, ‘는’, ‘여행’, ‘을’, ‘가보’, ‘아요’]
[(‘열심히’,’MAG’), (‘코딩’, ‘NNG’), (‘하’, ‘XSV’), (‘ㄴ’, ‘ETD’), (‘당신’, ‘NP’), (‘,’, ‘SP’), (‘연휴’, ‘NNG’), (‘에’, ‘JKM’), (‘는’, ‘JX’), (‘여행’, ‘NNG’), (‘을’, ‘JKO’), (‘가보’, ‘VV’), (‘아요’, ‘EFN’)]
[‘코딩’, ‘당신’, ‘연휴’, ‘여행’]




Named entity recognition




Co-reference




Basic dependencies




Tokenization with regular expression

import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer=RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

[‘Don’, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘Mr’, ‘Jone’, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]

import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

[“Don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name,’, ‘Mr.’, “Jone’s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]





Cleaning and normalization : Morphological analysis

  • morphology
    • stem
    • affix




Lemmatization : conservation of pos

WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

n=WordNetLemmatizer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([n.lemmatize(w) for w in words])

[‘policy’, ‘doing’, ‘organization’, ‘have’, ‘going’, ‘love’, ‘life’, ‘fly’, ‘dy’, ‘watched’, ‘ha’, ‘starting’]

The above results present inappropriate words that do not have any meaning, such as dy or ha. This is because the lemmatizer must know the information about part or speech of the original word for accurate results.

from nltk.stem import WordNetLemmatizer

n=WordNetLemmatizer()
print(n.lemmatize('dies', 'v'))
print(n.lemmatize('watched', 'v'))
print(n.lemmatize('has', 'v'))

‘die’
‘watch’
‘have’




Stemming : non-conservation of pos

stemming through porter algorithm
PorterStemmer

from nltk.stem import PorterStemmer

s=PorterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([s.stem(w) for w in words])

[‘polici’, ‘do’, ‘organ’, ‘have’, ‘go’, ‘love’, ‘live’, ‘fli’, ‘die’, ‘watch’, ‘ha’, ‘start’]

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

s = PorterStemmer()
text="This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
words=word_tokenize(text)

print(words)
print([s.stem(w) for w in words])

[‘This’, ‘was’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘Billy’, ‘Bones’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accurate’, ‘copy’, ‘,’, ‘complete’, ‘in’, ‘all’, ‘things’, ‘–’, ‘names’, ‘and’, ‘heights’, ‘and’, ‘soundings’, ‘–’, ‘with’, ‘the’, ‘single’, ‘exception’, ‘of’, ‘the’, ‘red’, ‘crosses’, ‘and’, ‘the’, ‘written’, ‘notes’, ‘.’]
[‘thi’, ‘wa’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘billi’, ‘bone’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accur’, ‘copi’, ‘,’, ‘complet’, ‘in’, ‘all’, ‘thing’, ‘–’, ‘name’, ‘and’, ‘height’, ‘and’, ‘sound’, ‘–’, ‘with’, ‘the’, ‘singl’, ‘except’, ‘of’, ‘the’, ‘red’, ‘cross’, ‘and’, ‘the’, ‘written’, ‘note’, ‘.’]

The results of the above algorithm include words that are not in the dictionary.




stemming through Lancaster Stemmer algorithm
LancasterStemmer

from nltk.stem import LancasterStemmer

l=LancasterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([l.stem(w) for w in words])




Removing Unnecessary Words(noise data)

Stopword
List of Stopword about Eng
stopwords

from nltk.corpus import stopwords  
stopwords.words('english')[:10]

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’]
Removing Stopword about Eng

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "Family is not an important thing. It's everything."
stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example)

result = []
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 

print(word_tokens) 
print(result) 

[‘Family’, ‘is’, ‘not’, ‘an’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
[‘Family’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
List of Stopword about Kor


Removing Stopword about Kor

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "고기를 아무렇게나 구우려고 하면 안 돼. 고기라고 다 같은 게 아니거든. 예컨대 삼겹살을 구울 때는 중요한 게 있지."
stop_words = "아무거나 아무렇게나 어찌하든지 같다 비슷하다 예컨대 이럴정도로 하면 아니거든"
# 위의 불용어는 명사가 아닌 단어 중에서 저자가 임의로 선정한 것으로 실제 의미있는 선정 기준이 아님
stop_words=stop_words.split(' ')
word_tokens = word_tokenize(example)

result = [] 
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 
# 위의 4줄은 아래의 한 줄로 대체 가능
# result=[word for word in word_tokens if not word in stop_words]

print(word_tokens) 
print(result)

[‘고기를’, ‘아무렇게나’, ‘구우려고’, ‘하면’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘아니거든’, ‘.’, ‘예컨대’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]
[‘고기를’, ‘구우려고’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘.’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]



Rare words




words with very a short length

import re

text = "I was wondering if anyone out there could enlighten me on this car."
shortword = re.compile(r'\W*\b\w{1,2}\b')

print(shortword.sub('', text))

was wondering anyone out there could enlighten this car.





Encoding

Integer encoding




One-hot encoding




Byte Pair encoding





Text preprocessing(2) : Syntax analysis





Language model

A language model is a criterion for determining whether a sentence is natural.

  • Assign probabilities for word sequences
    • Machine Translation
    • Spell Correction
    • Speech Recognition




Statistical Language Model, SLM

  • Prediction of the next word from a given previous word
    • P(W) = P(w_1,w_2,w_3,…,w_n)
    • P(w_n|w_1,…,w_n-1) = P(w_1,w_2,…,w_n)/P(w_1,w_2,w_3,…,w_n-1)

image

Cause sparsity problem.


N-gram Language Model

image

  • Sparsity problem
  • Trade off (sparsity vs accuracy)
    • The higher the value of n, the stronger the sparsity problem and the accuracy.
    • The lower the value of n, the weaker the sparsity problem and the accuracy.

for general, image for n-gram, image




Evaluation: Perplexity(Branching factor)

  • Evaluation
    • extrinsic
    • intrinsic : Perplexity

image





Neural Network Based Language Model

Feed Forward Neural Network Language Model, FFNNLM : Neural Probabilistic Language Model

  • Improvement : Solving sparsity problem
  • Limitation : Fixed-length input




Recurrent Neural Network Language Model, RNNLM






Quantification : Word representation

image

Local Representation

  • WDM, Word-Document Matrix
  • TF-IDF, Term Frequency-Inverse Document Frequency
  • WCM, Word-Context Matrix
  • PMIM, Point-wise Mutual Information Matrix




Count based word Representation(1) : BoW

Bag of Words

from konlpy.tag import Okt
import re  
okt=Okt()  

token=re.sub("(\.)","","정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다.")  
# 정규 표현식을 통해 온점을 제거하는 정제 작업입니다.  
token=okt.morphs(token)  
# OKT 형태소 분석기를 통해 토큰화 작업을 수행한 뒤에, token에다가 넣습니다.  

word2index={}  
bow=[]  
for voca in token:  
         if voca not in word2index.keys():  
             word2index[voca]=len(word2index)  
# token을 읽으면서, word2index에 없는 (not in) 단어는 새로 추가하고, 이미 있는 단어는 넘깁니다.   
             bow.insert(len(word2index)-1,1)
# BoW 전체에 전부 기본값 1을 넣어줍니다. 단어의 개수는 최소 1개 이상이기 때문입니다.  
         else:
            index=word2index.get(voca)
# 재등장하는 단어의 인덱스를 받아옵니다.
            bow[index]=bow[index]+1
# 재등장한 단어는 해당하는 인덱스의 위치에 1을 더해줍니다. (단어의 개수를 세는 것입니다.)  
print(word2index)
print(bow)

(‘정부’: 0, ‘가’: 1, ‘발표’: 2, ‘하는’: 3, ‘물가상승률’: 4, ‘과’: 5, ‘소비자’: 6, ‘느끼는’: 7, ‘은’: 8, ‘다르다’: 9)
[1, 2, 1, 1, 2, 1, 1, 1, 1, 1]



Create Bag of Words with CountVectorizer class

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.

[[1 1 2 1 2 1]]
{‘you’: 4, ‘know’: 1, ‘want’: 3, ‘your’: 5, ‘love’: 2, ‘because’: 0}



Remove stopwords in Bag of Words
Custom stopwords

from sklearn.feature_extraction.text import CountVectorizer

text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words=["the", "a", "an", "is", "not"])

print(vect.fit_transform(text).toarray()) 
print(vect.vocabulary_)

[[1 1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 4, ‘it’: 3, ‘everything’: 0}
CountVectorizer stopwords

from sklearn.feature_extraction.text import CountVectorizer

text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words="english")

print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)

[[1 1 1]]
{‘family’: 0, ‘important’: 1, ‘thing’: 2}
NLTK stopwords

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

text=["Family is not an important thing. It's everything."]
sw = stopwords.words("english")
vect = CountVectorizer(stop_words =sw)
print(vect.fit_transform(text).toarray()) 
print(vect.vocabulary_)

[[1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 3, ‘everything’: 0}



Count based word Representation(2) : DTM

Document-Term Matrix, DTM
image

  • Limitations : Sparse representation




Count based word Representation(3) : TD-IDF

Term Frequency-Inverse Document Frequency
Implementation : pandas

import pandas as pd # 데이터프레임 사용을 위해
from math import log # IDF 계산을 위해

docs = [
  '먹고 싶은 사과',
  '먹고 싶은 바나나',
  '길고 노란 바나나 바나나',
  '저는 과일이 좋아요'
] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()



N = len(docs) # 총 문서의 수

def tf(t, d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df + 1))

def tfidf(t, d):
    return tf(t,d)* idf(t)
    


result = []
for i in range(N): # 각 문서에 대해서 아래 명령을 수행
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]        
        result[-1].append(tf(t, d))

tf_ = pd.DataFrame(result, columns = vocab)
print(tf_)

image

result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])
print(idf_)

image

result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]

        result[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(result, columns = vocab)
print(tfidf_)

image

Implementation : scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
vector = CountVectorizer()

print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.

[[0 1 0 1 0 1 0 1 1]
[0 0 1 0 0 0 0 1 0]
[1 0 0 0 1 0 1 0 0]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
tfidfv = TfidfVectorizer().fit(corpus)

print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)

[[0. 0.46735098 0. 0.46735098 0. 0.46735098 0. 0.35543247 0.46735098]
[0. 0. 0.79596054 0. 0. 0. 0. 0.60534851 0. ]
[0.57735027 0. 0. 0. 0.57735027 0. 0.57735027 0. 0. ]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}

Implementation : keras






Continuous Representation

Topic modeling(1) : LSA

Singular Value Decomposition, SVD

import numpy as np

A = np.array([[0,0,0,1,0,1,1,0,0],
              [0,0,0,1,1,0,1,0,0],
              [0,1,1,0,2,0,0,0,0],
              [1,0,0,0,0,0,0,1,1]])
              
# Full SVD              
U, s, VT = np.linalg.svd(A, full_matrices = True)  
S = np.zeros(np.shape(A))         # 대각 행렬의 크기인 4 x 9의 임의의 행렬 생성
S[:4, :4] = np.diag(s)            # 특이값을 대각행렬에 삽입

# Truncated SVD
S=S[:2,:2]
U=U[:,:2]
VT=VT[:2,:]




Latent Semantic Analysis, LSA





Topic modeling(2) : LDA

Latent Dirichlet Allocation, LDA




Word Embedding





Document Similarity





Recurrent Neural Network

Word-level





Character-level






Text Classification





Tagging Task





Neural Machine Translation





Attention Mechanism

attention explanation URL



Transformer





Convolution Neural Network





List of posts followed by this article


Reference


OUTPUT