AAI02, What is natural language processing
Back to the previous page|page management|paper review|to CV
List of posts to read before reading this article
Contents
- korean morpheme analysis
- konlpy installation
- Introduction
- Text preprocessing(1) : Lexical analysis
- Text preprocessing(2) : Syntax analysis
- Language model
- Quantification : Word representation
- Recurrent Neural Network
- Text Classification
- Tagging Task
- Neural Machine Translation
- Attention Mechanism
- Transformer
- Convolution Neural Network
korean morpheme analysis
konlpy installation
$ sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl # Install Java 1.8 or up
$ python3 -m pip install --upgrade pip
$ python3 -m pip install konlpy # Python 3.x
Hannanum : KAIST
from konlpy.tag import Hannanum
hannanum = Hannanum()
analyze = hannanum.analyze((u'대한민국은 아름다운 나라이다.'))
morphs = hannanum.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = hannanum.nouns((u'대한민국은 아름다운 나라이다.'))
pos = hannanum.pos((u'대한민국은 아름다운 나라이다.'))
print("analyze :\n", analyze)
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
Kkma : SNU
from konlpy.tag import Kkma
kkma = Kkma()
morphs = kkma.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = kkma.nouns((u'대한민국은 아름다운 나라이다.'))
pos = kkma.pos((u'대한민국은 아름다운 나라이다.'))
sentences = kkma.sentences((u'대한민국은 아름다운 나라이다.'))
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("sentences :\n", sentences)
Komoran : Shineware
Mecab : Eunjeon project
$ sudo apt-get install curl git
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
from konlpy.tag import Mecab
mecab = Mecab()
morphs = mecab.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = mecab.nouns((u'대한민국은 아름다운 나라이다.'))
pos = mecab.pos((u'대한민국은 아름다운 나라이다.'))
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
Okt : Twitter
from konlpy.tag import Okt
okt = Okt()
morphs = okt.morphs((u'대한민국은 아름다운 나라이다.'))
nouns = okt.nouns((u'대한민국은 아름다운 나라이다.'))
pos = okt.pos((u'대한민국은 아름다운 나라이다.'))
phrases = okt.phrases((u'대한민국은 아름다운 나라이다.'))
print("morphs :\n", morphs)
print("nouns :\n", nouns)
print("pos :\n", pos)
print("phrases :\n", phrases)
Introduction
Keywords : Unsupervised machine translation, Pretrained language model, common sense inference datasets, Meta-learning, Robust unsupervised methods, Understanding representations, Clever auxiliary tasks, Combining semi-supervised learning with transfet learning, QA and reasoning with large documents, Inductive bias
NLP Categorization
- Phonology : Linguistics sounds
- Speech to Text(STT)
- Morphology : Meaningful components of words
- Lexical analysis
- Syntax : Structural relationships between words
- Syntax analysis
- Semantics : Meaning
- Semantic analysis
- Pragmatics : How language is used to complish goals
- Pragmatic analysis
- Discourse : Larger lingustic units
NLP Research trend
- Rule-based approach(dedeuctive reasoning, determinstic)
- Statistical approach(Inductive reasoning, stochastic)
- Machine learning approach(Inductive reasoning, stochastic) : end to end multi-task learning
Upstream-task
- Tokenize
- Embedding
- Factorization based(Matrix decomposition)
- GloVe, Swivel
- Prediction based
- Word2Vec, FastText, BERT, ELMo, GPT
- Topic based
- LDA
- Factorization based(Matrix decomposition)
Downstream-task
- Part of Speech tagging
- Named Entity Recognition
- Semantic Rule Labeling
Text preprocessing(1) : Lexical analysis
Tokenization
Word Tokenization
word_tokenize
from nltk.tokenize import word_tokenize
print(word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
[‘Do’, “n’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr.’, ‘Jone’, “‘s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]
WordPunctTokenizer
from nltk.tokenize import WordPunctTokenizer
print(WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
[‘Don’, “’”, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘,’, ‘Mr’, ‘.’, ‘Jone’, “’”, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’, ‘.’]
text_to_word_sequence
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
[“don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘mr’, “jone’s”, ‘orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]
Consideration
- Don’t simply exclude punctuation marks or special characters.
- ex] Ph.D, AT&T, 123,456,789
- In case of abbreviations and spacing within words
- ex] rock ‘n’ roll(abbreviation), New York(spacing within words)
- Standard : Penn Treebank Tokenization
TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text="Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."
print(tokenizer.tokenize(text))
[‘Starting’, ‘a’, ‘home-based’, ‘restaurant’, ‘may’, ‘be’, ‘an’, ‘ideal.’, ‘it’, ‘does’, “n’t”, ‘have’, ‘a’, ‘food’, ‘chain’, ‘or’, ‘restaurant’, ‘of’, ‘their’, ‘own’, ‘.’]
Sentence Tokenization
sent_tokenize
from nltk.tokenize import sent_tokenize
text="His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to mae sure no one was near."
print(sent_tokenize(text))
[‘His barber kept his word.’, ‘But keeping such a huge secret to himself was driving him crazy.’, ‘Finally, the barber went up a mountain and almost to the edge of a cliff.’, ‘He dug a hole in the midst of some reeds.’, ‘He looked about, to mae sure no one was near.’]
from nltk.tokenize import sent_tokenize
text="I am actively looking for Ph.D. students. and you are a Ph.D student."
print(sent_tokenize(text))
[‘I am actively looking for Ph.D. students.’, ‘and you are a Ph.D student.’]
Part-of-speech tagging
English
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
text="I am actively looking for Ph.D. students. and you are a Ph.D. student."
x=word_tokenize(text)
print(x)
pos_tag(x)
[‘I’, ‘am’, ‘actively’, ‘looking’, ‘for’, ‘Ph.D.’, ‘students’, ‘.’, ‘and’, ‘you’, ‘are’, ‘a’, ‘Ph.D.’, ‘student’, ‘.’]
[(‘I’, ‘PRP’), (‘am’, ‘VBP’), (‘actively’, ‘RB’), (‘looking’, ‘VBG’), (‘for’, ‘IN’), (‘Ph.D.’, ‘NNP’), (‘students’, ‘NNS’), (‘.’, ‘.’), (‘and’, ‘CC’), (‘you’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘Ph.D.’, ‘NNP’), (‘student’, ‘NN’), (‘.’, ‘.’)]
PRP | personal pronouns |
VBP | verb |
RB | adverb |
VBG | present participle |
IN | preposition |
NNP | proper noun |
NNS | aggregate noun |
CC | conjunction |
DT | article |
Korean
from konlpy.tag import Kkma
kkma=Kkma()
print(kkma.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
[‘열심히’, ‘코딩’, ‘하’, ‘ㄴ’, ‘당신’, ‘,’, ‘연휴’, ‘에’, ‘는’, ‘여행’, ‘을’, ‘가보’, ‘아요’]
[(‘열심히’,’MAG’), (‘코딩’, ‘NNG’), (‘하’, ‘XSV’), (‘ㄴ’, ‘ETD’), (‘당신’, ‘NP’), (‘,’, ‘SP’), (‘연휴’, ‘NNG’), (‘에’, ‘JKM’), (‘는’, ‘JX’), (‘여행’, ‘NNG’), (‘을’, ‘JKO’), (‘가보’, ‘VV’), (‘아요’, ‘EFN’)]
[‘코딩’, ‘당신’, ‘연휴’, ‘여행’]
Named entity recognition
Co-reference
Basic dependencies
Tokenization with regular expression
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
[‘Don’, ‘t’, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name’, ‘Mr’, ‘Jone’, ‘s’, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\s]+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
[“Don’t”, ‘be’, ‘fooled’, ‘by’, ‘the’, ‘dark’, ‘sounding’, ‘name,’, ‘Mr.’, “Jone’s”, ‘Orphanage’, ‘is’, ‘as’, ‘cheery’, ‘as’, ‘cheery’, ‘goes’, ‘for’, ‘a’, ‘pastry’, ‘shop’]
Cleaning and normalization : Morphological analysis
- morphology
- stem
- affix
Lemmatization : conservation of pos
WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
n=WordNetLemmatizer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([n.lemmatize(w) for w in words])
[‘policy’, ‘doing’, ‘organization’, ‘have’, ‘going’, ‘love’, ‘life’, ‘fly’, ‘dy’, ‘watched’, ‘ha’, ‘starting’]
The above results present inappropriate words that do not have any meaning, such as dy or ha. This is because the lemmatizer must know the information about part or speech of the original word for accurate results.
from nltk.stem import WordNetLemmatizer
n=WordNetLemmatizer()
print(n.lemmatize('dies', 'v'))
print(n.lemmatize('watched', 'v'))
print(n.lemmatize('has', 'v'))
‘die’
‘watch’
‘have’
Stemming : non-conservation of pos
stemming through porter algorithm
PorterStemmer
from nltk.stem import PorterStemmer
s=PorterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([s.stem(w) for w in words])
[‘polici’, ‘do’, ‘organ’, ‘have’, ‘go’, ‘love’, ‘live’, ‘fli’, ‘die’, ‘watch’, ‘ha’, ‘start’]
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
s = PorterStemmer()
text="This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
words=word_tokenize(text)
print(words)
print([s.stem(w) for w in words])
[‘This’, ‘was’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘Billy’, ‘Bones’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accurate’, ‘copy’, ‘,’, ‘complete’, ‘in’, ‘all’, ‘things’, ‘–’, ‘names’, ‘and’, ‘heights’, ‘and’, ‘soundings’, ‘–’, ‘with’, ‘the’, ‘single’, ‘exception’, ‘of’, ‘the’, ‘red’, ‘crosses’, ‘and’, ‘the’, ‘written’, ‘notes’, ‘.’]
[‘thi’, ‘wa’, ‘not’, ‘the’, ‘map’, ‘we’, ‘found’, ‘in’, ‘billi’, ‘bone’, “‘s”, ‘chest’, ‘,’, ‘but’, ‘an’, ‘accur’, ‘copi’, ‘,’, ‘complet’, ‘in’, ‘all’, ‘thing’, ‘–’, ‘name’, ‘and’, ‘height’, ‘and’, ‘sound’, ‘–’, ‘with’, ‘the’, ‘singl’, ‘except’, ‘of’, ‘the’, ‘red’, ‘cross’, ‘and’, ‘the’, ‘written’, ‘note’, ‘.’]
The results of the above algorithm include words that are not in the dictionary.
stemming through Lancaster Stemmer algorithm
LancasterStemmer
from nltk.stem import LancasterStemmer
l=LancasterStemmer()
words=['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
print([l.stem(w) for w in words])
Removing Unnecessary Words(noise data)
Stopword
List of Stopword about Eng
stopwords
from nltk.corpus import stopwords
stopwords.words('english')[:10]
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’]
Removing Stopword about Eng
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example = "Family is not an important thing. It's everything."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example)
result = []
for w in word_tokens:
if w not in stop_words:
result.append(w)
print(word_tokens)
print(result)
[‘Family’, ‘is’, ‘not’, ‘an’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
[‘Family’, ‘important’, ‘thing’, ‘.’, ‘It’, “‘s”, ‘everything’, ‘.’]
List of Stopword about Kor
Removing Stopword about Kor
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example = "고기를 아무렇게나 구우려고 하면 안 돼. 고기라고 다 같은 게 아니거든. 예컨대 삼겹살을 구울 때는 중요한 게 있지."
stop_words = "아무거나 아무렇게나 어찌하든지 같다 비슷하다 예컨대 이럴정도로 하면 아니거든"
# 위의 불용어는 명사가 아닌 단어 중에서 저자가 임의로 선정한 것으로 실제 의미있는 선정 기준이 아님
stop_words=stop_words.split(' ')
word_tokens = word_tokenize(example)
result = []
for w in word_tokens:
if w not in stop_words:
result.append(w)
# 위의 4줄은 아래의 한 줄로 대체 가능
# result=[word for word in word_tokens if not word in stop_words]
print(word_tokens)
print(result)
[‘고기를’, ‘아무렇게나’, ‘구우려고’, ‘하면’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘아니거든’, ‘.’, ‘예컨대’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]
[‘고기를’, ‘구우려고’, ‘안’, ‘돼’, ‘.’, ‘고기라고’, ‘다’, ‘같은’, ‘게’, ‘.’, ‘삼겹살을’, ‘구울’, ‘때는’, ‘중요한’, ‘게’, ‘있지’, ‘.’]
Rare words
words with very a short length
import re
text = "I was wondering if anyone out there could enlighten me on this car."
shortword = re.compile(r'\W*\b\w{1,2}\b')
print(shortword.sub('', text))
was wondering anyone out there could enlighten this car.
Encoding
Integer encoding
One-hot encoding
Byte Pair encoding
Text preprocessing(2) : Syntax analysis
Language model
A language model is a criterion for determining whether a sentence is natural.
- Assign probabilities for word sequences
- Machine Translation
- Spell Correction
- Speech Recognition
Statistical Language Model, SLM
- Prediction of the next word from a given previous word
- P(W) = P(w_1,w_2,w_3,…,w_n)
- P(w_n|w_1,…,w_n-1) = P(w_1,w_2,…,w_n)/P(w_1,w_2,w_3,…,w_n-1)
Cause sparsity problem.
N-gram Language Model
- Sparsity problem
- Trade off (sparsity vs accuracy)
- The higher the value of n, the stronger the sparsity problem and the accuracy.
- The lower the value of n, the weaker the sparsity problem and the accuracy.
for general, for n-gram,
Evaluation: Perplexity(Branching factor)
- Evaluation
- extrinsic
- intrinsic : Perplexity
Neural Network Based Language Model
Feed Forward Neural Network Language Model, FFNNLM : Neural Probabilistic Language Model
- Improvement : Solving sparsity problem
- Limitation : Fixed-length input
Recurrent Neural Network Language Model, RNNLM
Quantification : Word representation
Local Representation
- WDM, Word-Document Matrix
- TF-IDF, Term Frequency-Inverse Document Frequency
- WCM, Word-Context Matrix
- PMIM, Point-wise Mutual Information Matrix
Count based word Representation(1) : BoW
Bag of Words
from konlpy.tag import Okt
import re
okt=Okt()
token=re.sub("(\.)","","정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다.")
# 정규 표현식을 통해 온점을 제거하는 정제 작업입니다.
token=okt.morphs(token)
# OKT 형태소 분석기를 통해 토큰화 작업을 수행한 뒤에, token에다가 넣습니다.
word2index={}
bow=[]
for voca in token:
if voca not in word2index.keys():
word2index[voca]=len(word2index)
# token을 읽으면서, word2index에 없는 (not in) 단어는 새로 추가하고, 이미 있는 단어는 넘깁니다.
bow.insert(len(word2index)-1,1)
# BoW 전체에 전부 기본값 1을 넣어줍니다. 단어의 개수는 최소 1개 이상이기 때문입니다.
else:
index=word2index.get(voca)
# 재등장하는 단어의 인덱스를 받아옵니다.
bow[index]=bow[index]+1
# 재등장한 단어는 해당하는 인덱스의 위치에 1을 더해줍니다. (단어의 개수를 세는 것입니다.)
print(word2index)
print(bow)
(‘정부’: 0, ‘가’: 1, ‘발표’: 2, ‘하는’: 3, ‘물가상승률’: 4, ‘과’: 5, ‘소비자’: 6, ‘느끼는’: 7, ‘은’: 8, ‘다르다’: 9)
[1, 2, 1, 1, 2, 1, 1, 1, 1, 1]
Create Bag of Words with CountVectorizer class
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.
[[1 1 2 1 2 1]]
{‘you’: 4, ‘know’: 1, ‘want’: 3, ‘your’: 5, ‘love’: 2, ‘because’: 0}
Remove stopwords in Bag of Words
Custom stopwords
from sklearn.feature_extraction.text import CountVectorizer
text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words=["the", "a", "an", "is", "not"])
print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)
[[1 1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 4, ‘it’: 3, ‘everything’: 0}
CountVectorizer stopwords
from sklearn.feature_extraction.text import CountVectorizer
text=["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words="english")
print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)
[[1 1 1]]
{‘family’: 0, ‘important’: 1, ‘thing’: 2}
NLTK stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
text=["Family is not an important thing. It's everything."]
sw = stopwords.words("english")
vect = CountVectorizer(stop_words =sw)
print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)
[[1 1 1 1]]
{‘family’: 1, ‘important’: 2, ‘thing’: 3, ‘everything’: 0}
Count based word Representation(2) : DTM
Document-Term Matrix, DTM
- Limitations : Sparse representation
Count based word Representation(3) : TD-IDF
Term Frequency-Inverse Document Frequency
Implementation : pandas
import pandas as pd # 데이터프레임 사용을 위해
from math import log # IDF 계산을 위해
docs = [
'먹고 싶은 사과',
'먹고 싶은 바나나',
'길고 노란 바나나 바나나',
'저는 과일이 좋아요'
]
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()
N = len(docs) # 총 문서의 수
def tf(t, d):
return d.count(t)
def idf(t):
df = 0
for doc in docs:
df += t in doc
return log(N/(df + 1))
def tfidf(t, d):
return tf(t,d)* idf(t)
result = []
for i in range(N): # 각 문서에 대해서 아래 명령을 수행
result.append([])
d = docs[i]
for j in range(len(vocab)):
t = vocab[j]
result[-1].append(tf(t, d))
tf_ = pd.DataFrame(result, columns = vocab)
print(tf_)
result = []
for j in range(len(vocab)):
t = vocab[j]
result.append(idf(t))
idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])
print(idf_)
result = []
for i in range(N):
result.append([])
d = docs[i]
for j in range(len(vocab)):
t = vocab[j]
result[-1].append(tfidf(t,d))
tfidf_ = pd.DataFrame(result, columns = vocab)
print(tfidf_)
Implementation : scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'you know I want your love',
'I like you',
'what should I do ',
]
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.
[[0 1 0 1 0 1 0 1 1]
[0 0 1 0 0 0 0 1 0]
[1 0 0 0 1 0 1 0 0]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'you know I want your love',
'I like you',
'what should I do ',
]
tfidfv = TfidfVectorizer().fit(corpus)
print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)
[[0. 0.46735098 0. 0.46735098 0. 0.46735098 0. 0.35543247 0.46735098]
[0. 0. 0.79596054 0. 0. 0. 0. 0.60534851 0. ]
[0.57735027 0. 0. 0. 0.57735027 0. 0.57735027 0. 0. ]]
{‘you’: 7, ‘know’: 1, ‘want’: 5, ‘your’: 8, ‘love’: 3, ‘like’: 2, ‘what’: 6, ‘should’: 4, ‘do’: 0}
Implementation : keras
Continuous Representation
Topic modeling(1) : LSA
Singular Value Decomposition, SVD
import numpy as np
A = np.array([[0,0,0,1,0,1,1,0,0],
[0,0,0,1,1,0,1,0,0],
[0,1,1,0,2,0,0,0,0],
[1,0,0,0,0,0,0,1,1]])
# Full SVD
U, s, VT = np.linalg.svd(A, full_matrices = True)
S = np.zeros(np.shape(A)) # 대각 행렬의 크기인 4 x 9의 임의의 행렬 생성
S[:4, :4] = np.diag(s) # 특이값을 대각행렬에 삽입
# Truncated SVD
S=S[:2,:2]
U=U[:,:2]
VT=VT[:2,:]
Latent Semantic Analysis, LSA
Topic modeling(2) : LDA
Latent Dirichlet Allocation, LDA
Word Embedding
Document Similarity
Recurrent Neural Network
Word-level
Character-level
Text Classification
Tagging Task
Neural Machine Translation
Attention Mechanism
Transformer
Convolution Neural Network
List of posts followed by this article
Reference
- pytorch
- torchnlp
- API] torchtext
- transformer
- API] nltk
- API] textblob
- API] standfordnlp
- API] polyglot
- TEXT] 딥 러닝을 이용한 자연어 처리 입문
- TEXT] 딥 러닝을 이용한 자연어 처리 심화
- TEXT] PyTorch로 시작하는 딥 러닝 입문
- LEC] NLP edwith
- LEC] NLP minsuk heo
- Taweh Beysolow II, Applied Natural Language Processing with Python, 2018
- jalammar
- Exobrain