AI02, Clustering
Back to the previous page|Meachine learning
List of posts to read before reading this article
Contents
Implement with sklearn
Clustering through K-Means algorithm
# [0] : importing modules
from sklearn import datasets
from sklearn import metrics
from sklearn import cluster
import numpy as np
# [1] : loading dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# [2] : clustering
clustering = cluster.KMeans(n_clusters=3)
clustering.fit(X)
y_pred = clustering.predict(X)
# [3] : correction assigned different integer values to the groups
idx_0, idx_1, idx_2 = (np.where(y_pred == n) for n in range(3))
y_pred[idx_0], y_pred[idx_1], y_pred[idx_2] = 2, 0, 1
# [4] : summarize the overlaps between the supervised and unsupervised classification
metrics.confusion_matrix(y, y_pred)
OUTPUT
array([[50, 0, 0],
[ 0, 48, 2],
[ 0, 14, 36]], dtype=int64)
VISUALIZATION
# [0] : importing modules
from sklearn import datasets
from sklearn import metrics
from sklearn import cluster
import numpy as np
import matplotlib.pyplot as plt
# [1] : loading dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# [2] : clustering
n_clusters = 3
clustering = cluster.KMeans(n_clusters)
clustering.fit(X)
y_pred = clustering.predict(X)
# [3] : correction assigned different integer values to the groups
idx_0, idx_1, idx_2 = (np.where(y_pred == n) for n in range(3))
y_pred[idx_0], y_pred[idx_1], y_pred[idx_2] = 2, 0, 1
# [4] : summarize the overlaps between the supervised and unsupervised classification
metrics.confusion_matrix(y, y_pred)
# Visualization
N = X.shape[1]
fig, axes = plt.subplots(N, N, figsize=(12, 12), sharex=True, sharey=True)
colors = ["coral", "blue", "green"]
markers = ["^", "v", "o"]
for m in range(N):
for n in range(N):
for p in range(n_clusters):
mask = y_pred == p
axes[m, n].scatter(X[:, m][mask], X[:, n][mask], s=30, marker=markers[p], color=colors[p], alpha=0.25)
axes[N-1, m].set_xlabel(iris.feature_names[m], fontsize=16)
axes[m, 0].set_ylabel(iris.feature_names[m], fontsize=16)
Details code[1]
iris dataset
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
iris.feature_names.append('target_names')
df1 = pd.DataFrame(iris.data)
df2 = pd.DataFrame(iris.target)
df = pd.concat([df1,df2], axis=1)
df.columns = iris.feature_names
print(df)
s.length (cm) s.width (cm) ... p.width (cm) target_names
0 5.1 3.5 ... 0.2 0
1 4.9 3.0 ... 0.2 0
2 4.7 3.2 ... 0.2 0
3 4.6 3.1 ... 0.2 0
4 5.0 3.6 ... 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 ... 2.3 2
146 6.3 2.5 ... 1.9 2
147 6.5 3.0 ... 2.0 2
148 6.2 3.4 ... 2.3 2
149 5.9 3.0 ... 1.8 2
[150 rows x 5 columns]
Details code[2]
INPUT
# [0] : importing modules
from sklearn import datasets
from sklearn import metrics
from sklearn import cluster
import numpy as np
# [1] : loading dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# [2] : clustering
clustering = cluster.KMeans(n_clusters=3)
clustering.fit(X)
OUTPUT
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
Details code[3]
# [0] : importing modules
from sklearn import datasets
from sklearn import metrics
from sklearn import cluster
import numpy as np
# [1] : loading dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# [2] : clustering
clustering = cluster.KMeans(n_clusters=3)
clustering.fit(X)
y_pred = clustering.predict(X)
y_pred[::8]
OUTPUT
: array([1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0])
y[::8]
OUTPUT
: array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
Details code[4]
On the above confusion matrix matrix, the diagonals correspond to the number of samples that are correctly classified for each level of the category variable, and the off-diagonal elements are the number of incorrectly classified samples. More specifically, the element of the confusion matrix C is the number of samples of category i that were categorized as j.
title2
title3
List of posts followed by this article
Reference