Developer Interface¶
This part of the documentation covers the public interface of k-prototypes.
Main Interface¶
The main entry point follows similar conventions to Scikit-Learn, but it is not fully compatible (see design choices).
-
class
kprototypes.
KPrototypes
(n_clusters=8, initialization=None, numerical_distance='euclidean', categorical_distance='matching', gamma=None, n_iterations=100, random_state=None, verbose=0)¶ K-Prototypes clustering.
The k-prototypes algorithm, as described in “Clustering large data sets with mixed numeric and categorical values” by Huang (1997), is an extension of k-means for mixed data.
This wrapper loosely follows Scikit-Learn conventions for clustering estimators, as it provide the usual
fit
andpredict
methods. However, the signature is different, as it expects numerical and categorical data to be provided in separated arrays.-
initialization
¶ Centroid initialization function.
- Type
callable
-
numerical_distance
¶ Distance function used for numerical features.
- Type
callable
-
categorical_distance
¶ Distance function used for categorical features.
- Type
callable
-
gamma
¶ Categorical distance weight.
- Type
float32
-
n_clusters
¶ Number of clusters.
- Type
int32
-
n_iterations
¶ Maximum number of iterations.
- Type
int32
-
true_gamma
¶ Categorical distance weight inferred from data, if gamma was not specified.
- Type
float32
-
numerical_centroids
¶ Numerical centroid array.
- Type
float32, n_clusters x n_numerical_features
-
categorical_centroids
¶ Categorical centroid array.
- Type
int32, n_clusters x n_categorical_features
-
cost
¶ Loss after last training iteration.
- Type
float32
-
fit
(numerical_values, categorical_values)¶ Fit centroids.
- Parameters
numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
- Returns
self
- Return type
-
fit_predict
(numerical_values, categorical_values)¶ Fit centroids and assign points to closest clusters.
- Parameters
numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
- Returns
clustership – Closest clusers.
- Return type
int32, n_samples
-
predict
(numerical_values, categorical_values)¶ Assign points to closest clusters.
- Parameters
numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
- Returns
clustership – Closest clusers.
- Return type
int32, n_samples
-
Distance Measure¶
Common distance functions are provided, but any callable can be used, as long as broadcasting is properly done.
-
kprototypes.
check_distance
(distance)¶ Resolve distance function.
If
distance
is a string, only"euclidean"
,"manhattan"
and"matching"
are accepted. If it is a callable, then is is return as-is. If it isNone
, it defaults to"euclidean"
.
-
kprototypes.
euclidean_distance
(a, b)¶ Squared euclidean distance.
This is the sum of squared differences for each feature pair, also known as squared L2 norm.
Example
>>> a = np.array([0.0, 0.0, 0.0]) >>> b = np.array([1.0, 2.0, 3.0]) >>> euclidean_distance(a, b) 14.0
-
kprototypes.
manhattan_distance
(a, b)¶ Manhattan distance.
This is the sum of absolute differences for each feature pair, also known as L1 norm.
Example
>>> a = np.array([0.0, 0.0, 0.0]) >>> b = np.array([1.0, 2.0, 3.0]) >>> manhattan_distance(a, b) 6.0
-
kprototypes.
matching_distance
(a, b)¶ Matching distance.
Each feature pair that does not match adds one to the distance. This distance measure is often used for categorical features.
Example
>>> a = np.array([1, 2, 3, 4, 5]) >>> b = np.array([1, 8, 3, 4, 0]) >>> matching_distance(a, b) 2
Initialization¶
Simple initialization functions are provided, but any callable can be used. Explicit centroids can also be provided, either to resume training or to use an external initialization process.
-
kprototypes.
check_initialization
(initialization)¶ Resolve initialization function.
If
distance
is a string, only"random"
and"frequency"
are accepted. If it is a callable, then is is return as-is. If it isNone
, it defaults to"random"
.Directly specifying centroids as a tuple of arrays is also accepted.
- Returns
function – Centroid initialization function.
- Return type
callable
-
kprototypes.
random_initialization
(numerical_values, categorical_values, n_clusters, numerical_distance, categorical_distance, gamma, random_state, verbose)¶ Random initialization.
Choose random points as cluster centroids.
Used in “Clustering large data sets with mixed numeric and categorical values” by Huang (1997), the original k-prototypes definition.
- Returns
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
-
kprototypes.
frequency_initialization
(numerical_values, categorical_values, n_clusters, numerical_distance, categorical_distance, gamma, random_state, verbose)¶ Frequency-based initialization.
Choose centroids from points, based on probability distributions of each feature. The first centroid is selected at highest density point. Then, the remaining centroids are selected to be both far from current centroids and at dense locations.
This is an extension for mixed values of “A new initialization method for categorical data clustering” by Cao et al. (2009).
- Returns
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
Data Preprocessing¶
As k-prototypes only accepts floating-point values for numerical data and integer values for categorical data, any other data type must be properly converted beforehand.
sklearn.preprocessing.StandardScaler
is the equivalent for
numerical values.
-
class
kprototypes.
CategoricalTransformer
(**kwargs)¶ Encode categorical values as integers.
Each column has its own vocabulary. Values are mapped from 0 to N - 1, where N is the size of the vocabulary.
- Parameters
min_count (int, optional) – Ignore values that appears less than a given number of times. Unknown values must be enabled as well.
allow_unknown (bool, optional) – Add an additional value for unexpected or unknown values.
nan_as_unknown (bool, optional) – Treat NaN as unknown, instead of allocating a dedicated index.
-
fit
(values)¶ Build index.
-
fit_transform
(values)¶ Build index and transform values.
-
inverse_transform
(indices)¶ Convert indices back to values.
-
transform
(values)¶ Transform values.
Low-Level Optimization Methods¶
At its core, k-prototypes is defined by these two methods.
-
kprototypes.
fit
(numerical_values, categorical_values, numerical_centroids, categorical_centroids, numerical_distance, categorical_distance, gamma, n_iterations, random_state, verbose)¶ Fit centroids.
This implementation follows the standard k-means algorithm, also referred to as Lloyd’s algorithm. The optimization proceeds by alternating between two steps:
assignment step, where each sample is assigned to the closest centroid;
update step, where centroids are recomputed based on the assignment.
This approach differs from the original paper, where centroids are updated after each individual assignment.
- Parameters
numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
numerical_distance (callable) – Distance function used for numerical features.
categorical_distance (callable) – Distance function used for categorical features.
gamma (float32) – Categorical distance weight.
n_iterations (int32) – Maximum number of iterations.
random_state (numpy.random.RandomState) – Random state used to initialize centroids.
verbose (int32) – Verbosity level (0 for no output).
- Returns
clustership (int32, n_samples) – Closest clusers.
cost (float32) – Loss after last iteration.
-
kprototypes.
predict
(numerical_values, categorical_values, numerical_centroids, categorical_centroids, numerical_distance, categorical_distance, gamma, return_cost=False)¶ Assign points to closest clusters.
- Parameters
numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
numerical_distance (callable) – Distance function used for numerical features.
categorical_distance (callable) – Distance function used for categorical features.
gamma (float32) – Categorical distance weight.
return_cost (bool, optional) – Whether to return cost.
- Returns
clustership (int32, n_samples) – Closest clusers.
cost (float32) – Loss after last iteration, if
return_cost
is true.