Developer Interface¶

This part of the documentation covers the public interface of k-prototypes.

Main Interface¶

The main entry point follows similar conventions to Scikit-Learn, but it is not fully compatible (see design choices).

class kprototypes.KPrototypes(n_clusters=8, initialization=None, numerical_distance='euclidean', categorical_distance='matching', gamma=None, n_iterations=100, random_state=None, verbose=0)¶

K-Prototypes clustering.

The k-prototypes algorithm, as described in “Clustering large data sets with mixed numeric and categorical values” by Huang (1997), is an extension of k-means for mixed data.

This wrapper loosely follows Scikit-Learn conventions for clustering estimators, as it provide the usual fit and predict methods. However, the signature is different, as it expects numerical and categorical data to be provided in separated arrays.

Distance Measure¶

Common distance functions are provided, but any callable can be used, as long as broadcasting is properly done.

kprototypes.check_distance(distance)¶

Resolve distance function.

If distance is a string, only "euclidean", "manhattan" and "matching" are accepted. If it is a callable, then is is return as-is. If it is None, it defaults to "euclidean".

Initialization¶

Simple initialization functions are provided, but any callable can be used. Explicit centroids can also be provided, either to resume training or to use an external initialization process.

kprototypes.check_initialization(initialization)¶

Resolve initialization function.

If distance is a string, only "random" and "frequency" are accepted. If it is a callable, then is is return as-is. If it is None, it defaults to "random".

Directly specifying centroids as a tuple of arrays is also accepted.

Returns: function – Centroid initialization function.
Return type: callable

Data Preprocessing¶

As k-prototypes only accepts floating-point values for numerical data and integer values for categorical data, any other data type must be properly converted beforehand.

sklearn.preprocessing.StandardScaler is the equivalent for numerical values.

class kprototypes.CategoricalTransformer(**kwargs)¶

Encode categorical values as integers.

Each column has its own vocabulary. Values are mapped from 0 to N - 1, where N is the size of the vocabulary.

Parameters

min_count (int, optional) – Ignore values that appears less than a given number of times. Unknown values must be enabled as well.
allow_unknown (bool, optional) – Add an additional value for unexpected or unknown values.
nan_as_unknown (bool, optional) – Treat NaN as unknown, instead of allocating a dedicated index.

fit(values)¶: Build index.

fit_transform(values)¶: Build index and transform values.

inverse_transform(indices)¶: Convert indices back to values.

transform(values)¶: Transform values.

Low-Level Optimization Methods¶

At its core, k-prototypes is defined by these two methods.

kprototypes.fit(numerical_values, categorical_values, numerical_centroids, categorical_centroids, numerical_distance, categorical_distance, gamma, n_iterations, random_state, verbose)¶

Fit centroids.

This implementation follows the standard k-means algorithm, also referred to as Lloyd’s algorithm. The optimization proceeds by alternating between two steps:

assignment step, where each sample is assigned to the closest centroid;

update step, where centroids are recomputed based on the assignment.

This approach differs from the original paper, where centroids are updated after each individual assignment.

Parameters

numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
numerical_distance (callable) – Distance function used for numerical features.
categorical_distance (callable) – Distance function used for categorical features.
gamma (float32) – Categorical distance weight.
n_iterations (int32) – Maximum number of iterations.
random_state (numpy.random.RandomState) – Random state used to initialize centroids.
verbose (int32) – Verbosity level (0 for no output).

Returns

clustership (int32, n_samples) – Closest clusers.
cost (float32) – Loss after last iteration.

kprototypes.predict(numerical_values, categorical_values, numerical_centroids, categorical_centroids, numerical_distance, categorical_distance, gamma, return_cost=False)¶

Assign points to closest clusters.

Parameters

numerical_values (float32, n_samples x n_numerical_features) – Numerical feature array.
categorical_values (int32, n_samples x n_categorical_features) – Categorical feature array.
numerical_centroids (float32, n_clusters x n_numerical_features) – Numerical centroid array.
categorical_centroids (int32, n_clusters x n_categorical_features) – Categorical centroid array.
numerical_distance (callable) – Distance function used for numerical features.
categorical_distance (callable) – Distance function used for categorical features.
gamma (float32) – Categorical distance weight.
return_cost (bool, optional) – Whether to return cost.

Returns

clustership (int32, n_samples) – Closest clusers.
cost (float32) – Loss after last iteration, if return_cost is true.