Technical Blog

All the latest technical and engineering news from the world of Guavus

Unsupervised Machine Learning: Validation Techniques

by Priyanshu Jain, Senior Data Scientist, Guavus, Inc.

When you talk about validating a machine learning model, it’s important to know that the validation techniques employed not only help in measuring performance, but also go a long way in helping you understand your model on a deeper level. This is the reason why a significant amount of time is devoted to the process of result validation while building a machine-learning model.

Result validation is a very crucial step as it ensures that our model gives good results not just on the training data but, more importantly, on the live or test data as well. In case of supervised learning, it is mostly done by measuring the performance metrics such as accuracy, precision, recall, AUC, etc. on the training set and the holdout sets. Such performance metrics help in deciding model viability.

However, if this is not the case, then we may tune the hyperparameters and repeat the same process till we achieve the desired performance. However, in case of unsupervised learning, the process is not very straight forward as we do not have the ground truth. In the absence of labels, it is very difficult to identify KPIs which can be used to validate results.

There are two classes of statistical techniques to validate results for cluster learning. These are:

  1. Internal validation
  2. External validation

Most of the literature related to internal validation for cluster learning revolves around the following two types of metrics –

  • Cohesion within each cluster
  • Separation between different clusters

Business/User validation, as the name suggests, requires inputs that are external to the data. The idea is to generate clusters on the basis of the knowledge of subject matter experts and then evaluate similarity between the two sets of clusters i.e. the clusters generated by ML and clusters generated as a result of human inputs. However, in most of the cases, such knowledge is not readily available. Also, this approach is not very scalable. Hence, in practice, external validation is usually skipped.

In this article, we propose the twin-sample validation as a methodology to validate results of unsupervised learning in addition to internal validation, which is very similar to external validation, but without the need for human inputs. In the subsequent sections, we briefly explain different metrics to perform internal and external validations. This will be followed by an explanation of how to perform twin-sample validation in case of unsupervised clustering and its advantages.

Internal Validation

Most of the methods of internal validation combine cohesion and separation to estimate the validation score.

The approach is to compute validation score of each cluster and then combine them in a weighted manner to arrive at the final score for the set of clusters. Let S be a set of clusters {C1 , C2 , C3 ,…………, Cn }, then validity of S will be computed as follows:

Cohesion for a cluster can be computed by summating the similarity between each pair of records contained in that cluster.

Separation between two clusters can be computed by summating the distance between each pair of records falling within the two clusters and both the records are from different clusters.

A set of clusters having high cohesion within the clusters and high separation between the clusters is considered to be good.

In practice, instead of dealing with two metrics, several measures are available which combine both of the above into a single measure. Few examples of such measures are:

  • Silhouette coefficient
  • Calisnki-Harabasz coefficient
  • Dunn index
  • Xie-Beni score
  • Hartigan index

External Validation

This type of result validation can be carried out if true cluster labels are available. Labels generated by SMEs can also be used to proxy true labels. In this approach we will have a set of clusters S= {C1, C2, C3,…………, Cn } which have been generated as a result of some clustering algorithm. We will have another set of clusters P = {D1, D2, D3, …………, Dm} which represent the true cluster labels on the same data. The idea is to measure the statistical similarity between the two sets. A cluster set is considered as good if it is highly similar to the true cluster set.

In order to measure the similarity between S and P, we label each pair of records from data as Positive if the pairs belong to the same cluster in P else Negative. Similar exercise is carried out for S as well. We then compute a confusion matrix between pair labels of S and P which can be used measure the similarity.

TP: Number of pairs of records which are in the same cluster, for both S and P

FP: Number of pairs of records which are in the same cluster in S but not in P

FN: Number of pairs of records which are in the same cluster in P but not in S

TN: Number of pairs of records which are not in the same cluster S as well as P

On the above 4 indicators, we can calculate different metrics to get an estimate for the similarity between S (cluster labels generated by unsupervised method) and P (true cluster labels). Some example metrics which could be used are as follows:

  • Precision measures the ratio of true positives to total positives predicted.
  • Recall measures the ratio of positives captured out of the total true positives.
  • F1-measure combines precision and recall into a single metric.
  • Jaccard Similarity
  • Mutual Information
  • Fowlkes-Mallows Index

Twin-Sample Validation

In this section, we explain how we can further validate the results of our unsupervised learning model in the absence of true cluster labels. This step takes it as a given that we have already performed clustering on our training data and now want to validate the results. The approach consists of following four steps:

  1. Creating a twin-sample of training data
  2. Performing unsupervised learning on twin-sample
  3. Importing results for twin-sample from training set
  4. Calculating similarity between two sets of results

1. Creating a twin-sample

This is the most important step in the process of performing the twin-sample validation. The key idea is to create a sample of records which is expected to exhibit similar behavior as the training set. This is similar to a validation set for supervised learning, only with additional constraints. The following constraints should be considered while creating a twin-sample:

  1. It should come from the same distribution as the training set.
  2. It should sufficiently cover most of the patterns observed in the training set.
  3. In case of the timeseries data:
    • It should come from a different duration (immediately succeeding is a good choice) than the training set.
    • It should cover at least 1 complete season of the data i.e. if the data has weekly seasonality, twin-sample should cover at least 1 complete week.

Keeping the above constraints in mind, a twin-sample can be formed and used to validate results of the clustering performed on the training set.

2. Performing unsupervised learning on twin-sample 

Now that we have our twin-sample, the next step is to perform cluster learning on it. For this, we will use the same parameter that we used on our training set. This includes the number of clusters, distance metric, etc. We will get a set of cluster labels as output of this step. We will denote this output set as S. The idea here is that we should get similar results on our twin-sample set as we got on our training set, given that both these sets contain similar data and we are using the same parameter set. This similarity will be measured in the subsequent steps.

3. Importing results for twin-sample from training set

In this step, we will compute another set of cluster labels on the twin-sample. This time we will use the results of clustering performed on the training set. For each point in twin-sample, we will perform the following two steps:

  1. Identify its nearest neighbor in the training set. Please note that the distance metric should be same as the one used in clustering process.
  2. Import the cluster label of its nearest neighbor.

Following the above process, we will have a cluster label for each point in the twin sample. Let’s denote this set cluster labels by P.

4. Calculating similarity between two sets results

Now that we have two sets of cluster labels, S and P, for twin-sample, we can compute their similarity by using any measure such as F1-measure, Jaccard Similarity, etc. defined in the External Validation section.

A set of clusters having high similarity with its twin-sample is considered good.

Conclusions

  1. Twin sample validation can be used to validate results of unsupervised learning.
  2. It should be used in combination with internal validation.
  3. It can prove to be highly useful in case of time-series data where we want to ensure that our results remain same across time.
  4. In this article we have used k-means clustering as an example to explain the process. But it is a general approach and can be adopted for any unsupervised learning technique.

 

Posted by Mathilde Remy