## Introduction

The clustering analysis is one type of unsupervised technique used to identify (i.e. group) similar samples in a multi-dimensional data space. In the context of remote sensing data, it is largely used for pixel classification, where the multi-dimensional space represents the pixel reflectances in ** n** different wavelengths (sensor bands) or any other combination.

During my first year of PhD, at the Paul Sabatier University (Toulouse), more specifically working at the GET laboratory, we have been working with satellite images from Sentinel-2 and Landsat-8 missions to assess inland water surface. As the water reflectance can vary a lot depending on the region, atmospheric conditions, and constituents it is not viable to employ a single threshold, or a set of fixed rules to separate water from land. Thus, one of the challenges was to develop an algorithm to automatically identify the water pixels in the scene.

The chosen approach was to run an unsupervised clustering algorithm to identify the pixels closest to the water pixel spectra. In the next two sections I will give a brief introduction to K-Means and Agglomerative clustering algorithms. If you are already familiarized to these techniques you could jump to the **Proposed Solution **section. The underlying code, as well as the git repository, is explained in the story __Water Detection in High Resolution Satellite Images using the waterdetect python package__.

## K-Means and the “Mouse” Data-set

The most used algorithm for clustering within the remote sensing community is, by far, the K-Means clustering. It’s fast, easy to implement or to find it in software packages, but it’s results are not good for this kind of task.

The idea behind K-Means is that given a desired number of clusters to split the data-set on, each sample belongs to the closest cluster centroid. The algorithm, then, searches for centroids that minimizes the within-cluster variance (the variance of the distance from the samples to the cluster centroid) by repeatedly assigning the samples to the clusters and then recomputing the centroids. A quick explanation of the method can be found in following video.

Video:__ K-means clustering: how it works__.

K-Means has some problems with convergence of the algorithm, depending on the initialization centroids, but there are alternatives like the k-means++ implemented in Scikit-Learn and other packages that can overcome this.

There is one shortcoming, however, that is inherent to K-Means. The final solution is a Voronoi’s diagram, with all the clusters sharing the same size, what can be easily explained by the initial assumption that all samples belongs to the closest cluster centroid. Thus, for data-sets with different clusters sizes (the “mouse” data-set, for example), this can lead to sub-optimal solutions (Figure 2).

Considering that our primary objective was to separate pixels of ** water** from

**(everything else), you can imagine how different in size those clusters can be (Figure 1).**

*land*## The alternative: Agglomerative Clustering

To overcome this K-Means problem, we opted to employ the agglomerative clustering. Agglomerative clustering is a sub-type of hierarchical clustering that follows a bottom-up approach, where each observation starts in its own cluster and then are merged iteratively until the desired number of clusters.

The key parameters for the algorithm to decide whether to merge two clusters are the metric and the linkage. The metric specifies the measure of distance between pairs of observations. We used the simple Euclidean distance for multi-dimensional space:

In this formula *a* and *b* are the point coordinates in a *n*-dimensional space and *i* is the considered dimension.

Besides the metric, the linkage criteria determine how to compute the distance between clusters as a function of the pairwise distances between observations. The Average Linkage, used in this study, considers the mean distance considering all points in each cluster and can be described as:

In this formula, a and b are the coordinates in a n-dimensional space and |A| and |B| are the total amount of observations in each cluster.

During each iteration, the algorithm merges the two clusters, among all of them, that are closer to each other considering the criteria described above. The iteration continues until the specified number of K is reached. With this algorithm, the “mouse” data-set, would look like this:

Everything was fine, except for one detail… one entire Sentinel-2 image simply don’t fit into such algorithm and for one reason, that is explained in the ** Understanding the concept of Hierarchical clustering Technique** story:

Space and Time Complexity of Hierarchical clustering Technique:Space complexity:The space required for the Hierarchical clustering Technique is very high when the number of data points are high as we need to store the similarity matrix in the RAM. The space complexity is the order of the square of n.

Spacecomplexity = O(n²) where n is the number of data points.

Time complexity:Since we’ve to perform n iterations and in each iteration, we need to update the similarity matrix and restore the matrix, the time complexity is also very high. The time complexity is the order of the cube of n.

Time complexity = O(n³) where n is the number of data points.

## The Proposed Solution

As we saw, the hierarchical clustering if better for our task, but it is not scalable due to its time and space complexities (that’s one reason why K-Means is far more common in Remote Sensing community).

The time and space for computing the clusters in our algorithm were viable for 10k, up to 50k pixels, but the Sentinel-2 image has 10980x10980… a total of **120 million pixels**.

The solution was to sub-sample the image, extracting completely random pixels. The only concern here is to provide an amount that can still represent the whole scene and all the targets. In the next story we will demonstrate that for a Sentinel-2 image, 10k pixels (i.e. 0,01%) is enough.

Figure 4-a shows all a scene of Sentinel 2, and Figure 4-b a scatter plot of all the pixels in the image, as well as the sub-sampled pixels represented in red. As our main objective was to identify water, the Modified Normalized Water Index — MNDWI and the band B12 (short wave infrared) were used as Y and X axis respectively. I this graph, the water pixels are expected to be at the top left corner, with a high MNDWI and low B12 reflectance. Figure 4-c shows the solution of the clustering for the sub sample of pixels.

After applying the clustering, we have the labels for all sub sampled pixels (Figure 4-c) and the problem becomes how to generalize the clustering solution for the entire scene.

That can be easily done by applying a machine learning algorithm that can actually “learn” the characteristics of each cluster and then replicate it to all other pixels in scene. For this task, many machine learning algorithms are available. The most commons are Random forests (RF), Support Vector Machines (SVM), Multi Layer Perceptron (MLP), Naive Bayes and many others. To achieve this, we can use a machine learning package like__ Scikit-Learn__, and the process has basically 2 steps:

Fit the model to the previously labeled data. That’s the same as “training” the model.

Predict the values

As we are going to predict all 120 million pixels classes, it is important that the selected algorithm have a good performance concerning accuracy and speed. The one that worked best in the mentioned study was Naive Bayes, with a good relation of accuracy and speed.

In Scikit, these two steps can be obtained with the following code:

**# train a NB classifier with the data and labels provided
**model = GaussianNB()
__model.fit__(clusters_data, clusters_labels)
**# return the new predicted labels for the whole dataset
**return model.predict(data)

Where clusters_data and clusters_labels are the vectors used in the clustering process and data is the whole scene data to be predicted. As already mentioned, the code and the api to apply these steps to satellite images will be explained in the next story (in preparation). The result of the process is shown in figure 5 and compared to the results of a K-Means clustering for the same data.

## Conclusion

As we have seen, the results of an unsupervised classification in remote sensing can be greatly improved by using agglomerative hierarchical clustering instead of the most used K-Means. Time and space complexity drawbacks can be dealt by the proposed solution of sub-sampling the image and generalizing it afterwards with a quick machine learning algorithm like the Naive Bayes. This generalization step is produced in less than 1 minute, which otherwise would be impossible to achieve with a regular agglomerative clustering into the full data-set.

The story __Water Detection in High Resolution Satellite Images using the waterdetect python package__ brings an example of this approach applied in the **waterdetect **package.

Figure 6 shows some examples of the this approach for water identification in some France scenes.

If you have any doubts or ideas on how to improve this story, please feel free to contact us on __geocorner.net__ or post the suggestions on the comments. If you are interested in some deep learning for remote sensing images, check out my other stories on that theme.

Thank you.

## 留言