kmeans clustering is a very popular technique for simplifying datasets into archetypes or clusters of observations with similar properties. The technique works by ‘forcing’ the observations into k different groups, with k chosen by the analyst, such that variance within each group is minimized. As with most statistical techniques, this analysis needs to be conducted with judgment. Some choices of k result in smaller groups for which the gain in insight is not sufficient to warrant them being treated differently. The elbow method is a popular way to choose an appropriate value of k, but human judgment is still necessary in any such method.
For the most part, kmeans clustering is conducted on static, point in time, observations. Examples can include clustering populations based on a selection of demographics at a point in time, clustering patients based on a set of medical observations at a point in time or clustering cities based on a set of urban statistics in a given year.
Increasingly, there is a desire to cluster observations based on how they change over time. Do they increase, decrease, stay the same? Are they consistently high, consistently low, or do they go up and down? Are some more complex in their changes than others?
Dealing with and preparing time series data
To illustrate how to conduct kmeans clustering on time series data (or trajectories), I am going to use a fictional dataset of survey responses from individuals over a five year timeframe, where the same survey was administered annually, and where individual IDs were tracked over the period. I am going to download this dataset from my Github repo and take a look at it. The repo contains both a csv
and and RDS
version of the data, and here I will read the RDS
version straight into my R session.
library(readr)
library(tidyverse)
# get data straight from Github
survey_data < readr::read_rds(url("https://github.com/keithmcnulty/longitudinal_clustering/blob/master/survey_data.RDS?raw=true"))
# display first 5 rows
knitr::kable(survey_data %>% head(5))
id  average_rating  year 

5b758a319063cdcd1afbcfcd6173006c  5.720280  2019 
66cddc5ab668f44582002b973451d684  4.125874  2019 
da400f831a94d78c816c413e6433b9f0  2.825175  2019 
55c3efe6ed30668c4c07eca7b2ed7fda  2.237762  2019 
8c49365b6b11232ae24e94ad07ad8b5f  3.790210  2019 
We can see that our data is quite simple. It looks like the survey may have had a number of questions on it, and we have been given data on the average rating the individual gave for all the questions. And it also looks like there may be some different years involved. Let’s see some of our basic stats:
# number of individuals
unique(survey_data$id) %>% length()
## [1] 436
# years involved
unique(survey_data$year)
## [1] 2019 2018 2017 2016 2015
# range ratings
range(survey_data$average_rating)
## [1] 1 10
So it looks like we are dealing with over 400 people and five years worth of survey data. We can reasonably assume that there is a 1 to 10 scale on the survey overall, although it would be sensible to get a copy of the survey to confirm this. Now, we don’t know whether every individual has data for all the years – it may be that some did not take the survey in certain years. So it’s probably a good idea to organize this data by individual and form a time series out of it, since that is what we are interested in.
Essentially we want to move this data set from its current long format, where each individual has a different row for each annual observation, to a wide format, where each row is an individual and each column is an annual observation for that individual. Let’s use our best friend tidyr::pivot_wider()
to do this.
# pivot from long to wide based on year
survey_data_wide < survey_data %>%
dplyr::arrange(year) %>%
tidyr::pivot_wider(id_cols = id, names_from = year, values_from = average_rating)
# show first five rows
knitr::kable(survey_data_wide %>% head(5))
id  2015  2016  2017  2018  2019 

5b758a319063cdcd1afbcfcd6173006c  5.342657  5.153846  5.216783  4.146853  5.720280 
66cddc5ab668f44582002b973451d684  3.265734  4.965035  3.895105  1.629371  4.125874 
da400f831a94d78c816c413e6433b9f0  4.335664  2.195804  1.251748  2.573427  2.825175 
55c3efe6ed30668c4c07eca7b2ed7fda  3.013986  3.265734  4.083916  5.027972  2.237762 
8c49365b6b11232ae24e94ad07ad8b5f  4.335664  3.769231  3.643357  2.762238  3.790210 
So we now have time series data – let’s just check if the data is complete for all individuals:
is.na(survey_data_wide) %>% sum()
## [1] 935
Clearly not – there are obviously many individuals that do not have observations for some of these years. We should bear this in mind as we move to the next stage.
Running kmeans clustering on the timeseries data
We are going to use the kml
package in R to cluster these individuals into a certain number of groups based on the pattern of their trajectories. This is very easy to do if you understand how to use the package, so this example should help you with that.
Before we can run the clustering algorithm, we need to get our data into a shape that the algorithm understands. The kml
algorithm expects an object of the form clusterLongData
, so we will need to transform our data into this object. This object is really just the data together with some extra information, such as where to find the actual time series in the data.
Luckily the kml
package has a function cld()
to convert our data into the required object. The function accepts an R data frame and some extra arguments, and uses the arguments to make a clusterLongData
object from the dataframe. In particular, this function needs to know two important things:

The
timeInData
argument tells the function which numeric columns are the timeseries columns. In our case it is all columns except the first column, so2:6
. 
The
maxNA
argument tells the function to ignore certain observations if there are too manyNA
values. In this case, it seems silly to me to try to cluster any individual who has more that twoNA
observations out of five.
# ensure data is in a data frame (cld also accepts a matrix but no other data type)
survey_data_wide < as.data.frame(survey_data_wide)
# create clusterLongData object
survey_data_cld < kml::cld(survey_data_wide, timeInData = 2:6, maxNA = 2)
# inspect object
class(survey_data_cld)
## [1] "ClusterLongData"
## attr(,"package")
## [1] "kml"
It looks like we have successfully made the conversion. Now we can run the kml
algorithm. The algorithm runs through:
 Different values of k according to the argument
nbCluster
. The default is 2 to 6 clusters.  Several iterations of each clustering using different staring points. The default is 20, but this can be adjusted using the
nbRedrawing
argument.
Usually, if you don’t mess with the more geeky arguments of the kml function, the algorithm will run in its fast version, which is optimized to be very fast. Also, kml
writes the clustering results into your clusterLongData
object so you don’t need to create a new object here.
Let’s say that I am happy with looking at up to six clusters but I only want five redrawings of each.
# run kml with 26 clusters and five redrawings for each
kml::kml(survey_data_cld, nbRedrawing = 5)
## ~ Fast KmL ~
## *************************S
So that looks promising. But this is all a bit abstract – how do I see results?
To look at results, and make a choice on which clustering you would like to proceed with, you can use the choice()
function. This will open an interactive panel that you can use to view the different clusterings, choose one you are interested in and download the data on which individual fell into which cluster based on your choice.
In Windows, this function should work without any problems, on a Mac or Linux machine you may need to open your X11 console before you run this.
# on Mac or Linux you may need to enable X11 console first
if (.Platform$OS.type != "windows") {
X11(type = "Xlib")
}
# run choice
kml::choice(survey_data_cld)
If your command succeeds here, you should see a screen that you can interact with like I am doing in this video:
On the left are the various values of k and you can move them up or down using the arrow keys, while you can move left or right to see the different redrawings for each k. The full navigation instructions should be visible in your R console window, like this:
Selecting a clustering
As you navigate the different clusters you might take several factors into account in deciding which clustering is most useful to you:

The sorting scale on the left uses the CalinskiHarabasz Index to provide a score on the ‘quality’ of the clustering. For more on measures of cluster quality, including CalinskiHarabasz, see this paper. In practical terms, this is simply stating the obvious: that as you move to more and more clusters, the statistics of each cluster become less reliable. You should, however, be careful if there is a big drop in quality when you are moving lefttoright through redrawings of the same k. There is no "minimum quality" that I am aware of, so this should be used as directional guidance.

The size of each cluster. If you choose three clusters for example, but one of them only has 1% of your sample, that may not be particularly useful. The statistics on cluster proportions at the top of the graph should be helpful in your evaluation of this.

The trajectories themselves – given by the colored lines. You will likely be seeking to identify trajectories that support your objective. At the end of the video, for example, I identify four useful trajectories – permanently low, permanently high, one moving from low to high and the other moving in the opposite direction. If this was an employee satisfaction survey, for example, you may be interested in exploring each of these groups more to better understand these patterns.
When you have selected a drawing of a particular clustering that you believe is helpful, you can formally select it by pressing the space key (don’t forget this step). Then you can exit your view by pressing the m
key. At the point of exit, various files on your selected clustering will be written to your project folder, including csv
files with cluster details and statistics and a mapping of each observation to a cluster, as well as graphics showing the cluster trajectories. You can find the files for my choice in the Github repo for this project.
Next steps
Assuming the survey is not anonymous, or that you have at least some demographic data on the respondents, you can use your clustering output and join it to that demographic data to try to see what the clusters you have identified have in common. This can help you with targeted actions against the different clusters, or to help illustrate where the different pockets of satisfaction or sentiment may lie in the organization.
In a similar way, if this were clinical trial data and the time periods were patient obs on the treated condition, you could join the clusters to other patient data to determine where the trial has been particularly effective or ineffective.
I really loved reading your blog. It was very well authored and easy to understand. Unlike other blogs I have read which are really not that good.Thanks alot!
click here For more info