Proximity-based spatial customer grouping (in R)

Providing a coding example for how to conduct spatial proximity customer clustering, applicable e.g. when searching for multiple centers of gravity (i.e. when wanting to solve a multiple warehouse location problem). The logic and approach is the same as in any kind of distance based clustering problem.

I willk apply k-means clustering for grouping customers based on their spatial distance.

The algorithm for k-means clustering is well-explained, e.g. by this article: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/

First I define a dataframe containing random latitude and longitude coordinates, representing randomly distributed customers.

customer_df <- as.data.frame(matrix(nrow=1000,ncol=2))
colnames(customer_df) <- c("lat","long")
customer_df$lat <- runif(n=1000,min=-90,max=90)
customer_df$long <- runif(n=1000,min=-180,max=180)

Here you see the header of the dataframe:

head(customer_df)
##          lat       long
## 1  67.260409   47.08063
## 2  55.400065   55.46616
## 3 -47.152065 -107.63843
## 4 -84.266658 -163.62681
## 5  -6.012361  103.34046
## 6 -10.717590  -59.64681

The standard k-means clustering algorithm select k random initial points and defines those as the cluster centers. The algorithm then assigns data points to each cluster center, based on minimal distance.

In this case we want to later use the clustering algorithm for solving facility location problems, considering multiple warehouses to locate. I thus seems more appropriate to select cluster centers that are reasonably distanced from one another. For this I define a function that chooses the defined number of starting centers based on the longitude dimension of the spatial data set:

initial_centers <- function(customers,centers){
  quantiles <- c()
  for(i in 1:centers){
    quantiles <- c(quantiles,i*as.integer(nrow(customers)/centers))
  }
  quantiles
}

We can now apply the function above, in combination with the kmeans function from the R base package. In this example I derive four proximity-based customer groups.

cluster_obj <- kmeans(customer_df,centers=customer_df[initial_centers(customer_df,4),])
head(cluster_obj)
## $cluster
##    [1] 4 4 1 1 2 1 2 4 1 2 1 4 3 4 4 1 1 1 2 2 3 2 1 3 2 3 1 4 2 4 4 2 4 2
##   [35] 1 4 4 2 2 1 3 2 2 1 3 2 4 3 2 1 1 2 2 3 4 1 4 2 2 3 2 1 2 1 2 2 2 3
##   [69] 1 4 3 3 2 1 4 3 1 1 3 1 2 1 2 1 1 4 2 4 1 2 2 1 4 3 4 2 1 2 3 4 1 2
##  [103] 3 3 4 4 4 1 4 3 1 4 1 2 2 1 3 2 3 2 4 3 4 3 2 1 1 2 2 2 4 4 4 1 2 2
##  [137] 3 3 2 4 4 3 4 1 1 1 3 3 4 4 1 1 2 4 3 4 4 2 2 1 3 2 4 3 2 1 1 2 1 1
##  [171] 2 1 1 1 4 3 3 1 2 3 2 4 2 2 2 4 3 2 1 4 1 2 4 2 3 2 2 2 2 2 1 2 2 1
##  [205] 2 1 2 3 3 2 3 1 2 1 2 4 1 1 2 4 3 2 4 2 1 4 4 3 1 1 2 1 2 2 3 2 1 1
##  [239] 3 1 3 1 2 1 2 1 1 4 1 1 2 2 1 2 1 4 1 4 2 2 2 2 4 4 1 3 3 1 1 4 3 4
##  [273] 4 4 1 2 2 1 4 1 2 4 2 1 4 2 4 2 3 4 4 4 2 2 1 4 2 4 4 1 2 1 2 1 2 3
##  [307] 1 1 1 1 2 3 3 3 1 4 4 1 2 1 4 1 4 3 2 4 3 2 1 2 2 4 2 4 2 2 2 4 2 1
##  [341] 3 2 1 3 3 2 1 1 3 1 4 1 2 1 4 1 2 3 2 1 4 2 3 1 3 1 1 2 2 2 2 2 1 3
##  [375] 2 2 1 2 4 4 1 3 1 2 3 4 2 4 4 1 1 2 4 4 4 2 3 4 1 3 2 3 4 1 3 3 1 4
##  [409] 2 1 4 1 3 2 1 3 3 2 2 2 1 2 3 1 2 4 4 2 2 4 3 4 3 1 1 3 1 3 4 2 4 3
##  [443] 3 3 4 1 1 2 1 3 2 1 1 2 1 4 2 2 1 1 2 1 2 4 2 4 3 2 1 1 1 4 2 3 1 4
##  [477] 3 1 2 1 1 1 2 3 4 3 2 3 4 4 2 1 3 2 1 4 4 2 4 2 3 1 2 2 3 4 2 3 2 4
##  [511] 3 4 2 4 2 1 3 2 1 4 2 4 3 1 1 4 2 2 2 1 4 2 1 3 1 4 1 4 2 3 4 3 1 2
##  [545] 2 2 2 2 2 2 2 2 4 4 1 4 1 2 2 1 1 1 2 3 3 1 1 2 2 3 4 3 2 2 2 1 1 3
##  [579] 4 2 1 4 1 3 3 1 1 1 2 3 1 2 3 1 4 4 1 1 3 1 4 1 2 3 3 2 4 4 2 4 2 2
##  [613] 2 3 1 1 4 2 3 4 1 4 4 2 2 1 4 3 3 4 4 1 1 3 4 3 1 1 2 3 3 3 3 1 1 1
##  [647] 4 1 2 1 2 4 2 4 2 2 3 4 4 2 4 1 2 1 1 4 2 1 1 2 1 4 4 1 3 3 1 3 4 4
##  [681] 2 2 4 3 1 2 3 2 4 3 2 4 3 4 1 4 4 1 3 1 3 3 4 2 1 4 4 2 2 2 2 3 1 1
##  [715] 1 2 1 4 1 3 1 2 2 4 3 3 2 2 1 3 2 2 1 1 3 4 3 3 1 1 2 1 1 4 2 4 1 4
##  [749] 2 2 2 2 3 1 2 1 1 1 2 1 3 2 1 3 2 3 2 2 1 2 4 3 4 1 4 2 3 1 3 1 3 2
##  [783] 3 1 1 1 1 1 4 2 2 1 2 1 4 1 4 3 4 1 2 1 1 4 2 1 4 4 3 4 2 3 1 3 2 3
##  [817] 1 3 4 2 4 1 3 2 1 3 3 1 1 1 1 4 2 2 4 1 1 3 4 1 2 3 2 4 1 1 1 3 2 2
##  [851] 1 3 3 2 3 1 2 2 3 2 1 4 1 1 1 3 2 1 3 1 2 3 2 4 2 2 2 2 1 3 4 3 1 4
##  [885] 2 3 2 2 3 4 4 2 2 1 3 4 4 1 4 4 3 1 2 4 2 1 1 1 2 4 3 1 1 3 1 3 1 1
##  [919] 4 3 1 2 1 3 2 4 2 1 4 2 1 3 1 2 1 3 3 1 2 1 1 1 1 1 1 3 4 4 2 1 2 2
##  [953] 2 1 1 1 4 2 3 4 3 4 1 2 3 3 1 4 2 1 1 3 1 3 4 1 3 1 3 1 3 3 1 4 3 4
##  [987] 1 3 2 4 4 2 3 4 3 2 4 2 3 2
## 
## $centers
##           lat        long
## 1   0.6938018 -122.442238
## 2  -5.3567099  123.957813
## 3 -46.9979863   -2.714282
## 4  48.9979562   15.062099
## 
## $totss
## [1] 13050174
## 
## $withinss
## [1] 1108924.4 1028012.3  423675.5  523506.7
## 
## $tot.withinss
## [1] 3084119
## 
## $betweenss
## [1] 9966055

Above you see the header of the result object returned by the kmeans function. Below I combine the cluster indices contained by the kmeans object with the customer dataframe, such that we now have 3 columns. This will allow us to do ggplots etc.

result_df <- customer_df
result_df$group <- cluster_obj$cluster
head(result_df)
##          lat       long group
## 1  67.260409   47.08063     4
## 2  55.400065   55.46616     4
## 3 -47.152065 -107.63843     1
## 4 -84.266658 -163.62681     1
## 5  -6.012361  103.34046     2
## 6 -10.717590  -59.64681     1

I complete this post by visualizing the results in a ggplot (scatterplot using the ggplot2 R package). For coloring I used the viridis package in R:

library(ggplot2)
library(viridis)
ggplot(result_df) + geom_point(mapping = aes(x=lat,y=long,color=group)) +
  xlim(-90,90) + ylim(-180,180) + scale_color_viridis(discrete = FALSE, option = "D") + scale_fill_viridis(discrete = FALSE) 

Lets run another test with 20 warehouses:

cluster_obj <- kmeans(customer_df,centers=customer_df[initial_centers(customer_df,20),])
result_df$group <- cluster_obj$cluster
ggplot(result_df) + geom_point(mapping = aes(x=lat,y=long,color=group)) +
  xlim(-90,90) + ylim(-180,180) + scale_color_viridis(discrete = FALSE, option = "D") + scale_fill_viridis(discrete = FALSE) 

If interested, check out my post on center of mass calculation in R and how it can be used to solve a warehouse location problem in R.

Leave a Reply

3 thoughts on “Proximity-based spatial customer grouping (in R)

Leave a Reply

Your email address will not be published. Required fields are marked *

Close

Meta