Providing a coding example for how to conduct spatial proximity customer clustering, applicable e.g. when searching for multiple centers of gravity (i.e. when wanting to locate multiple warehouses). The logic and approach is the same as in any kind of distance based clustering problem.
I willk apply k-means clustering for grouping customers based on their spatial distance.
The algorithm for k-means clustering is well-explained, e.g. by this article: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/
First I define a dataframe containing random latitude and longitude coordinates, representing randomly distributed customers.
customer_df <- as.data.frame(matrix(nrow=1000,ncol=2))
colnames(customer_df) <- c("lat","long")
customer_df$lat <- runif(n=1000,min=-90,max=90)
customer_df$long <- runif(n=1000,min=-180,max=180)
Here you see the header of the dataframe:
head(customer_df)
## lat long
## 1 -42.69660 58.067160
## 2 37.31715 179.655272
## 3 -28.68660 -3.025719
## 4 -76.15463 117.119388
## 5 -14.84898 -162.408406
## 6 54.19468 -128.476143
The standard k-means clustering algorithm select k random initial points and defines those as the cluster centers. The algorithm then assigns data points to each cluster center, based on minimal distance.
In this case we want to later use the clustering algorithm for solving facility location problems, considering multiple warehouses to locate. I thus seems more appropriate to select cluster centers that are reasonably distanced from one another. For this I define a function that chooses the defined number of starting centers based on the longitude dimension of the spatial data set:
initial_centers <- function(customers,centers){
quantiles <- c()
for(i in 1:centers){
quantiles <- c(quantiles,i*as.integer(nrow(customers)/centers))
}
quantiles
}
We can now apply the function above, in combination with the kmeans function from the R base package. In this example I derive four proximity-based customer groups.
cluster_obj <- kmeans(customer_df,centers=customer_df[initial_centers(customer_df,4),])
head(cluster_obj)
## $cluster
## [1] 1 4 2 1 3 3 3 1 2 1 3 2 2 4 3 4 4 4 3 4 3 2 3 4 2 3 2 3 1 4 4 4 1 1 4 2 2
## [38] 2 4 4 3 3 1 2 4 1 2 1 3 1 2 2 3 3 4 1 4 2 3 3 2 4 2 2 3 2 1 4 2 2 2 4 4 2
## [75] 4 3 3 4 1 1 1 3 3 2 1 1 3 3 4 4 3 1 2 4 3 1 3 2 2 2 3 2 3 4 4 2 3 3 1 3 1
## [112] 2 2 4 1 1 1 3 4 1 2 3 3 3 1 1 2 3 3 2 1 3 4 2 2 3 2 2 1 2 1 2 2 2 2 3 2 3
## [149] 1 2 2 1 2 3 2 2 1 4 2 4 3 3 3 2 1 1 2 2 3 3 4 1 2 4 1 2 1 2 3 2 2 2 3 3 2
## [186] 1 1 1 4 3 4 4 2 1 3 2 4 2 2 3 3 1 3 2 3 2 4 2 3 2 4 1 1 3 1 2 1 3 4 2 4 3
## [223] 4 2 4 3 4 2 4 2 1 2 1 3 4 2 2 3 2 4 2 1 2 3 3 2 2 3 3 1 3 4 4 3 4 1 1 2 3
## [260] 3 4 2 1 1 1 2 2 2 1 4 4 3 1 2 4 3 3 3 3 3 3 3 2 3 3 3 3 2 2 3 3 1 4 1 2 1
## [297] 4 2 2 3 1 4 4 2 3 3 2 4 4 3 2 1 2 3 2 2 4 4 2 2 2 3 2 2 2 2 2 2 2 1 2 2 4
## [334] 3 3 2 2 3 3 1 2 4 2 1 3 3 4 1 2 4 1 4 4 4 1 2 3 1 3 1 3 3 2 3 4 1 2 2 2 2
## [371] 1 2 2 2 1 3 2 1 2 2 2 4 3 2 2 3 1 3 3 4 1 1 3 4 2 4 1 1 4 4 2 4 2 3 3 2 4
## [408] 4 4 3 2 1 3 3 4 1 3 3 1 3 4 2 3 2 2 3 2 2 2 1 2 3 4 3 4 3 4 4 2 1 3 2 3 1
## [445] 3 1 1 2 3 3 2 2 3 4 1 1 3 1 2 4 2 2 2 3 1 3 2 1 4 2 3 2 4 1 4 3 1 1 4 4 3
## [482] 1 2 4 3 3 2 1 4 2 3 2 4 3 4 4 1 2 2 2 3 3 4 4 1 3 2 3 2 4 1 2 4 1 2 3 1 3
## [519] 2 3 3 3 1 3 2 4 1 3 4 3 4 4 3 4 4 2 1 1 3 3 3 3 3 4 2 1 3 3 1 1 4 1 4 2 2
## [556] 1 1 4 4 3 3 4 3 1 4 3 1 2 3 3 2 4 1 2 3 3 1 2 2 1 3 1 4 4 3 2 4 3 1 4 2 3
## [593] 3 2 2 1 1 2 4 2 3 3 2 1 4 1 4 3 3 3 3 3 3 2 2 2 1 1 2 3 2 1 1 2 1 1 1 1 1
## [630] 2 4 2 1 1 3 1 4 2 4 2 2 1 4 1 2 2 3 1 1 3 1 1 3 4 3 2 4 1 1 1 2 1 1 1 2 3
## [667] 4 3 2 4 4 4 2 4 4 3 2 1 2 2 3 3 3 4 2 4 3 1 2 4 2 3 1 3 3 1 4 3 4 4 1 2 3
## [704] 3 4 4 2 2 1 2 2 1 3 4 1 2 2 3 4 4 2 3 1 2 4 3 1 2 2 2 1 4 1 3 1 4 2 2 1 1
## [741] 2 2 2 2 1 2 4 3 3 3 1 3 4 1 1 3 2 1 4 4 2 4 2 3 2 3 3 4 2 1 2 3 2 1 1 1 3
## [778] 2 3 4 3 2 2 4 1 4 4 2 2 1 1 2 3 1 2 2 2 1 4 3 3 3 1 2 1 3 2 4 2 3 4 1 4 3
## [815] 1 3 2 2 1 1 2 3 4 4 4 3 1 2 4 2 2 2 1 3 4 4 2 2 3 2 4 3 4 2 2 2 3 3 3 4 2
## [852] 3 3 3 1 1 3 1 1 2 2 1 3 3 4 3 3 3 2 2 1 3 2 3 1 4 3 2 4 1 4 3 3 2 3 4 4 1
## [889] 3 1 2 4 4 3 2 1 2 3 2 1 1 2 3 2 1 1 3 3 4 3 3 4 3 3 3 2 2 1 2 3 1 1 1 4 2
## [926] 3 4 2 4 1 4 3 4 4 1 3 2 2 1 2 2 4 2 4 1 4 1 2 1 2 4 3 4 2 4 4 4 2 4 1 2 3
## [963] 3 2 4 4 4 2 4 4 3 2 1 2 4 3 2 3 1 2 3 4 1 4 3 4 2 1 4 1 4 2 3 3 1 1 2 2 2
## [1000] 4
##
## $centers
## lat long
## 1 -44.672042 103.20907
## 2 9.621406 -22.15262
## 3 -4.487789 -127.84173
## 4 48.358322 110.24174
##
## $totss
## [1] 13417586
##
## $withinss
## [1] 557304.2 1006745.4 962130.0 492832.8
##
## $tot.withinss
## [1] 3019012
##
## $betweenss
## [1] 10398574
Above you see the header of the result object returned by the kmeans function. Below I combine the cluster indices contained by the kmeans object with the customer dataframe, such that we now have 3 columns. This will allow us to do ggplots etc.
result_df <- customer_df
result_df$group <- cluster_obj$cluster
head(result_df)
## lat long group
## 1 -42.69660 58.067160 1
## 2 37.31715 179.655272 4
## 3 -28.68660 -3.025719 2
## 4 -76.15463 117.119388 1
## 5 -14.84898 -162.408406 3
## 6 54.19468 -128.476143 3
I complete this post by visualizing the results in a ggplot (scatterplot using the ggplot2 R package). For coloring I used the viridis package in R:
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
ggplot(result_df) + geom_point(mapping = aes(x=lat,y=long,color=group)) +
xlim(-90,90) + ylim(-180,180) + scale_color_viridis(discrete = FALSE, option = "D") + scale_fill_viridis(discrete = FALSE)
Lets run another test with 20 warehouses:
cluster_obj <- kmeans(customer_df,centers=customer_df[initial_centers(customer_df,20),])
result_df$group <- cluster_obj$cluster
ggplot(result_df) + geom_point(mapping = aes(x=lat,y=long,color=group)) +
xlim(-90,90) + ylim(-180,180) + scale_color_viridis(discrete = FALSE, option = "D") + scale_fill_viridis(discrete = FALSE)
Data scientist focusing on simulation, optimization and modeling in R, SQL, VBA and Python
Leave a Reply