Cluster analysis of the housing market

1. Introduction

1.1. About the data
I have used information about housing type of 20,786 households across Afghanistan which were collected in 2013-2014 for a nation-wide comprehensive Afghanistan Living Conditions Survey. A subset of 35 housing-related variables from the original 345 variables is used and analyzed via R in this analysis. This is an exercise in big-data clustering and does not attempt to answer any question pertaining housing market in Afghanistan.

1.2. Methodology
The dataset used is numeric-categorical in nature. Ideal clustering algorithm would be hierarchical. Yet, this project utilizes k-means clustering and divisive hierarchical clustering on a fraction of the sample. Initially, I performed a principal component analysis of the dataset to decrease its dimensions. 4 components that accounted for ~95% of variance of the data were used for clustering. These components are:
·       Type of dwelling: single family house, part of a shared house, apartment, tent, temporary shelter/shack, other,
·       Main construction material of walls: fired brick/stone, concrete/cement, mudbrick/mud, stone/mud, other,
·       main construction material of roof: concrete with metal, wood/wood with mud, tin/metal, girder with fired brick, mud brick, other,
·       main construction material of floor: mud/earth, concrete/tile, other.
Following sections include project preparation, graphical analysis, within group sums of square calculation, multi-cluster analysis, silhouette calculation, clustering based on optimal number of clusters, wrap up of dataset, descriptive statistics and conclusion. Where handy functions were available, I did not write new functions and avoided using complex ones.

2. Project preparation
Tidying involves removal of NAs and this reduces the number of observations to 10088. To reduce computing cost, I further sampled 20% (0.2 fraction) of this and hence the effective sample size is now 2018. This reduced the computational burden of this exercise dramatically. PC1, PC2, PC3 and PC4 cumulatively explains 94.7% of variances in the dataset and hence will be used to cluster the data below.

3.Graphical analysis
Enhance matrix scatterplot shows that dwelling is largely left skewed, has positive correlation with wall material and roof material and positive correlation with floor material. None of these correlations are significant. Wall material has negative correlation with roof material and floor material. Roof material has positive correlation with floor material. Scatterplots also confirm that the latter two have slightly better association.
Graph 1
Histograms (see graph 2) show that dwelling has largely two distint breaks, wall material has rought three breaks, roof material has three major breaks and floor materials have twoo breaks. Since the dataset is categorical-numeric, this is inevitable. Moreover, none are normally distributed.

4. Clustering

4.1. Determining number of clusters based on within groups sum of square for k-means cluster
Within group sums of square is the lowest for 7 clusters. Nonetheless, there is a spike of value for 5 clusters. If I opt for brevity, 4 cluster solution would be preferred.
Graph 3
4.2. Kmeans clustering and divisive clustering (general treatment)
I used a function to perform k-mean clustering for 2-7 clusters, print cluster size and within-sums-of-square for each cluster. The function also plots data based on varying number of clusters. Plot (see graph 4, appendix) for k=4 shows that observations have been grouped fairly well here. Maximum number of clusters (k=7) gives 7 groups each with 337, 147, 25. 49, 1383, 34, 43 observations with within-sums-of-square of (131.32938  20.85714  20.08000  37.02041 496.27043  40.50000  41.44186) respectively. Dendrogram for divisive hierarchical clustering indicates that omptimal number of cluster could be 5. 

4.3. Methodic selection of optimal number of clusters using silhouette values
This section is performed via fviz_nbclust which is part of the factoextra library and is written by Alboukadel Kassambara. Silhouette plot shows that average of silhouette width is highest for 3 clusters in the k-means clustering. In contrast, for divisive hierarchical clustering, this is 10. 

Graph 5

4.4. K-means clustering and divisive clustering using optimum number of clusters
Observations are plotted neatly using information from PC1 and PC2 for the kmeans algorithm. K-means clustering with 3 clusters has size 410, 1574, 34 each.

That of divisive clustering is overlapping. It does show boundaries of clusters around there centroids. Cluster one has 1483 observations, 2 has 347, 3 has 41 and the rest follows with lesser observations. Cluster 10 has only one observation (graphs 6 below and graph 7--annexed).
Graph 6

4.5. Descriptive Statistics

4.5.1a. Results of kmean clustering:
Characteristics of cluster one (410 observations): households' median  housing type is 1 (Single family house), household's median wall construction material is 3 (mud bricks/mud), households' median roof material type is 5 (mud bricks), households' median floor material type is 1 (mud/earth).

Characteristics of cluster two (1574 observations): households' median  housing type is 1 (Single family house), household's median wall construction material is 3 (mud bricks/mud), households' median roof material type is 2 (wood or wood with mud), households' median floor material type is 1 (mud/earth).

Characteristics of cluster three (34 observations): households' median  housing type is 3 (apartment (shared or separate)), household's median wall construction material is 5 (other), households' median roof material type is 2 (wood or wood with mud), households' median floor material type is 1 (mud/earth). Standard deviations are pretty much modest for all clusters and across all components.



4.5.1b. Results of k-mean clustering (graphical analysis)
A histogram is created for new cluster and the type of housing. This can be repeated for the rest as well. This histogram shows the proportion of housing types available in each cluster, which in turn is dependent on number of observations in the cluster. Cluster two encompasses majority of counts.
Scatterplot of new cluster and wall material shows that cluster two encompasses majority of observations. Significant number of this cluster's members have walls that are made up of mud. Another strong number of members of this cluster have walls that are made up of stone or mud. Then comes cluster one, with most of its members having walls that are made of mud/mud bricks.

4.5.2a. Results of divisive clustering
Divisive clustering had created 10 clusters. Cluster one has 1483 observations, 2 has 347, 3 has 41 and the rest follows with lesser observations. Cluster 10 has only one observation.

Characteristics of cluster one: households' median  housing type is 1 (Single family house), household's median wall construction material is 3 (mud bricks/mud), households' median roof material type is 2 (wood/wood with mud), households' median floor material type is 1 (mud/earth).

Characteristics of cluster two: households' median  housing type is 1 (Single family house), household's median wall construction material is 3 (mud bricks/mud), households' median roof material type is 5 (mud bricks), households' median floor material type is 1 (mud/earth).


Characteristics of cluster three: households' median  housing type is 1 (single family house), household's median wall construction material is 3 (mud bricks/mud), households' median roof material type is 2 (wood or wood with mud), households' median floor material type is 2 (mud/earth).

Characteristics of cluster ten: households' median  housing type is 2 (part of a shared house), household's median wall construction material is 1 (fired bricks/stone), households' median roof material type is 3 (tin/mteal), households' median floor material type is 3 (other).  Standard deviations are more modest for all clusters and across all components.

4.5.2b. Results of divisive clustering (graphical analysis)
As explained above, cluster one, two, three and seven have the highest number of members. Others have very few observations.
Faceted scatterplot of results of divisive clustering and its characteristics based on floor material type is created. Majority of observations in cluster one is characterized by floor material type 2, followed by 4 and 5. Majority of observations in cluster 2 is characterized by floor material type 2, followed by 4 and 5. None of members of cluster six have floor of the type 1, 3, 4, 5 and 6.

6. Conclusions:
Divisive hierarchical clustering produces starkly different clusters than k-means clustering. It is understandable given that inner working of both approaches differ. While kmeans clustering algorithm was efficient in grouping 10,000 observations, divisive hierarchical clustering couldn't handle observations beyond 3,000. I tried agglomerative approaches, which were equally inefficient.

Kmeans clustering produced 3 optimal clusters. Divisive hierarchical clustering produced 10 clusters. Interestingly, divisive hierarchical clustering has the least standard deviations, compared to kmeans clustering. Except for 9th cluster that has a standard deviation of 1 for principal component 2. This supports the argument that divisive hierarchical clustereng is efficient in some circumstances among other branches of hierarchical clustering.

7.References:

 7.1          K-means Cluster Analysis · UC Business Analytics R Programming Guide. (2018, October 2). Retrieved April 20, 2019, from https://uc-r.github.io/kmeans_clustering
 7.2          Hierarchical Cluster Analysis · UC Business Analytics R Programming Guide. (2018, October 2). Retrieved April 20, 2019, from https://uc-r.github.io/hc_clustering
 7.3          Kassambara  , A., & Mundt, F. (2019, December 5). Package ‘factoextra.’ Retrieved January 6, 2020, from https://cran.r-project.org/web/packages/factoextra/factoextra.pdf
 7.4          Kassambara, A. (n.d.). Partitional Clustering in R: The Essentials. Retrieved April 20, 2019, from https://www.datanovia.com/en/courses/partitional-clustering-in-r-the-essentials/?id=236
 7.5          Grolemund, G. (2017, January). 3 Data visualisation | R for Data Science. Retrieved April 20, 2019, from https://r4ds.had.co.nz/data-visualisation.html#facets
 7.6          Christopher, D. (2009, April 7). Divisive clustering. Retrieved April 20, 2019, from https://nlp.stanford.edu/IR-book/html/htmledition/divisive-clustering-1.html
 7.7          Piech, C. (2012). K Means. Retrieved April 20, 2020, from http://stanford.edu/%7Ecpiech/cs221/handouts/kmeans.html

8. Appendix:

Graph 2

Graph 4
Graph 7

0 comments:

Post a Comment

My Twitter Feed