Regionalisation with Spatially Constrained Cluster Analysis

Published

December 5, 2022

case study : Regionalisation by Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods.

1. OVERVIEW

Regionalisation with Spatially Constrained clustering analysis requires similar observations to be grouped according to their statistical attributes and spatial location.

1.1 Objectives

Regionalise Nigeria based on water points’ attribute

Total number of water points by status, i.e. functional, non-functional, and unknown;
Percentage of water points by :
- status (functional, non-functional, and unknown);
- deployed water technology (hand pump, mechanical pump, stand tap, etc.) ;
- usage capacity (1000, 300, 250, 50);
- rural or urban.

1.2 Study Area Introduction

Water points in Federal Republic of Nigeria

Alpha-3 Code : NGA

Population : 225 million (1st in Africa, 6th globally)

Local Government Areas (LGA) : 774

Water Point Observations : 95,008

Environmental Aspects :

Geography :
- Southwest - “rugged” highland.
- Southeast - hills and mountains, which form the Mambilla Plateau, the highest plateau in Nigeria.
Hydrology :
- Two (2) main catchment areas - Chad Basin & Niger catchment area.
- Surface area of lake Chad is shrinking recent decades due to irrigation activities.¹
- Untreated wastes dump in places resulted in waterways and groundwater pollution.²
Vegetation Coverage :
- Lost nearly 80% of primary forest by 2012.³
- States with dense forests concentrated : Bayelsa, Cross River, Edo, Ekiti, Ondo, Osun, Rivers, and Taraba.

¹ Wikipedia. Nigeria. https://en.wikipedia.org/wiki/Nigeria

² Ogbonna, D.N., Ekweozor, I.K.E., Igwe, F.U. (2002). “Waste Management: A Tool for Environmental Protection in Nigeria”. Ambio: A Journal of the Human Environment. 31 (1): 55–57. doi:10.1639/0044-7447(2002)031[0055:wmatfe]2.0.co;2.

³ https://rainforests.mongabay.com/20nigeria.htm

1.3 Scope of Works

Specific tasks required for this study :

import the shapefile into R with the appropriate sf method, and save it in a simple feature data frame format;

Note

Three (3) Projected Coordinate Systems of Nigeria, EPSG : 26391, 26392, and 26303.

derive the proportion of functional and non-functional water points at LGA level (i.e. ADM2) by appropriate tidyr and dplyr methods;
combine geospatial and aspatial data frames into a simple feature data frame.
delineate water points measures functional regions by using :
- conventional hierarchical clustering.
- spatially constrained clustering algorithms.
plot two (2) main types of maps below :

Thematic Mapping

Show the derived water-point measures by appropriate statistical graphics and choropleth mapping technique.

Analytical Mapping

Plot delineated functional regions using non-spatially constrained and spatially constrained clustering algorithms.

2. R PACKAGE REQUIRED

The following are the packages required for this exercise :

!!!!! Tidy up the functions according to the high-level function category.

Load one or more packages

pacman package :
- p_load( ) - 2.1

Import and process Geospatial data

sf package :
- st_as_sfc( ) - 3.2.5.1
- st_sf( ) - 3.2.5.2
- st_crs( )- 3.2.5.2
- st_geometry( )- 3.2.5.3, 3.4.2.3
- st_read( ) - 3.3
- st_join( ) - 3.4.2.1
- st_set_crs( ) - 3.4.5.20
- st_set_geometry( ) - 4.1
stars package :
- st_intersects( )- 3.4.5.4

Import and process Aspatial data

tidyverse :
- readr package :
  - read_csv( ) - 3.2.1, 3.2.2, 3.2.3, 3.2.2.1
  - problems( ) - 3.2.1, 3.2.2, 3.2.3, 3.3
  - write_rds( ) - 3.2.1.1, 3.2.2.1, 3.2.3.1, 3.4.2.2, 3.4.5.1
  - read_rds( )- 3.2.1.1, 3.2.2.1, 3.2.3.1, 3.4.2.2, 3.4.4.3, 3.4.5.1
- dplyr :
  - rename( ) - 3.2.1, 3.2.2, 3.2.3
  - filter( ) - 3.3.1.3, 3.4.1.1, 3.4.5.1
  - left_join( ) - 3.2.4,
  - select( ) - 3.3, 3.3.1.3, 3.4.1.1
  - add_count( ) - 3.3.1.3, 3.4.1.1
  - mutate( ) - 3.4.3, 3.4.4.1
- tidyr :
  - replace_na( ) - 3.4.3, 3.4.5.4

Plot map for visualisation

tmap :
- tmap_mode( ) - 3.3.1.4
- tm_shape( ) - 3.3.1.4
- tm_polygons( ) - 3.3.1.4
- tm_view( ) - 3.3.1.4
- tm_fill( ) - 3.3.1.4
- tm_borders( ) - 3.3.1.4
- tmap_style( ) - 3.3.1.4
- tm_layout( ) - 3.3.1.4
- qtm( ) - 3.4.5.12
- tmap_arrange( ) - 3.4.5.12

Plot graph and chart

ggplot2 package :
- ggplot( ) - 3.4.5.3
- geom_bar( ) - 3.4.5.3
- geom_text( ) - 3.4.5.3
- scale_x_discrete( ) - 3.4.5.3
- geom_boxplot( )- 3.5.1.1
- geom_histogram( ) - 3.5.2.1
egg package :
- ggarrange( ) - 3.5.2.1

Visualise correlation matrix

corrplot package :
- corrplot.mixed( ) - 4.2

Get descriptive stats

questionr :
- freq.na( )- 3.3.1.1
- freq( ) - 3.3.1.2, 3.4.2.3
factoextra :
- fviz_nbclust( ) -
skimr package :
- skim( ) -
stats :
- dist( ) - 5.1.2
- as.dist( ) -

Miscellaneous generic functions

janitor package :
- get_dupes( ) - 3.4.2.3
plyr package :
- count( ) - 3.4.5.2
base package :
- summary( ) - 3.2.1.2, 3.2.2.2, 3.2.3.2, 3.4.2.3, 3.4.5.2
- duplicated( ) - 3.3.1.2
- length( )- 3.4.5.2
- sum( ) - 3.4.5.2

2.1 Load R Packages

Usage of the code chunk below :

p_load( ) - pacman - to load packages into R environment. This function will attempt to install the package from CRAN or pacman repository list if it is not installed.

Show the code

pacman::p_load(sf, tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr, funModeling, knitr, caTools, viridis, rgeoda, cowplot, patchwork)

3. GEOSPATIAL DATA

3.1 Acquire Data

Aspatial Data
1. Download the Nigeria data set in shapefile format via Access WPdx+ Global Data Repository from WPdx Global Data Repositories.
2. Rename the title of the data set to “geo_export”.

Note

The file size of the downloaded data is about 422 MB due to water points data from multiple countries.

Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.

Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.

Therewith, the CSV file size should be lesser than 100 MB.

Geospatial Data
1. Download the Nigeria geoBoundaries data set at ADM2 level⁴ from geoBoundaries.org or the Humanitarian Data Exchange portal.
2. Rename the title of the data set to “nga_admbnda_adm2_osgof_20190417”

⁴ Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866

3.2 Import Attribute Data

3.2.1 Import Aspatial Data

⇳ Usage of the code chunk below :

read_csv( ) - readr - to import and save the comma separated value (CSV) file as a data frame, with title “wp_coord”.

col_select( ) - readr - to include only the selected variables into wp_cood data frame.

rename( ) - dplyr - to remove “#” from the variables.

problems( ) - readr - to reveal any parsing errors when importing the CSV file.

Show the code

wp_attribute <- read_csv("data/aspatial/WPdx_NGAv1.2.1.csv",
                       col_select = c(`row_id`,
                                      `#lat_deg`,
                                      `#lon_deg`,
                                      `New Georeferenced Column`,
                                      `lat_lon_deg`,
                                      `#water_source`,
                                      `#water_source_clean`,
                                      `#water_source_category`,
                                      `#water_tech_clean`,
                                      `#water_tech_category`,
                                      `#status_clean`,
                                      `#status`,
                                      `#status_id`,
                                      `#clean_adm1`,
                                      `#clean_adm2`,
                                      `water_point_population`,
                                      `local_population_1km`,
                                      `crucialness_score`,
                                      `pressure_score`,
                                      `usage_capacity`,
                                      `staleness_score`,
                                      `rehab_priority`,
                                      `is_urban`)) %>%
  rename(lat_deg = `#lat_deg`, 
         lon_deg = `#lon_deg`,
         water_source = `#water_source`,
         water_source_clean = `#water_source_clean`, 
         water_source_category = `#water_source_category`, 
         water_tech_clean = `#water_tech_clean`, 
         water_tech_category = `#water_tech_category`,
         status_clean = `#status_clean`,
         status = `#status`,
         status_id = `#status_id`,
         clean_adm1 = `#clean_adm1`,
         clean_adm2 = `#clean_adm2`)

Rows: 95008 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): #status_id, #water_source_clean, #water_source_category, #water_te...
dbl (10): row_id, #lat_deg, #lon_deg, rehab_priority, water_point_population...
lgl  (1): is_urban

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Show the code

problems(wp_attribute)

3.2.2 Get Imported Data Frame Overview

⇳ Usage of the code chunk below :

skim( ) - skimr - to get a broad overview of wp_coord data frame.

skim(wp_attribute)

3.2.3 Save Imported Data Frame into RDS Format

⇳ Usage of the code chunk below :

write_rds( ) - readr - to save wp_attribute data table into an RDS format.

note : reduce the file size with this function -> compress = “xz”.

write_rds(wp_attribute,
          "data/geodata/wp_attribute.rds",
          compress = "xz")

3.2.4 Read RDS File

⇳ Usage of the code chunk below :

read_rds( ) - readr - to read wp_attribute RDS file into wp_attribute.

wp_attribute <- read_rds("data/geodata/wp_attribute.rds")

3.2.5 Convert Well Known Text (WKT) Data to SF Data Frame

The “New Georeferenced Column” in wp_attribute contains spatial data in a WKT format.

Two (2) steps to convert the WKT data format into an sf data frame :

derive “geometry” variable.
conversion to sf data frame.

3.2.5.1 derive new field :: “geometry”

⇳ Usage of the code chunk below :

st_as_sfc( ) - sf - to convert foreign geometry object `New Georeferenced Column` to an sfc object

wp_attribute$geometry = st_as_sfc(wp_attribute$`New Georeferenced Column`)

3.2.5.2 convert to SF Data Frame

⇳ Usage of the code chunk below :

st_sf( ) - sf - to convert the tibble data frame into sf data frame with crs first set to WGS 84 (EPSG : 4326).

st_crs( ) - sf - to retrieve coordinate reference system from the object.

wp.sf<- st_sf(wp_attribute, crs = 4326)
st_crs(wp.sf)

Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

3.3 Import Boundary Data

3.3.1 Import Geospatial Data

⇳ Usage of the code chunk below :

st_read( ) - sf - to read simple features.

select( ) - dplyr - to select “shapeName” variable.

bdy_nga.sf <- st_read(dsn = "data/geospatial",
               layer = "geoBoundaries-NGA-ADM2",
               crs = 4326) %>%
  select(shapeName)

Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\jephOstan\ISSS624\class_project\project_2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

problems(bdy_nga.sf)

3.3.3 Review Imported File

3.3.3.1 check for missing and duplicated data

Show the code

skim(bdy_nga.sf)

Remarks :

There is no missing data.
There are 774 unique “geometry” but only 768 unique “shapeName”
- That means there are 6 values of “shapeName” duplicated among the identified unique “shapeName”.

3.3.3.2 list the unique “shapeName” associated with duplication

⇳ Usage of the code chunk below :

add_count( ) - dplyr - to count observations by group

filter( ) - dplyr - to retain shapeName that has count not equal to 1.

dupl_shapeName.sf <- bdy_nga.sf %>%
  add_count(bdy_nga.sf$shapeName) %>%
  filter(n != 1) %>%
  select(-n)

freq(dupl_shapeName.sf$shapeName)

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

       var frequency percentage cumulative_perc
1    Bassa         2      16.67           16.67
2 Ifelodun         2      16.67           33.34
3 Irepodun         2      16.67           50.01
4 Nasarawa         2      16.67           66.68
5      Obi         2      16.67           83.35
6 Surulere         2      16.67          100.00

3.3.3.3 verify findings in section 3.3.3.2

⇳ Usage of the code chunk below :

tmap_mode( ) - tmap - to set tmap mode to static plotting or interactive.

tm_shape( ) - tmap - to specify the shape object.

tm_polygons( ) - tmap - to fill the polygons and draw the polygon borders.

tm_view( ) - tmap - to set the options for the interactive tmap viewer.

tm_fill( ) - tmap - to specify either which colour to be used or which data variable mapped to the colour palette.

tm_borders( ) - tmap - to draw the polygon borders.

tmap_style( ) - tmap - to set the tmap style.

tm_layout( ) - tmap - to set the layout of cartographic map.

tmap_mode("view")

tmap mode set to interactive viewing

tm_shape(bdy_nga.sf) +
  tm_polygons() +
  tm_view(set.zoom.limits = c(6,8)) +
  tm_shape(dupl_shapeName.sf) +
  tm_fill("shapeName",
          n = 6,
          style = "jenks") +
  tm_borders(alpha = 0.5) +
  tmap_style("albatross") +
  tm_layout(main.title = "Distribution of Duplicated ShapeName",
            main.title.size = 1.3,
            main.title.position = "center")

tmap style set to "albatross"

other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "beaver", "bw", "classic", "watercolor"

Show the code

tmap_mode("plot")

tmap mode set to plotting

Remarks :

The plot above indicates those duplicated water points are not within the same location.

3.3.3.4 acquire State info for duplicated value

⇳ Usage of the code chunk below :

To identify the correct State for the duplicated “shapeName” value -

First, retrieve the centroid coordinate for each “shapeName”.
Next, get the coordinates for LGA Headquarters or Secretariat office via Google Maps.
Lastly, match the duplicated LGA’s coordinate with the offices’ coordinates to determine the State therein.
- The steps to update can be referred to in section 3.4.1.

centroid_dupl.sf <- dupl_shapeName.sf %>%
  mutate(st_centroid
         (dupl_shapeName.sf$geometry,
           of_largest_polygon = FALSE))

centroid_dupl.sf[[4]]

Geometry set for 12 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3.346919 ymin: 6.493217 xmax: 8.782946 ymax: 12.00446
Geodetic CRS:  WGS 84
First 5 geometries:

POINT (7.031827 7.791971)

POINT (8.782946 10.08015)

POINT (5.052235 8.544311)

POINT (4.636735 7.920948)

POINT (4.926215 8.169349)

lga	row_id	headquarter	state	iso3166code	dupl_shapeName_coord	lga_office_coord
Bassa	94	Oguma	Kogi	NG.KO.BA	7.791971, 7.031827	7.80932, 6.74853
Bassa	95	Bassa	Plateau	NG.PL.BA	10.08015, 8.782946	10.11143, 8.71559
Ifelodun	304	Share	Kwara	NG.KW.IF	8.544311, 5.052235	8.5 5.0
Ifelodun	305	Ikirun	Osun	NG.OS.ID	7.920948, 4.636735	7.5 4.5
Irepodun	355	Omu Aran	Kwara	NG.KW.IR	8.169349, 4.926215	8.5 5.0
Irepodun	356	Ilobu	Osun	NG.OS.IP	7.84861, 4.498797	7.5 4.5
Nasarawa	519	Bompai	Kano	NG.KN.NA	12.00446, 8.578262	11.5, 8.5
Nasarawa	520	Nasarawa	Nasarawa	NG.NA.NA	8.304034, 7.760272	8.53477, 7.70153
Obi	546	Obarike-Ito	Benue	NG.BE.OB	7.022495, 8.281026	7.01129, 8.33118
Obi	547	Obi	Nasarawa	NG.NA.OB	8.35534, 8.734777	8.37944, 8.78561
Surelere	693	Surelere	Lagos	NG.LA.SU	6.493217, 3.346919	6.50048, 3.35488
Surelere	694	Iresa-Adu	Oyo	NG.OY.SU	8.088897, 4.393574	8.08459, 4.38538

:::

3.4 Data Wrangling

3.4.1 Edit Duplicated Value :: “shapeName”

Update the LGA boundary data frame with the matched state and “shapeName” by row index.⁵

⁵ Ong Z.R.J. (2022). Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate. https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#checking-of-duplicated-area-name

Show the code

bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <- 
  c("Bassa Kogi",
    "Bassa Plateau",
    "Ifelodun Kwara",
    "Ifelodun Osun",
    "Irepodun Kwara",
    "Irepodun Osun",
    "Nasarawa Kano",
    "Nasarawa Nasarawa",
    "Obi Nasarawa",
    "Obi Benue",
    "Surulere Lagos",
    "Surulere Oyo")

bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)]

 [1] "Bassa Kogi"        "Bassa Plateau"     "Ifelodun Kwara"   
 [4] "Ifelodun Osun"     "Irepodun Kwara"    "Irepodun Osun"    
 [7] "Nasarawa Kano"     "Nasarawa Nasarawa" "Obi Nasarawa"     
[10] "Obi Benue"         "Surulere Lagos"    "Surulere Oyo"

3.4.1.1 validate edited value :: “shapeName”

Show the code

dupl_shapeName_val.sf <- bdy_nga.sf %>%
  add_count(bdy_nga.sf$shapeName) %>%
  filter(n != 1) %>%
  select(-n)

dupl_shapeName_val.sf

Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] shapeName            bdy_nga.sf$shapeName geometry            
<0 rows> (or 0-length row.names)

3.4.2 Perform Point-in-Polygon Overlay

Combine the attribute and boundary of the water points into a simple feature object.

3.4.2.1 join objects :: wp.sf, bdy_nga.sf

⇳ Usage of the code chunk below :

st_join( ) - sf - to join sf-class objects based on geometry, namely, wp.sf and bdy_nga.sf.

Show the code

wp_joined.sf <- st_join(x = wp.sf,
                        y = bdy_nga.sf,
                        join = st_intersects,
                        left = TRUE)

3.4.2.2 inspect joined file :: wp_joined

-- assess the uniqueness of water Point

wp_joined.sf %>% janitor::get_dupes(shapeName,lat_lon_deg)

No duplicate combinations found of: shapeName, lat_lon_deg

Simple feature collection with 0 features and 25 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
# A tibble: 0 × 26
# … with 26 variables: shapeName <chr>, lat_lon_deg <chr>, dupe_count <int>,
#   row_id <dbl>, lat_deg <dbl>, lon_deg <dbl>, New Georeferenced Column <chr>,
#   water_source <chr>, water_source_clean <chr>, water_source_category <chr>,
#   water_tech_clean <chr>, water_tech_category <chr>, status_clean <chr>,
#   status <chr>, status_id <chr>, clean_adm1 <chr>, clean_adm2 <chr>,
#   water_point_population <dbl>, local_population_1km <dbl>,
#   crucialness_score <dbl>, pressure_score <dbl>, usage_capacity <dbl>, …

Remarks :

Each water point observation is unique as there is no duplication between the “shapeName” and “lat_lon_deg”.

-- determine reference point :: “shapeName” / “clean_adm2”

wp_reference <- (wp_joined.sf$shapeName == wp_joined.sf$clean_adm2)

summary(wp_reference)

   Mode   FALSE    TRUE    NA's 
logical   29856   65123      29

Remarks :

There are 29,856 “FALSE”, which is approximately 31% of LGA names mismatched between “shapeName” and “clean_adm2”.
- Since the geoBoundaries data is collected from government-published and reliable internet sources.⁶
  - Hence, the “shapeName” variable will be used throughout this study.
The 29 NA’s are 29 water points located beyond the LGA boundary, as shown in the plot below.

⁶ Daniel et. al (2020) geoBoundaries: A global database of political administrative boundaries. PlosOne. https://doi.org/10.1371/journal.pone.0231866

Show the code

tmap_mode("view")

tmap mode set to interactive viewing

Show the code

tm_shape(bdy_nga.sf) +
  tm_polygons() +
  tm_view(set.zoom.limits = c(5.5, 12)) +
  
tm_shape(filter(wp_joined.sf, 
                is.na(wp_reference))) +
  tm_dots(size = 0.05,
          col = "red")

Show the code

tmap_mode("plot")

tmap mode set to plotting

3.4.3 Remove Water Point Outside LGA Boundary

wp_joined1.sf <- wp_joined.sf %>% 
  filter(shapeName == clean_adm2 | shapeName != clean_adm2)

3.4.4 Combine Unique Value

questionr::freq(wp_joined1.sf$status_clean)

                                     n    % val%
Abandoned                          175  0.2  0.2
Abandoned/Decommissioned           232  0.2  0.3
Functional                       45874 48.3 54.4
Functional but needs repair       4579  4.8  5.4
Functional but not in use         1683  1.8  2.0
Non-Functional                   29378 30.9 34.8
Non-Functional due to dry season  2403  2.5  2.8
Non functional due to dry season     7  0.0  0.0
NA                               10648 11.2   NA

Remarks :

There are 9 unique values for “status_clean”. However, four (4) of them share the same context :

“Non functional due to dry season”
“Non-Functional due to dry season”
“Abandoned”
“Abandoned/Decommissioned”

Hence, the same context values need to combine into one unique value.

3.4.4.1 combine values with the same context

wp_joined1.sf$status_clean[wp_joined1.sf$status_clean == "Non functional due to dry season"] <- "Non-Functional due to dry season"

wp_joined1.sf$status_clean[wp_joined1.sf$status_clean == "Abandoned"] <- "Abandoned/Decommissioned"

-- review combined output

Show the code

unique(wp_joined1.sf$status_clean)

[1] "Non-Functional"                   NA                                
[3] "Functional"                       "Functional but needs repair"     
[5] "Abandoned/Decommissioned"         "Functional but not in use"       
[7] "Non-Functional due to dry season"

3.4.4.2 compute missing value :: “crucialness_score”

Show the code

summary(wp_joined1.sf$water_point_population == 0)

   Mode   FALSE    TRUE    NA's 
logical   88106    6336     537

Show the code

summary(wp_joined1.sf$local_population_1km == 0)

   Mode   FALSE    TRUE    NA's 
logical   89985    4457     537

Show the code

summary(wp_joined1.sf$crucialness_score == 0)

   Mode   FALSE    NA's 
logical   88106    6873

Remarks :

There will be 6,873 water points without crucialness score due to 0 value in 6,336 observations or missing value in 537 observations.

!!!!! need to remove these 537 observations?

3.4.4.3 save and read RDS file :: wp_joined1.sf

Save the updated values into wp_joined1.sf RDS file.

Show the code

write_rds(wp_joined1.sf,
          "data/geodata/wp_joined1.sf.rds",
          compress = "xz")

Show the code

wp_joined1.sf <- read_rds("data/geodata/wp_joined1.sf.rds")

3.5 Extract Water Point by Attribute

3.5.1 Extract Functional Water Point :: wpt_functional.sf

wpt_functional.sf <- wp_joined1.sf %>%
  filter(status_clean %in%
           c("Functional", 
             "Functional but not in use",
             "Functional but needs repair"))

3.5.1.1 compute data table for clustering analysis

wp_nga.sf <- bdy_nga.sf %>%
  mutate(`total_wp` = lengths(
    st_intersects(bdy_nga.sf, 
                  wp_joined1.sf))) %>%
  mutate(`wp_functional` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_functional.sf))) %>%
  mutate(`pct_functional` = (`wp_functional`/`total_wp`*100))

3.5.2 Extract Non-Functional Water Point :: wpt_nonFunctional.sf

wpt_nonFunctional.sf <- wp_joined1.sf %>%
  filter(status_clean %in%
           c("Abandoned/Decommissioned", 
             "Non-Functional",
             "Non-Functional due to dry season"))

3.5.2.1 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`wp_nonFunctional` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_nonFunctional.sf))) %>%
  mutate(`pct_nonFunctional` = (`wp_nonFunctional`/`total_wp`*100))

3.5.3 Extract Unknown Water Point :: wpt_unknown.sf

wpt_unknown.sf <- wp_joined1.sf %>%
  filter(status_clean == "Unknown")

3.5.3.1 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`wp_unknown` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_unknown.sf))) %>%
  mutate(`pct_unknown` = (`wp_unknown`/`total_wp`*100))

3.5.4 Extract Water Points by “water_tech_category”

Show the code

freq(wp_joined1.sf$water_tech_category)

              var frequency percentage cumulative_perc
1       Hand Pump     58735      61.84           61.84
2 Mechanized Pump     25636      26.99           88.83
3            <NA>     10054      10.59           99.42
4        Tapstand       553       0.58          100.00
5 Rope and Bucket         1       0.00          100.00

Remarks :

Only “Hand Pump”, “Mechanized Pump”, and “Tapstand” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

wpt_hPump.sf <- wp_joined1.sf %>%
  filter(water_tech_category %in% "Hand Pump")

wpt_mPump.sf <- wp_joined1.sf %>%
  filter(water_tech_category %in% "Mechanized Pump")

wpt_tStand.sf <- wp_joined1.sf %>%
  filter(water_tech_category %in% "Tapstand")

3.5.4.1 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_handPump` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_hPump.sf))) %>%
  mutate(`total_mechPump` = lengths(
    st_intersects(bdy_nga.sf,
                  wpt_mPump.sf))) %>%
  mutate(`total_tapStand` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_tStand.sf))) %>%
  mutate(`pct_handPump` = (`total_handPump`/`total_wp`*100)) %>%
  mutate(`pct_mechPump` = (`total_mechPump`/`total_wp`*100)) %>%
  mutate(`pct_tapStand` = (`total_tapStand`/`total_wp`*100))

3.5.5 Extract Water Point by “usage_capacity”

Show the code

freq(wp_joined1.sf$usage_capacity)

   var frequency percentage cumulative_perc
1  300     68768      72.40           72.40
2 1000     25636      26.99           99.39
3  250       573       0.60           99.99
4   50         2       0.00          100.00

Remarks :

Only “300”, “1000”, and “250” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.
But, “50” will be included in the new variable “total_ucN1000” as part of the none ‘1000’ “usage_capacity” value.
Based on the metadata description, the usage capacity is a recommended value of maximum users per water point based on the Sphere Standards.⁷
- 250 people usually consists of tapstand, kiosk, rainwater catchment.
- 1,000 people - mechanised well.

⁷ Sphere Association. The Sphere Handbook : Humanitarian Charter and Minimum Standards in Humanitarian Response 2018 Edition. Pg.106. https://spherestandards.org/wp-content/uploads/Sphere-Handbook-2018-EN.pdf

wpt_uC300.sf <- wp_joined1.sf %>%
  filter(usage_capacity %in% "300")

wpt_uC1000.sf <- wp_joined1.sf %>%
  filter(usage_capacity %in% "1000")

wpt_uC250.sf <- wp_joined1.sf %>%
  filter(usage_capacity %in% "250")

wpt_uC50.sf <- wp_joined1.sf %>%
  filter(usage_capacity %in% "50")

3.5.5.1 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_uc300` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_uC300.sf))) %>%
  mutate(`total_uc1000` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_uC1000.sf))) %>%
  mutate(`total_uc250` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_uC250.sf))) %>%
  mutate(`total_uc50` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_uC50.sf))) %>%
  mutate(`total_ucN1000` = (
    (lengths(
      st_intersects(bdy_nga.sf, wpt_uC300.sf))) +
      (lengths(
        st_intersects(bdy_nga.sf, wpt_uC250.sf))) +
      (lengths(
        st_intersects(bdy_nga.sf, wpt_uC50.sf))))) %>%
  mutate(`pct_ucN1000` = (`total_ucN1000`/`total_wp`*100)) %>%
  mutate(`pct_uc300` = (`total_uc300`/`total_wp`*100)) %>%
  mutate(`pct_uc1000` = (`total_uc1000`/`total_wp`*100)) %>%
  mutate(`pct_uc250` = (`total_uc250`/`total_wp`*100))

3.5.6 Extract Water Point by “is_urban”

Show the code

questionr::freq(wp_joined1.sf$'is_urban')

          n    % val%
FALSE 75423 79.4 79.4
TRUE  19556 20.6 20.6

Remarks :

Approximately 79.4% of the water points fall within non-urban community.

wpt_urban1.sf <- wp_joined1.sf %>%
  filter(is_urban %in% "TRUE")

wpt_urban0.sf <- wp_joined1.sf %>%
  filter(is_urban %in% "FALSE")

3.5.6.1 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_urban1` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_urban1.sf))) %>%
  mutate(`total_urban0` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_urban0.sf))) %>%
  mutate(`pct_urban1` = (`total_urban1`/`total_wp`*100)) %>%
  mutate(`pct_urban0` = (`total_urban0`/`total_wp`*100))

3.5.7 Extract Water Point by “crucialness_score”

Under section 3.4.4.2, there are 6,873 records without value. Hence, these NAs need to first replace with 0 value.

3.5.7.1 replace NA with 0

wp_joined1.sf <- wp_joined1.sf %>%
  mutate(crucialness_score = replace_na(crucialness_score, 0))

3.5.7.2 examine distribution statistically

Show the code

summary(wp_joined1.sf$crucialness_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.09669 0.26671 0.38363 0.56474 1.00017

Show the code

summary(wp_joined1.sf$crucialness_score == 1)

   Mode   FALSE    TRUE 
logical   79968   15011

Remarks :

According to the metadata, the “crucialness_score” indicates the importance of each water point to the surrounding communities. The value is a ratio of users assigned to the water point to the total local population within a 1km radius thereof.

For non-functional water points, the “crucialness_score” shows how important the water point would be if it were to be rehabilitated.
Having a median value of 0.26671 means 50% of the water points supply approximately 26% of residents within 1 km of the water point.
According to Sphere Standards⁸, the distance from any household to the nearest water point should be less than 500 metres.
However, 15,011 water points cover entire communities within 1 km. This is approximately 15.8% of the total water points.

⁸ Sphere Association. The Sphere Handbook : Humanitarian Charter and Minimum Standards in Humanitarian Response 2018 Edition. Pg.106. https://spherestandards.org/wp-content/uploads/Sphere-Handbook-2018-EN.pdf

3.5.7.3 reveal the distribution “crucialness_score” by “shapeName”

Show the code

questionr::freq(wp_joined1.sf$crucialness_score[
  wp_joined1.sf$shapeName == "Kirfi"])

          n    % val%
0         2  2.3  2.3
0.077914  1  1.1  1.1
0.115731  1  1.1  1.1
0.12009   1  1.1  1.1
0.129994  1  1.1  1.1
0.147203  1  1.1  1.1
0.15509   1  1.1  1.1
0.157345  1  1.1  1.1
0.164474  1  1.1  1.1
0.166955  1  1.1  1.1
0.167232  1  1.1  1.1
0.170958  1  1.1  1.1
0.173628  1  1.1  1.1
0.193043  1  1.1  1.1
0.210386  1  1.1  1.1
0.235887  1  1.1  1.1
0.251512  1  1.1  1.1
0.258065  1  1.1  1.1
0.266712  1  1.1  1.1
0.286574  1  1.1  1.1
0.298597  1  1.1  1.1
0.302252  1  1.1  1.1
0.303544  1  1.1  1.1
0.319412  1  1.1  1.1
0.330536  1  1.1  1.1
0.337174  1  1.1  1.1
0.338235  1  1.1  1.1
0.363727  1  1.1  1.1
0.433821  1  1.1  1.1
0.470878  1  1.1  1.1
0.472674  1  1.1  1.1
0.472941  1  1.1  1.1
0.473177  1  1.1  1.1
0.477725  1  1.1  1.1
0.480092  1  1.1  1.1
0.484076  1  1.1  1.1
0.496662  1  1.1  1.1
0.49864   1  1.1  1.1
0.50136   1  1.1  1.1
0.502825  1  1.1  1.1
0.504512  1  1.1  1.1
0.512159  1  1.1  1.1
0.514331  1  1.1  1.1
0.519908  1  1.1  1.1
0.524109  1  1.1  1.1
0.52478   1  1.1  1.1
0.525721  1  1.1  1.1
0.526823  1  1.1  1.1
0.527059  1  1.1  1.1
0.528302  1  1.1  1.1
0.534786  1  1.1  1.1
0.549686  1  1.1  1.1
0.565784  1  1.1  1.1
0.606396  1  1.1  1.1
0.772727  1  1.1  1.1
0.800382  1  1.1  1.1
0.846154  1  1.1  1.1
0.952727  1  1.1  1.1
0.986722  1  1.1  1.1
0.989832  1  1.1  1.1
1        26 29.9 29.9

Remarks :

Cannot use means to compute the “crucialness_score” to represent a “shapeName”. As shown in the output above, ranging from 0 to 1, a mean value can reduce the accuracy of the analysis.

Hence, the “crucialness_score” will be split into 2 groups. Since its 3rd quartile value is around 0.56, 0.6 will be the lowest bin as shown below -

x <= 0.60
x > 0.60

3.5.7.4 extract water point

wpt_cS04.sf <- wp_joined1.sf %>%
  filter(crucialness_score <= 0.6)

wpt_cS10.sf <- wp_joined1.sf %>%
  filter(crucialness_score > 0.6)

3.5.7.5 compute additional variable

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_cs04` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_cS04.sf))) %>%
  mutate(`total_cs10` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_cS10.sf))) %>%
  mutate(`pct_cs04` = (`total_cs04`/`total_wp`*100)) %>%
  mutate(`pct_cs10` = (`total_cs10`/`total_wp`*100))

3.5.8 Extract Water Point by “pressure_score”

Replace NAs with 0 value before computing a new variable associated with “pressure_score”.

3.5.8.1 replace NA with 0

wp_joined1.sf <- wp_joined1.sf %>%
  mutate(pressure_score = replace_na(pressure_score, 0))

3.5.8.2 examine distribution statistically

Show the code

summary(wp_joined1.sf$pressure_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.270   1.026   2.978   2.850 776.970

Show the code

summary(wp_joined1.sf$pressure_score > 1)

   Mode   FALSE    TRUE 
logical   46898   48081

Remarks :

Based on the metadata, pressure_score is a ratio of water point users assigned by the recommended maximum usage capacity of the water tech deployed for the water point.

When the score is greater than 1, that reflects the water point is supplying over its limit.

48,081 water points, approximately 50.6% of the water points are over their usage capacity limit.

3.5.8.3 extract water point

wpt_pS09.sf <- wp_joined1.sf %>%
  filter(crucialness_score <= 0.9)
wpt_pS19.sf <- wp_joined1.sf %>%
  filter(crucialness_score > 0.9 & crucialness_score <= 1.9)
wpt_pS39.sf <- wp_joined1.sf %>%
  filter(crucialness_score > 1.9 & crucialness_score <= 3.9)
wpt_pS40.sf <- wp_joined1.sf %>%
  filter(crucialness_score > 3.9)

3.5.8.4 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_ps09` = lengths(
    st_intersects(bdy_nga.sf, wpt_pS09.sf))) %>%
  mutate(`total_ps19` = lengths(
    st_intersects(bdy_nga.sf, wpt_pS19.sf))) %>%
  mutate(`total_ps39` = lengths(
    st_intersects(bdy_nga.sf, wpt_pS39.sf))) %>%
  mutate(`total_ps40` = lengths(
    st_intersects(bdy_nga.sf, wpt_pS40.sf))) %>%
  mutate(`pct_ps09` = (`total_ps09`/`total_wp`*100)) %>%
  mutate(`pct_ps19` = (`total_ps19`/`total_wp`*100)) %>%
  mutate(`pct_ps39` = (`total_ps39`/`total_wp`*100)) %>%
  mutate(`pct_ps40` = (`total_ps40`/`total_wp`*100))

3.5.9 Extract Water Point by “status_id”

Show the code

questionr::freq(wp_joined1.sf$status_id)

            n    % val%
No      32210 33.9 33.9
Unknown 10178 10.7 10.7
Yes     52591 55.4 55.4

3.5.9.1 extract water point

wpt_stat1 <- wp_joined1.sf %>%
  filter(status_id %in% "Yes")

wpt_stat0 <- wp_joined1.sf %>%
  filter(status_id %in% "No")

3.5.9.2 compute additional variables

wp_nga.sf <- wp_nga.sf %>%
  mutate(`total_status1` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_stat1))) %>%
  mutate(`total_status0` = lengths(
    st_intersects(bdy_nga.sf, 
                  wpt_stat0))) %>%
  mutate(`pct_stat1` = (`total_status1`/`total_wp`*100)) %>%
  mutate(`pct_stat0` = (`total_status0`/`total_wp`*100))

3.5.10 Save and Read RDS File :: wp_nga

Show the code

write_rds(wp_nga.sf,"data/geodata/wp_nga.sf.rds")

wp_nga <- read_rds("data/geodata/wp_nga.sf.rds")

3.6 Transform Coordinate System

The EPSG for wp_nga is 4326, which is WGS 84. To compute the proximity distance matrix for clustering analysis, the coordinate reference system for attribute (wp_nga) and boundary (bdy_nga) data frames needs to transform into EPSG: 26391.

⇳ Usage of the code chunk below :

st_set_crs( ) - sf - to update the coordinate reference system.

wp_ngaTrans <- st_set_crs(wp_nga, 26391)

Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that

bdy_ngaTrans <- st_set_crs(bdy_nga.sf, 26391)

Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that

3.6.1 Review CRS :: wp_ngaTrans

⇳ Usage of the code chunk below :

st_crs( ) - sf - to inspect the coordinate reference system.

st_crs(wp_ngaTrans)

Coordinate Reference System:
  User input: EPSG:26391 
  wkt:
PROJCRS["Minna / Nigeria West Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria West Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",4.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",230738.26,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
        BBOX[3.57,2.69,13.9,6.5]],
    ID["EPSG",26391]]

3.6.2 Review CRS :: bdy_ngaTrans

st_crs(bdy_ngaTrans)

Coordinate Reference System:
  User input: EPSG:26391 
  wkt:
PROJCRS["Minna / Nigeria West Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria West Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",4.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",230738.26,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
        BBOX[3.57,2.69,13.9,6.5]],
    ID["EPSG",26391]]

3.6.3 Save and Read RDS File :: wp_ngaTrans

Show the code

write_rds(wp_ngaTrans,"data/geodata/wp_ngaTrans.rds")
write_rds(bdy_ngaTrans,"data/geodata/bdy_ngaTrans.rds")

wp_ngaTrans <- read_rds("data/geodata/wp_ngaTrans.rds")
bdy_ngaTrans <- read_rds("data/geodata/bdy_ngaTrans.rds")

3.7 Exploratory Data Analysis (EDA)

3.7.1 Geospatial EDA

Reference of the code chunk below source from Lin S.Y., 2022⁹.

⁹ Lin S.Y.(2022). Regionalisation Using Water Point Availability in Nigeria. https://lins-92-isss624.netlify.app/take-home_ex02/take-home_ex02#exploratory-data-analysis

shapeName_na <- function(x){
  tm_shape(wp_ngaTrans) +
    tm_fill(col=x,
            style="pretty") +
    tm_borders(alpha=0.5) +
    tm_layout(legend.height = 0.2, 
              legend.width = 0.2)
}

eda_wpNA <- map(names(wp_ngaTrans)
                [c(3,5,7,13:15,21:24,28)], 
                shapeName_na)
tmap_arrange(eda_wpNA)

Remarks :

The LGA that does not have any water points as shown above to be removed from the data frame.

3.7.1.1 filter state without water point

wp_ngaTrim <- wp_ngaTrans %>%
  filter(if_all(
    starts_with("total_wp"),~. > 0))

3.7.1.2 visualise distribution of water status & non-functional water points

stat1 <- tm_shape (wp_ngaTrim) +
  tm_fill("pct_stat1",
          style = "jenks",
          n = 6,
          title = "Water Observed (%)") +
  tm_layout(main.title = "Water Point with Water Observed",
            main.title.position = "center",
            main.title.size = 0.9,
            main.title.fontface = "bold",
            legend.height = 0.25, 
            legend.width = 0.25,
            frame = TRUE) +
  tm_borders(alpha = 0.5)

nfunc <- tm_shape (wp_ngaTrim) +
  tm_fill("pct_nonFunctional",
          style = "jenks",
          n = 6,
          title = "Non-Functional (%)") +
  tm_layout(main.title = "Non-Functional Water Point",
            main.title.position = "center",
            main.title.size = 0.9,
            main.title.fontface = "bold",
            legend.height = 0.25, 
            legend.width = 0.25,
            frame = TRUE) +
  tm_borders(alpha = 0.5)

tmap_arrange (stat1, nfunc, ncol = 2, asp = 1)

3.7.1.3 save and read RDS file

Show the code

write_rds(wp_ngaTrim,"data/geodata/wp_ngaTrim.rds")

wp_ngaTrim <- read_rds("data/geodata/wp_ngaTrim.rds")

3.7.2 Visualise Distribution of Cluster Variable

Create a data frame with variables for clustering analysis before visualise the distribution.

3.7.2.1 create cluster variable data frame

cluster_vars <- wp_ngaTrim %>%
  st_set_geometry(NULL) %>%
  select("shapeName",
         "total_wp",
         "pct_functional", 
         "pct_nonFunctional",
         "pct_handPump",
         "pct_mechPump",
         "pct_tapStand",
         "pct_uc300",
         "pct_uc1000",
         "pct_ucN1000",
         "pct_uc250",
         "pct_urban1",
         "pct_urban0",
         "pct_cs04",
         "pct_cs10",
         "pct_stat1",
         "pct_stat0",
         "pct_ps09",
         "pct_ps19")

3.7.2.2 replace row ID with “shapeName”

row.names(cluster_vars) <- cluster_vars$shapeName
cluster_vars <- cluster_vars %>%
  select(-shapeName)

3.7.2.3 examine distribution of Cluster Variables

⇳ Usage of the code chunk below :

st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.

Reference of the code chunk below source from Lin S.Y., 2022¹⁰.

¹⁰ Lin S.Y.(2022). Regionalisation Using Water Point Availability in Nigeria. https://lins-92-isss624.netlify.app/take-home_ex02/take-home_ex02#exploratory-data-analysis

plot_num(cluster_vars)

3.7.3 Identify Outliers

Boxplot shows the minimum, maximum, median, first quartile, third quartile and outliers, if any.

ggarrange(
  (ggplot(data=cluster_vars, aes(x=`pct_functional`)) + 
     geom_boxplot(color="black", fill="#19ff3fFF")), 
  ((ggplot(data=cluster_vars, aes(x=`pct_nonFunctional`)) +
      geom_boxplot(color="black", fill="#ff1919FF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_handPump`)) +
      geom_boxplot(color="black", fill="#FFA319FF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_mechPump`)) +
      geom_boxplot(color="black", fill="#ff8419FF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_tapStand`)) +
      geom_boxplot(color="black", fill="#ff5619FF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_urban0`)) +
      geom_boxplot(color="black", fill="#19beffFF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_uc1000`)) +
      geom_boxplot(color="black", fill="#C16622FF"))),
  ((ggplot(data = cluster_vars, aes(x = `pct_ucN1000`)) +
      geom_boxplot(color="black", fill="#543005FF"))),
   ncol = 2,
  nrow = 4)

3.7.4 Alternate EDA Approach

With references to various coursemates’ assignment.

3.7.4.1 set helper function

-- set Helper Function 1

The approach and the code chunk refers from Chua Y.T.¹¹

¹¹ Chua Y.T. (2022). 4 Exploratory Data Analysis - Helper Function 1. https://isss624-amelia.netlify.app/exercises/take-home_ex2/take-home_ex2#exploratory-data-analysis

hist_box_plot <- function(varname, title){ 
  
  func1 <- ggplot(data = cluster_vars, 
       aes(x = varname)) + 
  geom_histogram(bins = 30,
                 color = "black", 
                 fill = "steelblue") +
  theme_classic() + 
  xlab(title)
  
  func2 <- ggplot(data = cluster_vars, 
         aes(x = varname)) + 
    geom_boxplot(fill = "steelblue", 
                 color = "black") + 
    theme_classic() +
    xlab("") +
    theme(axis.text.y = element_blank(),
          axis.ticks.y = element_blank())
  
plot_grid(func1,
          func2,
          align = "v",
          ncol = 1)
}

-- set Helper Function 2

hist_plot <- function(varname, title){
  func3 <- ggplot(data = cluster_vars, 
       aes(x = varname)) + 
  geom_histogram(bins = 30,
                 color = "black", 
                 fill = "steelblue") +
  theme_classic() + 
  xlab(title)

  func3
}

-- set Helper Function 3

choropleth_plot <- function(varname, style, title) {
  tm_shape(wp_ngaTrim) +
    tm_fill(varname, 
          n = 5,
          style = style) +
    tm_borders(alpha = 0.5) +
    tm_layout(main.title = title,
              main.title.size = 0.8,
              main.title.position = "center",
              legend.height = 3, 
              legend.width = 3,
              legend.title.size = 0.8,
              legend.text.size = 0.5,
              frame = TRUE)+ 
    tm_compass(position = c('left','bottom'))
}

3.7.4.2 Univariate Distribution Analysis

-- visualise cluster variable with histogram and boxplot

hist_box_plot(cluster_vars$pct_nonFunctional, "Non-Functional Water Point")

hist_box_plot(cluster_vars$pct_stat1, "Water Point with Water Observed")

hist_box_plot(cluster_vars$pct_cs10, "Crucialness-score Greater than 0.6")

hist_box_plot(cluster_vars$pct_handPump, "Hand Pump Deployed")

-- multiple plot histograms

pct_functional <- hist_plot(cluster_vars$pct_functional, "pct_functional")
pct_nonfunctional <- hist_plot(cluster_vars$pct_nonFunctional, "pct_nonFunctional")
pct_handpump <- hist_plot(cluster_vars$pct_handPump, "pct_handPump")
pct_mechpump <- hist_plot(cluster_vars$pct_mechPump, "pct_mechPump")
pct_uc1000 <- hist_plot(cluster_vars$pct_uc1000, "pct_uc1000")
pct_ucN1000 <- hist_plot(cluster_vars$pct_ucN1000, "pct_ucN1000")
pct_urban0 <- hist_plot(cluster_vars$pct_urban0, "pct_urban0")
pct_stat1 <- hist_plot(cluster_vars$pct_stat1, "pct_stat1")
pct_cs10 <-hist_plot(cluster_vars$pct_cs10, "pct_cs10")
pct_ps19 <-hist_plot(cluster_vars$pct_ps19, "pct_ps19")

pct_functional + pct_nonfunctional +
  pct_handpump + pct_mechpump +
  pct_uc1000 + pct_ucN1000 +
  pct_urban0 + pct_stat1 +
  pct_cs10 + pct_ps19 +
  plot_layout(ncol = 2)

-- multiple plot geospatial EDA

Show the code

tmap_arrange(choropleth_plot("wp_functional", "quantile", 
                "Functional Water Point"),
             choropleth_plot("wp_nonFunctional", "quantile", 
                "Non-functional Water Point"), 
             choropleth_plot("pct_functional", "quantile", 
                "Pct of functional water point"),
             choropleth_plot("pct_nonFunctional", "quantile", 
                "Pct of Non-functional water point"),
             choropleth_plot("pct_handPump", "quantile", 
                "Pct of Hand Pump Deployed"),
             choropleth_plot("pct_mechPump", "quantile", 
                "Pct of Mechanical Pump Deployed"),
             choropleth_plot("pct_urban0", "quantile", 
                "Pct of Water Point in Non-Urban Community"),
             choropleth_plot("pct_cs10", "quantile", 
                "Pct of Water Points with Crucialness > 0.6"),
             choropleth_plot("pct_ps19", "quantile", 
                "Pct of Water Points with Reaching Usage Limit"),
             ncol = 2,
             heights = 5,
             nrow = 5)

4. CORRELATION ANALYSIS

4.1 Multivariate Analysis

4.1.1 Visualise Correlation Matrix

⇳ Usage of the code chunk below :

corrplot.mixed( ) - corrplot - to use mixed methods to visualise a correlation matrix.

This plot allows to identify the pattern and the relationship in the matrix.

corrplot.mixed((cor(cluster_vars)),
               upper = "number",
               lower = "ellipse",
               tl.col = "black",
               diag = "l",
               tl.pos = "lt")

Remarks :

Following are the pairs with strong correlation :

correlation coefficients	variable_1	variable_2
1.00	pct_uc1000	pct_ucN1000
1.00	pct_mechPump	pct_uc1000
-1.00	pct_mechPump	pct_ucN1000
-1.00	pct_uc1000	pct_ucN1000
-1.00	pct_cs04	pct_cs10
-1.00	pct_urban1	pct_urban0
-1.00	pct_ps09	pct_ps19
0.99	pct_tapStand	pct_uc250
0.99	pct_uc300	pct_ucN1000
-0.99	pct_mechPump	pct_uc300
-0.99	pct_uc300	pct_uc1000
0.96	pct_functional	pct_stat1
0.96	pct_nonFunctional	pct_stat0
0.95	pct_cs04	pct_ps09
0.95	pct_cs10	pct_ps19
-0.95	pct_cs04	pct_ps19
-0.95	pct_cs10	pct_ps19

Since only “usage_capacity” and “water_tech_category” are showing multicollinearity, hence the study will be split into 2 models with water point status by -

1) “non-functional water point”

2) “functional”

4.2 Refine Model :: cluster_varsNF

4.2.1 Remove Variable

cluster_varsNF <- cluster_vars %>%
  select(-pct_tapStand, -pct_uc300, -pct_uc250, -pct_urban1, -pct_cs04,-pct_stat0, -pct_ps09)

corrplot.mixed((cor(cluster_varsNF)),
               upper = "number",
               lower = "ellipse",
               tl.col = "black",
               diag = "l",
               tl.pos = "lt")

Remarks :

“pct_handPump” and “pct_mechPump” are negatively correlated at -0.82.

Since hand pump is the instructed main water point technology in the scope of work, “pct_mechPump” will be removed from the model.

cluster_varsNF1 <- cluster_varsNF %>%
  select(-pct_functional, -pct_mechPump, -pct_uc1000, -pct_ucN1000, -pct_ps19)

4.2.2 Build Regression Model

model_test <- lm(total_wp ~ pct_nonFunctional +
         pct_handPump +
         pct_urban0 +
         pct_cs10 +
         pct_stat1,
         data = cluster_varsNF1)

summary(model_test)


Call:
lm(formula = total_wp ~ pct_nonFunctional + pct_handPump + pct_urban0 + 
    pct_cs10 + pct_stat1, data = cluster_varsNF1)

Residuals:
    Min      1Q  Median      3Q     Max 
-271.52  -51.51  -11.50   31.22  681.72 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       199.1869    18.9601  10.506  < 2e-16 ***
pct_nonFunctional  -1.0952     0.2089  -5.241 2.07e-07 ***
pct_handPump        1.0160     0.1300   7.814 1.86e-14 ***
pct_urban0          0.7333     0.1152   6.363 3.42e-10 ***
pct_cs10           -1.8779     0.1665 -11.275  < 2e-16 ***
pct_stat1          -1.4837     0.2200  -6.744 3.06e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 89.64 on 755 degrees of freedom
Multiple R-squared:  0.3289,    Adjusted R-squared:  0.3245 
F-statistic: 74.02 on 5 and 755 DF,  p-value: < 2.2e-16

4.2.3 Detect Multicollinearity

car::vif(model_test)

pct_nonFunctional      pct_handPump        pct_urban0          pct_cs10 
         1.764617          1.670228          1.245002          1.643799 
        pct_stat1 
         2.447319

Remarks :

Alert

When encountering the following error :

” Error in vif.default(model_test) : there are aliased coefficients in the mode “

It means two or more predictor variables in the model are perfectly correlated.

Note

Variables with VIF threshold values between 5 to 10 may need to be cautious, but VIF greater than 10 can be problematic to the model performance due to serious collinearity problem.¹²

¹² Chouiry G. (2022). What is an Acceptable Value for VIF? (With References). https://quantifyinghealth.com/vif-threshold/

“A VIF value of 10 means that the tolerance of the relevant independent variable is 0.10 and that 90% (𝐑𝟐) of the variable is explained by other variables. This is undesirable, as it indicates that the relevant independent variable is unnecessarily included in the model.”¹³

¹³ ResearchGate. Multicollinearity issues: is a value less than 10 acceptable for VIF? https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF

When a study involves a large sample size, the VIF threshold value can be set up to 10.

Based on the VIF report above, all value is below 5. Hence, there is no multicollinearity in the model.

5. DATA STANDARDISATION

5.1 Remove Variable

cluster_varsNF1 <- cluster_varsNF1 %>%
  select(-total_wp)

5.2 Standardise Data

As shown in 3.7.3.3, not all variables are not distributed normally. Hence, standardisation will be required before the clustering analysis.

5.2.1 Standardise :: Min-Max Method

wp_stdMM <- normalize(cluster_varsNF1)
describe(wp_stdMM)

wp_stdMM 

 5  Variables      761  Observations
--------------------------------------------------------------------------------
pct_nonFunctional 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     761        0      618        1   0.3654   0.2354  0.03012  0.08333 
     .25      .50      .75      .90      .95 
 0.22111  0.35593  0.50820  0.64444  0.72727 

lowest : 0.000000000 0.004065041 0.009009009 0.009708738 0.012987013
highest: 0.852941176 0.857142857 0.864864865 0.878787879 1.000000000
--------------------------------------------------------------------------------
pct_handPump 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     761        0      619        1   0.4956   0.3721  0.00000  0.03125 
     .25      .50      .75      .90      .95 
 0.18605  0.52555  0.78571  0.92593  0.96000 

lowest : 0.000000000 0.009009009 0.009523810 0.012987013 0.013333333
highest: 0.989898990 0.991011236 0.994555354 0.994604317 1.000000000
--------------------------------------------------------------------------------
pct_urban0 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     761        0      447    0.975   0.7395   0.3267   0.0000   0.1818 
     .25      .50      .75      .90      .95 
  0.5922   0.8717   1.0000   1.0000   1.0000 

lowest : 0.000000000 0.008298755 0.009345794 0.010309278 0.013888889
highest: 0.992957746 0.993730408 0.993975904 0.996138996 1.000000000
--------------------------------------------------------------------------------
pct_cs10 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     761        0      612        1   0.3318   0.2774  0.02462  0.05091 
     .25      .50      .75      .90      .95 
 0.13542  0.27027  0.47059  0.69841  0.82143 

lowest : 0.000000000 0.001798561 0.003184713 0.006944444 0.008064516
highest: 0.944444444 0.947368421 0.964285714 0.969696970 1.000000000
--------------------------------------------------------------------------------
pct_stat1 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     761        0      623        1   0.5165   0.2642   0.1724   0.2267 
     .25      .50      .75      .90      .95 
  0.3391   0.4828   0.6866   0.8667   0.9206 

lowest : 0.00000000 0.03030303 0.05555556 0.06666667 0.07692308
highest: 0.98701299 0.99029126 0.99099099 0.99593496 1.00000000
--------------------------------------------------------------------------------

5.2.2 Standardise :: Z-score Method

wp_stdZ <- scale(cluster_varsNF1)
describe(wp_stdZ)

wp_stdZ 

 5  Variables      761  Observations
--------------------------------------------------------------------------------
pct_nonFunctional 
         n    missing   distinct       Info       Mean        Gmd        .05 
       761          0        618          1 -1.201e-18      1.139   -1.62178 
       .10        .25        .50        .75        .90        .95 
  -1.36438   -0.69793   -0.04573    0.69082    1.34989    1.75056 

lowest : -1.767485 -1.747821 -1.723906 -1.720521 -1.704663
highest:  2.358458  2.378783  2.416136  2.483486  3.069827
--------------------------------------------------------------------------------
pct_handPump 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      761         0       619         1 3.424e-18     1.151   -1.5333   -1.4366 
      .25       .50       .75       .90       .95 
  -0.9577    0.0927    0.8977    1.3315    1.4369 

lowest : -1.533321 -1.505447 -1.503854 -1.493139 -1.492068
highest:  1.529392  1.532834  1.543799  1.543951  1.560645
--------------------------------------------------------------------------------
pct_urban0 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      761         0       447     0.975 6.197e-17     1.038   -2.3488   -1.7713 
      .25       .50       .75       .90       .95 
  -0.4677    0.4200    0.8274    0.8274    0.8274 

lowest : -2.3488119 -2.3224529 -2.3191273 -2.3160670 -2.3046972
highest:  0.8050784  0.8075325  0.8083123  0.8151828  0.8274464
--------------------------------------------------------------------------------
pct_cs10 
         n    missing   distinct       Info       Mean        Gmd        .05 
       761          0        612          1 -9.464e-17      1.108    -1.2273 
       .10        .25        .50        .75        .90        .95 
   -1.1223    -0.7847    -0.2459     0.5543     1.4645     1.9559 

lowest : -1.325627 -1.318442 -1.312904 -1.297885 -1.293410
highest:  2.447339  2.459020  2.526603  2.548221  2.669279
--------------------------------------------------------------------------------
pct_stat1 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      761         0       623         1 9.351e-17     1.143   -1.4879   -1.2529 
      .25       .50       .75       .90       .95 
  -0.7672   -0.1457    0.7357    1.5146    1.7480 

lowest : -2.233545 -2.102491 -1.993279 -1.945226 -1.900869
highest:  2.035083  2.049261  2.052287  2.073668  2.091249
--------------------------------------------------------------------------------

Remarks :

Comparing the reports above, the Min-Max method is the only method that can standardise the value to between 0 and 1.

5.2.3 Compare Distribution For Standardisation Method

Visualise to determine which standardisation method provide the better output.

ggarrange(
  (ggplot(data = cluster_varsNF1, aes(x = `pct_stat1`)) +
    geom_density(color = "black", fill = "#19ff3fFF") + 
    ggtitle("Before Standardisation")),
  (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_stat1`)) +
      geom_density(color = "black", fill = "#19ff3fFF") +
      ggtitle("Min-Max Stdsn.")),
  (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_stat1`)) +
     geom_density(color = "black", fill="#19ff3fFF") +
     ggtitle("Z-score Stdsn.")),
   ncol = 3)

ggarrange(
  (ggplot(data = cluster_varsNF1, aes(x = `pct_nonFunctional`)) +
    geom_density(color = "black", fill = "#ff1919FF") + 
    ggtitle("Before Standardisation")),
  (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_nonFunctional`)) +
      geom_density(color = "black", fill = "#ff1919FF") +
      ggtitle("Min-Max Stdsn.")),
  (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_nonFunctional`)) +
     geom_density(color = "black", fill="#ff1919FF") +
     ggtitle("Z-score Stdsn.")),
   ncol = 3)

ggarrange(
  (ggplot(data = cluster_varsNF1, aes(x = `pct_handPump`)) +
    geom_density(color = "black", fill = "#FFA319FF") + 
    ggtitle("Before Standardisation")),
  (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_handPump`)) +
      geom_density(color = "black", fill = "#FFA319FF") +
      ggtitle("Min-Max Stdsn.")),
  (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_handPump`)) +
     geom_density(color = "black", fill="#FFA319FF") +
     ggtitle("Z-score Stdsn.")),
   ncol = 3)

ggarrange(
  (ggplot(data = cluster_varsNF1, aes(x = `pct_cs10`)) +
    geom_density(color = "black", fill = "#ff5619FF") + 
    ggtitle("Before Standardisation")),
  (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_cs10`)) +
      geom_density(color = "black", fill = "#ff5619FF") +
      ggtitle("Min-Max Stdsn.")),
  (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_cs10`)) +
     geom_density(color = "black", fill="#ff5619FF") +
     ggtitle("Z-score Stdsn.")),
   ncol = 3)

ggarrange(
  (ggplot(data = cluster_varsNF1, aes(x = `pct_urban0`)) +
    geom_density(color = "black", fill = "#19beffFF") + 
    ggtitle("Before Standardisation")),
  (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_urban0`)) +
      geom_density(color = "black", fill = "#19beffFF") +
      ggtitle("Min-Max Stdsn.")),
  (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_urban0`)) +
     geom_density(color = "black", fill="#19beffFF") +
     ggtitle("Z-score Stdsn.")),
   ncol = 3)

6. CLUSTERING ANALYSIS

6.1 Hierarchical Clustering

There are four (4) main steps :

compute proximity matrix and determine clustering algorithm.
compute Hierarchical Clustering.
identify optimal number of cluster and merge similar clusters.
update the distance matrix.

6.1.1 Compute Proximity Matrix and Clustering Algorithm

First determine the clustering algorithm before compute Hierarchical Clustering analysis.

6.1.1.1 determine Hierarchical Clustering algorithm

⇳ Usage of the code chunk below :

agnes( ) - cluster - to get agglomerative coefficient of 4 clustering structure, namely “average”, “single”, “complete” and “ward”.

m <- c("average","single","complete","ward")

names(m) <- c("average","single","complete","ward")

ac <- function(x){agnes(wp_stdMM, method = x)$ac}

map_dbl(m, ac)

  average    single  complete      ward 
0.9077095 0.8277742 0.9414055 0.9892021

Remarks :

Value 1 indicates the strongest clustering structure.
With value of 0.9892, Ward’s method provides the strongest clustering structure. Therefore, Ward’s method will be used in subsequent analysis.

6.1.1.2 compute proximity matrix

⇳ Usage of the code chunk below :

dist( ) - stats - to compute the proximity distance matrix. Among euclidean, maximum, manhattan, canberra, binary and minkowski, euclidean is used to compute proxmat_euc.

prox_mat_euc <- dist(wp_stdMM, 
                     method = 'euclidean')

6.1.2 Compute Hierarchical Clustering

⇳ Usage of the code chunk below :

hclust( ) - stats - to compute cluster with agglomeration method.

ggdendrogram( ) - ggdendro - to plot dendrogram with tools available in ggplot2.

hclust_ward <- hclust(prox_mat_euc, 
                      method = 'ward.D')
ggdendrogram(hclust_ward,
             rotate = TRUE,
             cex = 0.1,
             theme_dendro = FALSE)

Remarks :

This is in spite of adjusting the cex = parameter that scales the resolution of the dendogram to 10%.¹⁴

¹⁴ Chua Y.T. (2022). Take-home Exercise 2 - Regionalisation of Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods. https://isss624-amelia.netlify.app/exercises/take-home_ex2/take-home_ex2#computing-hierarchical-clustering

6.1.3 Identify Number of Optimal Cluster

To determine the optimal clusters to retain, the following commons methods are to be tested :

Gap statistic
Elbow
Average Silhouette

6.1.3.1 compute Gap Statistic method

⇳ Usage of the code chunk below :

clusGap( ) - cluster - to compute the gap statistic.

set.seed(12345)

gap_stat <- clusGap(wp_stdMM, 
                    FUN = hcut, 
                    nstart = 25, 
                    K.max = 7, 
                    B = 50)

print(gap_stat, method = "firstmax")

Clustering Gap statistic ["clusGap"] from call:
clusGap(x = wp_stdMM, FUNcluster = hcut, K.max = 7, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..7; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 4
         logW   E.logW       gap      SE.sim
[1,] 5.007896 5.556869 0.5489733 0.007508419
[2,] 4.832041 5.428820 0.5967789 0.013537767
[3,] 4.698841 5.360628 0.6617876 0.012344996
[4,] 4.591572 5.301564 0.7099919 0.012258503
[5,] 4.552221 5.254345 0.7021242 0.011632129
[6,] 4.514782 5.213852 0.6990704 0.011133338
[7,] 4.475110 5.178753 0.7036430 0.010996961

Remarks :

The number of clusters recommended by “firstmax” approach of the Gap Statistic method is 4.

6.1.3.2 visualise gap_stat

⇳ Usage of the code chunk below :

fviz_nbclust( ) - factoextra - to compute and visualise the Optimal Number of clusters.

set.seed(12345)

fviz_nbclust(wp_stdMM,
             FUNcluster = hcut,
             nstart = 25,  
             method = "gap_stat", 
             nboot = 50,
             linecolor = "white")+
  theme_dark() +
  labs(subtitle = "Gap statistic method")

6.1.3.3 compute and visualise Total Within Sum of Squares (Elbow) method

fviz_nbclust(wp_stdMM, 
             kmeans, 
             method = "wss",
             linecolor = "white")+
  theme_dark() +
  labs(subtitle = "Elbow method")

6.1.3.4 compute and visualise Silhouette method

fviz_nbclust(wp_stdMM, 
             kmeans, 
             method = "silhouette",
             linecolor = "white") +
  theme_dark() +
  labs(subtitle = "Silhouette method")

Remarks :

Method	Gap stat	Elbow	Silhouette
Optimal Value for K	4	3	3

Both Elbow method, Silhouette method indicated the 3-cluster by Silhouette method will be used for the rest of the study.

6.1.4 Merge Similar Clusters

⇳ Usage of the code chunk below :

rect.hclust( ) - stats - to draw a dendrogram with a borders around the selected clusters.

plot(hclust_ward, cex = 0.5)
rect.hclust(hclust_ward, 
            k = 3, 
            border = 2:5)

:::

6.1.5 Visually-Driven Hierarchical Clustering Analysis

::: {.callout-warning appearance=“simple” icon=“false”} The data is loaded into a data frame, but it has to be a data matrix to plot the heatmap. Hence, the data frame will need to first transform into a matrix.

6.1.5.1 transform data frame into matrix

⇳ Usage of the code chunk below :

data.matrix( ) - base - to transform wp_stdMM data frame into a data matrix, and named it as wp_stdMM_mat.

wp_stdMM_mat <- data.matrix(wp_stdMM)

6.1.5.2 plot interactive cluster heatmap

⇳ Usage of the code chunk below :

heatmaply( ) - heatmaply - to build an interactive cluster heatmap. The variables are already transformed with Min-Max method to have comparable values. Hence, neither scaling, normalise or percentise is required.

heatmaply(wp_stdMM_mat,
          Colv = NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 3,
          margins = c(NA,200,60,NA),
          fontsize_row = 1,
          fontsize_col = 8,
          main="Cluster Heatmap of Non-Functional Water Points",
          xlab = "Attribute",
          ylab = "Nigeria LGA",
          dend_hoverinfo = TRUE
          )

Remarks :

Following are the interpretation of the heatmap above :

Among these clusters, Cluster 1 has the highest percentage of water points that failed to supply to more than 60% of the residents within 1 km therefrom.
Most non-functional and non-urban water points are deployed with hand pumps in Cluster 3.
As indicated by the percentage of stat1, water was observed at the majority of the water points in Cluster 3.

6.1.5.3 create 3 clusters model

⇳ Usage of the code chunk below :

cutree( ) - base - to derive a 3-cluster model, and named the output as groups.

groups <- as.factor(cutree(hclust_ward, k = 3))

6.1.5.4 append groups to wp_ngaTrim

nga_clust.sf <- cbind(wp_ngaTrim, as.matrix(groups)) %>%
  rename(`cluster`=`as.matrix.groups.`)

6.1.5.5 plot choropleth map :: nga_clust.sf

clusGeo.map <- tm_shape(nga_clust.sf) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Non-Spatial Hierarchical Clustering",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo.map

Remarks :

The choropleth map above shows the fragmented clusters by the used of non-spatial clustering algorithm (hierarchical cluster analysis method).

6.2 Spatially Constrained Clustering :: SKATER Approach

SKATER function only supports sp objects in SpatialPolygonDataFrame.

Hence, the wp_ngaTrim has to first transform into SpatialPolygonDataFrame before proceeding further.

6.2.1 Convert SF to SP Data Frame

⇳ Usage of the code chunk below :

as_Spatial( ) - sf - to convert wp_ngaTrim into nga_sp in a SP data frame.

nga.sp <- as_Spatial(wp_ngaTrim)

6.2.2 Compute Neighbour List

First compute the neighbour list before plot it.

6.2.2.1 compute neighbour list from polygon list

⇳ Usage of the code chunk below :

poly2nb( ) - spdep - to compute the neighbours list from polygon list.

nga.nb <- poly2nb(nga.sp, queen = TRUE)

summary(nga.nb)

Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4348 
Percentage nonzero weights: 0.750793 
Average number of links: 5.713535 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  57 122 177 137 121  71  39  11   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

Remarks :

There is no LGA without a link.

6.2.2.2 plot Neighbour List by Centroid Node

⇳ Usage of the code chunk below :

Plot the boundary first before the neighbour list object to avoid any region from being clipped away.

plot(nga.sp, border = grey(.5))
plot(nga.nb, coordinates(nga.sp),
     cex.lab = .7,
     col = "blue", 
     add = TRUE)

6.2.3 Compute Minimum Spanning Tree (MST)

To find a minimum path connecting all nodes in a graph, a minimum spanning tree with a minimum weight than all other spanning trees is to be used in subsequent steps.

6.2.3.1 calculate edge costs

⇳ Usage of the code chunk below :

nbcosts( ) - spdep - to compute the cost of each edge which is the distance between nodes.

edge_cost <- nbcosts(nga.nb, wp_stdMM)

6.2.3.2 specify spatial weight

⇳ Usage of the code chunk below :

nb2listw( ) - spdep - to specify edge_cost as the spatial weights. Set the “style” to “B” to ensure the cost values are not row-standardised.

nga.w <- nb2listw(nga.nb,
                  edge_cost,
                  style = "B")
summary(nga.w)

Characteristics of weights list object:
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4348 
Percentage nonzero weights: 0.750793 
Average number of links: 5.713535 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  57 122 177 137 121  71  39  11   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

Weights style: B 
Weights constants summary:
    n     nn       S0       S1       S2
B 761 579121 2116.963 2727.142 29001.78

6.2.3.3 compute MST

⇳ Usage of the code chunk below :

nbcosts( ) - spdep - to compute the minimum spanning tree.

nga_minSpanT <- mstree(nga.w)

6.2.3.4 review class and dimension of the computed MST

class(nga_minSpanT)

[1] "mst"    "matrix"

dim(nga_minSpanT)

[1] 760   3

head(nga_minSpanT)

     [,1] [,2]      [,3]
[1,]   56  197 0.2311487
[2,]  197  306 0.1954847
[3,]  306  195 0.3444925
[4,]  197  274 0.3611552
[5,]  274  302 0.3559411
[6,]  302  548 0.1901367

6.2.3.5 plot MST Neighbour List

plot(nga.sp, border = gray(.5))

plot.mst(nga_minSpanT,
         coordinates(nga.sp), 
         col = "blue", 
         cex.lab = 0.7, 
         cex.circles = 0.005, 
         add = TRUE)

6.2.4 Compute Spatially Constrained Cluster

⇳ Usage of the code chunk below :

skater( ) - spdep - to compute the spatially constrained cluster.

note: ncuts = number of clusters - 1

clust3 <- spdep::skater(edges = nga_minSpanT[,1:2],
                        data = cluster_varsNF1,
                        method = "euclidean",
                        ncuts = 2)
str(clust3)

List of 8
 $ groups      : num [1:761] 3 3 1 3 1 1 1 1 3 1 ...
 $ edges.groups:List of 3
  ..$ :List of 3
  .. ..$ node: num [1:402] 324 556 745 5 215 564 13 576 613 262 ...
  .. ..$ edge: num [1:401, 1:3] 496 631 237 36 209 740 636 302 667 363 ...
  .. ..$ ssw : num 21313
  ..$ :List of 3
  .. ..$ node: num [1:237] 479 480 639 651 761 735 760 482 112 649 ...
  .. ..$ edge: num [1:236, 1:3] 635 86 479 448 760 112 148 649 479 406 ...
  .. ..$ ssw : num 9770
  ..$ :List of 3
  .. ..$ node: num [1:122] 102 33 546 201 538 282 325 359 9 214 ...
  .. ..$ edge: num [1:121, 1:3] 538 201 359 9 214 572 539 325 308 282 ...
  .. ..$ ssw : num 5746
 $ not.prune   : NULL
 $ candidates  : int [1:3] 1 2 3
 $ ssto        : num 42921
 $ ssw         : num [1:3] 42921 38492 36829
 $ crit        : num [1:2] 1 Inf
 $ vec.crit    : num [1:761] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "class")= chr "skater"

6.2.4.1 tabulate cluster assignment

ccs3 <- clust3$groups
table(ccs3)

ccs3
  1   2   3 
402 237 122

6.2.4.2 plot the pruned tree

plot(nga.sp, border = gray(.5))
plot(clust3, 
     coordinates(nga.sp), 
     cex.lab = .7,
     groups.colors = c("red","green","blue", "brown"),
     cex.circles = 0.005, 
     add = TRUE)

6.2.5 Visualise SKATER Clusters in Choropleth Map

groups_mat <- as.matrix(clust3$groups)

nga_spClust.sf <- cbind(nga_clust.sf, 
                        as.factor(groups_mat)) %>%
  rename(`sp_cluster`=`as.factor.groups_mat.`)

To compare the output of hierarchical clustering and spatially constrained hierarchical clustering :

clusGeo_SKAT.map <- tm_shape(nga_spClust.sf) +
  tm_fill(col = "sp_cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Spatially Constrained SKATER Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo_SKAT.map

6.3 ClustGeo Method

This section consists of two (2) parts i.e. spatially and non-spatially Constrained Cluster analysis.

6.3.1 Non-Spatially Constrained Hierarchical Cluster Analysis

Dissimilarity matrix must be an object of class dist.

6.3.1.1 create Class `dist` Object

proxmat_ngc <- dist(wp_stdMM, method = 'euclidean')

6.3.1.2 compute Non-Spatially Constrained Hierarchical Clustering

⇳ Usage of the code chunk below :

hclustgeo( ) - ClustGeo - to perform a typical Ward-like hierarchical clustering.

nonGeo_clust <- hclustgeo(proxmat_ngc)

plot(nonGeo_clust, 
     cex = 0.5)

rect.hclust(nonGeo_clust, 
            k = 3, 
            border = 1:5)

6.3.1.3 derive 5-cluster model

groups_ngc <- as.factor(cutree(nonGeo_clust, 
                               k = 3))

6.3.1.4 combine groups_ngc with wp_ngaTrim

nga_ngeo_clust.sf <- cbind(
  wp_ngaTrim, 
  as.matrix(groups_ngc)) %>%
  rename(`cluster` = `as.matrix.groups_ngc.`)

6.3.1.5 visualise Non-Spatially Constrained Hierarchical Cluster

clusGeo_nSp.map <- tm_shape(nga_ngeo_clust.sf) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Non-Spatially Constrained ClustGeo Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo_nSp.map

6.3.2 Spatially Constrained Hierarchical Cluster Analysis

6.3.2.1 determine Spatial Distance Matrix

⇳ Usage of the code chunk below :

st_distance( ) - sf - to derive the spatial distance matrix before perform spatially constrained hierarchical clustering.

as.dist( ) - stats - to convert the data frame into matrix.

dist <- st_distance(wp_ngaTrim, wp_ngaTrim)
dist_mat <- as.dist(dist)

6.3.2.2 determine Alpha value

⇳ Usage of the code chunk below :

choicealpha( ) - psych - to determine a suitable value for the mixing parameter alpha.

cr <- choicealpha(
  proxmat_ngc, 
  dist_mat, 
  range.alpha = seq(0, 1, 0.1), 
  K = 3, 
  graph = TRUE)

Remarks :

With reference to the plot above, multiple alpha values to be used to perform spatially constrained hierarchical clustering.

6.3.2.3 compute Spatially Constrained Hierarchical Clustering

clustG_0.3 <- hclustgeo(proxmat_ngc, 
                    dist_mat, 
                    alpha = 0.3)

clustG_0.35 <- hclustgeo(proxmat_ngc, 
                    dist_mat, 
                    alpha = 0.35)

clustG_0.4 <- hclustgeo(proxmat_ngc, 
                    dist_mat, 
                    alpha = 0.4)

6.3.2.4 derive “cluster” Object

groups_cg_0.3 <- as.factor(cutree(clustG_0.3, k = 3))

groups_cg_0.35 <- as.factor(cutree(clustG_0.35, k = 3))

groups_cg_0.4 <- as.factor(cutree(clustG_0.4, k = 3))

6.3.2.5 combine groups_cg with wp_ngaTrim

wp_nga_clustG_0.3 <- cbind(wp_ngaTrim, as.matrix(groups_cg_0.3)) %>%
  rename(`cluster` = `as.matrix.groups_cg_0.3.`)

wp_nga_clustG_0.35 <- cbind(wp_ngaTrim, as.matrix(groups_cg_0.35)) %>%
  rename(`cluster` = `as.matrix.groups_cg_0.35.`)

wp_nga_clustG_0.4 <- cbind(wp_ngaTrim, as.matrix(groups_cg_0.4)) %>%
  rename(`cluster` = `as.matrix.groups_cg_0.4.`)

6.3.2.6 Visualise Spatially Constrained Hierarchical Clustering

clusGeo_sp.map0.3 <- tm_shape(wp_nga_clustG_0.3) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Spatially Constrained ClustGeo Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo_sp.map0.3

clusGeo_sp.map0.35 <- tm_shape(wp_nga_clustG_0.35) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Spatially Constrained ClustGeo Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo_sp.map0.35

clusGeo_sp.map0.4 <- tm_shape(wp_nga_clustG_0.4) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Spatially Constrained ClustGeo Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

clusGeo_sp.map0.4

Remarks :

Three threshold values were tested, i.e. 3 (crossing point in the first alpha value plot), 3.5 (crossing point in the second alpha value plot) and 4 (noticeable elbow in the alpha value plots).

When the alpha value is set to :

0.3, Cluster 1 scatter around the southwestern region while Cluster 2 has the highest LGA coverage compared to alpha values of 3.5 or 4. Cluster 1 scatter around South-western region.
0.35, Cluster 3 covers more LGA than Cluster 2 with minimal changes to Cluster 1.
0.4, Cluster 1 has the highest LGA coverage compared to alpha values of 3.5 or 4. Cluster 2 expand LGA coverage north-eastward.

6.4 Spatially Constrained Clustering :: RedCap Method

6.4.1 Derive Queen Contiguity Spatial Weights

⇳ Usage of the code chunk below :

queen_weights( ) - rgeoda - to create a Queen contiguity weights.

nga_queenW <- queen_weights(wp_ngaTrim)

nga_queenW

Reference class object of class "Weight"
Field "gda_w":
An object of class "p_GeoDaWeight"
Slot "pointer":
<pointer: 0x000001a130ab7e70>

Field "is_symmetric":
[1] TRUE
Field "sparsity":
[1] 0.00750793
Field "min_neighbors":
[1] 1
Field "max_neighbors":
[1] 14
Field "num_obs":
[1] 761
Field "mean_neighbors":
[1] 5.713535
Field "median_neighbors":
[1] 6
Field "has_isolates":
[1] FALSE

6.4.2 Compute Spatially Constrained Cluster

⇳ Usage of the code chunk below :

redcap( ) - rgeoda - to compute spatially constrained cluster based on three (3) mandatory arguments:-

k, the number of clusters to form
w, an instance of Weight class
df, data frame with cluster variables

set.seed(12345)

clust_rCap <- redcap(3, 
                     nga_queenW, 
                     wp_stdMM, 
       method = "fullorder-singlelinkage",
       scale_method = 'raw',
       distance_method = "euclidean",
       random_seed = 12345)

str(clust_rCap)

List of 5
 $ Clusters                                    : int [1:761] 3 3 1 3 1 2 2 1 3 2 ...
 $ Total sum of squares                        : num 3800
 $ Within-cluster sum of squares               : num [1:5] 731 412 692 647 613
 $ Total within-cluster sum of squares         : num 706
 $ The ratio of between to total sum of squares: num 0.186

6.4.3 Reveal Cluster Assignment

redCap_cluster <- clust_rCap$Cluster

table(redCap_cluster)

redCap_cluster
  1   2   3 
443 219  99

6.4.4 Visualise Spatially Constrained Cluster

redCap_mat <- as.matrix(redCap_cluster)

wp_nga_redCap <- cbind(wp_ngaTrim, as.factor(redCap_mat)) %>%
  rename(`cluster`=`as.factor.redCap_mat.`)

redCap_sp.map <- tm_shape(wp_nga_redCap) +
  tm_fill(col = "cluster",
          title = "Cluster") +
  tm_borders(alpha = 0.3) +
  tm_style("cobalt") +
  tm_layout(main.title = "Spatially Constrained RedCap Method",
            main.title.size = 1.1,
            main.title.position = "center",
            legend.height = 0.3,
            legend.width = 0.1, 
            legend.title.size = 1,
            legend.text.size = 1,
            frame = TRUE,
            asp = 0)

redCap_sp.map

7. VISUAL INTERPRETATION

7.1 Visualise Individual Clustering Variable

7.1.1 Plot Boxplot

ggplot(data = nga_ngeo_clust.sf,
       aes(x = cluster, y = pct_nonFunctional)) +
  geom_boxplot()

Remarks :

The boxplot reveals Cluster 3 displays the highest mean of non-functional water points. This is followed by Cluster 2.

7.1.2 Visualise Multivariate

Create a separate data frame to ensure key variables are included.

7.1.2.1 prepare data frame

nga_ngeo_clust.sf1 <- nga_ngeo_clust.sf %>%
  select("shapeName",
         "pct_nonFunctional",
         "pct_handPump", 
         "pct_urban0",
         "pct_cs10",
         "pct_stat1",
         "cluster")
         
head(nga_ngeo_clust.sf1,3)

Simple feature collection with 3 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 6.778522 ymin: 5.052192 xmax: 7.402708 ymax: 9.232154
Projected CRS: Minna / Nigeria West Belt
  shapeName pct_nonFunctional pct_handPump pct_urban0 pct_cs10 pct_stat1
1 Aba North          52.94118    11.764706   0.000000 41.17647  41.17647
2 Aba South          49.29577     9.859155   5.633803 26.76056  49.29577
3     Abaji          59.64912    40.350877  84.210526 42.10526  40.35088
  cluster                       geometry
1       1 MULTIPOLYGON (((7.401109 5....
2       1 MULTIPOLYGON (((7.334479 5....
3       2 MULTIPOLYGON (((7.045872 9....

7.1.2.2 plot parallel coordinate plot

⇳ Usage of the code chunk below :

ggparcoord( ) - GGally - to plot static parallel coordinate plots to reveal distribution of variables by cluster.

scale is a character string with the following options -
- std : univariately, subtract mean & divide by standard deviation.
- robust : univariately, subtract median & divide by median absolute deviation.
- uniminmax : univariately, scale the minimum to 0, the maximum to 1.
- globalminmax : no scaling; the range of the graphs is defined by the global minimum and the global maximum.
- center : use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.
- centerObs : use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param

ggparcoord(data = nga_ngeo_clust.sf1,
           columns = c(2:6),
           mapping = aes(
             color = as.factor(`cluster`)),
           scale = "uniminmax",
           alphaLines = 0.4,
           boxplot = TRUE,
           groupColumn = "cluster") +
  scale_color_manual("cluster", 
                     values = c("maroon", "steelblue", "darkgreen"),
                     labels = levels(nga_ngeo_clust.sf1$cluster)) +
  labs(title = "Visual Clustering for ClustGeo Method",
       subtitle = "Multiple Parallel Coordinates Plot",
       xlab = "Cluster Variables") +
  facet_grid(~ cluster,) +
  theme(axis.text.x = element_text(angle = 90),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "bottom",
        text = element_text(
          size = 12)) +
  facet_wrap(~ cluster)

Remarks :

Based on the parallel coordinate plot above, insights for the stakeholder or decision makers from :

the Federal Ministry of Agriculture & Rural Development (FMARD) -
- Cluster 2 has the highest mean of non-functional water points, followed by Cluster 3, which is slightly higher than Cluster 1.
- Cluster 3, on the other hand, has the highest mean of non-urban communities and hand pump deployment. It also has the highest mean of non-functional water points with water observed during data collection. That means other causes caused these water points not-functioning accordingly instead of due to the absence of water.

7.2 Compare Clustering Method Visually

tmap_arrange(clusGeo.map, clusGeo_SKAT.map, 
             clusGeo_nSp.map, clusGeo_sp.map0.35,
             asp = 0,
             ncol = 2)

Legend labels were too wide. The labels have been resized to 0.02, 0.02, 0.02. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
Legend labels were too wide. The labels have been resized to 0.02, 0.02, 0.02. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
Legend labels were too wide. The labels have been resized to 0.02, 0.02, 0.02. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
Legend labels were too wide. The labels have been resized to 0.02, 0.02, 0.02. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

Remarks :

The output by both Hierarchical Clustering and Hierarchical Clustering by ClustGeo method are relatively similar.

However, for spatially constrained clustering, ClustGeo method produce a desirable output compare to SKATER and RedCap methods.

8. CONCLUSION

8.1 Conclude with Summary Statistics

nga_ngeo_clust.sf1 %>% 
  st_set_geometry(NULL) %>%
  group_by(cluster) %>%
  summarise(mean_pct_stat1 = mean(pct_stat1),
            mean_pct_nonFunctional = mean(pct_nonFunctional),
            mean_pct_handPump = mean(pct_handPump), 
            mean_pct_cs10 = mean(pct_cs10), 
            mean_pct_urban0 = mean(pct_urban0))

# A tibble: 3 × 6
  cluster mean_pct_stat1 mean_pct_nonFunctional mean_pct_handP…¹ mean_…² mean_…³
  <chr>            <dbl>                  <dbl>            <dbl>   <dbl>   <dbl>
1 1                 57.8                   30.5             38.5    14.7    15.7
2 2                 36.4                   43.4             14.9    54.4    77.9
3 3                 58.3                   34.5             72.1    26.8    88.7
# … with abbreviated variable names ¹mean_pct_handPump, ²mean_pct_cs10,
#   ³mean_pct_urban0

Remarks :

Cluster 2 having the highest mean of non-functional compare to the other two clusters, may require more resources to overhaul water points as 54% of them are currently supplying from 60% up to entire communities within 1 km. These water points may need to :
- increase the limit of usage capacity.
- reassign amount of users to the water points.
Cluster 3, with average up to 88% of the region falls under non-urban setting, having average 72% of water points deployed with hand pump. However, water was observed in 58% of the non-funcitonal water points. Thus, resources may required to explore further what rendered these water points from being functional.