Explore2 : Thematic Mapping

Published

January 16, 2023

Modified

March 6, 2023

1 OVERVIEW

  • installing and loading sf and tidyverse packages into R environment,

  • importing geospatial data by using appropriate functions of sf package,

  • importing aspatial data by using appropriate function of readr package,

  • exploring the content of simple feature data frame by using appropriate Base R and sf functions,

  • assigning or transforming coordinate systems by using using appropriate sf functions,

  • converting an aspatial data into a sf data frame by using appropriate function of sf package,

  • performing geoprocessing tasks by using appropriate functions of sf package,

  • performing data wrangling tasks by using appropriate functions of dplyr package and

  • performing Exploratory Data Analysis (EDA) by using appropriate functions from ggplot2 package.

2. R PACKAGE REQUIRED

  • sf for importing, managing, and processing geospatial data.

  • tidyverse for performing data science tasks such as importing, wrangling and visualising data.

    • readr for importing csv data

    • readxl for importing Excel worksheet

    • tidyr for manipulating data

    • dplyr for transforming data

    • ggplot2 for visualising data

2.1 Load R Packages

Usage of the code chunk below :

p_load( ) - pacman - to load packages into R environment. This function will attempt to install the package from CRAN or the pacman repository list if it is not installed.

Show the code
pacman::p_load(sf, tidyverse, readr, readxl, ggplot2, dplyr, tidyr, units)

Remarks :

sf, sp, (rgdal, rgeos both are retiring by year 2023), tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr, funModeling, knitr, caTools, viridis, rgeoda, cowplot, patchwork.

Alternate code chunk -

packages = c('tidyverse', 'sf', 'readr', 'readxl','ggplot2','dplyr','tidyr')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. GEOSPATIAL DATA

3.1 Acquire Data

Note

The file size of the downloaded data is about 422 MB due to water points data from multiple countries.

-   [Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.]{style="color:#d69c3c"}

[Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.]{style="color:#d69c3c"}

[Therewith, the CSV file size should be lesser than 100 MB.]{style="color:#d69c3c"}

3.2 Import Data

3.2.1 Import Boundary Data

Included two extra functions when importing the data :

  • transform the boundary data to “26392”

  • check parsing error.

3.2.1.1 import data

bdy_nga.sf <- st_read(dsn = "data/geospatial",
               layer = "geoBoundaries-NGA-ADM2") %>%
  select(shapeName) %>%
  st_transform(crs = 26392)
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\jephOstan\ISSS624\exploration\exploration_2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84
problems(bdy_nga.sf)

Remarks :

Observations : 774 in multipolygon

CRS : Projected CRS i.e. WGS 84

3.2.1.2 save sf data frame into RDS format

Show the code
write_rds(bdy_nga.sf,
          "data/geodata/bdy_nga.sf.rds")

3.2.1.3 read RDS file

bdy_nga.sf <- read_rds("data/geodata/bdy_nga.sf.rds")

3.2.2 Import Attribute Data

wp_attribute <- read_csv("data/aspatial/WPdx_NGAv1.2.1.csv",
                       col_select = c(`row_id`,
                                      `#lat_deg`,
                                      `#lon_deg`,
                                      `New Georeferenced Column`,
                                      `lat_lon_deg`,
                                      `#water_source`,
                                      `#water_source_clean`,
                                      `#water_source_category`,
                                      `#water_tech_clean`,
                                      `#water_tech_category`,
                                      `#status_clean`,
                                      `#status`,
                                      `#status_id`,
                                      `#clean_adm1`,
                                      `#clean_adm2`,
                                      `water_point_population`,
                                      `local_population_1km`,
                                      `crucialness_score`,
                                      `pressure_score`,
                                      `usage_capacity`,
                                      `staleness_score`,
                                      `rehab_priority`,
                                      `is_urban`)) %>%
  rename(lat_deg = `#lat_deg`, 
         lon_deg = `#lon_deg`,
         water_source = `#water_source`,
         water_source_clean = `#water_source_clean`, 
         water_source_category = `#water_source_category`, 
         water_tech_clean = `#water_tech_clean`, 
         water_tech_category = `#water_tech_category`,
         status_clean = `#status_clean`,
         status = `#status`,
         status_id = `#status_id`,
         clean_adm1 = `#clean_adm1`,
         clean_adm2 = `#clean_adm2`)
Rows: 95008 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): #status_id, #water_source_clean, #water_source_category, #water_te...
dbl (10): row_id, #lat_deg, #lon_deg, rehab_priority, water_point_population...
lgl  (1): is_urban

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
problems(wp_attribute)

3.2.2.1 inspect for missing value

Show the code
skimr::skim(wp_attribute)
Data summary
Name wp_attribute
Number of rows 95008
Number of columns 23
_______________________
Column type frequency:
character 12
logical 1
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
New Georeferenced Column 0 1.00 11 45 0 95008 0
lat_lon_deg 0 1.00 8 42 0 95008 0
water_source 0 1.00 3 32 0 16 0
water_source_clean 302 1.00 8 22 0 5 0
water_source_category 302 1.00 4 11 0 3 0
water_tech_clean 10055 0.89 8 26 0 11 0
water_tech_category 10055 0.89 8 15 0 4 0
status_clean 10656 0.89 9 32 0 8 0
status 10656 0.89 14 156 0 834 0
status_id 0 1.00 2 7 0 3 0
clean_adm1 0 1.00 3 25 0 37 0
clean_adm2 0 1.00 3 19 0 753 0

Variable type: logical

skim_variable n_missing complete_rate mean count
is_urban 0 1 0.21 FAL: 75444, TRU: 19564

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 199975.48 189726.13 10732.00 52632.75 86952.50 323671.50 681838.00 ▇▃▂▂▁
lat_deg 0 1.00 9.33 2.48 4.30 7.36 9.09 11.83 13.87 ▃▇▅▅▆
lon_deg 0 1.00 7.50 2.25 2.71 5.52 7.89 9.08 14.22 ▃▃▇▃▁
water_point_population 539 0.99 1246.32 4027.41 0.00 117.00 413.00 1169.00 384595.00 ▇▁▁▁▁
local_population_1km 539 0.99 3723.15 7417.59 0.00 597.00 1756.00 4393.00 384595.00 ▇▁▁▁▁
crucialness_score 6879 0.93 0.41 0.34 0.00 0.13 0.30 0.63 1.00 ▇▅▃▁▅
pressure_score 6879 0.93 3.21 9.04 0.00 0.40 1.18 3.10 776.97 ▇▁▁▁▁
usage_capacity 0 1.00 488.63 310.95 50.00 300.00 300.00 1000.00 1000.00 ▁▇▁▁▃
staleness_score 0 1.00 44.94 6.29 23.13 41.49 42.87 44.34 99.00 ▁▇▁▁▁
rehab_priority 53109 0.44 1545.45 5243.53 0.00 136.50 522.00 1527.00 384595.00 ▇▁▁▁▁

Remarks :

Observation : 95,008 water points

Variable : Out of 23 variables, there are 7 variables with missing value n percent.

3.2.2.2 derive “geometry” variable

Show the code
wp_attribute$geometry = st_as_sfc(wp_attribute$`New Georeferenced Column`)

Remarks :

Need to have “geometry” variable, else will encounter the following error -

” Error in st_sf(wp_attribute, crs = 26392) : no simple features geometry column present ”

3.2.2.3 convert to sf data frame

Show the code
wp.sf<- st_sf(wp_attribute, 
              crs = 26392)
st_crs(wp.sf)
Coordinate Reference System:
  User input: EPSG:26392 
  wkt:
PROJCRS["Minna / Nigeria Mid Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria Mid Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",8.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",670553.98,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria between 6°30'E and 10°30'E, onshore and offshore shelf."],
        BBOX[3.57,6.5,13.53,10.51]],
    ID["EPSG",26392]]

3.2.2.4 write sf data frame into RDS format

write_rds(wp.sf,
          "data/geodata/wp.sf.rds",
          compress = "xz")

3.2.2.5 read RDS file

wp.sf <- read_rds("data/geodata/wp.sf.rds")

3.3 Data Wrangling

3.3.1 Ensure Unique References

In order to use “shapeName” as the unique reference id, the duplicated shapeName will be append with state name respectively to become unique value.

3.3.1.1 update “shapeName” with list of row id

Show the code
bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <- 
  c("Bassa Kogi",
    "Bassa Plateau",
    "Ifelodun Kwara",
    "Ifelodun Osun",
    "Irepodun Kwara",
    "Irepodun Osun",
    "Nasarawa Kano",
    "Nasarawa Nasarawa",
    "Obi Nasarawa",
    "Obi Benue",
    "Surulere Lagos",
    "Surulere Oyo")

bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)]
 [1] "Bassa Kogi"        "Bassa Plateau"     "Ifelodun Kwara"   
 [4] "Ifelodun Osun"     "Irepodun Kwara"    "Irepodun Osun"    
 [7] "Nasarawa Kano"     "Nasarawa Nasarawa" "Obi Nasarawa"     
[10] "Obi Benue"         "Surulere Lagos"    "Surulere Oyo"     

3.3.1.2 varify output

Usage of the code chunk below : The codechunk below is to verify the output from previous step.

Show the code
dupl_shapeName_val.sf <- bdy_nga.sf %>%
  add_count(bdy_nga.sf$shapeName) %>%
  filter(n != 1) %>%
  select(-n)

dupl_shapeName_val.sf
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: Minna / Nigeria Mid Belt
[1] shapeName            bdy_nga.sf$shapeName geometry            
<0 rows> (or 0-length row.names)

3.3.2 Perform Point-in-Polygon Overlay

Compare different approaches in combining the attribute and boundary of the water points into a simple feature object :

x y left
wp.sf bdy_nga.sf TRUE
wp.sf bdy_nga.sf NULL
bdy_nga.sf wp.sf NULL

3.3.2.1 join objects :: wp.sf, bdy_nga.sf

Show the code
wp_point.left <- st_join(x = wp.sf,
                     y = bdy_nga.sf,
                     join = st_intersects,
                     left = TRUE)

st_geometry(wp_point.left)
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
POINT (10.47318 10.60104)
POINT (6.95009 6.78599)
POINT (7.615451 6.799595)
POINT (7.30539 6.30817)
POINT (10.44625 10.50681)
Show the code
wp_point <- st_join(x = wp.sf,
                     y = bdy_nga.sf,
                     join = st_intersects)

st_geometry(wp_point)
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
POINT (10.47318 10.60104)
POINT (6.95009 6.78599)
POINT (7.615451 6.799595)
POINT (7.30539 6.30817)
POINT (10.44625 10.50681)
Show the code
wp_mpolygon <- st_join(x = bdy_nga.sf,
                     y = wp.sf,
                     join = st_intersects,
                     left = TRUE)

st_geometry(wp_mpolygon)
Geometry set for 774 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 26662.71 ymin: 30523.38 xmax: 1344157 ymax: 1096029
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
MULTIPOLYGON (((548795.5 119641, 548687.4 11968...
MULTIPOLYGON (((541412.3 122192.3, 541544.6 122...
MULTIPOLYGON (((1248985 1048169, 1247285 104795...
MULTIPOLYGON (((510864.9 578541.6, 508736.2 577...
MULTIPOLYGON (((594269 120968.5, 594389.6 12087...

https://cengel.github.io/R-spatial/spatialops.html

3.3.2.2 save and read RDS Files

Show the code
write_rds(wp_point,
          "data/geodata/wp_point.rds")
Show the code
wp_point <- read_rds("data/geodata/wp_point.rds")

4. GEOSPATIAL VISUALISATION

4.1 Plot with graphics package

Usage of the code chunk below :

plot( ) - graphics - to plot map for each field / variable in mpsz.

Show the code
plot(wp_point,
     max.plot = 15,
     asp = 1)

4.2 Plot with sf package

4.2.1 Plot Boundary

Usage of the code chunk below :

st_geometry( ) - sf - to get the map boundary for the plot( ) function.

Show the code
plot(st_geometry(wp_point))

4.2.2 Plot Specific Attribute

Show the code
plot(wp_point["status_clean"])

5. REFERENCES

  • r4gdsa.netlify.app. https://r4gdsa.netlify.app/chap02.html#data-preparation

  • STHDA (Statistical tools for high-throughput data analysis), (N.A.), ggplot2 scatter plots : Quick start guide - R software and data visualization. http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization

Footnotes

  1. Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866↩︎