Show the code
pacman::p_load(sf, tidyverse, readr, readxl, ggplot2, dplyr, tidyr, units)January 16, 2023
March 6, 2023
installing and loading sf and tidyverse packages into R environment,
importing geospatial data by using appropriate functions of sf package,
importing aspatial data by using appropriate function of readr package,
exploring the content of simple feature data frame by using appropriate Base R and sf functions,
assigning or transforming coordinate systems by using using appropriate sf functions,
converting an aspatial data into a sf data frame by using appropriate function of sf package,
performing geoprocessing tasks by using appropriate functions of sf package,
performing data wrangling tasks by using appropriate functions of dplyr package and
performing Exploratory Data Analysis (EDA) by using appropriate functions from ggplot2 package.
sf for importing, managing, and processing geospatial data.
tidyverse for performing data science tasks such as importing, wrangling and visualising data.
readr for importing csv data
readxl for importing Excel worksheet
tidyr for manipulating data
dplyr for transforming data
ggplot2 for visualising data
Usage of the code chunk below :
p_load( ) - pacman - to load packages into R environment. This function will attempt to install the package from CRAN or the pacman repository list if it is not installed.
Remarks :
sf, sp, (rgdal, rgeos both are retiring by year 2023), tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr, funModeling, knitr, caTools, viridis, rgeoda, cowplot, patchwork.
Alternate code chunk -
Aspatial Data
The file size of the downloaded data is about 422 MB due to water points data from multiple countries.
-   [Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.]{style="color:#d69c3c"}
[Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.]{style="color:#d69c3c"}
[Therewith, the CSV file size should be lesser than 100 MB.]{style="color:#d69c3c"}
Geospatial Data
Included two extra functions when importing the data :
transform the boundary data to “26392”
check parsing error.
bdy_nga.sf <- st_read(dsn = "data/geospatial",
               layer = "geoBoundaries-NGA-ADM2") %>%
  select(shapeName) %>%
  st_transform(crs = 26392)Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\jephOstan\ISSS624\exploration\exploration_2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84
Remarks :
Observations : 774 in multipolygon
CRS : Projected CRS i.e. WGS 84
wp_attribute <- read_csv("data/aspatial/WPdx_NGAv1.2.1.csv",
                       col_select = c(`row_id`,
                                      `#lat_deg`,
                                      `#lon_deg`,
                                      `New Georeferenced Column`,
                                      `lat_lon_deg`,
                                      `#water_source`,
                                      `#water_source_clean`,
                                      `#water_source_category`,
                                      `#water_tech_clean`,
                                      `#water_tech_category`,
                                      `#status_clean`,
                                      `#status`,
                                      `#status_id`,
                                      `#clean_adm1`,
                                      `#clean_adm2`,
                                      `water_point_population`,
                                      `local_population_1km`,
                                      `crucialness_score`,
                                      `pressure_score`,
                                      `usage_capacity`,
                                      `staleness_score`,
                                      `rehab_priority`,
                                      `is_urban`)) %>%
  rename(lat_deg = `#lat_deg`, 
         lon_deg = `#lon_deg`,
         water_source = `#water_source`,
         water_source_clean = `#water_source_clean`, 
         water_source_category = `#water_source_category`, 
         water_tech_clean = `#water_tech_clean`, 
         water_tech_category = `#water_tech_category`,
         status_clean = `#status_clean`,
         status = `#status`,
         status_id = `#status_id`,
         clean_adm1 = `#clean_adm1`,
         clean_adm2 = `#clean_adm2`)Rows: 95008 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): #status_id, #water_source_clean, #water_source_category, #water_te...
dbl (10): row_id, #lat_deg, #lon_deg, rehab_priority, water_point_population...
lgl  (1): is_urban
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Name | wp_attribute | 
| Number of rows | 95008 | 
| Number of columns | 23 | 
| _______________________ | |
| Column type frequency: | |
| character | 12 | 
| logical | 1 | 
| numeric | 10 | 
| ________________________ | |
| Group variables | None | 
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace | 
|---|---|---|---|---|---|---|---|
| New Georeferenced Column | 0 | 1.00 | 11 | 45 | 0 | 95008 | 0 | 
| lat_lon_deg | 0 | 1.00 | 8 | 42 | 0 | 95008 | 0 | 
| water_source | 0 | 1.00 | 3 | 32 | 0 | 16 | 0 | 
| water_source_clean | 302 | 1.00 | 8 | 22 | 0 | 5 | 0 | 
| water_source_category | 302 | 1.00 | 4 | 11 | 0 | 3 | 0 | 
| water_tech_clean | 10055 | 0.89 | 8 | 26 | 0 | 11 | 0 | 
| water_tech_category | 10055 | 0.89 | 8 | 15 | 0 | 4 | 0 | 
| status_clean | 10656 | 0.89 | 9 | 32 | 0 | 8 | 0 | 
| status | 10656 | 0.89 | 14 | 156 | 0 | 834 | 0 | 
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 | 
| clean_adm1 | 0 | 1.00 | 3 | 25 | 0 | 37 | 0 | 
| clean_adm2 | 0 | 1.00 | 3 | 19 | 0 | 753 | 0 | 
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count | 
|---|---|---|---|---|
| is_urban | 0 | 1 | 0.21 | FAL: 75444, TRU: 19564 | 
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | 
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 199975.48 | 189726.13 | 10732.00 | 52632.75 | 86952.50 | 323671.50 | 681838.00 | ▇▃▂▂▁ | 
| lat_deg | 0 | 1.00 | 9.33 | 2.48 | 4.30 | 7.36 | 9.09 | 11.83 | 13.87 | ▃▇▅▅▆ | 
| lon_deg | 0 | 1.00 | 7.50 | 2.25 | 2.71 | 5.52 | 7.89 | 9.08 | 14.22 | ▃▃▇▃▁ | 
| water_point_population | 539 | 0.99 | 1246.32 | 4027.41 | 0.00 | 117.00 | 413.00 | 1169.00 | 384595.00 | ▇▁▁▁▁ | 
| local_population_1km | 539 | 0.99 | 3723.15 | 7417.59 | 0.00 | 597.00 | 1756.00 | 4393.00 | 384595.00 | ▇▁▁▁▁ | 
| crucialness_score | 6879 | 0.93 | 0.41 | 0.34 | 0.00 | 0.13 | 0.30 | 0.63 | 1.00 | ▇▅▃▁▅ | 
| pressure_score | 6879 | 0.93 | 3.21 | 9.04 | 0.00 | 0.40 | 1.18 | 3.10 | 776.97 | ▇▁▁▁▁ | 
| usage_capacity | 0 | 1.00 | 488.63 | 310.95 | 50.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▁▇▁▁▃ | 
| staleness_score | 0 | 1.00 | 44.94 | 6.29 | 23.13 | 41.49 | 42.87 | 44.34 | 99.00 | ▁▇▁▁▁ | 
| rehab_priority | 53109 | 0.44 | 1545.45 | 5243.53 | 0.00 | 136.50 | 522.00 | 1527.00 | 384595.00 | ▇▁▁▁▁ | 
Remarks :
Observation : 95,008 water points
Variable : Out of 23 variables, there are 7 variables with missing value n percent.
Remarks :
Need to have “geometry” variable, else will encounter the following error -
” Error in st_sf(wp_attribute, crs = 26392) : no simple features geometry column present ”
Coordinate Reference System:
  User input: EPSG:26392 
  wkt:
PROJCRS["Minna / Nigeria Mid Belt",
    BASEGEOGCRS["Minna",
        DATUM["Minna",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4263]],
    CONVERSION["Nigeria Mid Belt",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",8.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.99975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",670553.98,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Nigeria between 6°30'E and 10°30'E, onshore and offshore shelf."],
        BBOX[3.57,6.5,13.53,10.51]],
    ID["EPSG",26392]]
In order to use “shapeName” as the unique reference id, the duplicated shapeName will be append with state name respectively to become unique value.
bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <- 
  c("Bassa Kogi",
    "Bassa Plateau",
    "Ifelodun Kwara",
    "Ifelodun Osun",
    "Irepodun Kwara",
    "Irepodun Osun",
    "Nasarawa Kano",
    "Nasarawa Nasarawa",
    "Obi Nasarawa",
    "Obi Benue",
    "Surulere Lagos",
    "Surulere Oyo")
bdy_nga.sf$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] [1] "Bassa Kogi"        "Bassa Plateau"     "Ifelodun Kwara"   
 [4] "Ifelodun Osun"     "Irepodun Kwara"    "Irepodun Osun"    
 [7] "Nasarawa Kano"     "Nasarawa Nasarawa" "Obi Nasarawa"     
[10] "Obi Benue"         "Surulere Lagos"    "Surulere Oyo"     
Usage of the code chunk below : The codechunk below is to verify the output from previous step.
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: Minna / Nigeria Mid Belt
[1] shapeName            bdy_nga.sf$shapeName geometry            
<0 rows> (or 0-length row.names)
Compare different approaches in combining the attribute and boundary of the water points into a simple feature object :
| x | y | left | 
|---|---|---|
| wp.sf | bdy_nga.sf | TRUE | 
| wp.sf | bdy_nga.sf | NULL | 
| bdy_nga.sf | wp.sf | NULL | 
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
POINT (10.47318 10.60104)
POINT (6.95009 6.78599)
POINT (7.615451 6.799595)
POINT (7.30539 6.30817)
POINT (10.44625 10.50681)
Geometry set for 95008 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
POINT (10.47318 10.60104)
POINT (6.95009 6.78599)
POINT (7.615451 6.799595)
POINT (7.30539 6.30817)
POINT (10.44625 10.50681)
Geometry set for 774 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 26662.71 ymin: 30523.38 xmax: 1344157 ymax: 1096029
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
MULTIPOLYGON (((548795.5 119641, 548687.4 11968...
MULTIPOLYGON (((541412.3 122192.3, 541544.6 122...
MULTIPOLYGON (((1248985 1048169, 1247285 104795...
MULTIPOLYGON (((510864.9 578541.6, 508736.2 577...
MULTIPOLYGON (((594269 120968.5, 594389.6 12087...
https://cengel.github.io/R-spatial/spatialops.html
r4gdsa.netlify.app. https://r4gdsa.netlify.app/chap02.html#data-preparation
STHDA (Statistical tools for high-throughput data analysis), (N.A.), ggplot2 scatter plots : Quick start guide - R software and data visualization. http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization
Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866↩︎