# MarineSPEED: Spatial cross-validation and variograms

As one of the downloads of the MarineSPEED dataset I’m trying to create a spatially disjoint set of subsamplings of the original dataset in order to reduce the spatial sorting bias (Hijmans 2012). One of the possible approaches is an adaptation of the one used by sdmtoolbox for spatial jackknifing:

1. create voronoi polygons for all points
2. spatially cluster points
3. merge voronoi polygons of the same cluster
4. clip the voronoi polygons with some buffer around the points

For this last step I need to define the buffer distance and decided to use the fit variograms to a selection of the Bio-ORACLE and MARSPEC rasters and use the range as an indication of the buffer distance.

201, 325, 810, 1172, 433, 1919, 2254,1556,790,2145
,749 # salinity
,2184

 Raster Range (cutoff=1000) Range (cutoff=2000) calcite 100 201 ph 604 790 sst 1400 2717 chlorophyl 300 325 chlorophyl range 40 810 mean cloud fraction 700 1172 mean diffuse attenuation 255 433 dissolved oxygen 907 1919 nitrate 1402 2254 photosynthetically available radiation 755 1556 phosphate 1360 2154 salinity 536 749 silicate 240 2184 sst range 666 1238 bathymetry 325 530 salinity variance 390 223

I fitted gaussian variogram models with the gstat package from 20000 random sample points of the rasters. Cutoff was set to 1000 and another time to 2000. The 2 starting models tried where vgm(1,”Gau”,100,1) and vgm(1,”Gau”,1000,1) as they didn’t always converge when only trying one starting model. The Gaussian model was used as based on visual inspection it seemed to fit better models with shorter ranges which gives as a more conservative estimate of the range. Note that the cutoff has a big influence on model results especially on those models with range values around 1000 and more. For example refitting the model for “dissolved oxygen” with a cutoff of 500 gives a range of 448 as compared to a range 907 when the cutoff is set to 1000.

From these results we can conclude that range of spatial autocorrelation is very large for most rasters and picking a buffer size of 200km for the spatial cross-validation is a reasonable value with only 2 of 17 rasters having a smaller range a cutoff 1000 and no rasters having a smaller range when the cutoff is set to 2000.

# MarineSPEED: broke git, lost some work

Lost a lot of time today trying to recover files that got lost due to incorrect of the bfg tool (used for removing big files in git). Finally decided to restart the process, again 200 species left to process. Lesson learned: don’t mess with git when you’re in a hurry.

I added a small performance improvement by adding some memoization in accessing raster layer values by cell indexes:

library("R.cache")

setCacheRootPath(path="./.Rcache") # put .Rcache directory in the current working dir

get_values_layers <- function(layers, cells) {
layers[cells]
}


Last weekend I encountered following 3 interesting papers:

And also make sure to try out the zoon package if you haven’t already.

# Distribution records

Only the records for 190 species got processed last night. With another 300+ species to go performance could be improved. The bottleneck is looking up the environmental data so I tried to improve this part of the process.

Old version:

 add_environment <- function(row, data, layers) {
environment <- extract(layers, data[,xycols])
data <- cbind(data, environment)
write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)
data
}

New version:

 add_environment <- function(row, data, layers) {
cells <- cellFromXY(layers, data[,lonlat])
unique_cells <- unique(cells)
matched_cells <- match(cells, unique_cells)
environment <- extract(layers, unique_cells)
data <- cbind(data, environment[matched_cells,])
write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)
data
}


But it didn’t improve the performance and might even have slowed everything down but its hard to test because there are huge caching effects.

I’ve finished writing the species info files with traits from WoRMS. Not sure it will be useful relevant information to try to include in modeling but we’ll see.

# Shiny UI

I’ve also been working on the MarineSPEED UI. There where some bugs which are now gone and I’ve added a Leaflet map with species points but this is rather slow with large amounts of records (45000 records). So I’ve filtered out records in order to have less then 2000 points per species which still gives a good idea of where the points are without slowing down the map.

Useful leaflet help of the day: http://stackoverflow.com/questions/32107667/clearshapes-not-working-leaflet-for-r/32118413

# Next week

• Create a poster for the VLIZ Marine Scientist conference
• MarineSPEED viewer:
• create 5-fold/10-fold random cross validation sets of the data (filtered with environmental data)
• deploy the viewer and link marinespeed.org to it

# MarineSPEED

Prepared lab records in correct format.
Prepared code for (see gist):

• combining all records: GBIF (hasGeoSpatialIssue == “false”), OBIS (qc=1,2,3,4,5,6), Reef Life Survey, lab data
• filter duplicates (exact same coordinates)
• filter rounded coordinates
• add environmental data with my experimental sdmpredictors package
• move points without environmental data up to 1 km
• filter points without environmental data
• filter duplicate points within the same 25km² Behrmann equal area grid

# Today I learned

• read.csv accepts gzipped file paths
• GeoTiffs don’t support the Mollweide projection which gives strange results when you have an existing QGIS project. Solution is to set the projects projection to the one stored in the GeoTiff.

# Good to know

With the R package robis you can download species records with their qc values and the occurrence method has a qc parameter to only download records that pass specific qc flags.