MarineSPEED: Spatial cross-validation and variograms

As one of the downloads of the MarineSPEED dataset I’m trying to create a spatially disjoint set of subsamplings of the original dataset in order to reduce the spatial sorting bias (Hijmans 2012). One of the possible approaches is an adaptation of the one used by sdmtoolbox for spatial jackknifing:

  1. create voronoi polygons for all points
  2. spatially cluster points
  3. merge voronoi polygons of the same cluster
  4. clip the voronoi polygons with some buffer around the points

For this last step I need to define the buffer distance and decided to use the fit variograms to a selection of the Bio-ORACLE and MARSPEC rasters and use the range as an indication of the buffer distance.

201, 325, 810, 1172, 433, 1919, 2254,1556,790,2145
,749 # salinity

Raster Range (cutoff=1000) Range (cutoff=2000)
calcite 100 201
ph 604 790
sst 1400 2717
chlorophyl 300 325
chlorophyl range 40 810
mean cloud fraction 700 1172
mean diffuse attenuation 255 433
dissolved oxygen 907 1919
nitrate 1402 2254
photosynthetically available radiation 755 1556
phosphate 1360 2154
salinity 536 749
silicate 240 2184
sst range 666 1238
bathymetry 325 530
salinity variance 390 223

I fitted gaussian variogram models with the gstat package from 20000 random sample points of the rasters. Cutoff was set to 1000 and another time to 2000. The 2 starting models tried where vgm(1,”Gau”,100,1) and vgm(1,”Gau”,1000,1) as they didn’t always converge when only trying one starting model. The Gaussian model was used as based on visual inspection it seemed to fit better models with shorter ranges which gives as a more conservative estimate of the range. Note that the cutoff has a big influence on model results especially on those models with range values around 1000 and more. For example refitting the model for “dissolved oxygen” with a cutoff of 500 gives a range of 448 as compared to a range 907 when the cutoff is set to 1000.

From these results we can conclude that range of spatial autocorrelation is very large for most rasters and picking a buffer size of 200km for the spatial cross-validation is a reasonable value with only 2 of 17 rasters having a smaller range a cutoff 1000 and no rasters having a smaller range when the cutoff is set to 2000.


MarineSPEED: broke git, lost some work

Lost a lot of time today trying to recover files that got lost due to incorrect of the bfg tool (used for removing big files in git). Finally decided to restart the process, again 200 species left to process. Lesson learned: don’t mess with git when you’re in a hurry.

I added a small performance improvement by adding some memoization in accessing raster layer values by cell indexes:


setCacheRootPath(path="./.Rcache") # put .Rcache directory in the current working dir

get_values_layers <- function(layers, cells) {
get_values_layers <- addMemoization(get_values_layers)

Next step is to prepare files for downloading.

Last weekend I encountered following 3 interesting papers:

And also make sure to try out the zoon package if you haven’t already.

MarineSPEED slow data preparation and improving the UI

Distribution records

Only the records for 190 species got processed last night. With another 300+ species to go performance could be improved. The bottleneck is looking up the environmental data so I tried to improve this part of the process.

Old version:

 add_environment <- function(row, data, layers) {
   print("add environment")
   environment <- extract(layers, data[,xycols])
   data <- cbind(data, environment)
   write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)

New version:

 add_environment <- function(row, data, layers) {
   print("add environment")
   cells <- cellFromXY(layers, data[,lonlat])
   unique_cells <- unique(cells)
   matched_cells <- match(cells, unique_cells)
   environment <- extract(layers, unique_cells)
   data <- cbind(data, environment[matched_cells,])
   write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)

But it didn’t improve the performance and might even have slowed everything down but its hard to test because there are huge caching effects.

I’ve finished writing the species info files with traits from WoRMS. Not sure it will be useful relevant information to try to include in modeling but we’ll see.

Shiny UI

I’ve also been working on the MarineSPEED UI. There where some bugs which are now gone and I’ve added a Leaflet map with species points but this is rather slow with large amounts of records (45000 records). So I’ve filtered out records in order to have less then 2000 points per species which still gives a good idea of where the points are without slowing down the map.

Useful leaflet help of the day:

Next week

  • Create a poster for the VLIZ Marine Scientist conference
  • MarineSPEED viewer:
    • add some overview information
    • create 5-fold/10-fold random cross validation sets of the data (filtered with environmental data)
    • add download links for the data
    • deploy the viewer and link to it
    • add a short about page

Some progress on


Redownloaded all OBIS data.
Prepared lab records in correct format.
Prepared code for (see gist):

  • combining all records: GBIF (hasGeoSpatialIssue == “false”), OBIS (qc=1,2,3,4,5,6), Reef Life Survey, lab data
  • filter duplicates (exact same coordinates)
  • filter rounded coordinates
  • add environmental data with my experimental sdmpredictors package
  • move points without environmental data up to 1 km
  • filter points without environmental data
  • filter duplicate points within the same 25km² Behrmann equal area grid

Today I learned

  • read.csv accepts gzipped file paths
  • GeoTiffs don’t support the Mollweide projection which gives strange results when you have an existing QGIS project. Solution is to set the projects projection to the one stored in the GeoTiff.

Good to know

With the R package robis you can download species records with their qc values and the occurrence method has a qc parameter to only download records that pass specific qc flags.

See help(qc) for more information on the different flags.

In case you forgot to filter during the request you can still filter using the filter_qc function in the gist of the day. The code would have been a lot shorter if there where bitwise AND and OR operators for 64-bit integers.

Hello world!

Finally took action to start my own open notebook. I’m working on marine species distribution modeling and my main tool is R so expect to see a lot of posts about these two. In the past I’ve also posted some of my research related posts on my personal blog.

Currently I’m working on a benchmark data set ( I’ve gathered records from OBIS, GBIF and Reef Life Survey but now have to combine them into one dataset. I’m also writing an R Shiny UI for exploring the data set.