Jeeve Science-Nature-Photography: Mai 2018

Species lists from different sources:
how to synchronize names with the GBIF database in R,
how to compile a list of Syrphidae from Switzerland and
how to determine which species are lacking in common identification keys

You may know this problem: you have 2 or more species list from different sources and they use different synonyms for the same species. You want to merge the lists to a single one that has only unique species. In this post you find:

Getting species lists from books in to R
Using R to compare these species list with the GBIF species database and merge them (including 'tricks', e.g. to extract names from a multidimensional list)
Discussion of common problems to take in to account
The final list of Syrphidae nicely formatted for printing

Case study description
The case study concerns European Hoverflies (Syrphidae). Unlike plants or mammals, this insect group has been subject to many synonymizations which results in very dissimilar name lists over time frames of a decade. I want to identify Syrphidae species from the Alps (mainly Switzerland), with two commonly used identification keys that do not include all alpine species. I need to know which species have to be identified with additional keys and which genera are safe to identify with the common keys. I took the species list of Switzerland from the website World of Syrphidae. For convenience and usability I also downloaded the German species list.

For Northwest Europe there are two recent major identification works:
The first covers (almost) all lowland species:
Van Veen, M. (2004) Hoverflies of Northwest Europe: identification keys to the Syrphidae. 256pp.
Van Veen published an addendum to his book on his website which enables identification of even more Syrphidae. I copied those species names manually into a txt file, an this list will be used as well.
The second work comes in two volumes and covers many alpine species:
Bartsch, H. et al. (2009) Blomflugor: Syrphinae. Nationalnyckeln till Sveriges flora och flora, DH53a. Artdatabanken, SLU, Uppsala. 406 pp.
Bartsch, H. et al. (2009) Blomflugor: Eristalinae & Microdontinae. Nationalnyckeln till Sveriges flora och flora, DH 53b. Artdatabanken, SLU, Uppsala. 478pp.

1. Getting the species lists from books and web pages into R

When you get experienced with this, it is faster than typing every single species manually.

Scan the species list from your book, either with a modern copy machine or flatbed scanner (A digital camera often yield too strong distorted images that do not give a good text recognition.)
Get a good text recognition (OCR, optical character recognition). I prefer to do a good OCR with Acrobat pro on a pdf made from the scans. There is also free OCR software that can do a good job on tiff and bmp images (here some software suggestions to try).
For a good result, make sure that you have one column with species on your input. If you have e.g. 2 columns on a single page or printed 2 pages on a single pdf page, you have to edit the images a bit. For a pdf you can do the following. Crop your pdf to the left side first and export as tiff image. Rename from filename.* to filename_A.* (you can do this with e.g. total commander under Files -> multi-rename tool). Crop your pdf to right side and export as tiff images. Rename from filename.* to filename_B.* Now combine files to pdf again. You now should have a single list of species per page.
Perform the OCR. Either export to text (save as other, txt (accessible), or select parts of the list manually and copy it into a *.txt file.
Clean up the file. Keep a list of what to "find and replace": e.g. / must be l.
Import the text file to excel or libre office calc. The delimiter is space. First 2 columns are genus and species. The rest may be rubbish or things like authority or common name and may be deleted. If you like to keep any of this (for example authority) you can do the following. Select all remaining columns (except genus/species), paste into a txt file, put behind the authority a sign for separation, instead of a space. In my example I had the kind of format: "Authority - Name". I replaced " - " by "-" to get a nice separation without spaces. Import and choose delimiter like "-".

Everything should now be in a perfect table! Scan through for further OCR errors, remaining ones can be identified in R. For use in R, I named the columns "GenSpec" for the species name, "FullGenSpec" for the species with authority, and "Gen" and "Spec" for using both elements separately.

2. Using R to compare species list with the GBIF database

Start and read the files with species lists:

################################_
### Load packages and functions, set working directory     ####
################################_

setwd("C://.../specieslistsR")

#install.packages("taxize")
#install.packages("dplyr")

require(taxize) #for connecting to gbif species lists
require(dplyr) #for bind_rows

# a nice function for removing accidential spaces at end of species name
trimendspace <- function (x) sub("\\s+$", "", x) 


################################_
### Read files     ####
################################_


#DATAFILES

list_vV <- read.table ("VanVeen_species_list.txt",na.strings="NA",head=T,sep="\t", dec=",", encoding = "utf-8") #encoding is relevant when OCR on scanned documents was performed
head(list_vV)
for (i in c(1:5))    {list_vV[,i] <- trimendspace(as.character(list_vV[,i]))}

list_vVadd <- read.table ("vanVeen_Addendum.txt",na.strings="NA",head=T,sep="\t", dec=",", encoding = "utf-8")
head(list_vVadd)
for (i in c(1:1))    {list_vVadd[,i] <- trimendspace(as.character(list_vVadd[,i]))}

list_Ba <- read.table ("Bartsch_species_list.txt",na.strings="NA",head=T,sep="\t", dec=",", encoding = "utf-8")
head(list_Ba)
for (i in c(1:5))    {list_Ba[,i] <- trimendspace(as.character(list_Ba[,i]))}

list_CH <- read.table ("checklist_CH.txt",na.strings="NA",head=T,sep="\t", dec=",", encoding = "utf-8")
head(list_CH)
for (i in c(1:3))    {list_CH[,i] <- trimendspace(as.character(list_CH[,i]))}

list_DE <- read.table ("checklist_DE.txt",na.strings="NA",head=T,sep="\t", dec=",", encoding = "utf-8")
head(list_DE)
for (i in c(1:5))    {list_DE[,i] <- trimendspace(as.character(list_DE[,i]))}

Two functions - the heart of the code:


#function for consulting Gbif and putting the right information in a dataframe
#with the species list copied to 'Gbifask'

func.GbifQuery <<- function() 
{
  #get current status for a list of species from gbif
  listOfResults <<- get_gbifid_(Gbifask)
  
  cat("\n", file = "gbifunknown.txt", append = F) #File for species names without retuns; F: clean file
  
  #put data in dataframe
  for (i in c(1:length(specieslist)))
  {
    if(i==1) # first entry used to start new dataframe
    {
      #first entry is the exact match if any present
      #Gbifid <- listOfResults[[i]][1,]
      listOfResults[[i]][,"enteredname"] <<- names(listOfResults[i])
      Gbifid <<- listOfResults[[i]]
    }  else # second an further are added to dataframe by bindrows
    {
      if(length(listOfResults[[i]][])>0) # all valid entries
      {
        listOfResults[[i]][,"enteredname"] <<- names(listOfResults[i])
        Gbifid <<-  bind_rows(Gbifid, listOfResults[[i]])      
      } else # species unknown to gbif, write to file diectly
      {
        cat("\n", names(listOfResults[i]) ,file = "gbifunknown.txt", append = T)
      }
    }
  }
}


#function for corrections with synonym lists

func.correctlists <<- function()   {
  speciescorrect <<- 0
  
  for (i in c(1:nrow(CorrectionList)))     #go through list with synonyms
  {
    for (Listname in ToCorrect)           #go through series of lists to correct
    {
      Specieslist <<- get(paste(Listname))
      
      speciescorrect <<- as.character(CorrectionList[i, "OldGenSpec"])
      if(length(Specieslist[Specieslist$GenSpec==speciescorrect, "GenSpec"])>0)
      {
        #first save old name
        if(correctiontype=="Synonym")
        {Specieslist[Specieslist$GenSpec==speciescorrect, "SynSpec"] <<- as.character(CorrectionList[i, "OldGenSpec"])}
        #than replace
        Specieslist[Specieslist$GenSpec==speciescorrect, "GenSpec"] <<- as.character(CorrectionList[i, "NewGenSpec"])
      }
      assign(paste(Listname), Specieslist, inherits = TRUE)
    }
  }
}

Merge the species lists:

#Merge the species lists
speciesbind <- bind_rows(list_vV, list_vVadd, list_Ba, list_CH, list_DE)
speciesbind <- speciesbind[order(speciesbind$GenSpec),]

#remove duplicates

#which full species names (with authority) are duplicated?
speciesauthoritiesduplicated <- speciesbind[duplicated(speciesbind$FullGenSpec),"FullGenSpec"]
#remove them
set1 <- speciesbind[-which(speciesbind$FullGenSpec %in% speciesauthoritiesduplicated),]

#check what remains:
speciesduplicated <- set1[duplicated(set1$GenSpec),"GenSpec"]
set1[which(set1$GenSpec %in% speciesduplicated),"FullGenSpec" ]

Look at the saved table. Slightly different spelled authority names can be dismissed, clearly different authority names need to be retained.

Consult the GBIF database with the raw species list:

#make the species list. In our case all duplicates were Authority spelling variations
specieslist <- unique(speciesbind$GenSpec)

#Now find the synonyms
Gbifask <- specieslist
func.GbifQuery() #returns Gbifid

First get a list of typos:

# which entry has no exact match?

# get list of entries which have an exact match
matchingnames <- Gbifid[Gbifid$matchtype=="EXACT",  ]
matchingnameslist <- matchingnames[,c("enteredname")]

#get a list of entries which have a fuzzy match
unresolvednames <- Gbifid[Gbifid$matchtype=="FUZZY",  ]  
#reduce the fuzzy list; filter out entries that also have an exact match
unresolvednames <- unresolvednames[-which(unresolvednames$enteredname %in% matchingnameslist), ]

#save this list, these are possible typos making them unrecognizable for GBIF
namestosave <- unresolvednames[,c("enteredname", "scientificname", "canonicalname","species","matchtype")]
write.table(namestosave, "possibletypos.txt", row.names = FALSE, col.names = TRUE, sep = "\t", dec=".")

Open the saved file in a spreadsheet program, with the goal to make make a correction list
FUZZY: mostly misspellings (by OCR) or male/female spelling changes - make a list for corrections.
the column "entered name" gets "OldGenSpec" and the column "canonicalname" gets "NewGenSpec" (synonyms in "species" are handled later). This is a translation list to replace the typos.
Carefully check all entries. You can also consult the list of valid species names to see if one of the fuzzy ones matches. If the fuzzy match is unclear and you like to see if the Authority matches somewhere, you can e.g. use: speciesbind[grep("^Pipiza ju",speciesbind$GenSpec),"FullGenSpec"]
Leave alone names that do not match at all! Remove them.
Look for crosslinking changes in the list (new name also occurs in old name list): forbidden! (here not present)
Name your file TypoCorrection.txt

Remove the typos:

### now replace typos in all lists with correction function ###

#load correction file
corrections <- read.table ("TypoCorrection.txt",na.strings="NA",head=T,sep="\t", dec=",")
head(corrections)
for (i in c(1:2))    {corrections[,i] <- trimendspace(as.character(corrections[,i]))}

ToCorrect <- c("list_vV","list_vVadd","list_Ba","list_CH","list_DE")
CorrectionList <- corrections
correctiontype <- "Typo"
func.correctlists()

Get synonyms round 1

#Merge species lists
speciesbind <- bind_rows(list_vV, list_vVadd, list_Ba, list_CH, list_DE, list_SA)
speciesbind <- speciesbind[order(speciesbind$GenSpec),]
specieslist <- unique(speciesbind$GenSpec)

#get current status for species from Gbif and put in dataframe
Gbifask <- specieslist
func.GbifQuery() #returns Gbifid

### Identify synonyms and evaluate thoroughly ###

#having solved all Fuzzy returns without an accepted counterpart in the data in the previous step, we now only have fuzzy returns which can be ignored, because they have a recognized valid name as well
#the names remaining are species with exact match with either synonym false and synonym true

#in first instance you would try to get the synonyms in this way:
resolvenames <- Gbifid[Gbifid$matchtype=="EXACT" & Gbifid$synonym==TRUE,  ]  #list of names marked as synonym
namestosave1 <- resolvenames[,c("enteredname", "scientificname", "canonicalname","species")]
#however this also lists all other known synonyms that are not relevant here. 
#the list needs to be reduced to the names that were originally entered - and see which of those ar now considered synonyms
getnames <- namestosave1[, "enteredname"]  
resolvenames2 <- Gbifid[which(Gbifid$enteredname %in% getnames),  ]  #reduce Gbifid
resolvenames3 <- resolvenames2[resolvenames2$matchtype=="EXACT", ]  #retain those labelled as exact match
namestosave2 <- resolvenames3[,c("enteredname", "scientificname", "canonicalname","species", "synonym")]
write.table(namestosave2, "synonymsfromgbif.txt", row.names = FALSE, col.names = TRUE, sep = "\t", dec=".") # save this table

open the saved file in a spreadsheet program, with the goal to make make a synonym list to replace 'old' names
the column "entered name" gets "OldGenSpec" and the column "species" gets "NewGenSpec"
when one name has both an entry for synonym true and false, choose the right one: you can check for options in your data like this: speciesbind[speciesbind$GenSpec=="Eristalis lineata","FullGenSpec"] If multiple sources (duplicates in speciesbind) consistently use the same authority, you can use that one (if synonym=false you can thus delete both entries). You can also check other sources for the right authority
go through all remaining names and decide what is the right and accepted name
at the end of this blog post I will give some hints on doing this, since GBIF is far from perfect when it comes to insect classification in general.
for now accept the synonym list you made
look for crosslinking changes in the list (new name also occurs in old name list): forbidden! (here not present)
name your file Synonyms1.txt

Replace synonym names with most actual one:

# Prepare new columns and variables

synonyms1 <- read.table ("Synonyms1.txt",na.strings="NA",head=T,sep="\t", dec=",")
head(synonyms1)
for (i in c(1:2))    {synonyms1[,i] <- trimendspace(as.character(synonyms1[,i]))}

ToCorrect <- c("list_vV","list_vVadd","list_Ba","list_CH","list_DE")
CorrectionList <- synonyms1
correctiontype <- "Synonym"
func.correctlists()

Resolve remaining issues:

#Merge species lists
speciesbind <- bind_rows(list_vV, list_vVadd, list_Ba, list_CH, list_DE, list_SA)
speciesbind <- speciesbind[order(speciesbind$GenSpec),]
specieslist <- unique(speciesbind$GenSpec)

#get current status for species from Gbif and put in dataframe
Gbifask <- specieslist
func.GbifQuery() #returns Gbifid

#get list of species names with exact match 
matchingnames <- Gbifid[Gbifid$matchtype=="EXACT",  ]
matchingnameslist <- matchingnames[,c("enteredname")]

#get unresolved names
unresolvednames <- Gbifid[Gbifid$matchtype!="EXACT",  ]  
#filter out those with exact match
unresolvednames <- unresolvednames[-which(unresolvednames$enteredname %in% matchingnameslist), ]
#save unresolved names to file
namestosave <- unresolvednames[,c("enteredname", "scientificname", "canonicalname","species","matchtype")]
write.table(namestosave, "gbifunclear.txt", row.names = FALSE, col.names = TRUE, sep = "\t", dec=".") 

# open gbifunknown.txt and gbifunclear.txt and edit as spreadsheet
# there are probably more misspellings and/or synonyms to resolve
# make a new file with synonyms, name it Synonyms2.txt

# replace synonyms

synonyms2 <- read.table ("Synonyms2.txt",na.strings="NA",head=T,sep="\t", dec=",")
head(synonyms2)
for (i in c(1:2))    {synonyms2[,i] <- trimendspace(as.character(synonyms2[,i]))}

ToCorrect <- c("list_vV","list_vVadd","list_Ba","list_CH","list_DE")
CorrectionList <- synonyms2
correctiontype <- "Synonym"
func.correctlists()

Combining the different species lists in one table and solve multiple synonyms:

#challenge: some accepted species names have multiple synonyms that need to be merged

#Merge species lists
speciesbind <- bind_rows(list_vV, list_vVadd, list_Ba, list_CH, list_DE)
speciesbind <- speciesbind[order(speciesbind$GenSpec),]
speciesbind <- speciesbind[!is.na(speciesbind$GenSpec),]

#make unique species list
#specieslist <- unique(speciesbind[, c("GenSpec","SynSpec")]) #does not work: many duplicate entries
specieslist <- unique(speciesbind$GenSpec)

#deal with  multiple synonyms
speciessynonyms <-  unique(speciesbind[speciesbind$SynSpec!="0", c("GenSpec","SynSpec")  ])
specieslistwithdouble <- speciessynonyms[duplicated(speciessynonyms$GenSpec),c("GenSpec")]

#merge synonyms
specieswithdouble <- setNames(as.data.frame(specieslistwithdouble),c("GenSpec"))
specieswithdouble$DSynSpec <- as.character("0")
for(i in c(1:nrow(specieswithdouble)))
{
  allsynonyms <- paste(as.character(speciessynonyms[speciessynonyms$GenSpec==specieslistwithdouble[i] & !is.na(speciessynonyms$GenSpec),c("SynSpec")], sep=","))
  syncombi <- NULL
  for (j in c(1:length(allsynonyms)))
  {
    if(j==1)
      syncombi <- paste0(allsynonyms[j])
    else syncombi <- paste0(syncombi,", ", allsynonyms[j])
  }
  specieswithdouble[i,"DSynSpec"] <- syncombi
}

#replace single synonyms with double synonyms
for (i in c(1:nrow(specieswithdouble)))
{
  speciessynonyms[speciessynonyms$GenSpec==specieswithdouble[i, "GenSpec"] & !is.na(speciessynonyms$GenSpec), "SynSpec"] <- as.character(specieswithdouble[i, "DSynSpec"])  
} 
#reduce duplicates
speciessynonyms <- unique(speciessynonyms)
speciessynonyms <- speciessynonyms[!is.na(speciessynonyms$GenSpec) ,]

#merge species list with synonyms
speciestable <- merge(setNames(as.data.frame(specieslist),c("GenSpec")), speciessynonyms, by="GenSpec", all.x = TRUE)

Other steps:

add genus name

get a clean standardized authority name for your species
get a column for year of description, which is helpful for judging older species list.

#add Genus name, work with string splitting 
#strsplit(specieslist, " ")
speciestable$Genus <- 0
for (i in c(1:nrow(speciestable)))
{
  speciestable[i,"Genus"] <- strsplit(as.character(speciestable$GenSpec), " ")[[i]][1]
}


#Get fresh Authority names
Gbifask <- specieslist
func.GbifQuery() #returns Gbifid

getnames <- Gbifid[(Gbifid$matchtype=="EXACT" & Gbifid$synonym==FALSE),c("enteredname", "scientificname") ]  
length(specieslist)-nrow(getnames) #some, because of one reversed name change

authority <- setNames(Gbifid[Gbifid$matchtype=="EXACT", c("enteredname", "scientificname") ] ,c("GenSpec", "ScientificName"))  

speciestable <- merge(speciestable, authority, by="GenSpec", all.x = TRUE, all.y = FALSE)

#authorities not available from Gbif, get from own input files
getAuthority <- speciestable[is.na(speciestable$ScientificName), "GenSpec"]

if(length(getAuthority)>0)   #if(nrow(getAuthority)>0)
{
  getAuthority <- setNames(as.data.frame(getAuthority), c("GenSpec"))
  # here Scaeva selenetica does not occur in the list, take the other name?
  getAuthority$ScientificName <- list_vV[list_vV$GenSpec==getAuthority$GenSpec, "FullGenSpec"]
  for (i in c(1:nrow(getAuthority)))
  {  genspec <- getAuthority[i,"GenSpec"]
  ScientificName <- getAuthority[i,"ScientificName"]
  speciestable[speciestable$GenSpec==genspec, "ScientificName"] <- ScientificName
  }
}

#check double authorities
speciestable[duplicated(speciestable$GenSpec),c("GenSpec","ScientificName")]
speciesauthoritiesduplicated <- speciestable[duplicated(speciestable$GenSpec),"GenSpec"]
speciestable[which(speciestable$GenSpec %in% speciesauthoritiesduplicated),]

#from the returned list, choose which to remove
listofnamestoremove <- c("Cheilosia griseiventris Vujic, 1994","Eristalis lineata Wahlberg, 1843",
                         "Eumerus argyropus Doleschall, 1857","Eumerus ruficornis Macquart, 1829",
                         "Merodon rufus (Macquart, 1835)", "Microdon analis Curran, 1940",
                         "Microdon myrmicae", "Paragus quadrifasciatus Say, 1824",
                         "Rhingia rostrata Scopoli, 1763", "Sphaerophoria interrupta Jones, 1917")

#remove them
speciestable <- speciestable[-which(speciestable$ScientificName %in% listofnamestoremove),]

#check if certain authorities were given to multiple species

speciestable[duplicated(speciestable$ScientificName),c("GenSpec","ScientificName")]

#cleanup: remove one non-meaningfull entry:
speciestable <- speciestable[speciestable$GenSpec!="Cheilosia melanura ssp. Rubra",]

# add year of description
speciestable$ID_Year <- 0
for (i in c(1:nrow(speciestable)))
{
  speciestable[i,"ID_Year"] <- substr(strsplit(as.character(speciestable$ScientificName), ", ")[[i]][2], 1, 4)
}

Compare the species lists and put the information in new columns and write the final table

# now see which species is in which list

#KEY van Veen
speciestable$Key_vVeen <- ifelse(speciestable$GenSpec %in% list_vV$GenSpec, "y", "n")
speciestable$Key_vVeen <- ifelse(speciestable$GenSpec %in% list_vVadd$GenSpec, "A", speciestable$Key_vVeen)

#Genus completely covered? #Make table with species per genus and 'no' counts
vVeensum <- setNames(as.data.frame(table(speciestable[speciestable$Key_vVeen=="n","Genus"])), c("Genus", "NumOfNo"))
speciestable$Key_vVeenGen <- ifelse(speciestable$Genus %in% vVeensum$Genus, "n", "y")

#KEY Bartsch
speciestable$Key_Bartsch <- ifelse(speciestable$GenSpec %in% list_Ba$GenSpec, "y", "n")
#Genus completely covered? #Make table with #species per genus and 'no' counts
Basum <- setNames(as.data.frame(table(speciestable[speciestable$Key_Bartsch=="n","Genus"])), c("Genus", "NumOfNo"))
speciestable$Key_BartschGen <- ifelse(speciestable$Genus %in% Basum$Genus, "n", "y")

#Checklists CH, DE, S-A
#Switzerland
speciestable$PA_CH <- ifelse(speciestable$GenSpec %in% list_CH$GenSpec, "p", "a")
speciestable$KeyCH <- ifelse( (speciestable$Key_vVeen=="n" & speciestable$Key_Bartsch=="n" & speciestable$PA_CH=="p"), "n", "y")
speciestable$KeyCH <- ifelse( (speciestable$PA_CH=="a"), "-", speciestable$KeyCH)
#Germany
speciestable$PA_DE <- ifelse(speciestable$GenSpec %in% list_DE$GenSpec, "p", "a")
speciestable$KeyDE <- ifelse( (speciestable$Key_vVeen=="n" & speciestable$Key_Bartsch=="n" & speciestable$PA_DE=="p"), "n", "y")
speciestable$KeyDE <- ifelse( (speciestable$PA_DE=="a"), "-", speciestable$KeyDE)
#small optic correction: 
speciestable$SynSpec <- as.character(speciestable$SynSpec)
speciestable$SynSpec <- ifelse(is.na(speciestable$SynSpec), "-", speciestable$SynSpec)

speciestable <- speciestable[ , c("GenSpec", "Genus", "SynSpec", "Key_vVeen", "Key_vVeenGen","Key_Bartsch", "Key_BartschGen", "PA_CH","KeyCH", "PA_DE", "KeyDE") ]
# "ScientificName","ID_Year" can be added in subsequent step

#FIN
write.table(speciestable, "SyrphidSpecies&Keys&Presence.txt", row.names = FALSE, col.names = TRUE, sep = "\t", dec=".")

The full R script and its input files can be downloaded here.

3. Discussion of common problems to take in to account

Unfortunately the GBIF database is often wrong about the names of Syrphidae, like:

Synonyms are very wrong (e.g. Chrysotoxum festivum and Xanthogramma festiva are separate species, not synonyms)
Old and new name should be reveresed (many examples, e.g. Brachypalpoides lentus and Xylota lenta)
Genus name changes should be consistent, but aren't always (resulting in two genera names instead of one, e.g. Anasimyia vs Lejops -> choose Anasimya and reverse respective synonyms)
Synonymization merging species that should remain separate (e.g. Eristalis picea and Eristalis rupium are separate species, not synonyms)
Some species do not have an exact GBIF match, but are valid anyway (e.g. Eristalis similis, Myathropa florea and Sphaerophoria interrupta)
The "Syrph the Net (StN) Species Accounts" is the most elaborate documentation about European Syrphid species, synonyms and doubtfull species names. I trusted this source more than GBIF and consulted it many times to get to my final list of species.
The Genus Cheilosia (not well handled by Gbif) is covered by a separate publication: Claußen, C.; Speight, M.C.D. (2007): Names of uncertain application and some previously unpublished synonyms, in the European Cheilosia fauna (Diptera, Syrphidae). - Volucella 8, 73-86. Stuttgart.

As a final note: I had a very strange error caused by the OCR software. In some species names it turned the letter 'l' into a special character, that I could not convert to a 'l' in R in any way, leading to the strange fact that R handled e.g. 'Episyrphus balteatus' and 'Episyrphus balteatus' as being equal, but both unique at the same time. The only working solution was to re-type the specific names in the spreadscheet files manually, to get rid of these type of duplicates in the list.

The supplied synonym files used in my R-script include all those changes needed for the list I received from GBIF. Note that GBIF is constantly changing, possibly requiring a slightly different synonyization list.

Code for some of the manual work

# track more misspellings
similarspecies <- as.data.frame(specieslist)
similarspecies$specieslist <- as.character(similarspecies$specieslist)
similarspecies$short <-  substr(similarspecies$specieslist, 1, nchar(similarspecies$specieslist)-5)
templist <- similarspecies[duplicated(similarspecies$short), "short"]
similarspecies[similarspecies$short %in% templist,]

# check for double species names accepted under different genera
# similarnames <- as.data.frame(speciesbind[, c("GenSpec", "Genus", "Species")])
similarnames <- as.data.frame(specieslist)
# add Genus name and species name, work with string splitting 
similarnames$Genus <- 0
similarnames$Species <- 0
for (i in c(1:nrow(similarnames)))
{
  similarnames[i,"Genus"] <- strsplit(as.character(similarnames$specieslist), " ")[[i]][1]
  similarnames[i,"Species"] <- strsplit(as.character(similarnames$specieslist), " ")[[i]][2]
}
similarnames$short <-  substr(similarnames$Species, 1, nchar(similarnames$Species)-2)
doublenames <- similarnames[duplicated(similarnames$short), "short"]
doublenamestable <- similarnames[similarnames$short %in% doublenames,]
doublenamestable[order(doublenamestable$short),]

# I also recommend to do new Gbif Queries

# You can use an extra file for replacement "Extra synonyms.txt"

# Or do manual replacements like this:

speciestable [speciestable $GenSpec=="Merodon subfasciatus", "GenSpec"] <- "Merodon cinereus"

# "Doubtfull species" can be removed like this

list_Chei <- read.table ("Cheilosia.txt",na.strings="NA",head=T,sep="\t", dec=",")
head(list_Chei)
for (i in c(1:4))    {list_Chei[,i] <- trimendspace(as.character(list_Chei[,i]))}
# Or remove species manually

speciestable <- speciestable[speciestable$GenSpec!="Eumerus emarginatus",]

4. The final list

The final formatted list for downloading

The final table has for both identification keys a colomn with yes/no for each species. The species supplied in the addendum by van Veen, are specified with "A". A genus completely covered by the key is marked with a vertical black line. For Switserland and Germany there is present/absent column, with an additional column whether or not the present species can be identified with one of both keys, with "n" highlighted. These are the species that require additional keys. Any species in the same genus identified with those two main keys need to be treated carefully, since it can also be one of the rarer species for which you need an additional key. Have fun identifying!

Jeeve Science-Nature-Photography

Mittwoch, 2. Mai 2018

Species lists from different sources - how to synchronise names with the GBIF database in R