ECOTOX data processing

Last updated: 2019-07-24

Purpose

This code cleans and standardizes data on the LD₅₀ of pesticides to the honey bee, Apis mellifera from EPA’s ECOTOX database, matches these data to USGS pesticide names using CAS numbers, and exports a slimmed down version of the cleaned data.

Libraries & functions

library(tidyverse)
library(readxl)
library(scales)

Load data

# Create path for most recent ECOTOX file
filename <- "ecotox_ld50_apis_mellifera_20170731" 
path <- paste("../data/",filename,".csv",sep="")

# Store date when data were downloaded from ECOTOX
data_date <- substr(filename, (nchar(filename)-7), nchar(filename))

# Read in data
# Convert blanks, NC (not coded), and NR (not reported) to NA (i.e. missing data)
data <- read.csv(path, na.strings=c(""," ","NC","NR"))

# Check structure of the data
str(data)

'data.frame':   1668 obs. of  55 variables:
 $ CAS.Number                      : int  50293 50293 50293 50293 50293 50293 50293 50293 50293 50293 ...
 $ Chemical.Name                   : Factor w/ 456 levels "(1-Methylethyl)phosphoramidic acid ethyl-3-methyl-4-(methylthio)phenyl ester",..: 73 73 73 73 73 73 73 73 73 73 ...
 $ Chemical.Formulation            : Factor w/ 25 levels "Active ingredient",..: 13 13 20 13 13 10 13 13 13 13 ...
 $ Chemical.Comment                : Factor w/ 671 levels "10.1","10WP",..: 251 250 249 250 250 248 247 247 247 247 ...
 $ Species.Scientific.Name         : Factor w/ 5 levels "Apis mellifera",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species.Common.Name             : Factor w/ 5 levels "Carniolan Honey Bee",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Organism.Source                 : Factor w/ 5 levels "Captive breeding colony",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Organism.Lifestage              : Factor w/ 7 levels "Adult(s)","Larva(e)",..: 1 4 1 4 4 1 1 1 1 1 ...
 $ Exposure.Duration.Mean.Op..Days.: Factor w/ 1 level ">=": NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Duration.Mean..Days.   : num  2 2 2 2 2 2 1 1 1 1 ...
 $ Exposure.Duration.Min.Op..Days. : logi  NA NA NA NA NA NA ...
 $ Exposure.Duration.Min..Days.    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Duration.Max.Op..Days. : logi  NA NA NA NA NA NA ...
 $ Exposure.Duration.Max..Days.    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Duration.Unit..Days.   : Factor w/ 2 levels "Day(s)","Emergence": 1 1 1 1 1 1 1 1 1 1 ...
 $ Test.Type                       : Factor w/ 3 levels "Acute","Not coded",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Test.Location                   : Factor w/ 1 level "Lab": 1 1 1 1 1 1 1 1 1 1 ...
 $ Test.Method                     : Factor w/ 5 levels "European and Mediterranean Plant Protection Organization",..: 5 5 5 5 5 5 2 2 2 2 ...
 $ Exposure.Type                   : Factor w/ 11 levels "Dermal","Diet, unspecified",..: 9 11 9 11 9 9 11 11 11 11 ...
 $ Dose.Mean.Op                    : logi  NA NA NA NA NA NA ...
 $ Dose.Mean                       : logi  NA NA NA NA NA NA ...
 $ Dose.Min.Op                     : logi  NA NA NA NA NA NA ...
 $ Dose.Min                        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Dose.Max.Op                     : logi  NA NA NA NA NA NA ...
 $ Dose.Max                        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Dose.Units                      : Factor w/ 10 levels "AI mg/L","AI ppm",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Mean.Op  : logi  NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Mean     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Min.Op   : logi  NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Min      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Max.Op   : logi  NA NA NA NA NA NA ...
 $ Exposure.Sample.Number.Max      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Dose.Number                     : Factor w/ 17 levels "' 4-5","' 5-6",..: NA NA NA NA NA NA 12 12 12 12 ...
 $ Endpoint                        : Factor w/ 1 level "LD50": 1 1 1 1 1 1 1 1 1 1 ...
 $ Effect                          : Factor w/ 1 level "Mortality": 1 1 1 1 1 1 1 1 1 1 ...
 $ Effect.Measurement              : Factor w/ 1 level "Mortality": 1 1 1 1 1 1 1 1 1 1 ...
 $ Observed.Response.Mean.Op       : Factor w/ 4 levels "<","<=",">","~": NA NA NA NA NA NA NA NA NA NA ...
 $ Observed.Response.Mean          : num  7.12 6.4 5.36 3.9 3.7 5.95 9.6 6.9 7.5 20.1 ...
 $ Observed.Response.Min.Op        : Factor w/ 2 levels "<",">": NA NA NA NA NA NA NA NA NA NA ...
 $ Observed.Response.Min           : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Observed.Response.Max.Op        : Factor w/ 1 level "<": NA NA NA NA NA NA NA NA NA NA ...
 $ Observed.Response.Max           : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Result.Statistical.Method       : logi  NA NA NA NA NA NA ...
 $ Observed.Response.Value         : logi  NA NA NA NA NA NA ...
 $ Observed.Response.Units         : Factor w/ 33 levels "%","AI lb/acre",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ Conc.1.Type..Author.            : Factor w/ 4 levels "Active ingredient",..: 1 3 3 3 3 3 2 2 2 2 ...
 $ Conc.1..Author.                 : Factor w/ 1027 levels "<=170","<=50",..: 945 907 855 737 730 869 1009 914 950 649 ...
 $ Conc.1.Units..Author.           : Factor w/ 33 levels "%","AI lb/acre",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ Result.Comment                  : Factor w/ 376 levels "//","COLONY 1//",..: 1 1 1 1 1 1 52 85 121 161 ...
 $ Reference.Number                : int  344 344 344 344 344 344 101387 101387 101387 101387 ...
 $ Source                          : Factor w/ 66 levels "Acta Hortic.288:133-138",..: 27 27 27 27 27 27 39 39 39 39 ...
 $ Author                          : Factor w/ 64 levels "Aliano,N.P., and M.D. Ellis",..: 58 58 58 58 58 58 25 25 25 25 ...
 $ Title                           : Factor w/ 66 levels "Acute Contact Toxicity of Oxalic Acid to Varroa destructor (Acari: Varroidae) and Their Apis mellifera (Hymenop"| __truncated__,..: 44 44 44 44 44 44 56 56 56 56 ...
 $ Publication.Year                : int  1992 1992 1992 1992 1992 1992 1965 1965 1965 1965 ...
 $ Summary.of.Additional.Parameters: Factor w/ 329 levels "Conc 1 (Author): \xe6 Active ingredient >0.04 - <0.05 ug/org | Conc 2 (Author): \xe6 NR (NR - NR) NR | Conc 3 ("| __truncated__,..: 206 322 322 322 322 322 277 277 277 277 ...

Clean data

# Fix variable names with double periods 
# This code replaces '..' with '.'
names(data) <- gsub("..",".",names(data), fixed=TRUE)

# Select variables of interest & rename them
data_slim<-select(data, 
                  cas = CAS.Number, 
                  cmpd_epa = Chemical.Name,
                  species = Species.Scientific.Name,
                  sp_common = Species.Common.Name,
                  org_source = Organism.Source,
                  stage = Organism.Lifestage,
                  exp_days = Exposure.Duration.Mean.Days.,
                  exp_type = Exposure.Type,
                  form = Conc.1.Type.Author.,
                  n = Exposure.Sample.Number.Mean,
                  dose_n = Dose.Number,
                  ld50_op = Observed.Response.Mean.Op,
                  ld50 = Observed.Response.Mean,
                  ld50_min_op = Observed.Response.Min.Op,
                  ld50_min = Observed.Response.Min,
                  ld50_max_op = Observed.Response.Max.Op,
                  ld50_max = Observed.Response.Max,
                  units = Observed.Response.Units,
                  method = Test.Method,
                  ref = Reference.Number
                  )

# Convert CAS number to factor
data_slim$cas <- as.factor(data_slim$cas) 

# Check structure of new data
str(data_slim)

'data.frame':   1668 obs. of  20 variables:
 $ cas        : Factor w/ 456 levels "50293","51036",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cmpd_epa   : Factor w/ 456 levels "(1-Methylethyl)phosphoramidic acid ethyl-3-methyl-4-(methylthio)phenyl ester",..: 73 73 73 73 73 73 73 73 73 73 ...
 $ species    : Factor w/ 5 levels "Apis mellifera",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sp_common  : Factor w/ 5 levels "Carniolan Honey Bee",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ org_source : Factor w/ 5 levels "Captive breeding colony",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ stage      : Factor w/ 7 levels "Adult(s)","Larva(e)",..: 1 4 1 4 4 1 1 1 1 1 ...
 $ exp_days   : num  2 2 2 2 2 2 1 1 1 1 ...
 $ exp_type   : Factor w/ 11 levels "Dermal","Diet, unspecified",..: 9 11 9 11 9 9 11 11 11 11 ...
 $ form       : Factor w/ 4 levels "Active ingredient",..: 1 3 3 3 3 3 2 2 2 2 ...
 $ n          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ dose_n     : Factor w/ 17 levels "' 4-5","' 5-6",..: NA NA NA NA NA NA 12 12 12 12 ...
 $ ld50_op    : Factor w/ 4 levels "<","<=",">","~": NA NA NA NA NA NA NA NA NA NA ...
 $ ld50       : num  7.12 6.4 5.36 3.9 3.7 5.95 9.6 6.9 7.5 20.1 ...
 $ ld50_min_op: Factor w/ 2 levels "<",">": NA NA NA NA NA NA NA NA NA NA ...
 $ ld50_min   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ld50_max_op: Factor w/ 1 level "<": NA NA NA NA NA NA NA NA NA NA ...
 $ ld50_max   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ units      : Factor w/ 33 levels "%","AI lb/acre",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ method     : Factor w/ 5 levels "European and Mediterranean Plant Protection Organization",..: 5 5 5 5 5 5 2 2 2 2 ...
 $ ref        : int  344 344 344 344 344 344 101387 101387 101387 101387 ...

# Make table of reference/source info
ref_key <- data %>%
  select(ref = Reference.Number,
        pub = Source,
        auth = Author,
        title = Title,
        year = Publication.Year) %>%
  group_by(ref, pub, auth, title, year) %>%
  summarise(obs = length(ref))
  
ref_key <- ref_key[order(ref_key$ref),] # order by ref number

# Save reference key with associated date for future reference
ref_path <- paste("../keys/","ecotox_ref_key","_",data_date,".csv",sep="")
write.csv(ref_key, ref_path)

Recode variables

# Code exposure type into fewer categories (contact, oral, other)
# Note: 'Dropwise' occurs only in Torchio (1973) and involves putting a drop of known volume on each bee

exp_type <- levels(data_slim$exp_type)
exp_type

 [1] "Dermal"                                    
 [2] "Diet, unspecified"                         
 [3] "Direct application"                        
 [4] "Drinking water"                            
 [5] "Dropwise"                                  
 [6] "Environmental, unspecified"                
 [7] "Food"                                      
 [8] "Multiple routes between application groups"
 [9] "Oral via capsule"                          
[10] "Spray"                                     
[11] "Topical, general"

exp_grp <- c("contact",
           "oral",
           "contact",
           "oral",
           "contact",
           "other",
           "oral",
           "other",
           "oral",
           "contact",
           "contact")
exp_key <- as.data.frame(cbind(exp_type,exp_grp))
exp_key

                                     exp_type exp_grp
1                                      Dermal contact
2                           Diet, unspecified    oral
3                          Direct application contact
4                              Drinking water    oral
5                                    Dropwise contact
6                  Environmental, unspecified   other
7                                        Food    oral
8  Multiple routes between application groups   other
9                            Oral via capsule    oral
10                                      Spray contact
11                           Topical, general contact

# Use table join to add new, recoded variable
data_slim <- left_join(data_slim, exp_key, by="exp_type")

# Convert number of doses into a numeric value (minimum # doses)
dose_n <- levels(data_slim$dose_n)
dose_n

 [1] "' 4-5" "' 5-6" "' 5-8" "' 6-7" "' 6-8" ">=4"   ">=5"   ">=6"  
 [9] ">=7"   "10"    "11"    "2"     "5"     "6"     "7"     "8"    
[17] "9"

dose_min <- c(4,5,5,6,6,4,5,6,7,10,11,2,5,6,7,8,9)
dose_key <- as.data.frame(cbind(dose_n, dose_min))
dose_key

   dose_n dose_min
1   ' 4-5        4
2   ' 5-6        5
3   ' 5-8        5
4   ' 6-7        6
5   ' 6-8        6
6     >=4        4
7     >=5        5
8     >=6        6
9     >=7        7
10     10       10
11     11       11
12      2        2
13      5        5
14      6        6
15      7        7
16      8        8
17      9        9

# Use table join to add new, minimum dose variable (and make numeric)
data_slim <- left_join(data_slim, dose_key, by="dose_n")
data_slim$dose_min <- as.numeric(as.character(data_slim$dose_min))

Standardize units

To standardize into common units of ug/bee, I exported all of the units from the dataset, then determined (a) which ones could be standardized, and (b) standardization factors for converting into ug/bee - these are related in the unit_key. Most cases simply involve unit conversions (i.e. mg/org to ug/org). For units like ug I checked the original studies to interpret the units, making corrections when necessary. Units that were checked in original references are reflected in the ecotox_unit_key.csv in the column called checked_ref (y = yes, n = no). Some LD₅₀s were reported on a body weight basis; I assumed an individual adult bee weighs 0.12 g, see COLOSS Bee Manual Section 2.2

# Check distribution of observations across units
table(data_slim$units)


           %   AI lb/acre      AI mg/L    AI mg/org        AI ng 
           6           40            1            1            3 
   AI ng/org  AI ng/org/d   AI oz/acre       AI ppm  AI ppm food 
         122            4            1           24            3 
       AI ug      AI ug/g AI ug/g bdwt     AI ug/ml    AI ug/org 
           1            2            6            3           50 
      lb/gal           mg       mg/bee       mg/cm2       mg/org 
           1           12            3            9           12 
 ml/1000 org       ng/org       pg/org          ppb          ppm 
           3           54            1            9           11 
          ug       ug/bee        ug/eu         ug/g    ug/g bdwt 
          74          207           11           12           21 
    ug/g org       ug/org        ug/ul 
          34          923            3

# Fix a few mistaken entries
# Laurino et al. 2011 - Table 2, incorrect units recorded in ECOTOX
which((data_slim == "AI ppm food"), arr.ind=TRUE)

      row col
[1,] 1570  18
[2,] 1571  18
[3,] 1572  18

data_slim[1570,18] <- "AI ng/org" # correct dataset
data_slim[1571,18] <- "AI ng/org" # correct dataset
data_slim[1572,18] <- "AI ng/org" # correct dataset

# Lindberg et al. 2000 - for some reason one entry from Fig. 2 was recorded differently from others
data_slim[245,21] <- "other" # correct dataset
data_slim[245,8] <- "Multiple routes between application groups" # correct dataset

# Read in unit key and join to dataset
unit_key <- read.csv("../keys/ecotox_unit_key.csv",
                   na.strings=c("na"))
unit_key <- select(unit_key, units, units_keep, units_conv) # get rid of unnecessary columns
unit_key

          units units_keep units_conv
1             %          n         NA
2    AI lb/acre          n         NA
3       AI mg/L          n         NA
4     AI mg/org          y    1.0e+03
5         AI ng          y    1.0e-03
6     AI ng/org          y    1.0e-03
7   AI ng/org/d          n         NA
8    AI oz/acre          n         NA
9        AI ppm          n         NA
10  AI ppm food          n         NA
11        AI ug          y    1.0e+00
12      AI ug/g          y    1.2e-01
13 AI ug/g bdwt          y    1.2e-01
14     AI ug/ml          n         NA
15    AI ug/org          y    1.0e+00
16       lb/gal          n         NA
17           mg          y         NA
18       mg/bee          y    1.0e+03
19       mg/cm2          n         NA
20       mg/org          y    1.0e+03
21  ml/1000 org          n         NA
22       ng/org          y    1.0e-03
23       pg/org          y    1.0e-06
24          ppb          n         NA
25          ppm          n         NA
26           ug          y    1.0e+00
27       ug/bee          y    1.0e+00
28        ug/eu          n         NA
29         ug/g          y    1.2e-01
30    ug/g bdwt          y    1.2e-01
31     ug/g org          y    1.2e-01
32       ug/org          y    1.0e+00
33        ug/ul          n         NA

data_slim <- left_join(data_slim, unit_key, by="units")

# Generate new outcomes standardized to ug/bee
data_slim <- mutate(data_slim, 
       ld50_ugbee = ld50*units_conv,
       ld50_min_ugbee = ld50_min*units_conv,
       ld50_max_ugbee = ld50_max*units_conv)

# Check for obviously wierd values
ggplot(data=data_slim, aes(ld50_ugbee)) +
  geom_histogram(bins=100) +
  scale_x_log10(breaks = trans_breaks("log10", n=10, function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  theme_classic()

# Inspect values on the extremes of the distribution
check_units <- filter(data_slim, ld50_ugbee < 10^-4 | ld50_ugbee > 10^4)

# Ref 70351 (5 extreme obs): diazinon & benomyl; all check out from original paper
# Ref 64900 (1 extreme obs): amitraz; unit error, it was ug/bee but ECOTOX recorded pg/bee (see Table 2 in paper)
# Ref 58587 (1 extreme obs): tau-fluvalinate; not a unit error, but kind of a wierd experimental design with mix of exposure pathways

# Identify location of amitraz outlier
which((data_slim == 2.5500e-06), arr.ind=TRUE)

     row col

which((data_slim == 1.57e-06), arr.ind=TRUE)

     row col
[1,] 975  26

which((data_slim == 4.32e-06), arr.ind=TRUE)

     row col
[1,] 975  27

# Correct dataset
data_slim[975,18] <- "ug/org" 
data_slim[975,24] <- 1 
data_slim[975,25] <- 2.55 
data_slim[975,26] <- 1.57
data_slim[975,27] <- 4.32

print(t(data_slim[975,])) # check that it worked

               975                                                                                    
cas            "33089611"                                                                             
cmpd_epa       "N'-(2,4-Dimethylphenyl)-N-[[(2,4-dimethylphenyl)imino]methyl]-N-methylmethanimidanide"
species        "Apis mellifera ssp. ligustica"                                                        
sp_common      "Italian Honeybee"                                                                     
org_source     "Not coded"                                                                            
stage          "Adult(s)"                                                                             
exp_days       "1"                                                                                    
exp_type       "Topical, general"                                                                     
form           "Formulation"                                                                          
n              NA                                                                                     
dose_n         "' 5-8"                                                                                
ld50_op        NA                                                                                     
ld50           "2.55"                                                                                 
ld50_min_op    NA                                                                                     
ld50_min       "1.57"                                                                                 
ld50_max_op    NA                                                                                     
ld50_max       "4.32"                                                                                 
units          "ug/org"                                                                               
method         "Not coded"                                                                            
ref            "64900"                                                                                
exp_grp        "contact"                                                                              
dose_min       "5"                                                                                    
units_keep     "y"                                                                                    
units_conv     "1"                                                                                    
ld50_ugbee     "2.55"                                                                                 
ld50_min_ugbee "1.57"                                                                                 
ld50_max_ugbee "4.32"

Descriptives

summary(data_slim)

        cas      
 50293    :  79  
 138261413:  75  
 1332407  :  73  
 153719234:  66  
 210880925:  49  
 2702729  :  45  
 (Other)  :1281  
                                                                                   cmpd_epa   
 1,1'-(2,2,2-Trichloroethylidene)bis[4-chlorobenzene]                                  :  79  
 (2E)-1-[(6-Chloro-3-pyridinyl)methyl]-N-nitro-2-imidazolidinimine                     :  75  
 Copper chloride oxide hydrate                                                         :  73  
 3-[(2-Chloro-5-thiazolyl)methyl]tetrahydro-5-methyl-N-nitro-4H-1,3,5-oxadiazin-4-imine:  66  
 [C(E)]-N-[(2-Chloro-5-thiazolyl)methyl]-N'-methyl-N''-nitroguanidine                  :  49  
 2-(2,4-Dichlorophenoxy)acetic acid sodium salt (1:1)                                  :  45  
 (Other)                                                                               :1281  
                          species                   sp_common   
 Apis mellifera               :1345   Carniolan Honey Bee: 206  
 Apis mellifera ssp. carnica  : 206   European Dark Bee  :  20  
 Apis mellifera ssp. caucasica:   4   European Honey Bee :   4  
 Apis mellifera ssp. ligustica:  93   Honey Bee          :1345  
 Apis mellifera ssp. mellifera:  20   Italian Honeybee   :  93  
                                                                
                                                                
                   org_source                                 stage    
 Captive breeding colony:   4   Adult(s)                         :771  
 Laboratory             :  60   Larva(e)                         : 96  
 Not coded              :1182   New, newly or recent hatch       :  4  
 Not reported           : 413   Not coded                        :257  
 Wild                   :   9   Not reported                     :533  
                                Organisms at different lifestages:  2  
                                Post-emergence                   :  5  
    exp_days                                             exp_type  
 Min.   : 0.0139   Topical, general                          :778  
 1st Qu.: 1.0000   Food                                      :458  
 Median : 2.0000   Oral via capsule                          :146  
 Mean   : 2.0055   Multiple routes between application groups:121  
 3rd Qu.: 2.0000   Dermal                                    : 46  
 Max.   :21.0000   Direct application                        : 42  
 NA's   :117       (Other)                                   : 77  
                form           n             dose_n     ld50_op    
 Active ingredient:925   Min.   :10.00   2      : 115   <   :  34  
 Formulation      :407   1st Qu.:10.00   ' 5-6  : 113   <=  :   2  
 Not coded        :252   Median :10.00   >=4    :  83   >   : 446  
 Total            : 84   Mean   :18.24   8      :  83   ~   :  18  
                         3rd Qu.:30.00   5      :  61   NA's:1168  
                         Max.   :60.00   (Other): 172              
                         NA's   :1526    NA's   :1041              
      ld50           ld50_min_op    ld50_min       ld50_max_op
 Min.   :0.000e+00   <   :   1   Min.   :   0.00   <   :  29  
 1st Qu.:1.000e+00   >   :  30   1st Qu.:   0.12   NA's:1639  
 Median :1.200e+01   NA's:1637   Median :   2.08              
 Mean   :6.276e+05               Mean   :  90.65              
 3rd Qu.:1.000e+02               3rd Qu.:  22.30              
 Max.   :1.020e+09               Max.   :7480.00              
 NA's   :39                      NA's   :1375                 
    ld50_max               units    
 Min.   :    0.001   ug/org   :924  
 1st Qu.:    0.190   ug/bee   :207  
 Median :    3.900   AI ng/org:125  
 Mean   :  143.196   ug       : 74  
 3rd Qu.:   43.890   ng/org   : 54  
 Max.   :12000.000   (Other)  :283  
 NA's   :1375        NA's     :  1  
                                                      method   
 European and Mediterranean Plant Protection Organization:  2  
 Not coded                                               :512  
 Not reported                                            :435  
 Organization for Economic Cooperation and Development   : 19  
 United States Environmental Protection Agency           :700  
                                                               
                                                               
      ref            exp_grp       dose_min      units_keep 
 Min.   :   344   contact:886   Min.   : 2.000   n   : 126  
 1st Qu.:   344   oral   :619   1st Qu.: 4.000   y   :1541  
 Median : 39264   other  :163   Median : 5.000   NA's:   1  
 Mean   : 52142                 Mean   : 5.242              
 3rd Qu.: 91623                 3rd Qu.: 7.000              
 Max.   :173554                 Max.   :11.000              
                                NA's   :1041                
   units_conv         ld50_ugbee       ld50_min_ugbee     
 Min.   :   0.001   Min.   :0.00e+00   Min.   :   0.0000  
 1st Qu.:   1.000   1st Qu.:0.00e+00   1st Qu.:   0.0346  
 Median :   1.000   Median :7.00e+00   Median :   0.1400  
 Mean   :  11.292   Mean   :6.80e+05   Mean   :  29.5882  
 3rd Qu.:   1.000   3rd Qu.:9.10e+01   3rd Qu.:   5.8350  
 Max.   :1000.000   Max.   :1.02e+09   Max.   :1489.0000  
 NA's   :139        NA's   :165        NA's   :1398       
 ld50_max_ugbee     
 Min.   :   0.0011  
 1st Qu.:   0.0639  
 Median :   0.2500  
 Mean   :  51.2546  
 3rd Qu.:   9.1475  
 Max.   :2962.7200  
 NA's   :1398

Notes

Life stage is not reported or coded for about half of all rows
Sample size not reported for majority of rows
Number of doses not reported for majority of rows
Most rows don’t have a min/max
Most rows are either contact or oral (few others)
Almost half of all observations are dervied from studies using EPA methodology

# Are some of these patterns explained by source/method?
method <- data_slim %>%
  group_by(method) %>%
  summarise(obs = length(ld50_ugbee),
            n_doses = sum(!is.na(dose_min)),
            n_orgs = sum(!is.na(n)),
            n_exp_days = sum(!is.na(exp_days)),
            n_ld50_min = sum(!is.na(ld50_min)))
as.data.frame(t(method))

                                                                 V1
method     European and Mediterranean Plant Protection Organization
obs                                                               2
n_doses                                                           0
n_orgs                                                            0
n_exp_days                                                        2
n_ld50_min                                                        0
                  V2           V3
method     Not coded Not reported
obs              512          435
n_doses          373          235
n_orgs             0          142
n_exp_days       397          435
n_ld50_min       130           53
                                                              V4
method     Organization for Economic Cooperation and Development
obs                                                           19
n_doses                                                       19
n_orgs                                                         0
n_exp_days                                                    19
n_ld50_min                                                    19
                                                      V5
method     United States Environmental Protection Agency
obs                                                  700
n_doses                                                0
n_orgs                                                 0
n_exp_days                                           698
n_ld50_min                                            91

Notes

Studies w/ EPA methods don’t report # doses or sample size - maybe this info is standard & available elsewhere? See this EPA guidance

Min/Max check

This section is intended to aid in the interpretation of the min/max values reported in ECOTOX. I identified references with min/max values, then consulted the original papers to determine the units in which these values were reported, and then added them back to the dataset.

# First, create table of refs that report min/max
min_max <- data_slim %>%
  filter(ld50_min_ugbee >= 0|ld50_max_ugbee >= 0) %>%
  group_by(ref) %>%
  summarise(obs_range = length(ref)) %>%
  left_join(ref_key, by='ref') %>%
  arrange(desc(obs_range))

# Export min/max table for editing, with associated date
minmax_path <- paste("../keys/","ecotox_min_max","_",data_date,".csv",sep="")
write.csv(min_max, minmax_path, row.names=FALSE)

# Next, check original references and create new column for the range units (google scholar search, refs saved in EndNote, csv filed edited outside of R)

# CI_95 = 95% confidence interval 
# FL = fiducial limits

# Load revised min/max table
minmax_key_path <- paste("../keys/","ecotox_min_max_units","_",data_date,".csv",sep="")
min_max_key <- read.csv(minmax_key_path)

range_key <- select(min_max_key,
                    ref,range_units)

# Add column to dataset indicating range units
data_slim <- data_slim %>%
  left_join(range_key, by="ref")

data_slim <- within(data_slim, range_units[is.na(ld50_min_ugbee)] <- NA) # clean up so only rows w/ range values have units

Life stage check

This section fills in missing life stage information. I identified references with missing life stage values, then consulted the original papers to determine life stages, and added them back into the dataset.

# First, create table of refs that do not report life stage
stage_check <- data_slim %>%
  filter(stage=="Not coded"|stage=="Not reported") %>%
  group_by(ref, stage) %>%
  summarise(stage_missing = length(stage)) %>%
  left_join(ref_key, by='ref') %>%
  arrange(desc(stage_missing)) %>%
  select(ref,stage,obs,stage_missing,auth,year, pub, title)

# Export life stage table for editing, with associated date
stage_path <- paste("../keys/","ecotox_stage","_",data_date,".csv",sep="")
write.csv(stage_check, stage_path, row.names=FALSE)

# Next, check original references and create new column of 'confirmed' life stage info (google scholar search, refs saved in EndNote)

# When studies included multiple methodologies (e.g. adults + larvae), check the individual data points to see what they correspond to in the paper

# Load revised life stage lookup table
stage_key_path <- paste("../keys/","ecotox_stage_fixed","_",data_date,".csv",sep="")
stage_new <- read.csv(stage_key_path)

stage_key <- stage_new %>%
  select(ref, stage, stage_new)
stage_key$stage_new <- as.factor(stage_key$stage_new)

# Fix the stage column with updated information
data_slim_stage_new <- data_slim %>%
  filter(stage=="Not coded"|stage=="Not reported") %>%
  left_join(stage_key, by=c("ref","stage")) %>%
  select(-stage) %>%
  rename(stage = stage_new)

Warning: Column `stage` joining factors with different levels, coercing to
character vector

# Save records that already had stage info
data_slim_stage_old <- data_slim %>%
  filter((stage!="Not coded" & stage!="Not reported")|is.na(stage)==TRUE)

# Save corrected full dataframe
data_slim_fixed <- data_slim_stage_new %>%
  rbind(data_slim_stage_old)

summary(data_slim_fixed)

        cas      
 50293    :  79  
 138261413:  75  
 1332407  :  73  
 153719234:  66  
 210880925:  49  
 2702729  :  45  
 (Other)  :1281  
                                                                                   cmpd_epa   
 1,1'-(2,2,2-Trichloroethylidene)bis[4-chlorobenzene]                                  :  79  
 (2E)-1-[(6-Chloro-3-pyridinyl)methyl]-N-nitro-2-imidazolidinimine                     :  75  
 Copper chloride oxide hydrate                                                         :  73  
 3-[(2-Chloro-5-thiazolyl)methyl]tetrahydro-5-methyl-N-nitro-4H-1,3,5-oxadiazin-4-imine:  66  
 [C(E)]-N-[(2-Chloro-5-thiazolyl)methyl]-N'-methyl-N''-nitroguanidine                  :  49  
 2-(2,4-Dichlorophenoxy)acetic acid sodium salt (1:1)                                  :  45  
 (Other)                                                                               :1281  
                          species                   sp_common   
 Apis mellifera               :1345   Carniolan Honey Bee: 206  
 Apis mellifera ssp. carnica  : 206   European Dark Bee  :  20  
 Apis mellifera ssp. caucasica:   4   European Honey Bee :   4  
 Apis mellifera ssp. ligustica:  93   Honey Bee          :1345  
 Apis mellifera ssp. mellifera:  20   Italian Honeybee   :  93  
                                                                
                                                                
                   org_source      exp_days      
 Captive breeding colony:   4   Min.   : 0.0139  
 Laboratory             :  60   1st Qu.: 1.0000  
 Not coded              :1182   Median : 2.0000  
 Not reported           : 413   Mean   : 2.0055  
 Wild                   :   9   3rd Qu.: 2.0000  
                                Max.   :21.0000  
                                NA's   :117      
                                       exp_type                  form    
 Topical, general                          :778   Active ingredient:925  
 Food                                      :458   Formulation      :407  
 Oral via capsule                          :146   Not coded        :252  
 Multiple routes between application groups:121   Total            : 84  
 Dermal                                    : 46                          
 Direct application                        : 42                          
 (Other)                                   : 77                          
       n             dose_n     ld50_op          ld50          
 Min.   :10.00   2      : 115   <   :  34   Min.   :0.000e+00  
 1st Qu.:10.00   ' 5-6  : 113   <=  :   2   1st Qu.:1.000e+00  
 Median :10.00   >=4    :  83   >   : 446   Median :1.200e+01  
 Mean   :18.24   8      :  83   ~   :  18   Mean   :6.276e+05  
 3rd Qu.:30.00   5      :  61   NA's:1168   3rd Qu.:1.000e+02  
 Max.   :60.00   (Other): 172               Max.   :1.020e+09  
 NA's   :1526    NA's   :1041               NA's   :39         
 ld50_min_op    ld50_min       ld50_max_op    ld50_max        
 <   :   1   Min.   :   0.00   <   :  29   Min.   :    0.001  
 >   :  30   1st Qu.:   0.12   NA's:1639   1st Qu.:    0.190  
 NA's:1637   Median :   2.08               Median :    3.900  
             Mean   :  90.65               Mean   :  143.196  
             3rd Qu.:  22.30               3rd Qu.:   43.890  
             Max.   :7480.00               Max.   :12000.000  
             NA's   :1375                  NA's   :1375       
       units    
 ug/org   :924  
 ug/bee   :207  
 AI ng/org:125  
 ug       : 74  
 ng/org   : 54  
 (Other)  :283  
 NA's     :  1  
                                                      method   
 European and Mediterranean Plant Protection Organization:  2  
 Not coded                                               :512  
 Not reported                                            :435  
 Organization for Economic Cooperation and Development   : 19  
 United States Environmental Protection Agency           :700  
                                                               
                                                               
      ref            exp_grp       dose_min      units_keep 
 Min.   :   344   contact:886   Min.   : 2.000   n   : 126  
 1st Qu.:   344   oral   :619   1st Qu.: 4.000   y   :1541  
 Median : 39264   other  :163   Median : 5.000   NA's:   1  
 Mean   : 52142                 Mean   : 5.242              
 3rd Qu.: 91623                 3rd Qu.: 7.000              
 Max.   :173554                 Max.   :11.000              
                                NA's   :1041                
   units_conv         ld50_ugbee       ld50_min_ugbee     
 Min.   :   0.001   Min.   :0.00e+00   Min.   :   0.0000  
 1st Qu.:   1.000   1st Qu.:0.00e+00   1st Qu.:   0.0346  
 Median :   1.000   Median :7.00e+00   Median :   0.1400  
 Mean   :  11.292   Mean   :6.80e+05   Mean   :  29.5882  
 3rd Qu.:   1.000   3rd Qu.:9.10e+01   3rd Qu.:   5.8350  
 Max.   :1000.000   Max.   :1.02e+09   Max.   :1489.0000  
 NA's   :139        NA's   :165        NA's   :1398       
 ld50_max_ugbee           range_units  
 Min.   :   0.0011   CI         :  16  
 1st Qu.:   0.0639   CI_95      : 138  
 Median :   0.2500   EPA/unknown:  87  
 Mean   :  51.2546   FL         :  16  
 3rd Qu.:   9.1475   FL_95      :  12  
 Max.   :2962.7200   unknown    :   1  
 NA's   :1398        NA's       :1398  
                               stage     
 Adult(s)                         :1269  
 EPA/unknown                      : 282  
 Larva(e)                         : 105  
 Post-emergence                   :   5  
 New, newly or recent hatch       :   4  
 Organisms at different lifestages:   2  
 (Other)                          :   1

Filter observations

Select only those observations that meet the following criteria:

Exposure time 4 days or less (i.e. acute, after Sanchez-Bayo & Goka 2014)
Contact or oral exposure
Adults (assume ‘EPA/unknown’ are adults)
LD₅₀ estimate in ug/bee (and > 0)

Also re-select variables of interest to remove non-standardized LD₅₀ values and re-organize columns.

data_clean <- data_slim_fixed %>%
  filter(exp_days<=4,
         (exp_grp=="contact"|exp_grp=="oral"),
         (stage=="Adult(s)"|stage=="EPA/unknown"),
         ld50_ugbee>0) %>%
  select(cas, 
         cmpd_epa, 
         species, 
         sp_common,
         org_source,
         stage,
         exp_days,
         form,
         n,
         dose_min,
         exp_type,
         exp_grp,
         ld50_op,
         ld50_min_op,
         ld50_max_op,
         ld50_ugbee,
         ld50_min_ugbee,
         ld50_max_ugbee,
         range_units,
         method,
         ref)

Add USGS names & categories

# Load key relating USGS names and CAS numbers
# See metadata in excel file for how it was generated
cas_key <- read_excel("../keys/USGS_Pesticide-CASRN.xlsx",1)
head(cas_key, 10)

# A tibble: 10 x 5
   compound_state      compound_cty         cas       cmpd_mult cas_mult
   <chr>               <chr>                <chr>     <chr>     <chr>   
 1 1METHYLCYCLOPROPENE 1-METHYLCYCLOPROPENE 3100047   n         n       
 2 24D                 2,4-D                94757     n         y       
 3 24D                 2,4-D                2702729   n         y       
 4 24DB                2,4-DB               94826     n         n       
 5 6BENZYLADENINE      6-BENZYLADENINE      1214397   n         n       
 6 ABAMECTIN           ABAMECTIN            71751412  n         n       
 7 ACEPHATE            ACEPHATE             30560191  n         n       
 8 ACEQUINOCYL         ACEQUINOCYL          57960197  n         n       
 9 ACETAMIPRID         ACETAMIPRID          135410207 n         n       
10 ACETOCHLOR          ACETOCHLOR           34256821  n         n

cas_key_slim <- select(cas_key,
                 cmpd_usgs = compound_state,
                 cas = cas)

# Load category key & add combined category/class column
# See metadata in excel file for how it was generated
cat_key <- read_excel("../keys/USGS_Pesticide-Category.xlsx",1)
head(cat_key, 10)

# A tibble: 10 x 7
   compound    compound_cty   compound_2017  category class    group source
   <chr>       <chr>          <chr>          <chr>    <chr>    <chr> <chr> 
 1 1METHYLCYC… 1-METHYLCYCLO… 1-METHYLCYCLO… PGR      unclass… NC    none  
 2 24D         2,4-D          2,4-D          H        Phenoxy… O     HRAC  
 3 24DB        2,4-DB         2,4-DB         H        Phenoxy… O     HRAC  
 4 6BENZYLADE… 6-BENZYLADENI… 6-BENZYLADENI… PGR      biologi… NC    none  
 5 ABAMECTIN   ABAMECTIN      ABAMECTIN      I        avermec… 6     IRAC  
 6 ACEPHATE    ACEPHATE       ACEPHATE       I        organop… 1B    IRAC  
 7 ACEQUINOCYL ACEQUINOCYL    ACEQUINOCYL    I        unclass… 20B   IRAC  
 8 ACETAMIPRID ACETAMIPRID    ACETAMIPRID    I        neonico… 4A    IRAC  
 9 ACETOCHLOR  ACETOCHLOR     ACETOCHLOR     H        Chloroa… K3    HRAC  
10 ACIBENZOLAR ACIBENZOLAR    ACIBENZOLAR    F        benzo-t… P01   FRAC

# Create new variable that is Category-Group
cat_key$cat_grp <- sub('-NA$', '', 
                       paste(cat_key$category, cat_key$group, 
                             sep = "-"))

# Remove unnecessary columns
cat_key_slim <- select(cat_key,
                 cmpd_usgs = compound,
                 cat = category,
                 cat_grp = cat_grp,
                 class = class,
                 grp_source = source)

# Create combined key w/ USGS names and categories
comb_key <- left_join(cas_key_slim, cat_key_slim, by="cmpd_usgs")
comb_key$cat_grp <- as.factor(comb_key$cat_grp)
head(comb_key, n=10) # view sample of key

# A tibble: 10 x 6
   cmpd_usgs          cas      cat   cat_grp class               grp_source
   <chr>              <chr>    <chr> <fct>   <chr>               <chr>     
 1 1METHYLCYCLOPROPE… 3100047  PGR   PGR-NC  unclassified        none      
 2 24D                94757    H     H-O     Phenoxy-carboxylic… HRAC      
 3 24D                2702729  H     H-O     Phenoxy-carboxylic… HRAC      
 4 24DB               94826    H     H-O     Phenoxy-carboxylic… HRAC      
 5 6BENZYLADENINE     1214397  PGR   PGR-NC  biological          none      
 6 ABAMECTIN          71751412 I     I-6     avermectin          IRAC      
 7 ACEPHATE           30560191 I     I-1B    organophosphate     IRAC      
 8 ACEQUINOCYL        57960197 I     I-20B   unclassified        IRAC      
 9 ACETAMIPRID        1354102… I     I-4A    neonicotinoid       IRAC      
10 ACETOCHLOR         34256821 H     H-K3    Chloroacetamide     HRAC

# Merge USGS names and categories with LD50 dataset
# Ignore warning - just converting factor to character vector
# Note this is a full join - i.e. USGS compounds that don't match any ECOTOX value will be added to the dataset w/ NAs for missing data
data_full <- full_join(data_clean, comb_key, by="cas")

Warning: Column `cas` joining factor and character vector, coercing into
character vector

str(data_full)

'data.frame':   1643 obs. of  26 variables:
 $ cas           : chr  "50293" "50293" "50293" "50293" ...
 $ cmpd_epa      : Factor w/ 456 levels "(1-Methylethyl)phosphoramidic acid ethyl-3-methyl-4-(methylthio)phenyl ester",..: 73 73 73 73 257 404 400 434 425 425 ...
 $ species       : Factor w/ 5 levels "Apis mellifera",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sp_common     : Factor w/ 5 levels "Carniolan Honey Bee",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ org_source    : Factor w/ 5 levels "Captive breeding colony",..: 3 3 3 4 3 3 3 4 3 3 ...
 $ stage         : Factor w/ 9 levels "Adult(s)","EPA/unknown",..: 2 2 2 1 2 2 2 1 1 1 ...
 $ exp_days      : num  2 2 2 1 2 2 2 1 1 1 ...
 $ form          : Factor w/ 4 levels "Active ingredient",..: 3 3 3 1 1 1 1 1 2 1 ...
 $ n             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ dose_min      : num  NA NA NA NA NA NA NA NA 8 5 ...
 $ exp_type      : Factor w/ 11 levels "Dermal","Diet, unspecified",..: 11 11 9 1 11 11 11 1 2 11 ...
 $ exp_grp       : Factor w/ 3 levels "contact","oral",..: 1 1 2 1 1 1 1 1 2 1 ...
 $ ld50_op       : Factor w/ 4 levels "<","<=",">","~": NA NA NA NA 3 NA NA NA NA NA ...
 $ ld50_min_op   : Factor w/ 2 levels "<",">": NA NA NA NA NA NA NA NA NA NA ...
 $ ld50_max_op   : Factor w/ 1 level "<": NA NA NA NA NA NA NA NA NA NA ...
 $ ld50_ugbee    : num  6.4 3.9 3.7 13.7 11 ...
 $ ld50_min_ugbee: num  NA NA NA NA NA ...
 $ ld50_max_ugbee: num  NA NA NA NA NA ...
 $ range_units   : Factor w/ 6 levels "CI","CI_95","EPA/unknown",..: NA NA NA NA NA NA NA NA NA 2 ...
 $ method        : Factor w/ 5 levels "European and Mediterranean Plant Protection Organization",..: 5 5 5 2 5 5 5 2 2 2 ...
 $ ref           : int  344 344 344 35123 344 344 344 35123 101175 119503 ...
 $ cmpd_usgs     : chr  NA NA NA NA ...
 $ cat           : chr  NA NA NA NA ...
 $ cat_grp       : Factor w/ 105 levels "F-1","F-11","F-12",..: NA NA NA NA 104 80 NA 80 NA NA ...
 $ class         : chr  NA NA NA NA ...
 $ grp_source    : chr  NA NA NA NA ...

Save new dataset

new_filename<-paste(filename,"_clean", sep="")
new_path<-paste("../output/",new_filename,".csv", sep="")
write.csv(data_full, file=new_path)

Session information

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.0.0    readxl_1.3.1    forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_0.8.3     purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
 [9] tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.1  
 [5] tools_3.6.1      zeallot_0.1.0    digest_0.6.20    lubridate_1.7.4 
 [9] jsonlite_1.6     evaluate_0.14    nlme_3.1-140     gtable_0.3.0    
[13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.4.0      cli_1.1.0       
[17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.1      xfun_0.8        
[21] withr_2.1.2      xml2_1.2.0       httr_1.4.0       knitr_1.23      
[25] vctrs_0.2.0      hms_0.5.0        generics_0.0.2   fs_1.3.1        
[29] grid_3.6.1       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0        
[33] fansi_0.4.0      rmarkdown_1.14   modelr_0.1.4     magrittr_1.5    
[37] backports_1.1.4  htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1
[41] colorspace_1.4-1 labeling_0.3     utf8_1.1.4       stringi_1.4.3   
[45] lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2      crayon_1.3.4

This R Markdown site was created with workflowr