For this initial compilation, we focused on gathering traits from field guides and species accounts rather than the primary research literature because each represents the culmination of a comprehensive effort to describe a regional flora/fauna by local experts25. Authors of these guides have already done the hard work of scouring the literature, corresponding with fellow naturalists, and compiling occurrence records to support range, phenology, and habitat associations26. We began by performing a comprehensive review of all the holdings in the Florida Museum of Natural History’s McGuire Center for Lepidoptera and Biodiversity library, at the University of Florida. This, and subsequent searches in online databases, allowed us to compile a list of references that currently has more than 800 relevant resources.
We initially identified the categories of trait information available in each resource and its format to target volumes for trait extraction and processing. Given the unequal availability of resources among regions, we had the explicit goal of identifying a corpus that would maximize the number of extractable trait data from as many butterfly species as evenly across the globe as possible. This led to our choice of 117 volumes within several global regions (Fig. 2, Supplementary Material S1) and a focus on measurements (wingspan/forewing length), phenology (months of adult flight and total duration of flight in months) and voltinism (the number of adult flight periods per year), habitat affinities, and host plants as traits (Table 1, Supplementary Material S2).
To process these resources, we developed a protocol to scan each volume, extract verbatim natural language descriptions, provide quality control for extraction, and then resolve given taxonomic names to a standardized list27. This provided a database of trait information in which each “cell” included all text from a single resource relevant to one trait category of a single taxon. In order to “atomize” the raw text into standardized metrics or a controlled list of descriptive terms, we developed a methodology appropriate to each trait. This resulted in a more fine-grained dataset in which each “cell” included a single, standardized trait value. Since the values of these taxon-specific traits frequently differed among resources, we then calculated “consensus” traits for each species, for example, the average forewing length (Table 1). A graphical representation of this process with an example trait is illustrated in Fig. 1.
Resource compilation and ingestion
Text sources from the master list were digitized by multiple participating institutions. They scanned each page of the book and converted the images to editable text with Abbyy FineReader optical character recognition (OCR) software (abbyy.com). These PDFs with copy-and-pastable text were then uploaded to a secure, online database that included citation information about each resource. The geographic breadth covered by each resource was designated using the World Geographic Scheme (WGS)28; this information was used to assess geographic evenness of our trait compilation efforts. Resource metadata, including the WGS scheme, were kept with each resource in an online database where individuals could access scanned copies of the resource for trait extraction.
Verbatim data extraction
Individual workers were assigned to a resource and instructed to copy verbatim trait information from the original source. They then pasted that text into the relevant data field in a standardized, electronic form on an online portal designed to facilitate extraction and processing. Most field guides and other book-length resources are organized within a taxonomic hierarchy to describe traits of a family with a contiguous block of text, for example, family, then genus, species, and finally subspecies within species. We call these text blocks describing a single taxon “accounts” (e.g., family account, species account), and we recorded data at the taxonomic resolution provided in the original source. These taxonomic ranks included family, subfamily, tribe, genus, species, and subspecies. When information for a taxon was encountered outside its own account, the “extractor” (project personnel trained to manually extract verbatim text) assigned to glean data from the book entered this text into a separate entry for the taxon. Trait information from figure captions and tables were also extracted from the resource. Graphical representations of phenology and voltinism were common, and these visual data were converted to text descriptions. Each resource was extracted in stages, and each stage was subjected to a quality assurance and control process (see Technical Validation). This process corrected mistakes and attempted to find unextracted data overlooked by the extractor. These problems were corrected before the extractor could proceed with further trait extraction from the resource and were also used for training purposes.
Verbatim text extracts were subjected to an “atomization” process in which raw text was standardized into disaggregated, readily computable data. This conversion into the final trait data format (numerical, categorical, etc.) was two-pronged and involved both manual editing and semi-automated atomization of verbatim text. Regular expressions were used for most semi-automated atomization, including extraction of wing measurements, which were converted into centimeters. Keyword searches were also performed in the semi-automated pipeline for phenology, voltinism, and oviposition traits. For example, “univoltine” or “uni*” was searched for across the voltinism raw text, along with other search terms. All semi-automated atomization outputs were subject to quality assurance and control detailed further in Technical Validation. Manual atomization tasks were performed by multiple team members for traits which presented higher complexity. For example, habitat affinities and host plant associations were atomized manually along with a quality control protocol based on predefined rule sets that are described further in the Supplementary Material S3.
Normalization and consensus traits
To provide consensus traits at the species (and sometimes genus) level, we standardized nomenclature through a process we called “name-normalization,” which harmonizes taxonomy across all of our resources29. This name-normalization procedure relied on a comprehensive catalog of valid names and synonyms27. Following taxonomic harmonization, we compiled consensus traits based on rule sets specified in the metadata of each trait. For example, species-level consensus of primary and secondary host plant families required that at least one-third of the records for a given taxon list a particular family of plants (when multiple records were available).
Categorical traits such as voltinism list all known voltinism patterns for a species regardless of geographic context. To this end, it is important that users of these data are aware that not all traits may be applicable to their study region. For example, some species may be univoltine at higher latitudes or elevations, but bivoltine elsewhere. We therefore present both the resource-level records as well as the species consensus traits for use in analysis.
For this initial synopsis of butterfly species traits, we extracted records from 117 literature/web-based resources, resulting in 75,103 individual trait extraction records across 12,448 unique species, out of the ca. 19,200 species described to date27. Figure 2 indicates the geographic regions covered by our 117 resources, mapped at the resolution level-two regions in the World Geographic Scheme28. A full list of resources can be found in the Supplemental Material S1 as a bibliography. Similarly, the geographic distribution of trait records is indicated in Fig. 3. Resource and consensus species trait records varied in number and in the scope of taxonomic coverage. Table 1 indicates the number of unique records and species level records for each trait. Table 2 indicates the number of species-level records by family. Measurement traits, including wingspan and forewing length, were the most comprehensive traits extracted from our resource set. This represents one of the largest trait datasets and the most comprehensive dataset for butterflies to date.