The database comprises information obtained with permission from the Catalogue of Endangered Languages that is hosted on the Endangered Languages Project platform (https://www.endangeredlanguages.com/). The Endangered Languages Project was first developed and launched by Google, and is currently overseen by First People’s Cultural Council and the Institute for Language Information and Technology at Eastern Michigan University. Information about the languages in this project is provided by the Catalogue, which is produced by the University of Hawai’i at Mānoa and Eastern Michigan University, with funding provided by the U.S. National Science Foundation (Grants #1058096 and #1057725) and the Luce Foundation. The project is supported by a team of global experts comprising its Governance Council and Advisory Committee.
In general, the Catalogue aims to present all languages that communities and scholars have pointed out to be at some level of risk as well as languages that have become dormant. In addition to being the largest database of endangered languages globally, the Catalogue is updated periodically based on feedback gathered from language communities and scholars worldwide. The data therefore represents what was most accurately known about the state of each language’s vitality at its point of utilization. At the time of usage, there were 3423 languages represented in the Catalogue that were determined to be at various levels of risk. Assessment of each language’s risk level is carried out using the Language Endangerment Index, which was developed for the Catalogue’s purposes. The Index is used to assess the level of endangerment of any given language based on whether there is intergenerational transmission of the language (whether the language is being passed on to younger generations), its absolute number of speakers, speaker number trends (whether numbers are stable, increasing, or decreasing), and domains of language use (whether the language is used in a wide number of domains or limited ones). The levels of endangerment that the Index generates include ‘safe’, ‘vulnerable’, ‘threatened’, ‘endangered’, ‘severely endangered’, and ‘critically endangered’. Languages for which it remains unclear if the language has gone extinct or whose last fluent speaker is reported to have died in recent times are referred to as ‘dormant’. Given that the focus of the Catalogue is languages that are at some level of threat, safe languages are excluded in general. Where locality information is available, each language is also accompanied with its latitudinal and longitudinal coordinates.
Steps taken to prepare the data for network analysis
The data obtained from the Catalogue was further organized and cleaned up for analysis.
Where available, the ISO 639-3 code for each language was utilized as its unique identifier. Otherwise, its LINGUIST List local use code was utilized. These are temporary codes that are not in the current version of the ISO 639-3 Standard for languages. For languages with neither, unique 3-letter codes were constructed.
Each language’s endangerment level appeared together with a level of certainty score in the same cell in the original data file. Both pieces of information were split into separate columns and only endangerment levels were utilized.
For languages where different data were available in the Catalogue depending on resource utilized, the data was listed in additional columns. The endangerment level data points utilized in these cases were the ones with the most complete and updated information. If there was no data available regarding endangerment level, this information was also reflected.
Where exact coordinates were not available, coordinates were approximated using Google maps based on the location description provided in the Catalogue source (e.g., the Tel Aviv district), attained from other sources such as Glottolog, UNESCO Atlas of the World’s Languages in Danger, or approximated from maps provided in other sources. ‘NA’ was indicated in the field for coordinates if none could be found.
Coordinates found to be inaccurate were rejected, for example in the instance that coordinates provided indicate a different location than the country the language is supposedly found in. The above steps were then taken to populate the coordinates field.
In instances where a language appears in more than one country, these are listed in separate rows as separate entries. Where there are two sets of coordinates for a country, the set that best corresponds with the written description in the Catalogue source, has greater detail, or is more recent is chosen. Where there are more than two sets of coordinates, a middle point is chosen as being representative of the language’s location, by plotting all coordinates on MapCustomizer (www.mapcustomizer.com).
On the Catalogue, the information regarding language family may be multi-tiered. For example, Laghuu falls under the Lolo-Burmese branch of the Sino-Tibetan family. For this study, the broader family is utilized—in the case of Laghuu the label ‘Sino-Tibetan’ is used.
Mixed languages, pidgins, and creoles have all been categorized as ‘contact languages’.
Language isolates are listed as ‘isolates’.
The Catalogue groups ‘Mexico, Central America, Caribbean’ together under region. Central America and Caribbean are listed as separate regions in this study, with Mexico falling under Central America.
A spatial network of endangered languages was constructed from the database. Each node represented an endangered language, and edges or links depicted the distance between the locations of the languages as specified in the database. A distance matrix containing the distances between all endangered languages was computed by using functions from the ‘geosphere’ R package. Specifically, Haversine distances were computed for each pair of longitude and latitude points in the dataset. The radius of the earth used in the Haversine distance calculation is 6,378,137 m (for more details see: https://www.rdocumentation.org/packages/geosphere/versions/1.5-14/topics/distHaversine). Haversine distance refers to the shortest distance between two points on a spherical earth, also referred to as the “great-circle-distance”29.
Sensitivity analyses of edge thresholds
The distance matrix is a fully connected network with weighted, undirected links. We set out to capture the strongest or “closest” spatial relationships among the endangered languages, therefore an edge threshold was applied to the distance matrix such that only the edges in the xth lowest percentile were retained in the spatial network. Such an approach allows for the analysis of the most meaningful (i.e., the physically closest) spatial relations in the dataset and how they relate to language endangerment status. The edges were then transformed into unweighted connections to create a simple unweighted, undirected graph for analysis. In order to determine the value of x (i.e., the percentile at which the edge threshold is to be applied), we constructed 10 spatial networks that retained edges with distances below the 1st, 2nd, 3rd… 10th percentile (in increments of 1%) of all distances in the matrix. Additional information of the distances depicted by the edges in each of the 10 networks is provided in Supplementary Information.
These 10 networks were then analyzed for their macro- and meso-scale network properties. A summary of macro and meso-scale network measures used in this analysis and their definitions is provided in Table 1, which depicts the 10 networks showing similar patterns in their network structures.
As expected, network density and average degree of the networks, which serve as indicators of the number of edges relative to the number of nodes in the network, increased as the edge threshold used to connect nodes became more liberal. The relatively high values of C (i.e., high levels of local clustering among nodes) and low values of ASPL (i.e., relatively short paths despite large size of network) suggested the presence of small world structure30. The community detection analysis using the Louvain method31 indicated strong evidence of community structure in the networks—suggesting the presence of clusters of endangered languages.
The point at which the vast majority of nodes was located within the largest connected component of the network occurred at the 5% edge threshold. Because the 5% network was not too fragmented, we report the analyses conducted on the largest connected component of the 5% network in the following subsections. Please see Supplementary Information for additional details behind the rationale for selecting the 5% network for further analyses. The smaller connected components were excluded. Note however that our results are robust across spatial networks of various edge thresholds (due to lack of space, please see Supplementary Information for a complete summary of all reported analyses conducted on all 10 spatial networks).
Macro-level analysis: assortative mixing of endangerment statuses
To investigate the macro-level structure of the spatial network of endangered languages, we computed the assortativity coefficient of the spatial network. Specifically, we wanted to know if the endangerment statuses of the languages tended to cluster at the global level of the entire network. If the assortativity coefficient is positive, the languages in the network would tend to be connected to languages of similar levels of endangerment. If the assortativity coefficient is negative, the languages in the network would tend to be connected to languages of dissimilar levels of endangerment.
There is a significant positive correlation (Spearman’s rank correlation) between the endangerment status of connected pairs of endangered languages in the network, r = 0.20, p < 0.001. This indicates that languages that are more endangered tend to be connected to (hence, close to) languages that are also more endangered. Figure 2 shows a bubble plot of endangerment statuses among spatially close languages: The larger bubbles toward the diagonal as compared to the edges of the plot indicate the presence of positive assortative mixing patterns in the network.
Many real-world networks from diverse domains have robust community structure32. Broadly speaking, communities are defined as groups of nodes in the network that are more interconnected with each other than with nodes outside of the community33. Networks with robust community structure will have high values of modularity, Q, a network science measure that quantifies the density of connections within and across communities33.
Here, we applied the community detection method to our spatial network of endangered languages to investigate the following question: Are there particular language communities that show more severe endangerment levels? In other words, do more endangered languages tend to cluster around specific communities or are they found across all communities in the network?
Do data-driven approaches such as community detection return “meaningful” communities (groups of endangered languages) that correspond or align with regions identified in previous work?
Are there particular language communities that show more severe endangerment levels? In other words, do more endangered languages tend to cluster around specific communities or are they found across all communities in the network?
We applied a community detection algorithm to the largest connected component of the 5% network. The largest connected component (LCC) is the network component containing the largest number of nodes that are connected to each other in a single component.
Although many community detection methods exist, we used the Louvain method31 as it is an efficient method that works well for large graphs. The general idea behind this approach is to reassign nodes to communities such that the highest contribution to modularity can be achieved. The reassignment process stops when the modularity of the network cannot be improved further. Specifics of the method can be found in Blondel et al.31.
The community detection method returned 13 communities, ranging in size from 11 to 624. Modularity, Q, was 0.77, indicating high levels of community structure of the network. Figure 1 of the findings section (replicated here as Fig. 3) shows the 13 communities and their properties.
Qualitatively, we observe that these communities correspond well to known or existing language regions. We also observe that certain communities have a much higher proportion of languages that are especially endangered (i.e., with larger darker stacks: Communities 12, 7, 10, 6). A more detailed discussion is provided in “Main findings” section.
Micro-level analysis: the protective value of high closeness centralities
Closeness centrality is a network science measure that measures the “centrality” of nodes in the network. Mathematically, it is the mean of the shortest paths between a target node and all other nodes in the network. Hence, it provides a way to measure a node’s importance by considering its distance in relation to other nodes in the network (see Table 2).
Overall, 1-way ANOVA comparing the closeness centralities of endangered languages across their statuses was statistically significant, F(5, 3634) = 12.47, p < 0.001. Post-hoc multiple comparisons (Tukey’s test, with corrected family-wise p-values) were conducted. Overall, this analysis revealed that more endangered languages have lower closeness centralities. In other words, highly endangered languages tend to lie on the periphery of the network whereas languages that are less endangered tend to be found in the center of the spatial network (i.e., more centrally located in the network).
The relationship between linguistic diversity, spatial network structure, and endangerment status
Language families and isolates
A parent language and all its derived daughter languages are one unique representation of each language family. A language isolate does not belong to a wider family but can be considered to be a unique representation of itself, therefore constituting its own family. Our paper considers each unique representation in its count of linguistic diversity. Contact languages are left out of the count of these counts, their classification being unclear.
Operationalization of language families and isolates: based on pre-defined regions in the database
The languages were grouped by ‘region’ as defined in the Catalogue.
The number of unique linguistic families per region and isolates were counted. Items within the categories of ‘contact language’, ‘unclassified’ and ‘sign language’ were excluded from the count. This value is entered into the regression analysis below.
Operationalization of language families and isolates: based on community detection results
The languages are grouped by ‘community’ based on the output of the community analysis.
The number of unique linguistic families per community and isolates were counted. Items within the categories of ‘contact language’, ‘unclassified’ and ‘sign language’ were excluded from the count. This value is entered into the regression analysis below.
We note that these two measures of language families and isolates are highly positively correlated, r = + 0.71, p < 0.001.
Model performance indices (see Table 3) indicate that the model containing the diversity (language families and isolates) coefficient derived from communities of nodes in the spatial network is a better model than the one containing the diversity coefficient derived from pre-defined regions in the Catalogue.
As seen in Table 4, the odds ratio of closeness centrality is less than 1, indicating that greater closeness centralities are associated with lower probabilities of a more severe language endangerment status. Languages that are more centrally positioned have less severe language endangerment statuses. The odds ratio of diversity is positive, indicating that greater linguistic diversity in the region is associated with higher probabilities of a more severe language endangerment status.