Research | Open | Published:
Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services
Tropical Medicine and Healthvolume 45, Article number: 33 (2017)
Tropical medicine appeared as a distinct sub-discipline in the late nineteenth century, during a period of rapid European colonial expansion in Africa and Asia. After a dramatic drop after World War II, research on tropical diseases have received more attention and research funding in the twenty-first century.
We used Apache Taverna to integrate Europe PMC and MapAffil web services, containing the spatiotemporal analysis workflow from a list of PubMed queries to a list of publication years and author affiliations geoparsed to latitudes and longitudes. The results could then be visualized in the Quantum Geographic Information System (QGIS).
Our workflows automatically matched 253,277 affiliations to geographical coordinates for the first authors of 379,728 papers on tropical diseases in a single execution. The bibliometric analyses show how research output in tropical diseases follow major historical shifts in the twentieth century and renewed interest in and funding for tropical disease research in the twenty-first century. They show the effects of disease outbreaks, WHO eradication programs, vaccine developments, wars, refugee migrations, and peace treaties.
Literature search and geoparsing web services can be combined in scientific workflows performing a complete spatiotemporal bibliometric analyses of research in tropical medicine. The workflows and datasets are freely available and can be used to reproduce or refine the analyses and test specific hypotheses or look into particular diseases or geographic regions. This work exceeds all previously published bibliometric analyses on tropical diseases in both scale and spatiotemporal range.
Tropical medicine first appeared as a distinct sub-discipline and professional specialization toward the end of the nineteenth century, and the heyday of tropical medicine coincided with European colonialism in Africa and Asia around this time. After the decades following World War II, recent years have seen an increasing attention and significant funding to combat tropical diseases in an increasingly globalized world. In this paper, we attempt to visualize these and other aspects of the history of tropical medicine by spatiotemporal bibliometric analyses.
This is not the first bibliometric venture into the history of research on tropical diseases. In 2006, Falagas et al. published two studies [1, 2] on parasitology and tropical medicine research respectively over the 9-year period 1995–2003 identifying Oceania countries as the most productive when adjusting for both gross national income per capita and population. The authors also noted the number of publications on parasitology from Latin America, the Caribbean, and Asia doubled between 1995 and 2003, but that the production from African countries remained low despite many of the diseases being endemic here. More recently, Ramos et al. published a bibliometric analysis of Chagas disease research 1940–2009  and leishmaniasis research 1945–2010 , identifying Brazil as the most productive country in the first decade of the twenty-first century when looking at the first-author affiliations. Similarly and more recently, Zyoud et al. published a spatiotemporal bibliometric analysis of publications on dengue 1872–2015 , noting both the most productive countries in the field and a considerable increase in dengue-related publication in the last decade. For Sub-Saharan Africa [6,7,8] and Latin America , biomedical research, including neglected infectious diseases and the relation between disease burden and clinical trials, has been assessed by bibliometric methods. The field has seen rapid development in recent years, and an updated analysis of research output on tropical diseases is therefore motivated. What are the global and historical trends in tropical medicine research, and how do recent outbreaks, attention, and funding compare in these contexts? What else can be learned from broad, spatiotemopral bibliometric analyses?
Here, we also show how to use scientific workflows and freely available web services for spatiotemporal bibliometric analyses. Scientific workflows integrate specialized software, databases, or services into an overall data flow. They are particularly well suited for multi-step analyses using different types of software tools. The workflows are reusable for similar purposes and make analyses reproducible. Using web services and online databases, the workflows always access the latest information. Technical details on how the literature and geoparsing web services are accessed and the returned data parses are abstracted and tucked away in workflow components, allowing less experienced users to focus on the overall workflow logic and scientific hypothesis. To our knowledge, this is the first time literature and geoparsing web services have been integrated this way. The bibliometric analyses were done in Taverna workflows available on myExperiment .
To count the number of publications on specific topics, such as a disease, we used the Europe PubMed Central (PMC) profilePublications Simple Object Access Protocol (SOAP) web service [ref. Europe PMC or manual]. This service returns summaries by category, i.e., database source (Agricola, CiteXplore, PubMed/MEDLINE NLM, PubMed Central, Biological Patents, etc.) and publication type (full text, open access, reviews, books, and documents). A Taverna  workflow profilePublications_over_time integrating this service is shown in Fig. 1 and also shared on the myExperiment  website (http://myexperiment.org/workflows/4980.html). This workflow takes as input one or more Europe PMC search queries. From each of these, the Build_Queries_1900_2016 component generates 117 individual search queries to retrieve the publication summaries by year. The Extract_ALL conditional XPath extracts the total number of publications in all categories. The workflow outputs a list with the number of publications per year, similar to the “Results by year” chart in the PubMed web interface. We ran this workflow in Taverna Workbench Core 2.5.0 in November 2016, with a list of the 10 most researched tropical diseases as defined by WHO (malaria, cholera, leprosy, schistosomiasis, trypanosomiasis, dengue, leishmaniasis, Chagas disease, Ebola, and taeniasis/cysticercosis).
For a spatiotemporal analysis of the scientific literature, in particular using PubMed and other open resources, it is often necessary to parse the author affiliation information. We performed this geoparsing using MapAffil , a tool specifically developed to parse the author affiliation strings in PubMed. MapAffil correctly identifies cities (or similar localities) and assigns the city-center geocodes to about 98% of affiliations in PubMed. The remaining 2% largely lack place information (e.g., only the name of a multi-location institution is given), while errors and unresolved ambiguities are rare.
A Taverna workflow searchPublications_and_MapAffil integrating the Europe PMC searchPublications SOAP web service as previously described [13, 14] with MapAffil using its REST-like API is shown in Fig. 2. The workflow takes as input one or more queries, searches Europe PMC using searchPublications, retrieve the records and extract PubMed IDs and publication years with two XPaths. The IDs are passed to MapAffil by Build_URL_for_MapAffil and the built-in Get_Web_Page_from_URL service. MapAffil returns geoparsed affiliations with official city, county and country names, FIPS codes, latitudes and longitudes in WGS 84, and author orders in JSON. Two JsonPaths, Get_Latitudes and Get_Longitudes, extract coordinates for the first authors. The coordinates are then combined into one list of latitudes, longitudes, and publication years. The workflow outputs this list and hitCount, the number of records returned from Europe PMC, for each input query. The workflow was run in November 2016, using the same input as the first workflow.
Geographical information can be visualized using different software tools, including from within Taverna using the rworldmap [13, 15] or RQGIS  R packages. Here, we used the standalone Quantum Geographic Information System (QGIS)  desktop software version 2.18.0 and directly imported the coordinates from the searchPublications_and_MapAffil workflow in Fig. 2 as a delimited text layer in QGIS and overlaid these on a world map. For co-authorship analysis, we used VOSviewer  version 1.6.5 and projected the collaborative network, using latitude and longitudes from MapAffil, but for all co-author affiliations, onto the same world map. Collaborative clusters were extracted using resolution = 0.3 and minimum size = 100. These parameters determine the sensitivity for separating clusters, and how many nodes are required to form a unique cluster.
Results and discussion
The results from the profilePublications_over_time workflow are summarized in Fig. 3. The figure shows the fraction of publications in PubMed devoted to one or more of these tropical diseases over the course of the twentieth and the early twenty-first century. This fraction was highest in the early 1900s (for malaria in 1900–1901, trypanosomiasis 1903–1904, and leishmaniasis 1912–1914). Minor dips can be observed at the beginning of World War II in 1939–1940 and to a lesser degree after the outbreak of World War I in 1914–1915. The absolute number of papers in PubMed increase dramatically after 1945, whereas the absolute number of publications per year on tropical diseases grew more slowly (malaria), remained largely constant (cholera), or even decreased (dengue fever). To some degree, this may have been a direct consequence of the rapid decolonization of Africa, South, and Southeast Asia after 1945, reducing the relative research output on tropical diseases from the UK, France, Belgium, and the Netherlands. In this century, with increased emphasis on previously neglected tropical diseases and research funding, for example, though the Bill & Melinda Gates Foundation since 2000, the fraction of publications on tropical diseases has slowly increased since around 2005–2007. For Ebola, which is a special case, we observe (Fig. 3, inset) an initial local maximum in 1978, 2 years after the first outbreak in Zaire September-October 1976, with interest waning around 1983–1984. After a period of few publications per year, we note a larger increase in 1995 after the outbreak the same year in DRC. The research output then increased steadily, in particular after the outbreaks in Congo and Uganda 2000–2001, and spiked in 2015–2016 after the major Uganda outbreak in 2014. Each time, the outbreak was followed by a distinct increase in research output. Although seven Ebola articles were published in 1977, within a year of the first outbreak, later outbreaks were followed by a more rapid increase in the number of research publications.
Fewer publications on a particular research topic or disease do not imply neglect. Though not exclusively a tropical disease, smallpox was successfully eradicated in 1980. This is clearly seen in the research output, where a period of higher research output with 309 ± 42 publications per year during the WHO Smallpox Eradication Programme 1966–1980 followed by a period of lower output with 129 ± 19 publications per year between 1981 and 1995. With increasing concerns of bioterrorism in the early 2000s, the number of publications increased dramatically, reaching maximum of 756 publications in 2003. Similar trends can be observed for polio, with an increased research output from 1952, the year the first successful vaccine was developed, reaching a local maximum of 326 publications in 1957, and then falling as the incidence declined rapidly following mass vaccination in developed countries, until reaching a steady level of ~ 150 publications/year from the mid-1960s until the mid-1980s.
Geoparsing PubMed affiliations reveals where research was conducted, in addition to when. Table 1 contains the number of results, first authors, and mapped first-author affiliations for the 10 tropical diseases. In total, the workflow retrieved 379,728 records on these 10 tropical diseases. Only 259,138 or 68.2% of the retrieved Europe PMC records have a first-author affiliation explicitly designated as such. This information is lacking in many older publications (and we are here looking at publications as far back as the early 1800s, at least for leprosy and malaria). However, from these, 253,277 or 97.7% could be successfully parsed to geographical coordinates by MapAffil. We did not observe a major difference in the MapAffil success rate between diseases (range from 95.9 to 98.3%) or as a function of publication year. Temporal artefacts may be due to the availability of affiliations in PubMed, with indexing starting in 1987 and all authors’ affiliations being available only from 2014. Publications older still are available for specific journals, such as Medico-Chirurgical Transactions (1809–1907)/Proceedings of the Royal Society of Medicine (1908–1977) provided by the Royal Society of Medicine Press, and journals such as British Medical Journal (1857–1980), Journal of Anatomy and Physiology (1867–1916), The Journal of Physiology (1878-), and Annals of Surgery (1885-). Even though single-author publications were common in the past, there may still be some geographic bias from looking only at the first authors of older papers. For example, looking at all authors may reveal additional insights into the structure of research collaboration and accurately cover field work and local collaborators in endemic areas. The last author/principal investigator affiliation may correlate with the institution awarded the grants to conduct the research. The overall coverage of affiliations in MapAffil over time is shown in Fig. 4. The US National Library of Medicine started consistently recording first-author affiliations in 1988. Some of the records lacking affiliations were supplemented with data harvested from sources external to PubMed, including PMC, Microsoft Academic Graph (MAG), US National Institutes of Health (NIH) grants, and the Astrophysics Data System (ADS). For MAG and ADS, a crosswalk between citations was created using the Patci citation matcher , while NIH grant links were based on grant numbers listed in the XML distribution of PubMed. As the trend in the figure shows, this supplementation has the greatest effect for papers published before 1985, where the coverage goes from ~ 0% to 10–20%. Since 1990, supplementation covers another 2% of papers, yielding ~ 80% coverage in 1990, and ~ 90% more recently. It should be noted that the figure does not reflect that supplemental affiliations are added to authors beyond the first author, which may pick up additional geocodes for multisite collaborations.
The spatiotemporal analysis by the searchPublications_and_MapAffil workflow shows the expected correlation between disease prevalence and research output. For example, Fig. 5 shows the geographical distribution of research output for the 10 tropical diseases based on all publications in PubMed 1813–2016 for which the first-author affiliation was available and could be geoparsed. In absolute terms, Western Europe and North America dominate research output on most tropical diseases except leprosy and Chagas disease. The geographic differences become more pronounced when looking at low- and medium-income countries, many of which are affected by one or more of these tropical diseases. South-East Asia (including India) is clearly overrepresented in research output on leprosy. This is consistent with disease prevalence and a previous report by Schoonbaert and Demedts . In 2014, 72% of all reported new cases were detected in this region . Research output on Chagas disease, also known as American trypanosomiasis, is concentrated in South America (Brazil in particular). We also observe a high proportion of research output on malaria in all tropical regions. Dengue research is found more broadly over South and Southeast Asia, and Leishmaniasis research is concentrated in the Middle East (Israel, Egypt) and Brazil. A significant share of the funding Egypt and Israel received after the 1978 Camp David accord went to malaria and leishmaniasis research . Ramos et al.  have also found that Israel produced the largest number of publications on leishmaniasis per capita. On the other hand, Ebola research output is concentrated to a few hotspots in Africa, such as Franceville, Brazzaville, Kampala, and Nairobi. When filtered by publication year, it is clear that the fraction of research publication coming from outside Europe, North America, or Japan increases dramatically after World War II. When analyzing these maps in detail, it should be remembered that fieldwork is, by definition, geographically separated from the research institutes often in the author affiliations.
Figure 6 shows the geographical patterns of schistosomiasis research, based on co-authorships in Europe PMC with the first publication date in 2016. Although the field is highly internationalized, the collaborations separate in clusters, where cluster 1 (red, 199 locations) is dominated by the UK and anglophone countries in East Africa. Cluster 2 (green, 141 locations) dominated by Asia, and cluster 3 (blue, 125 locations) shared between France (Paris, Lille, Caen, Bordeaux) and francophone countries in West Africa, Portugal, and Brazil. This is consistent with the social network analysis of international academic ties by Safonova and Sokolov , identifying “academic neocolonialism” being of primary importance for institutional links and the collaborative clusters they form, and explaining observed patterns of international student flows. Lacunae, “missing” research output from major countries with high disease prevalence, may be due to conditions disadvantageous for conducting research, such as civil war or unrest, or lack of infrastructure. This is difficult to quantify, as also disease prevalence may be underreported from such regions. Affiliations for authors other than the first are only available in for very recent publications, and historical studies on international collaborations are for this reason difficult in any field. The recent relationship between developed and developing countries has recently been investigated by González-Alcaide and co-workers , showing that countries of low and medium income have a higher degree of participation in areas of tropical medicine (co-authorship of 41% of research publications) and parasitology (24%) than infectious disease (19%) or pediatrics (8%) between 2011 and 2015.
We here used simple search queries and disabled the synonym lookup options in the Europe PMC web services. This will result in the inclusion of a few unrelated publications; for example, one paper from 1958 , 18 years before the first report, the Ebola hemorrhagic fever, on the geographic distribution of endemic goiter, including the areas watered by the Ebola river. Topic disambiguation is possible using Medical Subject Headings (MeSH). For example, Ramos et al. in their work  looked for the MeSH terms “Leishmania” or “leishmaniasis.” Using MeSH may also bridge publications that exclusively refer to a disease by an alternate name, such as leprosy as Hansen’s disease or schistosomiasis as bilharziasis or Katayama fever, though care should be taken that all synonyms are specific and that searches for all diseases are expanded to a similar “depth.” Text-mining methods can also be used to disambiguate topics de novo but will only be usefully accurate for full-text articles. Regardless of query specification, some relevant articles will always be missed, and some less relevant included, in large datasets.
This paper illustrates how literature search and geoparsing web services can be combined in scientific workflows for reproducible, shareable, and reusable spatiotemporal bibliometric analyses. We have demonstrated this using research on 10 tropical diseases, as these exhibit characteristic and interpretable spatiotemporal patterns. Other resources that could, in principle, be combined in similar workflows include, for example, genomic, molecular, and epidemiological data, though geographical mapping of disease is a challenging but rapidly progressing field in itself [26,27,28]. The European Nucleotide Archive, ENA, and UniProt are extensively linked with publication Europe PMC. These database links can also be traversed using the searchPublications and getDatabaseLinks web services from Europe PMC and RESTful web services from UniProt.
Research output on tropical diseases has some correlation with disease burden, in particular when comparing countries of similar resources and research output. Shared colonial history and language are also important factors. The Ebola example suggests the research community now reacts faster and more strongly than the past decades upon outbreaks of diseases in Sub-Saharan Africa.
All work was performed on open data using freely available tools, including Taverna Workbench, Europe PMC, MapAffil web services, and QGIS. The two workflows are available from myExperiment for anyone who wishes to repeat or modify our analyses, without the need to download any bibliographic databases. The workflows and results are also available on the Open Science Framework (osf.io/dtkep/).
Astrophysics Data System
European Nucleotide Archive
Microsoft Academic Graph
Medical Subject Headings
National Institutes of Health
National Library of Medicine
Quantum Geographic Information System (now QGIS)
REpresentational State Transfer
Simple Object Access Protocol (now SOAP)
Visualization Of Similarities
World Health Organization
Extensible Markup Language
Falagas ME, Karavasiou AI, Bliziotis IA. A bibliometric analysis of global trends of research productivity in tropical medicine. Acta Trop. 2006;99:155–9.
Falagas ME, Papastamataki PA, Bliziotis IA. A bibliometric analysis of research productivity in parasitology by different world regions during a 9-year period (1995–2003). BMC Infect Dis. 2006;6:56.
Ramos JM, Gonzalez-Alcaide G, Gascon J, Gutierrez F. Mapping of Chagas disease research: analysis of publications in the period between 1940 and 2009. Rev Soc Bras Med Trop. 2011;44:708–16.
Ramos JM, Gonzalez-Alcaide G, Bolanos-Pizarro M. Bibliometric analysis of leishmaniasis research in Medline (1945-2010). Parasit Vectors. 2013;6:55.
Zyoud SH. Dengue research: a bibliometric analysis of worldwide and Arab publications during 1872–2015. Virol J. 2016;13:78.
Uthman OA, Uthman MB. Geography of Africa biomedical publications: an analysis of 1996–2005 PubMed papers. Int J Health Geogr. 2007;6:46.
Hofman KJ, Kanyengo CW, Rapp BA, Kotzin S. Mapping the health research landscape in Sub-Saharan Africa: a study of trends in biomedical publications. J Med Libr Assoc. 2009;97:41–4.
Breugelmans JG, Makanga MM, Cardoso AL, Mathewson SB, Sheridan-Jones BR, Gurney KA, Mgone CS. Bibliometric assessment of European and Sub-Saharan African research output on poverty-related and neglected infectious diseases from 2003 to 2011. PLoS Negl Trop Dis. 2015;9:e0003997.
Perel P, Miranda JJ, Ortiz Z, Casas JP. Relation between the global burden of disease and randomized clinical trials conducted in Latin America published in the five leading medical journals. PLoS One. 2008;3:e1696.
Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, De Roure D. myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 2010;38:W677–82.
Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, et al. The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 2013;41:W557–61.
Torvik VI: MapAffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. Dlib Mag. 2015;21:11-2. doi:10.1045/november2015-torvik.
Guler AT, Waaijer CJ, Palmblad M. Scientific workflows for bibliometrics. Scientometrics. 2016;107:385–98.
Guler AT, Waaijer CJF, Mohammed Y, Palmblad M. Automating bibliometric analyses using Taverna scientific workflows: a tutorial on integrating web services. J Informetr. 2016;10:830–41.
South A. rworldmap: a new R package for mapping global data. The R Journal. 2011;3:35–43.
Muenchow J, Schratz P. Integrating R with QGIS. 2017.
Quantum GIS Development Team. Quantum GIS Geographic Information System. Open Source Geospatial Foundation Project. 2017. http://qgis.osgeo.org.
van Eck NJ, Waltman L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics. 2010;84:523–38.
Agarwal S, Lincoln M, Cai H, Torvik VI. Patci––a tool for identifying scientific articles cited by patents. GSLIS Research Showcase; 2014.
Schoonbaert D, Demedts V. Analysis of the leprosy literature indexed in Medline (1950-2007). Lepr Rev. 2008;79:387–400.
Programme W-GL. Global leprosy strategy 2016-2020: accelerating towards a leprosy-free world. New Dehli: WHO; 2016.
Shope R, Baker RH, Buck A, Heyneman D, Krogstad DJ, Western KA, Hornbeak H. Epidemiology and control of vector-borne diseases in Egypt and Israel. In Report by external scientific committee on NIAID research contracts NOI-Al-22667/8. Bethesda: NIAID; 1985.
Safonova M, Sokolov M. The construction of the academic world-system: regression and social network approaches to analysis of international academic ties. In: Gorraiz J, Schiebel E, Gumpenberger C, editors. 14th International Society of Scientometrics and Informetrics Conference Vienna, Austria. Hörlesberger M: Moed H. AIT Austrian Institute of Technology GmbH; 2013. p. 389–403.
Gonzalez-Alcaide G, Park J, Huamani C, Ramos JM. Dominance and leadership in research activities: collaboration between countries of differing human development is reflected through authorship order and designation as corresponding authors in scientific publications. PLoS One. 2017;12:e0182513.
Kelly FC, Snedden WW. Prevalence and geographical distribution of endemic goitre. Bull World Health Organ. 1958;18:5–173.
Hay SI, Battle KE, Pigott DM, Smith DL, Moyes CL, Bhatt S, Brownstein JS, Collier N, Myers MF, George DB, Gething PW. Global mapping of infectious disease. Philosophical Transactions of the Royal Society B-Biological Sciences. 2013;368
Carroll LN, AP A, Detwiler LT, TC F, Painter IS, Abernethy NF. Visualization and analytics tools for infectious disease epidemiology: a systematic review. J Biomed Inform. 2014;51:287–98.
Kraemer MUG, Hay SI, Pigott DM, Smith DL, Wint GRW, Golding N. Progress and challenges in infectious disease cartography. Trends Parasitol. 2016;32:19–29.
The authors would like to thank Prof. André M. Deelder for many helpful suggestions and careful reading of the manuscript and Dr. Cathelijn J. F. Waaijer for additional recommendations.
VIT was funded by the US NIH P01AG039347.
Availability of data and materials
The two workflows used and the data produced and analyzed in this work are available on Open Science Framework (https://osf.io/dtkep/) and the workflows on myExperiment (https://www.myexperiment.org/workflows/4980.html and https://www.myexperiment.org/workflows/4981.html).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.