The Lattes platform has been hosting curricula of Brazilian researchers since the late 1990s, containing more than 5 million curricula. The data from the Lattes curricula can be downloaded to XML
format, the complexity of this reading process motivated the development of the getLattes
package, which imports the information from the XML
files to a list in the R
software and then tabulates the Lattes data to a data.frame
.
The main information contained in XML
files, and imported via getLattes
, are:
getAreasAtuacao()
getArtigosPublicados()
getAtuacoesProfissionais()
getBancasDoutorado()
getBancasGraduacao()
getBancasMestrado()
getCapitulosLivros()
getDadosGerais()
getEnderecoProfissional()
getEventosCongressos()
getFormacaoDoutorado()
getFormacaoMestrado()
getFormacaoGraduacao()
getIdiomas()
getLinhaPesquisa()
getLivrosPublicados()
getOrganizacaoEvento()
getOrientacoesDoutorado()
getOrientacoesMestrado()
getOrientacoesPosDoutorado()
getOutrasProducoesTecnicas()
getParticipacaoProjeto()
getProducaoTecnica()
getPatentes()
getId()
From the functionalities presented in this package, the main challenge to work with the Lattes curriculum data is now to download the data, as there are Captchas. To download a lot of curricula I suggest the use of Captchas Negated by Python reQuests - CNPQ. The second barrier to be overcome is the management and processing of a large volume of data, the whole Lattes platform in XML
files totals over 200 GB. In this tutorial we will focus on the getLattes
package features, being the reader responsible for download and manage the files.
Follow an example of how to search and download data from the Lattes website.