University of Cambridge scientists have created two automatically generated databases presenting photovoltaic properties and device material data for dye-sensitized solar cells (DSCs) and perovskite solar cells (PSCs).
The scientists used the ChemDataExtractor text-mining toolkit, which they described as a “chemistry-aware” natural-language-processing (NLP) tool. It was applied to 25,720 scientific articles comprising 660,881 data entries representing 57,678 photovoltaic devices. The database for the dye-sensitized devices included 475,045 entries organized into 41,680 records. The one for perovskite cells included 185,836 entries organized into 15,818 records.
“Such a database could also reveal information about the variation observed in popular structures that have been synthesized multiple times in different studies, to glean information on the underlying variation for that particular architecture,” they said.
The researchers claim their multifaceted evaluation approach ensured data quality, with precision metrics ranging from 73.1% to 95.8%.
“It is interesting to note that the accuracy of data extracted for the PSC database exceeded that of the DSC database on both metrics,” they said. “This is surprising since the parsers for the photovoltaic properties and the logic for calculating derived properties are the same in both cases.