A short paper entitled “Rescuing Historical Climate Observations to Support Hydrological Research: A Case Study of Solar Radiation Data”, presented at the ACM Symposium on Document Engineering 2021(DocEng ’21), outlines research for establishing an efficient and accurate process of digitizing paper-based climate data. The framework for this work was set up by evaluating the performance of two optical character recognition (OCR) engines, namely Tesseract OCR and ABBYY FineReader. These OCR tools were applied to tabular data contained in pre-1990s Solar Radiation and Radiation Balance Data booklets published by the Hydrometeorological Service of the USSR. The paper highlights the characteristics of each method such as ease of character training, data conversion time, extensivity in recognizing different font types and data formats, and accuracy of evaluation results. The results indicate that, while ABBYY FineReader produced a better accuracy than Tesseract OCR, Tesseract OCR could have been easily trained to adapt to data in different formats. Moreover, it is also free and non-proprietary. This paper was co-authored by ERG’s Bhaleka Persaud and Philippe Van Cappellen and former ERG coop students (Naveela Sookoo, Anthony Cavallin) in collaboration with co-authors from David R. Cheriton School of Computer Science (Ogundepo Odunayo, Jimmy Lin, Gautam Bathla) and Davis Centre Library (Kathy Szigeti). The paper can be accessed via this link.
Monday, August 16, 2021