DAY 2, 22 MAY
14:00 - 14:45
ABOUT THE SPEAKER
I am a team lead in the company Scrapinghub's QA department. Scrapinghub uses Python and machine learning to extract data from web at massive scale.
Talk: Data QA, The Automated Way
For datasets collected from the web there are so many different issues that can arise from poorly written web scrapers to esoteric HTML structure that can lead to missing fields or incorrect values. This is where Data QA (as opposed to application QA) comes into play. For datasets containing millions of records, it's impossible to manually check every record in a database so we have to innovate with tools to automate the QA process. During this presentation you will learn main problems which happen with data, difficulties of QAing large datasets, and how to automate the Data QA process using Python, Jupyter, and JSON.