Conference paper Open Access
Tarasconi, Francesco; Farina, Michela; Mazzei, Antonio; Bosca, Alessio
Social media can be an important, constantly updated, source of information concerning natural disasters. User-generated, free text messages contain useful elements for the three main phases of disaster management: awareness/early warning, response, post-disaster assessments. However, most of the previous research focus on studying contents collected in relation to specific events. More work can be done in extending Information Extraction tasks to continuous streams of documents (potentially) hazard-related, regardless of time or location. We describe a Natural Language Processing architecture, employed in our study, to collect and monitor keywordbased streams, associated to different languages and event types. Starting from existing work, we review the definitions of disaster-related Information Types and Informativeness to better capture relevant and interesting items in the newly defined streams. To act as both a guideline in this procedure and a gold standard in automatic classification we created and annotated a multi-language, multi-hazard corpus of more than 10,000 tweets, sampled from our collected data-streams. We conclude by discussing the methodology behind and the results achieved by rule-based classifiers that we developed using domain and linguistic knowledge. Our approach is found to be viable in performing Information Extraction on generic, hazard-related (but noisy), social media data streams.