Social media is a great source of data for analyses, since they provide ways for people to
share emotions, feelings, ideas, and even symptoms of diseases. By the end of 2019, a global pandemic
alert was raised, relative to a virus that had a high contamination rate and could cause respiratory
complications. To help identify those who may have the symptoms of this disease or to detect who is
already infected, this paper analyzed the performance of eight machine learning algorithms (KNN,
Naive Bayes, Decision Tree, Random Forest, SVM, simple Multilayer Perceptron, Convolutional
Neural Networks and BERT) in the search and classification of tweets that mention self-report of
COVID-19 symptoms. The dataset was labeled using a set of disease symptom keywords provided
by the World Health Organization. The tests showed that Random Forest algorithm had the best
results, closely followed by BERT and Convolution Neural Network, although traditional machine
learning algorithms also have can also provide good results. This work could also aid in the selection
of algorithms in the identification of diseases symptoms in social media content.
This work has been supported by FCT—Fundação para a Ciência e Tecnologia within the
Project Scope: DSAIPA/AI/0088/2020