First of all, I want to apologize if the title is misleading.
I have a dataset made of around 60000 tweets, their date and time as well as the username. I need to classify them into topics. I am working on topic modelling with LDA getting the right number of topics (I guess) thanks to this R package, which calculates the value of three metrics(“CaoJuan2009”, “Arun2010”, “Deveaud2014”). Since I am very new to this, I just thought about a few questions that might be obvious for some of you, but I can’t find online.
I have removed, before cleaning the data (removing mentions, stopwords, weird characters, numbers etc), all duplicate instances (having all three columns in common), in order to avoid them influencing the results of topic modelling. Is this right?
Should I, for the same reason mentioned before, remove also all retweets?
Until now, I thought about classifing using the “per-document-per-topic” probability. If I get rid of so many instances, do I have to classify them based on the “per-word-per-topic” probability?
Do I have to divide the dataset into testing and training? I thought that is a thing only in supervised training, since I cannot really use the testing dataset to measure quality of classification.
Antoher goal would be to classify twitterers based the topic they most are passionate about. Do you have any idea about how to implement this?
Thank you all very much in advance.