nlp – machine learning data annotation

I'm going to develop a machine learning model. I have large amounts of data (text). Overall, I need a more detailed F1 rating, etc. I use data annotation tools (Dataturks). Which approach is good for labeling the data as a single label per entity or as a multiple label per entity (as there has been a 5-fold GUI, we have to label it once or 5 times to get a better overall rating). Your help is greatly appreciated.

Machine Learning – What is the main concept of using a lexical, linguistic, semantic or syntactic approach in NLP for cyberbullying?

I'm really in need of an explanation and I'm working on an nlp cyberbullying detection tool that I'm going to implement on the web using the Django framework. However, I am fixated on an idea, someone can explain to me … What is the main concept of I use lexical, linguistic, semantic or syntactic approaches in (NLP) and know how they are used in cyberbullying A synonym for semantics because pos Tagging is a process that connects words with root and representative word in an understandable context. Correct me if wrong.

I read an article in which an article started a project using a predictive analysis approach with feature extraction techniques and Navie Baye to classify and train the model. The discussion also discussed how other teams used the semantics approach to classify cyberbullying. I am familiar with data cleansing, tokenization stemming and most feature extraction models. However, I am faced with the problem of the approach, which is relevant to lexical, semantic or syntactic aspects and how to deal with them.

Probability Theory – Contingency Table Confusion NLP

Hello for the contingency table: [really positive, false negative, wrong positive, really negative]. It is difficult for me to remember the difference between these terms, because all concepts consist of words that are very similar to each other but used in such contrasting contexts. The only ones that make sense are true positives and false negatives, but the others that I am always confusing and wondering is there a brief mnemonics that I can use?

Python – NLP sentiment analysis in Norwegian

Please remember that I'm very new in data science and completely new to NLP! I'm trying to create a model to classify customer reviews as negative or positive. My main problem, however, is that most of them are in Norwegian, which does not seem to be very well supported. I've found this repo https://github.com/ltgoslo/norsentlex/tree/master/fullform, which contains both negative and positive dictionaries, and I've decided to use it.

with open("Fullform_Positive_lexicon.txt") as f:
    reader = csv.reader(f)
    positive_lexicon = ()
    for row in reader:
        print(str(row))
        positive_lexicon.append(row(0))

with open("Fullform_Negative_lexicon.txt") as f:
    reader = csv.reader(f)
    negative_lexicon = ()
    for row in reader:
        negative_lexicon.append(row(0))

#adjusted to use it with NaiveBayesClassifier
def get_tokens_for_model(cleaned_tokens_list):
    final_token_list = () 
    for token in cleaned_tokens_list:
        token_dict = {}
        token_dict.update({token : True})
        final_token_list.append(token_dict)
    return final_token_list

positive_tokens_for_model = get_tokens_for_model(positive_lexicon)
negative_tokens_for_model = get_tokens_for_model(negative_lexicon)

positive_dataset = ((token, "Positive")
                     for token in positive_tokens_for_model)

negative_dataset = ((token, "Negative")
                     for token in negative_tokens_for_model)
#shuffle dataset
for i in range(len(dataset)-1, 0, -1): 
    j = random.randint(0, i + 1)  
    dataset(i), dataset(j) = dataset(j), dataset(i)  

train_data = dataset(:20000)
test_data = dataset(7742:)

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

I only train my model on the basis of this lexicon and then try to apply it to my comments. I get 87% here, which is not that bad. However, it looks worse when used with whole sentences.

with open("stopwords.csv") as f:
    reader = csv.reader(f)
    norwegian_stopwords = ()
    for row in reader:
        norwegian_stopwords.append(row(0))

customer_feedback = pd.read_excel("classification_sample.xlsx")
customer_feedback = customer_feedback.dropna()
customer_feedback('Kommentar') = customer_feedback('Kommentar').apply(remove_punctuation)

list_of_comments = list(customer_feedback('Kommentar'))

for comment in list_of_comments:
    custom_tokens = word_tokenize(comment)
    filtered_tokens = ()
    for word in custom_tokens:
        if word not in  norwegian_stopwords:
            filtered_tokens.append(word)

    classification = classifier.classify(dict((token, True) for token in filtered_tokens))
    probdist = classifier.prob_classify(dict((token, True) for token in filtered_tokens))
    pred_sentiment = probdist.max()
    print("Sentence: " + comment)
    print("Classified as: " + str(classification))
    print("Key words: " + str(custom_tokens))
    print("Probability: " + str(round(probdist.prob(pred_sentiment), 2)))
    print("-----------------------------------------------------------")

I know that my code currently does not have the highest quality (suggestions are also welcome). I'm mainly looking forward to feedback, which I can do to improve my accuracy. What I do not really understand right now is how to train the model correctly on words and then achieve the same accuracy on sentences.

NLP possible in languages ​​other than English?

I would like to use Mathematica NLP functions to parse texts in French. I have not found any documentation of how good it is compared to English (and if it's even possible to use non-English languages).

For example, I just tried:

moon = WikipediaData("Lune", Language -> "French");
contents = TextContents(moon, VerifyInterpretation -> True)

and got the error message NaturalLanguageProcessingTextTokenize::liberr: -- Message text not found -- (Java) (MathLinkException: 49: Unable to convert from MathLink encoding to requested character encoding)

Is this just a coding problem or a more general problem with using non-English languages?

nlp – ResourceExhausted: 429 Quota metric API quota for natural language exceeded dataflow with Python SDK

I'm creating a Dataflow pipeline to read CSV, do a sentiment analysis on the Google Cloud NLP API, and send the result to BigQuery.

When the function performing the sentiment analysis retrieves the collection, the error mentioned above is displayed to me.

I'm considering splitting the pcollection into a small pcollection to handle the offer restriction in the NLP API.

(p
       | 'ReadData' >> beam.io.textio.ReadFromText(src_path)
       | 'ParseCSV' >> beam.ParDo(Analysis())
       | 'WriteToBigQuery' >> ...
)

python – pnadas read csv with great lyrics for nlp

I have a large CSV file that consists of one column and has a single cell in each row, about 18,800 of them. Each cell contains a large text. The file size is around 318 MB. I use Jupyter notebooks.
I'm trying to read the csv file into a pandas data frame. However, I get the following error message.
ParserError: Error while tokenizing data. C error: Expect 1 fields in line 12, saw 2

I've tried a few tricks, but I do not get the original text or even the number of lines, but 200 lines.
I have to use this file for a nlp project.

Your help is greatly appreciated.
Many thanks