r – Error when Running a Naive Bayes on Text as Data using Caret

I am attempting to train a naive bayes model on text data, having predetermined the number of folds (so as to allow for comparison with other models), and employed adaptive resampling for hyperparameter tuning. However, this error appears:

Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed

I know there are other methods, such as provided by the quanteda package, however, I wanting to remain with caret so that I am able to compare other models using the same data.

Any help would be much appreciated.

My code is below:


library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(caret)

corp <- data_corpus_moviereviews

set.seed(300)

id_train <- sample(docnames(corp), size = 1500, replace = FALSE)

# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE, tolower=TRUE, remove=stopwords("en"), remove_symbols=TRUE)

# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE, tolower=TRUE, remove_symbols=TRUE, remove=stopwords("en")) %>% 
  dfm_select(pattern = training_dfm, 
             selection = "keep")

training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")

myFolds <- createFolds(training_m, k = 5) 

myControl <- trainControl(
  method="adaptive_cv",
  repeats=2,
  summaryFunction = twoClassSummary,
  classProbs = TRUE, 
  verboseIter = TRUE,
  index = myFolds, 
  adaptive = list(min = 2, 
                  alpha = 0.05, 
                  method = "gls", 
                  complete = TRUE), 
  search = "random")

nb_caret <- train(x = training_m,
                  y = as.factor(docvars(training_dfm, "sentiment")),
                  method = "naive_bayes",
                  trControl = myControl,
                  tuneLength = 3,
                  verbose = TRUE,
                  metric = "ROC") ```

machine learning – Bayes Classifier and Bayes Risk

A probability distribution P over X ×
{0, 1}. P can be defined in term of its marginal distribution over X , which we
will denote by $P_X$ and the conditional labeling distribution, which is defined by the
regression function
$$
µ(x) = P
(x,y)∼P
(y = 1 | x)
$$

Let’s consider a 2-dimensional Euclidean domain, that is $X = R^2$, and the
following process of data generation: The marginal distribution over X is uniform over two square areas (1, 2) × (1, 2) ∪ (3, 4) × (1.5, 2.5). Points in the first
square Q1 = (1, 2) × (1, 2) are labeled 0 (blue) and points in the second square
Q2 = (3, 4) × (1.5, 2.5) are labeled 1 (red).

(a) Describe the density function of $P_X$, and the regression function, Bayes predictor and Bayes risk of P.

(b) Consider the two distributions $P^1$ and $P^2$ that we obtain by “projecting” onto each of the axes. Formally, we are marginalizing out one of the features to
obtain $P^1$ and $P^2$. Both are distributions over R× {0, 1}. Describe the density
functions of $P^1_X$ and $P^2_X$, and the regression functions, Bayes predictors and
Bayes risks of $P^1$ and $P^2$.

python – Naive Bayes Classifier for sentiment labelled documents

In order to continue improving my Python knowledge, I have implemented a naïve Bayes classifier as described in “An introduction to Information Retrieval”. I would be very interested which parts could be improved, be it e.g. coding style or use of data structures.

"""Implementation of a naive Bayes classifier based on sentiment labelled sentences.
See https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf for the algorithm.
The dataset was obtained from
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences"""
import sys
import re
import math
from collections import Counter
from stop_words import get_stop_words


# PARAMETERS
DATAFILE = "data\imdb_labelled.txt"

# FUNCTIONS
def load_data(filepath):
    """Load the sentiment labelled data."""
    # A library is a list of categories, which label a list of documents
    library = ((),())

    # The textfile is formatted as document (string), TAB, category (int), NL
    with open(filepath, 'r') as file:
        for line in file:
            document, category = line.split('t')
            library(int(category)).append(document)

    return library

def clean_library(library):
    """Clean documents in the library array."""
    for i, category in enumerate(library):
        for j, document in enumerate(category):
            cleaned_doc = clean_document(document)
            library(i)(j) = cleaned_doc

def clean_document(document):
    """Clean a document from stop words, numbers and various other
       characters and return a list of all words."""
    stop_words = get_stop_words('en')

    new_doc = document.strip().lower()
    new_doc = re.sub("(-0-9.,!;:\/()"&)", "", new_doc)
    new_doc = new_doc.split()
    new_doc = (word for word in new_doc if word not in stop_words)

    return new_doc

def train_categories(library):
    """Calculate probabilities for the naive Bayes classifier and
       return the vocabulary with conditional probabilities and the priors."""
    total_docs = sum((len(category) for category in library))
    vocabulary = (word for category in library
                       for document in category
                       for word in document)

    cond_prob = ()
    prior = ()

    for category in library:
        # Prior probability
        total_cat_docs = len(category)
        prior.append(total_cat_docs / total_docs)

        # Conditional probabilities
        text = (word for document in category for word in document)

        word_count = Counter(text)
        total_word_count = sum(word_count.values())

        cat_cond_prob = {}

        for word in vocabulary:
            cat_cond_prob(word) = (word_count(word) + 1) / (total_word_count + 1)

        cond_prob.append(cat_cond_prob)

    return (vocabulary, prior, cond_prob)

def apply_nb(vocabulary, priors, cond_prob, document):
    """Apply the naive Bayes classification to a document in order
       to retrieve its category."""
    prepared_doc = clean_document(document)
    prepared_doc = (word for word in prepared_doc if word in vocabulary)

    score = (math.log(prior) for prior in priors)

    for cat, cat_cond_prob in enumerate(cond_prob):
        score(cat) = sum((math.log(cat_cond_prob(word)) for word in prepared_doc))

    return score.index(max(score))

def main(argv):
    """Train a naive Bayes classifier and apply it to a user-supplied string."""
    if len(argv) == 0:
        print("Please supply a document string.")
        return

    user_doc = argv(0)

    library = load_data(DATAFILE)
    clean_library(library)
    vocabulary, priors, cond_prob = train_categories(library)

    doc_cat = apply_nb(vocabulary, priors, cond_prob, user_doc)
    print(f'"{user_doc}": Category {doc_cat}')

if __name__ == "__main__":
    main(sys.argv(1:))

probability – Distribution for which the Bayes error equals the asymptotic nearest neighbor error

Let $eta(x) = mathbb{P}(Y=1|X=x)$ where $Y$ is a random variable in ${0, 1}$ and $X$ is a random variable taking values in some space $mathcal{X}$. Show that there exists some distribution over $X, Y$ such that:

$$mathbb{E}(min{eta(X), 1-eta(X)}) = mathbb{E}(2eta(X)(1-eta(X)))$$

Note that the term on the LHS is the asymptotic nearest neighbor error while the term on the RHS is the Bayes error.

What I’ve Tried

I’m still having trouble getting intuition on when this might be true. I have tried coming up with simple examples such as when $X$ is bernoulli with parameter $q$ and $Y sim Bernoulli(p + (-1)^Xepsilon)$ for some constants $p, epsilon > 0$, but none of the examples I’m come up with have cases where the above equality is true.

statistics – Bayes Estimator for Bernoulli Variance

I have the following question, which I have also posted here, however nobody has answered, so I thought I would post it here as well.

Let $X_1,dots,X_n$ be independent, identically distributed random variables with
$$
P(X_i=1)=theta = 1-P(X_i=0)
$$

where $theta$ is an unknown parameter, $0<theta<1$, and $ngeq 2$. It is desired to estimate the quantity $phi = theta(1-theta) = nVar((X_1+dots+X_N)/n)$.

Suppose that a Bayesian approach is adopted and that the prior distribution for $theta$, $pi(theta)$, is taken to be the uniform distribution on $(0,1)$. Compute the Bayes point estimate of $phi$ when the loss function is $L(phi,a)=(phi-a)^2$.

Now, my solution so far:

It can easily be proven that $a$ needs to be the mean of the posterior. Also, when $theta$ spans $(0,1)$, $phi$ spans $(0,frac{1}{4})$. Hence, we have that
$$
a = int_0^{frac{1}{4}}phicdot f(phi|x_1,dots,x_n)dphi.
$$

Now, we have that
$$
f(phi|x_1,dots,x_n)propto f(x_1,dots,x_n|phi)cdot pi(phi).
$$

Given that $theta$ follows $U(0,1)$, we get that $phi$ follows:

$$
P(Phileq t) = frac{1-sqrt{1-4t}}{2}
$$

Hence we can derive $pi(phi)$. However, I am not sure how to derive $f(x_i|phi)$.

Help proceeding forward and letting me know if I have made any mistakes so far would be very appreciated.

Bayes Theorm with multiple entrances

Trying to write a Bayes equation to predict the playfulness of a dog breed based on related breeds. An example of this is how playful a Labrador is when you look at the playfulness of a golden retreiver, poodle and German Shepard. (These breeds don't appear to be related, but genetically similar breeds can tell you about personality traits.)

I originally thought the following

Pl = play, GS = German shepherd, GR = golden retreiver, Po = poodle

$$ p (PlLab | PlGR, PlPo, PlGS) = frac {p (PlGR) p (GR) + p (PlPo) p (Po) + p (PlGS) p (GS)} {p (PlGR) + p (PlPo) + p (PlGS)} $$

Although I don't know if that's right.

Probability Theory – Bayes Theorem Probability does not make sense

I'm trying to use the Bayes phrase to calculate the probability of $ P (A | B) $, I have $ P (A) $ in column 1 $ P (B | A) $ in colmn2, $ P (B) $ In column 3 I get the following:

Enter image description here

My calculations were:

$$ P (B / A) = 0.8 times A $$
$$ P (B) = (Bx * 0.55) + ((1-Bx) * (0.55)) $$
$$ P (A / B) = (Ax * Bx) / Cx $$

The probability is over 1. What am I doing wrong?

python – Naive Bayes – From Probability to Probability

I am leading a Gaussian Naive Bayes from scratch for this really good tutorial: https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

Everything seems to work well. I've adapted the code to work with other records coming from here http://archive.ics.uci.edu/ml/datasets/statlog+(heart)

Because this tutorial uses a Gaussian probability density function, it provides probability. I've tried to tweak that idea to make sure that my output also shows a probability, not just a probability. The decision-making process works well, he predicts correctly. However, my probability output shows, for example, "1.2803300002643495e-15," which I can not "convert" with a probability.

I've tried switching from probability to probability:

probfor1 = (probabilities[1]/ (Probabilities[1] + Probabilities[2]))
probfor2 = probabilities[2]/ (Probabilities[1] + Probabilities[2]) 

However, it does not provide correct probabilities (i.e., it outputs 0.9991818295017332 or 2,8694419736742756e-06, which is clearly incorrect). Please let me know if I should add more explanations.

My whole code is:

# - * - Coding: utf-8 - * -
"" "

"" "

Import csv
Import math


"" "
-------------------
Start: load data
-------------------
It reads the DAT files and returns them as floats
Used for training and test files
"" "
def loadCsv (filename):
lines = csv.reader (open (filename, "rt"), separator = & # 39; & # 39;)
Record = list (rows)
for i in range (len (record)):
record[i] =[Float(x)fürximDatensatz[Float(x)forxindataset[float(x)fürximDatensatz[float(x)forxindataset[i]]Return dataset
"" "
-------------------
End: load data
-------------------
"" "


def separateByClass (record):
separated = {}
for i in range (len (record)):
Vector = record[i]
        if (vector[-1] not separated
separated[Vector[vector[Vektor[vector[-1]]= []
        separated[Vector[vector[Vektor[vector[-1]].append (vector)
Return separately

def mean (numbers):
#print (sum (pay) / float (len (pay)))
Result sum (numbers) / Float (Len (numbers))

def stdev (numbers):
avg = average (numbers)
Deviation = sum ([pow(x-avg,2) for x in numbers]) / float (len (pay) -1)
#print (math.sqrt (deviation))
Return math.sqrt (deviation)

Merge Def (Record):
Summaries [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    Return summaries

def summarizeByClass (record):
separated = separateByClass (record)
Summaries = {}
for classValue instances in separate.items ():
summaries[classValue] = summarize (instances)
#print (summaries)
Return summaries

def calculable (x, mean, stdev):
Exponent = math.exp (- (math.pow (x-mean, 2) / (2 * math.pow (stdev, 2))))
#print ((1 / (math.sqrt (2 * math.pi) * stdev)) * exponent)
return (1 / (math.sqrt (2 * math.pi) * stdev)) * exponent

def calculateClassProbabilities (summaries, inputVector):
Probabilities = {}
for classValue, classSummaries in summaries.items ():
probabilities[classValue] = 1
for i in range (len (classSummaries)):
Mean, stdev = classSummaries[i]
            x = inputVector[i]
            probabilities[classValue] * = calculate predictability (x, mean, stdev)
print ("here")
Pressure (probabilities)[1])
print ("Here2")
Pressure (probabilities)[2])

#print ("Here3")
#print (probabilities)[1]/ (Probabilities[1] + Probabilities[2]))
return probabilities

Def predictions (summaries, input vector):
to press("-----------------------------")
Probabilities = convertClassProbabilities (summaries, inputVector)
print (& # 39; probabilities for each class: {0} & # 39;. format (probabilities))
probfor1 = (probabilities[1]/ (Probabilities[1] + Probabilities[2]))
probfor2 = probabilities[2]/ (Probabilities[1] + Probabilities[2])
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items ():
if bestLabel is None or Probability> bestProb:

bestProb = probability
bestLabel = class value
print (bestProb)
print (bestLabel)
if bestLabel == 1:
print ("hereeeee11111")
print (probfor1)
otherwise:
Print ("Hereeeee2222")
print (probfor2)

return bestLabel

def getPredictions (summaries, testSet):
Predictions = []
    for i in range (len (testSet)):
Result = predictions (summaries, TestSet.)[i])
predictions.append (result)
print (result)
Return predictions

def getAccuracy (testSet, Predictions):
right = 0
for i in range (len (testSet)):
if testSet[i][-1]    == predictions[i]:
right + = 1
return (true / float (len (testSet))) * 100.0

def main ():
filename_training = & # 39; heart.training.dat & # 39;
filename_testing = & heart; test.dat & # 39;
#splitratio = 0.67
trainingSet = loadCsv (filename_training)
testSet = loadCsv (filename test)
#trainingSet, testSet = splitDataset (trainingset, splitRatio)
divide #print (& # 39; {0} lines into train = {1} and test = {2} lines & # 39; format (len (record), len (trainingSet), len (testSet)))
# Prepare the model
summaries = summarizeByClass (trainingSet)
# Test model
Predictions = GetPredictions (summaries, TestSet)
accuracy = getAccuracy (testSet, predictions)
print (& # 39; accuracy: {0}% & # 39; format (precision))

Main()