I have some time series data, $mathbf{x}_{1:T} = { x_1, dots, x_T }$ where the observation at time $t$, $X_t$, is a continuous random variable. Let $Y_t$ denote a discrete random variable at time $t$ that, conditioned on the previous $t$ observations, has support over $t$ values. (It is an estimate for each of the previous time points.)
Is this mutual information welldefined?
$$
text{MI}(Y_t, X_t) = mathbb{H}(p(color{red}{Y_{t1}} mid mathbf{x}_{1:t1})) – mathbb{E}_{X_t}(p(Y_{t} mid mathbf{x}_{1:t1}, X_t = x_t)).
$$
In words, I want to know how much information I gain about $Y_t$ by observing $X_t$.
I ask for two reasons:

The maximum entropy of a discrete distribution is a function of the size of support of that distribution. So I am not sure if the left and right terms above are on the same “scale”. I wonder if I should have $Y_t$ instead of $Y_{t1}$ (in red above) or if there is another way to handle this (assuming it is a problem).

When I approximate $text{MI}(X_t, Y_t)$ using some code, I get a slightly different answer (always slightly larger). And sometimes the value for $text{MI}(Y_t, X_t)$ is negative, and I know that MI is nonnegative.