python – Tensorflow giving wrong values for precession, recall and accuracy for validation set

I have a very simple model –

model = Sequential((
    Dense(64,'relu'),
    Dense(384,'relu'),
    Dense(384,'relu'),
    Dense(512,'relu'),
    Dense(512,'relu'),
    Dense(512,'relu'),
    Dense(512,'relu'),
    Dense(312,'relu'),
    Dense(1,'sigmoid'),
))

And the working is also pretty straightforward.

initial_learning_rate = 0.0008
decay = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=75,
    decay_rate=0.95,
    staircase=False)

optmizer = Adam(learning_rate=decay, amsgrad=True) #0.0005
model.compile(
    optimizer=optmizer,
    loss='BinaryCrossentropy',
    metrics=(
        'accuracy',
        'Precision',
        'Recall',
        'AUC',
    )
)

The dataset that I have has about 800+ features and 150,000 observations and a 60-40 train-test split (in previous models I did have a validation set but for just getting a basic idea about a good architecture I have skipped the validation set and am using the test set as the validation set)

The last layer of the model returns a value between 0 and 1. I have to ultimately return probabilities and not 1/0 during production, but the Actual Outputs for the dataset contains only 1 and 0 not the probabilities. So, that’s why I am using the above architecture for training –

model.fit(
    X_train, 
    y_train, 
    batch_size=512,
    epochs=1000,
    validation_data = (X_train,y_train),
)

Now here is the problem: As the data is a bit imbalanced so I have to rely heavily on Precision, Recall, and AUC scores. But the values during training are varying by a large margin when I try to compute it using various thresholds and sklearn functions.

Epoch 114/1000
166/166 (==============================) – 2s 11ms/step – loss: 0.0746 – accuracy: 0.9782 – precision: 0.9863 – recall: 0.8498 – auc: 0.9850 – val_loss: 0.0746 – val_accuracy: 0.9782 – val_precision: 0.9863 – val_recall: 0.8497 – val_auc: 0.9850

Epoch 115/1000
166/166 (==============================) – 2s 11ms/step – loss: 0.0746 – accuracy: 0.9782 – precision: 0.9863 – recall: 0.8497 – auc: 0.9850 – val_loss: 0.0746 – val_accuracy: 0.9782 – val_precision: 0.9863 – val_recall: 0.8497 – val_auc: 0.9850

Epoch 116/1000
162/166 (============================>.) – ETA: 0s – loss: 0.0746 – accuracy: 0.9782 – precision: 0.9862 – recall: 0.8498 – auc: 0.9850

Restoring model weights from the end of the best epoch.
166/166 (==============================) – 2s 11ms/step – loss: 0.0746 – accuracy: 0.9782 – precision: 0.9863 – recall: 0.8497 – auc: 0.9850 – val_loss: 0.0746 – val_accuracy: 0.9782 – val_precision: 0.9863 – val_recall: 0.8497 – val_auc: 0.9850

Sklearn scores for testing(validation) set at threshold 0.5-

((45912  3240)
 ( 4132  3370)) 

Precession: 0.5098335854765507
Recall: 0.4492135430551853
Acuracy: 0.8698767959896918

Is there something that I am doing wrong, or am I just interpreting the results wrong. Ultimately I decided to create custom functions for these metrics but that was pretty hard too, and I really want to understand how TensorFlow is determining a threshold to split the output probability into 2 classes to provide the precession, recall, accuracy, and AUC and why are these values so off from the real ones calculated using sklearn.