I’m working on a small ML regression problem and after empirical research I found that the best way to normalize my dependant feature is to use the log scaling (np.log or np.log1p). But I wanted to know if it was possible to know directly from the curves that it should be the right way to normalize. Here is my code (with y being a pandas Serie containing the values for my dependant variable):
fig = plt.figure(figsize=(12,8),constrained_layout=True)
grid = gridspec.GridSpec(ncols=3,nrows=4,figure=fig)
# Histrogram
ax1 = fig.add_subplot(grid(0,:))
sns.distplot(y,ax=ax1)
ax1.set_title("Histrogram of revenue",fontsize=10)
# Probability plot
ax2 = fig.add_subplot(grid(2:,:2))
stats.probplot(y,plot=ax2)
ax2.set_title("QQplot of revenue")
# Boxplot
ax3 = fig.add_subplot(grid(2:,2))
sns.boxplot(y,ax=ax3,orient="v")
ax3.set_title("Boxplot of revenue")
plt.show()
It gives the following plots:
Plots for the distribution of revenue
I also have this for kurtosis and skewness:
print(f"Kurtosis : {y.kurt()}")
print(f"Skewness : {y.skew()}")
Kurtosis : 12.055176638707394
Skewness : 2.793478695162504
So my question is based on these informations, how do you know which normalization technique to pick (here log instead of Z normalization for example).