Hi!

Very good blog...

I'm doing some stuff with Clementine... and I have an issue...

My target for NN train dataset is a continuos value between 0 and 100... the problem is that is a normal/gaussian distribution and makes the NN predict bad...

How can I resolve the unbalancing data? split into classe with same frequency!?

Regards,

Pedro

Pedro,

I am not aware that neural networks have a problem with predicting values with normal distributions. In fact, if you randomize the weights in a neural network whose output layer has a linear transfer function, then the output is likely to follow a normal distribution -- just from the Central Limit Theorem of statistics.

So, you have a neural network that is not producing good results. There can be several causes.

The first thing to look for is too many inputs. Clementine has options to prune the input variables on a neural network. Be sure that you do not have too many inputs. I would recommend a variable reduction technique such as principal components, and advise you to avoid categorical variables that have many levels.

A similar problem can occur if your hidden layer is too large.

Whatever the network, it is worthwhile looking at the number of weights in the network (or a related measure called the degrees of freedom). Remember, you want to have lots of training data for each weight.

Another problem may be that the target is continuous, but bounded between 0 and 100. This could result in a neural network where the output layer uses a linear transfer function. Although not generally a bad idea, it may not work in this case because the range of a linear function is from minus infinity to positive infinity, which far exceeds the range of the data.

One simple solution would be to divide the output by 100 and treat it as a probability. The neural network should then be set up with a logistic function in the target layer.

Your idea of binning the results might also work, assuming that bins work for solving the business problem. Equal sized bins are reasonable, since they are readily understandable as quantiles.

Good luck.

## Tuesday, August 25, 2009

Subscribe to:
Post Comments (Atom)

Uaahhooo...

ReplyDeleteAmazing fast answer!! thanks a lot!!

The "problem" of normal distribution is that I thought and felt that NN could be very intelligent if 70% of output data range between for example 30 and 60, it set the predict output also between this range to guarantee at least 70% of accuracy...:-)

but I'll follow your tips... maybe I'm having too many inputs... around 40.. i'll focus on quality and less in quantity...

Do you suggest normalization the input columns?

My goal is to try to predict the signal of a stock in the next day... this signal could be -1, 0 and 1 (sell/hold/buy)... in the scope of my master BI degree...

I'm using a supervised NN and I need to generate the optimal sinals... and for that I'm using a formula to get the best return in the next 5 days.. that is normalized between 0 and 100... next... to reduce the unbalancing data, I'm generated those signals based in the mean and the standard deviation... and I'm trying now predict the classes -1,0,1, in spite of the continuos output... I'm trying t get some results...

I can show you more into detail if your are interested... maybe you have some time and give me more tips...:-)

Thanks a lot!

Pedro

www.pedrocgd.blogspot.com

Predicting stock market outcomes is a known hard problem. Any system that does a little better than random is still going to be quite valuable.

ReplyDeleteThat may be the issue: your definition of 'good model' may be too hard.