Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
468 views
in Technique[技术] by (71.8m points)

calculating the outliers in R

I have a data frame like this:

x

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA         30       60  80

I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,

abs(x-Mean) > 3*SD

x$count<-c(1) (increment this value if the above condition is met).

I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.

I like to subtrack mean from each cell, and I tried this

x$diff<-sweep(x, 1, x$Mean, FUN='-')

does not seem to be working, any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Get your IQR (Interquartile range) and lower/upper quartile using:

lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)

Compute the bounds for a mild outlier:

mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier

To detect extreme outliers do the same, but multiply by 3 instead:

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier

Hope this helps

edit: was accessing 50%, not 75%


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...