r - Random Data Sets within loops

Question

Welcome To Ask or Share your Answers For Others

r - Random Data Sets within loops

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

r - Random Data Sets within loops

Here is what I want to do:

I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.

I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).

I want to do this let us say 100k times and store each result in the form

ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean

so that I get a 9*100k df.

I tried some things already but R is very bad with loops and I know vector oriented solutions are better because of R design.

Here is what I did and I know it is horrible

The df is in the form

v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc

e=NULL
for (x in 1:100000)
{
  s=sample(1:100,4)#pick 4 variables randomly
  a=sample(seq(0,1,0.01),1)
  b=sample(seq(0,1-a,0.01),1)
  c=sample(seq(0,(1-a-b),0.01),1)
  d=1-a-b-c
  e=c(a,b,c,d)#4 random weights
  average=mean(timeseries.df[,s]%*%t(e))
  e=rbind(e,s,average)#in the end i get the 9*100k df
  }

The procedure runs way to slow.

EDIT:

Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R. Then the problem becomes a little bit complex if i want to calculate the standard deviation. i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance

t(sampleweights)%*%sample_cov.mat%*%sampleweights

to get in the end the ts.weighted_standard_dev matrix

Last question what is the best way to proceed if i want to bootstrap the original df x times and then apply the same computations to test the robustness of my datas

thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:15:52+0000

Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)

First, note that the mean of the weighted sums equals the weighted sum of the means, as:

enter image description here

Let's generate some sample data:

timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4                # number of items in the convex combination
replications <- 100   # number of replications

Thus, we may first compute the mean of all columns and do all further computations using this mean:

ts.means <- apply(timeseries.df, 2, mean)

Let's create some samples:

samples <- replicate(replications, sample(1:length(ts.means), n))

and the corresponding weights for those samples:

weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)

That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.

Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:

ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)

Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.

result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)

# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")

I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.

Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.

Edit

You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).

Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).

Thus, I have to give up on this.

I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)

Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.

Bootstrapping: That is a whole new question. Separate different topics by starting new questions.

Categories

r - Random Data Sets within loops

r - Random Data Sets within loops

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Edit

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags