Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
438 views
in Technique[技术] by (71.8m points)

pca - Principal component analysis plot in R

I need a PCA plot which shows whether and how the data clusters by the populations (AFR_ACB, AFR_ASW,etc.) I also need different colours for each population and a legend for the population colours. It would also be good if I could add a frame for all African populations together, American populations, Asians and Europeans as my real data consists of all these population

I have data in the following format in a csv (TLR9.csv) file which I created from my result files. In reality, there are 26 columns (26 populations) and 1522 rows.

nuc_pos AFR_ACB AFR_ASW AMR_PUR AMR_PEL EAS_CHS EAS_JPN EUR_FIN EUR_CEU AMR_MXL AMR_PEL AMR_PUR EAS_CDX  EAS_CHB  EAS_CHS
42809473    0   0   0   0   0   0   0   0   0   0   0   0   0.00971 0
42809498    0.01042 0   0.0201  0.00885 0   0.03488 0.00926 0   0   0   0   0   0   0
42809524    0   0   0   0   0.0201  0   0.00926 0   0   0   0   0   0   0
42809625    0   0   0   0   0   0   0   0.08192 0.01563 0.02339 0.02857 0   0   0
42809638    0   0   0   0.00885 0   0   0   0   0   0   0   0   0   0
42809715    0.30628 0.20485 0.34743 0.36531 0.19059 0.36199 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42809846    0   0   0   0   0   0   0   0   0   0   0   0   0.00971 0.00952
42809910    0   0   0   0   0   0   0   0   0   0.01176 0   0   0   0
42809911    0   0   0   0   0   0   0   0   0   0   0   0   0   0
42809964    0.30628 0.20485 0.34743 0.36531 0.20638 0.38016 0.35241 0.02116 0.01563 0   0.06536 0   0   0
42810034    0.30628 0.20485 0.34743 0.36531 0.19059 0.34918 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42810082    0   0   0   0   0   0.02339 0   0   0   0   0   0   0   0
42810098    0   0   0   0   0   0   0   0   0   0   0   0   0   0
42810103    0   0   0   0   0.0101  0   0   0   0   0   0   0   0   0
42810184    0   0   0   0   0.03    0   0   0   0   0   0   0   0   0
42810189    0.30628 0.20485 0.34743 0.36531 0.19853 0.34918 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42810233    0   0   0   0   0   0   0   0   0   0   0   0   0   0

I have made a PCA plot using the following code:

df <- read.csv('TLR9.csv')
pca_res <- prcomp(df, scale. = TRUE)
autoplot(pca_res, data = df, loadings = TRUE, loadings.label = TRUE, frame = TRUE, label = TRUE, shape = FALSE, label.size = 2, loadings.label.size = 3)

This is my scatterplot

Is the input file format correct for this type of analysis? Is it also right to take all 26 populations as principal component?

I have tried other R packages where the tutorials are better explained how to make a PCA on R, but they are not compatible with the R version I have. So, I tried this one and it works but I am not sure the output is the way it should be.

This is my first time doing pca and I am not very familiar with R. Any help would be most appreciated. Thanks in advance!

question from:https://stackoverflow.com/questions/65901839/principal-component-analysis-plot-in-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, I cannot use your dataset, since you did not make it available to us. So I made one available down below

First, this is very easily done in the library(FactoMineR).

Load data frame

df <- read.table("https://pastebin.com/raw/6aukL6YW", header=T)

library(FactoMineR) # load package

names(df) notice I have one column called "treatment", the others are columns filled with data

Run the PCA

x <- PCA(df,quali.sup=1) # the quali.sup= is referring to "which column do you want to refer to as a category (and each category is automatically assigned a color), in your case, this would be "population"

you can also make a scatterplot with the plot.PCA() command directly integrated in the FactoMineR package

plot.PCA(x, axes=c(1, 2), cex=1,choix="ind", habillage=1) # habillage is referring the which column you want to treat as a factor, and it also will assign different colors, (again in your case, "population
 and this plot automatically adds a legend

Finally, you can make a plot which could tell you which variable is causing the most variation in your data set , again, with plot.PCA()

plot.PCA(x, choix='var',select='contrib 2') # top 2 contributors of variation, the rest are not shown in bold, could do 5, 10, etc..

And there you go...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...