Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
256 views
in Technique[技术] by (71.8m points)

Colouring a PCA plot by clusters in R

I have some biological data that looks like this, with 2 different types of clusters (A and B):

                Cluster_ID       A1      A2      A3       B1       B2      B3
 5  chr5:100947454..100947489,+   3.31322  7.52365  3.67255  21.15730  8.732710 17.42640
12 chr5:101227760..101227782,+   1.48223  3.76182  5.11534  15.71680  4.426170 13.43560
29 chr5:102236093..102236457,+  15.60700 10.38260 12.46040   6.85094 15.551400  7.18341

I clean up the data:

CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)

#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]

#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)

The data size is reduced from 6981 clusters to 599 after this.

I then go on to apply PCA:

#Get data dimensions

dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE) 
summary(PCA.CAGE)

I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.

This is what I have tried, to erroneous results:

qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))

ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()

qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))

The errors appear as such:

  > qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")

  Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]

  > qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data,    data=as.data.frame(PCA.CAGE$x))

  Don't know how to automatically pick scale for object of type data.frame.   Defaulting to continuous
  Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
  Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data

 > ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more,      label=CAGE_filter_more)) + geom_point() + geom_text()

 Error: ggplot2 doesn't know how to deal with data of class 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.

pca   <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)

So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...