Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

apache spark - ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I've copied the answer below for the reader's convenience:

You can get predictions for new users using the trained model (without updating it):

To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.

i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)

To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.

Note: MLLIB gives you access to the matrix u and v

The quoted text above is an excellent answer, however, I'm struggling to understand how to programmatically implement this solution. For example, the matrix u and v can be obtained with:

# pyspark example

# ommitted for brevity ... loading movielens 1M ratings

model = ALS.train(ratings, rank, numIterations, lambdaParam)

matrix_u = model.userFeatures()

print(matrix_u.take(2)) # take a look at the dataset

This returns:

[
  (2, array('d', [0.26341307163238525, 0.1650490164756775, 0.118405282497406, -0.5976635217666626, -0.3913084864616394, -0.1379186064004898, -0.3866392970085144, -0.1768060326576233, -0.38342711329460144, 0.48550787568092346, -0.18867433071136475, -0.02757863700389862, 0.1410026103258133, 0.11498363316059113, 0.03958914801478386, 0.034536730498075485, 0.08427099883556366, 0.46969038248062134, -0.8230801224708557, -0.15124185383319855, 0.2566414773464203, 0.04326820373535156, 0.19077207148075104, 0.025207923725247383, -0.02030213735997677, 0.1696728765964508, 0.5714617967605591, -0.03885050490498543, -0.09797532111406326, 0.29186877608299255, -0.12768596410751343, -0.1582849770784378, 0.01933656632900238, -0.09131495654582977, 0.26577943563461304, -0.4543033838272095, -0.11789630353450775, 0.05775507912039757, 0.2891307771205902, -0.2147761881351471, -0.011787488125264645, 0.49508437514305115, 0.5610293745994568, 0.228189617395401, 0.624510645866394, -0.009683617390692234, -0.050237834453582764, -0.07940001785755157, 0.4686132073402405, -0.02288617007434368])), 
  (4, array('d', [-0.001666820957325399, -0.12487432360649109, 0.1252429485321045, -0.794727087020874, -0.3804478347301483, -0.04577340930700302, -0.42346617579460144, -0.27448347210884094, -0.25846347212791443, 0.5107921957969666, 0.04229479655623436, -0.10212298482656479, -0.13407345116138458, -0.2059325873851776, 0.12777331471443176, -0.318756639957428, 0.129398375749588, 0.4351944327354431, -0.9031049013137817, -0.29211774468421936, -0.02933369390666485, 0.023618215695023537, 0.10542935132980347, -0.22032295167446136, -0.1861676126718521, 0.13154461979866028, 0.6130356192588806, -0.10089754313230515, 0.13624103367328644, 0.22037173807621002, -0.2966669499874115, -0.34058427810668945, 0.37738317251205444, -0.3755438029766083, -0.2408779263496399, -0.35355791449546814, 0.05752146989107132, -0.15478627383708954, 0.3418906629085541, -0.6939512491226196, 0.4279302656650543, 0.4875738322734833, 0.5659542083740234, 0.1479463279247284, 0.5280753970146179, -0.24357643723487854, 0.14329688251018524, -0.2137598991394043, 0.011986476369202137, -0.015219110995531082]))
]

I can also do similar to get the v matrix:

matrix_v = model.productFeatures()

print(matrix_v.take(2)) # take a look at the dataset

This results in:

[
  (2, array('d', [0.019985994324088097, 0.0673416256904602, -0.05697149783372879, -0.5434763431549072, -0.40705952048301697, -0.18632276356220245, -0.30776089429855347, -0.13178342580795288, -0.27466219663619995, 0.4183739423751831, -0.24422742426395416, -0.24130797386169434, 0.24116989970207214, 0.06833088397979736, -0.01750543899834156, 0.03404173627495766, 0.04333991929888725, 0.3577033281326294, -0.7044714689254761, 0.1438472419977188, 0.06652364134788513, -0.029888223856687546, -0.16717877984046936, 0.1027146726846695, -0.12836599349975586, 0.10197233408689499, 0.5053384900093079, 0.019304445013403893, -0.21254844963550568, 0.2705852687358856, -0.04169371724128723, -0.24098040163516998, -0.0683765709400177, -0.09532768279314041, 0.1006036177277565, -0.08682398498058319, -0.13584329187870026, -0.001340558985248208, 0.20587041974067688, -0.14007550477981567, -0.1831497997045517, 0.5021498203277588, 0.3049483597278595, 0.11236990243196487, 0.15783801674842834, -0.044139936566352844, -0.14372406899929047, 0.058535050600767136, 0.3777201473712921, -0.045475270599126816])), 
  (4, array('d', [0.10334215313196182, 0.1881643384695053, 0.09297363460063934, -0.457258403301239, -0.5272660255432129, -0.0989445373415947, -0.2053477019071579, -0.1644461452960968, -0.3771175146102905, 0.21405018866062164, -0.18553146719932556, 0.011830524541437626, 0.29562288522720337, 0.07959598302841187, -0.035378433763980865, -0.11786794662475586, -0.11603366583585739, 0.3776192367076874, -0.5124108791351318, 0.03971947357058525, -0.03365595266222954, 0.023278912529349327, 0.17436474561691284, -0.06317273527383804, 0.05118614062666893, 0.4375131130218506, 0.3281322419643402, 0.036590900272130966, -0.3759073317050934, 0.22429685294628143, -0.0728025734424591, -0.10945595055818558, 0.0728464275598526, 0.014129920862615108, -0.10701996833086014, -0.2496117204427719, -0.09409723430871964, -0.11898282915353775, 0.18940524756908417, -0.3211393356323242, -0.035668935626745224, 0.41765937209129333, 0.2636736035346985, -0.01290816068649292, 0.2824321389198303, 0.021533429622650146, -0.08053319901227951, 0.11117415875196457, 0.22975310683250427, 0.06993964314460754]))
]

However, I'm not sure how to progress from this to full_u * v^t * v

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This new user is not the the matrix U, so you don't have its latent representation in 'k' factors, you only know its full representation, i.e., all its ratings. full_u here means all of the new user ratings in a dense format (not the sparse format ratings are) e.g.:

[0 2 0 0 0 1 0] if user u has rated item 2 with a 2 and item 6 with a 1.

then you can get v pretty much like you did and turning it to a matrix in numpy for instance:

pf = model.productFeatures()
Vt = np.matrix(np.asarray(pf.values().collect()))

then is is just a matter of multiplying: full_u*Vt*Vt.T

Vt and V are transposed compared to the other answer but that's just a matter of convention.

Note that the Vt*Vt.T product is fixed, so if you are going to use this for multiple new users it will be computationally more efficient to pre-compute it. Actually for more than one user it would be better to put all their ratings in bigU (in the same format as my one new user example) and just do the matrix product: bigU*Vt*Vt.T to get all the ratings for all the new users. Might still be worth checking that the product is done in the most efficient way in terms of number of operations.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...