Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
774 views
in Technique[技术] by (71.8m points)

numpy - How to implement Conflation for probability distribution in python?

I looked online for performing the combining several continuous probability distributions into one continuous probability distribution. This method is called Conflation, the method can be found in the following article: An Optimal Method for Consolidating Data from Different Experiments. In this article, I found out that it was better to perform Conflation instead of averaging to combine distributions.

From what I understood from the article is that equation performs by multiplying each probability density values from several probability distributions divided by the integration of the product of each probability density value from several probability distributions for continuous distribution while for the discrete distribution it is done by multiplying each probability density value from several probability distributions divided by the summation of each probability density value from several probability distributions. (Details can be found on page 5 of the article) enter image description here enter image description here

Say I have around 4 lists from, for example, 4 norm distributions, for example,

list_1 = [5, 8, 6, 2, 1]
list_2 = [2, 6, 1, 3, 8]
list_3 = [1, 9, 2, 7, 5]
list_4 = [3, 2, 4, 1, 6]

and implementing the Conflation the result list becomes,

Con_list = [2.73, 34.56, 3.69, 3.23, 12]

(Correct me if I am wrong)

how is it possible to implement both equations in the photo into python to get the Conflation of inputted PDF distribution?

I found stackflow question before regarding averaging list and the code was the following,

def average(l):
    llen = len(l)
    def divide(x):
        return x / llen
    # return map(divide, map(sum, zip(*l)))
    return map(divide, map(sum, zip(l)))

I have been trying to recode this function to follow the equation above but I can't find a way to get conflated pdf for a continuous distribution.

Edit 1:

Based on the answer from @Josh Purtell, I rewrote the code, however, I keep on getting the following error message:

Error Message:

Traceback (most recent call last):
  File "/tmp/sessions/c903d99d60f20c3b/main.py", line 72, in <module>
    graph=conflate_pdf(domain, dists,lb,ub)
  File "/tmp/sessions/c903d99d60f20c3b/main.py", line 58, in conflate_pdf
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
  File "/usr/local/lib/python3.6/dist-packages/scipy/integrate/quadpack.py", line 341, in quad
    points)
  File "/usr/local/lib/python3.6/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
    return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
TypeError: only size-1 arrays can be converted to Python scalars

Code:

def prod_pdf(x,pdfs):
    prod=np.ones(pdfs[0].shape[0])
    for pdf in pdfs:
        prod=prod*pdf
    return prod

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    return prod_pdf(x,dists)/denom

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = stats.norm.pdf(domain, 2,1)
dist_2 = stats.norm.pdf(domain, 2.5,1.5)
dist_3 = stats.norm.pdf(domain, 2.2,1.6)
dist_4 = stats.norm.pdf(domain, 2.4,1.3)
dist_5 = stats.norm.pdf(domain, 2.7,1.5)

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=conflate_pdf(domain, dists,lb,ub)

from matplotlib import pyplot as plt
plt.plot(domain, dist_1)
plt.plot(domain, dist_2)
plt.plot(domain, dist_3)
plt.plot(domain, dist_4)
plt.plot(domain, dist_5)
plt.plot(domain,graph)
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()

From the code, what causes this error?

Edit 2:

I managed to rewrite the code to look into lists of distribution instead of getting the pdf in the product function in Edit 1, but still, I keep on having the same error in Edit 1.

Code:

def prod_pdf(x,pdfs):
    prod=np.ones(np.array(pdfs)[0].shape)
    for pdf in pdfs:
        print(prod)
        for c,y in enumerate(pdf):
            prod[c]=prod[c]*y
        print('final:', prod)
    return prod

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    print('Denom: ',denom)
    print('product pdf: ', prod_pdf(x,dists))
    conflated_pdf=prod_pdf(x,dists)/denom
    print(conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.plot(domain, dist_1, 'r')
plt.plot(domain, dist_2, 'g')
plt.plot(domain, dist_3, 'b')
plt.plot(domain, dist_4, 'y')
plt.plot(domain, dist_5, 'c')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=conflate_pdf(domain, dists,lb,ub)


plt.plot(domain,graph, 'm')
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()

Edit 3:

I tried to run the following code (based on an answer from @Josh Purtell), but, I keep on getting one variable it gets the whole array after product function and it produces the same error message regarding the size-1 array. See the following code with a portion of the output:

Code:

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    print('Incoming Array:', p_pdf)
    for dist in dists:
        p_pdf=p_pdf*dist
        print('final:', p_pdf)
    return p_pd

def conflate_pdf(x,dists,lb,ub):
    print('Input product pdf: ', prod_pdf(x,dists))
    denom = quad(prod_pdf, lb, ub, args=(dists,))[0]
    # denom = simps(prod_pdf)
    # denom = nquad(func=(prod_pdf), ranges=([lb, ub]), args=(dists,))[0]
    print('Denom: ', denom)
    conflated_pdf=prod_pdf(x,dists)/denom
    print('Conflated PDF: ', conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, dist_1, 'r', label='Dist. 1')
plt.plot(domain, dist_2, 'g', label='Dist. 2')
plt.plot(domain, dist_3, 'b', label='Dist. 3')
plt.plot(domain, dist_4, 'y', label='Dist. 4')
plt.plot(domain, dist_5, 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
print('distribution list: 
', dists)
graph=conflate_pdf(domain, dists,lb,ub)

plt.plot(domain,graph, 'm', label='Conflated Dist.')
plt.show()

Here is a small portion of the output:

Incoming Array: 1
final: [2.14638374e-32 2.41991991e-32 2.72804284e-32 ... 6.41980576e-15
 5.92770938e-15 5.47278628e-15]
final: [4.75178372e-48 5.66328097e-48 6.74864868e-48 ... 7.03075979e-21
 6.27970218e-21 5.60806584e-21]
final: [2.80912097e-61 3.51131870e-61 4.38823989e-61 ... 1.32670185e-26
 1.14952951e-26 9.95834610e-27]
final: [1.51005552e-81 2.03116529e-81 2.73144352e-81 ... 1.76466623e-34
 1.46198598e-34 1.21092834e-34]
final: [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]
Input product pdf:  [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]
Incoming Array: 1
final: [2.14638374e-32 2.41991991e-32 2.72804284e-32 ... 6.41980576e-15
 5.92770938e-15 5.47278628e-15]
final: [4.75178372e-48 5.66328097e-48 6.74864868e-48 ... 7.03075979e-21
 6.27970218e-21 5.60806584e-21]
final: [2.80912097e-61 3.51131870e-61 4.38823989e-61 ... 1.32670185e-26
 1.14952951e-26 9.95834610e-27]
final: [1.51005552e-81 2.03116529e-81 2.73144352e-81 ... 1.76466623e-34
 1.46198598e-34 1.21092834e-34]
final: [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]

I managed to look into the code to implement the same method in Edit 3, I edited the code where it gets the first variables from each distribution however, for the rest of the loop it keeps on printing the same values, it does not go to the next values in the lists and Conflated distribution is a single variable. See the following code with a portion of the output:

Code:

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    print('Incoming Array:', p_pdf)
    for c,dist in enumerate(dists):
        p_pdf=p_pdf*dist[c]
        print('final:', p_pdf)
    return p_pdf

def conflate_pdf(x,dists,lb,ub):
    print('Input product pdf: ', prod_pdf(x,dists))
    denom = quad(prod_pdf, lb, ub, args=(dists,))[0]
    # denom = simps(prod_pdf)
    # denom = nquad(func=(prod_pdf), ranges=([lb, ub]), args=(dists,))[0]
    print('Denom: ', denom)
    conflated_pdf=prod_pdf(x,dists)/denom
    print('Conflated PDF: ', conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, dist_1, 'r', label='Dist. 1')
plt.plot(domain, dist_2, 'g', label='Dist. 2')
plt.plot(domain, dist_3, 'b', label='Dist. 3')
plt.plot(domain, dist_4, 'y', label='Dist. 4')
plt.plot(domain, dist_5, 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
print('distribution list: 
', dists)
graph=conflate_pdf(domain, dists,lb,ub)

plt.plot(domain,graph, 'm', label='Conflated Dist.')
plt.show()

A portion of the output:

Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Denom:  3.95931575736075e-96
Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Conflated PDF:  0.049999999999999996

Edit 4:

I implemented the following code and it seems to work, also, I managed to sort out the problem with quad it seems if I changed the quad into fixed_quad and normalise the pdf list. I will get the same result. Here is the following code:

import scipy.stats as st
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
from scipy.integrate import quad, simps, quad_vec, nquad, cumula

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Disclaimer: there's a good chance I'm misunderstanding either you or the paper authors, in which case please suggest an edit to this answer.

Here is a trivial, not-especially-performant implementation of what I think conflation might look like

##define pdfs for discrete RV X = {1,2,3,4}
import numpy as np

def mult_list(pdfs):
    prod=np.ones(pdfs[0].shape[0])
    for pdf in pdfs:
        prod=prod*pdf
    return prod

def conflate(pdfs):
    return mult_list(pdfs)/sum(mult_list(pdfs))

pdf_1=np.array([.25,.25,.25,.25])
pdf_2=np.array([.33,.33,.33,.00])
pdf_3=np.array([.25,.12,.13,.50])

print(conflate([pdf_1,pdf_2,pdf_3]))

which yields the resulting conflated pdf

>>> [0.5  0.24 0.26 0.  ]

which passes a cursory sniff test.

On the continuous side of things, the above translates to

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    for dist in dists:
        p_pdf=p_pdf*dist.pdf(x)
    return p_pdf

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    return prod_pdf(x,dists)/denom

dists=[stats.norm(2,1),stats.norm(4,2)]
lb=-10
ub=10
domain=np.arange(lb,ub,.01)
graph=conflate_pdf(domain,dists,lb,ub)

from matplotlib import pyplot as plt
plt.plot(domain,graph)
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()
plt.savefig("conflatedpdf.png")

which gives conflatedpdf

As you can see, the distribution is not bimodal, just as one would hope.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...