Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
343 views
in Technique[技术] by (71.8m points)

python - Proportional venn diagram for more than 3 sets

I have a collection of documents in MongoDB where each has one or more categories in a list. Using map reduce, I can get the details of how many documents have each unique combination of categories:

['cat1']               = 523
['cat2']               = 231
['cat3']               = 102
['cat4']               = 72
['cat1','cat2']        = 710
['cat1','cat3']        = 891
['cat1','cat3','cat4'] = 621 ...

where the totals are for the number of documents that exact combination of categories.

I'm looking for a sensible way to present this data, and I think a venn diagram with proportional areas would be a good idea. Using the above example, the area cat1 would be 523+710+891+621, the area of the overlap between cat1 and cat3 would be 891+621, the area of overlap between cat1, cat3, cat4 would be 621 etc.

Does anyone have any tips for how I might go about implementing this? I'd preferably like to do it in Python (+Numpy/MatPlotLib) or MatLab.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Problem

We need to represent counts of multiple interconnected categories of object, and a Venn diagram would be unable to represent more than a trivial amount of categories and their overlap.

A Solution

Consider each of the categories and their combinations as a node in a graph. Draw the graph such that the size of the node represents the count in each category, and the edges connect the related categories. The advantage of this approach is: multiple categories can be accommodated with ease, and this becomes a type of connected bubble chart.

The Result

network layout

The Code

The proposed solution uses NetworkX to create the data structure and matplotlib to draw it. If data is presented in the right format, this will scale to a large number of categories with multiple connections.

import networkx as nx
import matplotlib.pyplot as plt

def load_nodes():
    text = '''  Node    Size
                1        523
                2        231
                3        102
                4         72
                1+2      710
                1+3      891
                1+3+4    621'''
    # load nodes into list, discard header
    # this may be replaced by some appropriate output 
    # from your program
    data = text.split('
')[1:]
    data = [ d.split() for d in data ]
    data = [ tuple([ d[0], 
                    dict( size=int(d[1]) ) 
                    ]) for d in data]
    return data

def load_edges():
    text = '''  From   To
                1+2    1
                1+2    2
                1+3    1
                1+3    3
                1+3+4    1
                1+3+4    3
                1+3+4    4'''
    # load edges into list, discard header
    # this may be replaced by some appropriate output 
    # from your program
    data = text.split('
')[1:]
    data = [ tuple( d.split() ) for d in data ]
    return data

if __name__ == '__main__':
    scale_factor = 5
    G = nx.Graph()
    nodes = load_nodes()
    node_sizes = [ n[1]['size']*scale_factor
                  for n in nodes ]

    edges = load_edges()
    G.add_edges_from( edges )

    nx.draw_networkx(G, 
                     pos=nx.spring_layout(G),
                     node_size = node_sizes)
    plt.axis('off')
    plt.show()

Other Solutions

Other solutions might include: bubble charts, Voronoi diagrams, chord diagrams, and hive plots among others. None of the linked examples use Python; they are just given for illustrative purposes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...