Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
157 views
in Technique[技术] by (71.8m points)

python - Google appengine: Task queue performance

I currently have an application running on appengine and I am executing a few jobs using the deferred library, some of these tasks run daily, while some of them are executed once a month. Most of these tasks query Datastore to retrieve documents and then store the entities in an index (Search API). Some of these tables are replaced monthly and I have to run these tasks on all entities (4~5M).

One exemple of such a task is:

def addCompaniesToIndex(cursor=None, n_entities=0, mindate=None):
    #get index
    BATCH_SIZE = 200
    cps, next_cursor, more = Company.query().
                                     fetch_page(BATCH_SIZE,
                                                start_cursor=cursor)

    doc_list = []

    for i in range(0, len(cps)):
        cp = cps[i]
        #create a Index Document using the Datastore entity
        #this document has only about 5 text fields and one date field
        cp_doc = getCompanyDocument(cp)
        doc_list.append(cp_doc)

    index = search.Index(name='Company')
    index.put(doc_list)

    n_entities += len(doc_list)

    if more:
        logging.debug('Company: %d added to index', n_entities)
        #to_put[:] = []
        doc_list[:] = []
        deferred.defer(addCompaniesToIndex,
                       cursor=next_cursor,
                       n_entities=n_entities,
                       mindate=mindate)
    else:
        logging.debug('Finished Company index creation (%d processed)', n_entities)

When I run one task only, the execution takes around 4-5s per deferred task, so indexing my 5M entities would take about 35 hours.

Another thing is that when I run an update on another index (eg, one of the daily updates) using a different deferred task on the same queue, both are executed a lot slower. And start taking about 10-15 seconds per deferred call which is just unbearable.

My question is: is there a way to do this faster and scale the push queue to more than one job running each time? Or should I use a different approach for this problem?

Thanks in advance,

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...