Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
770 views
in Technique[技术] by (71.8m points)

python - duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error:

IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(test1, 2200) already exists.
 [SQL: '
CREATE TABLE test1 (
"A" BIGINT, 
"B" BIGINT, 
"C" BIGINT, 
"D" BIGINT, 
"E" BIGINT, 
"F" BIGINT, 
"G" BIGINT, 
"H" BIGINT, 
"I" BIGINT, 
"J" BIGINT, 
idx BIGINT
)

']

You can recreate the error with the following code:

import numpy as np
import dask.dataframe as dd
import dask
import pandas as pd
import sqlalchemy_utils as sqla_utils
import sqlalchemy as sqla
DATABASE_CONFIG = {
    'driver': '',
    'host': '',
    'user': '',
    'password': '',
    'port': 5432,
}
DBNAME = 'dask'
url = '{driver}://{user}:{password}@{host}:{port}/'.format(
        **DATABASE_CONFIG)
db_url = url.rstrip('/') + '/' + DBNAME
# create db if non-existent
if not sqla_utils.database_exists(db_url):
    print('Creating database '{}''.format(DBNAME))
    sqla_utils.create_database(db_url)
conn = sqla.create_engine(db_url)
# create pandas df with random numbers
df = pd.DataFrame(np.random.randint(0,40,size=(100, 10)), columns=list('ABCDEFGHIJ'))
# add index so that it can be used as primary key later on
df['idx'] = df.index
# create dask df
ddf = dd.from_pandas(df, npartitions=4)
# Write to psql
dto_sql = dask.delayed(pd.DataFrame.to_sql)
out = [dto_sql(d, 'test', db_url, if_exists='append', index=False, index_label='idx')
       for d in ddf.to_delayed()]
dask.compute(*out)

The code doesn't produce an error if npartitions is set to 1. So I'm guessing it has to do with postgres not being able to handle parallel requests to write to a same sql table...? How can I fix this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I was reading this. It seems this error rises when you are creating/updating the same table with parallel processing. I understand it depends because of this (as explained on the google group discussion).

So I think it depend from PostgreSQL itself and not from the connection driver or the module used for the multiprocessing.

Well, Actually, the only way I found to solve this is to create chunks big enough to have back a writing process slower than the calculation itself. With bigger chunks this error doesn't rise.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...