Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
204 views
in Technique[技术] by (71.8m points)

python - Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.

The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.

As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?

Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)

class Template(models.Model):
    template_name = models.TextField(_("Name"), max_length=70)
    sourcepackage = models.TextField(_("Source package"), max_length=70)
    translation_domain = models.TextField(_("Domain"), max_length=70)
    total = models.IntegerField(_("Total"))
    enabled = models.BooleanField(_("Enabled"))
    priority = models.IntegerField(_("Priority"))
    release = models.ForeignKey(Release) 

class Translation(models.Model):
    release = models.ForeignKey(Release)
    template = models.ForeignKey(Template)
    language = models.ForeignKey(Language)
    translated = models.IntegerField(_("Translated"))

And here's the bit of code that seems to take ages to complete:

@transaction.commit_manually
def add_translations(translation_data, lp_translation):

    releases = Release.objects.all()

    # There are 5 releases
    for release in releases:

        # translation_data has about 90K entries
        # this is the part that takes a long time
        for lp_translation in translation_data:
            try:
                language = Language.objects.get(
                    code=lp_translation['language'])
            except Language.DoesNotExist:
                continue

            translation = Translation(
                template=Template.objects.get(
                            sourcepackage=lp_translation['sourcepackage'],
                            template_name=lp_translation['template_name'],
                            translation_domain=
                                lp_translation['translation_domain'],
                            release=release),
                translated=lp_translation['translated'],
                language=language,
                release=release,
                )

            translation.save()

        # I realize I should commit every n entries
        transaction.commit()

        # I've also got another bit of code to fill in some data I'm
        # not getting from the json files

        # Add missing templates
        languages = Language.objects.filter(visible=True)
        languages_total = len(languages)

        for language in languages:
            templates = Template.objects.filter(release=release)

            for template in templates:
                try:
                    translation = Translation.objects.get(
                                    template=template,
                                    language=language,
                                    release=release)
                except Translation.DoesNotExist:
                    translation = Translation(template=template,
                                              language=language,
                                              release=release,
                                              translated=0,
                                              untranslated=0)
                    translation.save()

            transaction.commit()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.

If the import files are available locally to the server you can use COPY. Else you could use the meta command copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.

If you just want to add new rows to a table:

COPY tbl FROM '/absolute/path/to/file' FORMAT csv;

Or if you want to INSERT / UPDATE some rows:

First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.

SET LOCAL temp_buffers='128MB';

In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:

select pg_size_pretty(pg_total_relation_size('tmp_x'));

Create the temporary table:

CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);

Or, to just duplicate the structure of an existing table:

CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;

Copy values (should take seconds, not hours):

COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;

From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:

ANALYZE tmp_x;

For instance, to update existing rows, matched by id:

UPDATE tbl
SET    col_a = tmp_x.col_a
USING  tmp_x
WHERE  tbl.id = tmp_x.id;

Finally, drop the temporary table:

DROP TABLE tmp_x;

Or have it dropped automatically at the end of the session.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...