Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
216 views
in Technique[技术] by (71.8m points)

very huge assosiative array in perl

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren't sorted. The second file may have records the first file does not.

Sample File 1:

1001234|X15X1211,J,S,12,15,100.05

Sample File 2:

1231112|AJ32,,,18,JP     
1001234|AJ15,,,16,PP

Output:

1001234,X15X1211,J,S,12,15,100.05,AJ15,,,16,PP

I am using following piece of code:

tie %hash_REP, 'Tie::File::AsHash', 'rep.in', split => '|'
my $counter=0;
while (($key,$val) = each %hash_REP) {
    if($counter==0) {
        print strftime "%a %b %e %H:%M:%S %Y", localtime;
    }
}

it takes almost 1 hour prepare associative array. is it really good or is it really bad? Is there any faster way to handle such size of records in associative array? Any suggestion in any scripting language would really help.

Thanks, Nitin T.

I also tried the following program, walso took 1+ Hour is as below:

#!/usr/bin/perl
use POSIX qw(strftime);
my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "
";

my %hash;
open FILE, "APP.in" or die $!;
while (my $line = <FILE>) {
     chomp($line);
      my($key, $val) = split /|/, $line;
      $hash{$key} = $val;
 }
 close FILE;

my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
open FILE, "rep.in" or die $!;
while (my $line = <FILE>) {
      chomp($line);
  my @words = split /|/, $line;
  for (my $i=0; $i <= $#words; $i++) {
    if($i == 0)
    {
       next;
    }
    print $fh  $words[$i] . "|^"
  }
  print $fh  $hash{$words[0]} . "
";
 }
 close FILE;
 close $fh;
 print "done
";

my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "
";
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your technique is extremely inefficient for a few reasons.

  • Tying is extremely slow.
  • You're pulling everything into memory.

The first can be mitigated by doing the reading and splitting yourself, but the latter is always going to be a problem. The rule of thumb is to avoid pulling big hunks of data into memory. It'll hog all the memory and probably cause it to swap to disk and slow down waaaay down, especially if you're using a spinning disk.

Instead, there's various "on disk hashes" you can use with modules like GDBM_File or BerkleyDB.

But really there's no reason to mess around with them because we have SQLite and it does everything they do faster and better.


Create a table in SQLite.

create table imported (
    id integer,
    value text
);

Import your file using the sqlite shell's .import adjusting for your format using the .mode and .separator.

sqlite>     create table imported (
   ...>         id integer,
   ...>         value text
   ...>     );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345       NITIN     
12346       NITINfoo  
2398        bar       
9823        baz     

And now you, and anyone else who has to work with the data, can do whatever you like with it in efficient, flexible SQL. Even if it takes a while to import, you can go do something else while it does.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...