Thursday, June 25, 2009

Editing Very Very Large File

Suppose we want to do changes in few lines in a very very large file. It is not possible to open such a big file(say size in GBs > RAM+Swap size) in a editor. Even sed/awk takes very long time, because they do pattern matching if mentioned on every line, otherwise, we can do one line editing with a line number. I have written a Perl Script to edit multiple lines independently. It uses sed commands to edit a line.

Format of Config file is:
line_number:sed_command


#!/usr/bin/perl -w
#===============================================================================
#
# FILE: ed_large_file.pl
#
# USAGE: ./ed_large_file.pl <config_file> <file_name> [overwrite}
#
# DESCRIPTION: Edit very[ very] large file
#
# OPTIONS: ---
# REQUIREMENTS: ---
# BUGS: ---
# NOTES: ---
# AUTHOR: Mitesh Singh Jat (mitesh), <mitesh[at]yahoo-inc[dot]com>
# VERSION: 1.0
# CREATED: Thursday 25 June 2009 02:32:37 IST IST
# REVISION: ---
#===============================================================================

use strict;
use warnings;

if (@ARGV < 2)
{
print STDERR "$0: <config_file> <file_name> [overwrite]\n";
print STDERR "!!!Be careful while using [overwrite] option,\n";
print STDERR "because original file will be deleted.\n";
exit(-1);
}

my $conf_file = $ARGV[0];
my $large_file = $ARGV[1];
my $overwrite = 0;
if (@ARGV >= 3 && $ARGV[2] eq "overwrite")
{
$overwrite = 1;
}

my $temp_file = `dirname $large_file`;
chomp($temp_file);
if ($temp_file eq "" || (!(-d $temp_file)))
{
print STDERR "$0: Cannot find dirname for temporary file.\n";
print STDERR "Please check path of file '$large_file'\n";
exit(-1);
}

$temp_file = $temp_file . "/temp";
print "Temporary file is '$temp_file'\n";

## Read config file
print "Reading config file '$conf_file'\n";
open(CFH, "$conf_file") or die("Cannot read Config file '$conf_file'\n");
my $line;
my %lineno_sedcmd;
while ($line = <CFH>)
{
chomp($line);
my ($lineno, $sedcmd) = split /:/, $line, 2;
if (defined($sedcmd))
{
$lineno_sedcmd{$lineno} = $sedcmd;
print "$lineno $lineno_sedcmd{$lineno}\n";
# Verifying sedcmd before running it;
# it gives a chance to reedit config file
my $cmd = "echo \"Mitesh Singh Jat\" | sed '$sedcmd' 1> /dev/null 2>&1";
if (!(system($cmd) == 0))
{
print STDERR "$0: sed command '$sedcmd' for line '$lineno'";
print STDERR "is having error. Please recheck with \$ man sed\n";
close(CFH);
exit(-1);
}
}
}
close(CFH);

my @line_nos;
foreach (sort keys (%lineno_sedcmd))
{
push(@line_nos, $_);
}

## Open large file
open(LFH, "$large_file") or die("$0: Cannot open file '$large_file'");
## Temporary File
open(OFH, ">$temp_file") or die("$0: Cannot create temporary file '$temp_file'");
my $nline = 0;
my $i = 0;
my $end_idx = @line_nos - 1;
print "Processing...";
while ($line = <LFH>)
{
++$nline;
if ($line_nos[$i] == $nline) # now edit
{
++$i; # This config line is over
if ($i > $end_idx)
{
$i = $end_idx;
}
chomp($line);
my $cmd = "echo \"$line\" | sed '$lineno_sedcmd{$nline}'";
#print "$cmd\n";
my $out_line = `$cmd`;
print OFH "$out_line";
print " $nline"; #sleep 1; # to see progress :)
}
else
{
print OFH "$line";
}
}

print "\n";

close(OFH);
close(LFH);

if ($overwrite == 0)
{
print "done\n";
exit(0);
}

## Overwite original file by deleting it and moving temp
print "Overwriting...\n";
my $cmd = "rm -f $large_file \&\& mv $temp_file $large_file";
print "$cmd\n";
system($cmd) == 0
or die("Problem in overwriting. '$cmd' failed: $?\n");
print "done\n";
exit(0);


Sample Run:


--(0 : 618)> ./ed_large_file.pl
./ed_large_file.pl:
<config_file> <file_name> [overwrite]
!!!Be careful while using [overwrite] option,

because original file will be deleted.

--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(255 : 619)> cat large_file.txt
Shree Ganeshay Namah
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Jat
Shikha Jat
Shilpa Jat
This is garbage line. Please delete it.
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 620)> cat large_file.conf
1:s/^.*$/!!&!!/
4:s/ / Singh /
7:/.*/d
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 621)> ./ed_large_file.pl large_file.conf large_file.txt
Temporary file is './temp'
Reading config file 'large_file.conf'
1 s/^.*$/!!&!!/
4 s/ / Singh /
7 /.*/d
Processing... 1 4 7
done
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 622)> cat ./temp
!!Shree Ganeshay Namah!!
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Singh Jat
Shikha Jat
Shilpa Jat
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 623)> ./ed_large_file.pl large_file.conf large_file.txt overwrite
Temporary file is './temp'
Reading config file 'large_file.conf'
1 s/^.*$/!!&!!/
4 s/ / Singh /
7 /.*/d
Processing... 1 4 7
Overwriting...
rm -f large_file.txt && mv ./temp large_file.txt
done
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 624)> cat large_file.txt
!!Shree Ganeshay Namah!!
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Singh Jat
Shikha Jat
Shilpa Jat
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 625)>

No comments: