Showing posts with label sed. Show all posts
Showing posts with label sed. Show all posts

Thursday, June 25, 2009

Editing Very Very Large File

Suppose we want to do changes in few lines in a very very large file. It is not possible to open such a big file(say size in GBs > RAM+Swap size) in a editor. Even sed/awk takes very long time, because they do pattern matching if mentioned on every line, otherwise, we can do one line editing with a line number. I have written a Perl Script to edit multiple lines independently. It uses sed commands to edit a line.

Format of Config file is:
line_number:sed_command


#!/usr/bin/perl -w
#===============================================================================
#
# FILE: ed_large_file.pl
#
# USAGE: ./ed_large_file.pl <config_file> <file_name> [overwrite}
#
# DESCRIPTION: Edit very[ very] large file
#
# OPTIONS: ---
# REQUIREMENTS: ---
# BUGS: ---
# NOTES: ---
# AUTHOR: Mitesh Singh Jat (mitesh), <mitesh[at]yahoo-inc[dot]com>
# VERSION: 1.0
# CREATED: Thursday 25 June 2009 02:32:37 IST IST
# REVISION: ---
#===============================================================================

use strict;
use warnings;

if (@ARGV < 2)
{
print STDERR "$0: <config_file> <file_name> [overwrite]\n";
print STDERR "!!!Be careful while using [overwrite] option,\n";
print STDERR "because original file will be deleted.\n";
exit(-1);
}

my $conf_file = $ARGV[0];
my $large_file = $ARGV[1];
my $overwrite = 0;
if (@ARGV >= 3 && $ARGV[2] eq "overwrite")
{
$overwrite = 1;
}

my $temp_file = `dirname $large_file`;
chomp($temp_file);
if ($temp_file eq "" || (!(-d $temp_file)))
{
print STDERR "$0: Cannot find dirname for temporary file.\n";
print STDERR "Please check path of file '$large_file'\n";
exit(-1);
}

$temp_file = $temp_file . "/temp";
print "Temporary file is '$temp_file'\n";

## Read config file
print "Reading config file '$conf_file'\n";
open(CFH, "$conf_file") or die("Cannot read Config file '$conf_file'\n");
my $line;
my %lineno_sedcmd;
while ($line = <CFH>)
{
chomp($line);
my ($lineno, $sedcmd) = split /:/, $line, 2;
if (defined($sedcmd))
{
$lineno_sedcmd{$lineno} = $sedcmd;
print "$lineno $lineno_sedcmd{$lineno}\n";
# Verifying sedcmd before running it;
# it gives a chance to reedit config file
my $cmd = "echo \"Mitesh Singh Jat\" | sed '$sedcmd' 1> /dev/null 2>&1";
if (!(system($cmd) == 0))
{
print STDERR "$0: sed command '$sedcmd' for line '$lineno'";
print STDERR "is having error. Please recheck with \$ man sed\n";
close(CFH);
exit(-1);
}
}
}
close(CFH);

my @line_nos;
foreach (sort keys (%lineno_sedcmd))
{
push(@line_nos, $_);
}

## Open large file
open(LFH, "$large_file") or die("$0: Cannot open file '$large_file'");
## Temporary File
open(OFH, ">$temp_file") or die("$0: Cannot create temporary file '$temp_file'");
my $nline = 0;
my $i = 0;
my $end_idx = @line_nos - 1;
print "Processing...";
while ($line = <LFH>)
{
++$nline;
if ($line_nos[$i] == $nline) # now edit
{
++$i; # This config line is over
if ($i > $end_idx)
{
$i = $end_idx;
}
chomp($line);
my $cmd = "echo \"$line\" | sed '$lineno_sedcmd{$nline}'";
#print "$cmd\n";
my $out_line = `$cmd`;
print OFH "$out_line";
print " $nline"; #sleep 1; # to see progress :)
}
else
{
print OFH "$line";
}
}

print "\n";

close(OFH);
close(LFH);

if ($overwrite == 0)
{
print "done\n";
exit(0);
}

## Overwite original file by deleting it and moving temp
print "Overwriting...\n";
my $cmd = "rm -f $large_file \&\& mv $temp_file $large_file";
print "$cmd\n";
system($cmd) == 0
or die("Problem in overwriting. '$cmd' failed: $?\n");
print "done\n";
exit(0);


Sample Run:


--(0 : 618)> ./ed_large_file.pl
./ed_large_file.pl:
<config_file> <file_name> [overwrite]
!!!Be careful while using [overwrite] option,

because original file will be deleted.

--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(255 : 619)> cat large_file.txt
Shree Ganeshay Namah
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Jat
Shikha Jat
Shilpa Jat
This is garbage line. Please delete it.
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 620)> cat large_file.conf
1:s/^.*$/!!&!!/
4:s/ / Singh /
7:/.*/d
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 621)> ./ed_large_file.pl large_file.conf large_file.txt
Temporary file is './temp'
Reading config file 'large_file.conf'
1 s/^.*$/!!&!!/
4 s/ / Singh /
7 /.*/d
Processing... 1 4 7
done
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 622)> cat ./temp
!!Shree Ganeshay Namah!!
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Singh Jat
Shikha Jat
Shilpa Jat
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 623)> ./ed_large_file.pl large_file.conf large_file.txt overwrite
Temporary file is './temp'
Reading config file 'large_file.conf'
1 s/^.*$/!!&!!/
4 s/ / Singh /
7 /.*/d
Processing... 1 4 7
Overwriting...
rm -f large_file.txt && mv ./temp large_file.txt
done
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 624)> cat large_file.txt
!!Shree Ganeshay Namah!!
Shri Bharat Singh Jat
Smt Amita Jat
Mitesh Singh Jat
Shikha Jat
Shilpa Jat
--(mitesh@roundduck-lm)-(~/Programming/Perl/Editing_Large_Files)--
--(0 : 625)>