Useful Perl Scripts With Regular Expressions
Read With Formatting | Free Open Source Tutorials Account
Server Side Coding Tutorials
Thread: Useful Perl Scripts With Regular Expressions
md_doc
Many people talk about Perl and many more about regular expressions but unless you are a programmer you probably never use either. We will discuss a few unique and very useful ways to use both of them. Have you ever needed to parse multiple files to remove or modify a certain string? Have you ever needed to parse multiple files in subdirectories to change content in them? If so then this tutorial will certainly give you the insight you need. Read Useful Perl Scripts With Regular Expressions tutorial here (http://www.grindinggears.com/articles/Server_Side_Coding/Perl/perl_scripts_regex/page1.html)
md_doc
How would I implement your script you've shown under "Replace on Multiple files" to limit itself only to the current directory?
Question submitted by e-mail.
md_doc
The following is code to parse a single directory. You will notice the only thing that changes is the if statement where we add in $File::Find:dir and compare it to our intial directory which is stored in $directory.
On a side note: You will want to make sure that the $directory value DOES NOT have a trailing slash. If it does then you will not get the results you want because $File:Find:dir returns directories without a trailing slash; for example a dir would be /tmp not /tmp/
#!/usr/bin/perl
use File::Find;
use strict;
my $directory = "/home/directory";
find (\&process, $directory);
sub process
{
my @outLines; #Data we are going to output
my $line; #Data we are reading line by line
# print "processing $_ / $File::Find::name\n";
# Only parse files that end in .html
if ( ($File::Find:dir eq $directory ) && ($File::Find::name =~ /\.html$/ ) ) {
open (FILE, $File::Find::name ) or
die "Cannot open file: $!";
print "\n" . $File::Find::name . "\n";
while ( $line = <FILE> ) {
$line =~ s/<body([^>]*)>/<body>/i;
push(@outLines, $line);
}
close FILE;
open ( OUTFILE, ">$File::Find::name" ) or
die "Cannot open file: $!";
print ( OUTFILE @outLines );
close ( OUTFILE );
undef( @outLines );
}
}
md_doc
This question came from an e-mail.
Thank you. I'll put that to use! To change this so that the directory
to parse is handled dynamically, you would change
my directory = "/home/directory";
to
my directory = $_[0];
correct?
I'm no Perl expert but with your help in cases such as this one, maybe I
can someday approach the level of expert.
md_doc
Actually the way you would do it is to use $ARGV[0]
So the code would look like
my $directory = $ARGV[0];
instead of
my $directory = "/home/directory";
By doing this you would be able to run your perl script from the command line as follows
> FindAndReplace.pl /my/directory
I hope this helps!
write-only
Hi, thank you for writing and posting your script. I found this thread via google.
Anyway, I thought I would contribute my version I use. It recurses, but there is a line commented out to only process a single directory as that is sometimes required. I also like the script to report what it is changing, this can be useful if the output is being written to a file via `>>log` or something.
#!/usr/local/bin/perl
#
# find and replace within HTML files
#
use File::Find;
use strict;
# edit these values for search and replace
my $old = '../images';
my $new = '../../images';
my $directory = $ARGV[0];
die "\nRequires argument for search path\n" unless defined $directory ;
find (\&process, $directory);
sub process
{
my @outLines; #Data we are going to output
my $line; #Data we are reading line by line
# print "processing $_ / $File::Find::name\n";
# Only parse files that end in .html
# if ( ($File::Find:dir eq $directory ) && ($File::Find::name =~ /.html$/ ) ) { # only search one dir
if ( $File::Find::name =~ /\.html$/ ) { # search sub dirs too
open (FILE, $File::Find::name ) or
die "Cannot open file: $!";
print "\n" . $File::Find::name . "\n";
while ( $line = <FILE> ) {
$line =~ s/$old/$new/i;
push(@outLines, $line);
}
close FILE;
open ( OUTFILE, ">$File::Find::name" ) or
die "Cannot open file: $!";
print ( OUTFILE @outLines );
close ( OUTFILE );
undef( @outLines );
}
}
print "\n\n Done changing $old to $new in\n $directory\n";
Please let me know if you see a bug!
Thanks for sharing.
md_doc
write-only
Actually that is a great idea to have variables for what you want changed as it makes it much easier to modify the script. Great job!
write-only
I found a bug!
# if ( ($File::Find:dir eq $directory ) && ($File::Find::name =~ /.html$/ ) ) { # only search one dir
should be:
# if ( ($File::Find::dir eq $directory ) && ($File::Find::name =~ /.html$/ ) ) { # only search one dir
Sometime when making chanes to web sites, you do not want to recurse into subdirectories.
perl -c SCRIPT.pl is our friend!
mazad_04
hi i was just wondering
when opening a file in windows the writer of "Useful Perl scripts with Regular Expressions" says you must u must give the windows path "C:\my documents\file.txt". then goes on to say --while it might seem funny to see the double slashes (but there are no double slashes ) wat is da story ??
Thanks mazad
md_doc
It seems the content management system and display system we use escapes the double slashes.
So it should be either "C:\\directory\\file" or you could also do (I think don't quote me on this "C:/directory/file".
I will try to get the site updated as soon as possible.
khalifa
Can anybody explain what "ne" does in $File::Find::dir ne ".").thank you
anyother info would be appreciated.
part of the script is below:
~~~~~~~~~~~~~~~~~~~
sub dir_trav {
if (($File::Find::dir ne ".")&&($File::Find::dir ne "..")&&!($File::Find::dir =~ /call/i)) {
$temp = $File::Find::dir;
$temp =~ s/.*\///;
}
}
finddepth(\&dir_trav,@path);
khalifa
md_doc
It stands for not equal and is used to compare strings in perl.
khalifa
I understand parts of sub dir_trav function but could you give me a brief explanation as to what its doing and whats is finddepth(\&dir_trav,@path); doing. thank you
md_doc
I believe, but please do not quote me on this because I do not have
File::Find installed on this system I am currently on, that the difference between find and finddepth is that finddepth goes down as deeps as possible first then starts coming back up and calling the function, ie it is recursive in the recursive sense.
I use find and since they are both similar what find and in theory finddepth do is they take a callback function. The callback function is called and passed a file name or in some cases a directory name. So in theory it is like you have the following directory structure (unix structure)
.
..
bill.txt
bob.txt
/home
etc
so what these functions do is they take these files and one at a time they pass the file name to your callback function. You then do whatever you want to the file in question.
In my examples I look for .html documents and parse them but you can do anything.
I hope this helps. It has been a really long day and I hope I did more good than bad.
bpaineusa
I would like to update an online pricebook. All of the prices in my pages are three digit numbers in table tags. A grep search like this ">\s*(\d{3})\s*<" finds all of my prices. I want to increase the prices by 6% so I've developed the following using your tutorial:
#!/usr/local/bin/perl -w
use File::Find;
use strict;
my $directory = "/Users/bpaine/Documents/buttura_gherardi_webs/Web_Master/try_perl/tryagain";
find (\&process, $directory);
sub process
{
my @outLines; #Data we are going to output
my $text; #Data we are reading line by line
#print "processing $_ / $File::Find::name\n";
# Only parse files that end in .html
if ( $File::Find::name =~ /\.html$/ ){
open (FILE, $File::Find::name ) or
die "Cannot open file: $!";
print "\n" . $File::Find::name . "\n";
while ( $text = <FILE> ) {
$text =~ s/>\s*(\d{3})\s*</'>' . int($1 * 1.06 + 0.5) . '<'/gex;
push(@outLines, $text);
}
close FILE;
open(OUTFILE, ">$File::Find::name" ) or
die "Cannot open file: $!";
print ( OUTFILE @outLines );
close ( OUTFILE );
undef( @outLines );
}
}
Why doesn't it work? Can y'all help?
Thanks,
Bill
md_doc
Bill,
I did the following
#!/usr/local/bin/perl -w
use strict;
my $text = "100abced300asdfew";
$text =~ s/\s*(\d{3})\s*/'>' . int($1 * 1.06 + .5) . '<'/gex;
print $text;
and I got the following output
>106<abced>318<asdfew
Which is I think what you expect.
Maybe your problem is that your greater than and less than signs are not where you expect them to be in the files?
Or maybe you need the s option because your line is wrapping?
bpaineusa
Here's an example of the code that I'm trying to transform.
There are several html files in several directories. There are other files in those directories, usually css. I want to multiply all of the 3 digit numbers by 1.06 (raising them 6%). I don't want to alter the file in any other way.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="base.css" media="screen">
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="650">
<tr>
<td colspan="17"><a href="../../printable/bases/06bases/06basesa.html" target="_blank">Printable View: Click here</a>
</td>
</tr>
<tr>
<td width="18">
</td>
<td width="10">
</td>
<td width="18">
</td>
<td width="14">
</td>
<td width="18">
</td>
<td width="10">
</td>
<td width="18">
</td>
<td width="14">
</td>
<td width="18">
</td>
<td width="10">
</td>
<td width="18">
</td>
<td width="80">
</td>
<td width="81">
</td>
<td width="81">
</td>
<td width="81">
</td>
<td width="81">
</td>
<td width="80">
</td>
</tr>
<tr>
<td colspan="17" class="xl37">
6 Inch Bases - Polished Flat Top, BRP
</td>
</tr>
<tr>
<td colspan="11">
</td>
<td class="xl34">
Artisan® Gray
</td>
<td>
</td>
<td class="xl34">
Artisan® Red
</td>
<td colspan="3">
</td>
</tr>
<tr>
<td colspan="11">
</td>
<td class="xl34">
Artisan® Rose
</td>
<td class="xl34">
Heirloom® </td>
<td class="xl34">
Heirloom® </td>
<td class="xl34">
Artisan® Gem Mist
</td>
<td class="xl38">
</td>
<td class="xl34">
Heirloom® </td>
</tr>
<tr>
<td class="xl39" colspan="11">
</td>
<td class="xl39">
Artisan® Mah.
</td>
<td class="xl39">
Blue Gray
</td>
<td class="xl39">
Black Tweed
</td>
<td class="xl39">
Heirloom® Rose
</td>
<td class="xl39">
Heirloom® Mah.
</td>
<td class="xl39">
Jet Black
</td>
</tr>
<tr>
<td class="xl42">
2
</td>
<td class="xl42">
-
</td>
<td class="xl42">
0
</td>
<td class="xl42">
x
</td>
<td class="xl42">
1
</td>
<td class="xl42">
-
</td>
<td class="xl42">
0
</td>
<td class="xl42">
x
</td>
<td class="xl42">
0
</td>
<td class="xl43">
-
</td>
<td class="xl45">
6
</td>
<td class="xl42">
98
</td>
<td class="xl42">
102
</td>
<td class="xl42">
130
</td>
<td class="xl42">
158
</td>
<td class="xl42">
184
</td>
<td class="xl45">
210
</td>
</tr>
<tr>
<td class="xl42">
2
</td>
<td class="xl43">
-
</td>
<td class="xl42">
0
</td>
<td class="xl42">
x
</td>
<td class="xl42">
1
</td>
<td class="xl43">
-
</td>
<td class="xl42">
2
</td>
<td colspan="2" class="xl42">
</td>
<td class="xl43">
</td>
<td class="xl45">
</td>
<td class="xl42">
114
</td>
<td class="xl42">
119
</td>
<td class="xl42">
152
</td>
<td class="xl42">
184
</td>
<td class="xl42">
215
</td>
<td class="xl45">
245
</td>
</tr>
<tr>
<td class="xl42">
2
</td>
<td class="xl43">
-
</td>
<td class="xl42">
0
</td>
<td class="xl42">
x
</td>
<td class="xl42">
1
</td>
<td class="xl43">
-
</td>
<td class="xl42">
4
</td>
<td colspan="2" class="xl42">
</td>
<td class="xl43">
</td>
<td class="xl45">
</td>
<td class="xl42">
131
</td>
<td class="xl42">
136
</td>
<td class="xl42">
173
</td>
<td class="xl42">
211
</td>
<td class="xl42">
245
</td>
<td class="xl45">
280
</td>
</tr>
<tr>
<td class="xl47">
3
</td>
<td class="xl46">
-
</td>
<td class="xl47">
10
</td>
<td class="xl47">
x
</td>
<td class="xl47">
1
</td>
<td class="xl46">
-
</td>
<td class="xl47">
6
</td>
<td class="xl47">
</td>
<td class="xl47">
</td>
<td class="xl46">
</td>
<td class="xl48">
</td>
<td class="xl47">
282
</td>
<td class="xl47">
293
</td>
<td class="xl47">
374
</td>
<td class="xl47">
454
</td>
<td class="xl47">
529
</td>
<td class="xl48">
604
</td>
</tr>
</table>
</body>
</html>
Thanks,
Bill
md_doc
Bill,
Your problem is that the >, ###, and < are all on separate lines.
When doing a string replace in perl s/// you will need to end it with a /s (which makes it treat the whole file as one string, ignoring the line breaks).
So you will just want to modify your code to do a /gexs (I am not certain what the x does but you had it there so I assume you want it there for some reason, I really believe /ges would do just fine as well).
Also you will want to be careful because you are doing greedy searches (using just the *) you should consider doing none greedy searches. I would make sure to test well before doing this on your production data.