HTML Parser
Read With Formatting | Free Open Source Tutorials Account
Perl Programming
Thread: HTML Parser
SS101
I read your nice article titled "Useful Perl Scripts With Regular Expressions" and found it very useful. I have a directory of about 1000 HTML files that I would like to clean up and then import in to a database. The problem is
1. I am very new to perl :-P
2. The HTML is very badly coded the title is not in html tags and there is not even a body tags so its very hard to clean up the code.
But I have to start some where so I might as well start here in the codeing examples you used reglex to search and replace or should I be useing some thing else ?
Here is a copy of the HTML as you can see the html is in bad shape.
<p><b><font color="#DD0011"><font face= "Arial"><font size="5">
<li>PUMPKIN* DESSERT</font></font></font>
<p><li>*1 lg. can Libby pumpkin (for 2 pies)
<p>*Mix as directed, except cut the milk amount
in half. 1 box yellow cake mix
<li>Chopped
walnuts
<li>1 1/2
sticks oleo, melted
<p>*Spread this in a buttered 9 x 13 inch pan.* Sprinkle over this (dry) yellow cake
mix.* Sprinkle over this chopped
walnuts.* Drizzle over this 1 1/2 sticks
melted oleo.* Bake at 350 degrees for 1
hour.* Test as for pumpkin pie.* (Whipped topping, if desired.)* </body>
</html>
Any help on this would be great as I am just getting a little lost in this new world of perl.
md_doc
Do all your files look similar... ie just about identical to that one in terms of html tags?
If they do then just let us know which part you think should be the title, I am assuming in this case that the title of the page should be "PUMPKIN* DESSERT" and that you probably want an h1 tag of the same data on the page.
Formatting the rest is not going to be easy at all since there is currently no format to it except line breaks but that should be enough.
If you could kind of show us what you think it should look like, the example you gave us, then we can help modify the script so that you can get it working on your files.
SS101
Thank you for your reply md_doc.
Yes as you pointed out the person who ever wrote them needs to take a few
lessons in basic html.
I have posted another example as you will see the title aways seem to be in these tags
<p><b><font color="#DD0011"><font face= "Arial"><font size="5">
and then ends in
</font></font></font>
After that there is about
<p><b><font color="#DD0011"><font face= "Arial"><font size="5">*19 -- LOBSTER* AND* ROASTED* CORN*
CHOWDER</font></font></font>
<p><li>*1 (21 lb.) live Maine lobster
<li>4
strips bacon, fine dice
<li>1/2
med. onion, fine dice
<li>1/4
med. green pepper, fine dice
<li>1/4
med. red pepper, fine dice
<li>1/4
med. yellow pepper, fine dice
<li>1/4 sm.
jalapeno pepper, fine dice
<li>1/2
stalk celery, fine dice
<li>1/2
med. carrot, fine dice
<li>1/2 c.
diced green chilies, canned
<li>1/4 lb.
unsalted butter
<li>1/2 c.
all purpose flour
<li>6 c.
lobster stock or 1 tbsp. lobster
base and 6 c. water
<li>1 tbsp.
tomato paste
<li>1 c.
corn, cut off the cob
<li>1 c.
cream style corn
<li>1 sm.
smoked hamhock
<li>2 med.
baking potatoes, peeled and
cut into 1 inch square dice
<li>1 bunch
cilantro, fine chop
<li>2
stalks green onions, fine bias cut
<li>2 c.
heavy whipping cream
<li>1 tsp.
Lenard's southwestern
seasoning blend
<li>1/2
lemon juice
<li>Salt
and ground black pepper, to taste
<p>*Steam lobster 17 minutes, let cool and remove
from shell.* Save shells to make lobster
stock if desired.* Saute bacon until
crispy in sauce or small stock pot.*
Add:* onions; green, red, yellow
and jalapeno peppers; celery and carrots, cook until soft.* Add:*
tomato paste and green chilies.*
Add:* lobster base, if that is
your choice over stock.* Cook 3 mintues
stirring constantly over medium heat.*
Add:
<li> 1/2 of the unsalted butter and cook until melted.* Add: flour and cook 3 more minutes.* Roast corn kernals in an oven on a baking
sheet or other flat pan until slightly browned and add to chowder.* Add:*
cream style corn; southwestern seasoning and smoked hamhocks.* Add:*
stock or water, if using base.*
Cook for 1/2 hour keeping at a slow boil, stirring constantly.* Add: potatoes and cook for another 15
minutes.* If too thick add more stock or
water to desired consistency.* Add:
green onions; cilantro and lemon juice.*
Slowly Whisk in cream and the remaining butter until melted.* Season with salt and black pepper to
taste.* Serves 12 people.* The Phoenician </body>
</html>
The next question is a good one. I am not sure either. The plan is to insert these recipes in to the open source project called Mambo. The only way I can see to do this is to do a database insert. As there is no import API
I dont know if I should write these to a txt file stripped of the html tags and then inserting in to the database and then to format it all nice in the CMS editor or I should insert html tags and all and then do a little cleaning up in the editor.
This is where I need some ideas and throughts......
md_doc
Well it really does not matter, to be honest, if you want to just edit the html files or insert it into a database because you have to still parse the information.
What I think you want to do, and you will want to check a bunch of your files for this, is to take everything before the /font tags and remove the html and just use that as the name or title. But please check more than 2 of your files to make sure.
The first file you posted does not have anything before the first font tag and it has an LI in the font tags where the second one you posted has something before the font tags and no LI in the font tags. So can you just check to see if there are any other variations?
You might be able to get lucky, in that I see no other font tags, so you might be able to just have it delete all html tags before the last font tag and take that text and use it as your title.
SS101
Thanks for the update I wrote a very small script to do a extract on the title and I scanned a few about 100 of the files and I think about 98% of them
did have
<p><b><font color="#DD0011"><font face= "Arial"><font size="5"> TITLE NAME
What I was thinking maybe all the ones that did not match this I could move to another folder that would have to be checked by hand.
I think the first one must have been a odd one as I have not seen another one like it.
What I do see alot of is the title wraping around on alot of them
<p><b><font color="#DD0011"><font face= "Arial"><font size="5">*7 -- BLACK*
BEAN* SOUP</font></font></font>
but I have to say they seem to follow the same layout most of the time.
md_doc
The title wraping wont be an issue you just have to add a /s to your regular expression which makes the the regular expression make the whole file act as 1 single line.
This means in theory you could do something like
s/<font(.*?)><font(.*?)><font(.*?)>(.*?)<\/font><\/font><\/font>/<h1>$4</h1>/is;
note I did not test the code. Also note that I did a (.*?) instead of doing a ([^>]*) but both do the same thing, they stop the regular expression from being greedy.
That should take everything between the font tags and replace it with the title that was between the font tags. You still want to do some other stuff though like create a title and what not so maybe doing a match instead of a switch is what you should be looking at... maybe you want to write all the information out to a new file.
If you want you can let us see what you have so far and we might be able to help more.