HELP: parsing unicode web sites


	ASP Shopping Cart, Forum & BBS - all for $20 from CodeToad Plus!

Home || ASP | ASP.Net | C++/C# | DHTML | HTML | Java | Javascript | Perl | VB | XML || CodeToad Plus! || Forums || RAM

Search Site:

CodeToad Forums » Perl » HELP: parsing unicode web sites	Search Forums:

HELP: parsing unicode web sites

andrewwan1980 at 13:56 on Thursday, July 31, 2008

I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you

Re: HELP: parsing unicode web sites

andrewwan1980 at 10:46 on Monday, August 04, 2008

Thanks to those who helped. Here's my working script:

#!/usr/bin/perl
# tom365crawl2.pl
# http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
# http://perldoc.perl.org/Encode.html
# http://juerd.nl/site.plp/perluniadvice
# http://www.perlmonks.org/?node_id=620068

use warnings;
use strict;

use File::stat;
use Tie::File;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
#use File::Slurp;

use Encode;

my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
my $delim1b = "\" class=\"mp\" \/\>";
my $folder1 = "movie_2004/html/";
my $url1;
my $start1 = 1000;
my $end1 = 1000;
my $contents1;
my $image1;

my $browser1 = LWP::UserAgent->new();
$browser1->timeout(10);
my $request1;
my $response1;

my $count;
for ($count=$start1; $count<=$end1; $count++) {
$url1 = $site1 . $folder1 . $count . ".html";
printf "Downloading %s\n", $url1;

# Method 1
#$contents1 = get($url1);

# Method 2
$request1 = HTTP::Request->new(GET => $url1);
$response1 = $browser1->request($request1);
if ($response1->is_error()) {
printf "%s\n", $response1->status_line;
}
$contents1 = $response1->decoded_content();

#open(NEWFILE1, "/forum/gt_Debug.txt");
#(print NEWFILE1 $contents1) or die "Can't write to Debug.txt: $!";
#close(NEWFILE1);

#print $contents1;

if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
$image1 = "$1";
printf "Downloading %s\n", $image1;
`wget -q -O $count.jpg $image1`;

#if ($image1 =~ /\/([^\/]*)$/m) {
# printf "Renaming %s to $count.jpg\n", $1;
#} else {
# printf "Could not rename %s to $count.jpg\n", $image1;
#}
} else {
#open(NEWFILE1, "/forum/gt_count.txt");
#(print NEWFILE1 "Download failed.\n") or die "Can't write to $image1: $!";
#close(NEWFILE1);
}
}

Useful Links

Hosting Deals and coupons

Reseller Hosting

Web+Hosting=$1.66

Web Hosting

Hosting

Windows Hosting

Web Hosting Providers

Cheap Hosting

Advertise here

CodeToad Experts

Can't find the answer?
Our Site experts are answering questions for free in the CodeToad forums

Recent Forum Threads

•	Re: Folder name with
•	please suggest a tool for javascript obfuscation (obfuscator)
•	Re: [Help] Javascript Quiz
•	Re: HELP: parsing unicode web sites
•	help me to solve problem in c++
•	Re: Passing Data from One Script to Another
•	untie attempted while 1 inner references still exist
•	Web Hosting
•	Re: refresh parent after closing pop up window

Recent Articles

	ASP GetTempName
	Decode and Encode UTF-8
	ASP GetFile
	ASP FolderExists
	ASP FileExists
	ASP OpenTextFile
	ASP FilesystemObject
	ASP CreateFolder
	ASP CreateTextFile
	Javascript Get Selected Text

submit a link - privacy - contact - advertise - hot links - link to us - submit your article