codetoad.com
  ASP Shopping CartForum & BBS
  - all for $20 from CodeToad Plus!
  
  Home || ASP | ASP.Net | C++/C# | DHTML | HTML | Java | Javascript | Perl | VB | XML || CodeToad Plus! || Forums || RAM 
Search Site:
Search Forums:
  HELP: parsing unicode web sites  andrewwan1980 at 13:56 on Thursday, July 31, 2008
 

I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you

  Re: HELP: parsing unicode web sites  andrewwan1980 at 10:46 on Monday, August 04, 2008
 

Thanks to those who helped. Here's my working script:

#!/usr/bin/perl
# tom365crawl2.pl
# http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
# http://perldoc.perl.org/Encode.html
# http://juerd.nl/site.plp/perluniadvice
# http://www.perlmonks.org/?node_id=620068

use warnings;
use strict;

use File::stat;
use Tie::File;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
#use File::Slurp;

use Encode;

my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
my $delim1a = "\<div class=\"movie\"\>\<img src=\"";
my $delim1b = "\" class=\"mp\" \/\>";
my $folder1 = "movie_2004/html/";
my $url1;
my $start1 = 1000;
my $end1 = 1000;
my $contents1;
my $image1;

my $browser1 = LWP::UserAgent->new();
$browser1->timeout(10);
my $request1;
my $response1;

my $count;
for ($count=$start1; $count<=$end1; $count++) {
$url1 = $site1 . $folder1 . $count . ".html";
printf "Downloading %s\n", $url1;

# Method 1
#$contents1 = get($url1);

# Method 2
$request1 = HTTP::Request->new(GET => $url1);
$response1 = $browser1->request($request1);
if ($response1->is_error()) {
printf "%s\n", $response1->status_line;
}
$contents1 = $response1->decoded_content();

#open(NEWFILE1, "/forum/gt_Debug.txt");
#(print NEWFILE1 $contents1) or die "Can't write to Debug.txt: $!";
#close(NEWFILE1);

#print $contents1;

if ($contents1 =~ /\<div class=\"movie\"\>\<img src=\"(.*)\" class=\"mp\" \/\>/m) {
$image1 = "$1";
printf "Downloading %s\n", $image1;
`wget -q -O $count.jpg $image1`;

#if ($image1 =~ /\/([^\/]*)$/m) {
# printf "Renaming %s to $count.jpg\n", $1;
#} else {
# printf "Could not rename %s to $count.jpg\n", $image1;
#}
} else {
#open(NEWFILE1, "/forum/gt_count.txt");
#(print NEWFILE1 "Download failed.\n") or die "Can't write to $image1: $!";
#close(NEWFILE1);
}
}









CodeToad Experts

Can't find the answer?
Our Site experts are answering questions for free in the CodeToad forums
//








Recent Forum Threads
•  Re: Folder name with
•  please suggest a tool for javascript obfuscation (obfuscator)
•  Re: [Help] Javascript Quiz
•  Re: HELP: parsing unicode web sites
•  help me to solve problem in c++
•  Re: Passing Data from One Script to Another
•  untie attempted while 1 inner references still exist
•  Web Hosting
•  Re: refresh parent after closing pop up window


Recent Articles
ASP GetTempName
Decode and Encode UTF-8
ASP GetFile
ASP FolderExists
ASP FileExists
ASP OpenTextFile
ASP FilesystemObject
ASP CreateFolder
ASP CreateTextFile
Javascript Get Selected Text


© Copyright codetoad.com 2001-2008