I had the brilliant idea to download and install a spider so that I could do my surfing off of my own computer, in order to save time (so that I don't have to wait for files to download while I'm surfing). My hope was to blast down all the "CRFH" comics so that I could just read them all in half an hour or less. I tried a program called "Power Siphon", which proved to be very broken.
Can you recommend a more robust package?
Details: Power Siphon (downloadable from http://www.powersiphon.com/spider.asp) reports termination of spidering almost immediately, and then crashes every time. Sometimes it continues spidering after it crashes. When it does spider, it ignores the "Include URL" specification and downloads from URLs that do not contain the specified string. After using it a few times, it has become worse, and won't spider at all, and goes straight to crashing. I downloaded and installed the required MSDAC file, but it lost knowledge of it and started complaining that it needed to be installed. I stopped using the databasing feature, but that didn't seem to help, because by then it was so broken that it was just crashing right away. (I'm running "Netzero" as my dialup ISP, and I'm deleting c:/windows/temp/ files every time it crashes and I have to reboot and run Scandisk. It should be able to recover from this, of course, if that is part of the problem). I tried it on two different websites, http://www.geocities.com/jameswi.geo/index.html and http://www.crfh.net.
Details: Free version of Power Siphon 1.9.3 build 28 downloaded MS Data Access Components (MSDAC) ver 2.5 My OS version: Win 98: 4.10 (Build 2222A) DOS: 7.10
Of couse I am going to be uninstalling this. I certainly wouldn't risk my cash on an official version after seeing this performance.
If it *was* working, I also wouldn't like the fact that the New Project Wizard doesn't auto-set the "Include URL" to equal exactly the string that was specified for the start URL, but rather it chops any leading "http://www" and any trailing pages/subdirs after the .xxx domain specifier. I also don't like how it was not possible (at least I couldn't easily see a way) to specify that it shouldn't change file names and subdirs, but rather download the subdirectories and files the way they are posted. Of course, if it *was* working, I could live with this. -Jim
I've only tried one or two web spiders, and I wasn't terribly impressed with either. That said, there are probably a hundred web spiders on sourceforge. search for 'web spider' and make sure you click the box that says you want ALL words (it does an 'OR' by default)
If you are just particularly interested in CRFH, then maybe you want the CRFH Archive Project. You can also search for 'web comics' and find hundreds of specialized comic spiders.
I still do it the old fashioned way. I've been meaning to get around to making myself an internet portal homepage that would have a comic-spider back end. That way, whenever I started up my browser, I could get to a single page with all of my comics with just one click. It hasn't been a high priority though, as I have 3 Megabit DSL right now, so the old fashioned way is pretty durn fast.
Wed, 18 Aug 2004 17:20:00 EDT (-0400) Even if you weren't terribly impressed with them, would they still meet my minimum requirements of downloading the contents of a website unattended and refraining from crashing (Windows 98) until they've done that? Which would be the best, or least objectionable of the one or two that you tried? It would be nice to have that URL matching feature, if possible. And of course freeware would be nice. -Jim
I've used the HTTrack Website Copier built into Spiderzilla (a website downloader that you can load as a mozilla plugin) and another called, IIRC 'black widow' or something similar. Both would meet your criteria.
As for freeware, both are, as is everything on sourceforge.
Sat, 21 Aug 2004 20:31:00 EDT (-0400) I tried one called weblech, which I found at http://sourceforge.net/projects/weblech/, and it was pretty cool. There are some bugs somewhere, and it doesn't download my website (http://www.geocities.com/jameswi.geo) completely before stopping, so it doesn't quite meet my minimum requirements. It's still pretty cool though, and looked like it would have downloaded the whole crfh site (it was busily downloading the comics when I finally stopped it). I found some interesting content that I wouldn't have found myself. On my site it didn't find and download my http://www.geocities.com/jameswi.geo/FandSF/PAnderson/PAnderson.htm pages, even though they are linked. It might have been a bug with the search level, although I set it to 0 (infinite depth) -- or maybe a problem with case sensitivity. It's an older project, dated June 2002. I didn't get the promised GUI with the download. It didn't stop when I pressed any key, as advertised. Sometimes I had to help it along by pressing Enter at startup, and by manually creating the local directory to which it copies files; once it tried to create this itself but failed to mark it as a folder and then couldn't download any content after that. It has an innovative "interesting URL" field you can set, but I was disappointed with the inherent limitations of setting the include string for desired URLs to be downloaded -- it seems that if I specify the comics' image URL pattern, it won't go to the containing HTML page in the first place, and if I specify the HTML page URL pattern, that won't match the image URL pattern. The "interesting URL" field didn't seem to have much affect; maybe I could have tried putting only my pattern in there and deleting the others.
If I have more time I can try something else, like one of those dedicated comics spiders you mentioned. -Jim
Wed, 18 Aug 2004 19:19:00 (-0400) I just finished reading through the entire CRFH series (http://www.crfh.net). Amusing. I liked the one where Mike read the cough medicine label with the list of horrendous side-effects, chugged the whole thing down, and tossed the bottle away, saying "Minty." Also, poor April needs a boyfriend, a life, and some superpowers. At least Margaret has some training and a would-be. So what is there to look at on the CRFH Archive project? -Jim
Thu, 19 Aug 2004 10:19:00 EDT (-0400) Thanks for the info. Wow, my memory certainly is playing tricks on me. I could have sworn that CRFH was one of the webcomics that you listed, but when I checked back at Feb 27 at http://www.livejournal.com/users/swestrup/135164.html sure enough, it's not included. Maybe it's time for a dose of Geritol (However, those who remember *that* product, are probably simultaneously old enough for it, but not in need of it!). -Jim
no subject
(Anonymous) 2004-08-17 07:06 pm (UTC)(link)Here's another off-point comment and question:
I had the brilliant idea to download and install a spider so that
I could do my surfing off of my own computer, in order to save
time (so that I don't have to wait for files to download while
I'm surfing). My hope was to blast down all the "CRFH" comics so
that I could just read them all in half an hour or less. I tried
a program called "Power Siphon", which proved to be very broken.
Can you recommend a more robust package?
Details:
Power Siphon (downloadable from
http://www.powersiphon.com/spider.asp) reports termination of
spidering almost immediately, and then crashes every time.
Sometimes it continues spidering after it crashes. When it does
spider, it ignores the "Include URL" specification and downloads
from URLs that do not contain the specified string. After using
it a few times, it has become worse, and won't spider at all, and
goes straight to crashing. I downloaded and installed the
required MSDAC file, but it lost knowledge of it and started
complaining that it needed to be installed. I stopped using the
databasing feature, but that didn't seem to help, because by then
it was so broken that it was just crashing right away. (I'm
running "Netzero" as my dialup ISP, and I'm deleting
c:/windows/temp/ files every time it crashes and I have to reboot
and run Scandisk. It should be able to recover from this, of
course, if that is part of the problem). I tried it on two
different websites,
http://www.geocities.com/jameswi.geo/index.html and
http://www.crfh.net.
Details: Free version of Power Siphon 1.9.3 build 28
downloaded MS Data Access Components (MSDAC) ver 2.5
My OS version: Win 98: 4.10 (Build 2222A) DOS: 7.10
Of couse I am going to be uninstalling this. I certainly wouldn't
risk my cash on an official version after seeing this
performance.
If it *was* working, I also wouldn't like the fact that the New
Project Wizard doesn't auto-set the "Include URL" to equal
exactly the string that was specified for the start URL, but
rather it chops any leading "http://www" and any trailing
pages/subdirs after the .xxx domain specifier. I also don't like
how it was not possible (at least I couldn't easily see a way) to
specify that it shouldn't change file names and subdirs, but
rather download the subdirectories and files the way they are
posted. Of course, if it *was* working, I could live with this.
-Jim
no subject
search for 'web spider' and make sure you click the box that says you want ALL words (it does an 'OR' by default)
If you are just particularly interested in CRFH, then maybe you want the CRFH Archive Project. You can also search for 'web comics' and find hundreds of specialized comic spiders.
no subject
(Anonymous) 2004-08-18 06:46 am (UTC)(link)no subject
no subject
(Anonymous) 2004-08-18 02:20 pm (UTC)(link)Even if you weren't terribly impressed with them, would they
still meet my minimum requirements of downloading the contents
of a website unattended and refraining from crashing (Windows 98)
until they've done that? Which would be the best, or least
objectionable of the one or two that you tried? It would be
nice to have that URL matching feature, if possible. And of
course freeware would be nice.
-Jim
no subject
As for freeware, both are, as is everything on sourceforge.
no subject
(Anonymous) 2004-08-21 05:31 pm (UTC)(link)I tried one called weblech, which I found at
http://sourceforge.net/projects/weblech/, and it was pretty cool.
There are some bugs somewhere, and it doesn't download my website
(http://www.geocities.com/jameswi.geo) completely before
stopping, so it doesn't quite meet my minimum requirements. It's
still pretty cool though, and looked like it would have
downloaded the whole crfh site (it was busily downloading the
comics when I finally stopped it). I found some interesting
content that I wouldn't have found myself. On my site it didn't
find and download my
http://www.geocities.com/jameswi.geo/FandSF/PAnderson/PAnderson.htm
pages, even though they are linked. It might have been a bug with
the search level, although I set it to 0 (infinite depth) -- or
maybe a problem with case sensitivity. It's an older project,
dated June 2002. I didn't get the promised GUI with the download.
It didn't stop when I pressed any key, as advertised. Sometimes I
had to help it along by pressing Enter at startup, and by
manually creating the local directory to which it copies files;
once it tried to create this itself but failed to mark it as a
folder and then couldn't download any content after that. It has
an innovative "interesting URL" field you can set, but I was
disappointed with the inherent limitations of setting the include
string for desired URLs to be downloaded -- it seems that if I
specify the comics' image URL pattern, it won't go to the
containing HTML page in the first place, and if I specify the
HTML page URL pattern, that won't match the image URL pattern.
The "interesting URL" field didn't seem to have much affect;
maybe I could have tried putting only my pattern in there and
deleting the others.
If I have more time I can try something else, like one of those
dedicated comics spiders you mentioned.
-Jim
no subject
(Anonymous) 2004-08-18 04:18 pm (UTC)(link)I just finished reading through the entire CRFH series
(http://www.crfh.net). Amusing. I liked the one where Mike read
the cough medicine label with the list of horrendous
side-effects, chugged the whole thing down, and tossed the bottle
away, saying "Minty." Also, poor April needs a boyfriend, a life,
and some superpowers. At least Margaret has some training and a
would-be.
So what is there to look at on the CRFH Archive project?
-Jim
no subject
no subject
(Anonymous) 2004-08-19 07:18 am (UTC)(link)Thanks for the info.
Wow, my memory certainly is playing tricks on me. I could have
sworn that CRFH was one of the webcomics that you listed, but
when I checked back at Feb 27 at
http://www.livejournal.com/users/swestrup/135164.html
sure enough, it's not included. Maybe it's time for a dose of
Geritol (However, those who remember *that* product, are probably
simultaneously old enough for it, but not in need of it!).
-Jim
no subject
(Anonymous) 2004-08-23 08:54 am (UTC)(link)I can recommend CRFH. It grows on you. The graphic artwork
improves drastically from its beginning.
-Jim