Spider Crawl Problem...

Bug Reporting and Feature Suggestions /Improvements go here.

Moderator: pgolovko

Spider Crawl Problem...

Postby UncleTimmy » 10/26/2007 1:51 pm

Hello,

I added a few sites to my directory and then tried to crawl them. However, the crawler does not go any further than the first page.. Here is what it returns:

Crawling: http://www.edirectories.info
--------------------------------------------------------------------------------
URL: http://www.edirectories.info/
Status: HTTP/1.0 200 OK
Referer-URL:
Content not received
--------------------------------------------------------------------------------

Report:
Links followed: 1
Files received: 0
Bytes received: 419



--------------------------------------------------------------------------------
Finished!


Mind you, this is a web directory of mine and there is obviously a lot more pages it could be crawling. I have set the FOLLOW MODE to 0 and the Receive Content Type is: /text\/\html/ (Although all the pages are in PHP but I don't see how that would matter)

What's the deal? I am almost positive all my options are set correctly. This is a great feature and I would like to implement it widespread into my web directories but it doesn't seem to be working. ANY help would be greatly appreciated!
UncleTimmy
 
Posts: 6
Joined: 10/26/2007 1:45 pm


Postby bihlink » 10/26/2007 11:13 pm

Hmm... Go to Admin>Config>Spider and check all your settings!

Folow mode 0:
The crawler will follow EVERY link, even if the link leads to a different host or domain. If you choose this mode, you really should set a limit to the crawling-process, otherwise the crawler maybe will crawl the whole WWW!


I use folow mode 2.
Also check "content size limit" and "connection timeout"

I guess, if it start crowling, it's working fine, so everything else should be up to configuration. Check it twice.
bihlink
 
Posts: 67
Joined: 05/08/2007 1:53 am

Postby UncleTimmy » 10/29/2007 2:11 pm

Thanks for the reply. I've double checked Content Size and Timeout and raised them to the fullest they go, but still does the same thing. As I showed, it gives me the code 200, which means it can tell the site exists. I have double checked the configuration and over again, but still the same problem.

Anyone else having similar problem?
UncleTimmy
 
Posts: 6
Joined: 10/26/2007 1:45 pm

Postby Chipman » 01/13/2008 1:55 pm

i have this with my new install


Crawling: http://www.chipman-online.deURL: http://www.chipman-online.de/
Status: HTTP/1.1 200 OK
Referer-URL:
Content not received

Report:
Links followed: 1
Files received: 0
Bytes received: 145
Finished!


help me please.

mfg.
Chipman
Chipman
 
Posts: 26
Joined: 04/09/2007 11:42 am

Postby bihlink » 01/13/2008 10:15 pm

look like you made some changes to spider.php If you didn't try reupload original spider.php and crawl again
bihlink
 
Posts: 67
Joined: 05/08/2007 1:53 am

Postby djhunlimited » 01/25/2008 12:25 pm

I had this same problem, make sure you have the correct PHP modules installed. If your using php5 that might be the problem. I don't remember exactly, but it was one of those 2 things above that caused this problem to happen to me. If you need to know what php modules you need Here is what I have installed and my crawler works just fine.

PHP Version 4.4.7 With Mysql 4.1.21
Code: Select all
System    Linux sever.com 2.6.9-023stab043.1-enterprise #1 SMP Mon Mar 5 16:58:09 MSK 2007 i686
Build Date    Oct 26 2007 01:56:00
Configure Command    './configure' '--enable-bcmath' '--enable-calendar' '--enable-dbase' '--enable-dbx' '--enable-exif' '--enable-force-cgi-redirect' '--enable-ftp' '--enable-gd-native-ttf' '--enable-libxml' '--enable-sockets' '--prefix=/usr' '--with-bz2' '--with-curl=/opt/curlssl/' '--with-dom=/opt/xml2/' '--with-dom-exslt=/opt/xslt/' '--with-dom-xslt=/opt/xslt/' '--with-expat-dir=/usr' '--with-freetype-dir=/usr' '--with-gd' '--with-gettext' '--with-imap=/opt/php_with_imap_client/' '--with-imap-ssl=/usr' '--with-jpeg-dir=/usr' '--with-kerberos' '--with-libxml-dir=/opt/xml2/' '--with-mysql=/usr' '--with-mysql-sock=/var/lib/mysql/mysql.sock' '--with-openssl=/usr' '--with-openssl-dir=/usr' '--with-png-dir=/usr' '--with-ttf' '--with-xmlrpc' '--with-xpm-dir=/usr/X11R6' '--with-zlib' '--with-zlib-dir=/usr' '--without-pear'
djhunlimited
 
Posts: 6
Joined: 07/13/2007 11:35 am

Postby Chipman » 01/28/2008 11:03 am

bihlink wrote:look like you made some changes to spider.php If you didn't try reupload original spider.php and crawl again



hy

this is the original file.

mfg.
chipman
Chipman
 
Posts: 26
Joined: 04/09/2007 11:42 am

Postby Chipman » 04/17/2008 11:06 am

hy,

I have this same problem!

Server Version: 4.1.22
MySQL-Client-Version: 5.0.51a
PHP Version 4.4.8
Sockets Support enabled


help me please!

chipman
Chipman
 
Posts: 26
Joined: 04/09/2007 11:42 am

Re: Spider Crawl Problem...

Postby bigmarko » 06/11/2008 8:01 pm

UncleTimmy wrote:Hello,

I added a few sites to my directory and then tried to crawl them. However, the crawler does not go any further than the first page.. Here is what it returns:

Crawling: http://www.edirectories.info
--------------------------------------------------------------------------------
URL: http://www.edirectories.info/
Status: HTTP/1.0 200 OK
Referer-URL:
Content not received
--------------------------------------------------------------------------------

Report:
Links followed: 1
Files received: 0
Bytes received: 419



--------------------------------------------------------------------------------
Finished!


Mind you, this is a web directory of mine and there is obviously a lot more pages it could be crawling. I have set the FOLLOW MODE to 0 and the Receive Content Type is: /text\/\html/ (Although all the pages are in PHP but I don't see how that would matter)

What's the deal? I am almost positive all my options are set correctly. This is a great feature and I would like to implement it widespread into my web directories but it doesn't seem to be working. ANY help would be greatly appreciated!



After searching all over this forum, I just decide to delete the "/text\/\html/" from the Receive Content Type text box (left it blank). It worked (like it's not needed, so far)! I'll follow up on this.
bigmarko
 
Posts: 5
Joined: 06/11/2008 3:09 pm

Postby xprt007 » 06/16/2008 2:49 am

Hi bigmarko,

If I got you right, you found a way of getting the spidering to work. How exactly did you do it, which file did you edit?

I am quite sure this feature used to work, including in my case, but it seems when my host server was updated to a higher version of MySQL (5.xx), this problem started. I noted more people announcing similar problems, but since Pavel left without trace, it has not been addressed as yet.

so if you found a solution, please share exactly which files u edited & tell us us how it's going.

thanks in advance.
xprt007
 
Posts: 92
Joined: 02/12/2007 4:39 am

Re: Spider Crawl Problem...

Postby robb58 » 08/01/2008 12:37 am

UncleTimmy wrote:After searching all over this forum, I just decide to delete the "/text\/\html/" from the Receive Content Type text box (left it blank). It worked (like it's not needed, so far)! I'll follow up on this.


I've just tried this and it does seem to have fixed the spidering problem :cheers: Well done Uncle Timmy!! You'll find the offending "/text\/\html/" halfway down on the admin/config/spider page
:drunken:
robb58
 
Posts: 10
Joined: 05/04/2007 4:09 pm

Postby bigmarko » 08/02/2008 9:55 pm

If I got you right, you found a way of getting the spidering to work. How exactly did you do it, which file did you edit?

I am quite sure this feature used to work, including in my case, but it seems when my host server was updated to a higher version of MySQL (5.xx), this problem started. I noted more people announcing similar problems, but since Pavel left without trace, it has not been addressed as yet.

so if you found a solution, please share exactly which files u edited & tell us us how it's going.

thanks in advance.


xprt007, I had no real solution, I just deleted the /text\/\html/ from the Receive Content Type text box on the admin/config/spider page.

I've just tried this and it does seem to have fixed the spidering problem Well done Uncle Timmy!! You'll find the offending "/text\/\html/" halfway down on the admin/config/spider page


robb58, actually it was me who wrote the post you're referring to!
bigmarko
 
Posts: 5
Joined: 06/11/2008 3:09 pm



Return to Bug Reporting

Who is online

Users browsing this forum: No registered users and 0 guests

cron