Welcome, Guest. Please Login.
IRobotSoft Robot
08/07/20 at 17:28:19
News: Welcome to the IRoboSoft Visual Web Scraping and Web Automation forum.
Home Help Search Login
Google
 


Pages: 1
Send Topic Print
CrawlWebsites() using https... (Read 6476 times)
BrentH
Junior Member
**




Posts: 16
CrawlWebsites() using https...
09/08/16 at 10:15:39
 
I need to crawl both http and https websites. Calling in Parallel doesn't really matter; but no combination of logic returns data for https websites.
 
1) When calling CrawlWebsite() in parallel, no data is returned for http or https websites. The log shows that each tuple/websiteURL is being passed to CrawlWebsite() correctly.
 
2) When using callTask (instead of Parallel), http sites return data while https sites do not.
 
My setup/logic:
https.dll in system folder.
loadData with repeat for each tuple/url in file.
For each tuple callTask/callParallel -> CrawlWebsite(...saveDataTask....)
 
a) Is something setup wrong?
b) Does CrawlWebsite() support https?
c) Does CrawlWebsite() support being called in Parallel?
 
Thanks
Back to top
 
« Last Edit: 09/08/16 at 12:59:25 by BrentH »  

Win7, 64bit, latest version of irobot (visual)
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1608
Gender: male
Re: CrawlWebsites() using https...
Reply #1 - 09/14/16 at 16:53:46
 
Refer: http://irobotsoft.com/download.htm
- HTTPS multithread support: download https.zip, extract the https.dll and put it in the IROBOT\system directory.
 
Also, you may need to extract files from http://irobotsoft.com/python27.zip and put it in the IROBOT root directory.  
Back to top
 
 

The Administrator.
WWW   IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #2 - 09/19/16 at 17:51:02
 
I already tried /system/https.dll...as I mentioned in my initial post.
 
I added python27.dll to root folder (per your instructions).
Now when I start irobot.exe it does not open. If I try starting it again, it say 'another instance is running'.  Task manager does not show irobot running?
I removed python27.dll and irobot works again. That did not work!
 
Further testing has shown that calling any https url in parallel does not return expected results...so this would be why the crawlWebsite function is failing I assume.
 
The logs show no errors and all urls are being processed.
For frame I am using -1.
My testOutput.txt file includes data for <title>:tx (variable 'title'), TargetUrl, box_url.
 
Retrieving pages via callTask processes all urls with expected data returned.
 
Retrieving pages via callParallel processes all urls with expected data returned EXCEPT https pages do not return <title>:tx data.  I looked at the source code for test pages and they indeed all have <title> tags.
Also, when callParallel is used the url in the output file is appended with a question mark. Example: https://twitter.com? The question mark is not present when using callTask.
 
Does https.dll indeed work or am I missing something?
 
Related question...
 
The embedded browser will resolve/redirect to the proper url, example facebook.com will redirect to https://www.facebook.com/
When called in parallel (no browser, just raw source code without javascript executed)...do urls need to be passed in as exact or will they resolve as like the facebook example above?
 
box_url was mentioned in this post for testing output http://irobotsoft.org/bb/YaBB.pl?num=1394930019
Is this an internal variable? I included it in my output...it never returns a value.
 
Thanks
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #3 - 09/20/16 at 09:14:54
 
I've done more testing when using callParallel...
 
TargetPage variable output...
throws error on https urls.'IRobot exe file has stopped working'.
does NOT throw errors on http urls.
 
 
SourcePage variable output...
returns no data for https urls.
does return data for http urls.
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1608
Gender: male
Re: CrawlWebsites() using https...
Reply #4 - 09/20/16 at 20:54:05
 
You can try to update your irobot.exe from http://irobotsoft.com/irobot/irobot.zip  
 
Also, try to copy the python27.dll to C:\Windows\System32 directory.  
 
Then test the https function by menu: Design -> Test Scripting, then in the interface input:  
https('https://google.com/')  
And press Run.  If it shows the google page, then it works.  
Back to top
 
 

The Administrator.
WWW   IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #5 - 09/21/16 at 11:01:46
 
I replaced irobot.exe with one from here: http://irobotsoft.com/irobot/irobot.zip  
I added python27.dll to C:\Windows\System32 directory
 
I tested using Design -> Test Scripting using: https('https://google.com/')  
 
Returns: Error code: -2!
 
What's next Wink
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #6 - 10/05/16 at 14:08:16
 
It has been three weeks since I last posted.
The new irobot.exe file and python27.dll did not work.
 
Are there plans to fix the software?
 
Thanks
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1608
Gender: male
Re: CrawlWebsites() using https...
Reply #7 - 10/06/16 at 10:25:10
 
What is your Windows and IE versions?  Can you test it on some other machines to see if it works?  
 
Currently it is difficult to pin point the issue because it is related to the OS environment, we will try to add some logging so that later we can identify such issues more easily.  
 
Back to top
 
 

The Administrator.
WWW   IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #8 - 10/07/16 at 15:53:20
 
I am running Windows 7, 64 bit; IE 11.
 
I also test ran on virtual machine --> Windows XP with IE 8 with python27.dll in irobot or windows/system32 folders and https.dll in irobot/system folder.
 
I still get Error code: -2! when testing https('https://google.com/').
 
Unfortunately I do not have access to any other test environments.
 
What is the ideal environment to be running the software?
 
Has anybody else had this issue?
 
Thanks
 
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1608
Gender: male
Re: CrawlWebsites() using https...
Reply #9 - 10/08/16 at 08:39:03
 
Looks like it is the python issue.  Please try to remove the python27.dll from your irobot directory, then install Python 2.7 version 32 bit from
https://www.python.org/downloads/release/python-2712/  (Windows x86 MSI installer)
 
Please let us know if this works.  
Back to top
 
 

The Administrator.
WWW   IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #10 - 10/10/16 at 21:21:49
 
Ok that worked!
 
callParallel now works with https sites after installing the 'Windows x86 MSI installer'. Thanks for the fix!
 
However, I retested the crawlWebsites() function for crawling https sites and found that it does not work.  
crawlWebsites() only seems to work with http sites.
 
Is there a fix for that?
 
Thanks again
 
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1608
Gender: male
Re: CrawlWebsites() using https...
Reply #11 - 10/11/16 at 09:40:05
 
Are you sure it is not the typo in crawlWebsite(), note there is no s?  
Back to top
 
 

The Administrator.
WWW   IP Logged
BrentH
Junior Member
**




Posts: 16
Re: CrawlWebsites() using https...
Reply #12 - 10/11/16 at 21:19:10
 
I just mis-typed in the blog Wink  I am definitely using crawlWebsite() correctly.
 
I am testing with a mix of http and https urls...
 
http sites are crawled and data is returned without issues.
https sites show "0 of 0 bytes..." for each https site in the download popup.
 
The log shows each https site as a tuple...but no data is returned for them.  It seems that they are being skipped.
 
A bug?
 
Thanks
Back to top
 
 

Win7, 64bit, latest version of irobot (visual)
  IP Logged
Pages: 1
Send Topic Print