Welcome, Guest. Please Login.
IRobotSoft Robot
05/27/17 at 19:07:12
News: IRobotSoft is the Best Visual Web Computing Platform!
Home Help Search Login
Google
 


Pages: 1
Send Topic Print
Wikipedia History scraper (Read 563 times)
Jean-Pierre
IRobotSoft Newbies
*


I Love IRobotSoft

Posts: 6
Gender: male
Wikipedia History scraper
07/08/16 at 14:29:55
 
I am trying to extract information from an unordered list (UL) structure on special type of Wikipedia page called a History page.  
An example URL is https://en.wikipedia.org/w/index.php?title=English_language&action=history  
and my HTQL so far is as follows:  
 
Code:
<UL (ID='pagehistory')>1.<LI>1-0{
CurrLink=<SPAN (CLASS='mw-history-histlinks')>1.<A (tx='cur')>1:href;
PrevLink=<SPAN (CLASS='mw-history-histlinks')>1.<A (tx='prev')>1:href;
DiffID=<INPUT (Name='diff')>1:value;
EditDTS=<A (CLASS='mw-changeslist-date')>1:tx;
UserName=<SPAN (CLASS='history-user')>1.<A (CLASS='mw-userlink')>1:tx;
UserText=<SPAN (CLASS='comment')>1:tx;
} 


 
I am running into a few problems. They are (in order of importance):
 
#1. Can't save the data to CSV file.
#2. Can't seem to get all the desired data.
#3. Can't get the built-in "Tuple" variable to show in data results.
#4. Don't know how to handle a special case for the first item in list.
#5. Unicode characters corrupted.
 
DETAILS:  

 

Problem #1. Overall the biggest problem is that I can't seem to get it to save the data as a CSV file and so I am working only with the IRobotSoft interface to see the data. This may mean some of the rest of the problems (like #3 and #5) are not actually real but I cannot tell since I cannot view the data outside the program. I carefully followed the manual's instructions for saving variables except I chose CSV instead of XML. I see an event (triggered "after each tuple") exists but no file is ever created.
 

Problem #2. For this example I am using the "21 June 2016" entry by user "Yobot". The desired output for that tuple should be:
 
CurrLink      =      /w/index.php?title=English_language&diff=727784939&oldid=726279199
PrevLink      =      /w/index.php?title=English_language&diff=726279199&oldid=726169770
DiffID        =      726279199
EditDTS        =      05:38, 21 June 2016
UserName      =      Yobot
UserText      =      (WP:CHECKWIKI error fixes using AWB (12030))

 
  • (A) The DiffID, EditDTS, and UserName fields seem to be working fine.  
  • (B) The CurrLink and DiffLink fields seems to be truncating at "&diff=$". Is this a string length limit imposed by the software?
  • (C) The UserText field seems to break/stop when it runs across certain html tags.  

Note that the last field (UserText) is actually a SPAN with a relatively small amount of highly variable data from which I only desire the final visible text (equivalent to what shows on the actual webpage but with no font formatting). I do not need the HTML tags or links (but I am okay if they are there) but I need all the rest of the text. This SPAN may contain any kind of text (including Unicode) and zero or more anchor tags mixed into the text.

 

Problem #3. I added Tuple as the first variable in the Save Variable page because I want an index of the tuples so I can recover the original list order. It doesn't show in the test runs.
 

Problem #4. In the very first <LI> there is no anchor tag for CurrLink, just the plain text "curr". This seems to break the extract and **ALL** fields for the first entry contain (null). Do I need to run a special action just for the first <LI> tuple ?
 

Problem #5. If I leave UserText as Original Contents then any UniCode is suppressed. If I strip out tags then Unicode is shown as "garbage" characters. The entries with a right-arrow at the start are easy to see examples. The right-arrow is character U+2192.  
 

I will keep working on finding the solutions on my own but any help is appreciated.
* Jean-Pierre
Back to top
 
 
  IP Logged
Jean-Pierre
IRobotSoft Newbies
*


I Love IRobotSoft

Posts: 6
Gender: male
Re: Wikipedia History scraper
Reply #1 - 07/08/16 at 16:20:13
 
Okay. I solved 4 out of 5 of the problems as follows:
 
Problem 1 was my error, the file was going to a different directory than I thought it was.  
 
Problem 2 was (as feared) just a problem with the IRobotSoft interface. The full data fields were being written into the CSV file all along.
 
Problem 3, same as Problem 2.
 
Problem 4 appears to need a special solution, perhaps a special data extract action and then modify the next action to start on tuple #2. I'll let you know how that works out.
 
Problem 5 was a combination of Problem 2 and some quirks with Excel. At first I tried to just open the CSV file by double clicking, but Excel made some bad assumptions about the data. So instead I opened a blank Excel sheet and then after that I adjusted the data parsing wizard to import using the 65001: Unicode (UTF-8) codepage.
 
BTW, I told IRobotSoft that I wanted the separator to be tabs instead of commas because the data contains commas sometimes. I also used no quote delimiters for the same reason.
 

Finally, I used a solution I found at http://stackoverflow.com/questions/5327512/convert-html-to-plain-text-in-vba to cleanup the UserText field...
 
First, I pasted the following Macro in Excel VBA:
Code:
Function StripTags(ByVal html As String) As String
    Dim text As String
    Dim accumulating As Boolean
    Dim n As Integer
    Dim c As String

    text = ""
    accumulating = True

    n = 1
    Do While n <= Len(html)

        c = Mid(html, n, 1)
        If c = "<" Then
            accumulating = False
        ElseIf c = ">" Then
            accumulating = True
        Else
            If accumulating Then
                text = text & c
            End If
        End If

        n = n + 1
    Loop

    StripTags = text
End Function 


 
UserText was in Column G so I added a new field called "EditSummary" in Column H and pasted =StripTags(G2) in cell H2 and copied that down to the bottom of the data. Worked like a charm.
 
Liking the program more! Still have some things I need to try later. * Jean-Pierre  
 
 
Back to top
 
 
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1597
Gender: male
Re: Wikipedia History scraper
Reply #2 - 07/09/16 at 06:10:28
 
For converting html text, it is in the Name Variable interface where you define your variables.  After the variable name, there is a Transformation column, where you can choose: Exclude all tags and line breaks.  
 
It can also be done in the HTQL query directly, by adding &tx to the column query.
Back to top
 
 

The Administrator.
WWW   IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1597
Gender: male
Re: Wikipedia History scraper
Reply #3 - 07/09/16 at 06:15:27
 
Not quite understand your problem for the first curr link.  There is no href link in the html anyway.   You can add an "After each tuple" event to the Table action, and with condition: CurrLink is null, assign it the value of TargetUrl.  
Back to top
 
 

The Administrator.
WWW   IP Logged
Jean-Pierre
IRobotSoft Newbies
*


I Love IRobotSoft

Posts: 6
Gender: male
Re: Wikipedia History scraper
Reply #4 - 07/09/16 at 20:06:35
 
Quote from IRobotSoft Administrator on 07/09/16 at 06:10:28:
It can also be done in the HTQL query directly, by adding &tx to the column query.

 
So Code:
UserText=<SPAN (CLASS='comment')>1:tx; 

and Code:
UserText=<SPAN (CLASS='comment')>1:&tx; 

are not the same thing?  
 
Or does the &tx go somewhere else in the query?
Back to top
 
 
  IP Logged
Jean-Pierre
IRobotSoft Newbies
*


I Love IRobotSoft

Posts: 6
Gender: male
Re: Wikipedia History scraper
Reply #5 - 07/09/16 at 20:41:51
 
Quote from IRobotSoft Administrator on 07/09/16 at 06:15:27:
Not quite understand your problem for the first curr link.  There is no href link in the html anyway.   You can add an "After each tuple" event to the Table action, and with condition: CurrLink is null, assign it the value of TargetUrl.

 
The "prev" and "cur" links send the user to a new page showing a old/new content diff between two versions of the article page.  
Prev shows the diff between the selected line (version of the page) and the previous version of the page (the next line in the list).  
Cur shows the difference between the selected line and the current version of the page.  
On History Pages the first item in the list IS the current version of the page, thus it is moot to have a "cur" link on that line.
 
Wikipedia decided to keep the text "(cur | prev)" even though there is no HREF link for cur. Actually the Wikimedia software just suppresses the anchor tag during the page generation and leaves the text content the same.  
 
First item in list:
Code:
<span class="mw-history-histlinks">
(cur |
<a title="" href="/w/index.php?title=English_language&diff=727784939&oldid=727742573">prev</a>
)
</span> 


 
Second item in list:
Code:
<span class="mw-history-histlinks">
(
<a title="English language" href="/w/index.php?title=English_language&diff=727784939&oldid=727742573">cur</a>
 |
<a title="English language" href="/w/index.php?title=English_language&diff=727742573&oldid=727639069">prev</a>
)
</span> 


 
This means that the CurrLink field should be null/empty/blank for the first item. That much is correct, but IRobotSoft apparently skips the entire list item when it cannot find an anchor tag for that one field. All other fields are then showing as "(null)" according to the in-program test run interface and the resulting CSV file starts on the second list item with a Tuple variable value of "1" even though it is technically the 2nd tuple processed.
 
The desired output would be that the first list item have all fields filled exactly like the others list items do but the CurrLink field would just be a blank (x'20').
Back to top
 
 
  IP Logged
IRobotSoft Administrator
IRobotSoft Administrator
*****


IRobotSoft, the Best
Internet Robot
System

Posts: 1597
Gender: male
Re: Wikipedia History scraper
Reply #6 - 07/11/16 at 01:22:46
 
Indeed there is issue when the first column of has empty data.  We will try to fix it.  At the same time, you can switch your first column and second column to temporary fix the issue.  Your HTQL query can be:  
 
<UL (ID='pagehistory')>1.<LI>1-0{
PrevLink=<SPAN (CLASS='mw-history-histlinks')>1.<A (tx='prev')>1:href;
CurrLink=<SPAN (CLASS='mw-history-histlinks')>1.<A (tx='cur')>1:href;
DiffID=<INPUT (Name='diff')>1:value;
EditDTS=<A (CLASS='mw-changeslist-date')>1:tx;
UserName=<SPAN (CLASS='history-user')>1.<A (CLASS='mw-userlink')>1:tx;
UserText=<SPAN (CLASS='comment')>1:tx;
}
 
 
Also, :tx and &tx are different.  :tx is to get the text attribute, while &tx is a function to convert everything into text.  
 
 
 
Back to top
 
 

The Administrator.
WWW   IP Logged
Pages: 1
Send Topic Print