Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
726 views
in Technique[技术] by (71.8m points)

html - web scraping with vba using XMLHTTP

I would like to get some data from web page http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures.

If I'm using the old InternetExplorer object (code below), I could walking through HTML document. But I would like to use XMLHTTP object (second code).

Sub IEZagon() 
     'we define the essential variables
    Dim ie As Object 
    Dim TDelement, TDelements 
    Dim AnhorLink, AnhorLinks 

     'add the "Microsoft Internet Controls" reference in your VBA Project indirectly
    Set ie = CreateObject("InternetExplorer.Application") 
    With ie 
        .Visible = True 
        .navigate ("[URL]http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL]") 
        While ie.ReadyState <> 4 
            DoEvents 
        Wend 
        Set AnhorLinks = .document.getElementsbytagname("a") 
        Set TDelements = .document.getElementsbytagname("td") 
        For Each AnhorLink In AnhorLinks 
            Debug.Print AnhorLink.innertext 
        Next 
        For Each TDelement In TDelements 
            Debug.Print TDelement.innertext 
        Next 
    End With 
    Set ie = Nothing 
End Sub

Using code with XMLHTTP object:

Sub FuturesScrap(ByVal URL As String) 
    Dim XMLHttpRequest As XMLHTTP 
    Dim HTMLDoc As New HTMLDocument 

    Set XMLHttpRequest = New MSXML2.XMLHTTP 
    XMLHttpRequest.Open "GET", URL, False 
    XMLHttpRequest.send 
    While XMLHttpRequest.readyState <> 4 
        DoEvents 
    Wend 

    Debug.Print XMLHttpRequest.responseText 
    HTMLDoc.body.innerHTML = XMLHttpRequest.responseText 

    With HTMLDoc.body 
        Set AnchorLinks = .getElementsByTagName("a") 
        Set TDelements = .getElementsByTagName("td") 

        For Each AnchorLink In AnchorLinks 
            Debug.Print AnhorLink.innerText 
        Next 

        For Each TDelement In TDelements 
            Debug.Print TDelement.innerText 
        Next 
    End With 
End Sub 

I get only basic HTML:

<html> 
<head> 
<title>Resource Not found</title> 
<link rel= 'stylesheet' type='text/css' href='/blueprint/css/errorpage.css'/>
</head> 
<body> 
<table class="header"> 
<tr> 
<td class="CMTitle CMHFill"><span class="large">Resource Not found</span></td> 
</tr> 
</table> 
<div class="body"> 
<p style="font-weight:bold;">The requested resource does Not exist.</p> 
</div> 
<table class="footer"> 
<tr> 
<td class="CMHFill"> </td> 
</tr> 
</table> 
</body> 
</html>

I would like to walking through tables and coresponding data... And finally I would like to select diferent time interval from Year to Month:

I'd really appreciate any help! Thank you!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I can confirm that I get the same HTML as you when I run your code (with or without the url tags). I found a useful post here. I have modified your code using the method found there and it now appears to have downloaded the correct information.

Sub test()
    Call FuturesScrap1("http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures")
End Sub

I included the calling sub because the url tags appeared to cause an error for the MSXML request.

Sub FuturesScrap1(ByVal URL As String)
    Dim HTMLDoc As New HTMLDocument
    Dim oHttp As MSXML2.XMLHTTP
    Dim sHTML As String
    Dim AnchorLinks As Object
    Dim TDelements As Object
    Dim TDelement As Object
    Dim AnchorLink As Object

    On Error Resume Next
    Set oHttp = New MSXML2.XMLHTTP
    If Err.Number <> 0 Then
        Set oHttp = CreateObject("MSXML.XMLHTTPRequest")
        MsgBox "Error 0 has occured while creating a MSXML.XMLHTTPRequest object"
    End If
    On Error GoTo 0
    If oHttp Is Nothing Then
        MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object"
        Exit Sub
    End If

    'Open the URL in browser object
    oHttp.Open "GET", URL, False
    oHttp.send
    sHTML = oHttp.responseText

    Debug.Print oHttp.responseText

    HTMLDoc.body.innerHTML = oHttp.responseText

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set TDelements = .getElementsByTagName("td")

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next

        For Each TDelement In TDelements
            Debug.Print TDelement.innerText
        Next
    End With

End Sub

Edit folowing comment:

I haven't been able to find the table elements using MSXML2 object, the source code doesn't appear to contain them. In firebug the td tags are present so I thik that the table is generated by the JavaScript code. I don't know if MSXML2 can run the JavaScript so I've modified the sub to use internet explorer, it's not quick code, but it does find the td elements and does allow clicking the tabs. I have found that the td elements can take some time to become available (presumably for IE has to run the JavaScript) so I have put in a couple of steps where xl waits before downloading the data.

I have put in some code that will download the contents of the td elements into the active worksheet, be careful if running it in a workbook with useful data in it.

Sub FuturesScrap3(ByVal URL As String)

    Dim HTMLDoc As New HTMLDocument
    Dim AnchorLinks As Object
    Dim tdElements As Object
    Dim tdElement As Object
    Dim AnchorLink As Object
    Dim lRow As Long
    Dim oElement As Object

    Dim oIE As InternetExplorer

    Set oIE = New InternetExplorer

    oIE.navigate URL
    oIE.Visible = True

    Do Until (oIE.readyState = 4 And Not oIE.Busy)
        DoEvents
    Loop

    'Wait for Javascript to run
    Application.Wait (Now + TimeValue("0:01:00"))

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set tdElements = .getElementsByTagName("td") '

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next AnchorLink

    End With

    lRow = 1
    For Each tdElement In tdElements
        Debug.Print tdElement.innerText
        Cells(lRow, 1).Value = tdElement.innerText
        lRow = lRow + 1
    Next

    'Clicking the Month tab
    For Each oElement In oIE.document.all
        If Trim(oElement.innerText) = "Month" Then
            oElement.Focus
            oElement.Click
        End If
    Next oElement

    Do Until (oIE.readyState = 4 And Not oIE.Busy)
        DoEvents
    Loop

    'Wait for Javascript to run
    Application.Wait (Now + TimeValue("0:01:00"))

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set tdElements = .getElementsByTagName("td") '

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next AnchorLink
    End With

    lRow = 1
    For Each tdElement In tdElements
        Debug.Print tdElement.innerText
        Cells(lRow, 2).Value = tdElement.innerText
        lRow = lRow + 1
    Next tdElement

End sub

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...