Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
608 views
in Technique[技术] by (71.8m points)

xml parsing - Storing specific XML node values with R's xmlEventParse

I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below:

library(XML)
z <- xmlEventParse(
    "my.xml", 
    handlers = list(
        startDocument   =   function() 
        {
                cat("Starting document
")
        },  
        startElement    =   function(name,attr) 
        {
                if ( name == "myNodeToMatch1" ){
                    cat("FLAG Matched element 1
")
                }
                if ( name == "myNodeToMatch2" ){
                    cat("FLAG Matched element 2
")
                }
        },
        text            =   function(text) {
                if ( # Matched element 1 .... )
                    # Store text in element 1 list
                if ( # Matched element 2 .... )
                    # Store text in element 2 list
        },
        endDocument     =   function() 
        {
                cat("ending document
")
        }
    ),
    addContext = FALSE,
    useTagName = FALSE,
    ignoreBlanks = TRUE,
    trim = TRUE)
z$ ... # show lists ??

My question is, how to implement this flag in R (in a professional way :)? Plus: What's the best choice to evaluate N arbitrary nodes to match... if name = "myNodeToMatchN" ... nodes avoiding case matching?

my.xml could be just a naive XML like

<A>
  <myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1>
  <B>
    <myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2>
    ...
  </B>
</A>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'll use fileName from example(xmlEventParse) as a reproducible example. It has tags record that have an attribute id and text that we'd like to extract. Rather than use handler, I'll go after the branches argument. This is like a handler, but one has access to the full node rather than just the element. The idea is to write a closure that has a place to keep the data we accumulate, and a function to process each branch of the XML document we are interested in. So let's start by defining the closure -- for our purposes, a function that returns a list of functions

ourBranches <- function() {

We need a place to store the results we accumulate, choosing an environment so that the insertion times are constant (not a list, which we would have to append to and would be memory inefficient)

    store <- new.env() 

The event parser is expecting a list of functions to be invoked when a matching tag is discovered. We're interested in the record tag. The function we write will receive a node of the XML document. We want to extract an element id that we'll use to store the (text) values in the node. We add these to our store.

    record <- function(x, ...) {
        key <- xmlAttrs(x)[["id"]]
        value <- xmlValue(x)
        store[[key]] <- value
    }

Once the document is processed, we'd like a convenient way to retrieve our results, so we add a function for our own purposes, independent of nodes in the document

    getStore <- function() as.list(store)

and then finish the closure by returning a list of functions

    list(record=record, getStore=getStore)
}

A tricky concept here is that the environment in which a function is defined is part of the function, so each time we say ourBranches() we get a list of functions and a new environment store to keep our results. To use, invoke xmlEventParse on our file, with an empty set of event handlers, and access our accumulated store.

> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7   8 360.0 175 3.15 3.440 17.02  0  0    3 "

$`Toyota Corolla`
[1] "33.9   4  71.1  65 4.22 1.835 19.90  1  1    4 "

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...