Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
242 views
in Technique[技术] by (71.8m points)

python - Removing an element from a parsed XML tree disrupts iteration

I want to parse an xml file, then process the result tree by removing selected elements. My problem is that removing an element disrupts the loop that iterates over the elements.

Consider the following xml data:

<results>
    <group>
        <a />
        <b />
        <c />
    </group>
</results>

and the code:

import xml.etree.ElementTree as ET

def showGroup(group,s):
    print(s + '  len=' + str(len(group)))
    print('<group>' )
    for e in group:
        print('   <' + e.tag + '>')
    print('</group>
')

def processGroup(group):
    for e in group:
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

tree = ET.parse('x.xml')
root = tree.getroot()

for group in root:
    processGroup(group)

I expected the for loop to process elements <a>, <b>, and <c> in order. In particular:

  1. processing <a> should not remove any element
  2. processing <b> should remove <b>
  3. processing <c> should remove <c>

I expected the resulting tree to have a single element inside <group> (the <a> element), and that len(group) would return 1.

Instead, after processing <b>, the for loop decides the end test has been met, and it does not process element <c>. If it did, <c> would be removed. Instead, I am left with a tree with elements <a> and <c>, and len(group) returns 2.

What do I need to do to process all three elements while removing selected elements? PS: any comments on style or better ways to do something are welcome.

Update: an ugly hack "fixes" the problem at the cost of some efficiency, if there is no code after removing the element. But in my real program, there is a lot of code after the pruning loop.

for e in group:
    if e.tag != 'a':
        group.remove(e)
        showGroup(group,'removed <' + e.tag + '>')
        processGroup(group)

I assume that if the for loop is disrupted, then starting again with the group at the beginning might solve the problem. Recursion is a tidy way of doing that - at the expense of reprocessing all elements that have already been checked but not removed.

I am not satisfied with this solution.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The issue is you are removing elements from something you are iterating over, when you remove an element the remaining elements get shifted so you can end up removing the incorrect elements:

A simple solution is to iterate over a copy of the tree or use reversed:

copy:

 def processGroup(group):
    # creates a shallow copy so we are removing from the original
    # but iterating over a copy. 
    for e in group[:]:
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

reversed:

def processGroup(group):
    # starts at the end, as the container shrinks.
    # when an element is removed, we still see
    # elements at the same position when we started out loop.
    for e in reversed(group):
        if e.tag != 'a':
            group.remove(e)
            showGroup(group,'removed <' + e.tag + '>')

using the copy logic:

In [7]: tree = ET.parse('test.xml')

In [8]: root = tree.getroot()

In [9]: for group in root:
   ...:         processGroup(group)
   ...:     
removed <b>  len=2
<group>
   <a>
   <c>
</group>

removed <c>  len=1
<group>
   <a>
</group>

You can also use ET.tostring in place of your for loop:

import xml.etree.ElementTree as ET

def show_group(group,s):
    print(s + '  len=' + str(len(group)))
    print(ET.tostring(group))


def process_group(group):
    for e in group[:]:
        if e.tag != 'a':
            group.remove(e)
            show_group(group, 'removed <' + e.tag + '>')

tree = ET.parse('test.xml')
root = tree.getroot()

for group in root.findall(".//group"):
    process_group(group)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...