Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
182 views
in Technique[技术] by (71.8m points)

Convert XML to CSV with Python - cannot return result properly

I would like to convert an XML file to CSV and I have tried many ways and I am unable to convert it. I am wondering if anyone is able to help me out with the coding part?

1267.xml file is as following:

https://scsanctions.un.org/al-qaida/

Here is my code:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("1267.xml")
root = tree.getroot()

get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]

df = pd.DataFrame.from_dict(l)
df.to_csv('1267.csv')

The output I get from the above code is:

,INDIVIDUAL,ENTITY
0,,
1,,

I would like to get the output like the following:

DATAID,VERSIONNUM,FIRST_NAME,SECOND_NAME,THIRD_NAME,FOURTH_NAME,UN_LIST_TYPE, NATIONALITY, ......
6908576,1,IYAD,NAZMI,SALIH,KHALIL,Al-Qaida, Jordan, ......

I have tried other codes but it doesn't even give any result. I have no clue at all, please help me out and explain to me the reason behind it. Appreciate your help.

question from:https://stackoverflow.com/questions/65895413/convert-xml-to-csv-with-python-cannot-return-result-properly

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

try use xmljson to parse the data.

# !pip install xmljson
# https://pypi.org/project/xmljson/
from xmljson import parker, Parker
from xml.etree.ElementTree import fromstring
import requests

url = 'https://scsanctions.un.org/al-qaida/'
r = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'})

data = re.sub(r'<?.*?>', '', r.text)
obj = parker.data(fromstring(data))
# change orderdict to normal dict
obj_dict = json.loads(json.dumps(obj))

df_INDIVIDUAL = pd.DataFrame(obj_dict['INDIVIDUALS']['INDIVIDUAL'])
df_INDIVIDUAL.columns
# Index(['DATAID', 'VERSIONNUM', 'FIRST_NAME', 'SECOND_NAME', 'THIRD_NAME',
#        'FOURTH_NAME', 'UN_LIST_TYPE', 'REFERENCE_NUMBER', 'LISTED_ON',
#        'NAME_ORIGINAL_SCRIPT', 'COMMENTS1', 'NATIONALITY', 'LIST_TYPE',
#        'LAST_DAY_UPDATED', 'INDIVIDUAL_ALIAS', 'INDIVIDUAL_ADDRESS',
#        'INDIVIDUAL_DATE_OF_BIRTH', 'INDIVIDUAL_PLACE_OF_BIRTH',
#        'INDIVIDUAL_DOCUMENT', 'SORT_KEY', 'SORT_KEY_LAST_MOD', 'GENDER',
#        'TITLE', 'SUBMITTED_BY'],
#       dtype='object')

cols = 'DATAID,VERSIONNUM,FIRST_NAME,SECOND_NAME,THIRD_NAME,FOURTH_NAME,UN_LIST_TYPE'.split(',')
# df_INDIVIDUAL[cols].to_csv('INDIVIDUAL.csv', index=False)
df_INDIVIDUAL[cols]

result:

print(df_INDIVIDUAL[cols].head(3).T)

                         0              1                      2
    DATAID         6908576        6908757                6908443
    VERSIONNUM           1              1                      1
    FIRST_NAME        IYAD          HAJJI          'ABD AL-MALIK
    SECOND_NAME     NAZMI   'ABD AL-NASIR               MUHAMMAD
    THIRD_NAME       SALIH            NaN                  YUSUF
    FOURTH_NAME     KHALIL            NaN  'UTHMAN 'ABD AL-SALAM
    UN_LIST_TYPE  Al-Qaida       Al-Qaida               Al-Qaida

update:

Get repeated tag, e.g. <INDIVIDUAL_ALIAS>.

I am familiar with BeautifulSoup to parse URL content, so I will use this to parse the content.

soup select use CSS style selector syntaxy.

  1. select_one return the first item or None
  2. select return a list
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')

data_list = []
for INDIVIDUAL in soup.select('INDIVIDUAL'):
    dataid = INDIVIDUAL.select_one('dataid').text
    for individual_alias in INDIVIDUAL.select('individual_alias'):
        quality = individual_alias.select_one('quality')
        alias_name = individual_alias.select_one('alias_name')
        # if quality or alias_name not included in individual_alias, it will return None
        quality = quality if quality == None else quality.text
        alias_name = alias_name if alias_name == None else alias_name.text

        data_list.append((dataid, quality, alias_name))

dfn = pd.DataFrame(data_list, columns=('dataid', 'alias_quality', 'alias_name'))

print(dfn.head())

       dataid  alias_quality                 alias_name
    0  6908576          Good  Ayyad Nazmi Salih Khalil 
    1  6908576          Good    Eyad Nazmi Saleh Khalil
    2  6908576           Low           Iyad al-Toubasi 
    3  6908576           Low            Iyad al-Tubasi 
    4  6908576           Low             Abu al-Darda' 

join to df_INDIVIDUAL:

# groupby dataid to join the alias_name together 
alias_name = dfn.groupby('dataid')['alias_name'].agg(list).str.join('###')

# split by ###, and expand list to column
df_alias_name = alias_name.str.split('###', expand=True)

# column name add prefix
df_alias_name = df_alias_name.add_prefix('NDIVIDUAL_ALIAS.ALIAS_NAME/')

# set the DATAID as type int to join df_INDIVIDUAL
df_alias_name.index = df_alias_name.index.astype(int)
df_INDIVIDUAL = df_INDIVIDUAL.join(df_alias_name, on='DATAID', how='left')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...