Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
577 views
in Technique[技术] by (71.8m points)

Convert nested JSON to CSV file in Python

I know this question has been asked many times. I tried several solutions but I couldn't solve my problem.

I have a large nested JSON file (1.4GB) and I would like to make it flat and then convert it to a CSV file.

The JSON structure is like this:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

I know that this is easy to accomplish with pandas module but I am not familiar with it.

EDITED

The desired output should be something like this:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

Note that this is just the short version. The output should have all the fields.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Please scroll down for the newer, faster solution

This is an older question, but I struggled the entire night to get a satisfactory result for a similar situation, and I came up with this:

import json
import pandas

def cross_join(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)

def json_to_dataframe(data_in):
    def to_frame(data, prev_key=None):
        if isinstance(data, dict):
            df = pandas.DataFrame()
            for key in data:
                df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
        elif isinstance(data, list):
            df = pandas.DataFrame()
            for i in range(len(data)):
                df = pandas.concat([df, to_frame(data[i], prev_key)])
        else:
            df = pandas.DataFrame({prev_key[1:]: [data]})
        return df
    return to_frame(data_in)

if __name__ == '__main__':
    with open('somefile') as json_file:
        json_data = json.load(json_file)

    df = json_to_dataframe(json_data)
    df.to_csv('data.csv', mode='w')

Explanation:

The cross_join function is a neat way I found to do a cartesian product. (credit: here)

The json_to_dataframe function does the logic, using pandas dataframes. In my case, the json was deeply nested, and I wanted to split dictionary key:value pairs into columns, but the lists I wanted to transform into rows for a column -- hence the concat -- which I then cross join with the upper level, thus multiplying the records number so that each value from the list has its own row, while the previous columns are identical.

The recursiveness creates stacks that cross join with the one below, until the last one is returned.

Then with the dataframe in a table format, it's easy to convert to CSV with the "df.to_csv()" dataframe object method.

This should work with deeply nested JSON, being able to normalize all of it into rows by the logic described above.

I hope this will help someone, someday. Just trying to give back to this awesome community.

---------------------------------------------------------------------------------------------

LATER EDIT: NEW SOLUTION

I'm coming back to this as while the dataframe option kinda worked, it took the app minutes to parse not so large JSON data. Therefore I thought of doing what the dataframes do, but by myself:

from copy import deepcopy
import pandas


def cross_join(left, right):
    new_rows = [] if right else left
    for left_row in left:
        for right_row in right:
            temp_row = deepcopy(left_row)
            for key, value in right_row.items():
                temp_row[key] = value
            new_rows.append(deepcopy(temp_row))
    return new_rows


def flatten_list(data):
    for elem in data:
        if isinstance(elem, list):
            yield from flatten_list(elem)
        else:
            yield elem


def json_to_dataframe(data_in):
    def flatten_json(data, prev_heading=''):
        if isinstance(data, dict):
            rows = [{}]
            for key, value in data.items():
                rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
        elif isinstance(data, list):
            rows = []
            for i in range(len(data)):
                [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]
        else:
            rows = [{prev_heading[1:]: data}]
        return rows

    return pandas.DataFrame(flatten_json(data_in))


if __name__ == '__main__':
    json_data = {
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
            {
                "batter":
                    [
                        {"id": "1001", "type": "Regular"},
                        {"id": "1002", "type": "Chocolate"},
                        {"id": "1003", "type": "Blueberry"},
                        {"id": "1004", "type": "Devil's Food"}
                    ]
            },
        "topping":
            [
                {"id": "5001", "type": "None"},
                {"id": "5002", "type": "Glazed"},
                {"id": "5005", "type": "Sugar"},
                {"id": "5007", "type": "Powdered Sugar"},
                {"id": "5006", "type": "Chocolate with Sprinkles"},
                {"id": "5003", "type": "Chocolate"},
                {"id": "5004", "type": "Maple"}
            ],
        "something": []
    }
    df = json_to_dataframe(json_data)
    print(df)

OUTPUT:

      id   type  name   ppu batters.batter.id batters.batter.type topping.id              topping.type
0   0001  donut  Cake  0.55              1001             Regular       5001                      None
1   0001  donut  Cake  0.55              1001             Regular       5002                    Glazed
2   0001  donut  Cake  0.55              1001             Regular       5005                     Sugar
3   0001  donut  Cake  0.55              1001             Regular       5007            Powdered Sugar
4   0001  donut  Cake  0.55              1001             Regular       5006  Chocolate with Sprinkles
5   0001  donut  Cake  0.55              1001             Regular       5003                 Chocolate
6   0001  donut  Cake  0.55              1001             Regular       5004                     Maple
7   0001  donut  Cake  0.55              1002           Chocolate       5001                      None
8   0001  donut  Cake  0.55              1002           Chocolate       5002                    Glazed
9   0001  donut  Cake  0.55              1002           Chocolate       5005                     Sugar
10  0001  donut  Cake  0.55              1002           Chocolate       5007            Powdered Sugar
11  0001  donut  Cake  0.55              1002           Chocolate       5006  Chocolate with Sprinkles
12  0001  donut  Cake  0.55              1002           Chocolate       5003                 Chocolate
13  0001  donut  Cake  0.55              1002           Chocolate       5004                     Maple
14  0001  donut  Cake  0.55              1003           Blueberry       5001                      None
15  0001  donut  Cake  0.55              1003           Blueberry       5002                    Glazed
16  0001  donut  Cake  0.55              1003           Blueberry       5005                     Sugar
17  0001  donut  Cake  0.55              1003           Blueberry       5007            Powdered Sugar
18  0001  donut  Cake  0.55              1003           Blueberry       5006  Chocolate with Sprinkles
19  0001  donut  Cake  0.55              1003           Blueberry       5003                 Chocolate
20  0001  donut  Cake  0.55              1003           Blueberry       5004                     Maple
21  0001  donut  Cake  0.55              1004        Devil's Food       5001                      None
22  0001  donut  Cake  0.55              1004        Devil's Food       5002                    Glazed
23  0001  donut  Cake  0.55              1004        Devil's Food       5005                     Sugar
24  0001  donut  Cake  0.55              1004        Devil's Food       5007            Powdered Sugar
25  0001  donut  Cake  0.55              1004        Devil's Food       5006  Chocolate with Sprinkles
26  0001  donut  Cake  0.55              1004        Devil's Food       5003                 Chocolate
27  0001  donut  Cake  0.55              1004        Devil's Food       5004                     Maple

As per what the above does, well, the cross_join function does pretty much the same thing as in the dataframe solution, but without dataframes, thus being faster.

I added the flatten_list generator as I wanted to make sure that the JSON arrays are all nice and flattened, then provided as a single list of dictionaries comprising of the previous key from one iteration before assigned to each of the list's values. This pretty much mimics the pandas.concat behaviour in this case.

The logic in the main function, json_to_dataframe is then the same as before. All that needed to change was having the operations performed by dataframes as coded functions.

Also, in the dataframes solution I was not appending the previous heading to the nested object, but unless you are 100% sure you do not have conflicts in column names, then it is pretty much mandatory.

I hope this helps :).

EDIT: Modified the cross_join function to deal with the case when a nested list is empty, basically maintaining the previous result set unmodified. The output is unchanged even after adding the empty JSON list in the example JSON data. Thank you, @Nazmus Sakib for pointing it out.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...