Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
678 views
in Technique[技术] by (71.8m points)

python - Regular expression to match comma separated list of key=value where value can contain commas

I have a naive "parser" that simply does something like:
[x.split('=') for x in mystring.split(',')]

However mystring can be something like
'foo=bar,breakfast=spam,eggs'

Obviously,
The naive splitter will just not do it. I am limited to Python 2.6 standard library for this,
So for example pyparsing can not be used.

Expected output is
[('foo', 'bar'), ('breakfast', 'spam,eggs')]

I'm trying to do this with regex, but am facing the following problems:

My First attempt
r'([a-z_]+)=(.+),?'
Gave me
[('foo', 'bar,breakfast=spam,eggs')]

Obviously,
Making .+ non-greedy does not solve the problem.

So,
I'm guessing I have to somehow make the last comma (or $) mandatory.
Doing just that does not really work,
r'([a-z_]+)=(.+?)(?:,|$)'
As with that the stuff behind the comma in an value containing one is omitted,
e.g. [('foo', 'bar'), ('breakfast', 'spam')]

I think I must use some sort of look-behind(?) operation.
The Question(s)
1. Which one do I use? or
2. How do I do that/this?

Edit:

Based on daramarak's answer below,
I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form;

vals = [x.rsplit(',', 1) for x in (data.split('='))]
ret = list()
while vals:
    value = vals.pop()[0]
    key = vals[-1].pop()
    ret.append((key, value))
    if len(vals[-1]) == 0:
        break

EDIT 2:

Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that re.findall() would return a list of 2-tuples?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Just for comparison purposes, here's a regex that seems to solve the problem as well:

([^=]+)    # key
=          # equals is how we tokenise the original string
([^=]+)    # value
(?:,|$)    # value terminator, either comma or end of string

The trick here it to restrict what you're capturing in your second group. .+ swallows the = sign, which is the character we can use to distinguish keys from values. The full regex doesn't rely on any back-tracking (so it should be compatible with something like re2, if that's desirable) and can work on abarnert's examples.

Usage as follows:

re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam')

Which returns:

[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...