Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
302 views
in Technique[技术] by (71.8m points)

regex - How to parse data effectively with python

Here is my code for extracting fields that i want to.
But, I don't think it works effectively because extracting is depends on count of fields.
Surely It's not important in small data however, I want to know better way.
So I want to extract at once or more effectively
Sorry for my stupidity.

import re

data="""
Message-ID: <1608636066635.7f830.79689714@crcvmail15.nm>
Received: from 125.209.x.x (net58.219.x-x.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
Received: from 125.209.x.x (net58.219.x-18.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
 Tue, 22 Dec 2020 11:20:58 -0000
From: "test"<from@google.com>
To: test@google.com
Subject:example email
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
"""

def searchHeader(field):
    form = re.search(r'('+field+'W+(.*?)
)',data)
    if form:
        print(form.group())

fields = ['From','To','Cc','Subject','Message-ID','Date','(Return-Path|Reply-To)']
for field in fields:
    res = searchHeader(field)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Depending on your definition of "effective" you can make use of named capture groups:

(?P<field>^[w-]+): *(?P<value>[sS]+?)(?=^[w-]+: *|)
  • (?P<field>^[w-]+) - name a capture group "field" and capture everything from the beginning of the line which is a w char or - dash.
  • : * - capture a colon followed by optional spaces.
  • (?P<value>[sS]+?) - name a capture group "value" and capture everything (including newlines). If you enable the dotall modifier then .+? could be used in place of [sS]+?. This ensures we capture the multiline values which can be found after Received:.
  • (?=^[w-]+: *|) - continue capturing the "value" until we hit a new "field" or the end of the string.

https://regex101.com/r/rBBRfM/1

You can see performance stats in the upper right at regex101.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...