Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
125 views
in Technique[技术] by (71.8m points)

Parse emails body in Python

I'm working with the enron dataset, and I'm interested on extract the clean body of the emails to a list keeping each answer as a string in the list. E.G.

For the following email:

Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: John_Arnold_Dec2000Notes Folders'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

So, what is it?   And by the way, don't start with the excuses.   You're 
expected to be a full, gourmet cook.

Kisses, not music, makes cooking a more enjoyable experience.  




"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi


I told you I have a long email address.

I've decided what to prepare for dinner tomorrow.  I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience.

Watch the debate if you are home tonight.  I want a report tomorrow...
Jen

___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com

I want to get the following response:

["So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience.", 
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow.  I hope you aren't 
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight.  I want a report tomorrow...
Jen"]

Where the first element in the list is:

"So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience."

Is there a library capable of doing this?

I have tried with the python email library, but I does not seem to have that functionality, since I get the full body as response:

import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())

So, what is it? And by the way, don't start with the excuses.
You're expected to be a full, gourmet cook. Kisses, not music, makes cooking a more enjoyable experience. "Jennifer White" jenwhite7@zdnetonebox.com on 10/17/2000 04:19:20 PM To: jarnold@enron.com cc: Subject: Hi I told you I have a long email address. I've decided what to prepare for dinner tomorrow. I hope you aren't expecting anything extravagant because my culinary skills haven't been put to use in a while. My only request is that your stereo works. Music makes cooking a more enjoyable experience. Watch the debate if you are home tonight. I want a report tomorrow... Jen ___________________________________________________________________ To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax, all in one place - sign up today at http://www.zdnetonebox.com '

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'm going to assume that you have all the Enron email messages in a .csv file, which is a common format for this dataset. I noted some data cleansing issues when processing this single message, mostly around the the " " in the message. I'm trying to figure out how to resolve this small issue.

import re as regex

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
      return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))


def parse_raw_email_message(raw_message):
   lines = raw_message.splitlines()
   email = {}
   message = ''
   keys_to_extract = ['from', 'to']
   for line in lines:
      if ':' not in line:
        message += line
        email['body'] = message

      else:
         pairs = line.split(':')
         key = pairs[0].lower()
         val = pairs[1].strip()
         if key in keys_to_extract:
            email[key] = val
   return email

###############################################
# change this open section to fit your dataset
###############################################
with open('enron_emails/sample_email.txt', 'r') as in_file:
   parsed_email = parse_raw_email_message(in_file.read())
   for key, value in parsed_email.items():
     if key == "body":
        # this regex add whitespace around single periods and words that end in 't.
        first_cleaning = regex.sub(r"(?<=('t)(?=[^s]))|(?<=[.,])(?=[^s])", r' ', value)
        cleaned_body = expunge_doublespaces(first_cleaning)
        print(cleaned_body)
        # print output
        So, what is it? And by the way, don't start with the excuses. You're
        expected to be a full, gourmet cook. Kisses, not music, makes cooking
        a more enjoyable experience. I told you I have a long email address.
        I've decided what to prepare for dinner tomorrow. I hope you aren't
        expecting anything extravagant because my culinary skills haven't 
        beenput to use in a while. My only request is that your stereo works. 
        Musicmakes cooking a more enjoyable experience. Watch the debate if 
        you are home tonight. I want a report tomorrow. . . Jen

UPDATE

Here is another way to obtain the body of the email message. There are other examples in another question that I answered.

import re as regex
import email

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
     return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))

with open('enron_emails/sample_email.txt', 'r') as input:
    email_body = ''
    raw_message = input.read()

    # Return a message object structure from a string
    msg = email.message_from_string(raw_message)

    # iterate over all the parts and subparts of a message object tree
    for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
      email_body = part.get_payload()
      first_cleaning = regex.sub(r"((Ww+W).*(d{2}:d{2}:d{2})s(AM|PM)
(To:.*)
(cc:.*)
(Subject:.*))", r' ',
                     email_body)
      clean_body = expunge_doublespaces(first_cleaning.replace('
', ' '))
      print(clean_body)
      # print output
      So, what is it? And by the way, don't start with the excuses. 
      You're expected to be a full, gourmet cook. Kisses, not music, 
      makes cooking a more enjoyable experience. I told you I have a 
      long email address. I've decided what to prepare for dinner 
      tomorrow. I hope you aren't expecting anything extravagant 
      because my culinary skills haven't been put to use in a while. 
      My only request is that your stereo works. Music makes cooking a 
      more enjoyable experience. Watch the debate if you are home 
      tonight. I want a report tomorrow... Jen 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...