Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

python - Javascript variable with html code regex email matching

This python script is not working to output the email address example@email.com for this case.

This was my previous post.

How can I use BeautifulSoup or Slimit on a site to output the email address from a javascript variable

#!/usr/bin/env python

from bs4 import BeautifulSoup
import re

soup = '''
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
'''


soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
  reg = '(<)?(w+@w+(?:.w+)+)(?(1)>)'
  reg2 = 'mailto:.*'
  secondHalf= re.search(reg, script.text)
  firstHalf= re.search(reg2, script.text)
  secondHalfEmail = secondHalf.group()
  firstHalfEmail = firstHalf.group()
  firstHalfEmail = firstHalfEmail.replace('mailto:', '')
  firstHalfEmail = firstHalfEmail.replace('";', '')
  if firstHalfEmail == secondHalfEmail:
     email = secondHalfEmail
  else:
     if ('>') not in firstHalfEmail:
        if ('>') not in secondHalfEmail:
            if firstHalfEmail != secondHalfEmail:
                email = firstHalfEmail + secondHalfEmail
        else:
            email = firstHalfEmail
    else:
        email = secondHalfEmail

    print email

It would be nice if someone can help me.

Thank you

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is a rather interesting (I think) approach.

Instead of parsing this javascript code - execute it!

Get the ptr value, load it via BeautifulSoup and get the href attribute value from the a tag. Example using V8 engine:

from bs4 import BeautifulSoup
from pyv8 import PyV8

data = """
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
"""

soup = BeautifulSoup(data)

# prepare the function to return a value and add a function call
js_code = soup.script.text.strip().replace('document.all.something.innerHTML = ptr;', 'return ptr;') + "; something()"

ctxt = PyV8.JSContext()
ctxt.enter()

soup = BeautifulSoup(ctxt.eval(str(js_code)))
print soup.a['href'].split('mailto:')[1]

Prints:

example@email.com

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...