Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
375 views
in Technique[技术] by (71.8m points)

html - RegEx匹配XHTML自包含标签以外的打开标签(RegEx match open tags except XHTML self-contained tags)

I need to match all of these opening tags:

(我需要匹配所有这些开始标签:)

<p>
<a href="foo">

But not these:

(但不是这些:)

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right.

(我想出了这个,想确保我做对了。)

I am only capturing the az .

(我只是捕获az 。)

<([a-z]+) *[^/]*?>

I believe it says:

(我相信它说:)

  • Find a less-than, then

    (找到一个小于,然后)

  • Find (and capture) az one or more times, then

    (查找(并捕获)az一次或多次,然后)

  • Find zero or more spaces, then

    (找到零个或多个空格,然后)

  • Find any character zero or more times, greedy, except / , then

    (找到零个或更多次的字符,贪婪的( /除外),然后)

  • Find a greater-than

    (寻找大于)

Do I have that right?

(我有那个权利吗?)

And more importantly, what do you think?

(更重要的是,您怎么看?)

  ask by community wiki translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can't parse [X]HTML with regex.

(您无法使用正则表达式解析[X] HTML。)

Because HTML can't be parsed by regex.

(因为正则表达式无法解析HTML。)

Regex is not a tool that can be used to correctly parse HTML.

(正则表达式不是可用于正确解析HTML的工具。)

As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

(正如我之前在这里多次回答HTML和Regex问题一样,使用正则表达式将不允许您使用HTML。)

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.

(正则表达式是一种工具,不够复杂,无法理解HTML所采用的结构。)

HTML is not a regular language and hence cannot be parsed by regular expressions.

(HTML不是常规语言,因此无法通过常规表达式进行解析。)

Regex queries are not equipped to break down HTML into its meaningful parts.

(正则表达式查询无法将HTML分解为有意义的部分。)

so many times but it is not getting to me.

(有很多次了,但是没有得到我。)

Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML.

(甚至Perl使用的增强的不规则正则表达式也无法完成HTML解析任务。)

You will never make me crack.

(你永远不会让我崩溃。)

HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

(HTML是一种足够复杂的语言,无法通过正则表达式进行解析。)

Even Jon Skeet cannot parse HTML using regular expressions.

(甚至Jon Skeet也无法使用正则表达式解析HTML。)

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

(每次您尝试使用正则表达式解析HTML时,这个邪恶的孩子都会哭泣处女之血,俄罗斯黑客将您的Web应用程序伪造。)

Parsing HTML with regex summons tainted souls into the realm of the living.

(用正则表达式解析HTML会使灵魂陷入生活领域。)

HTML and regex go together like love, marriage, and ritual infanticide.

(HTML和正则表达式可以像爱情,婚姻和仪式杀婴一样一起使用。)

The <center> cannot hold it is too late.

(<center>不能容纳为时已晚。)

The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty.

(正则表达式和HTML共同作用于同一个概念空间中,将像太多水腻子一样破坏您的思维。)

If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.

(如果您使用正则表达式解析HTML,那么您就是在屈服于他们及其亵渎神明的方式,这使我们所有人都为不愿在基本多语言平面中表达其名字的人付出辛劳。)

HTML-plus-regexp will liquify the n?erves of the sentient whilst you observe, your psyche withering in the onslaught of horror.

(HTML + regexp将在您观察的同时液化众生的神经,使您的心灵在恐怖的冲击下枯萎。)

Rege???x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi?ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using rege x as a tool to process HTML establishes a brea ch between this world and the dread realm of c??o??rrupt entities (like SGML entities, but more corrupt) a mere glimp se of the world of reg? ex parsers for HTML will ins) ?tantly transport ap rogrammer's consciousness i nto aw orl d of ceaseless screaming, he comes , the pestilent sl ithy regex-infection wil? l devour your HT) ?ML parser, application and existence for all time like Visual Basic only worse he comes he com es do not fi ?ght h e com?e?s, ?h?i) ?s un?ho?ly radian?ce? de stro?ying all enli? (基于Rege???x的HTML解析器是杀死StackOverflow的癌症, 为时已晚,为时已晚,我们无法挽救。混乱的局面确保正则表达式将消耗所有活体组织(除了HTML,如先前所言,它不能消耗) 亲爱的主帮助我们,使用正则表达式解析HTML的人如何能够幸免于这一祸害,使用rege x作为处理HTML的工具,人类注定要遭受无尽的折磨和安全漏洞,从而 在这个世界和恐怖的实体(如SGML实体,但更多的腐败) (的HTML)仅仅glimp SE REG 前解析器)的世界将插件) tantly运输AP rogrammer的意识扎成 AW ORL不断尖叫的d,他来了 ,可恶SL ithy正则表达式感染WIL 升吞噬你的HT) ML解析器,应用和存在的Visual Basic一样,所有的时间只有更糟,他谈到他命令 ES 没有网络连接 GHT ^ h E排,喜)小号邪恶的光采德stro?ying所有张恩利个展) ??ghtenment, HTML tags lea?ki?n?g fr?o?m ?yo??ur eye?s? ?l?ik?e liq) ?uid p ain, the song of re?gular exp?re ssion parsing will exti ?nguish the voices of mor? tal man from the sp) ?here I can see it can you see ?????i???t???????? it is beautiful t? he f inal snuf fing o f the lie? s of Man ALL IS LOS????????T A) (LL I?SL) OST th e pon?y he come s he c??om es he co me st he ich?) or permeat es al l MY FAC E MY FACE ?h god n o NO NOO?) (O?ON) Θ stop t he an?*??????????g????????l?????????? e??s ?a???r?????e n ?ot re????a?l???????? ZA????LG? IS????????? T) O???????? TH? E??? ?P???O??N?Y? H??????????E?????????? ??????????C??????????O??????M??????????E?????????) S?????????? ?ghtenment,HTML标记泄漏fr??m玩吧眼睛像LIQ) UID p AlN,定期EXP重新 裂变解析 的歌曲将EXTI nguish 从SP)铁道部TAL男人)的声音在这里我可以看到它,你可以看到它它是美丽的T他?F inal snuf Fing头O至谎言人所有的S是失去了一个) (LL我SL) OST个e-小马才想起他小号COM ES他合作 ST ICH)或permeat ES人 L我FAC ?我的脸?h神N 2 O NO野应) (o在) Θ停止T 他的*????GL ES ?a???r?????e ňOT真正ZA????LG?IS???????牛逼) O???????个e- PO纽约H??????????????????????????C?????????O??????M??????????????????s) ^) Have you tried using an XML parser instead? (您是否尝试过使用XML解析器?) Moderator's Note (主持人的话) This post is locked to prevent inappropriate edits to its content. (该帖子已被锁定,以防止对其内容进行不适当的编辑。)The post looks exactly as it is supposed to look - there are no problems with its content. (该帖子看起来与预期的完全一样-内容没有问题。)Please do not flag it for our attention. (请不要标记它以引起我们的注意。)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...