Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
388 views
in Technique[技术] by (71.8m points)

php - How do I grab just the parsed Infobox of a wikipedia article?

I'm still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has references to countries and on each country page I would like to be able to show the infobox which is on corresponding wikipedia article of that country. I'm using php here - I would greatly appreciate it if anyone has any code snippets or advice on what should I be doing here.

Thanks again.


EDIT

Well I have a db table with names of countries. And I have a script that takes a country and shows its details. I would like to grab the infobox - the blue box with all country details images etc as it is from wikipedia and show it on my page. I would like to know a really simple and easy way to do that - or have a script that just downloads the information of the infobox to a local remote system which I could access myself later on. I mean I'm open to ideas here - except that the end result I want is to see the infobox on my page - of course with a little Content by Wikipedia link at the bottom :)


EDIT

I think I found what I was looking for on http://infochimps.org - they got loads of datasets in I think the YAML language. I can use this information straight up as it is but I would need a way to constantly update this information from wikipedia now and then although I believe infoboxes rarely change especially o countries unless some nation decides to change their capital city or so.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.

Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.

Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...