Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
154 views
in Technique[技术] by (71.8m points)

c# - description on each constans specified in pdfname, since i need to be able to retrieve both images and text at the same time

i am having a trouble in retrieving images and text in a pdf file at the same, i was able to get images and text in a pdf file but not at the same time (this will cause a question of whether to render the image first or the text first for example in my panel control?), maybe if you guys can help me define what does each constants in pdfname means? i tried using pdfname.all but it returns null, but when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text? does it have pdfname.text for retrieving text)?

thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First of all,

i am having a trouble in retrieving images and text in a pdf file at the same

for this task you should use the iText(Sharp) parser API. In iTextSharp you essentially implement IRenderListener (an interface with methods for being informed about (bitmap) images and text fragments in a content stream) and process the page contents with it:

PdfReader reader = new PdfReader(...);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
int pageNumber = [... the number of the page you are interested in; may be a loop variable ...];

IRenderListener listener = new [... your IRenderListener implementation ...]
parser.ProcessContent(pageNumber, listener);

You ask

whether to render the image first or the text first for example in my panel control

The IRenderListener methods also retrieve information on the location of the bitmap or text fragment in question.

For ideas how the text fragments may be combined in your listener, you may want to be inspired by the implementations SimpleTextExtractionStrategy or LocationTextExtractionStrategy present in iTextSharp.

If you insist on doing it manually, though...

maybe if you guys can help me define what does each constants in pdfname means?

You find the definitions of what the names map to in the PDF specification ISO 32000-1:2008 a copy of which Adobe made available here.

when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text?

The contents of the page Resource Dictionaries are explained in section 7.8.3 of the specification.

does it have pdfname.text for retrieving text)?

You'll find how test is presented in page content streams and xobjects in section 9.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...