Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
381 views
in Technique[技术] by (71.8m points)

java - Getting Text Colour with PDFBox

I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.

Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?

Many thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).

Here is a small sample I tried:

After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:

DeviceRGB 146.115 208.08 80.07

Here's the code:

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(), page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.

As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.

In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...