Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
541 views
in Technique[技术] by (71.8m points)

c# - How to read text of appearance stream?

I have a PDF where the text shown in an annotation (as rendered in Adobe Reader) is different than what is given by its /Contents and /RC entries. This is related to the problem that I was dealing with in this question:

Can't change /Contents of annotation

In this case, instead of changing the appearance to match the annotation's contents, I want to do the opposite: get the appearance text and change the /Contents and /RC values to match. E.g., if the annotation displays "appearance" and /Contents is set to "content", I want to do something like:

void setContent(PdfDictionary dict)
{
 PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));
 dict.Put(PdfName.CONTENTS,str);
}

But I can't find where the appearance text is stored. I got the dictionary referenced by /AP with this code:

private PdfDictionary getAPAnnot(PdfArray annotArray,PdfDictionary annot)
        {
            PdfDictionary apDict = annot.GetAsDict(PdfName.AP);
            if (apDict!=null)
            {
                PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
                PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);
                return apRefDict;
            }
            else
            {
                return null;
            }
        }

This dictionary has the following hashMap:

{[/BBox, [-38.7578, -144.058, 62.0222, 1]]} 
{[/Filter, /FlateDecode]}   
{[/Length, 172]}    
{[/Matrix, [1, 0, 0, 1, 0, 0]]} 
{[/Resources, Dictionary]}

/Resources has indirect references to the fonts, but no contents. So it seems that the appearance stream doesn't include content data.

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Unfortunately the OP has not provided a sample PDF. Considering his previous question, though, he is most likely interested in free text annotations. Thus, I use this example PDF here as example. It has one page with a typewriter free text annotation looking like this:

sampleTypewriter.pdf screenshot


The OP asked

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

The major shortcoming of the OP's code is that he only considered the normal appearance as PdfDictionary:

PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);

It actually is a PdfStream, i.e. a dictionary with a data stream, and this data stream is where the appearance drawing instructions are located.

But even with this data stream at hand, it is not as simple as imagined by the OP:

PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));

Actually the text in the appearance stream can be drawn in pieces, e.g. in my sample file the stream data look like this:

0 w
131.2646 564.8243 180.008 30.984 re
n
q
1 0 0 1 0 0 cm
131.2646 564.8243 180.008 30.984 re
W
n
0 g
1 w
BT
/Cour 12 Tf
0 g
131.265 587.96 Td
(This ) Tj
35.999 0 Td
(is ) Tj
21.6 0 Td
(written ) Tj
57.599 0 Td
(using ) Tj
43.2 0 Td
(the ) Tj
-158.398 -16.3 Td
(typewriter ) Tj
79.199 0 Td
(tool.) Tj
ET
Q

Furthermore, the encoding does not need to be some standard encoding like here but can instead be defined for an embedded font on-the-fly.

Thus, one has to apply full-fledged text extraction.

This all can be implemented like this:

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    Console.Write("
Page {0}
", page);
    PdfDictionary pageDictionary = pdfReader.GetPageNRelease(page);
    PdfArray annotsArray = pageDictionary.GetAsArray(PdfName.ANNOTS);
    if (annotsArray == null || annotsArray.IsEmpty())
    {
        Console.Write("  No annotations.
");
        continue;
    }
    foreach (PdfObject pdfObject in annotsArray)
    {
        PdfObject direct = PdfReader.GetPdfObject(pdfObject);
        if (direct.IsDictionary())
        {
            PdfDictionary annotDictionary = (PdfDictionary)direct;
            Console.Write("  SubType: {0}
", annotDictionary.GetAsName(PdfName.SUBTYPE));
            PdfDictionary appearancesDictionary = annotDictionary.GetAsDict(PdfName.AP);
            if (appearancesDictionary == null)
            {
                Console.Write("    No appearances.
");
                continue;
            }
            foreach (PdfName key in appearancesDictionary.Keys)
            {
                Console.Write("    Appearance: {0}
", key);
                PdfStream value = appearancesDictionary.GetAsStream(key);
                if (value != null)
                {
                    String text = ExtractAnnotationText(value);
                    Console.Write("    Text:
---
{0}
---
", text);
                }
            }
        }
    }
}

with this helper method

public String ExtractAnnotationText(PdfStream xObject)
{
    PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();

    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
    processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
    return strategy.GetResultantText();
}

In case of the sample file above, the output of the code is

Page 1
  SubType: /FreeText
    Appearance: /N
    Text:
---
This is written using the 
typewriter tool.
---

Beware, there are some annotations, in particular widget annotations of checkboxes and radio buttons, which have a slightly deeper structure than expected by the code here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...