Itextsharp can't extract pdf unicode content in c#

Question

Welcome To Ask or Share your Answers For Others

Itextsharp can't extract pdf unicode content in c#

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Itextsharp can't extract pdf unicode content in c#

I am trying to get the content of pdf file using itextsharp as you can see :

static void Main(string[] args)
{
    StringBuilder text = new StringBuilder();
    using (PdfReader reader = new PdfReader(@"D:a.pdf"))
    {
        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
        }
    }
    System.IO.File.WriteAllText(@"c:/a.txt",text.ToString());
    Console.ReadLine();
}

My pdf content is written in Persian ,and after running the above code to result is like this :

But this is not correct result.should i set any option in itextsharp

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:15:51+0000

It is hard to say without an original file but in case you have characters/words incorrectly placed then you should try to use LocationTextExtractionStrategy like this:

text.Append(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

Categories

Itextsharp can't extract pdf unicode content in c#

Itextsharp can't extract pdf unicode content in c#

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

Itextsharp can&#39;t extract pdf unicode content in c#

Itextsharp can&#39;t extract pdf unicode content in c#

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Itextsharp can't extract pdf unicode content in c#

Itextsharp can't extract pdf unicode content in c#