Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
131 views
in Technique[技术] by (71.8m points)

java - which is the best way to read data from a table in a pdf?

I want to read the data from the table in this PDF.

PDF

I had thought about reading the PDF, exporting it to an Excel and then use the data. The problem of reading the pdf and exporting it to Excel is that there are elements of columns that move to empty columns because I read with Apache Poi, and in this way the whole PDF is saved in a string.

Another way was to read exact coordinates data, but I do not think it's a very good option.

Could someone advise me? Which way is better or some new way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I've had the best luck using Xpdf pdftotext with a combination of the -layout and -table options.

Xpdf Link

You would call like this:

pdftotext -table c:	empENaB20180317.pdf c:	empoutput.txt

You could then parse by getting the starting column position from the header on each page.

Antoher good option is PDFBox it may extract the text in a format you can use without having to call a separate command line app.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...