Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
533 views
in Technique[技术] by (71.8m points)

r - Error trying to read a PDF using readPDF from the tm package

(Windows 7 / R version 3.0.1)

Below the commands and the resulting error:

> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file 'C:UsersRaffaelAppDataLocalTemp
    RtmpS8Uql1pdfinfo167c2bc159f8': No such file or directory

How do I solve this issue?


EDIT I

(As suggested by Ben and described here)

I downloaded Xpdf copied the 32bit version to C:Program Files (x86)xpdf32 and the 64bit version to C:Program Filesxpdf64

The environment variables pdfinfo and pdftotext are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)


EDIT II

One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:

> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file 'C:UsersRaffaelAppDataLocalTempRtmpKi5GnL
     pdfinfode8283c422f': No such file or directory

I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:

> pdf

function (elem, language, id) 
{
    meta <- tm:::pdfinfo(elem$uri)
    content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), 
        "-"), stdout = TRUE)
    PlainTextDocument(content, meta$Author, meta$CreationDate, 
        meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>

> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf

function (elem, language, id) 
{
    meta <- tm:::pdfinfo(elem$uri)
    content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri), 
        "-"), stdout = TRUE)
    PlainTextDocument(content, meta$Author, meta$CreationDate, 
        meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>

Apparently there is no difference - then why use readPDF at all?


EDIT III

The pdf file is located here: C:UsersRaffaelDocuments

> getwd()
[1] "C:/Users/Raffael/Documents"

EDIT IV

First instruction in pdf() is a call to tm:::pdfinfo() - and there the error is caused within the first few lines:

> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")), 
+                   stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator", 
+           "Producer", "CreationDate", "ModDate", "Tagged", "Form", 
+           "Pages", "Encrypted", "Page size", "File size", "Optimized", 
+           "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:", 
+                                                       tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file 'C:UsersRaffaelAppDataLocalTempRtmpquRYX6pdfinfo8d419174450':   No such file or direc

Apparently tempfile() simply doesn't create a file.

> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d437bd65d9"

The folder C:UsersRaffaelAppDataLocalTempRtmpquRYX6 exists and holds some files but none is named pdfinfo8d437bd65d9.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:

 getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
  package:grDevices
  namespace:grDevices [etc.]

But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.

In your case it would be (note the two sets of quotes):

system(paste('"C:/Program Files/xpdf64/pdftotext.exe"', 
             '"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)

This could easily be extended with an *apply function or loop if you have many PDF files.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...