textutil convert PDF to txt producing garbled output
I am trying to convert PDF files to text files using textutil
. I don't if there are special types of PDFs that can and cannot be converted. The files I am trying to convert are in a searchable format, which I assume is a minimum requirement. When I convert the file, the text document is completed garbled output. Here is my code:
textutil -convert txt example.pdf
Here are some of the first lines, in case that helps to identify where I am going wrong:
%PDF-1.3
%ƒÂÚÂÎßÛ†–ƒ∆
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
xÌõYè‹∏«flı)8>2”„å,R%Ÿªõ¯fixs9ôM‚<YÅ`„Ô‰W,J¢‘íF3”@^2Z›<ädˇ:(ˇl>òüçuπ´Í¶ñ¶nõº.⁄⁄
4>~˘œ?Ã_ÕøÕ”W_≠˘Ù’·fl◊OL.ò´øÂKI5ÖÀª∫*≥O_ÃÀk”‘aH|1OØØù
±Ê˙'sqv0◊ˇ2oÆ√Vñ©˘÷Êmy2jæ»;P+Ú¢(*s˝ikó3>z¸ãõæ8;èè˙΄·ê—z~=|
¯D˝rËî)WÈå<˝¡ÒˇnÆfl/3¿’UnõÆ4~∫Á;Ú”µ≠J˙4‰JWùîgz8€]êªA@g¸≠kRŸ¯‹÷ùàëeÁÔπUŸÓ÷Ü´≤Œ
I'm guessing it has to do with some encoding feature -- not my area of expertise, so any assistance would be greatly appreciated!
pdf conversion text
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
I am trying to convert PDF files to text files using textutil
. I don't if there are special types of PDFs that can and cannot be converted. The files I am trying to convert are in a searchable format, which I assume is a minimum requirement. When I convert the file, the text document is completed garbled output. Here is my code:
textutil -convert txt example.pdf
Here are some of the first lines, in case that helps to identify where I am going wrong:
%PDF-1.3
%ƒÂÚÂÎßÛ†–ƒ∆
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
xÌõYè‹∏«flı)8>2”„å,R%Ÿªõ¯fixs9ôM‚<YÅ`„Ô‰W,J¢‘íF3”@^2Z›<ädˇ:(ˇl>òüçuπ´Í¶ñ¶nõº.⁄⁄
4>~˘œ?Ã_ÕøÕ”W_≠˘Ù’·fl◊OL.ò´øÂKI5ÖÀª∫*≥O_ÃÀk”‘aH|1OØØù
±Ê˙'sqv0◊ˇ2oÆ√Vñ©˘÷Êmy2jæ»;P+Ú¢(*s˝ikó3>z¸ãõæ8;èè˙΄·ê—z~=|
¯D˝rËî)WÈå<˝¡ÒˇnÆfl/3¿’UnõÆ4~∫Á;Ú”µ≠J˙4‰JWùîgz8€]êªA@g¸≠kRŸ¯‹÷ùàëeÁÔπUŸÓ÷Ü´≤Œ
I'm guessing it has to do with some encoding feature -- not my area of expertise, so any assistance would be greatly appreciated!
pdf conversion text
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
@frostschutz That alternative worked perfectly. After install, I ranpdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!
– Brian P
Mar 31 '15 at 16:05
add a comment |
I am trying to convert PDF files to text files using textutil
. I don't if there are special types of PDFs that can and cannot be converted. The files I am trying to convert are in a searchable format, which I assume is a minimum requirement. When I convert the file, the text document is completed garbled output. Here is my code:
textutil -convert txt example.pdf
Here are some of the first lines, in case that helps to identify where I am going wrong:
%PDF-1.3
%ƒÂÚÂÎßÛ†–ƒ∆
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
xÌõYè‹∏«flı)8>2”„å,R%Ÿªõ¯fixs9ôM‚<YÅ`„Ô‰W,J¢‘íF3”@^2Z›<ädˇ:(ˇl>òüçuπ´Í¶ñ¶nõº.⁄⁄
4>~˘œ?Ã_ÕøÕ”W_≠˘Ù’·fl◊OL.ò´øÂKI5ÖÀª∫*≥O_ÃÀk”‘aH|1OØØù
±Ê˙'sqv0◊ˇ2oÆ√Vñ©˘÷Êmy2jæ»;P+Ú¢(*s˝ikó3>z¸ãõæ8;èè˙΄·ê—z~=|
¯D˝rËî)WÈå<˝¡ÒˇnÆfl/3¿’UnõÆ4~∫Á;Ú”µ≠J˙4‰JWùîgz8€]êªA@g¸≠kRŸ¯‹÷ùàëeÁÔπUŸÓ÷Ü´≤Œ
I'm guessing it has to do with some encoding feature -- not my area of expertise, so any assistance would be greatly appreciated!
pdf conversion text
I am trying to convert PDF files to text files using textutil
. I don't if there are special types of PDFs that can and cannot be converted. The files I am trying to convert are in a searchable format, which I assume is a minimum requirement. When I convert the file, the text document is completed garbled output. Here is my code:
textutil -convert txt example.pdf
Here are some of the first lines, in case that helps to identify where I am going wrong:
%PDF-1.3
%ƒÂÚÂÎßÛ†–ƒ∆
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
xÌõYè‹∏«flı)8>2”„å,R%Ÿªõ¯fixs9ôM‚<YÅ`„Ô‰W,J¢‘íF3”@^2Z›<ädˇ:(ˇl>òüçuπ´Í¶ñ¶nõº.⁄⁄
4>~˘œ?Ã_ÕøÕ”W_≠˘Ù’·fl◊OL.ò´øÂKI5ÖÀª∫*≥O_ÃÀk”‘aH|1OØØù
±Ê˙'sqv0◊ˇ2oÆ√Vñ©˘÷Êmy2jæ»;P+Ú¢(*s˝ikó3>z¸ãõæ8;èè˙΄·ê—z~=|
¯D˝rËî)WÈå<˝¡ÒˇnÆfl/3¿’UnõÆ4~∫Á;Ú”µ≠J˙4‰JWùîgz8€]êªA@g¸≠kRŸ¯‹÷ùàëeÁÔπUŸÓ÷Ü´≤Œ
I'm guessing it has to do with some encoding feature -- not my area of expertise, so any assistance would be greatly appreciated!
pdf conversion text
pdf conversion text
edited Mar 31 '15 at 14:38
Brian P
asked Mar 31 '15 at 14:23
Brian PBrian P
1618
1618
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
@frostschutz That alternative worked perfectly. After install, I ranpdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!
– Brian P
Mar 31 '15 at 16:05
add a comment |
1
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
@frostschutz That alternative worked perfectly. After install, I ranpdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!
– Brian P
Mar 31 '15 at 16:05
1
1
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
@frostschutz That alternative worked perfectly. After install, I ran
pdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!– Brian P
Mar 31 '15 at 16:05
@frostschutz That alternative worked perfectly. After install, I ran
pdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!– Brian P
Mar 31 '15 at 16:05
add a comment |
1 Answer
1
active
oldest
votes
With reference to TEXTUTIL(1) manual page it seems pdf is not between formats managed by this utility:
fmt is one of: txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive
On Linux/Unix installing a scriptable product like XPDF/pdftotext may be a valid solution like some comment already suggested.
For those on OS X it's possible extract text from PDF by a native OS X automator action ( ..see this answer or last 4' of this tutorial) then consider that automator's workflow can be "scripted" via CLI automator command
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f193592%2ftextutil-convert-pdf-to-txt-producing-garbled-output%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
With reference to TEXTUTIL(1) manual page it seems pdf is not between formats managed by this utility:
fmt is one of: txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive
On Linux/Unix installing a scriptable product like XPDF/pdftotext may be a valid solution like some comment already suggested.
For those on OS X it's possible extract text from PDF by a native OS X automator action ( ..see this answer or last 4' of this tutorial) then consider that automator's workflow can be "scripted" via CLI automator command
add a comment |
With reference to TEXTUTIL(1) manual page it seems pdf is not between formats managed by this utility:
fmt is one of: txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive
On Linux/Unix installing a scriptable product like XPDF/pdftotext may be a valid solution like some comment already suggested.
For those on OS X it's possible extract text from PDF by a native OS X automator action ( ..see this answer or last 4' of this tutorial) then consider that automator's workflow can be "scripted" via CLI automator command
add a comment |
With reference to TEXTUTIL(1) manual page it seems pdf is not between formats managed by this utility:
fmt is one of: txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive
On Linux/Unix installing a scriptable product like XPDF/pdftotext may be a valid solution like some comment already suggested.
For those on OS X it's possible extract text from PDF by a native OS X automator action ( ..see this answer or last 4' of this tutorial) then consider that automator's workflow can be "scripted" via CLI automator command
With reference to TEXTUTIL(1) manual page it seems pdf is not between formats managed by this utility:
fmt is one of: txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive
On Linux/Unix installing a scriptable product like XPDF/pdftotext may be a valid solution like some comment already suggested.
For those on OS X it's possible extract text from PDF by a native OS X automator action ( ..see this answer or last 4' of this tutorial) then consider that automator's workflow can be "scripted" via CLI automator command
edited Apr 13 '17 at 12:45
Community♦
1
1
answered Jul 31 '15 at 9:04
Franco RondiniFranco Rondini
1114
1114
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f193592%2ftextutil-convert-pdf-to-txt-producing-garbled-output%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Have you tried your luck with poppler's pdftotext/html? Not every PDF can be converted that way. See if you can search for text in the PDF, if that doesn't work then you might have to resort to OCR
– frostschutz
Mar 31 '15 at 14:59
The lines that you show here is the PDF.
– A.B.
Mar 31 '15 at 15:45
@frostschutz That alternative worked perfectly. After install, I ran
pdftotext example.pdf
and it produced exactly what I needed. Feel free to suggest in the answer and I will accept!– Brian P
Mar 31 '15 at 16:05