RetroBBS - comp.text.pdf

pdf grep?

<uujj10$3tv68$2@dont-email.me>

https://www.rocksolidbbs.com/computers/article-flat.php?id=331&group=comp.text.pdf#331

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: pdf grep?
Date: Wed, 3 Apr 2024 12:45:20 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 3
Message-ID: <uujj10$3tv68$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Apr 2024 12:45:20 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3fc59fc5b164e30958ab4c2ac5ec4c56";
logging-data="4127944"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iL0HfbMqGhtmHWui5lehrmonfk8Da64w="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:mOOCkst187rse+NF/UCsNrLSnDg=

by: db - Wed, 3 Apr 2024 12:45 UTC

Re: pdf grep?

<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=332&group=comp.text.pdf#332

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!news.neodome.net!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Wed, 03 Apr 2024 14:03:37 +0000
MIME-Version: 1.0
From: heller@deepsoft.com (Robert Heller)
Organization: Deepwoods Software
X-Newsreader: TkNews 3.0 (1.2.17)
Subject: Re: pdf grep?
In-Reply-To: <uujj10$3tv68$2@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
Newsgroups: comp.text.pdf
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset="us-ascii"
Originator: heller@sharky4.deepsoft.com
Message-ID: <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
Date: Wed, 03 Apr 2024 14:03:37 +0000
Lines: 24
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-48zvgird97A9/ITu+Drx0XzrDBTk43LIy4R0e+edm9E5pljQW9kbEU7P8w/DXZvJ1v3OuRqcfYE+7Ux!ihML7eQjlYz0d+KRgDNfiz0Y0QvgmG2I1bvsRaQmEfYJXrNuf465pgMgXq5+PTbpJdC1/+R4BT4i!qHQ=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Received-Bytes: 2349

by: Robert Heller - Wed, 3 Apr 2024 14:03 UTC

Grep may sort of also work with pdf files. You might want to also use the
strings command to get "clean" srings. Note: *some* pdf files are just images
(no actual text). These would be PDFs created by scanning a document (not
using OCR). Also, many typesetting programs (TeX/LaTex, word-processos, etc),
might do some typesetting "magic" (eg ligitures, etc.) that might make things
hard for grep.

xpdf includes a text search button as part of its UI.

At Wed, 3 Apr 2024 12:45:20 -0000 (UTC) db <dieterhansbritz@gmail.com> wrote:

>
> Under Linux, I can use grep to search a bunch of
> files for a character string. Is there an equivalent
> command for searching pdf files?
>
>

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
heller@deepsoft.com -- Webhosting Services

Re: pdf grep?

<grep-20240403151634@ram.dialup.fu-berlin.de>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=333&group=comp.text.pdf#333

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: 3 Apr 2024 14:17:22 GMT
Organization: Stefan Ram
Lines: 9
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <grep-20240403151634@ram.dialup.fu-berlin.de>
References: <uujj10$3tv68$2@dont-email.me> <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de zgDP1DF/5Ws8fAzvm+jV3gsZmKXxjvsOPlbppt0MFxgcJu
Cancel-Lock: sha1:yCzZ3e5JZEsdgmg5fv/Is93ItZ0= sha256:i7ow4sk1NyC22+jTMgeXzJE4XOOVb5Rjg2b+P7Vtnn0=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US

by: Stefan Ram - Wed, 3 Apr 2024 14:17 UTC

Robert Heller <heller@deepsoft.com> wrote or quoted:
>might do some typesetting "magic" (eg ligitures, etc.) that might make things

"ligatures"

Text in PDFs is sometimes compressed. So one can either use
programs like "Agent Ransack" to search for text in PDFs or
tools like "pdftotext" to first create a text file for every
PDF file and then grep those text files.

Re: pdf grep?

<87wmpe1q79.fsf@vagabond.tim-landscheidt.de>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=334&group=comp.text.pdf#334

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: tim@tim-landscheidt.de (Tim Landscheidt)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Wed, 03 Apr 2024 14:22:18 +0000
Organization: https://www.tim-landscheidt.de/
Lines: 10
Message-ID: <87wmpe1q79.fsf@vagabond.tim-landscheidt.de>
References: <uujj10$3tv68$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net bkGqySacOKNlh+IW18qhaA2eO8wtu4dY0XzSj/7VbNdjhmQhJj
Cancel-Lock: sha1:hK1coaA6m4l4oneREIlMZEtYMQ0= sha1:j0wI1pbYO5wBaF3wPsJfM6+SJTs= sha256:IOivHOMSEq3/latdjC7ui0lScvW0efRuF1VTSfEWvok=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.3 (gnu/linux)

by: Tim Landscheidt - Wed, 3 Apr 2024 14:22 UTC

db <dieterhansbritz@gmail.com> wrote:

> Under Linux, I can use grep to search a bunch of
> files for a character string. Is there an equivalent
> command for searching pdf files?

You can use pdfgrep (https://pdfgrep.org/) for that. It is
available as a package in Fedora and Debian as well.

Tim

Re: pdf grep?

<search-20240403152924@ram.dialup.fu-berlin.de>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=335&group=comp.text.pdf#335

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: 3 Apr 2024 14:29:40 GMT
Organization: Stefan Ram
Lines: 11
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <search-20240403152924@ram.dialup.fu-berlin.de>
References: <uujj10$3tv68$2@dont-email.me> <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com> <grep-20240403151634@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de jSb4m5gA5BEGBu6j4KzGsQFojIzpWOlFMo1rZplTjNmV4M
Cancel-Lock: sha1:9riBKCF8fxhQypFxUfn/Ae8mRx0= sha256:uxItC6vosBIu/EtCgJuJVq+iTlgZsa7lvosJso3jkRo=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US

by: Stefan Ram - Wed, 3 Apr 2024 14:29 UTC

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>Text in PDFs is sometimes compressed. So one can either use
>programs like "Agent Ransack" to search for text in PDFs or
>tools like "pdftotext" to first create a text file for every
>PDF file and then grep those text files.

PS: "Agent Ransack" is Windows software. "pdftotext" is also
available for Linux. Converting all PDFs to text files needs
to be done only once, and then search operations on those
text files are faster than scanning the PDF files for text
on every search!

Re: pdf grep?

<uujs1s$7u0$3@dont-email.me>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=336&group=comp.text.pdf#336

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Wed, 3 Apr 2024 15:19:24 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <uujs1s$7u0$3@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Apr 2024 15:19:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3fc59fc5b164e30958ab4c2ac5ec4c56";
logging-data="8128"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ruhJn43xRmQgQeqvwqBUNjdbeWethttw="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:g5dkZRgK81gQJJzE7717xw5oqLg=

by: db - Wed, 3 Apr 2024 15:19 UTC

On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

> ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>>Text in PDFs is sometimes compressed. So one can either use programs
>>like "Agent Ransack" to search for text in PDFs or tools like
>>"pdftotext" to first create a text file for every PDF file and then grep
>>those text files.
>
> PS: "Agent Ransack" is Windows software. "pdftotext" is also available
> for Linux. Converting all PDFs to text files needs to be done only
> once, and then search operations on those text files are faster than
> scanning the PDF files for text on every search!

I should maybe have elaborated a bit. Sometimes I
remember a certain phrase or word but forget which
pdf it is in. With text files I can do
grep blabla *.txt
and I wanted an equivalent. Using pdftotext would
mean using it for every suspect pdf. Since a lot of
pdf files are searchable, I figured that such a
command might exist.
But if there really is a pdfgrep command, that might
do the job. I will do some googling, thanks.

Re: pdf grep?

<uult5m$iqkv$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=339&group=comp.text.pdf#339

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Thu, 4 Apr 2024 09:50:46 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <uult5m$iqkv$1@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 04 Apr 2024 09:50:46 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="11d4a932091b4ca06ee40d12e4656fcc";
logging-data="617119"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lleGXsaB3j6A72oeVCsTr711vCdBSMX4="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:8lzRqLoeAsmZW4nD79/DdzGer1A=

by: db - Thu, 4 Apr 2024 09:50 UTC

On Wed, 3 Apr 2024 15:19:24 -0000 (UTC), db wrote:

> On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:
>
>> ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>>>Text in PDFs is sometimes compressed. So one can either use programs
>>>like "Agent Ransack" to search for text in PDFs or tools like
>>>"pdftotext" to first create a text file for every PDF file and then
>>>grep those text files.
>>
>> PS: "Agent Ransack" is Windows software. "pdftotext" is also
>> available for Linux. Converting all PDFs to text files needs to be
>> done only once, and then search operations on those text files are
>> faster than scanning the PDF files for text on every search!
>
> I should maybe have elaborated a bit. Sometimes I remember a certain
> phrase or word but forget which pdf it is in. With text files I can do
> grep blabla *.txt and I wanted an equivalent. Using pdftotext would mean
> using it for every suspect pdf. Since a lot of pdf files are searchable,
> I figured that such a command might exist.
> But if there really is a pdfgrep command, that might do the job. I will
> do some googling, thanks.

I installed pdfgrep in my Kubuntu system, but it is
not happy. Although the man file is there, even help
doesn't work:

> pdfgrep --help
terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
Aborted (core dumped)

Re: pdf grep?

<l780vtFidhnU1@mid.individual.net>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=341&group=comp.text.pdf#341

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: peter@silmaril.ie (Peter Flynn)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Thu, 4 Apr 2024 16:57:49 +0100
Organization: Usenet Labs Bozon Detector Facility
Lines: 11
Message-ID: <l780vtFidhnU1@mid.individual.net>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
<uult5m$iqkv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net lwMg5sGS1URvdMcf4ijROQ8hv33YmSvHTUIRoK1ANavtKejYwq
Cancel-Lock: sha1:FsgDr+6AHXdZJxut4G7/25xxEbM= sha256:28sQRzY1LTxlNz3mQI86V6lFcPScrnur+NHb8ioKnPQ=
User-Agent: Mozilla Thunderbird
Content-Language: en-GB
In-Reply-To: <uult5m$iqkv$1@dont-email.me>

by: Peter Flynn - Thu, 4 Apr 2024 15:57 UTC

On 04/04/2024 10:50, db wrote:
[...]
> I installed pdfgrep in my Kubuntu system, but it is
> not happy. Although the man file is there, even help
> doesn't work:

I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
seems to work OK. What version is the Kubuntu one?

Peter

Re: pdf grep?

<uuoqu8$1cco6$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=345&group=comp.text.pdf#345

copy link Newsgroups: comp.text.pdf

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Fri, 5 Apr 2024 12:31:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <uuoqu8$1cco6$1@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
<uult5m$iqkv$1@dont-email.me> <l780vtFidhnU1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 05 Apr 2024 12:31:04 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="88a35264003d6fd2fe031050366b57d5";
logging-data="1454854"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HXOuPTbuU3gLqx9L60Wr4pT09yR+rmk0="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:1BA4E6VODTaXCk5nR0V9vtmlO2w=

by: db - Fri, 5 Apr 2024 12:31 UTC

On Thu, 4 Apr 2024 16:57:49 +0100, Peter Flynn wrote:

> On 04/04/2024 10:50, db wrote:
> [...]
>> I installed pdfgrep in my Kubuntu system, but it is not happy. Although
>> the man file is there, even help doesn't work:
>
> I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
> seems to work OK. What version is the Kubuntu one?
>
> Peter

The man file for pdfgrep says V. 2.1.1. My Kubuntu
is 23.04.

Computer programmers do it byte by byte.

computers / comp.text.pdf / Re: pdf grep?

Subject	Author
pdf grep?	db
Re: pdf grep?	Robert Heller
Re: pdf grep?	Stefan Ram
Re: pdf grep?	Stefan Ram
Re: pdf grep?	db
Re: pdf grep?	db
Re: pdf grep?	Peter Flynn
Re: pdf grep?	db
Re: pdf grep?	Tim Landscheidt