Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Trespassers will be shot. Survivors will be SHOT AGAIN!


computers / alt.os.linux.ubuntu / Unable to wget some pages

SubjectAuthor
* Unable to wget some pagesMichael F. Stemper
+* Re: Unable to wget some pagesDan Purgert
|+* Re: Unable to wget some pagesJosef Möllers
||`- Re: Unable to wget some pagesMichael F. Stemper
|`* Re: Unable to wget some pagesMichael F. Stemper
| `* Re: Unable to wget some pagesDan Purgert
|  `- Re: Unable to wget some pagesMichael F. Stemper
`* Re: Unable to wget some pagesPaul
 `- Re: Unable to wget some pagesMichael F. Stemper

1
Unable to wget some pages

<usn3ec$3l9th$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3901&group=alt.os.linux.ubuntu#3901

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: michael.stemper@gmail.com (Michael F. Stemper)
Newsgroups: alt.os.linux.ubuntu
Subject: Unable to wget some pages
Date: Mon, 11 Mar 2024 09:11:24 -0500
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <usn3ec$3l9th$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Mar 2024 14:11:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6e72de6da3f953b2946e59ff54f03c42";
logging-data="3844017"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/oU3bfZAHVBKi5TO4llhO3YomXKouUTgc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:Fc2GrQKrQCUNH0dWodjw5Si2/nU=
Content-Language: en-US
 by: Michael F. Stemper - Mon, 11 Mar 2024 14:11 UTC

Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:

$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden

Username/Password Authentication Failed.
$

Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.

Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.

On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.

I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.

I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.

Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
--
Michael F. Stemper
Indians scattered on dawn's highway bleeding;
Ghosts crowd the young child's fragile eggshell mind.

Re: Unable to wget some pages

<slrnuuu55t.807.dan@djph.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3902&group=alt.os.linux.ubuntu#3902

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dan@djph.net (Dan Purgert)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 14:27:09 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <slrnuuu55t.807.dan@djph.net>
References: <usn3ec$3l9th$1@dont-email.me>
Injection-Date: Mon, 11 Mar 2024 14:27:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4d7200ff6d722f330eae8f0f7638a116";
logging-data="3849347"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18C1gdvGre1JrhOwtP5N8wPwgH520ebByo="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:nce9oPGrst3zq0CvQdCB0lE9Jw8=
 by: Dan Purgert - Mon, 11 Mar 2024 14:27 UTC

On 2024-03-11, Michael F. Stemper wrote:
> Late last week, a script that I have used for several years suddenly
> stopped working. Investigation showed that wget was failing to
> download some pages. A simplified version, showing the problem, is:
>
> $ cat ic
> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
> $ . ./ic
> --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
> HTTP request sent, awaiting response... 401 HTTP Forbidden
>
> Username/Password Authentication Failed.
> $
>
> Looking at the error message, one might think that this page/site
> requires user login credentials. However, the same URL works just
> fine in Firefox, with no login requested or required.

Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I'd imagine either

(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason

The "401" is the error code. The "HTTP Forbidden" is (for lack of a
better word) "custom text" they're supplying. I've done similar where a
HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
when scripting things with expect.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Re: Unable to wget some pages

<l58okrFs04jU1@mid.individual.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3903&group=alt.os.linux.ubuntu#3903

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: josef@invalid.invalid (Josef Möllers)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 17:08:59 +0100
Lines: 40
Message-ID: <l58okrFs04jU1@mid.individual.net>
References: <usn3ec$3l9th$1@dont-email.me> <slrnuuu55t.807.dan@djph.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net jC11HCSCbrRtsjlZe6WxKQX2runEcaPA7xoLZe3P1FGFPKwyDD
Cancel-Lock: sha1:KwdOB4Jg10UewgwsSG79Q12KlQ4= sha256:ya6zDQVy7TmpjKzBOSXzyFEbBHPJpUo425OSdrIxp3I=
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <slrnuuu55t.807.dan@djph.net>
 by: Josef Möllers - Mon, 11 Mar 2024 16:08 UTC

On 11.03.24 15:27, Dan Purgert wrote:
> On 2024-03-11, Michael F. Stemper wrote:
>> Late last week, a script that I have used for several years suddenly
>> stopped working. Investigation showed that wget was failing to
>> download some pages. A simplified version, showing the problem, is:
>>
>> $ cat ic
>> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
>> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
>> $ . ./ic
>> --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
>> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
>> HTTP request sent, awaiting response... 401 HTTP Forbidden
>>
>> Username/Password Authentication Failed.
>> $
>>
>> Looking at the error message, one might think that this page/site
>> requires user login credentials. However, the same URL works just
>> fine in Firefox, with no login requested or required.
>
> Looks like the page *does* have a login button / javascript thing
> "somewhere" (at least I can see it when I open the page in lynx here).
> I'd imagine either
>
> (1) wget is respecting some robots.txt somewhere OR
> (2) wget is following that login link for some reason
>
> The "401" is the error code. The "HTTP Forbidden" is (for lack of a
> better word) "custom text" they're supplying. I've done similar where a
> HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
> when scripting things with expect.
>
Besides that ... is it on purpose that $uas is between single quotes, so
won't get expanded? Double quotes are required because the user agent
string has blanks (and parantheses), but single quotes are definitely
wrong here!

Josef "2cts" Möllers

Re: Unable to wget some pages

<usnchs$3ne8k$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3904&group=alt.os.linux.ubuntu#3904

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@needed.invalid (Paul)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 12:46:50 -0400
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <usnchs$3ne8k$1@dont-email.me>
References: <usn3ec$3l9th$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Mar 2024 16:46:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="53c6c14cb837f2fb4f5580d1bb8946e3";
logging-data="3914004"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lumUUsR1/Vy3peIJbSwV4+OmXKK84OCU="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:RWsTVG8KZHpl4eZ3LMoIu9EUEGQ=
In-Reply-To: <usn3ec$3l9th$1@dont-email.me>
Content-Language: en-US
 by: Paul - Mon, 11 Mar 2024 16:46 UTC

On 3/11/2024 10:11 AM, Michael F. Stemper wrote:
> Late last week, a script that I have used for several years suddenly
> stopped working. Investigation showed that wget was failing to
> download some pages. A simplified version, showing the problem, is:
>
> $ cat ic
> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
> $ . ./ic
> --2024-03-11 08:52:23--  https://www.marketwatch.com/investing/index/spx
> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
> HTTP request sent, awaiting response... 401 HTTP Forbidden
>
> Username/Password Authentication Failed.
> $
>
> Looking at the error message, one might think that this page/site
> requires user login credentials. However, the same URL works just
> fine in Firefox, with no login requested or required.
>
> Despite this, I tried telling wget to provide empty username and
> password, with no observable change in results.
>
> On a purely cargo-cult basis, I tried some different user agent
> strings, with no effect.
>
> I searched on "401 HTTP Forbidden", only to find that there does
> not appear to be such an error. There is "401 Unathorized", and
> "403 Forbidden", but no such cross-breed.
>
> I looked briefly at the page source (in Firefox), but without a
> top-level design document, couldn't make head or tail of it.
>
> Does anybody have any suggestions on how to fix my problem and
> again automatically download this, and neighboring, pages?
Almost like there is a mixing up at some point, of https://
versus http:// in the operation. The website denying http:// access.

Maybe at some point, the website used to redirect the http://
attempt to https:// for you, and maybe it's not doing that
any more ?

Or perhaps wget has developed a defect in dining habits
related to that aspect.

Paul

Re: Unable to wget some pages

<usnjs6$3p7g5$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3907&group=alt.os.linux.ubuntu#3907

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: michael.stemper@gmail.com (Michael F. Stemper)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 13:51:49 -0500
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <usnjs6$3p7g5$1@dont-email.me>
References: <usn3ec$3l9th$1@dont-email.me> <usnchs$3ne8k$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Mar 2024 18:51:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6e72de6da3f953b2946e59ff54f03c42";
logging-data="3972613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18d2VUaIQE557Po9M1RIBHsStUHOsl98D4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:dOw2SGmFEMYmPbp6E1QlJ3fJX/Q=
Content-Language: en-US
In-Reply-To: <usnchs$3ne8k$1@dont-email.me>
 by: Michael F. Stemper - Mon, 11 Mar 2024 18:51 UTC

On 11/03/2024 11.46, Paul wrote:
> On 3/11/2024 10:11 AM, Michael F. Stemper wrote:

>> $ cat ic
>> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
>> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
>> $ . ./ic
>> --2024-03-11 08:52:23--  https://www.marketwatch.com/investing/index/spx
>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
>> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
>> HTTP request sent, awaiting response... 401 HTTP Forbidden
>>
>> Username/Password Authentication Failed.
>> $

> Almost like there is a mixing up at some point, of https://
> versus http:// in the operation. The website denying http:// access.
>
> Maybe at some point, the website used to redirect the http://
> attempt to https:// for you, and maybe it's not doing that
> any more ?

Sorry, but I don't follow this. The URL that I show above is
https:// not http:// and that's also what the output of wget
shows as the URL.

What is the source for your suspicion that it's really doing
http:// under the covers?

--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.

Re: Unable to wget some pages

<usnkeo$3p7g5$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3909&group=alt.os.linux.ubuntu#3909

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: michael.stemper@gmail.com (Michael F. Stemper)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 14:01:44 -0500
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <usnkeo$3p7g5$2@dont-email.me>
References: <usn3ec$3l9th$1@dont-email.me> <slrnuuu55t.807.dan@djph.net>
<l58okrFs04jU1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 11 Mar 2024 19:01:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6e72de6da3f953b2946e59ff54f03c42";
logging-data="3972613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jeRSnMSDCjrIpFsn34tDdWg/z3ZrTBsE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:3eFZpaeDRWwDQH0/vdRAAX8RVJI=
Content-Language: en-US
In-Reply-To: <l58okrFs04jU1@mid.individual.net>
 by: Michael F. Stemper - Mon, 11 Mar 2024 19:01 UTC

On 11/03/2024 11.08, Josef Möllers wrote:
> On 11.03.24 15:27, Dan Purgert wrote:
>> On 2024-03-11, Michael F. Stemper wrote:
>>> Late last week, a script that I have used for several years suddenly
>>> stopped working. Investigation showed that wget was failing to
>>> download some pages. A simplified version, showing the problem, is:
>>>
>>> $ cat ic
>>> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
>>> wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500

> Besides that ... is it on purpose that $uas is between single quotes, so won't get expanded? Double quotes are required because the user agent string has blanks (and parantheses), but single quotes are definitely wrong here!

Interesting. All this time, I've been sending a User Agent string
of $uas, and it's worked.

If I recall my thinking from six years ago, I had used single quotes
because I used double quotes in the definition of the variable. Now
that I look back, that was pretty obviously wrong, since the *value*
of the variable doesn't have any quotes in it.

However, changing from single to double didn't help. (I'm guessing
that you didn't expect that it would.)

Single versus double quotes always get me tangled up.

--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.

Re: Unable to wget some pages

<usnkkt$3p7g5$3@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3910&group=alt.os.linux.ubuntu#3910

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: michael.stemper@gmail.com (Michael F. Stemper)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Mon, 11 Mar 2024 14:05:01 -0500
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <usnkkt$3p7g5$3@dont-email.me>
References: <usn3ec$3l9th$1@dont-email.me> <slrnuuu55t.807.dan@djph.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Mar 2024 19:05:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6e72de6da3f953b2946e59ff54f03c42";
logging-data="3972613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3T48mTNbGOGVSCUGPAkzSxUdUQCIdr7U="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:NYxzBg7gPKT2Un8Apk0YCWTfgFo=
Content-Language: en-US
In-Reply-To: <slrnuuu55t.807.dan@djph.net>
 by: Michael F. Stemper - Mon, 11 Mar 2024 19:05 UTC

On 11/03/2024 09.27, Dan Purgert wrote:
> On 2024-03-11, Michael F. Stemper wrote:
>> Late last week, a script that I have used for several years suddenly
>> stopped working. Investigation showed that wget was failing to
>> download some pages. A simplified version, showing the problem, is:
>>
>> $ cat ic
>> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
>> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
>> $ . ./ic
>> --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
>> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
>> HTTP request sent, awaiting response... 401 HTTP Forbidden
>>
>> Username/Password Authentication Failed.
>> $
>>
>> Looking at the error message, one might think that this page/site
>> requires user login credentials. However, the same URL works just
>> fine in Firefox, with no login requested or required.
>
> Looks like the page *does* have a login button / javascript thing
> "somewhere" (at least I can see it when I open the page in lynx here).

I've never installed lynx. Is it capable of running as a background
process, e.g., via crontab?

> I'd imagine either
>
> (1) wget is respecting some robots.txt somewhere OR
> (2) wget is following that login link for some reason

Any ideas how I could test for, or prevent, either of these?

--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.

Re: Unable to wget some pages

<slrnuv07q8.807.dan@djph.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3912&group=alt.os.linux.ubuntu#3912

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dan@djph.net (Dan Purgert)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Tue, 12 Mar 2024 09:24:24 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <slrnuv07q8.807.dan@djph.net>
References: <usn3ec$3l9th$1@dont-email.me> <slrnuuu55t.807.dan@djph.net>
<usnkkt$3p7g5$3@dont-email.me>
Injection-Date: Tue, 12 Mar 2024 09:24:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="08af59b90df807c866dbf133bc03d33d";
logging-data="232071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TcOvG+Ybsh9jbVADcMaNIggPNAULox/8="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:s8pAbzpZpnQ8deMehDe9hBoWqvE=
 by: Dan Purgert - Tue, 12 Mar 2024 09:24 UTC

On 2024-03-11, Michael F. Stemper wrote:
> On 11/03/2024 09.27, Dan Purgert wrote:
>> On 2024-03-11, Michael F. Stemper wrote:
>>> Late last week, a script that I have used for several years suddenly
>>> stopped working. Investigation showed that wget was failing to
>>> download some pages. A simplified version, showing the problem, is:
>>>
>>> $ cat ic
>>> uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
>>> wget "https://www.marketwatch.com/investing/index/spx"" rel="nofollow" target="_blank">https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
>>> $ . ./ic
>>> --2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
>>> Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
>>> Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
>>> HTTP request sent, awaiting response... 401 HTTP Forbidden
>>>
>>> Username/Password Authentication Failed.
>>> $
>>>
>>> Looking at the error message, one might think that this page/site
>>> requires user login credentials. However, the same URL works just
>>> fine in Firefox, with no login requested or required.
>>
>> Looks like the page *does* have a login button / javascript thing
>> "somewhere" (at least I can see it when I open the page in lynx here).
>
> I've never installed lynx. Is it capable of running as a background
> process, e.g., via crontab?

Not that I'm aware of, sorry.

>
>> I'd imagine either
>>
>> (1) wget is respecting some robots.txt somewhere OR
>> (2) wget is following that login link for some reason
>
> Any ideas how I could test for, or prevent, either of these?

Potentially adding "-e robots=off" will avoid #1. More verbosity (-v) or
turning on headers (-S?) may help for both as well.

But both of these were a bit of a stab in the dark.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Re: Unable to wget some pages

<uspnvu$as0t$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=3913&group=alt.os.linux.ubuntu#3913

  copy link   Newsgroups: alt.os.linux.ubuntu
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: michael.stemper@gmail.com (Michael F. Stemper)
Newsgroups: alt.os.linux.ubuntu
Subject: Re: Unable to wget some pages
Date: Tue, 12 Mar 2024 09:14:22 -0500
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <uspnvu$as0t$1@dont-email.me>
References: <usn3ec$3l9th$1@dont-email.me> <slrnuuu55t.807.dan@djph.net>
<usnkkt$3p7g5$3@dont-email.me> <slrnuv07q8.807.dan@djph.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Mar 2024 14:14:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="52b83cbda72929ed4a11cc63e32bc899";
logging-data="356381"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+V9tm9OZ/dZxjWHr20niJJH8jUYvhQPAw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:vieIL2glRJI9bdlnrKtcL6NDOg4=
Content-Language: en-US
In-Reply-To: <slrnuv07q8.807.dan@djph.net>
 by: Michael F. Stemper - Tue, 12 Mar 2024 14:14 UTC

On 12/03/2024 04.24, Dan Purgert wrote:
> On 2024-03-11, Michael F. Stemper wrote:
>> On 11/03/2024 09.27, Dan Purgert wrote:
>>> On 2024-03-11, Michael F. Stemper wrote:
>>>> Late last week, a script that I have used for several years suddenly

>>>> Looking at the error message, one might think that this page/site
>>>> requires user login credentials. However, the same URL works just
>>>> fine in Firefox, with no login requested or required.
>>>
>>> Looks like the page *does* have a login button / javascript thing
>>> "somewhere" (at least I can see it when I open the page in lynx here).

>>> I'd imagine either
>>>
>>> (1) wget is respecting some robots.txt somewhere OR
>>> (2) wget is following that login link for some reason
>>
>> Any ideas how I could test for, or prevent, either of these?
>
> Potentially adding "-e robots=off" will avoid #1. More verbosity (-v) or
> turning on headers (-S?) may help for both as well.

No joy from robots=off, and wget's man page says that -v is the default.

But, I just tried with curl, and think that I've found a clue. Included
in what it downloaded was:
"Please enable JS and disable any ad blocker"

I'm not sure if it's possible for wget to fake having javascript, but
it seems as if that's the next place to look.
--
Michael F. Stemper
This sentence no verb.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor