Message-ID:

Peace was the way. -- Kirk, "The City on the Edge of Forever", stardate unknown

Raymond Wiker <rwiker@gmail.com> writes:

> Julieta Shem <jshem@yaxenu.org> writes:
>
>> I'd like to thank everyone in this thread. Valuable help. Thank you.
>>
>> Raymond Wiker <rwiker@gmail.com> writes:
>>
>>> Julieta Shem <jshem@yaxenu.org> writes:
>>>
>>>> First I define x to be a list of bytes of an NNTP article.
>>>>
>>>> (defvar x (fetch-article "local.test" "28"))
>>>>
>>>> Now x is a list of integers, the bytes of the article 28, which is a
>>>> MIME message containing a PDF of 610 KiB.
>>>>
>>>> Now I want to split it line by line. I am able to do it, but it's very
>>>> slow.
>>>>
>>>> * (time (length (split-sequence (list 13 10) x nil)))
>>>> Evaluation took:
>>>> 65.710 seconds of real time
>>>> 65.671875 seconds of total run time (47.093750 user, 18.578125 system)
>>>> [ Run times consist of 23.968 seconds GC time, and 41.704 seconds non-GC time. ]
>>>> 99.94% CPU
>>>> 170,322,749,168 processor cycles
>>>> 79,439,358,864 bytes consed
>>>>
>>>> 11585
>>>> *
>>>>
>>>> Less than 12k lines. The number 79,439,358,864 of bytes is way more
>>>> than I would have expected for anything regarding this invocation above,
>>>> but I have no idea what this number really measures here. I appreciate
>>>> any help. Thank you.
>>>>
>>>> Here's the procedure:
>>>>
>>>> (defun split-sequence (delim ls acc &key limit (so-far 1))
>>>> (let* ((len (length ls))
>>>> (delim delim)
>>>> (pos (search delim ls))
>>>> (n-take (or pos len))
>>>> (n-drop (if pos
>>>> (+ n-take (length delim))
>>>> n-take)))
>>>> (cond ((zerop len) acc)
>>>> ((and limit (= so-far limit)) (list ls))
>>>> (t (split-sequence
>>>> delim (drop n-drop ls)
>>>> (cons (take n-take ls) acc)
>>>> :limit limit
>>>> :so-far (1+ so-far))))))
>>>>
>>>> (defun take (n seq) (subseq seq 0 n))
>>>> (defun drop (n seq) (subseq seq n))
>>>
>>> It is slow because
>>>
>>> * You're repeatedly calling #'length on whatever is left of the
>>> list. Computing the length of a list is an O(n) operation; repeatedly
>>> calling #'length on the remainder of the list is then O(n^2).
>>>
>>> * You are generating new lists for both the initial segments and the
>>> rest of the list. In order to generate 12k lines you have to generate
>>> lists containing the lines, but also 12 lists containing the remainder
>>> (at various points) of the input.
>>
>> That's the take and drop, I think.
>
> Yup.
>
>>> * Each character of the input requires a cons cell, which is probably 16
>>> bytes (assuming a 64-bit Lisp implementation). So, your input now
>>> requires 610000*16, so about 9.76MB.
>>
>> So that's one of the advantages of using a vector, I think.
>
> Right. The other is that accessing any element in an array by position
> (index) is O(1); for a list, it's O(n).
>
>>> * The remainder lists will consume about 0.5*610000*16*12000, so about
>>> 55GB. This is not too far from the 74GB that time reports. (My
>>> calculations are probably off, but clearly not _too_ far off.)
>>
>> Impressive.
>
> Maybe... but it also appears to be wrong :-)

That's okay. :-)

>>> * I'm going to assume that 74GB is going to be a significant chunk of
>>> the available memory in your computer. That means that there will be
>>> significant garbage-collection activity, and you may also end up
>>> swapping to disk.
>>
>> What do you mean by ``available memory;'''? I have 24 GiB of RAM---23.9
>> GiB usable, says Windows. I don't know how to show this right now, but
>> I think I don't have any swap space (on Windows). I tried to turn it
>> off some time ago I think I succeeded.
>
> What implementation of Common Lisp are you using? Is it a 32-bit or
> 64-bit version?

I believe it's a 64-bit version.

%file sbcl.exe
sbcl.exe: PE32+ executable (console) x86-64, for MS Windows

I got it from

https://www.sbcl.org/platform-table.html

and I downloaded the MSI-file from column AMD64, version 2.3.2.

%./sbcl.exe --version
SBCL 2.3.2

>>> If you used strings (or byte arrays instead), you would
>>>
>>> * Require much less memory.
>>>
>>> * Computing #'length for a string or array is O(1), so the overall complexity
>>> becomes O(n).
>>
>> Because strings and arrays store their size somewhere in their data
>> structure, I suppose, and lists do not.
>
> Right. Lists are literally composed of objects that contain two values
> (cons cells), where the second value is a pointer to the next cons cell.
>
>>> * You don't have to create new arrays containing the successive
>>> remainders; instead, you can use the :start keyword to #'search.
>>
>> I tried this. Here's how I rewrote it. First some user interface that
>> gets the length just once, even though now it's probably okay to call
>> length multiple times. (The /acc/umulator should be probably be
>> optional here, but I was afraid of mixing &optional with &key. This is
>> my first Common Lisp program.)
>>
>> (defun split-vector (delim v acc &key limit (so-far 1))
>> (let ((len (length v)))
>> (split-vector-helper delim v len acc limit so-far 0)))
>>
>> Now, find the position of /delim/ between /start/ and /len/. If you
>> can't find it, we're done. If we've reached the limit of lines the
>> caller requested, we're done. Now we found /delim/: copy the slice of
>> the vector that goes from /start/ to the /pos/ition we found /delim/;
>> save it in the accumulator. Update /start/ we along over the vector.
>> Now repeat everything.
>>
>> (defun split-vector-helper (delim v len acc limit so-far start)
>> (if (zerop len)
>> acc
>> (let ((pos (search delim v :start2 start :end2 len)))
>> (cond ((or (not pos) (and limit (= so-far limit)))
>> (nreverse (cons (subseq v start len) acc)))
>> (t (split-vector-helper delim
>> v
>> len
>> (cons (subseq v start (or pos len)) acc)
>> limit
>> (1+ so-far)
>> (+ pos (length delim))))))))
>>
>> The result is better:
>>
>> * (time (length (split-vector (vector 13 10) x nil)))
>> Evaluation took:
>> 0.011 seconds of real time
>> 0.000000 seconds of total run time (0.000000 user, 0.000000 system)
>> 0.00% CPU
>> 30,857,378 processor cycles
>> 7,000,064 bytes consed
>>
>> 11576
>>
>> (Difficult to understand 0.011 real time atgainst ``total run time''.)
>
> That's a good second implementation, and a very nice improvement in run
> time and memory use!

Thanks!

> Note that you do not actually need the "len" parameter: if you don't
> specify :end2 for #'search or the end position for #'subseq, they will
> assume the value nil, which means the end of the sequence.

Noted.

>> I haven't studied about the /loop/ macro yet.
>
> Some people abhor loop, while others find it strangely attractive (or
> even addictive).

People say loop is hard to understand. I have a feeling I won't like it
much: hard-to-understand things don't give us a feeling of control.

``The details of the interaction matter, ease of use matters, but I
want more than correct details, more than a system that is easy to
learn or to use: I want a system that is enjoyable to use. This is
an important, dominating design philosophy, easier to say than to
do. It implies developing systems that provide a strong sense of
understanding and control. This means tools that reveal their
underlying conceptual model and allow for interaction, tools that
emphasize comfort, ease, and pleasure of use [...]. A major factor
in this debate is the feeling of control that the user has over the
operations that are being performed. A `powerful,' `intelligent'
system can lead to the well documented problems of `overautomation,'
causing the user to be a passive observer of operations, no longer
in control of either what operations take place, or of how they are
done. On the other hand, systems that are not sufficiently powerful
or intelligent can leave too large a gap in the mappings from
intention to action execution and from system state to psychological
interpretation. The result is that operation and interpretation are
complex and difficult, and the user again feels out of control,
distanced from the system.'' -- ``User Centered System Design'',
chapter 3, ``cognitive engineering'', ``on the quality of
human-computer interaction'', pages 48--49, Donald A. Norman, CRC
Press, 1986, ISBN 0-89859-872-9.

>>> * Also, if you used strings, you could just create a string stream from
>>> the input string and repeatedly call #'read-line to collect the
>>> lines. The entire program would be something like this (untested):
>>>
>>> (defun split-article (article-as-string)
>>> (with-input-from-string (str article-as-string)
>>> (loop for line = (read-line str nil)
>>> while line
>>> collect line)))
>>
>> As Spiros Bousbouras remarked in the thread, I'd prefer not to try to
>> understand article content because the user has the right to use
>> whatever encoding they please in NNTP articles. So I've been trying
>> handle them as byte sequences.
>
> At some point you will probably have to encode the article as
> characters. In order to do so, you will have to parse the article
> headers to find what encoding is used. I'm not convinced that this is
> possible, but if you have a 16-bit character encoding, you may have an
> octet 13 from one character, and an octet 10 from the following
> character.

That's a good question. I wonder what would happen. Spiros Bousbouras
seems to say that the NNTP protocol is such that if someone uses an
encoding that could encode a character as CRLF, it's their problem
now---NNTP came first.

So far what I have written is what Bousbouras says. I take an article
as a sequence of bytes. I look for the first sequence of bytes

13 10 13 10.

That separates headers and body. The body is everything that follows
this mark. Headers are everything that precedes it. I do have to
decode some headers completely---such as /newsgroups/. To make my life
easy and let me use string functions, I've been decoding all headers,
though I would rewrite that if I could and let people put whatever
encoding they want in the headers so long as they respect
line-termination as CRLF and respect the header-value separator, which
is the COLON.

Perhaps this perspective turns the NNTP protocol into a binary protocol.

Subject	Replies	Author
why is this procedure is so slow? By: Julieta Shem on Thu, 8 Feb 2024	31	Julieta Shem

Peace was the way. -- Kirk, "The City on the Edge of Forever", stardate unknown

devel / comp.lang.lisp / Re: why is this procedure is so slow?