Error while attempting to parse toc as utf-8 #10

Azlinon · 2024-12-13T13:39:46Z

I'm getting this stack trace trying to rip Taylor Swift: Tortured Poets Department: The Anthology (disc 1) in whipper with python3-puaidio-2.1.0-14, and I didn't see anything relevant in the issues list. My understanding is that the toc is supposed to be encoded as iso-8859-1, so attempting to use utf-8 is the underlying problem. Can someone confirm if this is the correct place to address this issue?

CRITICAL:whipper.command.main:exception UnicodeDecodeError at /usr/lib64/python3.10/codecs.py:322: decode(): 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte
Traceback (most recent call last):
File "/usr/lib64/python3.10/site-packages/whipper/extern/task/task.py", line 522, in c
callable_task(*args, **kwargs)
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 115, in _read
self._done()
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 153, in _done
self.toc.parse()
File "/usr/lib64/python3.10/site-packages/whipper/image/toc.py", line 257, in parse
content = f.readlines()
File "/usr/lib64/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte

rocky · 2024-12-13T16:33:35Z

My understanding is that the toc is supposed to be encoded as iso-8859-1,

Were is this documented?

so attempting to use utf-8 is the underlying problem.

Ok, try changing the encoding and let everyone know if that fixes things.

Can someone confirm if this is the correct place to address this issue?

Confirmed.

Azlinon · 2024-12-13T18:21:11Z

I don't have any definitive source, but the timelines don't match up. From what I have gathered (I failed to locate the CDDA specification), TOC may just be a string of bytes.

A quick wikipedia search shows that the CD-DA standard was published in 1980 and products/media using it were commercially available starting in 1982. A similar search shows that the utf-8 standard was developed in 1992 and released in 1993. It didn't receive mainstream software adoption until sometime later.

While many ISO-8859-1 strings are also valid UTF-8, some are not. Wikipedia has a list in the error handling section of https://en.wikipedia.org/wiki/UTF-8.

So, either UTF-8 data was later placed into pressed discs that would not work properly with pre-existing hardware/software, or this data is all intended to be in an earlier encoding that was available at CDDA release, like ISO-8859-1.

I seem to be running into the "continuation byte" at the start of a string case here. I will attempt to make this change and test when I have some time over the holidays.

Azlinon · 2024-12-22T17:46:01Z

Now that I've researched this a bit further, I found informal references to some discs indeed using alternate encodings. So, it seems like some discs may be utf-8, some iso-8859-1, and even a few others.

I found a reference to the behavior change regarding utf-8 in cdrdao 1.2.5 within parse_toc_string in whipper/image/toc.py. Following that prompt, I compiled a version of cdrdao 1.2.4, and this same disc does indeed work with that version.

So, is this ultimately a compatibility issue against cdrdao 1.2.4., a flaw in cdrdao 1.2.4, or something else?

rocky · 2024-12-22T17:54:28Z

I found a reference to the behavior change regarding utf-8 in cdrdao 1.2.5 within parse_toc_string in whipper/image/toc.py. Following that prompt, I compiled a version of cdrdao 1.2.4, and this same disc does indeed work with that version.

So, is this ultimately a compatibility issue against cdrdao 1.2.4., a flaw in cdrdao 1.2.4, or something else?

A suggestion is to allow some sort of switch of flag to whatever it is that parse the TOC to allow the user to specify an encoding. utf-8, iso-8859-1, or something else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while attempting to parse toc as utf-8 #10

Error while attempting to parse toc as utf-8 #10

Azlinon commented Dec 13, 2024

rocky commented Dec 13, 2024 •

edited

Loading

Azlinon commented Dec 13, 2024

Azlinon commented Dec 22, 2024

rocky commented Dec 22, 2024

Error while attempting to parse toc as utf-8 #10

Error while attempting to parse toc as utf-8 #10

Comments

Azlinon commented Dec 13, 2024

rocky commented Dec 13, 2024 • edited Loading

Azlinon commented Dec 13, 2024

Azlinon commented Dec 22, 2024

rocky commented Dec 22, 2024

rocky commented Dec 13, 2024 •

edited

Loading