Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while attempting to parse toc as utf-8 #10

Open
Azlinon opened this issue Dec 13, 2024 · 4 comments
Open

Error while attempting to parse toc as utf-8 #10

Azlinon opened this issue Dec 13, 2024 · 4 comments

Comments

@Azlinon
Copy link

Azlinon commented Dec 13, 2024

I'm getting this stack trace trying to rip Taylor Swift: Tortured Poets Department: The Anthology (disc 1) in whipper with python3-puaidio-2.1.0-14, and I didn't see anything relevant in the issues list. My understanding is that the toc is supposed to be encoded as iso-8859-1, so attempting to use utf-8 is the underlying problem. Can someone confirm if this is the correct place to address this issue?

CRITICAL:whipper.command.main:exception UnicodeDecodeError at /usr/lib64/python3.10/codecs.py:322: decode(): 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte
Traceback (most recent call last):
File "/usr/lib64/python3.10/site-packages/whipper/extern/task/task.py", line 522, in c
callable_task(*args, **kwargs)
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 115, in _read
self._done()
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 153, in _done
self.toc.parse()
File "/usr/lib64/python3.10/site-packages/whipper/image/toc.py", line 257, in parse
content = f.readlines()
File "/usr/lib64/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte

@rocky
Copy link
Owner

rocky commented Dec 13, 2024

My understanding is that the toc is supposed to be encoded as iso-8859-1,

Were is this documented?

so attempting to use utf-8 is the underlying problem.

Ok, try changing the encoding and let everyone know if that fixes things.

Can someone confirm if this is the correct place to address this issue?

Confirmed.

@Azlinon
Copy link
Author

Azlinon commented Dec 13, 2024

I don't have any definitive source, but the timelines don't match up. From what I have gathered (I failed to locate the CDDA specification), TOC may just be a string of bytes.

A quick wikipedia search shows that the CD-DA standard was published in 1980 and products/media using it were commercially available starting in 1982. A similar search shows that the utf-8 standard was developed in 1992 and released in 1993. It didn't receive mainstream software adoption until sometime later.

While many ISO-8859-1 strings are also valid UTF-8, some are not. Wikipedia has a list in the error handling section of https://en.wikipedia.org/wiki/UTF-8.

So, either UTF-8 data was later placed into pressed discs that would not work properly with pre-existing hardware/software, or this data is all intended to be in an earlier encoding that was available at CDDA release, like ISO-8859-1.

I seem to be running into the "continuation byte" at the start of a string case here. I will attempt to make this change and test when I have some time over the holidays.

@Azlinon
Copy link
Author

Azlinon commented Dec 22, 2024

Now that I've researched this a bit further, I found informal references to some discs indeed using alternate encodings. So, it seems like some discs may be utf-8, some iso-8859-1, and even a few others.

I found a reference to the behavior change regarding utf-8 in cdrdao 1.2.5 within parse_toc_string in whipper/image/toc.py. Following that prompt, I compiled a version of cdrdao 1.2.4, and this same disc does indeed work with that version.

So, is this ultimately a compatibility issue against cdrdao 1.2.4., a flaw in cdrdao 1.2.4, or something else?

@rocky
Copy link
Owner

rocky commented Dec 22, 2024

I found a reference to the behavior change regarding utf-8 in cdrdao 1.2.5 within parse_toc_string in whipper/image/toc.py. Following that prompt, I compiled a version of cdrdao 1.2.4, and this same disc does indeed work with that version.

So, is this ultimately a compatibility issue against cdrdao 1.2.4., a flaw in cdrdao 1.2.4, or something else?

A suggestion is to allow some sort of switch of flag to whatever it is that parse the TOC to allow the user to specify an encoding. utf-8, iso-8859-1, or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants