-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while attempting to parse toc as utf-8 #10
Comments
Were is this documented?
Ok, try changing the encoding and let everyone know if that fixes things.
Confirmed. |
I don't have any definitive source, but the timelines don't match up. From what I have gathered (I failed to locate the CDDA specification), TOC may just be a string of bytes. A quick wikipedia search shows that the CD-DA standard was published in 1980 and products/media using it were commercially available starting in 1982. A similar search shows that the utf-8 standard was developed in 1992 and released in 1993. It didn't receive mainstream software adoption until sometime later. While many ISO-8859-1 strings are also valid UTF-8, some are not. Wikipedia has a list in the error handling section of https://en.wikipedia.org/wiki/UTF-8. So, either UTF-8 data was later placed into pressed discs that would not work properly with pre-existing hardware/software, or this data is all intended to be in an earlier encoding that was available at CDDA release, like ISO-8859-1. I seem to be running into the "continuation byte" at the start of a string case here. I will attempt to make this change and test when I have some time over the holidays. |
Now that I've researched this a bit further, I found informal references to some discs indeed using alternate encodings. So, it seems like some discs may be utf-8, some iso-8859-1, and even a few others. I found a reference to the behavior change regarding utf-8 in cdrdao 1.2.5 within parse_toc_string in whipper/image/toc.py. Following that prompt, I compiled a version of cdrdao 1.2.4, and this same disc does indeed work with that version. So, is this ultimately a compatibility issue against cdrdao 1.2.4., a flaw in cdrdao 1.2.4, or something else? |
A suggestion is to allow some sort of switch of flag to whatever it is that parse the TOC to allow the user to specify an encoding. utf-8, iso-8859-1, or something else. |
I'm getting this stack trace trying to rip Taylor Swift: Tortured Poets Department: The Anthology (disc 1) in whipper with python3-puaidio-2.1.0-14, and I didn't see anything relevant in the issues list. My understanding is that the toc is supposed to be encoded as iso-8859-1, so attempting to use utf-8 is the underlying problem. Can someone confirm if this is the correct place to address this issue?
CRITICAL:whipper.command.main:exception UnicodeDecodeError at /usr/lib64/python3.10/codecs.py:322: decode(): 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte
Traceback (most recent call last):
File "/usr/lib64/python3.10/site-packages/whipper/extern/task/task.py", line 522, in c
callable_task(*args, **kwargs)
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 115, in _read
self._done()
File "/usr/lib64/python3.10/site-packages/whipper/program/cdrdao.py", line 153, in _done
self.toc.parse()
File "/usr/lib64/python3.10/site-packages/whipper/image/toc.py", line 257, in parse
content = f.readlines()
File "/usr/lib64/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 124: invalid start byte
The text was updated successfully, but these errors were encountered: