bugfix: fixed incorrect bytestring encoding PlutusData #269

theeldermillenial · 2023-09-21T20:08:09Z

Summary

pycardano incorrectly encodes bytes longer than 64 bytes for PlutusData. Currently, a bytes element is encoded the same regardless of length, but if the length is larger than 64 bytes it should be broken up into 64 byte chunks per the following spec:
https://developers.cardano.org/docs/get-started/cardano-serialization-lib/transaction-metadata/#metadata-limitations

This PR fixes this issue by creating a dummy class to catch PlutusData objects during serialization and properly breaks up bytes values longer than 64 bytes in length.

If further explanation is needed, I can provide examples. This PR provides a unit test to verify the expected output is obtained with a long bytes input.

…n 64 bytes

nielstron · 2023-09-21T20:39:35Z

Is there a reason to introduce MetadataIndefiniteList? I don't see the benefit over using IndefiniteList directly

nielstron · 2023-09-21T20:41:31Z

pycardano/serialization.py

        # Currently, cbor2 doesn't support indefinite list, therefore we need special
        # handling here to explicitly write header (b'\x9f'), each body item, and footer (b'\xff') to
        # the output bytestring.
        encoder.write(b"\x9f")
        for item in value:
-            encoder.encode(item)
+            if (
+                isinstance(value, MetadataIndefiniteList)


I don't see why the encoding as MetadatIndefiniteList is necessary, can we not simply encode all bytes longer than 64 bytes as a list?

That's a good question. I'm still new to Cardano. I guess these were my thoughts and how I would push back on encoding all bytes the same way.

If bytes length only has a restriction for metadata, will encoding them the same way in other parts of the message cause an issue?

Why impose the same criteria on other parts of the message if it's not strictly required?

Won't chunking bytes data cause nominally larger message lengths, and by extension, marginally higher tx fees?

If you feel these are non-issues, I think it's easy enough to remove the dummy class and encode everything the same way.

I think in retrospect your comments on the issue I raised make more sense now.

IIRC the cardano ledger generally specifies that the cbor encoding of bytestrings should be at most 64 bytes long each piece. This would prevent OOM attacks when reading very long bytestrings. However the ledger does not enforce this, leading to different implementations being abound on chain.

But isn't OOM attack prevented my maximum transaction size anyway? Again, just playing devils advocate here. I'm still relatively new to Cardano.

I can revert changes back to the original without the dummy class for PlutusData. Just give me a yay or nay. With these changes I was able to successfully submit to smart contracts, so I know these changes work correctly.

Just wanted to bump this so I can finish it off :)

I would prefer this without the dummy class - maybe you can revert to that and see if both the test cases and your submission pass?

then Jerry or I can create a test case for exactly this datum submission

theeldermillenial · 2023-09-28T12:22:48Z

Removed the dummy class.

All unit tests pass, including the specific test created to test a long bytestring.

My specific use case works.

I'm happy to do any additional work required to finish this off. I didn't see any contributing doc, so I don't know how versioning is handled.

codecov-commenter · 2023-10-01T05:19:53Z

Codecov Report

Merging #269 (81abc6a) into main (0a95536) will decrease coverage by 0.34%.
The diff coverage is 53.33%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #269      +/-   ##
==========================================
- Coverage   85.14%   84.81%   -0.34%     
==========================================
  Files          26       26              
  Lines        2995     3022      +27     
  Branches      719      728       +9     
==========================================
+ Hits         2550     2563      +13     
- Misses        335      345      +10     
- Partials      110      114       +4

Files	Coverage Δ
pycardano/metadata.py	`94.59% <ø> (ø)`
pycardano/plutus.py	`87.05% <45.45%> (-2.02%)`	⬇️
pycardano/serialization.py	`83.43% <57.89%> (-1.65%)`	⬇️

cffls

Thanks for fixing this! Posted a question in the code. Apology for the delayed response.

cffls · 2023-10-01T05:28:47Z

pycardano/serialization.py

@@ -176,7 +176,14 @@ def default_encoder(
        # the output bytestring.
        encoder.write(b"\x9f")
        for item in value:
-            encoder.encode(item)
+            if isinstance(item, bytes) and len(item) > 64:
+                encoder.write(b"\x5f")


This is only activated when an item is inside a indefinite list. Do we need to break byte strings that are not part of indefinite list?

AFAIK we need to break all bytes that are longer than 64 bytes

I may have misunderstood, but it seemed to me that this was the best place to put it since all PlutusData are cast to IndefiniteList.

If I pull it out of the IndefiniteList block, will it be handled properly? I guess it should.

I guess you (correctly) noticed that all PlutusData fields are part of an indefinite list. However plutusdata can also contain bytes without being part of PlutusData (i.e. pure bytes or bytes that are keys in dictionaries)

So is the final answer to pull it outside of the IndefiniteList block?

In this documentation it seems that yes, we need dummy classes. But not for lists, for bytes! :)

I am also wondering if there are cases where integers are incorrectly encoded (when they exceed 64 bytes size) since I implemented a special case for this here: https://github.com/OpShin/uplc/blob/448f634cc1225de6dd7390b670b01396d2e71156/uplc/ast.py#L430

I guess I am seeing more and more the intuition behind all the custom classes in OpShin.

I realize it's a bigger lift, but is there any reason why we wouldn't just take OpShin's implementation and pull it over to here? Then, just rely on pycardano rather than duplicating efforts across repos?

I apologize if I'm speaking out of ignorance and there are things I'm not considering, but this seems like it might be the more lasting implementation.

No worries at all. The code I wrote for OpShin/UPLC was created after pycardano was written, hence there might be a point in copying it over. Then again, the UPLC implementation is really only catered towards PlutusData, while PyCardano also handles serialization of all other kinds of things - not sure if anything will break.

Long story short: The only reason that there are two different implementations is that no one yet tried to unify them.

Okay, I would like to have this done sooner rather than later. Can I just create a dummy class for bytes to patch this and open a more general issue about syncing datum handling between OpShin and pycardano?

Yes sounds good to me! Would also prefer to get this resolved over any big open stale PR :)

theeldermillenial · 2023-10-05T03:04:52Z

Alright, this was annoying. I made a new dummy class to wrap bytes. That seemed to properly wrap the PlutusData tests so that the unit tests pass (with a modification to the ground truth CBOR for one test), but wrecked about 20 other unit tests that did raw dictionary/list comparisons to primitives. I was able to get around that by modifying the comparator for the dummy class.

There's still one failing unit test that has to do with generating the script hash. I didn't have a chance to dig into it. Maybe you can take a look and more easily see what might be the issues. Once that is resolved, this should be good.

theeldermillenial · 2023-10-05T12:48:23Z

Alright, as best as I can tell, the last failing test was caused by serializing the COST_MODEL. If dictionaries are serialized to bytes, then the hash should change with the new way data is serialized to cbor.

cffls · 2023-10-06T04:17:59Z

test/pycardano/test_util.py

@@ -149,7 +149,7 @@ def test_script_data_hash():
    redeemers = [Redeemer(unit, ExecutionUnits(1000000, 1000000))]
    redeemers[0].tag = RedeemerTag.SPEND
    assert ScriptDataHash.from_primitive(
-        "032d812ee0731af78fe4ec67e4d30d16313c09e6fb675af28f825797e8b5621d"
+        "b11ed6f6046df925b6409b850ac54a829cd1e7603145c9aaf765885d8ec64da7"


Not sure if this should change. If we use write the same test in Haskell, it would generate the same hash.

I think as you noticed in the other comment, this changes because all bytes are being encoded the same way per nielstrons suggestion.

Your comment makes sense. If we only change encoding in metadata/plutusdata, then the hash would not change.

cffls · 2023-10-06T04:28:58Z

pycardano/serialization.py

+            elif isinstance(value, bytes):
+                return ByteString(value)


IMO, it seems incorrect to replace every bytes with ByteString. Instead, we just offer ByteString for users to use in PlutusData or Metadata.
For some internal implementation that generates bytes as intermediate values, e.g. script_data_hash, we don't want to change its type arbitrarily.

Should be implementable by a simple parameter?

That was one of my original points and how I originally had it implemented.

Can someone please make a definitive final decision so I can fix and be done? I've implemented and reimplemented this multiple times.

@nielstron Could you elaborate how adding a parameter will work?

It might not work as straightforward as I imagined it. @theeldermillenial was right, maybe we should just roll with the initial design. I appreciate the excourse though because now we know precisely which bytes to encode this way 😅 sorry for the divergence.

maybe we can document this (and ideally find some supporting documentation on the discrepancy in the implementation)

My preferred approach is to offer users ByteString class to use, which the encoder can automatically break it down to byte array. If a bytes object is found longer than 64 instead, pycardano should raise an exception and recommend users to use ByteString.

@cffls But should this error only be thrown for PolusData and Metadata? Or should we apply it globally?

My two cents is "only implement exactly what is defined". The bytes length restriction appears to be limited to "metadata", so maybe we only apply it to metadata.

We already have this check for metadata: https://github.com/Python-Cardano/pycardano/blob/main/pycardano/metadata.py#L40-L49

I thought this should be also enforced in PlutusData, which was the reason why this PR was raised. If not, I am fine with only providing ByteString as an option for user in this PR.

My apologies. I misspoke. When I said Metadata, I also meant PlutusData.

Also, I now see exactly what you're saying, and I think you're solution makes the most sense. You are saying we should inject a check for long byte strings in PlutusData and throw an error similar to what is seen in Metadata. Part of that error message should indicate the user can use the new ByteString class to allow longer bytes.

I think this is the most transparent approach, and it keeps in line with what I see to be pycardano's philosophy of being very unbiased.

If this is what you mean, I'll make the changes and we can be done. I will revert any hashes I altered, since this should really only affect the test I created. If there's any other unit tests you would like to see, I'm happy to add them.

Yes, this is exactly what I meant. Please go ahead with this approach. Thank you for confirming. ☺️

… type

cffls

Looks good to me. Some code failed mypy analysis, you can do make qa to check errors.

theeldermillenial · 2023-10-11T13:10:03Z

Should be good to go now.

Now that I know all your qa stuff, it should be easier for me to contribute. I'm dedicated to building stuff on Cardano in Python, so I'll try to contribute as I find things. Like #273

cffls

LGTM. Thank you for your contribution!

theeldermillenial added 2 commits September 21, 2023 15:59

bugfix: fixed incorrect bytestring encoding for bytestring longer tha…

9492b5c

…n 64 bytes

Removed extraneous test encoder

c92ecfb

nielstron reviewed Sep 21, 2023

View reviewed changes

Removed metadata dummy class

6dbdd4f

theeldermillenial requested a review from nielstron September 28, 2023 21:31

cffls reviewed Oct 1, 2023

View reviewed changes

theeldermillenial added 2 commits October 4, 2023 22:53

Added dummy ByteString class

b663a3b

Added ByteString equality test for bytes

364ba67

Updated test reference hash

5a90438

theeldermillenial requested a review from cffls October 5, 2023 12:48

cffls reviewed Oct 6, 2023

View reviewed changes

theeldermillenial added 2 commits October 10, 2023 08:14

Updated byte encoding errors, added ByteString to as valid PlutusData…

c71dd82

… type

Removed debug print statements

945e52a

theeldermillenial requested a review from cffls October 10, 2023 12:27

cffls reviewed Oct 11, 2023

View reviewed changes

Fixed mypy error for equality checks on ByteString

81abc6a

theeldermillenial requested a review from cffls October 12, 2023 11:14

cffls approved these changes Oct 13, 2023

View reviewed changes

cffls merged commit a6f76ea into Python-Cardano:main Oct 13, 2023

theeldermillenial deleted the bugfix/datum-bytestring branch October 14, 2023 00:47

cffls mentioned this pull request Oct 22, 2023

Reconstructable deterministic constrids #272

Merged

bugfix: fixed incorrect bytestring encoding PlutusData #269

bugfix: fixed incorrect bytestring encoding PlutusData #269

Conversation

theeldermillenial commented Sep 21, 2023 • edited Loading

Summary

nielstron commented Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

theeldermillenial Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theeldermillenial commented Sep 28, 2023

codecov-commenter commented Oct 1, 2023 • edited Loading

Codecov Report

cffls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theeldermillenial commented Oct 5, 2023

theeldermillenial commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nielstron Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cffls left a comment

Choose a reason for hiding this comment

theeldermillenial commented Oct 11, 2023

cffls left a comment

Choose a reason for hiding this comment

theeldermillenial commented Sep 21, 2023 •

edited

Loading

nielstron commented Sep 21, 2023 •

edited

Loading

theeldermillenial Sep 21, 2023 •

edited

Loading

codecov-commenter commented Oct 1, 2023 •

edited

Loading

nielstron Oct 6, 2023 •

edited

Loading