- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make --phar
reproducible
#114
Comments
CC @drupol |
Interesting idea. I'll have a look at it. I was hoping to not having to touch phpab anymore ;-) |
Once again, my fault ... Sorry 😖 |
No worries. It's a very reasonable request :) |
I just tried to implement this - and failed. The current implementation uses iterators to add the files into the phar. Should be fairly straight forward to hook into this. The docs on But none of my overwritten methods are ever called. I do not currently see any means to control the meta data in the phar at build time. The cli tool mentioned does a binary patch of the raw data. It's not actually using the phar API. I'm not sure what to do here atm.. |
Okay, reading the php-src (https://github.com/php/php-src/blob/master/ext/phar/phar_object.c) helped shedding some light. If I understand the inline docs for There are basically two ways - despite the many different APIs - to add things into the phar: By having So, looks like for the time being, there is nothing I can do here - at least not by using the I could, in theory, include Jordi's library and introduce this as a post-processing step. But that's not much different from what you do now already. |
Thank you for looking into this. |
Maybe @nielsdos, who has done a bit of work on |
@sebastianbergmann how should the API look like? Something like |
Thank you for your reply, Niels. To be honest, I have not really thought about the API for this yet. The first thing that came to mind was an optional What do you think, Arne? CC @Seldaek as this could, eventually, hopefully, at some point in the future, make |
I'm not yet convinced that having a specific API for timestamps on The most obvious thing for me to control meta data was to implement my own If we want to enhance the |
Hello, Glad to see this conversation happening, as the outcome of this is a definitive improvement for reproducible PHARs. Adding @theofidry which might be interested to participate. |
What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.
I think we probably should make this work. If you still have the reproduction test case please share it. Also, this is imo orthogonal to improving the API for Phar to allow overriding metadata.
Not a fan of this particular idea. SplFileInfo is tied to an actual file normally, so that seems like abusing the class for the purpose of holding metadata. |
Hi!
I do not know if there is other metadata that needs to be manipulated. From a reproducibility perspective the only thing I'm missing is rather clarification of what can cause a PHAR give different signatures. We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different. So to me it would be great to know exactly what defines the signature and, provided built the PHAR from the exact same source and the timestamp and algorithm are set, what could still make it that the built PHARs are not identical. I assume the system filesystem could be a factor, or A bit unrelated so I don't want to expend to much on it here, but I'm trying to compile some improvements I would like to see in the PHAR extension in box-project/box#1154 (it is entirely unpolished and probably missing some stuff still). Back to the topic though, I would be more than content with // src/Box.php
public function sign(
SigningAlgorithm $signingAlgorithm,
?DateTimeImmutable $timestamp = null,
): void {
if (null === $timestamp) {
$this->phar->setSignatureAlgorithm($signingAlgorithm->value);
return;
}
// Ensure all the changes done to the PHAR are "committed" before we manipulate the file
$phar = $this->phar;
$phar->__destruct();
unset($this->phar);
$util = new \Seld\PharUtils\Timestamps($this->pharFilePath);
$util->updateTimestamps($timestamp);
$util->save(
$this->pharFilePath,
$signingAlgorithm->value,
);
$this->phar = new Phar($this->pharFilePath);
} |
File and directory permissions can be manipulated.
That would be also amazing if we could fix that at the same time. |
I actually don't remember exactly what the phar file adding preserves in terms of metadata. In theory, it could be pretty much everything that belongs to a file.
I might still have it as a PoC but I can also create a new one. It wasn't not that much code. :)
Point taken - but only if you want to add something that is not resembling a file within the phar. Otherwise it would make perfect sense to me. The reason I was suggesting this is we then have a symmetric API. It feels a bit odd to have The reason I'm not happy with the "generic" setTimestamp approach is that it would only cater the single use case of reproducibility - making the assumption that every entry in the archive should be having an identical timestamp and we want to explicitly ignore any other meta data that may be set within the phar for a file or directory. So maybe, we need to spent some more time on thinking about what we actually want to accomplish first - not limiting us to the simplest solution that would "merely" make reproducibility work. And then what the best API is. And what to do with regards to BC. (Maybe the answer to the later turns out to be a wrapper "ReproduciblePhar" ;) ) |
Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.
Is this an RFC you're going to pursue yourself?
Well, feel free to share something and I'll take a look.
Right, if someone can come up with a proposal I can do the implementation. |
Just to clarify: this is not possible at the moment right? IIRC I tried to use
0 chance I can implement any of them 😅 At the moment those are just very rough notes on things that came to my mind so they need to be fleshed out a bit more and would also be good to review what would be really helpful with the PHAR building process or PHAR manipulation. Happy to spend more time on it and create an RFC for it if you or someone can actually implement them. |
Having deterministic ordering by default would for sure be nice. In Composer we do build the phar using a sorted iterator, and we do add everything via addFromString so yes it would also be nice to have a way to specify metadata in there (I guess full access to file modes, mtime, or whatever ends up actually written in the archive), then we would not have to use phar-utils to overwrite timestamps after the fact. I think my ideal API would be something like new params for addFromString (and possibly addFile too to override file properties), like If you're using buildFromDirectory or from iterator, then it's a bit more complex to override a specific file's info, but I think forcing someone to use the addFile/addFromString to do that would be reasonable. |
You can do this with your own code.
I can help with the implementation.
Operating systems generally don't make a guarantee about filesystem order. You'll have to do this yourself.
I disagree. If you have to add those parameters to different APIs, it's better to just provide separate APIs to override that metadata where the doesn't care about how a file was added. |
Good point, it's such a rarely used api tho i don't really mind how it looks i have to say. As long as it works and it's documented anything will be great :)
I'm aware i just think the phar could sort files at least when creating from iterator and maybe even sort whatever is added by filename but i can see that's probably too opinionated so I can also live with the current situation for sure. |
@nielsdos I'll have to double check, but IIRC using this over using |
@theofidry I have done performance improvements to different extensions of PHP before, if you give me code I can check what the bottleneck is and try to improve it. |
Cool, will try to check it ASAP |
@nielsdos Here's a reproducer for the "no method called on SplFileInfo": <?php declare(strict_types=1);
class MySplFileInfo extends SplFileInfo {
public function getPathname(): string {
echo "[Pathname]";
return '';
}
public function getMTime(): int {
echo "[MTime]";
return 0;
}
public function getCTime(): int {
echo "[CTime]";
return 0;
}
public function getATime(): int {
echo "[ATime]";
return 0;
}
}
class MyIterator extends RecursiveDirectoryIterator {
public function current(): SplFileInfo {
echo "[ Found: " . parent::current()->getPathname() . " ]\n";
return new MySplFileInfo( parent::current()->getPathname());
}
}
$workdir = uniqid('/tmp/');
mkdir($workdir . '/content', recursive: true);
file_put_contents($workdir . '/content/hello.txt', "Hello world.");
$phar = new \Phar($workdir . '/test.phar');
$phar->startBuffering();
$phar->buildFromIterator(
new RecursiveIteratorIterator(
new MyIterator($workdir . '/content', FilesystemIterator::SKIP_DOTS)
),
$workdir
);
$phar->stopBuffering();
$result = new \Phar($workdir . '/test.phar', 0, 'test.phar');
var_dump($result['content/hello.txt']); This produces the following output on my system (PHP 8.3.6):
|
While it is possible to return a custom SplFileInfo object in the iterator used by buildFromIterator(), the data is not actually used from that object, instead the data from the underlying internal structure is used. This makes it impossible to override some metadata such as the path name and modification time. The main motivation comes from two reasons: - Consistency. We expect our custom methods to be called when having a custom object. - Support reproducibility. This is the original use case as requested in [1]. Add support for this by calling the getMTime() and getPathname() methods if they're overriden by a user class. [1] theseer/Autoload#114.
@theseer I've made a draft PR that fixes your problem here: php/php-src#14143 |
Created the benchmark here. Will publish the results tomorrow off to bed. I however realise I cannot simply switch to |
Here are the results:
|
Yes composer phar build is slow too due to addFromString but imo that's normal and even stated in the docs: Note: Phar::addFile(), Phar::addFromString() and Phar::offsetSet() save a new phar archive each time they are called. If performance is a concern, Phar::buildFromDirectory() or Phar::buildFromIterator() should be used instead. The only way to speed this up would be a new api which buffers and requires explicit save() to commit changes to disk. But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle. |
I thought that was what the start buffering and stop buffering was about...
I agree to a degree, feels like a jump from 120ms to 20s is... meh. Not critical, neither high priority for sure though yes. |
It's indeed the fact that the phar is flushed that makes it slow. But the surprising detail is that it's not the writing of the file itself, it's the computation of the signature. The signature computation takes 95% of runtime according to perf profiling... |
Ok interesting.. is it not respecting buffering at all? In a quick test here it seems to me like it is not writing to the file until the end when buffering, but it is perhaps updating the signature on every file add? Perhaps that needs fixing so signature is only updated if getSignature is called, or if stopBuffering/__destruct is called?
This sounds like really something that should be able to be fixed if it is really broken, sure it might break some odd use case but it's clearly not working as advertised and the users of phar-writing aren't all that many IMO. |
Nope, it writes the Phar file, just not to the file path you provided.
This is actually wrong, a Phar file with signature and contents is written in the
Perhaps, this would need discussion on the ML and in the worst case an RFC. The |
Actually, it turns out almost any operation can create the temporary file. The fact the temporary file is written is actually related to handling compressed files within the phar. This entire performance issue is related to php/php-src#13055 |
The only thing that was missing for the build of PHPUnit's PHAR to become reproducible was to set the timestamps of the files in the PHAR to a consistent date.
I have implemented that in sebastianbergmann/phpunit@4627215 using seld/phar-utils.
However, I do not like that the PHAR is now changed after it has been created using PHPAB.
It would be nice if PHPAB would do this while it creates the PHAR. What do you think, Arne?
The text was updated successfully, but these errors were encountered: