{"id":1766,"date":"2012-08-08T13:04:16","date_gmt":"2012-08-08T20:04:16","guid":{"rendered":"http:\/\/the-witness.net\/news\/?p=1766"},"modified":"2012-08-08T13:39:02","modified_gmt":"2012-08-08T20:39:02","slug":"fun-with-package-sorting","status":"publish","type":"post","link":"http:\/\/the-witness.net\/news\/2012\/08\/fun-with-package-sorting\/","title":{"rendered":"Fun with package sorting"},"content":{"rendered":"<p>As you see in <a href=\"http:\/\/the-witness.net\/news\/2012\/08\/steam-is-cool\/\">the previous post<\/a>, lately I've been getting the game set up on Steam. Now that it's in there and working, there are still things to think about and make better.<\/p>\n<p>A couple of months ago I was playing <a href=\"http:\/\/store.steampowered.com\/app\/104700\/\" onclick=\"_gaq.push(['_trackEvent', 'outbound-article', 'http:\/\/store.steampowered.com\/app\/104700\/', 'Super Monday Night Combat']);\" >Super Monday Night Combat<\/a>; a much-anticipated weekly update came along, but its file size was over 800MB! MNC is an Unreal Engine game and Unreal, like many game engines (and like The Witness), packages its data files together for robustness and speed. This package format interacts with the patch delivery system of any online distribution service, though.<\/p>\n<p>Steam's system performs a binary-data comparison between the old version of your file and the new version, sending only the pieces that have changed. To make this analysis tractable, files are thought of as collections of 1 megabyte chunks; if anything in a particular chunk changes, even one byte, the whole chunk gets included in the patch. If you aren't careful about this, small details of your package format can trigger very large patches, and this is what was happening with Super MNC (not really the developers' fault since they were using a licensed engine; I have heard that the Unreal guys are addressing the problem.)<\/p>\n<p>The Witness packs most of its data into a single 2-gigabyte file stored in the .zip format. Before uploading the initial version to Steam, I took a safeguard to minimize patch sizes, or so I thought: Files are sorted lexicographically and then placed into the zip file in that specific order. (Imagine that files are stored in random order: probably every patch involves every player re-downloading the entire game!)<\/p>\n<p>Last night, as a test, I pushed a new build to Steam. The hope was that the patch would be small, since we had only done two days of work. Instead, here's what we got:<\/p>\n<p><a href=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/steam_chunks.png\"><img src=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/steam_chunks-512x94.png\" alt=\"\" title=\"steam_chunks\" width=\"512\" height=\"94\" class=\"aligncenter size-large wp-image-1767\" \/><\/a><\/p>\n<p>The patch was 660 megabytes. Clearly I had not done enough to prevent this. I opened up a zip file reader and took a look at the file order. In many cases, it was as you'd expect, but in the case of meshes and lightmaps we see this:<\/p>\n<p><a href=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/sorting_problem_7zip.png\"><img src=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/sorting_problem_7zip-512x162.png\" alt=\"\" title=\"sorting_problem_7zip\" width=\"512\" height=\"162\" class=\"aligncenter size-large wp-image-1768\" \/><\/a><\/p>\n<p>'save' is the world name and the 5-digit number is the identifier for each entity. (The world name is redundant right now, since The Witness takes place entirely in one world). We re-bake lightmaps pretty often, but we change meshes much less often. So, the packing order you see here will result in us patching more chunks than we should. Not necessarily by much, in this case, since meshes are much smaller than lightmaps, but that might not be true for all meshes, and other data types may have a similar issue but exhibit more-severe problems. So, I had the thought that instead of sorting things like this:<\/p>\n<p><code>a.lightmap<br \/>\na.mesh<br \/>\nb.lightmap<br \/>\nb.mesh<br \/>\n<\/code><\/p>\n<p>What we really want is to sort by file extension first, then sort by name:<\/p>\n<p><code>a.lightmap<br \/>\nb.lightmap<br \/>\na.mesh<br \/>\nb.mesh<br \/>\n<\/code><\/p>\n<p>... but not always. Sometimes we have multiple files associated with one source file, and that will all change when that source file changes, but that have different extensions. This is the case with shaders:<\/p>\n<p><a href=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/shaders_7zip.png\"><img src=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/shaders_7zip-512x107.png\" alt=\"\" title=\"shaders_7zip\" width=\"512\" height=\"107\" class=\"aligncenter size-large wp-image-1769\" \/><\/a><\/p>\n<p>So I hacked together a solution that sorts anything with an extension of .shader*, in terms of the global file list, as though the extension were only \".shader\" -- that means all these files will stick together -- but locally, you still need to keep the full extension, so that the files for one particular shader do not randomize their order every time you make a patch.<\/p>\n<p>The way I do this is to tack a modified extension onto the front of every filename, for sorting purposes only. So \"hello.texture\" becomes \".texture hello.texture\", \"whatever.shader_ps_text\" becomes \".shader whatever.shader_ps_text\", \"whatever.shader_dump\" becomes \".shader whatever.shader_dump\".<\/p>\n<p>If this doesn't make sense to you, don't worry too much about it, because it's not that big of a problem (and I am not even sure if those ps_text and vs_text files are needed for the running game; if we should be discarding them it would certainly mitigate the issue.) These kinds of problems would modestly increase the patch size, but I would not expect the increase to be terrible.<\/p>\n<p>The whole time I was doing this sorting stuff, I knew it wasn't the real problem, and in the back of my mind I was thinking, \"I hope the Steam patch mechanism ignores timestamps on files so that we don't have to go touching all files to a known time. (Or I hope we aren't packing a newly-generated build ID into each file or something.)\" But this was an example of my brain not working properly: because we are packing all this stuff into a zip file, Steam doesn't know they are separate files! Since the zip file stores timestamps for all the files, <em>of course<\/em> that data changes every build (since our build process generates new files every time).<\/p>\n<p>At first I was thinking that this wouldn't be the real problem, either, because the zip file format has a main directory that is stored compactly, and that is where timestamps would be, because you want to be able to access that data quickly. So I would have guessed that, if all the timestamps changed, the worst that would happen is that the directory would change every build, but that is at most a couple of megabytes. But looking at <a href=\"http:\/\/www.pkware.com\/documents\/casestudies\/APPNOTE.TXT\" onclick=\"_gaq.push(['_trackEvent', 'outbound-article', 'http:\/\/www.pkware.com\/documents\/casestudies\/APPNOTE.TXT', 'the zip file specification']);\" >the zip file specification<\/a>, I discovered how wrong I was. Inexplicably, timestamps for each file in a zip archive are stored both in the central directory <em>and<\/em> in the local header for each file, so if you change timestamps for all your files, you are strewing changes everywhere.<\/p>\n<p>So I made the packaging script set all timestamps to 0, but the zip library we are using didn't like that, so I set the timestamp to January 1, 2012 00:00:00 GMT.<\/p>\n<p>After fixing this, as a test I built two patches in a row from scratch, with the game recompiling all shaders and re-compressing all textures each time. In theory the results should be exactly the same (certainly if this still generated a huge patch, we would still have some kind of big problem). Happily, we have found at least some bit of success:<\/p>\n<p><a href=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/success.png\"><img src=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/success-512x97.png\" alt=\"\" title=\"success\" width=\"512\" height=\"97\" class=\"aligncenter size-large wp-image-1772\" \/><\/a><\/p>\n<p>I am not sure what it is saying there about 2046 of 2262 matching, or 60 changed chunks, but saying it uploaded 0 chunks. In any case it's better than before. I am going to ask the Steam people what these various numbers actually mean.<\/p>\n<p>Feeling good about this, I ran a final test where I moved around a few objects in the world and changed a constant in one shader:<\/p>\n<p><a href=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/bigger.png\"><img src=\"http:\/\/the-witness.net\/news\/wp-content\/uploads\/2012\/08\/bigger-512x93.png\" alt=\"\" title=\"bigger\" width=\"512\" height=\"93\" class=\"aligncenter size-large wp-image-1781\" \/><\/a><\/p>\n<p>That's once again not so good (140 MB!) I had been sweeping the 18MB from the previous run under the rug, thinking \"hey, it says 0 chunks sent, so maybe that 18MB is protocol overhead involved in figuring out which chunks changed, and it doesn't impact patch size.\" Apparently not. The next step involves asking the Steam folks for advice, hunting around for a binary diff file with which to analyze our own packages to see if we are doing something unsavory, etc.<\/p>\n<p>Here's the current Perl code for the part of our script that builds the zip file:<\/p>\n<p><code><\/p>\n<pre>\r\nsub modify_name {\r\n    my $name = $_[0];\r\n\r\n    my ($file, $dir, $ext) = fileparse(lc($name), qr\/\\.[^.]*\/);\r\n    my $orig_ext = $ext;\r\n\r\n    if ($ext =~ m\/^\\.shader\/) { \r\n        $ext = '.shader';\r\n    }\r\n\r\n    # Throw away 'dir' since we discard it when writing the file, anyway.\r\n    return \"$ext $file$orig_ext\";\r\n}\r\n\r\nsub make_uncompressed_zip {  # Sort zip members alphabetically so that binary diffing will have an easier time.\r\n    my ($source, $prefix, $dest_zip_file) = @_;\r\n\r\n    my @unsorted = glob(\"$source\/*.*\");\r\n    my @sorted = sort { \r\n        modify_name($a) cmp modify_name($b)\r\n    } @unsorted;\r\n\r\n    print(\"Sorted $#sorted files.\\n\");\r\n\r\n    use Archive::Zip qw(:ERROR_CODES);\r\n    my $zip = Archive::Zip->new();\r\n\r\n    \r\n    # Zip API complains if time is before 1980, so use an arbitrary timestamp in that range (yuck).\r\n    my $timestamp_2012_jan1 = 1325376000;\r\n    foreach my $fullname (@sorted) {\r\n        my ($file, $dir, $ext) = fileparse($fullname, qr\/\\.[^.]*\/);\r\n\r\n        # print(\"Adding '$fullname' as '$prefix$file$ext'\\n\");\r\n        my $member = $zip->addFile(\"$fullname\", \"$prefix$file$ext\");\r\n        $member->setLastModFileDateTimeFromUnix($timestamp_2012_jan1);  # Clobber timestamp. \r\n    }\r\n\r\n    # Set the compression level on all the members.\r\n\r\n    my @members = $zip->members();\r\n    foreach my $member (@members) {\r\n        $member->desiredCompressionMethod(COMPRESSION_STORED);\r\n    }\r\n\r\n    $zip->writeToFileNamed($dest_zip_file);\r\n}\r\n<\/pre>\n<p><\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you see in the previous post, lately I&#8217;ve been getting the game set up on Steam. Now that it&#8217;s in there and working, there are still things to think about and make better. A couple of months ago I was playing Super Monday Night Combat; a much-anticipated weekly update \u2026<\/p>\n<p class=\"continue-reading-button\"> <a class=\"continue-reading-link\" href=\"http:\/\/the-witness.net\/news\/2012\/08\/fun-with-package-sorting\/\">Continue reading<i class=\"crycon-right-dir\"><\/i><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"_links":{"self":[{"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/posts\/1766"}],"collection":[{"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/comments?post=1766"}],"version-history":[{"count":15,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/posts\/1766\/revisions"}],"predecessor-version":[{"id":1786,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/posts\/1766\/revisions\/1786"}],"wp:attachment":[{"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/media?parent=1766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/categories?post=1766"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/the-witness.net\/news\/wp-json\/wp\/v2\/tags?post=1766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}