Collecting Playtest Data

In The Witness, when progress in your game is saved, we also save a thumbnail that shows you where you were. This makes it easier to load a particular game later if you had multiple saves. Here is our current, very unpolished load screen:

Right now, we save the thumbnail and the game data as separate files, but the names are mostly the same; so if you go into your game data folder in Windows you can see them living there very plainly:

(It might be interesting to do it like Spore and embed the save data in the image file, so that there's only one file per save. Maybe we'll do this, though it's not a no-brainer; right now the image file is usually bigger than the save data, so this would involve tying a relatively small amount of crucial data to a relatively large amount of cosmetic data. Maybe that doesn't matter, though.)

As I mentioned in the previous post, we are starting to get the game out to a limited number of players. It's been a long time since I've had anyone new play through the game as a whole, so I really have no idea how it plays now, and it's important to find out. However, it's not so easy, because if you sit behind someone and watch them play a game, they are going to play it differently than if you weren't there. You invalidate your own playtest data.

So, I wanted to set up the game to record some minimal playtest data, that would be easy for me to browse through later on, and that would get recorded without me being there to directly watch the play session.

One of the first approaches that programmers often consider is an input recording method. You just journal a stream of all the player's inputs, and you can play that back deterministically to reproduce the entire game session. I have never liked this method. Firstly, you need to play back using exactly the same version of the game executable that the player used, which is a pain; secondly, this kind of playback code is brittle and breaks often. An uninitialized memory read bug can render your playback data useless! (Of course, one reason you might want playback data is to track down an uninitialized memory read bug... so... there's a problem here.)

For these reasons I didn't want to use input journaling. It seemed to me that our saved games are small enough, and the thumbnails are nice enough, that together they almost solve the problem. All I need to do is save a series of snapshots / game states over time, without overwriting previous saves:

Whereas the earlier folder contained a collection of different play sessions, this one contains one play session at different moments in time. Every time the game made an autosave for this play session (overwriting the old one in the normal way), it also also copied the autosave here. The date and time that are in each filename just indicate when that game was started; the number at the end tells you which snapshot this is.

There's one immediate, very nice property of this: I don't even need to run the game in order to get a general idea of how someone's playtest went. Once they send me the files, I can just click on the first one in Windows to invoke the image viewer, and then click the forward/backward buttons in the image viewer to animate my way through the game.

But, if I have questions about what was going on, I can also dig into the save file that corresponds to any given image. I made a mode that lets me browse these saves in-game, load up any particular one into the current world state, and play from there. I invoke it by running a console command:

This loads all the saved games and thumbnails and pops them into a scrolling list at the top of the screen, in chronological order:

I just scroll to the one I want:

... then press Enter to load that game state:

It's very fast to use and it only took a few hours to program.

This whole system is also pretty cool because it's much more robust than the input-recording method. If I want to look into some detail that is super-nitpicky, I may need to use the same build of the game executable that the player used, but this will almost never be necessary. Usually I won't even need to use the in-game browser -- as I mentioned, I can just click through the thumbnails in Windows to get a basic sense of what's going on.

The fact that we only have discrete snapshots is sort of a side-bonus: because we only snapshot every minute or so, the data is not so creepy and spy-like; players can relax while they are playtesting and know that we won't later be obsessing over their every move.


  1. Hi Jonathan,

    It looks like a cool system. I imagine that the harder part of the overall playtest process for you will be deciding what player interactions are actually worth observing. Do you have a particular method to cull it down a bit?

    I guess you don’t only want to respond to tester comments, since you’ll need to know how they are interacting with the game outside of just perceived problems on their part.

    If it was me, I could imagine getting lost watching how people play to the detriment of my development work.

    – Mark

    • I don’t have too many answers in terms of how this will work out specifically; I am going to just start using it and see what happens.

  2. I can’t help but notice that of late the art has been looking more and more shippable with every post.

    P.S. Captcha took 4 tries

  3. I love how nonchalantly you drop your observation that you “invalidate your playtest data” with any attempt to directly observe! And the second half of your post continues the theme: watcher, watching, watched. But what can we truly know? Do we really see each other, or do we just cast shadows and reflections of our own mind on every subject where we train the eye?

  4. Hi Jonathan,
    Interesting… This really will let playtesters have a free flow through the game. :)
    My game is left for the users to explore and experience with the said basic controls. I gave my game to be playtested and I saw that when I wasn’t around they never asked questions such as how and what has to be done in the game and otherwise when I seemed to be sitting around they had to impulse to turn around and ask what needs to be done even though they knew they were doing fine (Playtested by my friends and relatives…). The testers who played it when I wasnt around understood more than the others. I find that the basic graphics I used was quite stimulating for the game’s logic… A person looking from far felt like “Such bad graphics” and the person who played it “Graphics could be increased a little bit otherwise it looks fine, putting audio will make it more stimulating”.
    The Witness is coming along quite nicely. Looking forward to help in any way possible. :)

  5. It looks very pretty and Datura like (Altough I hated Datura and thought it was a bad game)
    And I can see two things: 1) I can finally see that Pick tree that I’ve been keeping my on forever now, The tree that is always moved but always sided with a purple/violet tree. Why are those tree always moving but always together side by side? Hmm must a puzzle!

    Second thing I noticed is that the water and bushes and maybe even grass moves, and their shadows with them. I don’t see any shadows of clouds which I thought would be cool. can you could put shadows of clouds on the ground that travel/move over time to indicate some sort of progression of time running by… and somehow affecting the puzzles that have to do with shadows not being on the touch screen panels

  6. I regularly record videos of games I’m playing. Is there any chance you’d be interested in an additional playtester to record and send you videos of their playtest(s)? I realize it’s unlikely, but it’s worth a shot to ask.

  7. This is a great example of something that I have seen more generally.

    That is, it’s a really good idea to have a functions that dumps the state of a system. When it comes to debugging you can then sprinkle calls to the dump function around the place and work out exactly where you went wrong.

    Once you have a dumper a loader isn’t far behind.

    Here is a video of us doing something like that with our random terrain generator:

    Each frame of the video is generated from a call to a function that dumps the current state of the map.

    We have used this functionality to debug generator bugs, work out differences between windows and linux builds and train artists about how the generator works.

  8. How would this method reflect solving difficult puzzles – are you saving state for each of the “failed” panels or are those save-states on a timer? Would it be possible to see if players are failing to understand the puzzle? What about instances of “getting lucky” with solving panels when they aren’t supposed to? (If any of the above are even possible, that is!)

    How do you, as a game designer, react to the possibility of brute-forcing solutions to puzzles? Did you know of any Braid players solving its puzzles without fully understanding them? (Not sure why, but Fickle Companion and Jumpman levels come to mind, mostly because I’ve read complains about them.)

    • I am not necessarily super-interested in the specifics of how much people get confused about specific puzzles. It is too easy to way overthink that stuff. That is exactly the kind of information that prods many game designers into filing down their game until it is smooth and generic.

      It is okay to get stuck on things if, when you finally understand them, they were worth being stuck on.

      If something is a huge blocker, I will see that when people spend a long time there. But, right now I am not even saving puzzle attempts.

      Some puzzles in The Witness are brute-forceable but most are not (within practical time and effort limits). If you try brute forcing puzzles, it feels bad; it is obvious you are doing it wrong. I have no doubt that some people will try and play the game that way anyway… but then they are just sort of losing the game.

      Sometimes there is a mechanism where panels will shut off and require you to redo the previous panel. This is meant to discourage brute-forcing and also to clearly communicate that you don’t want to play the game that way. I don’t use it super-often though.

      Very few people fully understand all the puzzles in Braid. That is part of what makes them interesting puzzles: they provide capacity for varying levels of understanding, depending on how much effort and interest the player brings to them.

      • Can’t smoothness be a good thing?

        Mathematicians do their best to break new concepts down into a manageable progression. And when they want to give examples of a phenomenon, they choose things that mostly simple and ordinary, so that they can draw your attention to a particular thing.

        Life is short; I feel games should try to press the interesting stuff they have to show you into as small an amount of time as they can.

        We all like “Portal” here. When teaching you the “fling”, the makers broke down every aspect of it, giving you a lot of training wheels before they expected you to understand and experiment with it on your own.

        Jon, remember the puzzle in my game you got stuck on? With the triangle you kept firing into, to no avail? Well, after we talked I turned that triangle upside down, so you can’t really fire into it. You weren’t the only one who got fixated on the idea that that triangle shot might help them, so I’ve disposed of it. I also added another puzzle that nudges you in the right direction.

        I think you already support me on some of this? You’ve praised the cutting out of red herrings, and the simplification of puzzles. In these games, where we are intending to express stuff to players, is “getting stuck” a good thing to have happen?

        • Yes, smoothness is a good thing, often.

          But there is such a thing as taking it too far (which is what pretty much every AAA game company does in focus testing).

          Once new people are playtesting the game I probably will make a lot of changes for smoothness purposes. But what I am saying here is just about overall philosophy of playtesting. It is not the case that everything rough is bad. It is okay for people to get stuck, and it is okay for people to misunderstand things… sometimes. Sometimes smoothing those things out makes the game better, but sometimes it makes the game worse.

  9. But in Braid you had full playback data, hadn’t you? At least it should have been easy to dump the rewind data to disc and play it back!

    (BTW: I always wanted to know how many hours of replay data Braid was able to store and how it was compressed. I could only find this so far: )

    • The abstract for Jon’s talk below says “30 to 60 minutes”, but I don’t know how accurate that is.

      • To fit fully in RAM on the Xbox 360 I restricted rewind recordings to about 40MB, which indeed gives 30-60 minutes depending on what level you are playing (more objects == more memory).

        However, as I say in the talk that Sean linked, the rewind system for Braid isn’t rocket science, it is just a baseline competent implementation with some useful choices made about how to reduce size, but with no real micro-optimization kinds of work done beyond those basic choices. If reducing size were really a priority, I probably could have gotten the rewind data 2x-6x smaller. I just had too many other things to think about.

        • P.P.S. World 5 in Braid (the parallel universe one) works by a relatively straightforward version of the input recording idea. I hated that and in fact the xbox360 version shipped buggy, in the sense that sometimes on world 5 when you rewind, the thing that was supposed to happen doesn’t happen the same way again. I looked for the cause of the bug and couldn’t figure it out and was dead tired of working on the game and the bug wasn’t that bad, so I shipped it.

          Later, for the PC version, I finally figured it out the problem. It was something a little unexpected but also that was obvious in hindsight.

          • That’s so interesting that you mention that bug in world 5, because I noticed it happening a few times and I always felt like I wasn’t wrapping my mind around the concept of time properly.

            I feel a little better now.

  10. This is a pretty interesting approach. It doesn’t look like it gives you enough information, but I’m guessing that most of the details (such as how difficult puzzles were, how confused they were, any mistakes or graphical bugs) are given by the playtester themselves, and that these periodic screenshots are just to give you a basic view of where the player went in which order, and also with a rough estimation of how quickly.

    By the way, I realise the save game screen is very incomplete, but I noticed that the save on the bottom said “1 panels solved”. Are you planning to bother coding it to say “panel” for when just a single panel is solved or not? I’ve seen a plural used for a value of 1 in lots of websites and games and it’s just a bit unsettling, it should take minimal code to correct it. I know it’s a bit picky of me and not the worst thing ever, but I just don’t see why not.

  11. I don’t see the input-recording method as necessarily worse or better than what you’re doing – I think each is useful in its own ways.

    Input-recording can be useful to track down hard to find bugs related to how you got *into* a particular state in the first place. I’m also using it now to help validate that I don’t introduce any gameplay bugs with last minute changes – if a variety of recorded replays still conclude with the same results, then I can be reasonably sure I haven’t broken anything. Obviously this is really only useful when the game is nearly in a final shipping state (but it is *really* useful). It’s very brittle, but that’s kind of the point.

    For the purpose of “recording” playtests, what you’re doing makes a lot more sense.

    The water is looking great in those screenshots.

    • The problem with input recording is that it’s difficult to make it robust between versions. If you have a game that is played and updated a lot your replays would quickly not work. You can jump through hoops to make this work but I think recording the output makes more sense.

    • Agreed Phil, both input and output recording are useful – and they are useful for quite different purposes.

      I used to work on the user interface for an air traffic control system. Frequently we would be asked to investigate situations that occurred in the operational environment. Unfortunately our recordings were mostly input based, which made it very hard to confirm rare bugs. Even with good input recordings, hardware & network conditions could make a replay not produce a bug.

      I think output recording is great for:
      – Confirming the presence and presentation of bugs which may be hard for an end user to describe (“Yes, you’re not crazy… that really did happen!”)
      – Looking at how the user interacts with the system in general

      And input recordings are very useful for:
      – Tracking a confirmed bug back through the system & code to it’s root cause
      – Assisting in confirming that you have resolved the bug against the original cause
      – Regression testing (if you have a library of input recordings to test with). Although this approach suffers from the versioning.

      – Mark

      • We have created a commercial product for playtesting browser games based on input tracking. While having ability to replay sessions is cool, it’s also quite ineffective in the long run – how many 40 minutes gameplays could you watch? It quickly turned out that our users find much more value in having a complete log of user inputs they can search in (by version, game level, specific behavior) and aggregate (i.e. to create heat maps or trends).

        Replays are only used for quick discovery process and the typical scenario for analysis looks like this:
        1. you aggregate recordings and find out that people are dropping out at level 2
        2. you filter out those recordings
        3. your *replay* a couple of them
        4. you find a problem (i.e. people are confused by UI)
        5. you read a log of user behaviors or analyze its visualization to find a hook (i.e. ‘’)
        6. then you compare how many players among those who dropped out and those who didn’t behaved that way (clicked the button) to verify your assumption.

        I believe, playtesting becomes really powerful if you can easily combine qualitative and quantitative analysis into a single process. It’s impossible if you are recording just output (or is it?).

        As for the problem with game versions and ability to replay the session – it turned out not to be that crucial. Of course staging client helps (no need for staging backend), but usually, you either analyze sessions pretty quickly (before a new version goes live) or you forget about old ones because new sessions got recorded (we record tens of thousands of sessions).

        The bigger problem we have with replays based on input is accurate synchronization of user input, especially in 3D environment. But it can be done quite well to with some extra effort.

        And there’s one more case for input based recordings – we use them to detect cheaters in competitions. If a cheater attacks code or memory the score achieved during replay different from a submitted one :)

        (sorry for a looong comment!)

  12. The idea of intentionally degrading game recordings as a means of filtering the feedback you receive from players is an interesting one. It’s a bit like throwing a low pass filter over the whole thing.

    But what sort of information do you seek in recordings like these, once all that low-level data is lost?

  13. I can’t help but be excited by the possibilities that may all be explored within the confines of such a seemingly simple thing as paths drawn through mazes. Perhaps I am still confused as to what the “real” game is. It is probably a closely guarded secret. But I am still impressed by what I have seen so far.

    I wonder though, how you will tackle the problem of affording players too many choices. I often feel in non-linear games as though there are too many things demanding my attention at one time, and I don’t know what is the best to pursue first. What do I want to do today, as opposed to tomorrow?

    Choice fatigue, lol.

    I think it may be fine as long as it is clear that there is no best path. That all choices are equal, and you will never miss an opportunity by choosing the wrong set of puzzles first.

    If by fortitude, you find a spot for The Witness in the next console generation: I would be happy to buy a new console to play it. It is unlikely that I will be thus compelled by a shinier Uncharted.

  14. How about using that mode for browsing snapshots in-game for loading regular saved games? It certainly looks nicer and takes better advantage of the screenshot preview. You can add the missing information with a text overlay at the bottom of the screen.

Leave a Reply

Your email address will not be published.