tisdag 22 januari 2013

On serialization and message passing

You've been doing network IO using this seemingly awesome tool, allowing you to write virtually any object to any data stream, and read it back perfectly without implementing any form of serialization logic.

Sounds like magic, you just put "implements serializable" on your class and woop-dee-doodle-doo everything just works...or does it?

What could possibly go wrong?

Java serialization certainly has advantages. It's there from the beginning - Virtually all standard java classes support it. It has a well known API and many external libraries have support for it. It just might be the tool for you, but it might also be completely wrong.

Let's examine what it does. Marking your own class as Serializable allows you to write/read instances of your class to/from an object stream (this can be disk io, network io, memory buffer io etc). This conversion of objects to an actual underlying byte stream is done at runtime using Java reflection  - essentially just using class metadata to write class/member identifiers and member data in a consistent way as bytes (It can also do some other fancy things).
Example: You create an object stream around a byte output stream and write objects A and B to it. Object B has references to Object A. When you read back the contents from the stream you get objects A and B back, and behold, B's reference to A still holds! Even cooler still, if you write the same object twice, it will only write an object id the second time(an int or long, I can't remember which) instead of the whole object!

aHA! So can I use this as a way to pass messages over sockets between VMs/programs/game instances?
Sure, let's say every object is a message, that would work - up to the point where you start getting strange out-of-memory or stream corruption errors. You wonder what is going on...

Remember the example above: B may hold references to A...therefor the stream on both sides must guarantee that the GC does not collect A until the stream is closed or reset. So...On both sides the stream pretty much holds references to all the objects it has ever sent.

"Hey hey I can live with this, I can work around this - why don't I just reset the stream every N number of bytes/messages?". That sort of solves the issue above, except it's rather inefficient. The reason is that the first time you send an object type or object across an object stream it carries an obscene amount of metadata - Not uncommon is to see the first message of type X taking being up to 5-10 times larger than subsequent sent instances of type X.

What more is wrong with this? - B can no longer reference A, because the remote endpoint no longer has any knowledge of A after the reset. So for generality you just lost the ability for subsequent messages to reference previous messages, not always, but if a reset should occur..

"Wait wait, I'll just hook in a resend request for A, or something similar".
Right, now, what if B arrives before A in the object stream? B arrives with a reference to object with id that has not yet arrived...Here it get's even more interesting. You will probably end up with a stream corruption exception and need to reset the stream again...

So in short to be able to use the many cool features of Java serialization it requires you to guarantee that objects/messages in the stream will arrive in the same order that they were sent and that you don't reset the stream, except that not resetting the stream will eventually give you out of memory crashes/errors.

All these problems are non-issues when you just want to write/read binary files. it-just-works (Let's ignore data class versions and changes for now ;)), but for low latency unreliable networking (Hint: UDP) it's pretty much a no-go, given that to get it to work over UDP you would either have to live with a reset after every message and the extra high bandwdith requierments due to resends of metadata each message, or implement TCP on top of your UDP...either way...not very useful.

All these issues become even worse when you start dealing with a network environment where each message has to go through several nodes with their own node states/remembered objects, and where remembrance of types/objects may be linked to multiple machines/other nodes. Which ones should be reset when? Heck you can't even add any extra pipe/layer in between because it might reorder messages, chose different routing paths...anything.......you're just simply plain screwed.

There are 3rd party variants such as JBOSS serialization where you can separate the reset of object cache vs type cache (allowing you to reset object cache after each message, solving the OOM issue), but you still have the ordering requirement and the requirement that all class information must come through.

What if you want your host to serve multiple clients at the same time. One thread per client, at least!
What about an asynchronous (NIO) approach? - With serialization :)....which is totally locked down on using blocking reads....You've got some work to do!

My point being:
Strongly consider using other methods than java serialization for message passing, unless you are absolutely sure of its guarantees and requirements. It's terribly easy to shoot oneself in the foot.



Inga kommentarer:

Skicka en kommentar