Network Exception

We are at the 5th stop from roadmap. I’ll talk about network exception in this article.
By exception I mean common errors you should expect from the Internet, i.e. built-in feature of the Internet. As for the case when someone deliberately attacks your server, I’ll cover that in the security article.

Breaking the Myth of TCP

TCP/IP protocol provides reliable, ordered, and error-checked delivery of a stream of octets between applications over a network.
Unfortunately, this is just a lie.

Let’s look at the send() API in tcp socket.
If send() returns false, that’s obvious. The question is, what can we be sure of if send() returns true? If you think that means the peer has successfully received the sending data and is ready to be read, think again.
This only means that the operating system has copied the data into an internal buffer and ready to send them. The data need to do several hops and probably be separated into smaller packets to be sent through IP layer. The receiving side will need to reconstruct the packets to get them into correct order. Needless to say, errors could happen anywhere during this process.
The reliability that TCP guarantees is just TCP will keep sending packets until it receives acknowledgment from the peer. There’s no guarantee that the data will successfully arrive. If network breaks while sending packets, the remaining packets certainly can’t arrive at the peer. You can do an experiment to test this.
Connect a socket to a remote peer, then pull out net cable and call send(). The send() will return true, but you know it can never be sent out since there’s no internet connection. The reason why it returns true is that the operating system thinks everything is OK, since the connection is already done. It will copy the data into buffer and tell you I’m ready to send the data to peer. During the next few seconds, if you call send() again, you will still get a true returned. Only after a short period of time, you will get a false. That’s because the operating system didn’t receive acknowledgement from the peer and a timeout is triggered. Now the operating system thinks something went wrong with the socket and set an invalid flag. From here on you call send() and you’ll get a false.

Now let’s look at recv(). TCP is signaled based, which implies that if a socket remains idle, the status will not change. Now suppose the server is waiting a client’s data by calling recv(). Some sudden disaster strikes the client, such as power loss or network down. The server will not know this since there’s no signal coming from the client. Now the problem is the server will keep waiting forever for a dead client. A part of the memory is occupied and will never be returned. If this happens a lot the server will run out of memory.
Interestingly, if the client crashes, the server will know. That’s because the operating system is doing some patch up and send a close connection signal on be half of the application.

Exception Handling

Q: How to make sure the connection is alive?
A: Use heartbeat.
Send a short message to peer to prove your existence. If no heartbeat above a threshold, the connection must have died and you can close the socket.

Q; How to make sure the message is intact?
A; Send a hash along the original message.
Note that this is only meant to deal with network exceptions. For attacks such as man-in-the-middle, you need a more complicated method.

Q: How to make sure the message is ordered?
A: Use a number to mark the order of message.
The number serves as a clock so that you can guarantee the message is strictly ordered.

Q: How to make sure the peer received the intended message?
A: Always do an ack from the application level.

Putting it Together

Let’s look at this scenario.

Client1 and client2 are connected to a server.
Now client1 wants to send a message M1 to client2.
It sends M1 together with its hash o server. Server receives M1, check hash, and ack client1.
(If client1 doesn’t receive server ack, it should resend M1)
Server caches M1 and sends M1 to client2.
Client2 receives M1 and ack server. When server gets the ack, it can erase M1 in cache.
(If server doesn’t receive ack from client2, it can either resend or wait client2 to ask for M1, depending on implementation)

Now client1 sends message M2 to client2. But for some reason client2 drops.
The status is client1 successfully sent M2 to server and got ack. Server cached M2, sent M2 to client2 but there’s no ack.

After some time client2 is back online. It should tell the server what state it is right now. Since it successfully received M1, so state is 1.
When server receives this, it sends the message client2 has missed during the drop. In this case M2.
In the meanwhile client1 confirms its online by heartbeat.

This is a simplified scenario but to the point. If you organize your online communication like this, you get a reliable one.