Sunday 15 August 2010

rabbitmq - What is the point of the immediate multiple retries in messaging systems? -


I have recently been reading on messaging systems and especially seen both and as I have understood it , If for some reason a message fails, it is tried to recover it several times, both systems then offer the possibility to try again later, for example in five seconds when five seconds have passed The message is repeated several times.

I have cited Von Vernon (p.502):

Another way to handle it is to try again to send it until It may not be successful, perhaps using a shadowy back-end. In the case of rabbit MQ, retrying may fail for some time. Thus, using the combination of message NAC and retry may be the best way, however, if our process does three times in every five minutes, then

For NServiceBus, this is called the second level attempt, and when it comes back, it happens many times.

Why does it need to be multiple times? Why not try again once every five minutes? What is the chance that fails to try again for the first time after five minutes and the second attempt will probably succeed only after milliseconds?

And if it does not need due to some configuration (what is it?), Why do I have to try more than once in all the instances?

My answer is NServiceBus so my answer can be added to those terms.

The first level breaks are great for very transient errors. The deceased is a perfect example of this, you try to change the database, and your transaction is chosen as a deadlock victim. In these cases, a first level re-attempt is correct. Most of the time , the need to retry a first level is only if you have a lot of controversy in the database, then perhaps 2 or 3 retries will be great.

Second level attempt is for your less transitive errors, think about things like web service is decreasing for 10 seconds, or in a SQL Server database in the failover cluster switching, which It can take 30-60 seconds. If you try again after a few milliseconds, it will not be good for you, but after 10, 20, 30 seconds later you may have a good shot.

However, the root of the question is 5 before the level of retry efforts and then a delay, why try 5 times before the additional delay? First of all, on your first second level attempt, it is still possible that you can get a deadlock or other very transient error. After all, the target usually It is not possible to slow down a system, so it would be better if the problem is really transient then do not wait for an additional delay before trying again. Of course there is no way for infrastructure, to know how the problem is transient.

The second reason is that it is easy to configure all, try the X level and Y per level of retrieval = X * Y attempts total and only 2 numbers in the configuration file. In NServiceBus, it has a back-off time span along with these 2 values, so the config looks like this:

   & Lt; TransportConfig MaxRetries = "3" /> It's simple enough 3 times try 10 seconds wait 3 times wait 20 seconds Wait 3 times Wait 30 seconds 3 times Then you did and you went to an error queue. Are there.  

To configure different values ​​for each level, a more complex configuration story will be required.

No comments:

Post a Comment