Thursday, 17 March 2011

Reacting to the production problem (software, ofcourse)

Procution problem start with PANIC and ends with PEACE, but its in your hand to avoid CHAOS.

I almost worked for four years in a production environment support web sites for media giants in Singapore.Also faced so many production problems and learned so many lessons during code deployment, troubleshootingand finally solving problem thus keeping business user's trust. I summarize the following points to keep inbrain while working on production envionment be it banking, media or any other.

Things to consider when deploying code to production

1. I am bored to write documents. But a detailed document which outlines the steps to follow to deploy code production is mandatory to avoid forgetting things in the last minute. Once the document is ready, all we have to do is just follow the steps and thats all. No stress. Moreover, by preparing document, we can come to timing as well. Let's say if takes 2 minutes to deploy each module and it can take 30 munutes finish total deployment. If we have the duration then we can keep everyone informed.
If something goes wrong, atleast we can save our back by showing the pre-reviewed and pre-approved document to our managers that we followed the procedure.

2. Plan for failures - if something goes wrong, how soon we can bring back the system to normal state. It is important to mention 'rollback plan' in the deployment document. After all, if we cannot deployment our code we don't want to ruin the system. Its our moral responsibility to bring back the system to normal.

3. Avoid peek hours - especially in media domain, early morning is the peek time to churn out news. Business users hate to wait for deployments. So, choosing off-peak hours for deployment is better. If something goes wrong, atleast we will have couple of hours to troubleshoot and debug the problem. I prefer mid-nights for deployments to avoid calls/smses from bosses for every 10 minutes asking whether everything is okay! Irritating! But its their concern too! isn't it?

Thigs to consider when troubleshooting production problem

1. If system fails suddently, your phone on your desk rings immediately. And the chaos starts!!! Everyone sits on your head from managers, to end users to CEO. All you have to do is STAY COOL. Tell them that you're working on the problem. If you panice ALONG WITH THEM, you can't even remember server password!!!

2. Solving production problem is two step process. First, we MUST thoroughly understand the problem. We must be 100% sure that a specic 'scenario' caused the problem. Second, once we establish the fact that root cause of the problem we need to think of possible solutions to fix the problem. My approach is to quickly fix the problem (even if it is workaround) just to make bosses and end-users happy for the moment. Once they're silient, I will take my own time to refactor the code and fix the problem permanently for good.

3. Document the problem and solution you applied in a log. This helps the next poor guy who takes over the system after you quit the company :) After all, you don't want to distrub other guy's family life at night 2 am.

4. On top all these things, important thing is to STAY COOL AND CALM even if you cannot fix the problem. If you think you need more time or you cannot fix the problem, better to escalate the issue to manager. The manager probably calls some expert or architect who can advice you or fix the problem by themself.

No comments:

Post a Comment