A several weeks ago, Peter from Lutheran Church of Hope emailed me a picture of their brand new Dell md3000i iSCSI array with its angry light on. For those of you who don’t know, an md3000i is a fancy box with a bunch of hard drives in it and has redundant everything. This thing is well built, redundant drives, power supplies, even a redundant raid controller and everything is hot swappable. When I heard that it was angry, I wasn’t too worried everything was up and running, they just lost redundancy of one the raid controllers. I told Peter that he should open a ticket with Dell and they would get him fixed up. They have 4 hour support and a new part showed up a couple hours later via courier: Impressive!
I hadn’t heard anything else from Peter until about 5pm when he wrote to tell me that there were now 3 technicians working on the array and they will be performing a ‘syswipe’ to resuscitate the array. A syswipe is where they completely reformat (wipe) the operating system, reload it and rebuild the configuration. They said that the 7 terabytes of data would be fine.
When arriving on the scene at about 7pm that Wednesday, I learned that the array said it was happy, but the Citrix Xenserver couldn’t see its 5 storage repositories that get served up from the array. We worked on trying to get the xenserver to talk to the array for about another 3 hours. We could get it to see the LUNs (logical unit numbers), but when it looked for the data, it stated that there was no data present. In fact, manually using fdisk, we didn’t even see partitions. It was like we were starting with fresh, blank volumes. We left for the evening with Dell scouring the support information to see if they could find any flaw with the way the syswipe was performed and they would call when they came to a determination.
At about 2am, Dell called and we conferenced myself, Peter, and about 3 or 4 people from Dell’s side. The news wasn’t good. They couldn’t find anything wrong with disk alignment or configuration that would lead them to believe that they did anything wrong and they were convinced that the xenserver was to blame for the missing data. I was dumbfounded. I couldn’t believe what I was hearing. They basically told me that this was now a Citrix support issue and they were basically done. Trying to stay calm, we decided to hang it up for the evening and get some rest.
Thursday was a bust for continuing on recovery with Peter and myself, but on Friday, I started to research the log of commands that Dell performed on the array for the syswipe. From the documentation I found, there are 2 types of commands you can perform: create commands, and recover commands. Guess what Dell ran. Both. On the inside, I was getting pretty steamed. Convinced that Dell blew away the data, I was formulating my argument of what Dell was going to do to rectify the situation.
Two days ago, we had an error on our array saying that we lost some redundant capability and were up and running and now we were at complete dataloss at the hands of Dell. Peter had a feeling that we weren’t going to be able to recover the data, but I *KNEW* the data was GONE.
At this point, we both let the project chill for about a day. We started talking about possible scenarios for data recovery with going back to the old array we had just replaced, but since the new backup procedure wasn’t in place yet, there would certainly be things missing.
And then came Sunday. It was Hope’s 16th birthday! it was a time to celebrate. We even had cupcakes : ) Crissy was helping in my 5yr old’s Sunday school class and my 3yr requested to go play so at sat by myself at the 9:15 service that day. Up to now, I have been the strong one in process up until we started to sing. I was a mess. Thank goodness I wasn’t sitting near anyone I knew! I wasn’t crying, but my eyes were rather moist. I kept telling myself “For crap’s sake, it is *just* data.” There are way bigger problems in this world! Here is the thing, I wasn’t crying because of dataloss. I was crying because I *knew* something was going to happen. I said (in my head, of course) “Damn-it, God you are going to freaking fix this problem and I won’t be able to ignore it and you anymore.” Let me make this clear. I wasn’t demanding that God fix it. I knew that it was about to be fixed and I’d have to talk about it. Still in denial, had a couple conversations on the way out with Chris G. (operations director) and Peter about next steps and went home.
After lunch, I went upstairs and started running the same commands I ran over and over trying to introduce the storage repositories to the xenserver with the same failed results. While sitting there, a single word came to me: IPV6. I literally said out loud “What do you mean IPV6?” I don’t know where it came from or what it meant, but there it was: IPV6. How could that have anything to do with anything. I was doing explicit single pathing on a particular IPV4 IQN just like when I first configured the array. Frustrated, I turned off my computer and went back downstairs to get ready to take my family to go see Megamind. As we were about to go out the door, I had a spark of an idea. I called Peter and asked him to look at the array and if he saw IPV6 on any of the iSCSI interfaces, completely disable it. I told him I was taking the kids to a movie and I would check on it when I got back.
After the movie, I went back upstairs and tried the same procedure that I had performed NUMEROUS times. It.freaking.worked! I successfully setup all 5 storage repositories and re-linked all the virtual hard drives back to the storage server! I turned on the server and it was like nothing had happened. Everything was back to normal. I started weeping like a baby. Silently of course, I am a man : ) I called Peter and we both sat dumbfounded on the phone in disbelief. After I got off the phone, Crissy walked in and asked why I was crying. All I could tell her is “Oh nothing, I’m just really happy.” It was certainly NOT nothing.
To top off the story, on Monday, Peter heard back from Dell. They had reviewed all the support calls again and determined that due to everything that had happened, our data is completely scrambled that no data recovery company would be able to retrieve any data. I bet they were a little surprised when he told them that everything was back to normal.
Dell, I’m sorry for all the bad things that I said to you in my mind. I certainly didn’t approach you with a loving heart.
If I could please hold your attention a little bit longer, I want to point you to the sermon from that Sunday. It is about 45 minutes long, but worthy of your time. Thanks! Click here to play it.