Server Room Fire | TechSNAP 44

Server Room Fire | TechSNAP 44

It’s a worst case scenario, when a server room catches fire in this week’s war story!

Plus: We’ll share a story that might make you re-think taking advantage of your hard drive warranty, the secrets to reliable SQL replication.

All that and more, in this episode of TechSNAP!

Thanks to:

GoDaddy.com Use our codes TechSNAP10 to save 10% at checkout, or TechSNAP20 to save 20% on hosting!

Super special savings for TechSNAP viewers only. Get a .co domain for only $7.99 (regular $29.99, previously $17.99). Use the GoDaddy Promo Code cofeb8 before February 29, 2012 to secure your own .co domain name for the same price as a .com.

Pick your code and save:
cofeb8: .co domain for $7.99
techsnap7: $7.99 .com
techsnap10: 10% off
techsnap20: 20% off 1, 2, 3 year hosting plans
techsnap40: $10 off $40
techsnap25: 25% off new Virtual DataCenter plans
Deluxe Hosting for the Price of Economy (12+ mo plans)
Code:  hostfeb8
Dates: Feb 1-29

   

Direct Download Links:

HD Video | Large Video | Mobile Video | MP3 Audio | OGG Audio | YouTube

 

Subscribe via RSS and iTunes:

   

Show Notes:

Crypto crack makes satellite phones vulnerable to eavesdropping

  • Researchers at the Ruhr University Bochum in Germany have reverse engineered the GMR–1 and GMR–2 encryption systems used by satellite phones and found serious weaknesses
  • Both algorithms rely on security by obscurity, but by downloading and disassembling the firmware, researchers were able to isolate the cryptographic algorithms
  • “Unlike standard algorithms such as AES and Blowfish—which have been subjected to decades of scrutiny from some of the world’s foremost cryptographers—these secret encryption schemes often rely more on obscurity than mathematical soundness and peer review to rebuff attacks”
  • The GMR–1 encryption system uses an algorithm that closely resembles the proprietary A5/2 encryption system that former protected GSM phone networks, before it was phased out in 2006 due to weaknesses that allowed attackers to decrypt calls in real time
  • The attack against GMR–1 allows anyone with a modest PC and some open source software to decrypt a call in less than an hour. With a cluster of more powerful machines, it is possible to decrypt a call in real time
  • GMR–2 phones are also vulnerable to cracking when there is known plaintext. This is a particularly glaring issue because the datagrams contains predictable headers and other content that can be known by the attacker, making such attacks possible
  • Researchers have not yet reverse engineered the audio codec that is used for voice calls, so a call can be decrypted, but not played back (yet). However other data types that do not use the audio codec (fax, SMS, data), have successfully been intercepted
  • Researchers are only able to intercept communications between the satellite and the user, not communications in the other direction, so only one side of the call can be eavesdropped. This is likely a limitation of the way satellite signals work, to intercept the signal from the phone to the satellite, you would need line of sight, usually requiring an EL-INT aircraft or satellite.

Customer buys refurbished drive from newegg, finds existing partitions and data

  • This story raises a number of questions about used and refurbished drives
  • Everyone knows that they should securely erase their drive before they resell it, we covered some of the techniques on TechSNAP 31 – How Malware Makes Money
  • However, how do you securely erase a drive when it has failed in some way?
  • You send the drive back to the store or the manufacturer in order to receive a replacement drive, however, you must trust to them to securely erase your data, since the drive was not usable when it left you
  • In this case it would seem that the drives we repaired, turned around and sold to another customer, without the data being properly erased
  • It would seem the only option that customers have is to not return the failed drive, which means not taking advantage of their warranty and having to pay full price for the replacement drive

Feedback:

Q: chocamo from the chatroom asks about MySQL Replication

A: MySQL has a few different replication modes built in, the main one being asynchronous replication, where a slave server constantly reads from the binary log of all changes made to the database. So you start with your two servers in a converged state (meaning they have exactly the same data), then then each time an UPDATE or INSERT command is run on the master, the slave runs the same commands, in the same order, and should continue to have the same data.

However, the slave is read only. If you want to do load balancing of more than just reads, you need to do what is called ‘multi-master replication’, In this setup, you have 2 or more servers that are all masters, and each is also the slave of the server in front of it. Something like: A -> B -> C -> A. So when an INSERT is done on server B, server C then executes that same INSERT statement, and then A, and when the query gets back to B, B notices that the query originated at B, and so skips it, preventing a loop. If you attempt an approach such as this, you will also need to adjust the auto_increment settings in MySQL, you will want the auto_increment_increment to be at least as many servers as you have, and then each server should have a different auto_increment_offset. This is to prevent primary key collisions, so that if an INSERT is done on each of the three servers at the same time, each row ends up with a unique key, otherwise replications stops until you solve the primary key collision. In the ScaleEngine setup, we also have 2 real-only slaves, one from server A and one from server C, the first offers read-only access to customers, to be used by applications that support using a read-only slave, and the other is used for taking backups (we pause replication to get a perfectly consistent copy of the entire database, then resume replication to catch back up to real-time data)

MySQL 5.5 also introduces ‘semi-synchronous replication’. In this mode, the MySQL client does not return from the query until the data has been written to not only the master, but at least X of the N slaves. This allows you to ensure that the data has actually been replicated and is safe from the failure of the master server. Normal replication in MySQL is asynchronous, meaning that when you make a change, the client returns a successful result as soon as the data has been written to the server you are connected to, and then replications happens later, this is normally the desired behaviour because it provides the greatest speed, however if the server you wrote to fails before any other servers replicate the change, that change could be lost. Semi-Synchronous replication attempts to solve this issue by allowing you to wait until there is at least 1 or more additional replicas of the data before returning a successful write. Fully synchronous replication is normally undesirable due to the performance impact.

If a table is too large, you can use ‘partitioning’ to break it in to smaller tables. You can also use the MySQL ‘Federation’ feature, to make databases from more than one server appear to be local to a single server, allowing you to move different databases to different physical machines.

War Story:

This week’s features another war story from our good friend Irish_Darkshadow (the other other Alan)


Setting:
IBM has essentially two “faces”, one is the commercial side that deals with all of the clients and the other is a completely internal organisation called the IGA (IBM Global Account) that provides IT infrastructure and support to all parts of IBM engaged with commercial business.

The IBM email system uses Lotus Domino as the server component and Lotus Notes as the client side application. The Domino servers handle the email for the company but also serve as database hosts and applications hosts. At the point in time when this war story took place, each country had their own server farm for these email / database / application servers. Each individual EMEA (Europe / Middle East / Africa) country then routed email from their in-country servers to the two “hubs”, those being Portsmouth (North Harbour) in the UK and Ehningen in Germany.

The events described below took place in the summer of 2004.

War Story:

Well, there I was once more with the 24×7 on-call phone and bouncing through my weekend without a care in the world. Well, sort of I suppose, if you don’t count a German girlfriend with shopping addiction and two kids with the inability to be quiet and give daddy some quality time with his computers. It was a Sunday afternoon and we were at the cinema which I figured was a safer option than what I chose to do for my last was story (getting very drunk).

The on-call phone started to ring almost immediately after we got out of the movie and it was the duty manager telling me that she had been “summoned” to the office to some of the higher ups for the EMEA geography. My first instinct was “and this is my problem, why?” but I resisted the urge to expose my inner bastard and played nice instead. I suspected that she had simply guessed that being called in to the office without any details was likely not a good sign and it might be useful to have some insurance (or a scape goat) beside her for the upcoming call. Apparently as I was the Crit Sit Manager for that week, I was the aforementioned insurance.

Being the devious little git that I am, I decided to bring the kids with me to the office. That would then allow me to counter any requirements on my time there with a need to get the kids home to feed them / wash them / imprison them…whatever fitted best. Essentially they would be my passport to get out of the office and buy myself some time if I needed it.

The Duty Manager that day was one of those people who had graduated to the position despite having absolutely no technical skill or capability but had an uncanny knack of lunching with the right people and “networking” with the right higher ups. Upon arrival in the office I sat in her office with her to chat about any details she had left out during her call to me. I had the kids running up and down the aisles of the call centre with one of the agents I trusted keeping an eye on them.

Nothing new was divulged prior to the big conference call kicking off and even when they started to expain the purpose of the call, details were being kept very very vague. The driver on the call was a guy from Italian Service Management which completely threw me as I had never seen a high level call originate from that part of the organisation.

The key part of the call went something like this :

Italian Guy: We are, eh, here today to eh, discuss a situation in the Vimercate (vim err kaa tay) site. Eh, perhaps we should proceed on that basis.

Duty Manager: Hello there, xxx here. I’m the duty manager for the EMEA CSC this weekend. I’m not sure what the Vimercate site is. Could you please explain ?

Me : *presses mute on the phone
Vimercate is the server farm location for Italy, all of the email and Lotus Notes database / applications for the country are run from there. If that site is down then IBM Italy will be unable to do ANY business for the duration of the outage.
*
unmutes the phone

Italian Guy: It is one of our locations here in Italy that is responsible for some servers.

Duty Manager: Ah ok, thanks for the explanation.

Italian Guy: Well about two hours ago eh….we a, received a call from the cleaning contractors that there was a, some cigarette coming out of the server room. We immediately alerted the rest of Service Management and started dealing with the crisis as a critical situation.

Me: ** rolls about laughing then thinks to telnet to some email servers in that site and nothing was connecting…….the urgency of the call started to dawn on me at this point.

Duty Manager: I’m sorry but I don’t understand what you mean when you say that there was a cigarette coming out of the server room. Did I mishear you?

Italian Guy: Sorry, not cigarette, I mean to say smoke. There was smoke coming out of the server room.

Duty Manager: Oh lord, has anyone been hurt? Is there any emergency service personnel at the site?

Italian Guy: Yes, the fire service were alerted almost immediately and nobody other than the cleaning staff was in the site when the alarm was raised. The fire has spread to other parts of the building and the firemen have been unable to get to the server room yet.

Me: Hi, I’m the crit sit manager here today. Could you please give me a current status on the server room itself? If those servers are not recoverable then we will need to activate the business continuity location and get the backup tapes couriered there. We could be up and running within 12 hours that way.

Italian Guy: Yes, yes, we know all of that. We are service management. We have already started to deal with those things. We invited you onto this call so that you are aware of the issue and can place voice messages on your incoming call lines and have your agents prepared to explain things to our users if they call your help desk. Nothing more.

Me: I have no doubt that you are on top of the situation but in such circumstances the in-country Service Management report in to the EMEA Critical Situations team who then coordinate all actions until there is a satisfactory resolution as per the EOP (Enterprise Operating Procedures). I will be taking point on this for you and liaising with EMEA Service Management for the duration of this situation.

**lots of back and forward, territorial pissing contest arguing took place until it was agree to have a followup call every hour. The second call went something like this :

Me: Good evening folks, how are things progressing on the site now?

Italian Guy: The emergency services are having difficulty due to the age of the building and they have not been able to get to the server room yet. There is nothing else new to say.

Duty Manager: So does that mean the servers are destroyed now or is there still some chance?

Italian Guy: The fire suppression system in the server room activated, that is all that we know right now.

** we adjourned the call and the next two were more of the same until the fifth call :

Italian Guy: The firemen have made it to the server room and have reported that the fire suppression system has not worked correctly. The servers themselves have been fire damaged.

Duty Manager: That’s very unfortunate, how are your efforts to get the backup tapes to the secondary site going?

Italian Guy: Eh, there is a problem with that too. The tape libraries are in the same room as the servers in an enclosure. The firemen have not retrieved them for us yet.

Me: Whoa, hold on a minute. The tapes that we’ve been trying to get into play for the last four hours are actually in the same room with the fire? Why didn’t you tell us that earlier ? If both the servers AND the backup tapes are destroyed then IBM Italy will be offline for days while a secondary site is configured. This completely changes the severity of this situation.

Italian Guy: yes, we believe that both the servers and the tapes have been damaged at this time.

**at this point I resisted the urge to reach my arm through the phone line and throttle this guy.

Duty Manager: So what can we do at this point?

Me: We need to get EMEA Service Management to start prepping a completely fresh site to take over for the ruined server farm. The problem is now that we’ve lost four hours waiting for tapes that were never going to arrive, we could have had the new servers being readied all that time.

So this all continued for a few more calls, I had my girlfriend pick up the kids between the calls and take them home and I just dived in and tried to maintain some momentum in the resolution efforts. Rather than drag it out and bore you to tears, here were the remaining revelations :
Servers were burnt to a crisp.
Backup tapes (which were in the same room) were partially burned but all were smoke damaged.
The fire suppression system simply failed to work
The firemen had to use water due to the composition of the building…WATER…on a room full of electronics.
It took 2 full days to build the new server environment which essentially meant that IBM Italy were unable to do business electronically for that duration.
Nobody ever explained why the tapes were in the server room other than to say – it was an oversight by the IT Manager. Really? an oversight?!?!!
The only bright spot in the entire debacle was that some of the data on the tapes was salvaged and shortened the duration of the outage significantly for some people.

I’m not sure there is a moral to the story or a catchy tag line like “patch your shit” but I suppose that my overriding memory of the whole situation was when I wondered how anyone thought it would be a good idea to put backup tapes in the same physical location as the servers and then neglected to do regular maintenance on an old building that was clearly a fire trap.


Round-Up:

Question? Comments? Contact us here!