/var/water/logged | TechSNAP 82

/var/water/logged | TechSNAP 82

An inside look at how hard some Sysadmins had to work to keep their servers running after being hit by Superstorm Sandy!

Plus the final analysis of the Diginotar saga, an epic network debugging war story that will leave you groaning and a huge batch of your questions, and so much more!

Thanks to:

Use our codes TechSNAP10 to save 10% at checkout, or TechSNAP20 to save 20% on hosting!

BONOUS ROUND PROMO:

Get your .COMs just $5.99 per year up to 3 domains! Additional .COMs just $7.99 per year!
CODE: 599tech

SPECIAL OFFER! Save 20% off your order!
Code: go20off5

Pick your code and save:
techsnap7: $7.49 .com
techsnap10: 10% off
techsnap11: $1.99 hosting for the first 3 months
techsnap20: 20% off 1, 2, 3 year hosting plans
techsnap40: $10 off $40
techsnap25: 25% off new Virtual DataCenter plans
techsnapx: 20% off .xxx domains

 

Direct Download:

HD Video | Mobile Video | MP3 Audio | Ogg Audio | YouTube | HD Torrent | Mobile Torrent

RSS Feeds:

HD Video Feed | Mobile Video Feed | MP3 Audio Feed | Ogg Audio Feed | iTunes Feeds | Torrent Feed

 

Support the Show:

   

Show Notes:

Get TechSNAP on your Android:

Browser Affiliate Extension:

  • Jupiter Broadcasting Affiliate Extensions for Chrome and Firefox
  • Hurricane Sandy creates havoc for data centers in New York and New Jersey

    • A number of data centers in and around New York and New Jersey suffered various failures and issues
    • ConEd the utility provider in New York started proactively shutting down power before the storm hit, in an effort to avoid damage to their equipment
    • Most data centers had already pro-actively switched to off-grid mode, providing their own power via Diesel Generators
    • What happens when salt water meets high voltage gear
    • Slashdot created a status page, showing the known issues
    • WebHostTalk thread where various customers report the status of their gear
    • More reporting from the Web Hosting Industry Review
    • Equinix reports on their situation
    • Oct 29th: Datagram goes down, takes out Gawker, HuffingtonPost, BuzzFeed and others
    • Oct 29th: Internap announces they are evacuating the 75 Broad Street building
    • “The flooding has submerged and destroyed the site’s diesel pumps and is preventing fuel from being pumped to the generators on the mezzanine level. The available fuel reserves on the mezzanine level are estimated to support customer loads for approximately 5–7 hours”
    • It appears that NY building codes prevent storing large amounts of fuel on the upper floors due to the danger to occupants and emergency personnel in the event of a fire
    • Generators are located in the basement with the fuel supply, and some customers have their own generators on the upper floors
    • The above ground generators and fueled from small ‘day tanks’, that are then refueled from the sub-basement by pumps
    • The pumps must be located near the fuel supply, rather than on the upper levels, because the pumps would not be able to ‘prime’ themselves (pumps need liquid to pump, they cannot create enough vacuum to draw the fuel up many floors)
    • Due to the flooding cutting off fuel supplies and drowning generators, some data centers that did manage to stay powered up, lost some or all of their transit to the internet, because the other buildings that their connections go through, or their providers lost power
    • Peer1, in the same building (75 Broad Street) is on the 17th floor, provided customers hourly updates via their forums
    • Peer1 staff and customers 2 took to carrying barrels of diesel fuel up to the 17th floor to keep the day tanks full
    • Oct 29, 17:40 – Sites 1 and 2 transitioned to generator power
    • 20:36 – Still on generator. Building reports that the lobby has taken in some water
    • 22:27 – Building has detected some flooding in the 1st and 2nd basement due to the storm surge. Extent of the damage will not be determined until the basement is accessible. The fuel system has a header with 5000 gallons of fuel and will be the primary supply for the next 12–24 hours. They are also observing some lowering of the water level outside the building.
      • Oct 30, 03:30 – We are still running from emergency generator power. Water has receded and we are currently waiting for a report back from building engineers on the status of the fuel and power systems that were located in the basement. We will post further updates when we have them
      • 08:00 – At this point we have an estimate of 4 hours for the fuel left on our generators. Our techs and facility are continuously working to get emergency fuel delivery on time and was looking to set-up a temporary tank and pump since the basement is still flooded. In the event of not receiving the fuel on time, worst case scenario is we will have to gracefully shutdown the facility.
      • 16:00 – the PEER 1 Hosting NYC datacenter remains on generator power with fuel being provided through the remaining building supply. The fuel tank has arrived at our facility and due to flooding conditions in the basement caused by the weather, we are working on alternative methods of fuel delivery to the day tank located on the 17th floor. As of now, our datacenter team is carrying half-full 50 gallon barrels of diesel to our daily fuel reservoir on the 17th floor, until a more sustainable solution is reached.
      • Oct 31, 00:00 – Peer1 is still maintaining generator power. We did have slight temperature rise at Site 1 but this has been addressed by technicians. We will provide our next update in 1 hour
      • 04:00 – Peer1 is still maintaining generator power for most customers in Site 2 and Site 1. the temperature in Site 1 is still running at critically high level. At this point, We have started to call all clients in our site 1 and are asking all our colocation clients to turn down non essential equipment. This will maximize our time to run on Generator and help with the temperature rise in site 1. Our technicians will go ahead and shutdown all customers at Site 1 within the next hour (You will receive an update when this is being performed). We will provide our next update in 1 hour.
      • 08:00 – completed shutdown of customer equipment in Site 1
      • 10:00 – The A/C in site 1 is powered off building generators that are still down. If we bring site 1 back up before the building generators are back up site 1 will just overheat . we are working to try and find another work around, but we are having trouble getting electricians on site and are also working with the building to get their generator up and running. Additional spare fuel is still being manually put into our generator.We have also schedule a fuel drop off for the next fueling marker. We will provide our next update in 1 hour.
      • 15:00 – Peer1 is still maintaining generator power for customers in Site 2 . The temperature in Site 1 is starting to stabilize but we are still not bringing up the power due to our cooling system still down in site 1. The electrician is currently moving electrical circuits to get a portion of the CRAC units in site 1 online. We will contact those customers directly once we have these units online. Fuel is still good, we will provide our next update in 1 hour.
      • 23:00 – Peer1 is still maintaining generator power for customers in Site 2. The temperature in Site 1 is has stabilize. We will soon begin the process of slowly bringing up customer’s cabinets at Site 1. Fuel is still good, we will provide our next update in 1 hour.
      • Nov 1, 13:00 – Peer1 is still maintaining generator power. We have an update from the building. We are providing them a fuel hose that will allow them to start filling the building fuel tank in the next hour. We are continuing to run from our generator.
    • 16:00 – Peer1 is still maintaining generator power. Building is currently pumping fuel into the 5000 gallon header tank. We are looking at cutting over to the 5000 gallon header tank in ~90 minutes
    • Additional Story
    • NY Times live updates on Sandy’s Aftermath

    70% of State chief information security officers report breaches this year

    • Between 2010 and 2011 only 14% of CISOs saw a budget increase, while 44% say their budgets didn’t change and 34% saw their budgets reduced
    • Only 24% of CISOs are confidence that they can safeguard their data from outside attacks
    • Report PDF

    DigiNotar report lands, all CAs totally compromised

    • The attacker who compromised the SSL CA DigiNotar last year, had full control over all 8 of their certificate issuing servers
    • The report suggests that the attacker may have issued additional rouge certificates that were never identified
    • This risk was mitigated somewhat by most vendors revoking all trust in DigiNotar issued certificates, but customers who did not receive the root trust update could still be vulnerable
    • The company investigating the compromise found that the log files were generally stored on the same servers that had been compromised and evidence was found that they had been tampered with
    • “While these log files could be used to make inconclusive observations regarding unauthorized actions that took place, the absence of suspicious entries could not be used to conclude that no unauthorized actions took place”
    • Investigators also found evidence that a claim by the anonymous attacker who compromised the Comodo CA, that he was also the one who breached DigiNotar, may infact be true
    • The DigiNotar network was highly segmented and a number of the segments were isolated from the public Internet. However, a lack of strict enforcement of these policies may have allowed the attacker to island hop from compromised web server to the CA servers
    • "The investigation showed that web servers in DigiNotar’s external Demilitarized Zone (DMZ-ext-net) were the first point of entry for the intruder on June 17, 2011”
    • "From the web servers in DMZ-ext-net, the intruder first compromised systems in the Office-net network segment between the 17th and 29th of June 2011”
    • “Subsequently, the Secure-net network segment that contained the CA servers was compromised on July 1, 2011”
    • “Specialized tools were recovered on systems in these segments, which were used to create tunnels that allowed the intruder to make an Internet connection to DigiNotar’s systems that were not directly connected to the Internet. The intruder was able to tunnel Remote Desktop Protocol connections in this way, which provided a graphical user interface on the compromised systems, including the compromised CA servers."”
    • The attack on DigiNotar lasted for almost six weeks, without being detected
    • “The private keys were activated in the netHSM using smartcards. No records could be provided by DigiNotar regarding if and when smartcards were used to activate private keys, except that the smartcard for the Certificate Authorities managed on the CCV-CA server, which is used to issue certificates used for electronic payment in the retail business, had reportedly been in a vault for the entire intrusion period”
    • Original Article, in Dutch
    • Full Report PDF

    Feedback

    Followup:

    Warstory The little ssh that sometimes couldn’t

    • Mina Naguib is a sysadmin and director of engineering at Adgear
    • Noticed that some of his SSH cronjobs started reporting failures and timeouts between his servers in London (UK) and Montreal (CA)
    • He found that the transfers either completed at high speed, or hung and never completed (there we no transfers that succeeded at low speed)
    • Running the transfers manually seemed to work fine
    • After examining packets with TCPDump as they left in London, he found that some packets were being transmitted, not acknowledged, and then retransmitted, still not acknowledged
    • While examining the packets are they were received in Montreal, he noticed a difference
    • The 15th byte of every 16 bytes was being predictably corrupted
    • In the SSH handshake, instances of “h” became “x”, all instances of “c” became “s”, but only beyond the first 576 bytes
    • The SSH sessions were getting stuck, because the remote server’s kernel was discarding the TCP packet because it was corrupted, the retransmit was corrupted the same way, and so the connection was in a stalemate
    • He rules out an issue with the NICs in the servers on either side, because the issue was affecting multiple servers, and two different Montreal data centers
    • To prove his hypothesis, he used netcat, and piped /dev/zero over the network, and while examining the packets as they were received on the other side, beyond the first 576 bytes, a specific bit was being transformed from a 0 to a 1
    • The issue did not affect UDP or ICMP packets, only TCP
    • Now, the task was to pinpoint which router along the path was causing the issue
    • This was more difficult because unlike an ICMP ECHO where you can evoke a predictable response from a remote host, for TCP you require both endpoints to cooperate
    • So, he grabbed nmap, and used it’s ‘Random IP’ mode to find a collection of SSH servers, some that did, and some that did not, share hops in common with the affected route between London and Montreal
    • He created a list of servers that did not experience corruption, and those that did, and used traceroutes to identify the paths the packets took
    • Note: some internet paths are asymmetrical, and a standard traceroute will not find the return path, this could have made this problem much harder to diagnose
    • After finding 16 bad, and 25 good SSH connections, he was able to narrow his list of suspects down to a specific connection between 2 backbone providers
    • London → N hops upstream1 → Y hops upstream2
    • “Through upstream1, I got confirmation that the hop I pointed out (first in upstream2) had an internal “management module failure” which affected BGP and routing between two internal networks. It’s still down (they’ve routed around it) until they receive a replacement for the faulty module.”
    • The upstreams involved appear to have been GBLX and Level3

    Round Up:

Question? Comments? Contact us here!