Archive for the ‘outage’ Category

Back to Normal: The post-S3 problems

July 21, 2008

Sorry again, but it appears that everything is finally stable and back to normal as of 12:15 ET.

Basically, we had a server in our memcache installation fail, and while I was fixing it, I neglected a critical step in the process which caused a nice cascade of other problems.  I called in the big guns (Don), and now know what not to do in the future to avoid this!

For those that are interested, Don will be doing a post-mortem in his blog, and will link it from here.  He still needs to gather some details, so it make take a bit to post.

Back to normal, we hope

July 21, 2008

OK Folks as of midnight, ET, we’re back to normal.   We hope for good now.   Sorry for the major hassle today and thank you all for the massive amounts of patience you’ve had with us, we have great customers!

Not out of the woods yet

July 21, 2008

As of about 10:55pm ET July 20, our engineers have taken the site down again, so we are out of Read Only Mode and in Site Down. Sorry folks.  More as we get it.

Service back to normal

July 21, 2008

Hi All, as of 8:30 pm ET, SmugMug is back up and service is back to normal.  Sorry for the prolonged delay, and thank you all for your amazing patience.

Amazon S3 outage causes SmugMug outage [UPDATED: 1:20pm PST]

July 20, 2008

Amazon’s S3 service, SmugMug’s primary storage provider, is currently experiencing problems.  As a result, a large portion of the photos and videos stored on SmugMug are currently offline.

Historically, Amazon has been very stable.  We’ve seen three of these in our entire history with Amazon (>2 years), including this one.  I expect, like the last two, that service will be restored shortly.  You can keep track of their efforts over on their own Status Dashboard.

Our faith in Amazon, and the care they take of your priceless memories, hasn’t been shaken.  Your photos and videos are safe - which is our #1 concern.  Since problems in this industry are inevitable, and Amazon’s performance over the last two years has been so exceptional, we’ve been afraid an outage like this.  I’m sure there will be more over the next few years, too. 

The important thing is that they’re few and far between, short, and handled properly.  Every component SmugMug has ever used, whether it’s networking providers, datacenter providers, software, servers, storage, or even people, has let us down at one point or another.  

It’s the name of the game, and our job is to handle these problems and outages as best we can.  We’ve already spent the last few months investigating additional resources we can combine with Amazon to cover outage scenarios like the one we’re experiencing today, and have some promising prospects.

Please continue to check here and Amazon’s Dashboard for future updates. Thank you for your patience - we realize this is no fun, and we’re truly sorry it’s happening.

UPDATE:  Amazon continues to make progress, we hope they’ll be online again soon, at which point we’re ready to re-enable SmugMug.  A few of our newer customers have asked, so I’d like to point out that I believe this our first unplanned long (more than a minute or two) site-wide outage in well over a year, and we’re definitely going to take steps to reduce the likelihood of this recurring.

When it rains, it pours

July 16, 2008

We were doing some routine network maintenance tonight which shouldn’t have affected service.  Of course, Murphy’s Law struck, and we went hard down.   :(

We’re currently back up in read-only mode, and expect to shortly resume full service.

Sorry about all the bumps the last few days - there’s nothing in particular, like server load or software bugs, that caused these issues other than failure events we didn’t anticipate.

Out of Read Only Mode

July 11, 2008

Thanks everyone for your patience, the site’s  out of read only mode.  Sorry for the hassle today!

Status update

July 11, 2008

We’ve located some corruption in an important table in one of our databases.  That database has been removed from the cluster and is being replaced, but we’re being extremely gentle and analyzing the other cluster members before coming back online.

I’m sorry to report that this is going to take awhile longer - but your data is important to us and we want to make sure we don’t damage it.

More as we get it…

Read-Only Maintenance 7/10/08 @ 10:00pm-3am PDT

July 11, 2008

Sun just dropped off some awesome new servers that we’re itchin’ to throw into production. If we *cough*PrimeTime*cough* can get them ready in time (no pressure Craig!), we’re going to put the site into read only mode this evening and put more awesome into SmugMug.  (see that last post, this is what’s fixing it! woo hoo!)

We don’t think it’ll take long (although we could use the entire window), nor do we anticipate taking the entire site down (we’ll be in read-only mode unless things don’t go well). We’ll do our best to keep complete downtime to a minimum.

For more info on our maintenance windows, please check out this dgrin post:
http://www.dgrin.com/showthread.php?t=30979

Thanks, and sorry for the inconvenience!

Here’s a link to our dgrin post, if you want to follow the progress:
http://www.dgrin.com/showthread.php?p=870098

Read-Only Maintenance 6/19/08 @ 10pm-3am PDT

June 19, 2008

One of our master database machines needs more storage, so to keep it happy, we’re going to pop in more tonight during the maintenance window.

We don’t think it’ll take long (although we could use the entire window), nor do we anticipate taking the entire site down (we’ll be in read-only mode unless things don’t go well). We’ll do our best to keep complete downtime to a minimum.

For more info on our maintenance windows, please check out this dgrin post:
http://www.dgrin.com/showthread.php?t=30979

Thanks, and sorry for the inconvenience!

If you’d like, please check the following dgrin thread for updates as we work/finish:
http://www.dgrin.com/showthread.php?p=853395