Technology’s Infestation of my Life

Examples of how technology has permeated every single bit of my life.

Calendar Invitation Email Gone Awry

Here’s some more details on calendar email issue I noted late Monday.

Just after 10pm on Monday, I attempted to migrate my calendar from Google Calendar to Fastmail Calendar1.

I did this by exporting my existing calendar from Google (per https://support.google.com/calendar/answer/37111?hl=en) and then re-importing it back into Fastmail using Apple’s Calendar App. During this re-importing process, it appears that the Fastmail system regenerated the event requests and emailed all the participants of the events; although I initially suspected Apple’s Calendar app.

My wife, who was sitting next to me, was the first to let me know something was awry when she received over 400 emails from me.

After aborting last nights attempt, I tried again to import the data again Tuesday morning by using FastMail’s “Subscribe to a public calendar” feature (https://www.fastmail.fm/help/calendar/publiccalendar.html), which should not have resulted in emails being sent but still did.

In total, 109 people were affected by this issue and up to 2904 emails were sent (1452 from each incident).

Graph of Emails Sent

The good news (if there is such a thing) is that 45% of those affected only received a single email (well, two emails), and 78% of those affected received less than 10 emails (20 emails across both incidents).

Unfortunately, emails were also sent to people even when I was not the original organizer of the event. This accounted for over half the emails that were sent.

I have opened a ticket with Fastmail (Calendar import emailing participants (Ticket Id: 479473)). Fastmail has been prompt and the issue is, in theory, resolved. However, in the future I plan on scrubbing the calendar file of email address to prevent this issue from occurring again.

For those curious, here’s how I extracted2 the number of those affected from the ICS file:

grep -Eiorh 'mailto:([[:alnum:]_.-]+@[[:alnum:]_.-]+?\.[[:alpha:].]{2,6})' "$@" basic.ics | sort | uniq -c | sort -r

Mea Culpa.

0
  1. there’s a larger story about why, but that’s not important at the moment 

  2. based on mosg’s answer on http://stackoverflow.com/questions/2898463/using-grep-to-find-all-emails/2898907#2898907 

VMWare and USB 3

It took me a while to figure out why my external Seagate harddrive wasn’t working on Windows 7 and VMware Fusion 5. As it turns out, VMware Fusion 5 does not support USB 3.0 with Windows 71.

What is not intuitive — and frankly doesn’t make sense — is that VMware Fusion 5 will not automatically revert to USB 2.0 to attempt to support it.

The solution to this is to run your USB 3.0 capable device through a USB 2.0 hub, such as an Apple Keyboard with Numeric Keypad.

See also: http://communities.vmware.com/thread/415658

0
  1. you need Windows 8, per their features list “Microsoft Windows 8 required for USB 3 support” 

Why You’re Doing Passwords Wrong

If you use passwords, there’s a good chance you’re doing them wrong and exposing yourself to unnecessary risk.

My intent is provide some basic information on how you can do passwords better1, suitable for grandma to use (no offense grandma), because there’s no reason that you can’t do passwords better.

Why We Have Passwords

In the beginning, the internet was a benevolent place. If I said I was fergbrain, everyone knew I was fergbrain. I didn’t need to prove I was fergbrain. Of course, that didn’t last long and so passwords were created to validate that I was, in fact, fergbrain.

Passwords are one of three ways in which someone can authenticate who they are:

  1. Password: something you know
  2. Token: something you have that can’t be duplicated (such as an RSA token or YubiKey)
  3. Biometric: something you are (such as a fingerprint or other biometric marker unique to you)

Back In The Day™, passwords were the de facto method of authentication because they were the easiest to implement and in many ways still are.

At the time, token-based methods were just on the verge of development with many of the technologies (such as public-key encryption) not even possible until the mid 1970’s. And once suitable encryption was more completely developed2, it could not be legally deployed outside of the United States until 1996 (President Clinton signed Executive Order 13026).

Finally, biometric authentication was an expensive pipe dream3.

The point being: passwords where the method of choice; and as we know, it is quite difficult to change the path of something once it gets moving.

Having just one password is easy enough, especially if you use it often enough. But how many places do you need to use a password? Email, social media, work, banking, games, utilities…the list goes on.

It would be pretty hard to remember all those different passwords. So we do the only thing we believe is reasonable: we use the same password. Or maybe a couple of different passwords: one for bank stuff, another for social media, maybe a third one for email.

Why Passwords Can Be a Problem

Bad guys know that most people use the same username, email address, and password for multiple services. This creates a massive incentive for bad guys to try and get that information. If the bad guys can extract your information from one web site, it’s likely they can use your hacked data to get into your account at other web sites.

For bad guys, the most bang for the buck comes from attacking systems that store lots of usernames and passwords. And this is how things have gone. Over just the last two years Kickstarter, Adobe, LinkedIn, eHarmony, Zappos.com, last.fm, LivingSocial, and Yahoo have all been hacked and had passwords compromised. And those are just the big companies.

In my opinion, most people know they have bad passwords, but don’t know what to do about it. It’s likely your IT person at work4 keeps telling you to make “more complex” passwords, but what does that mean? Does it even help? What are we to do about this? Can we do anything to keep ourselves safer?

How to do Passwords Better

There is no single best way to do passwords. The best way for any particular person is a compromise between security, cost, and ease of use.

There are several parts to doing passwords better:

Have Unique Passwords

If one web site is hacked, that should not compromise your data at another web site. Web sites generally identify you by your username (or email address) and password. You could have a different username for every single web site you use, but that would probably be more confusing (and could possible lead to personality disorder). Besides, having to explain to your friends why you go by TrogdorTheMagnificent on one site but TrogdorTheBold on another side would get tiring pretty quick.

For reasons which I hope are obvious, making your passwords unique is better than making your usernames unique. Unless you don’t want people to find you, then make both your username and password unique.

General Rule of Thumb

Passwords should be unique for each web site or service.

Why: If a unique passwords is compromised (e.g. someone hacked the site), the compromised password cannot be used to gain access to additional resources (i.e. other web sites)

If you’re asking yourself, “But how do I remember all those passwords?!” just hold your horses.

Choose better passwords

People suck…at picking good passwords.

If you choose your own passwords, here’s a little test:

  1. For the 1st character in your password, give yourself 4 points.
  2. For 2nd through 8th character in your password, give yourself 2 points for each character.
  3. For the 9th through 20th character in your password, give yourself 1.5 points.
  4. If you password has upper case, lower case, and numbers (or special characters), give yourself an additional 6 points.
  5. If your password does not contain any words from the dictionary, give yourself an additional 6 points.
  • If you score 44 points or more, you have a good password!
  • If you score between 21 and 44 points, your password sucks.
  • If you score 20 points or less, your password really sucks.

If my password was, for example, Ferguson86Gmail, I would only have 34.5 points:

  • F: 4 points
  • erguson: 2 points each, 14 points
  • 86gmail: 1.5 points each, 10.5 points
  • I have uppercase, lowercase, and a number: 6 points
  • “Ferguson” and “gmail” are both considered dictionary words, so I get no extra points

Instead choosing Ferguson86Gmail as my password, what if my password was Dywpac27Najunst? The password is still 15 characters long, it still has two capital letters, and it still has two numbers. However, since it’s randomly generated it would score 89.3 — over twice as many points as the password I choose.

What’s going on here?

When you make up your own password, such as Ferguson86Gmail, you’re not choosing it at random and thus your password will not have a uniform random distribution of information5.

Passwords chosen by users probably roughly reflect the patterns and character frequency distributions of ordinary English text, and are chosen by users so that they can remember them. Experience teaches us that many users, left to choose their own passwords will choose passwords that are easily guessed and even fairly short dictionaries of a few thousand commonly chosen passwords, when they are compared to actual user chosen passwords, succeed in “cracking” a large share of those passwords.6

The “goodness” of a password is measured by randomness, which is usually referred to as bits of entropy (which I cleverly disguised as “points” in the above test) the reality of the situation is that humans suck at picking their own passwords.

More Entropy!

If more entropy leads to better passwords, let’s look at what leads to more bits of entropy in a password. The number of bits of entropy, H, in a randomly generated password (versus a password you picked) of length, L, is:

H=log_{2}N^{L}

Where N is the number of characters possible. If you use only lowercase letters, N is 26. If you use lower and uppercase, N is 52. Adding numbers increases N to 62.

For example:

  • mougiasw is an eight-character all lowercase password that has log_{2}26^{8}=37.6 bits of entropy.
  • gLAviAco is an eight-character lowercase and uppercase password that has log_{2}52^{8}=45.6 bits of entropy
  • Pr96Regu is an eight-character lowercase, uppercase, and numeric password that has log_{2}62^{8}=47.6 bits of entropy.

Adding uppercase gets us 8 additional bits, but adding numbers only nets us 2 additional bits of entropy. However, look what happens when we just add additional characters instead:

  • vubachukus is a ten-character all lowercase password that has log_{2}26^{10}=47.0 bits of entropy.
  • neprajubrawa is a twelve-character all lowercase password that has log_{2}26^{12}=56.4 bits of entropy.

For every additional character, you add log_{2}N bits of entropy. And unlike expanding the character set (e.g. using uppercase letters and/or numbers and/or special characters), you get more bits of entropy for every additional character you extend your password by…not just the first one.

The good news is that for randomly generated passwords, increasing the length by one character increases the difficulty to guess it by a factor of 32. The bad news is that for user selected passwords, every additional character added to make a password longer only quadruples the difficulty (adds roughly 2 bits of entropy which, based on NIST Special Publication 800-63 Rev 1 for the first 12 characters of a password).

More bits of entropy is better and I usually like to have at least 44 bits of entropy in my passwords. More is better.

Having to break out a calculator to determine the entropy of your passwords is not easy, and passwords should be easy. So let’s make it easy:

General Rule of Thumb<

Longer passwords (at least ten characters long) are better than more complex passwords.

Why: Adding complexity only provides a minimal and one time benefit. Adding length provides benefit for each character added and is likely to be easier to remember.

To anyone who understands information theory and security and is in an infuriating argument with someone who does not (possibly involving mixed case), I sincerely apologize.

Track Your Passwords

The inevitable reality of doing passwords better is that you need a way to keep track of them. There simply is no way a person can keep track of all the different passwords for all the different sites.

This leaves us with two other options:

Write Down Your Passwords

Yes. Writing your passwords down in a safe place is an acceptable method of keeping track of your passwords:
From www.schneier.com:

Simply, people can no longer remember passwords good enough to reliably defend against dictionary attacks, and are much more secure if they choose a password too complicated to remember and then write it down. We’re all good at securing small pieces of paper. I recommend that people write their passwords down on a small piece of paper, and keep it with their other valuable small pieces of paper: in their wallet.

Bruce Schneier, 2005

Writing down passwords can be appropriate because the most common attack vector is online (i.e. someone you’ve never even heard of trying to hack into your account from half-a-world away) with the following caveat: you make them more unique and more entropic.

By writing down passwords, you can increase their entropy (i.e. making them harder to guess) since you don’t have to memorize them. And since you don’t have to memorize them, you are more likely to create a better password. Additionally, if you write your passwords down, you don’t have to remember which password goes with which account so you can have a different password for each account: this also increases password uniqueness.

Encrypt Your Passwords

It would be reasonable to obfuscate your password list — instead of just writing them down in plaintext — so that if someone were to riffle through your wallet, they wouldn’t immediately recognize it as a password list or know exactly which passwords go with which accounts.

Instead of keeping them on a piece of paper, you could use a program to encrypt your passwords for you. There are a variety of ways to safely encrypt and store your passwords on your computer. I have been using 1Password for several years now and have been very impressed with their products7.

KeePass is another password manager I’ve used, however it does not have good support for OSX. There are other systems one could use, including Password Safe YubiKey.

I tend to be leery of web-based systems, such as LastPass and Passpack for two reasons:

  1. Having lots of sensitive data stored in a known location on the internet is ripe for an attack.
  2. The defense against such an attack is predicated on the notion that the company has implemented their encryption solution correctly!

General Rule of Thumb

You don’t have to remember your passwords.

Why: It’s better to have unique and more entropic passwords than it is to never write down your password.

That’s it! Hopefully you found this helpful, now go make your passwords better and report back!

19 February 2014: Added additional clarification about entropy of user-generated versus randomly-generated passwords.

0
  1. Arguably, there is no one right way to do passwords 

  2. it’s one thing to prove the mathematics of something, it’s a whole other thing to release a suitable product 

  3. and still sort of is 

  4. or your son/grandson/nephew/cousin 

  5. this is, in part, how predictive typing technologies such as SWYPE work 

  6. NIST Special Publication 800-63 Rev 1 

  7. as well as their technechal discusions on topics such as threats to confidentiality versus threats to availability 

The Day We Fight Back

I-do-not-consent-stickerBSix months ago, primarily in light of the issues concerning the NSA’s use of what I believe to be unconstitutional searches I started the process of moving my email system (which is also the email system my family and extended family uses) away from Google Apps.

Last week, I completed the technical transition to the new mail system provided by FastMail.

Today, the fight continues. I called both my Senators, as well as my Representative…yes, I called them. On the phone.

I talked to a live human being and I told them what I thought:

I’d like my Senator / Representative to support and co-sponsor H.R. 3361 / S. 1599, the USA Freedom Act. I would also like my Senator / Representative to oppose S. 1631, the so-called FISA Improvements Act. Moreover, I’d like like my Senator / Representative to work to prevent the NSA from undermining encryption standards.

If you visit AFDN today, you will see a small large banner that will help you contact your Senators and Representative to do the same.

“I Do Not Consent to the Search of this Device / EFF.org” image used under Creative Commons License from EFF

0

We’re Going To Need a Bigger Boat

Comcast Bandwidth

I somehow have managed to churn through over 3.5 terabytes of data on our internet connection over the last three months. While I’m glad Comcast’s “enforcement of the 250GB data consumption threshold is currently suspended”, I’m also scared to see what happens when they bring it back.

For what it’s worth, most (~90%) of that data is being pushed up to the cloud, likely to CrashPlan1.

I can’t wait for Fiber to The Home.

0
  1. or the NSA 

OpenMediaVault, Round 2: Picking a NAS for Home

One year ago, I spent my Thanksgiving setting up OpenMediaVault on a computer I had just hanging around. It has served me faithfully through the years, but several things became clear, the most important thing being that external hard drives are not designed to be continuously powered.

I had two drives fail and a growing concerns about the remaining disks. I use CrashPlan to backup the data, so I wasn’t concern with losing the data, but I was concerned with having it available when I needed it.

I also had a huge increase in storage requirements, due mostly to my video archiving project from last Christmas (which I still need to write up).

I also got married this year, and Rachel had several external drives I was hoping to consolidate. Ironically, her computer also died last week…good thing we had a back up!

The need was clear: a more robust NAS with serious storage requirements.

Requirements

Minimum Requirements:

  • Multiple user access
  • Simultaneous user access
  • File sharing (prefer SMB)
  • Media sharing (prefer iTunes DAAP and DLNA)
  • Access Control List (ACL)
  • High availability (99% up time ~ 3.5 days of downtime/year) for all local users
  • Remote backup (prefer CrashPlan)
  • 10TB of usable space
  • Minumum 100MBit/s access rate
  • Minimal single points of failture (e.g. RAID 5, ZFS, or BTRFS)
  • Secure system
  • Minimum of five years of viable usage
  • Cost effective

Trade Study

I performed a trade study based on four major options:

  1. Upgrading internal the drives with systems
  2. Continuing to use external hard drives
  3. Using cloud storage
  4. Using a NAS
Internal External Cloud Network
Multiple User Access 2 3 3 3
Simultaneous User Access 2 2 3 3
File Sharing 3 3 3 3
Media Sharing 2 2 1 3
Access Control List 3 2 3 3
> 99% Up Time 0 0 3 3
Remote Backup 3 3 2 3
> 10TB Usable Space 1 1 3 3
> 100MBit/s bandwidth 3 3 1 3
Minimal Single Point of Failure 3 1 2 3
Secure System 3 3 1 3
> 5 Years of Usage 3 3 3 3
Total 28 26 28 36

From this trade study, the differentiations pop-out pretty quick: Accessibility and security.

Accessibility

Accessibility covers multiple and simultaneous user access, as well as bandwidth of data.

Single user storage

While increasing the internal local storage is often the best option for a single user, we are in a multi-user environment and the requirement for simultaneous access requires some sort of network connection. This requirement eliminates both per-user options of increasing either the internal or external disk space. Also, the feasibility of increasing the disk space would have been impossible give that Rachel and I both use laptops.

Cloud Storage

Storing and sharing data in the Internet has become incredibly easy thanks to the likes of DropBox, Google Drive, Microsoft Spaces, Microsoft Azure, RackSpace Cloud Storage, Amazon S3, SpiderOak, and the like. In fact, many consumer Cloud storage solutions (such as DropBox) use enterprises systems (such as Amazon S3) to store their data. Because it’s provided as a network service, simultaneous data access with multiple users is possible.

The challenge of Cloud Storage is getting access to the data, which requires a working Internet connection and sufficient bandwidth to transport the data. Current bandwidth with Comcast typically limited to no more than ~48MBits/s, which is less than 50% of the 100MBit/s requirement. While higher data rates are possible, they are cost prohibitive at this time.

NAS

Network Attached Storage Devices are not a new thing and have been around for decades. Within the last 10 years though, their popularity in home and home office environments has become greater as the costs of implementation and maintenance have decreased. At its core a NAS is a computer with lots of internal storage that is shared with users over the home network. While more costly than simply increasing internal/external local storage, it provides significantly better access to the data.

Because the NAS is primarily accessed over the home network, the speed of access is limited to the connection speed of the NAS to the network and the network to the end system. Directly connected systems (using an ethernet cable) can reach speeds of 1000 MBit/s and 300MBit/s over wireless. This is significantly slower than directly connected drives, but faster than externally connected USB 2.0 drives and Cloud Storage. Most files would open in less than one second and all video files would be able to stream immediately with no buffering.

System Security

Securing data is the other challenge.

Cloud Storage

Because the data is stored by a third-party there are considerable concerns about data safety, as well as the right to data privacy from allegedly lawful (but arguably constitutionally illegal) search and seizures by government agencies.

I ran into similar issues with securing my Linode VPS, and ended up not taking any extraordinary steps because the bottom line is: without physical control of the data, the data is not secure.

The data that I’m looking to store for this project is certainly more sensitive than whatever I host on the web. There are many ways to implement asymmetric encryption to store files, but it would also require that each end-user have the decryption keys. Key management gets very complicated very quick (trust me) and also throws out any hope of streaming media.

NAS

Since the NAS is local to the premise, physical control of data is maintained and also given the superior protection of the 4th Amendment for such items in your control.

Additionally, the system is behind several layers of security that would make remote extraction of data highly difficult and improbable.

Designing a NAS

With a NAS selected, I had to figure out which one. But first, a short primer on the 10TB of usable space and what that means.

Hard Drives

Capacity

I arrived at the 10TB requirement by examining the amount of storage we were currently use and then extrapolating what we might need over the next five years, which is generally considered the useful-life period1:

field failure rate pattern of hdd

While the “bathtub curve” has been widely used as a benchmark for life expectancy:

Changes in disk replacement rates during the first five years of the lifecycle were more dramatic than often assumed. While replacement rates are often expected to be in steady state in year 2-5 of operation (bottom of the “bathtub curve”), we observed a continuous increase in replacement rates, starting as early as in the second year of operation.2

Practically speaking, the data show that:

For drives less than five years old, field replacement rates were larger than what the datasheet MTTF suggested by a factor of 2-10. For five to eight year old drives, field replacement rates were a factor of 30 higher than what the datasheet MTTF suggested.3

Something to keep in mind if you’re building larger systems.

Redundancy

Unfortunately, there is no physical 10TB drive one can buy, but a series of smaller drives can be logically arranged to appear as 10TB. However, the danger of logically arranging these drives is that typically if any single drive fails, you would lose all the data. To prevent this, a redundancy system is employed that allows at least one drive to fail, but still have access to all the data.

Using a RAID array is the de facto way to do this, and RAID 5 has been the preferred implementation because it has one of the best storage efficiencies and only “requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost.”

Annualized Failure Rate

Failure rates of hard drives are generally given as a Mean Time Between Failures (MTBF), although Seagate has started to use Annualized Failure Rate (AFR), which is seen as a better measure.

A common MTBF for hard drives is about 1,000,000 hours, which can be converted to AFR:

\textup{AFR}=1-e^{\left(\frac{-\textup{Annual Operating Hours}}{\textup{MTBF}}\right)}

Assuming the drives are powered all the time, the Annual Operating Hours is 8760, which gives an AFR of 0.872%. Over five years, it can be expected that 4.36% of the drives will fail.

The AFR for the entire RAID array (not just a given disk) can be generally approximated as a Bernoulli trial.

For a RAID 5 array:
\textup{AFR}_{RAID5} = 1-(1-r)^{n}-nr(1-r)^{n-1}

For a RAID 6 array:
\textup{AFR}_{RAID6} = 1-(1-r)^{n}-nr(1-r)^{n-1}-{n\choose 2}r^{2}(1-r)^{n-2}

Efficiency of Space and Failure Rate

Using a five year failure rate of 4.36%, the data show that RAID 6 is significantly more tolerant to failure than RAID 5, which should not be a surprise: RAID 6 can lose two disks while RAID 5 can only lose one.

What was more impressive to me is how quickly RAID 5 failure rates grow (as a function of number of disks), especially when compared to RAID 6 failure rates.

Technically a Bernoulli trial requires the disk failures to be statistically independent, however there is strong evidence4 for the existence of correlations between disk replacement interarrivals; in short, once a disk fails there is actually a higher chance that another disk will fail within a short period of time. However, I believe the Bernoulli trial is still helpful to illustrate the relative failure rate differences between RAID 5 and RAID 6.

Bit Error Rate

Even if you ignore the data behind AFR, single disk fault tolerance is still no longer good enough due to non-recoverable read errors – the bit error rate (BER). For most drives, the BER is <1 in 1014 “which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can’t read that sector back to you.”

One hundred trillion bits is about 12 terabytes (which is roughly the capacity of the planned system), and “when a disk fails in a RAID 5 array and it has to rebuild there is a significant chance of a non-recoverable read error during the rebuild (BER / UER). As there is no longer any redundancy the RAID array cannot rebuild, this is not dependent on whether you are running Windows or Linux, hardware or software RAID 5, it is simple mathematics.”

The answer is dual disk fault tolerance, such as RAID 6, with one to guard against a whole disk failure and the other to, essentially, guard against the inevitable bit error that will occur.

RAID or ZFS

I originally wanted to use ZFS RAID-Z2, which is a dual disk fault tolerant file system. While it offers similar features as RAID 6, RAID 6 still needs a file system (such as ext4) put on top of it. ZFS RAID-Z2 a combined system which is important because:

From blogs.oracle.com:

“RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise — and can’t — due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero — it’s that equation that allows you to reconstruct data when a disk fails. The problem is that there’s no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.

RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.

Whoa, whoa, whoa — that’s it? Variable stripe width? Geez, that seems pretty obvious. If it’s such a good idea, why doesn’t everybody do it?

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there’s no simple formula like “all the disks XOR to zero.” You have to traverse the filesystem metadata to determine the RAID-Z geometry. Note that this would be impossible if the filesystem and the RAID array were separate products, which is why there’s nothing like RAID-Z in the storage market today. You really need an integrated view of the logical and physical structure of the data to pull it off.”

However it’s not quite ready for primetime, and more importantly OpenMediaVault does not support it yet5.

So RAID 6 it is.

Cost

RAID 6 is pretty straight forward and provides (n-2)*capacity of storage. To provide at least 10 TB, I would need five 4 TB drives (or six 3 TB drives, or seven 2 TB drives, or twelve 1 TB drives, etc).

Western Digital’s Red NAS drives are designed for 24×7 operation (versus other drives which are geared toward 8 hours of daily operation) and are widely regarded as the best drives to use for a NAS.

Their cost structure breaks out as such:

Capacity Cost/Disk Cost/GB
1 TB $70 $0.0700
2 TB $99 $0.0495
3 TB $135 $0.0450
4 TB $180 $0.0450

At first glance, it appears that there’s no cost/GB difference between the 3 TB and 4 TB drives, but using smaller sized drives is more cost-effective because the amortization of the redundant disks is spread over more total disks and thus brings the cost/GB down faster for a given storage capacity

RAID_6_Cost_v_Space

However, the actual cost per a GB is the same (between 3TB and 4TB) for a given number of disks, you just get more usable space when using five 4 TB drivers versus five 3 TB drives.

Given that I was trying to keep things small, and some reviews indicated there are some possible manufacturing issues with the 3 TB WD Red drives, I decided to splurge a bit6 and go for the 4 TB drives.

Also, the cost per GB has, for the last 30+ years, decreased by half every 14 months. This corresponds to an order of magnitude every 5 years (i.e. if it costs $0.045/GB today, five years ago it would have cost about $0.45/GB and ten years ago it would have cost about $4.50/GB). If we wait 14 months, presumably it would cost $450 to purchase five new 4TB drives. If we wait 28 months, the cost should half again and it would presumably cost about $225 to purchase five new 4TB drives.

However, since we need drives now, whatever we spend becomes a sunk cost. The difference between buying five 2TB drives or five 4TB drives now is $181. However, if we buy them in 28 months, we would have to spend close to $225…or 24% more than we would have to pay now.

Since we will need the additional space sooner than 2.3 years from now, it actually makes financial sense to buy the 4TB drives now.

The Rest of the System

With the hard drives figured out, it’s time to figure out the rest of the system. There are basically two routes: build your own or buy an appliance.

Build your own NAS

My preliminary research quickly pointing to HP’s ProLiant MicroServer as an ideal candidate: it was small, reasonably powerful, a great price. Since I’ve built up computers before, I also wanted to price out what it would cost to build a system from scratch.

I was able to design a pretty slick system:
bitfenix

Buy an Appliance

After careful review, Synology is the only company that I believe builds an appliance worth considering. Their DiskStation Manager operating system seemed solid when I tried it, there was an easy and known method to get CrashPlan working on their x86-based system, and their system stability has garnered lots of praise.

Initially, I was looking at:

  • DS412+
  • DS414
  • DS1513+
  • DS1813+

However, the DS41x units only hold 4 drives and that was not going to be enough to have at least 10TB of RAID6 usable storage.

System Trade Study

HP G7 HP G8 DS1513+ DS1813+ Homebuilt
x86-based Yes Yes Yes Yes Yes
> 2GB RAM 2GB 2GB 2GB 2GB 4GB
— Max RAM 16GB 16GB 4GB 4GB 16GB
> 10TB Usable Space 12 TB 12 TB 12 TB 24 TB 12 TB
> 100MBit/s NIC 1GBit 1GBit 1GBit 1GBit 1GBit
Cost7 $415 $515 $800 $1000 $449

The main differences between the G7 and the G8 are:

  • G8 uses an Intel Celeron G1610T Dual Core 2.3 GHz instead of the AMD Turion II Model Neo N54L 2.2GHz…no real benefit
  • G8 has a second ethernet plug, however this no real benefit since our configuration would not use it
  • G8 has USB 3.0, which would be nice but can be added to the G7 for $30.
  • G8 has only one PCI Express slot which is downgrade since the G7 version has two slots.
  • G8 has an updated RAID controller, however this is no real benefit since it would not be used in our configuration
  • G8 has the iLO Management Engine, however this no real benefit for our configuration
  • The G8 HP BIOS is digitally signed, “reducing accidental programming and preventing malicious efforts to corrupt system ROM.” It’s also means I cannot use a modified BIOS…which is bad.
  • The G8 supports SATA III, which is faster than than the G7 SATA II…but probably not a differentiator for our configuration.

Conclusion

Perhaps the most important element is getting buy-in from your wife. All of this analysis is fun, but at the end of the day can I convince my wife to spend over $1000 on a data storage system that will sit in the closet – my side of the closet.

We selected the HP ProLiant MicroServer G7, which I think is a good choice.

I really wanted to build a server from scratch, but it can be a risky endeavour. I tried to pick good quality parts (those with good ratings, lots of reviews, and from vendors I know), but it can be a crapshoot.

For a first time major NAS system like this, I wanted something more reliable. I believe the HP ProLiant MicroServer G7 will be a reliable system and will meet our needs; lots of NAS enthusiasts use it, which is a big plus because it means that it works well and there are lots of people to ask questions of.

For next time (in five years or so), I want to do some more analysis of our data storage over time, which I will be able to track.

I’m also curious what the bottlenecks will be. We currently use a mix of 802.11n over 2.4 GHz and 5 GHz, but I’ve thought about putting in a GigE CAT5 cable.

RAID 6 still has has the write hole issue, and I hope it doesn’t cause an issue.

I’m not terribly thrilled with the efficiency of 3+2 (three storage disks plus two parity disks), but there’s not really a better way to slice it unless I add more disks. And it may be that more disks that are each smaller does actually make a difference.

Resources

0
  1. J. Yang and F.-B. Sun. A comprehensive review of hard-disk drive reliability. In Proc. of the Annual Reliability and Maintainability Symposium, 1999. 

  2. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST ’07)  

  3. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST ’07)  

  4. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST ’07)  

  5. NAS4Free and FreeNAS both support ZFS RAID-Z, but they run FreeBSD which does not have native support for CrashPlan 

  6. for the capacity, it’s an 11% increase in per GB cost 

  7. Not including hard drives 

Transition to LEMP

If you’re reading this, it means you are using the new AFdN server! As part of my foolish reason plunge in to Virtual Private Servers.

I’ve been able to migrate all the files moved over1, setup, and fine tune the new system.

It’s not that I wasn’t happy with BlueHost, just that I had grown out of Bluehost, which makes sense: Bluehost really is targeted and people new to web hosting. I’ve had a web site since I was 11.

I’ve heard rumors that Bluehost has over 500 users on each one of their boxes, upgrading to their Pro Package a couple of years ago put me on a box with “80% less accounts per server”, but it still wasn’t cutting it. I needed more!

The LEMP setup: Linux, Nginx2, MariaDB, PHP-FPM.

From a hardware standpoint, fremont is a NextGen 1GB Linode Virtual Private Server (VPS), powered by dual Intel Sandy Bridge E5-2670 processors each of which “enjoys 20 MB of cache and has 8 cores running at 2.6 Ghz” and is shared with, on average, 39 other Linodes.

Linux

I’ve chosen to run Debian 7 (64 bit); it’s a Linux distribution I trust, has a good security focus, and I’m also very familiar with it.

Setting it up the Linode was easy. I decided against using StackScripts because I wanted to know exactly what was going into my system and I wanted to have the experience in case something goes wrong down the line.

I took a fresh copy of Wheezy (Debian 7) and then used the following guides:

I very seriously considered encrypting the entire server, but decided against because ultimately the hardware was still going to be out of my physical control and thus encrypting the system was not an appropriate solution for the attack vector I was concerned with.

Nginx

I’ve always used Apache to do the actual web serving, but I’ve heard great things about Nginx and I wanted to try it. Since I was already going down the foolish path, I figured that I had nothing to lose with trying a new web server as well.

To make things easier, I installed Nginx from the repo instead of from source and then configured it using the (more or less) standard approach.

It’s really simple to install, I probably over thought it.

rtCamp has a really great tutorial on setting up fastcgi_cache_purge that allows Nginx to cache WordPress data and then purge and rebuild the cached content after you edit a post/page from WordPress dashboard or approve a comment on an article.

MariaDB

The standard tool for web-based SQL databases in my book has always been mySQL. But just like Nginx, I’ve heard some good things about MariaDB and figured why not. The great thing is, MariaDB is essentially a drop-in replacement for mySQL. Installing from the repo was a piece of cake and there really is no practical difference in operation…it just works, but better (in theory).

PHP-FPM

PHP FastCGI Process Manager (FPM) is an alternative to the regular PHP FastCGI implementation. In particular, it includes adaptive process spawning, among other things, and seems to be the defacto PHP implementation method for Nginx. Installing from the repo was a piece of cake and required only minimal configuration.

I originally used the TCP Sockets, but found that UNIX Sockets gave better performance.

Fine-tuning

Getting everything moved over was pretty easy, I did some benchmarking using Google Chrome’s Network DevTool and using Plugin Performance Profiler from GoDaddy3.

Most of the fine tuning was the little things, like better matching the threads to the number of cores I had available. I also enabled IPv6 support, which means that AFdN is IPv6 compliant:

ipv6 ready

Enjoy faster and better access to AFdN!

0
  1. at least for AFdN, there are other sites I run that are still in migration 

  2. pronounced engine-x, the “e” is invisible 

  3. I know, I’m just as shocked as you