Storage Optimization

Are you Content Aware?

Posted in Analyst,Storage by storageoptimization on October 2, 2008
Tags: , ,
Storage analyst Robin Harris commented on the storage story of the week–NetApp’s Guarantee that virtualization will mean a 50% gain in storage capacity for its customers. 
Harris’s take on the announcement is that dedupe for primary storage could be “the next big win for IT shops.” Perhaps, but let’s keep in mind that NetApp dedupe is very simple. It only finds duplicate blocks at NetApp WAFL 4K block boundaries. The reason that they are positioning it as a big win for VMware users is that virtual machines (static images of whole operating systems) are exactly one of the few places where you’ll find lots of dupes in primary storage on block-aligned boundaries.
Here is my take: The best results in dedupe for primary storage are going to be from applications that can recognize file types and understand how to find the duplicate information in them. That is where the big wins in dedupe for primary storage are going to be.
Consider this typical scenario: I create a PowerPoint and email it to someone else. They save it, open it, and make an edit – add a slide, or even just edit a bullet or two. That small edit will mean that none of the redundant content of that file falls on the same NetApp WAFL block boundaries. So although the two files are almost entirely the same, you won’t see good dedupe results on them.
A content-aware solution – which combines both information-level dedupe with content-aware compression – should be able to get 10:1 compression on most typical file mixes (especially those Office and engineering ones). A 10:1 ratio is the same as 90% reduction, so if you can shrink 80% of your data by 90%, so can get a pretty good handle on how big the win could be. And by the way, It’s not necessarily a bad thing for the guys who sell disks, either, because what happens when you can get that kind of win is that you start to think differently about what you can store, and how long you can store it for. For example, at my company Ocarina Networks (, we have a customer that plans to store a snapshot a day online for every day’s data for 10 years. That wouldn’t be possible without some drastic deduplication.
Block level dedupe – whether simple block-aligned like NetApp or sliding window like market leader Data Domain – is only going to find a small subset of the duplicate or redundant information in primary storage. That’s because most file types that drive storage growth in primary (or nearline) storage are compressed. Compression will cause the contents of a file to be recomputed – and to look random – every time a file is changed. So if I store a photo, then open it and edit one pixel and save the new version as a new file, there won’t be a single duplicate block at the disk level. On the other hand, almost the entire file is duplicate information.    
Can you find a duplicate graphic that was used in a Powerpoint, a Word document, and a PDF? Powerpoint and Word both compress with a variant of zip; PDF compressed with deflate. Even if the graphic is identical, block level dedupe won’t find the duplicate graphics because they are not stored identically on disk. You need something that can find duplicate data at the information level. Finally, there are pretty concrete data that say that about 80% of the file data on NAS is a candidate for deduplication.
With all that in mind, don’t you think content aware optimization is going to be the next truly big win?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: