This year’s big storage story
My company Ocarina Networks is one of Network World’s “10 Storage Companies to Watch.” We’ve been given this honor before, by Byte and Switch in 2007. This is not just to toot our horn, but to mention that more and more, we’re confirming our hunch that primary storage is where there is an immense need for new and innovative optimization solutions.
As I mentioned in an earlier post, the kind of data that’s driving much of today’s storage growth–files–demands a file-aware solution for shrinking them down. It’s clear to me that there is an emerging set of opportunities in this space, and we are only beginning to see where this will lead.
In Startup City’s spotlight
I was interviewed on video by John Foley for InformationWeek’s Startup City a month or so ago, and have just discovered that the video is now up on the site. If you haven’t already, I encourage you to explore this blog. Foley does a great job of covering the vast and growing landscape of IT startups. Enjoy.
What’s Hot in Storage — Spending Less
Byte & Switch has once again released its “Top 10 Storage Startups to Watch” for 2008, and it’s definitely worth a read. My company Ocarina Networks was on that same list last year, and so I can say with confidence that they got it right at least once before.
As reflected in this year’s list, data reduction technologies continue to be hot. Makes sense in a down economy that anything that increases capacity will continue to get budget dollars. As we’re finding, dollars for stuff like Ocarina is already there in every data center’s budget – it’s just listed as disk expense. We’re not only ahead of our revenue goals for our storage optimization product launched in April, but we’re having to triple the size of our sales force to keep up with demand.
If you have planned to buy 100 TB of disk, and can spend half as much for an optimization solution that shrinks your files that means you don’t have to buy any disk at all. A win all the way around. While Ocarina started out with wins in large web sites – where the fastest year-to-year storage growth is taking place – we’re now seeing installs in life sciences, energy, movie studios, and finance.
The chief takeaway from what I’ve seen: some nice-to-have new technologies may be facing a tough summer with an economic downturn, but data reduction scores high on both saving money and green IT, and is likely to stay strong, or maybe even move up in priority, during a down cycle in storage spending.
Saturated: The Cloud’s Storage Dilemma
Yesterday’s Mashable post looking at online file storage providers caught my eye. Right now, online “cloud” storage providers are all targeting different markets, but the competition is fierce in all segments. Some are going after the consumer - such as AOL X-Drive -, some are going after online backup, and some are going after web site data. Actually, that article doesn’t even mention Amazon’s S3, for example, which is a huge online repository.
The obvious benefits are basically twofold: ease-of-use and – for most of them – the fact that they manage your data for you in terms of backing it up, replicating it, etc. The biggest drawback here is that you have to be connected to a network to get to your files.

Most customers will still look at cost/Gigabyte as the main motivator to use a service like this and, at the right price and benefit point, people will put their files online. Since all these storage service providers all buy their disks from the same small number of companies that actually make disk drives, the costs are all roughly the same for the physical infrastructure needed to build an online storage service and compete.
I think that the real solution here is that, for anyone to breakthrough and get some separation from the crowd, they are going to have to incorporate breakthrough storage optimization in their offering – and do so in a way that’s transparent to the end user. That could be dedupe, that could be compression, or it could be something more sophisticated like Ocarina. The main thing is that if you can get 5:1 or 10:1 ratios on how much logical space you can provide via the cloud to how much physical space you, as a provider, have to buy, then you can have a compelling proposition. The competition is fierce in this market and in order to grow and thrive in any business that offers online storage, the providers are going to have to develop a strategy to significantly increase their online storage capacity without increasing cost and overhead in step.
Who’s Really Melting the Ice Cap?
Can You Compress Already Compressed Files? Part II
In my last post I discussed the fact that most files that are used are already compressed. And up to now, there were no algorithms to further compress them. Yet, it’s obvious that there needs to be a new solution.
On the cutting edge, there are some new innovations in file-aware optimization that allow companies to reduce their storage footprint and get more from the storage they already have. The key to this is understanding specific file types, their formats, and how the applications that created those files use and save data. Most existing compression tools are generic. To get better results than you can get with a generic compressor, you need to go to file-type-aware compressors.
There’s another problem. Let’s say you just created a way better tool for compressing photographs than JPEG. That doesn’t mean your tool can compress already-compressed JPEGs, it means that if you were given the same original photo in the first place, you could do a better job. So the first step in moving towards compressing already-compressed files is what we call Extraction – you have to extract the original full information from the file. In most cases, that’s going to involve de-compressing the file first, getting back to the uncompressed original, and then applying your better tools.
Extraction may seem simple enough – just reverse whatever was done to a file in the first place. But it’s not always quite that easy. Many files are compound documents, with multiple sections or objects of different data types. A PowerPoint presentation, for example, may have text sections, graphics sections, some photos pasted in, etc. The same is true for PDFs, email folders with attachments, and a lot of the other file types that are driving storage growth. So to really extract all the original information from these files, you may need to not only be able to decompress files, but to look inside them, understand how they are structured, break them apart in to their separate pieces, and then do different things to each different piece.
The two things to take away from this discussion are: 1) you won’t get much benefit from applying generic compression to already-compressed file types, which are the file types that are driving most of your storage growth and 2) it is possible to compress already-compressed files, but to do so, you have to first extract all the original information from them, which may involve decoding and unraveling complex compound documents and then decompressing all the different parts. Once you’ve gotten to that point, you’re just at the starting point for where online data reduction can really get started for today’s file types.
Can you compress an already compressed file? Part I
We can all recognize the amount of data we generate. And just like we keep telling ourselves we’ll clean out the garage “one of these days” most of us rarely bother to clean out our email or photo sharing accounts.
As a result, enterprise and internet data centers have to buy hundreds of thousands of petabytes of disk every year to handle all the data in those files. It all has to be stored somewhere.
One way to reduce the amount of storage growth is to compress files. Compression techniques have been around forever, and are built in to many operating systems (like Windows) and storage platforms (such as file servers).
Here’s the problem: most modern file formats, the formats driving all this storage growth, are already compressed.
· The most common format for photos is JPEG – that’s a compressed image format.
· The most common format for most documents at work is Microsoft Office, and in Office 2007, all Office documents are compressed as they are saved.
· Music (mp3) and video (MPEG-2 and MPEG-4) are highly compressed.
The mathematics of compression are that once you compress a file, and reduce its size, you can’t expect to be able to compress it again and get even more size reduction. The way compression works is that it looks for patterns in the data, and if it finds patterns it replaces them with more efficient codes. So if you’ve compressed something once, the compressed file shouldn’t have any patterns in it.
Of course, some compression algorithms are better than others, and you might see some small benefits by trying to compress something that has already been compressed with a lesser tool, but for the most part, you’re not going to see a big win by doing that. In fact, in a lot of cases, trying to compress an already compressed file will make it bigger!
Conventional wisdom dictates that once files are compressed via commonly used technologies, the ability to further limit their size and consumption of expensive resources is nearly impossible. So, what can be done about this?
Less is More–Part 2
As we all know, the internet is where there is huge storage growth, multi-petabyte scale, and a need to stay very close to the commodity price point on storage costs. There are two common threads across all of the “less is more” file systems that have been popping up to handle all this growth.
First, they are all designed in a way that you can build very scalable, very large pools of storage using generic white box servers stuffed with cheap disks. Second, they mostly support only the most primitive operations — create a new file, read that file, delete a file. While I’m generalizing, and this is not exactly true for all of these new file systems, many just skip things that are considered standard in traditional file systems: locking, Posix semantics, authentication, ACLs, concurrency control, metadata or the ability to list and search for files.
The overhead of all those traditional file system operations is too much for massive internet-scale operations where the primary purpose of a file system is for a user to upload something, for millions of people to look at it over and over, and maybe someday, sometime, someone will delete something.
These file systems are in contrast to advanced file system developments from places like NetApp’s latest OnTap and WAFL releases, HP’s PolyServe cluster file system, or the transaction-enabled NTFS from Microsoft that you can find in Server 2008.
The line in the sand is, there are file systems that are designed to be used by people, and file systems that are designed to be used by specific applications only.
The commercial file systems grew up serving the needs of business users and business applications. They are designed to host a wide variety of applications, including production databases, to let users peruse and manage their files, and to let storage administrators keep up with both growth, availability, and corporate compliance requirements.
As a consequence, more and more value-add features are being put in to the file system to support these use-cases. The “less is more” crowd, on the other hand, wants a very cost-effective but massively scalable pool of storage to make available to their web applications. A global namespace (so it looks like one giant pool of storage), and low, low cost per terabyte are the drivers of these file systems.
Users don’t list their directories in these file systems. In fact, users never see these file systems. Users see web applications, and the web applications use databases to keep track of what files are where in the massive storage pool, and who is allowed to see them. In that sense, in the “less is more” file system world, a lot of the value-add and management functionality of the file system is moving up in to the application layer, especially in the largest content-rich web sites.
>From my point of view, the feature-rich commercial file systems will continue to evolve to meet the needs of corporate customers, including scaling to meet their growth needs. The “less is more” file systems will continue to push out traditional file systems in the highest growth web properties and other customers whose data growth is at that many-petabyte scale. Finally, the two things are not entirely incompatible – most of the new web tier file systems actually have a bunch of single node file systems buried in them on each storage node somewhere at the bottom building block level of their architecture.
But it’s time that these two file system approaches evolve and develop some kind of relationship–because for now, neither is perfectly suited the problem at hand. There’s no reason why those building blocks couldn’t have richer functionality, such as transparent clustering and failover, that comes from commercial file systems, and still give you the massive scale and cheap $/petabyte of a global namespace and commodity building blocks.
The internet has often been the cauldron in which new technologies are forged that then eventually move in to the corporate data center. We saw this in the server world, where low cost Linux servers displaced Sun and other Unix systems early on, and eventually that movement to cheaper, standard servers pushed Big Unix out of the corporate data center too.
The cost differences between a corporation’s EMC DMX storage array and a storage pool of white boxes with disk is even greater than the cost difference between Unix machines and standard Linux boxes. People are more hesitant to change storage platforms than server platforms (for good reason), but that huge cost difference and the rate at which storage is growing is going to cause the shift to happen sooner or later.
My prediction (and hope) is that someone will figure out a way to marry the “less is more” simple file system layers with richer underlying commercial file systems. This is what’s needed.
Less is More
Less is more … or is it? Part One
I recently returned from Storage Networking World in Orlando. As everyone knows, the conference is mainly a place for storage vendors to meet each other, tout their wares, and nose around in their competitors’ booths pretending to be potential customers. There are some good sessions, however, and one of the best was IDC analyst Noemi Greyzdorf’s presentation on the future of file systems.
Her smart and interesting talk was on the evolution of clustered, distributed, and grid file systems. As I listened, it occurred to me that I’m seeing a big split in the file system world, especially at the high end, where really large amounts of data are stored.
One of Noemi’s key points is that more and more functionality is being packed into file systems. As she puts it, file systems are the natural place for value-add knowledge about storage to be kept. That’s certainly true, and there are a number of advanced file systems that are becoming richer and richer in terms of integrated features.
At the same time, there is definitely a “less is more” crowd emerging, where many of the most basic features of file systems are being left out in some of the newest large-scale file systems around. This group includes file systems like GoogleFS, Hadoop, Mogile, Amazon’s S3 simple storage service, and the in-house developments at a couple of other very large online web 2.0 shops.
Are these two trends in file systems headed on a collision course? I don’t think so. But what I do see is that neither of these solutions is nailing the growing problem posed by the exploding amount of internet data that needs to be managed and stored. In other words, there are issues with both of these approaches. In my next entry, I will discuss what that is, and how we might solve it.


