Are you Content Aware?
Nice to be quoted/mentioned
It’s not always clear whether I’m making an impact, but this past week was one of those times when I realized that some others are taking note of the excitement around and importance I attach to the concept of storage optimization. In a June 27 editorial article in Processor, “Doing More With Less,” I was quoted in a section on “Saving Space” as follows:
That same day, my company Ocarina Networks earned a mention in a post on Jon Toigo’s excellent Drunken Data blog. In a post recalling a conversation with Chris Santilli of Copan Systems, he writes:
“Chris noted that de-duplication technology was past the hype stage (not sure about that one) but that the technology was still undergoing substantial development — rather like compression in its early days: a lot of variations, no standards. He further noted that some interesting work was being done by companies such as Ocarina on improved file type awareness that might help mitigate some nagging technical issues involving de-dupe of data on disks that had been defragged. (Lot’s of “D’s” in that sentence.)”
Thanks guys. Good to know I can get the word out to the very folks who really know and understand what’s going on.
What to do about the coming video explosion
Pete Steege’s Storage Effect is commenting today on an ABI report that highlights the explosion of video content on the web, which expected to increase to one billion viewers by 2013. Steege’s response is that the report ignores the “digital home,” which will no doubt become ubiquitous in the coming years.
I agree, and would add that there are still other things driving video storage growth as well, such as a drastic increase in the number of video surveillance cameras and their resolution. But mainly, what I see is that the storage problem itself could actually be solved to a great extent with the proper optimization. For video, since video files are already compressed for transmission, the proper storage optimization has to include both video-specific recompression and video-specific deduplication.
For video on the internet, you have two related but different problems. One is to store the vast amount of content that is being generated. The second is provide the bandwidth needed for high-definition viewing of hot content.
Most video content is not hot. People upload thousands of hours of video per day to popular sites like YouTube, but only a small fraction of that gets wide viewership. It all needs to be stored, but the key thing for most of it is to store it cheaply. That’s going to mean not just cheap disks, but video-specific storage optimization that greatly reduces the size of the video files.
The relatively few videos (meaning, a couple hundred a day) that do become popular won’t be so aggressively compressed, or they’ll be compressed for bandwidth rather than for storage optimization. That is, solving the speed problem for the hot stuff that everyone is watching is easy – it will be replicated and cached, and people will get access to their hot shows and user-contributed videos. Solving the “store 900 Petabytes of user-generated video really cheaply” problem is not so easy to solve.
Another major optimization of video storage is that most videos that most people want access to is duplicated across many homes. Today, a blockbuster movie, a hit TV show, a TiVo of the big game – these are all stored hundreds of thousands of times across millions of households.
As video storage moves to cloud storage services, a lot of that can be deduplicated. For entire licensed content (e.g., a studio movie) that’s relatively easy – you’d say, here are 10,000,000 users uploading their copy of the Lion King…let’s just save one. But to get real optimization, cloud storage providers are going to want to be able to find and compress video at finer granularity than that. Let’s say there’s a football game broadcast on ABC in some markets, and carried by ESPN (with different commercials) in another market. User A records it in standard def. User B records it in high def. The user in Atlanta records it from ABC. The user in Portland records in from ESPN. To be efficient, you’ll want storage optimization that recognizes that those users are all uploading versions of the same thing, and takes out the redundant information as part of the compression / deduplication process.
Without aggressive storage optimization – including video-specific compression and dedupe – the explosive growth of video content is going to overwhelm storage capability.
Saturated: The Cloud’s Storage Dilemma
Yesterday’s Mashable post looking at online file storage providers caught my eye. Right now, online “cloud” storage providers are all targeting different markets, but the competition is fierce in all segments. Some are going after the consumer - such as AOL X-Drive -, some are going after online backup, and some are going after web site data. Actually, that article doesn’t even mention Amazon’s S3, for example, which is a huge online repository.
The obvious benefits are basically twofold: ease-of-use and – for most of them – the fact that they manage your data for you in terms of backing it up, replicating it, etc. The biggest drawback here is that you have to be connected to a network to get to your files.

Most customers will still look at cost/Gigabyte as the main motivator to use a service like this and, at the right price and benefit point, people will put their files online. Since all these storage service providers all buy their disks from the same small number of companies that actually make disk drives, the costs are all roughly the same for the physical infrastructure needed to build an online storage service and compete.
I think that the real solution here is that, for anyone to breakthrough and get some separation from the crowd, they are going to have to incorporate breakthrough storage optimization in their offering – and do so in a way that’s transparent to the end user. That could be dedupe, that could be compression, or it could be something more sophisticated like Ocarina. The main thing is that if you can get 5:1 or 10:1 ratios on how much logical space you can provide via the cloud to how much physical space you, as a provider, have to buy, then you can have a compelling proposition. The competition is fierce in this market and in order to grow and thrive in any business that offers online storage, the providers are going to have to develop a strategy to significantly increase their online storage capacity without increasing cost and overhead in step.
Less is More
Less is more … or is it? Part One
I recently returned from Storage Networking World in Orlando. As everyone knows, the conference is mainly a place for storage vendors to meet each other, tout their wares, and nose around in their competitors’ booths pretending to be potential customers. There are some good sessions, however, and one of the best was IDC analyst Noemi Greyzdorf’s presentation on the future of file systems.
Her smart and interesting talk was on the evolution of clustered, distributed, and grid file systems. As I listened, it occurred to me that I’m seeing a big split in the file system world, especially at the high end, where really large amounts of data are stored.
One of Noemi’s key points is that more and more functionality is being packed into file systems. As she puts it, file systems are the natural place for value-add knowledge about storage to be kept. That’s certainly true, and there are a number of advanced file systems that are becoming richer and richer in terms of integrated features.
At the same time, there is definitely a “less is more” crowd emerging, where many of the most basic features of file systems are being left out in some of the newest large-scale file systems around. This group includes file systems like GoogleFS, Hadoop, Mogile, Amazon’s S3 simple storage service, and the in-house developments at a couple of other very large online web 2.0 shops.
Are these two trends in file systems headed on a collision course? I don’t think so. But what I do see is that neither of these solutions is nailing the growing problem posed by the exploding amount of internet data that needs to be managed and stored. In other words, there are issues with both of these approaches. In my next entry, I will discuss what that is, and how we might solve it.
