Storage Optimization

Can You Compress Already Compressed Files? Part II

Posted in Featured,File Systems,Storage by storageoptimization on May 6, 2008

In my last post I discussed the fact that most files that are used are already compressed. And up to now, there were no algorithms to further compress them. Yet, it’s obvious that there needs to be a new solution.

On the cutting edge, there are some new innovations in file-aware optimization that allow companies to reduce their storage footprint and get more from the storage they already have. The key to this is understanding specific file types, their formats, and how the applications that created those files use and save data. Most existing compression tools are generic. To get better results than you can get with a generic compressor, you need to go to file-type-aware compressors.

There’s another problem. Let’s say you just created a way better tool for compressing photographs than JPEG. That doesn’t mean your tool can compress already-compressed JPEGs, it means that if you were given the same original photo in the first place, you could do a better job. So the first step in moving towards compressing already-compressed files is what we call Extraction – you have to extract the original full information from the file. In most cases, that’s going to involve de-compressing the file first, getting back to the uncompressed original, and then applying your better tools.

Extraction may seem simple enough – just reverse whatever was done to a file in the first place. But it’s not always quite that easy. Many files are compound documents, with multiple sections or objects of different data types. A PowerPoint presentation, for example, may have text sections, graphics sections, some photos pasted in, etc. The same is true for PDFs, email folders with attachments, and a lot of the other file types that are driving storage growth. So to really extract all the original information from these files, you may need to not only be able to decompress files, but to look inside them, understand how they are structured, break them apart in to their separate pieces, and then do different things to each different piece.

The two things to take away from this discussion are: 1) you won’t get much benefit from applying generic compression to already-compressed file types, which are the file types that are driving most of your storage growth and 2) it is possible to compress already-compressed files, but to do so, you have to first extract all the original information from them, which may involve decoding and unraveling complex compound documents and then decompressing all the different parts. Once you’ve gotten to that point, you’re just at the starting point for where online data reduction can really get started for today’s file types.

Can you compress an already compressed file? Part I

Posted in Featured,File Systems,Storage by storageoptimization on May 1, 2008

We can all recognize the amount of data we generate. And just like we keep telling ourselves we’ll clean out the garage “one of these days” most of us rarely bother to clean out our email or photo sharing accounts.

As a result, enterprise and internet data centers have to buy hundreds of thousands of petabytes of disk every year to handle all the data in those files. It all has to be stored somewhere.

One way to reduce the amount of storage growth is to compress files. Compression techniques have been around forever, and are built in to many operating systems (like Windows) and storage platforms (such as file servers).

Here’s the problem: most modern file formats, the formats driving all this storage growth, are already compressed.
· The most common format for photos is JPEG – that’s a compressed image format.
· The most common format for most documents at work is Microsoft Office, and in Office 2007, all Office documents are compressed as they are saved.
· Music (mp3) and video (MPEG-2 and MPEG-4) are highly compressed.

The mathematics of compression are that once you compress a file, and reduce its size, you can’t expect to be able to compress it again and get even more size reduction. The way compression works is that it looks for patterns in the data, and if it finds patterns it replaces them with more efficient codes. So if you’ve compressed something once, the compressed file shouldn’t have any patterns in it.

Of course, some compression algorithms are better than others, and you might see some small benefits by trying to compress something that has already been compressed with a lesser tool, but for the most part, you’re not going to see a big win by doing that. In fact, in a lot of cases, trying to compress an already compressed file will make it bigger!
Conventional wisdom dictates that once files are compressed via commonly used technologies, the ability to further limit their size and consumption of expensive resources is nearly impossible. So, what can be done about this?

Greening storage

Posted in File Systems,Storage by storageoptimization on May 1, 2008

The New York Times Bits blog has a post on the need to green Internet and other data centers, “Data Centers are Becoming Big Polluters.” Citing a study by McKinsey & Company, Bits’ Steve Lohr states that data centers are “projected to surpass the airline industry as a greenhouse gas polluter by 2020.”

He goes on to sum up the report, which “also lists 10 ‘game-changing improvements’ intended to double data center efficiency, ranging from using virtualization software to integrated control of cooling units.”

Many of us are aware that server virtualization is the path to increasing server utilization. But servers are only half of the data center picture. The other half is storage. The solution for that? Storage optimization.

Just as server virtualization lets you turn 10 physical servers in to 10 virtual servers and then consolidate them on to one physical machine, storage optimization lets you store 10 times more files on a given disk than you can today. The heat, cooling, rackspace, and power benefits are obvious.

Update: Ben Worthen at the Wall Street Journal is also discussing this on the Business Technology Blog. His post, “Can the Tech Guy Afford to Care about Pollution?” also talks about how the problem will only get worse in the future. Worthen’s take: “Given that most of the tech departments we talk to are looking to cut costs, they’re not likely to invest in new technology that will cut emissions, unless it cuts short-term costs at the same time.”

Less is More–Part 2

Posted in Featured,File Systems,Storage by storageoptimization on April 25, 2008

As we all know, the internet is where there is huge storage growth, multi-petabyte scale, and a need to stay very close to the commodity price point on storage costs. There are two common threads across all of the “less is more” file systems that have been popping up to handle all this growth. 

First, they are all designed in a way that you can build very scalable, very large pools of storage using generic white box servers stuffed with cheap disks. Second, they mostly support only the most primitive operations — create a new file, read that file, delete a file. While I’m generalizing, and this is not exactly true for all of these new file systems, many just skip things that are considered standard in traditional file systems: locking, Posix semantics, authentication, ACLs, concurrency control, metadata or the ability to list and search for files.

The overhead of all those traditional file system operations is too much for massive internet-scale operations where the primary purpose of a file system is for a user to upload something, for millions of people to look at it over and over, and maybe someday, sometime, someone will delete something.

These file systems are in contrast to advanced file system developments from places like NetApp’s latest OnTap and WAFL releases, HP’s PolyServe cluster file system, or the transaction-enabled NTFS from Microsoft that you can find in Server 2008. 

The line in the sand is, there are file systems that are designed to be used by people, and file systems that are designed to be used by specific applications only.    

The commercial file systems grew up serving the needs of business users and business applications. They are designed to host a wide variety of applications, including production databases, to let users peruse and manage their files, and to let storage administrators keep up with both growth, availability, and corporate compliance requirements.     

As a consequence, more and more value-add features are being put in to the file system to support these use-cases. The “less is more” crowd, on the other hand, wants a very cost-effective but massively scalable pool of storage to make available to their web applications.  A global namespace (so it looks like one giant pool of storage), and low, low cost per terabyte are the drivers of these file systems.

Users don’t list their directories in these file systems. In fact, users never see these file systems. Users see web applications, and the web applications use databases to keep track of what files are where in the massive storage pool, and who is allowed to see them. In that sense, in the “less is more” file system world, a lot of the value-add and management functionality of the file system is moving up in to the application layer, especially in the largest content-rich web sites.   

>From my point of view, the feature-rich commercial file systems will continue to evolve to meet the needs of corporate customers, including scaling to meet their growth needs. The “less is more” file systems will continue to push out traditional file systems in the highest growth web properties and other customers whose data growth is at that many-petabyte scale. Finally, the two things are not entirely incompatible – most of the new web tier file systems actually have a bunch of single node file systems buried in them on each storage node somewhere at the bottom building block level of their architecture.

But it’s time that these two file system approaches evolve and develop some kind of relationship–because for now, neither is perfectly suited the problem at hand. There’s no reason why those building blocks couldn’t have richer functionality, such as transparent clustering and failover, that comes from commercial file systems, and still give you the massive scale and cheap $/petabyte of a global namespace and commodity building blocks.

The internet has often been the cauldron in which new technologies are forged that then eventually move in to the corporate data center. We saw this in the server world, where low cost Linux servers displaced Sun and other Unix systems early on, and eventually that movement to cheaper, standard servers pushed Big Unix out of the corporate data center too.   

The cost differences between a corporation’s EMC DMX storage array and a storage pool of white boxes with disk is even greater than the cost difference between Unix machines and standard Linux boxes. People are more hesitant to change storage platforms than server platforms (for good reason), but that huge cost difference and the rate at which storage is growing is going to cause the shift to happen sooner or later.

My prediction (and hope) is that someone will figure out a way to marry the “less is more” simple file system layers with richer underlying commercial file systems. This is what’s needed.

Less is More

Posted in Analyst,Featured,File Systems,Storage by storageoptimization on April 24, 2008
Tags: , ,

Less is more … or is it? Part One

I recently returned from Storage Networking World in Orlando. As everyone knows, the conference is mainly a place for storage vendors to meet each other, tout their wares, and nose around in their competitors’ booths pretending to be potential customers. There are some good sessions, however, and one of the best was IDC analyst Noemi Greyzdorf’s presentation on the future of file systems.

Her smart and interesting talk was on the evolution of clustered, distributed, and grid file systems. As I listened, it occurred to me that I’m seeing a big split in the file system world, especially at the high end, where really large amounts of data are stored.

One of Noemi’s key points is that more and more functionality is being packed into file systems. As she puts it, file systems are the natural place for value-add knowledge about storage to be kept. That’s certainly true, and there are a number of advanced file systems that are becoming richer and richer in terms of integrated features.

At the same time, there is definitely a “less is more” crowd emerging, where many of the most basic features of file systems are being left out in some of the newest large-scale file systems around. This group includes file systems like GoogleFS, Hadoop, Mogile, Amazon’s S3 simple storage service, and the in-house developments at a couple of other very large online web 2.0 shops.

Are these two trends in file systems headed on a collision course? I don’t think so. But what I do see is that neither of these solutions is nailing the growing problem posed by the exploding amount of internet data that needs to be managed and stored. In other words, there are issues with both of these approaches. In my next entry, I will discuss what that is, and how we might solve it.