Data archiving organizes your information on a granular level making it searchable when needed. While different from a backup, effective archiving can ease backup issues companies and organizations face in the “Big Data” era.
It’s no secret data is growing, year after year at alarming rates. The majority of this growth is from machine and sensor data—not users, although people are certainly contributing. However, the amount of active data accessed over 30 days is growing at a much slower rate. In other words, a significant amount of data growth relates to older, infrequently accessed information.
Much of this unused data is unstructured (file) data. This is the hardest type of data for a backup system to ingest and the most expensive to store, yet IT professionals methodically back this data up week after week. Removing this data from the data protection process by archiving it improves backup operations substantially, while reducing costs.
The Unstructured Data Challenge
As previously stated, unstructured data is not currently active, but it may become active at a later time. Knowing what unstructured data will become active again is the equivalent of finding a needle in the haystack. Manually restoring this data based on user need is also time consuming. Often, users cannot provide essential details such as file names or creation dates of data they need recovered. Even if the files can be identified, finding them within a vast sea of data is challenging.
The easiest solution often appears to keep all data on production servers and not manage it at all. To some extent, storage vendors have led IT professionals to believe this is the best possible outcome and that their infrastructure and technology can handle it. Thanks to scale-out NAS, the ability to add almost limitless capacity is now possible. But is it smart? A large scale-out NAS architecture can be expensive to purchase, scale and operate (think energy and cooling costs). Not to mention, you still have to protect it. Backing all that data up presents several challenges. It takes time for the backup application to examine each of these files, which may be in the billions, to determine which files should be backed up. Then, there is the time it takes to transfer all this data to the backup system over the network. Finally, there is the cost to maintain multiple copies of all this data. At a minimum, backup storage is 2X the size of production storage, and in most cases it is 5X or more.
Typically 80% to 90% of backup data will not change. This data is generally just consuming storage space and slowing up the backup process. The results of removing this data from regular backups can be dramatic. In some cases, backup data sets can be reduced by as much as 80%. With a modern archive system in place, only databases and the most active unstructured data must be backed up.
The biggest responsibility for a data archiving system is to provide seamless access to that information. There are three ways to provide rapid response to recovery requests. In all three cases, the archive system needs to have an on-premises disk front end. Then, the archive software can replicate the directory structure of the primary storage system, provide an indexing and search function or offer automated recovery of data.
The first method to ease access, replicating the directory structure of the archived data set, allows the user to navigate the same directory path as before to find their files with the insertion of “archive” in the path. For example instead of looking for a file in “\home\docs\” they would be instructed to look in “archive\home\docs.” This method is the easiest to implement, least likely to break due to a software update and has the least impact on the environment. It is often the least expensive of available options. But it does require the most training. Users have to be comfortable navigating file systems.
The second method is to move all data to the archive and lay a “Google-like” search capability to the archive, which could include context-level searching. A sophisticated search like this allows the users to search by file name, modification dates or content within the file. Once they find the necessary file, the interface would present an option to restore the file to the location of their choice, assuming they have adequate security clearance.
The third method is automated recovery. Some products have the ability to leave stub files or symbolic links in place, which make it appear that the file is still in its original location, but when needed, it is retrieved from its archived location. While this option is the easiest on the users, it is also the most fragile. If the stub files or links are deleted, accessing to the original data can be difficult.
Each of these methods significantly reduces the amount of data that the backup software needs to transfer from primary storage to secondary storage. Only the first two lessen the amount of file system walk time required. The reduction occurs because the data and the files associated with it are no longer on production storage. With the third method, since stub files replace migrated files, there are still the same number of physical files or objects to process.
Choosing The Right Storage Solution
Archive storage can be disk, cloud, tape or a combination of all three. Many data archiving systems use scale-out architecture. They also have data protection schemes like erasure coding or replication that offer faster drive-rebuild times than traditional RAID. This allows for the use of very high-capacity drives – 8 TB or even larger. These systems also often feature an object storage file system that is not impacted by the number of files it stores. Despite these capabilities, a disk-only archive strategy can get expensive. Similar to scale-out NAS, the cost to power and cool these systems can get expensive as they scale.
Cloud storage is another option and many data archiving products will connect directly to a cloud provider. The advantage of this method is that it totally removes the hardware, power, cooling and footprint expense from the data center. The latency of the cloud is also not typically an issue with this type of data. While data may need to be recovered, it often does not need to be recovered instantly. Additionally, most archive products that leverage the cloud will keep a local copy on disk for a period prior to making data “cloud only.”
The challenge with the cloud is ongoing costs. An organization is essentially paying for the same capacity repeatedly. In addition, the periodic cost increases incrementally because the data set continues to grow. When those periodic payments are added up, it can far exceed the cost of just buying the storage and keeping it on premises. For many environments—those under 50 TBs—the periodic cost is not much of an issue. But for medium to large data centers, the total cost can be significant.
Another option is tape, the least expensive storage, especially when power and cooling are factored in. Archiving software that is “tape aware” has made significant progress in abstracting tape management. Today, tape can appear to be merely an extension of the archive disk system. Users may not even know or care that they are accessing files from tape storage. In many cases, the tapes created are now interchangeable between systems thanks to LTFS. A combination of these technologies is typically the best design. Using tape with either disk or cloud archive limits the cost of those platforms, while still allowing relatively rapid access to data.
Choosing The Right Data Archiving Software
Software is the most important part of archive design. The good news is that there are many data archiving products that are well integrated with backup software products. Some backup software products have even added archiving functionality. If you are getting started with archiving data, this may be a good place to start.
The data archiving software should abstract the user and administrator from the management of the backend devices, essentially making a tape library or the cloud an extension of disk. Software should present storage as an addressable file system (CIFS or NFS) or an object store. It should also provide policy management capabilities that manage the movement of data between the front-end disk cache and the tape or cloud repository. It may also manage the protection of the data by making sure that multiple copies of data are created on two tapes, to two different cloud providers or a combination of both.
Beyond those core capabilities, some products may provide the ability to store user-defined segments of a file or object on disk, but the majority of it on near-line (tape or cloud). For formats that can support it, this allows for rapid response and maximum storage efficiencies. An ideal example is video. A copy of the first 10 minutes of a video might always be cached on disk, but the entire file is stored on a secondary repository. This allows the initial playback of the video to occur while the rest of the video is loaded to the primary system. The user experiences instant access but the data center achieves maximum data efficiency.
The net impact of implementing an archive is a massive reduction in backup storage and complexity. The backup process is not only storing less data, it also does not need to process that data. Also, archive storage options are typically far less expensive than backup storage options. The prerequisite work in creating a backup job is reduced and the back-end work, like metadata management, is greatly simplified.
Teoma Systems is a preferred distributor for Quantum, NetApp and Dell Compellent on-premise enterprise backup solutions – along with several other hosted solutions to fit your business or organizations backup needs. Contact us today for more information about how our comprehensive data analysis can help you find the perfect solution to fit your needs.