Oracle RMAN Best Practices and Myths

Including a sample script and sample logfile.

RMAN is beyond a doubt the best option for backing up an Oracle database. It can be as simple as typing BACKUP DATABASE; and let RMAN do the driving. However, RMAN is a powerful and complicated tool, sometimes even fickle, and production backup requirements vary widely. There are a few good papers floating around discussing best practices, in addition to the official Oracle Docs, see my reference section below.

The purpose of this paper is to collect all the good (and bad) ideas and present some practical tips based on years of production experience. I've also included a sample script and sample logfile that demonstrates some of the ideas presented here. I welcome comments and feedback in an effort to improve the practical application of this product in the industry workplace.

Listed below is the growing collection of topics, followed by some detailed discussion that will grow over time.

List of Topics

General Best Practices

Controlfile Autobackup

Do/Don't use crosscheck!??!?

Crosscheck backup but not archive logs!

Do not rely on one backup – don't use Oracle Suggested Backup

Do not reply on RMAN stored configuration

Save/Restore RMAN configuration at start/end of your script

Notify if a problem, or script runs too long

Database Backup

Block Checking/Checksum

Include 'check logical' when backing up DB

Block Change Tracking – for incremental backups only

Have each datafile in a single backup piece

Backup optimization – devil or saint?

Archivelog Backup

Duplex archive log groups and have more than one archive log dest

Don't delete archive logs based on time – Use an archivelog deletion policy!

Do not use 'DELETE ALL INPUT" when backing up archive logs

Controlfile/SPfile Backup

Backup controlfile last – to include record of this backup

Use controlfile alone if catalog DB is not available

Maintain your RMAN catalog/controlfile (controlfile_record_keep_time, delete obsolete, resync)

Preparing for Recovery

Prepare for loss of controlfiles – have many copies, know your DBID

Re-cataloging backups

Test your backup

Test recovery

Alternate technologies – Flashback Database, Flashback Table & Query, DataGuard standby DB

Backup Log Files

Review log files, scan for errors and successful completion

Set NLS_DATE_FORMAT, NLS_LANGUAGE to make output readable

Echo actual commands, and command output to the log file

Show all – what settings are in effect for the backup

Include timing of the entire script, and individual steps

Don't overwrite log file, use a unique time/date stamp filename/ext

Rotate log files, but keep pertinent ones accessible

Backup Performance

Parallelize channels across RAC nodes

To compress or not to compress

Throttling backups

Special requirements for de-duplication

Backup Architecture

Using the FRA

Near-line vs far-line backups (or both)

Backup database to FRA, FRA to tape

The advent of de-duplication

Data mirroring

Special requirements for de-duplication

Special requirement for tape

Third party RMAN agents

Tape Backup Considerations

Backup verification pass

USE MML topics (from Yuri)

- don't use delete obsolete for tape (or see Caleb's method)

- set retention policy high, do not leave at default

- delete oldest backup first before making today's backup

- half way backed up system files (FRA to tape)

Detailed Discussions

Control File Auto Backup. Why, Oh Why?

All the experts, including Oracle's own paper agree, controlfile autobackup should be turned ON. So, why, oh why, after so many versions and releases of RMAN is the default still OFF?!??

The purpose of controlfile autobackup is to automatically backup the control file, which contains the RMAN repository, after any structural change to the database (eg. data file or log file added, moved, deleted, etc). Having an accurate description of the database file structure is crucial to performing recovery.

So, when you do an alter database… and forget to diligently backup the control file right after, RMAN will do it for you.

Now, I can understand some situations where you may actually want to turn it off – at least temporarily. For example, resizing all the online redo logs, or any other operation that requires multiple alter database… commands. The autobackup can add a very noticeable delay to command completion, consume system resources, and clutter the FRA and repository potentially even throwing off a redundancy based retention policy (more on that later).

However, even this problem is solved in 11g, creating some confusion in the process. Starting in 11g, the controlfile autobackup no longer occurs immediately, nor does it hold up command completion. When a structural change is made, RMAN starts a timer in case you are about to make additional structural changes. After no further structural changes have been made for a few minutes, the controlfile autobackup occurs.

This is a significant behavior change in 11g and has caught more than one seasoned DBA (including myself) off guard!

Discuss autobackup location and the difference between autobackup and snapshot controlfile destination – which has to be changed from default on RAC.

To Crosscheck, or Not To Crosscheck?

That is the question. Oracle's Top 10 Best Practices paper says crosscheck your backups, Pythian's 10 Problems paper says don't use crosscheck. Who's right, and why?

They both are, dig deeper.

There is more than one object type in RMAN. Crosschecking your backups is a GOOD idea, and it should be done before a delete obsolete. If a backup set, or piece, has gone missing we would not want to delete what may be one of the last remaining good backups that could take its place. Crosschecking only marks the missing backup set/piece as expired. It does not delete or remove anything. Backup set/pieces marked as expired will not count toward the retention policy of the delete obsolete command.

Now, about other object types in RMAN. Crosschecking archive logs is a BAD idea, unless it is done interactively under the close scrutiny of the DBA. Archive logs marked as expired will be silently ignored during all subsequent archive log backups. This will compromise the recoverability of the database without so much as a warning message. I would prefer my backup to fail and belch error messages all over the place, than silently pretend it worked.

Following any crosscheck command, it's a good idea to do a corresponding report expired and pipe that output to the log file. If something has gone missing, we need to know about it. When I scan my log file for errors, I also scan for the word 'expired'.

Lastly, delete expired will remove the entries flagged as expired from the RMAN repository. It does not delete anything from disk, as the whole point of being expired is that the underlying file(s) do not exist in RMAN's eyes. Unless there is a good reason to the contrary, this should probably be done manually by the DBA. If expired objects are not removed from the repository, they will keep showing up on subsequent report expired commands.

Bad Block Detection

The best way to recovery from bad blocks, which will eventually happen, is to avoid them in the first place. This may not be entirely possible, but there are several proactive steps that will detect bad blocks sooner. The sooner bad blocks are detected, the better the chance of recovery.

At the database level there are two init parameters that control the level of block validation done when blocks are read. This affects all block I/O, not just the backup. It does not cause any additional I/O but it can cause additional CPU overhead, just how much remains an argument among the experts. Best strategy: unless the additional CPU overhead prevents their use, enable both of these features.

The two parameters are DB_BLOCK_CHECKING and DB_BLOCK_CHECKSUM. Their range of possible values changes between 10g and 11g, and 11g introduces a new parameter DB_ULTRA_SAFE which affects the default value of the above two. This can get a bit confusing, so read the appropriate documentation for your version. If at all in doubt, just set the two original parameters to TRUE.

RMAN also has a command option CHECK LOGICAL that can be used to perform additional validation as a backup is being written. This option applies to BACKUP DATABASE (and look up any others). It performs …..

There's also no harm in running the good old-fashioned dbverfiy utility against your data files. If there is any question about possible block corruption, dbverify will scan a data file and perform a physical block check.

There is a new, easier way, to invoke the equivalent of dbverify.

Go into physical vs logical block corruption.

Block Change Tracking

A minor point, significant only if incremental backups are in use. This feature can greatly speed up incremental backups by keeping a log of changed blocks since the last backup, thus avoiding a full scan of every datafile. It does, however, add a bit of overhead to daily operation which is not necessary if you are not doing incrementals.

Keep Multiple Backups (in separate locations)

The Oracle Suggested Backup (portrayed in DB Console) may be suitable for a starter database in a test environment, but it is a good example of what is not appropriate for any production environment. It makes one baseline backup of your database the first day, and then forever after makes incrementally updated backups. There are two problems with this: First, there is only one actual backup in one place. Second, incrementally updated backups simply take changed blocks from the database and substitute them for the original blocks in the very first backup, but the baseline backup is never repeated. Eventually, one can see the potential for bad blocks to creep in and never be detected (until, of course, the day you try to recover from your one and only backup!).

Even if not required for possible point-in-time restore, multiple self-contained backups should exist in case one of them turns out to be corrupt. They should also exist, or be duplicated, in more than one physical location, especially if one location is the FRA in your server room. What happens if the sprinklers come on the floor above the server room, and why to people insist on putting server rooms in the basement (water does tend to follow gravity)? Snide comments aside, companies have ceased to exist when their physical server room was destroyed and there was no off-site backup.

Many options exist for maintaining off-site backups including DataGuard (my favorite for various reasons), storage level replication (eg. snap mirror), remote archive destinations, and plain old-fashioned off-site tape (or disk) storage.

Do Not Rely on RMAN Stored Configuration

It's nice that RMAN can store a set of default parameters that will be applied to all subsequent operations, and in the early days of my career I used to run a setup script once, and then rely on these parameters remaining stable. That was a mistake. Eventually someone else needed to make a change and the production backups started going to the wrong place, even worse mistakes are possible.

At minimum, echo the stored configuration in the backup log file so there is absolute certainty what settings were in effect when the backup (or restore) ran. Better yet, explicitly set all required parameters at the start of the backup script. Even better, use a run {} command and override the required parameters for the duration of the job.

Save/Restore RMAN configuration at start/end of your script

A good idea echoed by others. The SHOW ALL; command just happens to produce the exact syntax to reset all stored configuration parameters to their current value. This makes it easy to capture the original configuration prior to making any changes, and then restore it at the end of the backup script.

Notify of Problems

This seems like such common sense, but so many backup scripts run as scheduled jobs, do "something", then write a log file that nobody ever looks at.

All of my production backups log everything to a time/date stamped log file, then scan the log file for errors and send a success/fail email to the DBAs. The subject line includes the word "SUCCESS" or "FAIL", followed by the databas identification making it easy to scan several emails quickly looking for the FAILs. The absence of an email indicates a script did not run, or did not finish. This is a bit harder to watch for, and brings up the subject of writing a watchdog to watch the watchdog (I've seen it done!).

Aside, I also implement monthly log rotation to purge old logs (and not just RMAN logs, but things like Listener and OEM logs as well), a daily alert log scanner, and an hourly DataGuard health check. On a good morning, my inbox looks something like this:

Separate Backup Pieces for Each Datafile

Clearly, there is some sense in limiting the number of datafiles or archivelogs in a single backup set piece. If the database had hundreds, or thousands of datafiles, and we only need to recover one or two, it would be much faster to read only the backup set piece(s) we needed. In the case of data de-duplication (eg Data Domain), this may even be a requirement to ensure de-duplication actually happens.

On the other side of the coin, this can make management of the backups (eg. list backup) difficult. Adding a unique tag to each backup can help. My starting point is generally a compromise: 5 datafiles or 20 archivelogs per backup set piece with a unique tag for each backup. This provides some optimization of the restore, without making the backup too difficult to manage.

Backup Optimization

Wouldn't everybody want their backup optimized? Don't be fooled by the name. This feature causes RMAN to skip files that have already been backed up. If that's OK and you're willing to rely on one (hopefully good) copy of a datafile or archivelog, then go for it. Otherwise, leave this feature off so that each backup includes all candidate files each time.

Duplex Archivelog Groups

In DBA 101, we all learned to duplex on-line redo logs, but what about archived redo logs? They can be large and numerous, and take up a lot of space. They are also absolutely critical to the recoverability of your database, and the integrity of any standby databases. A corrupt or missing archivelog can spell disaster.

The engineered systems, and ASM aficionados argue that writing once to double or triple mirrored storage is just as good, but I don't buy it. Too many times I've seen one copy of a log get corrupted, and been very thankful for the remaining good copy.

When duplexing logs, be they on-line or archived, put the members on different media! There is little point in duplexing files on the same physical media when that media fails or becomes corrupt. Do be cautious of the write speed of the media, as this can directly affect database performance, especially in the case of the on-line redo logs.

Include diagram and go into storage layout for +DATA, +FRA, +ARCH and +REDO

Use an Archivelog Deletion Policy

Archivelogs must be managed, period. Like the doughnut machine that wouldn't stop (The Doughnuts,1963 by Weston Woods Studios) they will be produced continuously and there is no automated process to delete them. If left unattended, they will fill up all available storage and halt the database.

Some DBAs have foolishly deleted archivelogs manually, or with time scheduled jobs. The problem with this approach is there is no way of knowing with certainty that the deleted archivelogs are not still needed for potential recovery or standby database synchronization. An un-resolvable gap means a useless backup or corrupt standby database.

The good news is, RMAN provides an excellent way to manage archivelogs. Simply configure an archivelog deletion policy, include the DELETE INPUT option when backing up archivelogs, and let RMAN do the work! DBAs keep you hands off. For example, with the following policy RMAN will delete archivelogs only after they have been backed up and applied on standby databases:

CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON ALL STANDBY BACKED UP 1 TIMES TO DEVICE TYPE DISK;

Note, this policy is appropriate for duplexed archive logs. Each copy of an archivelog will be backed up once before becoming eligible for deletion. Each copy will be in a different backup set, another redundancy feature. If archivelogs are not duplexed (eg engineered systems), then change the policy to ...BACKED UP 2 TIMES... to ensure every archivelog is included in two separate backup sets.

Do not use DELETE ALL INPUT

This applies when archivelogs are duplexed. There is a significant difference between ...DELETE INPUT; and ...DELETE ALL INPUT;

Including the ALL keyword means after backing up any one copy of an archivelog, all copies of it will be deleted. That does not provide redundancy in the backup. If the single backup of a critical archivelog is missing or corrupt, there is no second chance. Without the ALL keyword, only the copy that was just backed up is deleted, the other cop(ies) remain for subsequent backups to handle.

Backup Verification Pass

OK, I'm from the days when disk drives where the size of washing machines, and tape drives were the size of refrigerators. There was only one kind of backup and it was to tape. We never fully trusted tapes, and rightfully so. Whenever a backup was made there were two passes: the backup recording pass, and the backup verification pass.

What ever happened to that? The backup verification pass I mean. Now a'days we just blindly trust that the tape recorded OK and we will be able to read it.

Restore database validate

Restore backupset validate

Caleb's sample code to perform backup verification

References

10 Problems With Your RMAN Backup Script, by Yury Velikanov and Michael S. Abbey

http://www.slideshare.net/yvelikanov/10-problems-with-your-rman-backup-script-whitepaper

Top 10 Backup and Recovery Best Practices

http://linda-smith.blogspot.com/2007/11/top-10-backup-and-recovery-best.html

Backup and Recovery Best Practices for the Oracle Database Appliance

http://www.oracle.com/technetwork/articles/servers-storage-admin/dbappliancebackupstrategies-519664.pdf

Oracle Rman Design Best Practices With EMC Data Domain

http://www.emc.com/collateral/hardware/white-papers/h8110-oracle-rman-data-domain-wp.pdf

About Caleb

Caleb has worked a lifetime in IT, the last two decades specializing in Oracle server technologies. He still has his first computer – a TRS-80 Model 1, but prefers to run a veritable mini-datacenter on his laptop now. His roles include both professional instructor and senior consulting DBA, the combination of which brings enhanced value to all of his projects. He has been active in many Oracle user groups along the west coast and across Canada, and still presides over the Victoria/Vancouver BC OUG. Most of his work involves setup and maintenance of critical server infrastructure in high availability environments utilizing RAC, DataGuard and RMAN.

Contact Me

To weed out the Spam and triflers, it's like a mini-IQ test to get my email. Serious discussion is welcome.

The first word in a pirate's every sentence		The opposite of woman		Guess who?
			@		. COM

Caleb Small

Senior Curmudgeon DBA