Health check of the Database Instance :-
Explanation:-
An ideal health check should provide an overview of a database’s stability across three major areas for consideration:
Availability
Performance
Scalability
Category: Availability
1. Database space (Should consider both database space and OS space; would be ideal to also consider growth trends.)
2. Archive log space and listener status.
3. Dump area space (bdump, cdump, adump, udump, etc.)
4. Verifying success of database archive logs to disk/tape.
5. Verify success of database backup
6. Snapshot/Materialized view status
7. Status of DBMS Jobs
8. Replication collisions
9. Monitoring Backups.
10. Online redo logs multiplexed (On different mount-points.)
11. Control file multiplexed (On different mount-points.)
12. Misc errors (potential bugs) in alert.log
13. Daily Tablespace Utilization.
14.Checking the temporary tablespace/files.
15.Checking the UNDO tablespace and retention.
16.Monitoring the Unix /tmp and /var Location.
17.Check Invalid objects and recompile.
Category: Performance
18. Disparate segment types (tables, indexes, etc.) in same tablespace
19. SYSTEM tablespace being granted as default or temp tablespace
20. Temporary tablespace not being a true temp tablespace
21. Deadlock related errors in alert.log
22. Non-symmetric segments or non-equi sized extents in tablespace (For dictionary managed tablespaces.)
23. Invalid objects
24. Any event that incurred a wait over X seconds (“X” to be defined by user during healthcheck report execution. Default value could be 5 seconds. Obviously, for this value to be available, some kind of stats recording mechanism needs to be in place. In our case, Data Palette is used to collect these stats so the health check report can query the Data Palette repository for wait events and corresponding durations.)
25. Hit ratios: DB buffer cache, redo log, SQL area, dictionary/row cache, etc. (While there is a mixed opinion on whether these are useful or not, I like to include them for DBAs that do rely on them to identify whether any memory shortage exists in the database instance and adjust the related resource(s) accordingly. While I do have an opinion on this matter, my goal is not to argue whether this stat is useful or not, instead, it’s to provide them to people that need them (and there are quite a few folks that still value hit ratios.)
26. I/O / disk busy (I/O stats, at the OS and database levels.)
27. CPU load average or queue size
28. RAM usage
29. Swap space usage
30. Network bandwidth usage (Input errors, output errors, queue size, collisions, etc.)
31. Multi-threaded settings (Servers, dispatchers, circuits, etc.)
32. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
Category: Scalability (Note: For ensuring there are no scalability related issues, the health check generating mechanism ideally should be able to relate to current resource consumption trends and apply predictive algorithms to discern whether there will be contention or shortfall. In the absence of such predictive capabilities, a basic health check routine can still use thresholds to determine whether a resource is close to being depleted.)
33. Sessions
34. Processes
35. Multi-threaded resources (dispatchers, servers, circuits, etc.)
36. Disk Space
37. Memory structures (locks, latches, semaphores, etc.)
38. I/O
39. CPU
40. RAM
41. Swap space
42. Network bandwidth
43. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
44. Understanding system resources consumed by non-DB processes running on the same server/domain (3rd party applications such as ETL jobs, webservers, app servers, etc.)
45. Understanding system resources consumed by DB-related processes running outside their normal scheduled window (Applications such as backup processes, archive log propagation, monitoring (OEM) agents, etc. This requires the health check utility to know which processes are related to the database and their normal execution time/frequency.)
Explanation:-
An ideal health check should provide an overview of a database’s stability across three major areas for consideration:
Availability
Performance
Scalability
Category: Availability
1. Database space (Should consider both database space and OS space; would be ideal to also consider growth trends.)
2. Archive log space and listener status.
3. Dump area space (bdump, cdump, adump, udump, etc.)
4. Verifying success of database archive logs to disk/tape.
5. Verify success of database backup
6. Snapshot/Materialized view status
7. Status of DBMS Jobs
8. Replication collisions
9. Monitoring Backups.
10. Online redo logs multiplexed (On different mount-points.)
11. Control file multiplexed (On different mount-points.)
12. Misc errors (potential bugs) in alert.log
13. Daily Tablespace Utilization.
14.Checking the temporary tablespace/files.
15.Checking the UNDO tablespace and retention.
16.Monitoring the Unix /tmp and /var Location.
17.Check Invalid objects and recompile.
Category: Performance
18. Disparate segment types (tables, indexes, etc.) in same tablespace
19. SYSTEM tablespace being granted as default or temp tablespace
20. Temporary tablespace not being a true temp tablespace
21. Deadlock related errors in alert.log
22. Non-symmetric segments or non-equi sized extents in tablespace (For dictionary managed tablespaces.)
23. Invalid objects
24. Any event that incurred a wait over X seconds (“X” to be defined by user during healthcheck report execution. Default value could be 5 seconds. Obviously, for this value to be available, some kind of stats recording mechanism needs to be in place. In our case, Data Palette is used to collect these stats so the health check report can query the Data Palette repository for wait events and corresponding durations.)
25. Hit ratios: DB buffer cache, redo log, SQL area, dictionary/row cache, etc. (While there is a mixed opinion on whether these are useful or not, I like to include them for DBAs that do rely on them to identify whether any memory shortage exists in the database instance and adjust the related resource(s) accordingly. While I do have an opinion on this matter, my goal is not to argue whether this stat is useful or not, instead, it’s to provide them to people that need them (and there are quite a few folks that still value hit ratios.)
26. I/O / disk busy (I/O stats, at the OS and database levels.)
27. CPU load average or queue size
28. RAM usage
29. Swap space usage
30. Network bandwidth usage (Input errors, output errors, queue size, collisions, etc.)
31. Multi-threaded settings (Servers, dispatchers, circuits, etc.)
32. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
Category: Scalability (Note: For ensuring there are no scalability related issues, the health check generating mechanism ideally should be able to relate to current resource consumption trends and apply predictive algorithms to discern whether there will be contention or shortfall. In the absence of such predictive capabilities, a basic health check routine can still use thresholds to determine whether a resource is close to being depleted.)
33. Sessions
34. Processes
35. Multi-threaded resources (dispatchers, servers, circuits, etc.)
36. Disk Space
37. Memory structures (locks, latches, semaphores, etc.)
38. I/O
39. CPU
40. RAM
41. Swap space
42. Network bandwidth
43. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
44. Understanding system resources consumed by non-DB processes running on the same server/domain (3rd party applications such as ETL jobs, webservers, app servers, etc.)
45. Understanding system resources consumed by DB-related processes running outside their normal scheduled window (Applications such as backup processes, archive log propagation, monitoring (OEM) agents, etc. This requires the health check utility to know which processes are related to the database and their normal execution time/frequency.)
No comments:
Post a Comment