Exadata OL7 session disconnects after 10 minutes

When upgrading to Exadata software 19c (release 19.1.0.0.0 and above) the compute nodes (database servers) upgrade to Oracle Linux 7.  As part of this upgrade the sshd ServerAliveInterval settings are changed to a value of 600 for STIG (Security Technical Implementation Guide) purposes as detailed in the My Oracle Support Note below:

Changed sshd setting “Clientaliveinterval” after updating Exadata Database Nodes (domU, dom0 and physical) (Doc ID 2501968.1)

When updating Exadata Database nodes (dom0, domu and physical) running either Oracle Linux 6 or Oracle Linux 7, “sshd Clientaliveinterval” settings are changed to a value of 600 for STIG purposes via unpublished bug 28204681.

This will result in your ssh connection being closed after being idle for 600 seconds while before this would not happen before 86400 seconds passed. While for the same security reasons, it’s not recommended to undo this change, it will be the choice of the operator and he/she is free to do so.

This means your connections to the Exadata Machines disconnect after 10 minutes of inactivity 😦 :

[AnwarZ@v1proxy1 ~]$ date;ssh oracle@v1ex1dbadm01;date
Thu May 21 15:27:31 IST 2020
oracle@v1ex1dbadm01's password:
Last login: Thu May 21 15:27:31 IST 2020 from x.x.x.x on pts/0
Last login: Thu May 21 15:27:40 2020 from x.x.x.x
[oracle@v1ex1dbadm01 ~]$ Connection to x.x.x.x closed by remote host.
Connection to x.x.x.x closed.
[AnwarZ@v1proxy1 ~]$date
Thu May 21 15:37:40 IST 2020
[AnwarZ@v1proxy1 ~]$

As per the MOS note, the recommendation is to not change ClientAliveInterval on the compute nodes but to use the flags options ServerAliveInterval and ServerAliveCountMax on the ssh connection as shown below:

[AnwarZ@v1proxy1 ~]$ date;ssh -o ServerAliveInterval=550 -o ServerAliveCountMax=157 oracle@v1ex1dbadm01;date
Thu May 21 15:41:29 IST 2020
oracle@v1ex1dbadm01's password:
Last login: Thu May 21 15:27:40 IST 2020 from x.x.x.x on pts/0
Last login: Thu May 21 15:41:37 2020 from x.x.x.x
[oracle@v1ex1dbadm01 ~]$ date
Thu May 21 15:55:10 IST 2020
[oracle@v1ex1dbadm01 ~]$

This session didn’t disconnect and a manual ‘date‘ show it’s greater then 10 minutes 🙂

This is because the ServerAliveInterval=550 ensure that a null packet is sent every 550 seconds from the client side, this ensures the server will not disconnect the session as this is less then the ClientAliveInterval=600 on the compute nodes.  The ServerAliveCountMax is multiplied with the ServerAliveInterval value to determine the maximum amount of time the session can be idle before disconnecting the session back in line with the previous standard of 86400.

Alternatively if you are using program like putty you can set in the settings to the same affect:

putty-keep-alive

It also appears from the MOS note, that this can affect OL6 on higher Exadata releases when the STIG recommendations were implemented.  In which case same workaround can be used.

 

If you found this blog post useful, please like as well as follow me through my various Social Media avenues available on the sidebar and/or subscribe to this oracle blog via WordPress/e-mail.

Thanks

Zed DBA (Zahid Anwar)

 

OGB Appreciation Day : Exadata X8M

What is OGB Appreciation Day?

The Oracle Groundbreakers (OGB) Appreciation Day formally known as OTN Appreciation Day and ODC Appreciation Day, is a great initiative by Tim Hall aka Oracle-Base.com.  Where we take the opportunity to say thanks to the Oracle Community which includes but not limited to ACEs, Java Champions, Ambassadors and all those who have the Groundbreakers spirit #ThanksOGB 🙂

I wonder what will be the name will be next year 😉

More info on Oracle Groundbreakers Community can be found here:
About Oracle Groundbreakers Community

When is it?

This year, it is on Thursday 10th October 2019 and I have to confess I totally missed it and thus my post is a few days out but I didn’t want to do disservice to the spirit of the initiative.

You can see my previous post here:
2017 – ODC Appreciation Day : Oracle Exadata Database Machine
2018 – ODC Appreciation Day : Oracle dcli Utility

You can see a summary of previous years blog posts here:
2016 – OTN Appreciation Day : Summary
2017 – ODC Appreciation Day 2017 : It’s a Wrap (#ThanksODC)
2018 – ODC Appreciation Day 2018 : It’s a Wrap (#ThanksODC)
2019 – OGB Appreciation Day 2019 : It’s a Wrap (#ThanksOGB) – this year

My Contribution : Exadata X8M

When I was at Oracle Open World 2019 a few weeks ago, Larry Ellison (CTO of Oracle) announced the new Exadata X8M:

IMG_6715

Key point being in-memory performance utilising persistence memory and RDMA Network over Converged Ethernet (RoCE), which I will detail later on in this blog post.

Larry also boasted the Exadata X8M storage is 50x faster then AWS and 100x faster then Azure All flash storage:

IMG_6716

Following the announcement I attended another 2 sessions with Juan Loaiza (Executive Vice President, Mission Critical Database Technologies, Oracle) and Kothanda Umamageswaran (Senior Vice President, Exadata Development)/Gavin Parish (Senior Principal Product Manager, Exadata Development), who gave more details on the Exadata X8M:

IMG_6880

The keys changes are:

  1. 100Gb/Sec RoCE internal fabric
  2. 1.5TB Persistent Memory per storage server/cell

IMG_6881

RoCE Networking

IMG_6882

RoCE stand for RDMA (Remote Direct Memory Access) over Converged Ethernet, which initially from the start of Exadata had been over InfiniBand, however Oracle stated Ethernet has caught up and surpassed InfiniBand giving 100Gb/sec throughput as opposed to 40Gb/sec with InfiniBand which is 2.5 times faster:

IMG_6883

IMG_6910

RoCE uses InfiniBand RDMA software on top of Ethernet, so includes all the optimisation and allows for backwards compatibility:

IMG_6911

Also mentioned is the smart network prioritisation which can prioritise critical database messages such as transaction commits, cache fusion over backups, etc using Class of Service:

IMG_6914

An another nice addition is instance failure detected through use of RoCE, because if all 4 ports don’t respond it confirmed server failure and instantly evicted from cluster:

IMG_6915

Persistent Memory

IMG_6884

The Exadata X8M uses Intel Optane DC persistent memory a new silicon technology that capacity, performance and cost is between DRAM and flash:

IMG_6885

In the Exadata X8M, the persistent memory is shared, just as disks and flash are.  So you get all the benefit of aggregated performance, redundancy, etc:

IMG_6887

The benefit of RoCE with persistent memory is the Persistent Memory Data Accelerator, that allows the database to use RDMA instead of I/O bypassing network and IO software, interrupts, context switches:

IMG_6919

Another benefit of persistent memory is the Memory Commit Accelerator, which like Smart Flash Logging, uses persistent memory to further speed up log writes by 8x using oersistent memory as a buffer which is flushed to flash or disk later on:

IMG_6920

Smart capacity management of persistent memory, so primaries on persistent memory and secondary on flash, which is automatically moved to persistent memory when primary is unavailable:

IMG_6921

If Exadata was not fast enough, all this innovation has lead to the “Worlds Fastest Database Machine” with a astonishing 16 million IOPS with less then 19 microseconds:

IMG_6909

For more information on Exadata X8M can be found here.

Finally Happy OGB Appreciation Day! #ThanksOGB #ThanksODC #ThanksOTN 🙂

If you found this blog post useful, please like as well as follow me through my various Social Media avenues available on the sidebar and/or subscribe to this oracle blog via WordPress/e-mail.

Thanks

Zed DBA (Zahid Anwar)

Oracle and Microsoft to Interconnect Oracle Cloud and Microsoft Azure

On the 5th of June 2019, both Oracle and Microsoft made a joint announcement on the interconnection between Oracle Cloud and Microsoft Azure:

Oracle’s Press Release
Microsoft Press Release

The key aspects of the announcement are:

  1. A direct interconnect between Oracle Cloud and Azure Cloud, starting in Oracle’s Ashburn (North America) region and Azure Washington DC (US East) region, with plans to expand additional regions in the future*.
  2. Unified identity and access management, via a unified single sign-on experience to manage resources across both Oracle Cloud and Azure.
  3. Supported deployment of custom applications and packaged Oracle applications (such as JD Edwards, PeopleSoft and E-Business Suite typically referred to as Oracle Applications Unlimited) in Azure Cloud with the Oracle databases (such as RAC, Exadata, Autonomous Database) deployed in the Oracle Cloud.
  4. A collaborative support model for customers leveraging these new capabilities.

*With only one region available in both Oracle and Azure, this will allow for failure in a Availability Domain in Oracle Cloud and/or Availability Zone in Azure Cloud but not a failure of a whole region in either.  So until more regions are added, Disaster Recover will be limited to Availability Domains/Availability Zones:

oracle-azure-connectivity

Figure 4 from Oracle’s blog “Overview of the Interconnect Between Oracle and Microsoft“.

What does this mean?

In a nutshell, for those customers who have Microsoft Azure as their Cloud platform of choice, can now migrate their application tier to Microsoft Azure, whilst migrating the database tier i.e. Oracle database (mandatory for Application Unlimited) without having to worry about the all important latency (high-throughput, low-latency as stated in Oracle’s blog post).  It is however unclear if there will be any charge for outbound/inbound traffic between the clouds, but it does seem from the documentation and blog post that both Oracle’s dedicated FastConnect and Azure’s dedicated ExpressRoute are required, which are both fix rate products.  It also helps those customers who require the favourable database licensing on the Oracle Cloud, more info can be found in my blog post here.

This is certainly a step towards the trend of multi-cloud/hybird-cloud platforms.

More info can be found regarding this announcement from our Version 1’s blog post here.

Another interesting read from SearchCloud Computing in regards to this announcement.

 

If you found this blog post useful, please like as well as follow me through my various Social Media avenues available on the sidebar and/or subscribe to this oracle blog via WordPress/e-mail.

Thanks

Zed DBA (Zahid Anwar)

Online Redo Log Switch Frequency Map

A query I find myself often running is the online redo log switch frequency map query, which queries the v$log_history/gv$log_history (for cluster databases) view and show the historical log switch frequency.

Why you might ask? Well it’s important to see how frequent log switches are occurring as Oracle’s rule of thumb is to not switch more then 3 logs per hour (20 minutes of redo) at peak DML activity to prevent excessive checkpoints.  The query can also highlight abnormal churn (DML activity).

Online Redo Log Switch Frequency Map Query

set pages 999 lines 400
col h0 format 999
col h1 format 999
col h2 format 999
col h3 format 999
col h4 format 999
col h5 format 999
col h6 format 999
col h7 format 999
col h8 format 999
col h9 format 999
col h10 format 999
col h11 format 999
col h12 format 999
col h13 format 999
col h14 format 999
col h15 format 999
col h16 format 999
col h17 format 999
col h18 format 999
col h19 format 999
col h20 format 999
col h21 format 999
col h22 format 999
col h23 format 999
SELECT TRUNC (first_time) "Date", inst_id, TO_CHAR (first_time, 'Dy') "Day",
 COUNT (1) "Total",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) "h0",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
 SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23",
 ROUND (COUNT (1) / 24, 2) "Avg"
FROM gv$log_history
WHERE thread# = inst_id
AND first_time > sysdate -7
GROUP BY TRUNC (first_time), inst_id, TO_CHAR (first_time, 'Dy')
ORDER BY 1,2;

Online Redo Log Switch Frequency Map Output

SQL> set pages 999 lines 400
SQL> col h0 format 999
SQL> col h1 format 999
SQL> col h2 format 999
SQL> col h3 format 999
SQL> col h4 format 999
SQL> col h5 format 999
SQL> col h6 format 999
SQL> col h7 format 999
SQL> col h8 format 999
SQL> col h9 format 999
SQL> col h10 format 999
SQL> col h11 format 999
SQL> col h12 format 999
SQL> col h13 format 999
SQL> col h14 format 999
SQL> col h15 format 999
SQL> col h16 format 999
SQL> col h17 format 999
SQL> col h18 format 999
SQL> col h19 format 999
SQL> col h20 format 999
SQL> col h21 format 999
SQL> col h22 format 999
SQL> col h23 format 999
SQL> SELECT TRUNC (first_time) "Date", inst_id, TO_CHAR (first_time, 'Dy') "Day",
SQL>  COUNT (1) "Total",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) "h0",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SQL>  SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23",
SQL>  ROUND (COUNT (1) / 24, 2) "Avg"
SQL> FROM gv$log_history
SQL> WHERE thread# = inst_id
SQL> AND first_time > sysdate -7
SQL> GROUP BY TRUNC (first_time), inst_id, TO_CHAR (first_time, 'Dy')
SQL> ORDER BY 1,2;

Date         INST_ID Day      Total   h0   h1   h2   h3   h4   h5   h6   h7   h8   h9  h10  h11  h12  h13  h14  h15  h16  h17  h18  h19  h20  h21  h22  h23        Avg
--------- ---------- --- ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----------
24-MAY-19          1 Fri          1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1        .04
24-MAY-19          2 Fri          1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1        .04
25-MAY-19          1 Sat         56    1    1    3    0    2    2    0    7    0    0   10    4    4    6    3    1    2    4    1    1    1    2    0    1       2.33
25-MAY-19          2 Sat         62    0    3    3    0    1    2    0    7    0    0    8    5    2    2    3    5    4    2    3    3    5    4    0    0       2.58
26-MAY-19          1 Sun         56    1    1    3    0    2    5    0   11    0    4    1    1    7    7    3    5    1    0    0    0    1    1    1    1       2.33
26-MAY-19          2 Sun         28    1    3    3    0    1    4    0    5    0    1    1    0    2    3    1    1    1    0    0    0    0    0    1    0       1.17
27-MAY-19          1 Mon         33    1    1    3    0    1    4    0    6    5    1    0    1    1    2    1    1    1    0    0    0    1    1    0    2       1.38
27-MAY-19          2 Mon         21    0    3    3    0    0    4    0    3    2    0    0    1    0    1    0    1    0    1    0    0    0    0    1    1        .88
28-MAY-19          1 Tue         43    2    3    3    0    3    5    1    4    3    1    3    0    1    1    1    1    2    3    2    1    1    0    1    1       1.79
28-MAY-19          2 Tue         38    2    4    3    1    1    4    1    7    1    1    1    1    0    1    0    0    2    4    0    1    0    1    1    1       1.58
29-MAY-19          1 Wed         58    2    4    4    1    3    5    2   12    3    1    3    1    2    5    4    1    1    0    1    0    1    0    2    0       2.42
29-MAY-19          2 Wed         28    0    2    3    0    1    4    1    5    1    0    1    1    0    2    1    1    1    1    0    0    1    1    0    1       1.17
30-MAY-19          1 Thu         39    2    3    3    0    1    5    1    7    1    1    2    3    1    2    2    1    0    1    1    0    0    1    0    1       1.63
30-MAY-19          2 Thu         29    0    1    3    0    2    4    1    4    2    1    1    1    1    1    2    0    1    2    0    0    1    1    0    0       1.21
31-MAY-19          1 Fri        153    2    1    3    0    2    4    1    8    3    0   23   29   10    2    1   18   20    6   12    2    3    3    0    0       6.38
31-MAY-19          2 Fri        223    1    2    3    0    1    4    0    3    1    0   71   86   29    1    0    6    7    2    4    0    1    1    0    0       9.29

16 rows selected.

SQL>

Where the log switch frequency exceeds 3 per hour, I have highlighted as red.

From the above output, we can see that most the time the 3 log switch per hour is met, with the 5th and 7th hour exceeding the threshold.  Also we can see today there as been abnormal and excessive churn at 10th and 11th hour.

Investigating this further, we look at v$log to see the size of the online redo logs:

SQL> set pages 999 lines 400
SQL> col FIRST_CHANGE# format 999999999999999
SQL> select GROUP#, THREAD#, SEQUENCE#, BYTES/1024/1024 SIZE_MB, BLOCKSIZE, MEMBERS, ARCHIVED, STATUS, FIRST_CHANGE#, FIRST_TIME, NEXT_CHANGE#, NEXT_TIME, CON_ID from v$log;

    GROUP#    THREAD#  SEQUENCE#    SIZE_MB  BLOCKSIZE    MEMBERS ARC STATUS              FIRST_CHANGE# FIRST_TIM NEXT_CHANGE# NEXT_TIME     CON_ID
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ---------------- --------- ------------ --------- ----------
         1          1      40649       4096        512          2 YES INACTIVE             137921001754 31-MAY-19   1.3793E+11 31-MAY-19          0
         2          1      40650       4096        512          2 NO  CURRENT              137930464620 31-MAY-19   2.8147E+14                    0
         3          1      40648       4096        512          2 YES INACTIVE             137920121808 31-MAY-19   1.3792E+11 31-MAY-19          0
         4          2      39319       4096        512          2 NO  CURRENT              137921005288 31-MAY-19   2.8147E+14                    0
         5          2      39317       4096        512          2 YES INACTIVE             137917291807 31-MAY-19   1.3792E+11 31-MAY-19          0
         6          2      39318       4096        512          2 YES INACTIVE             137919186475 31-MAY-19   1.3792E+11 31-MAY-19          0

6 rows selected.

SQL>

As we can see the online redo logs are 4GB, which is the default size on Oracle Exadata, which this is.  We could increase the size of the redo logs but it trade-off between performance verse space usage.  The peak we can see was 29 log switches for instance 1 and 86 log switches for instance 2, assuming these were full redo logs before switching, this is 29 + 86 logs switches for the database.  Which is 115 for the the hour, equating to a significant 460 GB of churn!  But this is an Exadata Machine and is sized and designed for this sort of load 🙂  To resize the redo logs to switch only 3 per hour, we would need to resize to 154GB, this wouldn’t be feasible, so we could either increase to something more reasonable like 10GB or leave as is, depending on what more important, the performance or space usage.

 

If you found this blog post useful, please like as well as follow me through my various Social Media avenues available on the sidebar and/or subscribe to this oracle blog via WordPress/e-mail.

Thanks

Zed DBA (Zahid Anwar)

Prevent crontab jobs overlapping using flock

As an Oracle DBA, you may find yourself in the situation where you have crontab jobs overlapping.

For example an RMAN backup takes longer then normal, then overlaps with another RMAN database backup leading to more resources being consumed:

00 00 * * * /home/oracle/scripts/backups/db_backup.sh PROD1 >> /home/oracle/backups/logs/db_backup_PROD1.log 2>&1
00 01 * * * /home/oracle/scripts/backups/db_backup.sh PROD2 >> /home/oracle/backups/logs/db_backup_PROD2.log 2>&1

If the PROD1 backup takes longer then 1 hour, then it will contend with the PROD2 backup when it starts.

Another more recent example for myself, is when I was copying archive logs to AWS for an Oracle Database Standard Edition migration using a manual standby, thus needed to manually transfer archive logs using rsync every 15 minutes:

0,15,30,45 * * * * /home/oracle/copy_Arch_to_AWS.sh > /home/oracle/copy_Arch_to_AWS.log

Ran fine most the time but when there was significant churn in the database, the crontab job would overlap causing several rsync 😦

The easiest solution is to wrap the crontab job in flock 🙂

Using flock

flock is a linux utility that can uses a lock file to determine if the process is already running.  The syntax I use is:

flock -x <lockfile> -c '<command>'

The “-x” is to obtain exclude lock and the “-c” is the command to run.

Flock examples

Backup example

00 00 * * * flock -x /home/oracle/scripts/backups/backup.lock -c '/home/oracle/scripts/backups/db_backup.sh PROD1 >> /home/oracle/backups/logs/db_backup_PROD1.log 2>&1'
00 01 * * * flock -x /home/oracle/scripts/backups/backup.lock -c '/home/oracle/scripts/backups/db_backup.sh PROD2 >> /home/oracle/backups/logs/db_backup_PROD2.log 2>&1'

Now when the backup for PROD2 starts flock will check for the lock and will see if exist and will not run the command until the backup for PROD1 is completed 🙂

Archive log copy example

0,15,30,45 * * * * flock -x /home/oracle/copy_Arch_to_AWS.lock -c '/home/oracle/copy_Arch_to_AWS.sh > /home/oracle/copy_Arch_to_AWS.log'

Now when the job runs, an exclusive lock is taken an hence when it runs again in 15 minutes if there an existing run, then it will not run the command until the previous one is completed 🙂  This will essentially queue the copies instead of them overlapping causing several rsync, which just exacerbate the issue.

Advance use of flock

Timeout

You can add “-w <seconds>“, which is the amount of time to wait for exclusive lock before exiting without running command. for example:

0,15,30,45 * * * * flock -w 300 -x /home/oracle/copy_Arch_to_AWS.lock -c '/home/oracle/copy_Arch_to_AWS.sh > /home/oracle/copy_Arch_to_AWS.log'

Now flock will wait 5 minutes for the previous archive log copy job to complete before exiting without running the command for that run 🙂

Viewing lock

If you want to see the lock taken by flock, you can run :

[oracle@dc1sbxdb001 ~]$ fuser -v /home/oracle/copy_Arch_to_AWS.lock
                                    USER   PID    ACCESS COMMAND
/home/oracle/copy_Arch_to_AWS.lock: oracle 341039 f....  flock
                                    oracle 341040 f....  rsync

 

If you found this blog post useful, please like as well as follow me through my various Social Media avenues available on the sidebar and/or subscribe to this oracle blog via WordPress/e-mail.

Thanks

Zed DBA (Zahid Anwar)