Monday, February 10, 2014

Design to plan database capacity

Purpose

The database resources is limited on CPU, Memory and IO. We have to do capacity planning well ahead. That includes transaction arrival rate, Transaction's SQL resource usage: LIO / CPU / Memory / Redo / Undo, Lock, Latch,...etc. We have to design and build a synthetic transaction workload benchmark to get these matrices.

Also "We have really got to think about the data and how it will be used, how it will be accessed so that we can utilize the optimal structure for that.
Make sure that the database organizes the data in a way that’s efficient for us to retrieve that data."

As we talked, here is the database criteria need to be filled before build a new project involving database process, to get started.
(Actually, we'd better consider it from the day one design the system for the new project)

I. Function priority, asynchronous batch process OK?
II. work load ( Planning the performance )
- II.i. Data Usage
  - a1. UI query and OLTP transaction response time
  - a2. Batch throughput
III. Growth and Archive - Data lifecycle management
Target
- Capacity plan
- Data life cycle management
- SLA

---

I. Function priority, asynchronous batch process OK?

#) Downtime allowed

- -) HA implementation (physical standby or Stream replication or RMAN backup and recovery or logical data dump expdp/impdp)

#) Process delay/latency allowed or data consistency delay/latency allowed

- -) to select an ACID or BASE solution

II. work load ( Planning the performance )

#) OLTP: average and peak executions per hour per day and average/peak response time requirement

#) Batch: throughput requirement, see System Architecture Model

See Define+database+application+SLA

II.i. Data Usage

a1. UI query and OLTP transaction response time

 SELECT customer_name

 FROM customer

 WHERE customer_id = 12345678;

Type	Arrival rate	Time	Peak season	SLA 90%	SLA 99.9%
Avg	3 / minute			< 2 seconds	< 4 seconds
Peak	20 / minute	11am ~ 2pm	Xmas	< 3.5 seconds	< 5 seconds

a2. Batch throughput

INSERT INTO order_stats(time_id, product_id, order_cnt)

SELECT to_date('2011/01/01','yyyy/mm/dd'), product_id, count(*)

FROM order

WHERE order_date BETWEEN to_date('2011/01/01','yyyy/mm/dd')

                     AND to_date('2011/02/01','yyyy/mm/dd')

GROUP BY product_id;

Type	Arrival rate	Rows per Exec(Volume)	Time	Peak season	Throughput SLA
Avg	2 times / day	30			50,000 rows / day
Peak	2 times / day	200	12:00 ~ 14:00, 20:00 ~ 22:00	January	2,000,000 rows / day

III. Growth and Archive - Data lifecycle management

#) Growth rate, E.g. 30MB/Day

#) Data retention and archive plan, E.g. 1 year working data, 3 years reporting data in DW, 10 years archive data in Tape/DVD.

See An Information-centric Approach to Data Classification Reduces Data Volumes.

—

Target

Based on the requirements, the architects are able to design a solution.

We will select one data process pattern(good practice) or design a new one to meet the SLA and business objective.

Capacity plan

We will plan our hardware resource to meet the above SLA.

Data life cycle management

It is not necessary to put everything into our expensive OLTP system, it tightly bind to SLA.

Usage Profile	Data Retention	Priority	SLA	Storage/Disk
Write/Read hot	Last 2 months	priority one	99.999% HA, 7 x 24	Most Expensive
Read hot	Last 2 ~ 6 months	priority 2	99.99% HA, 7 x 24	Expensive, copy to DW
Read only cooler	Last 6 ~ 36 months	priority 3	nearline storage 99% HA, 5 x 8	Cheap
Read only cold	Older than 3 years	priority 4	move to DVD or tape, only mount online when needed.	Cheapest
Obsoleted	Older than 9 years	priority 5	Keep aggregate data and Drop it	...

There are many requirements to effectively manage the data lifecycle. Effective and efficient management of the lifecycle will reduce both the hardware and management costs associated with the database facility, and improve overall database performance. To begin the process, the lifecycle for each events/fact table data stream must be understood. This includes defining who will ultimately use the data, and how they will use it. Based upon the usage profile, the characteristics of the life of the data can be quantified.

Reference:

http://docs.oracle.com/database/121/VLDBG/part_lifecycle.htm

Data Usage Over Time

Data Lifecycle

Tuesday, January 14, 2014

How bad is the network round trip when accessing database.

It is a common good practice to save the network round trips when design and develop database applications.
But how bad it is? we need to setup a synthetic benchmark to find out.

My configuration is Oracle database 12.1, Linux 6.4, 64bit, on my laptop virtual box.
There are 2 database servers in one data center. To query remote table through database link.
5000 rows table t2, copy data from view all_objects.

Here is the benchmark results:

1 rows, query 1000 times.

6.39 seconds v.s. 0.02 seconds. 10M network.
1.09 seconds v.s. 0.02 seconds. 100M network.

1000 rows, row by row fetch v.s. bulk fetch. 100M network.

0.44 seconds : 0 seconds.

So the network latency is between 0.00044 and 0.0011 seconds per round trip.

For example, we have a system process 300 orders per second, run 300 SQL SELECT, 33% time are used for network.
It is considerable amount of time.

Here is the benchmark test setup, so you can reproduce it on your servers.

tnsnames.ora:

mac=
(description=
(address=(protocol=tcp)(host=192.168.1.104)(port=1521))

(connect_data=(service_name=pdb1)))

SQL>
drop public database link dev1_link;
create public database link dev1_link
connect to scott
identified by abc
using 'mac';

drop public database link dev1_link;
create public database link dev1_link
connect to scott
identified by abc
using 'local';

drop table t2 purge;
create table t2 as select * from all_objects where 1=0;

-- Test 1, remove v.s. local database,

set serveroutput on
declare
l_cnt number;
li number;
g_run1 number;
begin
l_cnt := 1000;

g_run1 := dbms_utility.get_time; --This function determines the current time in 100th's of a second.
for i in 1..l_cnt
loop
select count(*) into li from t2@dev1_link where rownum < 1;
end loop;
g_run1 := (dbms_utility.get_time-g_run1);
dbms_output.put_line( 'remote query ran in '||g_run1);

g_run1 := dbms_utility.get_time;
for i in 1..l_cnt
loop
select count(*) into li from t2 where rownum < 1;
end loop;
g_run1 := (dbms_utility.get_time-g_run1);
dbms_output.put_line( 'local query ran in '||g_run1);

end;
/

remote query ran in 105
local query ran in 1

set autot on

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
0 consistent gets
0 physical reads
0 redo size
541 bytes sent via SQL*Net to client
543 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
.

-- Test 2, Bulk Fetch, both from remote db,

drop table t2 purge;
create table t2 as select * from all_objects;

set serveroutput on
declare
l_cnt number;
li number;
g_run1 number;
cursor c1 is select * from t2@dev1_link where rownum <= 1000;
lr1 c1%rowtype;
type tab1 is table of c1%rowtype;
lt1 tab1;
begin
l_cnt := 1000;

g_run1 := dbms_utility.get_time; --This function determines the current time in 100th's of a second.

open c1;
loop
fetch c1 into lr1;
exit when c1%notfound;
end loop;
g_run1 := (dbms_utility.get_time-g_run1);
dbms_output.put_line( 'remote query ran in '||g_run1);

g_run1 := dbms_utility.get_time;
fetch c1 bulk collect into lt1;
g_run1 := (dbms_utility.get_time-g_run1);
dbms_output.put_line( 'local query ran in '||g_run1);

end;
/

remote query ran in 44

local query ran in 0

.

Let me know your benchmark results, the ball park.

Thanks,
Charlie | Database Architect Developer

Tuesday, January 07, 2014

Understanding business goal and process logic

Every time when there is a problem, and people come to me to ask solutions, or to give solutions, I must ask, as always, could someone describe the business goal?

Plus current data flow and data process logic would be better.

To help me understanding the problem we're trying to solve, to figure out what the problem really is.

A solid understanding of the goal of ABC.

“And most importantly of all - a solid understanding of the goal, of what the question is. Tuning a query or process is really hard (impossible I would say) - unless we understand the question in the first place. I cannot tell you how many times I've not been able to tune a query until I had the question in hand. Certainly we can derive a question from a query - however, many times that derived question is much more confining than the real question being asked. For example, many people use outer joins in all queries - they are "afraid" of losing a row (perhaps they got "burned" in some past experience and now use outer joins everywhere). If the objects are related in a one to one mandatory fashion - we don't need an outer join at all. The question derived from the query is much more confining than reality.” - AskTom

To help us better understanding an application business goal and application logic, could you please help us to find out below details?

Some follow up questions will be asked, or one or more teleconference meetings will be helpful.

And then some design options can be given from there.

/******* Question Start *******/

Business goal.

What is the goal of ABC system? Please describe the problem that ABC is try to solve. The big picture.

For example: Goal: Replicate data from MySQL to Oracle.

What is the goal for each module?

For example: Goal of module A: Capture incremental changes from MySQL.

Application logic and data flow for each transaction module.

What are the input, process, output for each step?
Which step could go wrong?
What would be the data look like after fail?
How often it goes wrong? e.g. 3 times a day.
What are the limitations? e.g. cannot process cash orders.
How to handle failure/exception? e.g. rollback.
Single process single connection, or multiple concurrent processes and multi-connections ?

publisher => subscription
master/coordinator => slave workers

What is performance SLA?

Response time. E.g. 90% < 2 seconds, 99.9% < 5 seconds per transaction.
Transaction volume. E.g. process 5000 orders/hours.

What is availability SLA? E.g. Max to 8 hours downtime a year.

Capture all SQL called in applicate by SQL Session Trace tool.

For all transactions that access table user_abc.table_a,

including every SQL SELECT/INSERT/UPDATE/DELETE.

(Another email will be sent to show you how to capture all SQL)

For example, the shopping cart checkout transaction would be:

Business goal: Ensure the online shopping process clear, fast, simple ...

The goal for each module:

create_order;

create_finance_journal;

notify_buyer;

commit_transaction;

Application logic and data flow.

/* create_order */

SELECT product_id, qty, ... FROM shopping_cart_item WHERE shopping_cart_id = 123

INSERT INTO order ...VALUES (...);

INSERT INTO order_item ...VALUES (...);

INSERT INTO order_audit ...VALUES (...);

DELETE shopping_cart ... WHERE shopping_cart_id = 123;

DELETE shopping_cart_item ... WHERE shopping_cart_id = 123;

UPDATE inventory SET qty = qty - 1 WHERE item_id = 7788123;

/* create_finance_journal */

INSERT into po_journal...VALUES (...);

INSERT into sub_ledger ...VALUES (...);

/* notify_buyer, call 3rd party API */

UTL_MAIL.SEND(...);

COMMIT;

/******* End *******/

I really appreciate your effort.

Design is to finding the problem, not solution.

We’ll be doing this for performance, scalability, security and so on.

We'll meet heavy resistance, big time - I am sure.

So we always be able to talk "numerously" - in numbers.

For example:

"if we do this ETL process slow by slow (row by row) and take just 1ms per row (pretty fast don't you think?) and you have to process 5,000,000 rows - it will take almost 1.5 hours. When we get to 50 million rows - it'll be over a half a day. When we get to 500 million rows - it will take a WEEK to process. if I am allowed to do it in the database in a single DDL statement to transform the data, I can be done with 500 million rows in 2 minutes."

See my blog …, for the proof of why row by row process is slow by slow process, and why set based process is much more better than row by row process.

Reference

"More orders equals more revenue". From a software architect in Google. http://www.eaipatterns.com/ramblings/18_starbucks.html

http://lifehacker.com/5795228/how-to-solve-just-about-any-problem

Thanks,

Charlie | Database Architect Developer

Wednesday, December 11, 2013

Disable RAC server (and client) side load balance

Here is how to disable RAC server (and client) side load balance, and the test:

* Client side load balance:

SALES_DB =
(DESCRIPTION_LIST=
(LOAD_BALANCE=off)
(FAILOVER=on)
(DESCRIPTION=
(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=rac01_vip)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=rac02_vip)(PORT=1521))

)

(CONNECT_DATA=(SERVICE_NAME=oltp_workload))))

* server side load balance:

srvctl modify service -d <dbname> -s <servicename> -B NONE

* Test

#!/bin/bash

echo "">t.txt

for i in {1..70}

sqlplus -s system/ABC@sales_db >> t.txt <<+++

set head off

set feedback off

set pagesize 0

select instance_name from v\$instance;

begin

for lc in (select * from all_objects where rownum < 500)

loop

Null;

end loop;

end;

exit;

+++

sleep 0.02

done

cat t.txt|uniq

rac01

If load balance is enabled, then the output will be 2 lines or more，for example,

rac01,

rac02.

Otherwise, it'll be always one line, e.g. : rac01.

* Reference

srvctl modify service -d db_unique_name -s service_name
     [-g server_pool] [-c {UNIFORM|SINGLETON}] [-P {BASIC|PRECONNECT|NONE}] 
     [-l {[PRIMARY] | [PHYSICAL_STANDBY] | [LOGICAL_STANDBY] | [SNAPSHOT_STANDBY]}
     [-q {TRUE|FALSE}] [-x {TRUE|FALSE}] [-j {SHORT|LONG}] 
     [-B {NONE|SERVICE_TIME|THROUGHPUT}] [-e {NONE|SESSION|SELECT}] 
     [-m {NONE|BASIC}] [-z failover_retries] [-w failover_delay]
     [-y {AUTOMATIC | MANUAL}]

Table A-79 srvctl modify service Options

Option	Description
-d db_unique_name	Unique name for the database
-s service_name	Service name
-g server_pool	The name of a server pool used when the database is policy managed. Note: This option is available only with Oracle RAC and only for policy-managed databases.
-c {UNIFORM \| SINGLETON}	The cardinality of the service, either `UNIFORM` (offered on all instances in the server pool) or`SINGLETON` (runs on only one instance at a time) Note: This option is available only with Oracle Clusterware.
-P {BASIC\|PRECONNECT\|NONE}	TAF failover policy Note: The `PRECONNECT` option applies only to administrator-managed databases.
-l {[PRIMARY] \| [PHYSICAL _STANDBY] \| [LOGICAL _STANDBY] \| [SNAPSHOT _STANDBY]}	The database modes for which the service should be started automatically.
-q {TRUE \| FALSE}	Indicates whether AQ HA notifications should be enabled (`TRUE`) for this service
-x {TRUE \| FALSE}	Indicates whether or not Distributed Transaction Processing should be enabled for this service Note: This option is available only with Oracle Clusterware.
-j {SHORT \| LONG}	Connection Load Balancing Goal
-B {NONE \| SERVICE_TIME \| THROUGHPUT}	Runtime Load Balancing Goal
-e {NONE \| SESSION \| SELECT}	Failover type
-m {NONE \| BASIC}	Failover method
-z failover_retries	The number of failover retry attempts
-w failover_delay	The time delay between failover attempts
-y {AUTOMATIC \| MANUAL}	Service management policy

Thanks for the help from Ricky Zhu