Wednesday, May 7, 2014

Steps for Hot Swapping MySQL Tables With Zero Downtime


Recently I had to alter some tables in our production database with minimal downtime. And also we had hardly any downtime by below process.  The tables was around 90GB/400M+ rows table.

The steps I have followed were:



1) Take the initial dump from one of the slave by filtering the unwanted records/org
mysqldump -p -u <username> db_name table_name --no-create-info --skip-tz-utc --complete-insert --compact --skip-comments --skip-lock-tables --where  " and id<10000 and auto_update_time>’2014-010-01’ " > temp.sql


Note: --skip-tz-utc  is very important if you are having the timestamp fields

Use the appropriate filters need to be used. Please make sure the fields are indexed and you are running optimal queries

2) Create a new table with similar structure in data_cleanup database (if required add additional columns and/or indexes)

You can run below query to get the create schema of the current table



SHOW CREATE TABLE db_name.table_name;


OR you can create new table with same using below query

CREATE table data_cleanup.table_name like db_name.table_name;
## ( Auto increment id is flushed by the query )

3) Load the dump taken from the slave to new data_cleanup database
mysql -p -u data_cleanup -A < temp.sql

4) Load the delta data based on org_id or auto_update_time to temp database

## get the number of records
SELECT id into @temp_max_id from data_cleanup.table_name order by id desc limit 1;
SELECT @temp_max_id;

select count(*) from  db_name.table_name
WHERE id > @temp_max_id;

## load the delta to temp table
INSERT INTO data_cleanup.table_name
SELECT * FROM db_name.table_name
WHERE id > @temp_max_id;

Note: If the number of records are more, please try to dump the delta from slave again and load the same and follow steps 1-3

5) Increase the auto_increment_id of the table if required
select count(*) from  db_name.table_name
WHERE id > @temp_max_id;
SELECT @temp_max_id;
// replace new max id in the query properly
ALTER TABLE data_cleanup.table_name AUTO_INCREMENT=
<@temp_max_id + 1000>;

6) Swap the tables - Please make sure you are running this step as quick as possible

SELECT id into @temp_max_id from data_cleanup.table_name order by id desc limit 1;
SELECT @temp_max_id;

select count(*) from  db_name.table_name
WHERE id > @temp_max_id;

INSERT INTO data_cleanup.table_name
SELECT * FROM db_name.table_name
WHERE id > @temp_max_id;

RENAME TABLE
`data_cleanup`.`table_name` to `data_cleanup`.`temp`,
`db_name`.`table_name` to `data_cleanup`.`table_name`,
`data_cleanup`.`temp` to `db_name`.`table_name` ;


7) You might need to take a delta of records which was inserted during the step 6, if required follow step 4. If required, you need to copy all updates to new table as well. Please do in the same connection, otherwise @temp_max_id wont be available.

INSERT INTO data_cleanup.table_name
SELECT * FROM db_name.table_name
WHERE id > @temp_max_id AND id < @temp_max_id+1000;

8)Test and run some random queries to warm up

Python script which does this is available in Git
Notes:
  • During any such import, please make sure import is stopped and no import is running.
  • Assumption made is, we are having a database named “data_cleanup” in dbmaster. If not create one using “CREATE DATABASE data_cleanup”

Thursday, April 11, 2013

Php Script to load data to HIVE from MySQL tables


I used the below code to load the data from MySQL to hive with no transformation what so ever.

<?
$startTime = microtime(true);
$host = "localhost";
$user = "root";
$password = "";
$backupFolder = '/tmp/hivebackup';
$hiveRestoreFolder = '/tmp/hivebackup';

$mysqli = new mysqli($host, $user, $password);

/* check connection */
if ($mysqli->connect_errno) {
    printf("Connect failed: %s\n", $mysqli->connect_error);
    exit();
}

//tables to backup
$tablesArr = array(
"db.tbl1",
"db2.tbl2",
);

$fp = fopen("$backupFolder/restore.sql", "w");
$hiveql = "\ncreate database if not exists db1; ";
$hiveql.= "\ncreate database if not exists db2;";
fwrite($fp, $hiveql);

$tablesCompleted = 0;
//foreach table take backup
foreach($tablesArr as $table)
{
$tableStartTime = microtime(true);
print "\nProcessing $table";
@unlink("$backupFolder/$table.log");

//get the table desc
if ($result = $mysqli->query("DESC $table", MYSQLI_USE_RESULT))
{
$sql = "SELECT ";
$hiveql  = "\n\nDROP table if exists $table;";
$hiveql .= "\nCREATE TABLE $table(";

while($col = $result->fetch_array())
{
   $sql .= ' replace(replace(replace( `'.$col["Field"].'`, "\t", ""), "\n", ""),"\r", "") ,';  $hiveql .= "\n\t`". $col["Field"]. "` " .getHiveType($col["Type"]).",";
}
$sql = substr($sql,0, -1);
$hiveql = substr($hiveql,0, -1);

$sql .= " INTO outfile '$backupFolder/$table.log' from $table \n";
$hiveql .= "\n)\n".'ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY "\n"; ';
$hiveql .= "\nLOAD DATA INPATH '$hiveRestoreFolder/$table.log' INTO TABLE $table;";

$mysqli->query($sql);
fwrite($fp, $hiveql);

$tablesCompleted++;
}
print "\nBackup taken for $table in ". number_format(microtime(true) - $tableStartTime, 2). " sec";
}
fclose($fp);
print "\n$tablesCompleted tables are backed up in ".number_format(microtime(true) - $startTime, 2). " sec";

print "\n\t\t===== ===== ===== ===== ===== Done ===== ===== ===== ===== =====\n";

function getHiveType($mysqlType)
{
$mysqlType = explode('(', $mysqlType);
$mysqlType = $mysqlType[0];

switch(strtoupper($mysqlType))
{
case 'TINYINT' : return 'TINYINT';
case 'SMALLINT' : return 'SMALLINT';
case 'INT' : return 'INT';
case 'BIGINT' : return 'BIGINT';
case 'FLOAT' : return 'FLOAT';
case 'DECIMAL' : return 'FLOAT';
case 'DOUBLE' : return 'DOUBLE';
case 'BOOLEAN' : return 'BOOLEAN';
case 'TIMESTAMP': return 'TIMESTAMP';
case 'DATETIME' : return 'TIMESTAMP';
default : return 'STRING';

}
}
?>

## Now execute the php file.
$ php myfile.php

# restore from the file you have created
$ hive -f  /tmp/hivebackup/restore.sql


Note: I needed only selective tables for my purpose, to load whole DB can find all table or tables based on some condition, SHOW TABLES FROM DB1 like 'xx%' can be used

Saturday, February 23, 2013

More On Select Queries

We have gone thru Select queries fundamentals earlier. You can revisit the same again here. Its time to dig deeper into more what you can do with Select.

Lets try to refresh with a query.
Query 1: To get the name of all parents who has two or more girls in the school.
select p.name 
from parents as p 
inner join students as s on s.id = parent_id 
where gender = 'F'
group by p.id having count(*)>2;

Lets now see select query inside query - sub-queries

Query 2: Find the list of students who has no parent details in the system
There are two approaches - The common approach is using sub-query with not in clause


select s.name
from students as s 
where s.parent_id not in ( select id from parents)
Query is like first identifying the all rows from students table and checking with list of all the parent-ids to identify the required records. But if you analyse more, the sub-query is ran for each matching row. And it can be done better with join

select s.name
from students as s 
left join parents as p on p.id = s.parent_id
where p.id is null
For those who the query is confusing, ignore the where clause first.It will list all the students and gets parent ids, if they have on (we are using left join). Now we add the where clause, its saying to ignore if parent id is mapped. Now the time take for the query will be lesser. But there can be cases where the sub-queries can  be useful.

Just for the sake of knowledge, I will show you another query - but its heavy and I would say, never ever use it.
Query 3 : List all students with number of students from the family currently studying in the school

select s.name,
(select count(*) from students as s1 where s.id = s1.id) as students_from_family
from students as s 
Here again the select sub-query is ran for each matching row.This could again be better executed using a JOIN query.
select s.name, count(s1.id)
from students as s 
left join students as s1 on s1.parent_id = s.parent_id
group by s.id

Now lets look into another type of query - derived-query. It uses a select query inside FROM or JOIN clause.
Query 4: To get the count of parents and students where parent_id is greater than 4

select s.name, count(s.id), p.name
from students as s 
left join (select id, name from parents as p where p.id <4) as p 
on p.id = s.parent_id
We can again get the result without join but its good to keep all these weapon with you. If you do an explain, you will notice there is a Temporary table creation involved here(row 2)

+----+-------------+------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+------------+-------+---------------+---------+---------+------+------+-------------+
|  1 | PRIMARY     | s          | ALL   | NULL          | NULL    | NULL    | NULL |    5 |             |
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL    | NULL    | NULL |    2 |             |
|  2 | DERIVED     | p          | range | PRIMARY       | PRIMARY | 4       | NULL |    2 | Using where |
+----+-------------+------------+-------+---------------+---------+---------+-------------+


I would rather suggest to create your own Temporary Table, execute the query by joining your tables and finally dropping the Temporary Table.

 create temporary table temp_parents(
 id int(11) not null,
 name varchar(30),
 index(id)
 )
 select id, name
 from parents 
where id > 4;



select s.name, count(s.id), p.name
from students as s 
left join temp_parents as p 
on p.id = s.parent_id;

delete temporary table temp_parents;

The advantage of creating your own temp table is, we can define your own indexes and optimize the query. Also as your join query is ran on temp table, the original table huge table wont be locked during your query.

Note: Temp tables exists only with in session - ie same table name but with different content can exists across 2 session and will expire soon after the session is closed
In the above create table query, we have combined both CREATE and INSERT into single query. See more options of  CREATE TABLE here

These Temp table is approach is useful especially when you parent table is huge (with millions of records) and your join clause needs to match only few thousands  rows from the huge table.




Thursday, January 31, 2013

Common Functions and Expressions

Lets now peep into common function in MySQL and probably when they are uses.

Comparison Operators

Operator
Operation
Usage
Comments
=
Equal
a=b
true if equal, else false
!= or <>
Not equal
a!=b or a<>b
true if not equal
<
Less than


<=
Less than or equal


> 
Greater than


>=
Greater than or equal


IN
equal to any one in list
a in (1,2,3)
true if a=1 or a=2 or a=3
LIKE
string matching
matches the pattern
strings starting with C followed by a character, then d followed by any number of characters
BETWEEN … AND
searches in a range
a between 5 and 10


More if interested at : http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html

Logical Operations

Operator
Operation
Usage
AND , &&
Logical AND two expressions
1=2 AND 5>6
OR , ||
Logical OR

NOT, !
Negative check
NOT a and b
!c
XOR
Logical XOR

Control Functions

if - if(expr_to_check, true_expr, false_expr)
eg: 
if (5 > 3, "yes", "no")  => returns "yes"

ifnull - ifnull(expr_to_check, expr_if_first_expr_empty)
eg:
if(5, 10) => returns 5
if(5 and NULL, "was null") => returns "was null" (5 AND NULL evaluates to null)

case .. when
This is similar to switch..case in programming languages
eg:
case 1+3
when 4-2 then "4-2"
when 6-2 then "6-2"
when 4 then "4"
else "else"
end  
 => return "6-2" (it wont go to case 4 as, case 6-2 already matched)

Common Functions
concat 
eg: concat( "hello", " ", "world") => returns "hello world"

trim, rtrim, ltrim - Trim white spaces in a string
eg: trim(" a ") 

date_format 
eg: date_format(now(), '%Y-%m-%d') => returns "2013-01-31"

adddate / subdate 
eg: adddate('2012-01-20', interval 5 day) => return '2012-01-25 00:00:00' 
interval is a keyword followed by no:of units and unit to operate

cast/convert - for type casting
eg: cast("124as" as unsigned) / convert("124as", unsigned) => returns 124
You might have noticed by now various default MySQL type casting happening here.


Some Tips
  • If you are having some indexes, to use the advantage of it, do not use it with function.  For example convert(id as char)= "10" will not make use of the index on id
  • While using like, if search pattern starts with %, the index could not  be used.
  • concat(col1, col2) returns null if either one of them is null
  • MySQL does internal type casting, but could be touch slower(as indexes might not be utilized) if types does not match 
By now we have seen just the basics of running select queries, modifying data and some basic DB schema design. Its time to move to phase 2 where we will get deeper into everything we have seen so far. 

Keep reading :)

Normalization

Normalization

Now comes a lot of rules and constraints, we call it Normalization. It is again applied at various level and are numbered as below

1NF
The first normal form asks the attributes to be atomic - means a column should contain only one piece of information. We should not save the say employee ID card number and employee number in single column separated by some separator. The schema we have is already in 1NF
2NF
The 2NF is a rule on primary key. No column should be dependent on subset of primary key. This will be maintained in most cased with the help of auto-increment id acting as primary key

3NF
The 3NF says, all the values in a table should be fully dependent on the primary key only. Or in other words,  a column in a table should not be obtained based another column(s) in the same table. So in our case employee_id/card_number in  swipe details and attendance register are mutually dependent. If you know one, you can get the other from employee entity. So we will have to remove employee id or card number. Employee id being the key identifier for employee table, I would prefer to save employee id over card number(also card number is string which occupies more space and searching is slower compared to searching numbers). 

When I started career, I always tried to do normalization. But now I understand, these are guidelines and not rules. More you learn the database and your application, you will find an excuse not to follow these. If you are interested, read more on the same.