Database(Mysql,Oracle,PostgreSQL,MSSQL...)
原文:http://www.oschina.net/translate/the-road-to-500k-qps-with-mysql-5-7
英文:http://dimitrik.free.fr/blog/archives/2013/10/mysql-performance-the-road-to-500k-qps-with-mysql-57.html


本文提供 MySql5.7实现每秒50W查询 一文的细节以及基准测试结果,解释了我早期在Mysql Connect 发表的谈话。

回顾 MySQL / InnoDB 的改善历史。你能很容易发现。在MySQL 5.6稳定版本中从来没有read-only 这么快的提速,它很容易搞懂,以及在read-only(RO)有着良好的扩张性。也很期待它在read+write(RW)上达到一个较高水平。(特别是在读取数据是数据库主要工作的时候)


然而。我们对于RO在 MySQL 5.6的表现也十分的高兴,在5.7这个版本中,主要工作集中在 read+write (RW)上, 因为在大数据的处理上还没能达到我们的期望。但是RW依赖RO下。能够再次提高速度。 InnoDB 团队通过不断的改进,强烈的推进优化着5.7这个版本的每秒的性能。 

Linux上MySQL优化三板斧

[不指定 2014/12/08 11:31 | by ipaddr ]

原文: http://www.woqutech.com/?p=1200

现在MySQL运行的大部分环境都是在Linux上的,如何在Linux操作系统上根据MySQL进行优化,我们这里给出一些通用简单的策略。这些方法都有助于改进MySQL的性能。
闲话少说,进入正题。

一、CPU

首先从CPU说起。
你仔细检查的话,有些服务器上会有的一个有趣的现象:你cat /proc/cpuinfo时,会发现CPU的频率竟然跟它标称的频率不一样:

    #cat /proc/cpuinfo      processor : 5     model name : Intel(R) Xeon(R) CPU E5-2620 0 @2.00GHz     ...     cpu MHz : 1200.000

这个是Intel E5-2620的CPU,他是2.00G * 24的CPU,但是,我们发现第5颗CPU的频率为1.2G。
这是什么原因列?
这些其实都源于CPU最新的技术:节能模式。操作系统和CPU硬件配合,系统不繁忙的时候,为了节约电能和降低温度,它会将CPU降频。这对环保人士和抵制地球变暖来说是一个福音,但是对MySQL来说,可能是一个灾难。
为了保证MySQL能够充分利用CPU的资源,建议设置CPU为最大性能模式。这个设置可以在BIOS和操作系统中设置,当然,在BIOS中设置该选项更好,更彻底。由于各种BIOS类型的区别,设置为CPU为最大性能模式千差万别,我们这里就不具体展示怎么设置了。

 

二、内存

然后我们看看内存方面,我们有哪些可以优化的。

i)我们先看看numa
非一致存储访问结构 (NUMA : Non-Uniform Memory Access) 也是最新的内存管理技术。它和对称多处理器结构 (SMP : Symmetric Multi-Processor) 是对应的。简单的队别如下:

Smp numa

如图所示,详细的NUMA信息我们这里不介绍了。但是我们可以直观的看到:SMP访问内存的都是代价都是一样的;但是在NUMA架构下,本地内存的访问和非本地内存的访问代价是不一样的。对应的根据这个特性,操作系统上,我们可以设置进程的内存分配方式。目前支持的方式包括:

 

--interleave=nodes
--membind=nodes
--cpunodebind=nodes
--physcpubind=cpus
--localalloc
--preferred=node

简而言之,就是说,你可以指定内存在本地分配,在某几个CPU节点分配或者轮询分配。除非是设置为--interleave=nodes轮询分配方式,即内存可以在任意NUMA节点上分配这种方式以外。其他的方式就算其他NUMA节点上还有内存剩余,Linux也不会把剩余的内存分配给这个进程,而是采用SWAP的方式来获得内存。有经验的系统管理员或者DBA都知道SWAP导致的数据库性能下降有多么坑爹。
所以最简单的方法,还是关闭掉这个特性。
关闭特性的方法,分别有:可以从BIOS,操作系统,启动进程时临时关闭这个特性。
a)由于各种BIOS类型的区别,如何关闭NUMA千差万别,我们这里就不具体展示怎么设置了。
b)在操作系统中关闭,可以直接在/etc/grub.conf的kernel行最后添加numa=off,如下所示:

kernel /vmlinuz-2.6.32-220.el6.x86_64 ro root=/dev/mapper/VolGroup-root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=VolGroup/root rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=auto rd_LVM_LV=VolGroup/swap rhgb crashkernel=auto quiet KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM  numa=off  

另外可以设置 vm.zone_reclaim_mode=0尽量回收内存。
c)启动MySQL的时候,关闭NUMA特性:
  numactl --interleave=all  mysqld &

 

当然,最好的方式是在BIOS中关闭。

 

ii)我们再看看vm.swappiness。

vm.swappiness是操作系统控制物理内存交换出去的策略。它允许的值是一个百分比的值,最小为0,最大运行100,该值默认为60。vm.swappiness设置为0表示尽量少swap,100表示尽量将inactive的内存页交换出去。
具体的说:当内存基本用满的时候,系统会根据这个参数来判断是把内存中很少用到的inactive 内存交换出去,还是释放数据的cache。cache中缓存着从磁盘读出来的数据,根据程序的局部性原理,这些数据有可能在接下来又要被读取;inactive 内存顾名思义,就是那些被应用程序映射着,但是“长时间”不用的内存。
我们可以利用vmstat看到inactive的内存的数量:

#vmstat -an 1 
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free inact active si so bi bo in cs us sy id wa st
1 0 0 27522384 326928 1704644 0 0 0 153 11 10 0 0 100 0 0
0 0 0 27523300 326936 1704164 0 0 0 74 784 590 0 0 100 0 0
0 0 0 27523656 326936 1704692 0 0 8 8 439 1686 0 0 100 0 0
0 0 0 27524300 326916 1703412 0 0 4 52 198 262 0 0 100 0 0

 

通过/proc/meminfo 你可以看到更详细的信息:

#cat /proc/meminfo | grep -i inact 
Inactive: 326972 kB
Inactive(anon): 248 kB
Inactive(file): 326724 kB

 

这里我们对不活跃inactive内存进一步深入讨论。Linux中,内存可能处于三种状态:free,active和inactive。众所周知,Linux Kernel在内部维护了很多LRU列表用来管理内存,比如LRU_INACTIVE_ANON, LRU_ACTIVE_ANON, LRU_INACTIVE_FILE , LRU_ACTIVE_FILE, LRU_UNEVICTABLE。其中LRU_INACTIVE_ANON, LRU_ACTIVE_ANON用来管理匿名页,LRU_INACTIVE_FILE , LRU_ACTIVE_FILE用来管理page caches页缓存。系统内核会根据内存页的访问情况,不定时的将活跃active内存被移到inactive列表中,这些inactive的内存可以被交换到swap中去。
一般来说,MySQL,特别是InnoDB管理内存缓存,它占用的内存比较多,不经常访问的内存也会不少,这些内存如果被Linux错误的交换出去了,将浪费很多CPU和IO资源。 InnoDB自己管理缓存,cache的文件数据来说占用了内存,对InnoDB几乎没有任何好处。
所以,我们在MySQL的服务器上最好设置vm.swappiness=0。

我们可以通过在sysctl.conf中添加一行:

echo "vm.swappiness = 0" >>/etc/sysctl.conf

 

并使用sysctl -p来使得该参数生效。

 

三、文件系统

最后,我们看一下文件系统的优化
i)我们建议在文件系统的mount参数上加上noatime,nobarrier两个选项。

用noatime mount的话,文件系统在程序访问对应的文件或者文件夹时,不会更新对应的access time。一般来说,Linux会给文件记录了三个时间,change time, modify time和access time。
我们可以通过stat来查看文件的三个时间:

stat libnids-1.16.tar.gz 
File: `libnids-1.16.tar.gz'
Size: 72309 Blocks: 152 IO Block: 4096 regular file
Device: 302h/770d Inode: 4113144 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access : 2008-05-27 15:13:03.000000000 +0800
Modify: 2004-03-10 12:25:09.000000000 +0800
Change: 2008-05-27 14:18:18.000000000 +0800

 

其中access time指文件最后一次被读取的时间,modify time指的是文件的文本内容最后发生变化的时间,change time指的是文件的inode最后发生变化(比如位置、用户属性、组属性等)的时间。一般来说,文件都是读多写少,而且我们也很少关心某一个文件最近什么时间被访问了。
所以,我们建议采用noatime选项,这样文件系统不记录access time,避免浪费资源。
现在的很多文件系统会在数据提交时强制底层设备刷新cache,避免数据丢失,称之为write barriers。但是,其实我们数据库服务器底层存储设备要么采用RAID卡,RAID卡本身的电池可以掉电保护;要么采用Flash卡,它也有自我保护机制,保证数据不会丢失。所以我们可以安全的使用nobarrier挂载文件系统。设置方法如下:
对于ext3, ext4和 reiserfs文件系统可以在mount时指定barrier=0;对于xfs可以指定nobarrier选项。

 

ii)文件系统上还有一个提高IO的优化万能钥匙,那就是deadline。

在Flash技术之前,我们都是使用机械磁盘存储数据的,机械磁盘的寻道时间是影响它速度的最重要因素,直接导致它的每秒可做的IO(IOPS)非常有限,为了尽量排序和合并多个请求,以达到一次寻道能够满足多次IO请求的目的,Linux文件系统设计了多种IO调度策略,已适用各种场景和存储设备。
Linux的IO调度策略包括:Deadline scheduler,Anticipatory scheduler,Completely Fair Queuing(CFQ),NOOP。每种调度策略的详细调度方式我们这里不详细描述,这里我们主要介绍CFQ和Deadline,CFQ是Linux内核2.6.18之后的默认调度策略,它声称对每一个 IO 请求都是公平的,这种调度策略对大部分应用都是适用的。但是如果数据库有两个请求,一个请求3次IO,一个请求10000次IO,由于绝对公平,3次IO的这个请求都需要跟其他10000个IO请求竞争,可能要等待上千个IO完成才能返回,导致它的响应时间非常慢。并且如果在处理的过程中,又有很多IO请求陆续发送过来,部分IO请求甚至可能一直无法得到调度被“饿死”。而deadline兼顾到一个请求不会在队列中等待太久导致饿死,对数据库这种应用来说更加适用。
实时设置,我们可以通过

echo deadline >/sys/block/sda/queue/scheduler

 

来将sda的调度策略设置为deadline。

我们也可以直接在/etc/grub.conf的kernel行最后添加elevator=deadline来永久生效。

 

 

总结

CPU方面
    关闭电源保护模式

内存:
    vm.swappiness = 0
    关闭numa

文件系统:
    用noatime,nobarrier挂载系统
    IO调度策略修改为deadline。

 

参考文档:
http://www.gentoo-wiki.info/FAQ_Linux_Memory_Management
http://bbs.gfan.com/android-4165836-1-1.html
https://wiki.archlinux.org/index.php/CPU_Frequency_Scaling_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87)
http://www.mysqlperformanceblog.com/2013/12/07/linux-performance-tuning-tips-mysql/ 

和墨DBA:mysql安全删除大表

[不指定 2012/11/14 10:42 | by ipaddr ]

来源: http://www.mysqlsky.com/201211/large-tbl-drop

【问题隐患】
由于业务需求不断变化,可能在DB中存在超大表占用空间或影响性能;对这些表的处理操作,容易造成mysql性能急剧下降,IO性能占用严重等。先前有在生产库drop table造成服务不可用;rm 大文件造成io跑满,引发应用容灾;对大表的操作越轻柔越好
 
【解决办法】
1.通过硬链接减少mysql DDL时间,加快锁释放
2.通过truncate分段删除文件,避免IO hang
 
【生产案例】
某对mysql主备,主库写入较大时发现空间不足,需要紧急清理废弃大表,但不能影响应用访问响应
$ll /u01/mysql/data/test/tmp_large.ibd
-rw-r—– 1 mysql dba 289591525376 Mar 30  2012 tmp_large.ibd
270GB的大表删除变更过程如下:
#(备库先做灰度)
ln tmp_large.ibd /u01/bak/tmp_tbl.ibd  #建立硬链接
-rw-r—– 2 mysql dba 289591525376 Mar 30  2012 tmp_large.ibd
set session sql_log_bin=0;  #不计入bin log节省性能,并且防止主备不一致
desc test.tmp_large;
drop table test.tmp_large;
Query OK, 0 rows affected (10.46 sec)
mysql -uroot -e "start slave;"
 
cd /u01/bak;screen -S weixi_drop_table
for i in `seq 270 -1 1 ` ;do sleep 2;truncate -s ${i}G tmp_tbl.ibd;done
rm -rf tmp_tbl.ibd
 
【性能比较】
中间ctrl-C一次,可以看到truncate前后io的对比情况,基本上影响不大
 
文件大小也成功更新
 
 
【工具介绍】
truncate – shrink or extend the size of a file to the specified size
#来自coreutils工具集
wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.9.tar.gz
tar -zxvf coreutils-8.9.tar.gz
cd coreutils-8.9
./configure
make
sudo cp src/truncate /usr/bin/ 

Amoeba是一个类似MySQL Proxy的分布式数据库中间代理层软件,是由陈思儒开发的一个开源的java项目。其主要功能包括读写分离,垂直分库,水平分库等,经过测试,发现其功能和稳定性都非常的不错,如果需要构架分布式数据库环境,采用Amoeba是一个不错的方案。目前Amoeba一共包括For aladdin,For MySQL和For Oracle三个版本,本文主要关注For MySQL版本的一个读写分离实现。实际上垂直切分和水平切分的架构也相差不大,改动几个配置就可以轻松实现。

下图是一个采用Amoeba的读写分离技术结合MySQL的Master-Slave Replication的一个分布式系统的架构:
amoeba_mysql

Amoeba处于在应用和数据库之间,扮演一个中介的角色,将应用传递过来的SQL语句经过分析后,将写的语句交给Master库执行,将读的语句路由到Slave库执行(当然也可以到Master读,这个完全看配置)。Amoeba实现了简单的负载均衡(采用轮询算法)和Failover。如果配置了多个读的库,则任何一个读的库出现宕机,不会导致整个系统故障,Amoeba能自动将读请求路由到其他可用的库上,当然,写还是单点的依赖于Master数据库的,这个需要通过数据库的切换,或者水平分割等技术来提升Master库的可用性。

Amoeba可以在不同机器上启动多个,并且做同样的配置来进行水平扩展,以分担压力和提升可用性,可以将Amoeba和MySQL装在同一台机器,也可以装在不同的机器上,Amoeba本身不做数据缓存,所以对于内存消耗很少,主要是CPU占用。对于应用来说,图中的三个Amoeba就是三台一模一样的MySQL数据库,连接其中任何一台都是可以的,所以需要在应用端有一个Load balance和Failover的机制,需要连接数据库时从三台中随机挑选一台即可,如果其他任何一台出现故障,则可以自动Failover到剩余的可用机器上。MySQL的JDBC驱动从connector-j 3.17版本起已经提供了这样的负载均衡和故障切换的功能,那么剩下的事情对于应用来说就很简单了,不需要做太多的改动就能搭建一套高可用的MySQL分布式数据库环境,何乐而不为?

参考链接:
Amoeba开发者博客
Amoeba下载
Amoeba文档
JavaEye上关于Amoeba的讨论贴

Google开源Key-Value数据库LevelDB

[不指定 2011/08/01 10:54 | by ipaddr ]

Google宣布将LevelDB开源,并且遵守New BSD许可证。LevelDB是一个嵌入式的key-value数据库。它的键和关联值可以是任意的字节数组,并且按照键值排序,排序机制是可以被重载的。数据存储机制非常简单,仅仅支持Put,Get和Delete命令,然后还有前向和后向迭代遍历。

数据会自动使用Snappy压缩,这是一个压缩库,Google将其用于BigTable,MapReduce和RPC中,并且在四月份宣布开源。LevelDB也有一些局限:不支持SQL查询和索引,支持多线程单进程访问,并且可以用于嵌入式设备,这将使某些项目收益,但是也会给其他的项目来带麻烦。

LevelDB优化了批量写操作。它将多个修改请求有序缓存在内存中,在累计到配置文件预设置的阈值之后再写入到磁盘中。对于顺序和随机写操作,以及顺序读操作来说,它的性能非常优秀,根据Google的性能基准测试,它能在某些测试项目中得分领先SQLite两个数量级。SQLite在随机读操作中比LevelDB稍好,而在写入较大数据的时候速度两倍快于LevelDB。LevelDB同样也表现得比Kyoto Cabinet优秀,Kyoto Cabinet也是一个key-value数据库,不过Google并没有像SQLite那样在所有测试项目中均进行比较。同样,Riak进行了一些测试对比LevelDB和InnoDB,在一些测试项目中,Google的LevelDB要比InnoDB要优秀或者能达到相同性能。

LevelDB是使用C++编写,一些外部的依赖库已经成功地移植到Windows、Mac OS X、Android和各种Unix上。在实际的应用中,Chrome的一些实验性版本中已经使用了LevelDB,将其作为IndexDB API的实现。而Riak则将其用于节点级的存储。不仅如此,一家开发3D地图软件的公司UpNext也使用了这套系统。


http://www.mydumper.org/

Mydumper (aka. MySQL Data Dumper) is a high-performance multi-threaded backup (and restore) toolset for MySQL and Drizzle.  The main developers originally worked as Support Engineers at MySQL (one has moved to Facebook and another to SkySQL) and this is how they would envisage mysqldump based on years of user feedback.

Mydumper features

  • Lightweight C source
  • Up to 10x faster dumps compared to mysqldump
  • Consistent snapshots for transactional and non-transactional tables (in 0.2.2 onwards)
  • File compression on-the-fly
  • Binary log dumps
  • Multi-threaded restore utility (in 0.2.1 onwards)
  • Daemon mode for timed snapshots and continuous binary logs (in 0.5.0 onwards)
  • Open Source! (GNU GPLv3)

Mydumper is still under active development but is well tested/used in production on some large installations.


主从数据一致性校验的利器: mk-table-sync
http://www.maatkit.org/doc/mk-table-sync.html

Usage: mk-table-sync [OPTION...] DSN [DSN...]

mk-table-sync synchronizes data efficiently between MySQL tables.

This tool changes data, so for maximum safety, you should back up your data before you use it. When synchronizing a server that is a replication slave with the --replicate or --sync-to-master methods, it always makes the changes on the replication master, never the replication slave directly. This is in general the only safe way to bring a replica back in sync with its master; changes to the replica are usually the source of the problems in the first place. However, the changes it makes on the master should be no-op changes that set the data to their current values, and actually affect only the replica. Please read the detailed documentation that follows to learn more about this.

Sync db.tbl on host1 to host2:

  mk-table-sync --execute h=host1,D=db,t=tbl h=host2

Sync all tables on host1 to host2 and host3:

  mk-table-sync --execute host1 host2 host3

Make slave1 have the same data as its replication master:

  mk-table-sync --execute --sync-to-master slave1

Resolve differences that mk-table-checksum found on all slaves of master1:

  mk-table-sync --execute --replicate test.checksum master1

Same as above but only resolve differences on slave1:

  mk-table-sync --execute --replicate test.checksum \     --sync-to-master slave1

Sync master2 in a master-master replication configuration, where master2's copy of db.tbl is known or suspected to be incorrect:

  mk-table-sync --execute --sync-to-master h=master2,D=db,t=tbl

Note that in the master-master configuration, the following will NOT do what you want, because it will make changes directly on master2, which will then flow through replication and change master1's data:

  # Don't do this in a master-master setup!   mk-table-sync --execute h=master1,D=db,t=tbl master2
http://www.mysqlperformanceblog.com/2009/10/16/finding-your-mysql-high-availability-solution-%e2%80%93-the-questions/

How to determine which MySQL High-Availability solution is best?

What is really tricky with MySQL is the number of possible HA solutions. From the simplest the most complex let’s list the most common ones:

- MySQL replication with manual failover
- Master-Master with MMM manager
- Heartbeat/SAN
- Heartbeat/DRBD
- NDB Cluster

These technologies are by far, not a one size fits all and many deployments use combination of solutions. I will not cover ScaleDB and Continuent because I know almost nothing of these solutions. There are many more questions you need to ask yourself before being able to pick the right one. Below, I listed the most common questions, I might have missed some.

1. What level of HA do you need?

Since all the technologies do not offer the same level of availability, this is a first important sorting factor. Here are estimates of the level of availability offered by the various solutions.

 Level of availability
Simple replication98 to 99.9+%
Master-Master with MMM manager99%
Heartbeat/SAN (depends on SAN)99.5% to 99.9%
Heartbeat/DRBD99.9%
NDB Cluster99.999%

From the table, if your requirements are for 99.99%, you are restricted to NDB Cluster while if it is only 99% you have more options. I recall that the level of availability is hard to estimate and subject to debate. These are the usually accepted level of availability for these technologies.

2. Can you afford to lose data?

Obviously, if you are concerned about loss of data, you are most likely using the InnoDB storage engine, since MyISAM is not transactional and do not sync data to disk. Similarly, MySQL replication is an asynchronous process and although it is fairly fast at transferring data between the master and the slaves, there is a window of time where data loss is possible.

If you can afford to lose some data, you can consider “MySQL replication” and “Master-Master with MMM manager” otherwise, you can only consider the other three solutions.

 Data 100% safe
MySQL replicationno
Master-Master with MMM managerno
Heartbeat/SAN (depends on SAN)yes
Heartbeat/DRBD yes
NDB Clusteryes

3. Does your application use MyISAM only features?

There are some features like Full text indexes and GIS indexes that are supported only by MyISAM. The HA solutions that work well with MyISAM are “MySQL replication” and “Master-Master with MMM manager”. Depending on the application, the MyISAM Full text indexes might be replaced by another search engine like Sphinx in order to remove the restriction. There is no HA solution other than the ones based on replication that handles GIS indexes.

 HA solutions
Need MyISAM Full text or GIS indexes “MySQL replication” and “Master-Master with MMM manager”
Don’t use any special MyISAM featureAll
Can change MyISAM Full text to SphinxAll

4. What is the write load?

The HA solutions we present are not equal in term of their write capacity. Due to the way replication is implemented, only one thread on the slave can handle the write operations. If the replication master is multi-cores servers and is heavily writing using multiple threads, the slaves will likely not be able to keep up. Replication is not the only technology that put a strain on the write capacity, DRBD, a shared storage emulator for Linux, also reduce by about 30% (very dependent on hardware) the write capacity of a database server. In term of write capacity here are you choices.

 Write capacity
MySQL replicationFair
Master-Master with MMM managerFair
Heartbeat/SAN (depends on SAN)Excellent
Heartbeat/DRBDGood
NDB ClusterExcellent

5. For what level of growth are you planning?

Since NDB Cluster is an integrated sharding environment, if you are planning for a growth that will need sharding (splitting the database over multiple servers), then you might need to take a serious at that solution. If not, then, apart from the write capacity, all the solutions are approximately equal.

6. How qualified is your staff or support company?

There is a quite direct relationship between the level of availability and the complexity of the solution. In order to reach the promised level of availability, the staff maintaining the HA setup, either internal or external, must have the required level of expertise. The required expertise level is summarized in the table below.

 Expertise level
MySQL replicationTypical, average MySQL DBA + some Sysadmin skills
Master-Master with MMM managerGood, average MySQL DBA + good Sysadmin skills
Heartbeat/SAN (depends on SAN)High, Good MySQL DBA + strong Sysadmin skills
Heartbeat/DRBDHigh, Good MySQL DBA + strong Sysadmin skills
NDB ClusterVery high, Specific NDB knowledge, strom MySQL skills and strong Sysadmin skills

7. How deep are your pocket?

The last aspect that needs to be considered is the budget, complexity is expensive. We will consider two types of setup. The first one is a basic proof of concept of the technology with the hardware tested, the data imported and basic testing and documentation. A proof of concept setup is a good way to get used to a technology and experiment with it in a test environment. The other type of setup we will consider is a full production setup that includes extensive testing, fire drills, full documentation, monitoring, alerting, backups, migration to production and post migration monitoring. Of course, it is the safest way to migrate an HA solution to production. All the times here are estimates based on field experience, the values presented here are fairly typical and contains some buffers for unexpected problems. Although an HA solution can be built remotely through a KVM over IP and adequate remote power management, an on site intervention with physical access to the servers is the preferred way, especially for the most complex solutions.

 Proof of conceptMigration to Production
MySQL replication4 hours12 hours
Master-Master with MMM manager8 hours24 hours
Heartbeat/SAN (depends on SAN)32 hours120 hours
Heartbeat/DRBD40 hours120 hours
NDB Cluster40 hours120 hours+

Editor’s Note: We’ve gotten many questions about the time estimates mentioned here. The above estimates shouldn’t be used to compare against any specific situation. Time will vary greatly depending on your project. For example, “setting up replication” can be as simple as CHANGE MASTER TO, and can take as little as a few minutes in some circumstances. Yves’s estimate is for a project to create a replication slave for HA purposes, not for “setting up replication.” There is a big difference between an HA project and a DBA task. – Baron Schwartz

DRBD(Distributed Replicated Block Device),分布式复制块设备,是一种通过TCP/IP网络实现块设备数据实时镜像的方案。利用这种方案,单一主节点模式(single primary mode)双机系统能够实时地将业务数据保存在主备节点的磁盘中,正常情况下两个节点的数据是一模一样的。

HeartBeat在此是用来做热备切换的,两节点间通过心跳线连接,如果主节点死机的话,可以通过HA将DRBD快速切换到另外一个节点,同时抢占虚拟IP,并启动MySQL,对MySQL客户端来说,只有很短时间的中断。此模式能有效提高MySQL的可用性。

此文档参考了DRBD(http://www.drbd.org), HA(http://linux-ha.org), MySQL的相关官方文档。安装笔记由鱼漂(ipaddr)原创,最新版本可以在http://www.eit.name/找到,转载请注明。

--------------------------------
一.环境说明
--------------------------------
OS: CentOS 5.5 32位
MySQL: 系统自带
DRBD: 8.3.8.1编译安装 (http://www.drbd.org/)
HeartBeat: 3.0.3编译安装 (http://linux-ha.org/)

====Node1====
Hostname: db_node1.systemadmin.cn
Ip: 192.168.0.201
====Node2====
Hostname: db_node2.systemadmin.cn
Ip: 192.168.0.202

HB的虚IP: 192.168.0.200

用来做DRBD的分区:/dev/sdb1
挂载点:/data/dbdata

--------------------------------
二. 编译安装DRBD
--------------------------------
www.drbd.org下载drbd-8.3.8.1.tar.gz,分别在两个节点上编译安装:
注意,DRBD需要内核模块支持,2.6.33后的Linux内核已内置DRBD的支持,在此版本前,安装时需要kernel-devel包,同时编译drbd时要带上--with-km。

#./configure --prefix=/ --with-km
#make
#make install
#chkconfig --add drbd
#chkconfig drbd on

检查安装是否成功:
#drbdadm
看是否以执行成功?
#cat /proc/drbd
是否可以看到版本信息

重启后再次检查cat /proc/drbd

--------------------------------
三. 配置DRBD
--------------------------------
注意,新版本的DRBD的配置文件已拆分成多个,全部放在/etc/drbd.d/目录,具体可参考/etc/drbd.conf中的配置.
如果开了iptables防火墙,需要打开对应的7789端口。

在主节点配置/etc/drbd.d/dbdata.res:
=========dbdata.res begin===========
resource dbdata {
on db_node1.systemadmin.cn {
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.0.201:7789;
meta-disk internal;
}
on db_node2.systemadmin.cn {
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.0.202:7789;
meta-disk internal;
}
}
=========dbdata.res end===========

创建drbd设备
#drbdadm create-md all
#/etc/init.d/drbd restart

查看是否有相应的块设备:
#ls -l /dev/drbd1
#cat /proc/drbd

设置当前节点为主节点并进行格式化:
#drbdadm -- --overwrite-data-of-peer primary all
#mkfs.ext3 /dev/drbd1
#mount /dev/drbd1 /data/dbdata
#mkdir /data/dbdata/etc
#mkdir /data/dbdata/data

在从节点做如下配置:
复制dbdata.res到从节点的/etc/drbd.d/
#drbdadm create-md all
#/etc/init.d/drbd restart

通过cat /prod/drbd查看是否开始复制,以及两节点的状态。

此时开始同步两个节点的磁盘,需要一定时间,在同步完成前,请不要重启,否则会重新同步。

同步完成后,如果两个节点都是Secondary/Secondary,并且已经UpToDate了,可以挑一个节点设置为主节点:
#drbdadm primary all

DRDB的一些常规操作:
切换主从节点前,要注意umount
主从切换: drbdadm primary all, drbdadm secondary all
节点间连接:drbdadm connect|disconnect all

DRBD脑裂后的处理:
脑裂后,两个节点间数据不同步,主从关系失效,需要按下面的步骤修复:
a.在从节点如下操作:
#drbdadm secondary dbdata
#drbdadm -- --discard-my-data connect dbdata
b.在主节点上,通过cat /proc/drbd查看状态,如果不是WFConnection状态,需要再手动连接:
#drbdadm connect dbdadta

--------------------------------
四. 迁移MySQL到DRBD磁盘
--------------------------------
确保drbd当前工作正常,将/data/dbdata加载,再在主节点按如下步骤操作。
a. 关闭MySQL: /etc/init.d/mysqld stop
b. #mv /etc/my.cnf /data/dbdata/etc/
c. #ln -s /data/dbdata/etc/my.cnf /etc/my.cnf
d. 修改my.cnf,将数文件文件指向 /data/dbdata/data
e. 将原来的数据文件移动到 /data/dbdata/data
f. 再启动MySQL

从节点只需要删除/etc/my.cnf,同时,建立链接到/data/dbdata/etc/my.cnf即可。

现在已完成MySQL+DRBD的配置,但切换时仍然是手动进行,假设主节点从node1转到node2,手动操作如下:
a. 在node1上停止MySQL
b. 在node1上unmout /data/dbdata
c. 在node1上执地drbdadm secondary all,两个节点都是secondary状态后,才可以将node2设置为primary
d. 在node2上执行drbdadm primary all,将node2变成主节点
e. 观察两个节点的/proc/drbd是否正常
f. 在node2上mount /dev/drbd1 /data/dbdata
g. 在node2上面启动MySQL

鱼漂建议:据说MyISAM在DRBD上容易造成数据丢失,或表文件损坏,所以此模式下最好用InnoDB,打开Binlog,同时,innodb_flush_log_at_trx_commit=1可以保证数据完整,但会严重影响性能。设置为2的话性能约有10倍提升,不过故障时可能会丢失1秒的数据。

--------------------------------
五. 安装配置HA
--------------------------------
需分别下载和安装:
* Cluster Glue 1.0.6: glue-1.0.6.tar.bz2
* Resource Agents 1.0.3: agents-1.0.3.tar.bz2
* Heartbeat 3.0.3: Heartbeat-3-0-STABLE-3.0.3.tar.bz2

鱼漂特别提醒,编译安装过程中,还会连接网络下载文件和校验xml,请在安装前确保网络连接和DNS解析正常。

A. 先安装glue
1. 从linux-ha.org下载 glue-1.0.6.tar.bz2
2. 解压
3. 编译安装:
./autogen.sh
./configure

打开lib/stonith/main.c,将version相关的注释掉,否则编译不过,有点诡异。
1、找到其64行,将其注释掉。
2、找到其76到81行全部注释掉。
3、找到其390行,将其注释

groupadd haclient
useradd -g haclient hacluster
make
make install

B. 安装Cluster agent
#./autogen.sh
#./configure
#make
#make install


C. 安装HA3.0.3
下载HA3.0.3解压并进入到源代码目录:
#./bootstrap
#./ConfigureMe configure
#make
#make install

D. 配置/etc/ha.d/ha.cf
=========ha.cf begin==============
logfacility local0
keepalive 500ms
deadtime 10
warntime 5
initdead 60
mcast eth0 225.0.0.1 694 1 0
auto_failback off
node db_node1.systemadmin.cn
node db_node2.systemadmin.cn
=========ha.cf end==============
注意,防火墙需要打开端口为694,目标Ip为225.0.0.1的udp通信

E. 配置/etc/ha.d/haresources
========haresource begin===========
db_node1.systemadmin.cn drbddisk Filesystem::/dev/drbd1::/data/dbdata::ext3 IPaddr::192.168.0.200/24/eth0 mysql
========haresource end===========

F.生成/etc/ha.d/authkeys
#( echo -ne "auth 1\n1 sha1 "; \
dd if=/dev/urandom bs=512 count=1 | openssl md5 ) \
> /etc/ha.d/authkeys
#chmod 0600 /etc/ha.d/authkeys


G. 编写MySQL启动脚本/etc/ha.d/resource.d/mysql
=========mysql being============
#!/bin/bash
#
# Author: ipaddr ( admin.net [#] 163.com )
#
. /etc/ha.d/shellfuncs
case "$1" in
start)
res=`/etc/init.d/mysqld start`
ret=$?
ha_log $res
exit $ret
;;
stop)
res=`/etc/init.d/mysqld stop`
ret=$?
ha_log $res
exit $ret
;;
status)
if [[ `ps -ef | grep '[m]ysqld'` > 1 ]] ; then
echo "running"
else
echo "stopped"
fi
;;
*)
echo "Usage: mysql {start|stop|status}"
exit 1
;;
esac
exit 0
=========mysql end============


F. 拷贝一些常用脚本
1. 从Heartbeat-3-0源码目录的heartbeat/lib/复制hb_standby, hb_takeover到/usr/lib/heartbeat 并设置为可执行.
2. 将Cluster agent的一些文件拷到/etc/ha.d/:
cp /usr/etc/ha.d/shellfuncs /etc/ha.d/
cp -rf /etc/ha.d/* /usr/etc/ha.d/
(这个操作很奇怪,Cluster Agent与HA使用不同的目录,但又要共用文件,不知在编译agent时是否可以使用--prefix=/来解决,由于编译时间实在太长,所有没有测试)

G. 使用:
1. 设置为自启动:
chkconfig mysqld off
chkconfig --add heartbeat
chkconfig heartbeat on

2. 手动切换两个节点:
/usr/lib/heartbeat/hb_takeover 将当前节点设为主节点
/usr/lib/heartbeat/hb_standby 将当前节点设为从节点
3. 查看HA状态
查看当前节点heartbeat状态:
#/usr/lib/heartbeat/heartbeat -s

查看当前节点heartbeat状态:
#/usr/bin/cl_status hbstatus

列出所有节点:
#/usr/bin/cl_status listnodes
(如果是64位系统,HA的安装目录通常为/usr/lib64/)

一些题外话:
个人觉得HA3.0还不够成熟,尤其是在编译安装时,非常麻烦,主要有几个小问题:
a. glue必须把version相关的注释掉才可以编译通过,很诡异;
b. 个人认为HA是一个整体,把它拆成三个软件包编译安装实在是很难理解;
c. 安装完后,还需要手动复制一些脚本文件,另外,agent和ha默认使用不同的/etc/目录也很郁闷;
d. 几个包都要从网络下载xml文件,编译过程实在太久,花了一个下午才把agent编译完成,为什么不把相关xml打包在源码中一起下载呢?

当然,瑕不掩玉,HA还是非常了不起的一个开源作品。

附 MySQL的DRBD文档:

MySQL Server Memory Usage

[不指定 2009/12/25 10:50 | by ipaddr ]

Every so often people ask me the question how should they estimate memory consumption by MySQL Server in given configuration. What is the formula they could use.

  经常有人问我配置MySQL时该如何估算内存的消耗。那么该使用什么公式来计算呢?

  The reasons to worry about memory usage are quite understandable. If you configure MySQL Server so it uses too small amount of memory it will likey perform suboptimally. If you however configure it so it consumes too much memory it may be crashing , failing to execute queries or make operation to swap seriously slowing down. On now legacy 32bit platforms you could also run out of address space so that had to be watched as well.
Having said so, I do not think looking for the secret fomula to compute your possible memory usage is the right approach to this problem. The reasons are - this formula is very complex nowadays and what is even more important “theoretically possible” maximum it provides have nothing to do with real memory consumptions. In fact typical server with 8GB of memory will often run with maximum theoretical memory usage of 100GB or more. Furthermore there is no easy “overcommit factor” you can use - it really depends on application and configuration. Some applications will drive server to 10% of theoretical memory consumptions others only to 1%.

  关心内存怎么使用的原因是可以理解的。如果配置MySQL服务器使用太少的内存会导致性能不是最优的;如果配置了太多的内存则会导致崩溃,无法执行查询或者导致交换操作严重变慢。在现在的32位平台下,仍有可能把所有的地址空间都用完了,因此需要监视着。
话虽如此,但我并不觉得找到什么可以计算内存使用的秘诀公式就能很好地解决这个问题。原因有 -- 如今这个公式已经很复杂了,更重要的是,通过它计算得到的值只是“理论可能”并不是真正消耗的值。事实上,有8GB内存的常规服务器经常能运行到最大的理论值 -- 100GB甚至更高。此外,你轻易不会使用到“超额因素” -- 它实际上依赖于应用以及配置。一些应用可能需要理论内存的 10% 而有些仅需 1%。

  So what could you do instead ? First take a look at global buffers which are allocated at start and always where - these are key_buffer_size, innodb_buffer_pool_size, innodb_additional_memory_pool_size, innodb_log_buffer_size, query_cache_size. If you’re using MyISAM seriously you can also add the size of Operation System cache you would like MySQL to use for your table. Take this number add to it number of memory Operation System and other applications need, add might be 32MB more for MySQL Server code and various small static buffers. This is memory which you can consider used when you just start MySQL Server. The rest of memory is available for connections. For exampe with 8GB server you might have everything listed adding up to 6GB, so you have 2GB left for your threads.

 那么,我们可以做什么呢?首先,来看看那些在启动时就需要分配并且总是存在的全局缓冲 -- key_buffer_size, innodb_buffer_pool_size, innodb_additional_memory_pool_size, innodb_log_buffer_size, query_cache_size。如果你大量地使用MyISAM表,那么你也可以增加操作系统的缓存空间使得MySQL也能用得着。把这些也都加到操作系统和应用程序所需的内存值之中,可能需要增加32MB甚至更多的内存给MySQL服务器代码以及各种不同的小静态缓冲。这些就是你需要考虑的在 MySQL服务器启动时所需的内存。其他剩下的内存用于连接。例如有8GB内存的服务器,可能监听所有的服务就用了6GB的内存,剩下的2GB内存则留下来给线程使用。

  Each thread connecting to MySQL Server will needs its own buffers. About 256K is allocated at once even if thread is idle - they are used by default thread stack, net buffer etc. If transaction is started some more space can add up. Running small queries might only barely increase memory consumption for given thread, however if table will perform complex operations such as full table scans, sorts, or need temporary tables as much as read_buffer_size, sort_buffer_size, read_rnd_buffer_size, tmp_table_size of memory might be allocated. But they are only allocated upon the need and freed once given stage of query is done. Some of them are allocated as single chunk at once others, for example tmp_table_size is rather maximum amount of memory MySQL will allocate for this operation. Note it is more complicated than once may think - multiple buffers of the same type might be allocated for exampe to handle subqueries. For some special queries memory usage might be even larger - bulk inserts may allocate bulk_insert_buffer_size bytes of memory if done to MyISAM tables. myisam_sort_buffer_size used for ALTER TABLE, OPTIMIZE TABLE, REPAIR TABLE commands.

  每个连接到MySQL服务器的线程都需要有自己的缓冲。大概需要立刻分配256K,甚至在线程空闲时 -- 它们使用默认的线程堆栈,网络缓存等。事务开始之后,则需要增加更多的空间。运行较小的查询可能仅给指定的线程增加少量的内存消耗,然而如果对数据表做复杂的操作例如扫描、排序或者需要临时表,则需分配大约 read_buffer_size, sort_buffer_size, read_rnd_buffer_size, tmp_table_size 大小的内存空间。不过它们只是在需要的时候才分配,并且在那些操作做完之后就释放了。有的是立刻分配成单独的组块,例如 tmp_table_size 可能高达MySQL所能分配给这个操作的最大内存空间了。注意,这里需要考虑的不只有一点 -- 可能会分配多个同一种类型的缓存,例如用来处理子查询。一些特殊的查询的内存使用量可能更大 -- 如果在MyISAM表上做成批的插入时需要分配 bulk_insert_buffer_size 大小的内存。执行 ALTER TABLE, OPTIMIZE TABLE, REPAIR TABLE 命令时需要分配 myisam_sort_buffer_size 大小的内存。

For OLTP applications with simple queries memory consumption is often less than 1MB per thread with default buffers, and you really do not need to increase per thread buffers unless you have complex queries. Sorting 10 rows will be as fast with 1MB sort buffer as with 16MB (actually 16MB might be even slower but it is other story).

  只有简单查询OLTP应用的内存消耗经常是使用默认缓冲的每个线程小于1MB,除非需要使用复杂的查询否则无需增加每个线程的缓冲大小。使用1MB的缓冲来对10行记录进行排序和用16MB的缓冲基本是一样快的(实际上16MB可能会更慢,不过这是其他方面的事了)。

  Another approach you may take is to come up with amount of memory you want MySQL Server to consume at peak. This can be easily computed by memory needed for OS, File Cache and other applications. For 32bit envinronment you also should keep 32bit limits into account and probably limit “mysqld” size to about 2.5GB (exact number depens on a lot of factors). Now you can use “ps aux” to see VSZ - Virtual Memory allocated by MySQL process. You can also look at “Resident Memory” but I find it less helpful as it may down because of swapping - not what you would like to see. Monitor how the value changes so you know memory requirements with current settings and increase/decrease values appropriately.

  另外,就是找出MySQL服务器内存消耗的峰值。这很容易就能计算出操作系统所需的内存、文件缓存以及其他应用。在32位环境下,还需要考虑到32位的限制,限制 “mysqld” 的值大约为2.5G(实际上还要考虑到很多其他因素)。现在运行 “ps aux” 命令来查看 VSZ 的值 -- MySQL 进程分配的虚拟内存。也可以查看 “Resident Memory” 的值,不过我想它可能没多大用处,因为它会由于交换而变小 -- 这并不是你想看到的。监视着内存变化的值,就能知道是需要增加/减少当前的内存值了。

  Some may say, Hey we want to have 100% guarantee our server will never run out of memory, no matter which queries or users will decide to run. Unfortunately this is as much close to impossible to be impractical. Here is why:

可能有的人想说,我们想要让服务器能保证100%不会耗尽内存,不管决定用什么样的查询、什么样的用户。很不幸,这其实很不明智也不可能,因为:

  List of rarely considered MySQL Server Memory Requirements

  以下是很少考虑的MySQL服务器内存需求

  Thread buffers can be allocated more than once for each thread. Consider for example subqueries - each layer may need its own read_buffer,sort_buffer, tmp_table_size etc
每个线程可能会不止一次需要分配缓冲。 考虑到例如子查询 -- 每层都需要有自己的 read_buffer,sort_buffer, tmp_table_size 等。
Many variabes can be set per connection. So you can’t relay on global values if developers may use their local values to run some queries.
在每个连接中很多变量都可能需要重新设置。 如果开发者想设定自己的变量值来运行某些查询就不能继续使用全局值。
There can be mutiple key caches. Multiple key caches can be created to accomodate query executions
可能有多个索引缓存。 为了配合执行查询可能会创建多个索引缓存。
Query Parsing and optimization needs memory. This is usually small to be ignored but certain queries can have very large memory requrement for this step, especially specially crafted ones.
解析查询和优化都需要内存。 这些内存通常比较小,可以忽略,不过如果是某些查询在这个步骤中则需要大量内存,尤其是那些设计的比较特别的查询。
Stored Procedures. Compex stored procedures may require a lot of memory
存储过程。 复杂的存储过程可能会需要大量内存。
Prepared statements and Cursors. Single connection may have many prepared statements and cursors. Their number finally can be limited but each of them still can have very large memory consumption
准备查询语句以及游标。 单次链接可能会有很多的准备好的语句以及游标。它们的数量最后可以限定,但是仍然会消耗大量的内存。
Innodb Table Cache. Innodb has its own table cache in which meta data about each table Accessed from the start is stored. It is never purged and may be large if you have a lot of tables. It also means user having CREATE TABLE privilege should be able to run MySQL Server out of memory
Innodb表缓存。 Innnodb表有自己的缓存,它保存了从一开始访问每个表的元数据。它们从未被清除过,如果有很多Innodb表的话,那么这个量就很大了。这也就意味着拥有 CREATE TABLE 权限的用户就可能把MySQL服务器的内存耗尽。
MyISAM buffers. MyISAM may allocate buffer which is large enough to contain largest record in the given table which is held until table is closed.
MyISAM缓冲。 MyISAM表可能会分配一个足以装下指定表最大记录的缓冲,而且这个缓冲直到表关闭了才释放。

 Federated Storage Engine. This may have unbound memory requirements retriving result sets from remove queries.
FEDERATED存储引擎。 This may have unbound memory requirements retriving result sets from remove queries.
Blobs may require 3x time of memory. This is important if you’re deaing with large Blobs (your max_allowed_packet is large) Processing of 256MB of blob may require 768MB of memory.
Blobs可能需要3倍的内存。 这在处理很大(max_allowed_packet 的值较大)的Blobs数据时很重要,如果处理256MB的数据可能需要768MB的内存。
Storage Engines. In general storage engines may have their own per thread or global memory allocations which are not tuned as buffers. Watch for these especially now with many storage engines being released for MySQL by various parties.
存储引擎。 通常情况下,存储引擎会设置自己的每个线程的全局分配内存,它通常不能像缓存一样可以调节。现在应该通过各种方式来特别关注MySQL释放出来的存储引擎。
I do not pretend this to be complete list. On the contrary I’m quite sure I’ve missed something (drop me a note if you have something to add). But the main point is - there are a lot of memory consumers out where and trying to find peak possible usage for each is impractical - so my advice would be measure what you get in practice and how memory consumption reacts to changing various variables. For example you may find out increasing sort_buffer_size from 1MB to 4MB and 1000 max_connections increases peak memory consumption just 30MB not 3000MB as you might have counted.

  我想这还不是完成的列表,相反地,我觉得还是漏掉了一些(如果你知道,请给我回复加上)。但主要的原因是 -- 找到每次内存消耗峰值是不切实际的,因此我的这些建议可以用来衡量一下你实际修改一些变量值产生的反应。例如,把 sort_buffer_size 从1MB增加到4MB并且在 max_connections 为 1000 的情况下,内存消耗增长峰值并不是你所计算的3000MB而是30MB。

分页: 1/5 第一页 1 2 3 4 5 下页 最后页 [ 显示模式: 摘要 | 列表 ]