并发环境频繁死锁如何解决

lilinew 2011-03-10 08:07:00

发现一个长期的死锁丢失数据问题
发现数据丢失比较验证,以前知道但是没有量化。现在脚本记录发现一天有10来起

环境介绍
3台centos5 mysql5.0.45 web是tomcat java.
是2台机器(web服务器装有mysql) 同时向db mysql中心服务器写入数据造成的deadlock 发生数据丢失比如
show innodb status \g 得到如下结果
------------------------
LATEST DETECTED DEADLOCK
------------------------
110310 11:15:24
*** (1) TRANSACTION:
TRANSACTION 1 2454545738, ACTIVE 0 sec, process no 3501, OS thread id 1195465024 inserting
mysql tables in use 1, locked 1
............................
Bruce_lee(942561070) 19:29:57
ip 10.0.1.1
MySQL thread id 34501, query id 1170219884 prod2a 192.168.10.15 betbrain update
INSERT INTO Out (id, aId, b,c,e,f,g,h,i,k,...) values (8081,5209319,a,47,2,3,null,null,null,4.0,null,'C','N','Y') ,(8082,5209319,a,47,2,4,null,null,null,4.0,null,'C','N','Y')。。。。。有时候多达一百行
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 345492 n bits 232 index `PRIMARY` of table `db/Out` trx id 0 412100899 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0

ip 10.0.1.2
MySQL thread id 34533, query id 1170219891 prod2b 192.168.10.25 betbrain update
INSERT INTO Out (id, aId, b,c,e,f,g,h,i,k,...) values (8083,5209320,a,43,1,1,27330,null,null,null,null,'C','N','Y') ,(8084,5209320,a,43,1,1,27357,null,null,null,null,'C','N','Y'))。。。。。有时候多达一百行
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 345492 n bits 232 index `PRIMARY` of table `db/Out` trx id 0 412100900 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;

*** WE ROLL BACK TRANSACTION (2)
(2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 2191204 n bits 704 index `idx_BOH_id` of table out` trx id 1 2454545745 lock_mode X locks gap before rec
Record lock

分析原因
192.168.10.15
192.168.10.25
2台同时向机器 ip 10.0.1.1写入不同的 INSERT INTO Out (id, aId, b,c数据

观察在线db 表情况,最理想的情况是192.168.10.15
或者192.168.10.25有一边的数据插入成功. 一般2边插入都失败也就是那个时间点一个数据都没有进入.
类似out这样的每天能发生死锁表有大概5个.

一天少的情况下10起左右
脚本trace1-2周发现比想象的要严重. 数据清理后还有10起,如果严重的时候估计快16起左右.
能发生死锁表的记录数目
100万
150万
230万
70万
50万
都是gingle表类似上面的一次性插入多行数据。

开发架构 java 代码提交的。在线数据每1秒都在不间断提交数据。(2台web接受外界服务器通过哦tomcat的 java sock不停的推数据到本身的web服务器。之后web产生sql语句（一个循环分段提交提交所以上面是。。。。。有时候多达一百行）插入到中心数据库服务器)
机器已经升级。硬件好一倍死锁减少非常多。现在很少。但还是有，希望能从根本上解决这个问题。

...全文

322 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

小小小小周 2011-03-11

打赏
举报

这个要顶,求解答

lilinew 2011-03-11

打赏
举报

6:36:49这一秒的总共300多条.与 out有关的表有
25条. 全是单表的update insert 比如
update
insert等 . 如果是这样模拟的难度很大.不过还是可以test

lilinew 2011-03-11

打赏
举报

观察并发环境的binlog,在6:36:49这一秒有280--365条更新语句,取中间值300多条(insert update delete)语句 .

这个6:36:48有1条慢查询 2秒
6:36:50有2条满查询. 2 4秒但是这些慢查询sql跟out没有任何联系的.

基本可以断定是该秒300多条更新语句同时执行对语句有影响

lilinew 2011-03-11

打赏
举报

我用linux的多进程在不同用户在同一个时间点crontab 2个复杂(总体很复扎 )的insert sql 各自有51条记录组成一个sql文件. 也就是在同一时间点模拟deadlock是否能发生. 遗憾失败,2边记录同时插入,没有发现有死锁.思考在单机器 ,环境过于简单.而在线的时间是几十个进程都在select insert,很可能一边读(select)一边insert造成的deadlock ). 另外我在本地机器仅仅一个机器 ,上面的问题是2台机器一个时间点向另外1台机器的同一个表插入数据(同时该表还有好多的查询操作).

本来准备用存储过程再模拟deadlock的发生,---------------直觉只要我能把这个情况精确模拟,肯定能解决问题. 反向推. 但是现实很难模拟当时高并发在线的实际情况

lilinew 2011-03-10

打赏
举报

(2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313620 lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000002; asc ;; 1: len 6; hex 00000285a78f; asc ;; 2: len 7; hex 00000040150110; asc @ ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313620 lock_mode X locks rec but not gap waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000001; asc ;; 1: len 6; hex 00000285a78e; asc ;; 2: len 7; hex 000000003411d9; asc 4 ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** WE ROLL BACK TRANSACTION (2)

For last deadlock Innodb shows transactions which caused deadlocks, their state during deadlock, what locks they were holding and what they were waiting for, which of transactions Innodb decided to roll back to resolve deadlock. Note - Innodb only prints information about few of the locks which transaction is holding. Also only last statement from each transactions is displayed, while locks rows could be locked by one of previous statements. For complex deadlock investigations you might need to look at the log files to find truly conflicting statements. For most simple cases information from SHOW INNODB STATUS is good enough.

As for deadlock information we have similar information about last failed foreign key constraint:

PLAIN TEXTCODE: ------------------------
LATEST FOREIGN KEY ERROR
------------------------
060717 4:29:00 Transaction:
TRANSACTION 0 336342767, ACTIVE 0 sec, process no 3946, OS thread id 1151088992 inserting, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
3 lock struct(s), heap size 368, undo log entries 1
MySQL thread id 9697561, query id 188161264 localhost root update
insert into child values(2,2)
Foreign key constraint fails for table `test/child`:
,
CONSTRAINT `child_ibfk_1` FOREIGN KEY (`parent_id`) REFERENCES `parent` (`id`) ON DELETE CASCADE
Trying to add in child table, in index `par_ind` tuple:
DATA TUPLE: 2 fields;
0: len 4; hex 80000002; asc ;; 1: len 6; hex 000000000401; asc ;;

But in parent table `test/parent`, in index `PRIMARY`,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 3; 1-byte offs TRUE; info bits 0
0: len 4; hex 80000001; asc ;; 1: len 6; hex 0000140c2d8f; asc - ;; 2: len 7; hex 80009c40050084; asc @ ;;

Innodb will print statement which caused error. Definition of foreign key which failed as well as the closest match which was located in parent table. There is a lot of scary information in hex but it is not important for most diagnostic cases - It is left by Innodb developers so SHOW INNODB STATUS can also be used as debugging tool.

Next section you will find in SHOW INNODB STATUS is information about currently active transactions:

PLAIN TEXTCODE: ------------
TRANSACTIONS
------------
Trx id counter 0 80157601
Purge done for trx's n:o <0 80154573 undo n:o <0 0
History list length 6
Total number of lock structs in row lock hash table 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 0, not started, process no 3396, OS thread id 1152440672
MySQL thread id 8080, query id 728900 localhost root
show innodb status
---TRANSACTION 0 80157600, ACTIVE 4 sec, process no 3396, OS thread id 1148250464, thread declared inside InnoDB 442
mysql tables in use 1, locked 0
MySQL thread id 8079, query id 728899 localhost root Sending data
select sql_calc_found_rows * from b limit 5
Trx read view will not see trx with id>= 0 80157601, sees <0 80157597
---TRANSACTION 0 80157599, ACTIVE 5 sec, process no 3396, OS thread id 1150142816 fetching rows, thread declared inside InnoDB 166
mysql tables in use 1, locked 0
MySQL thread id 8078, query id 728898 localhost root Sending data
select sql_calc_found_rows * from b limit 5
Trx read view will not see trx with id>= 0 80157600, sees <0 80157596
---TRANSACTION 0 80157598, ACTIVE 7 sec, process no 3396, OS thread id 1147980128 fetching rows, thread declared inside InnoDB 114
mysql tables in use 1, locked 0
MySQL thread id 8077, query id 728897 localhost root Sending data
select sql_calc_found_rows * from b limit 5
Trx read view will not see trx with id>= 0 80157599, sees <0 80157595
---TRANSACTION 0 80157597, ACTIVE 7 sec, process no 3396, OS thread id 1152305504 fetching rows, thread declared inside InnoDB 400
mysql tables in use 1, locked 0
MySQL thread id 8076, query id 728896 localhost root Sending data
select sql_calc_found_rows * from b limit 5
Trx read view will not see trx with id>= 0 80157598, sees <0 80157594

If you have small number of connections all connections will be printed in transaction list, if you have large number of connections Innodb will only print number of them, cutting the list so SHOW INNODB STATUS output will not grow too large.

Transaction id is current transaction identifier - it is incremented for each transaction. Purge done for trx's n:o is number of transaction to which purge is done. Innodb can only purge old versions if they there are no running transactions potentially needing them. Old stale uncommitted transactions may block purge process eating up resources. By looking at transaction counter difference between current and last purged transaction you will be able to spot it. In some rare cases purge also could have hard time to keep up with update rate, in this case difference between these values will also grow and innodb_max_purge_lag will become your friend. "undo n:o" will show the undo log record number which purge is currently processing, if it is active otherwise it will be zero.

History list length 6 is number of unpurged transactions in undo space. It is increased as transactions which have done updates are commited and decreased as purge runs.

Total number of lock structs in row lock hash table is number of row lock structures allocated by all transactions. Note not same as number of locked rows - there are normally many rows for each lock structure.

For each of connections for MySQL there will be ether not started state if there is no active Innodb transaction for this connection, or ACTIVE if transaction is active. Note transaction can be active even if connection is in "Sleep" stage - if it is multiple statement transaction. Innodb also will print OS thread_id and process id which may be helpful if you would like to use gdb to connect to running mysqld for troubleshooting purposes and similar things. Also transaction status is reported which is basically what transaction is doing it can be "fetching rows", "updating" and couple of other values. "Thread declared inside InnoDB 400" means thread is running inside Innodb kernel and still has 400 tickets to use. Innodb tries to limit thread concurrency allowing only innodb_thread_concurrency threads to run inside Innodb kernel at the same time. If thread is not runniing inside innodb kernel status could be "waiting in InnoDB queue" or "sleeping before joining InnoDB queue". Latest one is quite interesting - to avoid too many threads competing to enter innodb queue at the same time Innodb makes thread to sleep for some time before trying to wait (if no free slot was available). This may cause number of threads active inside kernel being less than number of threads allowed by innodb_thread_concurrency. For certain workloads it may help to decrease the time thread waits before it enters the queue. This is done by adjusting innodb_thread_sleep_delay variable. Value is specified in microseconds.

mysql tables in use 1, locked 0 is number of tables used by transaction in question (meaning it was accessed) and number of tables locked by transactions. Innodb does not lock tables for normal operation so number of tables locked normally stays 0, unless it is ALTER TABLE or similar statement, or if LOCK TABLES was used.

In addition to Innodb specific information, there is generic statement information which is visible in SHOW PROCESSLIST showed in SHOW INNODB STATUS, such as statement which is being executed, query id, query status etc.

lilinew 2011-03-10

打赏
举报

现在只要想办法在stage机器模拟出这个deadlock点就有办法
OS WAIT ARRAY INFO: reservation count 13569, signal count 11421
–Thread 1152170336 …/iz` trx id 0 42313619 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD:… bit conflicting information. This is by design as requiring global lock to provide consitent information would cause significant overhead
SHOW INNODB STATUS walk through
In SHOW INNODB STATUS many values are per second. If you're planning to use these values make sure they are sampled over decent period of time. In the very start of printout Innodb will print:

在 SHOW INNODB STATUS 中许多值都是按秒统计的，如果要使用这些值，就得隔适当的时间取样。SHOW INNODB STATUS 的开头应该是：

=====================================
060717 3:07:56 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 44 seconds

数据的取样应该至少是20－30秒。最好是写一个脚本来统计 SHOW INNODB STATUS 的结果。

下面是 Semaphores 信息:

----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 13569, signal count 11421
--Thread 1152170336 has waited at ./../include/buf0buf.ic line 630 for 0.00 seconds the semaphore:
Mutex at 0x2a957858b8 created file buf0buf.c line 517, lock var 0
waiters flag 0
wait is ending
--Thread 1147709792 has waited at ./../include/buf0buf.ic line 630 for 0.00 seconds the semaphore:
Mutex at 0x2a957858b8 created file buf0buf.c line 517, lock var 0
waiters flag 0
wait is ending
Mutex spin waits 5672442, rounds 3899888, OS waits 4719
RW-shared spins 5920, OS waits 2918; RW-excl spins 3463, OS waits 3163

该部分有2部分。一个是当前等待列表，如果并发数很大，Innodb就必须频繁等待 OS。那么该部分不包含由spinlock解决的等待。

该部分需要对源代码有一定了解，因为只能看到文件名和行数。但可以从文件名猜出其含义－buf0buf.ic是缓存池的内容。

“lock var”是当前互斥对象(locked=1/free=0)的值，"waiters flag"是当前等待线程数，还有一些状态信息："wait is ending"意思是互斥锁已经释放，但OS还没有分配，这时这个对象可以执行。

第二部分是事件信息。"reservation count" 和 "signal count" 表明InnoDB使用内部同步数组的频繁程度。这些数值表示InnoDB需要等待OS的频率。"OS Waits" 表明由于互斥的OS的等待。OS Wait is not exactly the same as "reservation" - before falling back to complex wait using sync_array Innodb tries to "yield" to OS hoping when name thread is scheduled next time object will be free already. OS Waits 相对较慢，如果每秒钟该值有几万，就有问题。另外一种方式来看该值是通过OS状态的 context switch.

另一个重要信息是"spin waits"和"spin rounds"的数量。与OS wait相比，Spin locks 是低消耗的等待，但它是一种活跃的等待，会浪费CPU资源，因此spin waits和spin rounds 如果很大，CPU资源会被浪费。如果该值达到几十万，CPU危险了。

死锁部分：

------------------------
LATEST DETECTED DEADLOCK
------------------------
060717 4:16:48
*** (1) TRANSACTION:
TRANSACTION 0 42313619, ACTIVE 49 sec, process no 10099, OS thread id 3771312 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 320
MySQL thread id 30898, query id 100626 localhost root Updating
update iz set pad='a' where i=2
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313619 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000002; asc ;; 1: len 6; hex 00000285a78f; asc ;; 2: len 7; hex 00000040150110; asc @ ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** (2) TRANSACTION:
TRANSACTION 0 42313620, ACTIVE 24 sec, process no 10099, OS thread id 4078512 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
3 lock struct(s), heap size 320
MySQL thread id 30899, query id 100627 localhost root Updating
update iz set pad='a' where i=1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313620 lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000002; asc ;; 1: len 6; hex 00000285a78f; asc ;; 2: len 7; hex 00000040150110; asc @ ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313620 lock_mode X locks rec but not gap waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000001; asc ;; 1: len 6; hex 00000285a78e; asc ;; 2: len 7; hex 000000003411d9; asc 4 ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** WE ROLL BACK TRANSACTION (2)

InnoDB只打印锁的几个信息，且只显示事务的最后一条语句，有可能被锁的行由前面一条语句引起。对复杂死锁的研究得看日志文件。

外键信息与死锁信息相似:

------------------------
LATEST FOREIGN KEY ERROR
------------------------
060717 4:29:00 Transaction:
TRANSACTION 0 336342767, ACTIVE 0 sec, process no 3946, OS thread id 1151088992 inserting, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
3 lock struct(s), heap size 368, undo log entries 1
MySQL thread id 9697561, query id 188161264 localhost root update
insert into child values(2,2)
Foreign key constraint fails for table `test/child`:
,
CONSTRAINT `child_ibfk_1` FOREIGN KEY (`parent_id`) REFERENCES `parent` (`id`) ON DELETE CASCADE
Trying to add in child table, in index `par_ind` tuple:
DATA TUPLE: 2 fields;
0: len 4; hex 80000002; asc ;; 1: len 6; hex 000000000401; asc ;;

But in parent table `test/parent`, in index `PRIMARY`,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 3; 1-byte offs TRUE; info bits 0
0: len 4; hex 80000001; asc ;; 1: len 6; hex 0000140c2d8f; asc - ;; 2: len 7; hex 80009c40050084 LATEST DETECTED DEADLOCK，LASTEST FOREIGN KEY ERROR，TRANSACTIONS

The next section is about deadlock errors:

PLAIN TEXTCODE: ------------------------
LATEST DETECTED DEADLOCK
------------------------
060717 4:16:48
*** (1) TRANSACTION:
TRANSACTION 0 42313619, ACTIVE 49 sec, process no 10099, OS thread id 3771312 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 320
MySQL thread id 30898, query id 100626 localhost root Updating
update iz set pad='a' where i=2
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 16403 n bits 72 index `PRIMARY` of table `test/iz` trx id 0 42313619 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000002; asc ;; 1: len 6; hex 00000285a78f; asc ;; 2: len 7; hex 00000040150110; asc @ ;; 3: len 10; hex 61202020202020202020; asc a ;;

*** (2) TRANSACTION:
TRANSACTION 0 42313620, ACTIVE 24 sec, process no 10099, OS thread id 4078512 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
3 lock struct(s), heap size 320
MySQL thread id 30899, query id 100627 localhost root Updating
update iz set pad='a' where i=1
***