This article mainly elaborates on the troubleshooting ideas and solutions to restart related problems
Reasons of BE down
OOM
dmesg -T|grep -i oom
Reason:
Version 2.X OOM Reasons
- BE configuration file (be.conf) mem_limit configuration is unreasonable
Need to configure mem_limit = (total machine memory - other services occupy memory -1~ 2g (system reserved))
For example, machine memory 40G, there is a Mysql, the theoretical upper limit will use 4G, then the configuration mem_limit = 34G (40-4-2)
Crash
You can first judge whether it is a known problem through the common Crash/BUG stack query
The current crash will print the exception stack in be.out. First, get be.out.
- If there is a keyword " LogMessageFatal " in the crash, for example, the occurrence time is November 28th, search for F1128 in the be.INFO to see the log that caused the crash
start time: Tue Nov 29 11:45:56 CST 2022
log4j:WARN No appenders could be found for logger (org.apache.hadoop.fs.FileSystem).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
*** Check failure stack trace: ***
@ 0x3feb34d google::LogMessage::Fail()
@ 0x3fed7bf google::LogMessage::SendToLog()
@ 0x3feae9e google::LogMessage::Flush()
@ 0x3feddc9 google::LogMessageFatal::~LogMessageFatal()
@ 0x2d359f5 starrocks::type_dispatch_predicate<>()
@ 0x2d304c3 starrocks::vectorized::VectorizedBinaryPredicateFactory::from_thrift()
@ 0x2ca34e8 starrocks::Expr::create_vectorized_expr()
@ 0x2ca393a starrocks::Expr::create_tree_from_thrift()
@ 0x2ca39cd starrocks::Expr::create_tree_from_thrift()
@ 0x2ca39cd starrocks::Expr::create_tree_from_thrift()
@ 0x2ca3c0f starrocks::Expr::create_expr_tree()
@ 0x2794b58 starrocks::vectorized::ProjectNode::init()
@ 0x253dd9a starrocks::ExecNode::create_tree_helper()
@ 0x253df6d starrocks::ExecNode::create_tree()
@ 0x34eaccd starrocks::pipeline::FragmentExecutor::_prepare_exec_plan()
@ 0x34ed3e1 starrocks::pipeline::FragmentExecutor::prepare()
@ 0x33a3662 starrocks::PInternalServiceImplBase<>::_exec_plan_fragment()
@ 0x33a5947 starrocks::PInternalServiceImplBase<>::exec_plan_fragment()
@ 0x41296ce brpc::policy::ProcessRpcRequest()
@ 0x4120137 brpc::ProcessInputMessage()
@ 0x4120fe3 brpc::InputMessenger::OnNewMessages()
@ 0x41c7cae brpc::Socket::ProcessEvent()
@ 0x40d5c3f bthread::TaskGroup::task_runner()
@ 0x425e421 bthread_make_fcontext
- If there is no " LogMessageFatal " keyword, start the pipeline cluster. If the be crash is caused by a query, it will print the query id or fragment id in be.out. Get the query id and find the corresponding query sql in fe.audit.log. Or use fragment id to first find the query id in the be.INFO, then find the corresponding query sql from fe.audit.log. Then get the query_dump file ( how to get query_dump file ) and submit it to the Product R & D side for processing
- If non-query causes crash or pipeline is not opened, provide be.out to submit SR ticket for follow-up
- If there is a stable crash, it is recommended to enable core dump on one machine to obtain the core dump file analysis
Core dump acquisition method
Core dump is only enabled during debugging. After obtaining the core dump, it must be closed.
Enable core dump
Specify the path and name of the generated file; execute #vim /etc/sysctl.conf, enter edit mode, and add the following two lines
kernel.core_pattern=/tmp/core_%e_%p
kernel.core_uses_pid=0
sudo prlimit -p $bePID --core=unlimited:unlimited
After setting, confirm
cat /proc/$bePID/limits in the following way. If the core file corresponds to unlimited, it is to enable coredumpClose the core dump recovery service (remember to close it after generating coredump)
sudo prlimit -p $bePID --core=0:0
Coredump analysis
For versions after 3.0, you need to download the corresponding debuginfo package separately and put the debuginfo package under be/lib
Download gdb
wget "https://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/dba/gdb" -O gdb
chmod +x gdb
Core file opening method
./gdb be/lib/starrocks_be core_xxx
If the user can send files, all stacks can be obtained as follows
The content of the file gdbcmd.txt is edited in the following three lines
set pagination off
thread apply all bt
quit
Then execute with gdb
./gdb -c <corefile> -x gdbcmd.txt -batch be/lib/starrocks_be > all-bt.txt
Send us this all-bt.txt
Comments
0 comments
Please sign in to leave a comment.