Basic Information
- Which business cluster and cluster version
- Fault symptoms and impact scope
- What operations and changes were made before the problem occurred
- Whether the problem can be stably reproduced
1. Data Import
1.1 Stream Load
- Import data format
- Detailed error information and Error_URL
- Stream load command
- Desensitized data to be imported
- Table creation statement
- Search fe logs by label
1.1.1 Call frontend service failed
Around the fault time node:
- fe.log
- fe.gc.log
1.1.2 Reached timeout 60000ms
- fe.log
- fe.gc.log
- Monitoring metrics: Grafana BE_LOAD module
1.1.3 Slow import/errors
- Leader node fe.log
- Coordinator BE node be.INFO log, and be.INFO log from any other BE node. The coordinator node can be found by searching for the import task label in the leader node, then searching for coordinator to see the coordinator node's IP.
1.2 Flink-connector
1.2.1 Source
- Problem description
- Source configuration
- Flink, connector, and SR versions
- Complete Taskmanager logs
1.2.2 Sink
- Problem description
- Sink configuration
- Flink, connector, and SR versions
- Jobmanager and complete Taskmanager logs
- FE leader node fe.log
- SR cluster resource usage
- Flink parallelism
- If connector upgrade is involved, provide Pom file
- For memory-related issues, provide Taskmanager memory size
1.2.3 Slow import/errors
- Leader node fe.log
- Coordinator BE node be.INFO log, and be.INFO log from any other BE node. The coordinator node can be found by searching for the import task label in the leader node, then searching for coordinator to see the coordinator node's IP.
1.3 Routine Load
1.3.1 Task errors
- Detailed error information
SHOW ROUTINE LOAD TASK WHERE JobName = "test1"; SHOW ROUTINE LOAD FOR test1;- Routine load task creation statement
- Desensitized data from Kafka
- Whether Kafka consumer consumption is feasible
- Table creation in SR
1.3.2 Slow task execution/stuck
SHOW ROUTINE LOAD FOR test1;- Kafka configuration: version, number of partitions
1.3.3 Slow import/errors
- Leader node fe.log
- be.INFO logs that can search for the import task label
1.4 Broker Load
- Detailed error information
SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior';- Broker load task creation statement
- Check through fe logs for load id and BE to verify if HDFS file path is correctly specified, can specify specific files or all files under a directory
- Check specific exceptions in BE, output to file:
grep $load_id be.INFO
1.4.1 Slow import/errors
- Leader node fe.log
- be.INFO logs that can search for the import task label
- Broker node logs, search for exception information in broker logs, provide info logs from nodes with exception errors
1.5 Insert Into
- Synchronously returned detailed error information and tracking_url
- Check through Information Schema:
SELECT * FROM information_schema.loads WHERE database_name = 'database_name' and label = 'label_name'\G
1.6 Primary Key Model
1.6.1 Metadata Bloat
- Table creation:
SHOW CREATE TABLE [db_name.]table_name
- Data distribution and size:
SHOW DATA [FROM <db_name>[.<table_name>]]
- FE logs, be.INFO logs
- CPU, memory, I/O resource usage, especially resource consumption during query execution
2. Query Execution
Profile Collection Method:
Obtain Profile to analyze query bottlenecks
Pstack Collection Method:
wget "https://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/SRDebugUtil.tar.gz" -O SRDebugUtil.tar.gz ./bin/pstack 4921 >pstack.log # Get BE threads and output to file, example BE pid is 4921
The corresponding BE needs mysql-client installed:
admin execute on $beid 'System.print(ExecEnv.get_stack_trace_for_all_threads())';
2.1 Slow Queries
- Provide profile
- If query is sometimes fast and sometimes slow, provide profiles for both fast and slow queries
- Query explain costs results, explain execution time
- Consider preserving querydump files depending on problem urgency
2.2 Query Errors
- Query SQL
- Complete exception stack, usually search for "exception" in fe.log to find corresponding SQL
- Query dump: query dump
2.3 Query Stuck
- Cluster resource usage, whether any resources reach bottlenecks, CPU, memory, IO Util, network throughput screenshots
- Confirm if FE received the query:
grep "register query id = 385e0ac9-314e-11ec-9853-00163e04d4c2" fe.log, results indicate reception. Middle string is query_id - Confirm if BE received the query:
grep "register query id = 385e0ac9-314e-11ec-9853-00163e04d4c2" be.INFO, results indicate reception. Middle string is query_id - If BE didn't receive query:
jstack -l $fe_pid > /tmp/jstack.logprovide jstack - If BE received query but didn't process it, print pstack:
admin execute on $beid 'System.print(ExecEnv.get_stack_trace_for_all_threads())';
- Confirm if FE had FullGC
2.4 Query Exhausts Resources
- If query can execute successfully, provide profile
- Specific resource usage, which resource is exhausted
- SQL statement
2.5 Incorrect/Inconsistent Query Results
- Profiles of different query results
- Explain cost SQL execution results of different query results
- Table creation statements
- Data import methods
2.6 Materialized Views
First refer to troubleshooting_asynchronous_materialized_views for troubleshooting. If troubleshooting is unsuccessful, please provide the following:
- Materialized view statements
- Info and warning logs from all three FEs
- Info logs from problematic BEs
3. Data Export
- When exporting to Hive cannot perform queries, metadata synchronization is needed
- Clarify export file format, separators, whether line breaks are correct
- When using Spark to export data, first confirm related parameter configurations: IP, port, connection timeout, etc.
- When using Spark for data import, confirm the correspondence between SR field types and Spark column types
4. Operations Management
4.1 BE Crash
- Crash time point, whether repeatedly crashing
- Core files
- If core files cannot be provided:
gdb -ex "thread apply all bt" -ex "quit" --batch -core /path/to/corefile /path/to/your/executable, where /path/to/your/executable is the binary file path, e.g., /lib/starrocks_be
- If core files cannot be provided:
- be.out
- If you see this keyword "LogMessageFatal" in be.out, you need to provide be.WARNING
- be.INFO
dmesg -T
4.2 FE Crash
- fe.out logs
- fe.log logs
dmesg -T- bdb/je.info.0
4.3 FE Stuck
- fe.log logs
- fe.gc.log logs
- Provide jstack:
jstack -l $fe_pid > /tmp/jstack.log
4.4 FE OOM
- Mem profile, usually in fe/log/proc_profile, obtain 3+ files around the problem time point
4.5 IO Exhausted
mysql -uroot -h$feIP -P9030 -p -e "admin execute on 10004 'System.print(ExecEnv.get_stack_trace_for_all_threads())';" > be.log# 10 indicates collection period is 10s, 3 indicates top3
admin execute on $backend_id 'System.print(ExecEnv.io_profile_and_get_topn_stats("all", 10, 3))';
- iotop results
- iostat -x -k 1 results
4.6 Checkpoint Failure
Sometimes BDB directory becomes large, usually due to bugs causing checkpoint failure. Get logs through:
grep -A50 'Exception when generate new image file' fe.warn.log
4.7 Schema Change Stuck
Leader fe.log search for job id from show alter table column
grep -A10 $job_id fe.log
4.8 Compaction
- be.INFO logs:
grep "compaction_type:cumulative" be.INFOgrep "compaction_type:base" be.INFO
4.9 BE Resource Usage Abnormalities
4.9.1 High Memory Usage
- be.INFO
- Manager metrics dashboard shows detailed thread memory usage information
4.9.2 High BE CPU Usage
- Collect CPU profile:
perf record -F 99 -ag -p $pid -- sleep 60
- Or:
pprof --svg --seconds=60 http://be_host:8040/pprof/profile > be.svg top -Hp $be_pid
5. Lake Analytics
5.1 Connection Exceptions
- Need to provide fe.log from connected FE
- Catalog creation statements
5.2 Slow Queries
- Need to provide query profile
- Hive table data storage format
- Amount of data scanned, row count and data size
- Whether configurations and parameters were modified
5.3 Query Errors
- Query SQL
- Table creation statements for queried tables
- Detailed query error information, search for query SQL in corresponding connected FE logs to get complete context logs
- Provide corresponding time point be.out
Comments
0 comments
Please sign in to leave a comment.