Skip to content

Conversation

@qidaye
Copy link
Contributor

@qidaye qidaye commented Dec 30, 2025

What problem does this PR solve?

Problem Summary:

When performing a Broker Load with wildcards (e.g., /*), if the directory contains
0-byte metadata files like _SUCCESS, the FE would incorrectly use the format/compression
of these metadata files (usually PLAIN) to overwrite the shared parameters for the
entire file group. This caused the BE to read compressed data files (like LZO) as
plain text, leading to import failures.

This PR fixes the issue by:

  1. Filtering out _SUCCESS files in BrokerLoadPendingTask to avoid processing
    them as data.
  2. Setting format_type and compress_type in each TFileRangeDesc instead of
    the shared TFileScanRangeParams in FE (both legacy and Nereids paths).
  3. Prioritizing per-file format_type in BE's CsvReader to ensure the correct
    reader is initialized for each file.
  4. Adding unit tests to verify _SUCCESS filtering

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Dec 30, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@qidaye
Copy link
Contributor Author

qidaye commented Dec 30, 2025

run buildall

@qidaye qidaye force-pushed the fix_broker_load_wildcard_import branch from d7134e0 to 98b984c Compare December 30, 2025 08:23
@qidaye
Copy link
Contributor Author

qidaye commented Dec 30, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34667 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 98b984cadd471483b7e5cbc7c7072cf8a0aca310, data reload: false

------ Round 1 ----------------------------------
q1	17602	4241	4047	4047
q2	2015	341	238	238
q3	10207	1269	726	726
q4	10221	917	323	323
q5	7555	2098	1872	1872
q6	202	165	136	136
q7	926	801	655	655
q8	9287	1514	1095	1095
q9	6683	5175	5134	5134
q10	6779	1813	1401	1401
q11	508	292	286	286
q12	695	765	620	620
q13	17791	3820	3075	3075
q14	287	291	271	271
q15	596	526	520	520
q16	689	681	627	627
q17	700	756	586	586
q18	7436	7278	7918	7278
q19	1326	1031	646	646
q20	415	375	257	257
q21	4393	4246	3860	3860
q22	1112	1126	1014	1014
Total cold run time: 107425 ms
Total hot run time: 34667 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4243	4210	4233	4210
q2	327	426	318	318
q3	2213	2941	2423	2423
q4	1413	1889	1427	1427
q5	4356	4399	4266	4266
q6	222	165	127	127
q7	1968	1915	1723	1723
q8	2502	2377	2297	2297
q9	7020	7150	6908	6908
q10	2268	2519	2099	2099
q11	538	450	438	438
q12	669	697	588	588
q13	3304	3782	3082	3082
q14	282	277	264	264
q15	524	482	486	482
q16	602	649	600	600
q17	1067	1341	1335	1335
q18	7283	7249	7257	7249
q19	860	823	839	823
q20	1880	1938	1805	1805
q21	4449	4340	4171	4171
q22	1093	1031	989	989
Total cold run time: 49083 ms
Total hot run time: 47624 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173437 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 98b984cadd471483b7e5cbc7c7072cf8a0aca310, data reload: false

query5	5003	600	450	450
query6	327	225	225	225
query7	4211	448	258	258
query8	319	251	238	238
query9	8753	2581	2595	2581
query10	532	362	326	326
query11	15098	15202	14860	14860
query12	190	114	109	109
query13	1260	466	382	382
query14	6452	2969	2704	2704
query14_1	2560	2549	2570	2549
query15	212	197	173	173
query16	997	462	423	423
query17	1030	657	552	552
query18	2683	421	325	325
query19	211	214	194	194
query20	120	120	112	112
query21	214	135	114	114
query22	4079	3922	3913	3913
query23	15927	15571	15322	15322
query23_1	15497	15531	15381	15381
query24	7312	1574	1213	1213
query24_1	1207	1200	1206	1200
query25	525	430	384	384
query26	1232	264	155	155
query27	2768	471	300	300
query28	4532	2160	2174	2160
query29	788	515	417	417
query30	297	242	207	207
query31	804	637	535	535
query32	78	69	68	68
query33	539	328	270	270
query34	889	890	525	525
query35	728	780	691	691
query36	864	898	829	829
query37	123	94	76	76
query38	2747	2643	2659	2643
query39	756	758	720	720
query39_1	703	715	702	702
query40	225	140	123	123
query41	67	62	62	62
query42	110	102	101	101
query43	452	459	444	444
query44	1327	747	761	747
query45	184	182	176	176
query46	879	952	590	590
query47	1359	1458	1370	1370
query48	322	316	252	252
query49	612	401	321	321
query50	625	273	205	205
query51	3754	3754	3842	3754
query52	108	111	94	94
query53	289	328	269	269
query54	279	246	245	245
query55	86	77	73	73
query56	282	290	288	288
query57	1040	961	941	941
query58	264	247	244	244
query59	2173	2209	2095	2095
query60	315	305	296	296
query61	164	160	158	158
query62	399	359	313	313
query63	303	259	271	259
query64	5079	1404	987	987
query65	3744	3697	3752	3697
query66	1369	454	314	314
query67	14655	14571	15157	14571
query68	2685	1017	755	755
query69	421	336	309	309
query70	1006	965	915	915
query71	326	292	275	275
query72	6302	4818	5083	4818
query73	562	670	323	323
query74	8719	8736	8509	8509
query75	2797	2860	2510	2510
query76	2920	1033	651	651
query77	349	355	264	264
query78	9641	9909	9085	9085
query79	1022	859	602	602
query80	615	570	478	478
query81	467	265	227	227
query82	246	141	106	106
query83	261	253	239	239
query84	252	116	105	105
query85	863	502	456	456
query86	339	323	324	323
query87	2782	2838	2773	2773
query88	3163	2260	2211	2211
query89	375	344	328	328
query90	1966	151	145	145
query91	175	168	141	141
query92	73	64	66	64
query93	946	899	565	565
query94	438	324	289	289
query95	552	327	295	295
query96	596	461	202	202
query97	2306	2365	2260	2260
query98	214	207	192	192
query99	578	588	494	494
Total cold run time: 245658 ms
Total hot run time: 173437 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.84 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 98b984cadd471483b7e5cbc7c7072cf8a0aca310, data reload: false

query1	0.06	0.05	0.04
query2	0.10	0.05	0.05
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.28	0.25	0.26
query6	1.14	0.66	0.65
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.54	0.51	0.51
query10	0.57	0.55	0.55
query11	0.15	0.11	0.11
query12	0.16	0.12	0.13
query13	0.60	0.59	0.58
query14	1.01	0.97	0.96
query15	0.81	0.79	0.81
query16	0.39	0.39	0.39
query17	1.02	1.08	1.04
query18	0.23	0.21	0.21
query19	1.94	1.83	1.86
query20	0.01	0.02	0.01
query21	15.47	0.29	0.14
query22	4.81	0.05	0.05
query23	15.85	0.28	0.10
query24	1.03	0.24	0.50
query25	0.12	0.06	0.11
query26	0.15	0.13	0.15
query27	0.09	0.07	0.04
query28	3.93	1.05	0.88
query29	12.56	3.91	3.14
query30	0.28	0.14	0.11
query31	2.82	0.62	0.38
query32	3.23	0.55	0.46
query33	3.02	3.05	3.04
query34	16.54	5.08	4.45
query35	4.44	4.46	4.44
query36	0.64	0.50	0.48
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.16	0.14	0.13
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 96.48 s
Total hot run time: 26.84 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.37% (18951/35508)
Line Coverage 39.25% (175886/448061)
Region Coverage 33.82% (136093/402431)
Branch Coverage 34.75% (58781/169140)

@qidaye qidaye force-pushed the fix_broker_load_wildcard_import branch from 98b984c to 41ecdde Compare December 30, 2025 09:49
@qidaye
Copy link
Contributor Author

qidaye commented Dec 30, 2025

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 71.43% (5/7) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 36219 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 41ecdde9fcfdd8eb01a50f72babf50d3d6a25ff5, data reload: false

------ Round 1 ----------------------------------
q1	17649	4205	4112	4112
q2	2036	354	242	242
q3	10180	1337	731	731
q4	10207	811	322	322
q5	7560	2173	1991	1991
q6	208	172	139	139
q7	1001	828	671	671
q8	9279	1439	1201	1201
q9	6911	5205	5267	5205
q10	6892	1843	1420	1420
q11	502	317	287	287
q12	726	744	610	610
q13	17818	3886	3199	3199
q14	293	300	293	293
q15	590	506	500	500
q16	741	691	639	639
q17	703	778	674	674
q18	7544	7672	8104	7672
q19	1201	1065	705	705
q20	457	400	262	262
q21	4625	4581	4330	4330
q22	1169	1112	1014	1014
Total cold run time: 108292 ms
Total hot run time: 36219 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4308	4190	4314	4190
q2	332	407	333	333
q3	2392	2814	2491	2491
q4	1444	1859	1457	1457
q5	4359	4499	4290	4290
q6	226	169	136	136
q7	2136	1892	1868	1868
q8	2541	2628	2402	2402
q9	7178	7182	6960	6960
q10	2508	2713	2275	2275
q11	540	461	433	433
q12	675	703	565	565
q13	3396	3849	3106	3106
q14	282	274	264	264
q15	521	550	470	470
q16	615	651	627	627
q17	1142	1346	1376	1346
q18	7214	7145	7167	7145
q19	899	843	889	843
q20	1880	1942	1803	1803
q21	4615	4344	4243	4243
q22	1095	1054	996	996
Total cold run time: 50298 ms
Total hot run time: 48243 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 174263 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 41ecdde9fcfdd8eb01a50f72babf50d3d6a25ff5, data reload: false

query5	4529	587	446	446
query6	340	227	226	226
query7	4229	457	281	281
query8	348	249	237	237
query9	8801	2594	2626	2594
query10	513	384	311	311
query11	15317	15129	14802	14802
query12	190	117	117	117
query13	1280	505	391	391
query14	6239	3018	2689	2689
query14_1	2669	2628	2657	2628
query15	210	213	175	175
query16	1025	485	476	476
query17	1113	718	599	599
query18	2467	442	353	353
query19	242	228	219	219
query20	126	122	119	119
query21	217	144	122	122
query22	3900	4024	3866	3866
query23	16035	15576	15408	15408
query23_1	15421	15366	15569	15366
query24	7448	1588	1218	1218
query24_1	1233	1197	1207	1197
query25	592	487	433	433
query26	1257	275	176	176
query27	2743	458	338	338
query28	4505	2187	2179	2179
query29	738	517	424	424
query30	311	248	214	214
query31	820	613	543	543
query32	76	68	65	65
query33	544	342	296	296
query34	921	874	533	533
query35	751	789	698	698
query36	876	868	821	821
query37	130	92	77	77
query38	2681	2707	2655	2655
query39	773	743	741	741
query39_1	720	719	701	701
query40	214	135	118	118
query41	67	66	64	64
query42	107	106	103	103
query43	450	452	411	411
query44	1346	759	740	740
query45	190	184	178	178
query46	858	958	602	602
query47	1352	1378	1294	1294
query48	325	338	261	261
query49	628	410	333	333
query50	636	288	203	203
query51	3737	3824	3742	3742
query52	108	109	98	98
query53	299	330	275	275
query54	323	265	245	245
query55	79	75	70	70
query56	277	286	289	286
query57	1047	1011	926	926
query58	267	261	244	244
query59	2144	2133	1959	1959
query60	323	332	298	298
query61	154	153	162	153
query62	415	337	304	304
query63	301	264	271	264
query64	4934	1326	992	992
query65	3738	3727	3684	3684
query66	1456	431	339	339
query67	14942	15808	14877	14877
query68	5227	1036	745	745
query69	499	353	311	311
query70	1071	943	968	943
query71	369	306	281	281
query72	6132	4776	4861	4776
query73	690	604	313	313
query74	8843	8921	8556	8556
query75	2908	2908	2526	2526
query76	3901	1064	664	664
query77	503	377	288	288
query78	9787	9873	9203	9203
query79	955	917	592	592
query80	1161	615	486	486
query81	554	266	232	232
query82	416	141	109	109
query83	381	257	241	241
query84	255	121	103	103
query85	915	532	472	472
query86	386	316	308	308
query87	2838	2878	2733	2733
query88	3165	2246	2246	2246
query89	380	341	337	337
query90	1959	149	159	149
query91	172	166	142	142
query92	69	65	66	65
query93	1002	930	567	567
query94	655	322	295	295
query95	582	380	308	308
query96	577	467	203	203
query97	2324	2377	2280	2280
query98	218	200	198	198
query99	573	573	509	509
Total cold run time: 251596 ms
Total hot run time: 174263 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.94 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 41ecdde9fcfdd8eb01a50f72babf50d3d6a25ff5, data reload: false

query1	0.06	0.05	0.04
query2	0.11	0.05	0.05
query3	0.25	0.08	0.08
query4	1.60	0.11	0.12
query5	0.27	0.28	0.25
query6	1.14	0.67	0.64
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.57	0.50	0.50
query10	0.55	0.55	0.55
query11	0.15	0.11	0.11
query12	0.16	0.13	0.13
query13	0.62	0.60	0.59
query14	0.98	0.98	0.99
query15	0.81	0.79	0.82
query16	0.39	0.40	0.39
query17	1.02	1.03	1.02
query18	0.23	0.22	0.21
query19	1.87	1.80	1.88
query20	0.02	0.01	0.02
query21	15.44	0.29	0.13
query22	4.80	0.04	0.05
query23	16.10	0.28	0.12
query24	1.91	0.30	0.61
query25	0.08	0.09	0.08
query26	0.15	0.13	0.13
query27	0.07	0.05	0.05
query28	5.09	1.04	0.89
query29	12.58	3.98	3.18
query30	0.27	0.13	0.12
query31	2.82	0.60	0.38
query32	3.23	0.56	0.46
query33	3.05	3.02	3.06
query34	16.86	5.06	4.43
query35	4.52	4.48	4.47
query36	0.67	0.50	0.48
query37	0.10	0.07	0.06
query38	0.07	0.04	0.04
query39	0.04	0.03	0.03
query40	0.16	0.14	0.13
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 99.07 s
Total hot run time: 26.94 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 28.57% (2/7) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 28.57% (2/7) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.37% (18951/35508)
Line Coverage 39.25% (175878/448068)
Region Coverage 33.83% (136122/402406)
Branch Coverage 34.75% (58777/169140)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.15% (25047/34713)
Line Coverage 58.90% (263218/446876)
Region Coverage 53.57% (217791/406569)
Branch Coverage 55.22% (93706/169702)

@qidaye qidaye requested review from morningman and starocean999 and removed request for morningman December 31, 2025 00:51
List<TBrokerFileStatus> filteredFileStatuses = Lists.newArrayList();
for (TBrokerFileStatus fstatus : fileStatuses) {
if (fstatus.getSize() == 0 && isBinaryFileFormat) {
boolean isSuccessFile = fstatus.path.endsWith("/_SUCCESS")
Copy link
Contributor

@cambyzju cambyzju Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是可以类似这个pr:(#59398)把下划线和点开头的文件都过滤一下?如果担心影响比较大,可以在broker load的option里加一个开关。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持一个file Group中指定不同的文件类型后,这里其实不过滤也没有问题了。目前是为了保险起见增加了这个_SUCCESS文件的特殊过滤。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants