On 10/01/2025 03:40, David Gibson wrote:On Thu, Jan 09, 2025 at 02:06:48PM +0100, Laurent Vivier wrote:Yes, I spent my afternoon trying to understand that, but we must use same endianness between sum_16b() and csum_avx2(), and I found this: diff --git a/checksum.c b/checksum.c index 1c4354d35734..0543e86b0e67 100644 --- a/checksum.c +++ b/checksum.c @@ -458,8 +458,13 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init) if (pad) init += sum_16b(buf, pad); - if (len > pad) - init = csum_avx2((void *)align, len - pad, init); + if (len > pad) { + if (pad & 1) + init = __bswap_32(csum_avx2((void *)align, len - pad, + __bswap_32(init))); + else + init = csum_avx2((void *)align, len - pad, init); + } return init; } Thanks, Laurentcsum_unfolded() must call csum_avx2() with a 32byte aligned base address. To be able to do that if the buffer is not correctly aligned, it splits the buffers in 2 parts, the second part is 32byte aligned and can be used with csum_avx2(), the first part is the remaining part, that is not 32byte aligned and we use sum_16b() to compute the checksum. A problem appears if the length of the first part is odd because the checksum is using 16bit words to do the checksum. If the length is odd, when the second part is computed, all words are shifted by 1 byte, meaning weight of upper and lower byte is swapped. For instance a 13 bytes buffer: bytes: aa AA bb BB cc CC dd DD ee EE ff FF gg 16bit words: AAaa BBbb CCcc DDdd EEee FFff 00gg If we don't split the sequence, the checksum is: AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg If we split the sequence with an even length for the first part: (AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg) But if the first part has an odd length: (AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF) To avoid the problem, do not call csum_avx2() if the first part cannot have an even length, and compute the checksum of all the buffer using sum_16b(). This is slower but it can only happen if the buffer base address is odd, and this can only happen if the binary is built using '-Os', and that means we have chosen to prioritize size over speed. Link: https://bugs.passt.top/show_bug.cgi?id=108 Signed-off-by: Laurent Vivier <lvivier(a)redhat.com>Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au> In that it's a real bug and we need to fix it quickly. That said, I think we can do a bit better long term: I believe it should be possible to correct the value from the of-by-one csum_avx2(), I think with just an unconditional byteswap. The TCP/UDP checksum has the curious property that it doesn't matter if you compute it big-endian or little-endian, as long as you're consistent. We already rely on this. Having one odd byte piece essentially means we're using inconsistent endianness between the two pieces.