New subject: [PATCH v3 01/20] tcp: Always pass NULL event with EPOLL_CTL_DEL

31 Jan 2025

      ...and finally connections survive migration from source to target,
at least the ones originating from the (source) guest. I didn't try
the other way around, small tweaks might be needed. Tested as follows,
roughly as instructed by Laurent:

Source:

  $ ./passt --vhost-user

  $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \
   -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \
   -append "console=ttyS0" \
   -chardev socket,id=chr0,path=/tmp/passt_1.socket \
   -netdev vhost-user,id=netdev0,chardev=chr0 \
   -device virtio-net,netdev=netdev0 \
   -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \
   -numa node,memdev=memfd0 -m 2G

  # ./passt-repair /tmp/passt_1.socket.repair

Target (same host):

  $ ./passt --vhost-user

  $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \
   -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \
   -append "console=ttyS0" \
   -chardev socket,id=chr0,path=/tmp/passt_2.socket \
   -netdev vhost-user,id=netdev0,chardev=chr0 \
   -device virtio-net,netdev=netdev0 \
   -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \
   -numa node,memdev=memfd0 -m 2G \
   -incoming tcp:0:4444

  # ./passt-repair /tmp/passt_2.socket.repair

Test server:

  $ nc -l 9091

Once the guest boots:

  # ip link set dev eth0 up
  # dhclient eth0
  # socat STDIN TCP:$DEFAULT_GW:9091
  abcd

  ^a-c
  migrate tcp:0:4444

Then continue typing in the target guest:

  efgh

The purpose of this is mostly to show the complete flow, but it needs
a number of reworks.

What's missing (letting aside pending packet queues for a moment,
those are not strictly needed):

1. tests based on the two_guests layout/setup. Even with reverse-search
   in the shell, this is getting quite hard on wrists. I guess we can
   start QEMU with -monitor unix:mon.sock,server,nowait and
   send the 'migrate' command via socat STDIN UNIX-CONNECT:mon.sock

2. dump and transfer of *socket-side* MSS and window scale (I used
   hardcoded values): this needs more storage, so it needs to be
   transferred outside the flow table

3. dump, transfer and restore of TCP_REPAIR_WINDOW parameters (not
   strictly needed, but easy to add once we have appropriate storage)

4. perhaps some small bits of implementation for socket-originated
   connections (I tested only guest-originated ones so far)

5. UDP and ICMP flows (ping already happens to "survive" nicely, by
   the way)

6. man page for passt-repair, and man page changes for everything

7. packaging and Linux Security Module changes for passt-repair

8. error handling here and there, and repair rollback/migration abort

9. setting original receive/send buffer sizes and socket options
   (TCP_NODELAY)

What clearly needs changes:

a. we can't dump more stuff to the flow table, because we would exceed
   128 bytes. We need to copy everything from tcp_tap_conn except for:

   - state in flow_common
   - in_epoll
   - sock
   - timer

   and on top of this we need:

   - values for TCPOPT_WINDOW and TCPOPT_MAXSEG
   - struct tcp_repair_window

   somewhat unexpectedly, this is actually bigger than a flow table
   entry. In any case, we need to implement a stream/per-entry
   migration right away.

b. at this point, I guess we can throw the header away, and just keep
   a magic (0xB1BB1D1B0BB1D1B0 has a missing 0 at the end but, well,
   https://en.wikipedia.org/wiki/Bibbidi-Bobbidi-Boo is the Magic
   Song: can we keep it?) and a version number. The rest, let's go
   with big/network endianness I'd say, and 64-bit time_t

c. the declarative data thing is very convenient but we need to fetch
   stuff from struct ctx, as shown by the hash_secret example. What's
   very convenient of this approach is the iovec / writev() / readv()
   idea. I'm not sure if we can maintain that convenience, though

Patches that could be applied regardless of this series to make it
more manageable:
1/20  tcp: Always pass NULL event with EPOLL_CTL_DEL
2/20  util: Rename and make global vu_remove_watch()
6/20  util: Add read_remainder() and read_all_buf()
8/20  Introduce passt-repair
16/20 vhost_user: Turn vhost-user message reports to trace()
17/20 vhost_user: Make source quit after reporting migration state
18/20 tcp: Get our socket port using getsockname() when connecting from guest
19/20 tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros

Patches that we can throw away with the changes outlined above:
3/20  icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
4/20  flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits
15/20 flow, flow_table: Export declaration of hash table

David Gibson (6):
  tcp: Always pass NULL event with EPOLL_CTL_DEL
  util: Rename and make global vu_remove_watch()
  migrate: vu_migrate_{source,target}() aren't actually vu speciic
  migrate: Move repair_sock_init() to vu_init()
  migrate: Make more handling common rather than vhost-user specific
  migrate: Don't handle the migration channel through epoll

Stefano Brivio (14):
  icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
  flow, flow_table: Pad flow table entries to 128 bytes, hash entries to
    32 bits
  flow_table: Use size in extern declaration for flowtab
  util: Add read_remainder() and read_all_buf()
  Introduce facilities for guest migration on top of vhost-user
    infrastructure
  Introduce passt-repair
  Add interfaces and configuration bits for passt-repair
  flow, tcp: Basic pre-migration source handler to dump sequence numbers
  flow, flow_table: Export declaration of hash table
  vhost_user: Turn vhost-user message reports to trace()
  vhost_user: Make source quit after reporting migration state
  tcp: Get our socket port using getsockname() when connecting from
    guest
  tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros
  Implement target side of migration

 .gitignore     |   1 +
 Makefile       |  24 +--
 conf.c         |  44 +++++-
 epoll_type.h   |   6 +-
 flow.c         |  97 +++++++++++-
 flow.h         |  20 ++-
 flow_table.h   |  22 ++-
 icmp.c         |   2 +-
 icmp_flow.h    |   6 +-
 migrate.c      | 408 +++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h      |  84 ++++++++++
 passt-repair.c | 117 ++++++++++++++
 passt.1        |  11 ++
 passt.c        |  17 ++-
 passt.h        |  17 +++
 repair.c       | 193 +++++++++++++++++++++++
 repair.h       |  16 ++
 tap.c          |  64 +-------
 tcp.c          | 198 +++++++++++++++++++++++-
 tcp_conn.h     |   7 +
 tcp_internal.h |  10 +-
 tcp_splice.c   |   4 +-
 udp_flow.c     |   2 +-
 udp_flow.h     |   6 +-
 util.c         | 155 +++++++++++++++++++
 util.h         |   4 +
 vhost_user.c   |  94 +++---------
 virtio.h       |   4 -
 vu_common.c    |  62 +++-----
 vu_common.h    |   2 +-
 30 files changed, 1469 insertions(+), 228 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h
 create mode 100644 passt-repair.c
 create mode 100644 repair.c
 create mode 100644 repair.h

-- 
2.43.0

[PATCH v3 00/20] Draft, incomplete series introducing state migration

tags

participants (2)