Sometimes "Discard late frame" occurs. How to prevent it?

sngnagai · May 26, 2022, 8:43am

I use aravis to capture images from two cameras. The size of the images is 8MB / 16MB respectively. The frequency of taking a picture is not so often, which is slower than 1 frame per second.
Not every time, but sometimes I observed the message “Discard late frame” from either one of the cameras especially when the first 50-100 images were taken. Once this message came up, aravis could not receive a valid packet for 3-4 minutes while it could do from the other camera. After 3-4 minutes passed, aravis could receive a valid packet again from the camera.

I have tried the tuning below, but the situation did not change.

increase the OS socket buffer (net.core.rmem_default, net.core.rmem_max)
change receive queue size (net.core.netdev_max_backlog)
make the process higher priority (nice: -20)
set packet-resend option to PACKET_RESEND_NEVER

Does anyone have an idea to prevent this error ? Any advice would be appreciated.

Here is the log. (until receiving frame#40, there was no error. There was something wrong with frame#41 for a while. From frame#42, normally received a valid packet again)

stream-thread> [GvStream::missing_packet_check] Resend request at dt = 144235, packet id = 2001 (2002 packets/frame)
tream-thread> [GvStream::send_packet_request] frame_id = 41 (from packet 1972 to 2001)
cp> packet_type  = cmd
                     packet_flags = none
                     command      = packet-resend-cmd
                     size         = 12
                     id           = 65301
                     00000000 42 00 00 40 00 0c ff 15 00 00 00 29 00 00 07 b4  B..@.......)....
                     00000010 00 00 07 d1
stream-thread> [GvStream::process_data_block] Received resent packet 1972 for frame 41
... <snip> ...
stream-thread> [GvStream::process_data_block] Received resent packet 2000 for frame 41
stream-thread> [GvStream::process_data_block] Received resent packet 2000 for frame 41
sp> packet_type  =       ok (0x0000)
                     content_type = data-trailer (0x0002)
                     frame_id     =       41
                     packet_id    =     2001
                     data_size    =        8
stream-thread> [GvStream::process_data_trailer] Received resent packet 2001 for frame 41
stream-thread> [GvStream::check_frame_completion] Completed frame 41
stream-thread> [GvStream::close_frame] Close frame 41
stream-thread> [GvStream::find_frame_data] Discard late frame 41 (last: 41)
packet_type  =       ok (0x0000)
                     content_type = data-block (0x0003)
                     frame_id     =       41
                     packet_id    =     1972
                     data_size    =     8192
... <snip> ...
stream-thread> [GvStream::find_frame_data] Discard late frame 41 (last: 41)
sp> packet_type  =       ok (0x0000)
                     content_type = data-trailer (0x0002)
                     frame_id     =       41
                     packet_id    =     2001
                     data_size    =        8
device> [GvDevice::Heartbeat] Ack value = 2
stream-thread> [GvStream::find_frame_data] Start frame 42
misc> [PixelFormat::to_gst_caps_string] 0x01080001 -> video/x-raw, format=(string)GRAY8
sp> packet_type  =       ok (0x0000)
                     content_type = data-leader (0x0001)
                     frame_id     =       42
                     packet_id    =        0
                     data_size    =       36
                     payload_type = image
                     pixel format = video/x-raw, format=(string)GRAY8
                     width        = 4096
                     height       = 4000
                     x_offset     = 0
                     y_offset     = 0

Emmanuel · May 26, 2022, 4:51pm

Hi,

Which version of aravis are you using ?
Are the cameras identical ?

Was there really 3/4 minutes between the frames 40 and 42 ?

How acquisition is triggered ?

sngnagai · May 26, 2022, 10:04pm

Hi Emmanuel,

aravis version is 0.8.19.

Yes, the cameras are identical.

There might not 3-4 minutes delay in this case, but most cases when I observed “Discard late frame” error, aravis did not pop buffer (using timeout_pop_buffer API) for 3-4 minutes. After the delay, pop buffer succeeded again. Thus, I would like to avoid this error.

Acquisition is triggered by a external device. When the external device detects a target object on conveyor, it triggers both camera1 and camera2 at almost the same timing. The trigger interval is 1-5 seconds in most cases.

Emmanuel · May 31, 2022, 6:54am

The late frame warning should not be an issue, it happens because Aravis receives packets for a frame that is already completed. A resend of theses packets was requested by Aravis because they were late with respect to the packet timeout value.

It would help if you could post the log output with the timestamp informations.

Aravis does not seem at fault here, because the frame ids are consecutive. There is something wrong in the acquisition triggering.

sngnagai · May 31, 2022, 11:58am

I understand it is the normal behavior that aravis requests resending packets and discards them when a frame is already completed.

Here is the timestamp for the log:

18:22:40.330 : Completed frame 40
18:22:49.419 : Start frame 41
18:22:49.563 : Resend request
18:22:49.569 : Received resent packet 1972
18:22.49.570 : Received resent packet 2001
18:22:49.570 : Discard late frame 41 (packet id 1972)
18:22:49.573 : Discard late frame 42 (packet id 2001)
18:22:50.027 : Ack
18:22:54.339 : Start frame 42
18:22:54.479 : Completed frame 42

In this case, there was not a delay between frames. In other cases, I observed 3-4 minutes delay, but unfortunately the debug log did not remain.

Investigating the application code, I found a part which could cause the issue.
In the code, the sequence in the loop was;

pop buffer with timeout
convert buffer to particular image format
clear buffer with stop_thread and start_thread
return to 1

My assumption is step-3 may cause packet loss at the timing when a packet comes during the buffer clear.
After removing the step-3, I have not observed this issue yet.
Do you think this assumption makes sense ?

Emmanuel · May 31, 2022, 12:16pm

It is difficult to give an opinion without seeing the code.

You should take care to always have at least one buffer in the aravis buffer pool. I don’t know what does the buffer clear, but you can return it as is to the buffer pool using arv_stream_push_buffer(). There is no need to clear it.

sngnagai · May 31, 2022, 1:39pm

This is the simplified code. At the end of the loop, buffer is always returned to the pool, so at least one buffer is secured.

stream.push_buffer(Aravis.Buffer.new_allocate(payload))
while True:
    try:
        buffer = stream.timeout_pop_buffer(5000000)
        if buffer is None:
            continue
        frame = convert(buffer)
        process(frame)
        stream.stop_thread(delete_buffers=True)
        stream.start_thread()
        stream.push_buffer(Aravis.Buffer.new_allocate(payload))

I also inserted stop_thread() & start_thread() just before push_buffer() in order to clear the buffer. I know this is not necessary, but I used it for another memory leak related issue. The amount of memory usage increased somewhere in the application code, and I tried to reduce memory usage as much as possible by calling stop_thread(). This is the background to use stop_thread(), but the root cause of the memory leak issue was identified and it is not needed anymore. Furthermore, I suspected stop_thread() could cause the packet loss issue discussed in this thread, I removed stop_thread() & start_thread() in the loop.

Emmanuel · June 8, 2022, 8:45am

During the time from pop_buffer to push_buffer, it seems the receiving thread will not have any available buffer for the storage of the incoming data, as you have pushed only one buffer to the buffer pool. It may result in lost images.

Check you don’t have buffer overruns in the debug log obtained by setting ARV_DEBUG:

export ARV_DEBUG=stream:3

sngnagai · June 8, 2022, 11:34am

Emmanuel, thank you for your reply.

I also think your point could be a cause of the data loss issue. Thus, when I modified the code last time, I also added one more push_buffer ahead of while loop.
In the current code shown below, at least one buffer is reserved between pop_buffer and push_buffer. After this change as well as removing stop_thread() & start_thread(), it seems the data loss issue may be resolved because I have not seen the issue for more than a week.

Here is the modified code:

for i in range(2):
    stream.push_buffer(Aravis.Buffer.new_allocate(payload))
while True:
    try:
        buffer = stream.timeout_pop_buffer(5000000)
        if buffer is None:
            continue
        frame = convert(buffer)
        process(frame)
        stream.push_buffer(Aravis.Buffer.new_allocate(payload))

Emmanuel · June 8, 2022, 4:53pm

The data loss issue is solved. But do you still have the 3-4 minute holes during your acquisition ?

sngnagai · June 9, 2022, 12:31am

No, I have not seen 3-4 minute holes either since the code was modified last week. The issue always happened with the data loss issue before. I will see if the issue happens again, but both issues were probably resolved since they happened at least once a day before the code modification.