[GvStream::check_frame_completion] Timeout for frame if a multicore cpu's one core utilization is 100%

Hello, there.

I am working on a project using aravis python version to get image from a GigE camera on a embedde board(8cores) and do some image process on the image, the display in a web server.

In the begining the progra m works fine, cpu utilization is like
core no.1 100% (producer to get GigE camera buffer to queue)
core no.2 40-70% (consumer to handle the image from queue)

Problem I am facing:
is that after a long time running, perhaps after 20 minutes?
The thread runinng on the core no.2 get a higher cpu utilization to almost 100%
and there will be mutiple timeout log for frame
“[GvStream::check_frame_completion] Timeout for frame 11897 at dt = 203292”


I also checked with a 20-core-cpu PC with the same procudure, and reproduced the issue.

And as to confirm if this issue has something to do with my code, I also checked with arv-camer-test-0.8 (on pc(20core) and on board(8core), with one core 100% utlized condition)
Htop information below

Procedure is like below:

export ARV_DEBUG=all
arv-camera-test-0.8 --gv-packet-size 1500 or 1400 or 500

log
arv-camera-test-0.log (21.9 KB)

Am I not using the library correctly? or I am missing something?
Please kindly help.
Thank you for your help in advance.

Best,
Joe.


Reference:
My Software architecture is like

  • Thread 1: Create a producer to constantly get frame from camera using aravis library buffer and put it to Queue, the queue’s size in unlimited. So there should be no problem even if thread no.2 can’t handle the queue’s image on time, since the producer just put the image from GigE camere constantly to the queue.
def video_stream_producer(g_stop_event, queue, stream,max_queue_size=100):
    try:
        while not g_stop_event.is_set():
            try:
                buffer = stream.try_pop_buffer()  
                if buffer is None:
                    continue

                frame_data = buffer.get_data()
                if frame_data:
                    stream.push_buffer(buffer)  # push the buffer back to the stream as soon as used
                    if not queue.full():
                        queue.put(frame_data)
                    else:
                        print("[Debug] Queue is full, dropping frame.")
                    if queue.qsize() > max_queue_size:
                        print("[Debug] Queue size exceeded, clearing queue.")
                        with queue.mutex:  
                            queue.queue.clear()
                else:
                    stream.push_buffer(buffer)  
            except Exception as e:
                if g_stop_event.is_set():  
                    break
                print(f"[Error in producer thread] {e}")
    except Exception as e:
        print(f"[Error in producer] {e}")
    print("[Debug] Producer thread exiting.")
  1. Thread 2: A comsumer :get data from the queue above and do some image process (bit convertion) and show in the web service(opencv jpeg encode and yield.)

Here is come extra information to expain this issue:

1.Htop information for only running arv-camera-test-0.8 --gv-packet-size 1500

2.Intended utilization of a single core to 100% by using the following code

COREMASK="0x01"  # one core
taskset "$COREMASK" bash -c '
while :; do
    echo "$((RANDOM * RANDOM))" > /dev/null
done

3.Intended utilization of a single core to 100% plus
running arv-camera-test-0.8 --gv-packet-size 1500


You can see from the left there are timeout for frame logs.

There are the arv-camera-test-0.8 logs without and with one core cpu utilized to 100%
arv-camera-test-0.8_no_extra_cpu_utilization.log (16.6 KB)
arv-camera-test-0.8_extra_1_core_100_utilization.log (31.9 KB)

Best,
Joe.

Very sorry to post multiple times.

It seems the 1 core 100% utilization is due to the kernel? softirq of network rx as to receive data from GigE camera through network.

# cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
          HI:      11879          0          0          0          0        770          0          3
       TIMER:    1572296    2825843    7736379      82376     293854      84532    1029875     161614
      NET_TX:       6934        999       2067        129      14430         16        134        138
      NET_RX:  180617448    2823498   12479279       5956       4556       7711      10549       1583
       BLOCK:       3618      70560      18842       2877       3192       5651       6247       8047
    IRQ_POLL:          0          0          0          0          0          0          0          0
     TASKLET:      85027      13310      58133          8          0        185          3         87
       SCHED:    1476442    2517628    4667216     231373     246995     756514     517846      27747
     HRTIMER:         12          0          0          0          0          0          0          0
         RCU:    2375074    1729336    2380463     237983     375160     398170    1067986     471719
sudo apt install irqbalance
sudo systemctl enable irqbalance --now

I have tried to distribute the softirq to multiple cores by using the irqbalance.
Also I bind the cpu-heavy task to some other core other than core 0, which seems to be the default one os uses to handle new task I guess.

import os
os.sched_setaffinity(0, {7}) # this is more powerful core in my board.

And now my camere can last more than 1 hour without problem.
There are is no more timeout issue, currently.

By the way Is the software irq 100% utilization phenomenon normal?
According to our hw guy, HW irq only happens when it is necessary.
Does this mean that as I am using a MTU==1500, the RAW data is too large from the GigE camera so that it create much interrupt. Is this correct?

Best,
Joe.

Hi,

If you are near the 100% CPU usage, timeouts may be due to the packet resend mechanism, which by itself will increase the CPU use, and increase the number of network packets.

Did you try to increase the socket buffer size ?
Why do you limit the packet size to 1500 ? A bigger value would decrease the number of interrupts.

https://aravisproject.github.io/aravis/aravis-stable/ethernet.html

Emmanuel.

Hi @Emmanuel
Thank you for your reply.
I changed the mtu(GigE Camera/Board/PoE_switch/PC) to 8192 or enable Jumbo frame support.
And the 1 core 100% result is not chaging.

## camera setting
        camera_mtu = 8192
        camera.gv_set_packet_size(camera_mtu)
##board

18: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8192 qdisc mq state UP group default qlen 1000
    link/ether ec:21:e5:10:4f:ea brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.9/24 brd 192.168.3.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::8a2f:84d1:f68:7472/64 scope link 
       valid_lft forever preferred_lft forever

PoE Switch Jumbo frame support enabled.

There is no special reason for the 1500 set mtu, it is just that our customer’s enviroment might be unconfigurable mtu. So we just set the mtu to a common number 1500(default value for some devices?)

Best,
Joe

I have no other suggestion to improve the situation. You should avoid to set the working point at 100% CPU use though, as in this case there is no room for packet resend. Try to decrease the image size, or use the binning feature.

Hi @Emmanuel
Thank you for your reply.

You should avoid to set the working point at 100% CPU use though, as in this case there is no room for packet resend. Try to decrease the image size, or use the binning feature.

OK thanks for pointing out that.
And if we would like to use the image size as it is(largest resolution), then we will have to bear this 100% cpu use right?

It is unexpected you use 100% of a CPU with a single 1Gb/s camera. What is the model of your CPU ?


It is a QCS8550 SoC, I think I am using the Silver core. Since the frequency for core 0-2 is around 2000 MHz
Ref:
1 x GoldPlus @3.2 GHz + (2+2) x Gold @2.8 GHz + 3 x Silver @2.0 GHz

It is supposed to be much powerful than the i5-8365U on my laptop, which only use 20/30% of a CPU for a 100 MB/s stream. I guess the difference is in the Ethernet driver implementation.

OK, thanks for the information, I will try to check the ethernet driver.

I checked this issue a little bit further.
If the one-100%-core is not further interrupted by other system call or what ever,
there will be no timout or packet loss from my observation, at least for 2 hours.

And it seems if the ksoftirq is binding to core 0, timeout and packet loss would be much easier to occur.
And if it is bound to core 1/2… other than 0, the chance that timeout and packet loss occurs would be much less.

And I found since QCS8550 SoC’s sdk/firmware has limited the irqaffinity to core 0-2(I dont’t know why but maybe they have their consideration), and I can’t modify the firmware for now. I chose to create a service right after booting up the board to further limit the irq binding to forbid core 0 from been bound to irq.

echo 6 > /proc/irq/default_smp_affinity #limit to core 2 and 1

Currently this works as charm.

Hope this would help someone in the future.

Best,
Joe