Hello, there.
I am working on a project using aravis python version to get image from a GigE camera on a embedde board(8cores) and do some image process on the image, the display in a web server.
In the begining the progra m works fine, cpu utilization is like
core no.1 100% (producer to get GigE camera buffer to queue)
core no.2 40-70% (consumer to handle the image from queue)
Problem I am facing:
is that after a long time running, perhaps after 20 minutes?
The thread runinng on the core no.2 get a higher cpu utilization to almost 100%
and there will be mutiple timeout log for frame
“[GvStream::check_frame_completion] Timeout for frame 11897 at dt = 203292”
I also checked with a 20-core-cpu PC with the same procudure, and reproduced the issue.
And as to confirm if this issue has something to do with my code, I also checked with arv-camer-test-0.8 (on pc(20core) and on board(8core), with one core 100% utlized condition)
Htop information below
Procedure is like below:
export ARV_DEBUG=all
arv-camera-test-0.8 --gv-packet-size 1500 or 1400 or 500
log
arv-camera-test-0.log (21.9 KB)
Am I not using the library correctly? or I am missing something?
Please kindly help.
Thank you for your help in advance.
Best,
Joe.
Reference:
My Software architecture is like
- Thread 1: Create a producer to constantly get frame from camera using aravis library buffer and put it to Queue, the queue’s size in unlimited. So there should be no problem even if thread no.2 can’t handle the queue’s image on time, since the producer just put the image from GigE camere constantly to the queue.
def video_stream_producer(g_stop_event, queue, stream,max_queue_size=100):
try:
while not g_stop_event.is_set():
try:
buffer = stream.try_pop_buffer()
if buffer is None:
continue
frame_data = buffer.get_data()
if frame_data:
stream.push_buffer(buffer) # push the buffer back to the stream as soon as used
if not queue.full():
queue.put(frame_data)
else:
print("[Debug] Queue is full, dropping frame.")
if queue.qsize() > max_queue_size:
print("[Debug] Queue size exceeded, clearing queue.")
with queue.mutex:
queue.queue.clear()
else:
stream.push_buffer(buffer)
except Exception as e:
if g_stop_event.is_set():
break
print(f"[Error in producer thread] {e}")
except Exception as e:
print(f"[Error in producer] {e}")
print("[Debug] Producer thread exiting.")
- Thread 2: A comsumer :get data from the queue above and do some image process (bit convertion) and show in the web service(opencv jpeg encode and yield.)
Here is come extra information to expain this issue:
1.Htop information for only running arv-camera-test-0.8 --gv-packet-size 1500
2.Intended utilization of a single core to 100% by using the following code
COREMASK="0x01" # one core
taskset "$COREMASK" bash -c '
while :; do
echo "$((RANDOM * RANDOM))" > /dev/null
done
3.Intended utilization of a single core to 100% plus
running arv-camera-test-0.8 --gv-packet-size 1500
You can see from the left there are timeout for frame logs.
There are the arv-camera-test-0.8 logs without and with one core cpu utilized to 100%
arv-camera-test-0.8_no_extra_cpu_utilization.log (16.6 KB)
arv-camera-test-0.8_extra_1_core_100_utilization.log (31.9 KB)
Best,
Joe.
Very sorry to post multiple times.
It seems the 1 core 100% utilization is due to the kernel? softirq of network rx as to receive data from GigE camera through network.
# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
HI: 11879 0 0 0 0 770 0 3
TIMER: 1572296 2825843 7736379 82376 293854 84532 1029875 161614
NET_TX: 6934 999 2067 129 14430 16 134 138
NET_RX: 180617448 2823498 12479279 5956 4556 7711 10549 1583
BLOCK: 3618 70560 18842 2877 3192 5651 6247 8047
IRQ_POLL: 0 0 0 0 0 0 0 0
TASKLET: 85027 13310 58133 8 0 185 3 87
SCHED: 1476442 2517628 4667216 231373 246995 756514 517846 27747
HRTIMER: 12 0 0 0 0 0 0 0
RCU: 2375074 1729336 2380463 237983 375160 398170 1067986 471719
sudo apt install irqbalance
sudo systemctl enable irqbalance --now
I have tried to distribute the softirq to multiple cores by using the irqbalance.
Also I bind the cpu-heavy task to some other core other than core 0, which seems to be the default one os uses to handle new task I guess.
import os
os.sched_setaffinity(0, {7}) # this is more powerful core in my board.
And now my camere can last more than 1 hour without problem.
There are is no more timeout issue, currently.
By the way Is the software irq 100% utilization phenomenon normal?
According to our hw guy, HW irq only happens when it is necessary.
Does this mean that as I am using a MTU==1500, the RAW data is too large from the GigE camera so that it create much interrupt. Is this correct?
Best,
Joe.
Hi,
If you are near the 100% CPU usage, timeouts may be due to the packet resend mechanism, which by itself will increase the CPU use, and increase the number of network packets.
Did you try to increase the socket buffer size ?
Why do you limit the packet size to 1500 ? A bigger value would decrease the number of interrupts.
https://aravisproject.github.io/aravis/aravis-stable/ethernet.html
Emmanuel.
Hi @Emmanuel
Thank you for your reply.
I changed the mtu(GigE Camera/Board/PoE_switch/PC) to 8192 or enable Jumbo frame support.
And the 1 core 100% result is not chaging.
## camera setting
camera_mtu = 8192
camera.gv_set_packet_size(camera_mtu)
##board
18: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8192 qdisc mq state UP group default qlen 1000
link/ether ec:21:e5:10:4f:ea brd ff:ff:ff:ff:ff:ff
inet 192.168.3.9/24 brd 192.168.3.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::8a2f:84d1:f68:7472/64 scope link
valid_lft forever preferred_lft forever
PoE Switch Jumbo frame support enabled.
There is no special reason for the 1500 set mtu, it is just that our customer’s enviroment might be unconfigurable mtu. So we just set the mtu to a common number 1500(default value for some devices?)
Best,
Joe
I have no other suggestion to improve the situation. You should avoid to set the working point at 100% CPU use though, as in this case there is no room for packet resend. Try to decrease the image size, or use the binning feature.
Hi @Emmanuel
Thank you for your reply.
You should avoid to set the working point at 100% CPU use though, as in this case there is no room for packet resend. Try to decrease the image size, or use the binning feature.
OK thanks for pointing out that.
And if we would like to use the image size as it is(largest resolution), then we will have to bear this 100% cpu use right?
It is unexpected you use 100% of a CPU with a single 1Gb/s camera. What is the model of your CPU ?
It is a QCS8550 SoC, I think I am using the Silver core. Since the frequency for core 0-2 is around 2000 MHz
Ref:
1 x GoldPlus
@3.2 GHz + (2+2) x Gold
@2.8 GHz + 3 x Silver
@2.0 GHz
It is supposed to be much powerful than the i5-8365U on my laptop, which only use 20/30% of a CPU for a 100 MB/s stream. I guess the difference is in the Ethernet driver implementation.
OK, thanks for the information, I will try to check the ethernet driver.
I checked this issue a little bit further.
If the one-100%-core is not further interrupted by other system call or what ever,
there will be no timout or packet loss from my observation, at least for 2 hours.
And it seems if the ksoftirq is binding to core 0, timeout and packet loss would be much easier to occur.
And if it is bound to core 1/2… other than 0, the chance that timeout and packet loss occurs would be much less.
And I found since QCS8550 SoC’s sdk/firmware has limited the irqaffinity to core 0-2(I dont’t know why but maybe they have their consideration), and I can’t modify the firmware for now. I chose to create a service right after booting up the board to further limit the irq binding to forbid core 0 from been bound to irq.
echo 6 > /proc/irq/default_smp_affinity #limit to core 2 and 1
Currently this works as charm.
Hope this would help someone in the future.
Best,
Joe