Abnormal basler disconnection after a long time

Hi,

I am using Aravis 0.8.20 with applications that has a long duration time (can be multiple days in a row), and I sometimes have a weird interactions when using multiple GigE Basler cameras on a same device, each camera having their own dedicated port.

Without being physically moved or anything, one camera can disappear from the device list, trigger their control-lost signal and never comes back despite still being connected. After that, if we try to physically reconnect the faulty camera, it doesn’t come back, and disconnect another camera.

The only way I have to recover those is to restart the program using the driver.

In an attempt to understand what’s going wrong, I’ve done some experimentations:

  • I tried to use arv_shutdown in an attempt to recover the cameras without having to restart the program but it doesn’t work
  • I tried with multiple basler models and got the same result
  • I cannot reproduce this problem with other camera brand
  • I had a similar problem when trying with pypylon, the official python library from basler. In this problem, some function from the driver doesn’t return anything and are completely blocked which is a lot harder to debug than with Aravis. I didn’t try to go further.

I could conclude that the problem comes from basler cameras and nothing is wrong with my setup but this conclusion doesn’t satisfy me because I’m sure I can mitigate the problem or at least fix it manually with only some lines of code.

The major issue to try to solve this problem is that it happens randomly: sometimes it can happen after 1 hour, sometimes I have no problem after three day. So whatever I am trying to do, it takes time, and when I have no issue for a long time I’m not sure if it’s because my addition worked.

I am coming here to know if someone already had this kind of interaction, or if this is a known bug that I’m not aware of, and if someone has some idea to how I could investigate more efficiently on this issue. I would want to enable aravis debug logs, but since the camera simulate a physical disconnection, I’m not sure I will get interesting logs. Also, I’m not sure what debug category and level could bring useful informations, and I don’t think I could enabling all and handle all these logs for that much time.

Thanks

Hi,

Which driver ?

arv_shutdown is mainly useful for memory debugging.

I have experienced some troubles using Basler cameras, apparently similar to yours. It seemed to happen when there is a lot of lost packets (and consequently a lot of packet resend requests).

What camera model are you using ?

I have a setup with 4 acA1300-60gm-NIR running for months without issue. This reliability is mostly due to the fact the application is running on a realtime OS, and the network has a bandwidth dedicated to the data transfer between the cameras and the receiving host.

If the issue is related to the number of packet resend request, you may try to tweak these stream properties:

  • packet-request-ratio
  • initial-packet-timeout
  • packet-timeout
  • frame-retention

For example:

double packet_request_ratio;

...

g_object_set (stream, "packet-request-ratio", packet_request_ratio, &error);

https://aravisproject.github.io/aravis/class.GvStream.html#properties

An interesting addition to Ararvis would be an option to keep the last n debug entries in memory, and a way to write them to a file that woudl be called by the user after a given event is detected (in your case, the lost-control signal).

Please keep us informed if you manage to gather more informations about this issue.

Cheers,

Emmanuel.

Hello,

I was talking about Aravis here, I have the habit of refering it as a driver but I guess it not really the case.

I was not sure it could help but I wanted to search a way to get the cameras back manually. Is there a way I could “restart” Aravis manually in the meantime ? I saw that I can disable and enable interfaces, but I’m not sure it will magically recover my cameras.

I’m mainly using a2A2448-23gcBAS cameras, but have some aca2440-20gc and aca2500-14gc.

I will start some extensive tests in the next weeks and will surely include these properties. I think i will also include some tips on the Ethernet Device Performance page that I’ve not included yet. If I have some interesting conclusions about these tests, I will make sure to bring them in this thread.

Thank you for your quick response

Ah ok. For now, Aravis is purely a user space library. It could be interesting to implement a GigEVision packet filter as a kernel module, for security reason when you want to maximize the performance. The most effective method currently is the AF_PACKET socket, but that requires to run the application as root, or to give the application the raw socket access capability (which means it can monitor all the network traffic).

Once you have destroyed the ArvCamera object, there is no state kept in Aravis about the camera.

Thanks.

Emmanuel.

Sure, one of the first thing I do when I receive a control_lost signal (which in this case should not happen) is to destroy the ArvCamera object.

But then, my camera is still physically plugged, and the functions arv_update_device_list and arv_get_n_devices act like there are not any devices connected anymore.

In a long term, I’ll try to mitigate this disconnection (solving the problem at the source), but in a short term I’m looking to recover my cameras, which mean having the two functions above to find my camera again, and being able to instanciate a new ArvCamera object like if it was a new camera. Since my cameras are found again by restarting my program, I wonder if I could get the camera again without restarting it.

I guess you are trying to reconnect immediately after destroying the ArvCamera object. Did you try to wait a bit before trying to reconnect ? If arv_update_device_list() act as if they is no available device, that means the camera does not answer to the discovery packet. There is may be a delay before the camera is able to answer again after a disconnect event…

I schedule these call in a thread every 10s (made my functions thread safe for this case), and usually the problem happen when I’m not near my computer, so the program has multiple hours of arv_update_device_list calls that’s not returning the lost camera.

And after that, if you stop and restart your program, you are gaining back the camera control ? That is puzzling…

The GigEVision protocol is based on UDP, and is not connection based like TCP. The stop/start of your program should not change anything.

Could you check you are able to ping the camera using arv-tool while your software is running trying to reconnect ?

Also, could you check your program is emitting GigEVision discovery packets every 10s, using wireshark ?

I don’t know if I am crazy or just unlucky, but now that I’m prepared to inspect networks packets I’m unable to reproduce the exact error. I have had cameras disconnection but they come back after replugging them and do not make the other ones disconnect. I looked the networks packets anyway when this happened and the result is not returning the faulty camera, same for arv-tool. It looks like when the disconnection happens, the cameras firmware crashed and need to be rebooted, so I don’t think we can do anything with just software.

I will try to tweak some stream properties for now to see if I’m able to mitigate the disconnection and I will give you a feedback here if I come with something interesting.

Hello,

Sorry, I was busy lately so I couldn’t provide a quick and clear feedback, but I want to make sure this thread has some conclusion in case somebody is coming here.

I have done some experimentation with network properties, and I still can’t provide a clear and final answer, but it feels it largely increased stability. I believe it doesn’t solve the problem from the source, I still suspect the problem comes from the camera and maybe from the hardware, but as long as it solve my problem, I am fine with that !

First I changed the mtu of my interface to 9000 and increased the packet_size to 8192, then I increased heartbeat_timeout from 3000 to 10000 and I switched socket_buffer to AUTO.

I also changed how I was removing managed camera in my program. Since I could manage uninstanciated cameras (which are just device id strings bundled in my class at this point), I couldn’t rely on the device-lost signal, so I added the thread of arv_update_device_list which add new cameras and clear ones that disappeared. But sometimes, it could happen that a camera was missing from the return of arv_update_device_list just once, clearing uninstanciated camera, but also instanciated and working ones. I just added a rule so that instanciated cameras are only cleared by the signal-lost signal.

I tried these modification on a computer with cameras that was often disconnecting (at least twice everyday) and it didn’t disconnect for almost a week.

1 Like