Using AI to confirm a wrongly cabled Exadata switch – or how to fix verify_roce_cables.py script for Python3.


One of the preparation steps when installing an Exadata X10M is to verify that the cabling of the RoCE switches is correctly done. The next step is to upgrade the Cisco switches with the latest firmware. During my intervention for Tradeware at the customer, the first didn’t work as the provided script is not compatible with Python3 and the latter complained about wrong cabling.

Here I show how studied the wrong cabling of the X10M switches and how I use Claude.ai (ChatGPT and other AI tools probably also work) to quickly fix the Python script provided by Oracle.

When using the verify_roce_cables.py script, I got the following error multiple times:

ibdiagtools# ./verify_roce_cables.py -n NODES.lst -s SWITCHES.lst

File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "./verify_roce_cables.py", line 12955, in get_vlan_from_host
    d[host] = [bytes2str(line).strip() for line in ret
  File "./verify_roce_cables.py", line 12956, in <listcomp>
    if re.search('bound to|ipv4|ipv6', line)]
  File "/usr/lib64/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

For the moment I ignored, until I tried to upgrade the switches firmware and got:

patch_switch_23.1.15.0.0.240605# ./patchmgr --roceswitches switches.lst --upgrade
 
2024-06-19 12:35:27 +0200        :Working: Initiate upgrade of 2 RoCE switch(es) to 10.2.4 Expect up to 15 minutes for each switch 
 
2024-06-19 12:35:30 +0200 1 of 2:Running upgrade on switch 10.10.20.21
 
2024-06-19 12:35:37 +0200:        [INFO     ] Performing Nodes connectivity tests on 10.10.20.21
2024-06-19 12:35:44 +0200:        [FAIL     ] PORT Eth1/26 ON SWITCH roceA.client.ch NOT CONNECTED:  Please check connections
2024-06-19 12:35:44 +0200:        [FAIL     ] PORT Eth1/25 ON SWITCH 10.10.20.21 NOT CONNECTED:  Please check connections
2024-06-19 12:35:44 +0200:        [FAIL     ] [FirmwareUpgradeError] Fabric health check failed

Hum, not nice. Based on this I went to the documentation – Cabling Tables for Oracle Exadata Database Machine X10M – and on the Table 13-2 RDMA Network Fabric Cabling for Oracle Exadata Database Machine X10M Servers and Leaf Switches I see:

The first column is the position of the target compoment.

The third column the position of the RoCE switch (switch A on U20 switch B on U22).

The last column says the switch port number.

So I can see that port 25 and 26 on switch B do not have anything on the other side. Using the first column, the U8 points (usually) to Storage Server 4 and U10 to Storage Server 5. This Exadata has 4 storage servers, so one cable is not right.

I connect directly to the switch B and there using the command

switch# show interface status

The confirmation comes. There is a cable on Eth1/25 that should go to RU10, where there is no storage server.

I felt the “verify_roce_cables.py” would be useful to confirm my assumptions of the cabling. But I do not use Python enough to know what can be the problem. On Oracle Support I found out the problem was related to the Python version:

Verify_roce_cables.py May Fail with Newer Versions of Python (Doc ID 2971639.1)

The provided “solution” doesn’t please me, the Exadata machine is installed in a protected network and bring extra software there is not easy.


SOLUTION

Use an older version of Python to Run verify_roce_cables.py
ibdiagtools#/usr/bin/python2.7 ./verify_roce_cables.py -n NODES.lst -s SWITCHES.lst
The plan is to provide a fix in Exadata image 23.2.0 when it is released


We are using the latest available 23.1.15 version released 10 days ago provided at Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1).

But then I though, maybe AI can help me. So I turned to my currently preferred solution claude.ai and provided as prompt:

This script works in Python 2.7 but not in Python 3. The error is the following:
...
what can the the cause?

Its answer was great, explaining in which line the error was and providing explanations what changed in Python3.

From the three options provided, the first one did not seem right to me. The third would be a catch all but not so nice (why it converts the pattern to bytes like in the first solution before compiling?), so I decide to use the second solution.

I profit from one my preferred thing in Python – the dot notation – and change the line 12956 to:

        if re.search('bound to|ipv4|ipv6', line.decode('utf-8'))]

Or using sed

cp verify_roce_cables.py verify_roce_cables.py.orig.$(date +%Y%m%dT%H%M%S)
sed -i "s/if re\.search('bound to|ipv4|ipv6', line)]/if re\.search('bound to|ipv4|ipv6', line.decode('utf-8'))]/" verify_roce_cables.py

Now the script works wonderfully and confirms my findings:

PS: I’ve checked version included in the Exadata version 24.1.1 (Patch 36651148) and here the Python code was changed, but they use a much more complex solution, with self written “bytes2str” function! Maybe an AI tool would also be useful at Oracle to improve coding. 🙂

Leave a comment

Your email address will not be published. Required fields are marked *