Since we moved datacenters, we've been having problems with the circuits that connect the campus to our machines (all the way in Texas.) The vendor requested a window in which they could do some 'invasive' testing. During this time we would fail over to the secondary circuit and so no traffic should be affected.
Or so we thought..
Right after we switched over, I started getting alerts from Nagios and Gomerz, and started seeing lots of SAP timeout errors in the app logs. uh-oh. We have a little standalone JCO/RFC client 'ping' that we use to validate basic connectivity back to SAP, and even it was failing. Boo. I play for a little bit, seeing how far we get.
1. Can we ping the SAP server? Check. 2. Can we establish a TCP 3-way handshake to the SAP server? Yup. 3. Does the client start sending data to the SAP server? Uh-huh. 4. Does the server send data back to the SAP client? Mmm-hmm. 5. Does the client send more data to the SAP server? You know it to be true. 6. Does the server respond back to this? No!For some reason SAP did not play nice with the secondary circuit. Since we didn't anticipate this, we called off the testing window and regrouped to try to isolate the problem. I was able to capture one network trace, as well as a truss from the JCO/RFC ping utility.
I spent a little time looking at what the ping utility did, and tried to figure out where it was breaking. After playing around a bit, I found that it fell over when it requested a list of services the SAP gateway offers.
[ Aside: I am no SAP expert, but here's what I do know: the client makes a TCP connection to the gateway, and the gateway tells the client which application server to connect to. The client then disconnects from the gateway and connects to the app server. This is SAP's method of load balancing. It is roughly analagous to a rendezvous-server, or the Unix portmapper. ]
I was able to cobble together a perl script to get a list of IP addresses out of the gateway --
jwa@nimue:~/work/cvs/hacks/sap$ ./talk-to-gw writing msg1 [114 bytes] response length: 110 **MESSAGE**- MSG_SERVER got 110 bytes writing msg2 [254 bytes] response length: 1290It was at the point that the client sends the second message of 254 bytes ('msg2' above) to the server that the connection to the gateway hung, and the client eventually timed out. It looked like the packet (& subsequent retries) was either dropped by the network, or not properly read by the server.
parse Name: [**MESSAGE**--??] Type: 0x0 (0) Name: [B2B_GRP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [BILLER_DIRECT LG_EYECAT] Type: 0x1 (1) IP: ———.3.118 Name: [DE1-APP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [DP_APP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [JCO_ORDER_STAT LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [Long_Runner LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [MAIL LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [FAV_COMPUTE_TIME FAV_COMPUT] Type: 0x54 (84) Name: [FAV_COMPUTE_SERVER camsapd1_D] Type: 0x44 (68) Name: [SPACE LG_EYECAT] Type: 0x1 (1) IP: ———.3.118 Name:  Type: 0x0 (0)
Therefore, two speculations:
1. The secondary circuit modifies the characteristics of the packet (ie, fragmentation, or MTU shrinkage, or whatever) such that the 254 byte packet is delevered in two pieces. The SAP gateway expects exactly 254 bytes of data in a single read(), but since it gets (say) 128 bytes of data, it does the Wrong Thing.
2. The secondary circuit is dropping packets that have a certain byte pattern. (like pinging a dialup user with a packet containing '+++'; their stack echoes it back, and the modem sees it as '+++' originating from the client, and goes into command mode & breaks the users' dialup session.)
The circuit provider doesn't think they're dropping packets, and suspect it's an app issue.
Soo.. we spend the next week figuring out how to replicate the problem w/o affecting production. We decide to route traffic between the SAP QA system and our QA system across the secondary circuit. (Much too much time is spent on getting approval from any remotely interested party.) We decide to place no less than four sniffers on the network-- one close to the SAP QA system, one at the edge of our network (where our network meets the circuit), one at the edge of our hosting provider's network (where their network meets the circuit), and one right off our QA network.
Saturday comes. The circuit provider routes traffic (it's neat to see the traceroutes change) across the secondary circuit. And, presto! we can replicate the problem... even with my little perl script :-) We capture sniffer traces and spend ~2h analyzing them, this time with the assistance of our hosting provider's network folks. We see a 254 byte packet leaving the edge of the hosting provider's network, but we dont'see it on our edge of the network. We see the TCP stack on the client end get lonely and issue retransmits, but we don't see those retransmissions on our edge.
It seems pretty clear that the packet is getting lost in "the cloud". But why? No one knows. The people that might be able to help aren't on the call, and we agree to pick things up on Monday.
Afterwards, I went for a gentle jog around work. My knee is still acting up, though, so I walked on the downhill parts to minimize trauma. While I could feel it threatening to hurt, it felt mostly OK up until I felt a strange pain in the back of my knee. I walked for a bit after that, fortunately it was near the end. Going down the stairs later that evening I could definitely feel it. Gah.