network:
Ok, here's why we couldn't pass nulls:
The trouble was isolated to an ——— controlled SMX OC12 fiber mux on the ——— ——— campus that extends the circuit between campus buildings. The Option 1 circuit goes through the SMX mux that was not optioned to ignore excessive zeroes which is usually not a problem unless the customer's application is pushing excessive 0s across the circuit. In this case the customer's SAP application was padding 0s which resulted in a large string of 0s being transmitted across the connection and the fiber mux was not able to maintain synchronization when the string of 0s were detected. This resulted in the packet drops the customer was reporting.I don't know enough about the magic telco protocols to understand why this would affect synchronization.——— actually has two different versions of fiber mux that they use when transporting connections between buildings on their fiber ring. They use what is called an Alcatel SMX mux (must be optioned to ignore excessive 0s if the customer traffic pattern generates this condition) and an Alcatel SM fiber mux which does not have the ability to be optioned to ignore excessive 0s. In ———'s case, the primary SMX fiber mux was already optioned to ignore excessive 0s when they are presented and the secondary fiber mux was not set to ignore the excessive 0s and the Engineer had to make the configuration change to correct that.
* If it's synchronous, there's clocks at either end that are synchronized. Data spews forth in predictable (clocked) chunks from one side, and the other side reassembles the chunks according to the clock.
* If it's asynchronous, the byte should be encapsulated by a bit that indicates "start of data."
I thought these protocols were all sync, but maybe they're not. Spooky.
The DS-3 can't pass zeroes. Why? Who knows.
This evening I had the circuit provider engineer route traffic from my Linux desktop across the DS-3. I used the perl doodad to generate the problematic SAP traffic from my desktop, and wrote another perl doodad to echo the traffic back to me. In essence it was a ping; send data to the server, and the server sends it right back 'atcha.
The ping failed. Failed! This means the problem isn't in the SAP server (hooray; troubleshooting that would have sucked bigtime.) I start chopping the packet down, trying to figure out which sequence of bytes are not passing. Finally, I am left with about 96 nulls. They don't pass. I mention this to the engineer. He says "Huh! that sounds familiar!" and a few seconds later says "I'm able to confirm this." He used ping on one of the routers to send a few dozen null bytes to the other router, and confirms they don't pass. I play with 'ping -p' from my Linux machine and confirm as well. Hooray! Now that the circuit provider can replicate the problem, we can stop pointing fingers and get things going.
Since we moved datacenters, we've been having problems with the circuits that connect the campus to our machines (all the way in Texas.) The vendor requested a window in which they could do some 'invasive' testing. During this time we would fail over to the secondary circuit and so no traffic should be affected.
Or so we thought..
Right after we switched over, I started getting alerts from Nagios and Gomerz, and started seeing lots of SAP timeout errors in the app logs. uh-oh. We have a little standalone JCO/RFC client 'ping' that we use to validate basic connectivity back to SAP, and even it was failing. Boo. I play for a little bit, seeing how far we get.
1. Can we ping the SAP server? Check. 2. Can we establish a TCP 3-way handshake to the SAP server? Yup. 3. Does the client start sending data to the SAP server? Uh-huh. 4. Does the server send data back to the SAP client? Mmm-hmm. 5. Does the client send more data to the SAP server? You know it to be true. 6. Does the server respond back to this? No!For some reason SAP did not play nice with the secondary circuit. Since we didn't anticipate this, we called off the testing window and regrouped to try to isolate the problem. I was able to capture one network trace, as well as a truss from the JCO/RFC ping utility.
I spent a little time looking at what the ping utility did, and tried to figure out where it was breaking. After playing around a bit, I found that it fell over when it requested a list of services the SAP gateway offers.
[ Aside: I am no SAP expert, but here's what I do know: the client makes a TCP connection to the gateway, and the gateway tells the client which application server to connect to. The client then disconnects from the gateway and connects to the app server. This is SAP's method of load balancing. It is roughly analagous to a rendezvous-server, or the Unix portmapper. ]
I was able to cobble together a perl script to get a list of IP addresses out of the gateway --
jwa@nimue:~/work/cvs/hacks/sap$ ./talk-to-gw writing msg1 [114 bytes] response length: 110 **MESSAGE**- MSG_SERVER got 110 bytes writing msg2 [254 bytes] response length: 1290It was at the point that the client sends the second message of 254 bytes ('msg2' above) to the server that the connection to the gateway hung, and the client eventually timed out. It looked like the packet (& subsequent retries) was either dropped by the network, or not properly read by the server.parse Name: [**MESSAGE**--??] Type: 0x0 (0) Name: [B2B_GRP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [BILLER_DIRECT LG_EYECAT] Type: 0x1 (1) IP: ———.3.118 Name: [DE1-APP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [DP_APP LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [JCO_ORDER_STAT LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [Long_Runner LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [MAIL LG_EYECAT] Type: 0x1 (1) IP: ———.251.197 Name: [FAV_COMPUTE_TIME FAV_COMPUT] Type: 0x54 (84) Name: [FAV_COMPUTE_SERVER camsapd1_D] Type: 0x44 (68) Name: [SPACE LG_EYECAT] Type: 0x1 (1) IP: ———.3.118 Name: [] Type: 0x0 (0)
Therefore, two speculations:
1. The secondary circuit modifies the characteristics of the packet (ie, fragmentation, or MTU shrinkage, or whatever) such that the 254 byte packet is delevered in two pieces. The SAP gateway expects exactly 254 bytes of data in a single read(), but since it gets (say) 128 bytes of data, it does the Wrong Thing.
2. The secondary circuit is dropping packets that have a certain byte pattern. (like pinging a dialup user with a packet containing '+++'; their stack echoes it back, and the modem sees it as '+++' originating from the client, and goes into command mode & breaks the users' dialup session.)
The circuit provider doesn't think they're dropping packets, and suspect it's an app issue.
Soo.. we spend the next week figuring out how to replicate the problem w/o affecting production. We decide to route traffic between the SAP QA system and our QA system across the secondary circuit. (Much too much time is spent on getting approval from any remotely interested party.) We decide to place no less than four sniffers on the network-- one close to the SAP QA system, one at the edge of our network (where our network meets the circuit), one at the edge of our hosting provider's network (where their network meets the circuit), and one right off our QA network.
Saturday comes. The circuit provider routes traffic (it's neat to see the traceroutes change) across the secondary circuit. And, presto! we can replicate the problem... even with my little perl script :-) We capture sniffer traces and spend ~2h analyzing them, this time with the assistance of our hosting provider's network folks. We see a 254 byte packet leaving the edge of the hosting provider's network, but we dont'see it on our edge of the network. We see the TCP stack on the client end get lonely and issue retransmits, but we don't see those retransmissions on our edge.
It seems pretty clear that the packet is getting lost in "the cloud". But why? No one knows. The people that might be able to help aren't on the call, and we agree to pick things up on Monday.
Afterwards, I went for a gentle jog around work. My knee is still acting up, though, so I walked on the downhill parts to minimize trauma. While I could feel it threatening to hurt, it felt mostly OK up until I felt a strange pain in the back of my knee. I walked for a bit after that, fortunately it was near the end. Going down the stairs later that evening I could definitely feel it. Gah.