Friday, November 29, 2013

Bug hunting

This is the story of the hunt for a bug that I've been chasing, on and off, for the last month.

After my last post on the PDU, I began doing more exhaustive testing. I left Munin polling all of the stats every 5 minutes and kept the GUI open for maybe half an hour, turning channels on and off, and everything seemed fine...

Then I went away to do something else, came back to my desk, and saw that the GUI had frozen. The board wasn't responding to ping (or any network activity at all), and I had no idea why.

Other than confirming that the interconnect fabric was still working (by resolving the addresses of a few cores by the name server) there wasn't a ton I could do without adding some diagnostic code.

I then resynthesized the FPGA netlist with the gdb bridge enabled on the CPU. (The FPGA is packed pretty full; I don't leave the bridge in all the time because it substantially increases place-and-route runtime and sometimes requires decreasing the maximum clock rate). After waiting around half an hour for the build to complete I reloaded the FPGA with the new code, fired up the GUI, and went off to tidy up the lab.

After checking in a couple of times and not seeing a hang, I finally got it to crash a couple hours later. A quick inspection in gdb suggested that the CPU was executing instruction normally, had not segfaulted, and there was no sign of trouble. In each case the program counter was somewhere in RecvRPCMessage(), as would be expected when the message loop was otherwise idle. So what was the problem?

The next step was to remove the gdb bridge and insert a logic analyzer core. (As mentioned above the FPGA is filled to capacity and it's not possible to use both at the same time without removing application logic.)

After another multi-hour build-and-wait-for-hang cycle, I managed to figure out that the CPU was popping the inbound-message FIFO correctly and seemed to be still executing instructions. None of the error flags were set.

I thought for a while and decided to check the free-memory-page counter in the allocator. A few hours later, I saw that the free-page count was zero... a telltale sign of a memory leak.

I wasted untold hours and many rebuild cycles trying to find the source of the leak before sniffing the RPC link between the CPU and the network. As soon as I saw packets arriving and not being sent, I knew that the leak wasn't the problem. It was just another symptom. The CPU was getting stuck somewhere and never processing new Ethernet frames; as soon as enough frames had arrived to fill all memory then all processing halted.

Unfortunately, at this point I still had no idea what was causing the bug. I could reliably trigger the logic analyzer after the bug had happened and the CPU was busy-waiting (by triggering when free_page_count hit 1 or 0) but had no way to tell what led up to it.

RPC packet captures taken after the fault condition showed that the new-frame messages from the Ethernet MAC were arriving to the CPU just fine. The CPU could be clearly seen popping them from the hardware FIFO and storing them in memory immediately.

Eventually, I figured out just what looked funny about the RPC captures: the CPU was receiving RPC messages, issuing memory reads and writes, but never sending any RPC messages whatsoever. This started to give me a hint as to what was happening.

I took a closer look at the execution traces and found that the CPU was sitting in a RecvRPCMessage() call until a message showed up, then PushInterrupt()ing the message and returning to the start of the loop.

/**
 @brief Performs a function call through the RPC network.
 
 @param addr  Address of target node
 @param callnum The RPC function to call
 @param d0  First argument (only low 21 bits valid)
 @param d1  Second argument
 @param d2  Third argument
 @param retval Return value of the function
 
 @return zero on success, -1 on failure
 */
int __attribute__ ((section (".romlibs"))) RPCFunctionCall(
 unsigned int addr, 
 unsigned int callnum,
 unsigned int d0,
 unsigned int d1,
 unsigned int d2,
 RPCMessage_t* retval)
{
 //Send the query
 RPCMessage_t msg;
 msg.from = 0;
 msg.to = addr;
 msg.type = RPC_TYPE_CALL;
 msg.callnum = callnum;
 msg.data[0] = d0;
 msg.data[1] = d1;
 msg.data[2] = d2;
 SendRPCMessage(&msg);
 
 //Wait for a response
 while(1)
 {
  //Get the message
  RecvRPCMessage(retval);
  
  //Ignore anything not from the host of interest; save for future processing
  if(retval->from != addr)
  {
   //TODO: Support saving function calls / returns
   //TODO: Support out-of-order function call/return structures
   if(retval->type == RPC_TYPE_INTERRUPT)
    PushInterrupt(retval);
   continue;
  }
   
  //See what it is
  switch(retval->type)
  {
   //Send it again
   case RPC_TYPE_RETURN_RETRY:
    SendRPCMessage(&msg);
    break;
    
   //Fail
   case RPC_TYPE_RETURN_FAIL:
    return -1;
    
   //Success, we're done
   case RPC_TYPE_RETURN_SUCCESS:
    return 0;
    
   //We're not ready for interrupts, save them
   case RPC_TYPE_INTERRUPT:
    PushInterrupt(retval);
    break;
    
   default:
    break;
  }

 }
}

I spent most of a day repeatedly running the board until it hung to collect a sampling of different failures. A pattern started to emerge: addr was always 0x8003, the peripheral controller. This module contains a couple of peripherals that weren't big enough to justify the overhead of a full RPC router port on their own:
  • One ten-signal bidirectional GPIO port (debug/status LEDs plus a few reserved for future expansion)
  • One 32-bit timer with interrupt on overflow (used for polling environmental sensors for fault conditions, as well as socket timeouts)
  • One I2C master port (for talking to the DACs)
  • Three SPI master ports (for talking to the ADCs)
The two most common values for callnum in the hang state were PERIPH_SPI_SEND_BYTE and PERIPH_SPI_RECV_BYTE, but I saw a PERIPH_SPI_DEASSERT_CS call once. The GPIO and I2C peripherals aren't used during normal activity and are only touched when someone changes a breaker's trip point or the network link flaps, so I wasn't sure if the hang was SPI-specific or related to the peripheral controller in general.

After not seeing anything obviously amiss in the peripheral controller Verilog, I added one last bit of instrumentation: logging the last message successfully processed by the peripheral controller.

The next time the board froze, the CPU was in the middle of the first PERIPH_SPI_RECV_BYTE call in the function below (reading one channel of a MCP3204 quad ADC) but the peripheral controller was idle and had most recently processed the PERIPH_SPI_SEND_BYTE call on the line before.

unsigned int ADCRead(unsigned char spi_channel, unsigned char adc_channel)
{
 //Get the actual sensor reading
 RPCMessage_t rmsg;
 unsigned char opcode = 0x30;
 opcode |= (adc_channel << 1);
 opcode <<= 1;
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_ASSERT_CS, 0,  spi_channel, 0, &rmsg);
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_SEND_BYTE, opcode, spi_channel, 0, &rmsg); //Three dummy bits first
                      //then request read of CH0
                      //(single ended)
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_RECV_BYTE, 0,   spi_channel, 0, &rmsg); //Read first 8 data bits
 unsigned int d0 = rmsg.data[0];
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_RECV_BYTE, 0,  spi_channel, 0, &rmsg); //Read next 4 data bits
                      //followed by 4 garbage bits
 unsigned int d1 = rmsg.data[0];
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_DEASSERT_CS, 0,  spi_channel, 0, &rmsg);
 
 return ((d0 << 4) & 0xFF0) | ( (d1 >> 4) & 0xF);
}

Operating under the assumption that my well-tested interconnect IP didn't have a bug that could make it drop packets randomly, the only remaining explanation was that the peripheral controller was occasionally ignoring an incoming RPC.

I took another look at the code and found the bug near the end of the main state machine:

//Wait for RPC transmits to finish
STATE_RPC_TXHOLD: begin
 if(rpc_fab_tx_done) begin
  rpc_fab_rx_done <= 1;
  state <= STATE_IDLE;
 end
end //end STATE_RPC_TXHOLD 

I was setting the "done" flag to pop the receive buffer every time I finished sending a message... without checking that I was sending it in response to another message. The only time this was ever untrue is when sending a timer overflow interrupt.

As a result, if a new message arrived at the peripheral controller between the start and end of sending the timer overflow message, it would be dropped. The window for doing this is only four clock cycles every 50ms, which explains the extreme rarity of the hang.

EDIT: Just out of curiosity I ran a few numbers to calculate the probability of a hang:
  • At the 30 MHz CPU speed I was using for testing, the odds of any single RPC transaction hanging are 1 in 375,000.
  • Reading each of the 12 ADC channels requires 5 SPI transactions, or 60 in total. The odds of at least one of these triggering a hang is 1 in 6250.
  • The client GUI polls at 4 Hz.
  • The chance of a hang occurring within the first 15 minutes of runtime is 43%.
  • The chance of a hang occurring within the first half hour is 68%.
  • There is about a 10% chance that the board will run for over an hour without hanging, and yet the bug is still there.

Friday, November 1, 2013

Managed DC PDU

As I mentioned in my last post, powering all of the prototyping boards on my desk presents some unique challenges. With only one exception (the Xilinx AC701 board), each of the 22 boards requires 5VDC at somewhere between 0.1 and 2 amps. Some are strictly USB powered, some have a 5.5/2.1mm barrel jack, and some can be powered by either USB or a barrel jack.

Powered USB hubs would reduce the number of power sources required, so I did just that. Lots of cables would get in the way so I designed a custom "backplane" USB hub with male mini-B ports which could plug directly into small prototyping boards. (As a side note, the connectors for this board were nearly impossible to find. There are very few uses for a male mini-B connector that mounts to a PCB rather than being attached to a cable so nobody makes them!)

USB backplane hub
These reduced the problem, but did not come close to eliminating it. I still had to power three backplane hubs, six standalone FPGA boards, and four standalone MCU/SoC dev boards. All needed 5V except for the AC701 (which runs on 12V) but I wanted additional 12V capability for the future if I expanded into higher-power design.

The obvious first idea was an ATX supply. My calculations of peak power for the apparatus (including room for growth) were fairly high, though, and most ATX supplies put the bulk of their output on the 12V rail and have fairly limited (well under 100W) 5V capacity.

The next thing I considered was an off-the-shelf 5V supply. This looked like a nice idea, but (as with an ATX supply) the high output current capability would represent a fire hazard if something shorted. I would obviously need overcurrent protection.

Thinking a bit more, I realized that fusing was probably not the best option. Fuses need to be replaced once blown and in a lab environment overcurrent events happen fairly often. Classical current limiting techniques would be problematic as well since many of my boards have switching power supplies. Since a switcher is a nonlinear load, reducing the input voltage doesn't actually reduce the current. Instead, load current actually increases to maintain the output voltage, which can lead to runaway failure conditions. The safer way to handle overcurrent on a switcher is to shut it down entirely.

I also wanted the ability to power cycle boards on command to reset a stuck board or test power-up behavior. While jiggling cables may work in a hands-on lab environment, it isn't a viable option in the remote-controlled "embedded cloud" platform I'm trying to build.

This would obviously require some intelligence on the part of the power management system. The natural solution was a managed power distribution unit (PDU) of the sort commonly used in datacenters for feeding power to racks of servers. Managed PDUs often include current metering as well, which could be very useful to me when trying to minimize power consumption in a design.

There's just one problem: As far as I can tell, nobody makes managed PDUs for 5V loads. The only ones I saw were for 12/24/48V supplies and massively overpriced: this 8-channel 12V unit costs a whopping $1,757.

What to do? Build one myself, of course!

The first step was to come up with the requirements:
  • Remote control via SNMP
  • Ten DC outputs fed by external supply
  • 4A max load for any single channel, 20A max for entire board
  • Independent overcurrent shutdown for each channel with adjustable threshold
  • Inrush timers for overcurrent shutdown to prevent false positives during powerup
  • Remote switching
  • Current metering
  • Thermal shutdown
  • Under/overvoltage shutdown
  • Input reverse voltage protection
  • Able to operate at 5V or 12V (jumper selected)
Now that I had a good idea of what I was building, it was time to start the actual design. I decided to use an FPGA instead of a MCU since the parallel nature made it easy to meet the hard-realtime demands of the overcurrent protection system. I also wanted an opportunity to field-test my softcore gigabit-Ethernet MAC, one of my CPU designs, and several other components of my thesis architecture under real-world load conditions.

PDU block diagram

The output stage is key to the entire circuit so it was very important that it be designed correctly. I put quite a bit of effort into component selection here... perhaps a bit too much, as I missed a few bugs elsewhere on the board! More on that later.

Output stage
Working from the output terminal (right side, VOUT_1) we first encounter a a 5 mΩ 4-terminal shunt resistor which feeds the overcurrent shutdown circuit and current metering. This is followed by a an LC filter to smooth the output power and reduce coupling of noise between downstream devices.

The fuse is provided purely as a second line of defense in the event that the soft overcurrent protection fails. As a firmware/HDL developer I know all too well what bugs are capable of, so I like to include passive safeguards whenever reasonably possible. Assuming that my code works correctly, this fuse should never blow even if the output of the PDU was connected to a dead short. (This of course requires that my protection mechanism trip faster than the fuse. Given the 1ms response time of typical fuses to small overcurrents, this isn't a very difficult task.)

Power switching is done by a high-side P-channel MOSFET connected to VOUT (the main high-current power rail). The logic-level input from the control subsystem is shifted up to VOUT level by an N-channel MOSFET. A pullup and pulldown resistor ensure that the output is kept safely in the "off" state when the system is booting.

Current monitoring
The monitoring stage is even simpler: the shunt voltage is amplified by a TI INA199A2 instrumentation amplifier, then fed to an ADC (not shown in this schematic) for metering. A comparator checks the amplified voltage against a reference voltage set by a DAC (also not shown) and if the threshold is exceeded the overcurrent alarm output is asserted.

A module in the FPGA controls the output enables based on the overcurrent flags and internal state. When an output is first turned on the overcurrent flag is ignored for a programmable delay (usually a few ms) in order to avoid false triggering from inrush spikes. After this period, if the overcurrent flag is ever asserted the channel is turned off and placed in the "error-disable" state. In order to clear an error condition the channel must be manually cycled, much like a conventional circuit breaker.

Here's a view of the finished first-run prototype. As you can see the first layout revision had a few bugs ;) The dead-bugged oscillator turned out to not be necessary but it would have been more work to remove it so I'm keeping it until I do a respin with all of these fixes incorporated.
PDU board on my desk
The SNMP interface and IP protocol stack runs on a custom softcore CPU of my own design. The CPU is named GRAFTON, in keeping with my tradition of naming my processors after nearby towns. It is fairly similar to MIPS-1 at the ISA level and can be targeted by mips-linux-gnu gcc with carefully chosen flags, but does not implement unaligned load/store, interrupts, or the normal coprocessors. Coprocessor 0 exists but is used to interface with the RPC network.

GRAFTON's programming model is largely event-driven, in a model that will be somewhat familiar to anyone who has done raw Windows API programming. The CPU sleeps until an RPC interrupt packet shows up, then it is processed and it goes back to sleep. Unlike classical interrupt handling, user code running on GRAFTON cannot be pre-empted by an interrupt; it just sits in the queue until retrieved.

int main()
{
	//Do one-time setup
	Initialize();

	//Main message loop
	RPCMessage_t rmsg;
	while(1)
	{
		GetRPCInterrupt(&rmsg);
		ProcessInterrupt(&rmsg);
	}
	
	return 0;
}

RPCFunctionCall(), a simple C wrapper around the low-level SendRPCMessage and RecvRPCMessage() functions, abstracts the RPC network with a blocking C function call semantics. Any messages other than return values of the pending call are queued for future processing.

In the example below, I'm initializing the SPI modules for the A/D converters with a clock divisor computed on the fly from the system clock rate.

void ADCInitialize()
{
	//SPI clock = 250 KHz
	RPCMessage_t rmsg;
	RPCFunctionCall(g_sysinfoAddr, SYSINFO_GET_CYCFREQ, 0, 250 * 1000, 0, &rmsg);
	int spiclk = rmsg.data[1];
	for(unsigned int i=0; i<3; i++)
		RPCFunctionCall(g_periphAddr, PERIPH_SPI_SET_CLKDIV, spiclk, i, 0, &rmsg);
}

The firmware is about 4300 lines of C in total, including comments but not the 1165 lines of C and assembly in my C runtime library shared by all GRAFTON designs. It implements IPv4, UDP, DHCP, ARP, ICMP echo, and SNMPv2c. SNMPv3 security and IPv6 are planned but are on hold until I move firmware out of block RAM and into flash so I have some space to work in. Other than that, it's essentially feature-complete and I've been using the PDU in my lab for a while while working on my flash controller and some support stuff.

The PC-side UI, intended to control several PDUs, is written in C++ using gtkmm and communicates with the board over SNMP. One tab (not shown) contains summary information with one graph trace per PDU.

PDU control panel
With a few minutes of PHP scripting I was also able to get my Munin installation to connect to the PDU and collect long-term logs even when I don't have the panel up.

Munin logs of PDU
The board runs quite cool, the spikes of heat caused by my furnace kicking in are quite visible and dwarf thermal variations caused by changes in load.

It needs a little bit more work to be fully production-ready but is already saving me time around the lab.

My desk with the PDU installed
Here's a look at my desk after deploying the PDU. The power cable mess is almost completely gone :) I do need to tidy up the Ethernet cables at some point, though...