Hello LL,
I can give you a few answers to your questions.
1. You need to verify with your test house if they did pre-stress and post-stress curve traces for every stress. If they did both curve traces properly for every stress, than a failure on an I/O pin is generally pretty obvious. The curve trace prior to the stress on that particular pin would show that it is healthy, and then the pin/domain is stressed, and the post-stress curve trace shows it failed. However, many test houses do not do full curve traces before and after each stress. Some for instance will do an entire set of I/O stresses and then curve trace the pins, in which case, you are correct, you may not know what combination caused the failure.
One thing to remember, curve traces alone should never be the final criteria for failure. In reality, the parts should pass a functional test pre-stressing, then the ESD stressing occurs and then the parts experience a second round of functional testing to ensure they passed. Usually, if a part passes the curve traces, but fails functional, then you have to track down what pin combination caused the failure and this becomes a whole new nightmare. If you are lucky though with your HBM testing , and the test were performed correctly, you should know exactly what pins stresses caused the failure.
I recommend White Mountain Labs as a test house because of their format in testing which helps avoid issues such as yours presented.
You can find a link to their website here:
http://srftechnologies.com/Affiliations.html (Full disclosure is that I do work with them a lot.)
2. This situation is not at all uncommon. It merely means that this cell is very marginal and that due to the nature and subtle variations of your different ESD networks, sometimes it fails and sometimes it passes. What people tend to not remember is that ESD is literally 80% chip architecture and only 20% individual devices. This is especially true in SoC's such as yours.
I hope this helps!
Stephen