I've been looking at the CS231N lectures from Stanford and I'm trying to wrap my head around some issues in CNN architectures. What I'm trying to understand is if there are some general guidelines for picking convolution filter size and things like strides or is this more an art than a science?
Pooling I understand exists mainly to induce some form of translation invariance into a model. On the other hand, I don't have a good intuition of how stride size is picked. Are there some other guidelines to that except trying to compress the current layer size or trying to achieve a larger receptive field to a neuron? Anyone know of any good papers or similar that discuss this?