MC 6460 and Zoom (please email firstname.lastname@example.org for the Zoom meeting link)
Yanming Kang | Applied Mathematics, University of Waterloo
Transformer models have become the most popular choice for NLP tasks. In general, transformer models with longer input sequences can achieve higher accuracy. However, due to the quadratic space complexity of dot-product attention, hardware constraints limit the maximum input length for transformers. There have been previous works that address the problem by applying fixed sparsity patterns on the attention matrix or using methods such as k-means clustering and local sensitive hashing. We present Multi-level Transformer, which uses a hierarchy of resolutions when computing the dot-product attention. In Multi-level Transformer, information is summarized using convolution to varying degrees depending on the distance between the input and output token. The multi-level attention has O(N log N) complexity in time and space. We found that compared to the transformer, Multi-level Transformer requires much less memory and is faster for longer inputs. Our preliminary results in language modeling on Wikitext-103 showed that Multi-level Transformer can achieve comparable perplexity (6% higher) to the transformer.