by Shreyas Kar, Dhruv Pai & Andres Carranza
Traditional Transform based architectures, relying on the attention mechanism, have a context length bottleneck. An alternative to this is State Space Models (SSMs), which are context-length independent. But they have historically performed worse than Transformer based architectures due to their poorer utilization of GPUs. Nonetheless, the advantages of SSMs are evident, as they are context-length independent and theoretically much faster than attention-based models.
H3 is a novel architecture of language model that leverages SSMs for a powerful alternative to the traditional attention mechanism. H3 was able to take a big stride in closing the SSM and attention expressivity gap. H3 has the potential for robust performance for large context length, while also being more efficient and better at utilizing hardware. However, there has been limited prior work on understanding its in-context learning (ICL) capabilities. This work addresses this gap.
We use the Quora Question Pairs (QQP) sub-dataset from GLUE [2]. We perform the evaluation on the dataset as originally described in the GLUE paper. We used the corresponding GLUE evaluation metric for the benchmark. For QQP, which is a binary classifications task, the metric is simply accuracy. Additionally however, we use F1 score as a metric for robustness, and we use cross-entropy based on logit predictions of class probability. In the binary classification case, cross-entropy reduces to binary log loss. The cross-entropy score was used to compare the probability loss of logit predictions, at a finer resolution than simply looking at accuracy
We used the 1.3B parameter H3 model published by HazyResearch, with the same model hyperparameters used in the paper (dmodel = 2048, nlayer = 24, nheads = 16). For our baseline, we used the 1.3B GPT-2 parameter published by OpenAI [8] To format an ICL sample of size n, we implement the following procedure. To run the experiments, we developed a Python script for automated testing of H3 in few and many shots settings (K = 1, 2, 4, 8 . . . 128). compare accuracy across the datasets