George Hotz on GPT-4 as Mixture of Experts Model with 1.2 Trillion Parameters and Google’s GLaM

·

·

In summary, George Hotz suggests that GPT-4 might be a Mixture of Experts (MoE) model with around 1.2 trillion parameters. MoE models use different parameters based on the input example, creating a sparsely activated ensemble. The Switch Transformers, as developed by Google researchers, use gating networks to produce a sparse distribution over the available experts, either by choosing the top-k highest-scoring experts or using softmax gating.

However, balancing the selection of experts is challenging, as it is essential to avoid choosing specific experts too often. It is also worth noting that the cost of GPT-4 is estimated to be about 10 times that of GPT-3.

Google’s GLaM model is an example of an MoE model with 1.2 trillion parameters and 64 experts. Ensemble networks have been powerful models in the past, and it seems like they might still hold a significant position in the neural network landscape.

Source: matt-rickard.com