Through the news, I’ve recently become aware of an interesting large language model (LLM) model architecture, Mixture of Experts (MoE), a concept which was actually established in a 1991 paper but has only recently come to prominence.
In MoE, each separate model is specialised, or expert, in one or more domains, subject to their training. During inference, depending on the nature of the particular prompt and the suitability of the particular model(s), only a subset of the overall models are called-upon. This reportedly improves computational efficiency and scalability.
MoE has advanced significantly from 2010 onwards, including scaling the concept up to a 100B+ parameter LSTM applied to natural language processing tasks in 2017. Historically, there have been a number of hurdles to overcome in order to realise MoE’s full potential. For instance, due to the branching nature of MoE, such models have not been particularly suited to computation on graphical processing units (GPUs). However, due to innovations around how training and inference occurs in relation to MoE models, they’re becoming more and more popular.
Such innovations that result in improvements in computational efficiency due to their specific training/inference implementation are often found to be technical before the European Patent Office (EPO) regardless of their application to any particular field of technology. Advantageously, a patent application directed to an AI system’s specific technical implementation may provide relatively broad protection, including in the field of natural language processing which the EPO generally considers less patentable than say image processing.