Next, they took their experiment up a grade, so to speak. This time a bigger group of researchers used carefully selected publicly-available data that was filtered based on educational value and content quality to train Phi-1. After collecting publicly available information into an initial dataset, they used a prompting and seeding formula inspired by the one used for TinyStories, but took it one step further and made it more sophisticated, so that it would capture a wider scope of data. To ensure high quality, they repeatedly filtered the resulting content before feeding it back into a LLM for further synthesizing. In this way, over several weeks, they built up a corpus of data large enough to train a more capable SLM.