I'm kind of surprised it took this long if it’s the first: It’s a large body of human written text generally under the same license terms as Wikipedia (Which when writing this I had assumed was a source for training a lot of LLMs, but I can’t actually find a citation for that, so maybe I’m mistaken)
Are they pulling dumps or just hitting random pages? Dumps would indicate knowledge of the project , but random page hits may just be a scraper trying to suck up as much data as possible.
Are they pulling dumps or just hitting random pages? Dumps would indicate knowledge of the project , but random page hits may just be a scraper trying to suck up as much data as possible.