Bittensor Dataset

Infinite data - 3 lines of code

Methods that leverage more computation and data are ultimately the most effective by far.

The near-infinite amount of ubiquitous and freely available internet-hosted data has given rise to Foundation models [1] -- like the GPT series, BERT, etc. Indeed, it is because of the high availability of data that the focus is shifting away from supervised datasets towards the real element of interest: intelligence itself. This is the commodity we mine from the intersection of data and computation.

Genesis Dataset

Bittensor sits at the intersection of internet-scale compute and data. Bittensor is powered by our custom Genesis Dataset: a production grade corpus of unlabeled text pulled from the web.

Presently sitting at 1.5TB in size and growing, the dataset consists of over 150 million files and is hosted entirely on IPFS. This enables us to maintain high availability, decentralization, and ease of access. In fact, it is so easy that we made this available to our miners with three lines of code.

import bittensor
for text in bittensor.dataloader( dataset = 'genesis' ).dataloader( 1000 ):
model ( text )

Eventually users will be able to quickly plug and play their own data using a custom hash. This will help us on our mission to build the largest decentralized machine learning dataset ever.

import bittensor
identifier = '0E7071C59DF3B9454D1D18A15270AA36D54F89606A576DC621757AFD44AD1D2E
for text in bittensor.dataloader( identifier ).dataloader( 1000 ):
model ( text

‚Äč

References

[1] Bommasani, Rishi, et al. "On the Opportunities and Risks of Foundation Models." arXiv preprint arXiv:2108.07258 (2021).