hf:// URIs.
This is enabled under the hood by the lance-huggingface
integration that allows users to stream Lance datasets directly from Hugging Face without needing to
download them first.
For ML and AI engineers working in LanceDB, this capability is incredibly useful for quickly exploring
multimodal datasets and reusing Lance datasets shared by others, without writing custom data loaders
or preprocessing pipelines.
The snippets below use the lance-format/laion-1m
dataset published in Lance format. The dataset includes a million image-caption pairs, and the
Lance dataset can package image embeddings alongside the metadata. This makes it useful for
demonstrating LanceDB’s multimodal search capabilities in combination with easy sharing via the
Hugging Face Hub.
The LAION table includes multimodal columns such as:
image(inline JPEG bytes)caption(text)img_emb(image embedding vector)- metadata fields such as
urlandsimilarity
Install dependencies
Open the dataset with LanceDB
LanceDB can open the dataset directly from the Hub, without needing to download it first. Note that in LanceDB, you need to specify the table name when opening a Lance dataset, and the Hugging Face convention is to usetrain and test splits for datasets.
The LAION dataset is uploaded as a single split named train, so we specify the table name
that contains the *.lance files when opening the dataset.
Inspect schema and available indexes
img_emb column, and an FTS index on the caption
column, which means we can directly do vector search on the image embeddings and keyword search on the captions
without needing to build the indexes ourselves!
If you see an empty list, it may be because the dataset author did not include the index files when uploading
to Hugging Face. You can download the dataset locally, and build the indexes yourself. See the indexing guide
for instructions on building different types of indexes with LanceDB.
Projection scan
Run a simple scan by projecting relevant columns to get a feel for the dataset. For example, we can run a search without any filters or input parameters to get a small subset of the data:Scan and filter data
Filtered search is a common pattern to narrow down interesting subsets of the data during early exploration. Here’s an example:Export image bytes to local files
To work with a subset of the data locally, you can export the image bytes from the table and save them as JPEG files.Vector search
You can use LanceDB to run vector search directly on the data on the Hub, without needing to download the dataset or build your own vector index. This makes it incredibly easy to explore the dataset and iterate on your search queries before you decide to download a local copy for further experimentation on your end.| distance | caption |
|---|---|
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
Full-text search
Run an FTS search query that uses BM25 ranking on thecaption column (on which we already have an FTS index):
| caption | url | _score |
|---|---|---|
| running with dog | https://www.doggytastic.com/wp… | 15.73168 |
| Dog Running in Water | https://static.wixstatic.com/m… | 14.756516 |
| Dogs on the run by heidiannemo… | http://ih2.redbubble.net/image… | 14.756516 |
Download the full dataset
Here’s how to download the entire dataset via the Hugging Face CLI:Upload your own datasets to Hugging Face in Lance format
This section shows how you can upload your own Lance datasets to the Hugging Face Hub to share with the community. First, install the Hugging Face CLI and export bothOPENAI_API_KEY and HF_TOKEN.
Then, create a Lance dataset using LanceDB on a local machine, and then proceed to upload it to the Hub via a CLI command.
1. Upload your local directory to the Hub
Upload the full local directory to a specified repository on the Hugging Face Hub. The command below uploads the contents of your local LanceDB directory at/path/to/your_local_dir to a new repository named your_hf_org/repo_name under your Hugging Face account.
bash
The
upload-large-folder command is designed for uploading large datasets (potentially terabytes in size) and will handle multipart uploads, retries, and resuming interrupted uploads.2. Inspect dataset versions
Because you can query your remote dataset directly from Hugging Face withhf:// URIs in LanceDB, you can easily inspect the dataset versions and updates on the Hub without needing to download the data locally. This is very useful to keep track of changes to the dataset and iterate on your data collection and curation process.
Python
3. Add a dataset card
The Hub dataset card allows you to communicate the schema and usage of the dataset to other developers. It sits at the repo’s root in a file namedREADME.md on the Hub.
This project keeps the source card text in HF_DATASET_CARD.md, so you can publish updates
to the dataset there and upload it as README.md using the following command on the HF CLI:
this requires a regular hf upload because it is a single-file upload to a specific target path (a custom commit message can be added if you wish).
4. Update the dataset
Over time, you may want to add new rows (append) or columns (backfill) to your dataset as your needs evolve. You can make the necessary updates to your local dataset using LanceDB, and then upload the updated version back to the Hub with the samehf upload-large-folder command.
bash
hf:// URIs in LanceDB.
Explore more Lance datasets on Hugging Face
The LanceDB team is actively uploading useful and interesting datasets in Lance format to the Hugging Face Hub under the lance-format organization. We actively encourage the Hugging Face and LanceDB communities to upload their own Lance datasets to the Hub to share with others! In the meantime, feel free to check out the Hugging Face Hub to discover more Lance datasets uploaded by the community.Click here to explore the latest trending Lance datasets on 🤗 Hugging Face!