Ingesting Data

In LanceDB, tables store records with a defined schema that specifies column names and types. Across the SDKs, you can create tables from row-oriented data and Apache Arrow data structures. The Python SDK additionally supports:

PyArrow schemas for explicit schema control
LanceModel for Pydantic-based validation

Create a LanceDB Table

Initialize a LanceDB connection and create a table Depending on the SDK, LanceDB can ingest arrays of records, Arrow tables or record batches, and Arrow batch iterators or readers. Let’s take a look at some of the common patterns.

From list of objects

You can provide a list of objects to create a table. The Python and TypeScript SDKs support lists/arrays of dictionaries, while the Rust SDK supports lists of structs.

From a custom schema

You can define a custom Arrow schema for the table. This is useful when you want to have more control over the column types and metadata.

From an Arrow Table

You can also create LanceDB tables directly from Arrow tables. Rust uses an Arrow RecordBatchReader for the same Arrow-native ingest flow.

From a Pandas DataFrame

Python Only

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

The vector column needs to be a Vector (defined as pyarrow.FixedSizeList) type.

From a Polars DataFrame

Python Only LanceDB supports Polars, a modern, fast DataFrame library written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between LanceDB Tables and Polars DataFrames is on the way.

From Pydantic Models

Python Only When you create an empty table without data, you must specify the table schema. LanceDB supports creating tables by specifying a PyArrow schema or a specialized Pydantic model called LanceModel. For example, the following Content model specifies a table with 5 columns: movie_id, vector, genres, title, and imdb_id. When you create a table, you can pass the class as the value of the schema parameter to create_table. The vector column is a Vector type, which is a specialized Pydantic type that can be configured with the vector dimensions. It is also important to note that LanceDB only understands subclasses of lancedb.pydantic.LanceModel (which itself derives from pydantic.BaseModel).

Nested schemas

Sometimes your data model may contain nested objects. For example, you may want to store the document string and the document source name as a nested Document object: This can be used as the type of a LanceDB table column: This creates a struct column called “document” that has two subfields called “content” and “source”:

In [28]: tbl.schema
Out[28]:
id: string not null
vector: fixed_size_list<item: float>[1536] not null
    child 0, item: float
document: struct<content: string not null, source: string not null> not null
    child 0, content: string not null
    child 1, source: string not null

Validators

Because LanceModel inherits from Pydantic’s BaseModel, you can combine them with Pydantic’s field validators. The example below shows how to add a validator to ensure that only valid timezone-aware datetime objects are used for a created_at field. When you run this code it, should raise the ValidationError.

Loading Large Datasets

When ingesting large datasets, use table.add() on an existing table rather than passing all data to create_table(). The add() method auto-parallelizes large writes, while create_table(name, data) does not.

For best performance with large datasets, create an empty table first and then call table.add(). This enables automatic write parallelism for materialized data sources.

From files (Parquet, CSV, etc.)

Python Only For file-based data, pass a pyarrow.dataset.Dataset to table.add(). This streams data from disk without loading the entire dataset into memory.

pa.dataset() input is currently Python-only. TypeScript and Rust support for file-based dataset ingestion is tracked in lancedb#3173.

From iterators (custom batch generation)

When you need custom batch logic — generating embeddings on the fly, transforming rows from an external source, etc. — use an iterator of RecordBatch objects. Python can also consume iterators of other supported types like Pandas DataFrames or Python lists.

Write parallelism

For materialized data (pa.Table, pd.DataFrame, pa.dataset()), LanceDB automatically parallelizes large writes — no configuration needed. Auto-parallelism targets approximately 1M rows or 2GB per write partition.For streaming sources (iterators, RecordBatchReader), LanceDB cannot determine total size upfront. A parallelism parameter to control this manually is planned but not yet exposed in Python or TypeScript (tracking issue).

Open existing tables

If you forget the name of your table, you can always get a listing of all table names.

Create empty table

You can create an empty table for scenarios where you want to add data to the table later. An example would be when you want to collect data from a stream/external file and then add it to a table in batches. An empty table can be initialized via an Arrow schema. Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not directly import pydantic but instead use lancedb.pydantic which is a subclass of pydantic.BaseModel that has been extended to support LanceDB specific types like Vector. Once the empty table has been created, you can append to it or modify its contents, as explained in the updating and modifying tables section.

Drop a table

Use the drop_table() method on the database to remove a table. This permanently removes the table and is not recoverable, unlike deleting rows. By default, if the table does not exist an exception is raised. To suppress this, you can pass in ignore_missing=True.

Get started

Guides

Feature Engineering (Geneva)

Support

Create a LanceDB Table

From list of objects

From a custom schema

From an Arrow Table

From a Pandas DataFrame

From a Polars DataFrame

From Pydantic Models

Nested schemas

Validators

Loading Large Datasets

From files (Parquet, CSV, etc.)

From iterators (custom batch generation)

Write parallelism

Open existing tables

Create empty table

Drop a table

Get started

Guides

Feature Engineering (Geneva)

Support

​Create a LanceDB Table

​From list of objects

​From a custom schema

​From an Arrow Table

​From a Pandas DataFrame

​From a Polars DataFrame

​From Pydantic Models

​Nested schemas

​Validators

​Loading Large Datasets

​From files (Parquet, CSV, etc.)

​From iterators (custom batch generation)

​Write parallelism

​Open existing tables

​Create empty table

​Drop a table

Create a LanceDB Table

From list of objects

From a custom schema

From an Arrow Table

From a Pandas DataFrame

From a Polars DataFrame

From Pydantic Models

Nested schemas

Validators

Loading Large Datasets

From files (Parquet, CSV, etc.)

From iterators (custom batch generation)

Write parallelism

Open existing tables

Create empty table

Drop a table