Home » How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker

How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker

Dataset preparation for an object detection training workflow can take a long time and often be frustrating. Label Studio, an open-source data annotation tool, can lend a hand by providing an easy way to annotate datasets. It supports a wide variety of annotation templates, including computer vision, natural language processing, and audio or speech processing. However, we’ll focus specifically on the object detection workflow.

But what if you want to take advantage of pre-annotated open-source datasets, such as the Pascal VOC dataset? In this article, I’ll show you how to easily import those tasks into Label Studio’s format while setting up the entire stack — including a PostgreSQL database, MinIO object storage, an Nginx reverse proxy, and the Label Studio backend. MinIO is an S3-compatible object storage service: you might use cloud-native storage in production, but you can also run it locally for development and testing.

In this tutorial, we’ll go through the following steps:

  1. Convert Pascal VOC annotations – transform bounding boxes from XML into Label Studio tasks in JSON format.
  2. Run the full stack – start Label Studio with PostgreSQL, MinIO, Nginx, and the backend using Docker Compose.
  3. Set up a Label Studio project – configure a new project inside the Label Studio interface.
  4. Upload images and tasks to MinIO – store your dataset in an S3-compatible bucket.
  5. Connect MinIO to Label Studio – add the cloud storage bucket to your project so Label Studio can fetch images and annotations directly.

Prerequisites

To follow this tutorial, make sure you have:

From VOC to Label Studio: Preparing Annotations

The Pascal VOC dataset has a folder structure where the train and test datasets are already split. The Annotations folder contains the annotation files for each image. In total, the training set includes 17,125 images, each with a corresponding annotation file.

.
└── VOC2012
    ├── Annotations  # 17125 annotations
    ├── ImageSets 
    │   ├── Action
    │   ├── Layout
    │   ├── Main
    │   └── Segmentation
    ├── JPEGImages  # 17125 images
    ├── SegmentationClass
    └── SegmentationObject

The XML snippet below, taken from one of the annotations, defines a bounding box around an object labeled “person”. The box is specified using four pixel coordinates: xmin, ymin, xmax, and ymax.

XML snippet from the Pascal VOC dataset (Image by Author)

The illustration below shows the inner rectangle as the annotated bounding box, defined by the top-left corner (xmin, ymin) and the bottom-right corner (xmax, ymax), within the outer rectangle representing the image.

Pascal VOC bounding box coordinates in pixel format (Image by Author)

Label Studio expects each bounding box to be defined by its width, height, and top-left corner, expressed as percentages of the image size. Below is a working example of the converted JSON format for the annotation shown above.

{
  "data": {
    "image": "s3:////2007_000027.jpg"
  },
  "annotations": [
    {
      "result": [
        {
          "from_name": "label",
          "to_name": "image",
          "type": "rectanglelabels",
          "value": {
            "x": 35.802,
            "y": 20.20,
            "width": 36.01,
            "height": 50.0,
            "rectanglelabels": ["person"]
          }
        }
      ]
    }
  ]
}

As you can see in the JSON format, you also need to specify the location of the image file — for example, a path in MinIO or an S3 bucket if you’re using cloud storage.

While preprocessing the data, I merged the entire dataset, even though it was already divided into training and validation. This simulates a real-world scenario where you typically begin with a single dataset and perform the splitting into training and validation sets yourself before training.

Running the Full Stack with Docker Compose

I merged the docker-compose.yml and docker-compose.minio.yml files into a simplified single configuration so the entire stack can run on the same network. Both files were taken from the official Label Studio GitHub repository.



services:
  nginx:
    # Acts as a reverse proxy for Label Studio frontend/backend
    image: heartexlabs/label-studio:latest
    restart: unless-stopped
    ports:
      - "8080:8085" 
      - "8081:8086"
    depends_on:
      - app
    environment:
      - LABEL_STUDIO_HOST=${LABEL_STUDIO_HOST:-}
    
    volumes:
      - ./mydata:/label-studio/data:rw # Stores Label Studio projects, configs, and uploaded files
    command: nginx

  app:
    stdin_open: true
    tty: true
    image: heartexlabs/label-studio:latest
    restart: unless-stopped
    expose:
      - "8000"
    depends_on:
      - db
    environment:
      - DJANGO_DB=default
      - POSTGRE_NAME=postgres
      - POSTGRE_USER=postgres
      - POSTGRE_PASSWORD=
      - POSTGRE_PORT=5432
      - POSTGRE_HOST=db
      - LABEL_STUDIO_HOST=${LABEL_STUDIO_HOST:-}
      - JSON_LOG=1
    volumes:
      - ./mydata:/label-studio/data:rw  # Stores Label Studio projects, configs, and uploaded files
    command: label-studio-uwsgi

  db:
    image: pgautoupgrade/pgautoupgrade:13-alpine
    hostname: db
    restart: unless-stopped
    environment:
      - POSTGRES_HOST_AUTH_METHOD=trust
      - POSTGRES_USER=postgres
    volumes:
      - ${POSTGRES_DATA_DIR:-./postgres-data}:/var/lib/postgresql/data  # Persistent storage for PostgreSQL database
  minio:
    image: "minio/minio:${MINIO_VERSION:-RELEASE.2025-04-22T22-12-26Z}"
    command: server /data --console-address ":9009"
    restart: unless-stopped
    ports:
      - "9000:9000"
      - "9009:9009"
    volumes:
      - minio-data:/data   # Stores uploaded dataset objects (like images or JSON tasks)
    # configure env vars in .env file or your systems environment
    environment:
      - MINIO_ROOT_USER=${MINIO_ROOT_USER:-minio_admin_do_not_use_in_production}
      - MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD:-minio_admin_do_not_use_in_production}
      - MINIO_PROMETHEUS_URL=${MINIO_PROMETHEUS_URL:-http://prometheus:9090}
      - MINIO_PROMETHEUS_AUTH_TYPE=${MINIO_PROMETHEUS_AUTH_TYPE:-public}
 
volumes:
  minio-data: # Named volume for MinIO object storage

This simplified Docker Compose file defines four core services with their volume mappings:

App – runs the Label Studio backend itself.

  • Shares the mydata directory with Nginx, which stores projects, configurations, and uploaded files.
  • Uses a bind mount: ./mydata:/label-studio/data:rw → maps a folder from your host into the container.

Nginx – acts as a reverse proxy for the Label Studio frontend and backend.

  • Shares the mydata directory with the App service.

PostgreSQL (db) – manages metadata and project information.

  • Stores persistent database files.
  • Uses a bind mount: ${POSTGRES_DATA_DIR:-./postgres-data}:/var/lib/postgresql/data.

MinIO – an S3-compatible object storage service.

  • Stores dataset objects such as images or JSON annotation tasks.
  • Uses a named volume: minio-data:/data.

When you mount host folders such as ./mydata and ./postgres-data, you need to assign ownership on the host to the same user that runs inside the container. Label Studio does not run as root — it uses a non-root user with UID 1001. If the host directories are owned by a different user, the container won’t have write access and you’ll run into permission denied errors.

After creating these folders in your project directory, you can adjust their ownership with:

mkdir mydata 
mkdir postgres-data
sudo chown -R 1001:1001 ./mydata ./postgres-data

Now that the directories are prepared, we can bring up the stack using Docker Compose. Simply run:

docker compose up -d

It may take a few minutes to pull all the required images from Docker Hub and set up Label Studio. Once the setup is complete, open http://localhost:8080 in your browser to access the Label Studio interface. You need to create a new account, and then you can log in with your credentials to access the interface. You can enable a legacy API token by going to Organization → API Token Settings. This token lets you communicate with the Label Studio API, which is especially useful for automation tasks.

Set up a Label Studio project

Now we can create our first data annotation project on Label Studio, specifically for an object detection workflow. But before starting to annotate your images, you need to define the types of classes to choose from. In the Pascal VOC dataset, there are 20 types of pre-annotated objects.

XML-style labeling setup (Image by Author)

Upload images and tasks to MinIO

You can open the MinIO user interface in your browser at localhost:9000, and then log in using the credentials you specified under the relevant service in the docker-compose.yml file.

I created a bucket with folders, one of which is used for storing images and another for JSON tasks formatted according to the instructions above.

Screenshot of an example bucket in MinIO (Image by Author)

We set up an S3-like service locally that allows us to simulate S3 cloud storage without incurring any charges. If you want to transfer files to an S3 bucket on AWS, it’s better to do this directly over the internet, considering the data transfer costs. The good news is that you can also interact with your MinIO bucket using the AWS CLI. To do this, you need to add a profile in ~/.aws/config and provide the corresponding credentials in ~/.aws/credentials under the same profile name.

And then, you can easily sync with your local folder using the following commands:

#!/bin/bash
set -e

PROFILE=
MINIO_ENDPOINT=   # e.g. http://localhost:9000
BUCKET_NAME=
SOURCE_DIR=    
DEST_DIR= 

aws s3 sync 
      --endpoint-url "$MINIO_ENDPOINT" 
      --no-verify-ssl 
      --profile "$PROFILE" 
      "$SOURCE_DIR" "s3://$BUCKET_NAME/$DEST_DIR"

 

Connect MinIO to Label Studio

After all the data, including the images and annotations, has been uploaded, we can move on to adding cloud storage to the project we created in the previous step.

From your project settings, go to Cloud Storage and add the required parameters, such as the endpoint (which points to the service name in the Docker stack along with the port number, e.g., minio:9000), the bucket name, and the relevant prefix where the annotation files are stored. Each path inside the JSON files will then point to the corresponding image.

Screenshot of the Cloud Storage settings (Image by Author)

After verifying that the connection is working, you can sync your project with the cloud storage. You may need to run the sync command multiple times since the dataset contains 22,263 images. It may appear to fail at first, but when you restart the sync, it continues to make progress. Eventually, all the Pascal VOC data will be successfully imported into Label Studio.

Screenshot of the task list (Image by Author)

You can see the imported tasks with their thumbnail images in the task list. When you click on a task, the image will appear with its pre-annotations.

Screenshot of an image with bounding boxes (Image by Author)

Conclusions

In this tutorial, we demonstrated how to import the Pascal VOC dataset into Label Studio by converting XML annotations into Label Studio’s JSON format, running a full stack with Docker Compose, and connecting MinIO as S3-compatible storage. This setup enables you to work with large-scale, pre-annotated datasets in a reproducible and cost-effective way, all on your local machine. Testing your project settings and file formats locally first will ensure a smoother transition when moving to cloud environments.

I hope this tutorial helps you kickstart your data annotation project with pre-annotated data that you can easily expand or validate. Once your dataset is ready for training, you can export all the tasks in popular formats such as COCO or YOLO.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *