Private docker registry on AWS with S3

Creating a docker private registry is pretty trivial and well documented. If you are just playing with it, docker hub might be a good start. Few things to figure out before starting with private registry:

  • Storage. There are numerous options
    • File system
    • Azure
    • Google cloud (GCS)
    • AWS S3
    • Swift
    • OSS
    • In memory (not a good option unless you are testing)
  • Authentication
    • silly (as the name implies it is really silly and not suitable for real deployments)
    • htpasswd (Apache htpasswd style authentication). Credentials are predefined in a file and only suitable when used with TLS)
    • token OAUTH 2.0 style authentication using a Bearer token. This could be tricky, if you have Jenkins or other CI systems building and pushing docker images)
  • Transport security
    • Use of TLS is a strongly advised. If you don’t have X509 cert/key, use letsencrypt free service
  • Storage security
    • Ideally image data should also be secured at rest. See below for S3 storage security
  • Regions
    • If accessing data from multiple regions is required, docker registry provides ability to use CloudFront

Here is a quick and easy setup on AWS using S3 as storage:

  • Create S3 bucket in the region you want to save the images (my-docker-registry)
  • If you got burned by recent AWS S3 outage few months back, you would also replicate your bucket to another region 🙂 It is pretty simple to setup
  • I also recommend using encrypting data in S3 bucket. You can do this using AWS Key Management Service (KMS) or using Server Side Encryption (SSE) with AES-256. If you are replicating the bucket data to other region(s), you cannot use KMS
  • For the buckets, set bucket policy (under bucket permissions) to enforce encrypted data. Here is sample bucket policy for enforcing SSE AES-256:
{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-docker-registry/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-docker-registry/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]
}
  • Figure out where you are going to run the registry. Docker registry is a docker image. It is better to have this EC2 instance in the same region as the S3 bucket. Ideally it should be in a VPC with a S3 endpoint configured. Whether the instance should have Public IP or not depends on where you are going to push/pull the images from!
  • Ideally the instance hosting docker registry can be launched with IAM role. This way there won’t be a need to provision access/secret keys. Here is a sample IAM role:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListBucketMultipartUploads"
            ],
            "Resource": "arn:aws:s3:::my-docker-registry"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": "arn:aws:s3:::my-docker-docker/*"
        }
    ]
}
  • Configure the Security Group the the instance appropriately. Ideally I would disable all incoming ports except for 22 and 443 from your specific IP address
  • Follow the installation instructions to install latest docker on the instance
  • The user who is going to bring up the docker registry container should have access to talk to docker daemon. You can either do this as root user 😦 or modify a regular user and make the user part of docker group (usermod -a -G docker userid)
  • Create a docker-compose.yml file. Here is sample. In this case I used X509 cert/key issued by a CA
registry:
  restart: always
  image: registry:2
  ports:
   - 443:5000
  volumes:
   - /host/path/to/certs:/certs
   - /host/path/to/config.yml:/etc/docker/registry/config.yml
  • Create /host/path/to/config.yml configuration for registry configuration. Here is sample template with S3 storage and TLS configuration:
version: 0.1
storage:
  s3:
    region: us-east-1
    bucket: my-docker-registry
    encrypt: true
    secure: true
    v4auth: true
    chunksize: 5242880
    multipartcopychunksize: 33554432
    multipartcopymaxconcurrency: 100
    multipartcopythresholdsize: 33554432
  cache:
    blobdescriptor: inmemory
http:
  addr: 0.0.0.0:5000
  net: tcp
  prefix: /
  host: https://<registry hostname>
  tls:
    certificate: /certs/hostname.crt
    key: /certs/hostname.key
  headers:
    X-Content-Type-Options: [nosniff]
 http2:
   disabled: false
  • Change <registry hostname> with appropriate value. In this case, I used real X509 certificate and key that are copied to the host and made available to the docker registry image. Other option is to use letsencrypt configuration
  • Bring up the docker registry:
$ docker-compose up -d
# Check logs
$ docker-compose logs registry
  • Now it should be possible to tag and push any image to your registry. For example:
$ docker pull ubuntu
$ docker tag ubuntu <registry hostname>/ubuntu

At this point registry should be working and usable but because authentication is not yet setup, you should make sure it is only accessible from trusted hosts.

Private docker registry on AWS with S3

PosgreSQL to Hadoop/Hive

Ever tried to get data from PostgreSQL to Hive? Came across CSV SerDe which is bundled in latest version of Apache Hive. But for all practical purposes it is useless. It treats every column as string. So wrote my own SerDe. You can find the source on GitHub. Dump your PostgreSQL table data using pg_dump or psql with COPY in plain text format.

Download pgdump-serde jar to your local machine. Open hive shell, add jar. Create external table and load the dump data. If you are using pg_dump file, this SerDe cannot handle schema, comments, column headers etc. So remove unnecessary header/footer that is not row data.

hive> add jar <path to>/pgdump-serde-1.0.4-1.2.0-all.jar;
hive> USE my_database;
hive> CREATE EXTERNAL TABLE `my_table_ext` (
  `id` string,
  `time` timestamp,
  `what` boolean,
  `size` int,
  ...
)
ROW FORMAT SERDE 'com.pasam.hive.serde.pg.PgDumpSerDe'
LOCATION '/tmp/my_table_ext.txt';
hive> LOAD DATA LOCAL INPATH '<path to dump directory>/my_table.dump' OVERWRITE INTO TABLE my_table_ext;
PosgreSQL to Hadoop/Hive