Conchylicultor

Conchylicultor

Research Engineer

Member Since 7 years ago

Google, Berlin

Experience Points
549
follower
Lessons Completed
14
follow
Lessons Completed
33
stars
Best Reply Awards
35
repos

365 contributions in the last year

Pinned
⚡ My tensorflow implementation of "A neural conversational model", a Deep learning based chatbot
⚡ Enumerate diverse machine learning training tricks.
⚡ Experiment diverse Deep learning models for music generation with TensorFlow
⚡ Google Chrome Extension: Group your tabs into groups.
⚡ Use a CNN architecture to segment and classify 3d meshes
⚡ My solutions to the Google Foobar Challenge (September 2016 edition)
Activity
Nov
25
3 days ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

Help needed using TFDS from config file in a Docker container

Coming from Issue on how to train a Resnet50 using Imagenet from Scratch

What I need help with / What I was wondering I'm trying to train from scratch Resnet50 from TF Model garden using Imagenet. I need to prepara the dataset and I'm trying to use tfds (loaded from yaml config file, as suggested on previous opened issue).

I got an error that says "Not enough disk space", but I do have more than 200GB available. Any further suggestions?

Note that I need to execute everything from a docker container, because this container willl be used to test several infrastructures.

What I've tried so far Here you can see: 1) Imagenet data is downloaded, 2) I mount that volume in a docker container, 3) Inside that container I run train,py from model garden as was indicated in previous issue, a configuration is generated as shown. 4) I got the error: OSError: Not enough disk space. Needed: 155.84 GiB, 5) However, df shows that there is more than 200GB available.

[email protected]:/hdd500/data/imagenet_tars/imagenet$ ll

total 151020536
drwxr-xr-x 6 root root         4096 Nov 17 11:53 ./
drwxr-xr-x 3 root root         4096 Nov 17 11:53 ../
drwxr-xr-x 4 2016 2016         4096 Jun 14  2012 ILSVRC2012_devkit_t12/
-rw-r--r-- 1 root root      2568145 Jun 15  2012 ILSVRC2012_devkit_t12.tar.gz
-rw-r--r-- 1 root root 147897477120 Jun 14  2012 ILSVRC2012_img_train.tar
-rw-r--r-- 1 root root   6744924160 Jun 14  2012 ILSVRC2012_img_val.tar
drwxr-xr-x 2 root root         4096 Sep 21 16:28 __pycache__/
drwxr-xr-x 2 root root         4096 Nov 17 11:28 experiments/
-rw-r--r-- 1 root root        11629 Sep 21 16:22 imagenet.py
drwxr-xr-x 2 root root         4096 Nov 17 11:32 model_checkpoints/

[email protected]:/hdd500/data/imagenet_tars/imagenet$ ls experiments/

custom_tfds.yaml  gpu.yaml  imagenet_resnet50_gpu.yaml  imagenet_resnet50_gpu_custom.yaml

[email protected]:/hdd500/data/imagenet_tars/imagenet$ sudo docker run --net=host  -it --gpus all -v /hdd500/data/imagenet_tars/imagenet:/root/tensorflow_datasets/downloads/manual manualresnet50 /bin/bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

[email protected]:/# ls /root/tensorflow_datasets/downloads/manual/

ILSVRC2012_devkit_t12  ILSVRC2012_devkit_t12.tar.gz  ILSVRC2012_img_train.tar  ILSVRC2012_img_val.tar  __pycache__  experiments  imagenet.py  model_checkpoints

[email protected]:/# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=resnet_imagenet --config_file=/root/tensorflow_datasets/downloads/manual/experiments/custom_tfds.yaml --mode=train_and_eval --model_dir=/root/tensorflow_datasets/downloads/manual/model_checkpoints

2021-11-17 11:55:13.435231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1117 11:55:13.472259 140716787398464 train_utils.py:292] Final experiment parameters:
{'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': 'dynamic',
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'task': {'evaluation': {'top_k': 5},
          'init_checkpoint': None,
          'init_checkpoint_modules': 'all',
          'losses': {'l2_weight_decay': 0.0001,
                     'label_smoothing': 0.1,
                     'one_hot': True},
          'model': {'add_head_batch_norm': False,
                    'backbone': {'resnet': {'depth_multiplier': 1.0,
                                            'model_id': 50,
                                            'replace_stem_max_pool': False,
                                            'resnetd_shortcut': False,
                                            'se_ratio': 0.0,
                                            'stem_type': 'v0',
                                            'stochastic_depth_drop_rate': 0.0},
                                 'type': 'resnet'},
                    'dropout_rate': 0.0,
                    'input_size': [224, 224, 3],
                    'norm_activation': {'activation': 'relu',
                                        'norm_epsilon': 1e-05,
                                        'norm_momentum': 0.9,
                                        'use_sync_bn': False},
                    'num_classes': 1001},
          'model_output_keys': [],
          'train_data': {'aug_policy': None,
                         'aug_rand_hflip': True,
                         'aug_type': None,
                         'block_length': 1,
                         'cache': False,
                         'cycle_length': 10,
                         'decode_jpeg_only': True,
                         'deterministic': None,
                         'drop_remainder': True,
                         'dtype': 'float16',
                         'enable_tf_data_service': False,
                         'file_type': 'tfrecord',
                         'global_batch_size': 256,
                         'image_field_key': 'image/encoded',
                         'input_path': '',
                         'is_multilabel': False,
                         'is_training': True,
                         'label_field_key': 'image/class/label',
                         'randaug_magnitude': 10,
                         'seed': None,
                         'sharding': True,
                         'shuffle_buffer_size': 10000,
                         'tf_data_service_address': None,
                         'tf_data_service_job_name': None,
                         'tfds_as_supervised': False,
                         'tfds_data_dir': '',
                         'tfds_name': 'imagenet2012',
                         'tfds_skip_decoding_feature': '',
                         'tfds_split': 'train'},
          'validation_data': {'aug_policy': None,
                              'aug_rand_hflip': True,
                              'aug_type': None,
                              'block_length': 1,
                              'cache': False,
                              'cycle_length': 10,
                              'decode_jpeg_only': True,
                              'deterministic': None,
                              'drop_remainder': False,
                              'dtype': 'float16',
                              'enable_tf_data_service': False,
                              'file_type': 'tfrecord',
                              'global_batch_size': 256,
                              'image_field_key': 'image/encoded',
                              'input_path': '',
                              'is_multilabel': False,
                              'is_training': False,
                              'label_field_key': 'image/class/label',
                              'randaug_magnitude': 10,
                              'seed': None,
                              'sharding': True,
                              'shuffle_buffer_size': 10000,
                              'tf_data_service_address': None,
                              'tf_data_service_job_name': None,
                              'tfds_as_supervised': False,
                              'tfds_data_dir': '',
                              'tfds_name': 'imagenet2012',
                              'tfds_skip_decoding_feature': '',
                              'tfds_split': 'validation'}},
 'trainer': {'allow_tpu_summary': False,
             'best_checkpoint_eval_metric': '',
             'best_checkpoint_export_subdir': '',
             'best_checkpoint_metric_comp': 'higher',
             'checkpoint_interval': 625,
             'continuous_eval_timeout': 3600,
             'eval_tf_function': True,
             'eval_tf_while_loop': False,
             'loss_upper_bound': 1000000.0,
             'max_to_keep': 5,
             'optimizer_config': {'ema': None,
                                  'learning_rate': {'stepwise': {'boundaries': [18750,
                                                                                37500,
                                                                                50000],
                                                                 'name': 'PiecewiseConstantDecay',
                                                                 'offset': 0,
                                                                 'values': [0.8,
                                                                            0.08,
                                                                            0.008,
                                                                            0.0008]},
                                                    'type': 'stepwise'},
                                  'optimizer': {'sgd': {'clipnorm': None,
                                                        'clipvalue': None,
                                                        'decay': 0.0,
                                                        'global_clipnorm': None,
                                                        'momentum': 0.9,
                                                        'name': 'SGD',
                                                        'nesterov': False},
                                                'type': 'sgd'},
                                  'warmup': {'linear': {'name': 'linear',
                                                        'warmup_learning_rate': 0,
                                                        'warmup_steps': 3125},
                                             'type': 'linear'}},
             'recovery_begin_steps': 0,
             'recovery_max_trials': 0,
             'steps_per_loop': 625,
             'summary_interval': 625,
             'train_steps': 56250,
             'train_tf_function': True,
             'train_tf_while_loop': True,
             'validation_interval': 625,
             'validation_steps': 25,
             'validation_summary_subdir': 'validation'}}
I1117 11:55:13.474529 140716787398464 train_utils.py:303] Saving experiment configuration to /root/tensorflow_datasets/downloads/manual/model_checkpoints/params.yaml
2021-11-17 11:55:13.493205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
I1117 11:55:13.493584 140716787398464 device_compatibility_check.py:121] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
2021-11-17 11:55:13.494813: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 11:55:13.495692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496071: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30997 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.080295 140716787398464 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.082595 140716787398464 train_utils.py:214] Running default trainer.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.146004 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.149153 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152038 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152982 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.159561 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.162897 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.428227 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.430712 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.434357 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.435827 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-11-17 11:55:17.963453: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1117 11:55:18.894689 140716787398464 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I1117 11:55:19.755020 140716787398464 dataset_info.py:358] Load dataset info from /tmp/tmpnvfqzrkttfds
I1117 11:55:19.761893 140716787398464 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762345 140716787398464 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762815 140716787398464 dataset_builder.py:400] Generating dataset imagenet2012 (/root/tensorflow_datasets/imagenet2012/5.1.0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 63, in main
    model_dir=model_dir)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_lib.py", line 78, in run_experiment
    params, model_dir))
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 225, in create_trainer
    checkpoint_exporter=checkpoint_exporter)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 259, in __init__
    self.task.build_inputs, self.config.task.train_data)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 159, in distribute_dataset
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 85, in make_distributed_dataset
    return strategy.distribute_datasets_from_function(dataset_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 83, in dataset_fn
    return dataset_or_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/tasks/image_classification.py", line 119, in build_inputs
    dataset = reader.read(input_context=input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 415, in read
    self._tfds_builder)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 335, in _read_decode_and_parse_dataset
    dataset = self._read_tfds(input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 268, in _read_tfds
    self._tfds_builder.download_and_prepare()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 409, in download_and_prepare
    self.info.dataset_size,
OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB)
  In call to configurable 'Trainer' (<class 'official.core.base_trainer.Trainer'>)
  In call to configurable 'create_trainer' (<function create_trainer at 0x7ffaa809a1e0>)

[email protected]:/# df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          79G   12G   63G  16% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/vdb         79G   12G   63G  16% /etc/hosts
/dev/vdd1       492G  266G  201G  57% /root/tensorflow_datasets/downloads/manual
tmpfs           7.7G   12K  7.7G   1% /proc/driver/nvidia
/dev/vda1        14G  3.6G  9.9G  27% /usr/bin/nvidia-smi
tmpfs           1.6G  996K  1.6G   1% /run/nvidia-persistenced/socket
udev            7.7G     0  7.7G   0% /dev/nvidia0
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /proc/scsi
tmpfs           7.7G     0  7.7G   0% /sys/firmware

It would be nice if... Anyone has any suggestion about what I missed.

Environment information Here is the Dockerfile:

# https://hub.docker.com/r/tensorflow/tensorflow
FROM tensorflow/tensorflow:2.6.0-gpu

RUN python3 -m pip install --upgrade pip

# https://github.com/tensorflow/models/tree/master/official
RUN pip install tf-models-official==2.6.0

# mount here the volume with imagenet downloaded data
RUN mkdir -p /root/tensorflow_datasets/downloads/manual

Conchylicultor
Conchylicultor

I was just proposing a workaround. Could you try to comment the following line: https://github.com/tensorflow/datasets/blob/30024eefca3aa0783e2374af32766717267335d0/tensorflow_datasets/core/dataset_builder.py#L404

We're using shutil.disk_usage to estimate the available space. This might conflict for some reason with docker. Commenting the line will ignore the error, so we can be sure this is the cause.

Nov
24
4 days ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

Help needed using TFDS from config file in a Docker container

Coming from Issue on how to train a Resnet50 using Imagenet from Scratch

What I need help with / What I was wondering I'm trying to train from scratch Resnet50 from TF Model garden using Imagenet. I need to prepara the dataset and I'm trying to use tfds (loaded from yaml config file, as suggested on previous opened issue).

I got an error that says "Not enough disk space", but I do have more than 200GB available. Any further suggestions?

Note that I need to execute everything from a docker container, because this container willl be used to test several infrastructures.

What I've tried so far Here you can see: 1) Imagenet data is downloaded, 2) I mount that volume in a docker container, 3) Inside that container I run train,py from model garden as was indicated in previous issue, a configuration is generated as shown. 4) I got the error: OSError: Not enough disk space. Needed: 155.84 GiB, 5) However, df shows that there is more than 200GB available.

[email protected]:/hdd500/data/imagenet_tars/imagenet$ ll

total 151020536
drwxr-xr-x 6 root root         4096 Nov 17 11:53 ./
drwxr-xr-x 3 root root         4096 Nov 17 11:53 ../
drwxr-xr-x 4 2016 2016         4096 Jun 14  2012 ILSVRC2012_devkit_t12/
-rw-r--r-- 1 root root      2568145 Jun 15  2012 ILSVRC2012_devkit_t12.tar.gz
-rw-r--r-- 1 root root 147897477120 Jun 14  2012 ILSVRC2012_img_train.tar
-rw-r--r-- 1 root root   6744924160 Jun 14  2012 ILSVRC2012_img_val.tar
drwxr-xr-x 2 root root         4096 Sep 21 16:28 __pycache__/
drwxr-xr-x 2 root root         4096 Nov 17 11:28 experiments/
-rw-r--r-- 1 root root        11629 Sep 21 16:22 imagenet.py
drwxr-xr-x 2 root root         4096 Nov 17 11:32 model_checkpoints/

[email protected]:/hdd500/data/imagenet_tars/imagenet$ ls experiments/

custom_tfds.yaml  gpu.yaml  imagenet_resnet50_gpu.yaml  imagenet_resnet50_gpu_custom.yaml

[email protected]:/hdd500/data/imagenet_tars/imagenet$ sudo docker run --net=host  -it --gpus all -v /hdd500/data/imagenet_tars/imagenet:/root/tensorflow_datasets/downloads/manual manualresnet50 /bin/bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

[email protected]:/# ls /root/tensorflow_datasets/downloads/manual/

ILSVRC2012_devkit_t12  ILSVRC2012_devkit_t12.tar.gz  ILSVRC2012_img_train.tar  ILSVRC2012_img_val.tar  __pycache__  experiments  imagenet.py  model_checkpoints

[email protected]:/# python3 /usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py --experiment=resnet_imagenet --config_file=/root/tensorflow_datasets/downloads/manual/experiments/custom_tfds.yaml --mode=train_and_eval --model_dir=/root/tensorflow_datasets/downloads/manual/model_checkpoints

2021-11-17 11:55:13.435231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.449582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1117 11:55:13.472259 140716787398464 train_utils.py:292] Final experiment parameters:
{'runtime': {'all_reduce_alg': None,
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': 'dynamic',
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 1,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worker_hosts': None},
 'task': {'evaluation': {'top_k': 5},
          'init_checkpoint': None,
          'init_checkpoint_modules': 'all',
          'losses': {'l2_weight_decay': 0.0001,
                     'label_smoothing': 0.1,
                     'one_hot': True},
          'model': {'add_head_batch_norm': False,
                    'backbone': {'resnet': {'depth_multiplier': 1.0,
                                            'model_id': 50,
                                            'replace_stem_max_pool': False,
                                            'resnetd_shortcut': False,
                                            'se_ratio': 0.0,
                                            'stem_type': 'v0',
                                            'stochastic_depth_drop_rate': 0.0},
                                 'type': 'resnet'},
                    'dropout_rate': 0.0,
                    'input_size': [224, 224, 3],
                    'norm_activation': {'activation': 'relu',
                                        'norm_epsilon': 1e-05,
                                        'norm_momentum': 0.9,
                                        'use_sync_bn': False},
                    'num_classes': 1001},
          'model_output_keys': [],
          'train_data': {'aug_policy': None,
                         'aug_rand_hflip': True,
                         'aug_type': None,
                         'block_length': 1,
                         'cache': False,
                         'cycle_length': 10,
                         'decode_jpeg_only': True,
                         'deterministic': None,
                         'drop_remainder': True,
                         'dtype': 'float16',
                         'enable_tf_data_service': False,
                         'file_type': 'tfrecord',
                         'global_batch_size': 256,
                         'image_field_key': 'image/encoded',
                         'input_path': '',
                         'is_multilabel': False,
                         'is_training': True,
                         'label_field_key': 'image/class/label',
                         'randaug_magnitude': 10,
                         'seed': None,
                         'sharding': True,
                         'shuffle_buffer_size': 10000,
                         'tf_data_service_address': None,
                         'tf_data_service_job_name': None,
                         'tfds_as_supervised': False,
                         'tfds_data_dir': '',
                         'tfds_name': 'imagenet2012',
                         'tfds_skip_decoding_feature': '',
                         'tfds_split': 'train'},
          'validation_data': {'aug_policy': None,
                              'aug_rand_hflip': True,
                              'aug_type': None,
                              'block_length': 1,
                              'cache': False,
                              'cycle_length': 10,
                              'decode_jpeg_only': True,
                              'deterministic': None,
                              'drop_remainder': False,
                              'dtype': 'float16',
                              'enable_tf_data_service': False,
                              'file_type': 'tfrecord',
                              'global_batch_size': 256,
                              'image_field_key': 'image/encoded',
                              'input_path': '',
                              'is_multilabel': False,
                              'is_training': False,
                              'label_field_key': 'image/class/label',
                              'randaug_magnitude': 10,
                              'seed': None,
                              'sharding': True,
                              'shuffle_buffer_size': 10000,
                              'tf_data_service_address': None,
                              'tf_data_service_job_name': None,
                              'tfds_as_supervised': False,
                              'tfds_data_dir': '',
                              'tfds_name': 'imagenet2012',
                              'tfds_skip_decoding_feature': '',
                              'tfds_split': 'validation'}},
 'trainer': {'allow_tpu_summary': False,
             'best_checkpoint_eval_metric': '',
             'best_checkpoint_export_subdir': '',
             'best_checkpoint_metric_comp': 'higher',
             'checkpoint_interval': 625,
             'continuous_eval_timeout': 3600,
             'eval_tf_function': True,
             'eval_tf_while_loop': False,
             'loss_upper_bound': 1000000.0,
             'max_to_keep': 5,
             'optimizer_config': {'ema': None,
                                  'learning_rate': {'stepwise': {'boundaries': [18750,
                                                                                37500,
                                                                                50000],
                                                                 'name': 'PiecewiseConstantDecay',
                                                                 'offset': 0,
                                                                 'values': [0.8,
                                                                            0.08,
                                                                            0.008,
                                                                            0.0008]},
                                                    'type': 'stepwise'},
                                  'optimizer': {'sgd': {'clipnorm': None,
                                                        'clipvalue': None,
                                                        'decay': 0.0,
                                                        'global_clipnorm': None,
                                                        'momentum': 0.9,
                                                        'name': 'SGD',
                                                        'nesterov': False},
                                                'type': 'sgd'},
                                  'warmup': {'linear': {'name': 'linear',
                                                        'warmup_learning_rate': 0,
                                                        'warmup_steps': 3125},
                                             'type': 'linear'}},
             'recovery_begin_steps': 0,
             'recovery_max_trials': 0,
             'steps_per_loop': 625,
             'summary_interval': 625,
             'train_steps': 56250,
             'train_tf_function': True,
             'train_tf_while_loop': True,
             'validation_interval': 625,
             'validation_steps': 25,
             'validation_summary_subdir': 'validation'}}
I1117 11:55:13.474529 140716787398464 train_utils.py:303] Saving experiment configuration to /root/tensorflow_datasets/downloads/manual/model_checkpoints/params.yaml
2021-11-17 11:55:13.493205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
I1117 11:55:13.493584 140716787398464 device_compatibility_check.py:121] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-PCIE-32GB, compute capability 7.0
2021-11-17 11:55:13.494813: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-17 11:55:13.495692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496071: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:13.496370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.416872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-17 11:55:14.417290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30997 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.080295 140716787398464 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1117 11:55:15.082595 140716787398464 train_utils.py:214] Running default trainer.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.146004 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.149153 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152038 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.152982 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.159561 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.162897 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.428227 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.430712 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.434357 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1117 11:55:15.435827 140716787398464 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-11-17 11:55:17.963453: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1117 11:55:18.894689 140716787398464 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I1117 11:55:19.755020 140716787398464 dataset_info.py:358] Load dataset info from /tmp/tmpnvfqzrkttfds
I1117 11:55:19.761893 140716787398464 dataset_info.py:413] Field info.description from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762345 140716787398464 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
I1117 11:55:19.762815 140716787398464 dataset_builder.py:400] Generating dataset imagenet2012 (/root/tensorflow_datasets/imagenet2012/5.1.0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 70, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/train.py", line 63, in main
    model_dir=model_dir)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_lib.py", line 78, in run_experiment
    params, model_dir))
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/train_utils.py", line 225, in create_trainer
    checkpoint_exporter=checkpoint_exporter)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 259, in __init__
    self.task.build_inputs, self.config.task.train_data)
  File "/usr/local/lib/python3.6/dist-packages/official/core/base_trainer.py", line 159, in distribute_dataset
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 85, in make_distributed_dataset
    return strategy.distribute_datasets_from_function(dataset_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1161, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 589, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 169, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1579, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2327, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "/usr/local/lib/python3.6/dist-packages/orbit/utils/common.py", line 83, in dataset_fn
    return dataset_or_fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/official/vision/beta/tasks/image_classification.py", line 119, in build_inputs
    dataset = reader.read(input_context=input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 415, in read
    self._tfds_builder)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 335, in _read_decode_and_parse_dataset
    dataset = self._read_tfds(input_context)
  File "/usr/local/lib/python3.6/dist-packages/official/core/input_reader.py", line 268, in _read_tfds
    self._tfds_builder.download_and_prepare()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 409, in download_and_prepare
    self.info.dataset_size,
OSError: Not enough disk space. Needed: 155.84 GiB (download: Unknown size, generated: 155.84 GiB)
  In call to configurable 'Trainer' (<class 'official.core.base_trainer.Trainer'>)
  In call to configurable 'create_trainer' (<function create_trainer at 0x7ffaa809a1e0>)

[email protected]:/# df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          79G   12G   63G  16% /
tmpfs            64M     0   64M   0% /dev
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/vdb         79G   12G   63G  16% /etc/hosts
/dev/vdd1       492G  266G  201G  57% /root/tensorflow_datasets/downloads/manual
tmpfs           7.7G   12K  7.7G   1% /proc/driver/nvidia
/dev/vda1        14G  3.6G  9.9G  27% /usr/bin/nvidia-smi
tmpfs           1.6G  996K  1.6G   1% /run/nvidia-persistenced/socket
udev            7.7G     0  7.7G   0% /dev/nvidia0
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /proc/scsi
tmpfs           7.7G     0  7.7G   0% /sys/firmware

It would be nice if... Anyone has any suggestion about what I missed.

Environment information Here is the Dockerfile:

# https://hub.docker.com/r/tensorflow/tensorflow
FROM tensorflow/tensorflow:2.6.0-gpu

RUN python3 -m pip install --upgrade pip

# https://github.com/tensorflow/models/tree/master/official
RUN pip install tf-models-official==2.6.0

# mount here the volume with imagenet downloaded data
RUN mkdir -p /root/tensorflow_datasets/downloads/manual

Conchylicultor
Conchylicultor

Rather than generating the imagenet inside the docker image, could your pre-generate imagenet .tfrecord (e.g. with tfds build imagenet2012 --manual_dir=... CLI), then only package the ~/tensorflow_datasets/imagenet2012/... rather than the original ILSVRC2012_devkit_t12.tar.gz

Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

Small typo fix

Typo fix for the issue described here #3341

Conchylicultor
Conchylicultor

Thank you, those pages are automatically generated. Could you update the .py file instead ?

Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

'Command not found' when trying to install tfds

Received the error message ‘command not found’ when trying to install tfds on a GCP VM. Have attempted via: conda install -c anaconda tensorflow-dataset, pip install tensorflow-datasets, pip install -q tfds-nightly

none of the above are working.. Any suggestions would be very helpful!

Conchylicultor
Conchylicultor

Maybe this is due to the missing __init__.py files. Let me add those now

Nov
19
1 week ago
started
started time in 1 week ago
Nov
17
1 week ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

'Command not found' when trying to install tfds

Received the error message ‘command not found’ when trying to install tfds on a GCP VM. Have attempted via: conda install -c anaconda tensorflow-dataset, pip install tensorflow-datasets, pip install -q tfds-nightly

none of the above are working.. Any suggestions would be very helpful!

Conchylicultor
Conchylicultor

Could you try to pip uninstall tensorflow_datasets, then re-install tfds-nightly ? It's possible the 2 versions conflicts

Oct
27
1 month ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

Loading 1 layer deep data directories

Is your feature request related to a problem? Please describe. When I have a data set layout as follows

.
├── data
│   ├── bad
│   └── good

I'd like to have this just load up but it assumes that I have subdirectories for train/test/foo/bar

Describe the solution you'd like I'd like to have this just load up but it assumes that I have subdirectories can I have something where I just take the directories as is like if good and bad don't have subdirectories can it assume that my files are just in there?

Describe alternatives you've considered I can do this but it's kinda nasty.

builder = tfds.folder_dataset.ImageFolder("./")
print(builder.info)  # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='data', shuffle_files=True)
tfds.show_examples(ds, builder.info)

Additional context I love the api but this one thing kinda bugs me.

Conchylicultor
Conchylicultor

I guess this would make sense. Would something like:

builder = tfds.folder_dataset.ImageFolder(".../data/", no_split=True)
ds = builder.as_dataset(shuffle_files=True)

satify your use-case ? If so, don't hesitate to send a PR to add some:

Oct
24
1 month ago
started
started time in 1 month ago
Oct
14
1 month ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

Slightly more powerful split syntax.

There's two additions to the split syntax that would make my life easier:

  1. Allow python integer format like 50_000. This helps with readability and catch off-by-an-order-of-magnitude mistakes more easily. Currently train[:50_000] gives an error.
  2. If this is not insanely difficult: Allow for a sequence of slicing, like train[:50_000][:10] that reduces to train[:10]. This would allow simplifying logic in my configuration files a lot, where I often define a minival split like train[:50_000], but have an additional "quickrun" flag that appends [:16] to all split names. Now it requires extra logic, but if the above syntax was allowed, it would be much simpler.
Conchylicultor
Conchylicultor

Thanks for the feedback.

  1. It should be fairly straighforward to update our regex to support _. tensorflow_datasets/core/tfrecords_reader.py would have to be modified. Don't hesitate to send us a fix.
  2. This one should be doable too but would be more tricky. I have some idea for supporting nested sub-split but I won't have time to implement this soon. If you're interested, I can give you some pointer.

Alternatively, in the meantime, I don't know if this would simplify your implementation but you could have a look at the tfds.core.ReadInstruction: https://www.tensorflow.org/datasets/splits#tfdscorereadinstruction_and_rounding

split = tfds.core.ReadInstruction.from_spec('train[:50_000]')
new_split = f`{split.nane}[:10]`
Oct
11
1 month ago
Activity icon
issue

Conchylicultor issue comment tensorflow/datasets

Conchylicultor
Conchylicultor

About REUSE_CACHE_IF_EXISTS and max_examples_per_split usage

What I need help with / What I was wondering Hi, I have downloaded and processed an OPUS en-es dataset using the following snippet:

        build_config = tfds.translate.opus.OpusConfig(
            version=tfds.core.Version("0.1.0"),
            language_pair=("es", "en"),
            subsets=["OpenSubtitles"]
        )
        # OPUS only provides one split: "train"
        builder = tfds.builder("opus", config=build_config)

        # By default stored in /root/tensorflow_datasets
        dwn_config = tfds.download.DownloadConfig(
            extract_dir=dwn_path,  # store extracted files here
            download_mode=tfds.GenerateMode.REUSE_CACHE_IF_EXISTS,  # Reuse downloads, fresh dataset
            max_examples_per_split=num_items  # "train" split only
        )

        builder.download_and_prepare(
            download_dir=dwn_path,
            download_config=dwn_config
        )

Initially I downloaded max_examples_per_split=57000 and now I want to increase that number without re downloading the dataset again. When using download_mode=tfds.GenerateMode.REUSE_CACHE_IF_EXISTS I got an error saying that I'm trying to overwrite an existing dataset, which is exactly what I was expecting to happen when using the REUSE_CACHE_IF_EXISTS flag as the documentation explains: image

Could you please let me know if I'm wrongly understanding the documentation? If so, how can I get a new set of training examples with more/less data from the already downloaded dataset?

What I've tried so far

  • The REUSE_CACHE_IF_EXISTS flag but it didn't help.
  • Deleting the prepared dataset. This helps, but involves a manual step.

Environment information (if applicable)

  • Operating System: Linux
  • Python version: 3.7.9
  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.4.0
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2.5.1 and 2.6.0
Conchylicultor
Conchylicultor

I find it useful during experimentation. Sometimes I download a large dataset and I start training a vanilla model with a small amount of prepared examples, then for parameter tuning I increase the number of samples. When I finish I increase the training samples again.

Yes, but isn't the --overwrite flag not enough for this use-case ? I usually experiment with

tfds build my_dataset.py --overwrite --max_examples_per_split=3

This does exactly what you want (limit the number of examples and overwrite the existing dataset)

See useful flags in https://www.tensorflow.org/datasets/add_dataset#download_and_prepare_tfds_build, or the https://www.tensorflow.org/datasets/cli doc.

Previous