vahidk

vahidk

PhD in Computer Vision & Robotics EM @ Pinterest. Ex. Google, Waymo, Microsoft, & Snap.

Member Since 11 years ago

California

2 organizations

Epic Games NVIDIA GameWorks

Experience Points
725
follower
Lessons Completed
55
follow
Lessons Completed
282
stars
Best Reply Awards
9
repos

17 contributions in the last year

vahidk Most Used Languages
vahidk GitHub Stats

6 Pinned

⚡ TensorFlow tutorials and best practices.
⚡ PyTorch tutorials and best practices.
⚡ TFRecord reader for PyTorch
⚡ An extendable framework for training neural network models using TensorFlow.
⚡ An experimental rasterizer in Tensorflow
⚡ Visual Studio Code TensorFlow snippets
Aug
4
1 day ago
started
started time in 22 hours ago
Jul
29
1 week ago
started
started time in 1 week ago
Jul
22
2 weeks ago
pull request

vahidk pull request vahidk/tfrecord

vahidk
vahidk

allow to switch off automatic seeding of workers

Hi,

another PR, this time for ability to switch off automatic seeding of workers. If a user has in place some other mechanism of seeding workers (such as by worker_init_fn in DataLoader), current seeding interferes with it and this PR adds a switch to not seed workers.

pull request

vahidk merge to vahidk/tfrecord

vahidk
vahidk

allow to switch off automatic seeding of workers

Hi,

another PR, this time for ability to switch off automatic seeding of workers. If a user has in place some other mechanism of seeding workers (such as by worker_init_fn in DataLoader), current seeding interferes with it and this PR adds a switch to not seed workers.

vahidk
vahidk

Let's shelf this for now. I think the proper way to do this is to use a RandomState.

pull request

vahidk merge to vahidk/tfrecord

vahidk
vahidk

Implementation of sharded multirecord dataset

Hi, I am creating a PR for properly sharded multi-record dataset, as discussed in #70 . This PR should solve it. Thanks :)

open pull request

vahidk wants to merge vahidk/tfrecord

vahidk
vahidk

Implementation of sharded multirecord dataset

Hi, I am creating a PR for properly sharded multi-record dataset, as discussed in #70 . This PR should solve it. Thanks :)

vahidk
vahidk

This is not right. This won't read all of the data. I suggest before sending a PR to at least create one test example and make sure that it works.

open pull request

vahidk wants to merge vahidk/tfrecord

vahidk
vahidk

Implementation of sharded multirecord dataset

Hi, I am creating a PR for properly sharded multi-record dataset, as discussed in #70 . This PR should solve it. Thanks :)

vahidk
vahidk

What's the purpose of this?

pull request

vahidk merge to vahidk/tfrecord

vahidk
vahidk

Implementation of sharded multirecord dataset

Hi, I am creating a PR for properly sharded multi-record dataset, as discussed in #70 . This PR should solve it. Thanks :)

started
started time in 2 weeks ago
Jul
19
2 weeks ago
Activity icon
issue

vahidk issue vahidk/tfrecord

vahidk
vahidk

Loading a multi-tfrecord multi-split dataset

I have downloaded the open_image_v4 dataset from Tensorflow using the tfds utility (as well as many other similar datasets), and now I have a folder with the following structure:

open_image_v4
├── open_images_v4-test.tfrecord-00001-of-00512
├── open_images_v4-test.tfrecord-00002-of-00512
├── ...
├── open_images_v4-train.tfrecord-00001-of-01024
├── ...
└── open_images_v4-validation.tfrecord-00001-of-00128
├── ...

As you can see, I have 3 splits: train, test and validation, however, each of those splits are themselves splitted/sharded into subfiles.

I created an index file for each of those TFrecord using GNU parallel:

parallel -j19 python3 -m tfrecord.tools.tfrecord2idx {} index/{}.index ::: *.tfrecord-*

From skimming over the reader code, it seems that you only support datasets that all fit in a single .tfrecord file. Did I miss something or is there a workaround I could employ to read large subsets (train, valid, test) of the dataset?

I initially thought it might be possible to provide the input pattern as:

tfrecord_pattern = "/tmp/{split}_{idx}.tfrecord"

or in the case of my above file system:

tfrecord_pattern = "/tmp/open_images_v4-{split}.tfrecord-{idx:05d}-of-{total:05d}"

if that may be of any help for future PRs.

Activity icon
issue

vahidk issue comment vahidk/tfrecord

vahidk
vahidk

Loading a multi-tfrecord multi-split dataset

I have downloaded the open_image_v4 dataset from Tensorflow using the tfds utility (as well as many other similar datasets), and now I have a folder with the following structure:

open_image_v4
├── open_images_v4-test.tfrecord-00001-of-00512
├── open_images_v4-test.tfrecord-00002-of-00512
├── ...
├── open_images_v4-train.tfrecord-00001-of-01024
├── ...
└── open_images_v4-validation.tfrecord-00001-of-00128
├── ...

As you can see, I have 3 splits: train, test and validation, however, each of those splits are themselves splitted/sharded into subfiles.

I created an index file for each of those TFrecord using GNU parallel:

parallel -j19 python3 -m tfrecord.tools.tfrecord2idx {} index/{}.index ::: *.tfrecord-*

From skimming over the reader code, it seems that you only support datasets that all fit in a single .tfrecord file. Did I miss something or is there a workaround I could employ to read large subsets (train, valid, test) of the dataset?

I initially thought it might be possible to provide the input pattern as:

tfrecord_pattern = "/tmp/{split}_{idx}.tfrecord"

or in the case of my above file system:

tfrecord_pattern = "/tmp/open_images_v4-{split}.tfrecord-{idx:05d}-of-{total:05d}"

if that may be of any help for future PRs.

vahidk
vahidk

You can provide a pattern to MultiTFRecordDataset and it would read and blend all of the files simultaneously. This is already supported.

Activity icon
issue

vahidk issue vahidk/tfrecord

vahidk
vahidk

Losing shape information

Hi,

I am trying to load the ImageNet2012 tfrecords (loaded using tensorflow datasets).

I can load the tfrecords using Tensorflow fine, however, when I use this library I lose the shape information of the images (they come back as flattened lists). Is there any way to resolve this? Other than that the library works great!

Thanks

Activity icon
issue

vahidk issue comment vahidk/tfrecord

vahidk
vahidk

Losing shape information

Hi,

I am trying to load the ImageNet2012 tfrecords (loaded using tensorflow datasets).

I can load the tfrecords using Tensorflow fine, however, when I use this library I lose the shape information of the images (they come back as flattened lists). Is there any way to resolve this? Other than that the library works great!

Thanks

vahidk
vahidk

This is expected behavior. You can pass a transform function to do your preprocessing.

Activity icon
issue

vahidk issue comment vahidk/tfrecord

vahidk
vahidk

shard not used in MultiTFRecordDataset

Hi, I just noticed, that MultiTFRecordDataset does not use shard, whereas TFRecordDataset does. Is this a correct behavior? It very well could be, there could be some dark magic which I am not aware of, however, to me it seems it is not. If it is not, I can submit a PR to fix it.

Thanks :)

vahidk
vahidk

Yes. It's currently not used. Sharding helps with parallelizing reads and better shuffling, but MultiTFRecordDataset can already read from pre-sharded data so it's not as necessary. If it doesn't overcomplicate the design we can do that.

Activity icon
fork

vahidk forked AFathi/ARVideoKit

⚡ Capture & record ARKit videos 📹, photos 🌄, Live Photos 🎇, and GIFs 🎆.
vahidk Apache License 2.0 Updated
fork time in 2 weeks ago
Jul
18
2 weeks ago
started
started time in 2 weeks ago
Jul
17
2 weeks ago
started
started time in 2 weeks ago
Jul
14
3 weeks ago
started
started time in 3 weeks ago
started
started time in 3 weeks ago
Jul
11
3 weeks ago
Jul
10
3 weeks ago
started
started time in 3 weeks ago
started
started time in 3 weeks ago
Jul
9
3 weeks ago
started
started time in 3 weeks ago
started
started time in 3 weeks ago
Jul
3
1 month ago
started
started time in 1 month ago
Jul
2
1 month ago
started
started time in 1 month ago
Jun
29
1 month ago
started
started time in 1 month ago
Activity icon
issue

vahidk issue comment vahidk/tfrecord

vahidk
vahidk

__getitem__(index)

Hey, how about implementing __getitem__() on the tfrecord dataset? I use that function quite to exactly determine which sample i wan to look at. regards btw. great work!

Jun
28
1 month ago
started
started time in 1 month ago