-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Currently, FRNN reports the following metrics related to computational speed/efficiency during training:
Per step, per epoch:
Examples/secsec/batch- % of batch time spent in calculation vs. synchronization
- overall batch size = batch size per GPU x N_GPU
As we become more cognizant about our performance expectations for the code on various architectures, I think it would be valuable to make these metrics more informative.
- At end of each epoch: summarize Min/Max, Mean, Std-dev of
Examples/sec,sec/batch,% calc,% syncof all steps within that epoch - At end of all epochs: same statistics across epochs?
- Add greater granularity of timing information within an epoch?
For the final performance metrics (at the end of all epochs), we should probably exclude the first epoch or so, due to TensorFlow invocation of cuDNN autotuner on the first invocation of tf.Session.run() when the undocumented environment variable TF_CUDNN_USE_AUTOTUNE=1. See https://github.com/tensorflow/tensorflow/blob/fddd829a0795a98b1bdac63c5acaed2c3d8122ff/tensorflow/core/util/use_cudnn.cc#L36 and https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-why for an explanation
Although I wonder if the initial run that is thrown away in order to force compilation already accomplishes this?
plasma-python/plasma/models/mpi_runner.py
Lines 549 to 564 in c82ba61
| # run the model once to force compilation. Don't actually use these | |
| # values. | |
| if first_run: | |
| first_run = False | |
| t0_comp = time.time() | |
| # print('input_dimension:',batch_xs.shape) | |
| # print('output_dimension:',batch_ys.shape) | |
| _, _ = self.train_on_batch_and_get_deltas( | |
| batch_xs, batch_ys, verbose) | |
| self.comm.Barrier() | |
| sys.stdout.flush() | |
| # TODO(KGF): check line feed/carriage returns around this | |
| g.print_unique('\nCompilation finished in {:.2f}s'.format( | |
| time.time() - t0_comp)) | |
| t_start = time.time() | |
| sys.stdout.flush() |
since it calls Keras
train_on_batch().
Note, I have tested the effects of the TF_CUDNN_USE_AUTOTUNE variable on the https://github.com/tensorflow/benchmarks , specifically
srun -n 1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
on Traverse V100s and TigerGPU P100s, and disabling the autotuner leads to a loss of about 10% performance:
P100:
Click to expand!
TensorFlow: 1.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 223.9 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 223.9 +/- 0.1 (jitter = 0.3) 7.593
20 images/sec: 223.7 +/- 0.1 (jitter = 0.3) 7.696
30 images/sec: 223.6 +/- 0.1 (jitter = 0.3) 7.753
40 images/sec: 223.6 +/- 0.1 (jitter = 0.3) 8.007
50 images/sec: 223.4 +/- 0.1 (jitter = 0.3) 7.520
60 images/sec: 223.3 +/- 0.1 (jitter = 0.4) 7.988
70 images/sec: 223.3 +/- 0.1 (jitter = 0.4) 8.028
80 images/sec: 223.4 +/- 0.1 (jitter = 0.4) 7.932
90 images/sec: 223.5 +/- 0.1 (jitter = 0.5) 7.848
100 images/sec: 223.5 +/- 0.1 (jitter = 0.5) 7.796
----------------------------------------------------------------
total images/sec: 223.38
----------------------------------------------------------------
-----------------------------------------------------------------------------
export TF_CUDNN_USE_AUTOTUNE=0
-----------------------------------------------------------------------------
TensorFlow: 1.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 204.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 204.8 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.696
30 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 7.753
40 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 8.007
50 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.520
60 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.988
70 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 8.027
80 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 7.931
90 images/sec: 204.9 +/- 0.1 (jitter = 0.3) 7.850
100 images/sec: 204.9 +/- 0.1 (jitter = 0.3) 7.797
----------------------------------------------------------------
total images/sec: 204.77
----------------------------------------------------------------
V100:
Click to expand!
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 339.6 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 339.9 +/- 0.1 (jitter = 0.4) 7.593
20 images/sec: 340.0 +/- 0.1 (jitter = 0.3) 7.696
30 images/sec: 340.1 +/- 0.1 (jitter = 0.4) 7.753
40 images/sec: 340.1 +/- 0.1 (jitter = 0.4) 8.007
50 images/sec: 340.0 +/- 0.1 (jitter = 0.4) 7.519
60 images/sec: 340.0 +/- 0.1 (jitter = 0.4) 7.988
70 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 8.027
80 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.931
90 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.849
100 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.797
----------------------------------------------------------------
total images/sec: 339.79
----------------------------------------------------------------
-----------------------------------------------------------------------------
export TF_CUDNN_USE_AUTOTUNE=0
-----------------------------------------------------------------------------
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 312.1 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 312.1 +/- 0.1 (jitter = 0.2) 7.593
20 images/sec: 312.1 +/- 0.1 (jitter = 0.2) 7.696
30 images/sec: 312.1 +/- 0.1 (jitter = 0.3) 7.753
40 images/sec: 312.2 +/- 0.1 (jitter = 0.3) 8.007
50 images/sec: 312.2 +/- 0.1 (jitter = 0.3) 7.520
60 images/sec: 312.1 +/- 0.1 (jitter = 0.3) 7.989
70 images/sec: 312.1 +/- 0.1 (jitter = 0.4) 8.026
80 images/sec: 312.1 +/- 0.1 (jitter = 0.4) 7.932
90 images/sec: 312.1 +/- 0.0 (jitter = 0.4) 7.849
100 images/sec: 312.1 +/- 0.0 (jitter = 0.4) 7.795
----------------------------------------------------------------
total images/sec: 312.01
----------------------------------------------------------------