Increase detail when reporting computational throughput and latency

Currently, FRNN reports the following metrics related to computational speed/efficiency during training:
**Per step, per epoch:**
- `Examples/sec`
- `sec/batch`
  - % of batch time spent in calculation vs. synchronization
- overall batch size = batch size per GPU x N_GPU

As we become more cognizant about our performance expectations for the code on various architectures, I think it would be valuable to make these metrics more informative.

- [ ] **At end of each epoch**: summarize Min/Max, Mean, Std-dev of `Examples/sec`,  `sec/batch`, `% calc`, `% sync` of all steps within that epoch
- [ ] **At end of all epochs:** same statistics across epochs? 
- [ ] Add greater granularity of timing information within an epoch?

For the final performance metrics (at the end of all epochs), we should probably exclude the first epoch or so, due to TensorFlow invocation of cuDNN autotuner on the first invocation of `tf.Session.run()` when the undocumented environment variable `TF_CUDNN_USE_AUTOTUNE=1`. See https://github.com/tensorflow/tensorflow/blob/fddd829a0795a98b1bdac63c5acaed2c3d8122ff/tensorflow/core/util/use_cudnn.cc#L36 and https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-why for an explanation

Although I wonder if the initial run that is thrown away in order to force compilation already accomplishes this?
https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L549-L564
since it calls Keras `train_on_batch()`.

Note, I have tested the effects of the `TF_CUDNN_USE_AUTOTUNE` variable on the https://github.com/tensorflow/benchmarks , specifically 
```
srun -n 1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
```
on Traverse V100s and TigerGPU P100s, and disabling the autotuner leads to a loss of about 10% performance:

**P100**:
<details>
  <summary>Click to expand!</summary>

```
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 223.9 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 223.9 +/- 0.1 (jitter = 0.3)        7.593
20      images/sec: 223.7 +/- 0.1 (jitter = 0.3)        7.696
30      images/sec: 223.6 +/- 0.1 (jitter = 0.3)        7.753
40      images/sec: 223.6 +/- 0.1 (jitter = 0.3)        8.007
50      images/sec: 223.4 +/- 0.1 (jitter = 0.3)        7.520
60      images/sec: 223.3 +/- 0.1 (jitter = 0.4)        7.988
70      images/sec: 223.3 +/- 0.1 (jitter = 0.4)        8.028
80      images/sec: 223.4 +/- 0.1 (jitter = 0.4)        7.932
90      images/sec: 223.5 +/- 0.1 (jitter = 0.5)        7.848
100     images/sec: 223.5 +/- 0.1 (jitter = 0.5)        7.796
----------------------------------------------------------------
total images/sec: 223.38
----------------------------------------------------------------

-----------------------------------------------------------------------------
export TF_CUDNN_USE_AUTOTUNE=0
-----------------------------------------------------------------------------

TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 204.8 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 204.8 +/- 0.0 (jitter = 0.1)        7.593
20      images/sec: 204.8 +/- 0.0 (jitter = 0.2)        7.696
30      images/sec: 204.8 +/- 0.1 (jitter = 0.2)        7.753
40      images/sec: 204.8 +/- 0.1 (jitter = 0.2)        8.007
50      images/sec: 204.8 +/- 0.0 (jitter = 0.2)        7.520
60      images/sec: 204.8 +/- 0.0 (jitter = 0.2)        7.988
70      images/sec: 204.8 +/- 0.1 (jitter = 0.2)        8.027
80      images/sec: 204.8 +/- 0.1 (jitter = 0.2)        7.931
90      images/sec: 204.9 +/- 0.1 (jitter = 0.3)        7.850
100     images/sec: 204.9 +/- 0.1 (jitter = 0.3)        7.797
----------------------------------------------------------------
total images/sec: 204.77
----------------------------------------------------------------
```
</details>

**V100**:
<details>
  <summary>Click to expand!</summary>

```
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 339.6 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 339.9 +/- 0.1 (jitter = 0.4)        7.593
20      images/sec: 340.0 +/- 0.1 (jitter = 0.3)        7.696
30      images/sec: 340.1 +/- 0.1 (jitter = 0.4)        7.753
40      images/sec: 340.1 +/- 0.1 (jitter = 0.4)        8.007
50      images/sec: 340.0 +/- 0.1 (jitter = 0.4)        7.519
60      images/sec: 340.0 +/- 0.1 (jitter = 0.4)        7.988
70      images/sec: 340.0 +/- 0.1 (jitter = 0.5)        8.027
80      images/sec: 340.0 +/- 0.1 (jitter = 0.5)        7.931
90      images/sec: 340.0 +/- 0.1 (jitter = 0.5)        7.849
100     images/sec: 340.0 +/- 0.1 (jitter = 0.5)        7.797
----------------------------------------------------------------
total images/sec: 339.79
----------------------------------------------------------------

-----------------------------------------------------------------------------
export TF_CUDNN_USE_AUTOTUNE=0
-----------------------------------------------------------------------------

TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 312.1 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 312.1 +/- 0.1 (jitter = 0.2)        7.593
20      images/sec: 312.1 +/- 0.1 (jitter = 0.2)        7.696
30      images/sec: 312.1 +/- 0.1 (jitter = 0.3)        7.753
40      images/sec: 312.2 +/- 0.1 (jitter = 0.3)        8.007
50      images/sec: 312.2 +/- 0.1 (jitter = 0.3)        7.520
60      images/sec: 312.1 +/- 0.1 (jitter = 0.3)        7.989
70      images/sec: 312.1 +/- 0.1 (jitter = 0.4)        8.026
80      images/sec: 312.1 +/- 0.1 (jitter = 0.4)        7.932
90      images/sec: 312.1 +/- 0.0 (jitter = 0.4)        7.849
100     images/sec: 312.1 +/- 0.0 (jitter = 0.4)        7.795
----------------------------------------------------------------
total images/sec: 312.01
----------------------------------------------------------------
```
</details>



	# run the model once to force compilation. Don't actually use these
	# values.
	if first_run:
	first_run = False
	t0_comp = time.time()
	# print('input_dimension:',batch_xs.shape)
	# print('output_dimension:',batch_ys.shape)
	_, _ = self.train_on_batch_and_get_deltas(
	batch_xs, batch_ys, verbose)
	self.comm.Barrier()
	sys.stdout.flush()
	# TODO(KGF): check line feed/carriage returns around this
	g.print_unique('\nCompilation finished in {:.2f}s'.format(
	time.time() - t0_comp))
	t_start = time.time()
	sys.stdout.flush()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase detail when reporting computational throughput and latency #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increase detail when reporting computational throughput and latency #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions