Improve performance on V100s

Mostly repeating private email and in-person communication on this topic for reference notes and posterity. 

FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traverse cluster, is **about 3x slower** than on the P100s on Princeton's TigerGPU cluster. See the below table, which tests the performance for `d3d_0D` training on both machines as a function of batch size (as suggested by @jnkh). I have run these tests with 1, 2, 8 GPUs as well, and several datasets. 

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">


<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Machine (GPU Model)</th>
<th scope="col" class="org-right">N_node</th>
<th scope="col" class="org-right">N_{GPU}</th>
<th scope="col" class="org-right">Examples/sec</th>
<th scope="col" class="org-right">Sec/batch</th>
<th scope="col" class="org-right">Batch size</th>
</tr>
</thead>

<tbody>
<tr>
<td class="org-left">Traverse (V100)</td>
<td class="org-right">1</td>
<td class="org-right">4</td>
<td class="org-right">1.35e3</td>
<td class="org-right">0.75</td>
<td class="org-right">1024</td>
</tr>


<tr>
<td class="org-left">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">2.53e3</td>
<td class="org-right">0.80</td>
<td class="org-right">2048</td>
</tr>


<tr>
<td class="org-left">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">5.20e3</td>
<td class="org-right">0.80</td>
<td class="org-right">4096</td>
</tr>
</tbody>

<tbody>
<tr>
<td class="org-left">TigerGPU (P100)</td>
<td class="org-right">1</td>
<td class="org-right">4</td>
<td class="org-right">4.30e3</td>
<td class="org-right">0.24</td>
<td class="org-right">1024</td>
</tr>


<tr>
<td class="org-left">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">7.70e3</td>
<td class="org-right">0.26</td>
<td class="org-right">2048</td>
</tr>


<tr>
<td class="org-left">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">&#xa0;</td>
<td class="org-right">1.38e4</td>
<td class="org-right">0.30</td>
<td class="org-right">4096</td>
</tr>
</tbody>
</table>


At first, I suspected some issue with my Conda / MPI environment on the Power 9 architecture. However, @ge-dong and I compared figures, and we confirmed that we are both independently observing this behavior. In fact, the original modules on Traverse produced about even slower performance (20%). 

@ASvyatkovskiy identified the primary issue being that the TensorFlow backend for`tf.keras` or external Keras does not run the cuDNN autotuner unlike vanilla TensorFlow architecture definitions. See my notes about the autotuner in #51. The default implementations of our layers might be slower on V100 than on P100.

He opened issues about this when he first ran on Summit over 1.5 years ago:
https://github.com/tensorflow/tensorflow/issues/18913,  https://github.com/keras-team/keras/issues/9825. Related: https://github.com/keras-team/keras/issues/9321

And proposed the following optimizations especially for V100s:
- Use https://github.com/NVIDIA/nccl library to perform all-reduce directly on the GPU
- Use https://github.com/NVIDIA/apex mixed precision optimizers

> All these things are easier to enable/add in PyTorch, which now also support distributed training natively and through Horovod.

Also, I am systematically benchmarking the `LSTM`  Keras layer definition vs. `CuDNNLSTM`, which seems to be at least an order of magnitude faster. 


**IBM AC922 "Traverse" architecture details:**
- Processor is 16-core Power 9 running at 2.7 GHz
- Host memory 256 GB DDR4
- 4 X V100 with 32 GB HBM2

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance on V100s #52

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine (GPU Model)	N_node	N_{GPU}	Examples/sec	Sec/batch	Batch size
Traverse (V100)	1	4	1.35e3	0.75	1024
			2.53e3	0.80	2048
			5.20e3	0.80	4096
TigerGPU (P100)	1	4	4.30e3	0.24	1024
			7.70e3	0.26	2048
			1.38e4	0.30	4096

Improve performance on V100s #52

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions