How To Install Cuda for torch 1.8.1

I’m using ubuntu 20.04 LTS on Windows10.
And I’m going to use my GPU for torch.

What should I install?
cuda | cuDNN | cuda driver
Actually, I don’t know if cuDNN is necessary.

Following are my Computer SPECifications.

  1. GPU: NVIDIA GEFORCE GTX 1050
  2. Framework: torch 1.8.1 (Is 1.8.1+cu111 is latest version?)
  3. OS: Ubuntu 20.04 LTS

I tried a lot, but maybe version is not correct.

My main questions is as follows.

  1. Should I install cuda, cuDNN, driver all?
  2. What versions should I install?

cuda – cuFFT plan fails for a large size

I am doing a cufft on a large signal of size 2^29. However the cufftPlan1D fails to create plan and returns

CUFFT_INTERNAL_ERROR = 5, // Driver or internal cuFFT library error

This is all I could find about this error in the cuFFT docs. How do I make a plan for this size? Is there other function instead of

cufftPlan1d(plan, SIZE, CUFFT_Z2Z, 1)

this call?

I have also checked that I am not running out of memory. It is well within range. And also size is in the power of 2.

installation – Sources disagree on hashes for supposely identical version when trying to install CUDA 11.2 toolkit

I am new to GPU computing and I am trying to install CUDA on my linux 20.04 focal with kernel 5.4.0 workstation that is equipped with NVIDIA quadro p4000 GPU.

I have followed the instructions in this link:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions

and when I did step 3.8 (sudo apt-get install cuda) I get this message:

Sources disagree on hashes for supposely identical version ‘11.2.2-1’ of ‘cuda-command-line-tools-11-2:amd64’
and when I tried nvcc –version after the installation finished with the above message and a few other similar messages I got nothing which means it didn’t install.

Actually I am trying to do the installation on a remote machine so I was not able to do any restarts during the installation if that is actually needed, because I don’t have access to the physical machine.

can you please help me figure out what is going wrong with my installation?

Downgrading CUDA without changing NVIDIA driver version

I’m struggling to downgrade my current CUDA version. I am using Ubuntu 20.04 LTS with NVIDIA GeForce RTX 3070 GPU, 460 drivers and CUDA 11.2. I am using tensorflow 1.13.1 as part of a machine-learning software package and for some reason the software doesn’t work properly. I suspect this is because of CUDA, as I use the same software with NVIDIA TITAN V GPU, 450 drivers and CUDA 11.0 and the software works fine.

I’ve first tried downgrading the NVIDIA drivers to 450 as that automatically installs CUDA 11.0. However, it seems the RTX 3070 GPU only supports the 460 drivers, so downgrading the drivers is not an option.

Next, I tried downgrading only CUDA, without touching the drivers. First, I tried removing the current CUDA installation:

sudo apt-get --purge remove "*cublas*" "cuda*"

followed by installing CUDA 11.0 from the NVIDIA archive with the .deb (local) file (following the installation instructions on the website):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu2004-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-0-local_11.0.3-450.51.06-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-0-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

However, this always seems to automatically revert the drivers to 450, which leads to dependency conflicts. Based on this website, CUDA 11 should support >=450 drivers, so is it possible to downgrade CUDA 11.2 to 11.0 without changing the drivers?

Many thanks!

windows subsystem for linux – How to change the right cuda version in WSL

I worked on the WSL-ubuntu1804, And I installed the cuda, tensorrt, pytorch, and some package I forget. Now I run the nvcc -V on the bash shell, it show me :

(torch1.6) sanwz@DESKTOP-NHKU0MT:/usr/local$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

so I find the nvcc file at /usr/local/cuda/bin and run ./nvcc -V
I got the right info:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

The cuda folder at /usr/local/ linking to the /usr/local/cuda-10.2
enter image description here
and there is only one folder like cuda-*

Does it mean my default cuda compiler driver version is 9.0? How should I change it .

Cuda API to render using Physically Based Rendering (PBR) in OpenGL-ES based project

Thanks for contributing an answer to Game Development Stack Exchange!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

beginner – gpuIncreaseOne Function Implementation in CUDA

I am trying to perform the basic operations + with CUDA for GPU computation. The function vectorIncreaseOne is the instance for the operation details and gpuIncreaseOne function is the structure for applying the operation to each element in the parameter data_for_calculation.

The experimental implementation

The experimental implementation of gpuIncreaseOne function is as below.

#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <helper_cuda.h>
#include <math.h>

__global__ void CUDACalculation::vectorIncreaseOne(const long double* input, long double* output, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        if (input(i) < 255)
        {
            output(i) = input(i) + 1;
        }
    }
}

int CUDACalculation::gpuIncreaseOne(float* data_for_calculation, int size)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = size;
    size_t DataSize = numElements * sizeof(float);

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, DataSize);
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, DataSize);
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, DataSize);
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    err = cudaMemcpy(d_A, data_for_calculation, DataSize, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threadsn", blocksPerGrid, threadsPerBlock);
    vectorIncreaseOne <<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);

    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    err = cudaMemcpy(data_for_calculation, d_C, DataSize, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
    return 0;
}

Test cases

The test case for gpuIncreaseOne function is as below.

auto data_pointer = (float*)malloc(100 * sizeof(float));
for (int i = 0; i < 100; i++)
{
    data_pointer(i) = static_cast<float>(1);
}
CUDACalculation::gpuIncreaseOne(data_pointer, 100);


free(data_pointer);

All suggestions are welcome.

If there is any possible improvement about:

  • Potential drawback or unnecessary overhead
  • Error handling

please let me know.

public key – need help generate single publickey on cuda device

I am trying to generate a public key from a hash on a cuda device. I couldn’t find a ready source for that task. So I want to try to modify/ create a new function for this task.

the source code I want to use is here source on github

I have a secret key(unsigned char) and just want to multiply it with G, and then print the puclic key.

inside the gpu kernel:

unsigned char secret(32)

unsigned char   publicK_x(32);
unsigned char   publicK_y(32);

between here supposed to be the function for the point multiplication

dec_ge_set_gej(&p, &pj);

dec_fe_get_b32(publicK_x, &p.x);
dec_fe_get_b32(publicK_y, &p.y);

for (int i = 0; i < 32; i++)///print x.side
{
    printf("%02X", publicK_x(i));
    printf ("n");

for (int i = 0; i < 32; i++)///print y.side
    {
    printf("%02X", publicK_y(i));
    printf("n")

But since the source code uses pre-computed parallel point addition, I’m a bit stuck to get it to multiply only a single point.