nsight_esclipse_logo

NVIDIA Nsight Eclipse Edition for Jetson TK1

NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.

Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.

Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit.  If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.

CUDA toolkit setup

The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6.

1. Enable armhf as a foreign architecture to get the cross-armhf packages installed:

$ sudo sh -c \ 'echo "foreign-architecture armhf" >> /etc/dpkg/dpkg.cfg.d/multiarch'
$ sudo apt-get update

2. Run dpkg to install and update the repo meta-data:

$ sudo dpkg – i cuda-repo-ubuntu1204_6.0-37_amd64.deb
$ sudo apt-get update

3. Install cuda cross and ARM GNU packages (these will be linked in future toolkit versions):

$ sudo apt-get install cuda-cross-armhf
$ sudo apt-get install g++-4.6-arm-linux-gnueabihf

4. OPTIONAL – if you also wish to do native x86 CUDA development and have an NVIDIA GPU in your host system then you can install the full toolchain and driver:

$ sudo apt-get install cuda

Reboot your system if you installed the driver so that NVIDIA driver gets loaded. Then update paths to the toolkit install location as follows:

$ export PATH=/usr/local/cuda/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

At the end of these steps you should see armv7-linux-gnueabihf and the optional x86_64_linux folder under /usr/local/cuda/targets/.

For your cross-development needs, Jetson TK1 comes prepopulated with Linux for Tegra (L4T), a modified Ubuntu (13.04 or higher) Linux distribution provided by NVIDIA. NVIDIA provides the board support package and a software stack that includes the CUDA Toolkit, OpenGL 4.4 drivers, and the NVIDIA VisionWorks™ Toolkit. You can download all of these, as well as examples and documentation, from the Jetson TK1 Support Page.

Importing Your First Jetson TK1 CUDA Sample into Nsight

With the CUDA Toolkit installed and the paths setup on the host system, launch Nsight by typing “nsight” (without the quotes) at the command line or by finding the Nsight icon in the Ubuntu dashboard. Once Nsight is loaded, navigate to File->New->CUDA C/C++ Project and import an existing CUDA sample to start the Project Creation wizard. For the project name, enter “boxfilter-arm” and select “Import CUDA Sample” in the project type and “CUDA Toolkit 6.0” in the toolchains. Next, choose the Boxfilter sample which can be found under the Imaging category. The remaining options in the wizard let you choose which GPU and CPU architectures to generate code for.  First, we will choose the GPU code that should be generated by the nvcc compiler.  Since Jetson TK1 includes an NVIDIA Kepler™ GPU, choose SM32 GPU binary code and SM30 PTX intermediate code. (The latter is so that any Kepler-class GPU can run this application.) The next page in the wizard lets you decide if you wish to do native x86 development or cross-compile for an ARM system. To cross compile for ARM, choose ARM architecture in the CPU architecture drop-down box.

nsight_arm_cross_compiler_selection

Building Your First Jetson TK1 Application from Nsight

CUDA samples are generic code samples that can be imported and run on various hardware configurations. For this cross build exercise the ARM library dependencies used by this application has to be resolved first. Here’s how you can resolve those:

1. Right click on the project and navigate to Properties->Build->Settings->Tool Settings->NVCC Linker->Libraries and update the paths to point to linux/armv7l instead of linux/x86_64. This will resolve the libGLEW library dependencies. Also remove the entry for GLU since that library is unused.

nsight_cuda_samples_lib_updates

2. Click on the Miscellaneous tab and add a new -Xlinker option “—unresolved-symbols=ignore-in-shared-libs” (without the quotes).

3. In the terminal window use the scp utility to copy the remaining libraries from your Jetson TK1:

scp ubuntu@your.ip.address:/usr/lib/arm-linux-gnueabihf/libglut.so.3  /usr/arm-linux-gnueabihf/lib folder, with a symlink to libglut.so
scp ubuntu@your.ip.address:/usr/lib/arm-linux-gnueabihf/tegra/libGL.so.1 /usr/arm-linux/gnueabihf/lib folder, with a symlink to libGL.so
scp ubuntu@your.ip.address:/usr/lib/arm-linux-gnueabihf/libX11.so.6 /usr/arm-linux-gnueabihf/lib folder, with a symlink to libX11.so

Note: You need to copy these ARM libraries only for the first CUDA sample. You may need additional libraries for other samples.

The build process for ARM cross-development is similar to the local build process. Just click on the build “hammer” icon in the toolbar menu to build a debug ARM binary.  As part of the compilation process, Nsight will launch nvcc for the GPU code and the arm-linux-gnueabihf-g++-4.6 cross-compiler for the CPU code as follows:

Building file: ../src/boxFilter_kernel.cu
Invoking: NVCC Compiler
/usr/local/cuda-6.0/bin/nvcc -I"/usr/local/cuda-6.0/samples/3_Imaging" -I"/usr/local/cuda-6.0/samples/common/inc" 
-I"/home/satish/cuda-workspace_new/boxfilter-arm" -G -g -O0 -ccbin arm-linux-gnueabihf-g++-4.6 -gencode arch=compute_30,
code=sm_30 -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -odir "src" -M -o "src/boxFilter_kernel.d" 
"../src/boxFilter_kernel.cu"
/usr/local/cuda-6.0/bin/nvcc --compile -G -I"/usr/local/cuda-6.0/samples/3_Imaging" -I"/usr/local/cuda-6.0/samples/common/inc" 
-I"/home/satish/cuda-workspace_new/boxfilter-arm" -O0 -g -gencode arch=compute_30,code=compute_30 -gencode arch=compute_32,
code=sm_32 --target-cpu-architecture ARM -m32 -ccbin arm-linux-gnueabihf-g++-4.6  -x cu -o  "src/boxFilter_kernel.o" 
"../src/boxFilter_kernel.cu"
Finished building: ../src/boxFilter_kernel.cu

After the compilation steps, the linker will resolve all library references, giving you a boxfilter-arm binary that is ready to run.

Running Your First Jetson TK1 Application from Nsight

To run the code on the target Jetson TK1 system, click on Run As->Remote C/C++ Application to setup the target system user and host address.

nsight_remote_run

Once you finish the remote target system configuration setup, click on the Run icon and you will see a new entry to run the boxfilter-arm binary on the Jetson TK1.

Note: Box filter application relies on data files that reside in the data/ subfolder of the application, which will need to be copied to the target system. Use the scp utility to copy those files into the /tmp/nsight-debug/data/ folder on your Jetson TK1.

Next, edit the boxfilter.cpp file as follows:
1. To ensure that the application runs on the correct display device, add this line to the top of the main function:

setenv(“DISPLAY”, “:0”, 0);

2. Add the following lines to the top of the display function so that app auto-terminates after a few seconds. This is required to gather deterministic execution data across multiple runs of the application, which we will need later in the profiling section:

static int icnt = 120;
while(!icnt--)
{
    cudaDeviceReset();
    _exit(EXIT_SUCCESS);
}

Click on Run to execute the modified Box Filter application on your Jetson TK1.

Debugging Your First Jetson TK1 Application in Nsight

The remote target system configuration that you set up in Nsight earlier will also be visible under the debugger icon in the toolbar.

Before you launch the debugger, note that by default Jetson TK1 does not allow any application to solely occupy the GPU 100% of the time. In order to run the debugger, we need to fix this. On your Jetson TK1, login as root (sudo su) and then disable the timeout as follows (in future releases of CUDA, the debugger will handle this automatically):

root@tegra-ubuntu:/home/ubuntu# echo N > sys/kernel/debug/gk20a.0/timeouts_enabled

Now we can launch the debugger using the debug icon back on the host system. Nsight will switch you to its debugger perspective and break on the first instruction in the CPU code. You can single-step a bit there to see the execution on the CPU and watch the variables and registers as they are updated.

To break on any and all CUDA kernels executing on the GPU, go to the breakpoint tab in the top-right pane of Nsight and click on the cube icon dropdown. Then select the “break on application kernel launches” feature to break on the first instruction of a CUDA kernel launch. You can now resume the application, which will run until the first breakpoint is hit in the CUDA kernel. From here, you can browse the CPU and GPU call stack in the top-left pane. You can also view the variables, registers and HW state in the top-right pane. In addition, you can see that the Jetson TK1’s GPU is executing 16 blocks of 64 threads each running on the single Streaming Multiprocessor (SMX) of this GK20A GPU.

You can also switch to disassembly view and watch the register values being updated by clicking on the i-> icon to do GPU instruction-level single-stepping.

nsight_debug_view

To “pin” (focus on) specific GPU threads, double click the thread(s) of interest in the CUDA tab in the top-right pane. The pinned CUDA threads will appear in the top-left pane, allowing you to select and single-step just those threads. (Keep in mind, however, that single-stepping a given thread causes the remaining threads of the same warp to step as well, since they share a program counter.)  You can experiment and watch this by pinning threads that belong to different warps.

There are more useful debug features that you will find by going into the debug configuration settings from the debug icon drop down, such as enabling cuda-memcheck and attaching to a running process (on the host system only).

To quit the application you are debugging, click the red stop button in the debugger perspective.

Profiling Your First Jetson TK1 Application in Nsight

Let’s switch back to the C++ project editor view to start the profiler run. The remote target system configuration you setup in Nsight earlier will also be visible to you under the profiler icon in the toolbar.

Before you launch the profiler, note that you need to create a release build with -lineinfo included in the compile options. This tells the compiler to generate information on source-to-instruction correlation. To do this, first go to the project settings by right-clicking on the project in the left pane. Then navigate to Properties->Build->Settings->Tool Settings->Debugging and check the box that says “Generate line-number…” and click Apply.

Back in the main window, click on the build hammer dropdown menu to create a release build. Resolve any build issues as you did during the first run above, then click on the Run As->Remote C/C++ Application to run the release build of the application. At this point Nsight will overwrite the Jetson TK1 system with the release binary you want to profile and run it once.

Next click on the profile icon dropdown and choose Profile Configurations where you must select “Profile Remote Application” since the binary is already on the Jetson TK1. Nsight will then switch you to the profiler perspective while it runs the application to gather an execution timeline view of all the CUDA Runtime and Driver API calls and of the kernels that executed on the GPU. The properties tab displays details of any event you select from this timeline; the details of the events can also be viewed in text form in the Details tab in the lower pane.

nsight_profile_view

Below the timeline view in the lower pane, there is also an Analysis tab that is very useful for performance tuning. It guides you through a step-by-step approach on resolving performance bottlenecks in your application. You can switch between guided and unguided analysis by clicking on their icons under the Analysis tab.

You can also get a source-to-instruction correlation view, with hot spots (where the instructions-executed count was particularly high) identified in red as shown in the figure below. You get this view from within the guided analysis mode by first clicking on “Examine Individual Kernels” and selecting the highest ranked (100) kernel from the list of examined kernels, then clicking “Perform Kernel Analysis” followed by “Perform Compute Analysis.” From there, clicking “Show Kernel Profile” will show d_boxfilter_rgba_a kernel in the right pane. Double-click on the kernel name to see the source-to-instruction view. Clicking on a given line of source code highlights the corresponding GPU instructions.

nsight_rc_to_sass

As you can see, whether you are new to NVIDIA® Nsight™ Eclipse Edition or an avid Nsight user, Nsight makes it just as easy and straightforward to create CUDA applications for the Jetson TK1 platform as for all your CUDA-enabled GPUs.

∥∀

About Satish Salian

Satish Salian
Satish Salian is a Sr. Software Engineering Manager at NVIDIA responsible for CUDA tools and developer experience. Satish has twelve years of experience creating various tools and SDKs at NVIDIA. He has a Bachelor's degree in Computer Engineering from University of Pune, India.
  • Josh Smith

    Thanks for this post, Satish! I’m still waiting to get my Jetson TK1. When I get it, I intend on developing on it from my Mac (OS X Mavericks). Do you know if that setup is supported, or should I use a Ubuntu partition instead?

    • Satish Salian

      Josh good to know that you have a board on the way. Please use Ubuntu 12.04 LTS on the host for cross development. MAC OSX is also a supported host platform but “synchronize-projects” remote development mode is the way to go on MAC, I’ll add more details on MAC in a future post.

    • Satish

      Josh good to know that you have a board on the way. Please use Ubuntu 12.04 LTS on the host for cross development. MAC OSX is also a supported host platform but “synchronize-projects” remote development mode is the way to go on MAC, I’ll add more details on MAC in a future post.

  • http://avrmp.com Mahan B

    Is there any way to compile the program inside the Jetson itself?! I couldn’t find any instruction for that?

  • Alexander Koumis

    Great guide Satish, looking forward to the synchronize-projects version.

  • Miner

    Hi, I followed the same steps but I keep on running into Xlib : extension “GLX” missing on display “:0″ error… Can anybody guide me.

    • Satish

      You would usually see this error if you don’t have a active desktop running on Jetson TK1. Do you have a panel connected to Jetson TK1?

      • Miner

        Thank you. That helped.

  • Satish

    All my earlier replies on these questions/comments were made from the blog portal and were thus lost. So if you seeing late replies you know why:-) I am now using disqus for the replies.

  • Loukas Bampis

    Hello, I followed the above instructions and everything worked very good. Thank you for your great tutorial. I have one problem though. I am using Nsight in order to debug and profile my code and when I time the output lets say that I get x seconds. If I take the exact same code and compile it on the board, I am getting y seconds, with y secs being smaller than x. So the algorithms run faster if I compile them on the board and without using cross-compilation. Does anybody have any idea for that?
    Thank you.

    • Satish

      The generated GPU(SASS) code will be the same whether cross compiled or natively compiled. Please check the GPU code generation options (I mentioned above in the blog) is the same in both the cross compile scenario and the native compile case, they both need to be SM32 for code and SM30 for PTX. Also check if you are using any debug options -G, make sure any such flags are same across both the compile paths.

  • payal talati

    Can I upgrade graphics driver in Jetson TK1 platform?

    - I am having NVIDIA Jetson TK1 kit and I am having Linux ubuntu inside. Now I need to try latest ES3.1 extension like tesselation shader or draw indirect but I am getting linker error as those functions are not available in the library.

    - I am assuming NVIDIA is working on new ES3.1 extension with google. So, I believe there must be new version of drivers for that toolkit.

    Thanks.

    • Satish

      No you should never update just the driver on JetsonTK1 since the driver is part of the L4T OS image. ES3.1 is supported in the upcoming Rel21 to be announced soon.

      • Shiney

        Ok, Thanks Satish.

  • Graham

    “sudo apt-get update” has problem something like this. Is there any suggestions for this?

    Err http://archive.ubuntu.com precise-security/universe armhf Packages
    404 Not Found [IP: 2001:67c:1360:8c01::19 80]

    • Mark Ebersole

      Graham, this is a known issue after adding “foreign-architecture
      armhf” to multiarch file and we’re working to fix it.

      However, this shouldn’t have any effect on your system (it’s a harmless error). Are you seeing other problems?

      • Graham

        So far it is good! Thank you Mark!!!

  • Archith

    Hi,

    The .deb file for the cross compilers installs the 6.5 version of CUDA’s cross compilers. On the otherhand, Jetson TK1 is at CUDA 6.0. This causes a version mismatch between the gdb server on TK1 and the gdb client on the host. Is there a way to resolve this? I am running ubuntu 12.04.

    Archith

    • http://www.markmark.net/ Mark Harris

      Hi Archith, please make sure you download the CUDA 6.0 cross compilation toolkit from the Jetson TK1 page (https://developer.nvidia.com/jetson-tk1-support), not the CUDA 6.5 toolkit from the CUDA download page. Alternatively you can wait for the next release of L4T, coming soon, which will support CUDA 6.5.

      • Archith

        Hi Mark,

        Thank you for your response.

        I did use the deb file for ubuntu 12.04. I have posted more details on the nvidia devtalk forum (https://devtalk.nvidia.com/default/topic/774786/cuda-setup-and-installation/cuda-6-0-on-ubuntu-14-04).

        The gist of the discussion was that the cross compiler deb file meant for ubuntu 12.04 points to cuda 6.5 tools instead of 6.0, and this causes an incompatibility which prevents cross-debugging. Is there an ETA on the 6.5 support for Jetson TK1?

        Archith

        • http://www.markmark.net/ Mark Harris

          The cross-compiler .deb file I referred to is this one:
          http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1204/x86_64/cuda-repo-ubuntu1204_6.0-37_amd64.deb

          That is CUDA 6.0.

          • Archith

            I have used that exact deb file and I still run into CUDA 6.5 tools when I execute ‘apt-get install cuda-cross-armhf’, which is quite puzzling.

            Archith

          • http://www.markmark.net/ Mark Harris

            Ahah! We figured it out. :) Due to updates since the .deb was posted, you need to specify that you want the 6.0 tools like this:

            apt-get install cuda-cross-armhf-6-0

          • Archith

            Yes, that was it! I can cross-debug now. Thank you for pointing that out. It might be useful to add a note somewhere for the benefit of CUDA newbies like me.

            Archith

    • Ashish Rajput

      Hi Archith,

      Did you use 32 bit or 64 bit? Also i got stuck on

      “sudo dpkg – i cuda-repo-ubuntu1204_6.0-37_amd64.deb” command. It always gives me some error.
      In short, thing are not favorable. Any suggestion would be much appreciated.
      Thanks!!!

      • Archith

        Maybe you could post the error you are seeing?

        Archith

  • Ashish Rajput

    Hi, i am having trouble deciding version. Please share your suggestions. I will try keep it simple.

    Ubunut 14 or 12 ?
    32bit or 64 bit?
    Cuda toolkit 6 or 6.5?

    i have tried my best and it appears dpkg command is not working in any version. i have tried ubuntu 14, 12 32-64bit.

    This paper suggests we should use ubuntu 12 (32 bit) but this file ”

    sudo dpkg – i cuda-repo-ubuntu1204_6.0-37_amd64.deb” seems 64bit to me.

    • Satish

      On the host packages there are no x86 32b debian packages from NVIDIA so host system has to be 64b system. For cross compilation stay with Ubuntu12.04 on the host. You are using the right CUDA6.0 toolkit package cuda-repo-ubuntu1204_6.0-37_amd64.deb for your 12.04 host system.

      Regarding CUDA 6.5 toolkit, please note the current shipping Jetson TK1 OS image for L4T (Linux for Tegra) does not contain the latest CUDA 6.5 toolkit or the related driver. CUDA6.5 toolkit will be available in a future L4T release (Rel21.2). You can check the L4T version with the following command:

      > head -1 /etc/nv_tegra_release.

      If you want more flexibility on the host OS please use the Nsight synchronized-project mode.
      More info @ http://devblogs.nvidia.com/parallelforall/remote-application-development-nvidia-nsight-eclipse-edition/

      • Ashish Rajput

        Thanks for help. It’s working fine at the moment. Also, i manage to cross-compile cuda samples on both Ubuntu 14.01 and 12.04

      • Guest

        Thanks for help. It’s working fine at the moment. Also, i manage to do cross-compilation on both Ubuntu 14.01 and 12.04

  • Zhaoyufei

    I have installed the toolkit on my Jetson Tk1,and checked that by ‘nvcc -v’on the terminal.The document says that the Nsight is inside the toolkit,how can i find the Nsight? PS,I have tried type the ‘nsight’ on the terminal,but it said there is no such commend. What should I do to get the Nsight on my TK1 ?And I only want to write native CUDA code ,not cross-compilation,only TK1,is that possible ?

    • Satish

      There are no native UI tools in ARM JetsonTK1 toolkit thus no NsightEclipse on Jetson TK1. If you want do native compilation instead of cross compilation, you can use the remote synchronized-project mode. More info @ http://devblogs.nvidia.com/parallelforall/remote-application-development-nvidia-nsight-eclipse-edition/

      • Zhaoyufei

        Thank you very much !
        I have some problems when I try to install the toolkit to my ubuntu 14.04(32bit) host(double OS with Win7 64bit). After install the driver Version 340 for my GT555m,I can’t get my UI desktop back by start lightdm.I can’t fix it .Is the problem of the driver ?

        As i don’t have another PC, can i make remote or cross compilation on Windows platform?

        Or maybe I should change to 12.04(64bit)?Is ubuntu-12.04.4-desktop-amd64 OK?

        Dose double OS make any effect to the host?

        • Ashish Rajput

          Hi Zhaoyufei,

          You should not use 32 bit version either of Ubuntu 14 or 12 OS (nvidia provides 64bit cross compilation toolkit package only).

          Does not matter whether you have graphics card installed on host PC.

          Windows, you should NOT do this.

          Yes, it would be a lot easier to set cross-compilation using Ubuntu 12.04 (64 bit) on host. The problem you may encounter installing dpkg package (run as root user would resolve it). Also, while installing cuda cross arm compiler use “sudo apt-get install
          cuda-cross-armhf-6-0″ instead what mentioned above (it may install cuda-cross-6.5).

          Though i have not tested with double OS but based on my understading it should not affect at all.

          Hopefully, this will bring smile on your face.

          Regards,
          Ashish

          • Zhaoyufei

            Hi Ashish
            Thanks for your reply! I have installed the toolkit 6.5,so i am wandering will toolkit 6.5+cross 6.0 work?i am a undergraduate student,and the gurduate project my teacher gave me is to achieve an image enhance algorithm on Jetson TK1 by Gpu programing. I never involved cuda before,so could you please recommend some basic CUDA study materials ?

          • Satish

            Ashish, great to see you helping out Zhaoyufei with his questions after your recent success.
            Zhaoyufei, you will see version mismatch if you mix toolkit versions so stay with 6.0TK on the host too. Here’s links to imaging samples and programming guide: http://docs.nvidia.com/cuda/cuda-samples/index.html#imaging and http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

          • Zhaoyufei

            Hi Satish
            Thanks for your reply.I downloaded my toolkite 6.5 on Nvidia’s web :
            https://developer.nvidia.com/cuda-downloads
            There are only toolkit 6.5.Could you please tell me where can i find the toolkit 6.0? How can i return my toolkit to 6.0?Just uninstall the 6.5 and install the 6.0?Please give me some guides.
            Besides thanks for the links,that help me a lot.^_^

  • Zhaoyufei

    If my host uses toolkit 6.5 can I make cross compilation for jetson tk 1? What about the remote synchronized-project mode? If not, what Should I do ? Please

    • Ashish Rajput

      Hi Zhaoyufei,

      The current Linux for Tegra (L4T) r19 OS version does not have drivers included to support cuda-toolkit 6.5. It will be included in the forthcoming version of L4T r21. (I read it somewhere but could not recall exact source)

      Regards,
      Ashish