<?xml version="1.0" encoding="UTF-8"?><xml><records><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>47</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Ian Elmor Lang</style></author><author><style face="normal" font="default" size="100%">Ziqiang Huang</style></author><author><style face="normal" font="default" size="100%">Nachiket Kapre</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Exploring the Impact of Switch Arity on Butterfly Fat Tree FPGA NoCs</style></title><secondary-title><style face="normal" font="default" size="100%">The 28th IEEE International Symposium On Field-Programmable Custom Computing Machines</style></secondary-title></titles><dates><year><style  face="normal" font="default" size="100%">2020</style></year></dates><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">Overlay Networks-on-Chip (NoCs) for FPGAs based on the Butterfly-Fat Tree (BFT) topology with lightweight flow control deliver low LUT costs and features such as in-order delivery and livelock freedom. BFT NoCs make it possible to configure network bandwidth to match application requirements, by choosing switch types with different numbers of ports (arity) for the layers of the tree hierarchy. We increase the design space of BFT NoC configurations available to designers by constructing networks with larger arity-4 switches, in addition to the arity2 switches explored by previous works. When synthesized for the Xilinx UltraScale+ VU9P FPGA, our proposed BFT NoCs consume 38-45% fewer LUTs and 33-50% smaller wiring lengths than arity-2 BFT NoCs with the same Rent parameter, in exchange for a reduction in maximum clock frequency in up to 25%. We simulate the operation of our proposed NoCs when routing various real-world workloads with 64 network clients, and show that they consistently achieve better Throughput / LUT cost ratios, when compared to arity-2 BFT NoCs with the same Rent parameter, with improvements of 15 to 120% depending on the benchmark and NoC topology.</style></abstract></record><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>25</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Jose Alberto Joao</style></author><author><style face="normal" font="default" size="100%">Ziqiang Huang</style></author><author><style face="normal" font="default" size="100%">Alejandro Rico Carro</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Defer buffer</style></title></titles><dates><year><style  face="normal" font="default" size="100%">2019</style></year></dates><urls><web-urls><url><style face="normal" font="default" size="100%">https://patents.google.com/patent/US10275250B2/en</style></url></web-urls></urls><edition><style face="normal" font="default" size="100%">United States of America</style></edition><volume><style face="normal" font="default" size="100%">10275250</style></volume><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">An apparatus comprises processing circuitry for executing instructions of two or more threads of processing, hardware registers to store context data for the two or more threads concurrently, and commit circuitry to commit results of executed instructions of the threads, where for each thread the commit circuitry commits the instructions of that thread in program order. At least one defer buffer is provided to buffer at least one blocked instruction for which execution by the processing circuitry is complete but execution of an earlier instruction of the same thread in the program order is incomplete. This can help to resolve inter-thread blocking and hence improve performance.</style></abstract><issue><style face="normal" font="default" size="100%">U.S Patent and Trademark Office</style></issue><section><style face="normal" font="default" size="100%">US</style></section></record><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>47</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Ziqiang Huang</style></author><author><style face="normal" font="default" size="100%">Jose A. Joao</style></author><author><style face="normal" font="default" size="100%">Alejandro Rico</style></author><author><style face="normal" font="default" size="100%">Andrew D. Hilton</style></author><author><style face="normal" font="default" size="100%">Benjamin C. Lee</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">DynaSprint: Microarchitectural Sprints with Dynamic Utility and Thermal Management</style></title><secondary-title><style face="normal" font="default" size="100%">In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture</style></secondary-title></titles><dates><year><style  face="normal" font="default" size="100%">2019</style></year></dates><urls><web-urls><url><style face="normal" font="default" size="100%">https://dl.acm.org/citation.cfm?id=3358301</style></url></web-urls></urls><pub-location><style face="normal" font="default" size="100%">52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH</style></pub-location><pages><style face="normal" font="default" size="100%">426-439</style></pages><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">Sprinting is a class of mechanisms that provides a short but significant performance boost while temporarily exceeding the thermal design point. We propose DynaSprint, a software runtime that manages sprints by dynamically predicting utility and modeling thermal headroom. Moreover, we propose a new sprint mechanism for caches, increasing capacity briefly for enhanced performance. For a system that extends last-level cache capacity from 2MB to 4MB per core and can absorb 10J of heat, DynaSprint-guided cache sprints improve performance by 17% on average and by up to 40% over a non-sprinting system. These performance outcomes, within 95% of an oracular policy, are possible because DynaSprint accurately predicts phase behavior and sprint utility.</style></abstract></record><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>25</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Jose Alberto Joao</style></author><author><style face="normal" font="default" size="100%">Alejandro Rico Carro</style></author><author><style face="normal" font="default" size="100%">Ziqiang Huang</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Hardware Thread Scheduling</style></title></titles><dates><year><style  face="normal" font="default" size="100%">2019</style></year></dates><urls><web-urls><url><style face="normal" font="default" size="100%">https://patents.google.com/patent/US10261835B2/en</style></url></web-urls></urls><edition><style face="normal" font="default" size="100%">United States of America</style></edition><volume><style face="normal" font="default" size="100%">10261835</style></volume><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">An apparatus has processing circuitry to execute instructions from multiple threads and hardware registers to store context data for the multiple threads concurrently. At a given time a certain number of software-scheduled threads may be scheduled for execution by software executed by the processing circuitry. Hardware thread scheduling circuitry is provided to select one or more active threads to be executed from among the software-scheduled threads. The hardware thread scheduling circuitry adjusts the number of active threads in dependence on at least one performance metric indicating performance of the threads.</style></abstract><issue><style face="normal" font="default" size="100%">U.S Patent and Trademark Office</style></issue><section><style face="normal" font="default" size="100%">US</style></section></record><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>47</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Ziqiang Huang</style></author><author><style face="normal" font="default" size="100%">Andrew D. Hilton</style></author><author><style face="normal" font="default" size="100%">Benjamin C. Lee</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Decoupling Loads for Nano-Instruction Set Computers</style></title><secondary-title><style face="normal" font="default" size="100%">Proceedings of the 43rd International Symposium on Computer Architecture</style></secondary-title></titles><dates><year><style  face="normal" font="default" size="100%">2016</style></year></dates><urls><web-urls><url><style face="normal" font="default" size="100%">https://dl.acm.org/citation.cfm?id=3001181</style></url></web-urls></urls><volume><style face="normal" font="default" size="100%">44</style></volume><pages><style face="normal" font="default" size="100%">406-417</style></pages><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.</style></abstract></record></records></xml>